[gpfsug-discuss] help with multi-cluster setup: Network is unreachable
Jaime Pinto
pinto at scinet.utoronto.ca
Mon May 8 17:23:01 BST 2017
Quoting "Simon Thompson (IT Research Support)" <S.J.Thompson at bham.ac.uk>:
> Do you have multiple networks on the hosts? We've seen this sort of
> thing when rp_filter is dropping traffic with asynchronous routing.
>
Yes Simon,
All clients and servers have multiple interfaces on different
networks, but we've been careful to always join nodes with the
<hostname>-ib0 resolution, always on IB.
I can also query with 'mmlscluster' and all nodes involved are listed
with the 10.20.x.x IP and -ib0 extension on their names. We don't have
mmnetverify anywhere yet.
Thanks
Jaime
> I know you said it's set to only go over IB, but if you have names
> that resolve onto you Ethernet, and admin name etc are not correct,
> it might be your problem.
>
> If you had 4.2, I'd suggest mmnetverify. I suppose that might work
> if you copied it out of the 4.x packages anyway?
>
> Simon
> ________________________________________
> From: gpfsug-discuss-bounces at spectrumscale.org
> [gpfsug-discuss-bounces at spectrumscale.org] on behalf of
> pinto at scinet.utoronto.ca [pinto at scinet.utoronto.ca]
> Sent: 08 May 2017 17:06
> To: gpfsug main discussion list
> Subject: [gpfsug-discuss] help with multi-cluster setup: Network is
> unreachable
>
> We have a setup in which "cluster 0" is made up of clients only on
> gpfs v3.5, ie, no NDS's or formal storage on this primary membership.
>
> All storage for those clients come in a multi-cluster fashion, from
> clusters 1 (3.5.0-23), 2 (3.5.0-11) and 3 (4.1.1-7).
>
> We recently added a new storage cluster 4 (4.1.1-14), and for some
> obscure reason we keep getting "Network is unreachable" during mount
> by clients, even though there were no issues or errors with the
> multi-cluster setup, ie, 'mmremotecluster add' and 'mmremotefs add'
> worked fine, and all clients have an entry in /etc/fstab for the file
> system associated with the new cluster 4. The weird thing is that we
> can mount cluster 3 fine (also 4.1).
>
> Another piece og information is that as far as GPFS goes all clusters
> are configured to communicate exclusively over Infiniband, each on a
> different 10.20.x.x network, but broadcast 10.20.255.255. As far as
> the IB network goes there are no problems routing/pinging around all
> the clusters. So this must be internal to GPFS.
>
> None of the clusters have the subnet parameter set explicitly at
> configuration, and on reading the 3.5 and 4.1 manuals it doesn't seem
> we need to. All have cipherList AUTHONLY. One difference is that
> cluster 4 has DMAPI enabled (don't think it matters).
>
> Below is an excerpt of the /var/mmfs/gen/mmfslog in one of the clients
> during mount (10.20.179.1 is one of the NDS on cluster 4):
> Mon May 8 11:35:27.773 2017: [I] Waiting to join remote cluster
> wosgpfs.wos-gateway01-ib0
> Mon May 8 11:35:28.777 2017: [W] The TLS handshake with node
> 10.20.179.1 failed with error 447 (client side).
> Mon May 8 11:35:28.781 2017: [E] Failed to join remote cluster
> wosgpfs.wos-gateway01-ib0
> Mon May 8 11:35:28.782 2017: [W] Command: err 719: mount
> wosgpfs.wos-gateway01-ib0:wosgpfs
> Mon May 8 11:35:28.783 2017: Network is unreachable
>
>
> I see this reference to "TLS handshake" and error 447, however
> according to the manual this TLS is only set to be default on 4.2
> onwards, not 4.1.1-14 that we have now, where it's supposed to be EMPTY.
>
> mmdiag --network for some of the client gives this excerpt (broken status):
> tapenode-ib0 <c4p1> 10.20.83.5
> broken 233 -1 0 0 Linux/L
> gpc-f114n014-ib0 <c4p2> 10.20.114.14
> broken 233 -1 0 0 Linux/L
> gpc-f114n015-ib0 <c4p3> 10.20.114.15
> broken 233 -1 0 0 Linux/L
> gpc-f114n016-ib0 <c4p4> 10.20.114.16
> broken 233 -1 0 0 Linux/L
> wos-gateway01-ib0 <c4p5> 10.20.179.1
> broken 233 -1 0 0 Linux/L
>
>
>
> I guess I just need a hint on how to troubleshoot this situation (the
> 4.1 troubleshoot guide is not helping).
>
> Thanks
> Jaime
>
>
>
> ---
> Jaime Pinto
> SciNet HPC Consortium - Compute/Calcul Canada
> www.scinet.utoronto.ca - www.computecanada.ca
> University of Toronto
> 661 University Ave. (MaRS), Suite 1140
> Toronto, ON, M5G1M1
> P: 416-978-2755
> C: 416-505-1477
>
> ----------------------------------------------------------------
> This message was sent using IMP at SciNet Consortium, University of Toronto.
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
************************************
TELL US ABOUT YOUR SUCCESS STORIES
http://www.scinethpc.ca/testimonials
************************************
---
Jaime Pinto
SciNet HPC Consortium - Compute/Calcul Canada
www.scinet.utoronto.ca - www.computecanada.ca
University of Toronto
661 University Ave. (MaRS), Suite 1140
Toronto, ON, M5G1M1
P: 416-978-2755
C: 416-505-1477
----------------------------------------------------------------
This message was sent using IMP at SciNet Consortium, University of Toronto.
More information about the gpfsug-discuss
mailing list