[gpfsug-discuss] help with multi-cluster setup: Network is unreachable

Jaime Pinto pinto at scinet.utoronto.ca
Mon May 8 17:23:01 BST 2017


Quoting "Simon Thompson (IT Research Support)" <S.J.Thompson at bham.ac.uk>:

> Do you have multiple networks on the hosts? We've seen this sort of   
> thing when rp_filter is dropping traffic with asynchronous routing.
>

Yes Simon,

All clients and servers have multiple interfaces on different  
networks, but we've been careful to always join nodes with the  
<hostname>-ib0 resolution, always on IB.

I can also query with 'mmlscluster' and all nodes involved are listed  
with the 10.20.x.x IP and -ib0 extension on their names. We don't have  
mmnetverify anywhere yet.

Thanks
Jaime

> I know you said it's set to only go over IB, but if you have names   
> that resolve onto you Ethernet, and admin name etc are not correct,   
> it might be your problem.
>
> If you had 4.2, I'd suggest mmnetverify. I suppose that might work   
> if you copied it out of the 4.x packages anyway?
>
> Simon
> ________________________________________
> From: gpfsug-discuss-bounces at spectrumscale.org   
> [gpfsug-discuss-bounces at spectrumscale.org] on behalf of   
> pinto at scinet.utoronto.ca [pinto at scinet.utoronto.ca]
> Sent: 08 May 2017 17:06
> To: gpfsug main discussion list
> Subject: [gpfsug-discuss] help with multi-cluster setup: Network is   
>     unreachable
>
> We have a setup in which "cluster 0" is made up of clients only on
> gpfs v3.5, ie, no NDS's or formal storage on this primary membership.
>
> All storage for those clients come in a multi-cluster fashion, from
> clusters 1 (3.5.0-23), 2 (3.5.0-11) and 3 (4.1.1-7).
>
> We recently added a new storage cluster 4 (4.1.1-14), and for some
> obscure reason we keep getting "Network is unreachable" during mount
> by clients, even though there were no issues or errors with the
> multi-cluster setup, ie, 'mmremotecluster add' and 'mmremotefs add'
> worked fine, and all clients have an entry in /etc/fstab for the file
> system associated with the new cluster 4. The weird thing is that we
> can mount cluster 3 fine (also 4.1).
>
> Another piece og information is that as far as GPFS goes all clusters
> are configured to communicate exclusively over Infiniband, each on a
> different 10.20.x.x network, but broadcast 10.20.255.255. As far as
> the IB network goes there are no problems routing/pinging around all
> the clusters. So this must be internal to GPFS.
>
> None of the clusters have the subnet parameter set explicitly at
> configuration, and on reading the 3.5 and 4.1 manuals it doesn't seem
> we need to. All have cipherList AUTHONLY. One difference is that
> cluster 4 has DMAPI enabled (don't think it matters).
>
> Below is an excerpt of the /var/mmfs/gen/mmfslog in one of the clients
> during mount (10.20.179.1 is one of the NDS on cluster 4):
> Mon May  8 11:35:27.773 2017: [I] Waiting to join remote cluster
> wosgpfs.wos-gateway01-ib0
> Mon May  8 11:35:28.777 2017: [W] The TLS handshake with node
> 10.20.179.1 failed with error 447 (client side).
> Mon May  8 11:35:28.781 2017: [E] Failed to join remote cluster
> wosgpfs.wos-gateway01-ib0
> Mon May  8 11:35:28.782 2017: [W] Command: err 719: mount
> wosgpfs.wos-gateway01-ib0:wosgpfs
> Mon May  8 11:35:28.783 2017: Network is unreachable
>
>
> I see this reference to "TLS handshake" and error 447, however
> according to the manual this TLS is only set to be default on 4.2
> onwards, not 4.1.1-14 that we have now, where it's supposed to be EMPTY.
>
> mmdiag --network for some of the client gives this excerpt (broken status):
>      tapenode-ib0                        <c4p1>   10.20.83.5
> broken     233  -1    0         0          Linux/L
>      gpc-f114n014-ib0                    <c4p2>   10.20.114.14
> broken     233  -1    0         0          Linux/L
>      gpc-f114n015-ib0                    <c4p3>   10.20.114.15
> broken     233  -1    0         0          Linux/L
>      gpc-f114n016-ib0                    <c4p4>   10.20.114.16
> broken     233  -1    0         0          Linux/L
>      wos-gateway01-ib0                   <c4p5>   10.20.179.1
> broken     233  -1    0         0          Linux/L
>
>
>
> I guess I just need a hint on how to troubleshoot this situation (the
> 4.1 troubleshoot guide is not helping).
>
> Thanks
> Jaime
>
>
>
> ---
> Jaime Pinto
> SciNet HPC Consortium - Compute/Calcul Canada
> www.scinet.utoronto.ca - www.computecanada.ca
> University of Toronto
> 661 University Ave. (MaRS), Suite 1140
> Toronto, ON, M5G1M1
> P: 416-978-2755
> C: 416-505-1477
>
> ----------------------------------------------------------------
> This message was sent using IMP at SciNet Consortium, University of Toronto.
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>






          ************************************
           TELL US ABOUT YOUR SUCCESS STORIES
          http://www.scinethpc.ca/testimonials
          ************************************
---
Jaime Pinto
SciNet HPC Consortium - Compute/Calcul Canada
www.scinet.utoronto.ca - www.computecanada.ca
University of Toronto
661 University Ave. (MaRS), Suite 1140
Toronto, ON, M5G1M1
P: 416-978-2755
C: 416-505-1477

----------------------------------------------------------------
This message was sent using IMP at SciNet Consortium, University of Toronto.




More information about the gpfsug-discuss mailing list