[gpfsug-discuss] Edge case failure mode

Thu May 11 18:49:02 BST 2017

Just following up on some discussions we had at the UG this week. I
mentioned a few weeks back that we were having issues with failover of
NFS, and we figured a work around to our clients for this so that failover
works great now (plus there is some code fixes coming down the line as
well to help).

Here's my story of fun with protocol nodes ...

Since then we've occasionally been seeing the load average of 1 CES node
rise to over 400 and then its SOOO SLOW to respond to NFS and SMB clients.
A lot of digging and we found that CTDB was reporting > 80% memory used,
so we tweaked the page pool down to solve this.

Great we thought ... But alas that wasn't the cause.

Just to be clear 95% of the time, the CES node is fine, I can do and ls in
the mounted file-systems and all is good. When the load rises to 400, an
ls takes 20-30 seconds, so they are related, but what is the initial
cause? Other CES nodes are 100% fine and if we do mmces node suspend, and
then resume all is well on the node (and no other CES node assumes the
problem as the IP moves). Its not always the same CES IP, node or even
data centre, and most of the time is looks fine.

I logged a ticket with OCF today, and one thing they suggested was to
disable NFSv3 as they've seen similar behaviour at another site. As far as
I know, all my NFS clients are v4, but sure we disable v3 anyway as its
not actually needed. (Both at the ganesha layer, change the default for
exports and reconfigure all existing exports to v4 only for good measure).
That didn't help, but certainly worth a try!

Note that my CES cluster is multi-cluster mounting the file-systems and
from the POSIX side, its fine most of the time.

We've used the mmnetverify command to check that all is well as well. Of
course this only checks the local cluster, not remote nodes, but as we
aren't seeing expels and can access the FS, we assume that the GPFS layer
is working fine.

So we finally log a PMR with IBM, I catch a node in a broken state and
pull a trace from it and upload that, and ask what other traces they might
want (apparently there is no protocol trace for NFS in 4.2.2-3).

Now, when we run this, I note that its doing things like mmlsfileset to
the remote storage, coming from two clusters and some of this is timing
out. We've already had issues with rp_filter on remote nodes causing
expels, but the storage backend here has only 1 nic, and we can mount and
access it all fine.

So why doesn't mmlsfileset work to this node (I can ping it - ICMP, not
GPFS ping of course), but not make "admin" calls to it. Ssh appears to
work fine as well BTW to it.

So I check on my CES and this is multi-homed and rp_filter is enabled.
Setting it to a value of 2, seems to make mmlsfileset work, so yes, I'm
sure I'm an edge case, but it would be REALLY REALLY helpful to get
mmnetverify to work across a cluster (e.g. I say this is a remote node and
here's its FQDN, can you talk to it) which would have helped with
diagnosis here. I'm not entirely sure why ssh etc would work and pass
rp_filter, but not GPFS traffic (in some cases apparently), but I guess
its something to do with how GPFS is binding and then the kernel routing
layer.

I'm still not sure if this is my root cause as the occurrences of the high
load are a bit random (anything from every hour to being stable for 2-3
days), but since making the rp_filter change this afternoon, so far ...?

I've created an RFE for mmnetverify to be able to test across a cluster...
https://www.ibm.com/developerworks/rfe/execute?use_case=viewRfe&CR_ID=10503
0

Simon