[gpfsug-discuss] bizarre performance behavior

Kumaran Rajaram kums at us.ibm.com
Thu May 4 23:22:12 BST 2017


Hi,

>>So the problem was some bad ib routing. We changed some ib links, and 
then we got also 12GB/s read with nsdperf.
>>On our clients we then are able to achieve the 7,2GB/s in total we also 
saw using the nsd servers!

This is good to hear.

>> We are now running some tests with different blocksizes and parameters, 
because our backend storage is able to do more than the 7.2GB/s we get 
with GPFS (more like 14GB/s in total). I guess prefetchthreads and 
nsdworkerthreads are the ones to look at?

If you are on 4.2.0.3 or higher, you can use workerThreads config paramter 
(start with value of 128, and increase in increments of 128 until MAX 
supported) and this setting will auto adjust values for other parameters 
such as  prefetchThreads, worker3Threads etc.

https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20%28GPFS%29/page/Tuning%20Parameters

In addition to trying larger file-system block-size (e.g. 4MiB or higher 
such that is aligns with storage volume RAID-stripe-width)  and config 
parameters (e.g , workerThreads,  ignorePrefetchLUNCount), it will be good 
to assess the "backend storage" performance for random I/O access pattern 
(with block I/O sizes in units of FS block-size) as this is more likely 
I/O scenario that the backend storage will experience when you have many 
GPFS nodes performing I/O simultaneously to the file-system (in production 
environment). 

mmcrfs has option "[-j {cluster | scatter}]". "-j scatter" would be 
recommended for consistent file-system performance over the lifetime of 
the file-system but then "-j scatter"  will result in random I/O to 
backend storage (even though application is performing sequential I/O). 
For your test purposes, you may assess the GPFS file-system performance by 
mmcrfs with "-j cluster" and you may see good sequential results (compared 
to -j scatter) for lower client counts but as you scale the client counts 
the combined workload can result in "-j scatter" to backend storage 
(limiting the FS performance to random I/O performance of the backend 
storage).

[snip from mmcrfs]
layoutMap={scatter | cluster}
                  Specifies the block allocation map type. When
                  allocating blocks for a given file, GPFS first
                  uses a round‐robin algorithm to spread the data
                  across all disks in the storage pool. After a
                  disk is selected, the location of the data
                  block on the disk is determined by the block
                  allocation map type. If cluster is
                  specified, GPFS attempts to allocate blocks in
                  clusters. Blocks that belong to a particular
                  file are kept adjacent to each other within
                  each cluster. If scatter is specified,
                  the location of the block is chosen randomly.

                  The cluster allocation method may provide
                  better disk performance for some disk
                  subsystems in relatively small installations.
                  The benefits of clustered block allocation
                  diminish when the number of nodes in the
                  cluster or the number of disks in a file system
                  increases, or when the file system’s free space
                  becomes fragmented. The cluster
                  allocation method is the default for GPFS
                  clusters with eight or fewer nodes and for file
                  systems with eight or fewer disks.

                  The scatter allocation method provides
                  more consistent file system performance by
                  averaging out performance variations due to
                  block location (for many disk subsystems, the
                  location of the data relative to the disk edge
                  has a substantial effect on performance). This
                  allocation method is appropriate in most cases
                  and is the default for GPFS clusters with more
                  than eight nodes or file systems with more than
                  eight disks.

                        The block allocation map type cannot be changed
                  after the storage pool has been created.
..
..
        -j {cluster | scatter}
         Specifies the default block allocation map type to be
         used if layoutMap is not specified for a given
         storage pool.
[/snip from mmcrfs]

My two cents,
-Kums




From:   Kenneth Waegeman <kenneth.waegeman at ugent.be>
To:     gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
Date:   05/04/2017 09:23 AM
Subject:        Re: [gpfsug-discuss] bizarre performance behavior
Sent by:        gpfsug-discuss-bounces at spectrumscale.org



Hi,
We found out using ib_read_bw and ib_write_bw that there were some links 
between server and clients degraded, having a bandwith of 350MB/s
strangely, nsdperf did not report the same. It reported 12GB/s write and 
9GB/s read, which was much more then we actually could achieve.
So the problem was some bad ib routing. We changed some ib links, and then 
we got also 12GB/s read with nsdperf.
On our clients we then are able to achieve the 7,2GB/s in total we also 
saw using the nsd servers!
Many thanks for the help !!
We are now running some tests with different blocksizes and parameters, 
because our backend storage is able to do more than the 7.2GB/s we get 
with GPFS (more like 14GB/s in total). I guess prefetchthreads and 
nsdworkerthreads are the ones to look at?
Cheers!
Kenneth
On 21/04/17 22:27, Kumaran Rajaram wrote:
Hi Kenneth,

As it was mentioned earlier, it will be good to first verify the raw 
network performance between the NSD client and NSD server using the 
nsdperf tool that is built with RDMA support.
g++ -O2 -DRDMA -o nsdperf -lpthread -lrt -libverbs -lrdmacm nsdperf.C

In addition, since you have 2 x NSD servers it will be good to perform NSD 
client  file-system performance test with just single NSD server 
(mmshutdown the other server, assuming all the NSDs have primary, server 
NSD server configured + Quorum will be intact when a NSD server is brought 
down) to see if it helps to improve the read performance + if there are 
variations in the file-system read bandwidth results between NSD_server#1 
'active' vs. NSD_server #2 'active' (with other NSD server in GPFS "down" 
state). If there is significant variation, it can help to isolate the 
issue to particular NSD server (HW or IB issue?).

You can issue "mmdiag --waiters" on NSD client as well as NSD servers 
during your dd test, to verify if there are unsual long GPFS waiters. In 
addition, you may issue Linux  "perf top -z" command on the GPFS node  to 
see if there is  high CPU usage by any particular call/event (for e.g., If 
GPFS config parameter verbsRdmaMaxSendBytes has been  set to low value 
from the default 16M, then it can cause RDMA completion threads to go CPU 
bound ). Please verify some performance scenarios detailed in Chapter 22 
in Spectrum Scale Problem Determination Guide (link below).

https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.3/com.ibm.spectrum.scale.v4r23.doc/pdf/scale_pdg.pdf?view=kc


Thanks,
-Kums 





From:        Kenneth Waegeman <kenneth.waegeman at ugent.be>
To:        gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
Date:        04/21/2017 11:43 AM
Subject:        Re: [gpfsug-discuss] bizarre performance behavior
Sent by:        gpfsug-discuss-bounces at spectrumscale.org



Hi, 
We already verified this on our nsds:
[root at nsd00 ~]# /opt/dell/toolkit/bin/syscfg --QpiSpeed
QpiSpeed=maxdatarate
[root at nsd00 ~]# /opt/dell/toolkit/bin/syscfg --turbomode
turbomode=enable
[root at nsd00 ~]# /opt/dell/toolkit/bin/syscfg ‐-SysProfile 
SysProfile=perfoptimized
so sadly this is not the issue.
Also the output of the verbs commands look ok, there are connections from 
the client to the nsds are there is data being read and writen.
Thanks again! 
Kenneth

On 21/04/17 16:01, Kumaran Rajaram wrote:
Hi,

Try enabling the following in the BIOS of the NSD servers (screen shots 
below) 
Turbo Mode - Enable
QPI Link Frequency - Max Performance
Operating Mode - Maximum Performance
>>>>While we have even better performance with sequential reads on raw 
storage LUNS, using GPFS we can only reach 1GB/s in total (each nsd server 
seems limited by 0,5GB/s) independent of the number of clients   
>>We are testing from 2 testing machines connected to the nsds with 
infiniband, verbs enabled.

Also, It will be good to verify that all the GPFS nodes have Verbs RDMA 
started using "mmfsadm test verbs status" and that the NSD client-server 
communication from client to server during "dd" is actually using Verbs 
RDMA using "mmfsadm test verbs conn" command  (on NSD client doing dd). If 
not, then GPFS might be using TCP/IP network over which the cluster is 
configured impacting performance (If this is the case, GPFS 
mmfs.log.latest for any Verbs RDMA related errors and resolve). 





Regards,
-Kums






From:        "Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]" 
<aaron.s.knister at nasa.gov>
To:        gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
Date:        04/21/2017 09:11 AM
Subject:        Re: [gpfsug-discuss] bizarre performance behavior
Sent by:        gpfsug-discuss-bounces at spectrumscale.org



Fantastic news! It might also be worth running "cpupower monitor" or 
"turbostat" on your NSD servers while you're running dd tests from the 
clients to see what CPU frequency your cores are actually running at.  

A typical NSD server workload (especially with IB verbs and for reads) can 
be pretty light on CPU which might not prompt your CPU crew governor to up 
the frequency (which can affect throughout). If your frequency scaling 
governor isn't kicking up the frequency of your CPUs I've seen that cause 
this behavior in my testing.  

-Aaron




On April 21, 2017 at 05:43:40 EDT, Kenneth Waegeman 
<kenneth.waegeman at ugent.be>wrote: 
Hi, 
We are running a test setup with 2 NSD Servers backed by 4 Dell 
Powervaults MD3460s. nsd00 is primary serving LUNS of controller A of the 
4 powervaults, nsd02 is primary serving LUNS of controller B. 
We are testing from 2 testing machines connected to the nsds with 
infiniband, verbs enabled.
When we do dd from the NSD servers, we see indeed performance going to 
5.8GB/s for one nsd, 7.2GB/s for the two! So it looks like GPFS is able to 
get the data at a decent speed. Since we can write from the clients at a 
good speed, I didn't suspect the communication between clients and nsds 
being the issue, especially since total performance stays the same using 1 
or multiple clients. 

I'll use the nsdperf tool to see if we can find anything, 

thanks!

K

On 20/04/17 17:04, Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP] 
wrote:
Interesting. Could you share a little more about your architecture? Is it 
possible to mount the fs on an NSD server and do some dd's from the fs on 
the NSD server? If that gives you decent performance perhaps try NSDPERF 
next 
https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General+Parallel+File+System+(GPFS)/page/Testing+network+performance+with+nsdperf


-Aaron




On April 20, 2017 at 10:53:47 EDT, Kenneth Waegeman 
<kenneth.waegeman at ugent.be>wrote:
Hi,
Having an issue that looks the same as this one: 
We can do sequential writes to the filesystem at 7,8 GB/s total , which is 
the expected speed for our current storage    
backend.  While we have even better performance with sequential reads on 
raw storage LUNS, using GPFS we can only reach 1GB/s in total (each nsd 
server seems limited by 0,5GB/s) independent of the number of clients   
(1,2,4,..) or ways we tested (fio,dd). We played with blockdev params, 
MaxMBps, PrefetchThreads, hyperthreading, c1e/cstates, .. as discussed in 
this thread, but nothing seems to impact this read performance. 
Any ideas?
Thanks!

Kenneth

On 17/02/17 19:29, Jan-Frode Myklebust wrote:
I just had a similar experience from a sandisk infiniflash system 
SAS-attached to s single host. Gpfsperf reported 3,2 Gbyte/s for writes. 
and 250-300 Mbyte/s on sequential reads!! Random reads were on the order 
of 2 Gbyte/s.

After a bit head scratching snd fumbling around I found out that reducing 
maxMBpS from 10000 to 100 fixed the problem! Digging further I found that 
reducing prefetchThreads from default=72 to 32 also fixed it, while 
leaving maxMBpS at 10000. Can now also read at 3,2 GByte/s.

Could something like this be the problem on your box as well?



-jf
fre. 17. feb. 2017 kl. 18.13 skrev Aaron Knister <aaron.s.knister at nasa.gov
>:
Well, I'm somewhat scrounging for hardware. This is in our test
environment :) And yep, it's got the 2U gpu-tray in it although even
without the riser it has 2 PCIe slots onboard (excluding the on-board
dual-port mezz card) so I think it would make a fine NSD server even
without the riser.

-Aaron

On 2/17/17 11:43 AM, Simon Thompson (Research Computing - IT Services)
wrote:
> Maybe its related to interrupt handlers somehow? You drive the load up 
on one socket, you push all the interrupt handling to the other socket 
where the fabric card is attached?
>
> Dunno ... (Though I am intrigued you use idataplex nodes as NSD servers, 
I assume its some 2U gpu-tray riser one or something !)
>
> Simon
> ________________________________________
> From: gpfsug-discuss-bounces at spectrumscale.org[
gpfsug-discuss-bounces at spectrumscale.org] on behalf of Aaron Knister [
aaron.s.knister at nasa.gov]
> Sent: 17 February 2017 15:52
> To: gpfsug main discussion list
> Subject: [gpfsug-discuss] bizarre performance behavior
>
> This is a good one. I've got an NSD server with 4x 16GB fibre
> connections coming in and 1x FDR10 and 1x QDR connection going out to
> the clients. I was having a really hard time getting anything resembling
> sensible performance out of it (4-5Gb/s writes but maybe 1.2Gb/s for
> reads). The back-end is a DDN SFA12K and I *know* it can do better than
> that.
>
> I don't remember quite how I figured this out but simply by running
> "openssl speed -multi 16" on the nsd server to drive up the load I saw
> an almost 4x performance jump which is pretty much goes against every
> sysadmin fiber in me (i.e. "drive up the cpu load with unrelated crap to
> quadruple your i/o performance").
>
> This feels like some type of C-states frequency scaling shenanigans that
> I haven't quite ironed down yet. I booted the box with the following
> kernel parameters "intel_idle.max_cstate=0 processor.max_cstate=0" which
> didn't seem to make much of a difference. I also tried setting the
> frequency governer to userspace and setting the minimum frequency to
> 2.6ghz (it's a 2.6ghz cpu). None of that really matters-- I still have
> to run something to drive up the CPU load and then performance improves.
>
> I'm wondering if this could be an issue with the C1E state? I'm curious
> if anyone has seen anything like this. The node is a dx360 M4
> (Sandybridge) with 16 2.6GHz cores and 32GB of RAM.
>
> -Aaron
>
> --
> Aaron Knister
> NASA Center for Climate Simulation (Code 606.2)
> Goddard Space Flight Center
> (301) 286-2776
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>

--
Aaron Knister
NASA Center for Climate Simulation (Code 606.2)
Goddard Space Flight Center
(301) 286-2776
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss




_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss



_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss




_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20170504/0da065ec/attachment-0002.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/gif
Size: 61023 bytes
Desc: not available
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20170504/0da065ec/attachment-0006.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/gif
Size: 85131 bytes
Desc: not available
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20170504/0da065ec/attachment-0007.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/gif
Size: 84819 bytes
Desc: not available
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20170504/0da065ec/attachment-0008.gif>


More information about the gpfsug-discuss mailing list