Spectrum Scale is a highly scalable, high-performance storage solution for file and object storage. It started more than 20 years ago as research project and is now used by thousands of customers. IBM continues to enhance Spectrum Scale, in response to recent hardware advancements and evolving workloads.
This presentation will discuss selected improvements in Spectrum V5, focusing on improvements for inode management, vCPU scaling and considerations for NUMA.
Part 1 (inode management):
Part 2 (vCPU scaling and NUMA considerations):
Download slides here
Q&A – inode management
Q: If I make the block size 8k, can an inode stuff a file of that size?
A: No , the max inode size is 4K. As discussed on the call, changing the blocksize (including the metadata blocksize) doesn’t impact the size of an inode, which is currently limited to 4K.
Q: When files are created an inode number is assigned from 1-Max. 32bit applications can only address inodes up to 4b. With millions of temp files created during application runs, inodes get used up very quickly. But once the job is finished the files are deleted, but the inodes are not recycled. This results in filling the inodes while the file system isn’t full. Can inodes/files that have been deleted, have their inode recycled for future use?
A: Inodes do get reused after deletion. How many independent filesets do you have? If you only had the root fileset and you set the maxInodes to less than 4b, then you can never have an inode number greater than 4b.
Q: Do we have any idea how long it takes to re-layout the inode allocation map? And can that be done while the FS is mounted?
A: The run-time for this operation will depend on the size of the existing inode allocation map file since we will be migrating data from the exiting map to the new map. In one customer engagement, the migration completed in an hour and in another case it took 18 hours.
While this operation can theoretically be done while fs is mounted, we have currently restricted it to be done with file system offline for safety reasons. We are evaluating to make this operation online in a future release. The re-layout parameters can however be tested with file system mounted.
Q: Are there counters that report the lock collisions/waiters for lock contention that would indicate whether a re-layout is desirable?
A: ‘mmfsadm dump ialloc’ provides counters on segment search. Grep for ‘inodes allocated from’. Ideally, we expect allocations to happen from ‘inodes allocated from list of prefetched/deleted inodes’ or ‘inodes allocated from current ialloc segment’. Also, long waiters during file creation is indication of inode space pressure.
Q: Why does a large number of NumNodes influence mmdf run time? (seen some minutes)
A: mmdf fetches cached data. This should not be impacted by cluster size.
Q: How does NumNodes relate to the number of segments?
A: The number of inode allocation map segments are chosen such that every node can find a segment with free inodes even if 75% of all segments are full. This has to do with the inode expansion getting triggered only when the inode space is 75% full. We want inode allocation to continue while the inode expansion is taking place. This means that the number of segments would be roughly 4 times NumNodes.
Q: Are there any general recommendations for initial inode allocation? I know this depends on the filesystem’s expected use. We typically just base it roughly off existing systems.
A: Use the default value of allocated inodes (by omitting NumInodesToPreallocate argument of –inode-limit option of mmcrfs/mmcrfileset) when creating file system/independent fileset and let the inodes expand on demand.
Q: How is the inode allocation map, and its segmentation, affected if metadata NSDs are added or deleted?
A: The inode allocation map is not affected by newly added NSDs as it only tracks inode state. The block allocation map is the one that tracks free/used disk blocks and will get updated on new disk add/delete.
Q: Can we shrink the inode space, if we by mistake allocate a large inode space using –inode-limit?
A: No.
Q: When files are deleted, does the recovery of free inode happen in lazy way? One customer has just reported that after deleting data from 5TB filesystem, free space is not reflecting on the filesystem.
A: Yes. The files are deleted asynchronously in the background. You can run ‘mmfsadm dump deferreddeletions’ to see the number of inodes that are queued for deletion in mounted file systems.
Q: At what version is automatic inode expansion available?
A: Since the earliest spectrum scale versions.
Q: How do you indentify the metanode?
A: Here is an example:
ls -i testfile
68608 testfile
Then find this inode number in ‘mmfsadm dump files’. (note that the mmfsadm dump command should be avoided in production)
===== dump files =====
[… search on inode]
inode 68608 snap 0 USERFILE nlink 1 genNum 0x49DE6F0F mode‑
The above is an example of how you might lookup the metanode for a file.
You can map the cluster name by looking at the ‘tscomm’ section of a dump, e.g.:
===== dump tscomm =====
[…]
Domain , myAddr <c1n2> (was: <c1p0>)[…]
UID domain 0x1800DF65038 (0xFFFFB6768DF65038) Name “c202f06fs04a-ib1.gpfs.net”
Q: Is metanode transient?
A: A metanode is a per file assignment. It lasts for as long as there are open instances of the file. The assignment is dynamic and the metanode role may automatically migrate to other nodes for better performance.
Q: If some of the node went down and if metanode unable to get update from those failed nodes: In such situation how updates are maintained by the metanode?
A: A non-metanode will sends its updates to the metanode before it writes any dependent blocks to disk. If the non-metanode went down before it could send its updates, then log recovery will ensure that there are no inconsistent modifications to disk data by the non-metanode. Spectrum Scale only guarantees persistence of data/metadata from the last sync window.
Q: Can we limit metanode to migrate to remote node? Also, will it help in improving performance if limit metanode in storage cluster?
A: Metanode performance depends on how many nodes are sending metanode updates and how expensive the network send is. The file system uses such heuristics to determine the optimal metanode placement. In most cases it is best to let the file system make this decision. The only known use case for preventing metanode migration to remote node is if the remote node is in a compute cluster which cannot afford the overhead of a metanode operation. For this rare case we have an undocumented configuration parameter to force the metanode to be in the storage cluster.
Q: Sometime when we delete large data, it takes significant time to show free space in df -h command output. Do we need to run mmrestripfs to reclaim the deleted space faster?
A: ‘df -h’ would return cached information on free space. It is likely that the large file that was deleted has not freed up its space as file deletes happen in the background. You can use ‘mmfsadm dump deferreddeletions’ to get a count of number of inodes that are queued for background deletion. If the node is not overloaded on I/O and you find that the number of to-be-deleted inodes are not reducing at a reasonable rate (depending on the file size and I/O througput of the node), then we would need to investigate further by collecting dumps and traces. Please open a ticket with IBM support in such a case. The mmrestripefs command is for restoring/rebalancing data and metadata replicas. It would not have any impact on speeding up background file deletion.
Q&A – vCPU scaling and NUMA considerations
Q: We see in the mmfs log now following messages, what does it mean? What is missing?
[W] NUMA BIOS/platform support for NUMA is disabled or not available. NUMA results are approximated and GPFS NUMA awareness may suffer.
A: That means libnuma was found but numa_available() returned false. This is a platform firmware functionality shortcoming. Spectrum Scale can still get a lot of information as some is derivable from /proc . File a ticket with your server vendor that libnuma :: numa_available() returns false .
Q: So, any recommendations on POWER9 for SMT settings? AIX versus Linux on Power? We used to suggest smaller SMT modes in the past.
A: We are running SMT-4 on some large POWER9 systems. Evaluate based on I/O vs workload needs as discussed verbally.
Q: Are there any special NUMA considerations for AMD systems which are different to NUMA considerations for Intel systems?
A: This is highly dependent upon the processor and chip set independent of brand and based on what that processor and chipset offer for tuning. We do not have any prescriptive guidance.
User group host: Simon Thompson
Speakers:
Speaker Name | Photo | Bio |
Michael Harris | | Mike is a Senior Software Engineer on the Spectrum Scale Core Team. Mike has a deep background on OS kernel, device drivers, virtualization, and system software with focus on NUMA, atomics and concurrency, high cpu count concurrency. On GPFS focusing on NUMA and scaling as well as DMAPI and host file system integration and system calls. |
Karthik Iyer | | Karthik Iyer is a Senior Software Engineer in Spectrum Scale Core. Karthik has 18 years of design and development experience in distributed system software, specifically in the areas of file system core and database management. Karthik also specialises in trouble shooting Spectrum Scale corruption related issues. |