SSUG::Digital – Page 4 – Spectrum Scale User Group

SSUG::Digital: 007 – Manage the lifecycle of your files using the policy engine

This episode will provide a comprehensive introduction to the IBM Spectrum Scale policy engine. It highlights the underlying architecture and how policies are executed in a IBM Spectrum Scale cluster. This episode also discusses example rules and policies facilitating Information Lifecycle Management accompanied with practical tips.

Download slides here

References

Q&A

Q: Which type of nodes participate in policy execution?
A: Depends on the nodes specified with the -N option of the mmapplypolicy command. If the -N option is not specified, then the command runs parallel instances of the policy code on the nodes that are specified by the defaultHelperNodes attribute of the mmchconfig command. If -N is specified then the command runs parallel instances on the nodes or node class specified with the -N option. For more information see the IBM Spectrum Scale knowledge center: https://www.ibm.com/support/knowledgecenter/STXKQY_5.0.5/com.ibm.spectrum.scale.v5r05.doc/bl1adm_mmapplypolicy.htm

Q: Can I somehow identify the type of a file via the policy engine, e.g. via the magicbyte? Or do I have to rely on the file extension?
A: The policy engine does not allow access to the data – only the file’s metadata including extended attributes can be evaluated by the policy engine. To identify the type of a file with the policy engine an EXTERNAL LIST rule can be used along with an external script that determines the type of files.

Q: Will the external tool process the filelist in parallel on all nodes, which are used to generate the filelist?
A: Yes, if an external tool or interface script is defined in an EXTERNAL POOL rule then this script is executed on all nodes that are specified with the -N option of the mmapplypolicy command. This assumes that all node specified with the -N option have access to the interface script. If this is not the case, then the policy run fails. You can control the number of instances of the external tool pool with the option -m and the number of files passed to one instance of the external pool with the option -B of the mmapplypolicy command.

Q: Are they any limitations or recommendations around length of rules in policy files? For example, we have ~750 filesets we want to place data on a specific pool. Should we just have one rule, or many rules for this?
A: Placement policies will be stored in a single file. The challenge is not so much the length of the file but the number of placement rules contained in the policy files. Whenever a file is created the policy engine must walk through all rules to find a match. If there are many rules, this will delay the file creation. Therefore I recommend to keep the number of placement rules low. For example, you could organize the placement policies by storage pools. There is a limit of eight storage pools, thus this would lead to maximal eight placement rules. In each rule you can use the FILESET statement to specify multiple filesets to be placed on a pool.

User group host: Bob Oesterlin

Speakers:

Speaker Name	Photo	Bio
Nils Haustein		Nils Haustein is Senior Technical Staff Member with IBM Systems. He is responsible for design and implementation of backup, archiving, file and object storage solutions. Nils provides guidance to IBM teams and consults with clients and business partners world wide. Nils has co-authored the book "Storage Networks explained". As leading IBM Master Inventor he has created more than 170 patents and is a respected mentor for the technical community world wide.

SSUG::Digital: 006 – Persistent Storage for Kubernetes and OpenShift environments

This episode will discuss the Spectrum Scale Container Storage Interface (CSI). CSI is a standard for exposing arbitrary block and file storage systems to containerized workloads on container orchestration systems like Kubernetes and OpenShift. Spectrum Scale CSI provides your containers fast access to files stored in Spectrum Scale with capabilities such as dynamic provisioning of volumes and read-write-many access.

Download slides here

Episode 2: Best Practices for building a stretched cluster

Q&A

Q: This slide (titled Spectrum Scale CSI Driver – Architecture) shows CPU architecture is x86
A: Yes, with Spectrum Scale CSI Driver 2.0.0 only x86 is supported. The support for other architectures (IBM Power and IBM Z) will be provided in upcoming releases (IBM usual roadmap disclaimers apply).

Q: Is the management of storage class available via Ansible?
A: Setting up a storage class is a one-time operation. While it might be done using Ansible (and Kubernetes integration modules), clients usually do the management using the Kubernetes or Openshift CLI or GUI.

Q: Will the slides be provided post this presentation?
A: Yes. You will find the chart decks, recordings, Q&A and related information for all past talks including this one at https://www.spectrumscaleug.org/experttalks/.

Q: Once you have CSI driver support for non x86_64 platforms, will the Spectrum Scale cluster be able to be heterogeneous (AIX, Linux, x86_64 and ppc64le)? Will this cluster support AIX NSD only nodes?
A: In the first release for non x86_64 platforms, all worker nodes that have Spectrum Scale client installed need to be of same CPU architecture and the same operating system. If there are AIX NSD nodes, those must be outside of Kubernetes cluster. AIX NSD only nodes might be integrated by remote mounting the storage cluster to a client Spectrum Scale cluster that runs the Kubernetes workload.

Q: Is Network Load Balancer a per-requisite must have for the CSI deployment?
A: No it isn’t.

Q: Is there a possibility to have Spectrum Scale clients installed within containers?
A: We are working on a capability called Container Native Spectrum Scale (CNSS) where Spectrum Scale will run inside a container. The initial release is planned for December 2020. (Disclaimer: All dates are subject to change; IBM usual roadmap disclaimers apply)

Q: Do we need to have an x86 “only” Spectrum Scale/OpenShift cluster and a ppc64le “only” Spectrum Scale/OpenShift cluster?
A: The requirement of same CPU architecture and same operating system is only for Spectrum Scale Client node which are part of Kubernetes/ Openshift cluster. NSD server can be of other platform (as per Spectrum Scale support matrix at https://www.ibm.com/support/knowledgecenter/STXKQY/gpfsclustersfaq.html)

Q: Any plans for ability to self-provision Spectrum Scale clusters with containers?
A: Container Native Spectrum Scale (CNSS) will have an Operator that will deploy and configure a Spectrum Scale cluster automatically. It will also remote mount the filesystem from a Spectrum Scale Storage Cluster.

Q: One of the issues that we are trying to solve is to isolate Spectrum Scale IO with respect to each tenant/application/user on a single server just like how we can isolate cpu/network with cgroups. Would spectrum scale on containers help us in isolating or qos with storage IO?
A: Running Spectrum Scale in a container or CSI by itself will not address QoS. A new fileset based QoS capability, with CSI, will be able to handle this in a future release. (Disclaimer: All dates are subject to change; IBM usual roadmap disclaimers apply)

Q: Does OpenStack have to be managed via the web gui. Can it be controlled via a CLI?
A: You are free to use either CLI or GUI.

User group host: Bill Anderson

Speakers:

Speaker Name	Photo	Bio
Smita Raut		Smita Raut is a Senior Software Engineer with IBM Storage Labs in Pune, India. She works with the IBM Spectrum Scale development team as the architect for persistent storage for containers. In her nine years with IBM, she has lead the development on various projects, including Object protocol for IBM Spectrum Scale and enablement of IBM Spectrum Scale on public cloud. She is an active technical blogger and has published several blogs on object protocol and container storage interface driver.
Harald Seipp		Harald Seipp is a Senior Technical Staff Member with IBM Systems in Germany. He is the founder and Technical Leader of the Center of Excellence for Cloud Storage as part of the EMEA Storage Competence Center. He is providing guidance to worldwide IBM teams across organizations, and works with customers and IBM Business Partners across EMEA to create and implement complex storage cloud architectures. His more than 25 years of technology experience includes previous job roles as Software Developer, Software Development Leader, Lead Developer, and Architect for successful software products, and co-inventor of an IBM storage product. He holds various patents on storage and networking technology.
Renar Grunenberg		Renar Grunenberg is since 27 years at HuK-Coburg. He leads the backup and storage team and is responsible for all the storage and backup stuff in his department and company. Renar has 15 years experience with Spectrum Scale including CES, CSI, ESS and normal core function. In this episode Renar will discuss a use case for Kafka self-service with K8s and Spectrum Scale CSI.
Simon Thompson		Simon Thompson is the Research Computing Infrastructure Architect within Advanced Research Computing at the University of Birmingham. He oversees the infrastructure and systems team, running the University's HPC and research data systems. This involves experimenting (and breaking) new technology. Simon is also chair of the Spectrum Scale user group in the UK.

SSUG::Digital: 005 – Update on functional enhancements in Spectrum Scale (inode management, vCPU scaling, considerations for NUMA)

Spectrum Scale is a highly scalable, high-performance storage solution for file and object storage. It started more than 20 years ago as research project and is now used by thousands of customers. IBM continues to enhance Spectrum Scale, in response to recent hardware advancements and evolving workloads.
This presentation will discuss selected improvements in Spectrum V5, focusing on improvements for inode management, vCPU scaling and considerations for NUMA.

Part 1 (inode management):

Part 2 (vCPU scaling and NUMA considerations):

Download slides here

Q&A – inode management

Q: If I make the block size 8k, can an inode stuff a file of that size?
A: No , the max inode size is 4K. As discussed on the call, changing the blocksize (including the metadata blocksize) doesn’t impact the size of an inode, which is currently limited to 4K.

Q: When files are created an inode number is assigned from 1-Max. 32bit applications can only address inodes up to 4b. With millions of temp files created during application runs, inodes get used up very quickly. But once the job is finished the files are deleted, but the inodes are not recycled. This results in filling the inodes while the file system isn’t full. Can inodes/files that have been deleted, have their inode recycled for future use?
A: Inodes do get reused after deletion. How many independent filesets do you have? If you only had the root fileset and you set the maxInodes to less than 4b, then you can never have an inode number greater than 4b.

Q: Do we have any idea how long it takes to re-layout the inode allocation map? And can that be done while the FS is mounted?
A: The run-time for this operation will depend on the size of the existing inode allocation map file since we will be migrating data from the exiting map to the new map. In one customer engagement, the migration completed in an hour and in another case it took 18 hours.
While this operation can theoretically be done while fs is mounted, we have currently restricted it to be done with file system offline for safety reasons. We are evaluating to make this operation online in a future release. The re-layout parameters can however be tested with file system mounted.

Q: Are there counters that report the lock collisions/waiters for lock contention that would indicate whether a re-layout is desirable?
A: ‘mmfsadm dump ialloc’ provides counters on segment search. Grep for ‘inodes allocated from’. Ideally, we expect allocations to happen from ‘inodes allocated from list of prefetched/deleted inodes’ or ‘inodes allocated from current ialloc segment’. Also, long waiters during file creation is indication of inode space pressure.

Q: Why does a large number of NumNodes influence mmdf run time? (seen some minutes)
A: mmdf fetches cached data. This should not be impacted by cluster size.

Q: How does NumNodes relate to the number of segments?
A: The number of inode allocation map segments are chosen such that every node can find a segment with free inodes even if 75% of all segments are full. This has to do with the inode expansion getting triggered only when the inode space is 75% full. We want inode allocation to continue while the inode expansion is taking place. This means that the number of segments would be roughly 4 times NumNodes.

Q: Are there any general recommendations for initial inode allocation? I know this depends on the filesystem’s expected use. We typically just base it roughly off existing systems.
A: Use the default value of allocated inodes (by omitting NumInodesToPreallocate argument of –inode-limit option of mmcrfs/mmcrfileset) when creating file system/independent fileset and let the inodes expand on demand.

Q: How is the inode allocation map, and its segmentation, affected if metadata NSDs are added or deleted?
A: The inode allocation map is not affected by newly added NSDs as it only tracks inode state. The block allocation map is the one that tracks free/used disk blocks and will get updated on new disk add/delete.

Q: Can we shrink the inode space, if we by mistake allocate a large inode space using –inode-limit?
A: No.

Q: When files are deleted, does the recovery of free inode happen in lazy way? One customer has just reported that after deleting data from 5TB filesystem, free space is not reflecting on the filesystem.
A: Yes. The files are deleted asynchronously in the background. You can run ‘mmfsadm dump deferreddeletions’ to see the number of inodes that are queued for deletion in mounted file systems.

Q: At what version is automatic inode expansion available?
A: Since the earliest spectrum scale versions.

Q: How do you indentify the metanode?
A: Here is an example:
ls -i testfile
68608 testfile

Then find this inode number in ‘mmfsadm dump files’. (note that the mmfsadm dump command should be avoided in production)
===== dump files =====
[… search on inode]
inode 68608 snap 0 USERFILE nlink 1 genNum 0x49DE6F0F mode‑

The above is an example of how you might lookup the metanode for a file.

You can map the cluster name by looking at the ‘tscomm’ section of a dump, e.g.:
===== dump tscomm =====
[…]
Domain , myAddr <c1n2> (was: <c1p0>)[…]
UID domain 0x1800DF65038 (0xFFFFB6768DF65038) Name “c202f06fs04a-ib1.gpfs.net”

Q: Is metanode transient?
A: A metanode is a per file assignment. It lasts for as long as there are open instances of the file. The assignment is dynamic and the metanode role may automatically migrate to other nodes for better performance.

Q: If some of the node went down and if metanode unable to get update from those failed nodes: In such situation how updates are maintained by the metanode?
A: A non-metanode will sends its updates to the metanode before it writes any dependent blocks to disk. If the non-metanode went down before it could send its updates, then log recovery will ensure that there are no inconsistent modifications to disk data by the non-metanode. Spectrum Scale only guarantees persistence of data/metadata from the last sync window.

Q: Can we limit metanode to migrate to remote node? Also, will it help in improving performance if limit metanode in storage cluster?
A: Metanode performance depends on how many nodes are sending metanode updates and how expensive the network send is. The file system uses such heuristics to determine the optimal metanode placement. In most cases it is best to let the file system make this decision. The only known use case for preventing metanode migration to remote node is if the remote node is in a compute cluster which cannot afford the overhead of a metanode operation. For this rare case we have an undocumented configuration parameter to force the metanode to be in the storage cluster.

Q: Sometime when we delete large data, it takes significant time to show free space in df -h command output. Do we need to run mmrestripfs to reclaim the deleted space faster?
A: ‘df -h’ would return cached information on free space. It is likely that the large file that was deleted has not freed up its space as file deletes happen in the background. You can use ‘mmfsadm dump deferreddeletions’ to get a count of number of inodes that are queued for background deletion. If the node is not overloaded on I/O and you find that the number of to-be-deleted inodes are not reducing at a reasonable rate (depending on the file size and I/O througput of the node), then we would need to investigate further by collecting dumps and traces. Please open a ticket with IBM support in such a case. The mmrestripefs command is for restoring/rebalancing data and metadata replicas. It would not have any impact on speeding up background file deletion.

Q&A – vCPU scaling and NUMA considerations

Q: We see in the mmfs log now following messages, what does it mean? What is missing?
[W] NUMA BIOS/platform support for NUMA is disabled or not available. NUMA results are approximated and GPFS NUMA awareness may suffer.
A: That means libnuma was found but numa_available() returned false. This is a platform firmware functionality shortcoming. Spectrum Scale can still get a lot of information as some is derivable from /proc . File a ticket with your server vendor that libnuma :: numa_available() returns false .

Q: So, any recommendations on POWER9 for SMT settings? AIX versus Linux on Power? We used to suggest smaller SMT modes in the past.
A: We are running SMT-4 on some large POWER9 systems. Evaluate based on I/O vs workload needs as discussed verbally.

Q: Are there any special NUMA considerations for AMD systems which are different to NUMA considerations for Intel systems?
A: This is highly dependent upon the processor and chip set independent of brand and based on what that processor and chipset offer for tuning. We do not have any prescriptive guidance.

User group host: Simon Thompson

Speakers:

Speaker Name	Photo	Bio
Michael Harris		Mike is a Senior Software Engineer on the Spectrum Scale Core Team. Mike has a deep background on OS kernel, device drivers, virtualization, and system software with focus on NUMA, atomics and concurrency, high cpu count concurrency. On GPFS focusing on NUMA and scaling as well as DMAPI and host file system integration and system calls.
Karthik Iyer		Karthik Iyer is a Senior Software Engineer in Spectrum Scale Core. Karthik has 18 years of design and development experience in distributed system software, specifically in the areas of file system core and database management. Karthik also specialises in trouble shooting Spectrum Scale corruption related issues.

SSUG::Digital: 004 – Update on Performance Enhancements in Spectrum Scale

Update on File Create and MMAP performance, optimised code for small DIO.

Spectrum Scale is a highly scalable, high-performance storage solution for file and object storage. IBM continues to enhance Spectrum Scale performance, in response to recent hardware advancements and evolving workloads.
This presentation will discuss performance related improvements in Spectrum V5, focusing on enhancements made in support of AI and HPC use cases, including improvements to MMAP reads, file create performance, and small direct IO. In addition we will review some performance numbers measured on the IBM ESS 5000.

Download slides here

Q&A

Q: I assume copy of these charts will be posted to Spectrum Scale User Group “Presentations” web page?
A: Yes, for all episodes the slides and video should be posted afterwards.

Q: Please expand on other areas of performance improvements within GPFS that IBM is working on now?
A: Which areas would you like to see improved?

Q: Will prefetch still happen after the slow second IO?
A: Regarding ‘will prefetch still happen after the slow second IO’ – I know that Ulf said we should handle any more prefetch questions in another talk, but let me just comment on one case: we make decisions to prefetch after the associated I/Os are complete, so prior to prefetch kicking in, a slow I/O might delay the decision to start prefetching.

User group host: Simon Thompson

Speakers:

Speaker Name	Photo	Bio
John Lewars (IBM)		John Lewars is a Senior Technical Staff Member leading performance engineering work in the IBM Spectrum Scale development team. He has been with IBM for over 20 years, working first on several aspects of IBM's largest high performance computing systems, and later on the IBM Spectrum Scale (formerly GPFS) development team. John's work on the Spectrum Scale team includes working with large customer deployments and improving network resiliency, along with co-leading development of the team's first public cloud and container support deliverables.
Jürgen Hannappel (DESY)		Jürgen Hannappel works in the scientific computing group of the DESY IT department on data management for EuXFEL and Petra III. With a background in particle physics his interests shifted towards computing over time as his place of work moved from CERN and Bonn University to DESY
Olaf Weiser (IBM)		Olaf works with GPFS for over 15 years now. He started his GPFS career in one of the worlds biggest telecommunication companies as a technical administrator. Since more than 10 years, Olaf is with IBM as storage consultant and performance specialist. Recently, he joined IBM Research and Development and works on enhancements in Spectrum Scale to adopt client and customer's needs in the product.

SSUG::Digital: 003 – Strategy Update

Spectrum Scale Strategy UpdateToday is the AI era and we are going through huge explosion of data. Besides the AI revolution, we have clouds, hybrid clouds and data is moving from “on-prem” to various clouds, multi-clouds and back. Coupled with this data growth, Hardware is evolving with an increasing factor of 10. The IBM Spectrum Scale team continues to Invest heavily in adding exciting new features and technology to maintain its leadership as a premier file system. In this session, Wayne Sawdon (CTO) and Ted Hoover (Program Director) of the Spectrum Scale development team will give an overview of recent, upcoming features and strategy for Spectrum Scale.

Download slides here

Q&A

None

User group host: Bob Oesterlin

Speakers:

Speaker Name	Photo	Bio
Wayne Sawdon		Wayne joined IBM in 1982 and worked on a variety of research projects including the QuickSilver Transactional Operating System. He spend most of the 90's on educational leave at Carnegie Mellon University working on Distributed Shared Memory and Software Defined Computer Architecture. Upon returning he joined the TigerShark research project which became IBM's General Parallel File System. Although Wayne has worked on most of the file system, he only admits to working on its data management. These days, Wayne serves as the CTO for Spectrum Scale and ESS.
Ted Hoover		Ted Hoover is a Program Director within IBM’s Spectrum Scale product development organisation. Ted is responsible for the worldwide development of Spectrum Scale cloud, container, and performance engineering teams.