At UMBC, the Lustre file systems (/umbc/lustre) are shared among many users and many application processes, which causes contention for various Lustre resources. This article explains how Lustre I/O works, and provides best practices for improving application performance.
How Does Lustre I/O Work?
When a client (a compute node from your job) needs to create or access a file, the client queries the metadata server (MDS) and the metadata target (MDT) for the layout and location of the file. Once the file is opened and the client obtains the file location, the MDS is no longer involved in the file I/O process. The client interacts directly with the object storage servers (OSSes) and object storage targets (OSTs) to perform I/O operations such as locking, disk allocation, storage, and retrieval.
If multiple clients try to read and write the same part of a file at the same time, the Lustre distributed lock manager enforces coherency so that all clients see consistent results.
Jobs being run on Maya contend for shared resources in UMBC’s Lustre filesystem. The Lustre server can only handle about 15,000 remote procedure calls (RPCs, inter-process communications that allow the client to cause a procedure to be executed on the server) per second. Contention slows the performance of your applications and weakens the overall health of the Lustre filesystem. To reduce contention and improve performance, please apply the examples below to your compute jobs while working in our high-end computing environment.
Avoid Using ls -l
The ls -l command displays information such as ownership, permission, and size of all files and directories. The information on ownership and permission metadata is stored on the MDTs. However, the file size metadata is only available from the OSTs. So, the ls -l command issues RPCs to the MDS/MDT and OSSes/OSTs for every file/directory to be listed. RPC requests to the OSSes/OSTs are very costly and can take a long time to complete if there are many files and directories.
- Use ls by itself if you just want to see if a file exists
- Use ls -l filename if you want the long listing of a specific file
Avoid Having a Large Number of Files in a Single Directory
Opening a file keeps a lock on the parent directory. When many files in the same directory are to be opened, it creates contention. A better practice is to split a large number of files (in the thousands or more) into multiple subdirectories to minimize contention.
Avoid Accessing Small Files on Lustre Filesystems
Accessing small files on the Lustre filesystem is not efficient. When possible, keep them on an NFS-mounted filesystem (such as your home filesystem on Maya /home/<username>) or copy them from Lustre to $JOB_SCRATCH_DIR on each node at the beginning of the job, and access the files using $JOB_SCRATCH_DIR.
Keep Copies of Your Source Code on the Maya Home Filesystem
Be aware that files under /umbc/lustre are not backed up. Make sure that you have copies of your source codes, makefiles, and any other important files saved on your Maya home filesystem.
Avoid Accessing Executables on Lustre Filesystems
There have been a few incidents on Maya where users’ jobs encountered problems while accessing their executables on the /umbc/lustre filesystem. The main issue is that the Lustre clients can become unmounted temporarily when there is a very high load on the Lustre filesystem. This can cause a bus error when a job tries to bring the next set of instructions from the inaccessible executable into memory.
Executables run slower when run from the Lustre filesystem. It is best to run executables, or any linked libraries, from your home filesystem (or group_saved space if available) on Maya. Libraries are especially susceptible to issues as lustre does not have any sense of file locking. On rare occasions, running executables from the Lustre filesystem can cause executables to be corrupted. Avoid copying new executables over existing ones of the same name within the Lustre filesystem. The copy causes a window of time (about 20 minutes) where the executable will not function. Instead, the executable should be accessed from your home filesystem during runtime.
Limit the Number of Processes Performing Parallel I/O
Given that the numbers of OSSes and OSTs on Maya are about a hundred or fewer, there will be contention if a large number of processes of an application are involved in parallel I/O. Instead of allowing all processes to do the I/O, choose just a few processes to do the work. For writes, these few processes should collect the data from other processes before the writes. For reads, these few processes should read the data and then broadcast the data to others.
When reading and writing, do not allow multiple processes to write to the same file, this also applies to creating files in the same directory. Too many file creations in a directory will cause the system to slow. The better way of handling large amounts of data is to create larger and fewer number of files to maximize the I/O and minimize the meta data queries.
Avoid Repetitive “stat” Operations
It is possible that some users have implemented logic in their scripts to test for the existence of certain files. Such tests generate “stat” requests to the Lustre server. When the testing becomes excessive, it creates a significant load on the filesystem. A workaround is to slow down the testing process by adding sleep in the logic. For example, the following user script tests the existence of the files WAIT and STOP to decide what to do next.
touch WAIT rm STOP while ( 0 <= 1 ) if(-e WAIT) then mpiexec ... rm WAIT endif if(-e STOP) then exit endif end
When neither the WAIT nor STOP file exists, the loop ends up testing for their existence as quickly as possible (on the order of 5,000 times per second). Adding sleep inside the loop slows down the testing.
touch WAIT rm STOP while ( 0 <= 1 ) if(-e WAIT) then mpiexec ... rm WAIT endif if(-e STOP) then exit endif sleep 15 end
Avoid Having Multiple Processes Open the Same File(s) at the Same Time
On Lustre filesystems, if multiple processes try to open the same file(s), some processes will not able to find the file(s) and your job will fail.
The source code can be modified to call the sleep function between I/O operations. This will reduce the occurrence of multiple, simultaneous access attempts to the same file from different processes.
100 open(unit,file='filename',IOSTAT=ierr) if (ierr.ne.0) then ... call sleep(1) go to 100 endif
When opening a read-only file in Fortran, use ACTION=’read’ instead of the default ACTION=’readwrite’. The former will reduce contention by not locking the file.
Avoid Repetitive Open/Close Operations
Opening files and closing files incur overhead and repetitive open/close should be avoided.
If you intend to open the files for read only, make sure to use ACTION=’READ’ in the open statement. If possible, read the files once each and save the results, instead of reading the files repeatedly.
If you intend to write to a file many times during a run, open the file once at the beginning of the run. When all writes are done, close the file at the end of the run.