Table of Contents
- Running Serial Jobs
- Running Parallel Jobs
- Node selection
- Some details about the batch system
- Parts of a SLURM script
- Job scheduling issues
- Specifying a time limit
- Email notifications for job status changes
- Controlling exclusive vs. shared access to nodes
- The memory limit
- Heavy Memory Use Jobs (such as MATLAB)
- Requesting an arbitrary number of tasks
- Setting a ‘begin’ time
- Requeue-ability of your jobs
- Using scratch storage
- Charging computing time to a PI, for users under multiple PIs
- Interactive jobs
- Selecting specific nodes
- Jobs stuck in the “PD” state
- What is the priority of my job?
- How can I check my fair-share level?
Running a program on maya is a bit different than running one on a standard workstation. When we log into the cluster, we are interacting with the user node. But we would like our programs to run on the compute nodes, which is where the real computing power of the cluster is. We will walk through the processes of running serial and parallel code on the cluster, and then later discuss some of the finer details. This page uses the code examples from the compilation tutorial. Please download and compile those examples, so you can follow along.
Resource intensive jobs (long running, high memory demand, etc) should be run on the compute nodes of the cluster. You cannot execute jobs directly on the compute nodes yourself; you must request the cluster’s batch system do it on your behalf. To use the batch system, you will submit a special script which contains instructions to execute your job on the compute nodes. When submitting your job, you specify a partition (group of nodes: for testing vs. production) and a QOS (a classification that determines what kind of resources your job will need). Your job will wait in the queue until it is “next in line”, and free processors on the compute nodes become available. Which job is “next in line” is determined by the scheduling rules of the cluster. Once a job is started, it continues until it either completes or reaches its time limit, in which case it is terminated by the system.
The batch system used on maya is called SLURM, which is short for Simple Linux Utility for Resource Management. Users transitioning from the cluster hpc should be aware that SLURM behaves a bit differently than PBS, and the scripting is a little different too. Unfortunately, this means you will need to rewrite your batch scripts. However many of the confusing points of PBS, such as requesting the number of nodes and tasks per node, are simplified in SLURM.
Scheduling Fundamentals on maya: Partitions, QOS’s, and more
Please read first the scheduling rules web page for complete background on the available queues and their limitations.
Interacting with the Batch System
There are several basic commands you’ll need to know to submit jobs, cancel them, and check their status. These are:
- sbatch – submit a job to the batch queue system
- squeue – check the current jobs in the batch queue system
- sinfo – view the current status of the queues
- scancel – cancel a job
Check here for more detailed information about job monitoring.
The first command we will mention is scancel. If you’ve submitted a job that you no longer want, you should be a responsible user and kill it. This will prevent resources from being wasted, and allow other users’ jobs to run. Jobs can be killed while they are pending (waiting to run), or while they are actually running. To remove a job from the queue or to cancel a running job cleanly, use the scancel command with the identifier of the job to be deleted. For instance:
[araim1@maya-usr1 hello_serial]$ scancel 636 [araim1@maya-usr1 hello_serial]$
The job identifier can be obtained from the job listing from squeue (see below) or you might have noted it from the response of the call to sbatch, when you originally submitted the job (also below). Try “man scancel” for more information.
Now that we know how to cancel a job, we will see how to submit one. You can use the sbatch command to submit a script to the batch queue system.
[araim1@maya-usr1 hello_serial]$ sbatch run.slurm sbatch: Submitted batch job 2626 [araim1@maya-usr1 hello_serial]$
In this example run.slurm is the script we are sending to the batch queue system. We will see shortly how to formulate such a script. Notice that sbatch returns a job identifier. We can use this to kill the job later if necessary, or to check its status. For more information, check the man page by running “man sbatch”.
You can use the squeue command to check the status of jobs in the batch queue system. Here’s an example of the basic usage:
[araim1@maya-usr1 ~]$ squeue JOBID PARTITION NAME USER ST TIME NODES QOS NODELIST(REASON) 2564 batch MPI_DG gobbert PD 0:00 64 medium (Resources) 2626 batch fmMle_no araim1 R 0:02 4 normal n[9-12] 2579 batch MPI_DG gobbert R 1-02:40:36 2 long n[7-8] 2615 batch test aaronk R 2:41:51 32 medium n[3-6,14-41] [araim1@maya-usr1 ~]$
The most interesting column is the one titled ST for “status”. It shows what a job is doing at this point in time. The state “PD” indicates that the job has been queued. When free processor cores become available and this process is “next in line”, it will change to the “R” state and begin executing. You may also see a job with status “CG” which means it’s completing, and about to exit the batch system. Other statuses are possible too, see the man page for squeue. Once a job has exited the batch queue system, it will no longer show up in the squeue display.
We can also see several other pieces of useful information. The TIME column shows the current walltime used by the job. For example, job 2578 has been running for 1 day, 2 hours, 40 minutes, and 36 seconds. The NODELIST column shows which compute nodes have been assigned to the job. For job 2626, nodes n9, n10, n11, and n12 are being used. However for job 2564, we can see that it’s pending because it’s waiting on resources.
The sinfo command also shows the current status of the batch system, but from the point of view of the SLURM partitions. Here is an example
[gobbert@maya-usr1 hello-serial]$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST develop* up 31:00 2 idle n[1-2] batch up infinite 82 idle n[3-84] [araim1@maya-usr1 ~]$
This section assumes you’ve already compiled the serial hello world example. Now we’ll see how to run it several different ways.
Test runs on the user node
The most obvious way to run the program is on the user node, which we normally log into.
[hu6@maya-usr1 hello_serial]$ ./hello_serial Hello world from maya-usr1 [hu6@maya-usr1 hello_serial]$
We can see the reported hostname which confirms that the program ran on the user node.
Let’s submit our job to the develop partition, since we just created it and we’re not completely sure that it works. The following script will accomplish this. Save it to your account alongside the “hello-serial” executable.
Here, the partition flag chooses the develop partition. The output and error flags set the file name for capturing standard output (stdout) and standard error (stderr), respectively, and the job-name flag simply sets the string that is displayed as the name of the job in squeue. Now we’re ready to submit our job to the scheduler. To accomplish this, use the sbatch command as follows
[araim1@maya-usr1 hello_serial]$ sbatch run-testing.slurm sbatch: Submitted batch job 2626 [araim1@maya-usr1 hello_serial]$
If the submission was successful, the sbatch command returns a job identifier to us. We can use this to check the status of the job (squeue), or delete it (scancel) if necessary. This job should run very quickly if there are processors available, but we can try to check its status in the batch queue system. The following command shows that our job is not in the system – it is so quick that it has already completed!
[araim1@maya-usr1 hello_serial]$ squeue JOBID PARTITION NAME USER ST TIME NODES QOS NODELIST(REASON) [araim1@maya-usr1 hello_serial]$
We should have obtained two output files. The file slurm.err contains stderr output from our program. If slurm.err isn’t empty, check the contents carefully as something may have gone wrong. The file slurm.out contains our stdout output; it should contain the hello world message from our program.
[araim1@maya-usr1 hello_serial]$ ls slurm.* slurm.err slurm.out [araim1@maya-usr1 hello_serial]$ cat slurm.err [araim1@maya-usr1 hello_serial]$ cat slurm.out Hello world from n70 [araim1@maya-usr1 hello_serial]$
Notice that the hostname no longer matches the user node, but one of the test nodes. We’ve successfully used one of the compute nodes to run our job. The develop partitions limits jobs to five minutes by default, measured in “walltime”, which is just the elapsed run time. The limit can be raised to up to 30 minutes using the –time flag, details are given later on this page. After your job has reached this time limit, it is stopped by the scheduler. This is done to ensure that everyone has a fair chance to use the cluster.
Production runs on the batch partition
Once our job has been tested and we’re confident that it’s working correctly, we can run it in the batch partition. Now the walltime limit for our job may be raised, based on the QOS we choose. There are also many more compute nodes available in this partition, so we probably won’t have to wait long to find a free processor. Start by creating the following script with some of the most important features.
The flags for job-name, output, and error are the same as in the previous script. The partition flag is now set to batch. Additionally, the qos flag chooses the normal QOS. This is the default QOS, so the result would be the same if we had not specified any QOS; we recommend always specifying a QOS explicitly for clarity. To submit our job to the scheduler, we issue the command
[araim1@maya-usr1 hello_serial]$ sbatch run-serial.slurm sbatch: Submitted batch job 2626 [araim1@maya-usr1 hello_serial]$
We can check the job’s status, but due to its shortness, it has already completed and does not show up any more.
[araim1@maya-usr1 hello_serial]$ squeue JOBID PARTITION NAME USER ST TIME NODES QOS NODELIST(REASON) [araim1@maya-usr1 hello_serial]$
This time our stdout output file indicates that our job has run on one of the primary compute nodes, rather than a develop node
[araim1@maya-usr1 hello_serial]$ ls slurm.* slurm.err slurm.out [araim1@maya-usr1 hello_serial]$ cat slurm.err [araim1@maya-usr1 hello_serial]$ cat slurm.out Hello world from n71 [araim1@maya-usr1 hello_serial]$
Selecting a QOS
Notice that we specified the normal QOS with the qos flag. Because we know our job is very quick, a more appropriate QOS would be short. To specify the short QOS, replace normal by short to get the line
in the submission script. In the same way, you can access any of the QOS’s listed in the scheduling rules. The rule of thumb is that you should always choose the QOS whose wall time limit is the most appropriate for your job. Realizing that these limits are hard upper limits, you will want to stay safely under them, or in other words, pick a QOS whose wall time limit is comfortably larger than the actually expected run time.
Note that the QOS of each job is shown by default in the squeue output. We have set this up on maya as a convenience, by setting the SQUEUE_FORMAT environment variable.
This section assumes that you’ve successfully compiled the parallel hello world example. Now we’ll see how to run this program on the cluster.
Test runs on the develop partition
First we will run the hello_parallel program as a single process. This will appear very similar to the serial job case. The difference is that now we are using the MPI-enabled executable hello_parallel, rather than the plain hello_serial executable. Create the following script in the same directory as the hello_parallel program. Notice the addition of the “srun” command before the executable, which is used to launch MPI-enabled programs. We’ve also added “–nodes=1” and “–ntasks-per-node=1” to specify what kind of resources we’ll need for our parallel program.
Now submit the script
[hu6@maya-usr1 hpcfweb]$ sbatch intel-np1.slurm Submitted batch job 21503 [hu6@maya-usr1 hpcfweb]$
Checking the output after the job has completed, we can see that exactly one process has run and reported back.
[hu6@maya-usr1 hpcfweb]$ ls slurm.* slurm.err slurm.out [hu6@maya-usr1 hpcfweb]$ cat slurm.err [hu6@maya-usr1 hpcfweb]$ cat slurm.out Hello world from process 000 out of 001, processor name n1
Example 2: One node, two processes
Next we will run the job on two processes of the same node. This is one important test, to ensure that our code will function in parallel. We want to be especially careful that the communications work correctly, and that processes don’t hang. We modify the single process script and set “–ntasks-per-node=2”.
Submit the script to the batch queue system
[hu6@maya-usr1 hpcfweb]$ sbatch intel-ppn2.slurm sbatch: Submitted batch job 2626 [hu6@maya-usr1 hpcfweb]$
Now observe that two processes have run and reported in. Both were located on the same node as we expected.
[araim1@maya-usr1 hello_parallel]$ ls slurm.* slurm.err slurm.out [araim1@maya-usr1 hello_parallel]$ cat slurm.err [araim1@maya-usr1 hello_parallel]$ cat slurm.out Hello world from process 000 out of 002, processor name n1 Hello world from process 001 out of 002, processor name n1 [araim1@maya-usr1 hello_parallel]$
Example 3: Two nodes, one process per node
Now let’s try to use two different nodes, but only one process on each node. This will exercise our program’s use of the high performance network, which didn’t come into the picture when a single node was used.
Submit the script to the batch queue system
[hu6@maya-usr1 hpcfweb]$ sbatch intel-nodes2-ppn1.slurm Submitted batch job 21505
Notice that again we have two processes, but this time they have distinct processor names.
[hu6@maya-usr1 hpcfweb]$ ls slurm.* slurm.err slurm.out [hu6@maya-usr1 hpcfweb]$ cat slurm.err [hu6@maya-usr1 hpcfweb]$ cat slurm.out Hello world from process 000 out of 002, processor name n1 Hello world from process 001 out of 002, processor name n2 [hu6@maya-usr1 hpcfweb]$
Example 4: Two nodes, eight processes per node
To illustrate the use of more processes, let’s try a job that uses two nodes, eight processes on each node. This is still possible on the develop partition. Therefore it is possible to run small performance studies which are completely restricted to the develop partition. Use the following batch script
Submit the script to the batch system
[araim1@maya-usr1 hello_parallel]$ sbatch intel-nodes2-ppn8.slurm sbatch: Submitted batch job 2626 [araim1@maya-usr1 hello_parallel]$
For reference, we quote the output of squeue, when using the above environment variable setting, which reads for this job
[araim1@maya-usr1 hello_parallel]$ squeue JOBID PARTITION NAME USER ST TIME NODES QOS NODELIST(REASON) 61911 develop hello_pa araim1 R 0:02 2 normal n[1-2]
Now observe the output. Notice that the processes have reported back in a non-deterministic order, and there are eight per node if you count them.
[araim1@maya-usr1 hello_parallel]$ ls slurm.* slurm.err slurm.out [araim1@maya-usr1 hello_parallel]$ cat slurm.err [araim1@maya-usr1 hello_parallel]$ cat slurm.out Hello world from process 002 out of 016, processor name n1 Hello world from process 011 out of 016, processor name n2 Hello world from process 014 out of 016, processor name n2 Hello world from process 006 out of 016, processor name n1 Hello world from process 010 out of 016, processor name n2 Hello world from process 007 out of 016, processor name n1 Hello world from process 001 out of 016, processor name n1 Hello world from process 015 out of 016, processor name n2 Hello world from process 000 out of 016, processor name n1 Hello world from process 008 out of 016, processor name n2 Hello world from process 003 out of 016, processor name n1 Hello world from process 012 out of 016, processor name n2 Hello world from process 005 out of 016, processor name n1 Hello world from process 013 out of 016, processor name n2 Hello world from process 004 out of 016, processor name n1 Hello world from process 009 out of 016, processor name n2 [araim1@maya-usr1 hello_parallel]$
Production runs on the batch partition
Now we’ve tested our program in several important configurations in the develop partition. We know that it performs correctly, and processes do not hang. We may now want to solve larger problems which are more time consuming, or perhaps we may wish to use more processes. We can promote our code to “production”, by simply changing “–partition=develop” to “–partition=batch”. We may also want to specify “–qos=short” as before, since the expected run time of our job is several seconds at most. Of course if this were a more substantial program, we might need to specify a longer QOS like normal, medium, or long.
The maya cluster has several different types of nodes available, and users may want to select certain kinds of nodes to suit their jobs. In this section we will discuss how to do this.
The following variation of the sinfo command shows some basic information about nodes in maya.
[hu6@maya-usr1 ~]$ sinfo -o "%10N %8z %8m %40f %10G" NODELIST S:C:T MEMORY FEATURES GRES n[34-51] 16:1:1 64508 hpcf2013,e5_2650v2,michost,mic_5110p,qdr mic:2 n[3-33] 16:1:1 64508 hpcf2013,e5_2650v2,qdr (null) n[72-153,2 8:1:1 1+ hpcf2010,x5560,mellanox (null) n[52-69] 16:1:1 64510 hpcf2013,e5_2650v2,qdr gpu:2 n[156-237] 8:1:1 24149 hpcf2009,x5550,qdr,mellanox (null) n[1-2] 16:1:1 1+ hpcf2013,e5_2650v2 (null) n[70-71,15 8:1:1 24149 hpcf2010,x5560 (null) maya-usr2- 1:1:1 1+ miccard,miccard (null) [hu6@maya-usr1 ~]$
- In the column S:C:T, “S” is the number of processors (“sockets”), “C” is the number of cores per processor, and “T” is the number of threads per core. It is seen that nodes n[1-69] have 16 processor cores per node, but nodes n[70-153] have eight.
- The memory column shows the total system memory in MB.
- The features column describes static properties of the nodes; we use this to declare the section of the cluster (e.g. “hpcf2013”), the type of CPU (e.g. “e5_2650v2”), the fact that the node contains a GPU or Phi (e.g. “gpu” or “phi”), and the specific kind of GPU or Phi (e.g. “phi_5110p” or “gpu_k20”).
- The gres column (a SLURM acronym which stands for for Generic Resource) gives some specific resources that can be reserved, along with the amount available. We currently use this to record that their are two GPU or two Phis available in the GPU and Phi-enabled nodes, respectively.
The following extentions of the above “sinfo” command may be useful to view the availability of each node type. In the last column, the four numbers correspond to the number of nodes which are currently in the following states: A = allocated, I = idle, O = other, and T = total.
[araim1@maya-usr1 ~]$ sinfo -o "%10N %8z %8m %40f %10G %F" NODELIST S:C:T MEMORY FEATURES GRES NODES(A/I/O/T) n[34-51] 16:1:1 64508 hpcf2013,e5_2650v2,michost,mic_5110p,qdr mic:2 0/0/18/18 n[3-33] 16:1:1 64508 hpcf2013,e5_2650v2,qdr (null) 23/8/0/31 n[72-153,2 8:1:1 1+ hpcf2010,x5560,mellanox (null) 80/12/2/94 n[52-69] 16:1:1 64510 hpcf2013,e5_2650v2,qdr gpu:2 0/18/0/18 n[156-237] 8:1:1 24149 hpcf2009,x5550,qdr,mellanox (null) 0/82/0/82 n[1-2] 16:1:1 1+ hpcf2013,e5_2650v2 (null) 0/0/2/2 n[70-71,15 8:1:1 24149 hpcf2010,x5560 (null) 0/4/0/4 maya-usr2- 1:1:1 1+ miccard,miccard (null) 0/32/4/36 [araim1@maya-usr1 ~]$
Nodes marked as allocated might still have available processors. The following command, using %C instead of %F, will show us availability at the processor level.
[araim1@maya-usr1 ~]$ sinfo -o "%10N %8z %8m %40f %10G %C" NODELIST S:C:T MEMORY FEATURES GRES CPUS(A/I/O/T) n[34-51] 16:1:1 64508 hpcf2013,e5_2650v2,michost,mic_5110p,qdr mic:2 0/0/288/288 n[3-33] 16:1:1 64508 hpcf2013,e5_2650v2,qdr (null) 282/214/0/496 n[72-153,2 8:1:1 1+ hpcf2010,x5560,mellanox (null) 602/134/16/752 n[52-69] 16:1:1 64510 hpcf2013,e5_2650v2,qdr gpu:2 0/288/0/288 n[156-237] 8:1:1 24149 hpcf2009,x5550,qdr,mellanox (null) 0/656/0/656 n[1-2] 16:1:1 1+ hpcf2013,e5_2650v2 (null) 0/0/32/32 n[70-71,15 8:1:1 24149 hpcf2010,x5560 (null) 0/32/0/32 maya-usr2- 1:1:1 1+ miccard,miccard (null) 0/32/4/36 [araim1@maya-usr1 ~]$
Other types of summaries are possible as well; try “man sinfo” for more information.
As we will see below, the “features” correspond to things that can be specified using the –constraint option, and “gres” correspond to things that can be specified by “–gres”.
Select nodes by CPU type
To demonstrate node selection on maya, first consider a very simple batch which does not specify any type of node.
[hu6@maya-usr1 hpcfweb]$ sbatch run.slurm Submitted batch job 21506 [hu6@maya-usr1 hpcfweb]$ cat slurm.err [hu6@maya-usr1 hpcfweb]$ cat slurm.out n71 n71 n71 n72 n72 n72 [hu6@maya-usr1 hpcfweb]$
Notice that we are assigned two nodes from the hpcf2010 equipment. This can be verified by checking the table of hostnames at System Description.
Suppose we would like to use the hpcf2013 nodes instead. This can be accomplished with the “–constraint” option and specifying the feature “hpcf2013”. Recall that the list of features was obtained above from the sinfo output.
[hu6@maya-usr1 hpcfweb]$ sbatch run.slurm Submitted batch job 21507 [hu6@maya-usr1 hpcfweb]$ cat slurm.err [hu6@maya-usr1 hpcfweb]$ cat slurm.out n3 n3 n3 n4 n4 n4 [hu6@maya-usr1 hpcfweb]$
Select GPU nodes
Selecting GPU-enabled nodes can be done using the “–gres” option. Specifying “–gres=gpu” requests one GPU on each node of our job. This allows the scheduler to allocate the remaining CPUs and GPUs to other users.
[araim1@maya-usr1 gpu-constraint]$ sbatch run.slurm [araim1@maya-usr1 gpu-constraint]$ cat slurm.err [araim1@maya-usr1 gpu-constraint]$ cat slurm.out n52 n52 n52 n53 n53 n53 [araim1@maya-usr1 gpu-constraint]$
We can also request one GPU per CPU by specifying “–gres=gpu*cpu” or two GPUs per node by “–gres=gpu:2”
For more detailed instrutions on running your code on a GPU, see CUDA for GPU.
Select Phi nodes
Selecting Phi-enabled nodes is exactly the same as GPU-enabled nodes, except using the appropriate feature names (“mic” or “mic_5110p”) for “–constraint” or resources name (“mic”) for “–gres”. Start with the GPU examples above and make the appropriate substitutions.
For more detailed instrutions on running your code on a Phi, see Intel Phi.
Heterogeneous jobs: mix of node types
Some details about the batch system
A SLURM batch script is a special kind of shell script. As we’ve seen, it contains information about the job like its name, expected walltime, etc. It also contains the procedure to actually run the job. Read on for some important details about SLURM scripting, as well as a few other features that we didn’t mention yet.
For more information, try the following sources
- Try the command “man sbatch”
- See the DoIT SLURM Usage Guide (UMBC account required)
- See the official SLURM documentation
Parts of a SLURM script
Here is a quick reference for the options discussed on this page.
|: (colon)||Indicates a commented-out line that should be ignored by the scheduler.|
|#SBATCH||Indicates a special line that should be interpreted by the scheduler.|
|srun ./hello_parallel||This is a special command used to execute MPI programs. The command uses directions from SLURM to assign your job to the scheduled nodes.|
|–job-name=hello_serial||This sets the name of the job; the name that shows up in the “Name” column in squeue’s output. The name has no significance to the scheduler, but helps make the display more convenient to read.|
|This tells SLURM where it should send your job’s output stream and error stream, respectively. If you would like to prevent either of these streams from being written, set the file name to /dev/null|
|–partition=batch||Set the partition in which your job will run.|
|–qos=normal||Set the QOS in which your job will run.|
|–nodes=4||Request four nodes.|
|–ntasks-per-node=8||Request eight tasks to be run on each node. The number of tasks may not exceed the number of processor cores on the node.|
|–ntasks=11||Request 11 tasks for your job.|
|–array=0-135||Used to submit and manage a collection of similar jobs quickly, in this example case 136 jobs.|
|–time=1-12:30:00||This option sets the maximum amount of time SLURM will allow your job to run before it is automatically killed. In the example shown, we have requested 1 day, 12 hours, 30 minutes, and 0 seconds. Several other formats are accepted such as “HH:MM:SS” (assuming less than a day). If your specified time is too large for the partition/QOS you’ve specified, the scheduler will not run your job.|
|–mail-type=type||SLURM can email you when your job reaches certain states. Set type to one of: BEGIN to notify you when your job starts, END for when it ends, FAIL for if it fails to run, or ALL for all of the above. See the example below.|
|–email@example.com||Specify a recipient(s) for notification emails (see example below)|
|–mem-per-cpu=MB||Specify a memory limit for each process of your job. The default is 2944|
|–mem=MB||Specify a memory limit for each node of your job. The default is that there is a per-core limit|
|–exclusive||Specify that you need exclusive access to nodes for your job. This is the opposite of “–share”.|
|–share||Specify that your job may share nodes with other jobs. This is the opposite of “–exclusive”.|
|–begin=2010-01-20T01:30:00||Tell the scheduler not to attempt to run the job until the given time has passed.|
|–dependency=afterany:15030:15031||Tell the scheduler not to run the job until jobs with IDs 15030 and 15031 have completed.|
|–account=pi_name||Tell the scheduler to charge this job to pi_name|
|–constraint=feature_name||Tell the scheduler that scheduled nodes for this job must have feature “feature_name”|
|–gres=resource_name||Tell the scheduler that scheduled nodes for this job will use resource “resource_name”|
- Don’t leave large-scale jobs enqueued during weekdays. Suppose you have a job that requires all the nodes on the cluster. If you submit this job to the scheduler, it will remain enqueued until all the nodes become available. Usually the scheduler will allow smaller jobs to run first, but sometimes it will enqueue them behind yours. The latter causes a problem during peak usage hours, because it clogs up the queue and diminishes the overall throughput of the cluster. The best times to submit these large-scale types of jobs are nights and weekends.
- Make sure the scheduler is in control of your programs. Avoid spawning processes or running background jobs from your code, as the scheduler can lose track of them. Programs running outside of the scheduler hinder its ability to allocate new jobs to free processors.
- Avoid jobs that run for many days uninterrupted. It’s best for the overall productivity of the cluster to design jobs that are “small to medium” in size – they should be large enough to accomplish a significant amount of work, but small enough so that resources are not tied up for too long. Avoiding long jobs is also in your best interest.
- Consider saving your progress. This is related to the issue above. Running your entire computation at once might be impractical or infeasible. Besides the fact that very long jobs can make scheduling new jobs difficult, it can also be very inconvenient for you if they fail after running for several days (due to a hardware issue for example). It may be possible to design your job to save its progress occasionally, so that it won’t need to restart from the beginning if there are any issues.
- Estimate memory usage before running your job. As discussed later in this page, jobs on maya have a default memory limit which you can raise or lower as needed. If your job uses more than this limit, it will be killed by the scheduler. Requesting the maximum available memory for every job is not a very good user behavior though, because it can lower the overall productivity of the system. Imagine you’re running a serial job which requires half the available memory. Specifying this in your submission script will allow the scheduler to use the remaining cores and memory on the node for other users’ jobs.The best strategy is to estimate how much memory you will need, and specify a reasonable upper bound in your submission script. Two suggestions to do this estimation are (1) calculate the sizes of the main objects in your code, and (2) run a small-scale version of the problem and observe the actual memory used. For more information on how to observe memory usage, see checking memory usage.
- Estimate walltime usage before running your job. Giving an accurate estimate of your job’s necessary walltime is very helpful for the scheduler, and the overall productivity of the cluster. The QOS choices (short, normal, etc) provide an easy way to specify this information. You may also specify it more granularly as an elapsed time; see below for more information and an example.
- Use less nodes and more processes per node. Performance studies have demonstrated that using multiple cores on the same node gives generally comparable performance to using a single core on the same number of nodes. And using the minimum number of nodes required for your job benefits the overall productivity of the cluster. See for example the technical report HPCF-2010-2.
- Consider sharing your nodes unless you require exclusive access. If your job isn’t using all cores on a node, it might be possible for other jobs to make use of the free cores. Allowing your nodes to be shared helps to improve the overall productivity of the cluster. The downside however is that other jobs might interfere with the performance of your job. If your job uses a small amount of memory and is not a performance study (for example), sharing is probably a good option. To see how to control exclusive vs. shared, see the examples below.
The partitions and QOS’s on maya have time limits built in. For example in the “normal” QOS, jobs are limited to an hour overall, and an hour if they require 64 nodes. This is an upper limit. Suppose you will be using 64 nodes, but will only require at most 2 hours. It is beneficial for the scheduler to supply this information, and may also allow your job to be backfilled. See the scheduling rules for more information, and also note the following example.
Suppose the system currently has 14 free nodes, and there are two jobs in the queue waiting to run. Suppose also that no additional nodes will become free in the next few hours. The first queued job (“job #1”) requires 16 nodes, and the second job (“job #2”) requires only 2 nodes. Since job #1 was queued first, job #2 would normally need to wait behind it. However, if the scheduler sees that job #2 would complete in the time that job #1 would be waiting, it can allow job #2 to skip ahead.
Here is an example where we’ve specified a time limit of 3 hours, 15 minutes, and 0 seconds. Notice that we’ve started with the batch script from Running Parallel Jobs, Example 1 and added a single “–time=” statement.
Note that in our experience, SLURM seems to round time limits up to the next minute. For example, specifying “–time=00:00:20” will result in an actual time limit of 1 minute.
Email notifications for job status changes
You can request the scheduler to email you on certain events related to your job. Namely:
- When the job starts running
- When the job exits normally
- If the job is aborted
As an example of how to use this feature, let’s ask the scheduler to email us on all three events when running the hello_serial program. Let’s start with the batch script developed earlier, and add the options “–mail-type=ALL” and “–firstname.lastname@example.org”. That is, where “email@example.com” is your actual email address. After submitting this script, we can check our email and receive the following messages.
From: Simple Linux Utility for Resource Management <firstname.lastname@example.org> Date: Thu, Jan 14, 2010 at 10:53 AM Subject: SLURM Job_id=2655 Name=hello_serial Began To: email@example.com
From: Simple Linux Utility for Resource Management <firstname.lastname@example.org> Date: Thu, Jan 14, 2010 at 10:53 AM Subject: SLURM Job_id=2655 Name=hello_serial Ended To: email@example.com
Because hello_serial is such a trivial program, the start and end emails appear to have been sent simultaneously. For a more substantial program the waiting time could be significant, both for your job to start and for it to run to completion. In this case email notifications could be useful to you.
By default, you are not given exclusive access to the nodes you’re assigned by the scheduler. This means that your job may run on a node with another user’s jobs. You can override this default behavior using the “–exclusive” option.
Here’s an example where we reserve an entire node.
If our job involves multiple nodes, specifying the “–exclusive” flag requests exlusive access to all nodes that will be in use by the job.
We may also explicitly permit sharing of our nodes with the “–share” flag.
Jobs on maya are limited to a maximum of 23,552 MB per node out of the total 24 GB system memory. The rest is reserved for the operating system. Jobs are run inside a “job container”, which protects the cluster against jobs overwhelming the nodes and taking them offline. By default, jobs are limited to 2944 MB per core, based on the number of cores you have requested. If your job goes over the memory limit, it will be killed by the batch system.
The memory limit may be specified per core or per node. To set the limit per core, simply add a line to your submission script as follows, keeping in mind that a larger memory request might increase the amount of time your job spends in the waiting queue:
In the serial case, the two options are equivalent. For parallel computing situations it may be more natural to use the per core limit, given that the scheduler has some freedom to assign processes to nodes for you.
If your job is killed because it has exceeded its memory limit, you will receive an error similar to the following in your stderr output. Notice that the effective limit is reported in the error.
slurmd[n1]: error: Job 13902 exceeded 3065856 KB memory limit, being killed slurmd[n1]: error: *** JOB 13902 CANCELLED AT 2010-04-22T17:21:40 *** srun: forcing job termination
Note that the memory limit can be useful in conducting performance studies. If your code runs out of physical memory and begins to use swap space, the performance will be severely degraded. For a performance study, this may be considered an invalid result and you may want to try a smaller problem, use more nodes, etc. One way to protect against this is to reserve entire nodes (as discussed elsewhere on this page), and set the memory limit to 23 GB per node (or less). That is about the maximum you can use before swapping starts to occur. Then the batch system will kill your job if it’s close enough to swapping.
Note that a memory limit can be specified even for non-SLURM jobs. This can be useful for interactive jobs on the user node. For example, running the following command
[araim1@maya-usr1 ~]$ ulimit -S -v 2097152
Heavy Memory Use Jobs (such as MATLAB)
When scheduling jobs that are have heavy memory use (such as MATLAB), one should pre-determine the amount of memory that the process may use and schedule accordingly. This should be done explicitly in the slurm script. Also notice the sleep command, for large jobs where many MATLAB instances will be brought up, the sleep command will prevent the job from overwhelming MATLAB’s license server.
#!/bin/bash # Name of the job: #SBATCH --job-name=RUN_MATLAB_1GB_RAM # N specifies that 1 job step is to be allocated per instance of matlab #SBATCH -N1 # This specifies the number of cores per matlab session will be available for parallel jobs #SBATCH --cpus-per-task 1 # Specify the desired partition develop/batch/prod #SBATCH --partition=batch # Specify the qos and run time (format: dd-hh:mm:ss) #SBATCH --qos=medium #SBATCH --time=13:00 # This is in MB #SBATCH --mem=5000 # Specify the job array (format: start-stop:step) #SBATCH --array=0-31 sleep $(( RANDOM % 30 )) srun matlab -nodisplay -r "matlab_job;exit"
job = str2num(getenv('SLURM_ARRAY_TASK_ID'))
$ sbatch matlab_job.sbatch Submitted batch job 67451
$ tail -4 slurm-67451_1.out job = 1 $ tail -4 slurm-67451_25.out job = 25
OPTIMIZING MATLAB FOR CLUSTER USAGE
Here are some tips for debugging your MATLAB code if you run into segmentation faults:
#!/bin/bash # Name of the job: #SBATCH --job-name=RUN_IDL_2GB_RAM # N specifies that 1 job step is to be allocated per instance of matlab #SBATCH -N1 # This specifies the number of cores per matlab session will be available for parallel jobs #SBATCH --cpus-per-task 1 # Specify the desired partition develop/batch/prod #SBATCH --partition=batch # This is in MB #SBATCH --mem-per-cpu=2000 # Specify the qos and run time (format: dd-hh:mm:ss) #SBATCH --qos=medium #SBATCH --time=13:00 # Specify the job array (format: start-stop:step) #SBATCH --array=0-31 srun idl -e array_test
pro array_test ; Load array index into number number = LONG(GETENV('SLURM_ARRAY_TASK_ID')) PRINT, number end
[schou@maya-usr1 ~]$ sbatch idl.sbatch Submitted batch job 225270
[schou@maya-usr1 ~]$ cat slurm-225302_15.out IDL Version 8.4 (linux x86_64 m64). (c) 2014, Exelis Visual Information Solutions, Inc. Installation number: 208932-19. Licensed for use by: University of Maryland (MAIN) 15 % Compiled module: ARRAY_TEST. [schou@maya-usr1 ~]$ cat slurm-225302_12.out IDL Version 8.4 (linux x86_64 m64). (c) 2014, Exelis Visual Information Solutions, Inc. Installation number: 208932-19. Licensed for use by: University of Maryland (MAIN) 12 % Compiled module: ARRAY_TEST.
Requesting an arbitrary number of tasks
So far on this page we’ve requested some number of nodes, and some number of tasks per node. But what if our application requires a number of tasks like 11, which can not be split evenly among a set of nodes. That is, unless we use one process per node, which isn’t a very efficient use of those nodes. We can split our 11 processes among as few as two nodes, using the following script. Notice that we don’t specify anything else like how many nodes to use. The scheduler will figure this out for us, and most likely use the minimum number of nodes (two) to accomodate our tasks.
Running this yields the following output in slurm.out
[araim1@maya-usr1 hello_parallel]$ cat slurm.out Hello world from process 009 out of 011, processor name n2 Hello world from process 008 out of 011, processor name n2 Hello world from process 000 out of 011, processor name n1 Hello world from process 001 out of 011, processor name n1 Hello world from process 002 out of 011, processor name n1 Hello world from process 003 out of 011, processor name n1 Hello world from process 004 out of 011, processor name n1 Hello world from process 005 out of 011, processor name n1 Hello world from process 006 out of 011, processor name n2 Hello world from process 010 out of 011, processor name n2 Hello world from process 007 out of 011, processor name n2 [araim1@maya-usr1 hello_parallel]$
Now suppose we want to limit the number of tasks per node to 2. This can be accomplished with the following batch script.
Notice that we needed to move out of the develop queue to demonstrate this scenario. Now we’ve specified –ntasks-per-node=2 at the top of the script, in addition to –ntasks=11.
[hu6@maya-usr1 hpcfweb]$ sort slurm.out Hello world from process 000 out of 011, processor name n71 Hello world from process 001 out of 011, processor name n71 Hello world from process 002 out of 011, processor name n72 Hello world from process 003 out of 011, processor name n72 Hello world from process 004 out of 011, processor name n73 Hello world from process 005 out of 011, processor name n73 Hello world from process 006 out of 011, processor name n74 Hello world from process 007 out of 011, processor name n74 Hello world from process 008 out of 011, processor name n75 Hello world from process 009 out of 011, processor name n75 Hello world from process 010 out of 011, processor name n76
where we’ve sorted the output to make it easier to read.
It’s also possible to use the “–ntasks” and “–nodes” options together, to specify the number of tasks and nodes, but leave the number of tasks per node up to the scheduler. See “man sbatch” for more information about these options.
Requesting an arbitrary number of tasks using slurm array
Slurm also allows for the ability to use –array in submission of jobs. These job arrays enable the user to submit and manage a collection of similar jobs quickly. All jobs must have the same initial options. Job arrays are only supported for batch jobs and array index values can be specified using the –array or -a option of the sbatch command as in the example below. More information can be found at official SLURM documentation. An example of how to queue a job in slurm using the array setting:An example job submission script called array_test.sbatch that illustrates the –array feature of slurm:
An example array_test shell script that illustrates the –array feature of slurm:
To start from the command line:
$ sbatch array_test.sbatch Submitted batch job 31367
Once run one can see the output demonstrating the node in which the job ran and the core on which the job was assigned:
$ grep host slurm-*.out slurm-31367_10.out:I am element 10 on host n112, pid 12488's current affinity list: 2 slurm-31367_11.out:I am element 11 on host n112, pid 12487's current affinity list: 3 slurm-31367_12.out:I am element 12 on host n112, pid 12535's current affinity list: 4 slurm-31367_13.out:I am element 13 on host n112, pid 12555's current affinity list: 5 slurm-31367_14.out:I am element 14 on host n112, pid 12272's current affinity list: 6 slurm-31367_15.out:I am element 15 on host n112, pid 12791's current affinity list: 7 ...
Note: A maximum number of simultaneously running tasks from the job array may be specified using a “%” separator. For example “–array=0-15%4” will limit the number of simultaneously running tasks from this job array to 4.
Setting a ‘begin’ time
You can tell the scheduler to wait a specified amount of time before attempting to run your job. This is useful for example, if your job requires many nodes. Being a conscientious user, you may want to wait until late at night for your job to run. By adding the following to your batch script, we can have the scheduler wait until 1:30am on 2010-01-20 for example.
You can also specify a relative time
See “man sbatch” for more information.
You may want a job to wait until another one starts or finishes. This can be useful if one job’s input depends on the other’s output. It can also be useful to ensure that you’re not running too many jobs at once. For example, suppose we want our job to wait until either (job with ID’s) 15030 or 15031 complete. This can be accomplished by adding the following to our batch script.
You can also specify that both jobs should have finished in a non-error state before the current job can start.
Notice that the above examples required us to note down the job IDs of our dependencies and specify them when launching the new job. Suppose you have a collection of jobs, and your only requirement is that only one should run at a time. A convenient way to accomplish this is with the “singleton” flag.
This job will wait until no other job called “myjob” is running from your account.
See “man sbatch” for more information.
Requeue-ability of your jobs
By default it’s assumed that your job can be restarted if a node fails, or if the cluster is about to be brought offline for maintenance. For many jobs this is a safe assumption, but sometimes it may not be.
For example suppose your job appends to an existing data file as it runs. Suppose it runs partially, but then is restarted and then runs to completion. The output will then be incorrect, and it may not be easy for you to recognize. An easy way to avoid this situation is to make sure output files are newly created on each run.
Another way to avoid problems is to specify the following option in your submission script. This will prevent the scheduler from automatically restarting your job if any system failures occur.
For very long-running jobs, you might also want to consider designing them to save their progress occasionally. This way if it’s necessary to restart such a job, it won’t need to start completely again from the beginning. See job scheduling issues for more information.
Temporary scratch storage is available when you run a job on the compute nodes. The storage is local to each node. You can find the name of your scratch directory in the environment variable “$JOB_SCRATCH_DIR” which is provided by SLURM. Here is an example of how it may be accessed by your batch script.
Submitting this script should yield something like the following
[araim1@maya-usr1 test_scratch]$ cat slurm.out My scratch directory is: /scratch/367738 Here is a listing of my scratch directory total 4 -rw-rw---- 1 araim1 pi_nagaraj 27 Jun 8 18:36 testfile Here are the contents of /scratch/367738/testfile ABCDEFGHIJKLMNOPQRSTUVWXYZ [araim1@maya-usr1 test_scratch]$
You can of course also access $JOB_SCRATCH_DIR from C, MATLAB, or any other language or package. Remember that the files only exist for the duration of your job, so make sure to copy anything you want to keep to a separate location, before your job exits.
Check here for more information about scratch and other storage areas.
Charging computing time to a PI, for users under multiple PIs
If you a member of multiple research groups on maya this will apply to you. When you run a job on maya, the resources you’ve used (e.g. computing time) are “charged” to your PI. This simply means that there is a record of your group’s use of the cluster. This information is leveraged to make sure everyone has access to their fair share of resources (through the fair-share scheduling rules), especially the PIs who have paid for nodes. Therefore, it’s important to charge your jobs to the correct PI.
You have a “primary” account which your jobs are charged to by default. To see this, try checking one of your jobs as follows (suppose our job has ID 25097)
[araim1@maya-usr1 ~]$ scontrol show job 25097 JobId=25097 Name=fmMle_MPI UserId=araim1(28398) GroupId=pi_nagaraj(1057) Priority=4294798165 Account=pi_nagaraj QOS=normal JobState=RUNNING Reason=None Dependency=(null) TimeLimit=04:00:00 Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0 SubmitTime=2010-06-30T00:14:24 EligibleTime=2010-06-30T00:14:24 StartTime=2010-06-30T00:14:24 EndTime=2010-06-30T04:14:24 SuspendTime=None SecsPreSuspend=0 ... [araim1@maya-usr1 ~]$
Notice the “Account=pi_nagaraj” field – in this example, this is our default account. This should also be the same as our primary Unix group
[araim1@maya-usr1 ~]$ id uid=28398(araim1) gid=1057(pi_nagaraj) groups=100(users),700(contrib), 701(alloc_node_ssh),1057(pi_nagaraj),32296(pi_gobbert) [araim1@maya-usr1 ~]$
The primary group is given above as “gid=1057(pi_nagaraj)”
Suppose we are also working for another PI “pi_gobbert”. When running jobs for that PI, it’s only fair that we charge the computing resources to them instead. To accomplish that, we may add the “–account” option to our batch scripts.
Note that if you specify an invalid name for the account (a group that does not exist, or which you do not belong to), the scheduler will silently revert back to your default account. You can quickly check the status field in the scontrol output to make sure the option worked.
[araim1@maya-usr1 ~]$ scontrol show job 25097 JobId=25097 Name=fmMle_MPI UserId=araim1(28398) GroupId=pi_nagaraj(1057) Priority=4294798165 Account=pi_gobbert QOS=normal ... [araim1@maya-usr1 ~]$
Normally interactive programs should be run on the user node (i.e. not using the scheduler). If you need to run them on the compute nodes for some reason, contact our HPCF Point of Contact as requested in the usage rules.
If you are running a job on a set of compute nodes, it is also possible to interact with those nodes, for example to collect diagnostic information about the jobs memory usage.
Selecting specific nodes
It is possible to select specific compute nodes for your job, although it is usually desirable not to do this. Usually we would rather request only the number of processors, nodes, memory, etc, and let the scheduler find the first available set of nodes which meets our requirements. To select specific nodes, consider the following batch script.
Here we have selected two nodes, namely n31 and n32, and request 8 processes per node.
Jobs stuck in the “PD” state
Your job may become stuck in the “PD” state either if there are not enough nodes available to run your job, or the cluster’s scheduler has decided to run other jobs before yours.
A job cannot be run until there are enough free processor cores/nodes to meet its requirement. To illustrate, if somebody submits a job that uses all of the cluster nodes for twelve hours, nobody else can run any jobs until that large job finishes. If you are trying to run a sixteen node job, and there are a set of jobs running which leave less than 16 nodes available, then your job must wait.
When there are a sufficient number of processes/nodes available, the scheduler must decide which job to run next. The decision is based on several factors:
- The number of nodes your job uses. A job that takes up the entire cluster will not run very soon. Use the options mentioned earlier to set the number of nodes your job uses.
- The maximum length of time that your job claims it will take to run. As mentioned earlier, give a walltime estimate to give the scheduler an idea of how long this will be. Smaller jobs may be allowed to run ahead of larger ones. If you do not give an estimate, the scheduler will assume a default, which is based on the queue you’ve submitted to.
- The job priority. This depends on when you submitted your job (generally first-in-first-out (FIFO) is used) and which queue you use. If you use the perform queue, your job will probably run before jobs in the serial queue.
It is also possible that someone else’s job has gotten stuck, or that there is another problem on the cluster. If you suspect that may be the case, run squeue. If there are many jobs whose state (“ST” column) is “R” or “PD” then there are probably no problems on the cluster – there are just a lot of jobs taking up nodes. If a job has been in the “R” state for most of the day, or if you see jobs that are in states other than “PD” or “R” for more than a few seconds, then something is wrong. If this is the case, or if you notice any other strange behavior contact us.
What is the priority of my job?
If your job has been submitted and in the pending state, its waiting time will depend on the currently running jobs and the other pending jobs. We can use the sprio utility to see the priorities of all pending jobs. Suppose we have the following scenario.
[araim1@maya-usr1 ~]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 4056 batch users16 araim1 PD 0:00 8 (Resources) 4061 batch users21 araim1 PD 0:00 8 (Priority) 4052 batch users12 araim1 PD 0:00 8 (Priority) 4055 batch users15 araim1 PD 0:00 8 (Priority) 4057 batch users17 araim1 PD 0:00 8 (Priority) 4058 batch users18 araim1 PD 0:00 8 (Priority) 4059 batch users19 araim1 PD 0:00 8 (Priority) 4060 batch users20 araim1 PD 0:00 8 (Priority) 4053 batch users13 araim1 PD 0:00 8 (Priority) 4054 batch users14 araim1 PD 0:00 8 (Priority) 4062 batch contrib0 araim1 PD 0:00 1 (Priority) 4063 batch contrib0 araim1 PD 0:00 1 (Priority) 4051 batch users11 araim1 R 2:35 8 n[77-84] 4045 batch users05 araim1 R 2:36 8 n[29-36] 4046 batch users06 araim1 R 2:36 8 n[37-44] 4047 batch users07 araim1 R 2:36 8 n[45-52] 4048 batch users08 araim1 R 2:36 8 n[53-60] 4049 batch users09 araim1 R 2:36 8 n[61-68] 4050 batch users10 araim1 R 2:36 8 n[69-76] 4042 batch users02 araim1 R 2:37 8 n[5-12] 4043 batch users03 araim1 R 2:37 8 n[13-20] 4044 batch users04 araim1 R 2:37 8 n[21-28] 4041 batch users01 araim1 R 2:37 2 n[3-4] [araim1@maya-usr1 ~]$
Notice that all compute nodes are in use, and the next job in line to run is 4056. The remaining pending jobs are waiting because they have been assigned a lower priority. If we run sprio, we can see the priorities of the pending jobs.
[araim1@maya-usr1 ~]$ sprio JOBID PRIORITY AGE FAIRSHARE JOBSIZE PARTITION 4052 572 14 463 94 0 4053 572 14 463 94 0 4054 572 14 463 94 0 4055 572 14 463 94 0 4056 572 14 463 94 0 4057 572 14 463 94 0 4058 572 14 463 94 0 4059 572 14 463 94 0 4060 572 14 463 94 0 4061 572 14 463 94 0 4062 415 12 391 11 0 4063 415 12 391 11 0 [araim1@maya-usr1 ~]$
All jobs except 4062 and 4063 effectively have the same priority, which is given in the second column. A higher number corresponds to a higher priority. Notice that job 4056 (the next job to run) is in the higher priority group, but is not necessarily the earliest job to the submitted. However, jobs 4052 through 4061 were submitted close enough in time that the age factor is about the same. The 3rd to 6th columns are factors that went into computing the priority. They are combined according to some weights (which are subject to change with system load). The weights themselves can be seen through sprio.
[araim1@maya-usr1 ~]$ sprio -w JOBID PRIORITY AGE FAIRSHARE JOBSIZE PARTITION Weights 1000 1000 1000 2000 [araim1@maya-usr1 ~]$
If you compute a weighted sum of the given factors, they will not necessarily add up to the displayed priority. If you are interested in the details, you can check the SLURM website. But for most users, we think that observing the priorities and the factors should be sufficient.
How can I check my fair-share level?
The sshare command can be used to check your fair-share usage, which is an important factor in the priority of your jobs.
[araim1@maya-usr1 ~]$ sshare Account User Raw Shares Norm Shares Raw Usage Effectv Usage Fair-share -------------------- ---------- ---------- ----------- ----------- ------------- ---------- root 1.000000 0 1.000000 0.500000 contribution 80 0.714286 0 0.000000 0.857143 pi_ithorpe 16 0.211640 0 0.000000 0.605820 ... pi_strow 11 0.145503 0 0.000000 0.572751 community 20 0.178571 0 0.000000 0.589286 pi_nagaraj 1 0.044643 0 0.000000 0.522321 pi_nagaraj araim1 1 0.044643 0 0.000000 0.522321 ... pi_gobbert 1 0.044643 0 0.000000 0.522321 pi_gobbert araim1 1 0.014881 0 0.000000 0.507440 [araim1@maya-usr1 ~]$
There are also various viewing options, such as viewing only specific accounts (PIs), and all users under those accounts. Let’s look at the usage for all users under the “pi_gobbert” account.
[araim1@maya-usr1 ~]$ sshare -A pi_gobbert -a Accounts requested: : pi_gobbert Account User Raw Shares Norm Shares Raw Usage Effectv Usage Fair-share -------------------- ---------- ---------- ----------- ----------- ------------- ---------- pi_gobbert 1 0.044643 0 0.000000 0.522321 pi_gobbert dtrott1 1 0.014881 0 0.000000 0.507440 pi_gobbert araim1 1 0.014881 0 0.000000 0.507440 pi_gobbert gobbert 1 0.014881 0 0.000000 0.507440 [araim1@maya-usr1 ~]$
- Fair-share is the most important number here. A fair-share of 0.5 means that you’ve used exactly your fair share of the system. A fair-share close to 0 means you’ve used much more than your share, and conversely a fair-share close to 1 means you’ve used almost none of your share of the system.
- Raw Shares are associated with accounts to describe their relative entitlement to the cluster. This is partially determined by financial contributions, and so the “contribution” accounting group has a majority of the raw shares. The raw shares are normalized to sum to 1, and this is shown as Norm Shares. We can see that the norm shares for an account are split among the users of that account
- Raw Usage is a summary of your usage. It is computed with a “half-life” so that previous activity contributes a smaller and smaller amount as it ages. Some normalization of this number is done to yield Effectv Usage, which is a number between 0 and 1.
- Raw Shares and Effectv Usage, summaries of your entitlement and usage, are combined to compute Fair-share
For more information about the fair-share calculations, see SLURM’s web page for the Multifactor Priority Plugin. There are some specifics we have not discussed here, such as the rate of decay for previous usage. These are set by the system administrators, and are subject to change.