Skip to Main Content

How to run programs on Stampede2

Table of Contents

Introduction

Running a program on Stampede2 is a bit different than running one on a standard workstation. When we log into the cluster, we are interacting with the login node. But we would like our programs to run on the compute nodes, which is where the real computing power of the cluster is. We will walk through the processes of running serial and parallel code on the cluster, and then later discuss some of the finer details. This page uses the code examples from the compilation tutorial. Please download and compile those examples, so you can follow along.

Resource intensive jobs (long running, high memory demand, etc) should be run on the compute nodes of the cluster. You cannot execute jobs directly on the compute nodes yourself; you must request the cluster’s batch system do it on your behalf. To use the batch system, you will submit a special script which contains instructions to execute your job on the compute nodes. When submitting your job, you specify a partition (group of nodes: for testing vs. production) and a QOS (a classification that determines what kind of resources your job will need). Your job will wait in the queue until it is “next in line”, and free processors on the compute nodes become available. Which job is “next in line” is determined by the scheduling rules of the cluster. Once a job is started, it continues until it either completes or reaches its time limit, in which case it is terminated by the system.

The batch system used on Stampede2 is called SLURM, which is short for Simple Linux Utility for Resource Management. Users transitioning from the cluster hpc should be aware that SLURM behaves a bit differently than PBS, and the scripting is a little different too. Unfortunately, this means you will need to rewrite your batch scripts. However many of the confusing points of PBS, such as requesting the number of nodes and tasks per node, are simplified in SLURM.

Scheduling Fundamentals on Stampede2: Partitions,

QOS’s, and more

Please read first the scheduling rules web page for complete background on the available queues and their limitations.

Interacting with the Batch System

There are several basic commands you’ll need to know to submit jobs, cancel them, and check their status. These are:

  • sbatch – submit a job to the batch queue system
  • squeue – check the current jobs in the batch queue system
  • sinfo – view the current status of the queues
  • scancel – cancel a job

Check here for more detailed information about job monitoring.

scancel

The first command we will mention is scancel. If you’ve submitted a job that you no longer want, you should be a responsible user and kill it. This will prevent resources from being wasted, and allow other users’ jobs to run. Jobs can be killed while they are pending (waiting to run), or while they are actually running. To remove a job from the queue or to cancel a running job cleanly, use the scancel command with the identifier of the job to be deleted. For instance:

login1(1001)$ scancel 727
login1(1002)$

The job identifier can be obtained from the job listing from squeue (see below) or after using sbatch, when you originally submitted the job (also below). Try “man scancel” for more information.

sbatch

Now that we know how to cancel a job, we will see how to submit one. You can use the sbatch command to submit a script to the queue system.

loing1(1001)$ sbatch run.slurm
sbatch: Submitted batch job 2626

-----------------------------------------------------------------
          Welcome to the Stampede2 Supercomputer
-----------------------------------------------------------------

No reservation for this job
--> Verifying valid submit host (login1)...OK
--> Verifying valid jobname...OK
--> Enforcing max jobs per user...OK
--> Verifying availability of your home dir (/home1/05030/jdella-g)...OK
--> Verifying availability of your work dir (/work/05030/jdella-g/stampede2)...OK
--> Verifying availability of your scratch dir (/scratch/05030/jdella-g)...OK
--> Verifying valid ssh keys...OK
--> Verifying access to desired queue (skx-dev)...OK
--> Verifying job request is within current queue limits...OK
--> Checking available allocation (TG-DMS170007)...OK
Submitted batch job 1789195
login1(1013)$

In this example run.slurm is the script we are sending to the skx-dev queue partition. We will see shortly how to formulate such a script. Notice that sbatch returns a job identifier. We can use this to kill the job later if necessary, or to check its status. For more information, check the man page by running “man sbatch”.

squeue

You can use the squeue command to check the status of jobs in the batch queue system. Here’s an example of the basic usage:

login2(1051)$ squeue
             JOBID   PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           1847451      normal parste-t   wg4889 CG       3:21      1 c401-051
           1843026  skx-normal launcher  xiangl1 CG    1:04:59      1 c503-083
           1761083  skx-normal FC2000cl tg850735 PD       0:00      1 (DependencyNeverSatisfied)
           1846312       large      amr tg830986 PD       0:00   1480 (Resources)
           1847337      normal  AuSV6x6  cgroome PD       0:00      1 (Resources)
           1847297      normal    EbDis    y4shi PD       0:00      1 (Resources)
           1847324      normal  epDenst    y4shi PD       0:00      1 (Resources)

The most interesting column is the one titled ST for “status”. It shows what a job is doing at this point in time. The state “PD” indicates that the job has been queued. When free processor cores become available and this process is “next in line”, it will change to the “R” state and begin executing. You may also see a job with status “CG” which means it’s completing, and about to exit the batch system. Other statuses are possible too, see the man page for squeue. Once a job has exited the batch queue system, it will no longer show up in the squeue display.

We can also see several other pieces of useful information. The TIME column shows the current walltime used by the job. For example, job 1843026 has been running for 1 hour, 40 minutes, and 59 seconds. The NODELIST column shows which compute nodes have been assigned to the job. For job 1847451, nodes c401-051 are being used. However for job 1847337, we can see that it’s pending because it’s waiting on resources.

sinfo

The sinfo command also shows the current status of the batch system, but from the point of view of the SLURM partitions. Here is an example

login1(1014)$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
development* up 2:00:00 1 drain* c455-114
development* up 2:00:00 16 comp c455-[062,071,092-094,101-104,111-113,121-124]
development* up 2:00:00 24 alloc c455-[001-004,011-014,021-024,031-034,041-044,051-053,091]
development* up 2:00:00 71 idle c455-[054,061,063-064,072-074,081-084,131-134]
normal up 14-00:00:0 14 drain* c415-071,c422-[011-014],c426-[003,052],c438-081,c466-123,c472-[121-124]
normal up 14-00:00:0 3 drain c470-074,c472-081,c474-064

Running Serial Jobs

This section assumes you’ve already compiled the serial hello world example. Now we’ll see how to run it several different ways.

Test runs on the user node

The most obvious way to run the program is on the user node, which we normally log into.

login1(1021)$ ./hello_serial
Hello world from login1.stampede2.tacc.utexas.edu
login1(1022)$

We can see the reported hostname which confirms that the program ran on the login node.

Jobs should usually be run on the login node only for testing purposes. The purpose of the login node is to develop code and submit jobs to the compute nodes. Everyone who uses Stampede2 must interact with the user node, so slowing it down will affect all users. Therefore, the usage rules prohibit the use of the login node for running jobs.

Test runs on the develop partition

Let’s submit our job to the develop partition, since we just created it and we’re not completely sure that it works. The following script will accomplish this. Save it to your account alongside the “hello-serial” executable.

Here, the partition flag chooses the Skylake develop partition (skx-dev). The output and error flags set the file name for capturing standard output (stdout) and standard error (stderr), respectively, and the job-name flag simply sets the string that is displayed as the name of the job in squeue. The next two flags set the total number of nodes requested, and the number of MPI tasks per node. The time flag sets a run-time limit on the submitted job. After your job has reached this time limit, it is stopped by the scheduler. This is done to ensure that everyone has a fair chance to use the cluster. Now we’re ready to submit our job to the scheduler. To accomplish this, use the sbatch command as follows

login2(1014)$ sbatch run-testing.slurm

-----------------------------------------------------------------
          Welcome to the Stampede2 Supercomputer
-----------------------------------------------------------------

No reservation for this job
--> Verifying valid submit host (login2)...OK
--> Verifying valid jobname...OK
--> Enforcing max jobs per user...OK
--> Verifying availability of your home dir (/home1/05030/jdella-g)...OK
--> Verifying availability of your work dir (/work/05030/jdella-g/stampede2)...OK
--> Verifying availability of your scratch dir (/scratch/05030/jdella-g)...OK
--> Verifying valid ssh keys...OK
--> Verifying access to desired queue (skx-dev)...OK
--> Verifying job request is within current queue limits...OK
--> Checking available allocation (TG-DMS170007)...OK
Submitted batch job 1796977

If the submission was successful, the sbatch command returns a job identifier to us. We can use this to check the status of the job (squeue), or delete it (scancel) if necessary. This job should run very quickly if there are processors available, but we can try to check its status in the batch queue system. The following command shows that our job is not in the system – it is so quick that it has already completed!

[araim1@maya-usr1 hello_serial]$ squeue
  JOBID PARTITION     NAME     USER ST       TIME  NODES QOS     NODELIST(REASON)
[araim1@maya-usr1 hello_serial]$

We should have obtained two output files. The file slurm.err contains stderr output from our program. If slurm.err isn’t empty, check the contents carefully as something may have gone wrong. The file slurm.out contains our stdout output; it should contain the hello world message from our program.

login2(1021)$ ls -l
total 448
-rwxrwx--- 1 jdella-g G-819243 225496 Jul 18 09:14 hello_serial
-rwxrwx--- 1 jdella-g G-819243 216080 Jul 18 09:34 hello_serial2
-rw-rw---- 1 jdella-g G-819243    184 Feb  1  2014 hello_serial.c
-rw-rw-r-- 1 jdella-g G-819243    392 Jul 20 08:32 run-testing.slurm
-rw-rw---- 1 jdella-g G-819243      0 Jul 20 08:29 slurm.err
-rw-rw---- 1 jdella-g G-819243     52 Jul 20 08:29 slurm.out
login2(1022)$ cat slurm.err
login2(1023)$ cat slurm.out
Hello world from c506-001.stampede2.tacc.utexas.edu

Notice that the hostname no longer matches the login node, but one of the compute nodes. We’ve successfully used one of the compute nodes to run our job. The develop partition limits jobs to two hours, measured in “walltime”, which is just the elapsed run time.

Note that with SLURM, the stdout and stderr files (slurm.out and slurm.err) will be written gradually as your job executes. This is different than PBS which was used on hpc, where stdout/stderr files did not exist until the job completed.
The stdout and stderr mechanisms in the batch system are not intended for large amounts of output. If your program writes out more than a few KB of output, consider using file I/O to write to logs or data files.

Production runs on the batch partition

Once our job has been tested and we’re confident that it’s working correctly, we can run it in the batch partition. Now the walltime limit for our job may be raised, based on the QOS we choose. There are also many more compute nodes available in this partition, so we probably won’t have to wait long to find a free processor. Start by creating the following script with some of the most important features.


Download: ../code-2018/stampede2/Hello_Serial/run-serial.slurm

The flags for job-name, output, and error are the same as in the previous script. The partition flag is now set to batch. Additionally, the qos flag chooses the normal QOS. This is the default QOS, so the result would be the same if we had not specified any QOS; we recommend always specifying a QOS explicitly for clarity. To submit our job to the scheduler, we issue the command

login2(1011)$ sbatch run-serial.slurm

-----------------------------------------------------------------
          Welcome to the Stampede2 Supercomputer
-----------------------------------------------------------------

No reservation for this job
--> Verifying valid submit host (login2)...OK
--> Verifying valid jobname...OK
--> Enforcing max jobs per user...OK
--> Verifying availability of your home dir (/home1/05030/jdella-g)...OK
--> Verifying availability of your work dir (/work/05030/jdella-g/stampede2)...OK
--> Verifying availability of your scratch dir (/scratch/05030/jdella-g)...OK
--> Verifying valid ssh keys...OK
--> Verifying access to desired queue (skx-dev)...OK
--> Verifying job request is within current queue limits...OK
--> Checking available allocation (TG-DMS170007)...OK
Submitted batch job 1798806
login2(1012)$

We can check the job’s status, but due to its speed, it has already completed and does not show up any more.

[araim1@maya-usr1 hello_serial]$ squeue
  JOBID PARTITION     NAME     USER ST       TIME  NODES QOS     NODELIST(REASON)
[araim1@maya-usr1 hello_serial]$

This time our stdout output file indicates that our job has run on one of the primary compute nodes, rather than a develop node

login2(1012)$ ls -l slurm.*
-rw-rw---- 1 jdella-g G-819243  0 Jul 20 14:04 slurm.err
-rw-rw---- 1 jdella-g G-819243 52 Jul 20 14:04 slurm.out
login2(1013)$ cat slurm.err
login2(1014)$ cat slurm.out
Hello world from c485-031.stampede2.tacc.utexas.edu
login2(1015)$
When using the skx-normal partition, you’ll be sharing resources with other researchers. So keep your duties as a responsible user in mind, which are described in this tutorial and in the usage rules.

Selecting a QOS

Notice that we specified the normal QOS with the qos flag. Because we know our job is very quick, a more appropriate QOS would be short. To specify the short QOS, replace normal by short to get the line

#SBATCH --qos=skx-short

in the submission script. In the same way, you can access any of the QOS’s listed in the scheduling rules. The rule of thumb is that you should always choose the QOS whose wall time limit is the most appropriate for your job. Realizing that these limits are hard upper limits, you will want to stay safely under them, or in other words, pick a QOS whose wall time limit is comfortably larger than the actually expected run time.

Note that the QOS of each job is shown by default in the squeue output. We have set this up on Stampede2 as a convenience, by setting the SQUEUE_FORMAT environment variable.

The develop partition only has one QOS, namely the SLURM default normal, so expect to see “normal” in the QOS column for any job in the develop partition.

Running Parallel Jobs

This section assumes that you’ve successfully compiled the parallel hello world example. Now, we’ll see how to run this program on the cluster.

Test runs on the develop partition

Example 1: Single process

First we will run the hello_parallel program as a single process. This will appear very similar to the serial job case. The difference is that now we are using the MPI-enabled executable hello_parallel, rather than the plain hello_serial executable. Create the following script in the same directory as the hello_parallel program. Notice the addition of the “ibrun” command before the executable, which is used to launch MPI-enabled programs on Stampede2. We’ve also added “–nodes=1” and “–ntasks-per-node=1” to specify what kind of resources we’ll need for our parallel program.

Now, we submit the script.

login2(1024)$ sbatch intel-n1-ppn1.slurm

-----------------------------------------------------------------
          Welcome to the Stampede2 Supercomputer
-----------------------------------------------------------------

No reservation for this job
--> Verifying valid submit host (login2)...OK
--> Verifying valid jobname...OK
--> Enforcing max jobs per user...OK
--> Verifying availability of your home dir (/home1/05030/jdella-g)...OK
--> Verifying availability of your work dir (/work/05030/jdella-g/stampede2)...OK
--> Verifying availability of your scratch dir (/scratch/05030/jdella-g)...OK
--> Verifying valid ssh keys...OK
--> Verifying access to desired queue (skx-dev)...OK
--> Verifying job request is within current queue limits...OK
--> Checking available allocation (TG-DMS170007)...OK
Submitted batch job 1798854
login2(1025)$

Checking the output after the job has completed, we can see that exactly one process has run and reported back.

login2(1030)$ ls -l slurm.*
-rw-rw---- 1 jdella-g G-819243   0 Jul 20 14:22 slurm.err
-rw-rw---- 1 jdella-g G-819243 194 Jul 20 14:22 slurm.out
login2(1031)$ cat slurm.err
login2(1032)$ cat slurm.out
TACC:  Starting up job 1798871
TACC:  Starting parallel tasks...
Hello world from process 000 out of 001, processor name c487-101.stampede2.tacc.utexas.edu
TACC:  Shutdown complete. Exiting.
login2(1033)$

Example 2: One node, two processes

Next, we will run the job on two processes of the same node. This is one important test, to ensure that our code will function in parallel. We want to be especially careful that the communications work correctly, and that processes don’t hang. We modify the single process script and set “–ntasks-per-node=2”.

Submit the script to the batch queue system

login2(1041)$ sbatch intel-n1-ppn2.slurm

-----------------------------------------------------------------
          Welcome to the Stampede2 Supercomputer
-----------------------------------------------------------------

No reservation for this job
--> Verifying valid submit host (login2)...OK
--> Verifying valid jobname...OK
--> Enforcing max jobs per user...OK
--> Verifying availability of your home dir (/home1/05030/jdella-g)...OK
--> Verifying availability of your work dir (/work/05030/jdella-g/stampede2)...OK
--> Verifying availability of your scratch dir (/scratch/05030/jdella-g)...OK
--> Verifying valid ssh keys...OK
--> Verifying access to desired queue (skx-dev)...OK
--> Verifying job request is within current queue limits...OK
--> Checking available allocation (TG-DMS170007)...OK
Submitted batch job 1799100
login2(1042)$

Now observe that two processes have run and reported in. Both were located on the same node as we expected.

login2(1043)$ ls -l slurm.*
-rw-rw---- 1 jdella-g G-819243   0 Jul 20 14:44 slurm.err
-rw-rw---- 1 jdella-g G-819243 285 Jul 20 14:44 slurm.out
login2(1044)$ cat slurm.err
login2(1045)$ cat slurm.out
TACC:  Starting up job 1799100
TACC:  Starting parallel tasks...
Hello world from process 000 out of 002, processor name c505-131.stampede2.tacc.utexas.edu
Hello world from process 001 out of 002, processor name c505-131.stampede2.tacc.utexas.edu
TACC:  Shutdown complete. Exiting.
login2(1046)$

Example 3: Two nodes, one process per node

Now let’s try to use two different nodes, but only one process on each node. This will exercise our program’s use of the high performance network, which didn’t come into the picture when a single node was used.

Submit the script to the batch queue system

login2(1049)$ sbatch intel-n2-ppn1.slurm

-----------------------------------------------------------------
          Welcome to the Stampede2 Supercomputer
-----------------------------------------------------------------

No reservation for this job
--> Verifying valid submit host (login2)...OK
--> Verifying valid jobname...OK
--> Enforcing max jobs per user...OK
--> Verifying availability of your home dir (/home1/05030/jdella-g)...OK
--> Verifying availability of your work dir (/work/05030/jdella-g/stampede2)...OK
--> Verifying availability of your scratch dir (/scratch/05030/jdella-g)...OK
--> Verifying valid ssh keys...OK
--> Verifying access to desired queue (skx-dev)...OK
--> Verifying job request is within current queue limits...OK
--> Checking available allocation (TG-DMS170007)...OK
Submitted batch job 1799117

Notice that again we have two processes, but this time they have distinct processor names.

login2(1050)$ ls -l slurm.*
-rw-rw---- 1 jdella-g G-819243   0 Jul 20 14:49 slurm.err
-rw-rw---- 1 jdella-g G-819243 285 Jul 20 14:49 slurm.out
login2(1051)$ cat slurm.err
login2(1052)$ cat slurm.out
TACC:  Starting up job 1799117
TACC:  Starting parallel tasks...
Hello world from process 000 out of 002, processor name c505-131.stampede2.tacc.utexas.edu
Hello world from process 001 out of 002, processor name c505-132.stampede2.tacc.utexas.edu
TACC:  Shutdown complete. Exiting.

Example 4: Two nodes, eight processes per node

To illustrate the use of more processes, let’s try a job that uses two nodes, eight processes on each node. This is still possible on the develop partition. Therefore it is possible to run small performance studies which are completely restricted to the develop partition. Use the following batch script

Submit the script to the batch system

login2(1056)$ sbatch intel-n2-ppn8.slurm

-----------------------------------------------------------------
          Welcome to the Stampede2 Supercomputer
-----------------------------------------------------------------

No reservation for this job
--> Verifying valid submit host (login2)...OK
--> Verifying valid jobname...OK
--> Enforcing max jobs per user...OK
--> Verifying availability of your home dir (/home1/05030/jdella-g)...OK
--> Verifying availability of your work dir (/work/05030/jdella-g/stampede2)...OK
--> Verifying availability of your scratch dir (/scratch/05030/jdella-g)...OK
--> Verifying valid ssh keys...OK
--> Verifying access to desired queue (skx-dev)...OK
--> Verifying job request is within current queue limits...OK
--> Checking available allocation (TG-DMS170007)...OK
Submitted batch job 1799125
login2(1057)$

For reference, we quote the output of squeue, when using the above environment variable setting, which reads for this job

login2(1059)$ squeue 
             JOBID   PARTITION     NAME     USER     ST      TIME  NODES NODELIST(REASON)
           1799130  skx-dev      hello_se  jdella-g  R       0:00      2 c476-034,c496-033

Now observe the output. Notice that the processes have reported back in numerical order, and there are eight per node.

login2(1060)$ ls -l slurm.*
-rw-rw---- 1 jdella-g G-819243    0 Jul 20 14:54 slurm.err
-rw-rw---- 1 jdella-g G-819243 1559 Jul 20 14:54 slurm.out
login2(1061)$ cat slurm.err
login2(1062)$ cat slurm.out
TACC:  Starting up job 1799130
TACC:  Starting parallel tasks...
Hello world from process 000 out of 016, processor name c476-034.stampede2.tacc.utexas.edu
Hello world from process 001 out of 016, processor name c476-034.stampede2.tacc.utexas.edu
Hello world from process 002 out of 016, processor name c476-034.stampede2.tacc.utexas.edu
Hello world from process 003 out of 016, processor name c476-034.stampede2.tacc.utexas.edu
Hello world from process 004 out of 016, processor name c476-034.stampede2.tacc.utexas.edu
Hello world from process 005 out of 016, processor name c476-034.stampede2.tacc.utexas.edu
Hello world from process 006 out of 016, processor name c476-034.stampede2.tacc.utexas.edu
Hello world from process 007 out of 016, processor name c476-034.stampede2.tacc.utexas.edu
Hello world from process 008 out of 016, processor name c496-033.stampede2.tacc.utexas.edu
Hello world from process 009 out of 016, processor name c496-033.stampede2.tacc.utexas.edu
Hello world from process 010 out of 016, processor name c496-033.stampede2.tacc.utexas.edu
Hello world from process 011 out of 016, processor name c496-033.stampede2.tacc.utexas.edu
Hello world from process 012 out of 016, processor name c496-033.stampede2.tacc.utexas.edu
Hello world from process 013 out of 016, processor name c496-033.stampede2.tacc.utexas.edu
Hello world from process 014 out of 016, processor name c496-033.stampede2.tacc.utexas.edu
Hello world from process 015 out of 016, processor name c496-033.stampede2.tacc.utexas.edu
TACC:  Shutdown complete. Exiting.

Production runs on the batch partition

Now we’ve tested our program in several important configurations in the develop partition. We know that it performs correctly, and processes do not hang. We may now want to solve larger problems which are more time consuming, or perhaps we may wish to use more processes. We can promote our code to “production”, by simply changing “–partition=skx-dev” to “–partition=skx-normal”. We may also want to specify “–qos=skx-dev” as before, since the expected run time of our job is several seconds at most. Of course if this were a more substantial program, we might need to specify a longer QOS like normal, or long.

Node selection

The Stampede2 cluster has several different types of nodes available, and users may want to select certain kinds of nodes to suit their jobs. In this section we will discuss how to do this.

The following variation of the sinfo command shows some basic information about nodes in Stampede2.

login2(1001)$ sinfo -o "%10N %8z %8m %40f %10G"
NODELIST   S:C:T    MEMORY   AVAIL_FEATURES                           GRES
c455-[001- 1:68:4   1        knl1_5                                   (null)
c401-[001- 1:68:4   1        knl                                      (null)
c476-[001- 2:24:2   1        skx                                      (null)
login2(1002)$
  • In the column S:C:T, “S” is the number of processors (“sockets”), “C” is the number of cores per processor, and “T” is the number of threads per core. It is seen that nodes n[1-69] have 16 processor cores per node, but nodes n[70-153] have eight.
  • The memory column shows the total system memory in MB.
  • The features column describes static properties of the nodes; we use this to declare the section of the cluster (e.g. “hpcf2013”), the type of CPU (e.g. “e5_2650v2”), the fact that the node contains a GPU or Phi (e.g. “gpu” or “phi”), and the specific kind of GPU or Phi (e.g. “phi_5110p” or “gpu_k20”).

The following extentions of the above “sinfo” command may be useful to view the availability of each node type. In the last column, the four numbers correspond to the number of nodes which are currently in the following states: A = allocated, I = idle, O = other, and T = total.

login2(1002)$ sinfo -o "%10N %8z %8m %40f %10G %F"
NODELIST   S:C:T    MEMORY   AVAIL_FEATURES                           GRES       NODES(A/I/O/T)
c455-[001- 1:68:4   1        knl1_5                                   (null)     499/5/0/504
c401-[001- 1:68:4   1        knl                                      (null)     3690/0/6/3696
c476-[001- 2:24:2   1        skx                                      (null)     1709/25/2/1736
login2(1003)$

Nodes marked as allocated might still have available processors. The following command, using %C instead of %F, will show us availability at the processor level.

login2(1003)$ sinfo -o "%10N %8z %8m %40f %10G %C"
NODELIST   S:C:T    MEMORY   AVAIL_FEATURES                           GRES       CPUS(A/I/O/T)
c455-[001- 1:68:4   1        knl1_5                                   (null)     135728/1360/0/137088
c401-[001- 1:68:4   1        knl                                      (null)     1003680/0/1632/1005312
c476-[001- 2:24:2   1        skx                                      (null)     163776/2688/192/166656
login2(1004)$

Other types of summaries are possible as well; try “man sinfo” for more information.

As we will see below, the “features” correspond to things that can be specified using the –constraint option, and “gres” correspond to things that can be specified by “–gres”.

Select nodes by CPU type

To demonstrate node selection on maya, first consider a very simple batch which does not specify any type of node.

login2(1036)$ sbatch run.slurm

-----------------------------------------------------------------
          Welcome to the Stampede2 Supercomputer
-----------------------------------------------------------------

No reservation for this job
--> Verifying valid submit host (login2)...OK
--> Verifying valid jobname...OK
--> Enforcing max jobs per user...OK
--> Verifying availability of your home dir (/home1/05030/jdella-g)...OK
--> Verifying availability of your work dir (/work/05030/jdella-g/stampede2)...OK
--> Verifying availability of your scratch dir (/scratch/05030/jdella-g)...OK
--> Verifying valid ssh keys...OK
--> Verifying access to desired queue (skx-normal)...OK
--> Verifying job request is within current queue limits...OK
--> Checking available allocation (TG-DMS170007)...OK
Submitted batch job 1846968
login2(1037)$ ls -l
total 8
-rw-rw-r-- 1 jdella-g G-819243 214 Jul 31 10:01 run.slurm
-rw-rw---- 1 jdella-g G-819243   0 Jul 31 10:01 slurm.err
-rw-rw---- 1 jdella-g G-819243 210 Jul 31 10:01 slurm.out
login2(1038)$ cat slurm.err
login2(1039)$ cat slurm.out
c488-053.stampede2.tacc.utexas.edu
c488-053.stampede2.tacc.utexas.edu
c488-053.stampede2.tacc.utexas.edu
c493-121.stampede2.tacc.utexas.edu
c493-121.stampede2.tacc.utexas.edu
c493-121.stampede2.tacc.utexas.edu
login2(1040)$

Notice that we are assigned two nodes from the hpcf2010 equipment. This can be verified by checking the table of hostnames at System Description.

Suppose we would like to use the hpcf2013 nodes instead. This can be accomplished with the “–constraint” option and specifying the feature “hpcf2013”. Recall that the list of features was obtained above from the sinfo output.

login2(1036)$ sbatch run.slurm

-----------------------------------------------------------------
          Welcome to the Stampede2 Supercomputer
-----------------------------------------------------------------

No reservation for this job
--> Verifying valid submit host (login2)...OK
--> Verifying valid jobname...OK
--> Enforcing max jobs per user...OK
--> Verifying availability of your home dir (/home1/05030/jdella-g)...OK
--> Verifying availability of your work dir (/work/05030/jdella-g/stampede2)...OK
--> Verifying availability of your scratch dir (/scratch/05030/jdella-g)...OK
--> Verifying valid ssh keys...OK
--> Verifying access to desired queue (skx-normal)...OK
--> Verifying job request is within current queue limits...OK
--> Checking available allocation (TG-DMS170007)...OK
Submitted batch job 1846968
login2(1037)$ ls -l
total 8
-rw-rw-r-- 1 jdella-g G-819243 214 Jul 31 10:01 run.slurm
-rw-rw---- 1 jdella-g G-819243   0 Jul 31 10:01 slurm.err
-rw-rw---- 1 jdella-g G-819243 210 Jul 31 10:01 slurm.out
login2(1038)$ cat slurm.err
login2(1039)$ cat slurm.out
c488-053.stampede2.tacc.utexas.edu
c488-053.stampede2.tacc.utexas.edu
c488-053.stampede2.tacc.utexas.edu
c493-121.stampede2.tacc.utexas.edu
c493-121.stampede2.tacc.utexas.edu
c493-121.stampede2.tacc.utexas.edu
login2(1040)$

Some details about the batch system

A SLURM batch script is a special kind of shell script. As we’ve seen, it contains information about the job like its name, expected walltime, etc. It also contains the procedure to actually run the job. Read on for some important details about SLURM scripting, as well as a few other features that we didn’t mention yet.

For more information, try the following sources

Parts of a SLURM script

Here is a quick reference for the options discussed on this page.

: (colon) Indicates a commented-out line that should be ignored by the scheduler.
#SBATCH Indicates a special line that should be interpreted by the scheduler.
ibrun ./hello_parallel This is a special command used to execute MPI programs. The command uses directions from SLURM to assign your job to the scheduled nodes.
–job-name=hello_serial This sets the name of the job; the name that shows up in the “Name” column in squeue’s output. The name has no significance to the scheduler, but helps make the display more convenient to read.
–output=slurm.out
–error=slurm.err
This tells SLURM where it should send your job’s output stream and error stream, respectively. If you would like to prevent either of these streams from being written, set the file name to /dev/null
–partition=skx-normal Set the partition in which your job will run.
–constraint=hpcf2018 Tell the scheduler that scheduled nodes for this job must have feature “feature_name”
–nodes=4 Request four nodes.
–ntasks-per-node=8 Request eight MPI tasks to be run on each node. The number of tasks may not exceed the number of processor cores on the node.
–time=1-12:30:00 This option sets the maximum amount of time SLURM will allow your job to run before it is automatically killed. In the example shown, we have requested 1 day, 12 hours, 30 minutes, and 0 seconds. Several other formats are accepted such as “HH:MM:SS” (assuming less than a day). If your specified time is too large for the partition/QOS you’ve specified, the scheduler will not run your job.
–mail-type=type SLURM can email you when your job reaches certain states. Set type to one of: BEGIN to notify you when your job starts, END for when it ends, FAIL for if it fails to run, or ALL for all of the above. See the example below.
–mail-user=email@umbc.edu Specify a recipient(s) for notification emails (see example below)
–mem=MB Specify a memory limit for each node of your job. The default is that there is a per-core limit
–share Specify that your job may share nodes with other jobs. This is the opposite of “–exclusive”.
–begin=2010-01-20T01:30:00 Tell the scheduler not to attempt to run the job until the given time has passed.
–dependency=afterany:15030:15031 Tell the scheduler not to run the job until jobs with IDs 15030 and 15031 have completed.

Job scheduling issues

  • Don’t leave large-scale jobs enqueued during weekdays. Suppose you have a job that requires all the nodes on the cluster. If you submit this job to the scheduler, it will remain enqueued until all the nodes become available. Usually the scheduler will allow smaller jobs to run first, but sometimes it will enqueue them behind yours. The latter causes a problem during peak usage hours, because it clogs up the queue and diminishes the overall throughput of the cluster. The best times to submit these large-scale types of jobs are nights and weekends.
  • Make sure the scheduler is in control of your programs. Avoid spawning processes or running background jobs from your code, as the scheduler can lose track of them. Programs running outside of the scheduler hinder its ability to allocate new jobs to free processors.
  • Avoid jobs that run for many days uninterrupted. It’s best for the overall productivity of the cluster to design jobs that are “small to medium” in size – they should be large enough to accomplish a significant amount of work, but small enough so that resources are not tied up for too long. Avoiding long jobs is also in your best interest.
  • Consider saving your progress. This is related to the issue above. Running your entire computation at once might be impractical or infeasible. Besides the fact that very long jobs can make scheduling new jobs difficult, it can also be very inconvenient for you if they fail after running for several days (due to a hardware issue for example). It may be possible to design your job to save its progress occasionally, so that it won’t need to restart from the beginning if there are any issues.
  • Estimate memory usage before running your job. As discussed later in this page, jobs on maya have a default memory limit which you can raise or lower as needed. If your job uses more than this limit, it will be killed by the scheduler. Requesting the maximum available memory for every job is not a very good user behavior though, because it can lower the overall productivity of the system. Imagine you’re running a serial job which requires half the available memory. Specifying this in your submission script will allow the scheduler to use the remaining cores and memory on the node for other users’ jobs.The best strategy is to estimate how much memory you will need, and specify a reasonable upper bound in your submission script. Two suggestions to do this estimation are (1) calculate the sizes of the main objects in your code, and (2) run a small-scale version of the problem and observe the actual memory used. For more information on how to observe memory usage, see checking memory usage.
  • Estimate walltime usage before running your job. Giving an accurate estimate of your job’s necessary walltime is very helpful for the scheduler, and the overall productivity of the cluster. The QOS choices (short, normal, etc) provide an easy way to specify this information. You may also specify it more granularly as an elapsed time; see below for more information and an example.
  • Use less nodes and more processes per node. Performance studies have demonstrated that using multiple cores on the same node gives generally comparable performance to using a single core on the same number of nodes. And using the minimum number of nodes required for your job benefits the overall productivity of the cluster. See for example the technical report HPCF-2010-2.
  • Consider sharing your nodes unless you require exclusive access. If your job isn’t using all cores on a node, it might be possible for other jobs to make use of the free cores. Allowing your nodes to be shared helps to improve the overall productivity of the cluster. The downside however is that other jobs might interfere with the performance of your job. If your job uses a small amount of memory and is not a performance study (for example), sharing is probably a good option. To see how to control exclusive vs. shared, see the examples below.

Specifying a time limit

The partitions and QOS’s on maya have time limits built in. For example in the “normal” QOS, jobs are limited to an hour overall, and an hour if they require 64 nodes. This is an upper limit. Suppose you will be using 64 nodes, but will only require at most 2 hours. It is beneficial for the scheduler to supply this information, and may also allow your job to be backfilled. See the scheduling rules for more information, and also note the following example.

Suppose the system currently has 14 free nodes, and there are two jobs in the queue waiting to run. Suppose also that no additional nodes will become free in the next few hours. The first queued job (“job #1”) requires 16 nodes, and the second job (“job #2”) requires only 2 nodes. Since job #1 was queued first, job #2 would normally need to wait behind it. However, if the scheduler sees that job #2 would complete in the time that job #1 would be waiting, it can allow job #2 to skip ahead.

Here is an example where we’ve specified a time limit of 3 hours, 15 minutes, and 0 seconds. Notice that we’ve started with the batch script from Running Parallel Jobs, Example 1 and added a single “–time=” statement.

Note that in our experience, SLURM rounds time limits up to the next minute. For example, specifying “–time=00:00:20” will result in an actual time limit of 1 minute.

Email notifications for job status changes

You can request the scheduler to email you on certain events related to your job. Namely:

  • When the job starts running
  • When the job exits normally
  • If the job is aborted

As an example of how to use this feature, let’s ask the scheduler to email us on all three events when running the hello_serial program. Let’s start with the batch script developed earlier, and add the options “–mail-type=ALL” and “–mail-user=username@domain.edu”. That is, where “username@domain.edu” is your actual email address. After submitting this script, we can check our email and receive the following messages.

From: Simple Linux Utility for Resource Management <slurm@maya-mgt.rs.umbc.edu>
Date: Thu, Jan 14, 2010 at 10:53 AM
Subject: SLURM Job_id=2655 Name=hello_serial Began
To: username@domain.edu
From: Simple Linux Utility for Resource Management <slurm@maya-mgt.rs.umbc.edu>
Date: Thu, Jan 14, 2010 at 10:53 AM
Subject: SLURM Job_id=2655 Name=hello_serial Ended
To: username@domain.edu

Because hello_serial is such a trivial program, the start and end emails appear to have been sent simultaneously. For a more substantial program the waiting time could be significant, both for your job to start and for it to run to completion. In this case email notifications could be useful to you.

Controlling exclusive vs. shared access to nodes

By default, you are not given exclusive access to the nodes you’re assigned by the scheduler. This means that your job may run on a node with another user’s jobs. You can override this default behavior using the “–exclusive” option.

Here’s an example where we reserve an entire node.

If our job involves multiple nodes, specifying the “–exclusive” flag requests exlusive access to all nodes that will be in use by the job.

We may also explicitly permit sharing of our nodes with the “–share” flag.


Download: ../code/hello_parallel/intel-share.slurm

Overriding the default shared behavior should not be done arbitrarily. Before using these options in a job, make sure you’ve confirmed that exclusive access is necessary. If every job was requested with exclusive access, it would have a negative effect on the overall productivity of the cluster.

The memory limit

This section is specific to the hpcf2009 nodes and needs to be updated to reflect for the various node types of maya.

Jobs on maya are limited to a maximum of 23,552 MB per node out of the total 24 GB system memory. The rest is reserved for the operating system. Jobs are run inside a “job container”, which protects the cluster against jobs overwhelming the nodes and taking them offline. By default, jobs are limited to 2944 MB per core, based on the number of cores you have requested. If your job goes over the memory limit, it will be killed by the batch system.

The memory limit may be specified per core or per node. To set the limit per core, simply add a line to your submission script as follows, keeping in mind that a larger memory request might increase the amount of time your job spends in the waiting queue:

#SBATCH --mem-per-cpu=5000

Memory is measured in MB so the above statement is requesting 5000MB of memory or 5GB. Similarly, to set the limit per node rather than by cpu you can use this instead.

#SBATCH --mem=5000

In the serial case, the two options are equivalent. For parallel computing situations it may be more natural to use the per core limit, given that the scheduler has some freedom to assign processes to nodes for you.

If your job is killed because it has exceeded its memory limit, you will receive an error similar to the following in your stderr output. Notice that the effective limit is reported in the error.

slurmd[n1]: error: Job 13902 exceeded 3065856 KB memory limit, being killed
slurmd[n1]: error: *** JOB 13902 CANCELLED AT 2010-04-22T17:21:40 ***
srun: forcing job termination

Note that the memory limit can be useful in conducting performance studies. If your code runs out of physical memory and begins to use swap space, the performance will be severely degraded. For a performance study, this may be considered an invalid result and you may want to try a smaller problem, use more nodes, etc. One way to protect against this is to reserve entire nodes (as discussed elsewhere on this page), and set the memory limit to 23 GB per node (or less). That is about the maximum you can use before swapping starts to occur. Then the batch system will kill your job if it’s close enough to swapping.

5/13/2010: Note that when using MVAPICH2, if your job has exclusive access to its assigned nodes (by virtue of the queue you’ve used – for example the parallel queue, or by the “–exclusive” flag), it will have access to the maximum available memory. This is not the case with OpenMPI. We hope to obtain version of SLURM will support this feature consistently. To avoid confusion in the meantime, we recommend using the “–mem” and “–mem-per_cpu” options as the preferred method of controlling the memory limit.

7/12/2011: The memory limit is being lowered from 23,954 MB (maximum) per node to 23,552 MB. This is being done to further protect nodes against crashing due to low memory. The default per-core limit is being lowered from 2994 MB to 2944 MB accordingly.

Note that a memory limit can be specified even for non-SLURM jobs. This can be useful for interactive jobs on the user node. For example, running the following command

[araim1@maya-usr1 ~]$ ulimit -S -v 2097152

will limit memory use to 2 GB of any subsequent command in the session.