Table of Contents

Introduction
Scheduling Fundamentals on taki: Partitions, QOS’s, and more
Interacting with the SLURM Scheduling System
Running Serial Jobs
Running Parallel Jobs
Parallel Runs on the Production Partitions
Some details about the batch system

Introduction

Running a program on taki is different than running one on a standard workstation. When we log into the cluster, we are interacting with the login node. But we would like our programs to run on the compute nodes, which is where the real computing power of the cluster is. We will walk through the processes of running serial and parallel code on the cluster, and then later discuss some of the finer details. This page uses the code examples from the compilation tutorial. Please download and compile those examples first, before following the run examples here.

On taki, jobs must be run on the compute nodes of the cluster. You cannot execute jobs directly on the compute nodes yourself; you must request the cluster’s batch system do it on your behalf. To use the batch system, you will submit a special script which contains instructions to execute your job on the compute nodes. When submitting your job, you specify a partition (group of nodes, e.g., develop for testing or, for instance, batch for production) and a QOS (a classification that determines what kind of resources your job will need). Your job will wait in the queue until it is “next in line”, and free processors on the compute nodes become available. Which job is “next in line” is determined by the scheduling rules of the cluster. Once a job is started, it continues to run until it either completes (with or without error) or reaches its time limit, in which case it is terminated by the scheduler.

During the runtime of your job, your instructions will be executed across the compute nodes. These instructions will have access to the resources of the nodes on which they are running. Notably, the memory, processors, and local disk space (/scratch space). Note that the /scratch space is cleared after every job terminates.

The batch system (also called the scheduler or work load manager) used on taki is called SLURM, which is short for Simple Linux Utility for Resource Management.

Scheduling Fundamentals on taki: Partitions, QOS’s, and more

Please read eventually the scheduling rules web page for complete background on the available queues and their limitations. The examples below on this page are designed for running code on the CPU cluster such as the C programs in the compile tutorial. They will start with the develop partition, which is designed for testing code in brief runs, and mention the batch partition, which is one of the partitions for production runs. For other software and packages, see the Other Software and Packages web page.

Interacting with the SLURM Scheduling System

There are several basic commands you need to know to submit jobs, cancel them, and check their status. These are:

- sbatch – submit a job to the batch queue system
- squeue – check the current jobs in the batch queue system
- sinfo – view the current status of the queues

scancel – cancel a job

scancel

The first command we will mention is scancel. If you have submitted a job that you no longer want, you should be a responsible user and kill it. This will prevent resources from being wasted, and allows other users’ jobs to run. Jobs can be killed while they are pending (waiting to run), or while they are actually running. To remove a job from the queue or to cancel a running job cleanly, use the scancel command with the identifier of the job to be deleted, for instance

[gobbert@taki-usr1 ~]$ scancel 9133

The job identifier can be obtained from the job listing from squeue (see below) or immediately after using sbatch, when you originally submitted the job (also below). See “man scancel” for more information.

sbatch

Now that we know how to cancel a job, we will see how to submit one. You can use the sbatch command to submit a script to the queue system.

[gobbert@taki-usr1 ~]$ sbatch run.slurm 
Submitted batch job 9133

In this example, run.slurm is the script we are sending to the slurm scheduler. We will see shortly how to formulate such a script. Notice that sbatch returns a job identifier. We can use this to kill the job later if necessary (as in the scancel example above), or to check its status. For more information, see “man sbatch”.

squeue

You can use the squeue command to check the status of jobs in the batch queue system. Here’s an example of the basic usage:

[gobbert@taki-usr1 Hello_Serial]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              9141  high_mem hello_se  gobbert  R       0:01      1 cnode031

The most interesting column is the one titled ST for “status”. It shows what a job is doing at this point in time. The state “PD” indicates that the job has been queued. When enough free processor cores become available, it will change to the “R” state and begin running. You may also see a job with status “CG” or “CF”, which means it is completing (such as still writing stdout and stderr), and about to exit the batch system. Other statuses are possible too, see “man squeue”. Once a job has exited the batch queue system, it will no longer show up in the squeue display.

We can also see several other pieces of useful information. The TIME column shows the current walltime used by the job up to the present time. For example, job 9141 has been running for 1 second so far. The NODELIST column shows which compute node(s) has(/have) been assigned to the job. For job 9141, node cnode031 is being used.

sinfo

The sinfo command also shows the current status of the batch system, but from the point of view of the SLURM partitions. Here is what it should show, if all nodes were idle:

[gobbert@taki-usr1 Hello_Serial]$ sinfo
[gobbert@taki-usr1 230905]$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
cpu2021      up  infinite     3   mix cnode[060,062,064]
cpu2021      up  infinite    13 alloc cnode[051-059,061,063,065-068]
develop      up     30:00     2  idle cnode[001,030]
gpu2018      up  infinite     1  idle gpunode001
high_mem     up  infinite    17 alloc bdnode[001-008],cnode[002-008,026-029,031-033]
high_mem     up  infinite    28  idle cnode[009-025,034-044]

Running Serial Jobs

This section assumes you have already compiled the serial “Hello, world!” example. Now we will see how to run it several different ways.

Test runs on the user node

The most obvious way to run the program is on the user node, which we normally log into.

[gobbert@taki-usr1 Hello_Serial]$ ./hello_serial 
Hello world from taki-usr1

We can see the reported hostname which confirms that the program ran on the login node.

Jobs should only be run on the login node for such brief testing purposes. The purpose of the login node is to develop code and submit jobs to the compute nodes. Everyone who uses taki must interact with the user node, so slowing it down will affect all users. Therefore, the usage rules prohibit the use of the login node for running jobs.

Test runs on the develop partition

Let us submit our job to the develop partition, since we just created it and we are not completely sure that it works. The following script will accomplish this. Download it using wget to your workspace alongside the “hello-serial” executable.

Download: ../code-2018/taki/Hello_Serial/run-testing.slurm

Here, the job-name flag simply sets the string that is displayed as the name of the job in squeue. The output and error flags set the file names for capturing standard output (stdout) and standard error (stderr), respectively. The next flag chooses the develop partition of the CPU cluster to request for the job to run on. The QOS flag requests the short queue, since this particular “Hello, world!” job should just run for a brief moment, and the time flag provides a more precise estimate of the maximum possible time for the job to take. After a job has reached its time limit, it is stopped by the scheduler. This is done to ensure that everyone has a fair chance to use the cluster. The next two flags set the total number of nodes requested, and the number of MPI tasks per node; by choosing both of these as 1, we are requesting space for a serial job. Now we are ready to submit our job to the scheduler. To accomplish this, use the sbatch command as follows

[gobbert@taki-usr1 Hello_Serial]$ sbatch run-testing.slurm
Submitted batch job 9111

If the submission is successful, the sbatch command returns a job identifier to us. We can use this to check the status of the job (squeue), or delete it (scancel) if necessary. This job should run very quickly if there are processors available, but we can try to check its status in the batch queue system. The first call to squeue in the following snapshot shows that the code ran 3 seconds up to that point in time. Checking again a little later does not show our job any more in the list, thus it is done at this point in time.

[gobbert@taki-usr1 Hello_Serial]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              9111   develop hello_se  gobbert  R       0:03      1 cnode001
[gobbert@taki-usr1 Hello_Serial]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

After possibly a brief delay, we should have obtained two output files. The file slurm.err contains stderr output from our program. If slurm.err is not empty, check the contents carefully as something may have gone wrong. The file slurm.out contains our stdout output; it should contain the hello world message from our program.

[gobbert@taki-usr1 Hello_Serial]$ ll slurm.*
-rw-rw---- 1 gobbert pi_gobbert  0 Jan 27 18:17 slurm.err
-rw-rw---- 1 gobbert pi_gobbert 26 Jan 27 18:17 slurm.out
[gobbert@taki-usr1 Hello_Serial]$ more slurm.err
[gobbert@taki-usr1 Hello_Serial]$ more slurm.out
Hello world from cnode001

Notice that the hostname is no longer the login node, but is one of the develop nodes; see output of sinfo above. The develop partition limits a job to 5 minutes by default and 30 minutes at maximum, measured in “walltime”, which is just the elapsed run time.

Note that with SLURM, the stdout and stderr files (slurm.out and slurm.err) will be written gradually as your job executes. The stdout and stderr mechanisms in the batch system are not intended for large amounts of output. If your program writes out more than a few kB of output, consider using file I/O to write to logs or data files.

Production runs on the high_mem partition

Once our job has been tested and we are confident that it is working correctly, we can run it in the high_mem partition, which is intended for production runs and has a longer time limit. Now the walltime limit for our job may be raised, based on the QOS we choose. There are also many more compute nodes available in this partition. Start by creating the following script with some of the most important features.

Download: ../code-2018/taki/Hello_Serial/run-serial.slurm

The flags for job-name, output, and error are the same as in the previous script. The partition flag is now set to high_mem. The qos flag still chooses the short QOS and the time limit is still set at 5 minutes, since the job is still expected to be quick, just like before for this “Hello, world!” example. It is also still a serial job, so no change to the requested number of nodes and tasks per node. To submit our job to the scheduler, we issue the command

[gobbert@taki-usr1 Hello_Serial]$ sbatch run-serial.slurm
Submitted batch job 9116

We can check the job’s status, but due to its speed, it has already completed and does not show up any more.

[gobbert@taki-usr1 Hello_Serial]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

This time our stdout output file indicates that our job has run on one of the compute nodes, rather than a develop node:

[gobbert@taki-usr1 Hello_Serial]$ ll slurm.*
-rw-rw---- 1 gobbert pi_gobbert  0 Jan 27 18:24 slurm.err
-rw-rw---- 1 gobbert pi_gobbert 26 Jan 27 18:24 slurm.out
[gobbert@taki-usr1 Hello_Serial]$ more slurm.err
[gobbert@taki-usr1 Hello_Serial]$ more slurm.out
Hello world from cnode032

When using the taki cluster, you are sharing resources with other researchers. So keep your duties as a responsible user in mind, which are described in this tutorial and in the usage rules.

Selecting a QOS

Notice that we specified the short QOS with the qos flag, because we know our job is very quick. In the same way, you can access any of the QOS’s listed in the scheduling rules. The rule of thumb is that you should always choose the QOS whose wall time limit is the most appropriate for your job. Realizing that these limits are hard upper limits, you will want to stay safely under them, or in other words, pick a QOS whose wall time limit is comfortably larger than the actually expected run time.

Running Parallel Jobs

This section assumes that you have successfully compiled the parallel “Hello, world!” example. Now, we will see how to run this program on the CPU cluster.

Test runs on the develop partition

Example 1: Single process

First we will run the hello_parallel program as a single process. This will appear very similar to the serial job case. The difference is that now we are using the MPI-enabled executable hello_parallel, rather than the plain hello_serial executable. Create the following script in the same directory as the hello_parallel program. Notice the addition of the “mpirun” command before the executable, which is used to launch MPI-enabled programs on taki. The number of nodes and tasks per node requested are still 1.

Download: ../code-2018/taki/Hello_Parallel/run-n1-ppn1.slurm

Now, we submit the script.

[gobbert@taki-usr1 hello_parallel]$ sbatch run-n1-ppn1.slurm
Submitted batch job 9126
[gobbert@taki-usr1 Hello_Parallel]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              9126   develop hello_pa  gobbert CF       0:00      1 cnode001
[gobbert@taki-usr1 Hello_Parallel]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

This time, we caught the job in a start-up, as the node was being initialized for the MPI job, with status CF. A short moment later, the job was done already, and the queue does not contain our job any more.
Checking the output after the job has completed, we can see that exactly one process has run and reported back.

[gobbert@taki-usr1 Hello_Parallel]$ ll slurm.*
-rw-rw---- 1 gobbert pi_gobbert  0 Jan 27 19:43 slurm.err
-rw-rw---- 1 gobbert pi_gobbert 65 Jan 27 19:43 slurm.out
[gobbert@taki-usr1 Hello_Parallel]$ more slurm.err
[gobbert@taki-usr1 Hello_Parallel]$ more slurm.out
Hello world from process 000 out of 001, processor name cnode001

Example 2: One node, two processes per node

Next, we will run the job on two processes of the same node. This is one important test, to ensure that our code will function in parallel. We want to be especially careful that the communications work correctly, and that processes do not hang. We modify the single process script and set “–ntasks-per-node=2”.

Download: ../code-2018/taki/Hello_Parallel/run-n1-ppn2.slurm

Submit the script to the batch queue system

[gobbert@taki-usr1 Hello_Parallel]$ sbatch run-n1-ppn2.slurm
Submitted batch job 9129
[gobbert@taki-usr1 Hello_Parallel]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              9129   develop hello_pa  gobbert  R       0:00      1 (None)
[gobbert@taki-usr1 Hello_Parallel]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

This time, squeue shows the job running briefly, with 0 seconds so far, before the job is again done quickly and gone from the queue.
Now observe that two processes have run and reported in. Both were located on the same node as we expected.

[gobbert@taki-usr1 Hello_Parallel]$ ll slurm.*
-rw-rw---- 1 gobbert pi_gobbert   0 Jan 27 19:50 slurm.err
-rw-rw---- 1 gobbert pi_gobbert 130 Jan 27 19:50 slurm.out
[gobbert@taki-usr1 Hello_Parallel]$ more slurm.err
[gobbert@taki-usr1 Hello_Parallel]$ more slurm.out
Hello world from process 000 out of 002, processor name cnode001
Hello world from process 001 out of 002, processor name cnode001

Example 3: Two nodes, one process per node

Now let us try to use two different nodes, but only one process on each node. This will exercise our program’s use of the high performance network, which did not come into play when a single node was used.

Download: ../code-2018/taki/Hello_Parallel/run-n2-ppn1.slurm

Submit the script to the batch queue system

[gobbert@taki-usr1 Hello_Parallel]$ sbatch run-n2-ppn1.slurm
Submitted batch job 9130
[gobbert@taki-usr1 Hello_Parallel]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

Notice that again we have two processes, but this time they have distinct processor names.

[gobbert@taki-usr1 Hello_Parallel]$ ll slurm.*
-rw-rw---- 1 gobbert pi_gobbert   0 Jan 27 19:51 slurm.err
-rw-rw---- 1 gobbert pi_gobbert 130 Jan 27 19:51 slurm.out
[gobbert@taki-usr1 Hello_Parallel]$ more slurm.err
[gobbert@taki-usr1 Hello_Parallel]$ more slurm.out
Hello world from process 000 out of 002, processor name cnode001
Hello world from process 001 out of 002, processor name cnode030

Example 4: Two nodes, eight processes per node

To illustrate the use of more processes per node, let us try a job that uses two nodes, eight processes on each node. This is still possible on the develop partition. Therefore it is possible to run small performance studies which are completely restricted to the develop partition. Use the following batch script.

Download: ../code-2018/taki/Hello_Parallel/run-n2-ppn8.slurm

Submit the script to the batch system

[gobbert@taki-usr1 Hello_Parallel]$ sbatch run-n2-ppn8.slurm
Submitted batch job 9131
[gobbert@taki-usr1 Hello_Parallel]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              9131   develop hello_pa  gobbert CF       0:00      2 (None)
[gobbert@taki-usr1 Hello_Parallel]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              9131   develop hello_pa  gobbert  R       0:01      2 cnode[001,030]
[gobbert@taki-usr1 Hello_Parallel]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

This time, we caught the job during its initialization first, then as running job, before having vanished from the queue after finishing. Notice that squeue lists the two nodes used for the job by name.

Now observe the output. Notice that the processes in the following output have reported mostly, but not completely, in numerical order, since all processes can access stdout and stderr and might reach faster and slower at random. But upon careful checking, there are two nodes reported by hostname and eight MPI processes per node.

[gobbert@taki-usr1 Hello_Parallel]$ ll slurm.*
-rw-rw---- 1 gobbert pi_gobbert    0 Jan 27 19:52 slurm.err
-rw-rw---- 1 gobbert pi_gobbert 1040 Jan 27 19:52 slurm.out
[gobbert@taki-usr1 Hello_Parallel]$ more slurm.err
[gobbert@taki-usr1 Hello_Parallel]$ more slurm.out
Hello world from process 000 out of 016, processor name cnode001
Hello world from process 008 out of 016, processor name cnode030
Hello world from process 001 out of 016, processor name cnode001
Hello world from process 009 out of 016, processor name cnode030
Hello world from process 002 out of 016, processor name cnode001
Hello world from process 010 out of 016, processor name cnode030
Hello world from process 003 out of 016, processor name cnode001
Hello world from process 011 out of 016, processor name cnode030
Hello world from process 004 out of 016, processor name cnode001
Hello world from process 012 out of 016, processor name cnode030
Hello world from process 005 out of 016, processor name cnode001
Hello world from process 013 out of 016, processor name cnode030
Hello world from process 006 out of 016, processor name cnode001
Hello world from process 014 out of 016, processor name cnode030
Hello world from process 007 out of 016, processor name cnode001
Hello world from process 015 out of 016, processor name cnode030

Parallel Runs on the Production Partitions

Now we have tested our program in several important configurations in the develop partition. We know that it performs correctly, and processes do not hang. We may now want to solve larger problems which are more time consuming, or perhaps we may wish to use more processes. There are two production partitions, cpu2021 and high_mem. We can promote our code to one of the production queues by simply changing the partition to cpu2021 or high-mem. We use the nodesused program from the compile tutorial as example here. Since the expected run time of our job is still only several seconds at most, we keep the short QOS and the time limit of 5 minutes. Of course if this were a more substantial program, we might need to specify a longer QOS like normal and a longer time limit. The submission script reads for the nodesused program, with obvious changes to the job-name and the mpirun line, as follows.

Download: ../code-2018/taki/Nodesused/run-n2-ppn8-batch.slurm

Running the code using sbatch gives the following results:

[gobbert@taki-usr1 Nodesused]$ ll slurm.*
-rw-rw---- 1 gobbert    0 Feb 28  2019 slurm.err
-rw-rw---- 1 gobbert 1072 Feb 28  2019 slurm.out
[gobbert@taki-usr1 Nodesused]$ more slurm.err
[gobbert@taki-usr1 Nodesused]$ more slurm.out
Hello world from process 0000 out of 0016, processor name cnode118
Hello world from process 0008 out of 0016, processor name cnode119
Hello world from process 0001 out of 0016, processor name cnode118
Hello world from process 0009 out of 0016, processor name cnode119
Hello world from process 0002 out of 0016, processor name cnode118
Hello world from process 0010 out of 0016, processor name cnode119
Hello world from process 0003 out of 0016, processor name cnode118
Hello world from process 0011 out of 0016, processor name cnode119
Hello world from process 0004 out of 0016, processor name cnode118
Hello world from process 0012 out of 0016, processor name cnode119
Hello world from process 0005 out of 0016, processor name cnode118
Hello world from process 0013 out of 0016, processor name cnode119
Hello world from process 0006 out of 0016, processor name cnode118
Hello world from process 0014 out of 0016, processor name cnode119
Hello world from process 0007 out of 0016, processor name cnode118
Hello world from process 0015 out of 0016, processor name cnode119

Notice that the node numbers confirm that the job was run on the high_mem partition of the CPU cluster. As before for the parallel “Hello, world!” program, the order of output lines to stdout is slightly random. The listing of the file nodesused.log shows that our code in the nodesused() function ordered the output by the MPI process IDs:

[gobbert@taki-usr1 Nodesused]$ more nodesused.log
MPI process 0000 of 0016 on node cnode118
MPI process 0001 of 0016 on node cnode118
MPI process 0002 of 0016 on node cnode118
MPI process 0003 of 0016 on node cnode118
MPI process 0004 of 0016 on node cnode118
MPI process 0005 of 0016 on node cnode118
MPI process 0006 of 0016 on node cnode118
MPI process 0007 of 0016 on node cnode118
MPI process 0008 of 0016 on node cnode119
MPI process 0009 of 0016 on node cnode119
MPI process 0010 of 0016 on node cnode119
MPI process 0011 of 0016 on node cnode119
MPI process 0012 of 0016 on node cnode119
MPI process 0013 of 0016 on node cnode119
MPI process 0014 of 0016 on node cnode119
MPI process 0015 of 0016 on node cnode119

Additional details about the batch system

Charging computing time to a PI, for users under multiple PIs

If you a member of multiple research groups on taki this will apply to you. When you run a job on taki, the resources you have used (e.g., computing time) are “charged” to your PI. This simply means that there is a record of your group’s use of the cluster. This information is leveraged to make sure everyone has access to their fair share of resources, especially the PIs who have paid for nodes. Therefore, it is important to charge your jobs to the correct PI.

You have a “primary” account which your jobs are charged to by default. To see this, try checking one of your jobs as follows (suppose our job has ID 25097)

[araim1@maya-usr1 ~]$ scontrol show job 25097
JobId=25097 Name=fmMle_MPI
   UserId=araim1(28398) GroupId=pi_nagaraj(1057)
   Priority=4294798165 Account=pi_nagaraj QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   TimeLimit=04:00:00 Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
   SubmitTime=2010-06-30T00:14:24 EligibleTime=2010-06-30T00:14:24
   StartTime=2010-06-30T00:14:24 EndTime=2010-06-30T04:14:24
   SuspendTime=None SecsPreSuspend=0
...

Notice the “Account=pi_nagaraj” field – in this example, this is our default account. This should also be the same as our primary Unix group

[araim1@maya-usr1 ~]$ id
uid=28398(araim1) gid=1057(pi_nagaraj) groups=100(users),700(contrib),
701(alloc_node_ssh),1057(pi_nagaraj),32296(pi_gobbert)
[araim1@maya-usr1 ~]$

The primary group is given above as “gid=1057(pi_nagaraj)”

Suppose we also work for another PI “gobbert”. When running jobs for that PI, it’s only fair that we charge the computing resources to them instead. To accomplish that, we may add the “–account” option to our batch scripts.

#SBATCH --account=pi_gobbert

Note that if you specify an invalid name for the account (a group that does not exist, or which you do not belong to), the scheduler will silently revert back to your default account. You can quickly check the status field in the scontrol output to make sure the option worked.

[araim1@maya-usr1 ~]$ scontrol show job 25097
JobId=25097 Name=fmMle_MPI
   UserId=araim1(28398) GroupId=pi_nagaraj(1057)
   Priority=4294798165 Account=pi_gobbert QOS=normal
...
[araim1@maya-usr1 ~]$

Node selection

The CPU cluster in taki has several portions, as explained in the system description. By default, your job will start on the next available node(s) that comply with all parameters of your request. The default order in which nodes are assigned follows from performance studies, such as reported in HPCF tech. rep. HPCF-2019-1, that demonstrate that in core-to-core comparisons, the 2013 CPUs of taki are no slower than the 2018 CPU cores. Only for heavily parallelized jobs do the 2018 CPUs have a direct benefit. To specify the type of node and possibly overwrite the default selection, use the flag

--constraint=hpcf2013

for instance, in your SLURM submission script.

Any user can check on the features for each node by running the following command

sinfo --format=%b,%N

Note that every user may not have access to each of these features via the above ‘SBATCH constraint’ directive.

Requesting an exclusive node

The default behavior of the scheduler is that multiple jobs are assigned to one node, if their requested numbers of processes are not more than total number of cores on the node and the total of their requested memory fits into the node’s memory. This is the most efficient use of the limited number of nodes, each of which have a larger number of cores. In the case that you are a real reason for not wanting to share a node, such as because your job needs all of the node’s memory or because you are running a performance study, you can ask the scheduler to reserve the entire node for you by the option

#SBATCH --exclusive

The memory limit

Each job has a limit on the memory that it can use. This is in place, so that one job cannot overwhelm the node and other users’ code on the node. By default, this limit is the amount of memory of the node divided by the number of computational cores on the node (or more precisely, the limit is slightly less to allow for some space for the operating system).

Just like estimating the wall time expected for your job, you should estimate the memory needed for your job and then include that information in your SLURM submission script. This allows the scheduler to select the appropriate node for your job. The memory limit may be specified per core or per node. To set the limit per core, simply add a line to your submission script

#SBATCH --mem-per-cpu=5000

Memory is measured in MB so the above statement is requesting 5000MB of memory or 5GB. Similarly, to set the limit per node rather than by process, you can use this line instead

#SBATCH --mem=5000

In the serial case, the two options are equivalent. For parallel computing situations it may be more natural to use the per core limit, given that the scheduler has some freedom to assign processes to nodes for you.

If your job is killed because it has exceeded its memory limit, you will receive an error similar to the following in your stderr output. Notice that the effective limit is reported in the error.

slurmd[n1]: error: Job 13902 exceeded 3065856 KB memory limit, being killed
slurmd[n1]: error: *** JOB 13902 CANCELLED AT 2010-04-22T17:21:40 ***
srun: forcing job termination

Requesting an arbitrary number of tasks

So far on this page we’ve requested some number of nodes, and some number of tasks per node. But what if our application requires a number of tasks like 11, which can not be split evenly among a set of nodes. That is, unless we use one process per node, which isn’t a very efficient use of those nodes. We can split our 11 processes among as few as two nodes, using the following script. Notice that we don’t specify anything else like how many nodes to use. The scheduler will figure this out for us. If two nodes are idle, i.e., all of their cores available, it will most likely use the minimum number of node(s) to accommodate our tasks. But if no nodes are idle, there may still be enough cores available across several nodes, then requesting the number of tasks might let your job start faster.
Request this by this line in your SLURM submission script

#SBATCH --ntasks=11

Ordinarily, you would use this line to replace the specifications in –nodes and –ntasks-per-node, but two of these together can also be usefully combined. However, consider directly requesting the needed memory, if that is really the reason for your resctrictions.

Requesting an arbitrary number of tasks using slurm array

Slurm also allows for the ability to use –array in submission of jobs. These job arrays enable the user to submit and manage a collection of similar jobs quickly. All jobs must have the same initial options. Job arrays are only supported for batch jobs and array index values can be specified using the –array or -a option of the sbatch command as in the example below. More information can be found at official SLURM documentation. An example of how to queue a job in slurm using the array setting:An example job submission script called array_test.sbatch that illustrates the –array feature of slurm:

Download: ../code/slurm_array/array_test.sbatch

An example array_test shell script that illustrates the –array feature of slurm:

Download: ../code/slurm_array/array_test

To start from the command line:

$ sbatch array_test.sbatch
Submitted batch job 31367

Once run one can see the output demonstrating the node in which the job ran and the core on which the job was assigned:

$ grep host slurm-*.out
slurm-31367_10.out:I am element 10 on host n112, pid 12488's current affinity list: 2
slurm-31367_11.out:I am element 11 on host n112, pid 12487's current affinity list: 3
slurm-31367_12.out:I am element 12 on host n112, pid 12535's current affinity list: 4
slurm-31367_13.out:I am element 13 on host n112, pid 12555's current affinity list: 5
slurm-31367_14.out:I am element 14 on host n112, pid 12272's current affinity list: 6
slurm-31367_15.out:I am element 15 on host n112, pid 12791's current affinity list: 7
...

Note: A maximum number of simultaneously running tasks from the job array may be specified using a “%” separator. For example “–array=0-15%4” will limit the number of simultaneously running tasks from this job array to 4.

Setting a ‘begin’ time

You can tell the scheduler to wait a specified amount of time before attempting to run your job. This is useful for example, if your job requires many nodes. Being a conscientious user, you may want to wait until late at night for your job to run. By adding the following to your batch script, we can have the scheduler wait until 1:30am on 2018-10-20 for example.

#SBATCH --begin=2018-10-20T01:30:00

Alternatively, you can also specify a time that is relative to the current time by

#SBATCH --begin=now+1hour

See “man sbatch” for more information.

Dependencies

You may want a job to wait until another one starts or finishes. This can be useful if one job’s input depends on the other’s output. It can also be useful to ensure that you’re not running too many jobs at once. For example, suppose we want our job to wait until either (job with ID’s) 15030 or 15031 complete. This can be accomplished by adding the following to our batch script.

#SBATCH --dependency=afterany:15030:15031

You can also specify that both jobs should have finished in a non-error state before the current job can start.

#SBATCH --dependency=afterok:15030:15031

Notice that the above examples required us to note down the job IDs of our dependencies and specify them when launching the new job. Suppose you have a collection of jobs, and your only requirement is that only one should run at a time. A convenient way to accomplish this is with the “singleton” flag in conjunction with the –job-name flag.

#SBATCH --dependency=singleton

This job will wait until no other job by the name in –job-name is running from your account.
See “man sbatch” for more information.

Requeue-ability of your jobs

By default it is assumed that your job can be restarted if a node fails, or if the cluster is about to be brought offline for maintenance. For many jobs this is a safe assumption, but sometimes it may not be.

For example suppose your job appends to an existing data file as it runs. Suppose it runs partially, but then is restarted and then runs to completion. The output will then be incorrect, and it may not be easy for you to recognize. An easy way to avoid this situation is to make sure output files are newly created on each run.

Another way to avoid problems is to specify the following option in your submission script. This will prevent the scheduler from automatically restarting your job if any system failures occur.

#SBATCH --no-requeue

For very long-running jobs, you might also want to consider designing them to save their progress occasionally. This way if it is necessary to restart such a job, it will not need to start completely again from the beginning.

Selecting and excluding specific nodes

It is possible to select specific compute nodes for your job, although it is usually desirable not to do this. Usually we would rather request only the number of processors, nodes, memory, etc., and let the scheduler find the first available set of nodes which meets our requirements. To select specific nodes, for instance, cnode032 and cnode033, put this line in your SLURM submission script

#SBATCH --nodelist=cnode[032,033]

If on the other hand, you have a reason to exclude a node from being considered for your job, you can accomplish this, using cnode031 as example, by putting the following line in your submission script

#SBATCH --exclude=cnode031

Search UMBC

Introduction

Scheduling Fundamentals on taki: Partitions, QOS’s, and more

Interacting with the SLURM Scheduling System

scancel

sbatch

squeue

sinfo

Running Serial Jobs

Test runs on the user node

Test runs on the develop partition

Production runs on the high_mem partition

Selecting a QOS

Running Parallel Jobs

Test runs on the develop partition

Example 1: Single process

Example 2: One node, two processes per node

Example 3: Two nodes, one process per node

Example 4: Two nodes, eight processes per node

Parallel Runs on the Production Partitions

Additional details about the batch system

Charging computing time to a PI, for users under multiple PIs

Node selection

Requesting an exclusive node

The memory limit

Requesting an arbitrary number of tasks

Requesting an arbitrary number of tasks using slurm array

Setting a ‘begin’ time

Dependencies

Requeue-ability of your jobs

Selecting and excluding specific nodes

Subscribe to UMBC Weekly Top Stories

I am interested in: