Table of Contents
- Scheduling Fundamentals on taki: Partitions, QOS’s, and more
- Interacting with the SLURM Scheduling System
- Running Serial Jobs
- Running Parallel Jobs
- Some details about the batch system
Running a program on taki is different than running one on a standard workstation. When we log into the cluster, we are interacting with the login node. But we would like our programs to run on the compute nodes, which is where the real computing power of the cluster is. We will walk through the processes of running serial and parallel code on the cluster, and then later discuss some of the finer details. This page uses the code examples from the compilation tutorial. Please download and compile those examples first, before following the run examples here.
Resource intensive jobs (long running, high memory demand, etc.) must be run on the compute nodes of the cluster. You cannot execute jobs directly on the compute nodes yourself; you must request the cluster’s batch system do it on your behalf. To use the batch system, you will submit a special script which contains instructions to execute your job on the compute nodes. When submitting your job, you specify a partition (group of nodes, e.g., develop for testing or compute for production) and a QOS (a classification that determines what kind of resources your job will need). Your job will wait in the queue until it is “next in line”, and free processors on the compute nodes become available. Which job is “next in line” is determined by the scheduling rules of the cluster. Once a job is started, it continues to run until it either completes or reaches its time limit, in which case it is terminated by the scheduler.
Scheduling Fundamentals on taki: Partitions, QOS’s, and more
Please read eventually the scheduling rules web page for complete background on the available queues and their limitations. The examples below on this page are designed for running code on the CPU cluster such as the C programs in the compile tutorial. They will start with the develop partition, which is designed for testing code in brief runs, and mention the compute partition, which is for production runs. For other software and packages, see the Other Software and Packages web page.
Interacting with the Batch System
There are several basic commands you need to know to submit jobs, cancel them, and check their status. These are:
- sbatch – submit a job to the batch queue system
- squeue – check the current jobs in the batch queue system
- sinfo – view the current status of the queues
- scancel – cancel a job
The first command we will mention is scancel. If you have submitted a job that you no longer want, you should be a responsible user and kill it. This will prevent resources from being wasted, and allows other users’ jobs to run. Jobs can be killed while they are pending (waiting to run), or while they are actually running. To remove a job from the queue or to cancel a running job cleanly, use the scancel command with the identifier of the job to be deleted, for instance
[gobbert@taki-usr1 ~]$ scancel 9133
The job identifier can be obtained from the job listing from squeue (see below) or immediately after using sbatch, when you originally submitted the job (also below). See “man scancel” for more information.
Now that we know how to cancel a job, we will see how to submit one. You can use the sbatch command to submit a script to the queue system.
[gobbert@taki-usr1 ~]$ sbatch run.slurm Submitted batch job 9133
In this example, run.slurm is the script we are sending to the slurm scheduler. We will see shortly how to formulate such a script. Notice that sbatch returns a job identifier. We can use this to kill the job later if necessary (as in the scancel example above), or to check its status. For more information, see “man sbatch”.
You can use the squeue command to check the status of jobs in the batch queue system. Here’s an example of the basic usage:
[gobbert@taki-usr1 Hello_Serial]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 9141 batch hello_se gobbert R 0:01 1 cnode031
The most interesting column is the one titled ST for “status”. It shows what a job is doing at this point in time. The state “PD” indicates that the job has been queued. When enough free processor cores become available, it will change to the “R” state and begin running. You may also see a job with status “CG” or “CF”, which means it is completing (such as still writing stdout and stderr), and about to exit the batch system. Other statuses are possible too, see “man squeue”. Once a job has exited the batch queue system, it will no longer show up in the squeue display.
We can also see several other pieces of useful information. The TIME column shows the current walltime used by the job up to the present time. For example, job 9141 has been running for 1 second so far. The NODELIST column shows which compute node(s) has(/have) been assigned to the job. For job 9141, node cnode031 is being used.
The sinfo command also shows the current status of the batch system, but from the point of view of the SLURM partitions. Here is a partial example
[gobbert@taki-usr1 Hello_Serial]$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST gpu up infinite 2 drain* gpunode[116-117] gpu up infinite 1 idle gpunode001 develop up infinite 2 idle cnode[001,030] batch* up infinite 42 idle cnode[002-029,031-044] high_mem up infinite 42 idle cnode[002-029,031-044]
This section assumes you have already compiled the serial “Hello, world!” example. Now we will see how to run it several different ways.
Test runs on the user node
The most obvious way to run the program is on the user node, which we normally log into.
[gobbert@taki-usr1 Hello_Serial]$ ./hello_serial Hello world from taki-usr1
We can see the reported hostname which confirms that the program ran on the login node.
Let us submit our job to the develop partition, since we just created it and we are not completely sure that it works. The following script will accomplish this. Download it using wget to your workspace alongside the “hello-serial” executable.
Here, the job-name flag simply sets the string that is displayed as the name of the job in squeue. The output and error flags set the file names for capturing standard output (stdout) and standard error (stderr), respectively. The next flag chooses the develop partition of the CPU cluster to request for the job to run on. The QOS flag requests the short queue, since this particular “Hello, world!” job should just run for a brief moment, and the time flag provides a more precise estimate of the maximum possible time for the job to take. After a job has reached its time limit, it is stopped by the scheduler. This is done to ensure that everyone has a fair chance to use the cluster. The next two flags set the total number of nodes requested, and the number of MPI tasks per node; by choosing both of these as 1, we are requesting space for a serial job. Now we are ready to submit our job to the scheduler. To accomplish this, use the sbatch command as follows
[gobbert@taki-usr1 Hello_Serial]$ sbatch run-testing.slurm Submitted batch job 9111
If the submission is successful, the sbatch command returns a job identifier to us. We can use this to check the status of the job (squeue), or delete it (scancel) if necessary. This job should run very quickly if there are processors available, but we can try to check its status in the batch queue system. The first call to squeue in the following snapshot shows that the code ran 3 seconds up to that point in time. Checking again a little later does not show our job any more in the list, thus it is done at this point in time.
[gobbert@taki-usr1 Hello_Serial]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 9111 develop hello_se gobbert R 0:03 1 cnode001 [gobbert@taki-usr1 Hello_Serial]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
After possibly a brief delay, we should have obtained two output files. The file slurm.err contains stderr output from our program. If slurm.err is not empty, check the contents carefully as something may have gone wrong. The file slurm.out contains our stdout output; it should contain the hello world message from our program.
[gobbert@taki-usr1 Hello_Serial]$ ll slurm.* -rw-rw---- 1 gobbert pi_gobbert 0 Jan 27 18:17 slurm.err -rw-rw---- 1 gobbert pi_gobbert 26 Jan 27 18:17 slurm.out [gobbert@taki-usr1 Hello_Serial]$ more slurm.err [gobbert@taki-usr1 Hello_Serial]$ more slurm.out Hello world from cnode001
Notice that the hostname is no longer the login node, but is one of the develop nodes; see output of sinfo above. The develop partition limits a job to 5 minutes by default and 30 minutes at maximum, measured in “walltime”, which is just the elapsed run time.
Production runs on the batch partition
Once our job has been tested and we are confident that it is working correctly, we can run it in the compute partition, which is intended for production runs and has a longer time limit. Now the walltime limit for our job may be raised, based on the QOS we choose. There are also many more compute nodes available in this partition. Start by creating the following script with some of the most important features.
The flags for job-name, output, and error are the same as in the previous script. The partition flag is now set to compute. The qos flag still chooses the short QOS and the time limit is still set at 5 minutes, since the job is still expected to be quick, just like before for this “Hello, world!” example. It is also still a serial job, so no change to the requested number of nodes and tasks per node. To submit our job to the scheduler, we issue the command
[gobbert@taki-usr1 Hello_Serial]$ sbatch run-serial.slurm Submitted batch job 9116
We can check the job’s status, but due to its speed, it has already completed and does not show up any more.
[gobbert@taki-usr1 Hello_Serial]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
This time our stdout output file indicates that our job has run on one of the compute nodes, rather than a develop node:
[gobbert@taki-usr1 Hello_Serial]$ ll slurm.* -rw-rw---- 1 gobbert pi_gobbert 0 Jan 27 18:24 slurm.err -rw-rw---- 1 gobbert pi_gobbert 26 Jan 27 18:24 slurm.out [gobbert@taki-usr1 Hello_Serial]$ more slurm.err [gobbert@taki-usr1 Hello_Serial]$ more slurm.out Hello world from cnode032
Selecting a QOS
Notice that we specified the short QOS with the qos flag, because we know our job is very quick. In the same way, you can access any of the QOS’s listed in the scheduling rules. The rule of thumb is that you should always choose the QOS whose wall time limit is the most appropriate for your job. Realizing that these limits are hard upper limits, you will want to stay safely under them, or in other words, pick a QOS whose wall time limit is comfortably larger than the actually expected run time.
This section assumes that you have successfully compiled the parallel “Hello, world!” example. Now, we will see how to run this program on the CPU cluster.
Test runs on the develop partition
First we will run the hello_parallel program as a single process. This will appear very similar to the serial job case. The difference is that now we are using the MPI-enabled executable hello_parallel, rather than the plain hello_serial executable. Create the following script in the same directory as the hello_parallel program. Notice the addition of the “srun” command before the executable, which is used to launch MPI-enabled programs on taki. The number of nodes and tasks per node requested are still 1.
Now, we submit the script.
[gobbert@taki-usr1 hello_parallel]$ sbatch run-n1-ppn1.slurm Submitted batch job 3268 [gobbert@taki-usr1 hello_parallel]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 3268 develop hello_pa gobbert CF 0:01 1 cnode001 [gobbert@taki-usr1 hello_parallel]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
This time, we caught the job in a start-up, as the node was being initialized for the MPI job, with status CF. A short moment later, the job was done already, and the queue does not contain our job any more.
Checking the output after the job has completed, we can see that exactly one process has run and reported back.
[gobbert@taki-usr1 hello_parallel]$ ll slurm.* -rw-rw---- 1 gobbert pi_gobbert 0 Oct 12 12:26 slurm.err -rw-rw---- 1 gobbert pi_gobbert 65 Oct 12 12:26 slurm.out [gobbert@taki-usr1 hello_parallel]$ more slurm.err [gobbert@taki-usr1 hello_parallel]$ more slurm.out Hello world from process 000 out of 001, processor name cnode001
Example 2: One node, two processes per node
Next, we will run the job on two processes of the same node. This is one important test, to ensure that our code will function in parallel. We want to be especially careful that the communications work correctly, and that processes do not hang. We modify the single process script and set “–ntasks-per-node=2”.
Submit the script to the batch queue system
[gobbert@taki-usr1 hello_parallel]$ sbatch run-n1-ppn2.slurm Submitted batch job 3269 [gobbert@taki-usr1 hello_parallel]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 3269 develop hello_pa gobbert R 0:00 1 cnode001 [gobbert@taki-usr1 hello_parallel]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
This time, squeue shows the job running briefly, with 0 seconds so far, before the job is again done quickly and gone from the queue.
Now observe that two processes have run and reported in. Both were located on the same node as we expected.
[gobbert@taki-usr1 hello_parallel]$ ll slurm.* -rw-rw---- 1 gobbert pi_gobbert 0 Oct 12 12:28 slurm.err -rw-rw---- 1 gobbert pi_gobbert 130 Oct 12 12:28 slurm.out [gobbert@taki-usr1 hello_parallel]$ more slurm.err [gobbert@taki-usr1 hello_parallel]$ more slurm.out Hello world from process 000 out of 002, processor name cnode001 Hello world from process 001 out of 002, processor name cnode001
Example 3: Two nodes, one process per node
Now let us try to use two different nodes, but only one process on each node. This will exercise our program’s use of the high performance network, which did not come into play when a single node was used.
Submit the script to the batch queue system
[gobbert@taki-usr1 hello_parallel]$ sbatch run-n2-ppn1.slurm Submitted batch job 3270
Notice that again we have two processes, but this time they have distinct processor names.
[gobbert@taki-usr1 hello_parallel]$ ll slurm.* -rw-rw---- 1 gobbert pi_gobbert 0 Oct 12 12:29 slurm.err -rw-rw---- 1 gobbert pi_gobbert 130 Oct 12 12:29 slurm.out [gobbert@taki-usr1 hello_parallel]$ more slurm.err [gobbert@taki-usr1 hello_parallel]$ more slurm.out Hello world from process 000 out of 002, processor name cnode001 Hello world from process 001 out of 002, processor name cnode030
Example 4: Two nodes, eight processes per node
To illustrate the use of more processes per node, let us try a job that uses two nodes, eight processes on each node. This is still possible on the develop partition. Therefore it is possible to run small performance studies which are completely restricted to the develop partition. Use the following batch script.
Submit the script to the batch system
[gobbert@taki-usr1 hello_parallel]$ sbatch run-n2-ppn8.slurm Submitted batch job 3271 [gobbert@taki-usr1 hello_parallel]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 3271 develop hello_pa gobbert CF 0:01 2 cnode[001,030] [gobbert@taki-usr1 hello_parallel]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 3271 develop hello_pa gobbert R 0:01 2 cnode[001,030] [gobbert@taki-usr1 hello_parallel]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
This time, we caught the job during its initialization first, then as running job, before having vanished from the queue after finishing. Notice that squeue lists the two nodes used for the job by name.
Now observe the output. Notice that the processes in the following output have reported mostly, but not completely, in numerical order, since all processes can access stdout and stderr and might reach faster and slower at random. But upon careful checking, there are two nodes reported by hostname and eight MPI processes per node.
[gobbert@taki-usr1 hello_parallel]$ ll slurm.* -rw-rw---- 1 gobbert pi_gobbert 0 Oct 12 12:31 slurm.err -rw-rw---- 1 gobbert pi_gobbert 1040 Oct 12 12:31 slurm.out [gobbert@taki-usr1 hello_parallel]$ more slurm.err [gobbert@taki-usr1 hello_parallel]$ more slurm.out Hello world from process 000 out of 016, processor name cnode001 Hello world from process 008 out of 016, processor name cnode030 Hello world from process 001 out of 016, processor name cnode001 Hello world from process 002 out of 016, processor name cnode001 Hello world from process 003 out of 016, processor name cnode001 Hello world from process 004 out of 016, processor name cnode001 Hello world from process 005 out of 016, processor name cnode001 Hello world from process 006 out of 016, processor name cnode001 Hello world from process 007 out of 016, processor name cnode001 Hello world from process 009 out of 016, processor name cnode030 Hello world from process 010 out of 016, processor name cnode030 Hello world from process 011 out of 016, processor name cnode030 Hello world from process 012 out of 016, processor name cnode030 Hello world from process 013 out of 016, processor name cnode030 Hello world from process 014 out of 016, processor name cnode030 Hello world from process 015 out of 016, processor name cnode030
Production runs on the compute partition
Now we have tested our program in several important configurations in the develop partition. We know that it performs correctly, and processes do not hang. We may now want to solve larger problems which are more time consuming, or perhaps we may wish to use more processes. We can promote our code to production queue by simply changing the partition to compute. We use the nodesused program from the compile tutorial as example here. Since the expected run time of our job is still only several seconds at most, we keep the short QOS and the time limit of 5 minutes. Of course if this were a more substantial program, we might need to specify a longer QOS like normal. The submission script reads for the nodesused program, with obvious changes to the job-name and the srun line, as follows.
Running the code using sbatch gives the following results
Notice that the node numbers confirm that the job was run on the compute partition of the CPU cluster. As before for the parallel “Hello, world!” program, order of output lines to stdout is slightly random. The listing of the file nodesused.log shows that our code in the nodesused() function ordered the output by the MPI process IDs.
Additional details about the batch system
The CPU cluster in taki has several portions, as explained in the system description. By default, your job will start on the next available node(s) that comply with all parameters of your request. The default order in which nodes are assigned follows from performance studies, such as reported in HPCF tech. rep. HPCF-2018-18, that demonstrate that in core-to-core comparisons, the 2009 and 2013 CPUs of taki are no slower than the 2018 CPU cores. Only for heavily parallelized jobs do the 2018 CPUs have a direct benefit. To overwrite the default selection of nodes, use the flag
, for instance, in your SLURM submission script.
Each job has a limit on the memory that it can use. This is in place, so that one job cannot overwhelm the node and other users’ code on the node. By default, this limit is the amount of memory of the node divided by the number of computational cores on the node (or more precisely, the limit is slightly less to allow for some space for the operating system).
Just like estimating the wall time expected for your job, you should estimate the memory needed for your job and then include that information in your SLURM submission script. This allows the scheduler to select the appropriate node for your job. The memory limit may be specified per core or per node. To set the limit per core, simply add a line to your submission script
Memory is measured in MB so the above statement is requesting 5000MB of memory or 5GB. Similarly, to set the limit per node rather than by process, you can use this line instead
In the serial case, the two options are equivalent. For parallel computing situations it may be more natural to use the per core limit, given that the scheduler has some freedom to assign processes to nodes for you.
slurmd[n1]: error: Job 13902 exceeded 3065856 KB memory limit, being killed slurmd[n1]: error: *** JOB 13902 CANCELLED AT 2010-04-22T17:21:40 *** srun: forcing job termination
Requesting an arbitrary number of tasks
So far on this page we’ve requested some number of nodes, and some number of tasks per node. But what if our application requires a number of tasks like 11, which can not be split evenly among a set of nodes. That is, unless we use one process per node, which isn’t a very efficient use of those nodes. We can split our 11 processes among as few as two nodes, using the following script. Notice that we don’t specify anything else like how many nodes to use. The scheduler will figure this out for us. If two nodes are idle, i.e., all of their cores available, it will most likely use the minimum number of node(s) to accommodate our tasks. But if no nodes are idle, there may still be enough cores available across several nodes, then requesting the number of tasks might let your job start faster.
Request this by this line in your SLURM submission script
Ordinarily, you would use this line to replace the specifications in –nodes and –ntasks-per-node, but two of these together can also be usefully combined. However, consider directly requesting the needed memory, if that is really the reason for your resctrictions.
Requesting an arbitrary number of tasks using slurm array
Slurm also allows for the ability to use –array in submission of jobs. These job arrays enable the user to submit and manage a collection of similar jobs quickly. All jobs must have the same initial options. Job arrays are only supported for batch jobs and array index values can be specified using the –array or -a option of the sbatch command as in the example below. More information can be found at official SLURM documentation. An example of how to queue a job in slurm using the array setting:An example job submission script called array_test.sbatch that illustrates the –array feature of slurm:
An example array_test shell script that illustrates the –array feature of slurm:
To start from the command line:
$ sbatch array_test.sbatch Submitted batch job 31367
Once run one can see the output demonstrating the node in which the job ran and the core on which the job was assigned:
$ grep host slurm-*.out slurm-31367_10.out:I am element 10 on host n112, pid 12488's current affinity list: 2 slurm-31367_11.out:I am element 11 on host n112, pid 12487's current affinity list: 3 slurm-31367_12.out:I am element 12 on host n112, pid 12535's current affinity list: 4 slurm-31367_13.out:I am element 13 on host n112, pid 12555's current affinity list: 5 slurm-31367_14.out:I am element 14 on host n112, pid 12272's current affinity list: 6 slurm-31367_15.out:I am element 15 on host n112, pid 12791's current affinity list: 7 ...
Note: A maximum number of simultaneously running tasks from the job array may be specified using a “%” separator. For example “–array=0-15%4” will limit the number of simultaneously running tasks from this job array to 4.
Setting a ‘begin’ time
You can tell the scheduler to wait a specified amount of time before attempting to run your job. This is useful for example, if your job requires many nodes. Being a conscientious user, you may want to wait until late at night for your job to run. By adding the following to your batch script, we can have the scheduler wait until 1:30am on 2018-10-20 for example.
Alternatively, you can also specify a time that is relative to the current time by
See “man sbatch” for more information.
You may want a job to wait until another one starts or finishes. This can be useful if one job’s input depends on the other’s output. It can also be useful to ensure that you’re not running too many jobs at once. For example, suppose we want our job to wait until either (job with ID’s) 15030 or 15031 complete. This can be accomplished by adding the following to our batch script.
You can also specify that both jobs should have finished in a non-error state before the current job can start.
Notice that the above examples required us to note down the job IDs of our dependencies and specify them when launching the new job. Suppose you have a collection of jobs, and your only requirement is that only one should run at a time. A convenient way to accomplish this is with the “singleton” flag in conjunction with the –job-name flag.
This job will wait until no other job by the name in –job-name is running from your account.
See “man sbatch” for more information.
Requeue-ability of your jobs
By default it is assumed that your job can be restarted if a node fails, or if the cluster is about to be brought offline for maintenance. For many jobs this is a safe assumption, but sometimes it may not be.
For example suppose your job appends to an existing data file as it runs. Suppose it runs partially, but then is restarted and then runs to completion. The output will then be incorrect, and it may not be easy for you to recognize. An easy way to avoid this situation is to make sure output files are newly created on each run.
Another way to avoid problems is to specify the following option in your submission script. This will prevent the scheduler from automatically restarting your job if any system failures occur.
For very long-running jobs, you might also want to consider designing them to save their progress occasionally. This way if it is necessary to restart such a job, it will not need to start completely again from the beginning.
Charging computing time to a PI, for users under multiple PIs
If you a member of multiple research groups on taki this will apply to you. When you run a job on taki, the resources you have used (e.g., computing time) are “charged” to your PI. This simply means that there is a record of your group’s use of the cluster. This information is leveraged to make sure everyone has access to their fair share of resources, especially the PIs who have paid for nodes. Therefore, it is important to charge your jobs to the correct PI.
You have a “primary” account which your jobs are charged to by default. To see this, try checking one of your jobs as follows (suppose our job has ID 25097)
[araim1@maya-usr1 ~]$ scontrol show job 25097 JobId=25097 Name=fmMle_MPI UserId=araim1(28398) GroupId=pi_nagaraj(1057) Priority=4294798165 Account=pi_nagaraj QOS=normal JobState=RUNNING Reason=None Dependency=(null) TimeLimit=04:00:00 Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0 SubmitTime=2010-06-30T00:14:24 EligibleTime=2010-06-30T00:14:24 StartTime=2010-06-30T00:14:24 EndTime=2010-06-30T04:14:24 SuspendTime=None SecsPreSuspend=0 ...
Notice the “Account=pi_nagaraj” field – in this example, this is our default account. This should also be the same as our primary Unix group
[araim1@maya-usr1 ~]$ id uid=28398(araim1) gid=1057(pi_nagaraj) groups=100(users),700(contrib), 701(alloc_node_ssh),1057(pi_nagaraj),32296(pi_gobbert) [araim1@maya-usr1 ~]$
The primary group is given above as “gid=1057(pi_nagaraj)”
Suppose we also work for another PI “gobbert”. When running jobs for that PI, it’s only fair that we charge the computing resources to them instead. To accomplish that, we may add the “–account” option to our batch scripts.
Note that if you specify an invalid name for the account (a group that does not exist, or which you do not belong to), the scheduler will silently revert back to your default account. You can quickly check the status field in the scontrol output to make sure the option worked.
[araim1@maya-usr1 ~]$ scontrol show job 25097 JobId=25097 Name=fmMle_MPI UserId=araim1(28398) GroupId=pi_nagaraj(1057) Priority=4294798165 Account=pi_gobbert QOS=normal ... [araim1@maya-usr1 ~]$
Selecting and excluding specific nodes
It is possible to select specific compute nodes for your job, although it is usually desirable not to do this. Usually we would rather request only the number of processors, nodes, memory, etc., and let the scheduler find the first available set of nodes which meets our requirements. To select specific nodes, for instance, cnode032 and cnode033, put this line in your SLURM submission script
If on the other hand, you have a reason to exclude a node from being considered for your job, you can accomplish this, using cnode031 as example, by putting the following line in your submission script