Monitoring the status of running batch jobs
Once your job has been submitted, you can check the status of your job using the two most important commands squeue
and scontrol
. It is also important to understand the common states of the jobs to monitor their status:
Code | State | Explanation |
---|---|---|
R | Running | Job has a resource allocation and is currently executing |
PD | Pending | Job is awaiting resource allocation |
CD | Completed | Job has been completed and exited |
F | Failed | Job terminated with a non-zero exit code |
CA | Canceled | Job was explicitly canceled by the user of system administrator |
squeue
squeue
provides information about jobs in the SLURM scheduling queue and their state. It is best used for viewing jobs and job step information for active jobs. For more details refer to the squeue manual or run squeue --help
, man squeue
.
(base)[uw76577@ada ~]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
33067 gpu mean user1 R 0:01 1 g05
18956 gpu calc user2 R 48:38 1 g03
18967 gpu wrap user1 R 14:25 1 g09
The most common arguments to squeue are -u $USER
for listing only user’s jobs, and -j $jobID
for the listing job specified by the job number.
To view current user jobs:
squeue -u $USER
To view filter jobs, use the -j
option followed by the job ID.
squeue -j $JOBID
Commands with options | Outcome |
squeue --long |
Provide more job information |
squeue --user=USER_ID |
Provide job information for specific UserID |
squeue --states=pending |
Show pending jobs only |
squeue --account=ACCOUNT_ID |
Provide information for jobs running for given AccountID |
squeue --Format=jobid,prioritylong,feature,tres-alloc:50,state |
Customize output of squeue |
squeue --help |
Show all options |
scontrol
scontrol
is used to monitor and modify queued jobs. It provides more information about the system than squeue
.
(base)[uw76577@ada ~]$ scontrol show jobs
JobId=32895 JobName=bash
UserId=user1(163392) GroupId=pi_hp(1136) MCS_label=N/A
Priority=356 Nice=0 Account=pi_hp QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
RunTime=9-04:18:31 TimeLimit=10-00:00:00 TimeMin=N/A
SubmitTime=2021-11-12T14:39:33 EligibleTime=2021-11-12T14:39:33 AccrueTime=2021-11-12T14:39:33
StartTime=2021-11-13T07:28:04 EndTime=2021-11-23T07:28:04 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-11-13T07:28:04
Partition=gpu AllocNode:Sid=ada:17816 ReqNodeList=(null) ExcNodeList=(null) NodeList=g05 BatchHost=g05
NumNodes=1 NumCPUs=24 NumTasks=1 CPUs/Task=24 ReqB:S:C:T=0:0:*:*
TRES=cpu=24,mem=150G,node=1,billing=69,gres/gpu=4 Features=rtx_6000 DelayBoot=00:00:00
Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=* MinCPUsNode=24 MinMemoryNode=150G MinTmpDiskNode=0
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/bin/bash
WorkDir=/nfs/ada/hp/users/user1/code/projects/semi-sup-cmsf/constrained_msf
TresPerNode=gpu:4 MailUser=navanek1 MailType=NONE
It is worth paying attention to the information like Command,TimeLimit,Requeue,StdErr,StdOut.
Display information about a specific job.
(base)[uw76577@ada ~]$ scontrol show --detail JobId=3918
Reviewing completed jobs
We can retrieve the history for a completed job (no longer in the queue) using sacct
command.
sacct
sacct
reports job accounting information about active or completed jobs.For a complete list of sacct
options please refer to the sacct manual or run man sacct.
(base)[uw76577@ada ~]$ sacct
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
18963 wrap gpu pi 2 COMPLETED 0:0
18964 mean1 gpu pi 1 COMPLETED 0:0
To retrieve statistics of a particular job :
sacct -j <JobID> \
--format=User,JobID,Jobname,partition,state,time,start,end,elapsed,MaxRss,MaxVMSize
Commands with options | Outcome |
sacct --starttime=2021-09-27 --endtime=2021-10-04 |
Show information from 9th Sept 21 to 4th Oct 21 |
sacct --format="JobID,user,elapsed,MaxRSS,ReqMem,MaxVMSize" |
Provide job information for specific UserID |
sacct -s PD |
Show pending jobs only |
squeue --account=ACCOUNT_ID |
Show information for all the users under AccountID |
sacct --help |
Show all options |
Useful SLURM commands
sinfo
sinfo
allows users to view information about SLURM nodes and partitions and their states.
(base)[uw76577@ada ~]$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
--------- ----- ---------- ------- ------- ----------
gpu* up infinite 10 mix g[01-02,05-12]
gpu* up infinite 3 idle g[03-04,13]
Other SLURM commands
Command | Meaning |
---|---|
sbatch |
Submit a batch script to SLURM |
sdiag |
Display scheduling statistics and timing parameters |
smap |
A curses-based tool for displaying jobs, partitions and reservations |
sprio |
Display the factors that comprise a job’s scheduling priority |
report |
Generate canned reports from job accounting data and machine utilization statistics |
srun |
Launch one or more tasks of an application across requested resources |
share |
Display the shares and usage for each charge account and user |
stat |
Display process statistics of a running job step |
sview |
A graphical tool for displaying jobs, partitions and reservations |