Monitoring Jobs

Monitoring the status of running batch jobs

Once your job has been submitted, you can check the status of your job using the two most important commands squeue and scontrol. It is also important to understand the common states of the jobs to monitor their status:

Code	State	Explanation
R	Running	Job has a resource allocation and is currently executing
PD	Pending	Job is awaiting resource allocation
CD	Completed	Job has been completed and exited
F	Failed	Job terminated with a non-zero exit code
CA	Canceled	Job was explicitly canceled by the user of system administrator

squeue

squeue provides information about jobs in the SLURM scheduling queue and their state. It is best used for viewing jobs and job step information for active jobs. For more details refer to the squeue manual or run squeue --help, man squeue.

(base)[uw76577@ada ~]$ squeue 
      JOBID PARTITION     NAME   USER  ST     TIME  NODES NODELIST(REASON)
      33067     gpu      mean   user1  R       0:01      1 g05
      18956     gpu      calc   user2  R      48:38      1 g03
      18967     gpu      wrap   user1  R      14:25      1 g09

The most common arguments to squeue are -u $USER for listing only user’s jobs, and -j $jobID for the listing job specified by the job number.

To view current user jobs:

squeue -u $USER

To view filter jobs, use the -j option followed by the job ID.

squeue -j $JOBID

Commands with options	Outcome
`squeue --long`	Provide more job information
`squeue --user=USER_ID`	Provide job information for specific UserID
`squeue --states=pending`	Show pending jobs only
`squeue --account=ACCOUNT_ID`	Provide information for jobs running for given AccountID
`squeue --Format=jobid,prioritylong,feature,tres-alloc:50,state`	Customize output of squeue
`squeue --help`	Show all options

`scontrol`

scontrol is used to monitor and modify queued jobs. It provides more information about the system than squeue.

(base)[uw76577@ada ~]$ scontrol show jobs
 JobId=32895 JobName=bash 
 UserId=user1(163392) GroupId=pi_hp(1136) MCS_label=N/A
 Priority=356 Nice=0 Account=pi_hp QOS=normal
 JobState=RUNNING Reason=None Dependency=(null)
 Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
 RunTime=9-04:18:31 TimeLimit=10-00:00:00 TimeMin=N/A 
 SubmitTime=2021-11-12T14:39:33 EligibleTime=2021-11-12T14:39:33 AccrueTime=2021-11-12T14:39:33
 StartTime=2021-11-13T07:28:04 EndTime=2021-11-23T07:28:04 Deadline=N/A 
 SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-11-13T07:28:04
 Partition=gpu AllocNode:Sid=ada:17816 ReqNodeList=(null) ExcNodeList=(null) NodeList=g05 BatchHost=g05
 NumNodes=1 NumCPUs=24 NumTasks=1 CPUs/Task=24 ReqB:S:C:T=0:0:*:*
 TRES=cpu=24,mem=150G,node=1,billing=69,gres/gpu=4 Features=rtx_6000 DelayBoot=00:00:00
 Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=*  MinCPUsNode=24 MinMemoryNode=150G MinTmpDiskNode=0
 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
 Command=/bin/bash
 WorkDir=/nfs/ada/hp/users/user1/code/projects/semi-sup-cmsf/constrained_msf 
 TresPerNode=gpu:4 MailUser=navanek1 MailType=NONE

It is worth paying attention to the information like Command,TimeLimit,Requeue,StdErr,StdOut.

Display information about a specific job.

(base)[uw76577@ada ~]$ scontrol show --detail JobId=3918

Reviewing completed jobs

We can retrieve the history for a completed job (no longer in the queue) using sacct command.

`sacct`

sacct reports job accounting information about active or completed jobs.For a complete list of sacct options please refer to the sacct manual or run man sacct.

(base)[uw76577@ada ~]$ sacct
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
     18963        wrap    gpu          pi          2    COMPLETED      0:0 

     18964        mean1   gpu          pi          1    COMPLETED      0:0

To retrieve statistics of a particular job :

sacct -j <JobID> \
--format=User,JobID,Jobname,partition,state,time,start,end,elapsed,MaxRss,MaxVMSize

Commands with options	Outcome
`sacct --starttime=2021-09-27 --endtime=2021-10-04`	Show information from 9th Sept 21 to 4th Oct 21
`sacct --format="JobID,user,elapsed,MaxRSS,ReqMem,MaxVMSize"`	Provide job information for specific UserID
`sacct -s PD`	Show pending jobs only
`squeue --account=ACCOUNT_ID`	Show information for all the users under AccountID
`sacct --help`	Show all options

Useful SLURM commands

`sinfo`

sinfo allows users to view information about SLURM nodes and partitions and their states.

(base)[uw76577@ada ~]$ sinfo
       PARTITION AVAIL  TIMELIMIT  NODES  STATE   NODELIST
        --------- ----- ---------- ------- ------- ----------      
         gpu*         up   infinite     10    mix g[01-02,05-12]
         gpu*         up   infinite      3   idle g[03-04,13]

Other SLURM commands

Command	Meaning
`sbatch`	Submit a batch script to SLURM
`sdiag`	Display scheduling statistics and timing parameters
`smap`	A curses-based tool for displaying jobs, partitions and reservations
`sprio`	Display the factors that comprise a job’s scheduling priority
`report`	Generate canned reports from job accounting data and machine utilization statistics
`srun`	Launch one or more tasks of an application across requested resources
`share`	Display the shares and usage for each charge account and user
`stat`	Display process statistics of a running job step
`sview`	A graphical tool for displaying jobs, partitions and reservations

High Performance Computing Facility

High Performance Computing Facility

Monitoring Jobs

Monitoring the status of running batch jobs

squeue

`scontrol`

Reviewing completed jobs

`sacct`

Useful SLURM commands

`sinfo`

Other SLURM commands

High Performance Computing Facility

Monitoring the status of running batch jobs

squeue

scontrol

Reviewing completed jobs

sacct

Useful SLURM commands

sinfo

Other SLURM commands

Subscribe to UMBC Weekly Top Stories

I am interested in:

`scontrol`

`sacct`

`sinfo`