Monitoring Jobs

Monitoring the status of running batch jobs

Once your job has been submitted, you can check the status of your job using the two most important commands squeue and scontrol. It is also important to understand the common states of the jobs to monitor their status:

Code State Explanation
 R  Running  Job has a resource allocation and is currently executing
 PD  Pending  Job is awaiting resource allocation
 CD  Completed  Job has been completed and exited
 F  Failed  Job terminated with a non-zero exit code
 CA  Canceled  Job was explicitly canceled by the user of system administrator

squeue

squeue provides information about jobs in the SLURM scheduling queue and their state. It is best used for viewing jobs and job step information for active jobs. For more details refer to the squeue manual or run squeue --helpman squeue.

$squeue 
      JOBID PARTITION     NAME   USER  ST     TIME  NODES NODELIST(REASON)
      33067     gpu      mean   user1  R       0:01      1 g05
      18956     gpu      calc   user2  R      48:38      1 g03
      18967     gpu      wrap   user1  R      14:25      1 g09

The most common arguments to squeue are -u $USER for listing only user’s jobs, and -j $jobID for the listing job specified by the job number.

To view current user jobs:

squeue -u $USER

To view filter jobs, use the -j option followed by the job ID.

squeue -j $JOBID
Commands with options  Outcome
squeue –long              Provide more job information
squeue –user=USER_ID  Provide job information for specific UserID
squeue –states=pending  Show pending jobs only
squeue –account=ACCOUNT_ID  Provide information for jobs running for given AccountID
squeue –Format=jobid,prioritylong,feature,tres-alloc:50,state       Customize output of squeue       
squeue –help  Show all options

scontrol

scontrol is used to monitor and modify queued jobs. It provides more information about the system than squeue.

$scontrol show jobs
 JobId=32895 JobName=bash 
 UserId=user1(163392) GroupId=pi_hp(1136) MCS_label=N/A
 Priority=356 Nice=0 Account=pi_hp QOS=normal
 JobState=RUNNING Reason=None Dependency=(null)
 Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
 RunTime=9-04:18:31 TimeLimit=10-00:00:00 TimeMin=N/A 
 SubmitTime=2021-11-12T14:39:33 EligibleTime=2021-11-12T14:39:33 AccrueTime=2021-11-12T14:39:33
 StartTime=2021-11-13T07:28:04 EndTime=2021-11-23T07:28:04 Deadline=N/A 
 SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-11-13T07:28:04
 Partition=gpu AllocNode:Sid=ada:17816 ReqNodeList=(null) ExcNodeList=(null) NodeList=g05 BatchHost=g05
 NumNodes=1 NumCPUs=24 NumTasks=1 CPUs/Task=24 ReqB:S:C:T=0:0:*:*
 TRES=cpu=24,mem=150G,node=1,billing=69,gres/gpu=4 Features=rtx_6000 DelayBoot=00:00:00
 Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=*  MinCPUsNode=24 MinMemoryNode=150G MinTmpDiskNode=0
 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
 Command=/bin/bash
 WorkDir=/nfs/ada/hp/users/user1/code/projects/semi-sup-cmsf/constrained_msf 
 TresPerNode=gpu:4 MailUser=navanek1 MailType=NONE

It is worth paying attention to the information like Command,TimeLimit,Requeue,StdErr,StdOut.

Display information about a specific job.

$ scontrol show --detail JobId=3918

Reviewing completed jobs

We can retrieve the history for a completed job (no longer in the queue) using sacct command.

sacct

sacct reports job accounting information about active or completed jobs.For a complete list of sacct options please refer to the sacct manual or run man sacct.

$sacct
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
     18963        wrap    gpu          pi          2    COMPLETED      0:0 

     18964        mean1   gpu          pi          1    COMPLETED      0:0 

To retrieve statistics of a particular job :

sacct -j <JobID> --format=User,JobID,Jobname,partition,state,time,start,end,elapsed,MaxRss,MaxVMSize
Commands with options  Outcome
sacct  –starttime=2021-09-27 –endtime=2021-10-04        Show information from 9th Sept 21 to 4th Oct 21
sacct  –format=”JobID,user,elapsed, MaxRSS,ReqMem,MaxVMSize”  Provide job information for specific UserID
sacct –accounts=ACCOUNT_ID  Show pending jobs only
squeue –account=ACCOUNT_ID  Show information for all the users under AccountID
sacct –help Show all options 

Useful SLURM commands

sinfo

sinfo allows users to view information about SLURM nodes and partitions and their states.                                                                                                                                                                                                                     

 $sinfo
       PARTITION AVAIL  TIMELIMIT  NODES  STATE   NODELIST
        --------- ----- ---------- ------- ------- ----------      
         gpu*         up   infinite     10    mix g[01-02,05-12]
         gpu*         up   infinite      3   idle g[03-04,13]                                                                                                                                                                                                                                                        --------------- ---------- ---------- ---------- ---------- ---------- --------  

Other SLURM commands

Command Meaning
sbatch Submit a batch script to SLURM
sdiag Display scheduling statistics and timing parameters
smap A curses-based tool for displaying jobs, partitions and reservations
sprio Display the factors that comprise a job’s scheduling priority
sreport Generate canned reports from job accounting data and machine utilization statistics
srun Launch one or more tasks of an application across requested resources
sshare Display the shares and usage for each charge account and user
stat Display process statistics of a running job step
sview A graphical tool for displaying jobs, partitions and reservations