Scheduling rules on maya

Table of Contents

Introduction

This page documents the scheduling rules implementing the current usage rules. These rules are different than past rules in some significant ways, including the incorporation of SLURM concepts like QOS and fair-share priority, but we believe that these rules will be natural to use. It is designed to implement the philosophical underpinnings of the usage rules concretely. The scheduling rules and usage rules are designed together to help support the productivity goals of the cluster, including:

  • Throughput – handle as many jobs as possible from our users.
  • Utilization – don’t leave processors idling if work is available.
  • Responsiveness – if you submit a job that will take X hours to run, ideally it shouldn’t have to wait more than X hours to start.
  • Give priority to faculty who have contributed to HPCF, but support the work of community users as much as possible.

All users are expected to read this page and the usage rules carefully; by using the cluster, you agree to abide by the rules stated in these pages.

Fundamentals

A partition represents a group of nodes in the cluster. There are several partitions:

Partition Description Walltime limits
develop There are six nodes in the develop partition: n1, n2, n70, n112, n156 and n196. This partition is dedicated to code under development. Jobs using many cores may be tested, but run time is supposed to be negligible. 5 min default, 30 min max
batch The majority of the compute nodes on maya are allocated to this partition. There are 211 nodes: n3, …, n51, n71, …, n111, n113, …, n153, n157, …, n195, n197, …, n237. Jobs running on these nodes are considered “production” runs; users should have a high degree of confidence that bugs have been worked out. 5 day maximum
prod The majority of the compute nodes on maya are allocated to this partition. There are 162 nodes: n71, …, n111, n113, …, n153, n157, …, n195, n197, …, n237. This QOS is meant for contributing members to the cluster, allowing access to the medium_prod and long_prod QOS. 45 day maximum
gpu The nodes with two NVIDIA K20 GPUs each on maya are allocated to this partition. There are 18 nodes each with 2 GPUs: n52, …, n69. Jobs running on these nodes are considered “production” runs; users should have a high degree of confidence that bugs have been worked out. 5 day maximum
mic The nodes with two Intel Phi each on maya are allocated to this partition. There are 18 nodes each with 2 mic cards: n34, …, n51. Jobs running on these nodes are considered “production” runs; users should have a high degree of confidence that bugs have been worked out. 5 day maximum

When a user runs a job, time is charged to their PI group for accounting purposes, as explained in the usage rules. The historical accounting data is used in scheduling to influence priority, to determine which job will run next if there are multiple queued jobs. Note that priority does not affect jobs which are already running. On maya we have implemented fair-share rules. PIs who have used more than their allocation in their recent history will have a reduced priority, while PIs who have used less than their allocation will have an increased priority.

Following the terminology of the SLURM scheduler, queues are referred to as QOS’s, short for Quality of Services. Several QOS’s are available, which are designed to handle different kinds of jobs. Every job is run under a particular QOS, chosen by the user in the submission script, and is subject to that QOS’s resource limitations.

The following bullets introduce the QOS’s by giving a motivational viewpoint first:

  • short – Designed for very short jobs, which may require many nodes but will not take very long – on the order of several minutes to a lunch break.
  • normal (default) – Designed for average length jobs, which may require a significant number of nodes. We consider average length to be on the order of a lunch break to half a workday. This is the default QOS if you do not specify one explicitly in your sbatch job submission script.
  • medium – Designed for medium length jobs, which we consider to be on the order of half a workday up to an overnight run, but require only few nodes. There is a 16 nodes limit for any single user at a time and a limit of 96 nodes total usage with this QOS.
  • medium_prod – This QOS similar to the medium QOS, but jobs are allowed twice the walltime and access is limited to contribution PI groups. There is a 48 nodes limit for any single user at a time and a limit of 96 nodes total usage with this QOS.
  • long – Designed for long jobs, which we consider to be on the order of overnight to several days. Any user (community or contribution) may use this QOS, but a 2 nodes maximum for any one user at a time and a limit of 64 nodes total with this QOS.
  • long_contrib – This QOS is similar to the long QOS, but access is limited to contribution PI groups. There is a 16 nodes limit to the number of nodes in use by any single user at a time and a limit of 64 nodes total with this QOS. Conflicts in usage are expected to be infrequent, and will be resolved between the affected PIs and the HPCF Point of Contact.
  • long_prod – This QOS is similar to the long QOS, but access is limited to contribution PI groups. There is a 8 node total limit to the number of nodes in use with this QOS. This QOS is intended for use only with the prod partition and thus this partition must also be part of your slurm submission script, that is, –partition=prod. Conflicts in usage are expected to be very infrequent, and will be resolved between the affected PIs and the HPCF Point of Contact.
  • support – The support QOS is designed for critical jobs run by HPCF support personnel. It has minimal restrictions (time limits, node limits), and the highest possible priority. It is intended for special circumstances and not for everyday use.
    To use the support QOS, you must have access to the support account, and must specify –account=support in your batch script

The actual specific definitions for the QOS’s are given as follows, please note that a similar table can be generated on the command line with the ‘hpc_qosstat’ command

QOS Wall time limit per job CPU time limit per job Total cores limit for the QOS Cores limit per user Total jobs limit per user
short 1 hour 1024 hours 2048 560
normal (default) 4 hours 1024 hours 256
medium 24 hours 1024 hours 2048 256
medium_prod 48 hours 2048 hours 2048 768
long 5 days 256 16 4
long_contrib 5 days 768 128 4
long_prod 45 days 64
support

where

  • CPU time limit is the maximum CPU time allowed for a single job, which is measured as product of walltime and number of cores requested by the job. Here, “CPU time” is actually a misnomer; please notice the definition as given in the previous sentence; we use the term following the SLURM documentation.
  • Total number of cores limit for the QOS is the number of cores that may be in use at any given time, across all jobs in the given QOS. Please note that some CPU nodes on the cluster have 16 cores while others have 8 cores. This means that a limit of 256 cores can mean a 16 node limit for the maya2013 nodes as well as a 32 node limit for the maya2010 and maya2009 nodes. Any combinaton of nodes can be used as long as the total number of cores for the QOS falls within the limit.
  • Number of cores limit per user is the number of cores that may be in use at any given time by a particular user in the given QOS. Any combinaton of nodes can be used as long as the total number of cores for the user on the QOS falls within the limit.

A general guideline is to choose the QOS with the minimum resources needed to run your job. This is considered to be a good user behavior, which responsible users should follow. But a direct benefit to the user is backfilling. Backfilling is a feature of the scheduler that allows your job ahead of higher priority jobs, provided that your job’s estimated run time is shorter than the estimated wait time for the other jobs to start. A very responsible user (or one who really wishes to take advantage of backfilling) can set specific walltime and memory limits for their job, based on their estimates.

Note that QOS’s work the same way across all partitions including develop and batch. QOS limits (e.g. number of cores and walltime) are applied in conjunction with any partition limits. In the case of the develop partition we suggest setting the QOS to the default “normal”, or simply leaving it blank.

The CPU time limit per job enforced in some of the QOS’s limits the number of nodes and time limit that a job can request in such a way that the total resource use of CPU time is limited. For this purpose, it is considered equivalent to use 64 nodes for 1 hour, 32 nodes for 2 hours, 16 nodes for 4 hours, etc. Thus, with 16 cores per node, the CPU time limit (walltime times number of cores) is 64 nodes times 16 cores times 1 hour equal to 1024 hours of CPU time. For demonstration, the following table lists some sample combinations of number of nodes and time limits per job, all of which are actually equivalent to 1024 hours of CPU time. Note that when using –exclusive in a submission script all cores on the CPU will be counted in the CPU time limit calculation.

Number of nodes Cores per node Total number of cores Wall time (hours) CPU time (hours)
64 16 1024 1 1024
32 16 512 2 1024
16 16 256 4 1024
8 16 128 8 1024
4 16 64 16 1024
2 16 32 32 1024
1 16 16 64 1024
1 8 8 128 1024
1 4 4 256 1024
1 2 2 512 1024
1 1 1 1024 1024

Here, the number of nodes in the first column and the cores per node in the second column are multiplied to give the total numbers of cores in the third column; this is the quantity that enters into the SLURM definition of CPU time. Thus, the fourth column shows which wall times in hours yield 1024 hours of CPU time in the fifth column. The numbers in this table also explain how the wall time limits for the QOS’s short, normal, and medium were chosen, namely to accomodate the equivalence of the job that uses 64 nodes for 1 hour. Specifically, the wall time limits in each QOS together with the CPU time limit per job limits each job to 64 nodes in the short QOS, to 16 in the normal QOS, and to 4 nodes in the medium QOS. These choices ensure that only short (1 hour) jobs can use many nodes in the cluster, while only jobs with relatively few nodes can take a long time (16 hours). Notice in any case that the wall time limits of a QOS also apply, for instance, 5 days for the long QOS; this is 120 hours and thus jobs with the parameters of the last four rows of the above table would not complete even in the long QOS.

How to submit jobs

Use of the batch system is discussed in detail on the how to run page.

Failure modes

In this section we give some commonly encountered failure modes – and what kind of behavior you will observe when experiencing them.

  1. A community user tries to run in the long_contrib queue.
    [araim1@maya-usr1 ~]$ sbatch run.slurm
    sbatch: error: Batch job submission failed: Job has invalid qos
    [araim1@maya-usr1 ~]$
    

    Additionally, no slurm.out or slurm.err output is generated

  2. You attempt to use more than two nodes in the long QOS
    [araim1@maya-usr1 ~]$ sbatch run.slurm 
    sbatch: error: Batch job submission failed: Job violates accounting policy
    (job submit limit, user's size and/or time limits)
    [araim1@maya-usr1 ~]$
    
  3. You attempt to use more than 30 nodes total in the long_contrib QOS
    [araim1@slurm-dev ~]$ sbatch run.slurm 
    sbatch: error: Batch job submission failed: Job violates accounting policy
    (job submit limit, user's size and/or time limits)
    [araim1@maya-usr1 ~]$
    
  4. You attempt to use 30 nodes total in the long_contrib QOS, but not all nodes are available at the time of submission. Suppose we first submitted a 2 node job, and then a 30 node job
    [araim1@maya-usr1 ~]$ squeue
      JOBID PARTITION     NAME     USER  ST       TIME  NODES QOS    NODELIST(REASON)
       4278     batch  users01   araim1  PD       0:00     30 normal (AssociationResourceLimit)
       4277     batch  users01   araim1   R       2:54      2 normal n[7-8]
    [araim1@maya-usr1 ~]$
    

    The 30 node job remains queued with reason “AssociationResourceLimit”, until all 30 nodes of long_contrib become available.

  5. Your job reaches a maximum walltime limit.The job is killed with a message in the stderr output
    [araim1@maya-usr1 ~]$ cat slurm.err
    slurmd[n1]: error: *** JOB 59545 CANCELLED AT 2011-05-20T08:10:52 DUE TO TIME
    LIMIT ***
    [araim1@maya-usr1 ~]$
    
  6. Your job violates the 512 hours maximum CPU time limit.The job is killed with a message in the stderr output
    [araim1@maya-usr1 ~]$ cat slurm.err 
    slurmd[n3]: *** JOB 4254 CANCELLED AT 2011-05-27T19:42:14 DUE TO TIME LIMIT ***
    slurmd[n3]: *** STEP 4254.0 CANCELLED AT 2011-05-27T19:42:14 DUE TO TIME LIMIT ***
    [araim1@maya-usr1 ~]$ 
    
  7. Run a job with walltime limit too high for QOS / partition.
    [araim1@maya-usr1 ~]$ sbatch run.slurm 
    sbatch: error: Batch job submission failed: Job violates accounting policy
    job submit limit, user's size and/or time limits)
    [araim1@maya-usr1 ~]$
    
  8. Try to charge time to a PI who doesn’t exist, or who you don’t work for.
    [araim1@maya-usr1 ~]$ sbatch run.slurm 
    sbatch: error: Batch job submission failed: Invalid account specified
    [araim1@maya-usr1 ~]$
    
  9. Try to use a QOS that doesn’t exist.
    [araim1@maya-usr1 ~]$ sbatch run.slurm 
    sbatch: error: Batch job submission failed: Job has invalid qos
    [araim1@maya-usr1 ~]$ 
    
  10. Try to use a partition that doesn’t exist.
    [araim1@maya-usr1 ~]$ sbatch run.slurm 
    sbatch: error: Batch job submission failed: Invalid partition name specified
    [araim1@maya-usr1 ~]$ 
    
  11. Try to use more processes per node than are available.
    [araim1@maya-usr1 ~]$ sbatch run.slurm 
    sbatch: error: Batch job submission failed: Requested node configuration is not available
    [araim1@maya-usr1 ~]$
    
  12. Try to use more nodes than available in a partition.
    [araim1@maya-usr1 ~]$ sbatch run.slurm 
    sbatch: error: Batch job submission failed: Node count specification invalid
    [araim1@maya-usr1 ~]$ 
    
  13. Invalid syntax in SLURM batch script
    [araim1@maya-usr1 ~]$ sbatch run.slurm 
    sbatch: unrecognized option `--ndoes=2'
    sbatch: error: Try "sbatch --help" for more information
    [araim1@maya-usr1 ~]$ 
    
  14. Memory limit exceeded by your program
    [araim1@maya-usr1 ~]$ cat slurm.err 
    slurmd[n1]: error: Job 60204 exceeded 10240 KB memory limit, being killed
    slurmd[n1]: error: *** JOB 60204 CANCELLED AT 2011-05-27T19:34:34 ***
    [araim1@maya-usr1 ~]$ 
    
  15. You’ve set the memory limit set too high for available memory
    [araim1@maya-usr1 ~]$ sbatch run.slurm 
    sbatch: error: Batch job submission failed: Requested node configuration is not available
    [araim1@maya-usr1 ~]$ 
    
  16. You try to use more than 30 minutes of walltime in the develop partition, by setting the –time flag. The job will be stuck in the pending state, with reason “PartitionTimeLimit”
    [araim1@maya-usr1 ~]$ squeue
      JOBID PARTITION     NAME     USER ST       TIME  NODES QOS     NODELIST(REASON)
      62280   develop     SNOW   araim1 PD       0:00      1 normal  (PartitionTimeLimit)
    [araim1@maya-usr1 ~]$