Scheduling Rules on taki

Table of Contents

Introduction

This page documents the scheduling rules implementing the current usage rules. The scheduling rules and usage rules are designed together to help support the productivity goals of the cluster, including:

  • Throughput – handle as many jobs as possible from all users.
  • Utilization – don’t leave processors idling if work is available.
  • Responsiveness – if you submit a job that will take X hours to run, ideally it shouldn’t have to wait more than X hours to start.
  • Give priority to faculty who have contributed to HPCF, but support the work of community users as much as possible.

All users are expected to read this page and the usage rules carefully; by using the cluster, you agree to abide by the rules stated in these pages.

Fundamentals

A slurm parent account represents a set of access controls afforded to research groups who belong to (or inherit from) these parent accounts. Each research group, at the time of its creation in the HPCF, is required to inherit from one of three slurm parent accounts. Questions regarding how the parent account for your research group might be modified should be directed to the HPCF Point of Contact.

 

Parent Account Description
general Average HPC user. Permitted use of the cluster within the usage rules and other restrictions (QOS access) as described on this webpage.
contrib HPC user who has contributed to HPCF, for instance by participating in grant proposals for the maintenance or extension of HPCF. Permitted use of the cluster within the usage rules and other restrictions (QOS access) as described on this webpage. These users enjoy access to the develop, gpu, and high_mem partitions with limited QOS options as described below.
contrib+ HPC user who has contributed significant funding to HPCF. Permitted use of the cluster within the usage rules and other restrictions as described on this webpage. These users can access all partitions and the ‘+’ QOS options as described below.

 

A partition represents a group of nodes in the cluster. There are several partitions:

Click here for a hardware description of the partitions.

 

Partition Description
develop There are four nodes in the develop partition: cnode001, cnode030, cnode101, and cnode134. This partition is dedicated to code under development. Jobs using many cores may be tested, but run time is supposed to be negligible.
high_mem There are 42 nodes: cnode002-cnode029 and cnode031-cnode044.
cpu2021 There are 18 nodes: cnode051-cnode068.

When a user runs a job, time is charged to their PI group for accounting purposes, as explained in the usage rules. The historical accounting data is used in scheduling to influence priority. Priority helps to determine which job will run next if there are multiple queued jobs. Note that priority does not affect jobs which are already running. On taki we have implemented fair-share rules. PIs who have used more than their allocation in their recent history will have a reduced priority, while PIs who have used less than their allocation will have an increased priority.

While the fair-share factor is a calculated factor based on the history of scheduled jobs, the raw shares factor is determined solely by contribution level and directly augments priority. Users belonging to research groups designated as general receive 0 additional priority. Users belonging to research groups designated as contrib receive 1 additional priority. Users belonging to research groups designated as contrib+ receive at least 1 additional priority units.

Following the terminology of the slurm scheduler, queues are referred to as a Quality of Service or QOS. Several QOSs are available, which are designed to handle different kinds of jobs. Every job is run under a particular QOS, chosen by the user in the submission script, and is subject to the resource limitations of that QOS.

The following QOS options exist on taki and have these motivations:

  • short – Designed for very short jobs that will not take very long – on the order of several minutes to a lunch break.
  • normal (default) – Designed for average length jobs. We consider average length to be on the order of a lunch break to a few hours. This is the default QOS if you do not specify one explicitly in your sbatch job submission script.
  • medium – Designed for medium length jobs, which we consider to be on the order of half a workday up to an overnight run.
  • long – Designed for long jobs, which we consider to be on the order of overnight to several days
  • support – For necessary testing and special cases by system administration and user support only

 

The following QOS options are only available for users belonging to research groups designated as general.

QOS Wall time limit per job CPU time limit per job Total cores limit for the QOS Cores limit per user Total jobs limit per user
short 1 hour 512 hours 1080 216 32
normal (default) 4 hours 512 hours 1080 216 32
medium 24 hours 512 hours 1080 216 32
long 5 days 540 36 1

 

The following QOS options are only available for users belonging to research groups designated as contrib.

QOS Wall time limit per job CPU time limit per job Total cores limit for the QOS Cores limit per user Total jobs limit per user
short 1 hour 512 hours 1080 216 32
normal (default) 4 hours 512 hours 1080 216 32
medium 24 hours 512 hours 1080 216 32
long 5 days 540 36 1

 

The following QOS options are only available for users belonging to research groups designated as contrib+.

QOS Wall time limit per job CPU time limit per job Total cores limit for the QOS Cores limit per user Total jobs limit per user
short+ 1 hours 1152 hours 4320 1152
normal+ (default) 4 hours 3456 hours 4320 864
medium+ 24 hours 4032 hours 2160 360
long+ 12 days 1152 96 4

where

  • Wall time limit per job is the maximum amount of time (as seen by a clock on the wall) that the job may run.
  • CPU time limit is the maximum CPU time allowed for a single job, which is measured as product of wall-time and number of cores requested by the job. Here, “CPU time” is actually a misnomer; please notice the definition as given in the previous sentence; we use the term following the slurm documentation.
  • Total number of cores limit for the QOS is the number of cores that may be in use at any given time, across all jobs in the given QOS. Please note that some CPU nodes on the cluster have 16 cores while others have 8 cores. Any combination of nodes can be used as long as the total number of cores for the QOS falls within the limit.
  • Cores limit per user is the number of cores that may be in use at any given time by a particular user in the given QOS. Any combination of nodes can be used as long as the total number of cores for the user on the QOS falls within the limit.
  • Total jobs limit per user is the total number of jobs that me be running at any given time by a particular user in the given QOS.

A similar table can be generated at any time via the taki command line by submitting the following command:

sacctmgr show QOS format=Name,MaxWall,MaxTRESMins,GrpTRES,MaxTRESPU,MaxJobsPU

To identify which QOS set you can access use the following command

id -ng userName #this will retrieve which is your default pi_group
sacctmgr show account pi_group

where userName is your taki user ID. You will see something like

Account Descr Org
pi_group pi_group contrib+

The last column which is the “Org” indicates which QOS you have access to. To see this for yourself with username $USERNAME, use this command (note the left-quotes):

sacctmgr show account `id -ng $USERNAME`

Another way to check which PI group you belong to, and the corresponding QOS, is the following command, again using your own username by the environment variable “$USERNAME” (note the double quotes):

hpc_userInfo "$USERNAME"

This is particularly useful if you belong to multiple PI groups, since it will print the QOS availability for each group you are a part of, as well as print your primary group.

A general guideline is to choose the QOS with the minimum resources needed to run your job. This is considered to be a good user behavior, which responsible users should follow. But a direct benefit to the user is back-filling. Back-filling is a feature of the scheduler that allows your job to be scheduled ahead of higher priority jobs, provided that your job’s estimated run time is shorter than the estimated wait time for the other jobs to start. A very responsible user (or one who really wishes to take advantage of back-filling) can set specific wall-time and memory limits for their job, based on their estimates.

Note that QOS options work the same way across all partitions including develop. QOS limits (e.g. number of cores and wall-time) are applied in conjunction with any partition limits. In the case of the develop partition we suggest setting the QOS to the default “normal”, or simply leaving it blank.

It is considered equivalent to use 64 nodes for 1 hour, 32 nodes for 2 hours, 16 nodes for 4 hours, etc. Thus, with 16 cores per node, the CPU time limit (wall-time times number of cores) is 64 nodes times 16 cores times 1 hour equal to 1024 hours of CPU time. For demonstration, the following table lists some sample combinations of number of nodes and time limits per job, all of which are actually equivalent to 1024 hours of CPU time. Note that when using –exclusive in a submission script all cores on the CPU will be counted in the CPU time limit calculation.

Number of nodes Cores per node Total number of cores Wall time (hours) CPU time (hours)
64 16 1024 1 1024
32 16 512 2 1024
16 16 256 4 1024
8 16 128 8 1024
4 16 64 16 1024
2 16 32 32 1024
1 16 16 64 1024
1 8 8 128 1024
1 4 4 256 1024
1 2 2 512 1024
1 1 1 1024 1024

Here, the number of nodes in the first column and the cores per node in the second column are multiplied to give the total numbers of cores in the third column; this is the quantity that enters into the slurm definition of CPU time. Thus, the fourth column shows which wall times in hours yield 1024 hours of CPU time in the fifth column.

How to submit jobs

Use of the batch system is discussed in detail on the how to run page.