Magi scheduling explained

The scheduling on Magi relies on partition's priorities. Each partition comes with three versions : PARTITION_NAME, PARTITION_NAME-SHORT and PARTITION_NAME-VERYSHORT.

PARTITION_NAME is designed for very long jobs. This partition has a low priority but you can run infinite lifetime job on it.
PARTITION_NAME-SHORT is designed for moderate lifetime job (maximum 3 days). When the period of 3 days is reached, the job is killed weither it is finished or not. Jobs submitted on this partition will pass ahead those submitted on PARTITION_NAME.
PARTITION_NAME-VERYSHORT is designed for short jobs (maximum 3 hours). When the period of 3 hours is reached, the job is killed weither it is finished or not. Jobs submitted on this partition will pass ahead those submitted on PARTITION_NAME-SHORT (and, as a consequence, PARTITION_NAME).

It sould be highlighted that when a partition includes more than one node, only a subset of nodes is elligible for infinite lifetime jobs.

I will illustrate this policy with an example. The job is a script that sleep 10 minutes and exit.

Let's check available partitions :

nicolas.greneche@magi1:~/tests$ scontrol show partitions
PartitionName=COMPUTE
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=YES QoS=N/A
   DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=magi[74]
   PriorityJobFactor=1 PriorityTier=3000 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=40 TotalNodes=1 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

PartitionName=COMPUTE-SHORT
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=3-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=magi[74,96]
   PriorityJobFactor=3000 PriorityTier=6000 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=80 TotalNodes=2 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

PartitionName=COMPUTE-VERYSHORT
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=03:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=magi[74,96]
   PriorityJobFactor=6000 PriorityTier=9000 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=80 TotalNodes=2 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

We have 3 partitions derived from COMPUTE : COMPUTE for infinite lifetime job, COMPUTE-SHORT for maximum three days job and COMPUTE-VERYSHORT for maximum three hours job. COMPUTE include one node (magi74) and the two other ones two nodes (magi74 and 96). We submit our job several times on different partitions :

nicolas.greneche@magi1:~/tests$ squeue

           JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              77   COMPUTE   sleep2 nicolas. PD       0:00      1 (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions)
              79 COMPUTE-S   sleep4 nicolas. PD       0:00      1 (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions)
              80 COMPUTE-S   sleep5 nicolas. PD       0:00      1 (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions)
              76   COMPUTE   sleep1 nicolas.  R       4:23      1 magi74
              78 COMPUTE-S   sleep3 nicolas.  R       4:08      1 magi96

We have a longtime job sleep1 running on magi74, part of COMPUTE partition and a short job sleep3 running on magi96 port of COMPUTE-SHORT (and also COMPUTE). We can also see that there are sleep2 (longtime job), sleep4 and sleep5 (short jobs) waiting in the queue (short jobs). Let's submit a very short job :

nicolas.greneche@magi1:~/tests$ sbatch –job-name=sleep6 –partition=COMPUTE-VERYSHORT sleep.slurm Let's see the place of this new job in the queue :

nicolas.greneche@magi1:~/tests$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                77   COMPUTE   sleep2 nicolas. PD       0:00      1 (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions)
                79 COMPUTE-S   sleep4 nicolas. PD       0:00      1 (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions)
                80 COMPUTE-S   sleep5 nicolas. PD       0:00      1 (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions)
                81 COMPUTE-V   sleep6 nicolas. PD       0:00      1 (Resources)
                76   COMPUTE   sleep1 nicolas.  R       4:23      1 magi74
                78 COMPUTE-S   sleep3 nicolas.  R       4:08      1 magi96

The job went ahead sleep4 and sleep5 that can access the same subset of nodes but by a partition (COMPUTE-SHORT) with a lower priority.

nicolas.greneche@magi1:~/tests$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                77   COMPUTE   sleep2 nicolas. PD       0:00      1 (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions)
                80 COMPUTE-S   sleep5 nicolas. PD       0:00      1 (Resources)
                79 COMPUTE-S   sleep4 nicolas.  R       0:11      1 magi96
                81 COMPUTE-V   sleep6 nicolas.  R       0:26      1 magi74

My job is running on magi74 which is part of all COMPUTE based partitions and it is running BEFORE sleep2 which is a longtime job despite sleep2 has been submitted prior to sleep6.