====== Magi scheduling explained ======

The scheduling on Magi relies on partition's priorities. Each partition comes with three versions : PARTITION_NAME, PARTITION_NAME-SHORT and PARTITION_NAME-VERYSHORT.
  * PARTITION_NAME is designed for very long jobs. This partition has a low priority but you can run infinite lifetime job on it.
  * PARTITION_NAME-SHORT is designed for moderate lifetime job (maximum 3 days). When the period of 3 days is reached, the job is killed weither it is finished or not. Jobs submitted on this partition will pass ahead those submitted on PARTITION_NAME.
  * PARTITION_NAME-VERYSHORT is designed for short jobs (maximum 3 hours). When the period of 3 hours is reached, the job is killed weither it is finished or not. Jobs submitted on this partition will pass ahead those submitted on PARTITION_NAME-SHORT (and, as a consequence, PARTITION_NAME).

It sould be highlighted that when a partition includes more than one node, only a subset of nodes is elligible for infinite lifetime jobs.

I will illustrate this policy with an example. The job is a script that sleep 10 minutes and exit.

Let's check available partitions :

  nicolas.greneche@magi1:~/tests$ scontrol show partitions
  PartitionName=COMPUTE
     AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
     AllocNodes=ALL Default=YES QoS=N/A
     DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
     MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
     Nodes=magi[74]
     PriorityJobFactor=1 PriorityTier=3000 RootOnly=NO ReqResv=NO OverSubscribe=NO
     OverTimeLimit=NONE PreemptMode=OFF
     State=UP TotalCPUs=40 TotalNodes=1 SelectTypeParameters=NONE
     JobDefaults=(null)
     DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
  
  PartitionName=COMPUTE-SHORT
     AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
     AllocNodes=ALL Default=NO QoS=N/A
     DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
     MaxNodes=UNLIMITED MaxTime=3-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
     Nodes=magi[74,96]
     PriorityJobFactor=3000 PriorityTier=6000 RootOnly=NO ReqResv=NO OverSubscribe=NO
     OverTimeLimit=NONE PreemptMode=OFF
     State=UP TotalCPUs=80 TotalNodes=2 SelectTypeParameters=NONE
     JobDefaults=(null)
     DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
  
  PartitionName=COMPUTE-VERYSHORT
     AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
     AllocNodes=ALL Default=NO QoS=N/A
     DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
     MaxNodes=UNLIMITED MaxTime=03:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
     Nodes=magi[74,96]
     PriorityJobFactor=6000 PriorityTier=9000 RootOnly=NO ReqResv=NO OverSubscribe=NO
     OverTimeLimit=NONE PreemptMode=OFF
     State=UP TotalCPUs=80 TotalNodes=2 SelectTypeParameters=NONE
     JobDefaults=(null)
     DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

We have 3 partitions derived from COMPUTE : COMPUTE for infinite lifetime job, COMPUTE-SHORT for maximum three days job and COMPUTE-VERYSHORT for maximum three hours job. COMPUTE include one node (magi74) and the two other ones two nodes (magi74 and 96). We submit our job several times on different partitions :

nicolas.greneche@magi1:~/tests$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                77   COMPUTE   sleep2 nicolas. PD       0:00      1 (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions)
                79 COMPUTE-S   sleep4 nicolas. PD       0:00      1 (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions)
                80 COMPUTE-S   sleep5 nicolas. PD       0:00      1 (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions)
                76   COMPUTE   sleep1 nicolas.  R       4:23      1 magi74
                78 COMPUTE-S   sleep3 nicolas.  R       4:08      1 magi96

We have a longtime job sleep1 running on magi74, part of COMPUTE partition and a short job sleep3 running on magi96 port of COMPUTE-SHORT (and also COMPUTE). We can also see that there are sleep2 (longtime job), sleep4 and sleep5 (short jobs) waiting in the queue (short jobs). Let's submit a very short job :

nicolas.greneche@magi1:~/tests$ sbatch --job-name=sleep6 --partition=COMPUTE-VERYSHORT sleep.slurm
Let's see the place of this new job in the queue :

  nicolas.greneche@magi1:~/tests$ squeue
               JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                  77   COMPUTE   sleep2 nicolas. PD       0:00      1 (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions)
                  79 COMPUTE-S   sleep4 nicolas. PD       0:00      1 (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions)
                  80 COMPUTE-S   sleep5 nicolas. PD       0:00      1 (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions)
                  81 COMPUTE-V   sleep6 nicolas. PD       0:00      1 (Resources)
                  76   COMPUTE   sleep1 nicolas.  R       4:23      1 magi74
                  78 COMPUTE-S   sleep3 nicolas.  R       4:08      1 magi96

The job went ahead sleep4 and sleep5 that can access the same subset of nodes but by a partition (COMPUTE-SHORT) with a lower priority.

  nicolas.greneche@magi1:~/tests$ squeue
               JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                  77   COMPUTE   sleep2 nicolas. PD       0:00      1 (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions)
                  80 COMPUTE-S   sleep5 nicolas. PD       0:00      1 (Resources)
                  79 COMPUTE-S   sleep4 nicolas.  R       0:11      1 magi96
                  81 COMPUTE-V   sleep6 nicolas.  R       0:26      1 magi74

My job is running on magi74 which is part of all COMPUTE based partitions and it is running BEFORE sleep2 which is a longtime job despite sleep2 has been submitted prior to sleep6.