The scheduling on Magi relies on partition's priorities. Each partition comes with three versions : PARTITION_NAME, PARTITION_NAME-SHORT and PARTITION_NAME-VERYSHORT.
It sould be highlighted that when a partition includes more than one node, only a subset of nodes is elligible for infinite lifetime jobs.
I will illustrate this policy with an example. The job is a script that sleep 10 minutes and exit.
Let's check available partitions :
nicolas.greneche@magi1:~/tests$ scontrol show partitions PartitionName=COMPUTE AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=YES QoS=N/A DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=magi[74] PriorityJobFactor=1 PriorityTier=3000 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=40 TotalNodes=1 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=COMPUTE-SHORT AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=3-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=magi[74,96] PriorityJobFactor=3000 PriorityTier=6000 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=80 TotalNodes=2 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=COMPUTE-VERYSHORT AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=03:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=magi[74,96] PriorityJobFactor=6000 PriorityTier=9000 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=80 TotalNodes=2 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
We have 3 partitions derived from COMPUTE : COMPUTE for infinite lifetime job, COMPUTE-SHORT for maximum three days job and COMPUTE-VERYSHORT for maximum three hours job. COMPUTE include one node (magi74) and the two other ones two nodes (magi74 and 96). We submit our job several times on different partitions :
nicolas.greneche@magi1:~/tests$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 77 COMPUTE sleep2 nicolas. PD 0:00 1 (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions) 79 COMPUTE-S sleep4 nicolas. PD 0:00 1 (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions) 80 COMPUTE-S sleep5 nicolas. PD 0:00 1 (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions) 76 COMPUTE sleep1 nicolas. R 4:23 1 magi74 78 COMPUTE-S sleep3 nicolas. R 4:08 1 magi96
We have a longtime job sleep1 running on magi74, part of COMPUTE partition and a short job sleep3 running on magi96 port of COMPUTE-SHORT (and also COMPUTE). We can also see that there are sleep2 (longtime job), sleep4 and sleep5 (short jobs) waiting in the queue (short jobs). Let's submit a very short job :
nicolas.greneche@magi1:~/tests$ sbatch –job-name=sleep6 –partition=COMPUTE-VERYSHORT sleep.slurm Let's see the place of this new job in the queue :
nicolas.greneche@magi1:~/tests$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 77 COMPUTE sleep2 nicolas. PD 0:00 1 (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions) 79 COMPUTE-S sleep4 nicolas. PD 0:00 1 (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions) 80 COMPUTE-S sleep5 nicolas. PD 0:00 1 (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions) 81 COMPUTE-V sleep6 nicolas. PD 0:00 1 (Resources) 76 COMPUTE sleep1 nicolas. R 4:23 1 magi74 78 COMPUTE-S sleep3 nicolas. R 4:08 1 magi96
The job went ahead sleep4 and sleep5 that can access the same subset of nodes but by a partition (COMPUTE-SHORT) with a lower priority.
nicolas.greneche@magi1:~/tests$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 77 COMPUTE sleep2 nicolas. PD 0:00 1 (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions) 80 COMPUTE-S sleep5 nicolas. PD 0:00 1 (Resources) 79 COMPUTE-S sleep4 nicolas. R 0:11 1 magi96 81 COMPUTE-V sleep6 nicolas. R 0:26 1 magi74
My job is running on magi74 which is part of all COMPUTE based partitions and it is running BEFORE sleep2 which is a longtime job despite sleep2 has been submitted prior to sleep6.