====== Magi scheduling explained ====== The scheduling on Magi relies on partition's priorities. Each partition comes with three versions : PARTITION_NAME, PARTITION_NAME-SHORT and PARTITION_NAME-VERYSHORT. * PARTITION_NAME is designed for very long jobs. This partition has a low priority but you can run infinite lifetime job on it. * PARTITION_NAME-SHORT is designed for moderate lifetime job (maximum 3 days). When the period of 3 days is reached, the job is killed weither it is finished or not. Jobs submitted on this partition will pass ahead those submitted on PARTITION_NAME. * PARTITION_NAME-VERYSHORT is designed for short jobs (maximum 3 hours). When the period of 3 hours is reached, the job is killed weither it is finished or not. Jobs submitted on this partition will pass ahead those submitted on PARTITION_NAME-SHORT (and, as a consequence, PARTITION_NAME). It sould be highlighted that when a partition includes more than one node, only a subset of nodes is elligible for infinite lifetime jobs. I will illustrate this policy with an example. The job is a script that sleep 10 minutes and exit. Let's check available partitions : nicolas.greneche@magi1:~/tests$ scontrol show partitions PartitionName=COMPUTE AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=YES QoS=N/A DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=magi[74] PriorityJobFactor=1 PriorityTier=3000 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=40 TotalNodes=1 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=COMPUTE-SHORT AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=3-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=magi[74,96] PriorityJobFactor=3000 PriorityTier=6000 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=80 TotalNodes=2 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=COMPUTE-VERYSHORT AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=03:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=magi[74,96] PriorityJobFactor=6000 PriorityTier=9000 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=80 TotalNodes=2 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED We have 3 partitions derived from COMPUTE : COMPUTE for infinite lifetime job, COMPUTE-SHORT for maximum three days job and COMPUTE-VERYSHORT for maximum three hours job. COMPUTE include one node (magi74) and the two other ones two nodes (magi74 and 96). We submit our job several times on different partitions : nicolas.greneche@magi1:~/tests$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 77 COMPUTE sleep2 nicolas. PD 0:00 1 (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions) 79 COMPUTE-S sleep4 nicolas. PD 0:00 1 (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions) 80 COMPUTE-S sleep5 nicolas. PD 0:00 1 (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions) 76 COMPUTE sleep1 nicolas. R 4:23 1 magi74 78 COMPUTE-S sleep3 nicolas. R 4:08 1 magi96 We have a longtime job sleep1 running on magi74, part of COMPUTE partition and a short job sleep3 running on magi96 port of COMPUTE-SHORT (and also COMPUTE). We can also see that there are sleep2 (longtime job), sleep4 and sleep5 (short jobs) waiting in the queue (short jobs). Let's submit a very short job : nicolas.greneche@magi1:~/tests$ sbatch --job-name=sleep6 --partition=COMPUTE-VERYSHORT sleep.slurm Let's see the place of this new job in the queue : nicolas.greneche@magi1:~/tests$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 77 COMPUTE sleep2 nicolas. PD 0:00 1 (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions) 79 COMPUTE-S sleep4 nicolas. PD 0:00 1 (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions) 80 COMPUTE-S sleep5 nicolas. PD 0:00 1 (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions) 81 COMPUTE-V sleep6 nicolas. PD 0:00 1 (Resources) 76 COMPUTE sleep1 nicolas. R 4:23 1 magi74 78 COMPUTE-S sleep3 nicolas. R 4:08 1 magi96 The job went ahead sleep4 and sleep5 that can access the same subset of nodes but by a partition (COMPUTE-SHORT) with a lower priority. nicolas.greneche@magi1:~/tests$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 77 COMPUTE sleep2 nicolas. PD 0:00 1 (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions) 80 COMPUTE-S sleep5 nicolas. PD 0:00 1 (Resources) 79 COMPUTE-S sleep4 nicolas. R 0:11 1 magi96 81 COMPUTE-V sleep6 nicolas. R 0:26 1 magi74 My job is running on magi74 which is part of all COMPUTE based partitions and it is running BEFORE sleep2 which is a longtime job despite sleep2 has been submitted prior to sleep6.