====== Booking resources on Magi ======

Physical nodes on Magi are gathered in partitions. Nodes belonging to the same partition are strictly homogeneous regarding their hardware. You must specify two ressources for your job : processor and memory.

Memory is the amount of RAM available for your job. The Processor ressource is more tricky. For SLURM the Processor resource hierarchy is from the lower to higher granularity :

  - Socket : refers to the physical processor that you plug on the motherboard)
  - Core : a set of one or more CPU that have their dedicated cache memory (very small memory faster than RAM)
  - CPU : Thread execution of a Core

You can check partitions with scontrol :

   nicolas.greneche@magi3:~$ scontrol show partition
   PartitionName=COMPUTE
      AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
      AllocNodes=ALL Default=YES QoS=N/A
      DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
      MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
      Nodes=magi[46-53]
      PriorityJobFactor=3000 PriorityTier=3000 RootOnly=NO ReqResv=NO OverSubscribe=NO
      OverTimeLimit=NONE PreemptMode=OFF
      State=UP TotalCPUs=320 TotalNodes=8 SelectTypeParameters=NONE
      JobDefaults=(null)
      DefMemPerCPU=1600 MaxMemPerNode=UNLIMITED

In this output we can see the “DefMemPerCPU” attribute with “1600” value. This attribute means that every CPU you book is shipped with 1.6GB of RAM by default (you can change it). Now we can go deep in node specification. We can see the attribute “Nodes” with value “magi[46-53]”. I remind you that nodes on a partition are homogeneous. So, we can display magi46 properties :

   nicolas.greneche@magi3:~$ scontrol show node magi46
   NodeName=magi46 Arch=x86_64 CoresPerSocket=10
      CPUAlloc=0 CPUTot=40 CPULoad=0.00
      AvailableFeatures=(null)
      ActiveFeatures=(null)
      Gres=(null)
      NodeAddr=magi46 NodeHostName=magi46 Version=21.08.7
      OS=Linux 5.10.0-11-amd64 #1 SMP Debian 5.10.92-1 (2022-01-18)
      RealMemory=64000 AllocMem=0 FreeMem=63289 Sockets=2 Boards=1
      State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
      Partitions=COMPUTE
      BootTime=2022-04-25T15:15:55 SlurmdStartTime=2022-04-25T15:17:09
      LastBusyTime=2022-04-26T13:20:31
      CfgTRES=cpu=40,mem=62.50G,billing=40
      AllocTRES=
      CapWatts=n/a
      CurrentWatts=0 AveWatts=0
      ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

This node has 1 motherboard (Boards=1), 2 physical processors (Sockets=2), 10 cores per processor (CoresPerSocket=10) and 2 execution threads per core (ThreadsPerCore=2). Moreover, available memory for jobs is 64000 MB (RealMemory=64000).

Some examples :

   nicolas.greneche@magi3:~/test-bullseye/stress$ cat stress.sh
   #!/bin/bash
   
   #SBATCH --job-name=stress_test
   #SBATCH --output=stress.out.%j
   #SBATCH --error=stress.err.%j
   #SBATCH --ntasks=1
   #SBATCH --cpus-per-task=8
   
   srun stress -c 1

   nicolas.greneche@magi3:~/test-bullseye/stress$ scontrol show job 234
      [...]
      Partition=COMPUTE AllocNode:Sid=slurmctld:585884
      ReqNodeList=(null) ExcNodeList=(null)
      NodeList=magi46
      BatchHost=magi46
      NumNodes=1 NumCPUs=8 NumTasks=1 CPUs/Task=8 ReqB:S:C:T=0:0:*:*
      TRES=cpu=8,mem=12.50G,node=1,billing=8
      Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
      MinCPUsNode=8 MinMemoryCPU=1600M MinTmpDiskNode=0
      [...]

The job will run on 8 execution threads (NumCPUs=8) and cannot exceed 12 800 GB (8*1600, mem=12.50G).

We try to allocate more memory than what we requested :

   nicolas.greneche@magi3:~/test-bullseye/stress$ cat stress.sh
   #!/bin/bash
   
   #SBATCH --job-name=stress_test
   #SBATCH --output=stress.out.%j
   #SBATCH --error=stress.err.%j
   #SBATCH --ntasks=1
   #SBATCH --cpus-per-task=8
   
   srun stress -m 8 --vm-bytes 2048M

The stress command spawns 8 processes (-m 8). Each process try to allocate 2048MB that exceeds the default limit 1600MB per CPU. The job fails. Here is the log :

   nicolas.greneche@magi3:~/test-bullseye/stress$ cat stress.err.236
   stress: FAIL: [6859] (415) <-- worker 6867 got signal 9
   stress: WARN: [6859] (417) now reaping child worker processes
   stress: FAIL: [6859] (451) failed run completed in 2s
   slurmstepd: error: Detected 1 oom-kill event(s) in StepId=236.0. Some of your processes may have been killed by the cgroup out-of-memory handler.
   srun: error: magi46: task 0: Out Of Memory
   slurmstepd: error: Detected 1 oom-kill event(s) in StepId=236.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

The job was killed by OOM killer.

You can change the amount of memory in job submission with the –mem option. We set it to the whole available memory of the node (RealMemory=64000)

   nicolas.greneche@magi3:~/test-bullseye/stress$ cat stress.sh
   #!/bin/bash
   
   #SBATCH --job-name=stress_test
   #SBATCH --output=stress.out.%j
   #SBATCH --error=stress.err.%j
   #SBATCH --ntasks=1
   #SBATCH --cpus-per-task=8
   #SBATCH --mem=64000
   
   srun stress -m 8 --vm-bytes 2048M

This time, the job works.

The last thing to know is that you can avoid the use of hyperthreading with the –hint=nomultithread option in submission script.

   nicolas.greneche@magi3:~/test-bullseye/stress$ cat stress.sh
   #!/bin/bash
   
   #SBATCH --job-name=stress_test
   #SBATCH --output=stress.out.%j
   #SBATCH --error=stress.err.%j
   #SBATCH --ntasks=1
   #SBATCH --cpus-per-task=8
   #SBATCH --hint=nomultithread
   
   srun stress -c 1

   nicolas.greneche@magi3:~/test-bullseye/stress$ scontrol show job 239
      [...]
      Partition=COMPUTE AllocNode:Sid=slurmctld:585884
      ReqNodeList=(null) ExcNodeList=(null)
      NodeList=magi46
      BatchHost=magi46
      NumNodes=1 NumCPUs=16 NumTasks=1 CPUs/Task=8 ReqB:S:C:T=0:0:*:1
      TRES=cpu=16,mem=12.50G,node=1,billing=16
      Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
      MinCPUsNode=8 MinMemoryCPU=1600M MinTmpDiskNode=0
      [...]

We want to use just 1 execution thread per core, so SLURM booked 16 CPUs on the node because each core has 2 execution threads.