====== Booking resources on Magi ====== Physical nodes on Magi are gathered in partitions. Nodes belonging to the same partition are strictly homogeneous regarding their hardware. You must specify two ressources for your job : processor and memory. Memory is the amount of RAM available for your job. The Processor ressource is more tricky. For SLURM the Processor resource hierarchy is from the lower to higher granularity : - Socket : refers to the physical processor that you plug on the motherboard) - Core : a set of one or more CPU that have their dedicated cache memory (very small memory faster than RAM) - CPU : Thread execution of a Core You can check partitions with scontrol : nicolas.greneche@magi3:~$ scontrol show partition PartitionName=COMPUTE AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=YES QoS=N/A DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=magi[46-53] PriorityJobFactor=3000 PriorityTier=3000 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=320 TotalNodes=8 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=1600 MaxMemPerNode=UNLIMITED In this output we can see the “DefMemPerCPU” attribute with “1600” value. This attribute means that every CPU you book is shipped with 1.6GB of RAM by default (you can change it). Now we can go deep in node specification. We can see the attribute “Nodes” with value “magi[46-53]”. I remind you that nodes on a partition are homogeneous. So, we can display magi46 properties : nicolas.greneche@magi3:~$ scontrol show node magi46 NodeName=magi46 Arch=x86_64 CoresPerSocket=10 CPUAlloc=0 CPUTot=40 CPULoad=0.00 AvailableFeatures=(null) ActiveFeatures=(null) Gres=(null) NodeAddr=magi46 NodeHostName=magi46 Version=21.08.7 OS=Linux 5.10.0-11-amd64 #1 SMP Debian 5.10.92-1 (2022-01-18) RealMemory=64000 AllocMem=0 FreeMem=63289 Sockets=2 Boards=1 State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=COMPUTE BootTime=2022-04-25T15:15:55 SlurmdStartTime=2022-04-25T15:17:09 LastBusyTime=2022-04-26T13:20:31 CfgTRES=cpu=40,mem=62.50G,billing=40 AllocTRES= CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s This node has 1 motherboard (Boards=1), 2 physical processors (Sockets=2), 10 cores per processor (CoresPerSocket=10) and 2 execution threads per core (ThreadsPerCore=2). Moreover, available memory for jobs is 64000 MB (RealMemory=64000). Some examples : nicolas.greneche@magi3:~/test-bullseye/stress$ cat stress.sh #!/bin/bash #SBATCH --job-name=stress_test #SBATCH --output=stress.out.%j #SBATCH --error=stress.err.%j #SBATCH --ntasks=1 #SBATCH --cpus-per-task=8 srun stress -c 1 nicolas.greneche@magi3:~/test-bullseye/stress$ scontrol show job 234 [...] Partition=COMPUTE AllocNode:Sid=slurmctld:585884 ReqNodeList=(null) ExcNodeList=(null) NodeList=magi46 BatchHost=magi46 NumNodes=1 NumCPUs=8 NumTasks=1 CPUs/Task=8 ReqB:S:C:T=0:0:*:* TRES=cpu=8,mem=12.50G,node=1,billing=8 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=8 MinMemoryCPU=1600M MinTmpDiskNode=0 [...] The job will run on 8 execution threads (NumCPUs=8) and cannot exceed 12 800 GB (8*1600, mem=12.50G). We try to allocate more memory than what we requested : nicolas.greneche@magi3:~/test-bullseye/stress$ cat stress.sh #!/bin/bash #SBATCH --job-name=stress_test #SBATCH --output=stress.out.%j #SBATCH --error=stress.err.%j #SBATCH --ntasks=1 #SBATCH --cpus-per-task=8 srun stress -m 8 --vm-bytes 2048M The stress command spawns 8 processes (-m 8). Each process try to allocate 2048MB that exceeds the default limit 1600MB per CPU. The job fails. Here is the log : nicolas.greneche@magi3:~/test-bullseye/stress$ cat stress.err.236 stress: FAIL: [6859] (415) <-- worker 6867 got signal 9 stress: WARN: [6859] (417) now reaping child worker processes stress: FAIL: [6859] (451) failed run completed in 2s slurmstepd: error: Detected 1 oom-kill event(s) in StepId=236.0. Some of your processes may have been killed by the cgroup out-of-memory handler. srun: error: magi46: task 0: Out Of Memory slurmstepd: error: Detected 1 oom-kill event(s) in StepId=236.batch. Some of your processes may have been killed by the cgroup out-of-memory handler. The job was killed by OOM killer. You can change the amount of memory in job submission with the –mem option. We set it to the whole available memory of the node (RealMemory=64000) nicolas.greneche@magi3:~/test-bullseye/stress$ cat stress.sh #!/bin/bash #SBATCH --job-name=stress_test #SBATCH --output=stress.out.%j #SBATCH --error=stress.err.%j #SBATCH --ntasks=1 #SBATCH --cpus-per-task=8 #SBATCH --mem=64000 srun stress -m 8 --vm-bytes 2048M This time, the job works. The last thing to know is that you can avoid the use of hyperthreading with the –hint=nomultithread option in submission script. nicolas.greneche@magi3:~/test-bullseye/stress$ cat stress.sh #!/bin/bash #SBATCH --job-name=stress_test #SBATCH --output=stress.out.%j #SBATCH --error=stress.err.%j #SBATCH --ntasks=1 #SBATCH --cpus-per-task=8 #SBATCH --hint=nomultithread srun stress -c 1 nicolas.greneche@magi3:~/test-bullseye/stress$ scontrol show job 239 [...] Partition=COMPUTE AllocNode:Sid=slurmctld:585884 ReqNodeList=(null) ExcNodeList=(null) NodeList=magi46 BatchHost=magi46 NumNodes=1 NumCPUs=16 NumTasks=1 CPUs/Task=8 ReqB:S:C:T=0:0:*:1 TRES=cpu=16,mem=12.50G,node=1,billing=16 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=8 MinMemoryCPU=1600M MinTmpDiskNode=0 [...] We want to use just 1 execution thread per core, so SLURM booked 16 CPUs on the node because each core has 2 execution threads.