Introduction to SLURM
SLURM (Simple Linux Utility for Resource Management) is a popular job scheduling and workload management system used in many high-performance computing environments. SLURM allows users to submit and manage jobs on a cluster of computers. It provides a framework for allocating resources (such as CPU cores, memory, and GPUs) and scheduling jobs efficiently.
- Logging in: To use SLURM, you need access to a cluster where SLURM is installed. Log in to the cluster using SSH or any other method provided by your system administrator.
- Job Script: Create a job script that describes the resources required for your job and the commands to be executed. A typical SLURM job script is a shell script with special directives recognized by SLURM.
Getting Started with Slurm
To tell Slurm what resources you need, you will have to create an sbatch script (also called a Slurm script). In this tutorial, we will be writing sbatch scripts with bash, but you can use any programming language as long as the pound sign (#) doesn’t cause an error. Your sbatch scripts will generally follow this format:
#!/bin/bash
# Declaring Slurm Configuration Options
# Loading Software/Libraries
# Running Code
Let’s start by going over the different configuration options for Slurm in the following example.
TP_1: SLURM Basics
Create an sbatch script ($ touch my-job.slurm). Open it using vim editor ($ vim my-job.slurm) and insert the following code:
#!/bin/bash
#SBATCH --job-name=myjob # Name for your job
#SBATCH --comment="Run My Job" # Comment for your job
#SBATCH --output=%x_%j.out # Output file
#SBATCH --error=%x_%j.err # Error file
#SBATCH --time=0-00:05:00 # Time limit
#SBATCH --nodes=1 # How many nodes to run on
#SBATCH --ntasks=2 # How many tasks per node
#SBATCH --cpus-per-task=2 # Number of CPUs per task
#SBATCH --mem-per-cpu=10g # Memory per CPU
#SBATCH --qos=short # priority/quality of service
# Command to run
hostname # Run the command hostname
So, in this example, we have requested a job with the following dimensions:
- Max Run Time: 5 Minutes
- Number of Nodes: 1
- Number of Tasks Per Node: 2
- Number of CPUs Per Task: 2
- Memory Per CPU: 10GB
Finally, we run the bash command hostname. You can run whatever kind of code you want here; C, C++, bash, python, R, Ruby, etc.
- Submitting a Job: Use the sbatch command to submit your job script to SLURM:
sbatch my-job.slurm
This will submit your job to the SLURM scheduler for execution.
You will then be given a message with the ID for that job:
Submitted batch job 1411747 on cluster nautilus
- After we submit a job, Slurm will create the output and error files. You can see them by running:
ls
You'll see the following files:
myjob_1411747.err myjob_1411747.out my-job.slurm
- Checking Job Status: You can check the status of your jobs using the squeue command: squeue -u username
squeue -u jmir@ec-nantes.fr
or you can use
squeue -u $USER
The squeue command gives us the following information:
JOBID: The unique ID for your job.
PARTITION: The partition your job is running on (or scheduled to run on).
NAME: The name of your job.
USER: The username for whomever submitted the job.
ST: The status of the job. The typical status codes you may see are:
- CD (Completed): Job completed successfully
- CG (Completing): Job is finishing, Slurm is cleaning up
- PD (Pending): Job is scheduled, but the requested resources aren’t available yet
- R (Running): Job is actively running
TIME: How long your job has been running.
NODES: How many nodes your job is using.
NODELIST(REASON): Which nodes your job is running on (or scheduled to run on). If your job is not running yet, you will also see one of the following reason codes:
- Priority: When Slurm schedules a job, it takes into consideration how frequently you submit jobs. If you often submit many jobs, Slurm will assign you a lower priority than someone who has never submitted a job or submits jobs very infrequently. Don’t worry, your job will run eventually.
- Resources: Slurm is waiting for the requested reasources to be available before starting your job.
- Dependency: If you are using dependent jobs, the parent job may show this reason if it’s waiting for a dependent job to complete.
To view detailed information about a specific job, including its resource usage, use the scontrol command: scontrol show job job_id -M nautilus
scontrol show job 1411747 -M nautilus
- Managing Jobs: You can cancel a running job using the scancel command followed by the job ID: scancel job_id
scancel 1411747
TP_2. Parallel Programming
IntelMPI
Let's try one embarassingly parallel program. Create a file using touch command and name it hello-mpi.cpp
Open the file using vim editor and insert the following code:
#include <iostream>
#include "mpi.h"
using namespace std;
int main(int argc,char **argv)
{
int rank,size;
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
MPI_Comm_size(MPI_COMM_WORLD,&size);
int namesize;
char name[512];
MPI_Get_processor_name(name,&namesize);
cout << "hello from " << rank << " out of " << size << " running on " << name << " " << endl;
MPI_Finalize();
return 0;
}
Load the IntelMPI module using:
module load intel/compiler intel/mpi
Now, compile the code using the following command:
mpicxx -cxx=icpx -O3 -o hello-intelmpi hello-mpi.cpp
Once compiled, create a slurm script and name it job-intel.slurm. Here's the script:
#!/bin/bash
#SBATCH --job-name=HelloWorldMpi
#SBATCH --partition=standard
module purge
module load intel/compiler intel/mpi
export I_MPI_PMI_LIBRARY=/lib64/libpmi2.so
export I_MPI_COLL_EXTERNAL=0
export I_MPI_ADJUST_BCAST=0
export I_MPI_FABRICS=shm:ofi
export FI_PROVIDER=psm3
srun --mpi=pmi2 ./hello-intelmpi
Finally, submit the job using the following command
sbatch -M nautilus -p standard -q short job-intel.slurm
Now, monitor your job and check the output files.
OpenMPI
Let's try compiling with GNU compiler and OpenMPI but first you need to purge the loaded modules.
module purge
Now, load openMPI module using:
module load gcc openmpi/ucx/4.1.5_gcc_8.5.0_uxc_1.14.1_rdma_46.0
and compile hello-mpi.cpp program using:
mpicxx -O3 -o hello-openmpi hello-mpi.cpp
Once compiled, create a slurm script and name it job-mpi.slurm. Here's the script:
#!/bin/bash
#SBATCH --job-name=HelloWorldMpi
#SBATCH --partition=standard
module purge
module load gcc openmpi/ucx/4.1.5_gcc_8.5.0_ucx_1.14.1_rdma_46.0
export UCX_WARN_UNUSED_ENV_VARS=n
export OMPI_MCA_btl=^openib
export UCX_NET_DEVICES=mlx5_2:1
srun ./hello-openmpi
Submit the job using the following command
sbatch -M nautilus -p standard -q short job-mpi.slurm
Now, monitor your job and check the output files.
TP_3. Conda and Micromamba
Conda is a software environment manager that is quite popular, especially in the Python community, but it has many issues in the context of HPC use.
Note: GLiCID administrators advise against its use, especially if the affected software is already available in Guix. However, some packages are not available(yet) on Guix. In GLiCID, we have decided to use micromamba, which is the lighter version of conda. Here are the commands to run to install it locally on GLiCID:
# Download micromamba
mkdir -p $HOME/.local/bin
wget -P $HOME/.local/bin https://s3.glicid.fr/pkgs/micromamba
chmod u+x $HOME/.local/bin/micromamba
# Initilize micromamba
$HOME/.local/bin/micromamba -r /micromamba/$USER/ shell init --shell=bash --prefix=/micromamba/$USER/
# [OPTIONAL] Add an alias `conda`
echo -e '\n\n#Alias conda with micromamba\nalias conda=micromamba' >> ~/.bashrc
# Reload the .bashrc
source ~/.bashrc
Note: It is possible that the file is not always sourced at the login on GLiCID (investigations are ongoing). If this is not the case, remember to do after each login to load well. .bashrcsource ~/.bashrcmicromamba
Note: Set proxy on Nautilus (not devel) use:
export http_proxy=http://proxy-upgrade.univ-nantes.fr:3128/
export https_proxy=http://proxy-upgrade.univ-nantes.fr:3128/
To verify the installation:
micromamba --version # or: `conda --version`
# -> 1.4.0
# create and environment micromamba/conda
micromamba create -n my_env
# Activate the ennvironment
micromamba activate my_env
# -> Your prompt should now be prefixed with: (my_env)
# Install the package using conda-forge
micromamba install -c conda-forge numpy
# Test the package
python -V
python -c "import numpy as np; print(np.__version__)"
# -> 1.24.2
# Deactivate the environment
micromamba deactivate
TP_4. FORTRAN
In this part of the tutorial, we will write our first Fortran program: the ubiquitous “Hello, World!” example.
However, before we can write our program, we need to ensure that we have a Fortran compiler set up.
Fortran is a compiled language, which means that, once written, the source code must be passed through a compiler to produce a machine executable that can be run.
Load the module using the following command:
module load gcc/13.1.0
Once you have loaded the module, open a new file in vim editor and enter the following:
program hello
! This is a comment line; it is ignored by the compiler
print *, 'Hello, World!'
end program hello
Having saved your program to hello.f90, compile at the command line with:
gfortran hello.f90 -o hello
.f90 is the standard file extension for modern Fortran source files. The 90 refers to the first modern Fortran standard in 1990.
To run your compiled program:
./hello
Congratulations, you’ve written, compiled and run your first Fortran program!