Using SLURM

One of the challenges to learning neuroimaging is the large amount of coding necessary to do it right. Many (like myself) begin with absolutely no coding experience and find themselves lost most of the time. Once I learned how to process a single scan I was pretty proud. Then I was told that I had to process scans in parallel on the supercomputer. I was less proud when I spent two weeks trying to figure that out. Here is a painfully detailed description of the first part of every batch script and some hints for using the supercomputer in general. This is pretty specific to my experiences in the brain imaging and behavior lab (BIABL) at BYU, but may have applicability elsewhere too.

Every batch file needs to start with this. Sometimes I’ll copy and paste a large batch script into terminal and miss this part. The batch process will fail and I won’t have any logfiles to figure out the errors. Usually it’s because I missed this.

#!/bin/bash 

This is the estimated run time for your script. I’ve read some stuff on the supercomputer website on how to estimate this, but it’s all very complicated and unnecessary. Typically you already have a ballpark guess of how long this will take and you can fine tune it within an hour or two of it’s actual run time while testing your script.

#SBATCH --time=50:00:00

Number of processor cores. Unless you’re doing some pretty involved stuff you probably don’t need to change this.

#SBATCH --ntasks=1

This is how many computers will be used to accomplish a script

#SBATCH --nodes=1

This is the total memory a process will take and can actually become pretty important for some of the more involved scripts. Fortunately, you can typically figure this out by seeing other similar scripts or by trial and error. (fyi 32768 M is a huge amount)

#SBATCH --mem-per-cpu=32768M  

The -o argument names the file where verbose output generated by your script will be generated. In this case I want a file named “output_jlf_0.txt” to be produced with all of the output. Similarly the -e argument produces a file for any errors thrown during this script. The -J argument allows you to name the job. This is useful for tracking progress, especially when running a large number of jobs.

#SBATCH -o /fslhome/username/logfiles/dataset/output_jlf_0.txt
#SBATCH -e /fslhome/username/logfiles/dataset/error_jlf_0.txt
#SBATCH -J "jlf_0"

Your email address for updates in script processing

#SBATCH --mail-user=GusTTShowbiz@email.com  

Here the actions for which you should be emailed are specified. This specifies an email for the job’s beginning, end, or failure.

#SBATCH --mail-type=BEGIN
#SBATCH --mail-type=END
#SBATCH --mail-type=FAIL

This part is a little unique. It uses compatibility variables for PBS and can be deleted if not necessary. I’ve never used it, but it doesn’t hurt anything so I leave it.

export PBS_NODEFILE=`/fslapps/fslutils/generate_pbs_nodefile`
export PBS_JOBID=$SLURM_JOB_ID
export PBS_O_WORKDIR="$SLURM_SUBMIT_DIR"
export PBS_QUEUE=batch

This sets the maximum number of threads to use for programs using OpenMP. I’ve never looked into altering this. However, there may come a time when you need to toy with this because a lot of programs utilize OpenMP, which can be very useful.

export OMP_NUM_THREADS=$SLURM_CPUS_ON_NODE

That’s it for the all of the necessary components for requesting memory. It might also be useful to know that immediately following the memory request it’s good to set environment variables necessary to run programs. Even if you’ve set these on your profile you need to set them here because the script is run on a separate node/computer than your profile.

export ANTSPATH=/fslhome/username/bin/antsbin/bin/ 
PATH=${ANTSPATH}:${PATH}

Now putting it all together…

#!/bin/bash 

#SBATCH --time=50:00:00
#SBATCH --ntasks=1
#SBATCH --nodes=1
#SBATCH --mem-per-cpu=32768M  
#SBATCH -o /fslhome/username/logfiles/dataset/output_jlf_0.txt
#SBATCH -e /fslhome/username/logfiles/dataset/error_jlf_0.txt
#SBATCH -J "jlf_0"  # 
#SBATCH --mail-user=username@email.com  
#SBATCH --mail-type=BEGIN
#SBATCH --mail-type=END
#SBATCH --mail-type=FAIL

export PBS_NODEFILE=`/fslapps/fslutils/generate_pbs_nodefile`
export PBS_JOBID=$SLURM_JOB_ID
export PBS_O_WORKDIR="$SLURM_SUBMIT_DIR"
export PBS_QUEUE=batch
export OMP_NUM_THREADS=$SLURM_CPUS_ON_NODE

export ANTSPATH=/fslhome/username/bin/antsbin/bin/ 
PATH=${ANTSPATH}:${PATH}

Some tips and tricks:

  1. When I was working at the BYU BIABL lab they used a system for prioritizing batch jobs. The most obvious effect your script has is based on how many resources it calls for. There are going to be less resources available for large jobs. Therefore a script that reserves 30 minutes and 4M will start immediately. The above example will take a lot longer to find resources. The other thing that effects your job’s priority is how accurately you estimated your required resources in the past. This effected your job’s priority based upon your user ID’s prior accuracy in reserving resources. I think this was done to give higher priority to people that actually know what they’re doing and don’t waste resources.

  2. Your estimated run time DOES matter. If you underestimate your run time the script will be terminated prematurely. If you consistently grossly overestimate your run time then you will have a lower priority in obtaining reserved resources. This can be really obnoxious if you’re running jobs throughout the school year when the supercomputer is really busy.

  3. Your memory request DOES matter in the same exact way as the run time. Don’t underestimate or grossly overestimate. In other words, approach these estimates like a date. You don’t want to get too little food because then your date is hungry and thinks your stingy (your date might also end prematurely like a batch file). If you order way too much food you’ll both be too full for you to make a romantic move at the door step. Just order a tiny bit more than you expect to need.

  4. Don’t batch process every subject before testing a single subject. This helps find the balance in the previous two steps and ensures that you’ve worked out all the bugs in your script and get the results you expect. Here’s

  5. Use some basic SLURM commands to monitor and manage your jobs. For example, squeue -u username lets you see the status of all of your current jobs and scancel -u username can be used to cancel all of your jobs. There are other specifications for each of those that allow more detailed interactions, but I rarely use them.

Comments

Popular posts from this blog

Medical School: Weeks 10-12

Medical School: Week 1

Medical School: Week 2