[gmx-users] performance issues running gromacs with more than 1 gpu card in slurm

Carlos Navarro carlos.navarro87 at gmail.com
Fri Jul 19 14:20:45 CEST 2019


Dear gmx-users,
I’m currently working in a server where each node posses 40 physical cores
(40 threads) and 4 Nvidia-V100.
When I launch a single job (1 simulation using a single gpu card) I get a
performance of about ~35ns/day in a system of about 300k atoms. Looking
into the usage of the video card during the simulation I notice that the
card is being used about and ~80%.
The problems arise when I increase the number of jobs running at the same
time. If for instance 2 jobs are running at the same time, the performance
drops to ~25ns/day each and the usage of the video cards also drops during
the simulation to about a ~30-40% (and sometimes dropping to less than 5%).
Clearly there is a communication problem between the gpu cards and the cpu
during the simulations, but I don’t know how to solve this.
Here is the script I use to run the simulations:

#!/bin/bash -x
#SBATCH --job-name=testAtTPC1
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=20
#SBATCH --account=hdd22
#SBATCH --nodes=1
#SBATCH --mem=0
#SBATCH --output=sout.%j
#SBATCH --error=s4err.%j
#SBATCH --time=00:10:00
#SBATCH --partition=develgpus
#SBATCH --gres=gpu:4

module use /gpfs/software/juwels/otherstages
module load Stages/2018b
module load Intel/2019.0.117-GCC-7.3.0
module load IntelMPI/2019.0.117
module load GROMACS/2018.3

WORKDIR1=/p/project/chdd22/gromacs/benchmark/AtTPC1/singlegpu/1
WORKDIR2=/p/project/chdd22/gromacs/benchmark/AtTPC1/singlegpu/2
WORKDIR3=/p/project/chdd22/gromacs/benchmark/AtTPC1/singlegpu/3
WORKDIR4=/p/project/chdd22/gromacs/benchmark/AtTPC1/singlegpu/4

DO_PARALLEL=" srun --exclusive -n 1 --gres=gpu:1 "
EXE=" gmx mdrun "

cd $WORKDIR1
$DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset 0
-ntomp 20 &>log &
cd $WORKDIR2
$DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset 10
-ntomp 20 &>log &
cd $WORKDIR3
$DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20  -nmpi 1 -pin on -pinoffset 20
-ntomp 20 &>log &
cd $WORKDIR4
$DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset 30
-ntomp 20 &>log &


Regarding to pinoffset, I first tried using 20 cores for each job but then
also tried with 8 cores (so pinoffset 0 for job 1, pinoffset 4 for job 2,
pinoffset 8 for job 3 and pinoffset 12 for job) but at the end the problem
persist.

Currently in this machine I’m not able to use more than 1 gpu per job, so
this is my only choice to use properly the whole node.
If you need more information please just let me know.
Best regards.
Carlos

——————
Carlos Navarro Retamal
Bioinformatic Engineering. PhD.
Postdoctoral Researcher in Center of Bioinformatics and Molecular
Simulations
Universidad de Talca
Av. Lircay S/N, Talca, Chile
E: carlos.navarro87 at gmail.com or cnavarro at utalca.cl


More information about the gromacs.org_gmx-users mailing list