[gmx-users] performance issues running gromacs with more than 1 gpu card in slurm

Mark Abraham mark.j.abraham at gmail.com
Mon Jul 29 11:48:00 CEST 2019


Hi,

When you use

DO_PARALLEL=" srun --exclusive -n 1 --gres=gpu:1 "

then the environment seems to make sure only one GPU is visible. (The log
files report only finding one GPU.) But it's probably the same GPU in each
case, with three remaining idle. I would suggest not using --gres unless
you can specify *which* of the four available GPUs each run can use.

Otherwise, don't use --gres and use the facilities built into GROMACS, e.g.

$DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset 0
-ntomp 20 -gpu_id 0
$DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset 10
-ntomp 20 -gpu_id 1
$DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20  -nmpi 1 -pin on -pinoffset 20
-ntomp 20 -gpu_id 2
etc.

Mark

On Mon, 29 Jul 2019 at 11:34, Carlos Navarro <carlos.navarro87 at gmail.com>
wrote:

> Hi Szilárd,
> To answer your questions:
> **are you trying to run multiple simulations concurrently on the same
> node or are you trying to strong-scale?
> I'm trying to run multiple simulations on the same node at the same time.
>
> ** what are you simulating?
> Regular and CompEl simulations
>
> ** can you provide log files of the runs?
> In the following link are some logs files:
> https://www.dropbox.com/s/7q249vbqqwf5r03/Archive.zip?dl=0.
> In short, alone.log -> single run in the node (using 1 gpu).
> multi1/2/3/4.log ->4 independent simulations ran at the same time in a
> single node. In all cases, 20 cpus are used.
> Best regards,
> Carlos
>
> El jue., 25 jul. 2019 a las 10:59, Szilárd Páll (<pall.szilard at gmail.com>)
> escribió:
>
> > Hi,
> >
> > It is not clear to me how are you trying to set up your runs, so
> > please provide some details:
> > - are you trying to run multiple simulations concurrently on the same
> > node or are you trying to strong-scale?
> > - what are you simulating?
> > - can you provide log files of the runs?
> >
> > Cheers,
> >
> > --
> > Szilárd
> >
> > On Tue, Jul 23, 2019 at 1:34 AM Carlos Navarro
> > <carlos.navarro87 at gmail.com> wrote:
> > >
> > > No one can give me an idea of what can be happening? Or how I can solve
> > it?
> > > Best regards,
> > > Carlos
> > >
> > > ——————
> > > Carlos Navarro Retamal
> > > Bioinformatic Engineering. PhD.
> > > Postdoctoral Researcher in Center of Bioinformatics and Molecular
> > > Simulations
> > > Universidad de Talca
> > > Av. Lircay S/N, Talca, Chile
> > > E: carlos.navarro87 at gmail.com or cnavarro at utalca.cl
> > >
> > > On July 19, 2019 at 2:20:41 PM, Carlos Navarro (
> > carlos.navarro87 at gmail.com)
> > > wrote:
> > >
> > > Dear gmx-users,
> > > I’m currently working in a server where each node posses 40 physical
> > cores
> > > (40 threads) and 4 Nvidia-V100.
> > > When I launch a single job (1 simulation using a single gpu card) I
> get a
> > > performance of about ~35ns/day in a system of about 300k atoms. Looking
> > > into the usage of the video card during the simulation I notice that
> the
> > > card is being used about and ~80%.
> > > The problems arise when I increase the number of jobs running at the
> same
> > > time. If for instance 2 jobs are running at the same time, the
> > performance
> > > drops to ~25ns/day each and the usage of the video cards also drops
> > during
> > > the simulation to about a ~30-40% (and sometimes dropping to less than
> > 5%).
> > > Clearly there is a communication problem between the gpu cards and the
> > cpu
> > > during the simulations, but I don’t know how to solve this.
> > > Here is the script I use to run the simulations:
> > >
> > > #!/bin/bash -x
> > > #SBATCH --job-name=testAtTPC1
> > > #SBATCH --ntasks-per-node=4
> > > #SBATCH --cpus-per-task=20
> > > #SBATCH --account=hdd22
> > > #SBATCH --nodes=1
> > > #SBATCH --mem=0
> > > #SBATCH --output=sout.%j
> > > #SBATCH --error=s4err.%j
> > > #SBATCH --time=00:10:00
> > > #SBATCH --partition=develgpus
> > > #SBATCH --gres=gpu:4
> > >
> > > module use /gpfs/software/juwels/otherstages
> > > module load Stages/2018b
> > > module load Intel/2019.0.117-GCC-7.3.0
> > > module load IntelMPI/2019.0.117
> > > module load GROMACS/2018.3
> > >
> > > WORKDIR1=/p/project/chdd22/gromacs/benchmark/AtTPC1/singlegpu/1
> > > WORKDIR2=/p/project/chdd22/gromacs/benchmark/AtTPC1/singlegpu/2
> > > WORKDIR3=/p/project/chdd22/gromacs/benchmark/AtTPC1/singlegpu/3
> > > WORKDIR4=/p/project/chdd22/gromacs/benchmark/AtTPC1/singlegpu/4
> > >
> > > DO_PARALLEL=" srun --exclusive -n 1 --gres=gpu:1 "
> > > EXE=" gmx mdrun "
> > >
> > > cd $WORKDIR1
> > > $DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset
> 0
> > > -ntomp 20 &>log &
> > > cd $WORKDIR2
> > > $DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset
> 10
> > > -ntomp 20 &>log &
> > > cd $WORKDIR3
> > > $DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20  -nmpi 1 -pin on -pinoffset
> > 20
> > > -ntomp 20 &>log &
> > > cd $WORKDIR4
> > > $DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset
> 30
> > > -ntomp 20 &>log &
> > >
> > >
> > > Regarding to pinoffset, I first tried using 20 cores for each job but
> > then
> > > also tried with 8 cores (so pinoffset 0 for job 1, pinoffset 4 for job
> 2,
> > > pinoffset 8 for job 3 and pinoffset 12 for job) but at the end the
> > problem
> > > persist.
> > >
> > > Currently in this machine I’m not able to use more than 1 gpu per job,
> so
> > > this is my only choice to use properly the whole node.
> > > If you need more information please just let me know.
> > > Best regards.
> > > Carlos
> > >
> > > ——————
> > > Carlos Navarro Retamal
> > > Bioinformatic Engineering. PhD.
> > > Postdoctoral Researcher in Center of Bioinformatics and Molecular
> > > Simulations
> > > Universidad de Talca
> > > Av. Lircay S/N, Talca, Chile
> > > E: carlos.navarro87 at gmail.com or cnavarro at utalca.cl
> > > --
> > > Gromacs Users mailing list
> > >
> > > * Please search the archive at
> > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> > posting!
> > >
> > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> > >
> > > * For (un)subscribe requests visit
> > > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> > send a mail to gmx-users-request at gromacs.org.
> > --
> > Gromacs Users mailing list
> >
> > * Please search the archive at
> > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> > posting!
> >
> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >
> > * For (un)subscribe requests visit
> > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> > send a mail to gmx-users-request at gromacs.org.
>
>
>
> --
>
> ----------
>
> Carlos Navarro Retamal
>
> Bioinformatic Engineering. PhD
>
> Postdoctoral Researcher in Center for Bioinformatics and Molecular
> Simulations
>
> Universidad de Talca
>
> Av. Lircay S/N, Talca, Chile
>
> T: (+56) 712201 <//T:%20(+56)%20712201> 798
>
> E: carlos.navarro87 at gmail.com or cnavarro at utalca.cl
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.


More information about the gromacs.org_gmx-users mailing list