[gmx-users] performance issues running gromacs with more than 1 gpu card in slurm

Szilárd Páll pall.szilard at gmail.com
Mon Jul 29 16:17:18 CEST 2019


Carlos,

You can accomplish the same using the multi-simulation feature of
mdrun and avoid having to manually manage the placement of runs, e.g.
instead of the above you just write
gmx mdrun_mpi -np N -multidir $WORKDIR1 $WORKDIR2 $WORKDIR3 ...
For more details see
http://manual.gromacs.org/documentation/current/user-guide/mdrun-features.html#running-multi-simulations
Note that if the different runs have different speed, just as with
your manual launch, your machine can end up partially utilized when
some of the runs finish.

Cheers,
--
Szilárd

On Mon, Jul 29, 2019 at 2:46 PM Carlos Navarro
<carlos.navarro87 at gmail.com> wrote:
>
> Hi Mark,
> I tried that before, but unfortunately in that case (removing —gres=gpu:1
> and including in each line the -gpu_id flag) for some reason the jobs are
> run one at a time (one after the other), so I can’t use properly the whole
> node.
>
>
> ——————
> Carlos Navarro Retamal
> Bioinformatic Engineering. PhD.
> Postdoctoral Researcher in Center of Bioinformatics and Molecular
> Simulations
> Universidad de Talca
> Av. Lircay S/N, Talca, Chile
> E: carlos.navarro87 at gmail.com or cnavarro at utalca.cl
>
> On July 29, 2019 at 11:48:21 AM, Mark Abraham (mark.j.abraham at gmail.com)
> wrote:
>
> Hi,
>
> When you use
>
> DO_PARALLEL=" srun --exclusive -n 1 --gres=gpu:1 "
>
> then the environment seems to make sure only one GPU is visible. (The log
> files report only finding one GPU.) But it's probably the same GPU in each
> case, with three remaining idle. I would suggest not using --gres unless
> you can specify *which* of the four available GPUs each run can use.
>
> Otherwise, don't use --gres and use the facilities built into GROMACS, e.g.
>
> $DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset 0
> -ntomp 20 -gpu_id 0
> $DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset 10
> -ntomp 20 -gpu_id 1
> $DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset 20
> -ntomp 20 -gpu_id 2
> etc.
>
> Mark
>
> On Mon, 29 Jul 2019 at 11:34, Carlos Navarro <carlos.navarro87 at gmail.com>
> wrote:
>
> > Hi Szilárd,
> > To answer your questions:
> > **are you trying to run multiple simulations concurrently on the same
> > node or are you trying to strong-scale?
> > I'm trying to run multiple simulations on the same node at the same time.
> >
> > ** what are you simulating?
> > Regular and CompEl simulations
> >
> > ** can you provide log files of the runs?
> > In the following link are some logs files:
> > https://www.dropbox.com/s/7q249vbqqwf5r03/Archive.zip?dl=0.
> > In short, alone.log -> single run in the node (using 1 gpu).
> > multi1/2/3/4.log ->4 independent simulations ran at the same time in a
> > single node. In all cases, 20 cpus are used.
> > Best regards,
> > Carlos
> >
> > El jue., 25 jul. 2019 a las 10:59, Szilárd Páll (<pall.szilard at gmail.com>)
> > escribió:
> >
> > > Hi,
> > >
> > > It is not clear to me how are you trying to set up your runs, so
> > > please provide some details:
> > > - are you trying to run multiple simulations concurrently on the same
> > > node or are you trying to strong-scale?
> > > - what are you simulating?
> > > - can you provide log files of the runs?
> > >
> > > Cheers,
> > >
> > > --
> > > Szilárd
> > >
> > > On Tue, Jul 23, 2019 at 1:34 AM Carlos Navarro
> > > <carlos.navarro87 at gmail.com> wrote:
> > > >
> > > > No one can give me an idea of what can be happening? Or how I can
> solve
> > > it?
> > > > Best regards,
> > > > Carlos
> > > >
> > > > ——————
> > > > Carlos Navarro Retamal
> > > > Bioinformatic Engineering. PhD.
> > > > Postdoctoral Researcher in Center of Bioinformatics and Molecular
> > > > Simulations
> > > > Universidad de Talca
> > > > Av. Lircay S/N, Talca, Chile
> > > > E: carlos.navarro87 at gmail.com or cnavarro at utalca.cl
> > > >
> > > > On July 19, 2019 at 2:20:41 PM, Carlos Navarro (
> > > carlos.navarro87 at gmail.com)
> > > > wrote:
> > > >
> > > > Dear gmx-users,
> > > > I’m currently working in a server where each node posses 40 physical
> > > cores
> > > > (40 threads) and 4 Nvidia-V100.
> > > > When I launch a single job (1 simulation using a single gpu card) I
> > get a
> > > > performance of about ~35ns/day in a system of about 300k atoms.
> Looking
> > > > into the usage of the video card during the simulation I notice that
> > the
> > > > card is being used about and ~80%.
> > > > The problems arise when I increase the number of jobs running at the
> > same
> > > > time. If for instance 2 jobs are running at the same time, the
> > > performance
> > > > drops to ~25ns/day each and the usage of the video cards also drops
> > > during
> > > > the simulation to about a ~30-40% (and sometimes dropping to less than
> > > 5%).
> > > > Clearly there is a communication problem between the gpu cards and the
> > > cpu
> > > > during the simulations, but I don’t know how to solve this.
> > > > Here is the script I use to run the simulations:
> > > >
> > > > #!/bin/bash -x
> > > > #SBATCH --job-name=testAtTPC1
> > > > #SBATCH --ntasks-per-node=4
> > > > #SBATCH --cpus-per-task=20
> > > > #SBATCH --account=hdd22
> > > > #SBATCH --nodes=1
> > > > #SBATCH --mem=0
> > > > #SBATCH --output=sout.%j
> > > > #SBATCH --error=s4err.%j
> > > > #SBATCH --time=00:10:00
> > > > #SBATCH --partition=develgpus
> > > > #SBATCH --gres=gpu:4
> > > >
> > > > module use /gpfs/software/juwels/otherstages
> > > > module load Stages/2018b
> > > > module load Intel/2019.0.117-GCC-7.3.0
> > > > module load IntelMPI/2019.0.117
> > > > module load GROMACS/2018.3
> > > >
> > > > WORKDIR1=/p/project/chdd22/gromacs/benchmark/AtTPC1/singlegpu/1
> > > > WORKDIR2=/p/project/chdd22/gromacs/benchmark/AtTPC1/singlegpu/2
> > > > WORKDIR3=/p/project/chdd22/gromacs/benchmark/AtTPC1/singlegpu/3
> > > > WORKDIR4=/p/project/chdd22/gromacs/benchmark/AtTPC1/singlegpu/4
> > > >
> > > > DO_PARALLEL=" srun --exclusive -n 1 --gres=gpu:1 "
> > > > EXE=" gmx mdrun "
> > > >
> > > > cd $WORKDIR1
> > > > $DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset
> > 0
> > > > -ntomp 20 &>log &
> > > > cd $WORKDIR2
> > > > $DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset
> > 10
> > > > -ntomp 20 &>log &
> > > > cd $WORKDIR3
> > > > $DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset
> > > 20
> > > > -ntomp 20 &>log &
> > > > cd $WORKDIR4
> > > > $DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset
> > 30
> > > > -ntomp 20 &>log &
> > > >
> > > >
> > > > Regarding to pinoffset, I first tried using 20 cores for each job but
> > > then
> > > > also tried with 8 cores (so pinoffset 0 for job 1, pinoffset 4 for job
> > 2,
> > > > pinoffset 8 for job 3 and pinoffset 12 for job) but at the end the
> > > problem
> > > > persist.
> > > >
> > > > Currently in this machine I’m not able to use more than 1 gpu per job,
> > so
> > > > this is my only choice to use properly the whole node.
> > > > If you need more information please just let me know.
> > > > Best regards.
> > > > Carlos
> > > >
> > > > ——————
> > > > Carlos Navarro Retamal
> > > > Bioinformatic Engineering. PhD.
> > > > Postdoctoral Researcher in Center of Bioinformatics and Molecular
> > > > Simulations
> > > > Universidad de Talca
> > > > Av. Lircay S/N, Talca, Chile
> > > > E: carlos.navarro87 at gmail.com or cnavarro at utalca.cl
> > > > --
> > > > Gromacs Users mailing list
> > > >
> > > > * Please search the archive at
> > > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> > > posting!
> > > >
> > > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> > > >
> > > > * For (un)subscribe requests visit
> > > > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> > > send a mail to gmx-users-request at gromacs.org.
> > > --
> > > Gromacs Users mailing list
> > >
> > > * Please search the archive at
> > > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> > > posting!
> > >
> > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> > >
> > > * For (un)subscribe requests visit
> > > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> > > send a mail to gmx-users-request at gromacs.org.
> >
> >
> >
> > --
> >
> > ----------
> >
> > Carlos Navarro Retamal
> >
> > Bioinformatic Engineering. PhD
> >
> > Postdoctoral Researcher in Center for Bioinformatics and Molecular
> > Simulations
> >
> > Universidad de Talca
> >
> > Av. Lircay S/N, Talca, Chile
> >
> > T: (+56) 712201 <//T:%20(+56)%20712201> 798
> >
> > E: carlos.navarro87 at gmail.com or cnavarro at utalca.cl
> > --
> > Gromacs Users mailing list
> >
> > * Please search the archive at
> > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> > posting!
> >
> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >
> > * For (un)subscribe requests visit
> > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> > send a mail to gmx-users-request at gromacs.org.
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send
> a mail to gmx-users-request at gromacs.org.
> --
> Gromacs Users mailing list
>
> * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-request at gromacs.org.


More information about the gromacs.org_gmx-users mailing list