[gmx-users] performance issues running gromacs with more than 1 gpu card in slurm

Mark Abraham mark.j.abraham at gmail.com
Mon Jul 29 18:11:12 CEST 2019


Hi,

Yes and the -nmpi I copied from Carlos's post is ineffective - use -ntmpi

Mark


On Mon., 29 Jul. 2019, 15:15 Justin Lemkul, <jalemkul at vt.edu> wrote:

>
>
> On 7/29/19 8:46 AM, Carlos Navarro wrote:
> > Hi Mark,
> > I tried that before, but unfortunately in that case (removing —gres=gpu:1
> > and including in each line the -gpu_id flag) for some reason the jobs are
> > run one at a time (one after the other), so I can’t use properly the
> whole
> > node.
> >
>
> You need to run all but the last mdrun process in the background (&).
>
> -Justin
>
> > ——————
> > Carlos Navarro Retamal
> > Bioinformatic Engineering. PhD.
> > Postdoctoral Researcher in Center of Bioinformatics and Molecular
> > Simulations
> > Universidad de Talca
> > Av. Lircay S/N, Talca, Chile
> > E: carlos.navarro87 at gmail.com or cnavarro at utalca.cl
> >
> > On July 29, 2019 at 11:48:21 AM, Mark Abraham (mark.j.abraham at gmail.com)
> > wrote:
> >
> > Hi,
> >
> > When you use
> >
> > DO_PARALLEL=" srun --exclusive -n 1 --gres=gpu:1 "
> >
> > then the environment seems to make sure only one GPU is visible. (The log
> > files report only finding one GPU.) But it's probably the same GPU in
> each
> > case, with three remaining idle. I would suggest not using --gres unless
> > you can specify *which* of the four available GPUs each run can use.
> >
> > Otherwise, don't use --gres and use the facilities built into GROMACS,
> e.g.
> >
> > $DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset 0
> > -ntomp 20 -gpu_id 0
> > $DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset 10
> > -ntomp 20 -gpu_id 1
> > $DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset 20
> > -ntomp 20 -gpu_id 2
> > etc.
> >
> > Mark
> >
> > On Mon, 29 Jul 2019 at 11:34, Carlos Navarro <carlos.navarro87 at gmail.com
> >
> > wrote:
> >
> >> Hi Szilárd,
> >> To answer your questions:
> >> **are you trying to run multiple simulations concurrently on the same
> >> node or are you trying to strong-scale?
> >> I'm trying to run multiple simulations on the same node at the same
> time.
> >>
> >> ** what are you simulating?
> >> Regular and CompEl simulations
> >>
> >> ** can you provide log files of the runs?
> >> In the following link are some logs files:
> >> https://www.dropbox.com/s/7q249vbqqwf5r03/Archive.zip?dl=0.
> >> In short, alone.log -> single run in the node (using 1 gpu).
> >> multi1/2/3/4.log ->4 independent simulations ran at the same time in a
> >> single node. In all cases, 20 cpus are used.
> >> Best regards,
> >> Carlos
> >>
> >> El jue., 25 jul. 2019 a las 10:59, Szilárd Páll (<
> pall.szilard at gmail.com>)
> >> escribió:
> >>
> >>> Hi,
> >>>
> >>> It is not clear to me how are you trying to set up your runs, so
> >>> please provide some details:
> >>> - are you trying to run multiple simulations concurrently on the same
> >>> node or are you trying to strong-scale?
> >>> - what are you simulating?
> >>> - can you provide log files of the runs?
> >>>
> >>> Cheers,
> >>>
> >>> --
> >>> Szilárd
> >>>
> >>> On Tue, Jul 23, 2019 at 1:34 AM Carlos Navarro
> >>> <carlos.navarro87 at gmail.com> wrote:
> >>>> No one can give me an idea of what can be happening? Or how I can
> > solve
> >>> it?
> >>>> Best regards,
> >>>> Carlos
> >>>>
> >>>> ——————
> >>>> Carlos Navarro Retamal
> >>>> Bioinformatic Engineering. PhD.
> >>>> Postdoctoral Researcher in Center of Bioinformatics and Molecular
> >>>> Simulations
> >>>> Universidad de Talca
> >>>> Av. Lircay S/N, Talca, Chile
> >>>> E: carlos.navarro87 at gmail.com or cnavarro at utalca.cl
> >>>>
> >>>> On July 19, 2019 at 2:20:41 PM, Carlos Navarro (
> >>> carlos.navarro87 at gmail.com)
> >>>> wrote:
> >>>>
> >>>> Dear gmx-users,
> >>>> I’m currently working in a server where each node posses 40 physical
> >>> cores
> >>>> (40 threads) and 4 Nvidia-V100.
> >>>> When I launch a single job (1 simulation using a single gpu card) I
> >> get a
> >>>> performance of about ~35ns/day in a system of about 300k atoms.
> > Looking
> >>>> into the usage of the video card during the simulation I notice that
> >> the
> >>>> card is being used about and ~80%.
> >>>> The problems arise when I increase the number of jobs running at the
> >> same
> >>>> time. If for instance 2 jobs are running at the same time, the
> >>> performance
> >>>> drops to ~25ns/day each and the usage of the video cards also drops
> >>> during
> >>>> the simulation to about a ~30-40% (and sometimes dropping to less than
> >>> 5%).
> >>>> Clearly there is a communication problem between the gpu cards and the
> >>> cpu
> >>>> during the simulations, but I don’t know how to solve this.
> >>>> Here is the script I use to run the simulations:
> >>>>
> >>>> #!/bin/bash -x
> >>>> #SBATCH --job-name=testAtTPC1
> >>>> #SBATCH --ntasks-per-node=4
> >>>> #SBATCH --cpus-per-task=20
> >>>> #SBATCH --account=hdd22
> >>>> #SBATCH --nodes=1
> >>>> #SBATCH --mem=0
> >>>> #SBATCH --output=sout.%j
> >>>> #SBATCH --error=s4err.%j
> >>>> #SBATCH --time=00:10:00
> >>>> #SBATCH --partition=develgpus
> >>>> #SBATCH --gres=gpu:4
> >>>>
> >>>> module use /gpfs/software/juwels/otherstages
> >>>> module load Stages/2018b
> >>>> module load Intel/2019.0.117-GCC-7.3.0
> >>>> module load IntelMPI/2019.0.117
> >>>> module load GROMACS/2018.3
> >>>>
> >>>> WORKDIR1=/p/project/chdd22/gromacs/benchmark/AtTPC1/singlegpu/1
> >>>> WORKDIR2=/p/project/chdd22/gromacs/benchmark/AtTPC1/singlegpu/2
> >>>> WORKDIR3=/p/project/chdd22/gromacs/benchmark/AtTPC1/singlegpu/3
> >>>> WORKDIR4=/p/project/chdd22/gromacs/benchmark/AtTPC1/singlegpu/4
> >>>>
> >>>> DO_PARALLEL=" srun --exclusive -n 1 --gres=gpu:1 "
> >>>> EXE=" gmx mdrun "
> >>>>
> >>>> cd $WORKDIR1
> >>>> $DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset
> >> 0
> >>>> -ntomp 20 &>log &
> >>>> cd $WORKDIR2
> >>>> $DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset
> >> 10
> >>>> -ntomp 20 &>log &
> >>>> cd $WORKDIR3
> >>>> $DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset
> >>> 20
> >>>> -ntomp 20 &>log &
> >>>> cd $WORKDIR4
> >>>> $DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset
> >> 30
> >>>> -ntomp 20 &>log &
> >>>>
> >>>>
> >>>> Regarding to pinoffset, I first tried using 20 cores for each job but
> >>> then
> >>>> also tried with 8 cores (so pinoffset 0 for job 1, pinoffset 4 for job
> >> 2,
> >>>> pinoffset 8 for job 3 and pinoffset 12 for job) but at the end the
> >>> problem
> >>>> persist.
> >>>>
> >>>> Currently in this machine I’m not able to use more than 1 gpu per job,
> >> so
> >>>> this is my only choice to use properly the whole node.
> >>>> If you need more information please just let me know.
> >>>> Best regards.
> >>>> Carlos
> >>>>
> >>>> ——————
> >>>> Carlos Navarro Retamal
> >>>> Bioinformatic Engineering. PhD.
> >>>> Postdoctoral Researcher in Center of Bioinformatics and Molecular
> >>>> Simulations
> >>>> Universidad de Talca
> >>>> Av. Lircay S/N, Talca, Chile
> >>>> E: carlos.navarro87 at gmail.com or cnavarro at utalca.cl
> >>>> --
> >>>> Gromacs Users mailing list
> >>>>
> >>>> * Please search the archive at
> >>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> >>> posting!
> >>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >>>>
> >>>> * For (un)subscribe requests visit
> >>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> >>> send a mail to gmx-users-request at gromacs.org.
> >>> --
> >>> Gromacs Users mailing list
> >>>
> >>> * Please search the archive at
> >>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> >>> posting!
> >>>
> >>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >>>
> >>> * For (un)subscribe requests visit
> >>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> >>> send a mail to gmx-users-request at gromacs.org.
> >>
> >>
> >> --
> >>
> >> ----------
> >>
> >> Carlos Navarro Retamal
> >>
> >> Bioinformatic Engineering. PhD
> >>
> >> Postdoctoral Researcher in Center for Bioinformatics and Molecular
> >> Simulations
> >>
> >> Universidad de Talca
> >>
> >> Av. Lircay S/N, Talca, Chile
> >>
> >> T: (+56) 712201 <//T:%20(+56)%20712201> 798
> >>
> >> E: carlos.navarro87 at gmail.com or cnavarro at utalca.cl
> >> --
> >> Gromacs Users mailing list
> >>
> >> * Please search the archive at
> >> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> >> posting!
> >>
> >> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >>
> >> * For (un)subscribe requests visit
> >> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> >> send a mail to gmx-users-request at gromacs.org.
>
> --
> ==================================================
>
> Justin A. Lemkul, Ph.D.
> Assistant Professor
> Office: 301 Fralin Hall
> Lab: 303 Engel Hall
>
> Virginia Tech Department of Biochemistry
> 340 West Campus Dr.
> Blacksburg, VA 24061
>
> jalemkul at vt.edu | (540) 231-3129
> http://www.thelemkullab.com
>
> ==================================================
>
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.


More information about the gromacs.org_gmx-users mailing list