[gmx-users] performance issues running gromacs with more than 1 gpu card in slurm

Carlos Navarro carlos.navarro87 at gmail.com
Mon Jul 29 11:34:05 CEST 2019


Hi Szilárd,
To answer your questions:
**are you trying to run multiple simulations concurrently on the same
node or are you trying to strong-scale?
I'm trying to run multiple simulations on the same node at the same time.

** what are you simulating?
Regular and CompEl simulations

** can you provide log files of the runs?
In the following link are some logs files:
https://www.dropbox.com/s/7q249vbqqwf5r03/Archive.zip?dl=0.
In short, alone.log -> single run in the node (using 1 gpu).
multi1/2/3/4.log ->4 independent simulations ran at the same time in a
single node. In all cases, 20 cpus are used.
Best regards,
Carlos

El jue., 25 jul. 2019 a las 10:59, Szilárd Páll (<pall.szilard at gmail.com>)
escribió:

> Hi,
>
> It is not clear to me how are you trying to set up your runs, so
> please provide some details:
> - are you trying to run multiple simulations concurrently on the same
> node or are you trying to strong-scale?
> - what are you simulating?
> - can you provide log files of the runs?
>
> Cheers,
>
> --
> Szilárd
>
> On Tue, Jul 23, 2019 at 1:34 AM Carlos Navarro
> <carlos.navarro87 at gmail.com> wrote:
> >
> > No one can give me an idea of what can be happening? Or how I can solve
> it?
> > Best regards,
> > Carlos
> >
> > ——————
> > Carlos Navarro Retamal
> > Bioinformatic Engineering. PhD.
> > Postdoctoral Researcher in Center of Bioinformatics and Molecular
> > Simulations
> > Universidad de Talca
> > Av. Lircay S/N, Talca, Chile
> > E: carlos.navarro87 at gmail.com or cnavarro at utalca.cl
> >
> > On July 19, 2019 at 2:20:41 PM, Carlos Navarro (
> carlos.navarro87 at gmail.com)
> > wrote:
> >
> > Dear gmx-users,
> > I’m currently working in a server where each node posses 40 physical
> cores
> > (40 threads) and 4 Nvidia-V100.
> > When I launch a single job (1 simulation using a single gpu card) I get a
> > performance of about ~35ns/day in a system of about 300k atoms. Looking
> > into the usage of the video card during the simulation I notice that the
> > card is being used about and ~80%.
> > The problems arise when I increase the number of jobs running at the same
> > time. If for instance 2 jobs are running at the same time, the
> performance
> > drops to ~25ns/day each and the usage of the video cards also drops
> during
> > the simulation to about a ~30-40% (and sometimes dropping to less than
> 5%).
> > Clearly there is a communication problem between the gpu cards and the
> cpu
> > during the simulations, but I don’t know how to solve this.
> > Here is the script I use to run the simulations:
> >
> > #!/bin/bash -x
> > #SBATCH --job-name=testAtTPC1
> > #SBATCH --ntasks-per-node=4
> > #SBATCH --cpus-per-task=20
> > #SBATCH --account=hdd22
> > #SBATCH --nodes=1
> > #SBATCH --mem=0
> > #SBATCH --output=sout.%j
> > #SBATCH --error=s4err.%j
> > #SBATCH --time=00:10:00
> > #SBATCH --partition=develgpus
> > #SBATCH --gres=gpu:4
> >
> > module use /gpfs/software/juwels/otherstages
> > module load Stages/2018b
> > module load Intel/2019.0.117-GCC-7.3.0
> > module load IntelMPI/2019.0.117
> > module load GROMACS/2018.3
> >
> > WORKDIR1=/p/project/chdd22/gromacs/benchmark/AtTPC1/singlegpu/1
> > WORKDIR2=/p/project/chdd22/gromacs/benchmark/AtTPC1/singlegpu/2
> > WORKDIR3=/p/project/chdd22/gromacs/benchmark/AtTPC1/singlegpu/3
> > WORKDIR4=/p/project/chdd22/gromacs/benchmark/AtTPC1/singlegpu/4
> >
> > DO_PARALLEL=" srun --exclusive -n 1 --gres=gpu:1 "
> > EXE=" gmx mdrun "
> >
> > cd $WORKDIR1
> > $DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset 0
> > -ntomp 20 &>log &
> > cd $WORKDIR2
> > $DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset 10
> > -ntomp 20 &>log &
> > cd $WORKDIR3
> > $DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20  -nmpi 1 -pin on -pinoffset
> 20
> > -ntomp 20 &>log &
> > cd $WORKDIR4
> > $DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset 30
> > -ntomp 20 &>log &
> >
> >
> > Regarding to pinoffset, I first tried using 20 cores for each job but
> then
> > also tried with 8 cores (so pinoffset 0 for job 1, pinoffset 4 for job 2,
> > pinoffset 8 for job 3 and pinoffset 12 for job) but at the end the
> problem
> > persist.
> >
> > Currently in this machine I’m not able to use more than 1 gpu per job, so
> > this is my only choice to use properly the whole node.
> > If you need more information please just let me know.
> > Best regards.
> > Carlos
> >
> > ——————
> > Carlos Navarro Retamal
> > Bioinformatic Engineering. PhD.
> > Postdoctoral Researcher in Center of Bioinformatics and Molecular
> > Simulations
> > Universidad de Talca
> > Av. Lircay S/N, Talca, Chile
> > E: carlos.navarro87 at gmail.com or cnavarro at utalca.cl
> > --
> > Gromacs Users mailing list
> >
> > * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
> >
> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >
> > * For (un)subscribe requests visit
> > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.



-- 

----------

Carlos Navarro Retamal

Bioinformatic Engineering. PhD

Postdoctoral Researcher in Center for Bioinformatics and Molecular
Simulations

Universidad de Talca

Av. Lircay S/N, Talca, Chile

T: (+56) 712201 <//T:%20(+56)%20712201> 798

E: carlos.navarro87 at gmail.com or cnavarro at utalca.cl


More information about the gromacs.org_gmx-users mailing list