[gmx-users] performance issues running gromacs with more than 1 gpu card in slurm

Mon Jul 29 15:14:07 CEST 2019

On 7/29/19 8:46 AM, Carlos Navarro wrote:
> Hi Mark,
> I tried that before, but unfortunately in that case (removing —gres=gpu:1
> and including in each line the -gpu_id flag) for some reason the jobs are
> run one at a time (one after the other), so I can’t use properly the whole
> node.
>

You need to run all but the last mdrun process in the background (&).

-Justin

> ——————
> Carlos Navarro Retamal
> Bioinformatic Engineering. PhD.
> Postdoctoral Researcher in Center of Bioinformatics and Molecular
> Simulations
> Universidad de Talca
> Av. Lircay S/N, Talca, Chile
> E: carlos.navarro87 at gmail.com or cnavarro at utalca.cl
>
> On July 29, 2019 at 11:48:21 AM, Mark Abraham (mark.j.abraham at gmail.com)
> wrote:
>
> Hi,
>
> When you use
>
> DO_PARALLEL=" srun --exclusive -n 1 --gres=gpu:1 "
>
> then the environment seems to make sure only one GPU is visible. (The log
> files report only finding one GPU.) But it's probably the same GPU in each
> case, with three remaining idle. I would suggest not using --gres unless
> you can specify *which* of the four available GPUs each run can use.
>
> Otherwise, don't use --gres and use the facilities built into GROMACS, e.g.
>
> $DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset 0
> -ntomp 20 -gpu_id 0
> $DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset 10
> -ntomp 20 -gpu_id 1
> $DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset 20
> -ntomp 20 -gpu_id 2
> etc.
>
> Mark
>
> On Mon, 29 Jul 2019 at 11:34, Carlos Navarro <carlos.navarro87 at gmail.com>
> wrote:
>
>> Hi Szilárd,
>> To answer your questions:
>> **are you trying to run multiple simulations concurrently on the same
>> node or are you trying to strong-scale?
>> I'm trying to run multiple simulations on the same node at the same time.
>>
>> ** what are you simulating?
>> Regular and CompEl simulations
>>
>> ** can you provide log files of the runs?
>> In the following link are some logs files:
>> https://www.dropbox.com/s/7q249vbqqwf5r03/Archive.zip?dl=0.
>> In short, alone.log -> single run in the node (using 1 gpu).
>> multi1/2/3/4.log ->4 independent simulations ran at the same time in a
>> single node. In all cases, 20 cpus are used.
>> Best regards,
>> Carlos
>>
>> El jue., 25 jul. 2019 a las 10:59, Szilárd Páll (<pall.szilard at gmail.com>)
>> escribió:
>>
>>> Hi,
>>>
>>> It is not clear to me how are you trying to set up your runs, so
>>> please provide some details:
>>> - are you trying to run multiple simulations concurrently on the same
>>> node or are you trying to strong-scale?
>>> - what are you simulating?
>>> - can you provide log files of the runs?
>>>
>>> Cheers,
>>>
>>> --
>>> Szilárd
>>>
>>> On Tue, Jul 23, 2019 at 1:34 AM Carlos Navarro
>>> <carlos.navarro87 at gmail.com> wrote:
>>>> No one can give me an idea of what can be happening? Or how I can
> solve
>>> it?
>>>> Best regards,
>>>> Carlos
>>>>
>>>> ——————
>>>> Carlos Navarro Retamal
>>>> Bioinformatic Engineering. PhD.
>>>> Postdoctoral Researcher in Center of Bioinformatics and Molecular
>>>> Simulations
>>>> Universidad de Talca
>>>> Av. Lircay S/N, Talca, Chile
>>>> E: carlos.navarro87 at gmail.com or cnavarro at utalca.cl
>>>>
>>>> On July 19, 2019 at 2:20:41 PM, Carlos Navarro (
>>> carlos.navarro87 at gmail.com)
>>>> wrote:
>>>>
>>>> Dear gmx-users,
>>>> I’m currently working in a server where each node posses 40 physical
>>> cores
>>>> (40 threads) and 4 Nvidia-V100.
>>>> When I launch a single job (1 simulation using a single gpu card) I
>> get a
>>>> performance of about ~35ns/day in a system of about 300k atoms.
> Looking
>>>> into the usage of the video card during the simulation I notice that
>> the
>>>> card is being used about and ~80%.
>>>> The problems arise when I increase the number of jobs running at the
>> same
>>>> time. If for instance 2 jobs are running at the same time, the
>>> performance
>>>> drops to ~25ns/day each and the usage of the video cards also drops
>>> during
>>>> the simulation to about a ~30-40% (and sometimes dropping to less than
>>> 5%).
>>>> Clearly there is a communication problem between the gpu cards and the
>>> cpu
>>>> during the simulations, but I don’t know how to solve this.
>>>> Here is the script I use to run the simulations:
>>>>
>>>> #!/bin/bash -x
>>>> #SBATCH --job-name=testAtTPC1
>>>> #SBATCH --ntasks-per-node=4
>>>> #SBATCH --cpus-per-task=20
>>>> #SBATCH --account=hdd22
>>>> #SBATCH --nodes=1
>>>> #SBATCH --mem=0
>>>> #SBATCH --output=sout.%j
>>>> #SBATCH --error=s4err.%j
>>>> #SBATCH --time=00:10:00
>>>> #SBATCH --partition=develgpus
>>>> #SBATCH --gres=gpu:4
>>>>
>>>> module use /gpfs/software/juwels/otherstages
>>>> module load Stages/2018b
>>>> module load Intel/2019.0.117-GCC-7.3.0
>>>> module load IntelMPI/2019.0.117
>>>> module load GROMACS/2018.3
>>>>
>>>> WORKDIR1=/p/project/chdd22/gromacs/benchmark/AtTPC1/singlegpu/1
>>>> WORKDIR2=/p/project/chdd22/gromacs/benchmark/AtTPC1/singlegpu/2
>>>> WORKDIR3=/p/project/chdd22/gromacs/benchmark/AtTPC1/singlegpu/3
>>>> WORKDIR4=/p/project/chdd22/gromacs/benchmark/AtTPC1/singlegpu/4
>>>>
>>>> DO_PARALLEL=" srun --exclusive -n 1 --gres=gpu:1 "
>>>> EXE=" gmx mdrun "
>>>>
>>>> cd $WORKDIR1
>>>> $DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset
>> 0
>>>> -ntomp 20 &>log &
>>>> cd $WORKDIR2
>>>> $DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset
>> 10
>>>> -ntomp 20 &>log &
>>>> cd $WORKDIR3
>>>> $DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset
>>> 20
>>>> -ntomp 20 &>log &
>>>> cd $WORKDIR4
>>>> $DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset
>> 30
>>>> -ntomp 20 &>log &
>>>>
>>>>
>>>> Regarding to pinoffset, I first tried using 20 cores for each job but
>>> then
>>>> also tried with 8 cores (so pinoffset 0 for job 1, pinoffset 4 for job
>> 2,
>>>> pinoffset 8 for job 3 and pinoffset 12 for job) but at the end the
>>> problem
>>>> persist.
>>>>
>>>> Currently in this machine I’m not able to use more than 1 gpu per job,
>> so
>>>> this is my only choice to use properly the whole node.
>>>> If you need more information please just let me know.
>>>> Best regards.
>>>> Carlos
>>>>
>>>> ——————
>>>> Carlos Navarro Retamal
>>>> Bioinformatic Engineering. PhD.
>>>> Postdoctoral Researcher in Center of Bioinformatics and Molecular
>>>> Simulations
>>>> Universidad de Talca
>>>> Av. Lircay S/N, Talca, Chile
>>>> E: carlos.navarro87 at gmail.com or cnavarro at utalca.cl
>>>> --
>>>> Gromacs Users mailing list
>>>>
>>>> * Please search the archive at
>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>>> posting!
>>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>>
>>>> * For (un)subscribe requests visit
>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>>> send a mail to gmx-users-request at gromacs.org.
>>> --
>>> Gromacs Users mailing list
>>>
>>> * Please search the archive at
>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>>> posting!
>>>
>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>
>>> * For (un)subscribe requests visit
>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>>> send a mail to gmx-users-request at gromacs.org.
>>
>>
>> --
>>
>> ----------
>>
>> Carlos Navarro Retamal
>>
>> Bioinformatic Engineering. PhD
>>
>> Postdoctoral Researcher in Center for Bioinformatics and Molecular
>> Simulations
>>
>> Universidad de Talca
>>
>> Av. Lircay S/N, Talca, Chile
>>
>> T: (+56) 712201 <//T:%20(+56)%20712201> 798
>>
>> E: carlos.navarro87 at gmail.com or cnavarro at utalca.cl
>> --
>> Gromacs Users mailing list
>>
>> * Please search the archive at
>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>> posting!
>>
>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>
>> * For (un)subscribe requests visit
>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>> send a mail to gmx-users-request at gromacs.org.

-- 
==================================================

Justin A. Lemkul, Ph.D.
Assistant Professor
Office: 301 Fralin Hall
Lab: 303 Engel Hall

Virginia Tech Department of Biochemistry
340 West Campus Dr.
Blacksburg, VA 24061

jalemkul at vt.edu | (540) 231-3129
http://www.thelemkullab.com

==================================================