[gmx-users] Two strange problems when using GPU on multiple nodes

Mark Abraham mark.j.abraham at gmail.com
Tue Jun 30 22:49:05 CEST 2015


Hi,

On Tue, Jun 30, 2015 at 10:38 PM Mark Zang <zangtw at gmail.com> wrote:

> Dear all,
> The cluster I am currently using has 24 CPU cores and 4 GPU cores per node
> and I am attempting to use 2 nodes for my simulation. While my sbatch
> script can be executed well, I have found two strange problems and I am a
> little confused about it. Here is my sbatch script:
>
>
> #!/bin/bash
> #SBATCH --job-name="zangtw"
> #SBATCH --output="zangtw.%j.%N.out"
> #SBATCH --partition=gpu
> #SBATCH --nodes=2
> #SBATCH --ntasks-per-node=4
> #SBATCH --gres=gpu:4
> #SBATCH --export=ALL
> #SBATCH -t 47:30:00
>
> ibrun --npernode 4 mdrun -ntomp 6 -s run.tpr -pin on -gpu_id 0123
>
>
> and here are the descriptions of my problems:
> 1. My simulation stops every 10+ hours. Specifically, the job is still
> “running” in the queue but md.log/traj.trr stop updating.


That suggests the filesystem has gone AWOL, or filled, or got to a 2GB file
size limit, or such.


> Is it due to the lack of memory?


No


> I have never met this before when I ran pure MPI, pure OpenMP, or even
> hybrid MPI/OpenMP(without GPU) jobs.
>

Your throughput is different now, but those probably ran on other
infrastructure, right? That's a more relevant difference.


> 2. I am attempting to use 8 GPUs, but only 4 GPUs are displayed in my log
> file (as follow):
>
>
> Using 8 MPI processes
> Using 6 OpenMP threads per MPI process
> 4 GPUs detected on host comet-30-06.sdsc.edu:
> #0: NVIDIA Tesla K80, compute cap.: 3.7, ECC:  no, stat: compatible
> #1: NVIDIA Tesla K80, compute cap.: 3.7, ECC:  no, stat: compatible
> #2: NVIDIA Tesla K80, compute cap.: 3.7, ECC:  no, stat: compatible
> #3: NVIDIA Tesla K80, compute cap.: 3.7, ECC:  no, stat: compatible
>
>
> Does it mean that only 4 GPUs in the first node are used while 4 GPUs in
> the second node are idle,


I suspect you can't even force that to happen if you wanted to ;-)


> or 8 GPUs are used in reality but only 4 of them are displayed in the log
> file?


This - the reporting names a specific node, and you are using two. GROMACS
5.1 will be more helpful with such reporting.


> In the second case, what should I do if I want to grep the information of
> the other 4 GPUs?
>

Upgrade to 5.1 shortly ;-)

Mark


>
>
> Thank you guys so much!
>
>
> Regards,
> Mark
>
>
>
>
>
>> Sent from Mailbox
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.


More information about the gromacs.org_gmx-users mailing list