[gmx-users] Two strange problems when using GPU on multiple nodes

Tue Jun 30 23:33:42 CEST 2015

Thanks for the quick response! The second problem is clear to me now but still not quite clear about the first one.. 

I have run pure MPI simulations of the same system on the same machine before. In that simulation, the trajectory file with 11G size generates in 48h. However, in my current simulation where GPUs are used, only 5~6G data is generated in 15h (1.5x simulation speed with half the number of nodes using GPU :D  ) and the simulation stops. It does not make sense to me and I guess it shouldn’t be caused by the file size limit.

Is there any method to figure out what happened to my filesystem after 15h simulation?

Thanks again

—
Sent from Mailbox

On Tue, Jun 30, 2015 at 3:50 PM, Mark Abraham <mark.j.abraham at gmail.com>
wrote:

> Hi,
> On Tue, Jun 30, 2015 at 10:38 PM Mark Zang <zangtw at gmail.com> wrote:
>> Dear all,
>> The cluster I am currently using has 24 CPU cores and 4 GPU cores per node
>> and I am attempting to use 2 nodes for my simulation. While my sbatch
>> script can be executed well, I have found two strange problems and I am a
>> little confused about it. Here is my sbatch script:
>>
>>
>> #!/bin/bash
>> #SBATCH --job-name="zangtw"
>> #SBATCH --output="zangtw.%j.%N.out"
>> #SBATCH --partition=gpu
>> #SBATCH --nodes=2
>> #SBATCH --ntasks-per-node=4
>> #SBATCH --gres=gpu:4
>> #SBATCH --export=ALL
>> #SBATCH -t 47:30:00
>>
>> ibrun --npernode 4 mdrun -ntomp 6 -s run.tpr -pin on -gpu_id 0123
>>
>>
>> and here are the descriptions of my problems:
>> 1. My simulation stops every 10+ hours. Specifically, the job is still
>> “running” in the queue but md.log/traj.trr stop updating.
> That suggests the filesystem has gone AWOL, or filled, or got to a 2GB file
> size limit, or such.
>> Is it due to the lack of memory?
> No
>> I have never met this before when I ran pure MPI, pure OpenMP, or even
>> hybrid MPI/OpenMP(without GPU) jobs.
>>
> Your throughput is different now, but those probably ran on other
> infrastructure, right? That's a more relevant difference.
>> 2. I am attempting to use 8 GPUs, but only 4 GPUs are displayed in my log
>> file (as follow):
>>
>>
>> Using 8 MPI processes
>> Using 6 OpenMP threads per MPI process
>> 4 GPUs detected on host comet-30-06.sdsc.edu:
>> #0: NVIDIA Tesla K80, compute cap.: 3.7, ECC:  no, stat: compatible
>> #1: NVIDIA Tesla K80, compute cap.: 3.7, ECC:  no, stat: compatible
>> #2: NVIDIA Tesla K80, compute cap.: 3.7, ECC:  no, stat: compatible
>> #3: NVIDIA Tesla K80, compute cap.: 3.7, ECC:  no, stat: compatible
>>
>>
>> Does it mean that only 4 GPUs in the first node are used while 4 GPUs in
>> the second node are idle,
> I suspect you can't even force that to happen if you wanted to ;-)
>> or 8 GPUs are used in reality but only 4 of them are displayed in the log
>> file?
> This - the reporting names a specific node, and you are using two. GROMACS
> 5.1 will be more helpful with such reporting.
>> In the second case, what should I do if I want to grep the information of
>> the other 4 GPUs?
>>
> Upgrade to 5.1 shortly ;-)
> Mark
>>
>>
>> Thank you guys so much!
>>
>>
>> Regards,
>> Mark
>>
>>
>>
>>
>>
>> —
>> Sent from Mailbox
>> --
>> Gromacs Users mailing list
>>
>> * Please search the archive at
>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>> posting!
>>
>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>
>> * For (un)subscribe requests visit
>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>> send a mail to gmx-users-request at gromacs.org.
> -- 
> Gromacs Users mailing list
> * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-request at gromacs.org.