[gmx-users] Re: Gromacs-4.6 on two Titans GPUs

Thu Nov 7 06:34:47 CET 2013

I've gone to conclusion that simulation with 1 or 2 GPU simultaneously gave
me the same performance
mdrun -ntmpi 2 -ntomp 6 -gpu_id 01 -v  -deffnm md_CaM_test,

mdrun -ntmpi 2 -ntomp 6 -gpu_id 0 -v  -deffnm md_CaM_test,

Doest it be due to the small CPU cores or addition RAM ( this system has 32
gb) is needed ? OR may be some extra options are needed in the config?

James

2013/11/6 Richard Broadbent <richard.broadbent09 at imperial.ac.uk>

> Hi Dwey,
>
>
> On 05/11/13 22:00, Dwey Kauffman wrote:
>
>> Hi Szilard,
>>
>>     Thanks for your suggestions. I am  indeed aware of this page. In a
>> 8-core
>> AMD with 1GPU, I am very happy about its performance. See below. My
>> intention is to obtain a even better one because we have multiple nodes.
>>
>> ### 8 core AMD with  1 GPU,
>> Force evaluation time GPU/CPU: 4.006 ms/2.578 ms = 1.554
>> For optimal performance this ratio should be close to 1!
>>
>>
>> NOTE: The GPU has >20% more load than the CPU. This imbalance causes
>>        performance loss, consider using a shorter cut-off and a finer PME
>> grid.
>>
>>                 Core t (s)   Wall t (s)        (%)
>>         Time:   216205.510    27036.812      799.7
>>                           7h30:36
>>                   (ns/day)    (hour/ns)
>> Performance:       31.956        0.751
>>
>> ### 8 core AMD with 2 GPUs
>>
>>                 Core t (s)   Wall t (s)        (%)
>>         Time:   178961.450    22398.880      799.0
>>                           6h13:18
>>                   (ns/day)    (hour/ns)
>> Performance:       38.573        0.622
>> Finished mdrun on node 0 Sat Jul 13 09:24:39 2013
>>
>>
> I'm almost certain that Szilard meant the lines above this that give the
> breakdown of where the time is spent in the simulation.
>
> Richard
>
>
>>  However, in your case I suspect that the
>>> bottleneck is multi-threaded scaling on the AMD CPUs and you should
>>> probably decrease the number of threads per MPI rank and share GPUs
>>> between 2-4 ranks.
>>>
>>
>>
>> OK but can you give a example of mdrun command ? given a 8 core AMD with 2
>> GPUs.
>> I will try to run it again.
>>
>>
>>  Regarding scaling across nodes, you can't expect much from gigabit
>>> ethernet - especially not from the cheaper cards/switches, in my
>>> experience even reaction field runs don't scale across nodes with 10G
>>> ethernet if you have more than 4-6 ranks per node trying to
>>> communicate (let alone with PME). However, on infiniband clusters we
>>> have seen scaling to 100 atoms/core (at peak).
>>>
>>
>>  From your comments, it sounds like a cluster of AMD cpus is difficult to
>>>
>> scale across nodes in our current setup.
>>
>> Let's assume we install Infiniband (20 or 40GB/s) in the same system of 16
>> nodes of 8 core AMD with 1 GPU only. Considering the same AMD system, what
>> is a good way to obtain better performance  when we run a task across
>> nodes
>> ? in other words, what dose mudrun_mpi look like ?
>>
>> Thanks,
>> Dwey
>>
>>
>>
>>
>> --
>> View this message in context: http://gromacs.5086.x6.nabble.
>> com/Gromacs-4-6-on-two-Titans-GPUs-tp5012186p5012279.html
>> Sent from the GROMACS Users Forum mailing list archive at Nabble.com.
>>
>>  --
> gmx-users mailing list    gmx-users at gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-users
> * Please search the archive at http://www.gromacs.org/
> Support/Mailing_Lists/Search before posting!
> * Please don't post (un)subscribe requests to the list. Use the www
> interface or send it to gmx-users-request at gromacs.org.
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>