[gmx-users] Re: Gromacs-4.6 on two Titans GPUs

Wed Nov 6 12:31:52 CET 2013

Hi Dwey,

On 05/11/13 22:00, Dwey Kauffman wrote:
> Hi Szilard,
>
>     Thanks for your suggestions. I am  indeed aware of this page. In a 8-core
> AMD with 1GPU, I am very happy about its performance. See below. My
> intention is to obtain a even better one because we have multiple nodes.
>
> ### 8 core AMD with  1 GPU,
> Force evaluation time GPU/CPU: 4.006 ms/2.578 ms = 1.554
> For optimal performance this ratio should be close to 1!
>
>
> NOTE: The GPU has >20% more load than the CPU. This imbalance causes
>        performance loss, consider using a shorter cut-off and a finer PME
> grid.
>
>                 Core t (s)   Wall t (s)        (%)
>         Time:   216205.510    27036.812      799.7
>                           7h30:36
>                   (ns/day)    (hour/ns)
> Performance:       31.956        0.751
>
> ### 8 core AMD with 2 GPUs
>
>                 Core t (s)   Wall t (s)        (%)
>         Time:   178961.450    22398.880      799.0
>                           6h13:18
>                   (ns/day)    (hour/ns)
> Performance:       38.573        0.622
> Finished mdrun on node 0 Sat Jul 13 09:24:39 2013
>

I'm almost certain that Szilard meant the lines above this that give the 
breakdown of where the time is spent in the simulation.

Richard
>
>> However, in your case I suspect that the
>> bottleneck is multi-threaded scaling on the AMD CPUs and you should
>> probably decrease the number of threads per MPI rank and share GPUs
>> between 2-4 ranks.
>
>
> OK but can you give a example of mdrun command ? given a 8 core AMD with 2
> GPUs.
> I will try to run it again.
>
>
>> Regarding scaling across nodes, you can't expect much from gigabit
>> ethernet - especially not from the cheaper cards/switches, in my
>> experience even reaction field runs don't scale across nodes with 10G
>> ethernet if you have more than 4-6 ranks per node trying to
>> communicate (let alone with PME). However, on infiniband clusters we
>> have seen scaling to 100 atoms/core (at peak).
>
>>From your comments, it sounds like a cluster of AMD cpus is difficult to
> scale across nodes in our current setup.
>
> Let's assume we install Infiniband (20 or 40GB/s) in the same system of 16
> nodes of 8 core AMD with 1 GPU only. Considering the same AMD system, what
> is a good way to obtain better performance  when we run a task across nodes
> ? in other words, what dose mudrun_mpi look like ?
>
> Thanks,
> Dwey
>
>
>
>
> --
> View this message in context: http://gromacs.5086.x6.nabble.com/Gromacs-4-6-on-two-Titans-GPUs-tp5012186p5012279.html
> Sent from the GROMACS Users Forum mailing list archive at Nabble.com.
>