[gmx-users] Re: Gromacs-4.6 on two Titans GPUs
James Starlight
jmsstarlight at gmail.com
Thu Nov 7 06:34:47 CET 2013
I've gone to conclusion that simulation with 1 or 2 GPU simultaneously gave
me the same performance
mdrun -ntmpi 2 -ntomp 6 -gpu_id 01 -v -deffnm md_CaM_test,
mdrun -ntmpi 2 -ntomp 6 -gpu_id 0 -v -deffnm md_CaM_test,
Doest it be due to the small CPU cores or addition RAM ( this system has 32
gb) is needed ? OR may be some extra options are needed in the config?
James
2013/11/6 Richard Broadbent <richard.broadbent09 at imperial.ac.uk>
> Hi Dwey,
>
>
> On 05/11/13 22:00, Dwey Kauffman wrote:
>
>> Hi Szilard,
>>
>> Thanks for your suggestions. I am indeed aware of this page. In a
>> 8-core
>> AMD with 1GPU, I am very happy about its performance. See below. My
>> intention is to obtain a even better one because we have multiple nodes.
>>
>> ### 8 core AMD with 1 GPU,
>> Force evaluation time GPU/CPU: 4.006 ms/2.578 ms = 1.554
>> For optimal performance this ratio should be close to 1!
>>
>>
>> NOTE: The GPU has >20% more load than the CPU. This imbalance causes
>> performance loss, consider using a shorter cut-off and a finer PME
>> grid.
>>
>> Core t (s) Wall t (s) (%)
>> Time: 216205.510 27036.812 799.7
>> 7h30:36
>> (ns/day) (hour/ns)
>> Performance: 31.956 0.751
>>
>> ### 8 core AMD with 2 GPUs
>>
>> Core t (s) Wall t (s) (%)
>> Time: 178961.450 22398.880 799.0
>> 6h13:18
>> (ns/day) (hour/ns)
>> Performance: 38.573 0.622
>> Finished mdrun on node 0 Sat Jul 13 09:24:39 2013
>>
>>
> I'm almost certain that Szilard meant the lines above this that give the
> breakdown of where the time is spent in the simulation.
>
> Richard
>
>
>> However, in your case I suspect that the
>>> bottleneck is multi-threaded scaling on the AMD CPUs and you should
>>> probably decrease the number of threads per MPI rank and share GPUs
>>> between 2-4 ranks.
>>>
>>
>>
>> OK but can you give a example of mdrun command ? given a 8 core AMD with 2
>> GPUs.
>> I will try to run it again.
>>
>>
>> Regarding scaling across nodes, you can't expect much from gigabit
>>> ethernet - especially not from the cheaper cards/switches, in my
>>> experience even reaction field runs don't scale across nodes with 10G
>>> ethernet if you have more than 4-6 ranks per node trying to
>>> communicate (let alone with PME). However, on infiniband clusters we
>>> have seen scaling to 100 atoms/core (at peak).
>>>
>>
>> From your comments, it sounds like a cluster of AMD cpus is difficult to
>>>
>> scale across nodes in our current setup.
>>
>> Let's assume we install Infiniband (20 or 40GB/s) in the same system of 16
>> nodes of 8 core AMD with 1 GPU only. Considering the same AMD system, what
>> is a good way to obtain better performance when we run a task across
>> nodes
>> ? in other words, what dose mudrun_mpi look like ?
>>
>> Thanks,
>> Dwey
>>
>>
>>
>>
>> --
>> View this message in context: http://gromacs.5086.x6.nabble.
>> com/Gromacs-4-6-on-two-Titans-GPUs-tp5012186p5012279.html
>> Sent from the GROMACS Users Forum mailing list archive at Nabble.com.
>>
>> --
> gmx-users mailing list gmx-users at gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-users
> * Please search the archive at http://www.gromacs.org/
> Support/Mailing_Lists/Search before posting!
> * Please don't post (un)subscribe requests to the list. Use the www
> interface or send it to gmx-users-request at gromacs.org.
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
More information about the gromacs.org_gmx-users
mailing list