[gmx-users] Re: Gromacs-4.6 on two Titans GPUs

Dwey Kauffman mpi566 at gmail.com
Tue Nov 5 23:00:19 CET 2013


Hi Szilard,

   Thanks for your suggestions. I am  indeed aware of this page. In a 8-core
AMD with 1GPU, I am very happy about its performance. See below. My
intention is to obtain a even better one because we have multiple nodes.

### 8 core AMD with  1 GPU,
Force evaluation time GPU/CPU: 4.006 ms/2.578 ms = 1.554
For optimal performance this ratio should be close to 1!


NOTE: The GPU has >20% more load than the CPU. This imbalance causes
      performance loss, consider using a shorter cut-off and a finer PME
grid.

               Core t (s)   Wall t (s)        (%)
       Time:   216205.510    27036.812      799.7
                         7h30:36
                 (ns/day)    (hour/ns)
Performance:       31.956        0.751

### 8 core AMD with 2 GPUs

               Core t (s)   Wall t (s)        (%)
       Time:   178961.450    22398.880      799.0
                         6h13:18
                 (ns/day)    (hour/ns)
Performance:       38.573        0.622
Finished mdrun on node 0 Sat Jul 13 09:24:39 2013


>However, in your case I suspect that the 
>bottleneck is multi-threaded scaling on the AMD CPUs and you should 
>probably decrease the number of threads per MPI rank and share GPUs 
>between 2-4 ranks.


OK but can you give a example of mdrun command ? given a 8 core AMD with 2
GPUs.
I will try to run it again.


>Regarding scaling across nodes, you can't expect much from gigabit 
>ethernet - especially not from the cheaper cards/switches, in my 
>experience even reaction field runs don't scale across nodes with 10G 
>ethernet if you have more than 4-6 ranks per node trying to 
>communicate (let alone with PME). However, on infiniband clusters we 
>have seen scaling to 100 atoms/core (at peak). 

>From your comments, it sounds like a cluster of AMD cpus is difficult to
scale across nodes in our current setup.

Let's assume we install Infiniband (20 or 40GB/s) in the same system of 16
nodes of 8 core AMD with 1 GPU only. Considering the same AMD system, what
is a good way to obtain better performance  when we run a task across nodes
? in other words, what dose mudrun_mpi look like ?

Thanks,
Dwey


    

--
View this message in context: http://gromacs.5086.x6.nabble.com/Gromacs-4-6-on-two-Titans-GPUs-tp5012186p5012279.html
Sent from the GROMACS Users Forum mailing list archive at Nabble.com.



More information about the gromacs.org_gmx-users mailing list