[gmx-users] GROMACS not scaling well with Core4 Quad technology CPUs

Mon May 28 03:46:50 CEST 2007

Marshall, would you like to give LAM-MPI a try?

Also, a patch to improve PME part's communication which has been 
featured on Gromacs' homepage.
http://wwwuser.gwdg.de/~ckutzne/

Regards,
Yang Ye

On 5/28/2007 9:07 AM, Mark Abraham wrote:
> Trevor Marshall wrote:
>> Can anybody give me any ideas which might help me optimize my new 
>> cluster for a more linear speed increase as I add computing cores? 
>> The new intel Core2 CPUs are inherently very fast, and my mdrun 
>> simulation performance is becoming asymptotic to a value only about 
>> twice the speed I can get from a single core.
>
> The throughput rate is a better measure of performance than the Gflops 
> reported by GROMACS internal accounting. See 
> http://www.gromacs.org/gromacs/benchmark/benchmarks.html
>
>> With mdrun_mpi I am calculating a 240aa protein and ligand for 10,000 
>> time intervals. Here are the results for various combinations of one, 
>> two, three, four and five cores.
>>
>> One local core only running mdrun:      18.3 hr/nsec    2.61 Gflops
>> Two local cores:                                9.98 hr/nsec    4.83 
>> Gflops
>> Three local cores:                              7.35 hr/nsec    6.65 
>> Gflops
>> Four local cores (one also controlling) 7.72 hr/nsec    6.42 Gflops
>> Three local cores and two remote cores: 7.59 hr/nsec    6.72 GFlops
>> One local and 2 remote cores:           9.76 hr/nsec    5.02 GFlops
>
> Here, the best you can expect three local cores to return is 6.1 h/ns, 
> *if* there's no limitations from memory or I/O - and that 18.3 h/ns 
> number is probably with the rest of the machine unloaded, and so is 
> not demonstrably realistic. Given Erik's suggestion, how is 7.35 h/ns 
> so bad?
>
>> I get good performance with one local core doing control, and three 
>> doing calculations, giving 6.66 Gflops. However, adding two extra 
>> remote cores only increases the speed a very small amount to 6.72 
>> Gflops, even though the log (below) shows good task distribution (I 
>> think).
>
> Not really... you're spending nearly half your simulation time (45.6% 
> in Coul(T) + LJ [W3-W3] which are the nonbonded loops optimized for 
> interactions between 3-point water) getting 86% scaling because CPU0 
> is only doing about half of the work of the others. That's because it's
> got the whole protein/ligand on it.
>
> To fix this, particularly for heterogeneous cluster examples, I think 
> you should be using the -sort and -shuffle to grompp - see the man page.
>
>> Is there some problem with scaling when using these new fast CPUs? 
>> Can I tweak anything in mdrun_mpi to give better scaling?
>
> In short, no :-) More comments below.
>
>>         M E G A - F L O P S   A C C O U N T I N G
>>
>>         Parallel run - timing based on wallclock.
>>    RF=Reaction-Field  FE=Free Energy  SCFE=Soft-Core/Free Energy
>>    T=Tabulated        W3=SPC/TIP3p    W4=TIP4p (single or pairs)
>>    NF=No Forces
>>
>>  Computing:                        M-Number         M-Flops  % of Flops
>> -----------------------------------------------------------------------
>>  LJ                              928.067418    30626.224794     1.1
>>  Coul(T)                         886.762558    37244.027436     1.4
>>  Coul(T) [W3]                     92.882138    11610.267250     0.4
>>  Coul(T) + LJ                    599.004388    32945.241340     1.2
>>  Coul(T) + LJ [W3]               243.730360    33634.789680     1.2
>>  Coul(T) + LJ [W3-W3]           3292.173000  1257610.086000    45.6
>>  Outer nonbonded loop            945.783063     9457.830630     0.3
>>  1,4 nonbonded interactions       41.184118     3706.570620     0.1
>>  Spread Q Bspline              51931.592640   103863.185280     3.8
>>  Gather F Bspline              51931.592640   623179.111680    22.6
>>  3D-FFT                        40498.449440   323987.595520    11.7
>>  Solve PME                      3000.300000   192019.200000     7.0
>>  NS-Pairs                       1044.424912    21932.923152     0.8
>>  Reset In Box                     24.064040      216.576360     0.0
>>  Shift-X                         961.696160     5770.176960     0.2
>>  CG-CoM                            8.242234      239.024786     0.0
>>  Sum Forces                      721.272120      721.272120     0.0
>>  Bonds                            25.022502     1075.967586     0.0
>>  Angles                           36.343634     5924.012342     0.2
>>  Propers                          13.411341     3071.197089     0.1
>>  Impropers                        12.171217     2531.613136     0.1
>>  Virial                          241.774175     4351.935150     0.2
>>  Ext.ens. Update                 240.424040    12982.898160     0.5
>>  Stop-CM                         240.400000     2404.000000     0.1
>>  Calc-Ekin                       240.448080     6492.098160     0.2
>>  Constraint-V                    240.424040     1442.544240     0.1
>>  Constraint-Vir                  215.884746     5181.233904     0.2
>>  Settle                           71.961582    23243.590986     0.8
>> -----------------------------------------------------------------------
>>  Total                                       2757465.194361   100.0
>> -----------------------------------------------------------------------
>>
>>                NODE (s)   Real (s)      (%)
>>        Time:    408.000    408.000    100.0
>>                        6:48
>
> A 6 minute simulation is pushing the low end for a benchmark. Nobody 
> simulates for only 10ps... I would go at least a factor of ten longer 
> for benchmarking.
>
>>                (Mnbf/s)   (GFlops)   (ns/day)  (hour/ns)
>> Performance:     14.810      6.758      3.176      7.556
>>
>> Detailed load balancing info in percentage of average
>> Type        NODE:  0   1   2   3   4 Scaling
>> -------------------------------------------
>>              LJ:423   0   3  41  32     23%
>>         Coul(T):500   0   0   0   0     20%
>>    Coul(T) [W3]:  0   0  32 291 176     34%
>>    Coul(T) + LJ:500   0   0   0   0     20%
>> Coul(T) + LJ [W3]:  0   0  24 296 178     33%
>> Coul(T) + LJ [W3-W3]: 60 116 108 106 107     86%
>> Outer nonbonded loop:246  42  45  79  85     40%
>> 1,4 nonbonded interactions:500   0   0   0   0     20%
>> Spread Q Bspline: 98 100 102 100  97     97%
>> Gather F Bspline: 98 100 102 100  97     97%
>>          3D-FFT:100 100 100 100 100    100%
>>       Solve PME:100 100 100 100 100    100%
>>        NS-Pairs:107  96  91 103 100     93%
>>    Reset In Box: 99 100 100 100  99     99%
>>         Shift-X: 99 100 100 100  99     99%
>>          CG-CoM:110  97  97  97  97     90%
>>      Sum Forces:100 100 100  99  99     99%
>>           Bonds:499   0   0   0   0     20%
>>          Angles:500   0   0   0   0     20%
>>         Propers:499   0   0   0   0     20%
>>       Impropers:500   0   0   0   0     20%
>>          Virial: 99 100 100 100  99     99%
>> Ext.ens. Update: 99 100 100 100  99     99%
>>         Stop-CM: 99 100 100 100  99     99%
>>       Calc-Ekin: 99 100 100 100  99     99%
>>    Constraint-V: 99 100 100 100  99     99%
>>  Constraint-Vir: 54 111 111 111 111     89%
>>          Settle: 54 111 111 111 111     89%
>>
>>     Total Force: 93 102  97 104 102     95%
>>
>>
>>     Total Shake: 56 110 110 110 110     90%
>>
>>
>> Total Scaling: 95% of max performance
>>
>> Finished mdrun on node 0 Sun May 27 07:29:57 2007
>
>> Erik,
>> I also have older systems which use Opteron 165 CPUs. I have run 
>> tests of the AMD Opteron 165 CPUs (2.18GHz) against the Intel Core2 
>> Duos (3GHz). Twelve concurrent AutoDock jobs on each machine show the 
>> Core2 duos outperforming the Opterons by a factor of two.
>
> One worthwhile test is running four copies of the same single-cpu job 
> on the same new node. Now the memory and disk access will 
> de-synchronise and you might see whether either of these are going to 
> be rate-limiting for a four-cpu job. These numbers are a much better 
> comparison for scaling than a one-cpu job with the rest of the box 
> unloaded (presumably).
>
>> The data I posted showed inconsistencies which have nothing to do 
>> with memory bandwidth, and I was rather hoping for an analysis based 
>> upon the manner in which GROMACS mdrun distributes its computing tasks.
>
> They're also confounded with the interconnect performance in same cases.
>
>> I don't believe my data shows memory bandwidth-limiting effects. For 
>> example, three 'local' CPUs on the quad core are faster (6.65Gflops) 
>> than one of the Quads (5.02 Gflops) and two from the cluster. How 
>> does that support the memory bandwidth hypothesis?
>
> So here you've got 3 faster CPUs out-performing 1 faster CPU and 2 
> slower CPUs across a Gigabit network? That's not a huge surprise. 
> You'd need a strong memory bandwidth effect for the former to get hurt 
> enough to overcome the two limitations in the latter.
>
>> I figured that it might be possible that the GAMMA MP software is 
>> causing overhead, but when I examined the distribution of tasks by 
>> GROMACS (in the log I provided) it would seem that the tasks which 
>> mdrun distributed to GAMMA actually were distributed well, but that 
>> that the manner in which CPU0 hogged most of the mdrun calculations 
>> might be a bottleneck. It was insight into GROMACS' mdrun 
>> distribution methodology which I was seeking. Is there any 
>> quantitative data available for me to review?
>
> CPU0 is not hogging - it's underloaded, if anything.
>
> Mark
> _______________________________________________
> gmx-users mailing list    gmx-users at gromacs.org
> http://www.gromacs.org/mailman/listinfo/gmx-users
> Please search the archive at http://www.gromacs.org/search before 
> posting!
> Please don't post (un)subscribe requests to the list. Use the www 
> interface or send it to gmx-users-request at gromacs.org.
> Can't post? Read http://www.gromacs.org/mailing_lists/users.php
>