[gmx-users] GROMACS not scaling well with Core4 Quad technology CPUs

Mark Abraham Mark.Abraham at anu.edu.au
Mon May 28 03:07:36 CEST 2007


Trevor Marshall wrote:
> Can anybody give me any ideas which might help me optimize my new 
> cluster for a more linear speed increase as I add computing cores? The 
> new intel Core2 CPUs are inherently very fast, and my mdrun simulation 
> performance is becoming asymptotic to a value only about twice the speed 
> I can get from a single core.

The throughput rate is a better measure of performance than the Gflops 
reported by GROMACS internal accounting. See 
http://www.gromacs.org/gromacs/benchmark/benchmarks.html

> With mdrun_mpi I am calculating a 240aa protein and ligand for 10,000 
> time intervals. Here are the results for various combinations of one, 
> two, three, four and five cores.
> 
> One local core only running mdrun:      18.3 hr/nsec    2.61 Gflops
> Two local cores:                                9.98 hr/nsec    4.83 Gflops
> Three local cores:                              7.35 hr/nsec    6.65 Gflops
> Four local cores (one also controlling) 7.72 hr/nsec    6.42 Gflops
> Three local cores and two remote cores: 7.59 hr/nsec    6.72 GFlops
> One local and 2 remote cores:           9.76 hr/nsec    5.02 GFlops

Here, the best you can expect three local cores to return is 6.1 h/ns, 
*if* there's no limitations from memory or I/O - and that 18.3 h/ns 
number is probably with the rest of the machine unloaded, and so is not 
demonstrably realistic. Given Erik's suggestion, how is 7.35 h/ns so bad?

> I get good performance with one local core doing control, and three 
> doing calculations, giving 6.66 Gflops. However, adding two extra remote 
> cores only increases the speed a very small amount to 6.72 Gflops, even 
> though the log (below) shows good task distribution (I think).

Not really... you're spending nearly half your simulation time (45.6% in 
Coul(T) + LJ [W3-W3] which are the nonbonded loops optimized for 
interactions between 3-point water) getting 86% scaling because CPU0 is 
only doing about half of the work of the others. That's because it's
got the whole protein/ligand on it.

To fix this, particularly for heterogeneous cluster examples, I think 
you should be using the -sort and -shuffle to grompp - see the man page.

> Is there some problem with scaling when using these new fast CPUs? Can I 
> tweak anything in mdrun_mpi to give better scaling?

In short, no :-) More comments below.

>         M E G A - F L O P S   A C C O U N T I N G
> 
>         Parallel run - timing based on wallclock.
>    RF=Reaction-Field  FE=Free Energy  SCFE=Soft-Core/Free Energy
>    T=Tabulated        W3=SPC/TIP3p    W4=TIP4p (single or pairs)
>    NF=No Forces
> 
>  Computing:                        M-Number         M-Flops  % of Flops
> -----------------------------------------------------------------------
>  LJ                              928.067418    30626.224794     1.1
>  Coul(T)                         886.762558    37244.027436     1.4
>  Coul(T) [W3]                     92.882138    11610.267250     0.4
>  Coul(T) + LJ                    599.004388    32945.241340     1.2
>  Coul(T) + LJ [W3]               243.730360    33634.789680     1.2
>  Coul(T) + LJ [W3-W3]           3292.173000  1257610.086000    45.6
>  Outer nonbonded loop            945.783063     9457.830630     0.3
>  1,4 nonbonded interactions       41.184118     3706.570620     0.1
>  Spread Q Bspline              51931.592640   103863.185280     3.8
>  Gather F Bspline              51931.592640   623179.111680    22.6
>  3D-FFT                        40498.449440   323987.595520    11.7
>  Solve PME                      3000.300000   192019.200000     7.0
>  NS-Pairs                       1044.424912    21932.923152     0.8
>  Reset In Box                     24.064040      216.576360     0.0
>  Shift-X                         961.696160     5770.176960     0.2
>  CG-CoM                            8.242234      239.024786     0.0
>  Sum Forces                      721.272120      721.272120     0.0
>  Bonds                            25.022502     1075.967586     0.0
>  Angles                           36.343634     5924.012342     0.2
>  Propers                          13.411341     3071.197089     0.1
>  Impropers                        12.171217     2531.613136     0.1
>  Virial                          241.774175     4351.935150     0.2
>  Ext.ens. Update                 240.424040    12982.898160     0.5
>  Stop-CM                         240.400000     2404.000000     0.1
>  Calc-Ekin                       240.448080     6492.098160     0.2
>  Constraint-V                    240.424040     1442.544240     0.1
>  Constraint-Vir                  215.884746     5181.233904     0.2
>  Settle                           71.961582    23243.590986     0.8
> -----------------------------------------------------------------------
>  Total                                       2757465.194361   100.0
> -----------------------------------------------------------------------
> 
>                NODE (s)   Real (s)      (%)
>        Time:    408.000    408.000    100.0
>                        6:48

A 6 minute simulation is pushing the low end for a benchmark. Nobody 
simulates for only 10ps... I would go at least a factor of ten longer 
for benchmarking.

>                (Mnbf/s)   (GFlops)   (ns/day)  (hour/ns)
> Performance:     14.810      6.758      3.176      7.556
> 
> Detailed load balancing info in percentage of average
> Type        NODE:  0   1   2   3   4 Scaling
> -------------------------------------------
>              LJ:423   0   3  41  32     23%
>         Coul(T):500   0   0   0   0     20%
>    Coul(T) [W3]:  0   0  32 291 176     34%
>    Coul(T) + LJ:500   0   0   0   0     20%
> Coul(T) + LJ [W3]:  0   0  24 296 178     33%
> Coul(T) + LJ [W3-W3]: 60 116 108 106 107     86%
> Outer nonbonded loop:246  42  45  79  85     40%
> 1,4 nonbonded interactions:500   0   0   0   0     20%
> Spread Q Bspline: 98 100 102 100  97     97%
> Gather F Bspline: 98 100 102 100  97     97%
>          3D-FFT:100 100 100 100 100    100%
>       Solve PME:100 100 100 100 100    100%
>        NS-Pairs:107  96  91 103 100     93%
>    Reset In Box: 99 100 100 100  99     99%
>         Shift-X: 99 100 100 100  99     99%
>          CG-CoM:110  97  97  97  97     90%
>      Sum Forces:100 100 100  99  99     99%
>           Bonds:499   0   0   0   0     20%
>          Angles:500   0   0   0   0     20%
>         Propers:499   0   0   0   0     20%
>       Impropers:500   0   0   0   0     20%
>          Virial: 99 100 100 100  99     99%
> Ext.ens. Update: 99 100 100 100  99     99%
>         Stop-CM: 99 100 100 100  99     99%
>       Calc-Ekin: 99 100 100 100  99     99%
>    Constraint-V: 99 100 100 100  99     99%
>  Constraint-Vir: 54 111 111 111 111     89%
>          Settle: 54 111 111 111 111     89%
> 
>     Total Force: 93 102  97 104 102     95%
> 
> 
>     Total Shake: 56 110 110 110 110     90%
> 
> 
> Total Scaling: 95% of max performance
> 
> Finished mdrun on node 0 Sun May 27 07:29:57 2007

> Erik,
> I also have older systems which use Opteron 165 CPUs. I have run tests of the AMD Opteron 165 CPUs (2.18GHz) against the Intel Core2 Duos (3GHz). Twelve concurrent AutoDock jobs on each machine show the Core2 duos outperforming the Opterons by a factor of two.

One worthwhile test is running four copies of the same single-cpu job on 
the same new node. Now the memory and disk access will de-synchronise 
and you might see whether either of these are going to be rate-limiting 
for a four-cpu job. These numbers are a much better comparison for 
scaling than a one-cpu job with the rest of the box unloaded (presumably).

> The data I posted showed inconsistencies which have nothing to do with memory bandwidth, and I was rather hoping for an analysis based upon the manner in which GROMACS mdrun distributes its computing tasks.

They're also confounded with the interconnect performance in same cases.

> I don't believe my data shows memory bandwidth-limiting effects. For example, three 'local' CPUs on the quad core are faster (6.65Gflops) than one of the Quads (5.02 Gflops) and two from the cluster. How does that support the memory bandwidth hypothesis?

So here you've got 3 faster CPUs out-performing 1 faster CPU and 2 
slower CPUs across a Gigabit network? That's not a huge surprise. You'd 
need a strong memory bandwidth effect for the former to get hurt enough 
to overcome the two limitations in the latter.

> I figured that it might be possible that the GAMMA MP software is causing overhead, but when I examined the distribution of tasks by GROMACS (in the log I provided) it would seem that the tasks which mdrun distributed to GAMMA actually were distributed well, but that that the manner in which CPU0 hogged most of the mdrun calculations might be a bottleneck. It was insight into GROMACS' mdrun distribution methodology which I was seeking. Is there any quantitative data available for me to review?

CPU0 is not hogging - it's underloaded, if anything.

Mark



More information about the gromacs.org_gmx-users mailing list