[gmx-users] GROMACS not scaling well with Core4 Quad technology CPUs
Yang Ye
leafyoung at yahoo.com
Mon May 28 03:46:50 CEST 2007
Marshall, would you like to give LAM-MPI a try?
Also, a patch to improve PME part's communication which has been
featured on Gromacs' homepage.
http://wwwuser.gwdg.de/~ckutzne/
Regards,
Yang Ye
On 5/28/2007 9:07 AM, Mark Abraham wrote:
> Trevor Marshall wrote:
>> Can anybody give me any ideas which might help me optimize my new
>> cluster for a more linear speed increase as I add computing cores?
>> The new intel Core2 CPUs are inherently very fast, and my mdrun
>> simulation performance is becoming asymptotic to a value only about
>> twice the speed I can get from a single core.
>
> The throughput rate is a better measure of performance than the Gflops
> reported by GROMACS internal accounting. See
> http://www.gromacs.org/gromacs/benchmark/benchmarks.html
>
>> With mdrun_mpi I am calculating a 240aa protein and ligand for 10,000
>> time intervals. Here are the results for various combinations of one,
>> two, three, four and five cores.
>>
>> One local core only running mdrun: 18.3 hr/nsec 2.61 Gflops
>> Two local cores: 9.98 hr/nsec 4.83
>> Gflops
>> Three local cores: 7.35 hr/nsec 6.65
>> Gflops
>> Four local cores (one also controlling) 7.72 hr/nsec 6.42 Gflops
>> Three local cores and two remote cores: 7.59 hr/nsec 6.72 GFlops
>> One local and 2 remote cores: 9.76 hr/nsec 5.02 GFlops
>
> Here, the best you can expect three local cores to return is 6.1 h/ns,
> *if* there's no limitations from memory or I/O - and that 18.3 h/ns
> number is probably with the rest of the machine unloaded, and so is
> not demonstrably realistic. Given Erik's suggestion, how is 7.35 h/ns
> so bad?
>
>> I get good performance with one local core doing control, and three
>> doing calculations, giving 6.66 Gflops. However, adding two extra
>> remote cores only increases the speed a very small amount to 6.72
>> Gflops, even though the log (below) shows good task distribution (I
>> think).
>
> Not really... you're spending nearly half your simulation time (45.6%
> in Coul(T) + LJ [W3-W3] which are the nonbonded loops optimized for
> interactions between 3-point water) getting 86% scaling because CPU0
> is only doing about half of the work of the others. That's because it's
> got the whole protein/ligand on it.
>
> To fix this, particularly for heterogeneous cluster examples, I think
> you should be using the -sort and -shuffle to grompp - see the man page.
>
>> Is there some problem with scaling when using these new fast CPUs?
>> Can I tweak anything in mdrun_mpi to give better scaling?
>
> In short, no :-) More comments below.
>
>> M E G A - F L O P S A C C O U N T I N G
>>
>> Parallel run - timing based on wallclock.
>> RF=Reaction-Field FE=Free Energy SCFE=Soft-Core/Free Energy
>> T=Tabulated W3=SPC/TIP3p W4=TIP4p (single or pairs)
>> NF=No Forces
>>
>> Computing: M-Number M-Flops % of Flops
>> -----------------------------------------------------------------------
>> LJ 928.067418 30626.224794 1.1
>> Coul(T) 886.762558 37244.027436 1.4
>> Coul(T) [W3] 92.882138 11610.267250 0.4
>> Coul(T) + LJ 599.004388 32945.241340 1.2
>> Coul(T) + LJ [W3] 243.730360 33634.789680 1.2
>> Coul(T) + LJ [W3-W3] 3292.173000 1257610.086000 45.6
>> Outer nonbonded loop 945.783063 9457.830630 0.3
>> 1,4 nonbonded interactions 41.184118 3706.570620 0.1
>> Spread Q Bspline 51931.592640 103863.185280 3.8
>> Gather F Bspline 51931.592640 623179.111680 22.6
>> 3D-FFT 40498.449440 323987.595520 11.7
>> Solve PME 3000.300000 192019.200000 7.0
>> NS-Pairs 1044.424912 21932.923152 0.8
>> Reset In Box 24.064040 216.576360 0.0
>> Shift-X 961.696160 5770.176960 0.2
>> CG-CoM 8.242234 239.024786 0.0
>> Sum Forces 721.272120 721.272120 0.0
>> Bonds 25.022502 1075.967586 0.0
>> Angles 36.343634 5924.012342 0.2
>> Propers 13.411341 3071.197089 0.1
>> Impropers 12.171217 2531.613136 0.1
>> Virial 241.774175 4351.935150 0.2
>> Ext.ens. Update 240.424040 12982.898160 0.5
>> Stop-CM 240.400000 2404.000000 0.1
>> Calc-Ekin 240.448080 6492.098160 0.2
>> Constraint-V 240.424040 1442.544240 0.1
>> Constraint-Vir 215.884746 5181.233904 0.2
>> Settle 71.961582 23243.590986 0.8
>> -----------------------------------------------------------------------
>> Total 2757465.194361 100.0
>> -----------------------------------------------------------------------
>>
>> NODE (s) Real (s) (%)
>> Time: 408.000 408.000 100.0
>> 6:48
>
> A 6 minute simulation is pushing the low end for a benchmark. Nobody
> simulates for only 10ps... I would go at least a factor of ten longer
> for benchmarking.
>
>> (Mnbf/s) (GFlops) (ns/day) (hour/ns)
>> Performance: 14.810 6.758 3.176 7.556
>>
>> Detailed load balancing info in percentage of average
>> Type NODE: 0 1 2 3 4 Scaling
>> -------------------------------------------
>> LJ:423 0 3 41 32 23%
>> Coul(T):500 0 0 0 0 20%
>> Coul(T) [W3]: 0 0 32 291 176 34%
>> Coul(T) + LJ:500 0 0 0 0 20%
>> Coul(T) + LJ [W3]: 0 0 24 296 178 33%
>> Coul(T) + LJ [W3-W3]: 60 116 108 106 107 86%
>> Outer nonbonded loop:246 42 45 79 85 40%
>> 1,4 nonbonded interactions:500 0 0 0 0 20%
>> Spread Q Bspline: 98 100 102 100 97 97%
>> Gather F Bspline: 98 100 102 100 97 97%
>> 3D-FFT:100 100 100 100 100 100%
>> Solve PME:100 100 100 100 100 100%
>> NS-Pairs:107 96 91 103 100 93%
>> Reset In Box: 99 100 100 100 99 99%
>> Shift-X: 99 100 100 100 99 99%
>> CG-CoM:110 97 97 97 97 90%
>> Sum Forces:100 100 100 99 99 99%
>> Bonds:499 0 0 0 0 20%
>> Angles:500 0 0 0 0 20%
>> Propers:499 0 0 0 0 20%
>> Impropers:500 0 0 0 0 20%
>> Virial: 99 100 100 100 99 99%
>> Ext.ens. Update: 99 100 100 100 99 99%
>> Stop-CM: 99 100 100 100 99 99%
>> Calc-Ekin: 99 100 100 100 99 99%
>> Constraint-V: 99 100 100 100 99 99%
>> Constraint-Vir: 54 111 111 111 111 89%
>> Settle: 54 111 111 111 111 89%
>>
>> Total Force: 93 102 97 104 102 95%
>>
>>
>> Total Shake: 56 110 110 110 110 90%
>>
>>
>> Total Scaling: 95% of max performance
>>
>> Finished mdrun on node 0 Sun May 27 07:29:57 2007
>
>> Erik,
>> I also have older systems which use Opteron 165 CPUs. I have run
>> tests of the AMD Opteron 165 CPUs (2.18GHz) against the Intel Core2
>> Duos (3GHz). Twelve concurrent AutoDock jobs on each machine show the
>> Core2 duos outperforming the Opterons by a factor of two.
>
> One worthwhile test is running four copies of the same single-cpu job
> on the same new node. Now the memory and disk access will
> de-synchronise and you might see whether either of these are going to
> be rate-limiting for a four-cpu job. These numbers are a much better
> comparison for scaling than a one-cpu job with the rest of the box
> unloaded (presumably).
>
>> The data I posted showed inconsistencies which have nothing to do
>> with memory bandwidth, and I was rather hoping for an analysis based
>> upon the manner in which GROMACS mdrun distributes its computing tasks.
>
> They're also confounded with the interconnect performance in same cases.
>
>> I don't believe my data shows memory bandwidth-limiting effects. For
>> example, three 'local' CPUs on the quad core are faster (6.65Gflops)
>> than one of the Quads (5.02 Gflops) and two from the cluster. How
>> does that support the memory bandwidth hypothesis?
>
> So here you've got 3 faster CPUs out-performing 1 faster CPU and 2
> slower CPUs across a Gigabit network? That's not a huge surprise.
> You'd need a strong memory bandwidth effect for the former to get hurt
> enough to overcome the two limitations in the latter.
>
>> I figured that it might be possible that the GAMMA MP software is
>> causing overhead, but when I examined the distribution of tasks by
>> GROMACS (in the log I provided) it would seem that the tasks which
>> mdrun distributed to GAMMA actually were distributed well, but that
>> that the manner in which CPU0 hogged most of the mdrun calculations
>> might be a bottleneck. It was insight into GROMACS' mdrun
>> distribution methodology which I was seeking. Is there any
>> quantitative data available for me to review?
>
> CPU0 is not hogging - it's underloaded, if anything.
>
> Mark
> _______________________________________________
> gmx-users mailing list gmx-users at gromacs.org
> http://www.gromacs.org/mailman/listinfo/gmx-users
> Please search the archive at http://www.gromacs.org/search before
> posting!
> Please don't post (un)subscribe requests to the list. Use the www
> interface or send it to gmx-users-request at gromacs.org.
> Can't post? Read http://www.gromacs.org/mailing_lists/users.php
>
More information about the gromacs.org_gmx-users
mailing list