[gmx-users] GROMACS not scaling well with Core4 Quad technology CPUs

Mon May 28 01:59:26 CEST 2007

Erik,
I also have older systems which use Opteron 165 CPUs. I have run tests of 
the AMD Opteron 165 CPUs (2.18GHz) against the Intel Core2 Duos (3GHz). 
Twelve concurrent AutoDock jobs on each machine show the Core2 duos 
outperforming the Opterons by a factor of two.

The data I posted showed inconsistencies which have nothing to do with 
memory bandwidth, and I was rather hoping for an analysis based upon the 
manner in which GROMACS mdrun distributes its computing tasks.

I don't believe my data shows memory bandwidth-limiting effects. For 
example, three 'local' CPUs on the quad core are faster (6.65Gflops) than 
one of the Quads (5.02 Gflops) and two from the cluster. How does that 
support the memory bandwidth hypothesis?

I figured that it might be possible that the GAMMA MP software is causing 
overhead, but when I examined the distribution of tasks by GROMACS (in the 
log I provided) it would seem that the tasks which mdrun distributed to 
GAMMA actually were distributed well, but that that the manner in which 
CPU0 hogged most of the mdrun calculations might be a bottleneck. It was 
insight into GROMACS' mdrun distribution methodology which I was seeking. 
Is there any quantitative data available for me to review?

Sincerely
Trevor

At 12:45 PM 5/27/2007, Erik Lindahl wrote:
>Hi Trevor,
>
>It's probably due to memory bandwidth limitations, as well as Intel's design.
>
>Intel managed to get quad cores to market by gluing together two dual-core 
>chips. All communication between them has to go over the front side bus 
>though, and all eight cores in a system share the bandwidth to memory.
>
>This can become a problem when you're running in parallel, since all eight 
>processes are communicating (=using the bus bandwidth) at once, and have 
>to share it. You will probably get much better performance by running 
>multiple (8) independent simulations.
>
>Essentially, there's no such thing as a free lunch. Intel's quad-core 
>chips are cheap, but have the same drawback as their first generation 
>dual-core chips. AMD's solution with real quad-cores and on-chip memory 
>controllers in Barcelona is looking a whole lot better, but I also expect 
>it to be quite a bit more expensive.
>
>You might want to test the CVS version for better scaling. The lower 
>amount of data communicated there might improve performance a bit for you.
>
>Cheers,
>
>Erik
>
>
>On May 27, 2007, at 6:28 PM, Trevor Marshall wrote:
>
>>Can anybody give me any ideas which might help me optimize my new cluster 
>>for a more linear speed increase as I add computing cores? The new intel 
>>Core2 CPUs are inherently very fast, and my mdrun simulation performance 
>>is becoming asymptotic to a value only about twice the speed I can get 
>>from a single core.
>>
>>I have included the log output from mdrun_mpi when using 5 cores at the 
>>foot of this email. But here is the system overview
>>
>>My cluster system which comprises two computers running Fedora Core 6 and 
>>MPI-GAMMA. Both have Intel Core2 CPUs running at 3GHz core speed 
>>(overclocked). The main machine now has a sparkling new Core2 Quad 
>>4-processor CPU and the remote still has a Core2-duo dual core CPU.
>>
>>Networking hardware is crossover CAT6 cables. The GAMMA software is 
>>connected thru one Intel PRO/1000 board in each computer, with MTU 9000. 
>>A Gigabit adapter with Realtek chipset is the primary Linux network in 
>>each machine, with MTU 1500. For the common filesystem I am running NFS 
>>on a mounted filesystem with "async" declared in the exports file. The 
>>mount is /dev/hde1 to /media and then /media is exported via NFS to the 
>>cluster machine. File I/O does not seem to be a bottleneck.
>>
>>With mdrun_mpi I am calculating a 240aa protein and ligand for 10,000 
>>time intervals. Here are the results for various combinations of one, 
>>two, three, four and five cores.
>>
>>One local core only running mdrun:      18.3 hr/nsec    2.61 Gflops
>>Two local cores:                                9.98 hr/nsec    4.83 Gflops
>>Three local cores:                              7.35 hr/nsec    6.65 Gflops
>>Four local cores (one also controlling) 7.72 hr/nsec    6.42 Gflops
>>Three local cores and two remote cores: 7.59 hr/nsec    6.72 GFlops
>>One local and 2 remote cores:           9.76 hr/nsec    5.02 GFlops
>>
>>I get good performance with one local core doing control, and three doing 
>>calculations, giving 6.66 Gflops. However, adding two extra remote cores 
>>only increases the speed a very small amount to 6.72 Gflops, even though 
>>the log (below) shows good task distribution (I think).
>>
>>Is there some problem with scaling when using these new fast CPUs? Can I 
>>tweak anything in mdrun_mpi to give better scaling?
>>
>>Sincerely
>>Trevor
>>------------------------------------------
>>Trevor G Marshall, PhD
>>School of Biological Sciences and Biotechnology, Murdoch University, 
>>Western Australia
>>Director, Autoimmunity Research Foundation, Thousand Oaks, California
>>Patron, Australian Autoimmunity Foundation.
>>------------------------------------------
>>
>>         M E G A - F L O P S   A C C O U N T I N G
>>
>>         Parallel run - timing based on wallclock.
>>    RF=Reaction-Field  FE=Free Energy  SCFE=Soft-Core/Free Energy
>>    T=Tabulated        W3=SPC/TIP3p    W4=TIP4p (single or pairs)
>>    NF=No Forces
>>
>>  Computing:                        M-Number         M-Flops  % of Flops
>>-----------------------------------------------------------------------
>>  LJ                              928.067418    30626.224794     1.1
>>  Coul(T)                         886.762558    37244.027436     1.4
>>  Coul(T) [W3]                     92.882138    11610.267250     0.4
>>  Coul(T) + LJ                    599.004388    32945.241340     1.2
>>  Coul(T) + LJ [W3]               243.730360    33634.789680     1.2
>>  Coul(T) + LJ [W3-W3]           3292.173000  1257610.086000    45.6
>>  Outer nonbonded loop            945.783063     9457.830630     0.3
>>  1,4 nonbonded interactions       41.184118     3706.570620     0.1
>>  Spread Q Bspline              51931.592640   103863.185280     3.8
>>  Gather F Bspline              51931.592640   623179.111680    22.6
>>  3D-FFT                        40498.449440   323987.595520    11.7
>>  Solve PME                      3000.300000   192019.200000     7.0
>>  NS-Pairs                       1044.424912    21932.923152     0.8
>>  Reset In Box                     24.064040      216.576360     0.0
>>  Shift-X                         961.696160     5770.176960     0.2
>>  CG-CoM                            8.242234      239.024786     0.0
>>  Sum Forces                      721.272120      721.272120     0.0
>>  Bonds                            25.022502     1075.967586     0.0
>>  Angles                           36.343634     5924.012342     0.2
>>  Propers                          13.411341     3071.197089     0.1
>>  Impropers                        12.171217     2531.613136     0.1
>>  Virial                          241.774175     4351.935150     0.2
>>  Ext.ens. Update                 240.424040    12982.898160     0.5
>>  Stop-CM                         240.400000     2404.000000     0.1
>>  Calc-Ekin                       240.448080     6492.098160     0.2
>>  Constraint-V                    240.424040     1442.544240     0.1
>>  Constraint-Vir                  215.884746     5181.233904     0.2
>>  Settle                           71.961582    23243.590986     0.8
>>-----------------------------------------------------------------------
>>  Total                                       2757465.194361   100.0
>>-----------------------------------------------------------------------
>>
>>                NODE (s)   Real (s)      (%)
>>        Time:    408.000    408.000    100.0
>>                        6:48
>>                (Mnbf/s)   (GFlops)   (ns/day)  (hour/ns)
>>Performance:     14.810      6.758      3.176      7.556
>>
>>Detailed load balancing info in percentage of average
>>Type        NODE:  0   1   2   3   4 Scaling
>>-------------------------------------------
>>              LJ:423   0   3  41  32     23%
>>         Coul(T):500   0   0   0   0     20%
>>    Coul(T) [W3]:  0   0  32 291 176     34%
>>    Coul(T) + LJ:500   0   0   0   0     20%
>>Coul(T) + LJ [W3]:  0   0  24 296 178     33%
>>Coul(T) + LJ [W3-W3]: 60 116 108 106 107     86%
>>Outer nonbonded loop:246  42  45  79  85     40%
>>1,4 nonbonded interactions:500   0   0   0   0     20%
>>Spread Q Bspline: 98 100 102 100  97     97%
>>Gather F Bspline: 98 100 102 100  97     97%
>>          3D-FFT:100 100 100 100 100    100%
>>       Solve PME:100 100 100 100 100    100%
>>        NS-Pairs:107  96  91 103 100     93%
>>    Reset In Box: 99 100 100 100  99     99%
>>         Shift-X: 99 100 100 100  99     99%
>>          CG-CoM:110  97  97  97  97     90%
>>      Sum Forces:100 100 100  99  99     99%
>>           Bonds:499   0   0   0   0     20%
>>          Angles:500   0   0   0   0     20%
>>         Propers:499   0   0   0   0     20%
>>       Impropers:500   0   0   0   0     20%
>>          Virial: 99 100 100 100  99     99%
>>Ext.ens. Update: 99 100 100 100  99     99%
>>         Stop-CM: 99 100 100 100  99     99%
>>       Calc-Ekin: 99 100 100 100  99     99%
>>    Constraint-V: 99 100 100 100  99     99%
>>  Constraint-Vir: 54 111 111 111 111     89%
>>          Settle: 54 111 111 111 111     89%
>>
>>     Total Force: 93 102  97 104 102     95%
>>
>>
>>     Total Shake: 56 110 110 110 110     90%
>>
>>
>>Total Scaling: 95% of max performance
>>
>>Finished mdrun on node 0 Sun May 27 07:29:57 2007
>>
>>_______________________________________________
>>gmx-users mailing list    <mailto:gmx-users at gromacs.org>gmx-users at gromacs.org
>>http://www.gromacs.org/mailman/listinfo/gmx-users
>>Please search the archive at 
>><http://www.gromacs.org/search>http://www.gromacs.org/search before posting!
>>Please don't post (un)subscribe requests to the list. Use the
>>www interface or send it to 
>><mailto:gmx-users-request at gromacs.org>gmx-users-request at gromacs.org.
>>Can't post? Read 
>><http://www.gromacs.org/mailing_lists/users.php>http://www.gromacs.org/mailing_lists/users.php
>
>_______________________________________________
>gmx-users mailing list    gmx-users at gromacs.org
>http://www.gromacs.org/mailman/listinfo/gmx-users
>Please search the archive at http://www.gromacs.org/search before posting!
>Please don't post (un)subscribe requests to the list. Use the
>www interface or send it to gmx-users-request at gromacs.org.
>Can't post? Read http://www.gromacs.org/mailing_lists/users.php
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-users/attachments/20070527/349f7d15/attachment.html>