[gmx-users] GROMACS not scaling well with Core4 Quad technology CPUs

Erik Lindahl lindahl at cbr.su.se
Sun May 27 21:45:47 CEST 2007


Hi Trevor,

It's probably due to memory bandwidth limitations, as well as Intel's  
design.

Intel managed to get quad cores to market by gluing together two dual- 
core chips. All communication between them has to go over the front  
side bus though, and all eight cores in a system share the bandwidth  
to memory.

This can become a problem when you're running in parallel, since all  
eight processes are communicating (=using the bus bandwidth) at once,  
and have to share it. You will probably get much better performance  
by running multiple (8) independent simulations.

Essentially, there's no such thing as a free lunch. Intel's quad-core  
chips are cheap, but have the same drawback as their first generation  
dual-core chips. AMD's solution with real quad-cores and on-chip  
memory controllers in Barcelona is looking a whole lot better, but I  
also expect it to be quite a bit more expensive.

You might want to test the CVS version for better scaling. The lower  
amount of data communicated there might improve performance a bit for  
you.

Cheers,

Erik


On May 27, 2007, at 6:28 PM, Trevor Marshall wrote:

> Can anybody give me any ideas which might help me optimize my new  
> cluster for a more linear speed increase as I add computing cores?  
> The new intel Core2 CPUs are inherently very fast, and my mdrun  
> simulation performance is becoming asymptotic to a value only about  
> twice the speed I can get from a single core.
>
> I have included the log output from mdrun_mpi when using 5 cores at  
> the foot of this email. But here is the system overview
>
> My cluster system which comprises two computers running Fedora Core  
> 6 and MPI-GAMMA. Both have Intel Core2 CPUs running at 3GHz core  
> speed (overclocked). The main machine now has a sparkling new Core2  
> Quad 4-processor CPU and the remote still has a Core2-duo dual core  
> CPU.
>
> Networking hardware is crossover CAT6 cables. The GAMMA software is  
> connected thru one Intel PRO/1000 board in each computer, with MTU  
> 9000. A Gigabit adapter with Realtek chipset is the primary Linux  
> network in each machine, with MTU 1500. For the common filesystem I  
> am running NFS on a mounted filesystem with "async" declared in the  
> exports file. The mount is /dev/hde1 to /media and then /media is  
> exported via NFS to the cluster machine. File I/O does not seem to  
> be a bottleneck.
>
> With mdrun_mpi I am calculating a 240aa protein and ligand for  
> 10,000 time intervals. Here are the results for various  
> combinations of one, two, three, four and five cores.
>
> One local core only running mdrun:      18.3 hr/nsec    2.61 Gflops
> Two local cores:                                9.98 hr/nsec     
> 4.83 Gflops
> Three local cores:                              7.35 hr/nsec     
> 6.65 Gflops
> Four local cores (one also controlling) 7.72 hr/nsec    6.42 Gflops
> Three local cores and two remote cores: 7.59 hr/nsec    6.72 GFlops
> One local and 2 remote cores:           9.76 hr/nsec    5.02 GFlops
>
> I get good performance with one local core doing control, and three  
> doing calculations, giving 6.66 Gflops. However, adding two extra  
> remote cores only increases the speed a very small amount to 6.72  
> Gflops, even though the log (below) shows good task distribution (I  
> think).
>
> Is there some problem with scaling when using these new fast CPUs?  
> Can I tweak anything in mdrun_mpi to give better scaling?
>
> Sincerely
> Trevor
> ------------------------------------------
> Trevor G Marshall, PhD
> School of Biological Sciences and Biotechnology, Murdoch  
> University, Western Australia
> Director, Autoimmunity Research Foundation, Thousand Oaks, California
> Patron, Australian Autoimmunity Foundation.
> ------------------------------------------
>
>         M E G A - F L O P S   A C C O U N T I N G
>
>         Parallel run - timing based on wallclock.
>    RF=Reaction-Field  FE=Free Energy  SCFE=Soft-Core/Free Energy
>    T=Tabulated        W3=SPC/TIP3p    W4=TIP4p (single or pairs)
>    NF=No Forces
>
>  Computing:                        M-Number         M-Flops  % of  
> Flops
> ---------------------------------------------------------------------- 
> -
>  LJ                              928.067418    30626.224794     1.1
>  Coul(T)                         886.762558    37244.027436     1.4
>  Coul(T) [W3]                     92.882138    11610.267250     0.4
>  Coul(T) + LJ                    599.004388    32945.241340     1.2
>  Coul(T) + LJ [W3]               243.730360    33634.789680     1.2
>  Coul(T) + LJ [W3-W3]           3292.173000  1257610.086000    45.6
>  Outer nonbonded loop            945.783063     9457.830630     0.3
>  1,4 nonbonded interactions       41.184118     3706.570620     0.1
>  Spread Q Bspline              51931.592640   103863.185280     3.8
>  Gather F Bspline              51931.592640   623179.111680    22.6
>  3D-FFT                        40498.449440   323987.595520    11.7
>  Solve PME                      3000.300000   192019.200000     7.0
>  NS-Pairs                       1044.424912    21932.923152     0.8
>  Reset In Box                     24.064040      216.576360     0.0
>  Shift-X                         961.696160     5770.176960     0.2
>  CG-CoM                            8.242234      239.024786     0.0
>  Sum Forces                      721.272120      721.272120     0.0
>  Bonds                            25.022502     1075.967586     0.0
>  Angles                           36.343634     5924.012342     0.2
>  Propers                          13.411341     3071.197089     0.1
>  Impropers                        12.171217     2531.613136     0.1
>  Virial                          241.774175     4351.935150     0.2
>  Ext.ens. Update                 240.424040    12982.898160     0.5
>  Stop-CM                         240.400000     2404.000000     0.1
>  Calc-Ekin                       240.448080     6492.098160     0.2
>  Constraint-V                    240.424040     1442.544240     0.1
>  Constraint-Vir                  215.884746     5181.233904     0.2
>  Settle                           71.961582    23243.590986     0.8
> ---------------------------------------------------------------------- 
> -
>  Total                                       2757465.194361   100.0
> ---------------------------------------------------------------------- 
> -
>
>                NODE (s)   Real (s)      (%)
>        Time:    408.000    408.000    100.0
>                        6:48
>                (Mnbf/s)   (GFlops)   (ns/day)  (hour/ns)
> Performance:     14.810      6.758      3.176      7.556
>
> Detailed load balancing info in percentage of average
> Type        NODE:  0   1   2   3   4 Scaling
> -------------------------------------------
>              LJ:423   0   3  41  32     23%
>         Coul(T):500   0   0   0   0     20%
>    Coul(T) [W3]:  0   0  32 291 176     34%
>    Coul(T) + LJ:500   0   0   0   0     20%
> Coul(T) + LJ [W3]:  0   0  24 296 178     33%
> Coul(T) + LJ [W3-W3]: 60 116 108 106 107     86%
> Outer nonbonded loop:246  42  45  79  85     40%
> 1,4 nonbonded interactions:500   0   0   0   0     20%
> Spread Q Bspline: 98 100 102 100  97     97%
> Gather F Bspline: 98 100 102 100  97     97%
>          3D-FFT:100 100 100 100 100    100%
>       Solve PME:100 100 100 100 100    100%
>        NS-Pairs:107  96  91 103 100     93%
>    Reset In Box: 99 100 100 100  99     99%
>         Shift-X: 99 100 100 100  99     99%
>          CG-CoM:110  97  97  97  97     90%
>      Sum Forces:100 100 100  99  99     99%
>           Bonds:499   0   0   0   0     20%
>          Angles:500   0   0   0   0     20%
>         Propers:499   0   0   0   0     20%
>       Impropers:500   0   0   0   0     20%
>          Virial: 99 100 100 100  99     99%
> Ext.ens. Update: 99 100 100 100  99     99%
>         Stop-CM: 99 100 100 100  99     99%
>       Calc-Ekin: 99 100 100 100  99     99%
>    Constraint-V: 99 100 100 100  99     99%
>  Constraint-Vir: 54 111 111 111 111     89%
>          Settle: 54 111 111 111 111     89%
>
>     Total Force: 93 102  97 104 102     95%
>
>
>     Total Shake: 56 110 110 110 110     90%
>
>
> Total Scaling: 95% of max performance
>
> Finished mdrun on node 0 Sun May 27 07:29:57 2007
>
> _______________________________________________
> gmx-users mailing list    gmx-users at gromacs.org
> http://www.gromacs.org/mailman/listinfo/gmx-users
> Please search the archive at http://www.gromacs.org/search before  
> posting!
> Please don't post (un)subscribe requests to the list. Use the
> www interface or send it to gmx-users-request at gromacs.org.
> Can't post? Read http://www.gromacs.org/mailing_lists/users.php

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-users/attachments/20070527/b685f929/attachment.html>


More information about the gromacs.org_gmx-users mailing list