[gmx-users] GROMACS not scaling well with Core4 Quad technology CPUs

Sun May 27 18:28:00 CEST 2007

Can anybody give me any ideas which might help me optimize my new cluster 
for a more linear speed increase as I add computing cores? The new intel 
Core2 CPUs are inherently very fast, and my mdrun simulation performance is 
becoming asymptotic to a value only about twice the speed I can get from a 
single core.

I have included the log output from mdrun_mpi when using 5 cores at the 
foot of this email. But here is the system overview

My cluster system which comprises two computers running Fedora Core 6 and 
MPI-GAMMA. Both have Intel Core2 CPUs running at 3GHz core speed 
(overclocked). The main machine now has a sparkling new Core2 Quad 
4-processor CPU and the remote still has a Core2-duo dual core CPU.

Networking hardware is crossover CAT6 cables. The GAMMA software is 
connected thru one Intel PRO/1000 board in each computer, with MTU 9000. A 
Gigabit adapter with Realtek chipset is the primary Linux network in each 
machine, with MTU 1500. For the common filesystem I am running NFS on a 
mounted filesystem with "async" declared in the exports file. The mount is 
/dev/hde1 to /media and then /media is exported via NFS to the cluster 
machine. File I/O does not seem to be a bottleneck.

With mdrun_mpi I am calculating a 240aa protein and ligand for 10,000 time 
intervals. Here are the results for various combinations of one, two, 
three, four and five cores.

One local core only running mdrun:      18.3 hr/nsec    2.61 Gflops
Two local cores:                                9.98 hr/nsec    4.83 Gflops
Three local cores:                              7.35 hr/nsec    6.65 Gflops
Four local cores (one also controlling) 7.72 hr/nsec    6.42 Gflops
Three local cores and two remote cores: 7.59 hr/nsec    6.72 GFlops
One local and 2 remote cores:           9.76 hr/nsec    5.02 GFlops

I get good performance with one local core doing control, and three doing 
calculations, giving 6.66 Gflops. However, adding two extra remote cores 
only increases the speed a very small amount to 6.72 Gflops, even though 
the log (below) shows good task distribution (I think).

Is there some problem with scaling when using these new fast CPUs? Can I 
tweak anything in mdrun_mpi to give better scaling?

Sincerely
Trevor
------------------------------------------
Trevor G Marshall, PhD
School of Biological Sciences and Biotechnology, Murdoch University, 
Western Australia
Director, Autoimmunity Research Foundation, Thousand Oaks, California
Patron, Australian Autoimmunity Foundation.
------------------------------------------

         M E G A - F L O P S   A C C O U N T I N G

         Parallel run - timing based on wallclock.
    RF=Reaction-Field  FE=Free Energy  SCFE=Soft-Core/Free Energy
    T=Tabulated        W3=SPC/TIP3p    W4=TIP4p (single or pairs)
    NF=No Forces

  Computing:                        M-Number         M-Flops  % of Flops
-----------------------------------------------------------------------
  LJ                              928.067418    30626.224794     1.1
  Coul(T)                         886.762558    37244.027436     1.4
  Coul(T) [W3]                     92.882138    11610.267250     0.4
  Coul(T) + LJ                    599.004388    32945.241340     1.2
  Coul(T) + LJ [W3]               243.730360    33634.789680     1.2
  Coul(T) + LJ [W3-W3]           3292.173000  1257610.086000    45.6
  Outer nonbonded loop            945.783063     9457.830630     0.3
  1,4 nonbonded interactions       41.184118     3706.570620     0.1
  Spread Q Bspline              51931.592640   103863.185280     3.8
  Gather F Bspline              51931.592640   623179.111680    22.6
  3D-FFT                        40498.449440   323987.595520    11.7
  Solve PME                      3000.300000   192019.200000     7.0
  NS-Pairs                       1044.424912    21932.923152     0.8
  Reset In Box                     24.064040      216.576360     0.0
  Shift-X                         961.696160     5770.176960     0.2
  CG-CoM                            8.242234      239.024786     0.0
  Sum Forces                      721.272120      721.272120     0.0
  Bonds                            25.022502     1075.967586     0.0
  Angles                           36.343634     5924.012342     0.2
  Propers                          13.411341     3071.197089     0.1
  Impropers                        12.171217     2531.613136     0.1
  Virial                          241.774175     4351.935150     0.2
  Ext.ens. Update                 240.424040    12982.898160     0.5
  Stop-CM                         240.400000     2404.000000     0.1
  Calc-Ekin                       240.448080     6492.098160     0.2
  Constraint-V                    240.424040     1442.544240     0.1
  Constraint-Vir                  215.884746     5181.233904     0.2
  Settle                           71.961582    23243.590986     0.8
-----------------------------------------------------------------------
  Total                                       2757465.194361   100.0
-----------------------------------------------------------------------

                NODE (s)   Real (s)      (%)
        Time:    408.000    408.000    100.0
                        6:48
                (Mnbf/s)   (GFlops)   (ns/day)  (hour/ns)
Performance:     14.810      6.758      3.176      7.556

Detailed load balancing info in percentage of average
Type        NODE:  0   1   2   3   4 Scaling
-------------------------------------------
              LJ:423   0   3  41  32     23%
         Coul(T):500   0   0   0   0     20%
    Coul(T) [W3]:  0   0  32 291 176     34%
    Coul(T) + LJ:500   0   0   0   0     20%
Coul(T) + LJ [W3]:  0   0  24 296 178     33%
Coul(T) + LJ [W3-W3]: 60 116 108 106 107     86%
Outer nonbonded loop:246  42  45  79  85     40%
1,4 nonbonded interactions:500   0   0   0   0     20%
Spread Q Bspline: 98 100 102 100  97     97%
Gather F Bspline: 98 100 102 100  97     97%
          3D-FFT:100 100 100 100 100    100%
       Solve PME:100 100 100 100 100    100%
        NS-Pairs:107  96  91 103 100     93%
    Reset In Box: 99 100 100 100  99     99%
         Shift-X: 99 100 100 100  99     99%
          CG-CoM:110  97  97  97  97     90%
      Sum Forces:100 100 100  99  99     99%
           Bonds:499   0   0   0   0     20%
          Angles:500   0   0   0   0     20%
         Propers:499   0   0   0   0     20%
       Impropers:500   0   0   0   0     20%
          Virial: 99 100 100 100  99     99%
Ext.ens. Update: 99 100 100 100  99     99%
         Stop-CM: 99 100 100 100  99     99%
       Calc-Ekin: 99 100 100 100  99     99%
    Constraint-V: 99 100 100 100  99     99%
  Constraint-Vir: 54 111 111 111 111     89%
          Settle: 54 111 111 111 111     89%

     Total Force: 93 102  97 104 102     95%

     Total Shake: 56 110 110 110 110     90%

Total Scaling: 95% of max performance

Finished mdrun on node 0 Sun May 27 07:29:57 2007

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-users/attachments/20070527/9526ee8a/attachment.html>