[gmx-users] GROMACS not scaling well with Core4 Quad technology CPUs
Trevor Marshall
trevor at trevormarshall.com
Sun May 27 18:28:00 CEST 2007
Can anybody give me any ideas which might help me optimize my new cluster
for a more linear speed increase as I add computing cores? The new intel
Core2 CPUs are inherently very fast, and my mdrun simulation performance is
becoming asymptotic to a value only about twice the speed I can get from a
single core.
I have included the log output from mdrun_mpi when using 5 cores at the
foot of this email. But here is the system overview
My cluster system which comprises two computers running Fedora Core 6 and
MPI-GAMMA. Both have Intel Core2 CPUs running at 3GHz core speed
(overclocked). The main machine now has a sparkling new Core2 Quad
4-processor CPU and the remote still has a Core2-duo dual core CPU.
Networking hardware is crossover CAT6 cables. The GAMMA software is
connected thru one Intel PRO/1000 board in each computer, with MTU 9000. A
Gigabit adapter with Realtek chipset is the primary Linux network in each
machine, with MTU 1500. For the common filesystem I am running NFS on a
mounted filesystem with "async" declared in the exports file. The mount is
/dev/hde1 to /media and then /media is exported via NFS to the cluster
machine. File I/O does not seem to be a bottleneck.
With mdrun_mpi I am calculating a 240aa protein and ligand for 10,000 time
intervals. Here are the results for various combinations of one, two,
three, four and five cores.
One local core only running mdrun: 18.3 hr/nsec 2.61 Gflops
Two local cores: 9.98 hr/nsec 4.83 Gflops
Three local cores: 7.35 hr/nsec 6.65 Gflops
Four local cores (one also controlling) 7.72 hr/nsec 6.42 Gflops
Three local cores and two remote cores: 7.59 hr/nsec 6.72 GFlops
One local and 2 remote cores: 9.76 hr/nsec 5.02 GFlops
I get good performance with one local core doing control, and three doing
calculations, giving 6.66 Gflops. However, adding two extra remote cores
only increases the speed a very small amount to 6.72 Gflops, even though
the log (below) shows good task distribution (I think).
Is there some problem with scaling when using these new fast CPUs? Can I
tweak anything in mdrun_mpi to give better scaling?
Sincerely
Trevor
------------------------------------------
Trevor G Marshall, PhD
School of Biological Sciences and Biotechnology, Murdoch University,
Western Australia
Director, Autoimmunity Research Foundation, Thousand Oaks, California
Patron, Australian Autoimmunity Foundation.
------------------------------------------
M E G A - F L O P S A C C O U N T I N G
Parallel run - timing based on wallclock.
RF=Reaction-Field FE=Free Energy SCFE=Soft-Core/Free Energy
T=Tabulated W3=SPC/TIP3p W4=TIP4p (single or pairs)
NF=No Forces
Computing: M-Number M-Flops % of Flops
-----------------------------------------------------------------------
LJ 928.067418 30626.224794 1.1
Coul(T) 886.762558 37244.027436 1.4
Coul(T) [W3] 92.882138 11610.267250 0.4
Coul(T) + LJ 599.004388 32945.241340 1.2
Coul(T) + LJ [W3] 243.730360 33634.789680 1.2
Coul(T) + LJ [W3-W3] 3292.173000 1257610.086000 45.6
Outer nonbonded loop 945.783063 9457.830630 0.3
1,4 nonbonded interactions 41.184118 3706.570620 0.1
Spread Q Bspline 51931.592640 103863.185280 3.8
Gather F Bspline 51931.592640 623179.111680 22.6
3D-FFT 40498.449440 323987.595520 11.7
Solve PME 3000.300000 192019.200000 7.0
NS-Pairs 1044.424912 21932.923152 0.8
Reset In Box 24.064040 216.576360 0.0
Shift-X 961.696160 5770.176960 0.2
CG-CoM 8.242234 239.024786 0.0
Sum Forces 721.272120 721.272120 0.0
Bonds 25.022502 1075.967586 0.0
Angles 36.343634 5924.012342 0.2
Propers 13.411341 3071.197089 0.1
Impropers 12.171217 2531.613136 0.1
Virial 241.774175 4351.935150 0.2
Ext.ens. Update 240.424040 12982.898160 0.5
Stop-CM 240.400000 2404.000000 0.1
Calc-Ekin 240.448080 6492.098160 0.2
Constraint-V 240.424040 1442.544240 0.1
Constraint-Vir 215.884746 5181.233904 0.2
Settle 71.961582 23243.590986 0.8
-----------------------------------------------------------------------
Total 2757465.194361 100.0
-----------------------------------------------------------------------
NODE (s) Real (s) (%)
Time: 408.000 408.000 100.0
6:48
(Mnbf/s) (GFlops) (ns/day) (hour/ns)
Performance: 14.810 6.758 3.176 7.556
Detailed load balancing info in percentage of average
Type NODE: 0 1 2 3 4 Scaling
-------------------------------------------
LJ:423 0 3 41 32 23%
Coul(T):500 0 0 0 0 20%
Coul(T) [W3]: 0 0 32 291 176 34%
Coul(T) + LJ:500 0 0 0 0 20%
Coul(T) + LJ [W3]: 0 0 24 296 178 33%
Coul(T) + LJ [W3-W3]: 60 116 108 106 107 86%
Outer nonbonded loop:246 42 45 79 85 40%
1,4 nonbonded interactions:500 0 0 0 0 20%
Spread Q Bspline: 98 100 102 100 97 97%
Gather F Bspline: 98 100 102 100 97 97%
3D-FFT:100 100 100 100 100 100%
Solve PME:100 100 100 100 100 100%
NS-Pairs:107 96 91 103 100 93%
Reset In Box: 99 100 100 100 99 99%
Shift-X: 99 100 100 100 99 99%
CG-CoM:110 97 97 97 97 90%
Sum Forces:100 100 100 99 99 99%
Bonds:499 0 0 0 0 20%
Angles:500 0 0 0 0 20%
Propers:499 0 0 0 0 20%
Impropers:500 0 0 0 0 20%
Virial: 99 100 100 100 99 99%
Ext.ens. Update: 99 100 100 100 99 99%
Stop-CM: 99 100 100 100 99 99%
Calc-Ekin: 99 100 100 100 99 99%
Constraint-V: 99 100 100 100 99 99%
Constraint-Vir: 54 111 111 111 111 89%
Settle: 54 111 111 111 111 89%
Total Force: 93 102 97 104 102 95%
Total Shake: 56 110 110 110 110 90%
Total Scaling: 95% of max performance
Finished mdrun on node 0 Sun May 27 07:29:57 2007
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-users/attachments/20070527/9526ee8a/attachment.html>
More information about the gromacs.org_gmx-users
mailing list