[gmx-users] GROMACS not scaling well with Core4 Quad technology CPUs
Erik Lindahl
lindahl at cbr.su.se
Sun May 27 21:45:47 CEST 2007
Hi Trevor,
It's probably due to memory bandwidth limitations, as well as Intel's
design.
Intel managed to get quad cores to market by gluing together two dual-
core chips. All communication between them has to go over the front
side bus though, and all eight cores in a system share the bandwidth
to memory.
This can become a problem when you're running in parallel, since all
eight processes are communicating (=using the bus bandwidth) at once,
and have to share it. You will probably get much better performance
by running multiple (8) independent simulations.
Essentially, there's no such thing as a free lunch. Intel's quad-core
chips are cheap, but have the same drawback as their first generation
dual-core chips. AMD's solution with real quad-cores and on-chip
memory controllers in Barcelona is looking a whole lot better, but I
also expect it to be quite a bit more expensive.
You might want to test the CVS version for better scaling. The lower
amount of data communicated there might improve performance a bit for
you.
Cheers,
Erik
On May 27, 2007, at 6:28 PM, Trevor Marshall wrote:
> Can anybody give me any ideas which might help me optimize my new
> cluster for a more linear speed increase as I add computing cores?
> The new intel Core2 CPUs are inherently very fast, and my mdrun
> simulation performance is becoming asymptotic to a value only about
> twice the speed I can get from a single core.
>
> I have included the log output from mdrun_mpi when using 5 cores at
> the foot of this email. But here is the system overview
>
> My cluster system which comprises two computers running Fedora Core
> 6 and MPI-GAMMA. Both have Intel Core2 CPUs running at 3GHz core
> speed (overclocked). The main machine now has a sparkling new Core2
> Quad 4-processor CPU and the remote still has a Core2-duo dual core
> CPU.
>
> Networking hardware is crossover CAT6 cables. The GAMMA software is
> connected thru one Intel PRO/1000 board in each computer, with MTU
> 9000. A Gigabit adapter with Realtek chipset is the primary Linux
> network in each machine, with MTU 1500. For the common filesystem I
> am running NFS on a mounted filesystem with "async" declared in the
> exports file. The mount is /dev/hde1 to /media and then /media is
> exported via NFS to the cluster machine. File I/O does not seem to
> be a bottleneck.
>
> With mdrun_mpi I am calculating a 240aa protein and ligand for
> 10,000 time intervals. Here are the results for various
> combinations of one, two, three, four and five cores.
>
> One local core only running mdrun: 18.3 hr/nsec 2.61 Gflops
> Two local cores: 9.98 hr/nsec
> 4.83 Gflops
> Three local cores: 7.35 hr/nsec
> 6.65 Gflops
> Four local cores (one also controlling) 7.72 hr/nsec 6.42 Gflops
> Three local cores and two remote cores: 7.59 hr/nsec 6.72 GFlops
> One local and 2 remote cores: 9.76 hr/nsec 5.02 GFlops
>
> I get good performance with one local core doing control, and three
> doing calculations, giving 6.66 Gflops. However, adding two extra
> remote cores only increases the speed a very small amount to 6.72
> Gflops, even though the log (below) shows good task distribution (I
> think).
>
> Is there some problem with scaling when using these new fast CPUs?
> Can I tweak anything in mdrun_mpi to give better scaling?
>
> Sincerely
> Trevor
> ------------------------------------------
> Trevor G Marshall, PhD
> School of Biological Sciences and Biotechnology, Murdoch
> University, Western Australia
> Director, Autoimmunity Research Foundation, Thousand Oaks, California
> Patron, Australian Autoimmunity Foundation.
> ------------------------------------------
>
> M E G A - F L O P S A C C O U N T I N G
>
> Parallel run - timing based on wallclock.
> RF=Reaction-Field FE=Free Energy SCFE=Soft-Core/Free Energy
> T=Tabulated W3=SPC/TIP3p W4=TIP4p (single or pairs)
> NF=No Forces
>
> Computing: M-Number M-Flops % of
> Flops
> ----------------------------------------------------------------------
> -
> LJ 928.067418 30626.224794 1.1
> Coul(T) 886.762558 37244.027436 1.4
> Coul(T) [W3] 92.882138 11610.267250 0.4
> Coul(T) + LJ 599.004388 32945.241340 1.2
> Coul(T) + LJ [W3] 243.730360 33634.789680 1.2
> Coul(T) + LJ [W3-W3] 3292.173000 1257610.086000 45.6
> Outer nonbonded loop 945.783063 9457.830630 0.3
> 1,4 nonbonded interactions 41.184118 3706.570620 0.1
> Spread Q Bspline 51931.592640 103863.185280 3.8
> Gather F Bspline 51931.592640 623179.111680 22.6
> 3D-FFT 40498.449440 323987.595520 11.7
> Solve PME 3000.300000 192019.200000 7.0
> NS-Pairs 1044.424912 21932.923152 0.8
> Reset In Box 24.064040 216.576360 0.0
> Shift-X 961.696160 5770.176960 0.2
> CG-CoM 8.242234 239.024786 0.0
> Sum Forces 721.272120 721.272120 0.0
> Bonds 25.022502 1075.967586 0.0
> Angles 36.343634 5924.012342 0.2
> Propers 13.411341 3071.197089 0.1
> Impropers 12.171217 2531.613136 0.1
> Virial 241.774175 4351.935150 0.2
> Ext.ens. Update 240.424040 12982.898160 0.5
> Stop-CM 240.400000 2404.000000 0.1
> Calc-Ekin 240.448080 6492.098160 0.2
> Constraint-V 240.424040 1442.544240 0.1
> Constraint-Vir 215.884746 5181.233904 0.2
> Settle 71.961582 23243.590986 0.8
> ----------------------------------------------------------------------
> -
> Total 2757465.194361 100.0
> ----------------------------------------------------------------------
> -
>
> NODE (s) Real (s) (%)
> Time: 408.000 408.000 100.0
> 6:48
> (Mnbf/s) (GFlops) (ns/day) (hour/ns)
> Performance: 14.810 6.758 3.176 7.556
>
> Detailed load balancing info in percentage of average
> Type NODE: 0 1 2 3 4 Scaling
> -------------------------------------------
> LJ:423 0 3 41 32 23%
> Coul(T):500 0 0 0 0 20%
> Coul(T) [W3]: 0 0 32 291 176 34%
> Coul(T) + LJ:500 0 0 0 0 20%
> Coul(T) + LJ [W3]: 0 0 24 296 178 33%
> Coul(T) + LJ [W3-W3]: 60 116 108 106 107 86%
> Outer nonbonded loop:246 42 45 79 85 40%
> 1,4 nonbonded interactions:500 0 0 0 0 20%
> Spread Q Bspline: 98 100 102 100 97 97%
> Gather F Bspline: 98 100 102 100 97 97%
> 3D-FFT:100 100 100 100 100 100%
> Solve PME:100 100 100 100 100 100%
> NS-Pairs:107 96 91 103 100 93%
> Reset In Box: 99 100 100 100 99 99%
> Shift-X: 99 100 100 100 99 99%
> CG-CoM:110 97 97 97 97 90%
> Sum Forces:100 100 100 99 99 99%
> Bonds:499 0 0 0 0 20%
> Angles:500 0 0 0 0 20%
> Propers:499 0 0 0 0 20%
> Impropers:500 0 0 0 0 20%
> Virial: 99 100 100 100 99 99%
> Ext.ens. Update: 99 100 100 100 99 99%
> Stop-CM: 99 100 100 100 99 99%
> Calc-Ekin: 99 100 100 100 99 99%
> Constraint-V: 99 100 100 100 99 99%
> Constraint-Vir: 54 111 111 111 111 89%
> Settle: 54 111 111 111 111 89%
>
> Total Force: 93 102 97 104 102 95%
>
>
> Total Shake: 56 110 110 110 110 90%
>
>
> Total Scaling: 95% of max performance
>
> Finished mdrun on node 0 Sun May 27 07:29:57 2007
>
> _______________________________________________
> gmx-users mailing list gmx-users at gromacs.org
> http://www.gromacs.org/mailman/listinfo/gmx-users
> Please search the archive at http://www.gromacs.org/search before
> posting!
> Please don't post (un)subscribe requests to the list. Use the
> www interface or send it to gmx-users-request at gromacs.org.
> Can't post? Read http://www.gromacs.org/mailing_lists/users.php
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-users/attachments/20070527/b685f929/attachment.html>
More information about the gromacs.org_gmx-users
mailing list