[gmx-users] Parallel Gromacs Benchmarking with Opteron Dual-Core & Gigabit Ethernet
Kazem Jahanbakhsh
k_jahanbakhsh at yahoo.com
Sun Jul 22 18:08:59 CEST 2007
Dear gmx users,
I have accommodated a Linux Cluster consisting of 8 nodes with the following specification:
Node HW: Two Dual-Core Opteron 2212 (2GHz + 1 MB cache every core), which means totally 4 cores on every node + 2GByte RAM + Gigabit Eth NICs.
Network Infrastructure: Gigabit Ethernet (Catalyst 2960) + Linux TCP/IP stack.
OS: Fedora-Core 5.
I configured and compiled lam (with default parameters), FFTW (with double precision support) and Gromacs (with mpi-enabled and double precision supports) seperately without any problems on our Cluster.
After all, I tried to benchmark the Cluster using your gmxbench pkg. According to Gmx benchmarks, I also ran parallel benchmarks of DPPC system that provided with your gmxbench pkg.
Starting with single node, I tried to fork four processes on single node (with four cores) with the following commands:
grompp -np 4 -sort -shuffle -f grompp.mdp -p topol.top -c conf.gro -o grompp.tpr
&
mpirun -np 4 mdrun_d -v -deffnm grompp
everything seems well, four cores utilized very well to about greater than 90%, and the following benchmarks attained:
M E G A - F L O P S A C C O U N T I N G
Parallel run - timing based on wallclock.
RF=Reaction-Field FE=Free Energy SCFE=Soft-Core/Free Energy
T=Tabulated W3=SPC/TIP3p W4=TIP4p (single or pairs)
NF=No Forces
Computing: M-Number M-Flops % of Flops
-----------------------------------------------------------------------
LJ 13783.350611 454850.570163 10.7
Coulomb 11511.123348 310800.330396 7.3
Coulomb [W3] 1477.194071 118175.525680 2.8
Coulomb [W3-W3] 2305.011660 539372.728440 12.7
Coulomb + LJ 6733.896263 255888.057994 6.0
Coulomb + LJ [W3] 2980.257052 271203.391732 6.4
Coulomb + LJ [W3-W3] 5589.019105 1369309.680725 32.4
Outer nonbonded loop 2892.716574 28927.165740 0.7
1,4 nonbonded interactions 148.509696 13365.872640 0.3
NS-Pairs 29597.265161 621542.568381 14.7
Reset In Box 61.049856 549.448704 0.0
Shift-X 1218.803712 7312.822272 0.2
CG-CoM 30.268416 877.784064 0.0
Sum Forces 1828.205568 1828.205568 0.0
Angles 291.898368 47579.433984 1.1
Propers 87.057408 19936.146432 0.5
Impropers 15.363072 3195.518976 0.1
RB-Dihedrals 122.904576 30357.430272 0.7
Virial 609.941964 10978.955352 0.3
Update 609.401856 18891.457536 0.4
Stop-CM 609.280000 6092.800000 0.1
Calc-Ekin 609.523712 16457.140224 0.4
Lincs 251.030528 15061.831680 0.4
Lincs-Mat 3504.181248 14016.724992 0.3
Constraint-V 609.401856 3656.411136 0.1
Constraint-Vir 604.522496 14508.539904 0.3
Settle 117.830656 38059.301888 0.9
-----------------------------------------------------------------------
Total 4232795.844875 100.0
-----------------------------------------------------------------------
NODE (s) Real (s) (%)
Time: 2799.000 2799.000 100.0
46:39
(Mnbf/s) (GFlops) (ns/day) (hour/ns)
Performance: 15.856 1.512 0.309 77.750
Then in the next step I repeated the above simulations, but for two nodes. In this phase, I created a lamboot file that lam daemon started by it as follow:
Node-1 (repeated 4 times)
Node-2 (repeated 4 times)
and then execute the following commands:
grompp -np 8 -sort -shuffle -f grompp.mdp -p topol.top -c conf.gro -o grompp.tpr
&
mpirun -np 8 mdrun_d -v -deffnm grompp
In this phase I have four running processes (mdrun) on every node, with utilization factor about (60-70%) which got by top commands. The benchmarks:
M E G A - F L O P S A C C O U N T I N G
Parallel run - timing based on wallclock.
RF=Reaction-Field FE=Free Energy SCFE=Soft-Core/Free Energy
T=Tabulated W3=SPC/TIP3p W4=TIP4p (single or pairs)
NF=No Forces
Computing: M-Number M-Flops % of Flops
-----------------------------------------------------------------------
LJ 13784.875630 454900.895790 10.7
Coulomb 16468.628499 444652.969473 10.5
Coulomb [W3] 866.583623 69326.689840 1.6
Coulomb [W3-W3] 2304.621876 539281.518984 12.7
Coulomb + LJ 8301.510488 315457.398544 7.4
Coulomb + LJ [W3] 1413.085299 128590.762209 3.0
Coulomb + LJ [W3-W3] 5588.477053 1369176.877985 32.3
Outer nonbonded loop 2958.469329 29584.693290 0.7
1,4 nonbonded interactions 148.509696 13365.872640 0.3
NS-Pairs 29582.701180 621236.724780 14.7
Reset In Box 61.049856 549.448704 0.0
Shift-X 1218.803712 7312.822272 0.2
CG-CoM 30.268416 877.784064 0.0
Sum Forces 2437.607424 2437.607424 0.1
Angles 291.898368 47579.433984 1.1
Propers 87.057408 19936.146432 0.5
Impropers 15.363072 3195.518976 0.1
RB-Dihedrals 122.904576 30357.430272 0.7
Virial 610.482072 10988.677296 0.3
Update 609.401856 18891.457536 0.4
Stop-CM 609.280000 6092.800000 0.1
Calc-Ekin 609.523712 16457.140224 0.4
Lincs 251.030528 15061.831680 0.4
Lincs-Mat 3504.181248 14016.724992 0.3
Constraint-V 609.401856 3656.411136 0.1
Constraint-Vir 604.522496 14508.539904 0.3
Settle 117.830656 38059.301888 0.9
-----------------------------------------------------------------------
Total 4235553.480319 100.0
-----------------------------------------------------------------------
NODE (s) Real (s) (%)
Time: 1337.000 1337.000 100.0
22:17
(Mnbf/s) (GFlops) (ns/day) (hour/ns)
Performance: 36.446 3.168 0.646 37.139
Not so bad, I got a scalability about 100%, but the bad news is that the processing utilization factor of every core decreased to 60%. Finally, I did all the above steps but for three physical nodes. The lamboot script for lamd:
Node-1 (repeated 4 times because of four cores on every node)
Node-2 (repeated 4 times)
Node-3 (repeated 4 times)
and then execute the following commands:
grompp -np 12 -sort -shuffle -f grompp.mdp -p topol.top -c conf.gro -o grompp.tpr
&
mpirun -np 12 mdrun_d -v -deffnm grompp
In this phase I have four running processes (mdrun) on every node, with utilization factor about (45-50%). The benchmarks:
M E G A - F L O P S A C C O U N T I N G
Parallel run - timing based on wallclock.
RF=Reaction-Field FE=Free Energy SCFE=Soft-Core/Free Energy
T=Tabulated W3=SPC/TIP3p W4=TIP4p (single or pairs)
NF=No Forces
Computing: M-Number M-Flops % of Flops
-----------------------------------------------------------------------
LJ 13784.842940 454899.817020 10.7
Coulomb 14582.358701 393723.684927 9.3
Coulomb [W3] 1138.280373 91062.429840 2.1
Coulomb [W3-W3] 2306.683307 539763.893838 12.7
Coulomb + LJ 7768.301364 295195.451832 7.0
Coulomb + LJ [W3] 1946.513725 177132.748975 4.2
Coulomb + LJ [W3-W3] 5594.156904 1370568.441480 32.3
Outer nonbonded loop 3059.386119 30593.861190 0.7
1,4 nonbonded interactions 148.509696 13365.872640 0.3
NS-Pairs 29577.291883 621123.129543 14.7
Reset In Box 61.049856 549.448704 0.0
Shift-X 1218.803712 7312.822272 0.2
CG-CoM 30.268416 877.784064 0.0
Sum Forces 4265.812992 4265.812992 0.1
Angles 291.898368 47579.433984 1.1
Propers 87.057408 19936.146432 0.5
Impropers 15.363072 3195.518976 0.1
RB-Dihedrals 122.904576 30357.430272 0.7
Virial 611.022180 10998.399240 0.3
Update 609.401856 18891.457536 0.4
Stop-CM 609.280000 6092.800000 0.1
Calc-Ekin 609.523712 16457.140224 0.4
Lincs 251.030528 15061.831680 0.4
Lincs-Mat 3504.181248 14016.724992 0.3
Constraint-V 609.401856 3656.411136 0.1
Constraint-Vir 604.522496 14508.539904 0.3
Settle 117.830656 38059.301888 0.9
-----------------------------------------------------------------------
Total 4239246.335581 100.0
-----------------------------------------------------------------------
NODE (s) Real (s) (%)
Time: 1272.000 1272.000 100.0
21:12
(Mnbf/s) (GFlops) (ns/day) (hour/ns)
Performance: 37.045 3.333 0.679 35.333
Very bad scalability!
I expected in about 4.5 GFlops, but the results are like 2 nodes execution. In other words, the third node did nothing for us at all. I googled Gmx mailing lists, and saw many topics in this regard. I think that gigabit ethernet's latency is the performance killer here. I want to know is there any solution for this problem like recompiling kernel, tcp/ip stack parameters tunning, LAM recompilation, setup simulations in different way or anything else?
Any help in this regards will be appreciated.
Thanks.
K. Jahanbakhsh
---------------------------------
Boardwalk for $500? In 2007? Ha!
Play Monopoly Here and Now (it's updated for today's economy) at Yahoo! Games.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-users/attachments/20070722/28305fff/attachment.html>
More information about the gromacs.org_gmx-users
mailing list