[gmx-users] Parallel Gromacs Benchmarking with Opteron Dual-Core & Gigabit Ethernet

Sun Jul 22 18:08:59 CEST 2007

Dear gmx users,

    I have accommodated a Linux Cluster consisting of 8 nodes with the following specification:

  Node HW: Two Dual-Core Opteron 2212 (2GHz + 1 MB cache every core),  which means totally 4 cores on every node + 2GByte RAM + Gigabit Eth  NICs.
  Network Infrastructure: Gigabit Ethernet (Catalyst 2960) + Linux TCP/IP stack.
  OS: Fedora-Core 5.

    I configured and compiled lam (with default parameters), FFTW (with  double precision support) and Gromacs (with mpi-enabled and double  precision supports) seperately without any problems on our Cluster.

    After all, I tried to benchmark the Cluster using your gmxbench pkg.  According to Gmx benchmarks, I also ran parallel benchmarks of DPPC  system that provided with your gmxbench pkg.

    Starting with single node, I tried to fork four processes on single node (with four cores) with the following commands:

    grompp -np 4 -sort -shuffle -f grompp.mdp -p   topol.top -c conf.gro -o grompp.tpr
    & 
    mpirun -np 4 mdrun_d -v -deffnm grompp

    everything seems well, four cores utilized very well to about greater than 90%, and the following benchmarks attained:

        M E G A - F L O P S   A C C O U N T I N G

        Parallel run - timing based on wallclock.
       RF=Reaction-Field  FE=Free Energy  SCFE=Soft-Core/Free Energy
       T=Tabulated        W3=SPC/TIP3p    W4=TIP4p (single or pairs)
       NF=No Forces

   Computing:                         M-Number         M-Flops   % of Flops
    -----------------------------------------------------------------------
   LJ                             13783.350611   454850.570163    10.7
   Coulomb                        11511.123348   310800.330396     7.3
   Coulomb  [W3]                    1477.194071   118175.525680     2.8
   Coulomb  [W3-W3]                 2305.011660   539372.728440    12.7
   Coulomb +  LJ                    6733.896263   255888.057994     6.0
   Coulomb + LJ  [W3]               2980.257052   271203.391732     6.4
   Coulomb + LJ  [W3-W3]            5589.019105  1369309.680725    32.4
   Outer nonbonded  loop            2892.716574    28927.165740     0.7
   1,4 nonbonded interactions       148.509696    13365.872640     0.3
   NS-Pairs                       29597.265161   621542.568381    14.7
   Reset In  Box                      61.049856       549.448704     0.0
   Shift-X                         1218.803712     7312.822272      0.2
   CG-CoM                            30.268416       877.784064     0.0
   Sum  Forces                      1828.205568     1828.205568      0.0
   Angles                           291.898368    47579.433984     1.1
   Propers                           87.057408    19936.146432     0.5
   Impropers                         15.363072     3195.518976      0.1
   RB-Dihedrals                     122.904576    30357.430272     0.7
   Virial                           609.941964    10978.955352     0.3
   Update                           609.401856    18891.457536     0.4
   Stop-CM                          609.280000     6092.800000      0.1
   Calc-Ekin                        609.523712    16457.140224     0.4
   Lincs                            251.030528    15061.831680     0.4
   Lincs-Mat                       3504.181248    14016.724992     0.3
   Constraint-V                     609.401856     3656.411136      0.1
   Constraint-Vir                   604.522496    14508.539904     0.3
   Settle                           117.830656    38059.301888     0.9
    -----------------------------------------------------------------------
   Total                                        4232795.844875   100.0
    -----------------------------------------------------------------------

                  NODE (s)   Real (s)      (%)
           Time:   2799.000   2799.000    100.0
                          46:39
                  (Mnbf/s)   (GFlops)   (ns/day)  (hour/ns)
  Performance:      15.856       1.512      0.309     77.750

    Then in the next step I repeated the above simulations, but for two  nodes. In this phase, I created a lamboot file that lam daemon started  by it as follow:

    Node-1 (repeated 4 times)
    Node-2 (repeated 4 times)

    and then execute the following commands:

    grompp -np 8 -sort -shuffle -f grompp.mdp -p topol.top -c conf.gro -o grompp.tpr
    & 
    mpirun -np 8 mdrun_d -v -deffnm grompp

    In this phase I have four running processes (mdrun) on every node, with    utilization factor about (60-70%) which got by top commands. The  benchmarks:

        M E G A - F L O P S   A C C O U N T I N G

        Parallel run - timing based on wallclock.
       RF=Reaction-Field  FE=Free Energy  SCFE=Soft-Core/Free Energy
       T=Tabulated        W3=SPC/TIP3p    W4=TIP4p (single or pairs)
       NF=No Forces

   Computing:                         M-Number         M-Flops   % of Flops
    -----------------------------------------------------------------------
   LJ                             13784.875630   454900.895790    10.7
   Coulomb                        16468.628499   444652.969473    10.5
   Coulomb  [W3]                     866.583623    69326.689840     1.6
   Coulomb  [W3-W3]                 2304.621876   539281.518984    12.7
   Coulomb +  LJ                    8301.510488   315457.398544     7.4
     Coulomb + LJ  [W3]                 1413.085299   128590.762209     3.0
   Coulomb + LJ  [W3-W3]            5588.477053  1369176.877985    32.3
   Outer nonbonded  loop            2958.469329    29584.693290     0.7
   1,4 nonbonded interactions       148.509696    13365.872640     0.3
   NS-Pairs                       29582.701180   621236.724780    14.7
   Reset In  Box                      61.049856       549.448704     0.0
   Shift-X                         1218.803712     7312.822272      0.2
   CG-CoM                            30.268416       877.784064     0.0
   Sum  Forces                      2437.607424     2437.607424      0.1
   Angles                           291.898368    47579.433984     1.1
   Propers                           87.057408    19936.146432     0.5
   Impropers                         15.363072     3195.518976      0.1
   RB-Dihedrals                     122.904576    30357.430272     0.7
   Virial                           610.482072    10988.677296     0.3
   Update                           609.401856    18891.457536     0.4
   Stop-CM                          609.280000     6092.800000      0.1
   Calc-Ekin                        609.523712    16457.140224     0.4
   Lincs                            251.030528    15061.831680     0.4
   Lincs-Mat                       3504.181248    14016.724992     0.3
   Constraint-V                     609.401856     3656.411136      0.1
   Constraint-Vir                   604.522496    14508.539904     0.3
   Settle                           117.830656    38059.301888     0.9
    -----------------------------------------------------------------------
   Total                                        4235553.480319   100.0
    -----------------------------------------------------------------------

                  NODE (s)   Real (s)      (%)
           Time:   1337.000   1337.000    100.0
                          22:17
                  (Mnbf/s)   (GFlops)   (ns/day)  (hour/ns)
    Performance:        36.446       3.168      0.646     37.139

    Not so bad, I got a scalability about 100%, but the bad news is that  the processing utilization factor of every core decreased to 60%.  Finally, I did all the above steps but for three physical nodes. The  lamboot script for lamd:

    Node-1 (repeated 4 times because of four cores on every node)
    Node-2 (repeated 4 times)
    Node-3 (repeated 4 times)

    and then execute the following commands:

    grompp -np 12 -sort -shuffle -f grompp.mdp -p topol.top -c conf.gro -o grompp.tpr
    & 
    mpirun -np 12 mdrun_d -v -deffnm grompp

    In this phase I have four running processes (mdrun) on every node, with utilization factor about (45-50%). The benchmarks:

        M E G A - F L O P S   A C C O U N T I N G

        Parallel run - timing based on   wallclock.
       RF=Reaction-Field  FE=Free Energy  SCFE=Soft-Core/Free Energy
       T=Tabulated        W3=SPC/TIP3p    W4=TIP4p (single or pairs)
       NF=No Forces

   Computing:                         M-Number         M-Flops   % of Flops
    -----------------------------------------------------------------------
   LJ                             13784.842940   454899.817020    10.7
   Coulomb                        14582.358701   393723.684927     9.3
   Coulomb  [W3]                    1138.280373    91062.429840     2.1
   Coulomb  [W3-W3]                 2306.683307   539763.893838    12.7
   Coulomb +  LJ                    7768.301364   295195.451832     7.0
   Coulomb + LJ  [W3]               1946.513725   177132.748975     4.2
   Coulomb + LJ  [W3-W3]            5594.156904  1370568.441480    32.3
   Outer nonbonded  loop            3059.386119    30593.861190     0.7
   1,4 nonbonded interactions       148.509696    13365.872640     0.3
   NS-Pairs                       29577.291883   621123.129543    14.7
   Reset In  Box                      61.049856       549.448704     0.0
   Shift-X                         1218.803712     7312.822272      0.2
   CG-CoM                            30.268416       877.784064     0.0
   Sum  Forces                      4265.812992     4265.812992      0.1
   Angles                           291.898368    47579.433984     1.1
   Propers                           87.057408    19936.146432     0.5
   Impropers                         15.363072     3195.518976      0.1
   RB-Dihedrals                     122.904576    30357.430272     0.7
   Virial                           611.022180    10998.399240     0.3
   Update                           609.401856    18891.457536     0.4
   Stop-CM                          609.280000     6092.800000      0.1
   Calc-Ekin                        609.523712    16457.140224     0.4
   Lincs                            251.030528    15061.831680     0.4
   Lincs-Mat                       3504.181248    14016.724992     0.3
   Constraint-V                     609.401856     3656.411136      0.1
   Constraint-Vir                   604.522496    14508.539904     0.3
   Settle                           117.830656    38059.301888     0.9
    -----------------------------------------------------------------------
   Total                                        4239246.335581   100.0
    -----------------------------------------------------------------------

                  NODE (s)   Real (s)      (%)
           Time:   1272.000   1272.000    100.0
                          21:12
                  (Mnbf/s)   (GFlops)   (ns/day)  (hour/ns)
  Performance:      37.045       3.333      0.679     35.333

    Very bad scalability!
  I expected in about 4.5 GFlops, but the results  are like 2 nodes execution. In other words, the third node did nothing  for us at all. I googled Gmx mailing lists, and saw many  topics in this regard. I think that gigabit ethernet's latency is the  performance killer here. I want to know is there any solution for this  problem like recompiling kernel, tcp/ip stack parameters tunning, LAM  recompilation, setup simulations in different way or anything else?

    Any help in this regards will be appreciated.

    Thanks.
    K. Jahanbakhsh

---------------------------------
Boardwalk for $500? In 2007? Ha! 
Play Monopoly Here and Now (it's updated for today's economy) at Yahoo! Games.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-users/attachments/20070722/28305fff/attachment.html>