[gmx-users] Parallel Use of gromacs - parallel run is 20 times SLOWER than single node run

Wed Jul 18 09:36:37 CEST 2007

Jim Kress wrote:
> I ran a parallel (mpi) compiled version of gromacs using the following 
> command line:
> 
> $ mpirun -np 5 mdrun_mpi -s topol.tpr -np 5 -v
> 
> At the end of the file md0.log I found:
> 
>         M E G A - F L O P S   A C C O U N T I N G
> 
>         Parallel run - timing based on wallclock.
>    RF=Reaction-Field  FE=Free Energy  SCFE=Soft-Core/Free Energy
>    T=Tabulated        W3=SPC/TIP3p    W4=TIP4p (single or pairs)
>    NF=No Forces
> 
>  Computing:                        M-Number         M-Flops  % of Flops
> -----------------------------------------------------------------------
>  Coulomb + LJ [W4-W4]            876.631638   234060.647346    88.0
>  Outer nonbonded loop            692.459088     6924.590880     2.6
>  NS-Pairs                        457.344228     9604.228788     3.6
>  Reset In Box                     13.782888      124.045992     0.0
>  Shift-X                         137.773776      826.642656     0.3
>  CG-CoM                            3.445722       99.925938     0.0
>  Sum Forces                      206.660664      206.660664     0.1
>  Virial                           70.237023     1264.266414     0.5
>  Update                           68.886888     2135.493528     0.8
>  Stop-CM                          68.880000      688.800000     0.3
>  P-Coupling                       68.886888      413.321328     0.2
>  Calc-Ekin                        68.893776     1860.131952     0.7
>  Constraint-V                     68.886888      413.321328     0.2
>  Constraint-Vir                   51.675498     1240.211952     0.5
>  Settle                           17.225166     5563.728618     2.1
>  Virtual Site 3                   17.221722      637.203714     0.2
> -----------------------------------------------------------------------
>  Total                                        266063.221098   100.0
> -----------------------------------------------------------------------
> 
>                NODE (s)   Real (s)      (%)
>        Time:   3344.000   3344.000    100.0
>                        55:44
>                (Mnbf/s)   (MFlops)   (ns/day)  (hour/ns)
> Performance:      0.262     79.564      0.517     46.444
> 
> Detailed load balancing info in percentage of average
> Type        NODE:  0   1   2   3   4 Scaling
> -------------------------------------------
> Coulomb + LJ [W4-W4]:118  94 101 104  80     84%
> Outer nonbonded loop: 97  98  98 103 102     96%
>        NS-Pairs:116  94 101 104  82     85%
>    Reset In Box: 99 100  99 100  99     99%
>         Shift-X: 99 100  99 100  99     99%
>          CG-CoM: 99 100  99 100  99     99%
>      Sum Forces: 99 100  99  99  99     99%
>          Virial: 99 100  99 100  99     99%
>          Update: 99 100  99 100  99     99%
>         Stop-CM: 99 100  99 100  99     99%
>      P-Coupling: 99 100  99 100  99     99%
>       Calc-Ekin: 99 100  99 100  99     99%
>    Constraint-V: 99 100  99 100  99     99%
>  Constraint-Vir: 99 100  99 100  99     99%
>          Settle: 99 100  99 100  99     99%
>  Virtual Site 3: 99 100  99 100  99     99%
> 
>     Total Force:118  94 101 104  81     84%
> 
> 
>     Total Shake: 99 100  99 100  99     99%
> 
> 
> Total Scaling: 85% of max performance
> 
> Finished mdrun on node 0 Sat Jul 14 23:32:32 2007
> 
> 
> Now, I tried the same calculation on one node and found the following at 
> the end of the file md.log:
> 
>         M E G A - F L O P S   A C C O U N T I N G
> 
>    RF=Reaction-Field  FE=Free Energy  SCFE=Soft-Core/Free Energy
>    T=Tabulated        W3=SPC/TIP3p    W4=TIP4p (single or pairs)
>    NF=No Forces
> 
>  Computing:                        M-Number         M-Flops  % of Flops
> -----------------------------------------------------------------------
>  Coulomb + LJ [W4-W4]            875.182588   233673.750996    88.0
>  Outer nonbonded loop            688.853376     6888.533760     2.6
>  NS-Pairs                        456.997574     9596.949054     3.6
>  Reset In Box                     13.782888      124.045992     0.0
>  Shift-X                         137.773776      826.642656     0.3
>  CG-CoM                            3.445722       99.925938     0.0
>  Virial                           69.156915     1244.824470     0.5
>  Update                           68.886888     2135.493528     0.8
>  Stop-CM                          68.880000      688.800000     0.3
>  P-Coupling                       68.886888      413.321328     0.2
>  Calc-Ekin                        68.893776     1860.131952     0.7
>  Constraint-V                     68.886888      413.321328     0.2
>  Constraint-Vir                   51.675498     1240.211952     0.5
>  Settle                           17.225166     5563.728618     2.1
>  Virtual Site 3                   17.221722      637.203714     0.2
> -----------------------------------------------------------------------
>  Total                                        265406.885286   100.0
> -----------------------------------------------------------------------
> 
>                NODE (s)   Real (s)      (%)
>        Time:    165.870    167.000     99.3
>                        2:45
>                (Mnbf/s)   (GFlops)   (ns/day)  (hour/ns)
> Performance:      5.276      1.600     10.418      2.304
> Finished mdrun on node 0 Thu Jul 12 15:17:49 2007
> 
> 
> While I didn't expect to find pure linear scaling with gromacs.  
> However, I didn't expect to find a massive INCREASE in computational 
> effort across my 5 node, gigabit ethernet cluster.
> 
> Anybody understand why this happened?
it is just the communication overhead that kills you. with infiniband 
you might be able to scale it a bit. it will be better in the next 
release, but you should realize that communication times are counted in 
milliseconds using TCP/IP, and that means 1 million cycles on a GHz 
chip. In that time gromacs computes 5276 interactions for you (see above).

It looks like you're doing a small TIP4P box, smaller things scale 
worse. With the development code we have been able to scale large 
protein/water systems to 30-40 processors on Gbit as well.

-- 
David van der Spoel, Ph.D.
Molec. Biophys. group, Dept. of Cell & Molec. Biol., Uppsala University.
Box 596, 75124 Uppsala, Sweden. Phone:	+46184714205. Fax: +4618511755.
spoel at xray.bmc.uu.se	spoel at gromacs.org   http://folding.bmc.uu.se