[gmx-users] Parallel Use of gromacs - parallel run is 20 times SLOWER than single node run
David van der Spoel
spoel at xray.bmc.uu.se
Wed Jul 18 09:36:37 CEST 2007
Jim Kress wrote:
> I ran a parallel (mpi) compiled version of gromacs using the following
> command line:
>
> $ mpirun -np 5 mdrun_mpi -s topol.tpr -np 5 -v
>
> At the end of the file md0.log I found:
>
> M E G A - F L O P S A C C O U N T I N G
>
> Parallel run - timing based on wallclock.
> RF=Reaction-Field FE=Free Energy SCFE=Soft-Core/Free Energy
> T=Tabulated W3=SPC/TIP3p W4=TIP4p (single or pairs)
> NF=No Forces
>
> Computing: M-Number M-Flops % of Flops
> -----------------------------------------------------------------------
> Coulomb + LJ [W4-W4] 876.631638 234060.647346 88.0
> Outer nonbonded loop 692.459088 6924.590880 2.6
> NS-Pairs 457.344228 9604.228788 3.6
> Reset In Box 13.782888 124.045992 0.0
> Shift-X 137.773776 826.642656 0.3
> CG-CoM 3.445722 99.925938 0.0
> Sum Forces 206.660664 206.660664 0.1
> Virial 70.237023 1264.266414 0.5
> Update 68.886888 2135.493528 0.8
> Stop-CM 68.880000 688.800000 0.3
> P-Coupling 68.886888 413.321328 0.2
> Calc-Ekin 68.893776 1860.131952 0.7
> Constraint-V 68.886888 413.321328 0.2
> Constraint-Vir 51.675498 1240.211952 0.5
> Settle 17.225166 5563.728618 2.1
> Virtual Site 3 17.221722 637.203714 0.2
> -----------------------------------------------------------------------
> Total 266063.221098 100.0
> -----------------------------------------------------------------------
>
> NODE (s) Real (s) (%)
> Time: 3344.000 3344.000 100.0
> 55:44
> (Mnbf/s) (MFlops) (ns/day) (hour/ns)
> Performance: 0.262 79.564 0.517 46.444
>
> Detailed load balancing info in percentage of average
> Type NODE: 0 1 2 3 4 Scaling
> -------------------------------------------
> Coulomb + LJ [W4-W4]:118 94 101 104 80 84%
> Outer nonbonded loop: 97 98 98 103 102 96%
> NS-Pairs:116 94 101 104 82 85%
> Reset In Box: 99 100 99 100 99 99%
> Shift-X: 99 100 99 100 99 99%
> CG-CoM: 99 100 99 100 99 99%
> Sum Forces: 99 100 99 99 99 99%
> Virial: 99 100 99 100 99 99%
> Update: 99 100 99 100 99 99%
> Stop-CM: 99 100 99 100 99 99%
> P-Coupling: 99 100 99 100 99 99%
> Calc-Ekin: 99 100 99 100 99 99%
> Constraint-V: 99 100 99 100 99 99%
> Constraint-Vir: 99 100 99 100 99 99%
> Settle: 99 100 99 100 99 99%
> Virtual Site 3: 99 100 99 100 99 99%
>
> Total Force:118 94 101 104 81 84%
>
>
> Total Shake: 99 100 99 100 99 99%
>
>
> Total Scaling: 85% of max performance
>
> Finished mdrun on node 0 Sat Jul 14 23:32:32 2007
>
>
> Now, I tried the same calculation on one node and found the following at
> the end of the file md.log:
>
> M E G A - F L O P S A C C O U N T I N G
>
> RF=Reaction-Field FE=Free Energy SCFE=Soft-Core/Free Energy
> T=Tabulated W3=SPC/TIP3p W4=TIP4p (single or pairs)
> NF=No Forces
>
> Computing: M-Number M-Flops % of Flops
> -----------------------------------------------------------------------
> Coulomb + LJ [W4-W4] 875.182588 233673.750996 88.0
> Outer nonbonded loop 688.853376 6888.533760 2.6
> NS-Pairs 456.997574 9596.949054 3.6
> Reset In Box 13.782888 124.045992 0.0
> Shift-X 137.773776 826.642656 0.3
> CG-CoM 3.445722 99.925938 0.0
> Virial 69.156915 1244.824470 0.5
> Update 68.886888 2135.493528 0.8
> Stop-CM 68.880000 688.800000 0.3
> P-Coupling 68.886888 413.321328 0.2
> Calc-Ekin 68.893776 1860.131952 0.7
> Constraint-V 68.886888 413.321328 0.2
> Constraint-Vir 51.675498 1240.211952 0.5
> Settle 17.225166 5563.728618 2.1
> Virtual Site 3 17.221722 637.203714 0.2
> -----------------------------------------------------------------------
> Total 265406.885286 100.0
> -----------------------------------------------------------------------
>
> NODE (s) Real (s) (%)
> Time: 165.870 167.000 99.3
> 2:45
> (Mnbf/s) (GFlops) (ns/day) (hour/ns)
> Performance: 5.276 1.600 10.418 2.304
> Finished mdrun on node 0 Thu Jul 12 15:17:49 2007
>
>
> While I didn't expect to find pure linear scaling with gromacs.
> However, I didn't expect to find a massive INCREASE in computational
> effort across my 5 node, gigabit ethernet cluster.
>
> Anybody understand why this happened?
it is just the communication overhead that kills you. with infiniband
you might be able to scale it a bit. it will be better in the next
release, but you should realize that communication times are counted in
milliseconds using TCP/IP, and that means 1 million cycles on a GHz
chip. In that time gromacs computes 5276 interactions for you (see above).
It looks like you're doing a small TIP4P box, smaller things scale
worse. With the development code we have been able to scale large
protein/water systems to 30-40 processors on Gbit as well.
--
David van der Spoel, Ph.D.
Molec. Biophys. group, Dept. of Cell & Molec. Biol., Uppsala University.
Box 596, 75124 Uppsala, Sweden. Phone: +46184714205. Fax: +4618511755.
spoel at xray.bmc.uu.se spoel at gromacs.org http://folding.bmc.uu.se
More information about the gromacs.org_gmx-users
mailing list