[gmx-users] Poor load balancing
Carsten Kutzner
ckutzne at gwdg.de
Tue Feb 16 16:58:05 CET 2010
Deniz,
for calculations with PME you might want to use the g_tune_pme
tools that helps to find the optimal settings on a given number of
cores. For Gromacs 4.0.x you can download it from
http://www.mpibpc.mpg.de/home/grubmueller/projects/MethodAdvancements/Gromacs/
You find installation instructions on the top of the g_tune_pme.c
file.
Carsten
On Feb 16, 2010, at 1:41 PM, Deniz KARASU wrote:
> Carsten thank you for your response.
>
> I did same benchmark with 8 node and 16 node . But these experiments were done with PME instead of cutt-off. To optimize I changed cutt-of and fourier spacing. I wonder this results are acceptable and if need more optimization.
>
> Thanks.
>
> Deniz
>
> ====================================================
>
> 8 node and cutt-of = 0.9 nm fourier_spacing=0.12
>
> Average load imbalance: 4.0 %
> Part of the total run time spent waiting due to load imbalance: 1.4 %
> Steps where the load balancing was limited by -rdd, -rcon and/or -dds: X 0 %
> Average PME mesh/force load: 1.758
> Part of the total run time spent waiting due to PP/PME imbalance: 15.7 %
>
> NOTE: 15.7 % performance was lost because the PME nodes
> had more work to do than the PP nodes.
> You might want to increase the number of PME nodes
> or increase the cut-off and the grid spacing.
>
>
> R E A L C Y C L E A N D T I M E A C C O U N T I N G
>
> Computing: Nodes Number G-Cycles Seconds %
> -----------------------------------------------------------------------
> Domain decomp. 4 1001 36.253 15.5 1.4
> Vsite constr. 4 5001 3.237 1.4 0.1
> Send X to PME 4 5001 10.365 4.4 0.4
> Comm. coord. 4 5001 15.193 6.5 0.6
> Neighbor search 4 1001 279.944 120.0 10.8
> Force 4 5001 451.185 193.5 17.4
> Wait + Comm. F 4 5001 63.147 27.1 2.4
> PME mesh 4 5001 940.073 403.1 36.3
> Wait + Comm. X/F 4 5001 356.494 152.9 13.7
> Wait + Recv. PME F 4 5001 345.820 148.3 13.3
> Vsite spread 4 10002 6.568 2.8 0.3
> Write traj. 4 1 0.350 0.2 0.0
> Update 4 5001 20.525 8.8 0.8
> Constraints 4 5001 42.245 18.1 1.6
> Comm. energies 4 5001 3.377 1.4 0.1
> Rest 4 18.393 7.9 0.7
> -----------------------------------------------------------------------
> Total 8 2593.170 1112.0 100.0
> -----------------------------------------------------------------------
>
> Parallel run - timing based on wallclock.
>
> NODE (s) Real (s) (%)
> Time: 139.000 139.000 100.0
> 2:19
> (Mnbf/s) (GFlops) (ns/day) (hour/ns)
> Performance: 127.854 9.458 12.434 1.930
> Finished mdrun on node 0 Mon Feb 15 17:34:48 2010
>
> ====================================================
> 8 node cut-off = 1.0 nm and fourier_spacing=0.13
>
> Average load imbalance: 3.4 %
> Part of the total run time spent waiting due to load imbalance: 1.7 %
> Steps where the load balancing was limited by -rdd, -rcon and/or -dds: X 0 %
> Average PME mesh/force load: 1.129
> Part of the total run time spent waiting due to PP/PME imbalance: 3.7 %
>
>
> R E A L C Y C L E A N D T I M E A C C O U N T I N G
>
> Computing: Nodes Number G-Cycles Seconds %
> -----------------------------------------------------------------------
> Domain decomp. 4 1001 35.777 15.3 1.5
> Vsite constr. 4 5001 2.620 1.1 0.1
> Send X to PME 4 5001 10.182 4.4 0.4
> Comm. coord. 4 5001 15.727 6.7 0.7
> Neighbor search 4 1001 275.561 117.9 11.8
> Force 4 5001 576.720 246.7 24.7
> Wait + Comm. F 4 5001 69.631 29.8 3.0
> PME mesh 4 5001 752.485 321.8 32.2
> Wait + Comm. X/F 4 5001 416.550 178.2 17.8
> Wait + Recv. PME F 4 5001 91.857 39.3 3.9
> Vsite spread 4 10002 6.456 2.8 0.3
> Write traj. 4 1 0.426 0.2 0.0
> Update 4 5001 20.577 8.8 0.9
> Constraints 4 5001 41.959 17.9 1.8
> Comm. energies 4 5001 2.967 1.3 0.1
> Rest 4 18.612 8.0 0.8
> -----------------------------------------------------------------------
> Total 8 2338.108 1000.0 100.0
> -----------------------------------------------------------------------
>
> Parallel run - timing based on wallclock.
>
> NODE (s) Real (s) (%)
> Time: 125.000 125.000 100.0
> 2:05
> (Mnbf/s) (GFlops) (ns/day) (hour/ns)
> Performance: 190.198 11.789 13.827 1.736
> Finished mdrun on node 0 Mon Feb 15 22:10:46 2010
>
> ====================================================
> 8 node cut-off = 1.1 nm, fourier_spacing=0.135
>
> Average load imbalance: 0.7 %
> Part of the total run time spent waiting due to load imbalance: 0.4 %
> Steps where the load balancing was limited by -rdd, -rcon and/or -dds: X 0 %
> Average PME mesh/force load: 0.872
> Part of the total run time spent waiting due to PP/PME imbalance: 4.2 %
>
>
> R E A L C Y C L E A N D T I M E A C C O U N T I N G
>
> Computing: Nodes Number G-Cycles Seconds %
> -----------------------------------------------------------------------
> Domain decomp. 4 1001 30.117 12.9 1.3
> Vsite constr. 4 5001 1.739 0.7 0.1
> Send X to PME 4 5001 9.944 4.3 0.4
> Comm. coord. 4 5001 16.964 7.3 0.7
> Neighbor search 4 1001 269.553 115.8 11.4
> Force 4 5001 708.179 304.2 29.9
> Wait + Comm. F 4 5001 50.572 21.7 2.1
> PME mesh 4 5001 671.310 288.3 28.4
> Wait + Comm. X/F 4 5001 511.451 219.7 21.6
> Wait + Recv. PME F 4 5001 10.333 4.4 0.4
> Vsite spread 4 10002 4.222 1.8 0.2
> Write traj. 4 1 0.348 0.1 0.0
> Update 4 5001 19.821 8.5 0.8
> Constraints 4 5001 39.736 17.1 1.7
> Comm. energies 4 5001 3.181 1.4 0.1
> Rest 4 18.084 7.8 0.8
> -----------------------------------------------------------------------
> Total 8 2365.556 1016.0 100.0
> -----------------------------------------------------------------------
>
> Parallel run - timing based on wallclock.
>
> NODE (s) Real (s) (%)
> Time: 127.000 127.000 100.0
> 2:07
> (Mnbf/s) (GFlops) (ns/day) (hour/ns)
> Performance: 244.853 13.855 13.609 1.764
> Finished mdrun on node 0 Mon Feb 15 22:24:07 2010
>
> ====================================================
> 16 node cut-off = 1.1 nm, fourier_spacing=0.135
>
> Average load imbalance: 7.0 %
> Part of the total run time spent waiting due to load imbalance: 3.5 %
> Steps where the load balancing was limited by -rdd, -rcon and/or -dds: X 0 % Y 0 %
> Average PME mesh/force load: 0.872
> Part of the total run time spent waiting due to PP/PME imbalance: 4.2 %
>
>
> R E A L C Y C L E A N D T I M E A C C O U N T I N G
>
> Computing: Nodes Number G-Cycles Seconds %
> -----------------------------------------------------------------------
> Domain decomp. 8 1001 55.569 23.8 1.9
> Vsite constr. 8 5001 3.334 1.4 0.1
> Send X to PME 8 5001 24.192 10.4 0.8
> Comm. coord. 8 5001 49.191 21.1 1.7
> Neighbor search 8 1001 300.578 128.8 10.3
> Force 8 5001 734.497 314.9 25.2
> Wait + Comm. F 8 5001 166.258 71.3 5.7
> PME mesh 8 5001 809.589 347.1 27.8
> Wait + Comm. X/F 8 5001 640.310 274.5 22.0
> Wait + Recv. PME F 8 5001 12.332 5.3 0.4
> Vsite spread 8 10002 11.558 5.0 0.4
> Write traj. 8 1 0.685 0.3 0.0
> Update 8 5001 18.789 8.1 0.6
> Constraints 8 5001 47.320 20.3 1.6
> Comm. energies 8 5001 12.562 5.4 0.4
> Rest 8 24.538 10.5 0.8
> -----------------------------------------------------------------------
> Total 16 2911.302 1248.0 100.0
> -----------------------------------------------------------------------
>
> Parallel run - timing based on wallclock.
>
> NODE (s) Real (s) (%)
> Time: 78.000 78.000 100.0
> 1:18
> (Mnbf/s) (GFlops) (ns/day) (hour/ns)
> Performance: 398.725 22.539 22.158 1.083
> Finished mdrun on node 0 Mon Feb 15 22:54:31 2010
>
>
>
>
> On Mon, Feb 15, 2010 at 5:36 PM, Carsten Kutzner <ckutzne at gwdg.de> wrote:
> Hi,
>
> 18 seconds real time is a bit short for such a test. You should run
> at least several minutes. The performance you can expect depends
> a lot on the interconnect you are using. You will definitely need a
> really low-latency interconnect if you have less then 1000 atoms
> per core.
>
> Carsten
>
>
> On Feb 15, 2010, at 3:13 PM, Deniz KARASU wrote:
>
> > Hi All,
> >
> > I'm trying to d.lzm gromacs benchmarks with 64 node machine, but dynamic load balancing performance is very low.
> >
> > Any suggestion will be of great help.
> >
> > Thanks.
> >
> > Deniz KARASU
> >
>
> --
> gmx-users mailing list gmx-users at gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-users
> Please search the archive at http://www.gromacs.org/search before posting!
> Please don't post (un)subscribe requests to the list. Use the
> www interface or send it to gmx-users-request at gromacs.org.
> Can't post? Read http://www.gromacs.org/mailing_lists/users.php
>
> --
> gmx-users mailing list gmx-users at gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-users
> Please search the archive at http://www.gromacs.org/search before posting!
> Please don't post (un)subscribe requests to the list. Use the
> www interface or send it to gmx-users-request at gromacs.org.
> Can't post? Read http://www.gromacs.org/mailing_lists/users.php
--
Dr. Carsten Kutzner
Max Planck Institute for Biophysical Chemistry
Theoretical and Computational Biophysics
Am Fassberg 11, 37077 Goettingen, Germany
Tel. +49-551-2012313, Fax: +49-551-2012302
http://www.mpibpc.mpg.de/home/grubmueller/ihp/ckutzne
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-users/attachments/20100216/bad030dc/attachment.html>
More information about the gromacs.org_gmx-users
mailing list