[gmx-users] Loosing partly the available CPU time

Mon Aug 15 18:35:00 CEST 2016

Hi,

Although I don't know what exactly is the system you are simulating,
one thing is clear: you're pushing the parallelization limit with
- 200 atoms/core
- likely "concentrated" free energy interactions.
The former that alone will make the run very sensitive to load
imbalance and the latter makes imbalance even worse as the very
expensive free energy interactions likely all fall in a few domains
(unless your 8 perturbed atoms are scattered).

There is not much you can do except the what I previously suggested
(trying more OpenMP threads e.g. 2-4 or simply use less cores). If you
have the possibility, using some hardware with fewer and faster cores
(and perhaps a GPU) will also be much more suitable than this 128-core
AMD node.

Cheers,
--
Szilárd

On Mon, Aug 15, 2016 at 4:01 PM, Alexander Alexander
<alexanderwien2k at gmail.com> wrote:
> Hi Szilárd,
>
> Thanks for your response, please find below a link containing required
> files.log files.
>
> https://drive.google.com/file/d/0B_CbyhnbKqQDc2FaeWxITWxqdDg/view?usp=sharing
>
> Thanks,
> Cheers,
> Alex
>
> On Mon, Aug 15, 2016 at 2:52 PM, Szilárd Páll <pall.szilard at gmail.com>
> wrote:
>
>> Hi,
>>
>> Please post full logs; what you cut out of the file will often miss
>> information needed to diagnose your issues.
>>
>> At first sight it seems that you simply have an imbalanced system. Not
>> sure about the source of the imbalance and without knowing more about
>> your system/setup and how is it decomposed what I can suggest is to:
>> try other decomposition schemes or simply less decomposition (use more
>> threads or less cores).
>>
>> Additionally you also have a pretty bad PP-PME load balance, but
>> that's likely going to get better if you get you PP performance
>> better.
>>
>> Cheers,
>> --
>> Szilárd
>>
>>
>> On Sun, Aug 14, 2016 at 3:23 PM, Alexander Alexander
>> <alexanderwien2k at gmail.com> wrote:
>> > Dear gromacs user,
>> >
>> > My free energy calculation works well, however, I am loosing around 56.5
>> %
>> > of the available CPU time as stated in my log file which is really
>> > considerable. The problem is due to the load imbalance and domain
>> > decomposition, but I have no idea to improve it, below is the very end of
>> > my log file and I would be so appreciated if you could help avoid this.
>> >
>> >
>> >    D O M A I N   D E C O M P O S I T I O N   S T A T I S T I C S
>> >
>> >  av. #atoms communicated per step for force:  2 x 115357.4
>> >  av. #atoms communicated per step for LINCS:  2 x 2389.1
>> >
>> >  Average load imbalance: 285.9 %
>> >  Part of the total run time spent waiting due to load imbalance: 56.5 %
>> >  Steps where the load balancing was limited by -rdd, -rcon and/or -dds:
>> X 2
>> > % Y 2 % Z 2 %
>> >  Average PME mesh/force load: 0.384
>> >  Part of the total run time spent waiting due to PP/PME imbalance: 14.5 %
>> >
>> > NOTE: 56.5 % of the available CPU time was lost due to load imbalance
>> >       in the domain decomposition.
>> >
>> > NOTE: 14.5 % performance was lost because the PME ranks
>> >       had less work to do than the PP ranks.
>> >       You might want to decrease the number of PME ranks
>> >       or decrease the cut-off and the grid spacing.
>> >
>> >
>> >      R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G
>> >
>> > On 96 MPI ranks doing PP, and
>> > on 32 MPI ranks doing PME
>> >
>> >  Computing:          Num   Num      Call    Wall time         Giga-Cycles
>> >                      Ranks Threads  Count      (s)         total sum    %
>> > ------------------------------------------------------------
>> -----------------
>> >  Domain decomp.        96    1     175000     242.339      53508.472
>>  0.5
>> >  DD comm. load         96    1     174903       9.076       2003.907
>>  0.0
>> >  DD comm. bounds       96    1     174901      27.054       5973.491
>>  0.1
>> >  Send X to PME         96    1    7000001      44.342       9790.652
>>  0.1
>> >  Neighbor search       96    1     175001     251.994      55640.264
>>  0.6
>> >  Comm. coord.          96    1    6825000    1521.009     335838.747
>>  3.4
>> >  Force                 96    1    7000001    7001.990    1546039.264
>> 15.5
>> >  Wait + Comm. F        96    1    7000001   10761.296    2376093.759
>> 23.8
>> >  PME mesh *            32    1    7000001   11796.344     868210.788
>>  8.7
>> >  PME wait for PP *                          22135.752    1629191.096
>> 16.3
>> >  Wait + Recv. PME F    96    1    7000001     393.117      86800.265
>>  0.9
>> >  NB X/F buffer ops.    96    1   20650001     132.713      29302.991
>>  0.3
>> >  COM pull force        96    1    7000001     165.613      36567.368
>>  0.4
>> >  Write traj.           96    1       7037      55.020      12148.457
>>  0.1
>> >  Update                96    1   14000002     140.972      31126.607
>>  0.3
>> >  Constraints           96    1   14000002   12871.236    2841968.551
>> 28.4
>> >  Comm. energies        96    1     350001     261.976      57844.219
>>  0.6
>> >  Rest                                          52.349      11558.715
>>  0.1
>> > ------------------------------------------------------------
>> -----------------
>> >  Total                                      33932.096    9989607.639
>> 100.0
>> > ------------------------------------------------------------
>> -----------------
>> > (*) Note that with separate PME ranks, the walltime column actually sums
>> to
>> >     twice the total reported, but the cycle count total and % are
>> correct.
>> > ------------------------------------------------------------
>> -----------------
>> >  Breakdown of PME mesh computation
>> > ------------------------------------------------------------
>> -----------------
>> >  PME redist. X/F       32    1   21000003    2334.608     171827.143
>>  1.7
>> >  PME spread/gather     32    1   28000004    3640.870     267967.972
>>  2.7
>> >  PME 3D-FFT            32    1   28000004    1587.105     116810.882
>>  1.2
>> >  PME 3D-FFT Comm.      32    1   56000008    4066.097     299264.666
>>  3.0
>> >  PME solve Elec        32    1   14000002     148.284      10913.728
>>  0.1
>> > ------------------------------------------------------------
>> -----------------
>> >
>> >                Core t (s)   Wall t (s)        (%)
>> >        Time:  4341204.790    33932.096    12793.8
>> >                          9h25:32
>> >                  (ns/day)    (hour/ns)
>> > Performance:       35.648        0.673
>> > Finished mdrun on rank 0 Sat Aug 13 23:45:45 2016
>> >
>> > Thanks,
>> > Regards,
>> > Alex
>> > --
>> > Gromacs Users mailing list
>> >
>> > * Please search the archive at http://www.gromacs.org/
>> Support/Mailing_Lists/GMX-Users_List before posting!
>> >
>> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>> >
>> > * For (un)subscribe requests visit
>> > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>> send a mail to gmx-users-request at gromacs.org.
>> --
>> Gromacs Users mailing list
>>
>> * Please search the archive at http://www.gromacs.org/
>> Support/Mailing_Lists/GMX-Users_List before posting!
>>
>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>
>> * For (un)subscribe requests visit
>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>> send a mail to gmx-users-request at gromacs.org.
> --
> Gromacs Users mailing list
>
> * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-request at gromacs.org.