[gmx-users] Loosing partly the available CPU time
Szilárd Páll
pall.szilard at gmail.com
Mon Aug 15 18:35:00 CEST 2016
Hi,
Although I don't know what exactly is the system you are simulating,
one thing is clear: you're pushing the parallelization limit with
- 200 atoms/core
- likely "concentrated" free energy interactions.
The former that alone will make the run very sensitive to load
imbalance and the latter makes imbalance even worse as the very
expensive free energy interactions likely all fall in a few domains
(unless your 8 perturbed atoms are scattered).
There is not much you can do except the what I previously suggested
(trying more OpenMP threads e.g. 2-4 or simply use less cores). If you
have the possibility, using some hardware with fewer and faster cores
(and perhaps a GPU) will also be much more suitable than this 128-core
AMD node.
Cheers,
--
Szilárd
On Mon, Aug 15, 2016 at 4:01 PM, Alexander Alexander
<alexanderwien2k at gmail.com> wrote:
> Hi Szilárd,
>
> Thanks for your response, please find below a link containing required
> files.log files.
>
> https://drive.google.com/file/d/0B_CbyhnbKqQDc2FaeWxITWxqdDg/view?usp=sharing
>
> Thanks,
> Cheers,
> Alex
>
> On Mon, Aug 15, 2016 at 2:52 PM, Szilárd Páll <pall.szilard at gmail.com>
> wrote:
>
>> Hi,
>>
>> Please post full logs; what you cut out of the file will often miss
>> information needed to diagnose your issues.
>>
>> At first sight it seems that you simply have an imbalanced system. Not
>> sure about the source of the imbalance and without knowing more about
>> your system/setup and how is it decomposed what I can suggest is to:
>> try other decomposition schemes or simply less decomposition (use more
>> threads or less cores).
>>
>> Additionally you also have a pretty bad PP-PME load balance, but
>> that's likely going to get better if you get you PP performance
>> better.
>>
>> Cheers,
>> --
>> Szilárd
>>
>>
>> On Sun, Aug 14, 2016 at 3:23 PM, Alexander Alexander
>> <alexanderwien2k at gmail.com> wrote:
>> > Dear gromacs user,
>> >
>> > My free energy calculation works well, however, I am loosing around 56.5
>> %
>> > of the available CPU time as stated in my log file which is really
>> > considerable. The problem is due to the load imbalance and domain
>> > decomposition, but I have no idea to improve it, below is the very end of
>> > my log file and I would be so appreciated if you could help avoid this.
>> >
>> >
>> > D O M A I N D E C O M P O S I T I O N S T A T I S T I C S
>> >
>> > av. #atoms communicated per step for force: 2 x 115357.4
>> > av. #atoms communicated per step for LINCS: 2 x 2389.1
>> >
>> > Average load imbalance: 285.9 %
>> > Part of the total run time spent waiting due to load imbalance: 56.5 %
>> > Steps where the load balancing was limited by -rdd, -rcon and/or -dds:
>> X 2
>> > % Y 2 % Z 2 %
>> > Average PME mesh/force load: 0.384
>> > Part of the total run time spent waiting due to PP/PME imbalance: 14.5 %
>> >
>> > NOTE: 56.5 % of the available CPU time was lost due to load imbalance
>> > in the domain decomposition.
>> >
>> > NOTE: 14.5 % performance was lost because the PME ranks
>> > had less work to do than the PP ranks.
>> > You might want to decrease the number of PME ranks
>> > or decrease the cut-off and the grid spacing.
>> >
>> >
>> > R E A L C Y C L E A N D T I M E A C C O U N T I N G
>> >
>> > On 96 MPI ranks doing PP, and
>> > on 32 MPI ranks doing PME
>> >
>> > Computing: Num Num Call Wall time Giga-Cycles
>> > Ranks Threads Count (s) total sum %
>> > ------------------------------------------------------------
>> -----------------
>> > Domain decomp. 96 1 175000 242.339 53508.472
>> 0.5
>> > DD comm. load 96 1 174903 9.076 2003.907
>> 0.0
>> > DD comm. bounds 96 1 174901 27.054 5973.491
>> 0.1
>> > Send X to PME 96 1 7000001 44.342 9790.652
>> 0.1
>> > Neighbor search 96 1 175001 251.994 55640.264
>> 0.6
>> > Comm. coord. 96 1 6825000 1521.009 335838.747
>> 3.4
>> > Force 96 1 7000001 7001.990 1546039.264
>> 15.5
>> > Wait + Comm. F 96 1 7000001 10761.296 2376093.759
>> 23.8
>> > PME mesh * 32 1 7000001 11796.344 868210.788
>> 8.7
>> > PME wait for PP * 22135.752 1629191.096
>> 16.3
>> > Wait + Recv. PME F 96 1 7000001 393.117 86800.265
>> 0.9
>> > NB X/F buffer ops. 96 1 20650001 132.713 29302.991
>> 0.3
>> > COM pull force 96 1 7000001 165.613 36567.368
>> 0.4
>> > Write traj. 96 1 7037 55.020 12148.457
>> 0.1
>> > Update 96 1 14000002 140.972 31126.607
>> 0.3
>> > Constraints 96 1 14000002 12871.236 2841968.551
>> 28.4
>> > Comm. energies 96 1 350001 261.976 57844.219
>> 0.6
>> > Rest 52.349 11558.715
>> 0.1
>> > ------------------------------------------------------------
>> -----------------
>> > Total 33932.096 9989607.639
>> 100.0
>> > ------------------------------------------------------------
>> -----------------
>> > (*) Note that with separate PME ranks, the walltime column actually sums
>> to
>> > twice the total reported, but the cycle count total and % are
>> correct.
>> > ------------------------------------------------------------
>> -----------------
>> > Breakdown of PME mesh computation
>> > ------------------------------------------------------------
>> -----------------
>> > PME redist. X/F 32 1 21000003 2334.608 171827.143
>> 1.7
>> > PME spread/gather 32 1 28000004 3640.870 267967.972
>> 2.7
>> > PME 3D-FFT 32 1 28000004 1587.105 116810.882
>> 1.2
>> > PME 3D-FFT Comm. 32 1 56000008 4066.097 299264.666
>> 3.0
>> > PME solve Elec 32 1 14000002 148.284 10913.728
>> 0.1
>> > ------------------------------------------------------------
>> -----------------
>> >
>> > Core t (s) Wall t (s) (%)
>> > Time: 4341204.790 33932.096 12793.8
>> > 9h25:32
>> > (ns/day) (hour/ns)
>> > Performance: 35.648 0.673
>> > Finished mdrun on rank 0 Sat Aug 13 23:45:45 2016
>> >
>> > Thanks,
>> > Regards,
>> > Alex
>> > --
>> > Gromacs Users mailing list
>> >
>> > * Please search the archive at http://www.gromacs.org/
>> Support/Mailing_Lists/GMX-Users_List before posting!
>> >
>> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>> >
>> > * For (un)subscribe requests visit
>> > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>> send a mail to gmx-users-request at gromacs.org.
>> --
>> Gromacs Users mailing list
>>
>> * Please search the archive at http://www.gromacs.org/
>> Support/Mailing_Lists/GMX-Users_List before posting!
>>
>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>
>> * For (un)subscribe requests visit
>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>> send a mail to gmx-users-request at gromacs.org.
> --
> Gromacs Users mailing list
>
> * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-request at gromacs.org.
More information about the gromacs.org_gmx-users
mailing list