[gmx-users] Re: mdrun -nosum still complains that 15 % of the run time was spent communicating energies

Mon Jul 20 23:06:10 CEST 2009

I have now tested with and without -nosum and it appears that the option 
is working (see 51 vs. 501 Number of communications) but that the total 
amount of time communicating energies didn't go down by very much. Seems 
strange to me. Anybody have any ideas if this is normal?

At the very least, I suggest adding an if statement to mdrun so that it 
doesn't output the -nosum usage note if the user did in fact use -nosum 
in that run.

Without using -nosum:

    R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

 Computing:         Nodes     Number     G-Cycles    Seconds     %
-----------------------------------------------------------------------
...
 Write traj.          256          2      233.218       93.7     0.5
 Update               256        501      777.511      312.5     1.7
 Constraints          256       1002     1203.894      483.9     2.7
 Comm. energies       256        501     7397.995     2973.9    16.5
 Rest                 256                 128.058       51.5     0.3
-----------------------------------------------------------------------
 Total                384               44897.468    18048.0   100.0
-----------------------------------------------------------------------

NOTE: 16 % of the run time was spent communicating energies,
      you might want to use the -nosum option of mdrun

        Parallel run - timing based on wallclock.

               NODE (s)   Real (s)      (%)
       Time:     47.000     47.000    100.0
               (Mnbf/s)   (GFlops)   (ns/day)  (hour/ns)
Performance:  13485.788    712.634      1.842     13.029
Finished mdrun on node 0 Mon Jul 20 12:53:41 2009

#########

And using -nosum:

    R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G
 Computing:         Nodes     Number     G-Cycles    Seconds     %
-----------------------------------------------------------------------
...
Write traj.          256          2      213.521       83.3     0.5
 Update               256        501      776.606      303.0     1.8
 Constraints          256       1002     1200.285      468.2     2.7
 Comm. energies       256         51     6926.667     2702.1    15.6
 Rest                 256                 127.503       49.7     0.3
-----------------------------------------------------------------------
 Total                384               44296.670    17280.0   100.0
-----------------------------------------------------------------------

NOTE: 16 % of the run time was spent communicating energies,
      you might want to use the -nosum option of mdrun

        Parallel run - timing based on wallclock.

               NODE (s)   Real (s)      (%)
       Time:     45.000     45.000    100.0
               (Mnbf/s)   (GFlops)   (ns/day)  (hour/ns)
Performance:  14084.547    744.277      1.924     12.475

#########

Thanks,
Chris.

Chris Neale wrote:
> Hello,
>
> I have been running simulations on a larger number of processors 
> recently and am confused about the message regarding -nosum that 
> occurs at the end of the .log file. In this case, I have included the 
> -nosum option to mdrun and I still get this warning (gromacs 4.0.4).
>
> My command was:
> mpirun -np $(wc -l $PBS_NODEFILE | gawk '{print $1}') -machinefile 
> $PBS_NODEFILE /scratch/cneale/exe/intel/gromacs-4.0.4/exec/bin/mdrun 
> -deffnm test -nosum -npme 128
>
> #########
>
> To confirm that I am asking mdrun for -nosum, to stderr I get:
> ...
> Option       Type   Value   Description
> ------------------------------------------------------
> -[no]h       bool   no      Print help info and quit
> -nice        int    0       Set the nicelevel
> -deffnm      string test    Set the default filename for all file options
> -[no]xvgr    bool   yes     Add specific codes (legends etc.) in the 
> output
>                            xvg files for the xmgrace program
> -[no]pd      bool   no      Use particle decompostion
> -dd          vector 0 0 0   Domain decomposition grid, 0 is optimize
> -npme        int    128     Number of separate nodes to be used for 
> PME, -1
>                            is guess
> -ddorder     enum   interleave  DD node order: interleave, pp_pme or 
> cartesian
> -[no]ddcheck bool   yes     Check for all bonded interactions with DD
> -rdd         real   0       The maximum distance for bonded 
> interactions with
>                            DD (nm), 0 is determine from initial 
> coordinates
> -rcon        real   0       Maximum distance for P-LINCS (nm), 0 is 
> estimate
> -dlb         enum   auto    Dynamic load balancing (with DD): auto, no 
> or yes
> -dds         real   0.8     Minimum allowed dlb scaling of the DD cell 
> size
> -[no]sum     bool   no      Sum the energies at every step
> -[no]v       bool   no      Be loud and noisy
> -[no]compact bool   yes     Write a compact log file
> -[no]seppot  bool   no      Write separate V and dVdl terms for each
>                            interaction type and node to the log file(s)
> -pforce      real   -1      Print all forces larger than this (kJ/mol nm)
> -[no]reprod  bool   no      Try to avoid optimizations that affect binary
>                            reproducibility
> -cpt         real   15      Checkpoint interval (minutes)
> -[no]append  bool   no      Append to previous output files when 
> continuing
>                            from checkpoint
> -[no]addpart bool   yes     Add the simulation part number to all output
>                            files when continuing from checkpoint
> -maxh        real   -1      Terminate after 0.99 times this time (hours)
> -multi       int    0       Do multiple simulations in parallel
> -replex      int    0       Attempt replica exchange every # steps
> -reseed      int    -1      Seed for replica exchange, -1 is generate 
> a seed
> -[no]glas    bool   no      Do glass simulation with special long range
>                            corrections
> -[no]ionize  bool   no      Do a simulation including the effect of an 
> X-Ray
>                            bombardment on your system
> ...
>
> ########
>
> And the message at the end of the .log file is:
> ...
>    D O M A I N   D E C O M P O S I T I O N   S T A T I S T I C S
>
> av. #atoms communicated per step for force:  2 x 3376415.3
> av. #atoms communicated per step for LINCS:  2 x 192096.6
>
> Average load imbalance: 11.7 %
> Part of the total run time spent waiting due to load imbalance: 7.9 %
> Steps where the load balancing was limited by -rdd, -rcon and/or -dds: 
> X 0 % Y 0 % Z 0 %
> Average PME mesh/force load: 0.620
> Part of the total run time spent waiting due to PP/PME imbalance: 10.0 %
>
> NOTE: 7.9 % performance was lost due to load imbalance
>      in the domain decomposition.
>
> NOTE: 10.0 % performance was lost because the PME nodes
>      had less work to do than the PP nodes.
>      You might want to decrease the number of PME nodes
>      or decrease the cut-off and the grid spacing.
>
>
>     R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G
>
> Computing:         Nodes     Number     G-Cycles    Seconds     %
> -----------------------------------------------------------------------
> Domain decomp.       256         51      337.551      131.2     0.7
> Send X to PME        256        501       59.454       23.1     0.1
> Comm. coord.         256        501      289.936      112.7     0.6
> Neighbor search      256         51     1250.088      485.9     2.8
> Force                256        501    16105.584     6259.9    35.4
> Wait + Comm. F       256        501     2441.390      948.9     5.4
> PME mesh             128        501     5552.336     2158.1    12.2
> Wait + Comm. X/F     128        501     9586.486     3726.1    21.1
> Wait + Recv. PME F   256        501      459.752      178.7     1.0
> Write traj.          256          2      223.993       87.1     0.5
> Update               256        501      777.618      302.2     1.7
> Constraints          256       1002     1223.093      475.4     2.7
> Comm. energies       256         51     7011.309     2725.1    15.4
> Rest                 256                 127.710       49.6     0.3
> -----------------------------------------------------------------------
> Total                384               45446.299    17664.0   100.0
> -----------------------------------------------------------------------
>
> NOTE: 15 % of the run time was spent communicating energies,
>      you might want to use the -nosum option of mdrun
>
>
>        Parallel run - timing based on wallclock.
>
>               NODE (s)   Real (s)      (%)
>       Time:     46.000     46.000    100.0
>               (Mnbf/s)   (GFlops)   (ns/day)  (hour/ns)
> Performance:  13778.036    728.080      1.882     12.752
>
> ########
>
> Thanks,
> Chris