[gmx-users] [gmx-developers] About dynamics loading balance

Roland Schulz roland at utk.edu
Thu Aug 21 19:39:07 CEST 2014


Hi,

please don't use gmx-developers for user questions. Feel free to use it if
you want to fix the problem, and have questions about implementation
details.

Please provide more details: How large is your system? How much memory does
a node have? On how many nodes do you try to run? How many mpi-ranks do you
have per node?

Roland

On Thu, Aug 21, 2014 at 12:21 PM, Yunlong Liu <yliu120 at jh.edu> wrote:

>  Hi Gromacs Developers,
>
> I found something about the dynamic loading balance really interesting. I
> am running my simulation on Stampede supercomputer, which has nodes with
> 16-physical core ( really 16 Intel Xeon cores on one node ) and an NVIDIA
> Tesla K20m GPU associated.
>
> When I am using only the CPUs, I turned on dynamic loading balance by -dlb
> yes. And it seems to work really good, and the loading imbalance is only
> 1~2%. This really helps improve the performance by 5~7%。But when I am
> running my code on GPU-CPU hybrid ( GPU node, 16-cpu and 1 GPU), the
> dynamic loading balance kicked in since the imbalance goes up to ~50%
> instantly after loading. Then the the system reports a
> fail-to-allocate-memory error:
>
> NOTE: Turning on dynamic load balancing
>
>
> -------------------------------------------------------
> Program mdrun_mpi, VERSION 5.0
> Source code file:
> /home1/03002/yliu120/build/gromacs-5.0/src/gromacs/utility/smalloc.c, line:
> 226
>
> Fatal error:
> Not enough memory. Failed to realloc 1020720 bytes for dest->a,
> dest->a=d5800030
> (called from file
> /home1/03002/yliu120/build/gromacs-5.0/src/gromacs/mdlib/domdec_top.c, line
> 1061)
> For more information and tips for troubleshooting, please check the GROMACS
> website at http://www.gromacs.org/Documentation/Errors
> -------------------------------------------------------
> : Cannot allocate memory
> Error on rank 0, will try to stop all ranks
> Halting parallel program mdrun_mpi on CPU 0 out of 4
>
> gcq#274: "I Feel a Great Disturbance in the Force" (The Emperor Strikes
> Back)
>
> [cli_0]: aborting job:
> application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0
> [c442-702.stampede.tacc.utexas.edu:mpispawn_0][readline] Unexpected
> End-Of-File on file descriptor 6. MPI process died?
> [c442-702.stampede.tacc.utexas.edu:mpispawn_0][mtpmi_processops] Error
> while reading PMI socket. MPI process died?
> [c442-702.stampede.tacc.utexas.edu:mpispawn_0][child_handler] MPI process
> (rank: 0, pid: 112839) exited with status 255
> TACC: MPI job exited with code: 1
>
> TACC: Shutdown complete. Exiting.
>
> So I manually turned off the dynamic loading balance by -dlb no. The
> simulation goes through with the very high loading imbalance, like:
>
> DD  step 139999 load imb.: force 51.3%
>
>            Step           Time         Lambda
>          140000      280.00000        0.00000
>
>    Energies (kJ/mol)
>             U-B    Proper Dih.  Improper Dih.      CMAP Dih.          LJ-14
>     4.88709e+04    1.21990e+04    2.99128e+03   -1.46719e+03    1.98569e+04
>      Coulomb-14        LJ (SR)  Disper. corr.   Coulomb (SR)   Coul. recip.
>     2.54663e+05    4.05141e+05   -3.16020e+04   -3.75610e+06    2.24819e+04
>       Potential    Kinetic En.   Total Energy    Temperature Pres. DC (bar)
>    -3.02297e+06    6.15217e+05   -2.40775e+06    3.09312e+02   -2.17704e+02
>  Pressure (bar)   Constr. rmsd
>    -3.39003e+01    3.10750e-05
>
> DD  step 149999 load imb.: force 60.8%
>
>            Step           Time         Lambda
>          150000      300.00000        0.00000
>
>    Energies (kJ/mol)
>             U-B    Proper Dih.  Improper Dih.      CMAP Dih.          LJ-14
>     4.96380e+04    1.21010e+04    2.99986e+03   -1.51918e+03    1.97542e+04
>      Coulomb-14        LJ (SR)  Disper. corr.   Coulomb (SR)   Coul. recip.
>     2.54305e+05    4.06024e+05   -3.15801e+04   -3.75534e+06    2.24001e+04
>       Potential    Kinetic En.   Total Energy    Temperature Pres. DC (bar)
>    -3.02121e+06    6.17009e+05   -2.40420e+06    3.10213e+02   -2.17403e+02
>  Pressure (bar)   Constr. rmsd
>    -1.40623e+00    3.16495e-05
>
> I think this high loading imbalance will affect more than 20% of the
> performance but at least it will let the simulation on. Therefore, the
> problem I would like to report is that when running simulation with GPU-CPU
> hybrid with very few GPU, the dynamic loading balance will cause domain
> decomposition problems ( fail-to-allocate-memory ). I don't know whether
> there is any solution to this problem currently or anything could be
> improved?
>
> Yunlong
>
>
>
>
> --
>
> ========================================
> Yunlong Liu, PhD Candidate
> Computational Biology and Biophysics
> Department of Biophysics and Biophysical Chemistry
> School of Medicine, The Johns Hopkins University
> Email: yliu120 at jhmi.edu
> Address: 725 N Wolfe St, WBSB RM 601, 21205
> ========================================
>



-- 
ORNL/UT Center for Molecular Biophysics cmb.ornl.gov
865-241-1537, ORNL PO BOX 2008 MS6309


More information about the gromacs.org_gmx-users mailing list