[gmx-developers] About dynamics loading balance
Roland Schulz
roland at utk.edu
Thu Aug 21 19:39:07 CEST 2014
Hi,
please don't use gmx-developers for user questions. Feel free to use it if
you want to fix the problem, and have questions about implementation
details.
Please provide more details: How large is your system? How much memory does
a node have? On how many nodes do you try to run? How many mpi-ranks do you
have per node?
Roland
On Thu, Aug 21, 2014 at 12:21 PM, Yunlong Liu <yliu120 at jh.edu> wrote:
> Hi Gromacs Developers,
>
> I found something about the dynamic loading balance really interesting. I
> am running my simulation on Stampede supercomputer, which has nodes with
> 16-physical core ( really 16 Intel Xeon cores on one node ) and an NVIDIA
> Tesla K20m GPU associated.
>
> When I am using only the CPUs, I turned on dynamic loading balance by -dlb
> yes. And it seems to work really good, and the loading imbalance is only
> 1~2%. This really helps improve the performance by 5~7%。But when I am
> running my code on GPU-CPU hybrid ( GPU node, 16-cpu and 1 GPU), the
> dynamic loading balance kicked in since the imbalance goes up to ~50%
> instantly after loading. Then the the system reports a
> fail-to-allocate-memory error:
>
> NOTE: Turning on dynamic load balancing
>
>
> -------------------------------------------------------
> Program mdrun_mpi, VERSION 5.0
> Source code file:
> /home1/03002/yliu120/build/gromacs-5.0/src/gromacs/utility/smalloc.c, line:
> 226
>
> Fatal error:
> Not enough memory. Failed to realloc 1020720 bytes for dest->a,
> dest->a=d5800030
> (called from file
> /home1/03002/yliu120/build/gromacs-5.0/src/gromacs/mdlib/domdec_top.c, line
> 1061)
> For more information and tips for troubleshooting, please check the GROMACS
> website at http://www.gromacs.org/Documentation/Errors
> -------------------------------------------------------
> : Cannot allocate memory
> Error on rank 0, will try to stop all ranks
> Halting parallel program mdrun_mpi on CPU 0 out of 4
>
> gcq#274: "I Feel a Great Disturbance in the Force" (The Emperor Strikes
> Back)
>
> [cli_0]: aborting job:
> application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0
> [c442-702.stampede.tacc.utexas.edu:mpispawn_0][readline] Unexpected
> End-Of-File on file descriptor 6. MPI process died?
> [c442-702.stampede.tacc.utexas.edu:mpispawn_0][mtpmi_processops] Error
> while reading PMI socket. MPI process died?
> [c442-702.stampede.tacc.utexas.edu:mpispawn_0][child_handler] MPI process
> (rank: 0, pid: 112839) exited with status 255
> TACC: MPI job exited with code: 1
>
> TACC: Shutdown complete. Exiting.
>
> So I manually turned off the dynamic loading balance by -dlb no. The
> simulation goes through with the very high loading imbalance, like:
>
> DD step 139999 load imb.: force 51.3%
>
> Step Time Lambda
> 140000 280.00000 0.00000
>
> Energies (kJ/mol)
> U-B Proper Dih. Improper Dih. CMAP Dih. LJ-14
> 4.88709e+04 1.21990e+04 2.99128e+03 -1.46719e+03 1.98569e+04
> Coulomb-14 LJ (SR) Disper. corr. Coulomb (SR) Coul. recip.
> 2.54663e+05 4.05141e+05 -3.16020e+04 -3.75610e+06 2.24819e+04
> Potential Kinetic En. Total Energy Temperature Pres. DC (bar)
> -3.02297e+06 6.15217e+05 -2.40775e+06 3.09312e+02 -2.17704e+02
> Pressure (bar) Constr. rmsd
> -3.39003e+01 3.10750e-05
>
> DD step 149999 load imb.: force 60.8%
>
> Step Time Lambda
> 150000 300.00000 0.00000
>
> Energies (kJ/mol)
> U-B Proper Dih. Improper Dih. CMAP Dih. LJ-14
> 4.96380e+04 1.21010e+04 2.99986e+03 -1.51918e+03 1.97542e+04
> Coulomb-14 LJ (SR) Disper. corr. Coulomb (SR) Coul. recip.
> 2.54305e+05 4.06024e+05 -3.15801e+04 -3.75534e+06 2.24001e+04
> Potential Kinetic En. Total Energy Temperature Pres. DC (bar)
> -3.02121e+06 6.17009e+05 -2.40420e+06 3.10213e+02 -2.17403e+02
> Pressure (bar) Constr. rmsd
> -1.40623e+00 3.16495e-05
>
> I think this high loading imbalance will affect more than 20% of the
> performance but at least it will let the simulation on. Therefore, the
> problem I would like to report is that when running simulation with GPU-CPU
> hybrid with very few GPU, the dynamic loading balance will cause domain
> decomposition problems ( fail-to-allocate-memory ). I don't know whether
> there is any solution to this problem currently or anything could be
> improved?
>
> Yunlong
>
>
>
>
> --
>
> ========================================
> Yunlong Liu, PhD Candidate
> Computational Biology and Biophysics
> Department of Biophysics and Biophysical Chemistry
> School of Medicine, The Johns Hopkins University
> Email: yliu120 at jhmi.edu
> Address: 725 N Wolfe St, WBSB RM 601, 21205
> ========================================
>
--
ORNL/UT Center for Molecular Biophysics cmb.ornl.gov
865-241-1537, ORNL PO BOX 2008 MS6309
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-developers/attachments/20140821/79324cf0/attachment.html>
More information about the gromacs.org_gmx-developers
mailing list