[gmx-developers] About dynamics loading balance

Yunlong Liu yliu120 at jh.edu
Thu Aug 21 19:22:02 CEST 2014


Hi Gromacs Developers,

I found something about the dynamic loading balance really interesting. 
I am running my simulation on Stampede supercomputer, which has nodes 
with 16-physical core ( really 16 Intel Xeon cores on one node ) and an 
NVIDIA Tesla K20m GPU associated.

When I am using only the CPUs, I turned on dynamic loading balance by 
-dlb yes. And it seems to work really good, and the loading imbalance is 
only 1~2%. This really helps improve the performance by 5~7%?But when I 
am running my code on GPU-CPU hybrid ( GPU node, 16-cpu and 1 GPU), the 
dynamic loading balance kicked in since the imbalance goes up to ~50% 
instantly after loading. Then the the system reports a 
fail-to-allocate-memory error:

NOTE: Turning on dynamic load balancing


-------------------------------------------------------
Program mdrun_mpi, VERSION 5.0
Source code file: 
/home1/03002/yliu120/build/gromacs-5.0/src/gromacs/utility/smalloc.c, 
line: 226

Fatal error:
Not enough memory. Failed to realloc 1020720 bytes for dest->a, 
dest->a=d5800030
(called from file 
/home1/03002/yliu120/build/gromacs-5.0/src/gromacs/mdlib/domdec_top.c, 
line 1061)
For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors
-------------------------------------------------------
: Cannot allocate memory
Error on rank 0, will try to stop all ranks
Halting parallel program mdrun_mpi on CPU 0 out of 4

gcq#274: "I Feel a Great Disturbance in the Force" (The Emperor Strikes 
Back)

[cli_0]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0
[c442-702.stampede.tacc.utexas.edu:mpispawn_0][readline] Unexpected 
End-Of-File on file descriptor 6. MPI process died?
[c442-702.stampede.tacc.utexas.edu:mpispawn_0][mtpmi_processops] Error 
while reading PMI socket. MPI process died?
[c442-702.stampede.tacc.utexas.edu:mpispawn_0][child_handler] MPI 
process (rank: 0, pid: 112839) exited with status 255
TACC: MPI job exited with code: 1

TACC: Shutdown complete. Exiting.

So I manually turned off the dynamic loading balance by -dlb no. The 
simulation goes through with the very high loading imbalance, like:

DD  step 139999 load imb.: force 51.3%

            Step           Time         Lambda
          140000      280.00000        0.00000

    Energies (kJ/mol)
             U-B    Proper Dih.  Improper Dih.      CMAP Dih.          LJ-14
     4.88709e+04    1.21990e+04    2.99128e+03   -1.46719e+03 1.98569e+04
      Coulomb-14        LJ (SR)  Disper. corr.   Coulomb (SR) Coul. recip.
     2.54663e+05    4.05141e+05   -3.16020e+04   -3.75610e+06 2.24819e+04
       Potential    Kinetic En.   Total Energy    Temperature Pres. DC (bar)
    -3.02297e+06    6.15217e+05   -2.40775e+06    3.09312e+02 -2.17704e+02
  Pressure (bar)   Constr. rmsd
    -3.39003e+01    3.10750e-05

DD  step 149999 load imb.: force 60.8%

            Step           Time         Lambda
          150000      300.00000        0.00000

    Energies (kJ/mol)
             U-B    Proper Dih.  Improper Dih.      CMAP Dih.          LJ-14
     4.96380e+04    1.21010e+04    2.99986e+03   -1.51918e+03 1.97542e+04
      Coulomb-14        LJ (SR)  Disper. corr.   Coulomb (SR) Coul. recip.
     2.54305e+05    4.06024e+05   -3.15801e+04   -3.75534e+06 2.24001e+04
       Potential    Kinetic En.   Total Energy    Temperature Pres. DC (bar)
    -3.02121e+06    6.17009e+05   -2.40420e+06    3.10213e+02 -2.17403e+02
  Pressure (bar)   Constr. rmsd
    -1.40623e+00    3.16495e-05

I think this high loading imbalance will affect more than 20% of the 
performance but at least it will let the simulation on. Therefore, the 
problem I would like to report is that when running simulation with 
GPU-CPU hybrid with very few GPU, the dynamic loading balance will cause 
domain decomposition problems ( fail-to-allocate-memory ). I don't know 
whether there is any solution to this problem currently or anything 
could be improved?

Yunlong




-- 

========================================
Yunlong Liu, PhD Candidate
Computational Biology and Biophysics
Department of Biophysics and Biophysical Chemistry
School of Medicine, The Johns Hopkins University
Email: yliu120 at jhmi.edu
Address: 725 N Wolfe St, WBSB RM 601, 21205
========================================

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-developers/attachments/20140821/023cd16f/attachment-0001.html>


More information about the gromacs.org_gmx-developers mailing list