[gmx-developers] About dynamics loading balance

Yunlong Liu yliu120 at jh.edu
Thu Aug 21 19:22:02 CEST 2014

Hi Gromacs Developers,

I found something about the dynamic loading balance really interesting. 
I am running my simulation on Stampede supercomputer, which has nodes 
with 16-physical core ( really 16 Intel Xeon cores on one node ) and an 
NVIDIA Tesla K20m GPU associated.

When I am using only the CPUs, I turned on dynamic loading balance by 
-dlb yes. And it seems to work really good, and the loading imbalance is 
only 1~2%. This really helps improve the performance by 5~7%?But when I 
am running my code on GPU-CPU hybrid ( GPU node, 16-cpu and 1 GPU), the 
dynamic loading balance kicked in since the imbalance goes up to ~50% 
instantly after loading. Then the the system reports a 
fail-to-allocate-memory error:

NOTE: Turning on dynamic load balancing

Program mdrun_mpi, VERSION 5.0
Source code file: 
line: 226

Fatal error:
Not enough memory. Failed to realloc 1020720 bytes for dest->a, 
(called from file 
line 1061)
For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors
: Cannot allocate memory
Error on rank 0, will try to stop all ranks
Halting parallel program mdrun_mpi on CPU 0 out of 4

gcq#274: "I Feel a Great Disturbance in the Force" (The Emperor Strikes 

[cli_0]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0
[c442-702.stampede.tacc.utexas.edu:mpispawn_0][readline] Unexpected 
End-Of-File on file descriptor 6. MPI process died?
[c442-702.stampede.tacc.utexas.edu:mpispawn_0][mtpmi_processops] Error 
while reading PMI socket. MPI process died?
[c442-702.stampede.tacc.utexas.edu:mpispawn_0][child_handler] MPI 
process (rank: 0, pid: 112839) exited with status 255
TACC: MPI job exited with code: 1

TACC: Shutdown complete. Exiting.

So I manually turned off the dynamic loading balance by -dlb no. The 
simulation goes through with the very high loading imbalance, like:

DD  step 139999 load imb.: force 51.3%

            Step           Time         Lambda
          140000      280.00000        0.00000

    Energies (kJ/mol)
             U-B    Proper Dih.  Improper Dih.      CMAP Dih.          LJ-14
     4.88709e+04    1.21990e+04    2.99128e+03   -1.46719e+03 1.98569e+04
      Coulomb-14        LJ (SR)  Disper. corr.   Coulomb (SR) Coul. recip.
     2.54663e+05    4.05141e+05   -3.16020e+04   -3.75610e+06 2.24819e+04
       Potential    Kinetic En.   Total Energy    Temperature Pres. DC (bar)
    -3.02297e+06    6.15217e+05   -2.40775e+06    3.09312e+02 -2.17704e+02
  Pressure (bar)   Constr. rmsd
    -3.39003e+01    3.10750e-05

DD  step 149999 load imb.: force 60.8%

            Step           Time         Lambda
          150000      300.00000        0.00000

    Energies (kJ/mol)
             U-B    Proper Dih.  Improper Dih.      CMAP Dih.          LJ-14
     4.96380e+04    1.21010e+04    2.99986e+03   -1.51918e+03 1.97542e+04
      Coulomb-14        LJ (SR)  Disper. corr.   Coulomb (SR) Coul. recip.
     2.54305e+05    4.06024e+05   -3.15801e+04   -3.75534e+06 2.24001e+04
       Potential    Kinetic En.   Total Energy    Temperature Pres. DC (bar)
    -3.02121e+06    6.17009e+05   -2.40420e+06    3.10213e+02 -2.17403e+02
  Pressure (bar)   Constr. rmsd
    -1.40623e+00    3.16495e-05

I think this high loading imbalance will affect more than 20% of the 
performance but at least it will let the simulation on. Therefore, the 
problem I would like to report is that when running simulation with 
GPU-CPU hybrid with very few GPU, the dynamic loading balance will cause 
domain decomposition problems ( fail-to-allocate-memory ). I don't know 
whether there is any solution to this problem currently or anything 
could be improved?



Yunlong Liu, PhD Candidate
Computational Biology and Biophysics
Department of Biophysics and Biophysical Chemistry
School of Medicine, The Johns Hopkins University
Email: yliu120 at jhmi.edu
Address: 725 N Wolfe St, WBSB RM 601, 21205

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-developers/attachments/20140821/023cd16f/attachment-0001.html>

More information about the gromacs.org_gmx-developers mailing list