[gmx-users] [gmx-developers] About dynamics loading balance

Thu Aug 21 20:26:03 CEST 2014

Hi Roland,

I just compiled the latest gromacs-5.0 version released on Jun 29th. I 
will recompile it as you suggested by using those Flags. It seems like 
the high loading imbalance doesn't affect the performance as well, which 
is weird.

Thank you.
Yunlong
On 8/21/14, 2:13 PM, Roland Schulz wrote:
> Hi,
>
>
> On Thu, Aug 21, 2014 at 1:56 PM, Yunlong Liu <yliu120 at jh.edu 
> <mailto:yliu120 at jh.edu>> wrote:
>
>     Hi Roland,
>
>     The problem I am posting is caused by trivial errors (like not
>     enough memory) and I think it should be a real bug inside the
>     gromacs-GPU support code.
>
> It is unlikely a trivial error because otherwise someone else would 
> have noticed. You could try the release-5-0 branch from git, but I'm 
> not aware of any bugfixes related to memory allocation.
> The memory allocation which causes the error isn't the problem. The 
> printed size is reasonable. You could recompile with PRINT_ALLOC_KB 
> (add -DPRINT_ALLOC_KB to CMAKE_C_FLAGS) and rerun the simulation. It 
> might tell you where the usual large memory allocation happens.
>
> PS: Please don't reply to an individual Gromacs developer. Keep all 
> conversation on the gmx-users list.
>
> Roland
>
>     That is the reason why I post this problem to the developer
>     mailing-list.
>
>     My system contains ~240,000 atoms. It is a rather big protein. The
>     memory information of the node is :
>
>     top - 12:46:59 up 15 days, 22:18, 1 user,  load average: 1.13,
>     6.27, 11.28
>     Tasks: 510 total,   2 running, 508 sleeping,   0 stopped,   0 zombie
>     Cpu(s):  6.3%us,  0.0%sy,  0.0%ni, 93.7%id,  0.0%wa, 0.0%hi, 
>     0.0%si,  0.0%st
>     Mem:  32815324k total,  4983916k used, 27831408k free,     7984k
>     buffers
>     Swap:  4194296k total,        0k used,  4194296k free,   700588k
>     cached
>
>     I am running the simulation on 2 nodes, 4 MPI ranks and each rank
>     with 8 OPENMP-threads. I list the information of their CPU and GPU
>     here:
>
>     c442-702.stampede(1)$ nvidia-smi
>     Thu Aug 21 12:46:17 2014
>     +------------------------------------------------------+
>     | NVIDIA-SMI 331.67     Driver Version: 331.67 |
>     |-------------------------------+----------------------+----------------------+
>     | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile
>     Uncorr. ECC |
>     | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util 
>     Compute M. |
>     |===============================+======================+======================|
>     |   0  Tesla K20m          Off  | 0000:03:00.0     Off
>     |                    0 |
>     | N/A   22C    P0    46W / 225W |    172MiB /  4799MiB |     
>     0%      Default |
>     +-------------------------------+----------------------+----------------------+
>
>     +-----------------------------------------------------------------------------+
>     | Compute processes: GPU Memory |
>     |  GPU       PID  Process name                                    
>     Usage      |
>     |=============================================================================|
>     |    0    113588 /work/03002/yliu120/gromacs-5/bin/mdrun_mpi 77MiB |
>     |    0    113589 /work/03002/yliu120/gromacs-5/bin/mdrun_mpi 77MiB |
>     +-----------------------------------------------------------------------------+
>
>     c442-702.stampede(4)$ lscpu
>     Architecture:          x86_64
>     CPU op-mode(s):        32-bit, 64-bit
>     Byte Order:            Little Endian
>     CPU(s):                16
>     On-line CPU(s) list:   0-15
>     Thread(s) per core:    1
>     Core(s) per socket:    8
>     Socket(s):             2
>     NUMA node(s):          2
>     Vendor ID:             GenuineIntel
>     CPU family:            6
>     Model:                 45
>     Stepping:              7
>     CPU MHz:               2701.000
>     BogoMIPS:              5399.22
>     Virtualization:        VT-x
>     L1d cache:             32K
>     L1i cache:             32K
>     L2 cache:              256K
>     L3 cache:              20480K
>     NUMA node0 CPU(s):     0-7
>     NUMA node1 CPU(s):     8-15
>
>     I hope this information will help. Thank you.
>
>     Yunlong
>
>
>
>
>
>
>     On 8/21/14, 1:38 PM, Roland Schulz wrote:
>>     Hi,
>>
>>     please don't use gmx-developers for user questions. Feel free to
>>     use it if you want to fix the problem, and have questions about
>>     implementation details.
>>
>>     Please provide more details: How large is your system? How much
>>     memory does a node have? On how many nodes do you try to run? How
>>     many mpi-ranks do you have per node?
>>
>>     Roland
>>
>>     On Thu, Aug 21, 2014 at 12:21 PM, Yunlong Liu <yliu120 at jh.edu
>>     <mailto:yliu120 at jh.edu>> wrote:
>>
>>         Hi Gromacs Developers,
>>
>>         I found something about the dynamic loading balance really
>>         interesting. I am running my simulation on Stampede
>>         supercomputer, which has nodes with 16-physical core ( really
>>         16 Intel Xeon cores on one node ) and an NVIDIA Tesla K20m
>>         GPU associated.
>>
>>         When I am using only the CPUs, I turned on dynamic loading
>>         balance by -dlb yes. And it seems to work really good, and
>>         the loading imbalance is only 1~2%. This really helps improve
>>         the performance by 5~7%。But when I am running my code on
>>         GPU-CPU hybrid ( GPU node, 16-cpu and 1 GPU), the dynamic
>>         loading balance kicked in since the imbalance goes up to ~50%
>>         instantly after loading. Then the the system reports a
>>         fail-to-allocate-memory error:
>>
>>         NOTE: Turning on dynamic load balancing
>>
>>
>>         -------------------------------------------------------
>>         Program mdrun_mpi, VERSION 5.0
>>         Source code file:
>>         /home1/03002/yliu120/build/gromacs-5.0/src/gromacs/utility/smalloc.c,
>>         line: 226
>>
>>         Fatal error:
>>         Not enough memory. Failed to realloc 1020720 bytes for
>>         dest->a, dest->a=d5800030
>>         (called from file
>>         /home1/03002/yliu120/build/gromacs-5.0/src/gromacs/mdlib/domdec_top.c,
>>         line 1061)
>>         For more information and tips for troubleshooting, please
>>         check the GROMACS
>>         website at http://www.gromacs.org/Documentation/Errors
>>         -------------------------------------------------------
>>         : Cannot allocate memory
>>         Error on rank 0, will try to stop all ranks
>>         Halting parallel program mdrun_mpi on CPU 0 out of 4
>>
>>         gcq#274: "I Feel a Great Disturbance in the Force" (The
>>         Emperor Strikes Back)
>>
>>         [cli_0]: aborting job:
>>         application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0
>>         [c442-702.stampede.tacc.utexas.edu:mpispawn_0][readline]
>>         Unexpected End-Of-File on file descriptor 6. MPI process died?
>>         [c442-702.stampede.tacc.utexas.edu:mpispawn_0][mtpmi_processops]
>>         Error while reading PMI socket. MPI process died?
>>         [c442-702.stampede.tacc.utexas.edu:mpispawn_0][child_handler]
>>         MPI process (rank: 0, pid: 112839) exited with status 255
>>         TACC: MPI job exited with code: 1
>>
>>         TACC: Shutdown complete. Exiting.
>>
>>         So I manually turned off the dynamic loading balance by -dlb
>>         no. The simulation goes through with the very high loading
>>         imbalance, like:
>>
>>         DD  step 139999 load imb.: force 51.3%
>>
>>                    Step Time         Lambda
>>                  140000 280.00000        0.00000
>>
>>            Energies (kJ/mol)
>>                     U-B    Proper Dih. Improper Dih.      CMAP
>>         Dih.          LJ-14
>>             4.88709e+04    1.21990e+04 2.99128e+03   -1.46719e+03
>>         1.98569e+04
>>              Coulomb-14        LJ (SR) Disper. corr.   Coulomb (SR)  
>>         Coul. recip.
>>             2.54663e+05    4.05141e+05 -3.16020e+04   -3.75610e+06
>>         2.24819e+04
>>               Potential    Kinetic En. Total Energy    Temperature
>>         Pres. DC (bar)
>>            -3.02297e+06    6.15217e+05 -2.40775e+06    3.09312e+02
>>         -2.17704e+02
>>          Pressure (bar)   Constr. rmsd
>>            -3.39003e+01    3.10750e-05
>>
>>         DD  step 149999 load imb.: force 60.8%
>>
>>                    Step Time         Lambda
>>                  150000 300.00000        0.00000
>>
>>            Energies (kJ/mol)
>>                     U-B    Proper Dih. Improper Dih.      CMAP
>>         Dih.          LJ-14
>>             4.96380e+04    1.21010e+04 2.99986e+03   -1.51918e+03
>>         1.97542e+04
>>              Coulomb-14        LJ (SR) Disper. corr.   Coulomb (SR)  
>>         Coul. recip.
>>             2.54305e+05    4.06024e+05 -3.15801e+04   -3.75534e+06
>>         2.24001e+04
>>               Potential    Kinetic En. Total Energy    Temperature
>>         Pres. DC (bar)
>>            -3.02121e+06    6.17009e+05 -2.40420e+06    3.10213e+02
>>         -2.17403e+02
>>          Pressure (bar)   Constr. rmsd
>>            -1.40623e+00    3.16495e-05
>>
>>         I think this high loading imbalance will affect more than 20%
>>         of the performance but at least it will let the simulation
>>         on. Therefore, the problem I would like to report is that
>>         when running simulation with GPU-CPU hybrid with very few
>>         GPU, the dynamic loading balance will cause domain
>>         decomposition problems ( fail-to-allocate-memory ). I don't
>>         know whether there is any solution to this problem currently
>>         or anything could be improved?
>>
>>         Yunlong
>>
>>
>>
>>
>>         -- 
>>
>>         ========================================
>>         Yunlong Liu, PhD Candidate
>>         Computational Biology and Biophysics
>>         Department of Biophysics and Biophysical Chemistry
>>         School of Medicine, The Johns Hopkins University
>>         Email: yliu120 at jhmi.edu <mailto:yliu120 at jhmi.edu>
>>         Address: 725 N Wolfe St, WBSB RM 601, 21205
>>         ========================================
>>
>>
>>
>>
>>     -- 
>>     ORNL/UT Center for Molecular Biophysics cmb.ornl.gov
>>     <http://cmb.ornl.gov>
>>     865-241-1537 <tel:865-241-1537>, ORNL PO BOX 2008 MS6309
>
>     -- 
>
>     ========================================
>     Yunlong Liu, PhD Candidate
>     Computational Biology and Biophysics
>     Department of Biophysics and Biophysical Chemistry
>     School of Medicine, The Johns Hopkins University
>     Email: yliu120 at jhmi.edu <mailto:yliu120 at jhmi.edu>
>     Address: 725 N Wolfe St, WBSB RM 601, 21205
>     ========================================
>
>
>
>
> -- 
> ORNL/UT Center for Molecular Biophysics cmb.ornl.gov <http://cmb.ornl.gov>
> 865-241-1537 <tel:865-241-1537>, ORNL PO BOX 2008 MS6309

-- 

========================================
Yunlong Liu, PhD Candidate
Computational Biology and Biophysics
Department of Biophysics and Biophysical Chemistry
School of Medicine, The Johns Hopkins University
Email: yliu120 at jhmi.edu
Address: 725 N Wolfe St, WBSB RM 601, 21205
========================================