[gmx-users] [gmx-developers] About dynamics loading balance

Thu Aug 21 20:13:30 CEST 2014

Hi,

On Thu, Aug 21, 2014 at 1:56 PM, Yunlong Liu <yliu120 at jh.edu> wrote:

>  Hi Roland,
>
> The problem I am posting is caused by trivial errors (like not enough
> memory) and I think it should be a real bug inside the gromacs-GPU support
> code.
>
It is unlikely a trivial error because otherwise someone else would have
noticed. You could try the release-5-0 branch from git, but I'm not aware
of any bugfixes related to memory allocation.
The memory allocation which causes the error isn't the problem. The printed
size is reasonable. You could recompile with PRINT_ALLOC_KB (add
-DPRINT_ALLOC_KB to CMAKE_C_FLAGS) and rerun the simulation. It might tell
you where the usual large memory allocation happens.

PS: Please don't reply to an individual Gromacs developer. Keep all
conversation on the gmx-users list.

Roland

> That is the reason why I post this problem to the developer mailing-list.
>
> My system contains ~240,000 atoms. It is a rather big protein. The memory
> information of the node is :
>
> top - 12:46:59 up 15 days, 22:18,  1 user,  load average: 1.13, 6.27, 11.28
> Tasks: 510 total,   2 running, 508 sleeping,   0 stopped,   0 zombie
> Cpu(s):  6.3%us,  0.0%sy,  0.0%ni, 93.7%id,  0.0%wa,  0.0%hi,  0.0%si,
> 0.0%st
> Mem:  32815324k total,  4983916k used, 27831408k free,     7984k buffers
> Swap:  4194296k total,        0k used,  4194296k free,   700588k cached
>
> I am running the simulation on 2 nodes, 4 MPI ranks and each rank with 8
> OPENMP-threads. I list the information of their CPU and GPU here:
>
> c442-702.stampede(1)$ nvidia-smi
> Thu Aug 21 12:46:17 2014
> +------------------------------------------------------+
>
> | NVIDIA-SMI 331.67     Driver Version: 331.67
> |
>
> |-------------------------------+----------------------+----------------------+
> | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr.
> ECC |
> | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute
> M. |
>
> |===============================+======================+======================|
> |   0  Tesla K20m          Off  | 0000:03:00.0     Off
> |                    0 |
> | N/A   22C    P0    46W / 225W |    172MiB /  4799MiB |      0%
> Default |
>
> +-------------------------------+----------------------+----------------------+
>
>
>
> +-----------------------------------------------------------------------------+
> | Compute processes:                                               GPU
> Memory |
> |  GPU       PID  Process name
> Usage      |
>
> |=============================================================================|
> |    0    113588  /work/03002/yliu120/gromacs-5/bin/mdrun_mpi
> 77MiB |
> |    0    113589  /work/03002/yliu120/gromacs-5/bin/mdrun_mpi
> 77MiB |
>
> +-----------------------------------------------------------------------------+
>
> c442-702.stampede(4)$ lscpu
> Architecture:          x86_64
> CPU op-mode(s):        32-bit, 64-bit
> Byte Order:            Little Endian
> CPU(s):                16
> On-line CPU(s) list:   0-15
> Thread(s) per core:    1
> Core(s) per socket:    8
> Socket(s):             2
> NUMA node(s):          2
> Vendor ID:             GenuineIntel
> CPU family:            6
> Model:                 45
> Stepping:              7
> CPU MHz:               2701.000
> BogoMIPS:              5399.22
> Virtualization:        VT-x
> L1d cache:             32K
> L1i cache:             32K
> L2 cache:              256K
> L3 cache:              20480K
> NUMA node0 CPU(s):     0-7
> NUMA node1 CPU(s):     8-15
>
> I hope this information will help. Thank you.
>
> Yunlong
>
>
>
>
>
>
> On 8/21/14, 1:38 PM, Roland Schulz wrote:
>
> Hi,
>
>  please don't use gmx-developers for user questions. Feel free to use it
> if you want to fix the problem, and have questions about implementation
> details.
>
>  Please provide more details: How large is your system? How much memory
> does a node have? On how many nodes do you try to run? How many mpi-ranks
> do you have per node?
>
>  Roland
>
> On Thu, Aug 21, 2014 at 12:21 PM, Yunlong Liu <yliu120 at jh.edu> wrote:
>
>> Hi Gromacs Developers,
>>
>> I found something about the dynamic loading balance really interesting. I
>> am running my simulation on Stampede supercomputer, which has nodes with
>> 16-physical core ( really 16 Intel Xeon cores on one node ) and an NVIDIA
>> Tesla K20m GPU associated.
>>
>> When I am using only the CPUs, I turned on dynamic loading balance by
>> -dlb yes. And it seems to work really good, and the loading imbalance is
>> only 1~2%. This really helps improve the performance by 5~7%。But when I am
>> running my code on GPU-CPU hybrid ( GPU node, 16-cpu and 1 GPU), the
>> dynamic loading balance kicked in since the imbalance goes up to ~50%
>> instantly after loading. Then the the system reports a
>> fail-to-allocate-memory error:
>>
>> NOTE: Turning on dynamic load balancing
>>
>>
>> -------------------------------------------------------
>> Program mdrun_mpi, VERSION 5.0
>> Source code file:
>> /home1/03002/yliu120/build/gromacs-5.0/src/gromacs/utility/smalloc.c, line:
>> 226
>>
>> Fatal error:
>> Not enough memory. Failed to realloc 1020720 bytes for dest->a,
>> dest->a=d5800030
>> (called from file
>> /home1/03002/yliu120/build/gromacs-5.0/src/gromacs/mdlib/domdec_top.c, line
>> 1061)
>> For more information and tips for troubleshooting, please check the
>> GROMACS
>> website at http://www.gromacs.org/Documentation/Errors
>> -------------------------------------------------------
>> : Cannot allocate memory
>> Error on rank 0, will try to stop all ranks
>> Halting parallel program mdrun_mpi on CPU 0 out of 4
>>
>> gcq#274: "I Feel a Great Disturbance in the Force" (The Emperor Strikes
>> Back)
>>
>> [cli_0]: aborting job:
>> application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0
>> [c442-702.stampede.tacc.utexas.edu:mpispawn_0][readline] Unexpected
>> End-Of-File on file descriptor 6. MPI process died?
>> [c442-702.stampede.tacc.utexas.edu:mpispawn_0][mtpmi_processops] Error
>> while reading PMI socket. MPI process died?
>> [c442-702.stampede.tacc.utexas.edu:mpispawn_0][child_handler] MPI
>> process (rank: 0, pid: 112839) exited with status 255
>> TACC: MPI job exited with code: 1
>>
>> TACC: Shutdown complete. Exiting.
>>
>> So I manually turned off the dynamic loading balance by -dlb no. The
>> simulation goes through with the very high loading imbalance, like:
>>
>> DD  step 139999 load imb.: force 51.3%
>>
>>            Step           Time         Lambda
>>          140000      280.00000        0.00000
>>
>>    Energies (kJ/mol)
>>             U-B    Proper Dih.  Improper Dih.      CMAP Dih.
>> LJ-14
>>     4.88709e+04    1.21990e+04    2.99128e+03   -1.46719e+03
>> 1.98569e+04
>>      Coulomb-14        LJ (SR)  Disper. corr.   Coulomb (SR)   Coul.
>> recip.
>>     2.54663e+05    4.05141e+05   -3.16020e+04   -3.75610e+06
>> 2.24819e+04
>>       Potential    Kinetic En.   Total Energy    Temperature Pres. DC
>> (bar)
>>    -3.02297e+06    6.15217e+05   -2.40775e+06    3.09312e+02
>> -2.17704e+02
>>  Pressure (bar)   Constr. rmsd
>>    -3.39003e+01    3.10750e-05
>>
>> DD  step 149999 load imb.: force 60.8%
>>
>>            Step           Time         Lambda
>>          150000      300.00000        0.00000
>>
>>    Energies (kJ/mol)
>>             U-B    Proper Dih.  Improper Dih.      CMAP Dih.
>> LJ-14
>>     4.96380e+04    1.21010e+04    2.99986e+03   -1.51918e+03
>> 1.97542e+04
>>      Coulomb-14        LJ (SR)  Disper. corr.   Coulomb (SR)   Coul.
>> recip.
>>     2.54305e+05    4.06024e+05   -3.15801e+04   -3.75534e+06
>> 2.24001e+04
>>       Potential    Kinetic En.   Total Energy    Temperature Pres. DC
>> (bar)
>>    -3.02121e+06    6.17009e+05   -2.40420e+06    3.10213e+02
>> -2.17403e+02
>>  Pressure (bar)   Constr. rmsd
>>    -1.40623e+00    3.16495e-05
>>
>> I think this high loading imbalance will affect more than 20% of the
>> performance but at least it will let the simulation on. Therefore, the
>> problem I would like to report is that when running simulation with GPU-CPU
>> hybrid with very few GPU, the dynamic loading balance will cause domain
>> decomposition problems ( fail-to-allocate-memory ). I don't know whether
>> there is any solution to this problem currently or anything could be
>> improved?
>>
>> Yunlong
>>
>>
>>
>>
>> --
>>
>> ========================================
>> Yunlong Liu, PhD Candidate
>> Computational Biology and Biophysics
>> Department of Biophysics and Biophysical Chemistry
>> School of Medicine, The Johns Hopkins University
>> Email: yliu120 at jhmi.edu
>> Address: 725 N Wolfe St, WBSB RM 601, 21205
>> ========================================
>>
>
>
>
>  --
> ORNL/UT Center for Molecular Biophysics cmb.ornl.gov
> 865-241-1537, ORNL PO BOX 2008 MS6309
>
>
> --
>
> ========================================
> Yunlong Liu, PhD Candidate
> Computational Biology and Biophysics
> Department of Biophysics and Biophysical Chemistry
> School of Medicine, The Johns Hopkins University
> Email: yliu120 at jhmi.edu
> Address: 725 N Wolfe St, WBSB RM 601, 21205
> ========================================
>

-- 
ORNL/UT Center for Molecular Biophysics cmb.ornl.gov
865-241-1537, ORNL PO BOX 2008 MS6309