[gmx-users] [gmx-developers] About dynamics loading balance

Szilárd Páll pall.szilard at gmail.com
Sun Aug 24 20:19:06 CEST 2014


On Thu, Aug 21, 2014 at 8:25 PM, Yunlong Liu <yliu120 at jh.edu> wrote:
> Hi Roland,
>
> I just compiled the latest gromacs-5.0 version released on Jun 29th. I will
> recompile it as you suggested by using those Flags. It seems like the high
> loading imbalance doesn't affect the performance as well, which is weird.

How did you draw that conclusion? Please show us log files of the
respective runs, that will help to assess what is gong on.

--
Szilárd

> Thank you.
> Yunlong
>
> On 8/21/14, 2:13 PM, Roland Schulz wrote:
>>
>> Hi,
>>
>>
>>
>> On Thu, Aug 21, 2014 at 1:56 PM, Yunlong Liu <yliu120 at jh.edu
>> <mailto:yliu120 at jh.edu>> wrote:
>>
>>     Hi Roland,
>>
>>     The problem I am posting is caused by trivial errors (like not
>>     enough memory) and I think it should be a real bug inside the
>>     gromacs-GPU support code.
>>
>> It is unlikely a trivial error because otherwise someone else would have
>> noticed. You could try the release-5-0 branch from git, but I'm not aware of
>> any bugfixes related to memory allocation.
>> The memory allocation which causes the error isn't the problem. The
>> printed size is reasonable. You could recompile with PRINT_ALLOC_KB (add
>> -DPRINT_ALLOC_KB to CMAKE_C_FLAGS) and rerun the simulation. It might tell
>> you where the usual large memory allocation happens.
>>
>> PS: Please don't reply to an individual Gromacs developer. Keep all
>> conversation on the gmx-users list.
>>
>> Roland
>>
>>     That is the reason why I post this problem to the developer
>>     mailing-list.
>>
>>     My system contains ~240,000 atoms. It is a rather big protein. The
>>     memory information of the node is :
>>
>>     top - 12:46:59 up 15 days, 22:18, 1 user,  load average: 1.13,
>>     6.27, 11.28
>>     Tasks: 510 total,   2 running, 508 sleeping,   0 stopped,   0 zombie
>>     Cpu(s):  6.3%us,  0.0%sy,  0.0%ni, 93.7%id,  0.0%wa, 0.0%hi,
>> 0.0%si,  0.0%st
>>     Mem:  32815324k total,  4983916k used, 27831408k free,     7984k
>>     buffers
>>     Swap:  4194296k total,        0k used,  4194296k free,   700588k
>>     cached
>>
>>     I am running the simulation on 2 nodes, 4 MPI ranks and each rank
>>     with 8 OPENMP-threads. I list the information of their CPU and GPU
>>     here:
>>
>>     c442-702.stampede(1)$ nvidia-smi
>>     Thu Aug 21 12:46:17 2014
>>     +------------------------------------------------------+
>>     | NVIDIA-SMI 331.67     Driver Version: 331.67 |
>>
>> |-------------------------------+----------------------+----------------------+
>>     | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile
>>     Uncorr. ECC |
>>     | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util
>> Compute M. |
>>
>> |===============================+======================+======================|
>>     |   0  Tesla K20m          Off  | 0000:03:00.0     Off
>>     |                    0 |
>>     | N/A   22C    P0    46W / 225W |    172MiB /  4799MiB |         0%
>> Default |
>>
>> +-------------------------------+----------------------+----------------------+
>>
>>
>> +-----------------------------------------------------------------------------+
>>     | Compute processes: GPU Memory |
>>     |  GPU       PID  Process name
>> Usage      |
>>
>> |=============================================================================|
>>     |    0    113588 /work/03002/yliu120/gromacs-5/bin/mdrun_mpi 77MiB |
>>     |    0    113589 /work/03002/yliu120/gromacs-5/bin/mdrun_mpi 77MiB |
>>
>> +-----------------------------------------------------------------------------+
>>
>>     c442-702.stampede(4)$ lscpu
>>     Architecture:          x86_64
>>     CPU op-mode(s):        32-bit, 64-bit
>>     Byte Order:            Little Endian
>>     CPU(s):                16
>>     On-line CPU(s) list:   0-15
>>     Thread(s) per core:    1
>>     Core(s) per socket:    8
>>     Socket(s):             2
>>     NUMA node(s):          2
>>     Vendor ID:             GenuineIntel
>>     CPU family:            6
>>     Model:                 45
>>     Stepping:              7
>>     CPU MHz:               2701.000
>>     BogoMIPS:              5399.22
>>     Virtualization:        VT-x
>>     L1d cache:             32K
>>     L1i cache:             32K
>>     L2 cache:              256K
>>     L3 cache:              20480K
>>     NUMA node0 CPU(s):     0-7
>>     NUMA node1 CPU(s):     8-15
>>
>>     I hope this information will help. Thank you.
>>
>>     Yunlong
>>
>>
>>
>>
>>
>>
>>     On 8/21/14, 1:38 PM, Roland Schulz wrote:
>>>
>>>     Hi,
>>>
>>>     please don't use gmx-developers for user questions. Feel free to
>>>     use it if you want to fix the problem, and have questions about
>>>     implementation details.
>>>
>>>     Please provide more details: How large is your system? How much
>>>     memory does a node have? On how many nodes do you try to run? How
>>>     many mpi-ranks do you have per node?
>>>
>>>     Roland
>>>
>>>     On Thu, Aug 21, 2014 at 12:21 PM, Yunlong Liu <yliu120 at jh.edu
>>>     <mailto:yliu120 at jh.edu>> wrote:
>>>
>>>         Hi Gromacs Developers,
>>>
>>>         I found something about the dynamic loading balance really
>>>         interesting. I am running my simulation on Stampede
>>>         supercomputer, which has nodes with 16-physical core ( really
>>>         16 Intel Xeon cores on one node ) and an NVIDIA Tesla K20m
>>>         GPU associated.
>>>
>>>         When I am using only the CPUs, I turned on dynamic loading
>>>         balance by -dlb yes. And it seems to work really good, and
>>>         the loading imbalance is only 1~2%. This really helps improve
>>>         the performance by 5~7%。But when I am running my code on
>>>         GPU-CPU hybrid ( GPU node, 16-cpu and 1 GPU), the dynamic
>>>         loading balance kicked in since the imbalance goes up to ~50%
>>>         instantly after loading. Then the the system reports a
>>>         fail-to-allocate-memory error:
>>>
>>>         NOTE: Turning on dynamic load balancing
>>>
>>>
>>>         -------------------------------------------------------
>>>         Program mdrun_mpi, VERSION 5.0
>>>         Source code file:
>>>
>>> /home1/03002/yliu120/build/gromacs-5.0/src/gromacs/utility/smalloc.c,
>>>         line: 226
>>>
>>>         Fatal error:
>>>         Not enough memory. Failed to realloc 1020720 bytes for
>>>         dest->a, dest->a=d5800030
>>>         (called from file
>>>
>>> /home1/03002/yliu120/build/gromacs-5.0/src/gromacs/mdlib/domdec_top.c,
>>>         line 1061)
>>>         For more information and tips for troubleshooting, please
>>>         check the GROMACS
>>>         website at http://www.gromacs.org/Documentation/Errors
>>>         -------------------------------------------------------
>>>         : Cannot allocate memory
>>>         Error on rank 0, will try to stop all ranks
>>>         Halting parallel program mdrun_mpi on CPU 0 out of 4
>>>
>>>         gcq#274: "I Feel a Great Disturbance in the Force" (The
>>>         Emperor Strikes Back)
>>>
>>>         [cli_0]: aborting job:
>>>         application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0
>>>         [c442-702.stampede.tacc.utexas.edu:mpispawn_0][readline]
>>>         Unexpected End-Of-File on file descriptor 6. MPI process died?
>>>         [c442-702.stampede.tacc.utexas.edu:mpispawn_0][mtpmi_processops]
>>>         Error while reading PMI socket. MPI process died?
>>>         [c442-702.stampede.tacc.utexas.edu:mpispawn_0][child_handler]
>>>         MPI process (rank: 0, pid: 112839) exited with status 255
>>>         TACC: MPI job exited with code: 1
>>>
>>>         TACC: Shutdown complete. Exiting.
>>>
>>>         So I manually turned off the dynamic loading balance by -dlb
>>>         no. The simulation goes through with the very high loading
>>>         imbalance, like:
>>>
>>>         DD  step 139999 load imb.: force 51.3%
>>>
>>>                    Step Time         Lambda
>>>                  140000 280.00000        0.00000
>>>
>>>            Energies (kJ/mol)
>>>                     U-B    Proper Dih. Improper Dih.      CMAP
>>>         Dih.          LJ-14
>>>             4.88709e+04    1.21990e+04 2.99128e+03   -1.46719e+03
>>>         1.98569e+04
>>>              Coulomb-14        LJ (SR) Disper. corr.   Coulomb (SR)
>>> Coul. recip.
>>>             2.54663e+05    4.05141e+05 -3.16020e+04   -3.75610e+06
>>>         2.24819e+04
>>>               Potential    Kinetic En. Total Energy    Temperature
>>>         Pres. DC (bar)
>>>            -3.02297e+06    6.15217e+05 -2.40775e+06    3.09312e+02
>>>         -2.17704e+02
>>>          Pressure (bar)   Constr. rmsd
>>>            -3.39003e+01    3.10750e-05
>>>
>>>         DD  step 149999 load imb.: force 60.8%
>>>
>>>                    Step Time         Lambda
>>>                  150000 300.00000        0.00000
>>>
>>>            Energies (kJ/mol)
>>>                     U-B    Proper Dih. Improper Dih.      CMAP
>>>         Dih.          LJ-14
>>>             4.96380e+04    1.21010e+04 2.99986e+03   -1.51918e+03
>>>         1.97542e+04
>>>              Coulomb-14        LJ (SR) Disper. corr.   Coulomb (SR)
>>> Coul. recip.
>>>             2.54305e+05    4.06024e+05 -3.15801e+04   -3.75534e+06
>>>         2.24001e+04
>>>               Potential    Kinetic En. Total Energy    Temperature
>>>         Pres. DC (bar)
>>>            -3.02121e+06    6.17009e+05 -2.40420e+06    3.10213e+02
>>>         -2.17403e+02
>>>          Pressure (bar)   Constr. rmsd
>>>            -1.40623e+00    3.16495e-05
>>>
>>>         I think this high loading imbalance will affect more than 20%
>>>         of the performance but at least it will let the simulation
>>>         on. Therefore, the problem I would like to report is that
>>>         when running simulation with GPU-CPU hybrid with very few
>>>         GPU, the dynamic loading balance will cause domain
>>>         decomposition problems ( fail-to-allocate-memory ). I don't
>>>         know whether there is any solution to this problem currently
>>>         or anything could be improved?
>>>
>>>         Yunlong
>>>
>>>
>>>
>>>
>>>         --
>>>         ========================================
>>>         Yunlong Liu, PhD Candidate
>>>         Computational Biology and Biophysics
>>>         Department of Biophysics and Biophysical Chemistry
>>>         School of Medicine, The Johns Hopkins University
>>>         Email: yliu120 at jhmi.edu <mailto:yliu120 at jhmi.edu>
>>>
>>>         Address: 725 N Wolfe St, WBSB RM 601, 21205
>>>         ========================================
>>>
>>>
>>>
>>>
>>>     --     ORNL/UT Center for Molecular Biophysics cmb.ornl.gov
>>>     <http://cmb.ornl.gov>
>>>     865-241-1537 <tel:865-241-1537>, ORNL PO BOX 2008 MS6309
>>
>>
>>     --
>>     ========================================
>>     Yunlong Liu, PhD Candidate
>>     Computational Biology and Biophysics
>>     Department of Biophysics and Biophysical Chemistry
>>     School of Medicine, The Johns Hopkins University
>>     Email: yliu120 at jhmi.edu <mailto:yliu120 at jhmi.edu>
>>
>>     Address: 725 N Wolfe St, WBSB RM 601, 21205
>>     ========================================
>>
>>
>>
>>
>> --
>> ORNL/UT Center for Molecular Biophysics cmb.ornl.gov <http://cmb.ornl.gov>
>> 865-241-1537 <tel:865-241-1537>, ORNL PO BOX 2008 MS6309
>
>
> --
>
> ========================================
> Yunlong Liu, PhD Candidate
> Computational Biology and Biophysics
> Department of Biophysics and Biophysical Chemistry
> School of Medicine, The Johns Hopkins University
> Email: yliu120 at jhmi.edu
> Address: 725 N Wolfe St, WBSB RM 601, 21205
> ========================================
>
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a
> mail to gmx-users-request at gromacs.org.


More information about the gromacs.org_gmx-users mailing list