[gmx-users] [gmx-developers] About dynamics loading balance

Mon Aug 25 06:08:30 CEST 2014

Hi Szilard,

I would like to send you the log file and i really need your help. Please trust me that i have tested many times when i turned on the dlb, the gpu nodes reported cannot allocate memory error and shut all MPI processes down. I have to tolerate the large loading imbalance (50%) to run my simulations. I wish i can figure out some way that makes my simulation run on GPU and have better performance.

Where can i post the log file? If i paste it here, it will be really long.

Yunlong

> On Aug 24, 2014, at 2:20 PM, "Szilárd Páll" <pall.szilard at gmail.com> wrote:
> 
>> On Thu, Aug 21, 2014 at 8:25 PM, Yunlong Liu <yliu120 at jh.edu> wrote:
>> Hi Roland,
>> 
>> I just compiled the latest gromacs-5.0 version released on Jun 29th. I will
>> recompile it as you suggested by using those Flags. It seems like the high
>> loading imbalance doesn't affect the performance as well, which is weird.
> 
> How did you draw that conclusion? Please show us log files of the
> respective runs, that will help to assess what is gong on.
> 
> --
> Szilárd
> 
>> Thank you.
>> Yunlong
>> 
>>> On 8/21/14, 2:13 PM, Roland Schulz wrote:
>>> 
>>> Hi,
>>> 
>>> 
>>> 
>>> On Thu, Aug 21, 2014 at 1:56 PM, Yunlong Liu <yliu120 at jh.edu
>>> <mailto:yliu120 at jh.edu>> wrote:
>>> 
>>>    Hi Roland,
>>> 
>>>    The problem I am posting is caused by trivial errors (like not
>>>    enough memory) and I think it should be a real bug inside the
>>>    gromacs-GPU support code.
>>> 
>>> It is unlikely a trivial error because otherwise someone else would have
>>> noticed. You could try the release-5-0 branch from git, but I'm not aware of
>>> any bugfixes related to memory allocation.
>>> The memory allocation which causes the error isn't the problem. The
>>> printed size is reasonable. You could recompile with PRINT_ALLOC_KB (add
>>> -DPRINT_ALLOC_KB to CMAKE_C_FLAGS) and rerun the simulation. It might tell
>>> you where the usual large memory allocation happens.
>>> 
>>> PS: Please don't reply to an individual Gromacs developer. Keep all
>>> conversation on the gmx-users list.
>>> 
>>> Roland
>>> 
>>>    That is the reason why I post this problem to the developer
>>>    mailing-list.
>>> 
>>>    My system contains ~240,000 atoms. It is a rather big protein. The
>>>    memory information of the node is :
>>> 
>>>    top - 12:46:59 up 15 days, 22:18, 1 user,  load average: 1.13,
>>>    6.27, 11.28
>>>    Tasks: 510 total,   2 running, 508 sleeping,   0 stopped,   0 zombie
>>>    Cpu(s):  6.3%us,  0.0%sy,  0.0%ni, 93.7%id,  0.0%wa, 0.0%hi,
>>> 0.0%si,  0.0%st
>>>    Mem:  32815324k total,  4983916k used, 27831408k free,     7984k
>>>    buffers
>>>    Swap:  4194296k total,        0k used,  4194296k free,   700588k
>>>    cached
>>> 
>>>    I am running the simulation on 2 nodes, 4 MPI ranks and each rank
>>>    with 8 OPENMP-threads. I list the information of their CPU and GPU
>>>    here:
>>> 
>>>    c442-702.stampede(1)$ nvidia-smi
>>>    Thu Aug 21 12:46:17 2014
>>>    +------------------------------------------------------+
>>>    | NVIDIA-SMI 331.67     Driver Version: 331.67 |
>>> 
>>> |-------------------------------+----------------------+----------------------+
>>>    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile
>>>    Uncorr. ECC |
>>>    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util
>>> Compute M. |
>>> 
>>> |===============================+======================+======================|
>>>    |   0  Tesla K20m          Off  | 0000:03:00.0     Off
>>>    |                    0 |
>>>    | N/A   22C    P0    46W / 225W |    172MiB /  4799MiB |         0%
>>> Default |
>>> 
>>> +-------------------------------+----------------------+----------------------+
>>> 
>>> 
>>> +-----------------------------------------------------------------------------+
>>>    | Compute processes: GPU Memory |
>>>    |  GPU       PID  Process name
>>> Usage      |
>>> 
>>> |=============================================================================|
>>>    |    0    113588 /work/03002/yliu120/gromacs-5/bin/mdrun_mpi 77MiB |
>>>    |    0    113589 /work/03002/yliu120/gromacs-5/bin/mdrun_mpi 77MiB |
>>> 
>>> +-----------------------------------------------------------------------------+
>>> 
>>>    c442-702.stampede(4)$ lscpu
>>>    Architecture:          x86_64
>>>    CPU op-mode(s):        32-bit, 64-bit
>>>    Byte Order:            Little Endian
>>>    CPU(s):                16
>>>    On-line CPU(s) list:   0-15
>>>    Thread(s) per core:    1
>>>    Core(s) per socket:    8
>>>    Socket(s):             2
>>>    NUMA node(s):          2
>>>    Vendor ID:             GenuineIntel
>>>    CPU family:            6
>>>    Model:                 45
>>>    Stepping:              7
>>>    CPU MHz:               2701.000
>>>    BogoMIPS:              5399.22
>>>    Virtualization:        VT-x
>>>    L1d cache:             32K
>>>    L1i cache:             32K
>>>    L2 cache:              256K
>>>    L3 cache:              20480K
>>>    NUMA node0 CPU(s):     0-7
>>>    NUMA node1 CPU(s):     8-15
>>> 
>>>    I hope this information will help. Thank you.
>>> 
>>>    Yunlong
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>>>    On 8/21/14, 1:38 PM, Roland Schulz wrote:
>>>> 
>>>>    Hi,
>>>> 
>>>>    please don't use gmx-developers for user questions. Feel free to
>>>>    use it if you want to fix the problem, and have questions about
>>>>    implementation details.
>>>> 
>>>>    Please provide more details: How large is your system? How much
>>>>    memory does a node have? On how many nodes do you try to run? How
>>>>    many mpi-ranks do you have per node?
>>>> 
>>>>    Roland
>>>> 
>>>>    On Thu, Aug 21, 2014 at 12:21 PM, Yunlong Liu <yliu120 at jh.edu
>>>>    <mailto:yliu120 at jh.edu>> wrote:
>>>> 
>>>>        Hi Gromacs Developers,
>>>> 
>>>>        I found something about the dynamic loading balance really
>>>>        interesting. I am running my simulation on Stampede
>>>>        supercomputer, which has nodes with 16-physical core ( really
>>>>        16 Intel Xeon cores on one node ) and an NVIDIA Tesla K20m
>>>>        GPU associated.
>>>> 
>>>>        When I am using only the CPUs, I turned on dynamic loading
>>>>        balance by -dlb yes. And it seems to work really good, and
>>>>        the loading imbalance is only 1~2%. This really helps improve
>>>>        the performance by 5~7%。But when I am running my code on
>>>>        GPU-CPU hybrid ( GPU node, 16-cpu and 1 GPU), the dynamic
>>>>        loading balance kicked in since the imbalance goes up to ~50%
>>>>        instantly after loading. Then the the system reports a
>>>>        fail-to-allocate-memory error:
>>>> 
>>>>        NOTE: Turning on dynamic load balancing
>>>> 
>>>> 
>>>>        -------------------------------------------------------
>>>>        Program mdrun_mpi, VERSION 5.0
>>>>        Source code file:
>>>> 
>>>> /home1/03002/yliu120/build/gromacs-5.0/src/gromacs/utility/smalloc.c,
>>>>        line: 226
>>>> 
>>>>        Fatal error:
>>>>        Not enough memory. Failed to realloc 1020720 bytes for
>>>>        dest->a, dest->a=d5800030
>>>>        (called from file
>>>> 
>>>> /home1/03002/yliu120/build/gromacs-5.0/src/gromacs/mdlib/domdec_top.c,
>>>>        line 1061)
>>>>        For more information and tips for troubleshooting, please
>>>>        check the GROMACS
>>>>        website at http://www.gromacs.org/Documentation/Errors
>>>>        -------------------------------------------------------
>>>>        : Cannot allocate memory
>>>>        Error on rank 0, will try to stop all ranks
>>>>        Halting parallel program mdrun_mpi on CPU 0 out of 4
>>>> 
>>>>        gcq#274: "I Feel a Great Disturbance in the Force" (The
>>>>        Emperor Strikes Back)
>>>> 
>>>>        [cli_0]: aborting job:
>>>>        application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0
>>>>        [c442-702.stampede.tacc.utexas.edu:mpispawn_0][readline]
>>>>        Unexpected End-Of-File on file descriptor 6. MPI process died?
>>>>        [c442-702.stampede.tacc.utexas.edu:mpispawn_0][mtpmi_processops]
>>>>        Error while reading PMI socket. MPI process died?
>>>>        [c442-702.stampede.tacc.utexas.edu:mpispawn_0][child_handler]
>>>>        MPI process (rank: 0, pid: 112839) exited with status 255
>>>>        TACC: MPI job exited with code: 1
>>>> 
>>>>        TACC: Shutdown complete. Exiting.
>>>> 
>>>>        So I manually turned off the dynamic loading balance by -dlb
>>>>        no. The simulation goes through with the very high loading
>>>>        imbalance, like:
>>>> 
>>>>        DD  step 139999 load imb.: force 51.3%
>>>> 
>>>>                   Step Time         Lambda
>>>>                 140000 280.00000        0.00000
>>>> 
>>>>           Energies (kJ/mol)
>>>>                    U-B    Proper Dih. Improper Dih.      CMAP
>>>>        Dih.          LJ-14
>>>>            4.88709e+04    1.21990e+04 2.99128e+03   -1.46719e+03
>>>>        1.98569e+04
>>>>             Coulomb-14        LJ (SR) Disper. corr.   Coulomb (SR)
>>>> Coul. recip.
>>>>            2.54663e+05    4.05141e+05 -3.16020e+04   -3.75610e+06
>>>>        2.24819e+04
>>>>              Potential    Kinetic En. Total Energy    Temperature
>>>>        Pres. DC (bar)
>>>>           -3.02297e+06    6.15217e+05 -2.40775e+06    3.09312e+02
>>>>        -2.17704e+02
>>>>         Pressure (bar)   Constr. rmsd
>>>>           -3.39003e+01    3.10750e-05
>>>> 
>>>>        DD  step 149999 load imb.: force 60.8%
>>>> 
>>>>                   Step Time         Lambda
>>>>                 150000 300.00000        0.00000
>>>> 
>>>>           Energies (kJ/mol)
>>>>                    U-B    Proper Dih. Improper Dih.      CMAP
>>>>        Dih.          LJ-14
>>>>            4.96380e+04    1.21010e+04 2.99986e+03   -1.51918e+03
>>>>        1.97542e+04
>>>>             Coulomb-14        LJ (SR) Disper. corr.   Coulomb (SR)
>>>> Coul. recip.
>>>>            2.54305e+05    4.06024e+05 -3.15801e+04   -3.75534e+06
>>>>        2.24001e+04
>>>>              Potential    Kinetic En. Total Energy    Temperature
>>>>        Pres. DC (bar)
>>>>           -3.02121e+06    6.17009e+05 -2.40420e+06    3.10213e+02
>>>>        -2.17403e+02
>>>>         Pressure (bar)   Constr. rmsd
>>>>           -1.40623e+00    3.16495e-05
>>>> 
>>>>        I think this high loading imbalance will affect more than 20%
>>>>        of the performance but at least it will let the simulation
>>>>        on. Therefore, the problem I would like to report is that
>>>>        when running simulation with GPU-CPU hybrid with very few
>>>>        GPU, the dynamic loading balance will cause domain
>>>>        decomposition problems ( fail-to-allocate-memory ). I don't
>>>>        know whether there is any solution to this problem currently
>>>>        or anything could be improved?
>>>> 
>>>>        Yunlong
>>>> 
>>>> 
>>>> 
>>>> 
>>>>        --
>>>>        ========================================
>>>>        Yunlong Liu, PhD Candidate
>>>>        Computational Biology and Biophysics
>>>>        Department of Biophysics and Biophysical Chemistry
>>>>        School of Medicine, The Johns Hopkins University
>>>>        Email: yliu120 at jhmi.edu <mailto:yliu120 at jhmi.edu>
>>>> 
>>>>        Address: 725 N Wolfe St, WBSB RM 601, 21205
>>>>        ========================================
>>>> 
>>>> 
>>>> 
>>>> 
>>>>    --     ORNL/UT Center for Molecular Biophysics cmb.ornl.gov
>>>>    <http://cmb.ornl.gov>
>>>>    865-241-1537 <tel:865-241-1537>, ORNL PO BOX 2008 MS6309
>>> 
>>> 
>>>    --
>>>    ========================================
>>>    Yunlong Liu, PhD Candidate
>>>    Computational Biology and Biophysics
>>>    Department of Biophysics and Biophysical Chemistry
>>>    School of Medicine, The Johns Hopkins University
>>>    Email: yliu120 at jhmi.edu <mailto:yliu120 at jhmi.edu>
>>> 
>>>    Address: 725 N Wolfe St, WBSB RM 601, 21205
>>>    ========================================
>>> 
>>> 
>>> 
>>> 
>>> --
>>> ORNL/UT Center for Molecular Biophysics cmb.ornl.gov <http://cmb.ornl.gov>
>>> 865-241-1537 <tel:865-241-1537>, ORNL PO BOX 2008 MS6309
>> 
>> 
>> --
>> 
>> ========================================
>> Yunlong Liu, PhD Candidate
>> Computational Biology and Biophysics
>> Department of Biophysics and Biophysical Chemistry
>> School of Medicine, The Johns Hopkins University
>> Email: yliu120 at jhmi.edu
>> Address: 725 N Wolfe St, WBSB RM 601, 21205
>> ========================================
>> 
>> --
>> Gromacs Users mailing list
>> 
>> * Please search the archive at
>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
>> 
>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>> 
>> * For (un)subscribe requests visit
>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a
>> mail to gmx-users-request at gromacs.org.
> -- 
> Gromacs Users mailing list
> 
> * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
> 
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> 
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-request at gromacs.org.