[gmx-users] [gmx-developers] About dynamics loading balance
Mark Abraham
mark.j.abraham at gmail.com
Mon Aug 25 06:55:51 CEST 2014
Please upload them to a file-sharing service on the web (there are lots
that are free-to-use), and paste the link here.
Mark
On Mon, Aug 25, 2014 at 6:07 AM, Yunlong Liu <yliu120 at jhmi.edu> wrote:
> Hi Szilard,
>
> I would like to send you the log file and i really need your help. Please
> trust me that i have tested many times when i turned on the dlb, the gpu
> nodes reported cannot allocate memory error and shut all MPI processes
> down. I have to tolerate the large loading imbalance (50%) to run my
> simulations. I wish i can figure out some way that makes my simulation run
> on GPU and have better performance.
>
> Where can i post the log file? If i paste it here, it will be really long.
>
> Yunlong
>
>
> > On Aug 24, 2014, at 2:20 PM, "Szilárd Páll" <pall.szilard at gmail.com>
> wrote:
> >
> >> On Thu, Aug 21, 2014 at 8:25 PM, Yunlong Liu <yliu120 at jh.edu> wrote:
> >> Hi Roland,
> >>
> >> I just compiled the latest gromacs-5.0 version released on Jun 29th. I
> will
> >> recompile it as you suggested by using those Flags. It seems like the
> high
> >> loading imbalance doesn't affect the performance as well, which is
> weird.
> >
> > How did you draw that conclusion? Please show us log files of the
> > respective runs, that will help to assess what is gong on.
> >
> > --
> > Szilárd
> >
> >> Thank you.
> >> Yunlong
> >>
> >>> On 8/21/14, 2:13 PM, Roland Schulz wrote:
> >>>
> >>> Hi,
> >>>
> >>>
> >>>
> >>> On Thu, Aug 21, 2014 at 1:56 PM, Yunlong Liu <yliu120 at jh.edu
> >>> <mailto:yliu120 at jh.edu>> wrote:
> >>>
> >>> Hi Roland,
> >>>
> >>> The problem I am posting is caused by trivial errors (like not
> >>> enough memory) and I think it should be a real bug inside the
> >>> gromacs-GPU support code.
> >>>
> >>> It is unlikely a trivial error because otherwise someone else would
> have
> >>> noticed. You could try the release-5-0 branch from git, but I'm not
> aware of
> >>> any bugfixes related to memory allocation.
> >>> The memory allocation which causes the error isn't the problem. The
> >>> printed size is reasonable. You could recompile with PRINT_ALLOC_KB
> (add
> >>> -DPRINT_ALLOC_KB to CMAKE_C_FLAGS) and rerun the simulation. It might
> tell
> >>> you where the usual large memory allocation happens.
> >>>
> >>> PS: Please don't reply to an individual Gromacs developer. Keep all
> >>> conversation on the gmx-users list.
> >>>
> >>> Roland
> >>>
> >>> That is the reason why I post this problem to the developer
> >>> mailing-list.
> >>>
> >>> My system contains ~240,000 atoms. It is a rather big protein. The
> >>> memory information of the node is :
> >>>
> >>> top - 12:46:59 up 15 days, 22:18, 1 user, load average: 1.13,
> >>> 6.27, 11.28
> >>> Tasks: 510 total, 2 running, 508 sleeping, 0 stopped, 0 zombie
> >>> Cpu(s): 6.3%us, 0.0%sy, 0.0%ni, 93.7%id, 0.0%wa, 0.0%hi,
> >>> 0.0%si, 0.0%st
> >>> Mem: 32815324k total, 4983916k used, 27831408k free, 7984k
> >>> buffers
> >>> Swap: 4194296k total, 0k used, 4194296k free, 700588k
> >>> cached
> >>>
> >>> I am running the simulation on 2 nodes, 4 MPI ranks and each rank
> >>> with 8 OPENMP-threads. I list the information of their CPU and GPU
> >>> here:
> >>>
> >>> c442-702.stampede(1)$ nvidia-smi
> >>> Thu Aug 21 12:46:17 2014
> >>> +------------------------------------------------------+
> >>> | NVIDIA-SMI 331.67 Driver Version: 331.67 |
> >>>
> >>>
> |-------------------------------+----------------------+----------------------+
> >>> | GPU Name Persistence-M| Bus-Id Disp.A | Volatile
> >>> Uncorr. ECC |
> >>> | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util
> >>> Compute M. |
> >>>
> >>>
> |===============================+======================+======================|
> >>> | 0 Tesla K20m Off | 0000:03:00.0 Off
> >>> | 0 |
> >>> | N/A 22C P0 46W / 225W | 172MiB / 4799MiB | 0%
> >>> Default |
> >>>
> >>>
> +-------------------------------+----------------------+----------------------+
> >>>
> >>>
> >>>
> +-----------------------------------------------------------------------------+
> >>> | Compute processes: GPU Memory |
> >>> | GPU PID Process name
> >>> Usage |
> >>>
> >>>
> |=============================================================================|
> >>> | 0 113588 /work/03002/yliu120/gromacs-5/bin/mdrun_mpi 77MiB |
> >>> | 0 113589 /work/03002/yliu120/gromacs-5/bin/mdrun_mpi 77MiB |
> >>>
> >>>
> +-----------------------------------------------------------------------------+
> >>>
> >>> c442-702.stampede(4)$ lscpu
> >>> Architecture: x86_64
> >>> CPU op-mode(s): 32-bit, 64-bit
> >>> Byte Order: Little Endian
> >>> CPU(s): 16
> >>> On-line CPU(s) list: 0-15
> >>> Thread(s) per core: 1
> >>> Core(s) per socket: 8
> >>> Socket(s): 2
> >>> NUMA node(s): 2
> >>> Vendor ID: GenuineIntel
> >>> CPU family: 6
> >>> Model: 45
> >>> Stepping: 7
> >>> CPU MHz: 2701.000
> >>> BogoMIPS: 5399.22
> >>> Virtualization: VT-x
> >>> L1d cache: 32K
> >>> L1i cache: 32K
> >>> L2 cache: 256K
> >>> L3 cache: 20480K
> >>> NUMA node0 CPU(s): 0-7
> >>> NUMA node1 CPU(s): 8-15
> >>>
> >>> I hope this information will help. Thank you.
> >>>
> >>> Yunlong
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>> On 8/21/14, 1:38 PM, Roland Schulz wrote:
> >>>>
> >>>> Hi,
> >>>>
> >>>> please don't use gmx-developers for user questions. Feel free to
> >>>> use it if you want to fix the problem, and have questions about
> >>>> implementation details.
> >>>>
> >>>> Please provide more details: How large is your system? How much
> >>>> memory does a node have? On how many nodes do you try to run? How
> >>>> many mpi-ranks do you have per node?
> >>>>
> >>>> Roland
> >>>>
> >>>> On Thu, Aug 21, 2014 at 12:21 PM, Yunlong Liu <yliu120 at jh.edu
> >>>> <mailto:yliu120 at jh.edu>> wrote:
> >>>>
> >>>> Hi Gromacs Developers,
> >>>>
> >>>> I found something about the dynamic loading balance really
> >>>> interesting. I am running my simulation on Stampede
> >>>> supercomputer, which has nodes with 16-physical core ( really
> >>>> 16 Intel Xeon cores on one node ) and an NVIDIA Tesla K20m
> >>>> GPU associated.
> >>>>
> >>>> When I am using only the CPUs, I turned on dynamic loading
> >>>> balance by -dlb yes. And it seems to work really good, and
> >>>> the loading imbalance is only 1~2%. This really helps improve
> >>>> the performance by 5~7%。But when I am running my code on
> >>>> GPU-CPU hybrid ( GPU node, 16-cpu and 1 GPU), the dynamic
> >>>> loading balance kicked in since the imbalance goes up to ~50%
> >>>> instantly after loading. Then the the system reports a
> >>>> fail-to-allocate-memory error:
> >>>>
> >>>> NOTE: Turning on dynamic load balancing
> >>>>
> >>>>
> >>>> -------------------------------------------------------
> >>>> Program mdrun_mpi, VERSION 5.0
> >>>> Source code file:
> >>>>
> >>>> /home1/03002/yliu120/build/gromacs-5.0/src/gromacs/utility/smalloc.c,
> >>>> line: 226
> >>>>
> >>>> Fatal error:
> >>>> Not enough memory. Failed to realloc 1020720 bytes for
> >>>> dest->a, dest->a=d5800030
> >>>> (called from file
> >>>>
> >>>> /home1/03002/yliu120/build/gromacs-5.0/src/gromacs/mdlib/domdec_top.c,
> >>>> line 1061)
> >>>> For more information and tips for troubleshooting, please
> >>>> check the GROMACS
> >>>> website at http://www.gromacs.org/Documentation/Errors
> >>>> -------------------------------------------------------
> >>>> : Cannot allocate memory
> >>>> Error on rank 0, will try to stop all ranks
> >>>> Halting parallel program mdrun_mpi on CPU 0 out of 4
> >>>>
> >>>> gcq#274: "I Feel a Great Disturbance in the Force" (The
> >>>> Emperor Strikes Back)
> >>>>
> >>>> [cli_0]: aborting job:
> >>>> application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0
> >>>> [c442-702.stampede.tacc.utexas.edu:mpispawn_0][readline]
> >>>> Unexpected End-Of-File on file descriptor 6. MPI process died?
> >>>> [c442-702.stampede.tacc.utexas.edu:
> mpispawn_0][mtpmi_processops]
> >>>> Error while reading PMI socket. MPI process died?
> >>>> [c442-702.stampede.tacc.utexas.edu:mpispawn_0][child_handler]
> >>>> MPI process (rank: 0, pid: 112839) exited with status 255
> >>>> TACC: MPI job exited with code: 1
> >>>>
> >>>> TACC: Shutdown complete. Exiting.
> >>>>
> >>>> So I manually turned off the dynamic loading balance by -dlb
> >>>> no. The simulation goes through with the very high loading
> >>>> imbalance, like:
> >>>>
> >>>> DD step 139999 load imb.: force 51.3%
> >>>>
> >>>> Step Time Lambda
> >>>> 140000 280.00000 0.00000
> >>>>
> >>>> Energies (kJ/mol)
> >>>> U-B Proper Dih. Improper Dih. CMAP
> >>>> Dih. LJ-14
> >>>> 4.88709e+04 1.21990e+04 2.99128e+03 -1.46719e+03
> >>>> 1.98569e+04
> >>>> Coulomb-14 LJ (SR) Disper. corr. Coulomb (SR)
> >>>> Coul. recip.
> >>>> 2.54663e+05 4.05141e+05 -3.16020e+04 -3.75610e+06
> >>>> 2.24819e+04
> >>>> Potential Kinetic En. Total Energy Temperature
> >>>> Pres. DC (bar)
> >>>> -3.02297e+06 6.15217e+05 -2.40775e+06 3.09312e+02
> >>>> -2.17704e+02
> >>>> Pressure (bar) Constr. rmsd
> >>>> -3.39003e+01 3.10750e-05
> >>>>
> >>>> DD step 149999 load imb.: force 60.8%
> >>>>
> >>>> Step Time Lambda
> >>>> 150000 300.00000 0.00000
> >>>>
> >>>> Energies (kJ/mol)
> >>>> U-B Proper Dih. Improper Dih. CMAP
> >>>> Dih. LJ-14
> >>>> 4.96380e+04 1.21010e+04 2.99986e+03 -1.51918e+03
> >>>> 1.97542e+04
> >>>> Coulomb-14 LJ (SR) Disper. corr. Coulomb (SR)
> >>>> Coul. recip.
> >>>> 2.54305e+05 4.06024e+05 -3.15801e+04 -3.75534e+06
> >>>> 2.24001e+04
> >>>> Potential Kinetic En. Total Energy Temperature
> >>>> Pres. DC (bar)
> >>>> -3.02121e+06 6.17009e+05 -2.40420e+06 3.10213e+02
> >>>> -2.17403e+02
> >>>> Pressure (bar) Constr. rmsd
> >>>> -1.40623e+00 3.16495e-05
> >>>>
> >>>> I think this high loading imbalance will affect more than 20%
> >>>> of the performance but at least it will let the simulation
> >>>> on. Therefore, the problem I would like to report is that
> >>>> when running simulation with GPU-CPU hybrid with very few
> >>>> GPU, the dynamic loading balance will cause domain
> >>>> decomposition problems ( fail-to-allocate-memory ). I don't
> >>>> know whether there is any solution to this problem currently
> >>>> or anything could be improved?
> >>>>
> >>>> Yunlong
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> ========================================
> >>>> Yunlong Liu, PhD Candidate
> >>>> Computational Biology and Biophysics
> >>>> Department of Biophysics and Biophysical Chemistry
> >>>> School of Medicine, The Johns Hopkins University
> >>>> Email: yliu120 at jhmi.edu <mailto:yliu120 at jhmi.edu>
> >>>>
> >>>> Address: 725 N Wolfe St, WBSB RM 601, 21205
> >>>> ========================================
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> -- ORNL/UT Center for Molecular Biophysics cmb.ornl.gov
> >>>> <http://cmb.ornl.gov>
> >>>> 865-241-1537 <tel:865-241-1537>, ORNL PO BOX 2008 MS6309
> >>>
> >>>
> >>> --
> >>> ========================================
> >>> Yunlong Liu, PhD Candidate
> >>> Computational Biology and Biophysics
> >>> Department of Biophysics and Biophysical Chemistry
> >>> School of Medicine, The Johns Hopkins University
> >>> Email: yliu120 at jhmi.edu <mailto:yliu120 at jhmi.edu>
> >>>
> >>> Address: 725 N Wolfe St, WBSB RM 601, 21205
> >>> ========================================
> >>>
> >>>
> >>>
> >>>
> >>> --
> >>> ORNL/UT Center for Molecular Biophysics cmb.ornl.gov <
> http://cmb.ornl.gov>
> >>> 865-241-1537 <tel:865-241-1537>, ORNL PO BOX 2008 MS6309
> >>
> >>
> >> --
> >>
> >> ========================================
> >> Yunlong Liu, PhD Candidate
> >> Computational Biology and Biophysics
> >> Department of Biophysics and Biophysical Chemistry
> >> School of Medicine, The Johns Hopkins University
> >> Email: yliu120 at jhmi.edu
> >> Address: 725 N Wolfe St, WBSB RM 601, 21205
> >> ========================================
> >>
> >> --
> >> Gromacs Users mailing list
> >>
> >> * Please search the archive at
> >> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
> >>
> >> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >>
> >> * For (un)subscribe requests visit
> >> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a
> >> mail to gmx-users-request at gromacs.org.
> > --
> > Gromacs Users mailing list
> >
> > * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
> >
> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >
> > * For (un)subscribe requests visit
> > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
>
More information about the gromacs.org_gmx-users
mailing list