[gmx-users] frequent stop of GPU jobs at specific simulation steps.
Szilárd Páll
pall.szilard at gmail.com
Thu Oct 1 18:02:51 CEST 2015
TL; DR
If you want to investigate the issue further check my reply here:
http://comments.gmane.org/gmane.science.biology.gromacs.user/80120
--
Szilárd
On Mon, Sep 28, 2015 at 5:34 AM, Zhongjin He <hzj1000 at 163.com> wrote:
> Dear GMX users,
>
>
> Recently I encountered a strange problem in using GPU GROMACS 5.0.6. The
> GUP node include 24 CPU cores and 1 Nvidia Tesla K40M card, the details of
> the node are below:
> 1 x GPU-Node (Supermicro 2U Server equivalent or better) 01 16,575.00
> - 2 x 12-Core Intel E5-2670v3 Processor
> - 64GB RAM (4x16GB DDR4-2133 REG ECC)
> - 2 x 1TB 7200RPM SATA 6Gbps Enterprise HDD
> - Integrated 1GbE ports
> - 1 x Nvidia Tesla K40M 12Gb GDDR5 GPU Card
>
>
> The GPU gromacs5.0.6 is installed by a computer engineer from a computer
> company who sells this GPU node to us. They install GPU GROMACS5.0.6 in
> this way,
> This how we configured the cmake during compilation.
> sudo cmake .. -DCMAKE_C_COMPILER=/data/apps/openmpi-1.8.8-intel/bin/mpicc
> -DCMAKE_CXX_COMPILER=/data/apps/openmpi-1.8.8-intel/bin/mpicxx -DGMX_MPI=on
> -DGMX_GPU=on
> -DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda
> -DCMAKE_PREFIX_PATH=/data/apps/fftw-3.3.4-intel-sp/ -DGMX_FFT_LIBRARY=fftw3
> -DCMAKE_INSTALL_PREFIX=/data/apps/gromacs-5.0.6-gpu-sp
>
>
> I use OpenMP rather than Mpirun to run a GPU job as it run faster. I use
> 12 CPUs and 1 GPU card to run a GPU job
> module load gromacs-5.0.6-gpu-sp
> export OMP_NUM_THREADS=12
> mdrun_mpi -deffnm md -pin on -pinoffset 0
>
>
> I have found such GPU job run about 10+ h and stopped at specific
> simulaiton steps. And The symptoms of this problem is similar to those
> described in
> http://permalink.gmane.org/gmane.science.biology.gromacs.user/77958.
> > and here are the descriptions of my problems:
> > 1. My simulation stops every 10+ hours. Specifically, the job is still
> > “running” in the queue but md.log/traj.trr stop updating.That suggests
> the filesystem has gone AWOL, or filled, or got to a 2GB file
> size limit, or such.
>
>
> Use qstat to check the state of this GPU job, showing it is running. GPU
> job runs part on these 12 CPUs and partically on GPU card. BUT log to the
> GPU node, the CPU part of this job is missing, while GPU part is still show
> it is in it, but GPU-Util is 0%.
> +------------------------------------------------------+
> | NVIDIA-SMI 346.46 Driver Version: 346.46 |
>
> |-------------------------------+----------------------+----------------------+
> | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr.
> ECC |
> | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute
> M. |
>
> |===============================+======================+======================|
> | 0 Tesla K40m Off | 0000:82:00.0 Off |
> 0 |
> | N/A 34C P0 62W / 235W | 139MiB / 11519MiB | 0%
> Default |
>
> +-------------------------------+----------------------+----------------------+
>
>
> +-----------------------------------------------------------------------------+
> | Processes: GPU
> Memory |
> | GPU PID Type Process name Usage
> |
>
> |=============================================================================|
> | 0 28651 C mdrun_mpi
> 82MiB |
>
> +-----------------------------------------------------------------------------+
>
>
>
>
>
> Actually, On this GPU node, I have tested two systems. They stopped at
> specific simulation steps.
> The first system is in /home/he/C1C2C3/C3_New, I have tested 3 times, the
> first is named as run1. After this job die, then I resubmit (NOT restart)
> it to run as run2. After run2 die, I resubmit the same job as run3. These
> 3 tests of the same simulation system show that after 10+ hours
> simulation, they died at the same simulation step. I use command
> mdrun_mpi -deffnm md -pin on -pinoffset 12 or mdrun_mpi -deffnm
> md
> Run1
> tail -n 10 /home/he/C1C2C3/C3_New/run1/md.log
> Step Time Lambda
> 62245000 124490.00000 0.00000
>
> Energies (kJ/mol)
> Angle LJ (SR) Disper. corr. Coulomb (SR) Coul. recip.
> 3.99377e+02 3.60666e+04 -2.97791e+03 -2.49639e+05 8.68339e+02
> Potential Kinetic En. Total Energy Temperature Pres. DC (bar)
> -2.15283e+05 2.59468e+04 -1.89336e+05 2.49125e+02 -3.11737e+02
> Pressure (bar) Constr. rmsd
> 5.47750e+02 1.02327e-06
>
>
> Run2
> tail -n 10 /home/he/C1C2C3/C3_New/run2/md.log
> Step Time Lambda
> 62245000 124490.00000 0.00000
>
> Energies (kJ/mol)
> Angle LJ (SR) Disper. corr. Coulomb (SR) Coul. recip.
> 4.24556e+02 3.59301e+04 -2.98718e+03 -2.50283e+05 8.91892e+02
> Potential Kinetic En. Total Energy Temperature Pres. DC (bar)
> -2.16023e+05 2.57250e+04 -1.90298e+05 2.46995e+02 -3.13679e+02
> Pressure (bar) Constr. rmsd
> 5.07704e+02 9.80011e-07
>
>
> Run3
> tail -n 10 /home/he/C1C2C3/C3_New/md.log
> Step Time Lambda
> 62245000 124490.00000 0.00000
>
> Energies (kJ/mol)
> Angle LJ (SR) Disper. corr. Coulomb (SR) Coul. recip.
> 4.58207e+02 3.55527e+04 -3.00135e+03 -2.48904e+05 8.72284e+02
> Potential Kinetic En. Total Energy Temperature Pres. DC (bar)
> -2.15022e+05 2.60553e+04 -1.88967e+05 2.50167e+02 -3.16659e+02
> Pressure (bar) Constr. rmsd
> 6.59141e+02 1.04033e-06
>
>
> Similar problem was also found in the second simulation system in
> /home/he/C1C2C3/PureCO2TraPPE. I use command mdrun_mpi -deffnm md -pin
> on -pinoffset 0
> Run1
> tail -n 10 /home/he/C1C2C3/PureCO2TraPPE/run1/md.log
> Step Time Lambda
> 65470000 130940.00000 0.00000
>
> Energies (kJ/mol)
> LJ (SR) Disper. corr. Coulomb (SR) Coul. recip. Potential
> 2.80567e+04 -1.99950e+03 -2.05836e+05 1.06692e+03 -1.78712e+05
> Kinetic En. Total Energy Temperature Pres. DC (bar) Pressure (bar)
> 2.09106e+04 -1.57801e+05 2.48747e+02 -2.91211e+02 4.97676e+02
> Constr. rmsd
> 6.64528e-07
> Run2
> tail -n 10 /home/he/C1C2C3/PureCO2TraPPE/md.log
> Step Time Lambda
> 65470000 130940.00000 0.00000
>
> Energies (kJ/mol)
> LJ (SR) Disper. corr. Coulomb (SR) Coul. recip. Potential
> 2.78408e+04 -2.00737e+03 -2.05654e+05 1.07244e+03 -1.78748e+05
> Kinetic En. Total Energy Temperature Pres. DC (bar) Pressure (bar)
> 2.09921e+04 -1.57756e+05 2.49716e+02 -2.93505e+02 5.29104e+02
> Constr. rmsd
> 7.43034e-07
>
>
> Hi GMX users, did anybody enconter such a problem with GPU Gromacs, please
> give me some advise to solve this problem. Thanks!
>
>
> Best regards,
>
>
> Zhongjin HE
>
>
>
>
>
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
More information about the gromacs.org_gmx-users
mailing list