[gmx-users] frequent stop of GPU jobs at specific simulation steps.

Thu Oct 1 18:02:51 CEST 2015

TL; DR

If you want to investigate the issue further check my reply here:
http://comments.gmane.org/gmane.science.biology.gromacs.user/80120

--
Szilárd

On Mon, Sep 28, 2015 at 5:34 AM, Zhongjin He <hzj1000 at 163.com> wrote:

> Dear GMX users,
>
>
> Recently I encountered a strange problem in using GPU GROMACS 5.0.6.  The
> GUP node include 24 CPU cores and 1 Nvidia Tesla K40M card, the details of
> the node are below:
> 1 x GPU-Node (Supermicro 2U Server equivalent or better) 01 16,575.00
> - 2 x 12-Core Intel E5-2670v3 Processor
> - 64GB RAM (4x16GB DDR4-2133 REG ECC)
> - 2 x 1TB 7200RPM SATA 6Gbps Enterprise HDD
> - Integrated 1GbE ports
> - 1 x Nvidia Tesla K40M 12Gb GDDR5 GPU  Card
>
>
> The GPU gromacs5.0.6 is installed by a computer engineer from a computer
> company who sells this GPU node to us. They install GPU GROMACS5.0.6 in
> this way,
> This how we configured the cmake during compilation.
> sudo cmake .. -DCMAKE_C_COMPILER=/data/apps/openmpi-1.8.8-intel/bin/mpicc
> -DCMAKE_CXX_COMPILER=/data/apps/openmpi-1.8.8-intel/bin/mpicxx -DGMX_MPI=on
> -DGMX_GPU=on
> -DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda
> -DCMAKE_PREFIX_PATH=/data/apps/fftw-3.3.4-intel-sp/ -DGMX_FFT_LIBRARY=fftw3
> -DCMAKE_INSTALL_PREFIX=/data/apps/gromacs-5.0.6-gpu-sp
>
>
> I use OpenMP rather than Mpirun to run a GPU job as it run faster. I use
> 12 CPUs and 1 GPU card to run a GPU job
> module load gromacs-5.0.6-gpu-sp
> export OMP_NUM_THREADS=12
> mdrun_mpi -deffnm md -pin on -pinoffset 0
>
>
> I have found such GPU job run about 10+ h and stopped at specific
> simulaiton steps. And The symptoms of this problem is similar to those
> described in
> http://permalink.gmane.org/gmane.science.biology.gromacs.user/77958.
> > and here are the descriptions of my problems:
> > 1. My simulation stops every 10+ hours. Specifically, the job is still
> > “running” in the queue but md.log/traj.trr stop updating.That suggests
> the filesystem has gone AWOL, or filled, or got to a 2GB file
> size limit, or such.
>
>
> Use qstat to check the state of this GPU job, showing it is running. GPU
> job runs part on these 12 CPUs and partically on GPU card. BUT log to the
> GPU node, the CPU part of this job is missing, while GPU part is still show
> it is in it, but  GPU-Util  is 0%.
> +------------------------------------------------------+
> | NVIDIA-SMI 346.46     Driver Version: 346.46         |
>
> |-------------------------------+----------------------+----------------------+
> | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr.
> ECC |
> | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute
> M. |
>
> |===============================+======================+======================|
> |   0  Tesla K40m          Off  | 0000:82:00.0     Off |
>   0 |
> | N/A   34C    P0    62W / 235W |    139MiB / 11519MiB |      0%
> Default |
>
> +-------------------------------+----------------------+----------------------+
>
>
> +-----------------------------------------------------------------------------+
> | Processes:                                                       GPU
> Memory |
> |  GPU       PID  Type  Process name                               Usage
>     |
>
> |=============================================================================|
> |    0     28651    C   mdrun_mpi
>  82MiB |
>
> +-----------------------------------------------------------------------------+
>
>
>
>
>
> Actually, On this GPU node, I have tested two systems.  They stopped at
> specific simulation steps.
> The first system is in /home/he/C1C2C3/C3_New, I have tested 3 times, the
> first is named as  run1. After this job die, then I resubmit (NOT restart)
> it to run as run2. After run2 die, I resubmit the same job as run3.   These
> 3  tests of the same simulation system show that after 10+ hours
> simulation, they died at the same simulation step.  I use command
>          mdrun_mpi -deffnm md -pin on -pinoffset 12   or mdrun_mpi -deffnm
> md
> Run1
> tail -n 10 /home/he/C1C2C3/C3_New/run1/md.log
>            Step           Time         Lambda
>        62245000   124490.00000        0.00000
>
>    Energies (kJ/mol)
>           Angle        LJ (SR)  Disper. corr.   Coulomb (SR)   Coul. recip.
>     3.99377e+02    3.60666e+04   -2.97791e+03   -2.49639e+05    8.68339e+02
>       Potential    Kinetic En.   Total Energy    Temperature Pres. DC (bar)
>    -2.15283e+05    2.59468e+04   -1.89336e+05    2.49125e+02   -3.11737e+02
>  Pressure (bar)   Constr. rmsd
>     5.47750e+02    1.02327e-06
>
>
> Run2
> tail -n 10 /home/he/C1C2C3/C3_New/run2/md.log
>            Step           Time         Lambda
>        62245000   124490.00000        0.00000
>
>    Energies (kJ/mol)
>           Angle        LJ (SR)  Disper. corr.   Coulomb (SR)   Coul. recip.
>     4.24556e+02    3.59301e+04   -2.98718e+03   -2.50283e+05    8.91892e+02
>       Potential    Kinetic En.   Total Energy    Temperature Pres. DC (bar)
>    -2.16023e+05    2.57250e+04   -1.90298e+05    2.46995e+02   -3.13679e+02
>  Pressure (bar)   Constr. rmsd
>     5.07704e+02    9.80011e-07
>
>
> Run3
> tail -n 10 /home/he/C1C2C3/C3_New/md.log
>            Step           Time         Lambda
>        62245000   124490.00000        0.00000
>
>    Energies (kJ/mol)
>           Angle        LJ (SR)  Disper. corr.   Coulomb (SR)   Coul. recip.
>     4.58207e+02    3.55527e+04   -3.00135e+03   -2.48904e+05    8.72284e+02
>       Potential    Kinetic En.   Total Energy    Temperature Pres. DC (bar)
>    -2.15022e+05    2.60553e+04   -1.88967e+05    2.50167e+02   -3.16659e+02
>  Pressure (bar)   Constr. rmsd
>     6.59141e+02    1.04033e-06
>
>
> Similar problem was also found in the second simulation system in
> /home/he/C1C2C3/PureCO2TraPPE. I use command      mdrun_mpi -deffnm md -pin
> on -pinoffset 0
> Run1
> tail -n 10 /home/he/C1C2C3/PureCO2TraPPE/run1/md.log
>            Step           Time         Lambda
>        65470000   130940.00000        0.00000
>
>    Energies (kJ/mol)
>         LJ (SR)  Disper. corr.   Coulomb (SR)   Coul. recip.      Potential
>     2.80567e+04   -1.99950e+03   -2.05836e+05    1.06692e+03   -1.78712e+05
>     Kinetic En.   Total Energy    Temperature Pres. DC (bar) Pressure (bar)
>     2.09106e+04   -1.57801e+05    2.48747e+02   -2.91211e+02    4.97676e+02
>    Constr. rmsd
>     6.64528e-07
> Run2
> tail -n 10 /home/he/C1C2C3/PureCO2TraPPE/md.log
>            Step           Time         Lambda
>        65470000   130940.00000        0.00000
>
>    Energies (kJ/mol)
>         LJ (SR)  Disper. corr.   Coulomb (SR)   Coul. recip.      Potential
>     2.78408e+04   -2.00737e+03   -2.05654e+05    1.07244e+03   -1.78748e+05
>     Kinetic En.   Total Energy    Temperature Pres. DC (bar) Pressure (bar)
>     2.09921e+04   -1.57756e+05    2.49716e+02   -2.93505e+02    5.29104e+02
>    Constr. rmsd
>     7.43034e-07
>
>
> Hi GMX users, did anybody enconter such a problem with GPU Gromacs, please
> give me some advise to solve this problem. Thanks!
>
>
> Best regards,
>
>
> Zhongjin HE
>
>
>
>
>
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.