[gmx-users] Job Output stops being written during long simulations on HPC cluster

Mark Abraham mark.j.abraham at gmail.com
Tue Jun 21 02:08:29 CEST 2016


Hi,

First, what GROMACS version is this? If not 5.1.2, then please try that :-)
Otherwise, that's all a bit confusing. Some of the output refers to 16
ranks and others to 9 threads, but you refer to other numbers of cores.
There's no known problem with Intel MPI or compiler, but people run/test
with those relatively less often. You could try other infrastructure, but
frankly it seems more likely that your simulation has instability that you
could probe with more frequent output, or solve with longer/more gentle
equilibration.

Mark

On Tue, Jun 21, 2016 at 1:36 AM Benjamin Joseph Coscia <
Benjamin.Coscia at colorado.edu> wrote:

> Hi everyone,
>
> I have been attempting to run some long simulations on a supercomputer at
> the University of Colorado at Boulder. I am trying to run simulations for
> about 200 ns. I have done tests using 48 cores and 96 cores. In each case
> output stops being written at the same time step (~50 million steps). This
> is only about half of the simulation time I wanted. According to SLURM, the
> job is still running days past when it stopped outputting.
>
> I checked how much space is being taken up by output files. The largest
> file is the trajectory at ~0.2 GB. I am only outputting data every 1
> million steps. I am convinced that this isn't a memory issue.
>
> I've reached out to the people who run the supercomputer and they are not
> positive what is going on. One of the guys there ran a system trace on the
> 'gmx_mpi mdrun' process and got the following output by looking at one of
> the nodes that is hung up:
>
> [root at node0636 ~]# ps -ef | grep beco
> root 2053 32739 0 20:48 pts/1 00:00:00 grep beco
> beco4952 17561 17557 0 Jun14 ? 00:00:00 /bin/bash
> /tmp/node0636/job1470980/slurm_script
> beco4952 17597 17561 0 Jun14 ? 00:00:00 /bin/bash
> /projects/beco4952/Gromacs/Pores/GitHub/Shell-Scripts/Build_and_Sim.sh -M
> monomer1.pdb -I steep -S 50000 -c verlet -t 10 -o 6 -r 3 -p 40 -P 4 -w 10
> -l 20 -x 8.0 -y 8.0 -e 0.1 -T Equilibration in Vacuum -C verlet -i md -D
> 0.002 -L 200 -f 100 -v v-rescale -K 300 -b berendsen -Y semiisotropic -B 1
> -R 4.5e-5 -Z xyz -V 1 -n 8 -s off -m on
> beco4952 18307 17597 0 Jun14 ? 00:00:00 /bin/sh
> /curc/tools/x86_64/rh6/software/impi/5.0.3.048/bin64/mpirun -np 16 gmx_mpi
> mdrun -v -deffnm wiggle
> beco4952 18313 18307 0 Jun14 ? 00:01:34 mpiexec.hydra -np 16 gmx_mpi mdrun
> -v -deffnm wiggle
> beco4952 18314 18313 0 Jun14 ? 00:00:00 /curc/slurm/slurm/current/bin/srun
> --nodelist
> node0636,node0637,node0638,node0639,node0640,node0641,node0642,node0643 -N
> 8 -n 8 /curc/tools/x86_64/rh6/software/impi/5.0.3.048/bin64/pmi_proxy
> --control-port <http://5.0.3.048/bin64/pmi_proxy--control-port>
> node0636.rc.int.colorado.edu:41464 --pmi-connect lazy-cache
> --pmi-aggregate -s 0 --rmk slurm --launcher slurm --demux poll --pgid 0
> --enable-stdin 1 --retries 10 --control-code 156865336 --usize -2
> --proxy-id -1
> beco4952 18315 18314 0 Jun14 ? 00:00:00 /curc/slurm/slurm/current/bin/srun
> --nodelist
> node0636,node0637,node0638,node0639,node0640,node0641,node0642,node0643 -N
> 8 -n 8 /curc/tools/x86_64/rh6/software/impi/5.0.3.048/bin64/pmi_proxy
> --control-port <http://5.0.3.048/bin64/pmi_proxy--control-port>
> node0636.rc.int.colorado.edu:41464 --pmi-connect lazy-cache
> --pmi-aggregate -s 0 --rmk slurm --launcher slurm --demux poll --pgid 0
> --enable-stdin 1 --retries 10 --control-code 156865336 --usize -2
> --proxy-id -1
> beco4952 18334 18329 0 Jun14 ? 00:01:00
> /curc/tools/x86_64/rh6/software/impi/5.0.3.048/bin64/pmi_proxy
> --control-port node0636.rc.int.colorado.edu:41464 --pmi-connect lazy-cache
> --pmi-aggregate -s 0 --rmk slurm --launcher slurm --demux poll --pgid 0
> --enable-stdin 1 --retries 10 --control-code 156865336 --usize -2
> --proxy-id -1
> beco4952 18354 18334 99 Jun14 ? 13-20:24:21 gmx_mpi mdrun -v -deffnm wiggle
> beco4952 18355 18334 99 Jun14 ? 13-20:30:41 gmx_mpi mdrun -v -deffnm wiggle
>
> [root at node0636 ~]# strace -f -p 18354
> Process 18354 attached with 9 threads - interrupt to quit
> [pid 18380] futex(0x2b76d6461484, FUTEX_WAIT_PRIVATE, 1, NULL <unfinished
> ...>
> [pid 18378] futex(0x2b76d6462984, FUTEX_WAIT_PRIVATE, 1, NULL <unfinished
> ...>
> [pid 18375] futex(0x2b76d6475484, FUTEX_WAIT_PRIVATE, 3, NULL <unfinished
> ...>
> [pid 18374] futex(0x2b76d6476984, FUTEX_WAIT_PRIVATE, 3, NULL <unfinished
> ...>
> [pid 18368] restart_syscall(<... resuming interrupted call ...> <unfinished
> ...>
> [pid 18377] futex(0x2b76d6463e84, FUTEX_WAIT_PRIVATE, 3, NULL <unfinished
> ...>
> [pid 18364] restart_syscall(<... resuming interrupted call ...> <unfinished
> ...>
> [pid 18354] futex(0x2b76d6477784, FUTEX_WAIT_PRIVATE, 1, NULL <unfinished
> ...>
> [pid 18373] restart_syscall(<... resuming interrupted call ...>) = -1
> ETIMEDOUT (Connection timed out)
> [pid 18373] futex(0x2b76d0736a00, FUTEX_WAKE_PRIVATE, 1) = 0
> [pid 18373] futex(0x2b76d0736a44,
> FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 3631433, {1466303205,
> 353534000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
> [pid 18373] futex(0x2b76d0736a00, FUTEX_WAKE_PRIVATE, 1) = 0
> [pid 18373] futex(0x2b76d0736a44,
> FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 3631435, {1466303205,
> 553978000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
> [pid 18373] futex(0x2b76d0736a00, FUTEX_WAKE_PRIVATE, 1) = 0
> [pid 18373] futex(0x2b76d0736a44,
> FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 3631437, {1466303205,
> 754503000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
> [pid 18373] futex(0x2b76d0736a00, FUTEX_WAKE_PRIVATE, 1) = 0
> [pid 18373] futex(0x2b76d0736a44,
> FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 3631439, {1466303205,
> 954902000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
> [pid 18373] futex(0x2b76d0736a00, FUTEX_WAKE_PRIVATE, 1) = 0
> [pid 18373] futex(0x2b76d0736a44,
> FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 3631441, {1466303206,
> 155424000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
> [pid 18373] futex(0x2b76d0736a00, FUTEX_WAKE_PRIVATE, 1) = 0
> [pid 18373] futex(0x2b76d0736a44,
> FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 3631443, {1466303206,
> 355864000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
> "
>
> I built mpi_enabled gromacs using best practices for the computer system
> here which included the use of intel compilers. Could it be that I need to
> rebuild using a different compiler such as gcc? It seems that I am in some
> sort of deadlock.
>
> Any ideas on how to address this problem would be much appreciated,
>
> Regards,
> Ben Coscia
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
>


More information about the gromacs.org_gmx-users mailing list