[gmx-users] Job Output stops being written during long simulations on HPC cluster

Benjamin Joseph Coscia Benjamin.Coscia at Colorado.EDU
Tue Jun 21 01:36:02 CEST 2016


Hi everyone,

I have been attempting to run some long simulations on a supercomputer at
the University of Colorado at Boulder. I am trying to run simulations for
about 200 ns. I have done tests using 48 cores and 96 cores. In each case
output stops being written at the same time step (~50 million steps). This
is only about half of the simulation time I wanted. According to SLURM, the
job is still running days past when it stopped outputting.

I checked how much space is being taken up by output files. The largest
file is the trajectory at ~0.2 GB. I am only outputting data every 1
million steps. I am convinced that this isn't a memory issue.

I've reached out to the people who run the supercomputer and they are not
positive what is going on. One of the guys there ran a system trace on the
'gmx_mpi mdrun' process and got the following output by looking at one of
the nodes that is hung up:

[root at node0636 ~]# ps -ef | grep beco
root 2053 32739 0 20:48 pts/1 00:00:00 grep beco
beco4952 17561 17557 0 Jun14 ? 00:00:00 /bin/bash
/tmp/node0636/job1470980/slurm_script
beco4952 17597 17561 0 Jun14 ? 00:00:00 /bin/bash
/projects/beco4952/Gromacs/Pores/GitHub/Shell-Scripts/Build_and_Sim.sh -M
monomer1.pdb -I steep -S 50000 -c verlet -t 10 -o 6 -r 3 -p 40 -P 4 -w 10
-l 20 -x 8.0 -y 8.0 -e 0.1 -T Equilibration in Vacuum -C verlet -i md -D
0.002 -L 200 -f 100 -v v-rescale -K 300 -b berendsen -Y semiisotropic -B 1
-R 4.5e-5 -Z xyz -V 1 -n 8 -s off -m on
beco4952 18307 17597 0 Jun14 ? 00:00:00 /bin/sh
/curc/tools/x86_64/rh6/software/impi/5.0.3.048/bin64/mpirun -np 16 gmx_mpi
mdrun -v -deffnm wiggle
beco4952 18313 18307 0 Jun14 ? 00:01:34 mpiexec.hydra -np 16 gmx_mpi mdrun
-v -deffnm wiggle
beco4952 18314 18313 0 Jun14 ? 00:00:00 /curc/slurm/slurm/current/bin/srun
--nodelist
node0636,node0637,node0638,node0639,node0640,node0641,node0642,node0643 -N
8 -n 8 /curc/tools/x86_64/rh6/software/impi/5.0.3.048/bin64/pmi_proxy
--control-port node0636.rc.int.colorado.edu:41464 --pmi-connect lazy-cache
--pmi-aggregate -s 0 --rmk slurm --launcher slurm --demux poll --pgid 0
--enable-stdin 1 --retries 10 --control-code 156865336 --usize -2
--proxy-id -1
beco4952 18315 18314 0 Jun14 ? 00:00:00 /curc/slurm/slurm/current/bin/srun
--nodelist
node0636,node0637,node0638,node0639,node0640,node0641,node0642,node0643 -N
8 -n 8 /curc/tools/x86_64/rh6/software/impi/5.0.3.048/bin64/pmi_proxy
--control-port node0636.rc.int.colorado.edu:41464 --pmi-connect lazy-cache
--pmi-aggregate -s 0 --rmk slurm --launcher slurm --demux poll --pgid 0
--enable-stdin 1 --retries 10 --control-code 156865336 --usize -2
--proxy-id -1
beco4952 18334 18329 0 Jun14 ? 00:01:00
/curc/tools/x86_64/rh6/software/impi/5.0.3.048/bin64/pmi_proxy
--control-port node0636.rc.int.colorado.edu:41464 --pmi-connect lazy-cache
--pmi-aggregate -s 0 --rmk slurm --launcher slurm --demux poll --pgid 0
--enable-stdin 1 --retries 10 --control-code 156865336 --usize -2
--proxy-id -1
beco4952 18354 18334 99 Jun14 ? 13-20:24:21 gmx_mpi mdrun -v -deffnm wiggle
beco4952 18355 18334 99 Jun14 ? 13-20:30:41 gmx_mpi mdrun -v -deffnm wiggle

[root at node0636 ~]# strace -f -p 18354
Process 18354 attached with 9 threads - interrupt to quit
[pid 18380] futex(0x2b76d6461484, FUTEX_WAIT_PRIVATE, 1, NULL <unfinished
...>
[pid 18378] futex(0x2b76d6462984, FUTEX_WAIT_PRIVATE, 1, NULL <unfinished
...>
[pid 18375] futex(0x2b76d6475484, FUTEX_WAIT_PRIVATE, 3, NULL <unfinished
...>
[pid 18374] futex(0x2b76d6476984, FUTEX_WAIT_PRIVATE, 3, NULL <unfinished
...>
[pid 18368] restart_syscall(<... resuming interrupted call ...> <unfinished
...>
[pid 18377] futex(0x2b76d6463e84, FUTEX_WAIT_PRIVATE, 3, NULL <unfinished
...>
[pid 18364] restart_syscall(<... resuming interrupted call ...> <unfinished
...>
[pid 18354] futex(0x2b76d6477784, FUTEX_WAIT_PRIVATE, 1, NULL <unfinished
...>
[pid 18373] restart_syscall(<... resuming interrupted call ...>) = -1
ETIMEDOUT (Connection timed out)
[pid 18373] futex(0x2b76d0736a00, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 18373] futex(0x2b76d0736a44,
FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 3631433, {1466303205,
353534000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
[pid 18373] futex(0x2b76d0736a00, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 18373] futex(0x2b76d0736a44,
FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 3631435, {1466303205,
553978000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
[pid 18373] futex(0x2b76d0736a00, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 18373] futex(0x2b76d0736a44,
FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 3631437, {1466303205,
754503000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
[pid 18373] futex(0x2b76d0736a00, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 18373] futex(0x2b76d0736a44,
FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 3631439, {1466303205,
954902000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
[pid 18373] futex(0x2b76d0736a00, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 18373] futex(0x2b76d0736a44,
FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 3631441, {1466303206,
155424000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
[pid 18373] futex(0x2b76d0736a00, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 18373] futex(0x2b76d0736a44,
FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 3631443, {1466303206,
355864000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
"

I built mpi_enabled gromacs using best practices for the computer system
here which included the use of intel compilers. Could it be that I need to
rebuild using a different compiler such as gcc? It seems that I am in some
sort of deadlock.

Any ideas on how to address this problem would be much appreciated,

Regards,
Ben Coscia


More information about the gromacs.org_gmx-users mailing list