[gmx-users] Job Output stops being written during long simulations on HPC cluster

Szilárd Páll pall.szilard at gmail.com
Wed Jun 22 15:38:12 CEST 2016


I doubt it's a compiler issue, if anything it's more likely a
system-component that's misbehaving (kernel, or file system). I'd try
outputting to another fs, e.g. /tmp is there is one just to check.
--
Szilárd


On Tue, Jun 21, 2016 at 1:35 AM, Benjamin Joseph Coscia
<Benjamin.Coscia at colorado.edu> wrote:
> Hi everyone,
>
> I have been attempting to run some long simulations on a supercomputer at
> the University of Colorado at Boulder. I am trying to run simulations for
> about 200 ns. I have done tests using 48 cores and 96 cores. In each case
> output stops being written at the same time step (~50 million steps). This
> is only about half of the simulation time I wanted. According to SLURM, the
> job is still running days past when it stopped outputting.
>
> I checked how much space is being taken up by output files. The largest
> file is the trajectory at ~0.2 GB. I am only outputting data every 1
> million steps. I am convinced that this isn't a memory issue.
>
> I've reached out to the people who run the supercomputer and they are not
> positive what is going on. One of the guys there ran a system trace on the
> 'gmx_mpi mdrun' process and got the following output by looking at one of
> the nodes that is hung up:
>
> [root at node0636 ~]# ps -ef | grep beco
> root 2053 32739 0 20:48 pts/1 00:00:00 grep beco
> beco4952 17561 17557 0 Jun14 ? 00:00:00 /bin/bash
> /tmp/node0636/job1470980/slurm_script
> beco4952 17597 17561 0 Jun14 ? 00:00:00 /bin/bash
> /projects/beco4952/Gromacs/Pores/GitHub/Shell-Scripts/Build_and_Sim.sh -M
> monomer1.pdb -I steep -S 50000 -c verlet -t 10 -o 6 -r 3 -p 40 -P 4 -w 10
> -l 20 -x 8.0 -y 8.0 -e 0.1 -T Equilibration in Vacuum -C verlet -i md -D
> 0.002 -L 200 -f 100 -v v-rescale -K 300 -b berendsen -Y semiisotropic -B 1
> -R 4.5e-5 -Z xyz -V 1 -n 8 -s off -m on
> beco4952 18307 17597 0 Jun14 ? 00:00:00 /bin/sh
> /curc/tools/x86_64/rh6/software/impi/5.0.3.048/bin64/mpirun -np 16 gmx_mpi
> mdrun -v -deffnm wiggle
> beco4952 18313 18307 0 Jun14 ? 00:01:34 mpiexec.hydra -np 16 gmx_mpi mdrun
> -v -deffnm wiggle
> beco4952 18314 18313 0 Jun14 ? 00:00:00 /curc/slurm/slurm/current/bin/srun
> --nodelist
> node0636,node0637,node0638,node0639,node0640,node0641,node0642,node0643 -N
> 8 -n 8 /curc/tools/x86_64/rh6/software/impi/5.0.3.048/bin64/pmi_proxy
> --control-port node0636.rc.int.colorado.edu:41464 --pmi-connect lazy-cache
> --pmi-aggregate -s 0 --rmk slurm --launcher slurm --demux poll --pgid 0
> --enable-stdin 1 --retries 10 --control-code 156865336 --usize -2
> --proxy-id -1
> beco4952 18315 18314 0 Jun14 ? 00:00:00 /curc/slurm/slurm/current/bin/srun
> --nodelist
> node0636,node0637,node0638,node0639,node0640,node0641,node0642,node0643 -N
> 8 -n 8 /curc/tools/x86_64/rh6/software/impi/5.0.3.048/bin64/pmi_proxy
> --control-port node0636.rc.int.colorado.edu:41464 --pmi-connect lazy-cache
> --pmi-aggregate -s 0 --rmk slurm --launcher slurm --demux poll --pgid 0
> --enable-stdin 1 --retries 10 --control-code 156865336 --usize -2
> --proxy-id -1
> beco4952 18334 18329 0 Jun14 ? 00:01:00
> /curc/tools/x86_64/rh6/software/impi/5.0.3.048/bin64/pmi_proxy
> --control-port node0636.rc.int.colorado.edu:41464 --pmi-connect lazy-cache
> --pmi-aggregate -s 0 --rmk slurm --launcher slurm --demux poll --pgid 0
> --enable-stdin 1 --retries 10 --control-code 156865336 --usize -2
> --proxy-id -1
> beco4952 18354 18334 99 Jun14 ? 13-20:24:21 gmx_mpi mdrun -v -deffnm wiggle
> beco4952 18355 18334 99 Jun14 ? 13-20:30:41 gmx_mpi mdrun -v -deffnm wiggle
>
> [root at node0636 ~]# strace -f -p 18354
> Process 18354 attached with 9 threads - interrupt to quit
> [pid 18380] futex(0x2b76d6461484, FUTEX_WAIT_PRIVATE, 1, NULL <unfinished
> ...>
> [pid 18378] futex(0x2b76d6462984, FUTEX_WAIT_PRIVATE, 1, NULL <unfinished
> ...>
> [pid 18375] futex(0x2b76d6475484, FUTEX_WAIT_PRIVATE, 3, NULL <unfinished
> ...>
> [pid 18374] futex(0x2b76d6476984, FUTEX_WAIT_PRIVATE, 3, NULL <unfinished
> ...>
> [pid 18368] restart_syscall(<... resuming interrupted call ...> <unfinished
> ...>
> [pid 18377] futex(0x2b76d6463e84, FUTEX_WAIT_PRIVATE, 3, NULL <unfinished
> ...>
> [pid 18364] restart_syscall(<... resuming interrupted call ...> <unfinished
> ...>
> [pid 18354] futex(0x2b76d6477784, FUTEX_WAIT_PRIVATE, 1, NULL <unfinished
> ...>
> [pid 18373] restart_syscall(<... resuming interrupted call ...>) = -1
> ETIMEDOUT (Connection timed out)
> [pid 18373] futex(0x2b76d0736a00, FUTEX_WAKE_PRIVATE, 1) = 0
> [pid 18373] futex(0x2b76d0736a44,
> FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 3631433, {1466303205,
> 353534000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
> [pid 18373] futex(0x2b76d0736a00, FUTEX_WAKE_PRIVATE, 1) = 0
> [pid 18373] futex(0x2b76d0736a44,
> FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 3631435, {1466303205,
> 553978000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
> [pid 18373] futex(0x2b76d0736a00, FUTEX_WAKE_PRIVATE, 1) = 0
> [pid 18373] futex(0x2b76d0736a44,
> FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 3631437, {1466303205,
> 754503000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
> [pid 18373] futex(0x2b76d0736a00, FUTEX_WAKE_PRIVATE, 1) = 0
> [pid 18373] futex(0x2b76d0736a44,
> FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 3631439, {1466303205,
> 954902000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
> [pid 18373] futex(0x2b76d0736a00, FUTEX_WAKE_PRIVATE, 1) = 0
> [pid 18373] futex(0x2b76d0736a44,
> FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 3631441, {1466303206,
> 155424000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
> [pid 18373] futex(0x2b76d0736a00, FUTEX_WAKE_PRIVATE, 1) = 0
> [pid 18373] futex(0x2b76d0736a44,
> FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 3631443, {1466303206,
> 355864000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
> "
>
> I built mpi_enabled gromacs using best practices for the computer system
> here which included the use of intel compilers. Could it be that I need to
> rebuild using a different compiler such as gcc? It seems that I am in some
> sort of deadlock.
>
> Any ideas on how to address this problem would be much appreciated,
>
> Regards,
> Ben Coscia
> --
> Gromacs Users mailing list
>
> * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-request at gromacs.org.


More information about the gromacs.org_gmx-users mailing list