[gmx-users] Job Output stops being written during long simulations on HPC cluster

Mark Abraham mark.j.abraham at gmail.com
Wed Jun 22 17:00:51 CEST 2016


Hi,

Or the filesystem disappeared during the run...

Mark

On Wed, Jun 22, 2016 at 3:38 PM Szilárd Páll <pall.szilard at gmail.com> wrote:

> I doubt it's a compiler issue, if anything it's more likely a
> system-component that's misbehaving (kernel, or file system). I'd try
> outputting to another fs, e.g. /tmp is there is one just to check.
> --
> Szilárd
>
>
> On Tue, Jun 21, 2016 at 1:35 AM, Benjamin Joseph Coscia
> <Benjamin.Coscia at colorado.edu> wrote:
> > Hi everyone,
> >
> > I have been attempting to run some long simulations on a supercomputer at
> > the University of Colorado at Boulder. I am trying to run simulations for
> > about 200 ns. I have done tests using 48 cores and 96 cores. In each case
> > output stops being written at the same time step (~50 million steps).
> This
> > is only about half of the simulation time I wanted. According to SLURM,
> the
> > job is still running days past when it stopped outputting.
> >
> > I checked how much space is being taken up by output files. The largest
> > file is the trajectory at ~0.2 GB. I am only outputting data every 1
> > million steps. I am convinced that this isn't a memory issue.
> >
> > I've reached out to the people who run the supercomputer and they are not
> > positive what is going on. One of the guys there ran a system trace on
> the
> > 'gmx_mpi mdrun' process and got the following output by looking at one of
> > the nodes that is hung up:
> >
> > [root at node0636 ~]# ps -ef | grep beco
> > root 2053 32739 0 20:48 pts/1 00:00:00 grep beco
> > beco4952 17561 17557 0 Jun14 ? 00:00:00 /bin/bash
> > /tmp/node0636/job1470980/slurm_script
> > beco4952 17597 17561 0 Jun14 ? 00:00:00 /bin/bash
> > /projects/beco4952/Gromacs/Pores/GitHub/Shell-Scripts/Build_and_Sim.sh -M
> > monomer1.pdb -I steep -S 50000 -c verlet -t 10 -o 6 -r 3 -p 40 -P 4 -w 10
> > -l 20 -x 8.0 -y 8.0 -e 0.1 -T Equilibration in Vacuum -C verlet -i md -D
> > 0.002 -L 200 -f 100 -v v-rescale -K 300 -b berendsen -Y semiisotropic -B
> 1
> > -R 4.5e-5 -Z xyz -V 1 -n 8 -s off -m on
> > beco4952 18307 17597 0 Jun14 ? 00:00:00 /bin/sh
> > /curc/tools/x86_64/rh6/software/impi/5.0.3.048/bin64/mpirun -np 16
> gmx_mpi
> > mdrun -v -deffnm wiggle
> > beco4952 18313 18307 0 Jun14 ? 00:01:34 mpiexec.hydra -np 16 gmx_mpi
> mdrun
> > -v -deffnm wiggle
> > beco4952 18314 18313 0 Jun14 ? 00:00:00
> /curc/slurm/slurm/current/bin/srun
> > --nodelist
> > node0636,node0637,node0638,node0639,node0640,node0641,node0642,node0643
> -N
> > 8 -n 8 /curc/tools/x86_64/rh6/software/impi/5.0.3.048/bin64/pmi_proxy
> > --control-port node0636.rc.int.colorado.edu:41464 --pmi-connect
> lazy-cache
> > --pmi-aggregate -s 0 --rmk slurm --launcher slurm --demux poll --pgid 0
> > --enable-stdin 1 --retries 10 --control-code 156865336 --usize -2
> > --proxy-id -1
> > beco4952 18315 18314 0 Jun14 ? 00:00:00
> /curc/slurm/slurm/current/bin/srun
> > --nodelist
> > node0636,node0637,node0638,node0639,node0640,node0641,node0642,node0643
> -N
> > 8 -n 8 /curc/tools/x86_64/rh6/software/impi/5.0.3.048/bin64/pmi_proxy
> > --control-port node0636.rc.int.colorado.edu:41464 --pmi-connect
> lazy-cache
> > --pmi-aggregate -s 0 --rmk slurm --launcher slurm --demux poll --pgid 0
> > --enable-stdin 1 --retries 10 --control-code 156865336 --usize -2
> > --proxy-id -1
> > beco4952 18334 18329 0 Jun14 ? 00:01:00
> > /curc/tools/x86_64/rh6/software/impi/5.0.3.048/bin64/pmi_proxy
> > --control-port node0636.rc.int.colorado.edu:41464 --pmi-connect
> lazy-cache
> > --pmi-aggregate -s 0 --rmk slurm --launcher slurm --demux poll --pgid 0
> > --enable-stdin 1 --retries 10 --control-code 156865336 --usize -2
> > --proxy-id -1
> > beco4952 18354 18334 99 Jun14 ? 13-20:24:21 gmx_mpi mdrun -v -deffnm
> wiggle
> > beco4952 18355 18334 99 Jun14 ? 13-20:30:41 gmx_mpi mdrun -v -deffnm
> wiggle
> >
> > [root at node0636 ~]# strace -f -p 18354
> > Process 18354 attached with 9 threads - interrupt to quit
> > [pid 18380] futex(0x2b76d6461484, FUTEX_WAIT_PRIVATE, 1, NULL <unfinished
> > ...>
> > [pid 18378] futex(0x2b76d6462984, FUTEX_WAIT_PRIVATE, 1, NULL <unfinished
> > ...>
> > [pid 18375] futex(0x2b76d6475484, FUTEX_WAIT_PRIVATE, 3, NULL <unfinished
> > ...>
> > [pid 18374] futex(0x2b76d6476984, FUTEX_WAIT_PRIVATE, 3, NULL <unfinished
> > ...>
> > [pid 18368] restart_syscall(<... resuming interrupted call ...>
> <unfinished
> > ...>
> > [pid 18377] futex(0x2b76d6463e84, FUTEX_WAIT_PRIVATE, 3, NULL <unfinished
> > ...>
> > [pid 18364] restart_syscall(<... resuming interrupted call ...>
> <unfinished
> > ...>
> > [pid 18354] futex(0x2b76d6477784, FUTEX_WAIT_PRIVATE, 1, NULL <unfinished
> > ...>
> > [pid 18373] restart_syscall(<... resuming interrupted call ...>) = -1
> > ETIMEDOUT (Connection timed out)
> > [pid 18373] futex(0x2b76d0736a00, FUTEX_WAKE_PRIVATE, 1) = 0
> > [pid 18373] futex(0x2b76d0736a44,
> > FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 3631433, {1466303205,
> > 353534000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
> > [pid 18373] futex(0x2b76d0736a00, FUTEX_WAKE_PRIVATE, 1) = 0
> > [pid 18373] futex(0x2b76d0736a44,
> > FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 3631435, {1466303205,
> > 553978000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
> > [pid 18373] futex(0x2b76d0736a00, FUTEX_WAKE_PRIVATE, 1) = 0
> > [pid 18373] futex(0x2b76d0736a44,
> > FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 3631437, {1466303205,
> > 754503000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
> > [pid 18373] futex(0x2b76d0736a00, FUTEX_WAKE_PRIVATE, 1) = 0
> > [pid 18373] futex(0x2b76d0736a44,
> > FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 3631439, {1466303205,
> > 954902000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
> > [pid 18373] futex(0x2b76d0736a00, FUTEX_WAKE_PRIVATE, 1) = 0
> > [pid 18373] futex(0x2b76d0736a44,
> > FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 3631441, {1466303206,
> > 155424000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
> > [pid 18373] futex(0x2b76d0736a00, FUTEX_WAKE_PRIVATE, 1) = 0
> > [pid 18373] futex(0x2b76d0736a44,
> > FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 3631443, {1466303206,
> > 355864000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
> > "
> >
> > I built mpi_enabled gromacs using best practices for the computer system
> > here which included the use of intel compilers. Could it be that I need
> to
> > rebuild using a different compiler such as gcc? It seems that I am in
> some
> > sort of deadlock.
> >
> > Any ideas on how to address this problem would be much appreciated,
> >
> > Regards,
> > Ben Coscia
> > --
> > Gromacs Users mailing list
> >
> > * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
> >
> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >
> > * For (un)subscribe requests visit
> > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.


More information about the gromacs.org_gmx-users mailing list