[gmx-users] mdrun bailing out on mpi cluster

Sat Jul 16 13:11:30 CEST 2016

Hi,

On Fri, Jul 15, 2016 at 4:35 PM Gregory, Andrew J. <Gregory.Andrew at mayo.edu>
wrote:

> We have run into two issues using a high performance cluster with gromacs
> that we have not had when using our own machines.
>
> 1)      Jobs stop writing to file after 2-3 days
> We have seen the "Simulation running but no output"
> http://www.gromacs.org/Documentation/Errors?highlight=stops+writing#Simulation_running_but_no_output
> on gromacs error but none of those options seem to apply-
> -It writes out for two days at full speed so it is not the speed of the
> simulation
> -These simulations on our machines never run into this so it's unlikely
> they are producing extraneous NAN and the cluster doesn't report high
> system cpu usage but actually lower usage when output stops
> -.mdp file works on our other machine and there is full output for the
> first couple days
> - the disk is not full, it is very large
> - WE are using MPI but it works for the first couple days so it is
> unlikely LAM id not started
>                 There is no warning or error output by this and the jobs
> will continue running for days in this state. We only know of this error
> because we were monitoring the cluster nodes running our jobs.
>
> 2)      Nodes stop communicating with the error:
> [dnode28]  ud_channel.c:768  Fatal: UD timeout sending to dnode05 (after
> 600.16 seconds)
> (full back trace following it attached, all from the .out file we pipe the
> output into)
> -The cluster IT department has looked into it and assured us it is not
> their error but a software error
>

It can be, but since GROMACS runs on thousands of clusters the world over,
and does nothing fancy, if it's a software error it's most likely e.g. the
code or configuration of the MPI.

> -Their only suggestion was running on one 16 core node to avoid the
> communication errors but the slow down would be too much and with error 1
> still occurring it wouldn't be worth it.
>
> In our sbatch submission we use the command:
>
> mpirun -np $NSLOTS gmx_mpi mdrun  -deffnm md >md.out
>
> We are running gromacs 5.1.1, and openmpi 1.8.1
>

Note that there's been a lot of bug fixes to openmpi 1.8.x since then...

>
> Is there anything else that could be stopping our jobs from writing out?
>

If e.g. a networked disk decided to disappear, that could explain the
symptoms.

> Could Gromacs cause a communication issue between nodes?
>

No.

> Is the node communication error connected to the writing out issue?
>

Depends... if the network has just gone away, then neither disk writing nor
network will work, but you can only assess which e.g. by seeing if there's
still any output to the job stdout/stderr.

Mark

Thank you
>
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
>