[gmx-users] mdrun bailing out on mpi cluster

Gregory, Andrew J. Gregory.Andrew at mayo.edu
Fri Jul 15 16:35:35 CEST 2016


We have run into two issues using a high performance cluster with gromacs that we have not had when using our own machines.

1)      Jobs stop writing to file after 2-3 days
We have seen the "Simulation running but no output" http://www.gromacs.org/Documentation/Errors?highlight=stops+writing#Simulation_running_but_no_output on gromacs error but none of those options seem to apply-
-It writes out for two days at full speed so it is not the speed of the simulation
-These simulations on our machines never run into this so it's unlikely they are producing extraneous NAN and the cluster doesn't report high system cpu usage but actually lower usage when output stops
-.mdp file works on our other machine and there is full output for the first couple days
- the disk is not full, it is very large
- WE are using MPI but it works for the first couple days so it is unlikely LAM id not started
                There is no warning or error output by this and the jobs will continue running for days in this state. We only know of this error because we were monitoring the cluster nodes running our jobs.

2)      Nodes stop communicating with the error:
[dnode28]  ud_channel.c:768  Fatal: UD timeout sending to dnode05 (after 600.16 seconds)
(full back trace following it attached, all from the .out file we pipe the output into)
-The cluster IT department has looked into it and assured us it is not their error but a software error
-Their only suggestion was running on one 16 core node to avoid the communication errors but the slow down would be too much and with error 1 still occurring it wouldn't be worth it.

In our sbatch submission we use the command:

mpirun -np $NSLOTS gmx_mpi mdrun  -deffnm md >md.out

We are running gromacs 5.1.1, and openmpi 1.8.1

Is there anything else that could be stopping our jobs from writing out?
Could Gromacs cause a communication issue between nodes?
Is the node communication error connected to the writing out issue?

Thank you



More information about the gromacs.org_gmx-users mailing list