[gmx-users] Production run error

Wed May 18 08:44:11 CEST 2016

*Dear all,
I'm performing a production run for a 100ns simulation. It runs well
upto 88 ns, but stops after that. Giving an error message.*

Program mdrun_mpi, VERSION 4.6.5
Source code file: /root/data/gromacs-4.6.5/src/mdlib/domdec.c, line: 4412

Fatal error:
A charge group moved too far between two domain decomposition steps
This usually means that your system is not well equilibrated
For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors
-------------------------------------------------------

"Come on boys, Let's push it hard" (P.J. Harvey)

Error on node 25, will try to stop all the nodes
Halting parallel program mdrun_mpi on CPU 25 out of 48

-------------------------------------------------------
Program mdrun_mpi, VERSION 4.6.5
Source code file: /root/data/gromacs-4.6.5/src/mdlib/domdec.c, line: 4412

Fatal error:
A charge group moved too far between two domain decomposition steps
This usually means that your system is not well equilibrated
For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors
-------------------------------------------------------

"Come on boys, Let's push it hard" (P.J. Harvey)

Error on node 37, will try to stop all the nodes
Halting parallel program mdrun_mpi on CPU 37 out of 48

gcq#339: "Come on boys, Let's push it hard" (P.J. Harvey)

-------------------------------------------------------
Program mdrun_mpi, VERSION 4.6.5
Source code file: /root/data/gromacs-4.6.5/src/mdlib/domdec.c, line: 4412

Fatal error:
A charge group moved too far between two domain decomposition steps
This usually means that your system is not well equilibrated
For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors
-------------------------------------------------------

"Come on boys, Let's push it hard" (P.J. Harvey)

Error on node 38, will try to stop all the nodes
Halting parallel program mdrun_mpi on CPU 38 out of 48

gcq#339: "Come on boys, Let's push it hard" (P.J. Harvey)

--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 38 in communicator MPI_COMM_WORLD
with errorcode -1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[compute-0-2.local][[20037,1],46][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[compute-0-2.local][[20037,1],26][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[compute-0-1.local][[20037,1],13][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[compute-0-1.local][[20037,1],33][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[compute-0-1.local][[20037,1],45][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[bicamp.bicnirrh.res.in:13287] 2 more processes have sent help message
help-mpi-api.txt / mpi-abort
[bicamp.bicnirrh.res.in:13287] Set MCA parameter
"orte_base_help_aggregate" to 0 to see all help / error messages
[compute-0-0.local][[20037,1],24][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[compute-0-0.local][[20037,1],36][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
--------------------------------------------------------------------------
mpirun has exited due to process rank 38 with PID 5651 on
node compute-0-2 exiting improperly. There are two reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here)

*Command used is: mpirun -np 48 -hostfile host
/share/apps/gromacs/bin/mdrun_mpi -v -deffnm filename*

*I checked up all the files everything seem to be ok, as i have use
the same parameters for running other simulation and they worked out
well.*

*Does anyone know what might be the problem?*