[gmx-users] problem with hemiltonian replica exchange restarting

Fri Apr 20 15:32:26 CEST 2012

Dear gromacs users,
I run a REMD simulation 20ns long, enabling free energy and using a
different init_lambda value for each replica  and using gromacs 4.5.3.

I run the simulation on a cluster equipped with torque queue management.

1) I used the following command in the submission script:

mpirun -np $NP mdrun_mpi_gcc -s rest2_.tpr -multi 10 -replex 1000 -dd 2 2 2
-maxh 36 -v >& log.rest2_TrpCage

The run went fine and it correctly terminated in 36 hours, before reaching
the 20ns and writing each file.

2) Then I extended the simulation using the command:
mpirun -np $NP mdrun_mpi_gcc -s rest2_.tpr -multi 10 -replex 1000 -dd 2 2 2
-maxh 36 -cpi -v >& log.resume.rest2_TrpCage

This time, the program crashed with the error:

*[[28079,1],72][/caspur/shared/src/openmpi/openmpi-1.4.3/ompi/mca/btl/openib/btl_openib_component.c:3227:handle_wc]
from neo085 to: neo098 error polling LP CQ with status RETRY EXCEEDED ERROR
status number 12 for wr_id 427787264 opcode 36099  vendor error 129 qp_idx 0
*
*--------------------------------------------------------------------------*
*The InfiniBand retry count between two MPI processes has been*
*exceeded.  "Retry count" is defined in the InfiniBand spec 1.2*
*(section 12.7.38):*
*
*
*    The total number of times that the sender wishes the receiver to*
*    retry timeout, packet sequence, etc. errors before posting a*
*    completion error.*
*
*
*This error typically means that there is something awry within the*
*InfiniBand fabric itself.  You should note the hosts on which this*
*error has occurred; it has been observed that rebooting or removing a*
*particular host from the job can sometimes resolve this issue.*
*
*
*Two MCA parameters can be used to control Open MPI's behavior with*
*respect to the retry count:*
*
*
** btl_openib_ib_retry_count - The number of times the sender will*
*  attempt to retry (defaulted to 7, the maximum value).*
** btl_openib_ib_timeout - The local ACK timeout parameter (defaulted*
*  to 10).  The actual timeout value used is calculated as:*
*
*
*     4.096 microseconds * (2^btl_openib_ib_timeout)*
*
*
*  See the InfiniBand spec 1.2 (section 12.7.34) for more details.*
*
*
*Below is some information about the host that raised the error and the*
*peer to which it was connected:*
*
*
*  Local host:   neo085*
*  Local device: mthca0*
*  Peer host:    neo098*
*
*
*You may need to consult with your system administrator to get this*
*problem fixed.*
*--------------------------------------------------------------------------*
*--------------------------------------------------------------------------*
*mpirun has exited due to process rank 72 with PID 2083 on*
*node neo085 exiting without calling "finalize". This may*
*have caused other processes in the application to be*
*terminated by signals sent by mpirun (as reported here).*
*--------------------------------------------------------------------------*
*mpirun: abort is already in progress...hit ctrl-c again to forcibly
terminate*

The reached simulation time written  in the md0.log file was *12.0766 ns*

3) I assumed it was a network error, imparing the correct comunication
among the nodes. I frequently obtain this error
and usually I restart the simulation without any troube.
Hence I restarted again the simulation:

mpirun -np $NP mdrun_mpi_gcc -s rest2_.tpr -multi 10 -replex 1000 -dd 2 2 2
-maxh 36 -cpi -v >& log.resume1.rest2_TrpCage

The simulation went fine, reaching the 20ns and without any complains by
gromacs.

When I started the data analysis, I noticed that all the 10 trajectory
files are nearly *12.07ns, *while energy files are 20ns long.
if I check the last modification time by ls -l it says me that the files
has been modified nearly simultaneously:

[oteri at matrix2 REST2]$ ls -lrt *.trr *.edr
-rw-r--r-- 1 oteri be7 1175659492 avr 19 15:52 traj8.trr
-rw-r--r-- 1 oteri be7 1175659492 avr 19 15:52 traj3.trr
-rw-r--r-- 1 oteri be7 1175659492 avr 19 15:52 traj2.trr
-rw-r--r-- 1 oteri be7 1175659492 avr 19 15:52 traj1.trr
-rw-r--r-- 1 oteri be7 1175659492 avr 19 15:52 traj7.trr
-rw-r--r-- 1 oteri be7 1175659492 avr 19 15:52 traj9.trr
-rw-r--r-- 1 oteri be7 1175659492 avr 19 15:52 traj4.trr
-rw-r--r-- 1 oteri be7 1175659492 avr 19 15:52 traj6.trr
-rw-r--r-- 1 oteri be7 1175659492 avr 19 15:52 traj0.trr
-rw-r--r-- 1 oteri be7 1175659492 avr 19 15:52 traj5.trr
-rw-r--r-- 1 oteri be7    1201324 avr 19 15:52 ener9.edr
-rw-r--r-- 1 oteri be7    1201324 avr 19 15:52 ener8.edr
-rw-r--r-- 1 oteri be7    1201324 avr 19 15:52 ener7.edr
-rw-r--r-- 1 oteri be7    1201324 avr 19 15:52 ener6.edr
-rw-r--r-- 1 oteri be7    1201324 avr 19 15:52 ener5.edr
-rw-r--r-- 1 oteri be7    1201324 avr 19 15:52 ener4.edr
-rw-r--r-- 1 oteri be7    1201324 avr 19 15:52 ener3.edr
-rw-r--r-- 1 oteri be7    1201324 avr 19 15:52 ener2.edr
-rw-r--r-- 1 oteri be7    1201324 avr 19 15:52 ener1.edr
-rw-r--r-- 1 oteri be7    1201324 avr 19 15:52 ener0.edr

So actually, gromacs accessed to both trajectory and energy files.

I have two question:

1) Is this a known bug, has it been corrected in gromacs 4.5.5?

2) How can i check if trajectory are correct? I mean, how can I
check whether spurious frames has been inserted?

3) If they are correct, how can I restart for 12ns?

You can download log and mdp files from
http://160.80.35.105/download/problem/

The other 9 files differs only for the init_lambda value:

rest2_0.mdp:init_lambda=-0.000000
rest2_1.mdp:init_lambda=0.143679
rest2_2.mdp:init_lambda=0.274297
rest2_3.mdp:init_lambda=0.388587
rest2_4.mdp:init_lambda=0.501717
rest2_5.mdp:init_lambda=0.611494
rest2_6.mdp:init_lambda=0.716387
rest2_7.mdp:init_lambda=0.818048
rest2_8.mdp:init_lambda=0.910347
rest2_9.mdp:init_lambda=1.000000

Thank you for help
                                                     Francesco
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-users/attachments/20120420/ba632dd0/attachment.html>