[gmx-users] problem with hemiltonian replica exchange restarting
francesco oteri
francesco.oteri at gmail.com
Fri Apr 20 15:32:26 CEST 2012
Dear gromacs users,
I run a REMD simulation 20ns long, enabling free energy and using a
different init_lambda value for each replica and using gromacs 4.5.3.
I run the simulation on a cluster equipped with torque queue management.
1) I used the following command in the submission script:
mpirun -np $NP mdrun_mpi_gcc -s rest2_.tpr -multi 10 -replex 1000 -dd 2 2 2
-maxh 36 -v >& log.rest2_TrpCage
The run went fine and it correctly terminated in 36 hours, before reaching
the 20ns and writing each file.
2) Then I extended the simulation using the command:
mpirun -np $NP mdrun_mpi_gcc -s rest2_.tpr -multi 10 -replex 1000 -dd 2 2 2
-maxh 36 -cpi -v >& log.resume.rest2_TrpCage
This time, the program crashed with the error:
*[[28079,1],72][/caspur/shared/src/openmpi/openmpi-1.4.3/ompi/mca/btl/openib/btl_openib_component.c:3227:handle_wc]
from neo085 to: neo098 error polling LP CQ with status RETRY EXCEEDED ERROR
status number 12 for wr_id 427787264 opcode 36099 vendor error 129 qp_idx 0
*
*--------------------------------------------------------------------------*
*The InfiniBand retry count between two MPI processes has been*
*exceeded. "Retry count" is defined in the InfiniBand spec 1.2*
*(section 12.7.38):*
*
*
* The total number of times that the sender wishes the receiver to*
* retry timeout, packet sequence, etc. errors before posting a*
* completion error.*
*
*
*This error typically means that there is something awry within the*
*InfiniBand fabric itself. You should note the hosts on which this*
*error has occurred; it has been observed that rebooting or removing a*
*particular host from the job can sometimes resolve this issue.*
*
*
*Two MCA parameters can be used to control Open MPI's behavior with*
*respect to the retry count:*
*
*
** btl_openib_ib_retry_count - The number of times the sender will*
* attempt to retry (defaulted to 7, the maximum value).*
** btl_openib_ib_timeout - The local ACK timeout parameter (defaulted*
* to 10). The actual timeout value used is calculated as:*
*
*
* 4.096 microseconds * (2^btl_openib_ib_timeout)*
*
*
* See the InfiniBand spec 1.2 (section 12.7.34) for more details.*
*
*
*Below is some information about the host that raised the error and the*
*peer to which it was connected:*
*
*
* Local host: neo085*
* Local device: mthca0*
* Peer host: neo098*
*
*
*You may need to consult with your system administrator to get this*
*problem fixed.*
*--------------------------------------------------------------------------*
*--------------------------------------------------------------------------*
*mpirun has exited due to process rank 72 with PID 2083 on*
*node neo085 exiting without calling "finalize". This may*
*have caused other processes in the application to be*
*terminated by signals sent by mpirun (as reported here).*
*--------------------------------------------------------------------------*
*mpirun: abort is already in progress...hit ctrl-c again to forcibly
terminate*
The reached simulation time written in the md0.log file was *12.0766 ns*
3) I assumed it was a network error, imparing the correct comunication
among the nodes. I frequently obtain this error
and usually I restart the simulation without any troube.
Hence I restarted again the simulation:
mpirun -np $NP mdrun_mpi_gcc -s rest2_.tpr -multi 10 -replex 1000 -dd 2 2 2
-maxh 36 -cpi -v >& log.resume1.rest2_TrpCage
The simulation went fine, reaching the 20ns and without any complains by
gromacs.
When I started the data analysis, I noticed that all the 10 trajectory
files are nearly *12.07ns, *while energy files are 20ns long.
if I check the last modification time by ls -l it says me that the files
has been modified nearly simultaneously:
[oteri at matrix2 REST2]$ ls -lrt *.trr *.edr
-rw-r--r-- 1 oteri be7 1175659492 avr 19 15:52 traj8.trr
-rw-r--r-- 1 oteri be7 1175659492 avr 19 15:52 traj3.trr
-rw-r--r-- 1 oteri be7 1175659492 avr 19 15:52 traj2.trr
-rw-r--r-- 1 oteri be7 1175659492 avr 19 15:52 traj1.trr
-rw-r--r-- 1 oteri be7 1175659492 avr 19 15:52 traj7.trr
-rw-r--r-- 1 oteri be7 1175659492 avr 19 15:52 traj9.trr
-rw-r--r-- 1 oteri be7 1175659492 avr 19 15:52 traj4.trr
-rw-r--r-- 1 oteri be7 1175659492 avr 19 15:52 traj6.trr
-rw-r--r-- 1 oteri be7 1175659492 avr 19 15:52 traj0.trr
-rw-r--r-- 1 oteri be7 1175659492 avr 19 15:52 traj5.trr
-rw-r--r-- 1 oteri be7 1201324 avr 19 15:52 ener9.edr
-rw-r--r-- 1 oteri be7 1201324 avr 19 15:52 ener8.edr
-rw-r--r-- 1 oteri be7 1201324 avr 19 15:52 ener7.edr
-rw-r--r-- 1 oteri be7 1201324 avr 19 15:52 ener6.edr
-rw-r--r-- 1 oteri be7 1201324 avr 19 15:52 ener5.edr
-rw-r--r-- 1 oteri be7 1201324 avr 19 15:52 ener4.edr
-rw-r--r-- 1 oteri be7 1201324 avr 19 15:52 ener3.edr
-rw-r--r-- 1 oteri be7 1201324 avr 19 15:52 ener2.edr
-rw-r--r-- 1 oteri be7 1201324 avr 19 15:52 ener1.edr
-rw-r--r-- 1 oteri be7 1201324 avr 19 15:52 ener0.edr
So actually, gromacs accessed to both trajectory and energy files.
I have two question:
1) Is this a known bug, has it been corrected in gromacs 4.5.5?
2) How can i check if trajectory are correct? I mean, how can I
check whether spurious frames has been inserted?
3) If they are correct, how can I restart for 12ns?
You can download log and mdp files from
http://160.80.35.105/download/problem/
The other 9 files differs only for the init_lambda value:
rest2_0.mdp:init_lambda=-0.000000
rest2_1.mdp:init_lambda=0.143679
rest2_2.mdp:init_lambda=0.274297
rest2_3.mdp:init_lambda=0.388587
rest2_4.mdp:init_lambda=0.501717
rest2_5.mdp:init_lambda=0.611494
rest2_6.mdp:init_lambda=0.716387
rest2_7.mdp:init_lambda=0.818048
rest2_8.mdp:init_lambda=0.910347
rest2_9.mdp:init_lambda=1.000000
Thank you for help
Francesco
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-users/attachments/20120420/ba632dd0/attachment.html>
More information about the gromacs.org_gmx-users
mailing list