[gmx-users] RE: Re: RE: Re: RE: About the binary identical results by restarting from the checkpoint file

Cuiying Jian cuiying_jian at hotmail.com
Mon Jun 17 18:41:14 CEST 2013


Hi Mark,
 
I test the simulations again using Berendsen thermostat in Gromacs Version 4.6.2. Below are the procedures and results:
 
1. Gromacs version 4.6.2 (gromacs462_openmp_mkl_nogpu):
 Run simulation NO. 1: 
mdrun -v -nt 1 -ntmpi 1 -cpt 0 -s md.tpr -nsteps 1000 -deffnm md -reprod
 and:
mdrun -v -nt 1 -ntmpi 1 -cpt 0 -cpi md.cpt -s md.tpr -nsteps 1000 -deffnm md -reprod
Run simulation No. 2:
mdrun -v -nt 1 -ntmpi 1 -cpt 0 -s md.tpr -nsteps 2000 -deffnm md2 -reprod
 
Time step is 2 fs. gmxcheck compares the two trajectories (md2.trr and md.trr). For the first 1000 steps, these two simulations give exactly the same results. But for the restarting part, the results are different, and for the last step (2000), part of the output is:
Force (total 7000 atoms):
f[ 6990] (-9.39294e+01  5.45975e+01  1.01624e+02) - (-1.02416e+02  5.23718e+01  1.01598e+02)
f[ 6991] ( 1.54992e+02 -5.87473e-02  6.75502e+01) - ( 1.58055e+02  1.48560e+00  6.96316e+01)
f[ 6992] (-2.78803e+01  3.77038e+01 -1.22465e+02) - (-2.89480e+01  3.75414e+01 -1.24998e+02)
f[ 6993] (-7.05751e+01 -4.50751e+01  6.68649e+01) - (-7.31574e+01 -3.31187e+01  6.14412e+01)
f[ 6994] ( 8.41666e+01 -8.24567e+01 -8.22464e+01) - ( 9.91297e+01 -9.07121e+01 -8.50464e+01)
f[ 6995] (-1.09495e+02 -4.55245e+01  7.79979e+01) - (-1.08700e+02 -4.04476e+01  7.19261e+01)
f[ 6996] ( 2.32501e+02  1.96111e+02 -3.86986e+01) - ( 2.08918e+02  1.92376e+02 -2.96415e+01)
f[ 6997] (-1.20652e+02  4.49553e+01 -6.55193e+01) - (-1.19118e+02  2.83817e+01 -6.51142e+01)
f[ 6998] (-1.41800e+02 -1.00833e+02  1.79822e+01) - (-1.26068e+02 -8.36744e+01  3.95021e+01)
f[ 6999] ( 1.20387e+02  3.74714e+01 -1.10871e+01) - ( 1.13464e+02  3.49046e+01 -3.11966e+01)
 
gmxcheck compares the two energy files (md2.edr and md.edr). Again, for the first 1000 steps, these two simulations give exactly the same results. But for the restarting part, the results are different, and for the last step (2000), the output is:
G96Angle         step 2000:       4607.98,  step 2000:      4617.72
Proper Dih.      step 2000:       5109.93,  step 2000:      5076.02
Potential          step 2000:      -6676.75,  step 2000:     -6703.89
Kinetic En.        step 2000:       18680.9,  step 2000:      18713.5
Temperature   step 2000:       299.631,  step 2000:      300.154
Pressure           step 2000:       19.9114,  step 2000:      20.3674
Vir-XX                step 2000:       7344.34,  step 2000:      7214.97
Vir-YY                step 2000:       4303.87,  step 2000:      4567.05
Vir-ZZ                step 2000:       5271.57,  step 2000:       5129.9
Pres-XX             step 2000:      -46.8698,  step 2000:      -41.164
Pres-YY             step 2000:       68.0412,  step 2000:      58.8771
Pres-ZZ             step 2000:       38.5629,  step 2000:       43.389
#Surf*SurfTen step 2000:       277.815,  step 2000:      342.915
T-HEP                step 2000:       299.631,  step 2000:      300.154
 
Informations from .log files:
md2.log:
Log file opened on Sun Jun 16 10:07:42 2013
Host: cn225  pid: 8415  nodeid: 0  nnodes:  1
Gromacs version:    VERSION 4.6.2
Precision:          single
Memory model:       64 bit
MPI library:        thread_mpi
OpenMP support:     enabled
GPU support:        disabled
invsqrt routine:    gmx_software_invsqrt(x)
CPU acceleration:   SSE4.1
FFT library:        MKL
Large file support: enabled
RDTSCP usage:       enabled
Built on:           Sat Jun  1 04:19:49 MDT 2013
Built by:           phillips at parallel [CMAKE]
Build OS/arch:      Linux 2.6.18-274.18.1.el5 x86_64
Build CPU vendor:   GenuineIntel
Build CPU brand:    Intel(R) Xeon(R) CPU           X5650  @ 2.67GHz
Build CPU family:   6   Model: 44   Stepping: 2
Build CPU features: aes apic clfsh cmov cx8 cx16 htt lahf_lm mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdtscp sse2 sse3 sse4.1 sse4.2 ssse3
md.log:
Log file opened on Sun Jun 16 10:07:42 2013
Host: cn005  pid: 27975  nodeid: 0  nnodes:  1
Gromacs version:    VERSION 4.6.2
Precision:          single
Memory model:       64 bit
MPI library:        thread_mpi
OpenMP support:     enabled
GPU support:        disabled
invsqrt routine:    gmx_software_invsqrt(x)
CPU acceleration:   SSE4.1
FFT library:        MKL
Large file support: enabled
RDTSCP usage:       enabled
Built on:           Sat Jun  1 04:19:49 MDT 2013
Built by:           phillips at parallel [CMAKE]
Build OS/arch:      Linux 2.6.18-274.18.1.el5 x86_64
Build CPU vendor:   GenuineIntel
Build CPU brand:    Intel(R) Xeon(R) CPU           X5650  @ 2.67GHz
Build CPU family:   6   Model: 44   Stepping: 2
Build CPU features: aes apic clfsh cmov cx8 cx16 htt lahf_lm mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdtscp sse2 sse3 sse4.1 sse4.2 ssse3
 
2. Gromacs version 4.6.2 (gromacs462_openmp_mkl_gpu):
Use the same commands shown before. Similar behaviors are observed. i.e. restarting simulations are not binary identical with continuous ones.
 
I set -nt 1 and -ntmpi 1. It may ensure the simulation is single threading?
 
Thanks a lot.
 
Cheers,
Cuiying

> Message: 3
> Date: Sat, 15 Jun 2013 21:50:52 +0200
> From: Mark Abraham <mark.j.abraham at gmail.com>
> Subject: Re: [gmx-users] RE: Re: RE: About the binary identical
> 	results by	restarting from the checkpoint file
> To: Discussion list for GROMACS users <gmx-users at gromacs.org>
> Message-ID:
> 	<CAMNuMAQBjD5LVCrOYBD4-qf04p4dN+WDpaBnBDv4ge-Ur-k38Q at mail.gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1
> 
> On Sat, Jun 15, 2013 at 9:00 PM, Cuiying Jian <cuiying_jian at hotmail.com>wrote:
> > Hi Mark,
> >
> > I test the simulations again using Berendsen thermostat -- Still, I cannot
> > get binary identical results.
> >  I do two sets of simulations:
> > 1. Use Gromacs 4.5.2 installed on my personal computer:
> >
> 
> 4.6.2, I hope. Nobody is interested in reports about 4.5.2 :-)
> 
> 
> > Run 2 simulations using the command: mdrun -s md.tpr -deffnm md -nt 1 -cpt
> > 0 -reprod (-nt 1 ensures that the number of threads to start is
> > 1).Terminate one simulation manually.Restart this simulation by: mdrun -s
> > md.tpr -deffnm md -nt 1 -cpt 0 -cpi md.cpt -reprod -npme 0 (-npme o ensures
> > that the number of pme nodes for the restarting the same with that in the
> > checkpoint file.)Compare the results with those from continuous ones.
> 
> 
> What does gmxcheck say when comparing the resulting ostensibly equivalent
> trajectory files? Please provide a snippet of output if it says things
> differ. We want to see how big "different" is. Also the top 20 lines of a
> .log file.
> 
> Also, you can do the above procedure in a controlled manner in 4.6.2 by
> using mdrun -nsteps on the run you wish to stop prematurely.
> 
> Might your FFT library be multi-threading behind your back?
> 
> Mark
> 
> 2. Use Gromacs 4.0.7 installed on a cluster (only one processor is used
> > during the simulation):
> > Run 2 simulations using the command: mdrun_s -v -cpt 0 -s md.tpr -deffnm
> > md -reprod Terminate one simulation manually.Restart this simulation by:
> > mdrun_s -v -cpt 0 -cpi md.cpt -s md.tpr -deffnm md -reprod  Compare the
> > results with those from continuous ones. Still, I cannot get binary
> > identical results.  As mentioned ealier, the only case I can get binary
> > identical results is for SPC rigid water molecules (using velocity
> > rescaling thermostat in Gromacs 4.0.7). I guess that the reason for this
> > problem may also be caused by the LINCS algorithm used to constraint all
> > bonds in other cases except the rigid water case..  Thanks a lot.
> > Cheers,Cuiying
> >
> > > Date: Mon, 3 Jun 2013 19:15:12 +0200
> > > From: Mark Abraham <mark.j.abraham at gmail.com>
> > > Subject: Re: [gmx-users] RE: About the binary identical results by
> > >       restarting      from the checkpoint file
> > > To: Discussion list for GROMACS users <gmx-users at gromacs.org>
> > > Message-ID:
> > >       <CAMNuMARBEZ=m=Y_M1=
> > C5PzNcGWV438MvEydOsf56R6yTc681bQ at mail.gmail.com>
> > > Content-Type: text/plain; charset=ISO-8859-1
> > >
> > > On Mon, Jun 3, 2013 at 6:59 PM, Cuiying Jian <cuiying_jian at hotmail.com
> > >wrote:
> > >
> > > > Hi Mark,
> > > >
> > > > Thanks for your reply. I tested restarting simulations with .cpt files
> > by
> > > > GROMACS 4.6.1.  and the problems are still there, i.e. I cannot get
> > binary
> > > > identical results from restarted simulations with those from continuous
> > > > simulations. The command I used for restarting is as the following
> > (Only
> > > > one processor is used during the simulations.):
> > > > mdrun -v -s md.tpr -cpt 0 -cpi md.cpt -deffnm md -reprod
> > > >
> > >
> > > This is not generally enough to generate a serial run in 4.6, by the way.
> > > GROMACS tries very hard to automatically use all the resources available
> > in
> > > the best way. See mdrun -h for various -nt* options, and consult the
> > > pre-step-0 part of the .log file for feedback.
> > >
> > > For further information, I attach my original .mdp file below:
> > > > constraints          =  all-bonds         ; convert all bonds to
> > > > constraints.
> > > > integrator                 =  md
> > > > dt                          =  0.002              ; ps !
> > > > nsteps                  =  10000             ; total 2 ns.
> > > > nstcomm             =  10                    ; frequency for center of
> > > > mass motion removal.
> > > > nstxout                =  5                      ; collect data every
> > 10.0
> > > > ps.
> > > > nstxtcout             =  5                      ; frequency to write
> > > > coordinate to xtc trajectory.
> > > > nstvout                =  5                      ; frequency to write
> > > > velocities to output trajectory.
> > > > nstfout                 =  5                      ; frequency to write
> > > > forces to output trajectory.
> > > > nstlog                   =  5                      ; frequency to write
> > > > energies to log file.
> > > > nstenergy            =  5                      ; frequency to write
> > > > energies to energy file.
> > > > nstlist                   =  1                       ; frequency to
> > update
> > > > the neighbor list.
> > > > ns_type               =  grid
> > > > rlist                       =  1.4
> > > > coulombtype      =  PME
> > > > rcoulomb            =  1.4
> > > > vdwtype              =  cut-off
> > > > rvdw                     =  1.4
> > > > pme_order          =  8                                 ; use 6,8 or 10
> > > > when running in parallel
> > > > ewald_rtol           =  1e-5
> > > > optimize_fft        =  yes
> > > > DispCorr               =  no                     ; don't apply any
> > > > correction
> > > > ;open LINCS
> > > > constraint_algorithm = LINCS
> > > > lincs_order                   = 4               ;highest order in the
> > > > expansion of the constraint coupling matrix
> > > > lincs_warnangle          = 30             ;maximum angle that a bond
> > can
> > > > rotate before LINCS will complain
> > > > lincs_iter                      = 1                ;number of
> > iterations
> > > > to correct for a rotational lengthening in LINCS
> > > > ; Temperature coupling is on
> > > > Tcoupl                          = v-rescale
> > > >
> > >
> > > This coupling algorithm has a stochastic component, and at least at some
> > > points in history the random number generator was either not checkpointed
> > > properly, or not propagated in parallel properly. I'm not sure offhand if
> > > any of that has been fixed yet (I doubt it), but you can test (parts of)
> > > this hypothesis by using Berendsen (in any GROMACS 4.x), or really being
> > > sure you've run a single thread.
> > >
> > > If Berendsen is fully reproducible, then the RNG is the issue. While
> > that's
> > > irritating, it probably won't get fixed before GROMACS 5 (as a side
> > effect
> > > of other stuff going on).
> > >
> > > Mark
> > >
> > > tau_t                             = 0.1
> > > > tc-grps                          = HEP
> > > > ref_t                              =  300
> > > > ; Pressure  coupling is on
> > > > Pcoupl                          = parrinello-rahman
> > > > Pcoupltype                  = isotropic
> > > > tau_p                            = 1.0
> > > > compressibility           = 4.5e-5
> > > > ref_p                             = 1.0
> > > > ; generate velocity is on at 300 K.
> > > > gen_vel              = yes
> > > > gen_temp          = 300.0
> > > > gen_seed           = -1
> > > >
> > > > Is there something wrong with my .mdp file or my command? Thanks a lot.
> > > >
> > > > Cheers,
> > > > Cuiying
> > > > > On Sun, Jun 2, 2013 at 10:37 PM, Cuiying Jian <
> > cuiying_jian at hotmail.com
> > > > >wrote:
> > > > >
> > > > > > Hi GROMACS Users,
> > > > > >
> > > > > > These days, I am testing restarting simulaitions with .cpt files. I
> > > > > > already set nlist=1 in the .mdp file. I can restart my simulations
> > > > (which
> > > > > > are stopped manually) with the following commands (version 4.0.7):
> > > > > > mpiexec mdrun_s_mpi -v -s md.tpr -cpt 0 -cpi md.cpt -deffnm md
> > -reprod
> > > > > > -reprod is used to force binary identical simulaitons.
> > > > > >
> > > > > > During the restarted simulations, same number of processors are
> > used as
> > > > > > that in the simulation interrupted. The only case, in which I can
> > get
> > > > > > binary identical results with those from the continuous simulations
> > > > (which
> > > > > > are not stopped manually), is for SPC water molecules. Any other
> > > > molecules
> > > > > > (like -heptane), I can never get binary identical results with
> > those
> > > > from
> > > > > > the continuous simulations.
> > > > > >
> > > > > > I also try to get new .tpr files by:
> > > > > > tpbconv_s -s md.tpr -f md.trr -e md.edr -c md_c.tpr -cont
> > > > > > and then:
> > > > > > mpiexec mdrun_s_mpi -v -s md_c.tpr -cpt 0 -cpi md.cpt -deffnm md_c
> > > > -reprod
> > > > > > But I still cannot get binary identical results.
> > > > > >
> > > > > > I also test the simulations with only one processor and binary
> > > > identical
> > > > > > results are still not obtained. Using double precision does not
> > solve
> > > > the
> > > > > > problems.
> > > > > >
> > > > > > I think that the above problems are caused by some information may
> > not
> > > > be
> > > > > > stored during the running of the simulations.
> > > > > >
> > > > >
> > > > > That seems likely. The leading candidate would be a random number
> > > > generator
> > > > > you're using for a stochastic integrator. Your .mdp file would have
> > been
> > > > > useful.
> > > > >
> > > > > On the other hand, if I run two independent simulations using the
> > exactly
> > > > > > same number of processors, the same commands and the same input
> > files,
> > > > i.e.
> > > > > > mpiexec mdrun_s_mpi -v -s md.tpr -deffnm md -reprod
> > > > > > I can always get binary identical results from these two
> > independent
> > > > > > simulations.
> > > > > >
> > > > > > I understand that MD is chaotic and if we run simulation for enough
> > > > long
> > > > > > time, simulation results should converge. Also, there are factors
> > > > which may
> > > > > > affect the reproducibility as described in the GROMACS website. But
> > > > for my
> > > > > > purpose, I am curious about whether there are certain methods
> > through
> > > > which
> > > > > > I can get binary identical results from restarted simulations and
> > > > > > continuous simulations. Thanks a lot.
> > > > > >
> > > > >
> > > > > There are ways to be fully reproducible, but probably not every
> > > > combination
> > > > > of algorithms has that property. 4.0.7 is so old no problem will be
> > > > fixed,
> > > > > unless it can also be shown in 4.6 ;-)
> > > > >
> > > > > Mark


 		 	   		  


More information about the gromacs.org_gmx-users mailing list