[gmx-users] RMSD truncation Restart simulation problems

Mark Abraham Mark.Abraham at anu.edu.au
Tue Mar 8 13:54:16 CET 2011


On 8/03/2011 9:41 PM, Henri Mone wrote:
> Hi All, hi Mark,
> Here are some more details. The outputs and error messages are
> attached at the end of the e-mail. After truncation I get the error
> message [1a], gromacs has problems with the checksum of the trr fles.
> After truncation the trajectories (xtc, trr) have the same length of
> 27752 frames [1b]. All the edr files have the same length of 277518
> frames [1b]. The cpt files used after truncation have a step =
> 138762700 and t = 277525.400000 [1c].
> Before truncation I got the error message [2], gromacs complains that
> the 32 subsystems are not compatible.
> Anyone a idea was is going wrong?
>
> Thanks,
> Henri
>
>
>
> ====1a: AFTER TRUNCATION: ERROR MESSAGE
> Reading checkpoint file state1.cpt generated: Thu Jan 27 02:19:50 2011
>    #PME-nodes mismatch,
>      current program: -1
>      checkpoint file: 0
> Reading checkpoint file state2.cpt generated: Thu Jan 27 02:19:50 2011
>    #PME-nodes mismatch,
>      current program: -1
>      checkpoint file: 0
> Gromacs binary or parallel settings not identical to previous run.
> Continuation is exact, but is not guaranteed to be binary identical.
> ...
> -------------------------------------------------------
> Program mdrun_mpi, VERSION 4.5.3
> Source code file: checkpoint.c, line: 1767
> Fatal error:
> Can't read 1048576 bytes of 'traj1.trr' to compute checksum. The file
> has been replaced or its contents has been modified.
> For more information and tips for troubleshooting, please check the GROMACS
> website at http://www.gromacs.org/Documentation/Errors
> -------------------------------------------------------
> -------------------------------------------------------
> Program mdrun_mpi, VERSION 4.5.3
> Source code file: checkpoint.c, line: 1767
> Fatal error:
> Can't read 1048576 bytes of 'traj2.trr' to compute checksum. The file
> has been replaced or its contents has been modified.
> For more information and tips for troubleshooting, please check the GROMACS
> website at http://www.gromacs.org/Documentation/Errors
> -------------------------------------------------------
> Error on node 1, will try to stop all the nodes
> Halting parallel program mdrun_mpi on CPU 1 out of 32
> gcq#307: "Good Music Saves your Soul" (Lemmy)
> [n030212:18418] MPI_ABORT invoked on rank 1 in communicator
> MPI_COMM_WORLD with errorcode -1

Ah yes, I remember now. mdrun tries to be smart and check that all the 
files match the state they were in before the crash by computing 
checksums when writing and again when reading.

>
>
> ====1b: AFTER TRUNCATION: XTC TRR
> $ gmxcheck -f traj0.xtc
> Checking file traj0.xtc
> Reading frame       0 time    0.000
> # Atoms  224
> Precision 0.001 (nm)
> Reading frame   27000 time 270000.000
> Item        #frames Timestep (ps)
> Step         27752    10
> Time         27752    10
> Lambda           0
> Coords       27752    10
> Velocities       0
> Forces           0
> Box          27752    10
> ...
> $ gmxcheck -f traj31.xtc
> Checking file traj31.xtc
> Reading frame       0 time    0.000
> # Atoms  224
> Precision 0.001 (nm)
> Reading frame   27000 time 270000.000
> Item        #frames Timestep (ps)
> Step         27752    10
> Time         27752    10
> Lambda           0
> Coords       27752    10
> Velocities       0
> Forces           0
> Box          27752    10
>
> $ gmxcheck -f traj0.trr
> Checking file traj0.trr
> trn version: GMX_trn_file (single precision)
> Reading frame       0 time    0.000
> # Atoms  6647
> Reading frame   27000 time 270000.000
> Item        #frames Timestep (ps)
> Step         27752    10
> Time         27752    10
> Lambda       27752    10
> Coords       27752    10
> Velocities   27752    10
> Forces           0
> Box          27752    10
> $ gmxcheck -f traj1.trr
> Checking file traj1.trr
> trn version: GMX_trn_file (single precision)
> Reading frame       0 time    0.000
> # Atoms  6647
> Reading frame   27000 time 270000.000
> Item        #frames Timestep (ps)
> Step         27752    10
> Time         27752    10
> Lambda       27752    10
> Coords       27752    10
> Velocities   27752    10
> Forces           0
> Box          27752    10
> ...
> $ gmxcheck -f traj31.trr
> Checking file traj31.trr
> trn version: GMX_trn_file (single precision)
> Reading frame       0 time    0.000
> # Atoms  6647
> Reading frame   27000 time 270000.000
> Item        #frames Timestep (ps)
> Step         27752    10
> Time         27752    10
> Lambda       27752    10
> Coords       27752    10
> Velocities   27752    10
> Forces           0
> Box          27752    10
>
> $ eneconv -f ener0.edr
> Reading energy frame      0 time    0.000
> Continue writing frames from t=0, step=0
> Last energy frame read 138759 time 277518.000         iting frame time
> 276000
> Last step written from ener0.edr: t 277518, step 138759000
> Last frame written was at step 138759000, time 277518.000000
> Wrote 138760 frames
> ...
> $ eneconv -f ener31.edr
> Reading energy frame      0 time    0.000
> Continue writing frames from t=0, step=0
> Last energy frame read 138759 time 277518.000         iting frame time
> 276000
> Last step written from ener31.edr: t 277518, step 138759000
> Last frame written was at step 138759000, time 277518.000000
> Wrote 138760 frames
>
>
>
>
>
> ====1c: AFTER TRUNCATION: CPT
> state0.cpt:
> generation time = Thu Jan 27 02:19:50 2011
> step = 138762700
> t = 277525.400000
> ...
> state31.cpt:
> generation time = Thu Jan 27 02:19:50 2011
> step = 138762700
> t = 277525.400000
>
>
> $ gmxdump -cp state0.cpt|less
> GROMACS version = 4.5.3
> GROMACS build time = Fri Dec  3 03:20:53 CET 2010
> GROMACS build user = user at cluster
> GROMACS build machine = Linux 2.6.18-194.17.4.el5 x86_64
> generating program = /opt/gromacs-4.5.3/bin/mdrun_mpi
> generation time = Thu Jan 27 02:19:50 2011
> checkpoint file version = 12
> generating host = n040407
> #atoms = 6647
> #T-coupling groups = 1
> #Nose-Hoover T-chains = 0
> #Nose-Hoover T-chains for barostat  = 0
> integrator = 0
> simulation part # = 18
> step = 138762700
> t = 277525.400000
> #PP-nodes = 1
> dd_nc[x] = 1
> dd_nc[y] = 1
> dd_nc[z] = 1
> #PME-only nodes = 0
> state flags = 6594
> ekin data flags = 0
> energy history flags = 255
>
>
> ====2: BEFORE TRUNCATION
> $ less md2.log
> Initializing Replica Exchange
> Repl  There are 32 replicas:
> Multi-checking the number of atoms ... OK
> Multi-checking the integrator ... OK
> Multi-checking init_step+nsteps ... OK
> Multi-checking first exchange step: init_step/-replex ...
> first exchange step: init_step/-replex is not equal for all subsystems
>    subsystem 0: 70425
>    subsystem 1: 70437
>    subsystem 2: 70437
>    subsystem 3: 70437
>    subsystem 4: 70437
>    subsystem 5: 70437
>    subsystem 6: 70437
>    subsystem 7: 70437
>    subsystem 8: 70437
>    subsystem 9: 70437
>    subsystem 10: 70437
>    subsystem 11: 70437
>    subsystem 12: 70437
>    subsystem 13: 70437
>    subsystem 14: 70437
>    subsystem 15: 70437
>    subsystem 16: 70425
>    subsystem 17: 70437
>    subsystem 18: 70437
>    subsystem 19: 70437
>    subsystem 20: 70437
>    subsystem 21: 70437
>    subsystem 22: 70437
>    subsystem 23: 70437
>    subsystem 24: 70425
>    subsystem 25: 70437
>    subsystem 26: 70437
>    subsystem 27: 70437
>    subsystem 28: 70437
>    subsystem 29: 70437
>    subsystem 30: 70437
>    subsystem 31: 70437
> -------------------------------------------------------
> Program mdrun_mpi, VERSION 4.5.3
> Source code file: main.c, line: 189
> Fatal error:
> The 32 subsystems are not compatible
> For more information and tips for troubleshooting, please check the GROMACS
> website at http://www.gromacs.org/Documentation/Errors
> -------------------------------------------------------

So the original problem was that the steps recorded in the different 
.cpt are different, unsurprisingly.

To fix this, get the .mdp used to make the original .tpr, and call 
grompp the same way, plus (e.g.) -t state0.cpt for each of the replicas. 
Now the (matching) state information will be taken by grompp from the 
(matching) checkpoint files, and no checksum calculation can fail. Then 
mdrun_mpi -multi 32 -replex 5000 -s new.tpr should be fine (and no .cpt 
should be provided).

While doing this, I would avoid trying to use mdrun -append, by the way. 
Glue things together later if you need to.

Mark

>
>
> $ gmxdump -cp state0.cpt|less
> GROMACS version = 4.5.3
> GROMACS build time = Fri Dec  3 03:20:53 CET 2010
> GROMACS build user = user at cluster
> GROMACS build machine = Linux 2.6.18-194.17.4.el5 x86_64
> generating program = /opt/gromacs-4.5.3/bin/mdrun_mpi
> generation time = Thu Jan 27 15:08:32 2011
> checkpoint file version = 12
> generating host = n040407
> #atoms = 6647
> #T-coupling groups = 1
> #Nose-Hoover T-chains = 0
> #Nose-Hoover T-chains for barostat  = 0
> integrator = 0
> simulation part # = 19
> step = 140849180
> t = 281698.360000
> #PP-nodes = 1
> dd_nc[x] = 1
> dd_nc[y] = 1
> dd_nc[z] = 1
> #PME-only nodes = 0
> state flags = 6594
> ekin data flags = 0
> ...
>
>
> $ gmxcheck -f traj0.xtc
> Reading frame       0 time    0.000
> # Atoms  224
> Precision 0.001 (nm)
> Reading frame   28000 time 280000.000
> Item        #frames Timestep (ps)
> Step         28170    10
> Time         28170    10
> Lambda           0
> Coords       28170    10
> Velocities       0
> Forces           0
> Box          28170    10
>
>
> $ gmxcheck -f traj1.xtc
> Reading frame       0 time    0.000
> # Atoms  224
> Precision 0.001 (nm)
> Reading frame   28000 time 280000.000
> Item        #frames Timestep (ps)
> Step         28175    10
> Time         28175    10
> Lambda           0
> Coords       28175    10
> Velocities       0
> Forces           0
> Box          28175    10




More information about the gromacs.org_gmx-users mailing list