[gmx-developers] Bug with continuation from checkpoint with gromacs 4.0.7

Roland Schulz roland at utk.edu
Mon Mar 15 23:45:42 CET 2010


what is nsteps in the mdp file?

Did the simulation up to this point really did 3377000 steps?


2010/3/15 Alexey Shvetsov <alexxyum at gmail.com>

> Hi,
> It crashed with this error.
> In md.log i can see
> -----------------------------------------------------------
> Restarting from checkpoint, appending to previous log file.
> Log file opened on Sat Mar 13 01:48:10 2010
> Host: n1  pid: 5575  nodeid: 0  nnodes:  128
> The Gromacs distribution was built Sun Feb 28 02:57:38 MSK 2010 by
> root at n1 (Linux 2.6.31-gentoo-r6 x86_64)
> Initializing Domain Decomposition on 128 nodes
> Dynamic load balancing: auto
> Will sort the charge groups at every domain (re)decomposition
> Initial maximum inter charge-group distances:
>    two-body bonded interactions: 0.430 nm, LJ-14, atoms 6915 6922
>  multi-body bonded interactions: 0.430 nm, Ryckaert-Bell., atoms 6915 6922
> Minimum cell size due to bonded interactions: 0.473 nm
> Maximum distance for 5 constraints, at 120 deg. angles, all-trans: 0.819 nm
> Estimated maximum distance required for P-LINCS: 0.819 nm
> This distance will limit the DD cell size, you can override this with -rcon
> Domain decomposition grid 16 x 7 x 1, separate PME nodes 16
> Interleaving PP and PME nodes
> This is a particle-particle only node
> Domain decomposition nodeid 0, coordinates 0 0 0
> Using two step summing over 16 groups of on average 7.0 processes
> Table routines are used for coulomb: TRUE
> Table routines are used for vdw:     TRUE
> Will do PME sum in reciprocal space.
> U. Essman, L. Perela, M. L. Berkowitz, T. Darden, H. Lee and L. G. Pedersen
> A smooth particle mesh Ewald method
> J. Chem. Phys. 103 (1995) pp. 8577-8592
> -------- -------- --- Thank You --- -------- --------
> Using a Gaussian width (1/beta) of 0.480244 nm for Ewald
> Using shifted Lennard-Jones, switch between 1.2 and 1.5 nm
> Cut-off's:   NS: 1.7   Coulomb: 1.5   LJ: 1.5
> System total charge: 0.000
> Generated table with 5400 data points for Ewald-Switch.
> Tabscale = 2000 points/nm
> Generated table with 5400 data points for LJ6Switch.
> Tabscale = 2000 points/nm
> Generated table with 5400 data points for LJ12Switch.
> Tabscale = 2000 points/nm
> Generated table with 5400 data points for 1-4 COUL.
> Tabscale = 2000 points/nm
> Generated table with 5400 data points for 1-4 LJ6.
> Tabscale = 2000 points/nm
> Generated table with 5400 data points for 1-4 LJ12.
> Tabscale = 2000 points/nm
> Enabling SPC water optimization for 115416 molecules.
> Configuring nonbonded kernels...
> Testing x86_64 SSE2 support... present.
> Initializing Parallel LINear Constraint Solver
> B. Hess
> P-LINCS: A Parallel Linear Constraint Solver for molecular simulation
> J. Chem. Theory Comput. 4 (2008) pp. 116-122
> -------- -------- --- Thank You --- -------- --------
> The number of constraints is 43710
> There are inter charge-group constraints,
> will communicate selected coordinates each lincs iteration
> S. Miyamoto and P. A. Kollman
> SETTLE: An Analytical Version of the SHAKE and RATTLE Algorithms for Rigid
> Water Models
> J. Comp. Chem. 13 (1992) pp. 952-962
> -------- -------- --- Thank You --- -------- --------
> Linking all bonded interactions to atoms
> There are 237282 inter charge-group exclusions,
> will use an extra communication step for exclusion forces for PME-Switch
> The initial number of communication pulses is: X 2 Y 1
> The initial domain decomposition cell size is: X 1.18 nm Y 2.54 nm
> The maximum allowed distance for charge groups involved in interactions is:
>                 non-bonded interactions           1.700 nm
> (the following are initial values, they could change due to box
> deformation)
>            two-body bonded interactions  (-rdd)   1.700 nm
>          multi-body bonded interactions  (-rdd)   1.178 nm
>  atoms separated by up to 5 constraints  (-rcon)  1.178 nm
> When dynamic load balancing gets turned on, these settings will change to:
> The maximum number of communication pulses is: X 2 Y 2
> The minimum size for domain decomposition cells is 0.850 nm
> The requested allowed shrink of DD cells (option -dds) is: 0.80
> The allowed shrink of domain decomposition cells is: X 0.72 Y 0.33
> The maximum allowed distance for charge groups involved in interactions is:
>                 non-bonded interactions           1.700 nm
>            two-body bonded interactions  (-rdd)   1.700 nm
>          multi-body bonded interactions  (-rdd)   0.850 nm
>  atoms separated by up to 5 constraints  (-rcon)  0.850 nm
> Making 2D domain decomposition grid 16 x 7 x 1, home cell index 0 0 0
> Center of mass motion removal mode is Linear
> We have the following groups for center of mass motion removal:
>  0:  rest
> There are: 390476 Atoms
> Charge group distribution at step 3377000: 1125 1142 1147 1166 1139 1158
> 1135
> 1146 1216 1139 1298 1162 1139 1138 1182 1279 1151 1525 1334 1173 1162 1364
> 1509 1368 1884 1513 1149 1151 1286 1358 1714 2023 1752 1191 1150 1167 1480
> 2099 2239 2327 1411 1170 1132 1668 2235 1647 2174 1812 1195 1258 1542 1919
> 1307 1861 1793 1141 1415 1590 1810 1435 2106 1755 1149 1564 1711 2424 2156
> 2122 1457 1158 1278 1291 1862 2071 1775 1564 1118 1144 1175 1938 2062 1858
> 1637 1136 1141 1326 1685 1438 1348 1277 1144 1162 1159 1361 1142 1226 1184
> 1153 1142 1144 1195 1154 1144 1151 1124 1149 1148 1171 1140 1127 1145 1158
> Grid: 5 x 7 x 17 cells
> Initial temperature: 309.173 K
> Started mdrun on node 0 Sat Mar 13 01:48:12 2010
>        <======  ###############  ==>
>        <====  A V E R A G E S  ====>
>        <==  ###############  ======>
>   Energies (kJ/mol)
>          Angle    Proper Dih. Ryckaert-Bell.          LJ-14     Coulomb-14
>    3.11012e+11    1.88259e+10    3.84061e+11    1.60351e+11    1.48403e+12
>        LJ (SR)  Disper. corr.   Coulomb (SR)   Coul. recip.      Potential
>    2.00208e+12   -3.71589e+10   -1.93169e+13   -2.65842e+12   -1.76521e+13
>    Kinetic En.   Total Energy    Temperature Pressure (bar)  Cons. rmsd ()
>    3.40082e+12   -1.42513e+13    1.04681e+09    3.09179e+06    0.00000e+00
>          Box-X          Box-Y          Box-Z         Volume   Density (SI)
>    6.36496e+07    6.00618e+07    3.96178e+07    1.32807e+10    3.44326e+09
>             pV
>    7.21862e+08
>   Total Virial (kJ/mol)
>    1.13360e+12    1.84057e+08    1.42298e+08
>    1.84057e+08    1.13321e+12    1.72446e+08
>    1.42298e+08    1.72446e+08    1.13293e+12
>   Pressure (bar)
>    1.96772e+06    3.52902e+05   -4.76710e+05
>    3.52902e+05    3.34068e+06   -5.56943e+05
>   -4.76710e+05   -5.56943e+05    3.96697e+06
>   Total Dipole (Debye)
>    2.04296e+09    7.00647e+08    1.82307e+09
>  Epot (kJ/mol)        Coul-SR          LJ-SR        Coul-14          LJ-14
> Protein-Protein   -1.05061e+12   -2.88200e+11    1.48403e+12    1.60351e+11
> Protein-Non-Protein   -9.55694e+11   -7.40905e+10    0.00000e+00
> 0.00000e+00
> Non-Protein-Non-Protein   -1.73106e+13    2.36437e+12    0.00000e+00
> 0.00000e+00
>      T-Protein  T-Non-Protein
>    1.04653e+09    1.04684e+09
>        <======  ###############################  ==>
>        <====  R M S - F L U C T U A T I O N S  ====>
>        <==  ###############################  ======>
>   Energies (kJ/mol)
>          Angle    Proper Dih. Ryckaert-Bell.          LJ-14     Coulomb-14
>    9.01957e+05    1.88394e+05    8.36483e+05    3.32678e+05    1.56961e+06
>        LJ (SR)  Disper. corr.   Coulomb (SR)   Coul. recip.      Potential
>    4.49504e+06    1.61531e+04    7.36949e+06    4.74527e+05    5.31921e+06
>    Kinetic En.   Total Energy    Temperature Pressure (bar)  Cons. rmsd ()
>    2.96016e+06    6.07351e+06    9.11168e+02    8.50221e+04    0.00000e+00
>          Box-X          Box-Y          Box-Z         Volume   Density (SI)
>    9.22282e+00    8.70294e+00    5.74061e+00    5.77310e+03    1.49680e+03
>             pV
>    2.01360e+07
>   Total Virial (kJ/mol)
>    1.43640e+07    8.71197e+06    8.72558e+06
>    8.71197e+06    1.43505e+07    8.72679e+06
>    8.72558e+06    8.72679e+06    1.39238e+07
>   Pressure (bar)
>    1.22151e+05    7.44143e+04    7.44974e+04
>    7.44143e+04    1.22011e+05    7.44967e+04
>    7.44974e+04    7.44967e+04    1.18187e+05
>   Total Dipole (Debye)
>    6.14300e+06    5.87464e+06    5.12540e+06
>  Epot (kJ/mol)        Coul-SR          LJ-SR        Coul-14          LJ-14
> Protein-Protein    7.16780e+06    1.41743e+06    1.56961e+06    3.32678e+05
> Protein-Non-Protein    1.08780e+07    9.73569e+05    0.00000e+00
> 0.00000e+00
> Non-Protein-Non-Protein    9.00044e+06    4.32503e+06    0.00000e+00
> 0.00000e+00
>      T-Protein  T-Non-Protein
>    2.66136e+03    9.69716e+02
>        M E G A - F L O P S   A C C O U N T I N G
>   RF=Reaction-Field  FE=Free Energy  SCFE=Soft-Core/Free Energy
>   T=Tabulated        W3=SPC/TIP3p    W4=TIP4p (single or pairs)
>   NF=No Forces
>  Computing:                         M-Number         M-Flops  % Flops
> -----------------------------------------------------------------------
>  CG-CoM                             0.390476           1.171   100.0
> -----------------------------------------------------------------------
>  Total                                                 1.171   100.0
> -----------------------------------------------------------------------
>    D O M A I N   D E C O M P O S I T I O N   S T A T I S T I C S
>  av. #atoms communicated per step for force:  2 x 1120267.0
>  av. #atoms communicated per step for LINCS:  2 x 34671.0
>     R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G
>  Computing:         Nodes     Number     G-Cycles    Seconds     %
> -----------------------------------------------------------------------
>  Rest                 112               34860.808        0.0   100.0
> -----------------------------------------------------------------------
>  Total                128               34860.808        0.0   100.0
> -----------------------------------------------------------------------
> nodetime = 0! Infinite Giga flopses!
>        Parallel run - timing based on wallclock.
> Finished mdrun on node 0 Sat Mar 13 01:48:12 2010
> On Вторник 16 марта 2010 00:58:50 Roland Schulz wrote:
> > Alexey,
> >
> > your not giving enough information.
> >
> > What exactly is the error? What happens? Does it hang or does it crash
> with
> > an error?
> >
> > Roland
> >
> > On Fri, Mar 12, 2010 at 7:01 PM, Alexey Shvetsov <alexxyum at gmail.com>
> wrote:
> > > Hi all
> > > Seem like there is bug with continuation from checkpoint for gromacs
> > > 4.0.7 Steps to reproduce
> > > 1. submit parrallel job to pbs
> > > 2. kill job
> > > 3. try to resume from checkpoint
> > >
> > > relevant output from mdrun
> > > Reading checkpoint file md.cpt generated: Thu Mar 11 12:20:46 2010
> > >
> > > Loaded with Money
> > >
> > > Making 2D domain decomposition 16 x 7 x 1
> > >
> > > WARNING: This run will generate roughly 20607979313638129664 Mb of data
> > >
> > > starting mdrun 'Protein in water'
> > > 500000 steps,   1000.0 ps (continuing from step 3377000,   6754.0 ps).
> > >
> > > nodetime = 0! Infinite Giga flopses!
> > >
> > >        Parallel run - timing based on wallclock.
> > >
