[gmx-developers] Bug with continuation from checkpoint with gromacs 4.0.7

Alexey Shvetsov alexxyum at gmail.com
Tue Mar 16 00:21:23 CET 2010


On Вторник 16 марта 2010 01:45:42 Roland Schulz wrote:
> Hi,
> 
> what is nsteps in the mdp file?
> 
> Did the simulation up to this point really did 3377000 steps?
> 
> Roland

nstep is 50000000 (10ns with 2fs timestep)

> 
> 2010/3/15 Alexey Shvetsov <alexxyum at gmail.com>
> 
> > Hi,
> > 
> > It crashed with this error.
> > In md.log i can see
> > 
> > -----------------------------------------------------------
> > Restarting from checkpoint, appending to previous log file.
> > 
> > Log file opened on Sat Mar 13 01:48:10 2010
> > Host: n1  pid: 5575  nodeid: 0  nnodes:  128
> > The Gromacs distribution was built Sun Feb 28 02:57:38 MSK 2010 by
> > root at n1 (Linux 2.6.31-gentoo-r6 x86_64)
> > 
> > 
> > 
> > Initializing Domain Decomposition on 128 nodes
> > Dynamic load balancing: auto
> > Will sort the charge groups at every domain (re)decomposition
> > 
> > Initial maximum inter charge-group distances:
> >    two-body bonded interactions: 0.430 nm, LJ-14, atoms 6915 6922
> >  
> >  multi-body bonded interactions: 0.430 nm, Ryckaert-Bell., atoms 6915
> >  6922
> > 
> > Minimum cell size due to bonded interactions: 0.473 nm
> > Maximum distance for 5 constraints, at 120 deg. angles, all-trans: 0.819
> > nm Estimated maximum distance required for P-LINCS: 0.819 nm
> > This distance will limit the DD cell size, you can override this with
> > -rcon Domain decomposition grid 16 x 7 x 1, separate PME nodes 16
> > Interleaving PP and PME nodes
> > This is a particle-particle only node
> > 
> > Domain decomposition nodeid 0, coordinates 0 0 0
> > 
> > Using two step summing over 16 groups of on average 7.0 processes
> > 
> > Table routines are used for coulomb: TRUE
> > Table routines are used for vdw:     TRUE
> > Will do PME sum in reciprocal space.
> > 
> > ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
> > U. Essman, L. Perela, M. L. Berkowitz, T. Darden, H. Lee and L. G.
> > Pedersen A smooth particle mesh Ewald method
> > J. Chem. Phys. 103 (1995) pp. 8577-8592
> > -------- -------- --- Thank You --- -------- --------
> > 
> > Using a Gaussian width (1/beta) of 0.480244 nm for Ewald
> > Using shifted Lennard-Jones, switch between 1.2 and 1.5 nm
> > Cut-off's:   NS: 1.7   Coulomb: 1.5   LJ: 1.5
> > System total charge: 0.000
> > Generated table with 5400 data points for Ewald-Switch.
> > Tabscale = 2000 points/nm
> > Generated table with 5400 data points for LJ6Switch.
> > Tabscale = 2000 points/nm
> > Generated table with 5400 data points for LJ12Switch.
> > Tabscale = 2000 points/nm
> > Generated table with 5400 data points for 1-4 COUL.
> > Tabscale = 2000 points/nm
> > Generated table with 5400 data points for 1-4 LJ6.
> > Tabscale = 2000 points/nm
> > Generated table with 5400 data points for 1-4 LJ12.
> > Tabscale = 2000 points/nm
> > 
> > Enabling SPC water optimization for 115416 molecules.
> > 
> > Configuring nonbonded kernels...
> > Testing x86_64 SSE2 support... present.
> > 
> > 
> > 
> > Initializing Parallel LINear Constraint Solver
> > 
> > ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
> > B. Hess
> > P-LINCS: A Parallel Linear Constraint Solver for molecular simulation
> > J. Chem. Theory Comput. 4 (2008) pp. 116-122
> > -------- -------- --- Thank You --- -------- --------
> > 
> > The number of constraints is 43710
> > There are inter charge-group constraints,
> > will communicate selected coordinates each lincs iteration
> > 
> > ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
> > S. Miyamoto and P. A. Kollman
> > SETTLE: An Analytical Version of the SHAKE and RATTLE Algorithms for
> > Rigid Water Models
> > J. Comp. Chem. 13 (1992) pp. 952-962
> > -------- -------- --- Thank You --- -------- --------
> > 
> > 
> > Linking all bonded interactions to atoms
> > There are 237282 inter charge-group exclusions,
> > will use an extra communication step for exclusion forces for PME-Switch
> > 
> > The initial number of communication pulses is: X 2 Y 1
> > The initial domain decomposition cell size is: X 1.18 nm Y 2.54 nm
> > 
> > The maximum allowed distance for charge groups involved in interactions 
is:
> >                 non-bonded interactions           1.700 nm
> > 
> > (the following are initial values, they could change due to box
> > deformation)
> > 
> >            two-body bonded interactions  (-rdd)   1.700 nm
> >          
> >          multi-body bonded interactions  (-rdd)   1.178 nm
> >  
> >  atoms separated by up to 5 constraints  (-rcon)  1.178 nm
> > 
> > When dynamic load balancing gets turned on, these settings will change
> > to: The maximum number of communication pulses is: X 2 Y 2
> > The minimum size for domain decomposition cells is 0.850 nm
> > The requested allowed shrink of DD cells (option -dds) is: 0.80
> > The allowed shrink of domain decomposition cells is: X 0.72 Y 0.33
> > 
> > The maximum allowed distance for charge groups involved in interactions 
is:
> >                 non-bonded interactions           1.700 nm
> >            
> >            two-body bonded interactions  (-rdd)   1.700 nm
> >          
> >          multi-body bonded interactions  (-rdd)   0.850 nm
> >  
> >  atoms separated by up to 5 constraints  (-rcon)  0.850 nm
> > 
> > Making 2D domain decomposition grid 16 x 7 x 1, home cell index 0 0 0
> > 
> > Center of mass motion removal mode is Linear
> > 
> > We have the following groups for center of mass motion removal:
> >  0:  rest
> > 
> > There are: 390476 Atoms
> > Charge group distribution at step 3377000: 1125 1142 1147 1166 1139 1158
> > 1135
> > 1146 1216 1139 1298 1162 1139 1138 1182 1279 1151 1525 1334 1173 1162
> > 1364 1509 1368 1884 1513 1149 1151 1286 1358 1714 2023 1752 1191 1150
> > 1167 1480 2099 2239 2327 1411 1170 1132 1668 2235 1647 2174 1812 1195
> > 1258 1542 1919 1307 1861 1793 1141 1415 1590 1810 1435 2106 1755 1149
> > 1564 1711 2424 2156 2122 1457 1158 1278 1291 1862 2071 1775 1564 1118
> > 1144 1175 1938 2062 1858 1637 1136 1141 1326 1685 1438 1348 1277 1144
> > 1162 1159 1361 1142 1226 1184 1153 1142 1144 1195 1154 1144 1151 1124
> > 1149 1148 1171 1140 1127 1145 1158 Grid: 5 x 7 x 17 cells
> > Initial temperature: 309.173 K
> > 
> > Started mdrun on node 0 Sat Mar 13 01:48:12 2010
> > 
> >        <======  ###############  ==>
> >        <====  A V E R A G E S  ====>
> >        <==  ###############  ======>
> >   
> >   Energies (kJ/mol)
> >   
> >          Angle    Proper Dih. Ryckaert-Bell.          LJ-14    
> >          Coulomb-14
> >    
> >    3.11012e+11    1.88259e+10    3.84061e+11    1.60351e+11   
> >    1.48403e+12
> >    
> >        LJ (SR)  Disper. corr.   Coulomb (SR)   Coul. recip.     
> >        Potential
> >    
> >    2.00208e+12   -3.71589e+10   -1.93169e+13   -2.65842e+12  
> >    -1.76521e+13 Kinetic En.   Total Energy    Temperature Pressure (bar)
> >     Cons. rmsd () 3.40082e+12   -1.42513e+13    1.04681e+09   
> >    3.09179e+06    0.00000e+00
> >    
> >          Box-X          Box-Y          Box-Z         Volume   Density
> >          (SI)
> >    
> >    6.36496e+07    6.00618e+07    3.96178e+07    1.32807e+10   
> >    3.44326e+09
> >    
> >             pV
> >    
> >    7.21862e+08
> >   
> >   Total Virial (kJ/mol)
> >   
> >    1.13360e+12    1.84057e+08    1.42298e+08
> >    1.84057e+08    1.13321e+12    1.72446e+08
> >    1.42298e+08    1.72446e+08    1.13293e+12
> >   
> >   Pressure (bar)
> >   
> >    1.96772e+06    3.52902e+05   -4.76710e+05
> >    3.52902e+05    3.34068e+06   -5.56943e+05
> >   
> >   -4.76710e+05   -5.56943e+05    3.96697e+06
> >   
> >   Total Dipole (Debye)
> >   
> >    2.04296e+09    7.00647e+08    1.82307e+09
> >  
> >  Epot (kJ/mol)        Coul-SR          LJ-SR        Coul-14         
> >  LJ-14
> > 
> > Protein-Protein   -1.05061e+12   -2.88200e+11    1.48403e+12   
> > 1.60351e+11 Protein-Non-Protein   -9.55694e+11   -7.40905e+10   
> > 0.00000e+00 0.00000e+00
> > Non-Protein-Non-Protein   -1.73106e+13    2.36437e+12    0.00000e+00
> > 0.00000e+00
> > 
> >      T-Protein  T-Non-Protein
> >    
> >    1.04653e+09    1.04684e+09
> >    
> >        <======  ###############################  ==>
> >        <====  R M S - F L U C T U A T I O N S  ====>
> >        <==  ###############################  ======>
> >   
> >   Energies (kJ/mol)
> >   
> >          Angle    Proper Dih. Ryckaert-Bell.          LJ-14    
> >          Coulomb-14
> >    
> >    9.01957e+05    1.88394e+05    8.36483e+05    3.32678e+05   
> >    1.56961e+06
> >    
> >        LJ (SR)  Disper. corr.   Coulomb (SR)   Coul. recip.     
> >        Potential
> >    
> >    4.49504e+06    1.61531e+04    7.36949e+06    4.74527e+05   
> >    5.31921e+06 Kinetic En.   Total Energy    Temperature Pressure (bar) 
> >    Cons. rmsd () 2.96016e+06    6.07351e+06    9.11168e+02   
> >    8.50221e+04    0.00000e+00
> >    
> >          Box-X          Box-Y          Box-Z         Volume   Density
> >          (SI)
> >    
> >    9.22282e+00    8.70294e+00    5.74061e+00    5.77310e+03   
> >    1.49680e+03
> >    
> >             pV
> >    
> >    2.01360e+07
> >   
> >   Total Virial (kJ/mol)
> >   
> >    1.43640e+07    8.71197e+06    8.72558e+06
> >    8.71197e+06    1.43505e+07    8.72679e+06
> >    8.72558e+06    8.72679e+06    1.39238e+07
> >   
> >   Pressure (bar)
> >   
> >    1.22151e+05    7.44143e+04    7.44974e+04
> >    7.44143e+04    1.22011e+05    7.44967e+04
> >    7.44974e+04    7.44967e+04    1.18187e+05
> >   
> >   Total Dipole (Debye)
> >   
> >    6.14300e+06    5.87464e+06    5.12540e+06
> >  
> >  Epot (kJ/mol)        Coul-SR          LJ-SR        Coul-14         
> >  LJ-14
> > 
> > Protein-Protein    7.16780e+06    1.41743e+06    1.56961e+06   
> > 3.32678e+05 Protein-Non-Protein    1.08780e+07    9.73569e+05   
> > 0.00000e+00 0.00000e+00
> > Non-Protein-Non-Protein    9.00044e+06    4.32503e+06    0.00000e+00
> > 0.00000e+00
> > 
> >      T-Protein  T-Non-Protein
> >    
> >    2.66136e+03    9.69716e+02
> >    
> >        M E G A - F L O P S   A C C O U N T I N G
> >   
> >   RF=Reaction-Field  FE=Free Energy  SCFE=Soft-Core/Free Energy
> >   T=Tabulated        W3=SPC/TIP3p    W4=TIP4p (single or pairs)
> >   NF=No Forces
> >  
> >  Computing:                         M-Number         M-Flops  % Flops
> > 
> > -----------------------------------------------------------------------
> > 
> >  CG-CoM                             0.390476           1.171   100.0
> > 
> > -----------------------------------------------------------------------
> > 
> >  Total                                                 1.171   100.0
> > 
> > -----------------------------------------------------------------------
> > 
> >    D O M A I N   D E C O M P O S I T I O N   S T A T I S T I C S
> >  
> >  av. #atoms communicated per step for force:  2 x 1120267.0
> >  av. #atoms communicated per step for LINCS:  2 x 34671.0
> >  
> >     R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G
> >  
> >  Computing:         Nodes     Number     G-Cycles    Seconds     %
> > 
> > -----------------------------------------------------------------------
> > 
> >  Rest                 112               34860.808        0.0   100.0
> > 
> > -----------------------------------------------------------------------
> > 
> >  Total                128               34860.808        0.0   100.0
> > 
> > -----------------------------------------------------------------------
> > 
> > nodetime = 0! Infinite Giga flopses!
> > 
> >        Parallel run - timing based on wallclock.
> > 
> > Finished mdrun on node 0 Sat Mar 13 01:48:12 2010
> > 
> > On Вторник 16 марта 2010 00:58:50 Roland Schulz wrote:
> > > Alexey,
> > > 
> > > your not giving enough information.
> > > 
> > > What exactly is the error? What happens? Does it hang or does it crash
> > 
> > with
> > 
> > > an error?
> > > 
> > > Roland
> > > 
> > > On Fri, Mar 12, 2010 at 7:01 PM, Alexey Shvetsov <alexxyum at gmail.com>
> > 
> > wrote:
> > > > Hi all
> > > > Seem like there is bug with continuation from checkpoint for gromacs
> > > > 4.0.7 Steps to reproduce
> > > > 1. submit parrallel job to pbs
> > > > 2. kill job
> > > > 3. try to resume from checkpoint
> > > > 
> > > > relevant output from mdrun
> > > > Reading checkpoint file md.cpt generated: Thu Mar 11 12:20:46 2010
> > > > 
> > > > Loaded with Money
> > > > 
> > > > Making 2D domain decomposition 16 x 7 x 1
> > > > 
> > > > WARNING: This run will generate roughly 20607979313638129664 Mb of
> > > > data
> > > > 
> > > > starting mdrun 'Protein in water'
> > > > 500000 steps,   1000.0 ps (continuing from step 3377000,   6754.0
> > > > ps).
> > > > 
> > > > nodetime = 0! Infinite Giga flopses!
> > > > 
> > > >        Parallel run - timing based on wallclock.
> > > > 
> > > > --
> > > > Best Regards,
> > > > Alexey 'Alexxy' Shvetsov
> > > > Petersburg Nuclear Physics Institute, Russia
> > > > Department of Molecular and Radiation Biophysics
> > > > Gentoo Team Ru
> > > > Gentoo Linux Dev
> > > > mailto:alexxyum at gmail.com
> > > > mailto:alexxy at gentoo.org
> > > > mailto:alexxy at omrb.pnpi.spb.ru
> > > > 
> > > > --
> > > > gmx-developers mailing list
> > > > gmx-developers at gromacs.org
> > > > http://lists.gromacs.org/mailman/listinfo/gmx-developers
> > > > Please don't post (un)subscribe requests to the list. Use the
> > > > www interface or send it to gmx-developers-request at gromacs.org.
> > 
> > --
> > Best Regards,
> > Alexey 'Alexxy' Shvetsov
> > Petersburg Nuclear Physics Institute, Russia
> > Department of Molecular and Radiation Biophysics
> > Gentoo Team Ru
> > Gentoo Linux Dev
> > mailto:alexxyum at gmail.com
> > mailto:alexxy at gentoo.org
> > mailto:alexxy at omrb.pnpi.spb.ru
> > 
> > --
> > gmx-developers mailing list
> > gmx-developers at gromacs.org
> > http://lists.gromacs.org/mailman/listinfo/gmx-developers
> > Please don't post (un)subscribe requests to the list. Use the
> > www interface or send it to gmx-developers-request at gromacs.org.

-- 
Best Regards,
Alexey 'Alexxy' Shvetsov
Petersburg Nuclear Physics Institute, Russia
Department of Molecular and Radiation Biophysics
Gentoo Team Ru
Gentoo Linux Dev
mailto:alexxyum at gmail.com
mailto:alexxy at gentoo.org
mailto:alexxy at omrb.pnpi.spb.ru
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part.
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-developers/attachments/20100316/68ee81b3/attachment.sig>


More information about the gromacs.org_gmx-developers mailing list