[gmx-developers] Bug with continuation from checkpoint with gromacs 4.0.7
Roland Schulz
roland at utk.edu
Mon Mar 15 23:45:42 CET 2010
Hi,
what is nsteps in the mdp file?
Did the simulation up to this point really did 3377000 steps?
Roland
2010/3/15 Alexey Shvetsov <alexxyum at gmail.com>
> Hi,
>
> It crashed with this error.
> In md.log i can see
>
> -----------------------------------------------------------
> Restarting from checkpoint, appending to previous log file.
>
> Log file opened on Sat Mar 13 01:48:10 2010
> Host: n1 pid: 5575 nodeid: 0 nnodes: 128
> The Gromacs distribution was built Sun Feb 28 02:57:38 MSK 2010 by
> root at n1 (Linux 2.6.31-gentoo-r6 x86_64)
>
>
>
> Initializing Domain Decomposition on 128 nodes
> Dynamic load balancing: auto
> Will sort the charge groups at every domain (re)decomposition
> Initial maximum inter charge-group distances:
> two-body bonded interactions: 0.430 nm, LJ-14, atoms 6915 6922
> multi-body bonded interactions: 0.430 nm, Ryckaert-Bell., atoms 6915 6922
> Minimum cell size due to bonded interactions: 0.473 nm
> Maximum distance for 5 constraints, at 120 deg. angles, all-trans: 0.819 nm
> Estimated maximum distance required for P-LINCS: 0.819 nm
> This distance will limit the DD cell size, you can override this with -rcon
> Domain decomposition grid 16 x 7 x 1, separate PME nodes 16
> Interleaving PP and PME nodes
> This is a particle-particle only node
>
> Domain decomposition nodeid 0, coordinates 0 0 0
>
> Using two step summing over 16 groups of on average 7.0 processes
>
> Table routines are used for coulomb: TRUE
> Table routines are used for vdw: TRUE
> Will do PME sum in reciprocal space.
>
> ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
> U. Essman, L. Perela, M. L. Berkowitz, T. Darden, H. Lee and L. G. Pedersen
> A smooth particle mesh Ewald method
> J. Chem. Phys. 103 (1995) pp. 8577-8592
> -------- -------- --- Thank You --- -------- --------
>
> Using a Gaussian width (1/beta) of 0.480244 nm for Ewald
> Using shifted Lennard-Jones, switch between 1.2 and 1.5 nm
> Cut-off's: NS: 1.7 Coulomb: 1.5 LJ: 1.5
> System total charge: 0.000
> Generated table with 5400 data points for Ewald-Switch.
> Tabscale = 2000 points/nm
> Generated table with 5400 data points for LJ6Switch.
> Tabscale = 2000 points/nm
> Generated table with 5400 data points for LJ12Switch.
> Tabscale = 2000 points/nm
> Generated table with 5400 data points for 1-4 COUL.
> Tabscale = 2000 points/nm
> Generated table with 5400 data points for 1-4 LJ6.
> Tabscale = 2000 points/nm
> Generated table with 5400 data points for 1-4 LJ12.
> Tabscale = 2000 points/nm
>
> Enabling SPC water optimization for 115416 molecules.
>
> Configuring nonbonded kernels...
> Testing x86_64 SSE2 support... present.
>
>
>
> Initializing Parallel LINear Constraint Solver
>
> ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
> B. Hess
> P-LINCS: A Parallel Linear Constraint Solver for molecular simulation
> J. Chem. Theory Comput. 4 (2008) pp. 116-122
> -------- -------- --- Thank You --- -------- --------
>
> The number of constraints is 43710
> There are inter charge-group constraints,
> will communicate selected coordinates each lincs iteration
>
> ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
> S. Miyamoto and P. A. Kollman
> SETTLE: An Analytical Version of the SHAKE and RATTLE Algorithms for Rigid
> Water Models
> J. Comp. Chem. 13 (1992) pp. 952-962
> -------- -------- --- Thank You --- -------- --------
>
>
> Linking all bonded interactions to atoms
> There are 237282 inter charge-group exclusions,
> will use an extra communication step for exclusion forces for PME-Switch
>
> The initial number of communication pulses is: X 2 Y 1
> The initial domain decomposition cell size is: X 1.18 nm Y 2.54 nm
>
> The maximum allowed distance for charge groups involved in interactions is:
> non-bonded interactions 1.700 nm
> (the following are initial values, they could change due to box
> deformation)
> two-body bonded interactions (-rdd) 1.700 nm
> multi-body bonded interactions (-rdd) 1.178 nm
> atoms separated by up to 5 constraints (-rcon) 1.178 nm
>
> When dynamic load balancing gets turned on, these settings will change to:
> The maximum number of communication pulses is: X 2 Y 2
> The minimum size for domain decomposition cells is 0.850 nm
> The requested allowed shrink of DD cells (option -dds) is: 0.80
> The allowed shrink of domain decomposition cells is: X 0.72 Y 0.33
> The maximum allowed distance for charge groups involved in interactions is:
> non-bonded interactions 1.700 nm
> two-body bonded interactions (-rdd) 1.700 nm
> multi-body bonded interactions (-rdd) 0.850 nm
> atoms separated by up to 5 constraints (-rcon) 0.850 nm
>
>
> Making 2D domain decomposition grid 16 x 7 x 1, home cell index 0 0 0
>
> Center of mass motion removal mode is Linear
> We have the following groups for center of mass motion removal:
> 0: rest
> There are: 390476 Atoms
> Charge group distribution at step 3377000: 1125 1142 1147 1166 1139 1158
> 1135
> 1146 1216 1139 1298 1162 1139 1138 1182 1279 1151 1525 1334 1173 1162 1364
> 1509 1368 1884 1513 1149 1151 1286 1358 1714 2023 1752 1191 1150 1167 1480
> 2099 2239 2327 1411 1170 1132 1668 2235 1647 2174 1812 1195 1258 1542 1919
> 1307 1861 1793 1141 1415 1590 1810 1435 2106 1755 1149 1564 1711 2424 2156
> 2122 1457 1158 1278 1291 1862 2071 1775 1564 1118 1144 1175 1938 2062 1858
> 1637 1136 1141 1326 1685 1438 1348 1277 1144 1162 1159 1361 1142 1226 1184
> 1153 1142 1144 1195 1154 1144 1151 1124 1149 1148 1171 1140 1127 1145 1158
> Grid: 5 x 7 x 17 cells
> Initial temperature: 309.173 K
>
> Started mdrun on node 0 Sat Mar 13 01:48:12 2010
>
> <====== ############### ==>
> <==== A V E R A G E S ====>
> <== ############### ======>
>
> Energies (kJ/mol)
> Angle Proper Dih. Ryckaert-Bell. LJ-14 Coulomb-14
> 3.11012e+11 1.88259e+10 3.84061e+11 1.60351e+11 1.48403e+12
> LJ (SR) Disper. corr. Coulomb (SR) Coul. recip. Potential
> 2.00208e+12 -3.71589e+10 -1.93169e+13 -2.65842e+12 -1.76521e+13
> Kinetic En. Total Energy Temperature Pressure (bar) Cons. rmsd ()
> 3.40082e+12 -1.42513e+13 1.04681e+09 3.09179e+06 0.00000e+00
>
> Box-X Box-Y Box-Z Volume Density (SI)
> 6.36496e+07 6.00618e+07 3.96178e+07 1.32807e+10 3.44326e+09
> pV
> 7.21862e+08
>
> Total Virial (kJ/mol)
> 1.13360e+12 1.84057e+08 1.42298e+08
> 1.84057e+08 1.13321e+12 1.72446e+08
> 1.42298e+08 1.72446e+08 1.13293e+12
>
> Pressure (bar)
> 1.96772e+06 3.52902e+05 -4.76710e+05
> 3.52902e+05 3.34068e+06 -5.56943e+05
> -4.76710e+05 -5.56943e+05 3.96697e+06
>
> Total Dipole (Debye)
> 2.04296e+09 7.00647e+08 1.82307e+09
>
> Epot (kJ/mol) Coul-SR LJ-SR Coul-14 LJ-14
> Protein-Protein -1.05061e+12 -2.88200e+11 1.48403e+12 1.60351e+11
> Protein-Non-Protein -9.55694e+11 -7.40905e+10 0.00000e+00
> 0.00000e+00
> Non-Protein-Non-Protein -1.73106e+13 2.36437e+12 0.00000e+00
> 0.00000e+00
>
> T-Protein T-Non-Protein
> 1.04653e+09 1.04684e+09
>
> <====== ############################### ==>
> <==== R M S - F L U C T U A T I O N S ====>
> <== ############################### ======>
>
> Energies (kJ/mol)
> Angle Proper Dih. Ryckaert-Bell. LJ-14 Coulomb-14
> 9.01957e+05 1.88394e+05 8.36483e+05 3.32678e+05 1.56961e+06
> LJ (SR) Disper. corr. Coulomb (SR) Coul. recip. Potential
> 4.49504e+06 1.61531e+04 7.36949e+06 4.74527e+05 5.31921e+06
> Kinetic En. Total Energy Temperature Pressure (bar) Cons. rmsd ()
> 2.96016e+06 6.07351e+06 9.11168e+02 8.50221e+04 0.00000e+00
>
> Box-X Box-Y Box-Z Volume Density (SI)
> 9.22282e+00 8.70294e+00 5.74061e+00 5.77310e+03 1.49680e+03
> pV
> 2.01360e+07
>
> Total Virial (kJ/mol)
> 1.43640e+07 8.71197e+06 8.72558e+06
> 8.71197e+06 1.43505e+07 8.72679e+06
> 8.72558e+06 8.72679e+06 1.39238e+07
>
> Pressure (bar)
> 1.22151e+05 7.44143e+04 7.44974e+04
> 7.44143e+04 1.22011e+05 7.44967e+04
> 7.44974e+04 7.44967e+04 1.18187e+05
>
> Total Dipole (Debye)
> 6.14300e+06 5.87464e+06 5.12540e+06
>
> Epot (kJ/mol) Coul-SR LJ-SR Coul-14 LJ-14
> Protein-Protein 7.16780e+06 1.41743e+06 1.56961e+06 3.32678e+05
> Protein-Non-Protein 1.08780e+07 9.73569e+05 0.00000e+00
> 0.00000e+00
> Non-Protein-Non-Protein 9.00044e+06 4.32503e+06 0.00000e+00
> 0.00000e+00
>
> T-Protein T-Non-Protein
> 2.66136e+03 9.69716e+02
>
>
> M E G A - F L O P S A C C O U N T I N G
>
> RF=Reaction-Field FE=Free Energy SCFE=Soft-Core/Free Energy
> T=Tabulated W3=SPC/TIP3p W4=TIP4p (single or pairs)
> NF=No Forces
>
> Computing: M-Number M-Flops % Flops
> -----------------------------------------------------------------------
> CG-CoM 0.390476 1.171 100.0
> -----------------------------------------------------------------------
> Total 1.171 100.0
> -----------------------------------------------------------------------
>
>
> D O M A I N D E C O M P O S I T I O N S T A T I S T I C S
>
> av. #atoms communicated per step for force: 2 x 1120267.0
> av. #atoms communicated per step for LINCS: 2 x 34671.0
>
>
> R E A L C Y C L E A N D T I M E A C C O U N T I N G
>
> Computing: Nodes Number G-Cycles Seconds %
> -----------------------------------------------------------------------
> Rest 112 34860.808 0.0 100.0
> -----------------------------------------------------------------------
> Total 128 34860.808 0.0 100.0
> -----------------------------------------------------------------------
>
> nodetime = 0! Infinite Giga flopses!
> Parallel run - timing based on wallclock.
>
> Finished mdrun on node 0 Sat Mar 13 01:48:12 2010
>
> On Вторник 16 марта 2010 00:58:50 Roland Schulz wrote:
> > Alexey,
> >
> > your not giving enough information.
> >
> > What exactly is the error? What happens? Does it hang or does it crash
> with
> > an error?
> >
> > Roland
> >
> > On Fri, Mar 12, 2010 at 7:01 PM, Alexey Shvetsov <alexxyum at gmail.com>
> wrote:
> > > Hi all
> > > Seem like there is bug with continuation from checkpoint for gromacs
> > > 4.0.7 Steps to reproduce
> > > 1. submit parrallel job to pbs
> > > 2. kill job
> > > 3. try to resume from checkpoint
> > >
> > > relevant output from mdrun
> > > Reading checkpoint file md.cpt generated: Thu Mar 11 12:20:46 2010
> > >
> > > Loaded with Money
> > >
> > > Making 2D domain decomposition 16 x 7 x 1
> > >
> > > WARNING: This run will generate roughly 20607979313638129664 Mb of data
> > >
> > > starting mdrun 'Protein in water'
> > > 500000 steps, 1000.0 ps (continuing from step 3377000, 6754.0 ps).
> > >
> > > nodetime = 0! Infinite Giga flopses!
> > >
> > > Parallel run - timing based on wallclock.
> > >
> > > --
> > > Best Regards,
> > > Alexey 'Alexxy' Shvetsov
> > > Petersburg Nuclear Physics Institute, Russia
> > > Department of Molecular and Radiation Biophysics
> > > Gentoo Team Ru
> > > Gentoo Linux Dev
> > > mailto:alexxyum at gmail.com
> > > mailto:alexxy at gentoo.org
> > > mailto:alexxy at omrb.pnpi.spb.ru
> > >
> > > --
> > > gmx-developers mailing list
> > > gmx-developers at gromacs.org
> > > http://lists.gromacs.org/mailman/listinfo/gmx-developers
> > > Please don't post (un)subscribe requests to the list. Use the
> > > www interface or send it to gmx-developers-request at gromacs.org.
>
> --
> Best Regards,
> Alexey 'Alexxy' Shvetsov
> Petersburg Nuclear Physics Institute, Russia
> Department of Molecular and Radiation Biophysics
> Gentoo Team Ru
> Gentoo Linux Dev
> mailto:alexxyum at gmail.com
> mailto:alexxy at gentoo.org
> mailto:alexxy at omrb.pnpi.spb.ru
>
> --
> gmx-developers mailing list
> gmx-developers at gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-developers
> Please don't post (un)subscribe requests to the list. Use the
> www interface or send it to gmx-developers-request at gromacs.org.
>
--
ORNL/UT Center for Molecular Biophysics cmb.ornl.gov
865-241-1537, ORNL PO BOX 2008 MS6309
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-developers/attachments/20100315/7c7962f8/attachment.html>
More information about the gromacs.org_gmx-developers
mailing list