[gmx-developers] Bug with continuation from checkpoint with gromacs 4.0.7
Mark Abraham
Mark.Abraham at anu.edu.au
Mon Mar 15 23:52:32 CET 2010
On 16/03/2010 9:38 AM, Alexey Shvetsov wrote:
> Hi,
>
> It crashed with this error.
That's not a crash. That's an orderly finish of a run for which the
checkpoint corresponds to a step later than the end of the run to which
the .tpr corresponds.
Now, where did the checkpoint come from? Why do you think this behaviour
is anomalous? Does the output of gmxcheck on it accord with what you
think it should have?
Mark
> In md.log i can see
>
> -----------------------------------------------------------
> Restarting from checkpoint, appending to previous log file.
>
> Log file opened on Sat Mar 13 01:48:10 2010
> Host: n1 pid: 5575 nodeid: 0 nnodes: 128
> The Gromacs distribution was built Sun Feb 28 02:57:38 MSK 2010 by
> root at n1 (Linux 2.6.31-gentoo-r6 x86_64)
>
>
>
> Initializing Domain Decomposition on 128 nodes
> Dynamic load balancing: auto
> Will sort the charge groups at every domain (re)decomposition
> Initial maximum inter charge-group distances:
> two-body bonded interactions: 0.430 nm, LJ-14, atoms 6915 6922
> multi-body bonded interactions: 0.430 nm, Ryckaert-Bell., atoms 6915 6922
> Minimum cell size due to bonded interactions: 0.473 nm
> Maximum distance for 5 constraints, at 120 deg. angles, all-trans: 0.819 nm
> Estimated maximum distance required for P-LINCS: 0.819 nm
> This distance will limit the DD cell size, you can override this with -rcon
> Domain decomposition grid 16 x 7 x 1, separate PME nodes 16
> Interleaving PP and PME nodes
> This is a particle-particle only node
>
> Domain decomposition nodeid 0, coordinates 0 0 0
>
> Using two step summing over 16 groups of on average 7.0 processes
>
> Table routines are used for coulomb: TRUE
> Table routines are used for vdw: TRUE
> Will do PME sum in reciprocal space.
>
> ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
> U. Essman, L. Perela, M. L. Berkowitz, T. Darden, H. Lee and L. G. Pedersen
> A smooth particle mesh Ewald method
> J. Chem. Phys. 103 (1995) pp. 8577-8592
> -------- -------- --- Thank You --- -------- --------
>
> Using a Gaussian width (1/beta) of 0.480244 nm for Ewald
> Using shifted Lennard-Jones, switch between 1.2 and 1.5 nm
> Cut-off's: NS: 1.7 Coulomb: 1.5 LJ: 1.5
> System total charge: 0.000
> Generated table with 5400 data points for Ewald-Switch.
> Tabscale = 2000 points/nm
> Generated table with 5400 data points for LJ6Switch.
> Tabscale = 2000 points/nm
> Generated table with 5400 data points for LJ12Switch.
> Tabscale = 2000 points/nm
> Generated table with 5400 data points for 1-4 COUL.
> Tabscale = 2000 points/nm
> Generated table with 5400 data points for 1-4 LJ6.
> Tabscale = 2000 points/nm
> Generated table with 5400 data points for 1-4 LJ12.
> Tabscale = 2000 points/nm
>
> Enabling SPC water optimization for 115416 molecules.
>
> Configuring nonbonded kernels...
> Testing x86_64 SSE2 support... present.
>
>
>
> Initializing Parallel LINear Constraint Solver
>
> ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
> B. Hess
> P-LINCS: A Parallel Linear Constraint Solver for molecular simulation
> J. Chem. Theory Comput. 4 (2008) pp. 116-122
> -------- -------- --- Thank You --- -------- --------
>
> The number of constraints is 43710
> There are inter charge-group constraints,
> will communicate selected coordinates each lincs iteration
>
> ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
> S. Miyamoto and P. A. Kollman
> SETTLE: An Analytical Version of the SHAKE and RATTLE Algorithms for Rigid
> Water Models
> J. Comp. Chem. 13 (1992) pp. 952-962
> -------- -------- --- Thank You --- -------- --------
>
>
> Linking all bonded interactions to atoms
> There are 237282 inter charge-group exclusions,
> will use an extra communication step for exclusion forces for PME-Switch
>
> The initial number of communication pulses is: X 2 Y 1
> The initial domain decomposition cell size is: X 1.18 nm Y 2.54 nm
>
> The maximum allowed distance for charge groups involved in interactions is:
> non-bonded interactions 1.700 nm
> (the following are initial values, they could change due to box deformation)
> two-body bonded interactions (-rdd) 1.700 nm
> multi-body bonded interactions (-rdd) 1.178 nm
> atoms separated by up to 5 constraints (-rcon) 1.178 nm
>
> When dynamic load balancing gets turned on, these settings will change to:
> The maximum number of communication pulses is: X 2 Y 2
> The minimum size for domain decomposition cells is 0.850 nm
> The requested allowed shrink of DD cells (option -dds) is: 0.80
> The allowed shrink of domain decomposition cells is: X 0.72 Y 0.33
> The maximum allowed distance for charge groups involved in interactions is:
> non-bonded interactions 1.700 nm
> two-body bonded interactions (-rdd) 1.700 nm
> multi-body bonded interactions (-rdd) 0.850 nm
> atoms separated by up to 5 constraints (-rcon) 0.850 nm
>
>
> Making 2D domain decomposition grid 16 x 7 x 1, home cell index 0 0 0
>
> Center of mass motion removal mode is Linear
> We have the following groups for center of mass motion removal:
> 0: rest
> There are: 390476 Atoms
> Charge group distribution at step 3377000: 1125 1142 1147 1166 1139 1158 1135
> 1146 1216 1139 1298 1162 1139 1138 1182 1279 1151 1525 1334 1173 1162 1364
> 1509 1368 1884 1513 1149 1151 1286 1358 1714 2023 1752 1191 1150 1167 1480
> 2099 2239 2327 1411 1170 1132 1668 2235 1647 2174 1812 1195 1258 1542 1919
> 1307 1861 1793 1141 1415 1590 1810 1435 2106 1755 1149 1564 1711 2424 2156
> 2122 1457 1158 1278 1291 1862 2071 1775 1564 1118 1144 1175 1938 2062 1858
> 1637 1136 1141 1326 1685 1438 1348 1277 1144 1162 1159 1361 1142 1226 1184
> 1153 1142 1144 1195 1154 1144 1151 1124 1149 1148 1171 1140 1127 1145 1158
> Grid: 5 x 7 x 17 cells
> Initial temperature: 309.173 K
>
> Started mdrun on node 0 Sat Mar 13 01:48:12 2010
>
> <====== ############### ==>
> <==== A V E R A G E S ====>
> <== ############### ======>
>
> Energies (kJ/mol)
> Angle Proper Dih. Ryckaert-Bell. LJ-14 Coulomb-14
> 3.11012e+11 1.88259e+10 3.84061e+11 1.60351e+11 1.48403e+12
> LJ (SR) Disper. corr. Coulomb (SR) Coul. recip. Potential
> 2.00208e+12 -3.71589e+10 -1.93169e+13 -2.65842e+12 -1.76521e+13
> Kinetic En. Total Energy Temperature Pressure (bar) Cons. rmsd ()
> 3.40082e+12 -1.42513e+13 1.04681e+09 3.09179e+06 0.00000e+00
>
> Box-X Box-Y Box-Z Volume Density (SI)
> 6.36496e+07 6.00618e+07 3.96178e+07 1.32807e+10 3.44326e+09
> pV
> 7.21862e+08
>
> Total Virial (kJ/mol)
> 1.13360e+12 1.84057e+08 1.42298e+08
> 1.84057e+08 1.13321e+12 1.72446e+08
> 1.42298e+08 1.72446e+08 1.13293e+12
>
> Pressure (bar)
> 1.96772e+06 3.52902e+05 -4.76710e+05
> 3.52902e+05 3.34068e+06 -5.56943e+05
> -4.76710e+05 -5.56943e+05 3.96697e+06
>
> Total Dipole (Debye)
> 2.04296e+09 7.00647e+08 1.82307e+09
>
> Epot (kJ/mol) Coul-SR LJ-SR Coul-14 LJ-14
> Protein-Protein -1.05061e+12 -2.88200e+11 1.48403e+12 1.60351e+11
> Protein-Non-Protein -9.55694e+11 -7.40905e+10 0.00000e+00
> 0.00000e+00
> Non-Protein-Non-Protein -1.73106e+13 2.36437e+12 0.00000e+00
> 0.00000e+00
>
> T-Protein T-Non-Protein
> 1.04653e+09 1.04684e+09
>
> <====== ############################### ==>
> <==== R M S - F L U C T U A T I O N S ====>
> <== ############################### ======>
>
> Energies (kJ/mol)
> Angle Proper Dih. Ryckaert-Bell. LJ-14 Coulomb-14
> 9.01957e+05 1.88394e+05 8.36483e+05 3.32678e+05 1.56961e+06
> LJ (SR) Disper. corr. Coulomb (SR) Coul. recip. Potential
> 4.49504e+06 1.61531e+04 7.36949e+06 4.74527e+05 5.31921e+06
> Kinetic En. Total Energy Temperature Pressure (bar) Cons. rmsd ()
> 2.96016e+06 6.07351e+06 9.11168e+02 8.50221e+04 0.00000e+00
>
> Box-X Box-Y Box-Z Volume Density (SI)
> 9.22282e+00 8.70294e+00 5.74061e+00 5.77310e+03 1.49680e+03
> pV
> 2.01360e+07
>
> Total Virial (kJ/mol)
> 1.43640e+07 8.71197e+06 8.72558e+06
> 8.71197e+06 1.43505e+07 8.72679e+06
> 8.72558e+06 8.72679e+06 1.39238e+07
>
> Pressure (bar)
> 1.22151e+05 7.44143e+04 7.44974e+04
> 7.44143e+04 1.22011e+05 7.44967e+04
> 7.44974e+04 7.44967e+04 1.18187e+05
>
> Total Dipole (Debye)
> 6.14300e+06 5.87464e+06 5.12540e+06
>
> Epot (kJ/mol) Coul-SR LJ-SR Coul-14 LJ-14
> Protein-Protein 7.16780e+06 1.41743e+06 1.56961e+06 3.32678e+05
> Protein-Non-Protein 1.08780e+07 9.73569e+05 0.00000e+00
> 0.00000e+00
> Non-Protein-Non-Protein 9.00044e+06 4.32503e+06 0.00000e+00
> 0.00000e+00
>
> T-Protein T-Non-Protein
> 2.66136e+03 9.69716e+02
>
>
> M E G A - F L O P S A C C O U N T I N G
>
> RF=Reaction-Field FE=Free Energy SCFE=Soft-Core/Free Energy
> T=Tabulated W3=SPC/TIP3p W4=TIP4p (single or pairs)
> NF=No Forces
>
> Computing: M-Number M-Flops % Flops
> -----------------------------------------------------------------------
> CG-CoM 0.390476 1.171 100.0
> -----------------------------------------------------------------------
> Total 1.171 100.0
> -----------------------------------------------------------------------
>
>
> D O M A I N D E C O M P O S I T I O N S T A T I S T I C S
>
> av. #atoms communicated per step for force: 2 x 1120267.0
> av. #atoms communicated per step for LINCS: 2 x 34671.0
>
>
> R E A L C Y C L E A N D T I M E A C C O U N T I N G
>
> Computing: Nodes Number G-Cycles Seconds %
> -----------------------------------------------------------------------
> Rest 112 34860.808 0.0 100.0
> -----------------------------------------------------------------------
> Total 128 34860.808 0.0 100.0
> -----------------------------------------------------------------------
>
> nodetime = 0! Infinite Giga flopses!
> Parallel run - timing based on wallclock.
>
> Finished mdrun on node 0 Sat Mar 13 01:48:12 2010
>
> On Вторник 16 марта 2010 00:58:50 Roland Schulz wrote:
>> Alexey,
>>
>> your not giving enough information.
>>
>> What exactly is the error? What happens? Does it hang or does it crash with
>> an error?
>>
>> Roland
>>
>> On Fri, Mar 12, 2010 at 7:01 PM, Alexey Shvetsov<alexxyum at gmail.com> wrote:
>>> Hi all
>>> Seem like there is bug with continuation from checkpoint for gromacs
>>> 4.0.7 Steps to reproduce
>>> 1. submit parrallel job to pbs
>>> 2. kill job
>>> 3. try to resume from checkpoint
>>>
>>> relevant output from mdrun
>>> Reading checkpoint file md.cpt generated: Thu Mar 11 12:20:46 2010
>>>
>>> Loaded with Money
>>>
>>> Making 2D domain decomposition 16 x 7 x 1
>>>
>>> WARNING: This run will generate roughly 20607979313638129664 Mb of data
>>>
>>> starting mdrun 'Protein in water'
>>> 500000 steps, 1000.0 ps (continuing from step 3377000, 6754.0 ps).
>>>
>>> nodetime = 0! Infinite Giga flopses!
>>>
>>> Parallel run - timing based on wallclock.
>>>
>>> --
>>> Best Regards,
>>> Alexey 'Alexxy' Shvetsov
>>> Petersburg Nuclear Physics Institute, Russia
>>> Department of Molecular and Radiation Biophysics
>>> Gentoo Team Ru
>>> Gentoo Linux Dev
>>> mailto:alexxyum at gmail.com
>>> mailto:alexxy at gentoo.org
>>> mailto:alexxy at omrb.pnpi.spb.ru
>>>
>>> --
>>> gmx-developers mailing list
>>> gmx-developers at gromacs.org
>>> http://lists.gromacs.org/mailman/listinfo/gmx-developers
>>> Please don't post (un)subscribe requests to the list. Use the
>>> www interface or send it to gmx-developers-request at gromacs.org.
>
More information about the gromacs.org_gmx-developers
mailing list