[gmx-developers] Bug with continuation from checkpoint with gromacs 4.0.7

Alexey Shvetsov alexxyum at gmail.com
Mon Mar 15 23:38:07 CET 2010


Hi,

It crashed with this error.
In md.log i can see 

-----------------------------------------------------------
Restarting from checkpoint, appending to previous log file.

Log file opened on Sat Mar 13 01:48:10 2010
Host: n1  pid: 5575  nodeid: 0  nnodes:  128
The Gromacs distribution was built Sun Feb 28 02:57:38 MSK 2010 by
root at n1 (Linux 2.6.31-gentoo-r6 x86_64)



Initializing Domain Decomposition on 128 nodes
Dynamic load balancing: auto
Will sort the charge groups at every domain (re)decomposition
Initial maximum inter charge-group distances:
    two-body bonded interactions: 0.430 nm, LJ-14, atoms 6915 6922
  multi-body bonded interactions: 0.430 nm, Ryckaert-Bell., atoms 6915 6922
Minimum cell size due to bonded interactions: 0.473 nm
Maximum distance for 5 constraints, at 120 deg. angles, all-trans: 0.819 nm
Estimated maximum distance required for P-LINCS: 0.819 nm
This distance will limit the DD cell size, you can override this with -rcon
Domain decomposition grid 16 x 7 x 1, separate PME nodes 16
Interleaving PP and PME nodes
This is a particle-particle only node

Domain decomposition nodeid 0, coordinates 0 0 0

Using two step summing over 16 groups of on average 7.0 processes

Table routines are used for coulomb: TRUE
Table routines are used for vdw:     TRUE
Will do PME sum in reciprocal space.

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
U. Essman, L. Perela, M. L. Berkowitz, T. Darden, H. Lee and L. G. Pedersen 
A smooth particle mesh Ewald method
J. Chem. Phys. 103 (1995) pp. 8577-8592
-------- -------- --- Thank You --- -------- --------

Using a Gaussian width (1/beta) of 0.480244 nm for Ewald
Using shifted Lennard-Jones, switch between 1.2 and 1.5 nm
Cut-off's:   NS: 1.7   Coulomb: 1.5   LJ: 1.5
System total charge: 0.000
Generated table with 5400 data points for Ewald-Switch.
Tabscale = 2000 points/nm
Generated table with 5400 data points for LJ6Switch.
Tabscale = 2000 points/nm
Generated table with 5400 data points for LJ12Switch.
Tabscale = 2000 points/nm
Generated table with 5400 data points for 1-4 COUL.
Tabscale = 2000 points/nm
Generated table with 5400 data points for 1-4 LJ6.
Tabscale = 2000 points/nm
Generated table with 5400 data points for 1-4 LJ12.
Tabscale = 2000 points/nm

Enabling SPC water optimization for 115416 molecules.

Configuring nonbonded kernels...
Testing x86_64 SSE2 support... present.



Initializing Parallel LINear Constraint Solver

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
B. Hess
P-LINCS: A Parallel Linear Constraint Solver for molecular simulation
J. Chem. Theory Comput. 4 (2008) pp. 116-122
-------- -------- --- Thank You --- -------- --------

The number of constraints is 43710
There are inter charge-group constraints,
will communicate selected coordinates each lincs iteration

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
S. Miyamoto and P. A. Kollman
SETTLE: An Analytical Version of the SHAKE and RATTLE Algorithms for Rigid
Water Models
J. Comp. Chem. 13 (1992) pp. 952-962
-------- -------- --- Thank You --- -------- --------


Linking all bonded interactions to atoms
There are 237282 inter charge-group exclusions,
will use an extra communication step for exclusion forces for PME-Switch

The initial number of communication pulses is: X 2 Y 1
The initial domain decomposition cell size is: X 1.18 nm Y 2.54 nm

The maximum allowed distance for charge groups involved in interactions is:
                 non-bonded interactions           1.700 nm
(the following are initial values, they could change due to box deformation)
            two-body bonded interactions  (-rdd)   1.700 nm
          multi-body bonded interactions  (-rdd)   1.178 nm
  atoms separated by up to 5 constraints  (-rcon)  1.178 nm

When dynamic load balancing gets turned on, these settings will change to:
The maximum number of communication pulses is: X 2 Y 2
The minimum size for domain decomposition cells is 0.850 nm
The requested allowed shrink of DD cells (option -dds) is: 0.80
The allowed shrink of domain decomposition cells is: X 0.72 Y 0.33
The maximum allowed distance for charge groups involved in interactions is:
                 non-bonded interactions           1.700 nm
            two-body bonded interactions  (-rdd)   1.700 nm
          multi-body bonded interactions  (-rdd)   0.850 nm
  atoms separated by up to 5 constraints  (-rcon)  0.850 nm


Making 2D domain decomposition grid 16 x 7 x 1, home cell index 0 0 0

Center of mass motion removal mode is Linear
We have the following groups for center of mass motion removal:
  0:  rest
There are: 390476 Atoms
Charge group distribution at step 3377000: 1125 1142 1147 1166 1139 1158 1135 
1146 1216 1139 1298 1162 1139 1138 1182 1279 1151 1525 1334 1173 1162 1364 
1509 1368 1884 1513 1149 1151 1286 1358 1714 2023 1752 1191 1150 1167 1480 
2099 2239 2327 1411 1170 1132 1668 2235 1647 2174 1812 1195 1258 1542 1919 
1307 1861 1793 1141 1415 1590 1810 1435 2106 1755 1149 1564 1711 2424 2156 
2122 1457 1158 1278 1291 1862 2071 1775 1564 1118 1144 1175 1938 2062 1858 
1637 1136 1141 1326 1685 1438 1348 1277 1144 1162 1159 1361 1142 1226 1184 
1153 1142 1144 1195 1154 1144 1151 1124 1149 1148 1171 1140 1127 1145 1158
Grid: 5 x 7 x 17 cells
Initial temperature: 309.173 K

Started mdrun on node 0 Sat Mar 13 01:48:12 2010

        <======  ###############  ==>
        <====  A V E R A G E S  ====>
        <==  ###############  ======>

   Energies (kJ/mol)
          Angle    Proper Dih. Ryckaert-Bell.          LJ-14     Coulomb-14
    3.11012e+11    1.88259e+10    3.84061e+11    1.60351e+11    1.48403e+12
        LJ (SR)  Disper. corr.   Coulomb (SR)   Coul. recip.      Potential
    2.00208e+12   -3.71589e+10   -1.93169e+13   -2.65842e+12   -1.76521e+13
    Kinetic En.   Total Energy    Temperature Pressure (bar)  Cons. rmsd ()
    3.40082e+12   -1.42513e+13    1.04681e+09    3.09179e+06    0.00000e+00

          Box-X          Box-Y          Box-Z         Volume   Density (SI)
    6.36496e+07    6.00618e+07    3.96178e+07    1.32807e+10    3.44326e+09
             pV
    7.21862e+08

   Total Virial (kJ/mol)
    1.13360e+12    1.84057e+08    1.42298e+08
    1.84057e+08    1.13321e+12    1.72446e+08
    1.42298e+08    1.72446e+08    1.13293e+12

   Pressure (bar)
    1.96772e+06    3.52902e+05   -4.76710e+05
    3.52902e+05    3.34068e+06   -5.56943e+05
   -4.76710e+05   -5.56943e+05    3.96697e+06

   Total Dipole (Debye)
    2.04296e+09    7.00647e+08    1.82307e+09

  Epot (kJ/mol)        Coul-SR          LJ-SR        Coul-14          LJ-14   
Protein-Protein   -1.05061e+12   -2.88200e+11    1.48403e+12    1.60351e+11
Protein-Non-Protein   -9.55694e+11   -7.40905e+10    0.00000e+00    
0.00000e+00
Non-Protein-Non-Protein   -1.73106e+13    2.36437e+12    0.00000e+00    
0.00000e+00

      T-Protein  T-Non-Protein
    1.04653e+09    1.04684e+09

        <======  ###############################  ==>
        <====  R M S - F L U C T U A T I O N S  ====>
        <==  ###############################  ======>

   Energies (kJ/mol)
          Angle    Proper Dih. Ryckaert-Bell.          LJ-14     Coulomb-14
    9.01957e+05    1.88394e+05    8.36483e+05    3.32678e+05    1.56961e+06
        LJ (SR)  Disper. corr.   Coulomb (SR)   Coul. recip.      Potential
    4.49504e+06    1.61531e+04    7.36949e+06    4.74527e+05    5.31921e+06
    Kinetic En.   Total Energy    Temperature Pressure (bar)  Cons. rmsd ()
    2.96016e+06    6.07351e+06    9.11168e+02    8.50221e+04    0.00000e+00

          Box-X          Box-Y          Box-Z         Volume   Density (SI)
    9.22282e+00    8.70294e+00    5.74061e+00    5.77310e+03    1.49680e+03
             pV
    2.01360e+07

   Total Virial (kJ/mol)
    1.43640e+07    8.71197e+06    8.72558e+06
    8.71197e+06    1.43505e+07    8.72679e+06
    8.72558e+06    8.72679e+06    1.39238e+07

   Pressure (bar)
    1.22151e+05    7.44143e+04    7.44974e+04
    7.44143e+04    1.22011e+05    7.44967e+04
    7.44974e+04    7.44967e+04    1.18187e+05

   Total Dipole (Debye)
    6.14300e+06    5.87464e+06    5.12540e+06

  Epot (kJ/mol)        Coul-SR          LJ-SR        Coul-14          LJ-14   
Protein-Protein    7.16780e+06    1.41743e+06    1.56961e+06    3.32678e+05
Protein-Non-Protein    1.08780e+07    9.73569e+05    0.00000e+00    
0.00000e+00
Non-Protein-Non-Protein    9.00044e+06    4.32503e+06    0.00000e+00    
0.00000e+00

      T-Protein  T-Non-Protein
    2.66136e+03    9.69716e+02


        M E G A - F L O P S   A C C O U N T I N G

   RF=Reaction-Field  FE=Free Energy  SCFE=Soft-Core/Free Energy
   T=Tabulated        W3=SPC/TIP3p    W4=TIP4p (single or pairs)
   NF=No Forces

 Computing:                         M-Number         M-Flops  % Flops
-----------------------------------------------------------------------
 CG-CoM                             0.390476           1.171   100.0
-----------------------------------------------------------------------
 Total                                                 1.171   100.0
-----------------------------------------------------------------------


    D O M A I N   D E C O M P O S I T I O N   S T A T I S T I C S

 av. #atoms communicated per step for force:  2 x 1120267.0
 av. #atoms communicated per step for LINCS:  2 x 34671.0


     R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

 Computing:         Nodes     Number     G-Cycles    Seconds     %
-----------------------------------------------------------------------
 Rest                 112               34860.808        0.0   100.0
-----------------------------------------------------------------------
 Total                128               34860.808        0.0   100.0
-----------------------------------------------------------------------

nodetime = 0! Infinite Giga flopses!
        Parallel run - timing based on wallclock.

Finished mdrun on node 0 Sat Mar 13 01:48:12 2010

On Вторник 16 марта 2010 00:58:50 Roland Schulz wrote:
> Alexey,
> 
> your not giving enough information.
> 
> What exactly is the error? What happens? Does it hang or does it crash with
> an error?
> 
> Roland
> 
> On Fri, Mar 12, 2010 at 7:01 PM, Alexey Shvetsov <alexxyum at gmail.com> wrote:
> > Hi all
> > Seem like there is bug with continuation from checkpoint for gromacs
> > 4.0.7 Steps to reproduce
> > 1. submit parrallel job to pbs
> > 2. kill job
> > 3. try to resume from checkpoint
> > 
> > relevant output from mdrun
> > Reading checkpoint file md.cpt generated: Thu Mar 11 12:20:46 2010
> > 
> > Loaded with Money
> > 
> > Making 2D domain decomposition 16 x 7 x 1
> > 
> > WARNING: This run will generate roughly 20607979313638129664 Mb of data
> > 
> > starting mdrun 'Protein in water'
> > 500000 steps,   1000.0 ps (continuing from step 3377000,   6754.0 ps).
> > 
> > nodetime = 0! Infinite Giga flopses!
> > 
> >        Parallel run - timing based on wallclock.
> > 
> > --
> > Best Regards,
> > Alexey 'Alexxy' Shvetsov
> > Petersburg Nuclear Physics Institute, Russia
> > Department of Molecular and Radiation Biophysics
> > Gentoo Team Ru
> > Gentoo Linux Dev
> > mailto:alexxyum at gmail.com
> > mailto:alexxy at gentoo.org
> > mailto:alexxy at omrb.pnpi.spb.ru
> > 
> > --
> > gmx-developers mailing list
> > gmx-developers at gromacs.org
> > http://lists.gromacs.org/mailman/listinfo/gmx-developers
> > Please don't post (un)subscribe requests to the list. Use the
> > www interface or send it to gmx-developers-request at gromacs.org.

-- 
Best Regards,
Alexey 'Alexxy' Shvetsov
Petersburg Nuclear Physics Institute, Russia
Department of Molecular and Radiation Biophysics
Gentoo Team Ru
Gentoo Linux Dev
mailto:alexxyum at gmail.com
mailto:alexxy at gentoo.org
mailto:alexxy at omrb.pnpi.spb.ru
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part.
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-developers/attachments/20100316/c41992cc/attachment.sig>


More information about the gromacs.org_gmx-developers mailing list