[gmx-users] MPI-based errors on the power6 at large parallelization

Roland Schulz roland at utk.edu
Thu Mar 5 09:29:12 CET 2009


Hi,

error b sounds like a bug (not necessary in Gromacs though - could
also be compiler or MPI lib). The count should not be negative. May be
the integer which holds the count wraps? Do you get a core file when
the job crashes?
I might be enough to put
ulimit -c unlimited
in your jobscript to get a core file.
If you could give us the stack trace where mdrun crashes either from
the core file or using a debugger that would be great. May be you have
a sysadmin to help you with this?

With error a it might be either be a bug or some MPI buffer size the
user can change. Take a look in your MPI documentation and see whether
you can modify buffer sizes by setting environment variables. Sorry
for being wage. But it seem to be hardware dependent. What MPI and
interconnect do you use?

Roland

On Wed, Mar 4, 2009 at 11:58 AM,  <chris.neale at utoronto.ca> wrote:
> Hello,
>
> This is a more detailed description of a problem that I previously reported
> under the title "MPI_Recv invalid count and system explodes for large but
> not small parallelization on power6 but not opterons"
> http://www.gromacs.org/pipermail/gmx-users/2009-March/040158.html but
> focuses solely on the MPI-related problems that I see for N=196
> parallelization.
>
> In summary, I see a variety of MPI-based errors:
> a. ERROR: 0032-117 User pack or receive buffer is too small  (24) in
> MPI_Sendrecv, task 183
> b. ERROR: 0032-103 Invalid count  (-8388608) in MPI_Recv, task 37
> c. Or the system crashes without giving an MPI-based error message.
>
> So I gather that there is some problem with the MPI. Is this something that
> I should try to solve by changing the way that MPI is set up? I would have
> thought that gromacs would be responsible for ensuring that buffers are
> large enough, etc.
>
> System contains 500,000 real atoms (not including Tip4p MW), and consists of
> All-atom OPLS protein, Tip4P water, ions, and united atom detergent. I
> suppose the united atom detergent may be causing some problem if gromacs
> assumes a uniform density when determining how many atoms are likely to be
> found in any one grid and that this may become more of a problem as the
> grids get smaller?
>
> ######
>
> Here is the detailed information.
>
> gromacs version 4.0.4
> cluster of 32 x Power6 boxes @ 4.3 GHz running SMT to yield 64 tasks per
> box.
> Compiled using
>
> Compilation Information:
>
> export
> PATH=/usr/lpp/ppe.hpct/bin:/usr/vacpp/bin:.:/usr/bin:/etc:/usr/sbin:/usr/ucb:/usr/bin/X11:/sbin:/usr/java14/jre/bin:/u
> sr/java14/bin:/usr/lpp/LoadL/full/bin:/usr/local/bin
> export F77=xlf_r
> export CC=xlc_r
> export CXX=xlc++_r
> export FFLAGS="-O2 -qarch=pwr6 -qtune=pwr6"
> export CFLAGS="-O2 -qarch=pwr6 -qtune=pwr6"
> export CXXFLAGS="-O2 -qarch=pwr6 -qtune=pwr6"
>
> export FFTW_LOCATION=/scratch/cneale/exe/fftw-3.1.2_aix/exec
> export GROMACS_LOCATION=/scratch/cneale/exe/gromacs-4.0.4_aix_o2/exec
> export CPPFLAGS=-I$FFTW_LOCATION/include
> export LDFLAGS=-L$FFTW_LOCATION/lib
>
> cd /scratch/cneale/exe/gromacs-4.0.4_aix_o2
> mkdir exec
>
> ./configure --prefix=$GROMACS_LOCATION --without-motif-includes
>>output.configure 2>&1
> make  >output.make 2>&1
> make install  >output.make_install 2>&1
> make distclean
>
> ######
>
> [cneale at tcs-f11n05]$ cat incubator1.mdp
> title               =  seriousMD
> cpp                 =  cpp
> integrator          =  md
> nsteps              =  500
> tinit               =  0
> dt                  =  0.002
> comm_mode           =  linear
> nstcomm             =  1
> comm_grps           =  System
> nstxout             =  5000
> nstvout             =  5000
> nstfout             =  5000
> nstlog              =  5000
> nstlist             =  10
> nstenergy           =  5000
> nstxtcout           =  5000
> ns_type             =  grid
> pbc                 =  xyz
> coulombtype         =  PME
> rcoulomb            =  0.9
> fourierspacing      =  0.12
> pme_order           =  4
> vdwtype             =  cut-off
> rvdw_switch         =  0
> rvdw                =  1.4
> rlist               =  0.9
> DispCorr            =  no
> Pcoupl              =  Berendsen
> pcoupltype          =  isotropic
> compressibility     =  4.5e-5
> ref_p               =  1.
> tau_p               =  4.0
> tcoupl              =  Berendsen
> tc_grps             =  Protein      DPC_LDA     SOL_NA+
> tau_t               =  0.1          0.1         0.1
> ref_t               =  300.         300.        300.
> annealing           =  no
> gen_vel             =  yes
> unconstrained-start =  no
> gen_temp            =  300.
> gen_seed            =  9896
> constraints         =  all-bonds
> constraint_algorithm=  lincs
> lincs-iter          =  1
> lincs-order         =  4
> ;EOF
>
> #################
>
> [cneale at tcs-f11n05]$ cat temp.log
> Log file opened on Wed Mar  4 11:38:00 2009
> Host: tcs-f09n10  pid: 279344  nodeid: 0  nnodes:  196
> The Gromacs distribution was built Tue Mar  3 11:49:04 EST 2009 by
> cneale at tcs-f03n07 (AIX 3 00CA27F24C00)
>
>
>                         :-)  G  R  O  M  A  C  S  (-:
>
>                          GROtesk MACabre and Sinister
>
>                            :-)  VERSION 4.0.4  (-:
>
>
>      Written by David van der Spoel, Erik Lindahl, Berk Hess, and others.
>       Copyright (c) 1991-2000, University of Groningen, The Netherlands.
>             Copyright (c) 2001-2008, The GROMACS development team,
>            check out http://www.gromacs.org for more information.
>
>         This program is free software; you can redistribute it and/or
>          modify it under the terms of the GNU General Public License
>         as published by the Free Software Foundation; either version 2
>             of the License, or (at your option) any later version.
>
>     :-)  /scratch/cneale/exe/gromacs-4.0.4_aix_o2/exec/bin/mdrun_mpi  (-:
>
>
> ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
> B. Hess and C. Kutzner and D. van der Spoel and E. Lindahl
> GROMACS 4: Algorithms for highly efficient, load-balanced, and scalable
> molecular simulation
> J. Chem. Theory Comput. 4 (2008) pp. 435-447
> -------- -------- --- Thank You --- -------- --------
>
>
> ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
> D. van der Spoel, E. Lindahl, B. Hess, G. Groenhof, A. E. Mark and H. J. C.
> Berendsen
> GROMACS: Fast, Flexible and Free
> J. Comp. Chem. 26 (2005) pp. 1701-1719
> -------- -------- --- Thank You --- -------- --------
>
>
> ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
> E. Lindahl and B. Hess and D. van der Spoel
> GROMACS 3.0: A package for molecular simulation and trajectory analysis
> J. Mol. Mod. 7 (2001) pp. 306-317
> -------- -------- --- Thank You --- -------- --------
>
>
> ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
> H. J. C. Berendsen, D. van der Spoel and R. van Drunen
> GROMACS: A message-passing parallel molecular dynamics implementation
> Comp. Phys. Comm. 91 (1995) pp. 43-56
> -------- -------- --- Thank You --- -------- --------
>
> parameters of the run:
>   integrator           = md
>   nsteps               = 500
>   init_step            = 0
>   ns_type              = Grid
>   nstlist              = 10
>   ndelta               = 2
>   nstcomm              = 1
>   comm_mode            = Linear
>   nstlog               = 5000
>   nstxout              = 5000
>   nstvout              = 5000
>   nstfout              = 5000
>   nstenergy            = 5000
>   nstxtcout            = 5000
>   init_t               = 0
>   delta_t              = 0.002
>   xtcprec              = 1000
>   nkx                  = 175
>   nky                  = 175
>   nkz                  = 175
>   pme_order            = 4
>   ewald_rtol           = 1e-05
>   ewald_geometry       = 0
>   epsilon_surface      = 0
>   optimize_fft         = FALSE
>   ePBC                 = xyz
>   bPeriodicMols        = FALSE
>   bContinuation        = FALSE
>   bShakeSOR            = FALSE
>   etc                  = Berendsen
>   epc                  = Berendsen
>   epctype              = Isotropic
>   tau_p                = 4
>   ref_p (3x3):
>      ref_p[    0]={ 1.00000e+00,  0.00000e+00,  0.00000e+00}
>      ref_p[    1]={ 0.00000e+00,  1.00000e+00,  0.00000e+00}
>      ref_p[    2]={ 0.00000e+00,  0.00000e+00,  1.00000e+00}
>   compress (3x3):
>      compress[    0]={ 4.50000e-05,  0.00000e+00,  0.00000e+00}
>      compress[    1]={ 0.00000e+00,  4.50000e-05,  0.00000e+00}
>      compress[    2]={ 0.00000e+00,  0.00000e+00,  4.50000e-05}
>   refcoord_scaling     = No
>   posres_com (3):
>      posres_com[0]= 0.00000e+00
>      posres_com[1]= 0.00000e+00
>      posres_com[2]= 0.00000e+00
>   posres_comB (3):
>      posres_comB[0]= 0.00000e+00
>      posres_comB[1]= 0.00000e+00
>      posres_comB[2]= 0.00000e+00
>   andersen_seed        = 815131
>   rlist                = 0.9
>   rtpi                 = 0.05
>   coulombtype          = PME
>   rcoulomb_switch      = 0
>   rcoulomb             = 0.9
>   vdwtype              = Cut-off
>   rvdw_switch          = 0
>   rvdw                 = 1.4
>   epsilon_r            = 1
>   epsilon_rf           = 1
>   tabext               = 1
>   implicit_solvent     = No
>   gb_algorithm         = Still
>   gb_epsilon_solvent   = 80
>   nstgbradii           = 1
>   rgbradii             = 2
>   gb_saltconc          = 0
>   gb_obc_alpha         = 1
>   gb_obc_beta          = 0.8
>   gb_obc_gamma         = 4.85
>   sa_surface_tension   = 2.092
>   DispCorr             = No
>   free_energy          = no
>   init_lambda          = 0
>   sc_alpha             = 0
>   sc_power             = 0
>   sc_sigma             = 0.3
>   delta_lambda         = 0
>   nwall                = 0
>   wall_type            = 9-3
>   wall_atomtype[0]     = -1
>   wall_atomtype[1]     = -1
>   wall_density[0]      = 0
>   wall_density[1]      = 0
>   wall_ewald_zfac      = 3
>   pull                 = no
>   disre                = No
>   disre_weighting      = Conservative
>   disre_mixed          = FALSE
>   dr_fc                = 1000
>   dr_tau               = 0
>   nstdisreout          = 100
>   orires_fc            = 0
>   orires_tau           = 0
>   nstorireout          = 100
>   dihre-fc             = 1000
>   em_stepsize          = 0.01
>   em_tol               = 10
>   niter                = 20
>   fc_stepsize          = 0
>   nstcgsteep           = 1000
>   nbfgscorr            = 10
>   ConstAlg             = Lincs
>   shake_tol            = 0.0001
>   lincs_order          = 4
>   lincs_warnangle      = 30
>   lincs_iter           = 1
>   bd_fric              = 0
>   ld_seed              = 1993
>   cos_accel            = 0
>   deform (3x3):
>      deform[    0]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
>      deform[    1]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
>      deform[    2]={ 0.00000e+00,  0.00000e+00,  0.00000e+00}
>   userint1             = 0
>   userint2             = 0
>   userint3             = 0
>   userint4             = 0
>   userreal1            = 0
>   userreal2            = 0
>   userreal3            = 0
>   userreal4            = 0
> grpopts:
>   nrdf:     40031.9     67943.8      987429
>   ref_t:         300         300         300
>   tau_t:         0.1         0.1         0.1
> anneal:          No          No          No
> ann_npoints:           0           0           0
>   acc:            0           0           0
>   nfreeze:           N           N           N
>   energygrp_flags[  0]: 0
>   efield-x:
>      n = 0
>   efield-xt:
>      n = 0
>   efield-y:
>      n = 0
>   efield-yt:
>      n = 0
>   efield-z:
>      n = 0
>   efield-zt:
>      n = 0
>   bQMMM                = FALSE
>   QMconstraints        = 0
>   QMMMscheme           = 0
>   scalefactor          = 1
> qm_opts:
>   ngQM                 = 0
>
> Initializing Domain Decomposition on 196 nodes
> Dynamic load balancing: auto
> Will sort the charge groups at every domain (re)decomposition
> Initial maximum inter charge-group distances:
>    two-body bonded interactions: 0.556 nm, LJ-14, atoms 25035 25038
>  multi-body bonded interactions: 0.556 nm, Proper Dih., atoms 25035 25038
> Minimum cell size due to bonded interactions: 0.612 nm
> Maximum distance for 5 constraints, at 120 deg. angles, all-trans: 0.820 nm
> Estimated maximum distance required for P-LINCS: 0.820 nm
> This distance will limit the DD cell size, you can override this with -rcon
> Guess for relative PME load: 0.37
> Will use 108 particle-particle and 88 PME only nodes
> This is a guess, check the performance at the end of the log file
> Using 88 separate PME nodes
> Scaling the initial minimum size with 1/0.8 (option -dds) = 1.25
> Optimizing the DD grid for 108 cells with a minimum initial size of 1.025 nm
> The maximum allowed number of cells is: X 16 Y 16 Z 14
> Domain decomposition grid 6 x 6 x 3, separate PME nodes 88
> Interleaving PP and PME nodes
> This is a particle-particle only node
>
> Domain decomposition nodeid 0, coordinates 0 0 0
>
> Using two step summing over 4 groups of on average 27.0 processes
>
> Table routines are used for coulomb: TRUE
> Table routines are used for vdw:     FALSE
> Will do PME sum in reciprocal space.
>
> ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
> U. Essman, L. Perela, M. L. Berkowitz, T. Darden, H. Lee and L. G. Pedersen
> A smooth particle mesh Ewald method
> J. Chem. Phys. 103 (1995) pp. 8577-8592
> -------- -------- --- Thank You --- -------- --------
>
> Using a Gaussian width (1/beta) of 0.288146 nm for Ewald
> Cut-off's:   NS: 0.9   Coulomb: 0.9   LJ: 1.4
> System total charge: 0.000
> Generated table with 1200 data points for Ewald.
> Tabscale = 500 points/nm
> Generated table with 1200 data points for LJ6.
> Tabscale = 500 points/nm
> Generated table with 1200 data points for LJ12.
> Tabscale = 500 points/nm
> Generated table with 1200 data points for 1-4 COUL.
> Tabscale = 500 points/nm
> Generated table with 1200 data points for 1-4 LJ6.
> Tabscale = 500 points/nm
> Generated table with 1200 data points for 1-4 LJ12.
> Tabscale = 500 points/nm
>
> Enabling TIP4p water optimization for 164564 molecules.
>
> Configuring nonbonded kernels...
>
>
> Removing pbc first time
>
> Initializing Parallel LINear Constraint Solver
>
> ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
> B. Hess
> P-LINCS: A Parallel Linear Constraint Solver for molecular simulation
> J. Chem. Theory Comput. 4 (2008) pp. 116-122
> -------- -------- --- Thank You --- -------- --------
>
> The number of constraints is 52488
> There are inter charge-group constraints,
> will communicate selected coordinates each lincs iteration
>
> ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
> S. Miyamoto and P. A. Kollman
> SETTLE: An Analytical Version of the SHAKE and RATTLE Algorithms for Rigid
> Water Models
> J. Comp. Chem. 13 (1992) pp. 952-962
> -------- -------- --- Thank You --- -------- --------
>
>
> Linking all bonded interactions to atoms
> There are 156384 inter charge-group exclusions,
> will use an extra communication step for exclusion forces for PME
>
> The initial number of communication pulses is: X 1 Y 1 Z 1
> The initial domain decomposition cell size is: X 2.81 nm Y 2.81 nm Z 4.86 nm
>
> The maximum allowed distance for charge groups involved in interactions is:
>                 non-bonded interactions           1.400 nm
> (the following are initial values, they could change due to box deformation)
>            two-body bonded interactions  (-rdd)   1.400 nm
>          multi-body bonded interactions  (-rdd)   1.400 nm
>  atoms separated by up to 5 constraints  (-rcon)  2.806 nm
>
> When dynamic load balancing gets turned on, these settings will change to:
> The maximum number of communication pulses is: X 1 Y 1 Z 1
> The minimum size for domain decomposition cells is 1.400 nm
> The requested allowed shrink of DD cells (option -dds) is: 0.80
> The allowed shrink of domain decomposition cells is: X 0.50 Y 0.50 Z 0.29
> The maximum allowed distance for charge groups involved in interactions is:
>                 non-bonded interactions           1.400 nm
>            two-body bonded interactions  (-rdd)   1.400 nm
>          multi-body bonded interactions  (-rdd)   1.400 nm
>  atoms separated by up to 5 constraints  (-rcon)  1.400 nm
>
>
> Making 3D domain decomposition grid 6 x 6 x 3, home cell index 0 0 0
>
> Center of mass motion removal mode is Linear
> We have the following groups for center of mass motion removal:
>  0:  System
>
> ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
> H. J. C. Berendsen, J. P. M. Postma, A. DiNola and J. R. Haak
> Molecular dynamics with coupling to an external bath
> J. Chem. Phys. 81 (1984) pp. 3684-3690
> -------- -------- --- Thank You --- -------- --------
>
> There are: 547196 Atoms
> There are: 164564 VSites
> Charge group distribution at step 0: 1835 1783 1808 1763 1765 1723 1796 1810
> 1783 1889 1773 1802 1777 1813 1726 1765 1801 1790 1766 1759 1733 1767 1738
> 1761 1744 1728 1775 1785 1759 1758 1754 1807 1738 1781 1716 1751 1775 1771
> 1794 1764 1765 1777 1798 1776 1827 1819 1752 1800 1751 1809 1768 1849 1788
> 1847 1824 1767 1834 1745 1737 1753 1776 1778 1785 1829 1788 1863 1725 1785
> 1778 1800 1754 1820 1747 1757 1773 1773 1819 1726 1745 1784 1772 1753 1771
> 1788 1781 1784 1734 1761 1769 1733 1801 1767 1798 1781 1779 1758 1796 1800
> 1856 1795 1740 1828 1736 1779 1783 1795 1776 1849
> Grid: 13 x 13 x 10 cells
>
> Constraining the starting coordinates (step 0)
>
> Constraining the coordinates at t0-dt (step 0)
> RMS relative constraint deviation after constraining: 3.34e-05
> Initial temperature: 300.336 K
>
> Started mdrun on node 0 Wed Mar  4 11:38:03 2009
>
>           Step           Time         Lambda
>              0        0.00000        0.00000
>
>   Energies (kJ/mol)
>          Angle    Proper Dih. Ryckaert-Bell.          LJ-14     Coulomb-14
>    6.76081e+04    1.72586e+04    2.77334e+04    4.21216e+04    2.71567e+05
>        LJ (SR)        LJ (LR)   Coulomb (SR)   Coul. recip.      Potential
>    1.55700e+06   -4.61765e+04   -9.26002e+06   -1.91687e+06   -9.23978e+06
>    Kinetic En.   Total Energy    Temperature Pressure (bar)  Cons. rmsd ()
>    1.37042e+06   -7.86936e+06    3.00934e+02   -2.95959e+03    5.41981e-05
>
> <this is the end of the log file>
>
> ###############
>
> [cneale at tcs-f11n05]$cat my.stderr
> ATTENTION: 0031-408  196 tasks allocated by LoadLeveler, continuing...
> NNODES=196, MYRANK=192, HOSTNAME=tcs-f04n08
> ...
> <snip>
> ...
> NNODES=196, MYRANK=4, HOSTNAME=tcs-f09n10
> NODEID=64 argc=3
> ...
> <snip>
> ...
> NODEID=63 argc=3
>                         :-)  G  R  O  M  A  C  S  (-:
>
>               Gromacs Runs One Microsecond At Cannonball Speeds
>
>                            :-)  VERSION 4.0.4  (-:
>
>
>      Written by David van der Spoel, Erik Lindahl, Berk Hess, and others.
>       Copyright (c) 1991-2000, University of Groningen, The Netherlands.
>             Copyright (c) 2001-2008, The GROMACS development team,
>            check out http://www.gromacs.org for more information.
>
>         This program is free software; you can redistribute it and/or
>          modify it under the terms of the GNU General Public License
>         as published by the Free Software Foundation; either version 2
>             of the License, or (at your option) any later version.
>
>     :-)  /scratch/cneale/exe/gromacs-4.0.4_aix_o2/exec/bin/mdrun_mpi  (-:
>
> Option     Filename  Type         Description
> ------------------------------------------------------------
>  -s       temp.tpr  Input        Run input file: tpr tpb tpa
>  -o       temp.trr  Output       Full precision trajectory: trr trj cpt
>  -x       temp.xtc  Output, Opt. Compressed trajectory (portable xdr format)
> -cpi       temp.cpt  Input, Opt.  Checkpoint file
> -cpo       temp.cpt  Output, Opt. Checkpoint file
>  -c       temp.gro  Output       Structure file: gro g96 pdb
>  -e       temp.edr  Output       Energy file: edr ene
>  -g       temp.log  Output       Log file
> -dgdl      temp.xvg  Output, Opt. xvgr/xmgr file
> -field     temp.xvg  Output, Opt. xvgr/xmgr file
> -table     temp.xvg  Input, Opt.  xvgr/xmgr file
> -tablep    temp.xvg  Input, Opt.  xvgr/xmgr file
> -tableb    temp.xvg  Input, Opt.  xvgr/xmgr file
> -rerun     temp.xtc  Input, Opt.  Trajectory: xtc trr trj gro g96 pdb cpt
> -tpi       temp.xvg  Output, Opt. xvgr/xmgr file
> -tpid      temp.xvg  Output, Opt. xvgr/xmgr file
>  -ei       temp.edi  Input, Opt.  ED sampling input
>  -eo       temp.edo  Output, Opt. ED sampling output
>  -j       temp.gct  Input, Opt.  General coupling stuff
>  -jo       temp.gct  Output, Opt. General coupling stuff
> -ffout     temp.xvg  Output, Opt. xvgr/xmgr file
> -devout    temp.xvg  Output, Opt. xvgr/xmgr file
> -runav     temp.xvg  Output, Opt. xvgr/xmgr file
>  -px       temp.xvg  Output, Opt. xvgr/xmgr file
>  -pf       temp.xvg  Output, Opt. xvgr/xmgr file
> -mtx       temp.mtx  Output, Opt. Hessian matrix
>  -dn       temp.ndx  Output, Opt. Index file
>
> Option       Type   Value   Description
> ------------------------------------------------------
> -[no]h       bool   no      Print help info and quit
> -nice        int    0       Set the nicelevel
> -deffnm      string temp    Set the default filename for all file options
> -[no]xvgr    bool   yes     Add specific codes (legends etc.) in the output
>                            xvg files for the xmgrace program
> -[no]pd      bool   no      Use particle decompostion
> -dd          vector 0 0 0   Domain decomposition grid, 0 is optimize
> -npme        int    -1      Number of separate nodes to be used for PME, -1
>                            is guess
> -ddorder     enum   interleave  DD node order: interleave, pp_pme or
> cartesian
> -[no]ddcheck bool   yes     Check for all bonded interactions with DD
> -rdd         real   0       The maximum distance for bonded interactions
> with
>                            DD (nm), 0 is determine from initial coordinates
> -rcon        real   0       Maximum distance for P-LINCS (nm), 0 is estimate
> -dlb         enum   auto    Dynamic load balancing (with DD): auto, no or
> yes
> -dds         real   0.8     Minimum allowed dlb scaling of the DD cell size
> -[no]sum     bool   yes     Sum the energies at every step
> -[no]v       bool   no      Be loud and noisy
> -[no]compact bool   yes     Write a compact log file
> -[no]seppot  bool   no      Write separate V and dVdl terms for each
>                            interaction type and node to the log file(s)
> -pforce      real   -1      Print all forces larger than this (kJ/mol nm)
> -[no]reprod  bool   no      Try to avoid optimizations that affect binary
>                            reproducibility
> -cpt         real   15      Checkpoint interval (minutes)
> -[no]append  bool   no      Append to previous output files when continuing
>                            from checkpoint
> -[no]addpart bool   yes     Add the simulation part number to all output
>                            files when continuing from checkpoint
> -maxh        real   -1      Terminate after 0.99 times this time (hours)
> -multi       int    0       Do multiple simulations in parallel
> -replex      int    0       Attempt replica exchange every # steps
> -reseed      int    -1      Seed for replica exchange, -1 is generate a seed
> -[no]glas    bool   no      Do glass simulation with special long range
>                            corrections
> -[no]ionize  bool   no      Do a simulation including the effect of an X-Ray
>                            bombardment on your system
>
> Reading file temp.tpr, VERSION 4.0.4 (single precision)
>
> Will use 108 particle-particle and 88 PME only nodes
> This is a guess, check the performance at the end of the log file
>
> NOTE: For optimal PME load balancing at high parallelization
>      PME grid_x (175) and grid_y (175) should be divisible by #PME_nodes
> (88)
>
> Making 3D domain decomposition 6 x 6 x 3
>
> starting mdrun 'Big Box'
> 500 steps,      1.0 ps.
> ERROR: 0032-117 User pack or receive buffer is too small  (24) in
> MPI_Sendrecv, task 183
>
>
> #########
>
> Then from the IBM website:
> (http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp?topic=/com.ibm.cluster.pe_linux43.messages.doc/am105_mpimsgs.html)
>
> 0032-117
>
> Parallel Environment for Linux V4.3 Messages
> SA38-0648-01
>
> User pack or receive buffer too small (number) in string, task number
> Explanation
>
> The buffer specified for the operation was too small to hold the message. In
> the PACK and UNPACK cases it is the space between current position and
> buffer end which is too small.
> User response
>
> Increase the size of the buffer or reduce the size of the message. Error
> Class: MPI_ERR_TRUNCATE
>
> ### And the error that I previously reported was:
>
> 0032-103
>
> Parallel Environment for Linux V4.3 Messages
> SA38-0648-01
>
> Invalid count (number) in string, task number
> Explanation
>
> The value of count (element count) is out of range.
> User response
>
> Make sure that the count is greater than or equal to zero.
>
> Error Class: MPI_ERR_COUNT
>
> ### And if I run it a third time, I don't get an MPI based error, but a
> crash:
>
> $tail temp.log
> ...
> Reading file temp.tpr, VERSION 4.0.4 (single precision)
>
> Will use 108 particle-particle and 88 PME only nodes
> This is a guess, check the performance at the end of the log file
>
> NOTE: For optimal PME load balancing at high parallelization
>      PME grid_x (175) and grid_y (175) should be divisible by #PME_nodes
> (88)
>
> Making 3D domain decomposition 6 x 6 x 3
>
> starting mdrun 'Big Box'
> 500 steps,      1.0 ps.
>
> t = 0.222 ps: Water molecule starting at atom 105813 can not be settled.
> Check for bad contacts and/or reduce the timestep.
> Wrote pdb files with previous and current coordinates
>
> Step 112, time 0.224 (ps)  LINCS WARNING
> relative constraint deviation after LINCS:
> rms 18077115.836165, max 343959040.000000 (between atoms 24714 and 24713)
> bonds that rotated more than 30 degrees:
>  atom 1 atom 2  angle  previous, current, constraint length
>  24715  24714   90.3    0.1530 14177746.0000      0.1530
>  24716  24715   93.1    0.1530 2540901.0000      0.1530
>  24717  24716   91.1    0.1530 473629.9375      0.1530
>  24718  24717  103.1    0.1530 63926.0820      0.1530
>  24719  24718  113.0    0.1530 3528.4724      0.1530
>  24720  24719   33.0    0.1530   0.1829      0.1530
>  24706  24703   94.5    0.1470 72767.5391      0.1470
>  24706  24704   91.5    0.1470 207189.8906      0.1470
>  24706  24705   94.1    0.1470 73579.9141      0.1470
>  24707  24706   97.4    0.1470 71872.3125      0.1470
>  24708  24707   94.8    0.1530 282677.5000      0.1530
>  24709  24708   90.6    0.1430 3258115.2500      0.1430
>  24710  24709   90.4    0.1610 15580227.0000      0.1610
>  24713  24710   90.2    0.1610 50052624.0000      0.1610
>  24712  24710   90.7    0.1480 15122392.0000      0.1480
>  24711  24710   90.8    0.1480 14735100.0000      0.1480
>  24714  24713   90.5    0.1430 49186144.0000      0.1430
>
> t = 0.224 ps: Water molecule starting at atom 164033 can not be settled.
> Check for bad contacts and/or reduce the timestep.
>
> t = 0.224 ps: Water molecule starting at atom 163709 can not be settled.
> Check for bad contacts and/or reduce the timestep.
> Wrote pdb files with previous and current coordinates
> Wrote pdb files with previous and current coordinates
> ERROR: 0031-250  task 145: Segmentation fault
> ERROR: 0031-250  task 151: Segmentation fault
> ERROR: 0031-250  task 153: Segmentation fault
> ERROR: 0031-250  task 112: Segmentation fault
> ERROR: 0031-250  task 118: Segmentation fault
> ERROR: 0031-250  task 119: Segmentation fault
>
> ##############
>
> Many thanks,
> Chris.
>
>
> _______________________________________________
> gmx-users mailing list    gmx-users at gromacs.org
> http://www.gromacs.org/mailman/listinfo/gmx-users
> Please search the archive at http://www.gromacs.org/search before posting!
> Please don't post (un)subscribe requests to the list. Use thewww interface
> or send it to gmx-users-request at gromacs.org.
> Can't post? Read http://www.gromacs.org/mailing_lists/users.php
>



-- 
ORNL/UT Center for Molecular Biophysics cmb.ornl.gov
865-241-1537, ORNL PO BOX 2008 MS6309



More information about the gromacs.org_gmx-users mailing list