[gmx-users] Need help on a SEGV mdrun mpi failure

Mostyn Lewis Mostyn.Lewis at sun.com
Fri Oct 24 18:21:01 CEST 2003


Hello,

Sent this last night but it seems to have been lost (Maybe because it
had a 270K attachment of topol.top.bz2?). So here goes again.

I'm having a problem with a benchmark case which causes SEGV (signal 11)
in most cases of a MPI run with more than 4 CPUs. The failure is always
in bondfree.c (gromacs-3.1.4 + gromacs-3.1.5_pre1) in the angles routine
at line 535

    ivec_sub(SHIFT_IVEC(g,ai),jt,dt_ij);
    ivec_sub(SHIFT_IVEC(g,ak),jt,dt_kj);
    t1=IVEC2IS(dt_ij);
    t2=IVEC2IS(dt_kj);

      rvec_inc(fr->fshift[t1],f_i);
      rvec_inc(fr->fshift[CENTRAL],f_j);
----->rvec_inc(fr->fshift[t2],f_k);
    }                                           /* 168 TOTAL    */

This line has a BAD t2 value which causes an out of bounds reference
(actually a little later in x=a[XX]+b[XX]; at line 235 in vec.h due to
the expansion of rvec_inc)

I enclose a run below with the grompp and mdrun_mpi output followed by
some dbx debugging output showing some values. This was on a 24 CPU
SUN SMP box (Sunfire 6800) using 8 CPUs.
I get the same failure on a cluster of Linux (2 CPU Xeon) boxes doing
MPI across Gigabit ethernet. The failure occurs in Linux land using
Intel/PGI and LAM/mpich combinations - so I think this is problem and/or
Gromacs dependent.

I'm not a Molecular persona at all, just a humble benchmarker and seek
help from the enlightened.

Any files you'd like (topol.top ...) or more debugging are available
on request.

Sorry this is so long. Any help would be appreciated.

Regards,
Mostyn



                         :-)  G  R  O  M  A  C  S  (-:

                God Rules Over Mankind, Animals, Cosmos and Such

                            :-)  VERSION 3.1.4  (-:


       Copyright (c) 1991-2002, University of Groningen, The Netherlands
         This program is free software; you can redistribute it and/or
          modify it under the terms of the GNU General Public License
         as published by the Free Software Foundation; either version 2
             of the License, or (at your option) any later version.

            :-)  /ptmp/mostyn/theogone/cg31_bench/Gromacs/bench/sparc-sun-solaris2.9/ultrasparc3/bin/grompp_mpi  (-:

Option     Filename  Type          Description
------------------------------------------------------------
  -f     grompp.mdp  Input, Opt.   grompp input file with MD parameters
 -po      mdout.mdp  Output        grompp input file with MD parameters
  -c       conf.gro  Input         Generic structure: gro g96 pdb tpr tpb tpa
  -r       conf.gro  Input, Opt.   Generic structure: gro g96 pdb tpr tpb tpa
  -n      index.ndx  Input, Opt.   Index file
-deshuf  deshuf.ndx  Output, Opt.  Index file
  -p      topol.top  Input         Topology file
 -pp  processed.top  Output, Opt.  Topology file
  -o      topol.tpr  Output        Generic run input: tpr tpb tpa
  -t       traj.trr  Input, Opt.   Full precision trajectory: trr trj

      Option   Type  Value  Description
------------------------------------------------------
      -[no]h   bool     no  Print help info and quit
      -[no]X   bool     no  Use dialog box GUI to edit command line options
       -nice    int      0  Set the nicelevel
      -[no]v   bool    yes  Be loud and noisy
       -time   real     -1  Take frame at or first after this time.
         -np    int      8  Generate statusfile for # nodes
-[no]shuffle   bool     no  Shuffle molecules over nodes
   -[no]sort   bool     no  Sort molecules according to X coordinate
-[no]rmdumbds  bool    yes  Remove constant bonded interactions with dummies
       -load string         Releative load capacity of each node on a parallel
                            machine. Be sure to use quotes around the string,
                            which should contain a number for each node
    -maxwarn    int     10  Number of warnings after which input processing
                            stops
-[no]check14   bool     no  Remove 1-4 interactions without Van der Waals

creating statusfile for 8 nodes...

Back Off! I just backed up mdout.mdp to ./#mdout.mdp.1#
Warning: as of GMX v 2.0 unit of compressibility is truly 1/bar
checking input for internal consistency...
calling /lib/cpp...
processing topology...
Generated 3 of the 3 non-bonded parameter combinations
Excluding 3 bonded neighbours for PE6000       1
processing coordinates...
double-checking input for internal consistency...
Cleaning up constraints and constant bonded interactions with dummy particles
renumbering atomtypes...
converting bonded parameters...
#      BONDS:   17997
#     ANGLES:   23992
#     RBDIHS:   29985
#   DUMMY3FD:   29990
#  DUMMY3FAD:   10
Setting particle type to Dummy for dummy atoms
initialising group options...
processing index file...
Analysing residue names:
Opening library file /ptmp/mostyn/theogone/cg31_bench/Gromacs/bench/share/gromacs/top/aminoacids.dat
There are:     1      OTHER residues
There are:     0    PROTEIN residues
There are:     0        DNA residues
Analysing Other...
Making dummy/rest group for Acceleration containing 12000 elements
Making dummy/rest group for Freeze containing 12000 elements
Making dummy/rest group for Energy Mon. containing 12000 elements
Making dummy/rest group for VCM containing 12000 elements
Number of degrees of freedom in T-Coupling group System is 17997.00
Making dummy/rest group for User1 containing 12000 elements
Making dummy/rest group for User2 containing 12000 elements
Making dummy/rest group for XTC containing 12000 elements
Making dummy/rest group for Or. Res. Fit containing 12000 elements
T-Coupling       has 1 element(s): System
Energy Mon.      has 1 element(s): rest
Acceleration     has 1 element(s): rest
Freeze           has 1 element(s): rest
User1            has 1 element(s): rest
User2            has 1 element(s): rest
VCM              has 1 element(s): rest
XTC              has 1 element(s): rest
Or. Res. Fit     has 1 element(s): rest
Checking consistency between energy and charge groups...
splitting topology...
There are 6000 charge group borders and 12000 shake borders
There are 6000 total borders
Division over nodes in atoms:
  1500  1500  1500  1500  1500  1500  1500  1500
writing run input file...

Back Off! I just backed up topol.tpr to ./#topol.tpr.1#

gcq#209: "Cut It Deep and Cut It Wide" (The Walkabouts)

NNODES=8, MYRANK=0, HOSTNAME=kiara brc.West.Sun.COM SunOS 5.9 SUNW,Sun-Fire Sun_MicrosystemsNNODES=8, MYRANK=1, HOSTNAME=kiara brc.West.Sun.COM SunOS 5.9 SUNW,Sun-Fire Sun_Microsystems
NNODES=8, MYRANK=2, HOSTNAME=kiara brc.West.Sun.COM SunOS 5.9 SUNW,Sun-Fire Sun_Microsystems
NNODES=8, MYRANK=4, HOSTNAME=kiara brc.West.Sun.COM SunOS 5.9 SUNW,Sun-Fire Sun_Microsystems
NNODES=8, MYRANK=5, HOSTNAME=kiara brc.West.Sun.COM SunOS 5.9 SUNW,Sun-Fire Sun_Microsystems

NNODES=8, MYRANK=3, HOSTNAME=kiara brc.West.Sun.COM SunOS 5.9 SUNW,Sun-Fire Sun_Microsystems
NNODES=8, MYRANK=6, HOSTNAME=kiara brc.West.Sun.COM SunOS 5.9 SUNW,Sun-Fire Sun_Microsystems
NNODES=8, MYRANK=7, HOSTNAME=kiara brc.West.Sun.COM SunOS 5.9 SUNW,Sun-Fire Sun_Microsystems
NODEID=2 argc=3
NODEID=3 argc=3
NODEID=4 argc=3
NODEID=5 argc=3
NODEID=6 argc=3
NODEID=0 argc=3
NODEID=1 argc=3
NODEID=7 argc=3
                         :-)  G  R  O  M  A  C  S  (-:

               Giving Russians Opium May Alter Current Situation

                            :-)  VERSION 3.1.4  (-:


       Copyright (c) 1991-2002, University of Groningen, The Netherlands
         This program is free software; you can redistribute it and/or
          modify it under the terms of the GNU General Public License
         as published by the Free Software Foundation; either version 2
             of the License, or (at your option) any later version.

           :-)  /ptmp/mostyn/theogone/cg31_bench/Gromacs/bench/sparc-sun-solaris2.9/ultrasparc3/bin/mdrun_mpi  (-:

Option     Filename  Type          Description
------------------------------------------------------------
  -s      topol.tpr  Input         Generic run input: tpr tpb tpa
  -o       traj.trr  Output        Full precision trajectory: trr trj
  -x       traj.xtc  Output, Opt.  Compressed trajectory (portable xdr format)
  -c    confout.gro  Output        Generic structure: gro g96 pdb
  -e       ener.edr  Output        Generic energy: edr ene
  -g      pc2_8.log  Output        Log file
-dgdl      dgdl.xvg  Output, Opt.  xvgr/xmgr file
-table    table.xvg  Input, Opt.   xvgr/xmgr file
-rerun    rerun.xtc  Input, Opt.   Generic trajectory: xtc trr trj gro g96 pdb
 -ei        sam.edi  Input, Opt.   ED sampling input
 -eo        sam.edo  Output, Opt.  ED sampling output
  -j       wham.gct  Input, Opt.   General coupling stuff
 -jo        bam.gct  Input, Opt.   General coupling stuff
-ffout      gct.xvg  Output, Opt.  xvgr/xmgr file
-devout   deviatie.xvg  Output, Opt.  xvgr/xmgr file
-runav  runaver.xvg  Output, Opt.  xvgr/xmgr file
 -pi       pull.ppa  Input, Opt.   Pull parameters
 -po    pullout.ppa  Output, Opt.  Pull parameters
 -pd       pull.pdo  Output, Opt.  Pull data output
 -pn       pull.ndx  Input, Opt.   Index file
-mtx         nm.mtx  Output, Opt.  Hessian matrix

      Option   Type  Value  Description
------------------------------------------------------
      -[no]h   bool     no  Print help info and quit
      -[no]X   bool     no  Use dialog box GUI to edit command line options
       -nice    int     19  Set the nicelevel
     -deffnm string         Set the default filename for all file options
         -np    int      1  Number of nodes, must be the same as used for
                            grompp
      -[no]v   bool     no  Be loud and noisy
-[no]compact   bool    yes  Write a compact log file
  -[no]multi   bool     no  Do multiple simulations in parallel (only with -np
                            > 1)
   -[no]glas   bool     no  Do glass simulation with special long range
                            corrections
 -[no]ionize   bool     no  Do a simulation including the effect of an X-Ray
                            bombardment on your system


Back Off! I just backed up pc2_81.log to ./#pc2_81.log.1#

Back Off! I just backed up pc2_82.log to ./#pc2_82.log.1#

Back Off! I just backed up pc2_85.log to ./#pc2_85.log.1#

Back Off! I just backed up pc2_84.log to ./#pc2_84.log.1#

Back Off! I just backed up pc2_86.log to ./#pc2_86.log.1#

Back Off! I just backed up pc2_80.log to ./#pc2_80.log.1#

Back Off! I just backed up pc2_87.log to ./#pc2_87.log.1#

Back Off! I just backed up pc2_83.log to ./#pc2_83.log.1#
Reading file topol.tpr, VERSION 3.1.4 (single precision)
Reading file topol.tpr, VERSION 3.1.4 (single precision)

Back Off! I just backed up ener.edr to ./#ener.edr.1#
starting mdrun 'pe'
5000 steps,      5.0 ps.

Job cre.1762 on kiara: received signal SEGV (core dumped).
$ dbx /ptmp/mostyn/theogone/cg31_bench/Gromacs/bench/sparc-sun-solaris2.9/ultrasparc3/bin/mdrun_mpi core
For information about new features see `help changes'
To remove this message, put `dbxenv suppress_startup_message 7.2' in your .dbxrc
Reading mdrun_mpi
dbx: internal warning: writable memory segment 0xfc400000[22626304] of size 0 in core
core file header read successfully
Reading ld.so.1
Reading libf77compat.so.1
Reading libfui.so.1
Reading libfai.so.1
Reading libfai2.so.1
Reading libfsumai.so.1
Reading libfprodai.so.1
Reading libfminlai.so.1
Reading libfmaxlai.so.1
Reading libfminvai.so.1
Reading libfmaxvai.so.1
Reading libfsu.so.1
Reading libsunmath.so.1
Reading libnsl.so.1
Reading libm.so.1
Reading libXm.so.4
Reading libXt.so.4
Reading libSM.so.6
Reading libICE.so.6
Reading libXext.so.0
Reading libXp.so.1
Reading libX11.so.4
Reading libsocket.so.1
Reading libmpi.so.1
Reading libc.so.1
Reading libdl.so.1
Reading libmp.so.2
Reading libtnfprobe.so.1
Reading libthread.so.1
Reading librte.so.1
Reading libhpcshm.so.1
Reading librt.so.1
Reading libaio.so.1
Reading libmd5.so.1
Reading libc_psr.so.1
Reading libfsu_isa.so.1
Reading libcre.so.2
Reading libgen.so.1
Reading librpcsvc.so.1
Reading libelf.so.1
Reading libproject.so.1
Reading libsecdb.so.1
Reading libproc.so.1
Reading libcmd.so.1
Reading librtld_db.so.1
Reading shmpm.so.2
Reading libmd5_psr.so.1
Reading rsmpm.so.2
Reading libdoor.so.1
Reading tcppm.so.2
t at 1 (l at 1) program terminated by signal SEGV (no mapping at the fault address)
Current function is rvec_inc
  235     x=a[XX]+b[XX];
(dbx) where
current thread: t at 1
=>[1] rvec_inc(a = 0x1099818, b = 0xffbfe610), line 235 in "vec.h"
  [2] angles(nbonds = 3000, forceatoms = 0x7ea3f0, forceparams = 0x775128, x = 0x865458, f = 0x9383f8, fr = 0x84d298, g = 0x670780, box = 0x6c2b68, lambda = 0.0, dvdlambda = 0xffbfe760, md = 0x77d338, ngrp = 1, egnb = 0x6ffd00, egcoul = 0x6ffcf0, fcd = 0x670480), line 535 in "bondfree.c"
  [3] calc_bonds(log = 0x66a7c4, cr = 0x66e580, mcr = (nil), idef = 0x6b6744, x_s = 0x865458, f = 0x9383f8, fr = 0x84d298, g = 0x670780, epot = 0x6703c8, nrnb = 0xffbfedf0, box = 0x6c2b68, lambda = 0.0, md = 0x77d338, ngrp = 1, egnb = 0x6ffd00, egcoul = 0x6ffcf0, fcd = 0x670480, step = 0, bSepDVDL = 0), line 112 in "bondfree.c"
  [4] force(fp = 0x66a7c4, step = 0, fr = 0x84d298, ir = 0x6c2978, idef = 0x6b6744, nsb = 0x6b5718, cr = 0x66e580, mcr = (nil), nrnb = 0xffbfedf0, grps = 0x670540, md = 0x77d338, ngener = 1, opts = 0x6c2af8, x = 0x865458, f = 0x9383f8, epot = 0x6703c8, fcd = 0x670480, bVerbose = 0, box = 0x6c2b68, lambda = 0.0, graph = 0x670780, excl = 0x6c18dc, bNBFonly = 0, lr_vir = 0xffbff220, mu_tot = 0xffbfedd0, qsum = 0.0, bGatherOnly = 0), line 963 in "force.c"
  [5] do_force(log = 0x66a7c4, cr = 0x66e580, mcr = (nil), parm = 0x6c2978, nsb = 0x6b5718, vir_part = 0xffbff244, pme_vir = 0xffbff220, step = 0, nrnb = 0xffbfedf0, top = 0x6b6740, grps = 0x670540, x = 0x865458, v = 0x8886e0, f = 0x9383f8, buf = 0x915170, mdatoms = 0x77d338, ener = 0x6703c8, fcd = 0x670480, bVerbose = 0, lambda = 0.0, graph = 0x670780, bNS = 1, bNBFonly = 0, fr = 0x84d298, mu_tot = 0xffbfedd0, bGatherOnly = 0), line 285 in "sim_util.c"
  [6] do_md(log = 0x66a7c4, cr = 0x66e580, mcr = (nil), nfile = 21, fnm = 0x5fa3e8, bVerbose = 0, bCompact = 1, bDummies = 1, dummycomm = 0xffbff418, stepout = 10, parm = 0x6c2978, grps = 0x670540, top = 0x6b6740, ener = 0x6703c8, fcd = 0x670480, x = 0x865458, vold = 0x97e908, v = 0x8886e0, vt = 0x95b680, f = 0x9383f8, buf = 0x915170, mdatoms = 0x77d338, nsb = 0x6b5718, nrnb = 0x6c2c28, graph = 0x670780, edyn = 0xffbff538, fr = 0x84d298, box_size = 0xffbff470, Flags = 0), line 510 in "md.c"
  [7] mdrunner(cr = 0x66e580, mcr = (nil), nfile = 21, fnm = 0x5fa3e8, bVerbose = 0, bCompact = 1, nDlb = 0, nstepout = 10, edyn = 0xffbff538, Flags = 0), line 197 in "md.c"
  [8] main(argc = 3, argv = 0x66f3e8), line 212 in "mdrun.c"
(dbx) list
  235     x=a[XX]+b[XX];
  236     y=a[YY]+b[YY];
  237     z=a[ZZ]+b[ZZ];
  238
  239     a[XX]=x;
  240     a[YY]=y;
  241     a[ZZ]=z;
  242   }
  243
  244   static inline void rvec_sub(const rvec a,const rvec b,rvec c)
(dbx) print a
a = 0x1099818
(dbx) print *a
dbx: cannot access address 0x1099818
(dbx) print b
b = 0xffbfe610
(dbx) print *b
*b = 30.9305
(dbx) up
Current function is angles
  535         rvec_inc(fr->fshift[t2],f_k);
(dbx) print f_k
f_k = (30.9305, 0.1243687, -49.10062)
(dbx) print fr->fshift[t2]
dbx: cannot access address 0x1099818
(dbx) print t2
t2 = 724928
(dbx) print fr->fshift
fr->fshift = 0x84db18
(dbx) print *fr->fshift
*fr->fshift = (0.0, 0.0, 0.0)
(dbx) print fr->fshift[724000]
dbx: cannot access address 0x1096c98
(dbx) print fr->fshift[2]
fr->fshift[2] = (0.0, 0.0, 0.0)
(dbx) whatis fr->fshift
float (*fshift)[3];
(dbx) whatis fr
t_forcerec *fr;
(dbx) print t1
t1 = 13
(dbx) print t2
t2 = 724928
(dbx) print dt_kj
dt_kj = (54073, 196608, 9002)
(dbx) quit




More information about the gromacs.org_gmx-users mailing list