[gmx-users] Need help on a SEGV mdrun mpi failure
Mostyn Lewis
Mostyn.Lewis at sun.com
Fri Oct 24 18:21:01 CEST 2003
Hello,
Sent this last night but it seems to have been lost (Maybe because it
had a 270K attachment of topol.top.bz2?). So here goes again.
I'm having a problem with a benchmark case which causes SEGV (signal 11)
in most cases of a MPI run with more than 4 CPUs. The failure is always
in bondfree.c (gromacs-3.1.4 + gromacs-3.1.5_pre1) in the angles routine
at line 535
ivec_sub(SHIFT_IVEC(g,ai),jt,dt_ij);
ivec_sub(SHIFT_IVEC(g,ak),jt,dt_kj);
t1=IVEC2IS(dt_ij);
t2=IVEC2IS(dt_kj);
rvec_inc(fr->fshift[t1],f_i);
rvec_inc(fr->fshift[CENTRAL],f_j);
----->rvec_inc(fr->fshift[t2],f_k);
} /* 168 TOTAL */
This line has a BAD t2 value which causes an out of bounds reference
(actually a little later in x=a[XX]+b[XX]; at line 235 in vec.h due to
the expansion of rvec_inc)
I enclose a run below with the grompp and mdrun_mpi output followed by
some dbx debugging output showing some values. This was on a 24 CPU
SUN SMP box (Sunfire 6800) using 8 CPUs.
I get the same failure on a cluster of Linux (2 CPU Xeon) boxes doing
MPI across Gigabit ethernet. The failure occurs in Linux land using
Intel/PGI and LAM/mpich combinations - so I think this is problem and/or
Gromacs dependent.
I'm not a Molecular persona at all, just a humble benchmarker and seek
help from the enlightened.
Any files you'd like (topol.top ...) or more debugging are available
on request.
Sorry this is so long. Any help would be appreciated.
Regards,
Mostyn
:-) G R O M A C S (-:
God Rules Over Mankind, Animals, Cosmos and Such
:-) VERSION 3.1.4 (-:
Copyright (c) 1991-2002, University of Groningen, The Netherlands
This program is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License
as published by the Free Software Foundation; either version 2
of the License, or (at your option) any later version.
:-) /ptmp/mostyn/theogone/cg31_bench/Gromacs/bench/sparc-sun-solaris2.9/ultrasparc3/bin/grompp_mpi (-:
Option Filename Type Description
------------------------------------------------------------
-f grompp.mdp Input, Opt. grompp input file with MD parameters
-po mdout.mdp Output grompp input file with MD parameters
-c conf.gro Input Generic structure: gro g96 pdb tpr tpb tpa
-r conf.gro Input, Opt. Generic structure: gro g96 pdb tpr tpb tpa
-n index.ndx Input, Opt. Index file
-deshuf deshuf.ndx Output, Opt. Index file
-p topol.top Input Topology file
-pp processed.top Output, Opt. Topology file
-o topol.tpr Output Generic run input: tpr tpb tpa
-t traj.trr Input, Opt. Full precision trajectory: trr trj
Option Type Value Description
------------------------------------------------------
-[no]h bool no Print help info and quit
-[no]X bool no Use dialog box GUI to edit command line options
-nice int 0 Set the nicelevel
-[no]v bool yes Be loud and noisy
-time real -1 Take frame at or first after this time.
-np int 8 Generate statusfile for # nodes
-[no]shuffle bool no Shuffle molecules over nodes
-[no]sort bool no Sort molecules according to X coordinate
-[no]rmdumbds bool yes Remove constant bonded interactions with dummies
-load string Releative load capacity of each node on a parallel
machine. Be sure to use quotes around the string,
which should contain a number for each node
-maxwarn int 10 Number of warnings after which input processing
stops
-[no]check14 bool no Remove 1-4 interactions without Van der Waals
creating statusfile for 8 nodes...
Back Off! I just backed up mdout.mdp to ./#mdout.mdp.1#
Warning: as of GMX v 2.0 unit of compressibility is truly 1/bar
checking input for internal consistency...
calling /lib/cpp...
processing topology...
Generated 3 of the 3 non-bonded parameter combinations
Excluding 3 bonded neighbours for PE6000 1
processing coordinates...
double-checking input for internal consistency...
Cleaning up constraints and constant bonded interactions with dummy particles
renumbering atomtypes...
converting bonded parameters...
# BONDS: 17997
# ANGLES: 23992
# RBDIHS: 29985
# DUMMY3FD: 29990
# DUMMY3FAD: 10
Setting particle type to Dummy for dummy atoms
initialising group options...
processing index file...
Analysing residue names:
Opening library file /ptmp/mostyn/theogone/cg31_bench/Gromacs/bench/share/gromacs/top/aminoacids.dat
There are: 1 OTHER residues
There are: 0 PROTEIN residues
There are: 0 DNA residues
Analysing Other...
Making dummy/rest group for Acceleration containing 12000 elements
Making dummy/rest group for Freeze containing 12000 elements
Making dummy/rest group for Energy Mon. containing 12000 elements
Making dummy/rest group for VCM containing 12000 elements
Number of degrees of freedom in T-Coupling group System is 17997.00
Making dummy/rest group for User1 containing 12000 elements
Making dummy/rest group for User2 containing 12000 elements
Making dummy/rest group for XTC containing 12000 elements
Making dummy/rest group for Or. Res. Fit containing 12000 elements
T-Coupling has 1 element(s): System
Energy Mon. has 1 element(s): rest
Acceleration has 1 element(s): rest
Freeze has 1 element(s): rest
User1 has 1 element(s): rest
User2 has 1 element(s): rest
VCM has 1 element(s): rest
XTC has 1 element(s): rest
Or. Res. Fit has 1 element(s): rest
Checking consistency between energy and charge groups...
splitting topology...
There are 6000 charge group borders and 12000 shake borders
There are 6000 total borders
Division over nodes in atoms:
1500 1500 1500 1500 1500 1500 1500 1500
writing run input file...
Back Off! I just backed up topol.tpr to ./#topol.tpr.1#
gcq#209: "Cut It Deep and Cut It Wide" (The Walkabouts)
NNODES=8, MYRANK=0, HOSTNAME=kiara brc.West.Sun.COM SunOS 5.9 SUNW,Sun-Fire Sun_MicrosystemsNNODES=8, MYRANK=1, HOSTNAME=kiara brc.West.Sun.COM SunOS 5.9 SUNW,Sun-Fire Sun_Microsystems
NNODES=8, MYRANK=2, HOSTNAME=kiara brc.West.Sun.COM SunOS 5.9 SUNW,Sun-Fire Sun_Microsystems
NNODES=8, MYRANK=4, HOSTNAME=kiara brc.West.Sun.COM SunOS 5.9 SUNW,Sun-Fire Sun_Microsystems
NNODES=8, MYRANK=5, HOSTNAME=kiara brc.West.Sun.COM SunOS 5.9 SUNW,Sun-Fire Sun_Microsystems
NNODES=8, MYRANK=3, HOSTNAME=kiara brc.West.Sun.COM SunOS 5.9 SUNW,Sun-Fire Sun_Microsystems
NNODES=8, MYRANK=6, HOSTNAME=kiara brc.West.Sun.COM SunOS 5.9 SUNW,Sun-Fire Sun_Microsystems
NNODES=8, MYRANK=7, HOSTNAME=kiara brc.West.Sun.COM SunOS 5.9 SUNW,Sun-Fire Sun_Microsystems
NODEID=2 argc=3
NODEID=3 argc=3
NODEID=4 argc=3
NODEID=5 argc=3
NODEID=6 argc=3
NODEID=0 argc=3
NODEID=1 argc=3
NODEID=7 argc=3
:-) G R O M A C S (-:
Giving Russians Opium May Alter Current Situation
:-) VERSION 3.1.4 (-:
Copyright (c) 1991-2002, University of Groningen, The Netherlands
This program is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License
as published by the Free Software Foundation; either version 2
of the License, or (at your option) any later version.
:-) /ptmp/mostyn/theogone/cg31_bench/Gromacs/bench/sparc-sun-solaris2.9/ultrasparc3/bin/mdrun_mpi (-:
Option Filename Type Description
------------------------------------------------------------
-s topol.tpr Input Generic run input: tpr tpb tpa
-o traj.trr Output Full precision trajectory: trr trj
-x traj.xtc Output, Opt. Compressed trajectory (portable xdr format)
-c confout.gro Output Generic structure: gro g96 pdb
-e ener.edr Output Generic energy: edr ene
-g pc2_8.log Output Log file
-dgdl dgdl.xvg Output, Opt. xvgr/xmgr file
-table table.xvg Input, Opt. xvgr/xmgr file
-rerun rerun.xtc Input, Opt. Generic trajectory: xtc trr trj gro g96 pdb
-ei sam.edi Input, Opt. ED sampling input
-eo sam.edo Output, Opt. ED sampling output
-j wham.gct Input, Opt. General coupling stuff
-jo bam.gct Input, Opt. General coupling stuff
-ffout gct.xvg Output, Opt. xvgr/xmgr file
-devout deviatie.xvg Output, Opt. xvgr/xmgr file
-runav runaver.xvg Output, Opt. xvgr/xmgr file
-pi pull.ppa Input, Opt. Pull parameters
-po pullout.ppa Output, Opt. Pull parameters
-pd pull.pdo Output, Opt. Pull data output
-pn pull.ndx Input, Opt. Index file
-mtx nm.mtx Output, Opt. Hessian matrix
Option Type Value Description
------------------------------------------------------
-[no]h bool no Print help info and quit
-[no]X bool no Use dialog box GUI to edit command line options
-nice int 19 Set the nicelevel
-deffnm string Set the default filename for all file options
-np int 1 Number of nodes, must be the same as used for
grompp
-[no]v bool no Be loud and noisy
-[no]compact bool yes Write a compact log file
-[no]multi bool no Do multiple simulations in parallel (only with -np
> 1)
-[no]glas bool no Do glass simulation with special long range
corrections
-[no]ionize bool no Do a simulation including the effect of an X-Ray
bombardment on your system
Back Off! I just backed up pc2_81.log to ./#pc2_81.log.1#
Back Off! I just backed up pc2_82.log to ./#pc2_82.log.1#
Back Off! I just backed up pc2_85.log to ./#pc2_85.log.1#
Back Off! I just backed up pc2_84.log to ./#pc2_84.log.1#
Back Off! I just backed up pc2_86.log to ./#pc2_86.log.1#
Back Off! I just backed up pc2_80.log to ./#pc2_80.log.1#
Back Off! I just backed up pc2_87.log to ./#pc2_87.log.1#
Back Off! I just backed up pc2_83.log to ./#pc2_83.log.1#
Reading file topol.tpr, VERSION 3.1.4 (single precision)
Reading file topol.tpr, VERSION 3.1.4 (single precision)
Back Off! I just backed up ener.edr to ./#ener.edr.1#
starting mdrun 'pe'
5000 steps, 5.0 ps.
Job cre.1762 on kiara: received signal SEGV (core dumped).
$ dbx /ptmp/mostyn/theogone/cg31_bench/Gromacs/bench/sparc-sun-solaris2.9/ultrasparc3/bin/mdrun_mpi core
For information about new features see `help changes'
To remove this message, put `dbxenv suppress_startup_message 7.2' in your .dbxrc
Reading mdrun_mpi
dbx: internal warning: writable memory segment 0xfc400000[22626304] of size 0 in core
core file header read successfully
Reading ld.so.1
Reading libf77compat.so.1
Reading libfui.so.1
Reading libfai.so.1
Reading libfai2.so.1
Reading libfsumai.so.1
Reading libfprodai.so.1
Reading libfminlai.so.1
Reading libfmaxlai.so.1
Reading libfminvai.so.1
Reading libfmaxvai.so.1
Reading libfsu.so.1
Reading libsunmath.so.1
Reading libnsl.so.1
Reading libm.so.1
Reading libXm.so.4
Reading libXt.so.4
Reading libSM.so.6
Reading libICE.so.6
Reading libXext.so.0
Reading libXp.so.1
Reading libX11.so.4
Reading libsocket.so.1
Reading libmpi.so.1
Reading libc.so.1
Reading libdl.so.1
Reading libmp.so.2
Reading libtnfprobe.so.1
Reading libthread.so.1
Reading librte.so.1
Reading libhpcshm.so.1
Reading librt.so.1
Reading libaio.so.1
Reading libmd5.so.1
Reading libc_psr.so.1
Reading libfsu_isa.so.1
Reading libcre.so.2
Reading libgen.so.1
Reading librpcsvc.so.1
Reading libelf.so.1
Reading libproject.so.1
Reading libsecdb.so.1
Reading libproc.so.1
Reading libcmd.so.1
Reading librtld_db.so.1
Reading shmpm.so.2
Reading libmd5_psr.so.1
Reading rsmpm.so.2
Reading libdoor.so.1
Reading tcppm.so.2
t at 1 (l at 1) program terminated by signal SEGV (no mapping at the fault address)
Current function is rvec_inc
235 x=a[XX]+b[XX];
(dbx) where
current thread: t at 1
=>[1] rvec_inc(a = 0x1099818, b = 0xffbfe610), line 235 in "vec.h"
[2] angles(nbonds = 3000, forceatoms = 0x7ea3f0, forceparams = 0x775128, x = 0x865458, f = 0x9383f8, fr = 0x84d298, g = 0x670780, box = 0x6c2b68, lambda = 0.0, dvdlambda = 0xffbfe760, md = 0x77d338, ngrp = 1, egnb = 0x6ffd00, egcoul = 0x6ffcf0, fcd = 0x670480), line 535 in "bondfree.c"
[3] calc_bonds(log = 0x66a7c4, cr = 0x66e580, mcr = (nil), idef = 0x6b6744, x_s = 0x865458, f = 0x9383f8, fr = 0x84d298, g = 0x670780, epot = 0x6703c8, nrnb = 0xffbfedf0, box = 0x6c2b68, lambda = 0.0, md = 0x77d338, ngrp = 1, egnb = 0x6ffd00, egcoul = 0x6ffcf0, fcd = 0x670480, step = 0, bSepDVDL = 0), line 112 in "bondfree.c"
[4] force(fp = 0x66a7c4, step = 0, fr = 0x84d298, ir = 0x6c2978, idef = 0x6b6744, nsb = 0x6b5718, cr = 0x66e580, mcr = (nil), nrnb = 0xffbfedf0, grps = 0x670540, md = 0x77d338, ngener = 1, opts = 0x6c2af8, x = 0x865458, f = 0x9383f8, epot = 0x6703c8, fcd = 0x670480, bVerbose = 0, box = 0x6c2b68, lambda = 0.0, graph = 0x670780, excl = 0x6c18dc, bNBFonly = 0, lr_vir = 0xffbff220, mu_tot = 0xffbfedd0, qsum = 0.0, bGatherOnly = 0), line 963 in "force.c"
[5] do_force(log = 0x66a7c4, cr = 0x66e580, mcr = (nil), parm = 0x6c2978, nsb = 0x6b5718, vir_part = 0xffbff244, pme_vir = 0xffbff220, step = 0, nrnb = 0xffbfedf0, top = 0x6b6740, grps = 0x670540, x = 0x865458, v = 0x8886e0, f = 0x9383f8, buf = 0x915170, mdatoms = 0x77d338, ener = 0x6703c8, fcd = 0x670480, bVerbose = 0, lambda = 0.0, graph = 0x670780, bNS = 1, bNBFonly = 0, fr = 0x84d298, mu_tot = 0xffbfedd0, bGatherOnly = 0), line 285 in "sim_util.c"
[6] do_md(log = 0x66a7c4, cr = 0x66e580, mcr = (nil), nfile = 21, fnm = 0x5fa3e8, bVerbose = 0, bCompact = 1, bDummies = 1, dummycomm = 0xffbff418, stepout = 10, parm = 0x6c2978, grps = 0x670540, top = 0x6b6740, ener = 0x6703c8, fcd = 0x670480, x = 0x865458, vold = 0x97e908, v = 0x8886e0, vt = 0x95b680, f = 0x9383f8, buf = 0x915170, mdatoms = 0x77d338, nsb = 0x6b5718, nrnb = 0x6c2c28, graph = 0x670780, edyn = 0xffbff538, fr = 0x84d298, box_size = 0xffbff470, Flags = 0), line 510 in "md.c"
[7] mdrunner(cr = 0x66e580, mcr = (nil), nfile = 21, fnm = 0x5fa3e8, bVerbose = 0, bCompact = 1, nDlb = 0, nstepout = 10, edyn = 0xffbff538, Flags = 0), line 197 in "md.c"
[8] main(argc = 3, argv = 0x66f3e8), line 212 in "mdrun.c"
(dbx) list
235 x=a[XX]+b[XX];
236 y=a[YY]+b[YY];
237 z=a[ZZ]+b[ZZ];
238
239 a[XX]=x;
240 a[YY]=y;
241 a[ZZ]=z;
242 }
243
244 static inline void rvec_sub(const rvec a,const rvec b,rvec c)
(dbx) print a
a = 0x1099818
(dbx) print *a
dbx: cannot access address 0x1099818
(dbx) print b
b = 0xffbfe610
(dbx) print *b
*b = 30.9305
(dbx) up
Current function is angles
535 rvec_inc(fr->fshift[t2],f_k);
(dbx) print f_k
f_k = (30.9305, 0.1243687, -49.10062)
(dbx) print fr->fshift[t2]
dbx: cannot access address 0x1099818
(dbx) print t2
t2 = 724928
(dbx) print fr->fshift
fr->fshift = 0x84db18
(dbx) print *fr->fshift
*fr->fshift = (0.0, 0.0, 0.0)
(dbx) print fr->fshift[724000]
dbx: cannot access address 0x1096c98
(dbx) print fr->fshift[2]
fr->fshift[2] = (0.0, 0.0, 0.0)
(dbx) whatis fr->fshift
float (*fshift)[3];
(dbx) whatis fr
t_forcerec *fr;
(dbx) print t1
t1 = 13
(dbx) print t2
t2 = 724928
(dbx) print dt_kj
dt_kj = (54073, 196608, 9002)
(dbx) quit
More information about the gromacs.org_gmx-users
mailing list