[gmx-users] Problem with gromacs-3.3 using PME

Fri Jan 6 16:21:17 CET 2006

Hi,

A couple of weeks ago I sent a message to the list reporting a problem
I was having with gromacs-3.3 when using PME.  Gromacs was segment
faulting within pme.c after what I believe to be a small number of
timesteps.

The initial configuration was: 8 EMT64 Xeon cpus, running RHE WS
release 3, using LAM 7.1.1, gnu compilers.

I tried varying a number of things:
- using intel compilers instead of gnu
- compiling using intel 32 bit compilers
- compiled without mpi, running sequentially on EMT64
- compiled without mpi, running sequentially on 32 bit Xeon

All of these runs failed when using pme, but ran fine using cut-off. 
I'll append some debugging info at the end of this email.

I then compiled gromacs-3.2.1, both sequentially and using lam, and
was able to run the pme input case without problem.

It sure looks to me like a bug that is activated by pme.  What is the
protocol for submitting a bug request?  I'd be happy to provide
whatever debugging info would be helpful.

I'm also curious; what am I giving up by using gromacs-3.2.1 rather than 3.3?

I apologize if this is the incorrect list for this message; perhaps it
should have gone to the developers list directly.  Please let me know
if that is the case.

Rob Bjornson

<begin debugging info>

Here is a sample stack trace from a sequential run on 32bit Xeon:

#0  0x0807e46b in spread_q_bsplines (grid=0x82fcf68, idx=0x424c9008,
charge=0x40c82008, theta=0x828b524, nr=72240, order=6, nnx=0x8330c80,
nny=0x8331138,
    nnz=0x83315f0) at pme.c:527
#1  0x080814fd in spread_on_grid (logfile=0x8291068, grid=0x82fcf68,
homenr=72240, pme_order=6, x=0x409be008, charge=0x40c82008,
box=0x82911dc,
    bGatherOnly=0, bHaveSplines=0) at pme.c:1180
#2  0x080817da in do_pme (logfile=0x8291068, bVerbose=0, ir=0x82a2ee0,
x=0x409be008, f=0x417ba008, chargeA=0x40c82008, chargeB=0x40cc9008,
box=0x82911dc,
    cr=0x8291008, nsb=0x8292400, nrnb=0xbfffcfe0, vir=0x82fc27c,
ewaldcoeff=3.47045946, bFreeEnergy=0, lambda=0, dvdlambda=0xbfffca5c,
bGatherOnly=0)
    at pme.c:1276
#3  0x0806a83f in force (fplog=0x8291068, step=25, fr=0x82fc178,
ir=0x82a2ee0, idef=0x8293424, nsb=0x8292400, cr=0x8291008, mcr=0x0,
nrnb=0xbfffcfe0,
    grps=0x8291ed8, md=0x8291720, ngener=2, opts=0x82a30bc,
x=0x409be008, f=0x4111a008, epot=0x8291de8, fcd=0x8292318, bVerbose=0,
box=0x82911dc,
    lambda=0, graph=0x8291858, excl=0x82a1e2c, bNBFonly=0,
bDoForces=1, mu_tot=0xbfffcb20, bGatherOnly=0, edyn=0xbfffd8d0) at
force.c:1306
#4  0x0808f003 in do_force (fplog=0x8291068, cr=0x8291008, mcr=0x0,
inputrec=0x82a2ee0, nsb=0x8292400, step=25, nrnb=0xbfffcfe0,
top=0x8293420,
    grps=0x8291ed8, box=0x82911dc, x=0x409be008, f=0x4111a008,
buf=0x41046008, mdatoms=0x8291720, ener=0x8291de8, fcd=0x8292318,
bVerbose=0, lambda=0,
    graph=0x8291858, bStateChanged=1, bNS=0, bNBFonly=0, bDoForces=1,
fr=0x82fc178, mu_tot=0xbfffcfb0, bGatherOnly=0, t=0.0250000004,
field=0x0,
    edyn=0xbfffd8d0) at sim_util.c:334
#5  0x08059100 in do_md (log=0x8291068, cr=0x8291008, mcr=0x0,
nfile=25, fnm=0x82840a0, bVerbose=0, bCompact=1, bVsites=0,
vsitecomm=0x0, stepout=10,
    inputrec=0x82a2ee0, grps=0x8291ed8, top=0x8293420, ener=0x8291de8,
fcd=0x8292318, state=0x82911d0, vold=0x412c2008, vt=0x411ee008,
f=0x4111a008,
    buf=0x41046008, mdatoms=0x8291720, nsb=0x8292400, nrnb=0x82a3188,
graph=0x8291858, edyn=0xbfffd8d0, fr=0x82fc178, repl_ex_nst=0,
repl_ex_seed=-1,
    Flags=0) at md.c:622
#6  0x08057dda in mdrunner (cr=0x8291008, mcr=0x0, nfile=25,
fnm=0x82840a0, bVerbose=0, bCompact=1, nDlb=0, nstepout=10,
edyn=0xbfffd8d0, repl_ex_nst=0,
    repl_ex_seed=-1, Flags=0) at md.c:227
#7  0x0805ad10 in main (argc=3, argv=0xbfffd984) at mdrun.c:253

Examining things under gdb revealed that one element of the the idxptr
array appeared to have been corrupted:

 (gdb) print idxptr[0]
$33 = 1062338964
(gdb) print idxptr[1]
$34 = 12
(gdb) print idxptr[2]
$35 = 32
(gdb) print idxptr[-1]
$36 = 24
(gdb) print idxptr[-2]
$37 = 59

(gdb) print nx
$38 = 60
(gdb) print ny
$39 = 60
(gdb) print nz
$40 = 60

Here is the code in question (pme.c) The segment fault occurs on line
527, after xidx was (apparently erroneously) set to a very large value
in line 515.  Note that DEBUG wasn't defined for my compilation

510       for(n=0; (n<nr); n++) {
511         qn     = charge[n];
512         idxptr = idx[n];
513
514         if (qn != 0) {
515           xidx    = idxptr[XX];
516           yidx    = idxptr[YY];
517           zidx    = idxptr[ZZ];
518     #ifdef DEBUG
519           range_check(xidx,0,nx);
520           range_check(yidx,0,ny);
521           range_check(zidx,0,nz);
522     #endif
523           i0      = ii0+xidx; /* Pointer arithmetic */
524           norder  = n*4;
525           norder1 = norder+4;
526
527           i = ii0[xidx];
528           j = jj0[yidx];
529           k = kk0[zidx];
530