[gmx-users] Segmentation fault, mdrun_mpi

Thu Oct 4 11:43:43 CEST 2012

So I have spent the past few weeks debugging my equilibration protocols,
which were an odd hybrid of examples ranging from GROMACS 3.3 up to GROMACS
4.5.  I have cleaned out old code.  I added an in vacuo energy minimization
step for the protein without solvent, and a missing NVT step after solvent
is defined.  I have dimly grasped that, as long as you don't require
compatibility with an older simulation, the V-rescale thermostat is the
current recommended choice, and that switching thermostats (unlike
barostats) can cause instabilities.  I now know how to examine and graph
macroscopic system parameters to assess stability.  I think that everything
should be looking good right now -- except that it isn't, not quite.

When I finally start the production MD runs, I have received two
segmentation faults on two different test structures.  They take a LONG time
to appear -- over 1,070,000 iterations on one run, and over 2,360,000
iterations on another.  On top of that, I'm not getting my usual error
messages -- PME errors, or SETTLE errors.  I'm not getting a dump of the
last frame of my simulation.

I had enough trouble accepting that my simulation parameters were set up
incorrectly when I had failures 100,000 steps after starting the production
MD run.  Am I really supposed to believe that I still have instability
problems?

Here is the terminal output from one run (executing mdrun_mpi):

Reading file test-prep.tpr, VERSION 4.5.4 (single precision)
Making 1D domain decomposition 5 x 1 x 1
starting mdrun 'Protein t=   0.00000 in water'
2500000 steps,   5000.0 ps.
[john-linux:09596] *** Process received signal ***
[john-linux:09596] Signal: Segmentation fault (11)
[john-linux:09596] Signal code: Address not mapped (1)
[john-linux:09596] Failing at address: 0x3e950840
[john-linux:09596] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x10060)
[0x7f8a8ad5c060]
[john-linux:09596] [ 1] /usr/lib/libgmx_mpi.openmpi.so.6(+0x1f9670)
[0x7f8a8b413670]
[john-linux:09596] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 2 with PID 9596 on node john-linux exited
on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

The .log file does NOT contain any error messages, indicating any
instability.  The last entry in the log file is a long chain of energy
status report blocks.  Here's the last one:

DD  step 1078799 load imb.: force  1.8%

           Step           Time         Lambda
        1078800     2157.60000        0.00000

   Energies (kJ/mol)
       G96Angle    Proper Dih.  Improper Dih.          LJ-14     Coulomb-14
    2.07623e+03    8.42439e+02    6.03967e+02   -2.12322e+02    1.95589e+04
        LJ (SR)  Disper. corr.   Coulomb (SR)   Coul. recip.      Potential
    8.12766e+04   -9.12661e+02   -5.96406e+05   -4.40546e+04   -5.37227e+05
    Kinetic En.   Total Energy    Temperature Pres. DC (bar) Pressure (bar)
    9.76293e+04   -4.39598e+05    3.10969e+02   -7.72781e+01    3.95185e+01
   Constr. rmsd
    1.90839e-05

I'm not a low-level programmer, and so I don't have to deal with this much,
but... a segmentation fault generally indicates that a program is trying to
write outside of its allocated memory block.  The third line of the error
message sent to the shell would seem to indicate exactly that.  That doesn't
actually sound like it has anything to do with my simulation being unstable. 
(However, with applications written in C, I'm willing to believe anything.) 
I did check on my memory usage.  I have 8 GB of RAM on my system, running
Ubuntu Linux 11.10, AMD 64-bit.  At most, I'm using a bit more than half of
my RAM (I have other, undemanding applications open besides my GROMACS
terminal windows, and I also reserved one CPU core to run those apps).  I
think that I should be fine.

If they would help, I can repost my cleaned-up MDP files. I can post graphs
of potential, pressure, temperature, density, etc., from any phase in my
protocol.   Or you could just take my word for it that all of these
parameters converge nicely during my equilibration procedure, and then
remain stable throughout the production MD run.  My target temperature is
310 K (37 C), and I get very close to that value on average.  My average
pressure and density readings are both a bit lower than my targets (0.80 bar
and 988 kg/m^3, respectively), but they are consistent.  I have examined a
series of snapshots of my protein.  It isn't undergoing any radical
movements.

My systems are on the small side, under 50,000 atoms.  It's all amino acids
and water molecules.

Puzzled once again.  Thanks for your advice!

--
View this message in context: http://gromacs.5086.n6.nabble.com/Segmentation-fault-mdrun-mpi-tp5001601.html
Sent from the GROMACS Users Forum mailing list archive at Nabble.com.