[gmx-users] Re: Segmentation fault, mdrun_mpi

Justin Lemkul jalemkul at vt.edu
Mon Oct 8 02:59:13 CEST 2012



On 10/7/12 2:15 PM, Ladasky wrote:
> Justin Lemkul wrote
>> Random segmentation faults are really hard to debug.  Can you resume the
>> run
>> using a checkpoint file?  That would suggest maybe an MPI problem or
>> something
>> else external to Gromacs.  Without a reproducible system and a debugging
>> backtrace, it's going to be hard to figure out where the problem is coming
>> from.
>
> Thanks for that tip, Justin.  I tried to resume one run which failed at 1.06
> million cycles, and it WORKED.  It proceeded all the way to the 2.50 million
> cycles that I designated.  I now have two separate .trr files, but I suppose
> they can be merged.
>
> I don't know whether my crashes are random yet.  I will try re-running that
> simulation again from time zero, to see whether it segfaults at the same
> place.  If it doesn't, then I have a problem which may have nothing to do
> with GROMACS.
>
> I looked in on memory usage several times while mdrun_mpi was executing.
> Over all, about 3 GB of my computer's 8 GB of RAM were in use.  As I
> expected, GROMACS used very little of this.  The mpirun process used a
> constant 708K.  I had five mdrun_mpi processes, all of which used slightly
> more RAM as they worked, but I didn't notice anything which suggested a
> gross memory leak.  The process which used the most RAM was using 14.4 MB
> right after it started, rose to 15.9 MB within the first ten minutes or so,
> and reached 16.0 MB after four hours.  The process which used the least RAM
> started at 10.6 MB and finished at 10.8 MB.  All together, GROMACS was using
> about 64 MB.
>
> I have a well-cooled CPU, core temperatures are under 50 degrees when the
> system is running under full load.  My system doesn't lock up or crash on
> me.  I think that my hardware is good.
>
>

My first guess would be a buggy MPI implementation.  I can't comment on hardware 
specs, but usually the random failures seen in mdrun_mpi are a result of some 
generic MPI failure.  What MPI are you using?

-Justin

-- 
========================================

Justin A. Lemkul, Ph.D.
Research Scientist
Department of Biochemistry
Virginia Tech
Blacksburg, VA
jalemkul[at]vt.edu | (540) 231-9080
http://www.bevanlab.biochem.vt.edu/Pages/Personal/justin

========================================



More information about the gromacs.org_gmx-users mailing list