[gmx-users] Re: ci barely out of bounds

David van der Spoel spoel at xray.bmc.uu.se
Tue May 29 21:14:40 CEST 2007

Matteo Guglielmi wrote:
> David van der Spoel wrote:
>> Matteo Guglielmi wrote:
>>> David van der Spoel wrote:
>>>> Matteo Guglielmi wrote:
>>>>> David van der Spoel wrote:
>>>>>> chris.neale at utoronto.ca wrote:
>>>>>>> This email refers to my original posts on this topic:
>>>>>>> http://www.gromacs.org/pipermail/gmx-users/2006-October/024154.html
>>>>>>> http://www.gromacs.org/pipermail/gmx-users/2006-October/024333.html
>>>>>>> I have previously posted some of this information to somebody else's
>>>>>>> bugzilla post #109 including a possible workaround
>>>>>>> http://bugzilla.gromacs.org/show_bug.cgi?id=109
>>>>>>> I have compiled my gromacs always using one of the gcc3.3*
>>>>>>> distros. I
>>>>>>> didn't do anything fancy, just the usual configure,make,make
>>>>>>> install.
>>>>>>> I did have trouble with a gromacs version compiled using a gcc4*
>>>>>>> distro and somebody on this user list assisted me in determining
>>>>>>> that
>>>>>>> I should just roll back my gcc verison. I run on all sorts of
>>>>>>> computers, opterons and intel chips, 32 and 64 bit, and this
>>>>>>> particular ci problem is the same for me on all of them.
>>>>>>> Matteo, I want to be sure that we are on the same page here: your ci
>>>>>>> is just *barely* out of bounds right? This is a different problem
>>>>>>> than when your ci is a huge negative number. In that case you have
>>>>>>> some other problem and your system is exploding.
>>>>>>> There is one test and one workaround included in my bugzilla post.
>>>>>>> The test is to recompile gromacs with the -DEBUG_PBC flag and see if
>>>>>>> the problem still occurs. For me this solved the problem (although
>>>>>>> gromacs runs much slower so it is not a great workaround). The
>>>>>>> solution was to remake my system with a few more or a few less
>>>>>>> waters
>>>>>>> so that the number of grids wasn't changing as the volume of the box
>>>>>>> fluctuates (slightly) during constant pressure simulations.
>>>>>>> I here include the text that I added to that bugzilla post:
>>>>>>> Did you try with a version of mdrun that was compiled with
>>>>>>> -DEBUG_PBC ?
>>>>>>> I have some runs that reliably (but stochastically) give errors
>>>>>>> about
>>>>>>> an atom
>>>>>>> being found in a grid just one block outside of the expected
>>>>>>> boundary
>>>>>>> only in
>>>>>>> parallel runs, and often other nodes have log files that indicate
>>>>>>> that they have
>>>>>>> just updated the grid size (constant pressure simulation). This
>>>>>>> error
>>>>>>> disappears
>>>>>>> when I run with a -DEBUG_PBC version. My assumption here is that
>>>>>>> there is some
>>>>>>> non-blocking MPI communication that is not getting through in time.
>>>>>>> The
>>>>>>> -DEBUG_PBC version spends a lot of time checking some things and
>>>>>>> although it
>>>>>>> never reports having found some problem, I assume that a side-effect
>>>>>>> of these
>>>>>>> extra calculations is to slow things down enough at the proper stage
>>>>>>> so that the
>>>>>>> MPI message gets through. I have solved my problem by adjusting my
>>>>>>> simulation
>>>>>>> cell so that it doesn't fall close to the grid boundaries. Perhaps
>>>>>>> you are
>>>>>>> experiencing some analogous problem?
>>>>>>> Quoting Matteo Guglielmi <matteo.guglielmi at epfl.ch>:
>>>>>>>> Hello Chris,
>>>>>>>> I have the same problem with gromacs and did not understand
>>>>>>>> what's going wrong yet.
>>>>>>>> I did not try to run a serial job (as you did) but all my 7
>>>>>>>> simulations
>>>>>>>> (6 solvated pores in membranes + 1 protein in water... all of them
>>>>>>>> with positional restrains - double precision) keep crashing in the
>>>>>>>> same way.
>>>>>>>> Did you finally understand why they do crash (in parallel)?
>>>>>>>> How did you compile gromacs?
>>>>>>>> I used the intel copilers (ifort icc icpc 9.1 series) whith the
>>>>>>>> following optimization flags: -O3 -unroll -axT.
>>>>>>>> I've also tried the 8.0 series but no chance to get rid of the
>>>>>>>> problem.
>>>>>>>> I'm running on woodcrest (xeon cpu 5140 2.33GHz) and xeon cpu
>>>>>>>> 3.06GHz.
>>>>>>>> Thanks for your attention,
>>>>>>>> MG
>>>>>> Do you use pressure coupling? In principle that can cause problems
>>>>>> when combined with position restraints. Further once again, please
>>>>>> try
>>>>>> to reproduce the problem with gcc as well. If this is related to
>>>>>> bugzilla 109 as Chris suggests then please let's continue the
>>>>>> discussion there.
>>>>> Yes I do (anysotropic pressure).
>>>>> I got the same problem with the same system using a compiled
>>>>> version  of
>>>>> gromacs
>>>>> with gcc 3.4.6.
>>>>> So, in my case, doesn't matter which compiler I do use to compile
>>>>> gromacs
>>>>> (either Intel series 9.1 or gcc 3.4.6), I always get the same error
>>>>> right after
>>>>> the "update" of the Grid size.
>>>>> ....
>>>>> step 282180, will finish at Tue May 29 15:57:31 2007
>>>>> step 282190, will finish at Tue May 29 15:57:32 2007
>>>>> -------------------------------------------------------
>>>>> Program mdrun_mpi, VERSION 3.3.1
>>>>> Source code file: nsgrid.c, line: 226
>>>>> Range checking error:
>>>>> Explanation: During neighborsearching, we assign each particle to a
>>>>> grid
>>>>> based on its coordinates. If your system contains collisions or
>>>>> parameter
>>>>> errors that give particles very high velocities you might end up with
>>>>> some
>>>>> coordinates being +-Infinity or NaN (not-a-number). Obviously, we
>>>>> cannot
>>>>> put these on a grid, so this is usually where we detect those errors.
>>>>> Make sure your system is properly energy-minimized and that the
>>>>> potential
>>>>> energy seems reasonable before trying again.
>>>>> Variable ci has value 1472. It should have been within [ 0 .. 1440 ]
>>>>> Please report this to the mailing list (gmx-users at gromacs.org)
>>>>> -------------------------------------------------------
>>>>> "BioBeat is Not Available In Regular Shops" (P.J. Meulenhoff)
>>>>> Looking forward to have a solution,
>>>>> thanks to all of you,
>>>>> MG.
>>>> If this is reproducible as you say it is then please save your energy
>>>> file at each step and plot the box as a function of time. Most likely
>>>> it is exploding gently. This is most likely caused by the combination
>>>> of position restraints and pressure coupling. Therefore it would be
>>>> good to turn off on. I would suggest starting with posres and no
>>>> pressure coupling, and once that equilibrates turn on pressure
>>>> coupling with a long tau_p (e.g. 5 ps).
>>> What I see from the trajectory file is my volume dimension getting
>>> smaller in x, y and z.
>>> Along the z coordinate, which is orthogonal to the membrane, the
>>> volume of the box shrinks a bit faster because I did prepare/solvate
>>> my system using amber9 (lower water molecules density).
>>> All my systems were geometry optimized (emtol = 70) prior to
>>> any md step.
>>> I apply positionrestraints (fc=1K)to a transmembrane synthetic
>>> ion channel (located in the center of the simulation box) which is
>>> *well* surrounded by "soft" lipids (isotropic pressure, off diagonal
>>> compressibility elements are set to 0)
>>> I have the same problem with a complete different system
>>> where a full protein (position restrained) is immersed only in
>>> water (isotropic ressure)... variable ci gets *barely* out of bounds
>>> also here.
>>> My tau_p is set to 5 ps, my time step is set to 1fs and I use lincs
>>> only on hbonds. I have room temperature.
>>> The ci out of bounds problem usually occurs after the very first
>>> 0.3 ns.
>>> I did run the same systems in terms of initial geometry and conditions
>>> with other parallel MD codes, for more than 30ns each (actually I wanna
>>> compare gromacs to them) without observing any slowly exploding systems.
>>> That's why I think It's something related to the parallel version of
>>> gromacs
>>> and the grid update which occurs along with the initial decreasing size
>>> of my volume.
>> You are welcome to submit a bugzilla if you have a reproducible
>> problem. It would be great if you can upload a tpr file that
>> reproduces the problem as fast as possible.
>> Nevertheless I would urge you to try it without position restraints as
>> well. The problem could be related to restraining a molecule at a
>> position far outside the box or reducing the box size until the box is
>> smaller than your protein.
> I have the tpr file.
> My molecules are located in the center of the simulation box and
> their size is much smaller then the box itself.
> I could run a position restraints-less job just to see what's gonna
> happens but I have no time at the moment.
Sorry, but please don't say these kind of things, it is not very 
encouraging for those wanting to help.

> (Actually my transmembrane pores will collapse for sure since
> the water density, all over the box, is too much low ;-) )
> Moreover, since I'm not the only one gmx user fighting with this
> problem:
> http://www.gromacs.org/pipermail/gmx-users/2006-October/024154.html
> http://www.gromacs.org/pipermail/gmx-users/2006-October/024333.html
> http://bugzilla.gromacs.org/show_bug.cgi?id=109
> I'm pretty sure we are dealing with a "bad" communication between
> parallel processes (serial jobs do not suffer of such a problem)
> Thanks David,
> MG.
You have not uploaded the tpr to the bugzilla yet...

David van der Spoel, PhD, Assoc. Prof., Molecular Biophysics group,
Dept. of Cell and Molecular Biology, Uppsala University.
Husargatan 3, Box 596,  	75124 Uppsala, Sweden
phone:	46 18 471 4205		fax: 46 18 511 755
spoel at xray.bmc.uu.se	spoel at gromacs.org   http://folding.bmc.uu.se

More information about the gromacs.org_gmx-users mailing list