[gmx-users] Re: ci barely out of bounds

Matteo Guglielmi matteo.guglielmi at epfl.ch
Wed May 30 00:26:35 CEST 2007


David van der Spoel wrote:
> Matteo Guglielmi wrote:
>> David van der Spoel wrote:
>>> Matteo Guglielmi wrote:
>>>> David van der Spoel wrote:
>>>>> Matteo Guglielmi wrote:
>>>>>> David van der Spoel wrote:
>>>>>>> chris.neale at utoronto.ca wrote:
>>>>>>>> This email refers to my original posts on this topic:
>>>>>>>> http://www.gromacs.org/pipermail/gmx-users/2006-October/024154.html
>>>>>>>>
>>>>>>>> http://www.gromacs.org/pipermail/gmx-users/2006-October/024333.html
>>>>>>>>
>>>>>>>>
>>>>>>>> I have previously posted some of this information to somebody
>>>>>>>> else's
>>>>>>>> bugzilla post #109 including a possible workaround
>>>>>>>> http://bugzilla.gromacs.org/show_bug.cgi?id=109
>>>>>>>>
>>>>>>>> I have compiled my gromacs always using one of the gcc3.3*
>>>>>>>> distros. I
>>>>>>>> didn't do anything fancy, just the usual configure,make,make
>>>>>>>> install.
>>>>>>>> I did have trouble with a gromacs version compiled using a gcc4*
>>>>>>>> distro and somebody on this user list assisted me in determining
>>>>>>>> that
>>>>>>>> I should just roll back my gcc verison. I run on all sorts of
>>>>>>>> computers, opterons and intel chips, 32 and 64 bit, and this
>>>>>>>> particular ci problem is the same for me on all of them.
>>>>>>>>
>>>>>>>> Matteo, I want to be sure that we are on the same page here:
>>>>>>>> your ci
>>>>>>>> is just *barely* out of bounds right? This is a different problem
>>>>>>>> than when your ci is a huge negative number. In that case you have
>>>>>>>> some other problem and your system is exploding.
>>>>>>>>
>>>>>>>> There is one test and one workaround included in my bugzilla post.
>>>>>>>> The test is to recompile gromacs with the -DEBUG_PBC flag and
>>>>>>>> see if
>>>>>>>> the problem still occurs. For me this solved the problem (although
>>>>>>>> gromacs runs much slower so it is not a great workaround). The
>>>>>>>> solution was to remake my system with a few more or a few less
>>>>>>>> waters
>>>>>>>> so that the number of grids wasn't changing as the volume of
>>>>>>>> the box
>>>>>>>> fluctuates (slightly) during constant pressure simulations.
>>>>>>>>
>>>>>>>> I here include the text that I added to that bugzilla post:
>>>>>>>> Did you try with a version of mdrun that was compiled with
>>>>>>>> -DEBUG_PBC ?
>>>>>>>> I have some runs that reliably (but stochastically) give errors
>>>>>>>> about
>>>>>>>> an atom
>>>>>>>> being found in a grid just one block outside of the expected
>>>>>>>> boundary
>>>>>>>> only in
>>>>>>>> parallel runs, and often other nodes have log files that indicate
>>>>>>>> that they have
>>>>>>>> just updated the grid size (constant pressure simulation). This
>>>>>>>> error
>>>>>>>> disappears
>>>>>>>> when I run with a -DEBUG_PBC version. My assumption here is that
>>>>>>>> there is some
>>>>>>>> non-blocking MPI communication that is not getting through in
>>>>>>>> time.
>>>>>>>> The
>>>>>>>> -DEBUG_PBC version spends a lot of time checking some things and
>>>>>>>> although it
>>>>>>>> never reports having found some problem, I assume that a
>>>>>>>> side-effect
>>>>>>>> of these
>>>>>>>> extra calculations is to slow things down enough at the proper
>>>>>>>> stage
>>>>>>>> so that the
>>>>>>>> MPI message gets through. I have solved my problem by adjusting my
>>>>>>>> simulation
>>>>>>>> cell so that it doesn't fall close to the grid boundaries. Perhaps
>>>>>>>> you are
>>>>>>>> experiencing some analogous problem?
>>>>>>>>
>>>>>>>> Quoting Matteo Guglielmi <matteo.guglielmi at epfl.ch>:
>>>>>>>>
>>>>>>>>> Hello Chris,
>>>>>>>>> I have the same problem with gromacs and did not understand
>>>>>>>>> what's going wrong yet.
>>>>>>>>>
>>>>>>>>> I did not try to run a serial job (as you did) but all my 7
>>>>>>>>> simulations
>>>>>>>>> (6 solvated pores in membranes + 1 protein in water... all of
>>>>>>>>> them
>>>>>>>>> with positional restrains - double precision) keep crashing in
>>>>>>>>> the
>>>>>>>>> same way.
>>>>>>>>>
>>>>>>>>> Did you finally understand why they do crash (in parallel)?
>>>>>>>>>
>>>>>>>>> How did you compile gromacs?
>>>>>>>>>
>>>>>>>>> I used the intel copilers (ifort icc icpc 9.1 series) whith the
>>>>>>>>> following optimization flags: -O3 -unroll -axT.
>>>>>>>>>
>>>>>>>>> I've also tried the 8.0 series but no chance to get rid of the
>>>>>>>>> problem.
>>>>>>>>>
>>>>>>>>> I'm running on woodcrest (xeon cpu 5140 2.33GHz) and xeon cpu
>>>>>>>>> 3.06GHz.
>>>>>>>>>
>>>>>>>>> Thanks for your attention,
>>>>>>>>> MG
>>>>>>> Do you use pressure coupling? In principle that can cause problems
>>>>>>> when combined with position restraints. Further once again, please
>>>>>>> try
>>>>>>> to reproduce the problem with gcc as well. If this is related to
>>>>>>> bugzilla 109 as Chris suggests then please let's continue the
>>>>>>> discussion there.
>>>>>>>
>>>>>> Yes I do (anysotropic pressure).
>>>>>>
>>>>>> I got the same problem with the same system using a compiled
>>>>>> version  of
>>>>>> gromacs
>>>>>> with gcc 3.4.6.
>>>>>>
>>>>>> So, in my case, doesn't matter which compiler I do use to compile
>>>>>> gromacs
>>>>>> (either Intel series 9.1 or gcc 3.4.6), I always get the same error
>>>>>> right after
>>>>>> the "update" of the Grid size.
>>>>>>
>>>>>> ....
>>>>>>
>>>>>> step 282180, will finish at Tue May 29 15:57:31 2007
>>>>>> step 282190, will finish at Tue May 29 15:57:32 2007
>>>>>> -------------------------------------------------------
>>>>>> Program mdrun_mpi, VERSION 3.3.1
>>>>>> Source code file: nsgrid.c, line: 226
>>>>>>
>>>>>> Range checking error:
>>>>>> Explanation: During neighborsearching, we assign each particle to a
>>>>>> grid
>>>>>> based on its coordinates. If your system contains collisions or
>>>>>> parameter
>>>>>> errors that give particles very high velocities you might end up
>>>>>> with
>>>>>> some
>>>>>> coordinates being +-Infinity or NaN (not-a-number). Obviously, we
>>>>>> cannot
>>>>>> put these on a grid, so this is usually where we detect those
>>>>>> errors.
>>>>>> Make sure your system is properly energy-minimized and that the
>>>>>> potential
>>>>>> energy seems reasonable before trying again.
>>>>>>
>>>>>> Variable ci has value 1472. It should have been within [ 0 .. 1440 ]
>>>>>> Please report this to the mailing list (gmx-users at gromacs.org)
>>>>>> -------------------------------------------------------
>>>>>>
>>>>>> "BioBeat is Not Available In Regular Shops" (P.J. Meulenhoff)
>>>>>>
>>>>>>
>>>>>>
>>>>>> Looking forward to have a solution,
>>>>>> thanks to all of you,
>>>>>> MG.
>>>>>>
>>>>> If this is reproducible as you say it is then please save your energy
>>>>> file at each step and plot the box as a function of time. Most likely
>>>>> it is exploding gently. This is most likely caused by the combination
>>>>> of position restraints and pressure coupling. Therefore it would be
>>>>> good to turn off on. I would suggest starting with posres and no
>>>>> pressure coupling, and once that equilibrates turn on pressure
>>>>> coupling with a long tau_p (e.g. 5 ps).
>>>>>
>>>> What I see from the trajectory file is my volume dimension getting
>>>> smaller in x, y and z.
>>>>
>>>> Along the z coordinate, which is orthogonal to the membrane, the
>>>> volume of the box shrinks a bit faster because I did prepare/solvate
>>>> my system using amber9 (lower water molecules density).
>>>>
>>>> All my systems were geometry optimized (emtol = 70) prior to
>>>> any md step.
>>>>
>>>> I apply positionrestraints (fc=1K)to a transmembrane synthetic
>>>> ion channel (located in the center of the simulation box) which is
>>>> *well* surrounded by "soft" lipids (isotropic pressure, off diagonal
>>>> compressibility elements are set to 0)
>>>>
>>>> I have the same problem with a complete different system
>>>> where a full protein (position restrained) is immersed only in
>>>> water (isotropic ressure)... variable ci gets *barely* out of bounds
>>>> also here.
>>>>
>>>> My tau_p is set to 5 ps, my time step is set to 1fs and I use lincs
>>>> only on hbonds. I have room temperature.
>>>>
>>>> The ci out of bounds problem usually occurs after the very first
>>>> 0.3 ns.
>>>>
>>>> I did run the same systems in terms of initial geometry and conditions
>>>> with other parallel MD codes, for more than 30ns each (actually I
>>>> wanna
>>>> compare gromacs to them) without observing any slowly exploding
>>>> systems.
>>>>
>>>> That's why I think It's something related to the parallel version of
>>>> gromacs
>>>> and the grid update which occurs along with the initial decreasing
>>>> size
>>>> of my volume.
>>>>
>>> You are welcome to submit a bugzilla if you have a reproducible
>>> problem. It would be great if you can upload a tpr file that
>>> reproduces the problem as fast as possible.
>>>
>>> Nevertheless I would urge you to try it without position restraints as
>>> well. The problem could be related to restraining a molecule at a
>>> position far outside the box or reducing the box size until the box is
>>> smaller than your protein.
>> I have the tpr file.
>>
>> My molecules are located in the center of the simulation box and
>> their size is much smaller then the box itself.
>>
>> I could run a position restraints-less job just to see what's gonna
>> happens but I have no time at the moment.
> Sorry, but please don't say these kind of things, it is not very
> encouraging for those wanting to help.
>
>
>> (Actually my transmembrane pores will collapse for sure since
>> the water density, all over the box, is too much low ;-) )
>>
>> Moreover, since I'm not the only one gmx user fighting with this
>> problem:
>>
>> http://www.gromacs.org/pipermail/gmx-users/2006-October/024154.html
>> http://www.gromacs.org/pipermail/gmx-users/2006-October/024333.html
>> http://bugzilla.gromacs.org/show_bug.cgi?id=109
>>
>> I'm pretty sure we are dealing with a "bad" communication between
>> parallel processes (serial jobs do not suffer of such a problem)
>>
>> Thanks David,
>> MG.
> You have not uploaded the tpr to the bugzilla yet...
>
The tpr file is uploaded now.

The serial run is "spinning" already... let you know ASAP.

Actually a friend of mine told me he had exactly the same
problem with gromacs something like 3-4 years ago.

At that time he solved the problem just doing the very first
equilibration of the box with amber6 (~0.5ns)


Best,
MG.



More information about the gromacs.org_gmx-users mailing list