[gmx-users] different results when using different number cpus

Wed Dec 5 18:37:10 CET 2007

> Message: 4
> Date: Wed, 05 Dec 2007 14:19:28 +0100
> From: "Berk Hess" <gmx3 at hotmail.com>
> Subject: Re: [gmx-users] different results when using different number
> 	cpus
> To: gmx-users at gromacs.org
> Message-ID: <BAY110-F3835AA47E9A1765473F1B68E6E0 at phx.gbl>
> Content-Type: text/plain; format=flowed
>
> Hi,
>
> With Gromacs and (nearly) all other MD packages you will never be able
> to get binary identical results when running on different number of CPUs.
> Since MD is chaotic, the results can be very different.
>
> Berk.

I can confirm that I get the same thing when running a repeat of a  
simulation segment twice on 4 cpus with gromacs-3.3.1 and fftw-3.1.2.  
Further, while trying to debug a collegues parameters that give a  
lincs error after long periods of simulation time on a single  
processor I find that a proper restart from just prior to the crash  
does not lead to an exact repeat of the error (although an error does  
eventually occur). This was unfortunate since my plan was to save the  
.trr every 100ps and then do a restart in which I saved the .xtc every  
integration step to get a good look at the problem. Carsten's comments  
about fftw3.x is useful since I have been using fftw-3.1.2. Note that  
I did not test to see if a run on 1cpu will generate an identical  
trajectory, only that the lincs error is not exactly reproduced. I did  
the restart using .trr/.edr and set  
gen_vel=no;unconstrained_start=yes; for the restart.

I agree that statistical properties will be properly reproduced, but I  
can imagine situations in which a proper restart would be identical:  
e.g. an interest in the dynamics of quick rare processes in which one  
might run for a long time while saving .xtc and .trr infrequently and  
then restarting at the proper place while saving .xtc very frequently  
in order to capture the dynamics of an identified transition.

>
>
>> From: Carsten Kutzner <ckutzne at gwdg.de>
>> Reply-To: Discussion list for GROMACS users <gmx-users at gromacs.org>
>> To: Discussion list for GROMACS users <gmx-users at gromacs.org>
>> Subject: Re: [gmx-users] different results when using different number cpus
>> Date: Wed, 05 Dec 2007 14:10:06 +0100
>>
>> Hi Dechang,
>>
>> it is normal that results are not binary identical if you compare the
>> "same" MD system on different numbers of processors. If you use PME then
>> you will probably get slightly different charge grids for 2 and for 16
>> processors - since the charge grid has to be divisible by the number of
>> CPUs in x- and y-direction. Even if you manually set the grid dimensions
>> to be the same for both cases, your simulations could diverge when using
>> version 3.x of the FFTW. This version has a build-in timer and chooses
>> the fastest of several algorithms which could be another even in two
>> runs on the same number of processors - depending on the timing results.
>> With different algorithms you get slight differences in the last digit
>> of the computed numbers (rounding / truncation / order of evaluation)
>> which will then grow during the simulation and lead to diverging
>> trajectories. Of course the averaged properties of the simulation are
>> unaffected by those differences and should be the same if averaged long
>> enough.
>> You could use FFTW 2.x and manually set the FFT grid size to the same
>> value for the 2 and 16 CPU case - but I am not shure if this is enough
>> to get binary identical results.
>> You could also repeat your simulations several times with (slightly)
>> different starting conditions (maybe different starting velocities) to
>> get a better picture of the average behaviour of your system. If in all
>> 16 processor cases you see the proteins diverge and in all 2 processor
>> cases you see them converge, I would guess something is wrong.
>>
>> Hope that helps,
>>   Carsten
>>
>>
>> Dechang Li wrote:
>> >  Dear all,
>> >
>> > ¡¡¡¡I used Gromacs3.3.1 to do a simulation about two proteins in
>> water(tip3p).
>> > I run two similar simulations, one for 2 cpus, while the other for 16
>> cpus.
>> > The two simulations have the same .gro, .top, and the same .mdp files. I
>> found
>> > the results were not the same. In the 2 cpus simulation, the two
>> proteins
>> > run closer and closer. But they run apart in the 16 cpus simulation.
>> >    Is that normal the different results when using different number
>> cpus? The
>> > size of my simulation box is 9*7*7.
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > Best regards,
>> >
>> > 2007-12-5
>> > ¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡
>> >
>> > =========================================
>> > Dechang Li, PhD Candidate
>> > Department of Engineering Mechanics
>> > Tsinghua University
>> > Beijing 100084
>> > PR China
>> >
>> > Tel:   +86-10-62773779(O)
>> > Email: li.dc06 at gmail.com
>> > =========================================¡¡¡¡¡¡¡¡¡¡
>> >