[gmx-users] Identical energies generated in a rerun calculation ... but ...

Fri Apr 24 09:56:56 CEST 2009

Mark Abraham wrote:
> 
> OK I have some confirmation of a possible bug here. Using 4.0.4 to do 
> reruns on the same positions-only NPT peptide+water trajectory with the 
> same run input file:
> 
> a) compiled without MPI, a single-processor rerun worked correctly, 
> including "zero" KE and temperature at each frame
> 
> b) compiled with MPI, a single-processor run worked correctly, including 
> zero KE and temperature, and agreed with a) within machine precision
> 
> c) compiled with MPI, a 4-processor run worked incorrectly : an 
> approximately-correct temperature and plausible positive KE were 
> reported, all PE terms were identical to about machine precision with 
> the first step of a) and b), and the reported pressure was different.
> 
> Thus it seems that a multi-processor mdrun is not updating the structure 
> for subsequent steps in the loop over structures, and/or is getting some 
>  KE from somewhere that a single-processor calculation is not.
> 
> I'll step through c) with a debugger tomorrow.

d) compiled with MPI, a 4-processor run using particle decomposition 
worked correctly, agreeing with a).

Further, c) has the *same* plausible positive KE at each step.

 From stepping through a run, I think the rerun DD problem arises in 
that a rerun loads the data from the rerun trajectory into rerun_fr, and 
later copies those into state, and not into state_global. state_global 
is initialized to that of the .tpr file (which *has* velocities), which 
is used for the DD initialization, and state_global is never 
subsequently updated. So, for each rerun step, the same .tpr state gets 
propagated, which leads to all the symptoms I describe above. The KE 
comes from the velocities in the .tpr file, and is thus constant.

So, a preliminary work-around is to use mdrun -rerun -pd to get particle 
decomposition.

I tried to hack a fix for the DD code. It seemed that using

for (i=0; i<state_global->natoms; i++)
   copy_rvec(rerun_fr.x[i],state_global.x[i])

before about line 1060 of do_md() in src/kernel/md.c should do the 
trick, since with bMasterState set for a rerun, dd_partition_system() 
should propagate state_global to the right places. However I got a 
segfault in that copy_rvec with i==0, despite state_global.x being 
allocated and of the right dimensions according to Totalview's memory 
debugger.

I'll file a bugzilla in any case.

Mark