Subject: Re: [gmx-users] Issue with domain decomposition between v4.5.5 and 4.6.1

Justin Lemkul jalemkul at vt.edu
Mon Apr 15 23:10:02 CEST 2013



On 4/15/13 3:16 PM, Stephanie Teich-McGoldrick wrote:
> Hello Justin,
>
> Thank you for the reply, and I am glad to hear that this is normal output.
> Unfortunately, my simulations crash  almost immediately when I used v4.6,
> and I was assuming it has something to do with the load balancing because
> that is the last line in my md.log file.
>
> I have run with the flag "mdrun -debug 1" and find the error:
> "mdrun_mpi:13106 terminated with signal 11 at PC=2abd88a03934
> SP=7fff6343f170.  Backtrace:
> /apps/x86_64/mpi/openmpi/intel-12.1-2011.7.256/openmpi-1.4.3_oobpr/lib/libmpi.so.0[0x2abd88a03934]"
>
>
> I know this is rather vague, but do you have any suggestions on where I
> should start tracking down this error? When I use particle decomposition my
> simulations run fine.
>

Standard advice applies: 
http://www.gromacs.org/Documentation/Terminology/Blowing_Up#Diagnosing_an_Unstable_System

If you want to troubleshoot further, please provide (at minimum) a complete .mdp 
file.  The simple truth is that systems are not infinitely parallelizable, and 
small changes in number of processors can cause subtle instabilities in the DD 
algorithm over large numbers of processors.  If it worked before, that may have 
been a random success that now fails.  If there is something more nefarious 
going on, we should be able to weed it out with some careful debugging, but 
there's no evidence of a bug yet.

-Justin

> Message: 3
> Date: Mon, 15 Apr 2013 06:08:13 -0400
> From: Justin Lemkul <jalemkul at vt.edu>
> Subject: Re: [gmx-users] Issue with domain decomposition between
>          v4.5.5 and      4.6.1
> To: Discussion list for GROMACS users <gmx-users at gromacs.org>
> Message-ID: <516BD18D.8000803 at vt.edu>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>
>
>
> On 4/14/13 11:23 PM, Stephanie Teich-McGoldrick wrote:
>> Dear all,
>>
>> I am running a NPT simulation of 33,534 tip4P waters, and I am using
> domain
>> decomposition as the parallelization scheme. Previously, I had been using
>> Gromacs version 4.5.5 but have recently installed and switched to Gromacs
>> version 4.6.1. Using Gromacs 4.5.5 I can successfully run my water box
>> using domain decomposition over many different processor numbers. However
>> the same simulation returns the following error when I try Gromacs 4.6.1
>>
>> "The initial number of communication pulses is: X 1 Y 1 Z 1
>> The initial domain decomposition cell size is: X 2.48 nm Y 2.48 nm Z 1.46
> nm
>>
>> When dynamic load balancing gets turned on, these settings will change to:
>> The maximum number of communication pulses is: X 1 Y 1 Z 1
>> The minimum size for domain decomposition cells is 1.000 nm
>> The requested allowed shrink of DD cells (option -dds) is: 0.80
>> The allowed shrink of domain decomposition cells is: X 0.40 Y 0.40 Z 0.68
>> "
>> The above error occurred running over 16 nodes / 128 processors. The
> system
>> runs for version 4.6.1 for 1,8, and 16 processors but not for 32,64, or
> 128
>> processors.
>>
>> I have tried other systems (including NVT, Berendsen/PR barostats,
>> anisotropic/isotropic ) at the higher number of processors using both
>> version 4.5.5 and 4.6.1 and get the same result - v4.5.5 runs fine while
>> v4.6.1 returns the error type listed above.
>>
>> Is anyone else having a similar issue? Is there something I am not
>> considering? Any help would be greatly appreciated! The details I have
> used
>> to compile each code are below. My log files indicate that I am indeed
>> calling the correct executable at run time.
>>
>
> Based on what you've posted, I don't see any error.  All of the above is
> normal
> output.
>
> -Justin
>
> --
> ==============================
> ==========
>
> Justin A. Lemkul, Ph.D.
> Research Scientist
> Department of Biochemistry
> Virginia Tech
> Blacksburg, VA
> jalemkul[at]vt.edu | (540) 231-9080
> http://www.bevanlab.biochem.vt.edu/Pages/Personal/justin
>

-- 
========================================

Justin A. Lemkul, Ph.D.
Research Scientist
Department of Biochemistry
Virginia Tech
Blacksburg, VA
jalemkul[at]vt.edu | (540) 231-9080
http://www.bevanlab.biochem.vt.edu/Pages/Personal/justin

========================================



More information about the gromacs.org_gmx-users mailing list