[gmx-users] Replica Exchange MD on more than 64 processors

Mon Dec 28 00:56:43 CET 2009

bharat v. adkar wrote:
> On Sun, 27 Dec 2009, Mark Abraham wrote:
> 
>> bharat v. adkar wrote:
>>>  On Sun, 27 Dec 2009, Mark Abraham wrote:
>>>
>>> >  bharat v. adkar wrote:
>>> > > > >   Dear all,
>>> > >     I am trying to perform replica exchange MD (REMD) on a 
>>> 'protein in
>>> > >   water' system. I am following instructions given on wiki 
>>> (How-Tos ->
>>> > >   REMD). I have to perform the REMD simulation with 35 different
>>> > >   temperatures. As per advise on wiki, I equilibrated the system at
>>> > >   respective temperatures (total of 35 equilibration 
>>> simulations). > >   After
>>> > >   this I generated chk_0.tpr, chk_1.tpr, ..., chk_34.tpr files 
>>> from the
>>> > >   equilibrated structures.
>>> > > > >  Now when I submit final job for REMD with following 
>>> command-line, it > >  gives
>>> > >   some error:
>>> > > > >  command line: mpiexec -np 70 mdrun -multi 35 -replex 1000 -s 
>>> chk_.tpr > >  -v
>>> > > > >   error msg:
>>> > >   -------------------------------------------------------
>>> > >   Program mdrun_mpi, VERSION 4.0.7
>>> > >   Source code file: ../../../SRC/src/gmxlib/smalloc.c, line: 179
>>> > > > >   Fatal error:
>>> > >   Not enough memory. Failed to realloc 790760 bytes for nlist->jjnr,
>>> > >   nlist->jjnr=0x9a400030
>>> > >   (called from file ../../../SRC/src/mdlib/ns.c, line 503)
>>> > >   -------------------------------------------------------
>>> > > > >   Thanx for Using GROMACS - Have a Nice Day
>>> > > :   Cannot allocate memory
>>> > >   Error on node 19, will try to stop all the nodes
>>> > >   Halting parallel program mdrun_mpi on CPU 19 out of 70
>>> > >   
>>> ***********************************************************************
>>> > > > > > >  The individual node on the cluster has 8GB of physical 
>>> memory and 16GB > >  of
>>> > >   swap memory. Moreover, when logged onto the individual nodes, 
>>> it > >   shows
>>> > >   more than 1GB of free memory, so there should be no problem 
>>> with > >  cluster
>>> > >   memory. Also, the equilibration jobs for the same system are 
>>> run on > >   the
>>> > >   same cluster without any problem.
>>> > > > >  What I have observed by submitting different test jobs with 
>>> varying > >  number
>>> > >  of processors (and no. of replicas, wherever necessary), that 
>>> any job > >  with
>>> > >  total number of processors <= 64, runs faithfully without any 
>>> problem. > >  As
>>> > >   soon as total number of processors are more than 64, it gives 
>>> the > >   above
>>> > >   error. I have tested this with 65 processors/65 replicas also.
>>> > >  This sounds like you might be running on fewer physical CPUs 
>>> than you >  have available. If so, running multiple MPI processes per 
>>> physical CPU >  can lead to memory shortage conditions.
>>>
>>>  I don't understand what you mean. Do you mean, there might be more 
>>> than 8
>>>  processes running per node (each node has 8 processors)? But that also
>>>  does not seem to be the case, as SGE (sun grid engine) output shows 
>>> only
>>>  eight processes per node.
>>
>> 65 processes can't have 8 processes per node.
> why can't it have? as i said, there are 8 processors per node. what i 
> have not mentioned is that how many nodes it is using. The jobs got 
> distributed over 9 nodes. 8 of which corresponds to 64 processors + 1 
> processor from 9th node.

OK, that's a full description. Your symptoms are indicative of someone 
making an error somewhere. Since GROMACS works over more than 64 
processors elsewhere, the presumption is that you are doing something 
wrong or the machine is not set up in the way you think it is or should 
be. To get the most effective help, you need to be sure you're providing 
full information - else we can't tell which error you're making or 
(potentially) eliminate you as a source of error.

> As far I can tell you, job distribution seems okay to me. It is 1 job 
> per processor.

Does non-REMD GROMACS run on more than 64 processors? Does your cluster 
support using more than 8 nodes in a run? Can you run an MPI "Hello 
world" application that prints the processor and node ID across more 
than 64 processors?

Mark

> bharat
> 
>>
>> Mark
>>
>>> >  I don't know what you mean by "swap memory".
>>>
>>>  Sorry, I meant cache memory..
>>>
>>>  bharat
>>>
>>> > >  Mark
>>> > > >   System: Protein + water + Na ions (total 46878 atoms)
>>> > >   Gromacs version: tested with both v4.0.5 and v4.0.7
>>> > >   compiled with: --enable-float --with-fft=fftw3 --enable-mpi
>>> > >   compiler: gcc_3.4.6 -O3
>>> > >   machine details: uname -mpio: x86_64 x86_64 x86_64 GNU/Linux
>>> > > > > > >  I tried searching the mailing-list without any luck. I 
>>> am not sure, if > >  i
>>> > >   am doing anything wrong in giving commands. Please correct me 
>>> if it > >   is
>>> > >   wrong.
>>> > > > >   Kindly let me know the solution.
>>> > > > > > >   bharat
>>> > > > > >
>>>
>>
>