[gmx-users] Replica Exchange MD on more than 64 processors

Mon Dec 28 08:12:28 CET 2009

bharat v. adkar wrote:
> On Mon, 28 Dec 2009, Mark Abraham wrote:
> 
>> bharat v. adkar wrote:
>>>  On Sun, 27 Dec 2009, Mark Abraham wrote:
>>>
>>> >  bharat v. adkar wrote:
>>> > >   On Sun, 27 Dec 2009, Mark Abraham wrote:
>>> > > > > >   bharat v. adkar wrote:
>>> > > > > > >    Dear all,
>>> > > > >      I am trying to perform replica exchange MD (REMD) on a > 
>>> >  'protein in
>>> > > > >    water' system. I am following instructions given on wiki > 
>>> >  (How-Tos ->
>>> > > > >    REMD). I have to perform the REMD simulation with 35 
>>> different
>>> > > > >    temperatures. As per advise on wiki, I equilibrated the 
>>> system > > > >    at
>>> > > > >    respective temperatures (total of 35 equilibration > >  
>>> simulations). > >   After
>>> > > > >    this I generated chk_0.tpr, chk_1.tpr, ..., chk_34.tpr 
>>> files > >  from the
>>> > > > >    equilibrated structures.
>>> > > > > > >   Now when I submit final job for REMD with following > 
>>> >  command-line, it > >  gives
>>> > > > >    some error:
>>> > > > > > >   command line: mpiexec -np 70 mdrun -multi 35 -replex 
>>> 1000 -s > >  chk_.tpr > >  -v
>>> > > > > > >    error msg:
>>> > > > >    -------------------------------------------------------
>>> > > > >    Program mdrun_mpi, VERSION 4.0.7
>>> > > > >    Source code file: ../../../SRC/src/gmxlib/smalloc.c, line: 
>>> 179
>>> > > > > > >    Fatal error:
>>> > > > >    Not enough memory. Failed to realloc 790760 bytes for > > 
>>> > >    nlist->jjnr,
>>> > > > >    nlist->jjnr=0x9a400030
>>> > > > >    (called from file ../../../SRC/src/mdlib/ns.c, line 503)
>>> > > > >    -------------------------------------------------------
>>> > > > > > >    Thanx for Using GROMACS - Have a Nice Day
>>> > > > > :    Cannot allocate memory
>>> > > > >    Error on node 19, will try to stop all the nodes
>>> > > > >    Halting parallel program mdrun_mpi on CPU 19 out of 70
>>> > > > > > >  
>>> ***********************************************************************
>>> > > > > > > > >   The individual node on the cluster has 8GB of 
>>> physical > >  memory and 16GB > >  of
>>> > > > >    swap memory. Moreover, when logged onto the individual 
>>> nodes, > > it > >    shows
>>> > > > >    more than 1GB of free memory, so there should be no 
>>> problem > > with > >   cluster
>>> > > > >    memory. Also, the equilibration jobs for the same system 
>>> are > >  run on > >   the
>>> > > > >    same cluster without any problem.
>>> > > > > > >   What I have observed by submitting different test jobs 
>>> with > > varying > >   number
>>> > > > >   of processors (and no. of replicas, wherever necessary), 
>>> that > >  any job > >  with
>>> > > > >   total number of processors <= 64, runs faithfully without 
>>> any > >  problem. > >  As
>>> > > > >    soon as total number of processors are more than 64, it 
>>> gives > > the > >    above
>>> > > > >   error. I have tested this with 65 processors/65 replicas also.
>>> > > > >   This sounds like you might be running on fewer physical 
>>> CPUs > >  than you >  have available. If so, running multiple MPI 
>>> processes per > >  physical CPU >  can lead to memory shortage 
>>> conditions.
>>> > > > >  I don't understand what you mean. Do you mean, there might 
>>> be more > >  than 8
>>> > >   processes running per node (each node has 8 processors)? But 
>>> that > >   also
>>> > >   does not seem to be the case, as SGE (sun grid engine) output 
>>> shows > >  only
>>> > >   eight processes per node.
>>> > >  65 processes can't have 8 processes per node.
>>>  why can't it have? as i said, there are 8 processors per node. what 
>>> i have
>>>  not mentioned is that how many nodes it is using. The jobs got 
>>> distributed
>>>  over 9 nodes. 8 of which corresponds to 64 processors + 1 processor 
>>> from
>>>  9th node.
>>
>> OK, that's a full description. Your symptoms are indicative of someone 
>> making an error somewhere. Since GROMACS works over more than 64 
>> processors elsewhere, the presumption is that you are doing something 
>> wrong or the machine is not set up in the way you think it is or 
>> should be. To get the most effective help, you need to be sure you're 
>> providing full information - else we can't tell which error you're 
>> making or (potentially) eliminate you as a source of error.
>>
> Sorry for not being clear in statements.
> 
>>>  As far I can tell you, job distribution seems okay to me. It is 1 
>>> job per
>>>  processor.
>>
>> Does non-REMD GROMACS run on more than 64 processors? Does your 
>> cluster support using more than 8 nodes in a run? Can you run an MPI 
>> "Hello world" application that prints the processor and node ID across 
>> more than 64 processors?
> 
> Yes, the cluster supports runs with more than 8 nodes. I generated a 
> system with 10 nm water box and submitted on 80 processors. It was 
> running fine. It printed all 80 NODEIDs. Also showed me when the job 
> will get over.
> 
> bharat
> 
> 
>>
>> Mark
>>
>>
>>>  bharat
>>>
>>> > >  Mark
>>> > > > >   I don't know what you mean by "swap memory".
>>> > > > >   Sorry, I meant cache memory..
>>> > > > >   bharat
>>> > > > > > >   Mark
>>> > > > > >    System: Protein + water + Na ions (total 46878 atoms)
>>> > > > >    Gromacs version: tested with both v4.0.5 and v4.0.7
>>> > > > >    compiled with: --enable-float --with-fft=fftw3 --enable-mpi
>>> > > > >    compiler: gcc_3.4.6 -O3
>>> > > > >    machine details: uname -mpio: x86_64 x86_64 x86_64 GNU/Linux
>>> > > > > > > > >   I tried searching the mailing-list without any 
>>> luck. I > >  am not sure, if > >  i
>>> > > > >    am doing anything wrong in giving commands. Please correct 
>>> me > >  if it > >   is
>>> > > > >    wrong.
>>> > > > > > >    Kindly let me know the solution.
>>> > > > > > > > >    bharat
>>> > > > > > > > > > >
>>>
>>
> 
your system is going out of memory. probably too big a system or all 
replicas are runing on he same node.

-- 
David van der Spoel, Ph.D., Professor of Biology
Molec. Biophys. group, Dept. of Cell & Molec. Biol., Uppsala University.
Box 596, 75124 Uppsala, Sweden. Phone:	+46184714205. Fax: +4618511755.
spoel at xray.bmc.uu.se	spoel at gromacs.org   http://folding.bmc.uu.se