[gmx-users] Replica Exchange MD on more than 64 processors

Mon Dec 28 06:31:24 CET 2009

On Mon, 28 Dec 2009, Mark Abraham wrote:

> bharat v. adkar wrote:
>>  On Sun, 27 Dec 2009, Mark Abraham wrote:
>> 
>> >  bharat v. adkar wrote:
>> > >   On Sun, 27 Dec 2009, Mark Abraham wrote:
>> > > 
>> > > >   bharat v. adkar wrote:
>> > > > > > >    Dear all,
>> > > > >      I am trying to perform replica exchange MD (REMD) on a 
>> > >  'protein in
>> > > > >    water' system. I am following instructions given on wiki 
>> > >  (How-Tos ->
>> > > > >    REMD). I have to perform the REMD simulation with 35 different
>> > > > >    temperatures. As per advise on wiki, I equilibrated the system 
>> > > > >    at
>> > > > >    respective temperatures (total of 35 equilibration 
>> > >  simulations). > >   After
>> > > > >    this I generated chk_0.tpr, chk_1.tpr, ..., chk_34.tpr files 
>> > >  from the
>> > > > >    equilibrated structures.
>> > > > > > >   Now when I submit final job for REMD with following 
>> > >  command-line, it > >  gives
>> > > > >    some error:
>> > > > > > >   command line: mpiexec -np 70 mdrun -multi 35 -replex 1000 -s 
>> > >  chk_.tpr > >  -v
>> > > > > > >    error msg:
>> > > > >    -------------------------------------------------------
>> > > > >    Program mdrun_mpi, VERSION 4.0.7
>> > > > >    Source code file: ../../../SRC/src/gmxlib/smalloc.c, line: 179
>> > > > > > >    Fatal error:
>> > > > >    Not enough memory. Failed to realloc 790760 bytes for 
>> > > > >    nlist->jjnr,
>> > > > >    nlist->jjnr=0x9a400030
>> > > > >    (called from file ../../../SRC/src/mdlib/ns.c, line 503)
>> > > > >    -------------------------------------------------------
>> > > > > > >    Thanx for Using GROMACS - Have a Nice Day
>> > > > > :    Cannot allocate memory
>> > > > >    Error on node 19, will try to stop all the nodes
>> > > > >    Halting parallel program mdrun_mpi on CPU 19 out of 70
>> > > > > 
>> > >  ***********************************************************************
>> > > > > > > > >   The individual node on the cluster has 8GB of physical 
>> > >  memory and 16GB > >  of
>> > > > >    swap memory. Moreover, when logged onto the individual nodes, 
>> > > it > >    shows
>> > > > >    more than 1GB of free memory, so there should be no problem 
>> > > with > >   cluster
>> > > > >    memory. Also, the equilibration jobs for the same system are 
>> > >  run on > >   the
>> > > > >    same cluster without any problem.
>> > > > > > >   What I have observed by submitting different test jobs with 
>> > > varying > >   number
>> > > > >   of processors (and no. of replicas, wherever necessary), that 
>> > >  any job > >  with
>> > > > >   total number of processors <= 64, runs faithfully without any 
>> > >  problem. > >  As
>> > > > >    soon as total number of processors are more than 64, it gives 
>> > > the > >    above
>> > > > >   error. I have tested this with 65 processors/65 replicas also.
>> > > > >   This sounds like you might be running on fewer physical CPUs 
>> > >  than you >  have available. If so, running multiple MPI processes per 
>> > >  physical CPU >  can lead to memory shortage conditions.
>> > > 
>> > >  I don't understand what you mean. Do you mean, there might be more 
>> > >  than 8
>> > >   processes running per node (each node has 8 processors)? But that 
>> > >   also
>> > >   does not seem to be the case, as SGE (sun grid engine) output shows 
>> > >  only
>> > >   eight processes per node.
>> > 
>> >  65 processes can't have 8 processes per node.
>>  why can't it have? as i said, there are 8 processors per node. what i have
>>  not mentioned is that how many nodes it is using. The jobs got distributed
>>  over 9 nodes. 8 of which corresponds to 64 processors + 1 processor from
>>  9th node.
>
> OK, that's a full description. Your symptoms are indicative of someone making 
> an error somewhere. Since GROMACS works over more than 64 processors 
> elsewhere, the presumption is that you are doing something wrong or the 
> machine is not set up in the way you think it is or should be. To get the 
> most effective help, you need to be sure you're providing full information - 
> else we can't tell which error you're making or (potentially) eliminate you 
> as a source of error.
>
Sorry for not being clear in statements.

>>  As far I can tell you, job distribution seems okay to me. It is 1 job per
>>  processor.
>
> Does non-REMD GROMACS run on more than 64 processors? Does your cluster 
> support using more than 8 nodes in a run? Can you run an MPI "Hello world" 
> application that prints the processor and node ID across more than 64 
> processors?

Yes, the cluster supports runs with more than 8 nodes. I generated a 
system with 10 nm water box and submitted on 80 processors. It was running 
fine. It printed all 80 NODEIDs. Also showed me when the job will get 
over.

bharat

>
> Mark
>
>
>>  bharat
>> 
>> > 
>> >  Mark
>> > 
>> > > >   I don't know what you mean by "swap memory".
>> > > 
>> > >   Sorry, I meant cache memory..
>> > > 
>> > >   bharat
>> > > 
>> > > > >   Mark
>> > > > > >    System: Protein + water + Na ions (total 46878 atoms)
>> > > > >    Gromacs version: tested with both v4.0.5 and v4.0.7
>> > > > >    compiled with: --enable-float --with-fft=fftw3 --enable-mpi
>> > > > >    compiler: gcc_3.4.6 -O3
>> > > > >    machine details: uname -mpio: x86_64 x86_64 x86_64 GNU/Linux
>> > > > > > > > >   I tried searching the mailing-list without any luck. I 
>> > >  am not sure, if > >  i
>> > > > >    am doing anything wrong in giving commands. Please correct me 
>> > >  if it > >   is
>> > > > >    wrong.
>> > > > > > >    Kindly let me know the solution.
>> > > > > > > > >    bharat
>> > > > > > > > 
>> > > 
>> >
>> 
>

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.