[gmx-users] Replica Exchange MD on more than 64 processors

bharat v. adkar bharat at sscu.iisc.ernet.in
Sun Dec 27 14:54:19 CET 2009


On Sun, 27 Dec 2009, Mark Abraham wrote:

> bharat v. adkar wrote:
>>  On Sun, 27 Dec 2009, Mark Abraham wrote:
>> 
>> >  bharat v. adkar wrote:
>> > > 
>> > >   Dear all,
>> > >     I am trying to perform replica exchange MD (REMD) on a 'protein in
>> > >   water' system. I am following instructions given on wiki (How-Tos ->
>> > >   REMD). I have to perform the REMD simulation with 35 different
>> > >   temperatures. As per advise on wiki, I equilibrated the system at
>> > >   respective temperatures (total of 35 equilibration simulations). 
>> > >   After
>> > >   this I generated chk_0.tpr, chk_1.tpr, ..., chk_34.tpr files from the
>> > >   equilibrated structures.
>> > > 
>> > >  Now when I submit final job for REMD with following command-line, it 
>> > >  gives
>> > >   some error:
>> > > 
>> > >  command line: mpiexec -np 70 mdrun -multi 35 -replex 1000 -s chk_.tpr 
>> > >  -v
>> > > 
>> > >   error msg:
>> > >   -------------------------------------------------------
>> > >   Program mdrun_mpi, VERSION 4.0.7
>> > >   Source code file: ../../../SRC/src/gmxlib/smalloc.c, line: 179
>> > > 
>> > >   Fatal error:
>> > >   Not enough memory. Failed to realloc 790760 bytes for nlist->jjnr,
>> > >   nlist->jjnr=0x9a400030
>> > >   (called from file ../../../SRC/src/mdlib/ns.c, line 503)
>> > >   -------------------------------------------------------
>> > > 
>> > >   Thanx for Using GROMACS - Have a Nice Day
>> > > :   Cannot allocate memory
>> > >   Error on node 19, will try to stop all the nodes
>> > >   Halting parallel program mdrun_mpi on CPU 19 out of 70
>> > >   ***********************************************************************
>> > > 
>> > > 
>> > >  The individual node on the cluster has 8GB of physical memory and 16GB 
>> > >  of
>> > >   swap memory. Moreover, when logged onto the individual nodes, it 
>> > >   shows
>> > >   more than 1GB of free memory, so there should be no problem with 
>> > >  cluster
>> > >   memory. Also, the equilibration jobs for the same system are run on 
>> > >   the
>> > >   same cluster without any problem.
>> > > 
>> > >  What I have observed by submitting different test jobs with varying 
>> > >  number
>> > >  of processors (and no. of replicas, wherever necessary), that any job 
>> > >  with
>> > >  total number of processors <= 64, runs faithfully without any problem. 
>> > >  As
>> > >   soon as total number of processors are more than 64, it gives the 
>> > >   above
>> > >   error. I have tested this with 65 processors/65 replicas also.
>> > 
>> >  This sounds like you might be running on fewer physical CPUs than you 
>> >  have available. If so, running multiple MPI processes per physical CPU 
>> >  can lead to memory shortage conditions.
>>
>>  I don't understand what you mean. Do you mean, there might be more than 8
>>  processes running per node (each node has 8 processors)? But that also
>>  does not seem to be the case, as SGE (sun grid engine) output shows only
>>  eight processes per node.
>
> 65 processes can't have 8 processes per node.
why can't it have? as i said, there are 8 processors per node. what i have 
not mentioned is that how many nodes it is using. The jobs got distributed 
over 9 nodes. 8 of which corresponds to 64 processors + 1 processor from 
9th node.
As far I can tell you, job distribution seems okay to me. It is 1 job per 
processor.

bharat

>
> Mark
>
>> >  I don't know what you mean by "swap memory".
>>
>>  Sorry, I meant cache memory..
>>
>>  bharat
>> 
>> > 
>> >  Mark
>> > 
>> > >   System: Protein + water + Na ions (total 46878 atoms)
>> > >   Gromacs version: tested with both v4.0.5 and v4.0.7
>> > >   compiled with: --enable-float --with-fft=fftw3 --enable-mpi
>> > >   compiler: gcc_3.4.6 -O3
>> > >   machine details: uname -mpio: x86_64 x86_64 x86_64 GNU/Linux
>> > > 
>> > > 
>> > >  I tried searching the mailing-list without any luck. I am not sure, if 
>> > >  i
>> > >   am doing anything wrong in giving commands. Please correct me if it 
>> > >   is
>> > >   wrong.
>> > > 
>> > >   Kindly let me know the solution.
>> > > 
>> > > 
>> > >   bharat
>> > > 
>> > > 
>> >
>> 
>

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.




More information about the gromacs.org_gmx-users mailing list