[gmx-users] Replica Exchange MD on more than 64 processors

bharat v. adkar bharat at sscu.iisc.ernet.in
Sun Dec 27 12:07:24 CET 2009

On Sun, 27 Dec 2009, Mark Abraham wrote:

> bharat v. adkar wrote:
>>  Dear all,
>>    I am trying to perform replica exchange MD (REMD) on a 'protein in
>>  water' system. I am following instructions given on wiki (How-Tos ->
>>  REMD). I have to perform the REMD simulation with 35 different
>>  temperatures. As per advise on wiki, I equilibrated the system at
>>  respective temperatures (total of 35 equilibration simulations). After
>>  this I generated chk_0.tpr, chk_1.tpr, ..., chk_34.tpr files from the
>>  equilibrated structures.
>>  Now when I submit final job for REMD with following command-line, it gives
>>  some error:
>>  command line: mpiexec -np 70 mdrun -multi 35 -replex 1000 -s chk_.tpr -v
>>  error msg:
>>  -------------------------------------------------------
>>  Program mdrun_mpi, VERSION 4.0.7
>>  Source code file: ../../../SRC/src/gmxlib/smalloc.c, line: 179
>>  Fatal error:
>>  Not enough memory. Failed to realloc 790760 bytes for nlist->jjnr,
>>  nlist->jjnr=0x9a400030
>>  (called from file ../../../SRC/src/mdlib/ns.c, line 503)
>>  -------------------------------------------------------
>>  Thanx for Using GROMACS - Have a Nice Day
>> :  Cannot allocate memory
>>  Error on node 19, will try to stop all the nodes
>>  Halting parallel program mdrun_mpi on CPU 19 out of 70
>>  ***********************************************************************
>>  The individual node on the cluster has 8GB of physical memory and 16GB of
>>  swap memory. Moreover, when logged onto the individual nodes, it shows
>>  more than 1GB of free memory, so there should be no problem with cluster
>>  memory. Also, the equilibration jobs for the same system are run on the
>>  same cluster without any problem.
>>  What I have observed by submitting different test jobs with varying number
>>  of processors (and no. of replicas, wherever necessary), that any job with
>>  total number of processors <= 64, runs faithfully without any problem. As
>>  soon as total number of processors are more than 64, it gives the above
>>  error. I have tested this with 65 processors/65 replicas also.
> This sounds like you might be running on fewer physical CPUs than you have 
> available. If so, running multiple MPI processes per physical CPU can lead to 
> memory shortage conditions.

I don't understand what you mean. Do you mean, there might be more than 8 
processes running per node (each node has 8 processors)? But that also 
does not seem to be the case, as SGE (sun grid engine) output shows only 
eight processes per node.

> I don't know what you mean by "swap memory".

Sorry, I meant cache memory..


> Mark
>>  System: Protein + water + Na ions (total 46878 atoms)
>>  Gromacs version: tested with both v4.0.5 and v4.0.7
>>  compiled with: --enable-float --with-fft=fftw3 --enable-mpi
>>  compiler: gcc_3.4.6 -O3
>>  machine details: uname -mpio: x86_64 x86_64 x86_64 GNU/Linux
>>  I tried searching the mailing-list without any luck. I am not sure, if i
>>  am doing anything wrong in giving commands. Please correct me if it is
>>  wrong.
>>  Kindly let me know the solution.
>>  bharat

