[gmx-users] Replica Exchange MD on more than 64 processors

Sun Dec 27 13:08:04 CET 2009

bharat v. adkar wrote:
> On Sun, 27 Dec 2009, Mark Abraham wrote:
> 
>> bharat v. adkar wrote:
>>>
>>>  Dear all,
>>>    I am trying to perform replica exchange MD (REMD) on a 'protein in
>>>  water' system. I am following instructions given on wiki (How-Tos ->
>>>  REMD). I have to perform the REMD simulation with 35 different
>>>  temperatures. As per advise on wiki, I equilibrated the system at
>>>  respective temperatures (total of 35 equilibration simulations). After
>>>  this I generated chk_0.tpr, chk_1.tpr, ..., chk_34.tpr files from the
>>>  equilibrated structures.
>>>
>>>  Now when I submit final job for REMD with following command-line, it 
>>> gives
>>>  some error:
>>>
>>>  command line: mpiexec -np 70 mdrun -multi 35 -replex 1000 -s 
>>> chk_.tpr -v
>>>
>>>  error msg:
>>>  -------------------------------------------------------
>>>  Program mdrun_mpi, VERSION 4.0.7
>>>  Source code file: ../../../SRC/src/gmxlib/smalloc.c, line: 179
>>>
>>>  Fatal error:
>>>  Not enough memory. Failed to realloc 790760 bytes for nlist->jjnr,
>>>  nlist->jjnr=0x9a400030
>>>  (called from file ../../../SRC/src/mdlib/ns.c, line 503)
>>>  -------------------------------------------------------
>>>
>>>  Thanx for Using GROMACS - Have a Nice Day
>>> :  Cannot allocate memory
>>>  Error on node 19, will try to stop all the nodes
>>>  Halting parallel program mdrun_mpi on CPU 19 out of 70
>>>  ***********************************************************************
>>>
>>>
>>>  The individual node on the cluster has 8GB of physical memory and 
>>> 16GB of
>>>  swap memory. Moreover, when logged onto the individual nodes, it shows
>>>  more than 1GB of free memory, so there should be no problem with 
>>> cluster
>>>  memory. Also, the equilibration jobs for the same system are run on the
>>>  same cluster without any problem.
>>>
>>>  What I have observed by submitting different test jobs with varying 
>>> number
>>>  of processors (and no. of replicas, wherever necessary), that any 
>>> job with
>>>  total number of processors <= 64, runs faithfully without any 
>>> problem. As
>>>  soon as total number of processors are more than 64, it gives the above
>>>  error. I have tested this with 65 processors/65 replicas also.
>>
>> This sounds like you might be running on fewer physical CPUs than you 
>> have available. If so, running multiple MPI processes per physical CPU 
>> can lead to memory shortage conditions.
> 
> I don't understand what you mean. Do you mean, there might be more than 
> 8 processes running per node (each node has 8 processors)? But that also 
> does not seem to be the case, as SGE (sun grid engine) output shows only 
> eight processes per node.

65 processes can't have 8 processes per node.

Mark

>> I don't know what you mean by "swap memory".
> 
> Sorry, I meant cache memory..
> 
> bharat
> 
>>
>> Mark
>>
>>>  System: Protein + water + Na ions (total 46878 atoms)
>>>  Gromacs version: tested with both v4.0.5 and v4.0.7
>>>  compiled with: --enable-float --with-fft=fftw3 --enable-mpi
>>>  compiler: gcc_3.4.6 -O3
>>>  machine details: uname -mpio: x86_64 x86_64 x86_64 GNU/Linux
>>>
>>>
>>>  I tried searching the mailing-list without any luck. I am not sure, 
>>> if i
>>>  am doing anything wrong in giving commands. Please correct me if it is
>>>  wrong.
>>>
>>>  Kindly let me know the solution.
>>>
>>>
>>>  bharat
>>>
>>>
>>
>