[gmx-users] Replica Exchange MD on more than 64 processors
Mark Abraham
Mark.Abraham at anu.edu.au
Sun Dec 27 13:08:04 CET 2009
bharat v. adkar wrote:
> On Sun, 27 Dec 2009, Mark Abraham wrote:
>
>> bharat v. adkar wrote:
>>>
>>> Dear all,
>>> I am trying to perform replica exchange MD (REMD) on a 'protein in
>>> water' system. I am following instructions given on wiki (How-Tos ->
>>> REMD). I have to perform the REMD simulation with 35 different
>>> temperatures. As per advise on wiki, I equilibrated the system at
>>> respective temperatures (total of 35 equilibration simulations). After
>>> this I generated chk_0.tpr, chk_1.tpr, ..., chk_34.tpr files from the
>>> equilibrated structures.
>>>
>>> Now when I submit final job for REMD with following command-line, it
>>> gives
>>> some error:
>>>
>>> command line: mpiexec -np 70 mdrun -multi 35 -replex 1000 -s
>>> chk_.tpr -v
>>>
>>> error msg:
>>> -------------------------------------------------------
>>> Program mdrun_mpi, VERSION 4.0.7
>>> Source code file: ../../../SRC/src/gmxlib/smalloc.c, line: 179
>>>
>>> Fatal error:
>>> Not enough memory. Failed to realloc 790760 bytes for nlist->jjnr,
>>> nlist->jjnr=0x9a400030
>>> (called from file ../../../SRC/src/mdlib/ns.c, line 503)
>>> -------------------------------------------------------
>>>
>>> Thanx for Using GROMACS - Have a Nice Day
>>> : Cannot allocate memory
>>> Error on node 19, will try to stop all the nodes
>>> Halting parallel program mdrun_mpi on CPU 19 out of 70
>>> ***********************************************************************
>>>
>>>
>>> The individual node on the cluster has 8GB of physical memory and
>>> 16GB of
>>> swap memory. Moreover, when logged onto the individual nodes, it shows
>>> more than 1GB of free memory, so there should be no problem with
>>> cluster
>>> memory. Also, the equilibration jobs for the same system are run on the
>>> same cluster without any problem.
>>>
>>> What I have observed by submitting different test jobs with varying
>>> number
>>> of processors (and no. of replicas, wherever necessary), that any
>>> job with
>>> total number of processors <= 64, runs faithfully without any
>>> problem. As
>>> soon as total number of processors are more than 64, it gives the above
>>> error. I have tested this with 65 processors/65 replicas also.
>>
>> This sounds like you might be running on fewer physical CPUs than you
>> have available. If so, running multiple MPI processes per physical CPU
>> can lead to memory shortage conditions.
>
> I don't understand what you mean. Do you mean, there might be more than
> 8 processes running per node (each node has 8 processors)? But that also
> does not seem to be the case, as SGE (sun grid engine) output shows only
> eight processes per node.
65 processes can't have 8 processes per node.
Mark
>> I don't know what you mean by "swap memory".
>
> Sorry, I meant cache memory..
>
> bharat
>
>>
>> Mark
>>
>>> System: Protein + water + Na ions (total 46878 atoms)
>>> Gromacs version: tested with both v4.0.5 and v4.0.7
>>> compiled with: --enable-float --with-fft=fftw3 --enable-mpi
>>> compiler: gcc_3.4.6 -O3
>>> machine details: uname -mpio: x86_64 x86_64 x86_64 GNU/Linux
>>>
>>>
>>> I tried searching the mailing-list without any luck. I am not sure,
>>> if i
>>> am doing anything wrong in giving commands. Please correct me if it is
>>> wrong.
>>>
>>> Kindly let me know the solution.
>>>
>>>
>>> bharat
>>>
>>>
>>
>
More information about the gromacs.org_gmx-users
mailing list