[gmx-users] Replica Exchange MD on more than 64 processors
David van der Spoel
spoel at xray.bmc.uu.se
Mon Dec 28 08:12:28 CET 2009
bharat v. adkar wrote:
> On Mon, 28 Dec 2009, Mark Abraham wrote:
>
>> bharat v. adkar wrote:
>>> On Sun, 27 Dec 2009, Mark Abraham wrote:
>>>
>>> > bharat v. adkar wrote:
>>> > > On Sun, 27 Dec 2009, Mark Abraham wrote:
>>> > > > > > bharat v. adkar wrote:
>>> > > > > > > Dear all,
>>> > > > > I am trying to perform replica exchange MD (REMD) on a >
>>> > 'protein in
>>> > > > > water' system. I am following instructions given on wiki >
>>> > (How-Tos ->
>>> > > > > REMD). I have to perform the REMD simulation with 35
>>> different
>>> > > > > temperatures. As per advise on wiki, I equilibrated the
>>> system > > > > at
>>> > > > > respective temperatures (total of 35 equilibration > >
>>> simulations). > > After
>>> > > > > this I generated chk_0.tpr, chk_1.tpr, ..., chk_34.tpr
>>> files > > from the
>>> > > > > equilibrated structures.
>>> > > > > > > Now when I submit final job for REMD with following >
>>> > command-line, it > > gives
>>> > > > > some error:
>>> > > > > > > command line: mpiexec -np 70 mdrun -multi 35 -replex
>>> 1000 -s > > chk_.tpr > > -v
>>> > > > > > > error msg:
>>> > > > > -------------------------------------------------------
>>> > > > > Program mdrun_mpi, VERSION 4.0.7
>>> > > > > Source code file: ../../../SRC/src/gmxlib/smalloc.c, line:
>>> 179
>>> > > > > > > Fatal error:
>>> > > > > Not enough memory. Failed to realloc 790760 bytes for > >
>>> > > nlist->jjnr,
>>> > > > > nlist->jjnr=0x9a400030
>>> > > > > (called from file ../../../SRC/src/mdlib/ns.c, line 503)
>>> > > > > -------------------------------------------------------
>>> > > > > > > Thanx for Using GROMACS - Have a Nice Day
>>> > > > > : Cannot allocate memory
>>> > > > > Error on node 19, will try to stop all the nodes
>>> > > > > Halting parallel program mdrun_mpi on CPU 19 out of 70
>>> > > > > > >
>>> ***********************************************************************
>>> > > > > > > > > The individual node on the cluster has 8GB of
>>> physical > > memory and 16GB > > of
>>> > > > > swap memory. Moreover, when logged onto the individual
>>> nodes, > > it > > shows
>>> > > > > more than 1GB of free memory, so there should be no
>>> problem > > with > > cluster
>>> > > > > memory. Also, the equilibration jobs for the same system
>>> are > > run on > > the
>>> > > > > same cluster without any problem.
>>> > > > > > > What I have observed by submitting different test jobs
>>> with > > varying > > number
>>> > > > > of processors (and no. of replicas, wherever necessary),
>>> that > > any job > > with
>>> > > > > total number of processors <= 64, runs faithfully without
>>> any > > problem. > > As
>>> > > > > soon as total number of processors are more than 64, it
>>> gives > > the > > above
>>> > > > > error. I have tested this with 65 processors/65 replicas also.
>>> > > > > This sounds like you might be running on fewer physical
>>> CPUs > > than you > have available. If so, running multiple MPI
>>> processes per > > physical CPU > can lead to memory shortage
>>> conditions.
>>> > > > > I don't understand what you mean. Do you mean, there might
>>> be more > > than 8
>>> > > processes running per node (each node has 8 processors)? But
>>> that > > also
>>> > > does not seem to be the case, as SGE (sun grid engine) output
>>> shows > > only
>>> > > eight processes per node.
>>> > > 65 processes can't have 8 processes per node.
>>> why can't it have? as i said, there are 8 processors per node. what
>>> i have
>>> not mentioned is that how many nodes it is using. The jobs got
>>> distributed
>>> over 9 nodes. 8 of which corresponds to 64 processors + 1 processor
>>> from
>>> 9th node.
>>
>> OK, that's a full description. Your symptoms are indicative of someone
>> making an error somewhere. Since GROMACS works over more than 64
>> processors elsewhere, the presumption is that you are doing something
>> wrong or the machine is not set up in the way you think it is or
>> should be. To get the most effective help, you need to be sure you're
>> providing full information - else we can't tell which error you're
>> making or (potentially) eliminate you as a source of error.
>>
> Sorry for not being clear in statements.
>
>>> As far I can tell you, job distribution seems okay to me. It is 1
>>> job per
>>> processor.
>>
>> Does non-REMD GROMACS run on more than 64 processors? Does your
>> cluster support using more than 8 nodes in a run? Can you run an MPI
>> "Hello world" application that prints the processor and node ID across
>> more than 64 processors?
>
> Yes, the cluster supports runs with more than 8 nodes. I generated a
> system with 10 nm water box and submitted on 80 processors. It was
> running fine. It printed all 80 NODEIDs. Also showed me when the job
> will get over.
>
> bharat
>
>
>>
>> Mark
>>
>>
>>> bharat
>>>
>>> > > Mark
>>> > > > > I don't know what you mean by "swap memory".
>>> > > > > Sorry, I meant cache memory..
>>> > > > > bharat
>>> > > > > > > Mark
>>> > > > > > System: Protein + water + Na ions (total 46878 atoms)
>>> > > > > Gromacs version: tested with both v4.0.5 and v4.0.7
>>> > > > > compiled with: --enable-float --with-fft=fftw3 --enable-mpi
>>> > > > > compiler: gcc_3.4.6 -O3
>>> > > > > machine details: uname -mpio: x86_64 x86_64 x86_64 GNU/Linux
>>> > > > > > > > > I tried searching the mailing-list without any
>>> luck. I > > am not sure, if > > i
>>> > > > > am doing anything wrong in giving commands. Please correct
>>> me > > if it > > is
>>> > > > > wrong.
>>> > > > > > > Kindly let me know the solution.
>>> > > > > > > > > bharat
>>> > > > > > > > > > >
>>>
>>
>
your system is going out of memory. probably too big a system or all
replicas are runing on he same node.
--
David van der Spoel, Ph.D., Professor of Biology
Molec. Biophys. group, Dept. of Cell & Molec. Biol., Uppsala University.
Box 596, 75124 Uppsala, Sweden. Phone: +46184714205. Fax: +4618511755.
spoel at xray.bmc.uu.se spoel at gromacs.org http://folding.bmc.uu.se
More information about the gromacs.org_gmx-users
mailing list