[gmx-users] Replica Exchange MD on more than 64 processors
bharat v. adkar
bharat at sscu.iisc.ernet.in
Mon Dec 28 06:31:24 CET 2009
On Mon, 28 Dec 2009, Mark Abraham wrote:
> bharat v. adkar wrote:
>> On Sun, 27 Dec 2009, Mark Abraham wrote:
>>
>> > bharat v. adkar wrote:
>> > > On Sun, 27 Dec 2009, Mark Abraham wrote:
>> > >
>> > > > bharat v. adkar wrote:
>> > > > > > > Dear all,
>> > > > > I am trying to perform replica exchange MD (REMD) on a
>> > > 'protein in
>> > > > > water' system. I am following instructions given on wiki
>> > > (How-Tos ->
>> > > > > REMD). I have to perform the REMD simulation with 35 different
>> > > > > temperatures. As per advise on wiki, I equilibrated the system
>> > > > > at
>> > > > > respective temperatures (total of 35 equilibration
>> > > simulations). > > After
>> > > > > this I generated chk_0.tpr, chk_1.tpr, ..., chk_34.tpr files
>> > > from the
>> > > > > equilibrated structures.
>> > > > > > > Now when I submit final job for REMD with following
>> > > command-line, it > > gives
>> > > > > some error:
>> > > > > > > command line: mpiexec -np 70 mdrun -multi 35 -replex 1000 -s
>> > > chk_.tpr > > -v
>> > > > > > > error msg:
>> > > > > -------------------------------------------------------
>> > > > > Program mdrun_mpi, VERSION 4.0.7
>> > > > > Source code file: ../../../SRC/src/gmxlib/smalloc.c, line: 179
>> > > > > > > Fatal error:
>> > > > > Not enough memory. Failed to realloc 790760 bytes for
>> > > > > nlist->jjnr,
>> > > > > nlist->jjnr=0x9a400030
>> > > > > (called from file ../../../SRC/src/mdlib/ns.c, line 503)
>> > > > > -------------------------------------------------------
>> > > > > > > Thanx for Using GROMACS - Have a Nice Day
>> > > > > : Cannot allocate memory
>> > > > > Error on node 19, will try to stop all the nodes
>> > > > > Halting parallel program mdrun_mpi on CPU 19 out of 70
>> > > > >
>> > > ***********************************************************************
>> > > > > > > > > The individual node on the cluster has 8GB of physical
>> > > memory and 16GB > > of
>> > > > > swap memory. Moreover, when logged onto the individual nodes,
>> > > it > > shows
>> > > > > more than 1GB of free memory, so there should be no problem
>> > > with > > cluster
>> > > > > memory. Also, the equilibration jobs for the same system are
>> > > run on > > the
>> > > > > same cluster without any problem.
>> > > > > > > What I have observed by submitting different test jobs with
>> > > varying > > number
>> > > > > of processors (and no. of replicas, wherever necessary), that
>> > > any job > > with
>> > > > > total number of processors <= 64, runs faithfully without any
>> > > problem. > > As
>> > > > > soon as total number of processors are more than 64, it gives
>> > > the > > above
>> > > > > error. I have tested this with 65 processors/65 replicas also.
>> > > > > This sounds like you might be running on fewer physical CPUs
>> > > than you > have available. If so, running multiple MPI processes per
>> > > physical CPU > can lead to memory shortage conditions.
>> > >
>> > > I don't understand what you mean. Do you mean, there might be more
>> > > than 8
>> > > processes running per node (each node has 8 processors)? But that
>> > > also
>> > > does not seem to be the case, as SGE (sun grid engine) output shows
>> > > only
>> > > eight processes per node.
>> >
>> > 65 processes can't have 8 processes per node.
>> why can't it have? as i said, there are 8 processors per node. what i have
>> not mentioned is that how many nodes it is using. The jobs got distributed
>> over 9 nodes. 8 of which corresponds to 64 processors + 1 processor from
>> 9th node.
>
> OK, that's a full description. Your symptoms are indicative of someone making
> an error somewhere. Since GROMACS works over more than 64 processors
> elsewhere, the presumption is that you are doing something wrong or the
> machine is not set up in the way you think it is or should be. To get the
> most effective help, you need to be sure you're providing full information -
> else we can't tell which error you're making or (potentially) eliminate you
> as a source of error.
>
Sorry for not being clear in statements.
>> As far I can tell you, job distribution seems okay to me. It is 1 job per
>> processor.
>
> Does non-REMD GROMACS run on more than 64 processors? Does your cluster
> support using more than 8 nodes in a run? Can you run an MPI "Hello world"
> application that prints the processor and node ID across more than 64
> processors?
Yes, the cluster supports runs with more than 8 nodes. I generated a
system with 10 nm water box and submitted on 80 processors. It was running
fine. It printed all 80 NODEIDs. Also showed me when the job will get
over.
bharat
>
> Mark
>
>
>> bharat
>>
>> >
>> > Mark
>> >
>> > > > I don't know what you mean by "swap memory".
>> > >
>> > > Sorry, I meant cache memory..
>> > >
>> > > bharat
>> > >
>> > > > > Mark
>> > > > > > System: Protein + water + Na ions (total 46878 atoms)
>> > > > > Gromacs version: tested with both v4.0.5 and v4.0.7
>> > > > > compiled with: --enable-float --with-fft=fftw3 --enable-mpi
>> > > > > compiler: gcc_3.4.6 -O3
>> > > > > machine details: uname -mpio: x86_64 x86_64 x86_64 GNU/Linux
>> > > > > > > > > I tried searching the mailing-list without any luck. I
>> > > am not sure, if > > i
>> > > > > am doing anything wrong in giving commands. Please correct me
>> > > if it > > is
>> > > > > wrong.
>> > > > > > > Kindly let me know the solution.
>> > > > > > > > > bharat
>> > > > > > > >
>> > >
>> >
>>
>
--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.
More information about the gromacs.org_gmx-users
mailing list