[gmx-users] Problems with REMD in Gromacs 4.6.3
gigo at ibb.waw.pl
Fri Jul 12 17:27:56 CEST 2013
On 2013-07-12 11:15, Mark Abraham wrote:
> What does --loadbalance do?
It balances the total number of processes across all allocated nodes.
The thing is that mpiexec does not know that I want each replica to fork
to 4 OpenMP threads. Thus, without this option and without affinities
(in a sec about it) mpiexec starts too many replicas on some nodes -
gromacs complains about the overload then - while some cores on other
nodes are not used. It is possible to run my simulation like that:
mpiexec mdrun_mpi -v -cpt 20 -multi 144 -replex 2000 -cpi (without
--loadbalance for mpiexec and without -ntomp for mdrun)
Then each replica runs on 4 MPI processes (I allocate 4 times more
cores then replicas and mdrun sees it). The problem is that it is much
slower than using OpenMP for each replica. I did not find any other way
than --loadbalance in mpiexec and then -multi 144 -ntomp 4 in mdrun to
use MPI and OpenMP at the same time on the torque-controlled cluster.
> What do the .log files say about
> OMP_NUM_THREADS, thread affinities, pinning, etc?
Each replica logs:
"Using 1 MPI process
Using 4 OpenMP threads",
That is is correct. As I said, the threads are forked, but 3 out of 4
don't do anything, and the simulation does not go at all.
About affinities Gromacs says:
"Can not set thread affinities on the current platform. On NUMA systems
can cause performance degradation. If you think your platform should
setting affinities, contact the GROMACS developers."
Well, the "current platform" is normal x86_64 cluster, but the whole
information about resources is passed by Torque to OpenMPI-linked
Gromacs. Can it be that mdrun sees the resources allocated by torque as
a big pool of cpus and misses the information about nodes topology?
If you have any suggestions how to debug or trace this issue, I would
be glad to participate.
> On Fri, Jul 12, 2013 at 3:46 AM, gigo <gigo at poczta.ibb.waw.pl> wrote:
>> Dear GMXers,
>> With Gromacs 4.6.2 I was running REMD with 144 replicas. Replicas
>> separate MPI jobs of course (OpenMPI 1.6.4). Each replica I run on 4
>> with OpenMP. There is Torque installed on the cluster build of
>> nodes, so I used the following script:
>> #!/bin/tcsh -f
>> #PBS -S /bin/tcsh
>> #PBS -N test
>> #PBS -l nodes=48:ppn=12
>> #PBS -l walltime=300:00:00
>> #PBS -l mem=288Gb
>> #PBS -r n
>> cd $PBS_O_WORKDIR
>> mpiexec -np 144 --loadbalance mdrun_mpi -v -cpt 20 -multi 144 -ntomp
>> -replex 2000
>> It was working just great with 4.6.2. It does not work with 4.6.3.
>> The new
>> version was compiled with the same options in the same environment.
>> spreads the replicas evenly over the cluster. Each replica forks 4
>> but only one of them uses any cpu. Logs end at the citations. Some
>> energy and trajectory files are created, nothing is written to them.
>> Please let me know if you have any immediate suggestion on how to
>> make it
>> work (maybe based on some differences between versions), or if I
>> should fill
>> the bug report with all the technical details.
>> Best Regards,
>> Grzegorz Wieczorek
>> gmx-users mailing list gmx-users at gromacs.org
>> * Please search the archive at
>> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
>> * Please don't post (un)subscribe requests to the list. Use the www
>> interface or send it to gmx-users-request at gromacs.org.
>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
More information about the gromacs.org_gmx-users