[gmx-users] Problems with REMD in Gromacs 4.6.3

Sat Jul 13 02:24:04 CEST 2013

On 2013-07-12 20:00, Mark Abraham wrote:
> On Fri, Jul 12, 2013 at 4:27 PM, gigo <gigo at ibb.waw.pl> wrote:
>> Hi!
>> 
>> On 2013-07-12 11:15, Mark Abraham wrote:
>>> 
>>> What does --loadbalance do?
>> 
>> 
>> It balances the total number of processes across all allocated nodes.
> 
> OK, but using it means you are hostage to its assumptions about 
> balance.

Thats true, but as long as I do not try to use more resources that the 
torque gives me, everything is OK. The question is, what is a proper way 
of running multiple simulations in parallel with MPI that are further 
parallelized with OpenMP, when pinning fails? I could not find any 
other.

> 
>> The
>> thing is that mpiexec does not know that I want each replica to fork 
>> to 4
>> OpenMP threads. Thus, without this option and without affinities (in 
>> a sec
>> about it) mpiexec starts too many replicas on some nodes - gromacs 
>> complains
>> about the overload then - while some cores on other nodes are not 
>> used. It
>> is possible to run my simulation like that:
>> 
>> mpiexec mdrun_mpi -v -cpt 20 -multi 144 -replex 2000 -cpi (without
>> --loadbalance for mpiexec and without -ntomp for mdrun)
>> 
>> Then each replica runs on 4 MPI processes (I allocate 4 times more 
>> cores
>> then replicas and mdrun sees it). The problem is that it is much 
>> slower than
>> using OpenMP for each replica. I did not find any other way than
>> --loadbalance in mpiexec and then -multi 144 -ntomp 4 in mdrun to use 
>> MPI
>> and OpenMP at the same time on the torque-controlled cluster.
> 
> That seems highly surprising. I have not yet encountered a job
> scheduler that was completely lacking a "do what I tell you" layout
> scheme. More importantly, why are you using #PBS -l nodes=48:ppn=12?

I thing that torque is very similar to all PBS-like resource managers 
in this regard. It actually does what I tell it to do. There are 12-core 
nodes, I ask for 48 of them - I get them (simple #PBS -l ncpus=576 does 
not work), end of story. Now, the program that I run is responsible for 
populating resources that I got.

> Surely you want 3 MPI processes per 12-core node?

Yes - I want each node to run 3 MPI processes. Preferably, I would like 
to run each MPI process on separate node (spread on 12 cores with 
OpenMP) but I will not get as much of resources. But again, without the 
--loadbalance hack I would not be able to properly populate the nodes...

> 
>>> What do the .log files say about
>>> OMP_NUM_THREADS, thread affinities, pinning, etc?
>> 
>> 
>> Each replica logs:
>> "Using 1 MPI process
>> Using 4 OpenMP threads",
>> That is is correct. As I said, the threads are forked, but 3 out of 4 
>> don't
>> do anything, and the simulation does not go at all.
>> 
>> About affinities Gromacs says:
>> "Can not set thread affinities on the current platform. On NUMA 
>> systems this
>> can cause performance degradation. If you think your platform should 
>> support
>> setting affinities, contact the GROMACS developers."
>> 
>> Well, the "current platform" is normal x86_64 cluster, but the whole
>> information about resources is passed by Torque to OpenMPI-linked 
>> Gromacs.
>> Can it be that mdrun sees the resources allocated by torque as a big 
>> pool of
>> cpus and misses the information about nodes topology?
> 
> mdrun gets its processor topology from the MPI layer, so that is where
> you need to focus. The error message confirms that GROMACS sees things
> that seem wrong.

Thank you, I will take a look. But the first thing I want to do is 
finding the reason why Gromacs 4.6.3 is not able to run on my (slightly 
weird, I admit) setup, while 4.6.2 does it very well.
Best,

Grzegorz