[gmx-users] Problems with REMD in Gromacs 4.6.3

Sat Jul 13 10:10:57 CEST 2013

On Sat, Jul 13, 2013 at 1:24 AM, gigo <gigo at ibb.waw.pl> wrote:
> On 2013-07-12 20:00, Mark Abraham wrote:
>>
>> On Fri, Jul 12, 2013 at 4:27 PM, gigo <gigo at ibb.waw.pl> wrote:
>>>
>>> Hi!
>>>
>>> On 2013-07-12 11:15, Mark Abraham wrote:
>>>>
>>>>
>>>> What does --loadbalance do?
>>>
>>>
>>>
>>> It balances the total number of processes across all allocated nodes.
>>
>>
>> OK, but using it means you are hostage to its assumptions about balance.
>
>
> Thats true, but as long as I do not try to use more resources that the
> torque gives me, everything is OK. The question is, what is a proper way of
> running multiple simulations in parallel with MPI that are further
> parallelized with OpenMP, when pinning fails? I could not find any other.

I think pinning fails because you are double-crossing yourself. You do
not want 12 MPI processes per node, and that is likely what ppn is
setting. AFAIK your setup should work, but I haven't tested it.

>>
>>> The
>>> thing is that mpiexec does not know that I want each replica to fork to 4
>>> OpenMP threads. Thus, without this option and without affinities (in a
>>> sec
>>> about it) mpiexec starts too many replicas on some nodes - gromacs
>>> complains
>>> about the overload then - while some cores on other nodes are not used.
>>> It
>>> is possible to run my simulation like that:
>>>
>>> mpiexec mdrun_mpi -v -cpt 20 -multi 144 -replex 2000 -cpi (without
>>> --loadbalance for mpiexec and without -ntomp for mdrun)
>>>
>>> Then each replica runs on 4 MPI processes (I allocate 4 times more cores
>>> then replicas and mdrun sees it). The problem is that it is much slower
>>> than
>>> using OpenMP for each replica. I did not find any other way than
>>> --loadbalance in mpiexec and then -multi 144 -ntomp 4 in mdrun to use MPI
>>> and OpenMP at the same time on the torque-controlled cluster.
>>
>>
>> That seems highly surprising. I have not yet encountered a job
>> scheduler that was completely lacking a "do what I tell you" layout
>> scheme. More importantly, why are you using #PBS -l nodes=48:ppn=12?
>
>
> I thing that torque is very similar to all PBS-like resource managers in
> this regard. It actually does what I tell it to do. There are 12-core nodes,
> I ask for 48 of them - I get them (simple #PBS -l ncpus=576 does not work),
> end of story. Now, the program that I run is responsible for populating
> resources that I got.

No, that's not the end of the story. The scheduler and the MPI system
typically cooperate to populate the MPI processes on the hardware, set
OMP_NUM_THREADS, set affinities, etc. mdrun honours those if they are
set.

You seem to be using 12 because you know there are 12 cores per node.
The scheduler should know that already. ppn should be a command about
what to do with the hardware, not a description of what it is. More to
the point, you should read the docs and be sure what it does.

>> Surely you want 3 MPI processes per 12-core node?
>
>
> Yes - I want each node to run 3 MPI processes. Preferably, I would like to
> run each MPI process on separate node (spread on 12 cores with OpenMP) but I
> will not get as much of resources. But again, without the --loadbalance hack
> I would not be able to properly populate the nodes...

So try ppn 3!

>>
>>>> What do the .log files say about
>>>> OMP_NUM_THREADS, thread affinities, pinning, etc?
>>>
>>>
>>>
>>> Each replica logs:
>>> "Using 1 MPI process
>>> Using 4 OpenMP threads",
>>> That is is correct. As I said, the threads are forked, but 3 out of 4
>>> don't
>>> do anything, and the simulation does not go at all.
>>>
>>> About affinities Gromacs says:
>>> "Can not set thread affinities on the current platform. On NUMA systems
>>> this
>>> can cause performance degradation. If you think your platform should
>>> support
>>> setting affinities, contact the GROMACS developers."
>>>
>>> Well, the "current platform" is normal x86_64 cluster, but the whole
>>> information about resources is passed by Torque to OpenMPI-linked
>>> Gromacs.
>>> Can it be that mdrun sees the resources allocated by torque as a big pool
>>> of
>>> cpus and misses the information about nodes topology?
>>
>>
>> mdrun gets its processor topology from the MPI layer, so that is where
>> you need to focus. The error message confirms that GROMACS sees things
>> that seem wrong.
>
>
> Thank you, I will take a look. But the first thing I want to do is finding
> the reason why Gromacs 4.6.3 is not able to run on my (slightly weird, I
> admit) setup, while 4.6.2 does it very well.

4.6.2 had a bug that inhibited any MPI-based mdrun from attempting to
set affinities. It's still not clear why ppn 12 worked at all.
Apparently mdrun was able to float some processes around to get
something that worked. The good news is that when you get it working
in 4.6.3, you will see a performance boost.

Mark