[gmx-users] Problems with REMD in Gromacs 4.6.3

Wed Jul 17 18:30:18 CEST 2013

On 2013-07-13 11:10, Mark Abraham wrote:
> On Sat, Jul 13, 2013 at 1:24 AM, gigo <gigo at ibb.waw.pl> wrote:
>> On 2013-07-12 20:00, Mark Abraham wrote:
>>> 
>>> On Fri, Jul 12, 2013 at 4:27 PM, gigo <gigo at ibb.waw.pl> wrote:
>>>> 
>>>> Hi!
>>>> 
>>>> On 2013-07-12 11:15, Mark Abraham wrote:
>>>>> 
>>>>> 
>>>>> What does --loadbalance do?
>>>> 
>>>> 
>>>> 
>>>> It balances the total number of processes across all allocated 
>>>> nodes.
>>> 
>>> 
>>> OK, but using it means you are hostage to its assumptions about 
>>> balance.
>> 
>> 
>> Thats true, but as long as I do not try to use more resources that 
>> the
>> torque gives me, everything is OK. The question is, what is a proper 
>> way of
>> running multiple simulations in parallel with MPI that are further
>> parallelized with OpenMP, when pinning fails? I could not find any 
>> other.
> 
> I think pinning fails because you are double-crossing yourself. You do
> not want 12 MPI processes per node, and that is likely what ppn is
> setting. AFAIK your setup should work, but I haven't tested it.
> 
>>> 
>>>> The
>>>> thing is that mpiexec does not know that I want each replica to 
>>>> fork to 4
>>>> OpenMP threads. Thus, without this option and without affinities 
>>>> (in a
>>>> sec
>>>> about it) mpiexec starts too many replicas on some nodes - gromacs
>>>> complains
>>>> about the overload then - while some cores on other nodes are not 
>>>> used.
>>>> It
>>>> is possible to run my simulation like that:
>>>> 
>>>> mpiexec mdrun_mpi -v -cpt 20 -multi 144 -replex 2000 -cpi (without
>>>> --loadbalance for mpiexec and without -ntomp for mdrun)
>>>> 
>>>> Then each replica runs on 4 MPI processes (I allocate 4 times more 
>>>> cores
>>>> then replicas and mdrun sees it). The problem is that it is much 
>>>> slower
>>>> than
>>>> using OpenMP for each replica. I did not find any other way than
>>>> --loadbalance in mpiexec and then -multi 144 -ntomp 4 in mdrun to 
>>>> use MPI
>>>> and OpenMP at the same time on the torque-controlled cluster.
>>> 
>>> 
>>> That seems highly surprising. I have not yet encountered a job
>>> scheduler that was completely lacking a "do what I tell you" layout
>>> scheme. More importantly, why are you using #PBS -l nodes=48:ppn=12?
>> 
>> 
>> I thing that torque is very similar to all PBS-like resource managers 
>> in
>> this regard. It actually does what I tell it to do. There are 12-core 
>> nodes,
>> I ask for 48 of them - I get them (simple #PBS -l ncpus=576 does not 
>> work),
>> end of story. Now, the program that I run is responsible for 
>> populating
>> resources that I got.
> 
> No, that's not the end of the story. The scheduler and the MPI system
> typically cooperate to populate the MPI processes on the hardware, set
> OMP_NUM_THREADS, set affinities, etc. mdrun honours those if they are
> set.

I was able to run what I wanted flawlessly on another cluster with 
PBS-Pro. The torque cluster seem to work like I said ("the end of story" 
behaviour). REMD runs well on torque when I give a whole physical node 
to one replica. Otherwise the simulation does not go or the pinning 
fails (sometimes partially). I run out of options, I did not find any 
working example/documentation on running hybrid MPI/OpenMP jobs in 
torque. It seems that I stumbled upon limitations of this resource 
manager, and it is not really the Gromacs issue.
Best Regards,
Grzegorz

> 
> You seem to be using 12 because you know there are 12 cores per node.
> The scheduler should know that already. ppn should be a command about
> what to do with the hardware, not a description of what it is. More to
> the point, you should read the docs and be sure what it does.
> 
>>> Surely you want 3 MPI processes per 12-core node?
>> 
>> 
>> Yes - I want each node to run 3 MPI processes. Preferably, I would 
>> like to
>> run each MPI process on separate node (spread on 12 cores with 
>> OpenMP) but I
>> will not get as much of resources. But again, without the 
>> --loadbalance hack
>> I would not be able to properly populate the nodes...
> 
> So try ppn 3!
> 
>>> 
>>>>> What do the .log files say about
>>>>> OMP_NUM_THREADS, thread affinities, pinning, etc?
>>>> 
>>>> 
>>>> 
>>>> Each replica logs:
>>>> "Using 1 MPI process
>>>> Using 4 OpenMP threads",
>>>> That is is correct. As I said, the threads are forked, but 3 out of 
>>>> 4
>>>> don't
>>>> do anything, and the simulation does not go at all.
>>>> 
>>>> About affinities Gromacs says:
>>>> "Can not set thread affinities on the current platform. On NUMA 
>>>> systems
>>>> this
>>>> can cause performance degradation. If you think your platform 
>>>> should
>>>> support
>>>> setting affinities, contact the GROMACS developers."
>>>> 
>>>> Well, the "current platform" is normal x86_64 cluster, but the 
>>>> whole
>>>> information about resources is passed by Torque to OpenMPI-linked
>>>> Gromacs.
>>>> Can it be that mdrun sees the resources allocated by torque as a 
>>>> big pool
>>>> of
>>>> cpus and misses the information about nodes topology?
>>> 
>>> 
>>> mdrun gets its processor topology from the MPI layer, so that is 
>>> where
>>> you need to focus. The error message confirms that GROMACS sees 
>>> things
>>> that seem wrong.
>> 
>> 
>> Thank you, I will take a look. But the first thing I want to do is 
>> finding
>> the reason why Gromacs 4.6.3 is not able to run on my (slightly 
>> weird, I
>> admit) setup, while 4.6.2 does it very well.
> 
> 4.6.2 had a bug that inhibited any MPI-based mdrun from attempting to
> set affinities. It's still not clear why ppn 12 worked at all.
> Apparently mdrun was able to float some processes around to get
> something that worked. The good news is that when you get it working
> in 4.6.3, you will see a performance boost.
> 
> Mark