[gmx-users] Problems with REMD in Gromacs 4.6.3
gigo
gigo at ibb.waw.pl
Wed Jul 17 18:30:18 CEST 2013
On 2013-07-13 11:10, Mark Abraham wrote:
> On Sat, Jul 13, 2013 at 1:24 AM, gigo <gigo at ibb.waw.pl> wrote:
>> On 2013-07-12 20:00, Mark Abraham wrote:
>>>
>>> On Fri, Jul 12, 2013 at 4:27 PM, gigo <gigo at ibb.waw.pl> wrote:
>>>>
>>>> Hi!
>>>>
>>>> On 2013-07-12 11:15, Mark Abraham wrote:
>>>>>
>>>>>
>>>>> What does --loadbalance do?
>>>>
>>>>
>>>>
>>>> It balances the total number of processes across all allocated
>>>> nodes.
>>>
>>>
>>> OK, but using it means you are hostage to its assumptions about
>>> balance.
>>
>>
>> Thats true, but as long as I do not try to use more resources that
>> the
>> torque gives me, everything is OK. The question is, what is a proper
>> way of
>> running multiple simulations in parallel with MPI that are further
>> parallelized with OpenMP, when pinning fails? I could not find any
>> other.
>
> I think pinning fails because you are double-crossing yourself. You do
> not want 12 MPI processes per node, and that is likely what ppn is
> setting. AFAIK your setup should work, but I haven't tested it.
>
>>>
>>>> The
>>>> thing is that mpiexec does not know that I want each replica to
>>>> fork to 4
>>>> OpenMP threads. Thus, without this option and without affinities
>>>> (in a
>>>> sec
>>>> about it) mpiexec starts too many replicas on some nodes - gromacs
>>>> complains
>>>> about the overload then - while some cores on other nodes are not
>>>> used.
>>>> It
>>>> is possible to run my simulation like that:
>>>>
>>>> mpiexec mdrun_mpi -v -cpt 20 -multi 144 -replex 2000 -cpi (without
>>>> --loadbalance for mpiexec and without -ntomp for mdrun)
>>>>
>>>> Then each replica runs on 4 MPI processes (I allocate 4 times more
>>>> cores
>>>> then replicas and mdrun sees it). The problem is that it is much
>>>> slower
>>>> than
>>>> using OpenMP for each replica. I did not find any other way than
>>>> --loadbalance in mpiexec and then -multi 144 -ntomp 4 in mdrun to
>>>> use MPI
>>>> and OpenMP at the same time on the torque-controlled cluster.
>>>
>>>
>>> That seems highly surprising. I have not yet encountered a job
>>> scheduler that was completely lacking a "do what I tell you" layout
>>> scheme. More importantly, why are you using #PBS -l nodes=48:ppn=12?
>>
>>
>> I thing that torque is very similar to all PBS-like resource managers
>> in
>> this regard. It actually does what I tell it to do. There are 12-core
>> nodes,
>> I ask for 48 of them - I get them (simple #PBS -l ncpus=576 does not
>> work),
>> end of story. Now, the program that I run is responsible for
>> populating
>> resources that I got.
>
> No, that's not the end of the story. The scheduler and the MPI system
> typically cooperate to populate the MPI processes on the hardware, set
> OMP_NUM_THREADS, set affinities, etc. mdrun honours those if they are
> set.
I was able to run what I wanted flawlessly on another cluster with
PBS-Pro. The torque cluster seem to work like I said ("the end of story"
behaviour). REMD runs well on torque when I give a whole physical node
to one replica. Otherwise the simulation does not go or the pinning
fails (sometimes partially). I run out of options, I did not find any
working example/documentation on running hybrid MPI/OpenMP jobs in
torque. It seems that I stumbled upon limitations of this resource
manager, and it is not really the Gromacs issue.
Best Regards,
Grzegorz
>
> You seem to be using 12 because you know there are 12 cores per node.
> The scheduler should know that already. ppn should be a command about
> what to do with the hardware, not a description of what it is. More to
> the point, you should read the docs and be sure what it does.
>
>>> Surely you want 3 MPI processes per 12-core node?
>>
>>
>> Yes - I want each node to run 3 MPI processes. Preferably, I would
>> like to
>> run each MPI process on separate node (spread on 12 cores with
>> OpenMP) but I
>> will not get as much of resources. But again, without the
>> --loadbalance hack
>> I would not be able to properly populate the nodes...
>
> So try ppn 3!
>
>>>
>>>>> What do the .log files say about
>>>>> OMP_NUM_THREADS, thread affinities, pinning, etc?
>>>>
>>>>
>>>>
>>>> Each replica logs:
>>>> "Using 1 MPI process
>>>> Using 4 OpenMP threads",
>>>> That is is correct. As I said, the threads are forked, but 3 out of
>>>> 4
>>>> don't
>>>> do anything, and the simulation does not go at all.
>>>>
>>>> About affinities Gromacs says:
>>>> "Can not set thread affinities on the current platform. On NUMA
>>>> systems
>>>> this
>>>> can cause performance degradation. If you think your platform
>>>> should
>>>> support
>>>> setting affinities, contact the GROMACS developers."
>>>>
>>>> Well, the "current platform" is normal x86_64 cluster, but the
>>>> whole
>>>> information about resources is passed by Torque to OpenMPI-linked
>>>> Gromacs.
>>>> Can it be that mdrun sees the resources allocated by torque as a
>>>> big pool
>>>> of
>>>> cpus and misses the information about nodes topology?
>>>
>>>
>>> mdrun gets its processor topology from the MPI layer, so that is
>>> where
>>> you need to focus. The error message confirms that GROMACS sees
>>> things
>>> that seem wrong.
>>
>>
>> Thank you, I will take a look. But the first thing I want to do is
>> finding
>> the reason why Gromacs 4.6.3 is not able to run on my (slightly
>> weird, I
>> admit) setup, while 4.6.2 does it very well.
>
> 4.6.2 had a bug that inhibited any MPI-based mdrun from attempting to
> set affinities. It's still not clear why ppn 12 worked at all.
> Apparently mdrun was able to float some processes around to get
> something that worked. The good news is that when you get it working
> in 4.6.3, you will see a performance boost.
>
> Mark
More information about the gromacs.org_gmx-users
mailing list