[gmx-users] Problems with REMD in Gromacs 4.6.3
gigo
gigo at ibb.waw.pl
Fri Jul 19 18:59:24 CEST 2013
Hi!
On 2013-07-17 21:08, Mark Abraham wrote:
> You tried ppn3 (with and without --loadbalance)?
I was testing on 8-replicas simulation.
1) Without --loadbalance and -np 8.
Excerpts from the script:
#PBS -l nodes=8:ppn=3
setenv OMP_NUM_THREADS 4
mpiexec mdrun_mpi -v -cpt 20 -multi 8 -ntomp 4 -replex 2500 -cpi -pin
on
Excerpts from logs:
Using 3 MPI processes
Using 4 OpenMP threads per MPI process
(...)
Overriding thread affinity set outside mdrun_mpi
Pinning threads with an auto-selected logical core stride of 1
WARNING: In MPI process #0: Affinity setting for 1/4 threads failed.
This can cause performance degradation! If you think your
setting are
correct, contact the GROMACS developers.
WARNING: In MPI process #2: Affinity setting for 4/4 threads failed.
Load: The job was allocated 24 cores (3 cores on 8 different nodes).
Each OpenMP thread uses ~1/3 of a CPU core on average.
Conclusions: MPI runs as many processes as cores requested
(nnodes*ppn=24), it ignores OMP_NUM_THREADS env ==> this is wrong and is
not Gromacs issue. Each MPI process forks to 4 threads as requested. The
24-core limit granted by Torque is not violated.
2) The same script, but with -np 8, to limit the number of MPI
processes to the number of replicas
Logs:
Using 1 MPI process
Using 4 OpenMP threads
(...)
Replicas 0,3 and 6: WARNING: Affinity setting for 1/4 threads failed.
Replicas 1,2,4,5,7: WARNING: Affinity setting for 4/4 threads failed.
Load: The job was allocated 24 cores on 8 nodes. Only on first 3 nodes
mpiexec was run. Each OpenMP thread uses ~20% of a CPU core.
3) -np 8 --loadbalance
Excerpts from logs:
Using 1 MPI process
Using 4 OpenMP threads
(...)
Each replica says: WARNING: Affinity setting for 3/4 threads failed.
Load: MPI processes spread evenly on all 8 nodes. Each OpenMP thread
uses ~50% of a CPU core.
4) -np 8 --loadbalance, #PBS -l nodes=8:ppn=4 <== this worked ~OK with
gromacs 4.6.2
Logs:
WARNING: Affinity setting for 2/4 threads failed.
Load: 32 cores allocated on 8 nodes. MPI processes spread evenly, each
OpenMP thread uses ~70% of a CPU core.
With 144 replicas the simulation did not produce any results, just got
stuck.
Some thoughts: the main problem is most probably in the way MPI
interprets the information from torque, it is not Gromacs related. MPI
ignores OMP_NUM_THREADS. The environment is just broken. Since
gromacs-4.6.2 behaved better than 4.6.3 there, I am coming back to it.
Best,
G
>
> Mark
>
> On Wed, Jul 17, 2013 at 6:30 PM, gigo <gigo at ibb.waw.pl> wrote:
>> On 2013-07-13 11:10, Mark Abraham wrote:
>>>
>>> On Sat, Jul 13, 2013 at 1:24 AM, gigo <gigo at ibb.waw.pl> wrote:
>>>>
>>>> On 2013-07-12 20:00, Mark Abraham wrote:
>>>>>
>>>>>
>>>>> On Fri, Jul 12, 2013 at 4:27 PM, gigo <gigo at ibb.waw.pl> wrote:
>>>>>>
>>>>>>
>>>>>> Hi!
>>>>>>
>>>>>> On 2013-07-12 11:15, Mark Abraham wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> What does --loadbalance do?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> It balances the total number of processes across all allocated
>>>>>> nodes.
>>>>>
>>>>>
>>>>>
>>>>> OK, but using it means you are hostage to its assumptions about
>>>>> balance.
>>>>
>>>>
>>>>
>>>> Thats true, but as long as I do not try to use more resources that
>>>> the
>>>> torque gives me, everything is OK. The question is, what is a
>>>> proper way
>>>> of
>>>> running multiple simulations in parallel with MPI that are further
>>>> parallelized with OpenMP, when pinning fails? I could not find any
>>>> other.
>>>
>>>
>>> I think pinning fails because you are double-crossing yourself. You
>>> do
>>> not want 12 MPI processes per node, and that is likely what ppn is
>>> setting. AFAIK your setup should work, but I haven't tested it.
>>>
>>>>>
>>>>>> The
>>>>>> thing is that mpiexec does not know that I want each replica to
>>>>>> fork to
>>>>>> 4
>>>>>> OpenMP threads. Thus, without this option and without affinities
>>>>>> (in a
>>>>>> sec
>>>>>> about it) mpiexec starts too many replicas on some nodes -
>>>>>> gromacs
>>>>>> complains
>>>>>> about the overload then - while some cores on other nodes are not
>>>>>> used.
>>>>>> It
>>>>>> is possible to run my simulation like that:
>>>>>>
>>>>>> mpiexec mdrun_mpi -v -cpt 20 -multi 144 -replex 2000 -cpi
>>>>>> (without
>>>>>> --loadbalance for mpiexec and without -ntomp for mdrun)
>>>>>>
>>>>>> Then each replica runs on 4 MPI processes (I allocate 4 times
>>>>>> more
>>>>>> cores
>>>>>> then replicas and mdrun sees it). The problem is that it is much
>>>>>> slower
>>>>>> than
>>>>>> using OpenMP for each replica. I did not find any other way than
>>>>>> --loadbalance in mpiexec and then -multi 144 -ntomp 4 in mdrun to
>>>>>> use
>>>>>> MPI
>>>>>> and OpenMP at the same time on the torque-controlled cluster.
>>>>>
>>>>>
>>>>>
>>>>> That seems highly surprising. I have not yet encountered a job
>>>>> scheduler that was completely lacking a "do what I tell you"
>>>>> layout
>>>>> scheme. More importantly, why are you using #PBS -l
>>>>> nodes=48:ppn=12?
>>>>
>>>>
>>>>
>>>> I thing that torque is very similar to all PBS-like resource
>>>> managers in
>>>> this regard. It actually does what I tell it to do. There are
>>>> 12-core
>>>> nodes,
>>>> I ask for 48 of them - I get them (simple #PBS -l ncpus=576 does
>>>> not
>>>> work),
>>>> end of story. Now, the program that I run is responsible for
>>>> populating
>>>> resources that I got.
>>>
>>>
>>> No, that's not the end of the story. The scheduler and the MPI
>>> system
>>> typically cooperate to populate the MPI processes on the hardware,
>>> set
>>> OMP_NUM_THREADS, set affinities, etc. mdrun honours those if they
>>> are
>>> set.
>>
>>
>> I was able to run what I wanted flawlessly on another cluster with
>> PBS-Pro.
>> The torque cluster seem to work like I said ("the end of story"
>> behaviour).
>> REMD runs well on torque when I give a whole physical node to one
>> replica.
>> Otherwise the simulation does not go or the pinning fails (sometimes
>> partially). I run out of options, I did not find any working
>> example/documentation on running hybrid MPI/OpenMP jobs in torque. It
>> seems
>> that I stumbled upon limitations of this resource manager, and it is
>> not
>> really the Gromacs issue.
>> Best Regards,
>> Grzegorz
>>
>>
>>>
>>> You seem to be using 12 because you know there are 12 cores per
>>> node.
>>> The scheduler should know that already. ppn should be a command
>>> about
>>> what to do with the hardware, not a description of what it is. More
>>> to
>>> the point, you should read the docs and be sure what it does.
>>>
>>>>> Surely you want 3 MPI processes per 12-core node?
>>>>
>>>>
>>>>
>>>> Yes - I want each node to run 3 MPI processes. Preferably, I would
>>>> like
>>>> to
>>>> run each MPI process on separate node (spread on 12 cores with
>>>> OpenMP)
>>>> but I
>>>> will not get as much of resources. But again, without the
>>>> --loadbalance
>>>> hack
>>>> I would not be able to properly populate the nodes...
>>>
>>>
>>> So try ppn 3!
>>>
>>>>>
>>>>>>> What do the .log files say about
>>>>>>> OMP_NUM_THREADS, thread affinities, pinning, etc?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Each replica logs:
>>>>>> "Using 1 MPI process
>>>>>> Using 4 OpenMP threads",
>>>>>> That is is correct. As I said, the threads are forked, but 3 out
>>>>>> of 4
>>>>>> don't
>>>>>> do anything, and the simulation does not go at all.
>>>>>>
>>>>>> About affinities Gromacs says:
>>>>>> "Can not set thread affinities on the current platform. On NUMA
>>>>>> systems
>>>>>> this
>>>>>> can cause performance degradation. If you think your platform
>>>>>> should
>>>>>> support
>>>>>> setting affinities, contact the GROMACS developers."
>>>>>>
>>>>>> Well, the "current platform" is normal x86_64 cluster, but the
>>>>>> whole
>>>>>> information about resources is passed by Torque to OpenMPI-linked
>>>>>> Gromacs.
>>>>>> Can it be that mdrun sees the resources allocated by torque as a
>>>>>> big
>>>>>> pool
>>>>>> of
>>>>>> cpus and misses the information about nodes topology?
>>>>>
>>>>>
>>>>>
>>>>> mdrun gets its processor topology from the MPI layer, so that is
>>>>> where
>>>>> you need to focus. The error message confirms that GROMACS sees
>>>>> things
>>>>> that seem wrong.
>>>>
>>>>
>>>>
>>>> Thank you, I will take a look. But the first thing I want to do is
>>>> finding
>>>> the reason why Gromacs 4.6.3 is not able to run on my (slightly
>>>> weird, I
>>>> admit) setup, while 4.6.2 does it very well.
>>>
>>>
>>> 4.6.2 had a bug that inhibited any MPI-based mdrun from attempting
>>> to
>>> set affinities. It's still not clear why ppn 12 worked at all.
>>> Apparently mdrun was able to float some processes around to get
>>> something that worked. The good news is that when you get it working
>>> in 4.6.3, you will see a performance boost.
>>>
>>> Mark
>>
>> --
>> gmx-users mailing list gmx-users at gromacs.org
>> http://lists.gromacs.org/mailman/listinfo/gmx-users
>> * Please search the archive at
>> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
>> * Please don't post (un)subscribe requests to the list. Use the www
>> interface or send it to gmx-users-request at gromacs.org.
>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
More information about the gromacs.org_gmx-users
mailing list