[gmx-users] Problems with REMD in Gromacs 4.6.3

Fri Jul 19 18:59:24 CEST 2013

Hi!

On 2013-07-17 21:08, Mark Abraham wrote:
> You tried ppn3 (with and without --loadbalance)?

I was testing on 8-replicas simulation.

1) Without --loadbalance and -np 8.
Excerpts from the script:
#PBS -l nodes=8:ppn=3
setenv OMP_NUM_THREADS 4
mpiexec mdrun_mpi -v -cpt 20 -multi 8 -ntomp 4 -replex 2500 -cpi -pin 
on

Excerpts from logs:
Using 3 MPI processes
Using 4 OpenMP threads per MPI process
(...)
Overriding thread affinity set outside mdrun_mpi

Pinning threads with an auto-selected logical core stride of 1

WARNING: In MPI process #0: Affinity setting for 1/4 threads failed.
          This can cause performance degradation! If you think your 
setting are
          correct, contact the GROMACS developers.

WARNING: In MPI process #2: Affinity setting for 4/4 threads failed.

Load: The job was allocated 24 cores (3 cores on 8 different nodes). 
Each OpenMP thread uses ~1/3 of a CPU core on average.
Conclusions: MPI runs as many processes as cores requested 
(nnodes*ppn=24), it ignores OMP_NUM_THREADS env ==> this is wrong and is 
not Gromacs issue. Each MPI process forks to 4 threads as requested. The 
24-core limit granted by Torque is not violated.

2) The same script, but with -np 8, to limit the number of MPI 
processes to the number of replicas
Logs:
Using 1 MPI process
Using 4 OpenMP threads
(...)

Replicas 0,3 and 6: WARNING: Affinity setting for 1/4 threads failed.
Replicas 1,2,4,5,7: WARNING: Affinity setting for 4/4 threads failed.

Load: The job was allocated 24 cores on 8 nodes. Only on first 3 nodes 
mpiexec was run. Each OpenMP thread uses ~20% of a CPU core.

3) -np 8 --loadbalance
Excerpts from logs:
Using 1 MPI process
Using 4 OpenMP threads
(...)
Each replica says: WARNING: Affinity setting for 3/4 threads failed.

Load: MPI processes spread evenly on all 8 nodes. Each OpenMP thread 
uses ~50% of a CPU core.

4) -np 8 --loadbalance, #PBS -l nodes=8:ppn=4 <== this worked ~OK with 
gromacs 4.6.2
Logs:
WARNING: Affinity setting for 2/4 threads failed.

Load: 32 cores allocated on 8 nodes. MPI processes spread evenly, each 
OpenMP thread uses ~70% of a CPU core.
With 144 replicas the simulation did not produce any results, just got 
stuck.

Some thoughts: the main problem is most probably in the way MPI 
interprets the information from torque, it is not Gromacs related. MPI 
ignores OMP_NUM_THREADS. The environment is just broken. Since 
gromacs-4.6.2 behaved better than 4.6.3 there, I am coming back to it.
Best,
G

> 
> Mark
> 
> On Wed, Jul 17, 2013 at 6:30 PM, gigo <gigo at ibb.waw.pl> wrote:
>> On 2013-07-13 11:10, Mark Abraham wrote:
>>> 
>>> On Sat, Jul 13, 2013 at 1:24 AM, gigo <gigo at ibb.waw.pl> wrote:
>>>> 
>>>> On 2013-07-12 20:00, Mark Abraham wrote:
>>>>> 
>>>>> 
>>>>> On Fri, Jul 12, 2013 at 4:27 PM, gigo <gigo at ibb.waw.pl> wrote:
>>>>>> 
>>>>>> 
>>>>>> Hi!
>>>>>> 
>>>>>> On 2013-07-12 11:15, Mark Abraham wrote:
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> What does --loadbalance do?
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> It balances the total number of processes across all allocated 
>>>>>> nodes.
>>>>> 
>>>>> 
>>>>> 
>>>>> OK, but using it means you are hostage to its assumptions about 
>>>>> balance.
>>>> 
>>>> 
>>>> 
>>>> Thats true, but as long as I do not try to use more resources that 
>>>> the
>>>> torque gives me, everything is OK. The question is, what is a 
>>>> proper way
>>>> of
>>>> running multiple simulations in parallel with MPI that are further
>>>> parallelized with OpenMP, when pinning fails? I could not find any 
>>>> other.
>>> 
>>> 
>>> I think pinning fails because you are double-crossing yourself. You 
>>> do
>>> not want 12 MPI processes per node, and that is likely what ppn is
>>> setting. AFAIK your setup should work, but I haven't tested it.
>>> 
>>>>> 
>>>>>> The
>>>>>> thing is that mpiexec does not know that I want each replica to 
>>>>>> fork to
>>>>>> 4
>>>>>> OpenMP threads. Thus, without this option and without affinities 
>>>>>> (in a
>>>>>> sec
>>>>>> about it) mpiexec starts too many replicas on some nodes - 
>>>>>> gromacs
>>>>>> complains
>>>>>> about the overload then - while some cores on other nodes are not 
>>>>>> used.
>>>>>> It
>>>>>> is possible to run my simulation like that:
>>>>>> 
>>>>>> mpiexec mdrun_mpi -v -cpt 20 -multi 144 -replex 2000 -cpi 
>>>>>> (without
>>>>>> --loadbalance for mpiexec and without -ntomp for mdrun)
>>>>>> 
>>>>>> Then each replica runs on 4 MPI processes (I allocate 4 times 
>>>>>> more
>>>>>> cores
>>>>>> then replicas and mdrun sees it). The problem is that it is much 
>>>>>> slower
>>>>>> than
>>>>>> using OpenMP for each replica. I did not find any other way than
>>>>>> --loadbalance in mpiexec and then -multi 144 -ntomp 4 in mdrun to 
>>>>>> use
>>>>>> MPI
>>>>>> and OpenMP at the same time on the torque-controlled cluster.
>>>>> 
>>>>> 
>>>>> 
>>>>> That seems highly surprising. I have not yet encountered a job
>>>>> scheduler that was completely lacking a "do what I tell you" 
>>>>> layout
>>>>> scheme. More importantly, why are you using #PBS -l 
>>>>> nodes=48:ppn=12?
>>>> 
>>>> 
>>>> 
>>>> I thing that torque is very similar to all PBS-like resource 
>>>> managers in
>>>> this regard. It actually does what I tell it to do. There are 
>>>> 12-core
>>>> nodes,
>>>> I ask for 48 of them - I get them (simple #PBS -l ncpus=576 does 
>>>> not
>>>> work),
>>>> end of story. Now, the program that I run is responsible for 
>>>> populating
>>>> resources that I got.
>>> 
>>> 
>>> No, that's not the end of the story. The scheduler and the MPI 
>>> system
>>> typically cooperate to populate the MPI processes on the hardware, 
>>> set
>>> OMP_NUM_THREADS, set affinities, etc. mdrun honours those if they 
>>> are
>>> set.
>> 
>> 
>> I was able to run what I wanted flawlessly on another cluster with 
>> PBS-Pro.
>> The torque cluster seem to work like I said ("the end of story" 
>> behaviour).
>> REMD runs well on torque when I give a whole physical node to one 
>> replica.
>> Otherwise the simulation does not go or the pinning fails (sometimes
>> partially). I run out of options, I did not find any working
>> example/documentation on running hybrid MPI/OpenMP jobs in torque. It 
>> seems
>> that I stumbled upon limitations of this resource manager, and it is 
>> not
>> really the Gromacs issue.
>> Best Regards,
>> Grzegorz
>> 
>> 
>>> 
>>> You seem to be using 12 because you know there are 12 cores per 
>>> node.
>>> The scheduler should know that already. ppn should be a command 
>>> about
>>> what to do with the hardware, not a description of what it is. More 
>>> to
>>> the point, you should read the docs and be sure what it does.
>>> 
>>>>> Surely you want 3 MPI processes per 12-core node?
>>>> 
>>>> 
>>>> 
>>>> Yes - I want each node to run 3 MPI processes. Preferably, I would 
>>>> like
>>>> to
>>>> run each MPI process on separate node (spread on 12 cores with 
>>>> OpenMP)
>>>> but I
>>>> will not get as much of resources. But again, without the 
>>>> --loadbalance
>>>> hack
>>>> I would not be able to properly populate the nodes...
>>> 
>>> 
>>> So try ppn 3!
>>> 
>>>>> 
>>>>>>> What do the .log files say about
>>>>>>> OMP_NUM_THREADS, thread affinities, pinning, etc?
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Each replica logs:
>>>>>> "Using 1 MPI process
>>>>>> Using 4 OpenMP threads",
>>>>>> That is is correct. As I said, the threads are forked, but 3 out 
>>>>>> of 4
>>>>>> don't
>>>>>> do anything, and the simulation does not go at all.
>>>>>> 
>>>>>> About affinities Gromacs says:
>>>>>> "Can not set thread affinities on the current platform. On NUMA 
>>>>>> systems
>>>>>> this
>>>>>> can cause performance degradation. If you think your platform 
>>>>>> should
>>>>>> support
>>>>>> setting affinities, contact the GROMACS developers."
>>>>>> 
>>>>>> Well, the "current platform" is normal x86_64 cluster, but the 
>>>>>> whole
>>>>>> information about resources is passed by Torque to OpenMPI-linked
>>>>>> Gromacs.
>>>>>> Can it be that mdrun sees the resources allocated by torque as a 
>>>>>> big
>>>>>> pool
>>>>>> of
>>>>>> cpus and misses the information about nodes topology?
>>>>> 
>>>>> 
>>>>> 
>>>>> mdrun gets its processor topology from the MPI layer, so that is 
>>>>> where
>>>>> you need to focus. The error message confirms that GROMACS sees 
>>>>> things
>>>>> that seem wrong.
>>>> 
>>>> 
>>>> 
>>>> Thank you, I will take a look. But the first thing I want to do is
>>>> finding
>>>> the reason why Gromacs 4.6.3 is not able to run on my (slightly 
>>>> weird, I
>>>> admit) setup, while 4.6.2 does it very well.
>>> 
>>> 
>>> 4.6.2 had a bug that inhibited any MPI-based mdrun from attempting 
>>> to
>>> set affinities. It's still not clear why ppn 12 worked at all.
>>> Apparently mdrun was able to float some processes around to get
>>> something that worked. The good news is that when you get it working
>>> in 4.6.3, you will see a performance boost.
>>> 
>>> Mark
>> 
>> --
>> gmx-users mailing list    gmx-users at gromacs.org
>> http://lists.gromacs.org/mailman/listinfo/gmx-users
>> * Please search the archive at
>> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
>> * Please don't post (un)subscribe requests to the list. Use the www
>> interface or send it to gmx-users-request at gromacs.org.
>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists