[gmx-users] Gromacs 4.6.7 with MPI and OpenMP

Szilárd Páll pall.szilard at gmail.com
Fri May 8 21:18:14 CEST 2015

On Fri, May 8, 2015 at 8:44 PM, Malcolm Tobias <mtobias at wustl.edu> wrote:
> Szilárd,
> On Friday 08 May 2015 20:25:09 Szilárd Páll wrote:
>> > I wouldn't expect the CPUSETs to be problematic, I've been using them with Gromacs for over a decade now ;-)
>> Thread affinity setting within mdrun has been employed since v4.6 and
>> we do it on a per-thread basis and not doing it can leadto pretty
>> severe performance degradation when using multi-threading. Depending
>> on the Linux kernel, OS jitter, and type/speed/scale of the simulation
>> even MPI-only runs will see a benefit from correct affinity settings.
>> Hints:
>> - some useful mdrun command line arguments: "-pin on", "-pinoffset N"
>> (-pinstride N)
>> - more details:
>> http://www.gromacs.org/Documentation/Acceleration_and_parallelization
> Understood.  What's weird is that without pin'ing, it seems to start the 4 threads all on the same CPU-core.
> Based on some suggestions by Mark Abraham, I think I might be seeing where the problem is coming from: If I call omp_get_num_procs within a OpenMP thread that is a part of an MPI process, it's only reporting 1 processor.  This seems to happen with one of my MPI implementations, but not others.  I wonder if this would affect the thread affinity: Since the threads only think there is one core, would they all try to run on it?

Try it with your "Hello OpenMP World" code. If you do e.g. taskset 0x1
on the binary, omp_get_num_procs() will return 1 which we compare to
the value returned by a syscall that asks the number of CPUs - that's
where the warning comes from.
Secondly, child threads of the mdrun_mpi process will inherit the
parent's affinity mask and if that's set e.g. to 0x1 all spawned
threads will be running on the first core.

>> > Weird.  I wonder if anyone else has experience using pin'ing with CPUSETs?
>> What is your goal with using CPUSETs? Node sharing?
> Correct.  While it might be possible to see the cores that have been assigned to the job and do the correct 'pin setting' it would probably be ugly.

Not sure what you mean by "see the cores". Also not sure why is it
more ugly to construct a CPUSET than a pin offset, but hey, if you
want both performance and node sharing with automated resource
allocation, the solution won't be simple, I think.


> Cheers,
> Malcolm
> --
> Malcolm Tobias
> 314.362.1594
> --
> Gromacs Users mailing list
> * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-request at gromacs.org.

More information about the gromacs.org_gmx-users mailing list