[gmx-developers] Problem with 4.6.x MPI, thread affinity, slurm and node-uneven task spread
Åke Sandgren
ake.sandgren at hpc2n.umu.se
Thu Oct 2 15:57:37 CEST 2014
Hi!
Just managed to pin down a weird problem which is caused by uneven
spread of tasks over nodes and thread affinity causing jobs to hang in
gmx_set_thread_affinity.
This happens on our 48-core nodes using a 100 task job that when
submitted through slurm (without specifying distribution manually) gets
distributed over 3 nodes with 6+47+47 tasks.
We are also using cgroups to allow for multiple jobs per node, so the
node with 6 tasks has an affinity mask set for only the 6 cores on a
single NUMA. The nodes with 47 tasks have the whole node allocated and
thus gets a full 48-core affinity mask.
(Actually due to a bug(/feature?) in slurm the tasks on the node with
only 6 cores allocated actually get a single-core per task affinity, but
that's not relevant here.)
Anyway, when the code gets to line 1629 in runner.c (this is 4.6.7) and
the call to gmx_check_thread_affinity_set we start having problems.
The loop to set bAllSet ends up setting bAllSet to TRUE for the tasks on
the two fully allocated nodes and FALSE on the tasks on the third node.
This in turn changes hw_opt->thread_affinity to threadaffOFF on those 6
tasks, but leaves it at threadaffAUTO for the other 2x47 tasks.
gmx_set_thread_affinity then promptly returns for those poor 6 tasks and
tries in vain to do a MPI_Comm_split with 6 tasks missing from the
equation...
I suggest to gather the bAllSet result from all nodes in
gmx_check_thread_affinity_set and make sure all tasks have the same view
of the world...
--
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: ake at hpc2n.umu.se Phone: +46 90 7866134 Fax: +46 90-580 14
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se
More information about the gromacs.org_gmx-developers
mailing list