[gmx-developers] Problem with 4.6.x MPI, thread affinity, slurm and node-uneven task spread

Thu Oct 2 15:57:37 CEST 2014

Hi!

Just managed to pin down a weird problem which is caused by uneven 
spread of tasks over nodes and thread affinity causing jobs to hang in 
gmx_set_thread_affinity.

This happens on our 48-core nodes using a 100 task job that when 
submitted through slurm (without specifying distribution manually) gets 
distributed over 3 nodes with 6+47+47 tasks.
We are also using cgroups to allow for multiple jobs per node, so the 
node with 6 tasks has an affinity mask set for only the 6 cores on a 
single NUMA. The nodes with 47 tasks have the whole node allocated and 
thus gets a full 48-core affinity mask.

(Actually due to a bug(/feature?) in slurm the tasks on the node with 
only 6 cores allocated actually get a single-core per task affinity, but 
that's not relevant here.)

Anyway, when the code gets to line 1629 in runner.c (this is 4.6.7) and 
the call to gmx_check_thread_affinity_set we start having problems.

The loop to set bAllSet ends up setting bAllSet to TRUE for the tasks on 
the two fully allocated nodes and FALSE on the tasks on the third node.
This in turn changes hw_opt->thread_affinity to threadaffOFF on those 6 
tasks, but leaves it at threadaffAUTO for the other 2x47 tasks.

gmx_set_thread_affinity then promptly returns for those poor 6 tasks and 
tries in vain to do a MPI_Comm_split with 6 tasks missing from the 
equation...

I suggest to gather the bAllSet result from all nodes in 
gmx_check_thread_affinity_set and make sure all tasks have the same view 
of the world...

-- 
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: ake at hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90-580 14
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se