[gmx-developers] Problem with 4.6.x MPI, thread affinity, slurm and node-uneven task spread
Szilárd Páll
pall.szilard at gmail.com
Thu Oct 2 16:32:06 CEST 2014
Thanks for the detailed report! Could you please file a redmine issue?
redmine.gromacs.org
--
Szilárd
On Thu, Oct 2, 2014 at 3:57 PM, Åke Sandgren <ake.sandgren at hpc2n.umu.se> wrote:
> Hi!
>
> Just managed to pin down a weird problem which is caused by uneven spread of
> tasks over nodes and thread affinity causing jobs to hang in
> gmx_set_thread_affinity.
>
> This happens on our 48-core nodes using a 100 task job that when submitted
> through slurm (without specifying distribution manually) gets distributed
> over 3 nodes with 6+47+47 tasks.
> We are also using cgroups to allow for multiple jobs per node, so the node
> with 6 tasks has an affinity mask set for only the 6 cores on a single NUMA.
> The nodes with 47 tasks have the whole node allocated and thus gets a full
> 48-core affinity mask.
>
> (Actually due to a bug(/feature?) in slurm the tasks on the node with only 6
> cores allocated actually get a single-core per task affinity, but that's not
> relevant here.)
>
> Anyway, when the code gets to line 1629 in runner.c (this is 4.6.7) and the
> call to gmx_check_thread_affinity_set we start having problems.
>
> The loop to set bAllSet ends up setting bAllSet to TRUE for the tasks on the
> two fully allocated nodes and FALSE on the tasks on the third node.
> This in turn changes hw_opt->thread_affinity to threadaffOFF on those 6
> tasks, but leaves it at threadaffAUTO for the other 2x47 tasks.
>
> gmx_set_thread_affinity then promptly returns for those poor 6 tasks and
> tries in vain to do a MPI_Comm_split with 6 tasks missing from the
> equation...
>
> I suggest to gather the bAllSet result from all nodes in
> gmx_check_thread_affinity_set and make sure all tasks have the same view of
> the world...
>
> --
> Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
> Internet: ake at hpc2n.umu.se Phone: +46 90 7866134 Fax: +46 90-580 14
> Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se
> --
> Gromacs Developers mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-developers_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-developers or
> send a mail to gmx-developers-request at gromacs.org.
More information about the gromacs.org_gmx-developers
mailing list