[gmx-users] multinode issue

Mark Abraham mark.j.abraham at gmail.com
Sat Dec 6 15:08:23 CET 2014


On Sat, Dec 6, 2014 at 9:29 AM, Éric Germaneau <germaneau at sjtu.edu.cn>
wrote:

> Dear Mark, Dear Szilárd,
>
> Thank you for your help.
> I did try different I_MPI... option without success.
> Something I can't figure is I can run jobs with 2 or more OpenMP threads
> per MPI process, but not just one.
> It crash doing one OpenMP threads per MPI process, even I disable
> I_MPI_PIN.
>

OK, well that points to something being configured incorrectly in IMPI,
rather than any of the other theories. Try OpenMPI ;-)

Mark


>
>   Éric.
>
>
>
> On 12/06/2014 02:54 AM, Szilárd Páll wrote:
>
>> On a second thought (and a quick googling), it _seems_ that this is an
>> issue caused by the following:
>> - the OpenMP runtime gets initialized outside mdrun and its threads
>> (or just the master thread), get their affinity set;
>> - mdrun then executes the sanity check, point at which
>> omp_get_num_procs(), reports 1 CPU most probably because the master
>> thread is bound to a single core.
>>
>> This alone should not be a big deal as long as the affinity settings
>> get correctly overridden in mdrun. However this can have the ugly
>> side-effect that, if mdrun's affinity setting gets disabled (if mdrun
>> detects the externally set affinities it back off or if not all
>> cores/hardware threads are used), all compute threads will inherit the
>> affinity set previously and multiple threads will run on a the same
>> core.
>>
>> Note that this warning should typically not cause a crash, but it is
>> telling you that something is not quite right, so it may be best to
>> start with eliminating this warning (hints: I_MPI_PIN for Intel MPI,
>> -cc for Cray's aprun, --cpu-bind for slurm).
>>
>> Cheers,
>> --
>> Szilárd
>>
>>
>> On Fri, Dec 5, 2014 at 7:35 PM, Szilárd Páll <pall.szilard at gmail.com>
>> wrote:
>>
>>> I don't think this is a sysconf issue. As you seem to have 16-core (hw
>>> thread?) nodes, it looks like sysnconf returned the correct value
>>> (16), but the OpenMP runtime actually returned 1. This typically means
>>> that the OpenMP runtime was initialized outside mdrun and for some
>>> reason (which I'm not sure about) it returns 1.
>>>
>>> My guess is that your job scheduler is multi-threading aware and by
>>> default assumes 1 core/hardware thread per rank so you may want to set
>>> some rank depth/width option.
>>>
>>> --
>>> Szilárd
>>>
>>>
>>> On Fri, Dec 5, 2014 at 1:37 PM, Éric Germaneau <germaneau at sjtu.edu.cn>
>>> wrote:
>>>
>>>> Thank you Mark,
>>>>
>>>> Yes this was the end of the log.
>>>> I tried an other input and got the same issue:
>>>>
>>>>     Number of CPUs detected (16) does not match the number reported by
>>>>     OpenMP (1).
>>>>     Consider setting the launch configuration manually!
>>>>     Reading file yukuntest-70K.tpr, VERSION 4.6.3 (single precision)
>>>>     [16:node328] unexpected disconnect completion event from [0:node299]
>>>>     Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0
>>>>     internal ABORT - process 16
>>>>
>>>> Actually, I'm running some test for our users, I'll talk with the admin
>>>> about how to  return information
>>>> to the standard sysconf() routine in the usual way.
>>>> Thank you,
>>>>
>>>>             Éric.
>>>>
>>>>
>>>> On 12/05/2014 07:38 PM, Mark Abraham wrote:
>>>>
>>>>> On Fri, Dec 5, 2014 at 9:15 AM, Éric Germaneau <germaneau at sjtu.edu.cn>
>>>>> wrote:
>>>>>
>>>>>  Dear all,
>>>>>>
>>>>>> I use impi and when I submit o job (via LSF) to more than one node I
>>>>>> get
>>>>>> the following message:
>>>>>>
>>>>>>      Number of CPUs detected (16) does not match the number reported
>>>>>> by
>>>>>>      OpenMP (1).
>>>>>>
>>>>>>  That suggests this machine has not be set up to return information
>>>>> to the
>>>>> standard sysconf() routine in the usual way. What kind of machine is
>>>>> this?
>>>>>
>>>>>      Consider setting the launch configuration manually!
>>>>>
>>>>>>      Reading file test184000atoms_verlet.tpr, VERSION 4.6.2 (single
>>>>>>      precision)
>>>>>>
>>>>>>  I hope that's just a 4.6.2-era .tpr, but nobody should be using 4.6.2
>>>>> mdrun
>>>>> because there was a bug in only that version affecting precisely these
>>>>> kinds of issues...
>>>>>
>>>>>      [16:node319] unexpected disconnect completion event from
>>>>> [11:node328]
>>>>>
>>>>>>      Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0
>>>>>>      internal ABORT - process 16
>>>>>>
>>>>>> I submit doing
>>>>>>
>>>>>>      mpirun -np 32 -machinefile nodelist $EXE -v -deffnm $INPUT
>>>>>>
>>>>>> The machinefile looks like this
>>>>>>
>>>>>>      node328:16
>>>>>>      node319:16
>>>>>>
>>>>>> I'm running the release 4.6.7.
>>>>>> I do not set anything about OpenMP for this job, I'd like to have 32
>>>>>> MPI
>>>>>> process.
>>>>>>
>>>>>> Using one node it works fine.
>>>>>> Any hints here?
>>>>>>
>>>>>>  Everything seems fine. What was the end of the .log file? Can you run
>>>>> another MPI test program thus?
>>>>>
>>>>> Mark
>>>>>
>>>>>
>>>>>                                                                 Éric.
>>>>>>
>>>>>> --
>>>>>> Éric Germaneau (???), Specialist
>>>>>> Center for High Performance Computing
>>>>>> Shanghai Jiao Tong University
>>>>>> Room 205 Network Center, 800 Dongchuan Road, Shanghai 200240 China
>>>>>> M:germaneau at sjtu.edu.cn P:+86-136-4161-6480 W:http://hpc.sjtu.edu.cn
>>>>>> --
>>>>>> Gromacs Users mailing list
>>>>>>
>>>>>> * Please search the archive at http://www.gromacs.org/
>>>>>> Support/Mailing_Lists/GMX-Users_List before posting!
>>>>>>
>>>>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>>>>
>>>>>> * For (un)subscribe requests visit
>>>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>>>>>> send a mail to gmx-users-request at gromacs.org.
>>>>>>
>>>>>>  --
>>>> Éric Germaneau (???), Specialist
>>>> Center for High Performance Computing
>>>> Shanghai Jiao Tong University
>>>> Room 205 Network Center, 800 Dongchuan Road, Shanghai 200240 China
>>>> Email:germaneau at sjtu.edu.cn Mobi:+86-136-4161-6480
>>>> http://hpc.sjtu.edu.cn
>>>> --
>>>> Gromacs Users mailing list
>>>>
>>>> * Please search the archive at
>>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>>>> posting!
>>>>
>>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>>
>>>> * For (un)subscribe requests visit
>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>>>> send a
>>>> mail to gmx-users-request at gromacs.org.
>>>>
>>>
> --
> Éric Germaneau (艾海克), Specialist
> Center for High Performance Computing
> Shanghai Jiao Tong University
> Room 205 Network Center, 800 Dongchuan Road, Shanghai 200240 China
> M:germaneau at sjtu.edu.cn P:+86-136-4161-6480 W:http://hpc.sjtu.edu.cn
> --
> Gromacs Users mailing list
>
> * Please search the archive at http://www.gromacs.org/
> Support/Mailing_Lists/GMX-Users_List before posting!
>
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
>


More information about the gromacs.org_gmx-users mailing list