[gmx-users] multinode issue

Éric Germaneau germaneau at sjtu.edu.cn
Sat Dec 6 09:29:36 CET 2014


Dear Mark, Dear Szilárd,

Thank you for your help.
I did try different I_MPI... option without success.
Something I can't figure is I can run jobs with 2 or more OpenMP threads 
per MPI process, but not just one.
It crash doing one OpenMP threads per MPI process, even I disable I_MPI_PIN.

   Éric.


On 12/06/2014 02:54 AM, Szilárd Páll wrote:
> On a second thought (and a quick googling), it _seems_ that this is an
> issue caused by the following:
> - the OpenMP runtime gets initialized outside mdrun and its threads
> (or just the master thread), get their affinity set;
> - mdrun then executes the sanity check, point at which
> omp_get_num_procs(), reports 1 CPU most probably because the master
> thread is bound to a single core.
>
> This alone should not be a big deal as long as the affinity settings
> get correctly overridden in mdrun. However this can have the ugly
> side-effect that, if mdrun's affinity setting gets disabled (if mdrun
> detects the externally set affinities it back off or if not all
> cores/hardware threads are used), all compute threads will inherit the
> affinity set previously and multiple threads will run on a the same
> core.
>
> Note that this warning should typically not cause a crash, but it is
> telling you that something is not quite right, so it may be best to
> start with eliminating this warning (hints: I_MPI_PIN for Intel MPI,
> -cc for Cray's aprun, --cpu-bind for slurm).
>
> Cheers,
> --
> Szilárd
>
>
> On Fri, Dec 5, 2014 at 7:35 PM, Szilárd Páll <pall.szilard at gmail.com> wrote:
>> I don't think this is a sysconf issue. As you seem to have 16-core (hw
>> thread?) nodes, it looks like sysnconf returned the correct value
>> (16), but the OpenMP runtime actually returned 1. This typically means
>> that the OpenMP runtime was initialized outside mdrun and for some
>> reason (which I'm not sure about) it returns 1.
>>
>> My guess is that your job scheduler is multi-threading aware and by
>> default assumes 1 core/hardware thread per rank so you may want to set
>> some rank depth/width option.
>>
>> --
>> Szilárd
>>
>>
>> On Fri, Dec 5, 2014 at 1:37 PM, Éric Germaneau <germaneau at sjtu.edu.cn> wrote:
>>> Thank you Mark,
>>>
>>> Yes this was the end of the log.
>>> I tried an other input and got the same issue:
>>>
>>>     Number of CPUs detected (16) does not match the number reported by
>>>     OpenMP (1).
>>>     Consider setting the launch configuration manually!
>>>     Reading file yukuntest-70K.tpr, VERSION 4.6.3 (single precision)
>>>     [16:node328] unexpected disconnect completion event from [0:node299]
>>>     Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0
>>>     internal ABORT - process 16
>>>
>>> Actually, I'm running some test for our users, I'll talk with the admin
>>> about how to  return information
>>> to the standard sysconf() routine in the usual way.
>>> Thank you,
>>>
>>>             Éric.
>>>
>>>
>>> On 12/05/2014 07:38 PM, Mark Abraham wrote:
>>>> On Fri, Dec 5, 2014 at 9:15 AM, Éric Germaneau <germaneau at sjtu.edu.cn>
>>>> wrote:
>>>>
>>>>> Dear all,
>>>>>
>>>>> I use impi and when I submit o job (via LSF) to more than one node I get
>>>>> the following message:
>>>>>
>>>>>      Number of CPUs detected (16) does not match the number reported by
>>>>>      OpenMP (1).
>>>>>
>>>> That suggests this machine has not be set up to return information to the
>>>> standard sysconf() routine in the usual way. What kind of machine is this?
>>>>
>>>>      Consider setting the launch configuration manually!
>>>>>      Reading file test184000atoms_verlet.tpr, VERSION 4.6.2 (single
>>>>>      precision)
>>>>>
>>>> I hope that's just a 4.6.2-era .tpr, but nobody should be using 4.6.2
>>>> mdrun
>>>> because there was a bug in only that version affecting precisely these
>>>> kinds of issues...
>>>>
>>>>      [16:node319] unexpected disconnect completion event from [11:node328]
>>>>>      Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0
>>>>>      internal ABORT - process 16
>>>>>
>>>>> I submit doing
>>>>>
>>>>>      mpirun -np 32 -machinefile nodelist $EXE -v -deffnm $INPUT
>>>>>
>>>>> The machinefile looks like this
>>>>>
>>>>>      node328:16
>>>>>      node319:16
>>>>>
>>>>> I'm running the release 4.6.7.
>>>>> I do not set anything about OpenMP for this job, I'd like to have 32 MPI
>>>>> process.
>>>>>
>>>>> Using one node it works fine.
>>>>> Any hints here?
>>>>>
>>>> Everything seems fine. What was the end of the .log file? Can you run
>>>> another MPI test program thus?
>>>>
>>>> Mark
>>>>
>>>>
>>>>>                                                                Éric.
>>>>>
>>>>> --
>>>>> Éric Germaneau (???), Specialist
>>>>> Center for High Performance Computing
>>>>> Shanghai Jiao Tong University
>>>>> Room 205 Network Center, 800 Dongchuan Road, Shanghai 200240 China
>>>>> M:germaneau at sjtu.edu.cn P:+86-136-4161-6480 W:http://hpc.sjtu.edu.cn
>>>>> --
>>>>> Gromacs Users mailing list
>>>>>
>>>>> * Please search the archive at http://www.gromacs.org/
>>>>> Support/Mailing_Lists/GMX-Users_List before posting!
>>>>>
>>>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>>>
>>>>> * For (un)subscribe requests visit
>>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>>>>> send a mail to gmx-users-request at gromacs.org.
>>>>>
>>> --
>>> Éric Germaneau (???), Specialist
>>> Center for High Performance Computing
>>> Shanghai Jiao Tong University
>>> Room 205 Network Center, 800 Dongchuan Road, Shanghai 200240 China
>>> Email:germaneau at sjtu.edu.cn Mobi:+86-136-4161-6480 http://hpc.sjtu.edu.cn
>>> --
>>> Gromacs Users mailing list
>>>
>>> * Please search the archive at
>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
>>>
>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>
>>> * For (un)subscribe requests visit
>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a
>>> mail to gmx-users-request at gromacs.org.

-- 
Éric Germaneau (艾海克), Specialist
Center for High Performance Computing
Shanghai Jiao Tong University
Room 205 Network Center, 800 Dongchuan Road, Shanghai 200240 China
M:germaneau at sjtu.edu.cn P:+86-136-4161-6480 W:http://hpc.sjtu.edu.cn


More information about the gromacs.org_gmx-users mailing list