[gmx-users] multinode issue

Éric Germaneau germaneau at sjtu.edu.cn
Sat Dec 6 15:12:08 CET 2014


Thanks Mark for having tried to help.

On 12/06/2014 10:08 PM, Mark Abraham wrote:
> On Sat, Dec 6, 2014 at 9:29 AM, Éric Germaneau <germaneau at sjtu.edu.cn>
> wrote:
>
>> Dear Mark, Dear Szilárd,
>>
>> Thank you for your help.
>> I did try different I_MPI... option without success.
>> Something I can't figure is I can run jobs with 2 or more OpenMP threads
>> per MPI process, but not just one.
>> It crash doing one OpenMP threads per MPI process, even I disable
>> I_MPI_PIN.
>>
> OK, well that points to something being configured incorrectly in IMPI,
> rather than any of the other theories. Try OpenMPI ;-)
>
> Mark
>
>
>>    Éric.
>>
>>
>>
>> On 12/06/2014 02:54 AM, Szilárd Páll wrote:
>>
>>> On a second thought (and a quick googling), it _seems_ that this is an
>>> issue caused by the following:
>>> - the OpenMP runtime gets initialized outside mdrun and its threads
>>> (or just the master thread), get their affinity set;
>>> - mdrun then executes the sanity check, point at which
>>> omp_get_num_procs(), reports 1 CPU most probably because the master
>>> thread is bound to a single core.
>>>
>>> This alone should not be a big deal as long as the affinity settings
>>> get correctly overridden in mdrun. However this can have the ugly
>>> side-effect that, if mdrun's affinity setting gets disabled (if mdrun
>>> detects the externally set affinities it back off or if not all
>>> cores/hardware threads are used), all compute threads will inherit the
>>> affinity set previously and multiple threads will run on a the same
>>> core.
>>>
>>> Note that this warning should typically not cause a crash, but it is
>>> telling you that something is not quite right, so it may be best to
>>> start with eliminating this warning (hints: I_MPI_PIN for Intel MPI,
>>> -cc for Cray's aprun, --cpu-bind for slurm).
>>>
>>> Cheers,
>>> --
>>> Szilárd
>>>
>>>
>>> On Fri, Dec 5, 2014 at 7:35 PM, Szilárd Páll <pall.szilard at gmail.com>
>>> wrote:
>>>
>>>> I don't think this is a sysconf issue. As you seem to have 16-core (hw
>>>> thread?) nodes, it looks like sysnconf returned the correct value
>>>> (16), but the OpenMP runtime actually returned 1. This typically means
>>>> that the OpenMP runtime was initialized outside mdrun and for some
>>>> reason (which I'm not sure about) it returns 1.
>>>>
>>>> My guess is that your job scheduler is multi-threading aware and by
>>>> default assumes 1 core/hardware thread per rank so you may want to set
>>>> some rank depth/width option.
>>>>
>>>> --
>>>> Szilárd
>>>>
>>>>
>>>> On Fri, Dec 5, 2014 at 1:37 PM, Éric Germaneau <germaneau at sjtu.edu.cn>
>>>> wrote:
>>>>
>>>>> Thank you Mark,
>>>>>
>>>>> Yes this was the end of the log.
>>>>> I tried an other input and got the same issue:
>>>>>
>>>>>      Number of CPUs detected (16) does not match the number reported by
>>>>>      OpenMP (1).
>>>>>      Consider setting the launch configuration manually!
>>>>>      Reading file yukuntest-70K.tpr, VERSION 4.6.3 (single precision)
>>>>>      [16:node328] unexpected disconnect completion event from [0:node299]
>>>>>      Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0
>>>>>      internal ABORT - process 16
>>>>>
>>>>> Actually, I'm running some test for our users, I'll talk with the admin
>>>>> about how to  return information
>>>>> to the standard sysconf() routine in the usual way.
>>>>> Thank you,
>>>>>
>>>>>              Éric.
>>>>>
>>>>>
>>>>> On 12/05/2014 07:38 PM, Mark Abraham wrote:
>>>>>
>>>>>> On Fri, Dec 5, 2014 at 9:15 AM, Éric Germaneau <germaneau at sjtu.edu.cn>
>>>>>> wrote:
>>>>>>
>>>>>>   Dear all,
>>>>>>> I use impi and when I submit o job (via LSF) to more than one node I
>>>>>>> get
>>>>>>> the following message:
>>>>>>>
>>>>>>>       Number of CPUs detected (16) does not match the number reported
>>>>>>> by
>>>>>>>       OpenMP (1).
>>>>>>>
>>>>>>>   That suggests this machine has not be set up to return information
>>>>>> to the
>>>>>> standard sysconf() routine in the usual way. What kind of machine is
>>>>>> this?
>>>>>>
>>>>>>       Consider setting the launch configuration manually!
>>>>>>
>>>>>>>       Reading file test184000atoms_verlet.tpr, VERSION 4.6.2 (single
>>>>>>>       precision)
>>>>>>>
>>>>>>>   I hope that's just a 4.6.2-era .tpr, but nobody should be using 4.6.2
>>>>>> mdrun
>>>>>> because there was a bug in only that version affecting precisely these
>>>>>> kinds of issues...
>>>>>>
>>>>>>       [16:node319] unexpected disconnect completion event from
>>>>>> [11:node328]
>>>>>>
>>>>>>>       Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0
>>>>>>>       internal ABORT - process 16
>>>>>>>
>>>>>>> I submit doing
>>>>>>>
>>>>>>>       mpirun -np 32 -machinefile nodelist $EXE -v -deffnm $INPUT
>>>>>>>
>>>>>>> The machinefile looks like this
>>>>>>>
>>>>>>>       node328:16
>>>>>>>       node319:16
>>>>>>>
>>>>>>> I'm running the release 4.6.7.
>>>>>>> I do not set anything about OpenMP for this job, I'd like to have 32
>>>>>>> MPI
>>>>>>> process.
>>>>>>>
>>>>>>> Using one node it works fine.
>>>>>>> Any hints here?
>>>>>>>
>>>>>>>   Everything seems fine. What was the end of the .log file? Can you run
>>>>>> another MPI test program thus?
>>>>>>
>>>>>> Mark
>>>>>>
>>>>>>
>>>>>>                                                                  Éric.
>>>>>>> --
>>>>>>> Éric Germaneau (???), Specialist
>>>>>>> Center for High Performance Computing
>>>>>>> Shanghai Jiao Tong University
>>>>>>> Room 205 Network Center, 800 Dongchuan Road, Shanghai 200240 China
>>>>>>> M:germaneau at sjtu.edu.cn P:+86-136-4161-6480 W:http://hpc.sjtu.edu.cn
>>>>>>> --
>>>>>>> Gromacs Users mailing list
>>>>>>>
>>>>>>> * Please search the archive at http://www.gromacs.org/
>>>>>>> Support/Mailing_Lists/GMX-Users_List before posting!
>>>>>>>
>>>>>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>>>>>
>>>>>>> * For (un)subscribe requests visit
>>>>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>>>>>>> send a mail to gmx-users-request at gromacs.org.
>>>>>>>
>>>>>>>   --
>>>>> Éric Germaneau (???), Specialist
>>>>> Center for High Performance Computing
>>>>> Shanghai Jiao Tong University
>>>>> Room 205 Network Center, 800 Dongchuan Road, Shanghai 200240 China
>>>>> Email:germaneau at sjtu.edu.cn Mobi:+86-136-4161-6480
>>>>> http://hpc.sjtu.edu.cn
>>>>> --
>>>>> Gromacs Users mailing list
>>>>>
>>>>> * Please search the archive at
>>>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>>>>> posting!
>>>>>
>>>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>>>
>>>>> * For (un)subscribe requests visit
>>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>>>>> send a
>>>>> mail to gmx-users-request at gromacs.org.
>>>>>
>> --
>> Éric Germaneau (???), Specialist
>> Center for High Performance Computing
>> Shanghai Jiao Tong University
>> Room 205 Network Center, 800 Dongchuan Road, Shanghai 200240 China
>> M:germaneau at sjtu.edu.cn P:+86-136-4161-6480 W:http://hpc.sjtu.edu.cn
>> --
>> Gromacs Users mailing list
>>
>> * Please search the archive at http://www.gromacs.org/
>> Support/Mailing_Lists/GMX-Users_List before posting!
>>
>>
>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>
>> * For (un)subscribe requests visit
>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>> send a mail to gmx-users-request at gromacs.org.
>>

-- 
Éric Germaneau (???), Specialist
Center for High Performance Computing
Shanghai Jiao Tong University
Room 205 Network Center, 800 Dongchuan Road, Shanghai 200240 China
Email:germaneau at sjtu.edu.cn Mobi:+86-136-4161-6480 http://hpc.sjtu.edu.cn


More information about the gromacs.org_gmx-users mailing list