[gmx-developers] Re: broadcast of zero-length arrays
Mark Abraham
Mark.Abraham at anu.edu.au
Tue Nov 24 01:12:14 CET 2009
Mathias PUETZ wrote:
> Hi,
>
> on BG/L zero length broadcasts were buggy on MPI_Bcast with length zero.
> I reported a defect to IBM BG development about a year ago and it was fixed
> shortly thereafter.
> By the MPI standard null pointers are legal for zero length buffers.
> The bug is in the BG MPI argument checking, not in the actual routine that
> does the broadcast..
>
> I suggest upgrading your BG/L driver software to the latest level. This
> should take care of it.
> If the problem really should persist, please contact your IBM customer
> support and
> open a new defect report (perhaps - I hope not - the original fix was lost
> in a newer driver version).
Thanks - I'll look into that.
Brian Smith of IBM suggested off-list that setting BGLMPI_BCAST=MPICH
would be a useful diagnostic. Setting that allowed GROMACS to continue
past the previous crash point, indicating that the issue is probably in
the optimized version.
Per Brian's request, here's a stack trace with the above variable not
set. Core was dumped on many of the 64 MPI processes.
[12:54][bgfen1-c:timing_prebugfix]$ tail -n 20 core.0 | addr2line -f -e
/hpc/home/mja163/builds/gromacs_builds/git/pre-bugfix/mpi_debug/src/kernel/mdrun
??
??:0
??
??:0
??
??:0
??
??:0
??
??:0
BGLMP_TreeBcastPacketDispatch
/bglhome/usr6/bgbuild/V1R3M4_300_2008-080728/ppc/src/bglsw/comm/sys/msglayer/util/BGLMLVNMutil.h:135
BGLML_Messager_tree_advance
/bglhome/usr6/bgbuild/V1R3M4_300_2008-080728/ppc/src/bglsw/comm/sys/msglayer/base/advance/BGLML_advance.h:295
BGLMP_TreeBcast
/bglhome/usr6/bgbuild/V1R3M4_300_2008-080728/ppc/src/bglsw/comm/sys/msglayer/proto/collectives/TreeBcast/BGLMP_TreeBcast.c:161
MPIDI_BGLTR_Bcast
/bglhome/usr6/bgbuild/V1R3M4_300_2008-080728/ppc/src/bglsw/comm/lib/mpi/mpich2/src/mpid/bgltorus5/src/coll/mpidi_bgltr/mpidi_bgltr_bcast.c:62
MPIDI_Coll_Comm_Bcast_wrapper
/bglhome/usr6/bgbuild/V1R3M4_300_2008-080728/ppc/src/bglsw/comm/lib/mpi/mpich2/src/mpid/bgltorus5/src/coll/mpid_collectives.c:1022
PMPI_Bcast
/bglhome/usr6/bgbuild/V1R3M4_300_2008-080728/ppc/src/bglsw/comm/lib/mpi/mpich2/src/mpi/coll/bcast.c:767
gmx_bcast
../../../src/gmxlib/network.c:363
bc_grpopts
../../../src/gmxlib/mvdata.c:341
bc_inputrec
../../../src/gmxlib/mvdata.c:402
bcast_ir_mtop
../../../src/gmxlib/mvdata.c:449
init_parallel
../../../src/mdlib/init.c:166
mdrunner
../../../src/kernel/md.c:165
main
../../../src/kernel/mdrun.c:496
_start_blrts
../sysdeps/blrts/start.c:107
??
??:0
That looks very much like Mathias's bugfix is required on my system.
Thanks to all for the prompt discussion.
Mark
> Mit freundlichen Grüßen / Kind regards
> Dr. Mathias Puetz
>
> Application Performance Specialist
> IBM Sales & Distribution, STG Sales / Industries Deep Computing FTSS
> ------------------------------------------------------------------------------------------------------------------------------------------
>
> IBM Deutschland
> Hechtsheimer Str. 2
> 55131 Mainz
> Phone: +49-160-7120602
> Mobile: +49-(0)160-7120602
> E-Mail: mpuetz at de.ibm.com
> -------------------------------------------------------------------------------------------------------------------------------------------
>
> IBM Deutschland GmbH / Vorsitzender des Aufsichtsrats: Erich Clementi
> Geschäftsführung: Martin Jetter (Vorsitzender), Reinhard Reschke, Christoph
> Grandpierre,Matthias Hartmann, Michael Diemer
> Sitz der Gesellschaft: Stuttgart / Registergericht: Amtsgericht Stuttgart,
> HRB 14562 WEEE-Reg.-Nr. DE 99369940
>
>
>
> gmx-developers-re
> quest at gromacs.org
> Sent by: To
> gmx-developers-bo gmx-developers at gromacs.org
> unces at gromacs.org cc
>
> Subject
> 11/23/2009 12:00 gmx-developers Digest, Vol 67,
> PM Issue 16
>
>
> Please respond to
> gmx-developers at gr
> omacs.org
>
>
>
>
>
>
> Send gmx-developers mailing list submissions to
> gmx-developers at gromacs.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> http://lists.gromacs.org/mailman/listinfo/gmx-developers
> or, via email, send a message with subject or body 'help' to
> gmx-developers-request at gromacs.org
>
> You can reach the person managing the list at
> gmx-developers-owner at gromacs.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of gmx-developers digest..."
>
>
> Today's Topics:
>
> 1. broadcast of zero-length arrays (Mark Abraham)
> 2. Re: broadcast of zero-length arrays (Roland Schulz)
> 3. Re: broadcast of zero-length arrays (hess at sbc.su.se)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Mon, 23 Nov 2009 14:11:32 +1100
> From: Mark Abraham <Mark.Abraham at anu.edu.au>
> Subject: [gmx-developers] broadcast of zero-length arrays
> To: Gromacs Developers <gmx-developers at gromacs.org>
> Message-ID: <4B09FD64.1020407 at anu.edu.au>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>
> Hi,
>
> During src/gmxlib/mvdata.c bc_grpopts(), my BlueGene/L segfaults during
> the broadcasts of the QMMM stuff. The lines that break are attempts to
> broadcast arrays of zero length. Adding a check for non-zero length into
> the definition of nblock_bc fixes the problem. Presumably a null pointer
> is being dereferenced inside the MPI library.
>
> I'm not sure whether this observation is indicative of (this version of)
> IBM's MPI library not having implemented the full standard, the standard
> not specifying behaviour in this case, or GROMACS not being sufficiently
> defensive. I haven't found anything useful in the MPI documentation I
> have to hand. You could argue cases either way - the implementors of the
> library want to avoid such checks to speed performance, and the users of
> the library expect it either to take care of such housekeeping for them,
> or not dereference pointers unnecessarily (think buffering)...
>
> Does anyone know what expected behaviour is here?
>
> Cheers,
>
> Mark
>
>
> ------------------------------
>
> Message: 2
> Date: Mon, 23 Nov 2009 00:24:30 -0500
> From: Roland Schulz <roland at utk.edu>
> Subject: Re: [gmx-developers] broadcast of zero-length arrays
> To: Discussion list for GROMACS development
> <gmx-developers at gromacs.org>, Brian Smith
> <smithbr at us.ibm.com>
> Message-ID:
> <c93c21390911222124n717712fkaa683b0d6a40226 at mail.gmail.com>
> Content-Type: text/plain; charset="iso-8859-1"
>
> Hi Brian,
>
> you could help before with a Segfault in the BlueGene MPI layer (in
> scatterv). Do you consider the below described segfault a bug in the MPI
> layer or in Gromacs?
>
> Roland
>
> ---------- Forwarded message ----------
> From: Mark Abraham <Mark.Abraham at anu.edu.au>
> Date: Sun, Nov 22, 2009 at 10:11 PM
> Subject: [gmx-developers] broadcast of zero-length arrays
> To: Gromacs Developers <gmx-developers at gromacs.org>
>
>
> Hi,
>
> During src/gmxlib/mvdata.c bc_grpopts(), my BlueGene/L segfaults during the
> broadcasts of the QMMM stuff. The lines that break are attempts to
> broadcast
> arrays of zero length. Adding a check for non-zero length into the
> definition of nblock_bc fixes the problem. Presumably a null pointer is
> being dereferenced inside the MPI library.
>
> I'm not sure whether this observation is indicative of (this version of)
> IBM's MPI library not having implemented the full standard, the standard
> not
> specifying behaviour in this case, or GROMACS not being sufficiently
> defensive. I haven't found anything useful in the MPI documentation I have
> to hand. You could argue cases either way - the implementors of the library
> want to avoid such checks to speed performance, and the users of the
> library
> expect it either to take care of such housekeeping for them, or not
> dereference pointers unnecessarily (think buffering)...
>
> Does anyone know what expected behaviour is here?
>
> Cheers,
>
> Mark
> --
> gmx-developers mailing list
> gmx-developers at gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-developers
> Please don't post (un)subscribe requests to the list. Use the www interface
> or send it to gmx-developers-request at gromacs.org.
>
>
>
> --
> ORNL/UT Center for Molecular Biophysics cmb.ornl.gov
> 865-241-1537, ORNL PO BOX 2008 MS6309
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL:
> http://lists.gromacs.org/pipermail/gmx-developers/attachments/20091123/5797bd81/attachment-0001.html
>
>
> ------------------------------
>
> Message: 3
> Date: Mon, 23 Nov 2009 09:20:41 +0100 (CET)
> From: hess at sbc.su.se
> Subject: Re: [gmx-developers] broadcast of zero-length arrays
> To: "Discussion list for GROMACS development"
> <gmx-developers at gromacs.org>
> Cc: Brian Smith <smithbr at us.ibm.com>
> Message-ID: <37050.90.163.29.105.1258964441.squirrel at mail.sbc.su.se>
> Content-Type: text/plain;charset=iso-8859-1
>
> Hi,
>
> I can remember have such issues before, I think also on an IBM
> and discussing this with somebody from IBM.
> I thought I had removed all MPI calls with NULL pointers,
> but apparently this is not the case.
> I committed fixes for nblock_bc in mvdata for 4.0.6 and git master.
>
> Berk
>
>> Hi Brian,
>>
>> you could help before with a Segfault in the BlueGene MPI layer (in
>> scatterv). Do you consider the below described segfault a bug in the MPI
>> layer or in Gromacs?
>>
>> Roland
>>
>> ---------- Forwarded message ----------
>> From: Mark Abraham <Mark.Abraham at anu.edu.au>
>> Date: Sun, Nov 22, 2009 at 10:11 PM
>> Subject: [gmx-developers] broadcast of zero-length arrays
>> To: Gromacs Developers <gmx-developers at gromacs.org>
>>
>>
>> Hi,
>>
>> During src/gmxlib/mvdata.c bc_grpopts(), my BlueGene/L segfaults during
>> the
>> broadcasts of the QMMM stuff. The lines that break are attempts to
>> broadcast
>> arrays of zero length. Adding a check for non-zero length into the
>> definition of nblock_bc fixes the problem. Presumably a null pointer is
>> being dereferenced inside the MPI library.
>>
>> I'm not sure whether this observation is indicative of (this version of)
>> IBM's MPI library not having implemented the full standard, the standard
>> not
>> specifying behaviour in this case, or GROMACS not being sufficiently
>> defensive. I haven't found anything useful in the MPI documentation I
> have
>> to hand. You could argue cases either way - the implementors of the
>> library
>> want to avoid such checks to speed performance, and the users of the
>> library
>> expect it either to take care of such housekeeping for them, or not
>> dereference pointers unnecessarily (think buffering)...
>>
>> Does anyone know what expected behaviour is here?
>>
>> Cheers,
>>
>> Mark
>> --
>> gmx-developers mailing list
>> gmx-developers at gromacs.org
>> http://lists.gromacs.org/mailman/listinfo/gmx-developers
>> Please don't post (un)subscribe requests to the list. Use the www
>> interface
>> or send it to gmx-developers-request at gromacs.org.
>>
>>
>>
>> --
>> ORNL/UT Center for Molecular Biophysics cmb.ornl.gov
>> 865-241-1537, ORNL PO BOX 2008 MS6309
>> --
>> gmx-developers mailing list
>> gmx-developers at gromacs.org
>> http://lists.gromacs.org/mailman/listinfo/gmx-developers
>> Please don't post (un)subscribe requests to the list. Use the
>> www interface or send it to gmx-developers-request at gromacs.org.
>
>
>
> ------------------------------
>
> --
> gmx-developers mailing list
> gmx-developers at gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-developers
>
>
> End of gmx-developers Digest, Vol 67, Issue 16
> **********************************************
>
>
More information about the gromacs.org_gmx-developers
mailing list