Fwd: [gmx-users] Assistance needed running gromacs 4.6.3 on Blue Gene/P

Mark Abraham mark.j.abraham at gmail.com
Mon Aug 12 12:37:56 CEST 2013


Re-sending to list; original bounced when we had some issues with
gmx-users over the weekend.

Mark

---------- Forwarded message ----------
From: Mark Abraham <mark.j.abraham at gmail.com>
Date: Sat, Aug 10, 2013 at 11:49 AM
Subject: Re: [gmx-users] Assistance needed running gromacs 4.6.3 on Blue Gene/P
To: prentice.bisbal at rutgers.edu, Discussion list for GROMACS users
<gmx-users at gromacs.org>


On Fri, Aug 9, 2013 at 6:03 PM, Prentice Bisbal
<prentice.bisbal at rutgers.edu> wrote:
> Mark,
>
> Since I was working with 4.6.2, I built 4.6.3 to see if this was the result
> of a bug in 4.6.2. It isn't I get the same error with 4.6.3, but that is the
> version I'll be working with from now on, since it's the latest. Since the
> problem occurs with both versions, might as well try to fix it in the latest
> version, right?

Yep.

> I compiled 4.6.3 with the following options to include debugging
> information:
>
>
> cmake .. \
> -DCMAKE_TOOLCHAIN_FILE=../cmake/Platform/BlueGeneP-static-XL-C.cmake \
>   -DBUILD_SHARED_LIBS=OFF \
>   -DGMX_MPI=ON \
>   -DCMAKE_C_FLAGS="-O0 -g -qstrict -qarch=450 -qtune=450" \
>   -DCMAKE_INSTALL_PREFIX=/scratch/bgapps/gromacs-4.6.3 \
>
>   -DGMX_CPU_ACCELERATION=None \
>   -DGMX_THREAD_MPI=OFF \
>   -DGMX_OPENMP=OFF \
>   -DGMX_DEFAULT_SUFFIX=ON \
>   -DCMAKE_PREFIX_PATH=/scratch/bgapps/fftw-3.3.2 \
>    2>&1 | tee cmake.log
>
> For qarch, I removed the 'd' from the end, so that the double-FPU isn't
> used, which can cause problems if the data isn't aligned correctly. The
> -qstrict makes sure certain optimizations aren't performed. It should be
> superfluous with optimization levels below 3, but I through it in just to be
> safe, and set -O0. (of course, I think -g turns off all optizations, anyway)

Mostly true, but mostly fine and immaterial :-)

> On the BG/P, I had to install FFTW3 separately, and that wasn't installed
> with debugging active, so there are no symbols for FFTW.

Yeah, that won't be a problem.

> One of my coworkers wrote a script that converts BG/P core files to stack
> traces. In all the kernels I've looked at so far (9 out of 64), the stack
> ends at a vfprintf call. For example:

Functions like vfprintf with va_list arguments use a macro that was
not implemented correctly on BG/L and BG/P. This has caused problems
with GROMACS before. See
http://www-01.ibm.com/support/docview.wss?uid=swg1LI73769 for details.
If this turns out to be the problem, then compiling just the files
that use va_list with -O0 should help (starting with
src/gmxlib/gmx_fatal.c). Or perhaps update the compiler if IBM really
did fix this some time, and/or file a support request with IBM.

However...

> -------------------------------------------------------------
>
> /bgsys/drivers/V1R4M2_200_2010-100508P/ppc/toolchain/gnu/glibc-2.4/stdio-common/vfprintf.c:1819
> /bgsys/drivers/V1R4M2_200_2010-100508P/ppc/toolchain/gnu/glibc-2.4/resolv/res_init.c:414
> /bgsys/drivers/V1R4M2_200_2010-100508P/ppc/toolchain/gnu/glibc-2.4/libio/wgenops.c:419
> /scratch/pbisbal/build/gromacs-4.6.3/src/gmxlib/nonbonded/nb_kernel_c/nb_kernel_ElecRFCut_VdwBhamSh_GeomW4P1_c.c:673
> ??:0
> /bghome/bgbuild/V1R4M2_200_2010-100508P/ppc/bgp/comm/sys/dcmf/../ccmi/executor/Broadcast.h:83
> /bghome/bgbuild/V1R4M2_200_2010-100508P/ppc/bgp/comm/lib/dev/mpich2/src/mpid/dcmfd/src/coll/reduce/reduce_algorithms.c:69
> /bghome/bgbuild/V1R4M2_200_2010-100508P/ppc/bgp/comm/lib/dev/mpich2/src/mpid/dcmfd/src/coll/bcast/bcast_algorithms.c:227
> /scratch/pbisbal/build/gromacs-4.6.3/src/mdlib/nbnxn_atomdata.c:779
> /scratch/pbisbal/build/gromacs-4.6.3/src/mdlib/nbnxn_atomdata.c:762
> /scratch/pbisbal/build/gromacs-4.6.3/src/mdlib/nbnxn_atomdata.c:374
> /scratch/pbisbal/build/gromacs-4.6.3/src/mdlib/calcmu.c:88
> /scratch/pbisbal/build/gromacs-4.6.3/src/kernel/mdrun.c:113
> /scratch/pbisbal/build/gromacs-4.6.3/src/kernel/runner.c:1492
> /scratch/pbisbal/build/gromacs-4.6.3/src/kernel/genalg.c:467
> /scratch/pbisbal/build/gromacs-4.6.3/src/kernel/calc_verletbuf.c:266
> ../stdio-common/printf_fphex.c:335
> ../stdio-common/printf_fphex.c:452
> ??:0
> /bgsys/drivers/V1R4M2_200_2010-100508P/ppc/toolchain/gnu/glibc-2.4/stdio-common/vfprintf.c:1819
> /bgsys/drivers/V1R4M2_200_2010-100508P/ppc/toolchain/gnu/glibc-2.4/stdio-common/vfprintf.c:1819
> /bgsys/drivers/V1R4M2_200_2010-100508P/ppc/toolchain/gnu/glibc-2.4/stdio-common/vfprintf.c:1819
>
> -----------------------------------------------------------------

This is the kind of thing I wanted to see, but it looks like you are
analysing a core file using an executable that was not the one that
generated the core file. The above does not make sense as a stack
trace. You will need to run the debug-enabled code and look at the
stack trace with the same executable. If the problem is a va_list one,
you might see the last function is gmx_fatal, as mdrun was trying to
exit gracefully from some other normal error condition, it ran until
the above implementation error while trying to issue the error
message.

> Another node with a different stack looks like this:
>
> ---------------------------------------------------------------
>
> /bgsys/drivers/V1R4M2_200_2010-100508P/ppc/toolchain/gnu/glibc-2.4/stdio-common/vfprintf.c:1819
> /bgsys/drivers/V1R4M2_200_2010-100508P/ppc/toolchain/gnu/glibc-2.4/libio/genops.c:982
> /bgsys/drivers/V1R4M2_200_2010-100508P/ppc/toolchain/gnu/glibc-2.4/string/memcpy.c:159
> /scratch/pbisbal/build/gromacs-4.6.3/src/mdlib/ns.c:423
> /scratch/pbisbal/build/gromacs-4.6.3/src/kernel/runner.c:1646
> /scratch/pbisbal/build/gromacs-4.6.3/src/kernel/genalg.c:467
> /scratch/pbisbal/build/gromacs-4.6.3/src/kernel/calc_verletbuf.c:266
> ../stdio-common/printf_fphex.c:335
> ../stdio-common/printf_fphex.c:452
> ??:0
> /bgsys/drivers/V1R4M2_200_2010-100508P/ppc/toolchain/gnu/glibc-2.4/stdio-common/vfprintf.c:1819
> /bgsys/drivers/V1R4M2_200_2010-100508P/ppc/toolchain/gnu/glibc-2.4/stdio-common/vfprintf.c:1819
> /bgsys/drivers/V1R4M2_200_2010-100508P/ppc/toolchain/gnu/glibc-2.4/stdio-common/vfprintf.c:1819
>
> ---------------------------------------------------------------
>
> All the stacks look like one of these two.

Same problem.

> Is any of this information useful? My coworker, who has a lot of experience
> developing for Blue Gene/P's, says this looks like an I/O problem, but he
> doesn't have the time to dig into the Gromacs source code for us. I'm
> willing to do some digging, but some guidance from someone who know the code
> well would be very helpful.

You've prompted me to remember what the issue probably is, but we
haven't actually identified it yet until we have a proper stack trace.

Mark

> Prentice
>
>
>
>
> On 08/06/2013 08:19 PM, Mark Abraham wrote:
>>
>> That all looks fine so far. The core file processor won't help unless
>> you've compiled with -g. Hopefully cmake -DCMAKE_BUILD_TYPE=Debug will
>> do that, but I haven't actually checked that really works. If not, you
>> might have to hack cmake/Platform/BlueGeneP-static-XL-C.cmake.
>>
>> Anyway, if you can compile with -g, then the core file will tell us in
>> what function it is dying, which might help locate the problem.
>>
>> Mark
>>
>> On Tue, Aug 6, 2013 at 11:43 PM, Prentice Bisbal
>> <prentice.bisbal at rutgers.edu> wrote:
>>>
>>> Dear GMX-users,
>>>
>>> I need some assistance running Gromacs 4.6.3 on a Blue Gene/P. Although I
>>> have  a background in Chemistry, I'm an experienced professional HPC
>>> admin
>>> who's relatively new to supporting Blue Genes and Gromacs. My first
>>> Gromacs
>>> user is having trouble running Gromacs on our BG/P. His jobs die and dump
>>> core, with no obvious signs (not to me, at least) of where the problem
>>> lies.
>>>
>>> I compiled Gromacs 4.6.3 with the following options:
>>>
>>>
>>> ------------------------------------------snip-------------------------------------------
>>>
>>> cmake .. \
>>> -DCMAKE_TOOLCHAIN_FILE=../cmake/Platform/BlueGeneP-static-XL-C.cmake \
>>>    -DBUILD_SHARED_LIBS=OFF \
>>>    -DGMX_MPI=ON \
>>>    -DCMAKE_C_FLAGS="-O3 -qarch=450d -qtune=450" \
>>>    -DCMAKE_INSTALL_PREFIX=/scratch/bgapps/gromacs-4.6.2 \
>>>    -DGMX_CPU_ACCELERATION=None \
>>>    -DGMX_THREAD_MPI=OFF \
>>>    -DGMX_OPENMP=OFF \
>>>    -DGMX_DEFAULT_SUFFIX=ON \
>>>    -DCMAKE_PREFIX_PATH=/scratch/bgapps/fftw-3.3.2 \
>>>     2>&1 | tee cmake.log
>>>
>>>
>>> ------------------------------------------snip-------------------------------------------
>>>
>>> When one of my users submits a job, it dumps core. My scheduler is
>>> LoadLeveler, and I used this JCF file to replicate the problem. I added
>>> the
>>> '-debug 1' flag after searching the gmx-users archives:
>>>
>>>
>>> ------------------------------------------snip-------------------------------------------
>>>
>>> #!/bin/bash
>>> # @ job_name = xiang
>>> # @ job_type = bluegene
>>> # @ bg_size = 64
>>> # @ class = small
>>> # @ wall_clock_limit = 01:00:00,00:50:00
>>> # @ error = job.$(Cluster).$(Process).err
>>> # @ output = job.$(Cluster).$(Process).out
>>> # @ environment = COPY_ALL;
>>> # @ queue
>>>
>>> source /scratch/bgapps/gromacs-4.6.2/bin/GMXRC.bash
>>>
>>>
>>> ------------------------------------------snip-------------------------------------------
>>>
>>> /bgsys/drivers/ppcfloor/bin/mpirun
>>> /scratch/bgapps/gromacs-4.6.2/bin/mdrun_mpi -pin off -deffnm sbm-b_dyn3
>>> -v
>>> -dlb yes -debug 1
>>>
>>> The stderr file shows this at the bottom, which isn't too helpful:
>>>
>>>
>>> ------------------------------------------snip-------------------------------------------
>>>
>>> Reading file sbm-b_dyn3.tpr, VERSION 4.6.2 (single precision)
>>>
>>> Will use 48 particle-particle and 16 PME only nodes
>>> This is a guess, check the performance at the end of the log file
>>> Using 64 MPI processes
>>> <Aug 06 17:25:55.303879> BE_MPI (ERROR): The error message in the job
>>> record
>>> is as follows:
>>> <Aug 06 17:25:55.303940> BE_MPI (ERROR):   "killed with signal 6"
>>>
>>>
>>> -----------------------------------------snip-----------------------------------------------
>>>
>>> I have a bunch of core files which I can analyze with the IBM Core file
>>> processor, and I also have bunch of debug files from mdrun. I went
>>> through
>>> about 12/64 of them, and didn't see anything that looked like an error.
>>>
>>> Can anyone offer me any suggestions of what to look for, or additional
>>> debugging steps I can take? Please keep in mind I'm the system
>>> administrator
>>> and not an expert-user of gromacs, so I'm not sure if the inputs are
>>> correct, or are at correct for my BG/P configuration. Any help will be
>>> greatly appreciated.
>>>
>>> Thanks,
>>> Prentice
>>>
>>> --
>>> gmx-users mailing list    gmx-users at gromacs.org
>>> http://lists.gromacs.org/mailman/listinfo/gmx-users
>>> * Please search the archive at
>>> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
>>> * Please don't post (un)subscribe requests to the list. Use the www
>>> interface or send it to gmx-users-request at gromacs.org.
>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
>
> --
> gmx-users mailing list    gmx-users at gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-users
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
> * Please don't post (un)subscribe requests to the list. Use the www
> interface or send it to gmx-users-request at gromacs.org.
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists



More information about the gromacs.org_gmx-users mailing list