[gmx-users] Assistance needed running gromacs 4.6.3 on Blue Gene/P

Prentice Bisbal prentice.bisbal at rutgers.edu
Fri Aug 9 18:03:48 CEST 2013


Mark,

Since I was working with 4.6.2, I built 4.6.3 to see if this was the 
result of a bug in 4.6.2. It isn't I get the same error with 4.6.3, but 
that is the version I'll be working with from now on, since it's the 
latest. Since the problem occurs with both versions, might as well try 
to fix it in the latest version, right?

I compiled 4.6.3 with the following options to include debugging 
information:

cmake .. \
-DCMAKE_TOOLCHAIN_FILE=../cmake/Platform/BlueGeneP-static-XL-C.cmake \
   -DBUILD_SHARED_LIBS=OFF \
   -DGMX_MPI=ON \
   -DCMAKE_C_FLAGS="-O0 -g -qstrict -qarch=450 -qtune=450" \
   -DCMAKE_INSTALL_PREFIX=/scratch/bgapps/gromacs-4.6.3 \
   -DGMX_CPU_ACCELERATION=None \
   -DGMX_THREAD_MPI=OFF \
   -DGMX_OPENMP=OFF \
   -DGMX_DEFAULT_SUFFIX=ON \
   -DCMAKE_PREFIX_PATH=/scratch/bgapps/fftw-3.3.2 \
    2>&1 | tee cmake.log

For qarch, I removed the 'd' from the end, so that the double-FPU isn't 
used, which can cause problems if the data isn't aligned correctly. The 
-qstrict makes sure certain optimizations aren't performed. It should be 
superfluous with optimization levels below 3, but I through it in just 
to be safe, and set -O0. (of course, I think -g turns off all 
optizations, anyway)

On the BG/P, I had to install FFTW3 separately, and that wasn't 
installed with debugging active, so there are no symbols for FFTW.

One of my coworkers wrote a script that converts BG/P core files to 
stack traces. In all the kernels I've looked at so far (9 out of 64), 
the stack ends at a vfprintf call. For example:

-------------------------------------------------------------

/bgsys/drivers/V1R4M2_200_2010-100508P/ppc/toolchain/gnu/glibc-2.4/stdio-common/vfprintf.c:1819
/bgsys/drivers/V1R4M2_200_2010-100508P/ppc/toolchain/gnu/glibc-2.4/resolv/res_init.c:414
/bgsys/drivers/V1R4M2_200_2010-100508P/ppc/toolchain/gnu/glibc-2.4/libio/wgenops.c:419
/scratch/pbisbal/build/gromacs-4.6.3/src/gmxlib/nonbonded/nb_kernel_c/nb_kernel_ElecRFCut_VdwBhamSh_GeomW4P1_c.c:673
??:0
/bghome/bgbuild/V1R4M2_200_2010-100508P/ppc/bgp/comm/sys/dcmf/../ccmi/executor/Broadcast.h:83
/bghome/bgbuild/V1R4M2_200_2010-100508P/ppc/bgp/comm/lib/dev/mpich2/src/mpid/dcmfd/src/coll/reduce/reduce_algorithms.c:69
/bghome/bgbuild/V1R4M2_200_2010-100508P/ppc/bgp/comm/lib/dev/mpich2/src/mpid/dcmfd/src/coll/bcast/bcast_algorithms.c:227
/scratch/pbisbal/build/gromacs-4.6.3/src/mdlib/nbnxn_atomdata.c:779
/scratch/pbisbal/build/gromacs-4.6.3/src/mdlib/nbnxn_atomdata.c:762
/scratch/pbisbal/build/gromacs-4.6.3/src/mdlib/nbnxn_atomdata.c:374
/scratch/pbisbal/build/gromacs-4.6.3/src/mdlib/calcmu.c:88
/scratch/pbisbal/build/gromacs-4.6.3/src/kernel/mdrun.c:113
/scratch/pbisbal/build/gromacs-4.6.3/src/kernel/runner.c:1492
/scratch/pbisbal/build/gromacs-4.6.3/src/kernel/genalg.c:467
/scratch/pbisbal/build/gromacs-4.6.3/src/kernel/calc_verletbuf.c:266
../stdio-common/printf_fphex.c:335
../stdio-common/printf_fphex.c:452
??:0
/bgsys/drivers/V1R4M2_200_2010-100508P/ppc/toolchain/gnu/glibc-2.4/stdio-common/vfprintf.c:1819
/bgsys/drivers/V1R4M2_200_2010-100508P/ppc/toolchain/gnu/glibc-2.4/stdio-common/vfprintf.c:1819
/bgsys/drivers/V1R4M2_200_2010-100508P/ppc/toolchain/gnu/glibc-2.4/stdio-common/vfprintf.c:1819

-----------------------------------------------------------------

Another node with a different stack looks like this:

---------------------------------------------------------------

/bgsys/drivers/V1R4M2_200_2010-100508P/ppc/toolchain/gnu/glibc-2.4/stdio-common/vfprintf.c:1819
/bgsys/drivers/V1R4M2_200_2010-100508P/ppc/toolchain/gnu/glibc-2.4/libio/genops.c:982
/bgsys/drivers/V1R4M2_200_2010-100508P/ppc/toolchain/gnu/glibc-2.4/string/memcpy.c:159
/scratch/pbisbal/build/gromacs-4.6.3/src/mdlib/ns.c:423
/scratch/pbisbal/build/gromacs-4.6.3/src/kernel/runner.c:1646
/scratch/pbisbal/build/gromacs-4.6.3/src/kernel/genalg.c:467
/scratch/pbisbal/build/gromacs-4.6.3/src/kernel/calc_verletbuf.c:266
../stdio-common/printf_fphex.c:335
../stdio-common/printf_fphex.c:452
??:0
/bgsys/drivers/V1R4M2_200_2010-100508P/ppc/toolchain/gnu/glibc-2.4/stdio-common/vfprintf.c:1819
/bgsys/drivers/V1R4M2_200_2010-100508P/ppc/toolchain/gnu/glibc-2.4/stdio-common/vfprintf.c:1819
/bgsys/drivers/V1R4M2_200_2010-100508P/ppc/toolchain/gnu/glibc-2.4/stdio-common/vfprintf.c:1819

---------------------------------------------------------------

All the stacks look like one of these two.

Is any of this information useful? My coworker, who has a lot of 
experience developing for Blue Gene/P's, says this looks like an I/O 
problem, but he doesn't have the time to dig into the Gromacs source 
code for us. I'm willing to do some digging, but some guidance from 
someone who know the code well would be very helpful.

Prentice



On 08/06/2013 08:19 PM, Mark Abraham wrote:
> That all looks fine so far. The core file processor won't help unless
> you've compiled with -g. Hopefully cmake -DCMAKE_BUILD_TYPE=Debug will
> do that, but I haven't actually checked that really works. If not, you
> might have to hack cmake/Platform/BlueGeneP-static-XL-C.cmake.
>
> Anyway, if you can compile with -g, then the core file will tell us in
> what function it is dying, which might help locate the problem.
>
> Mark
>
> On Tue, Aug 6, 2013 at 11:43 PM, Prentice Bisbal
> <prentice.bisbal at rutgers.edu> wrote:
>> Dear GMX-users,
>>
>> I need some assistance running Gromacs 4.6.3 on a Blue Gene/P. Although I
>> have  a background in Chemistry, I'm an experienced professional HPC admin
>> who's relatively new to supporting Blue Genes and Gromacs. My first Gromacs
>> user is having trouble running Gromacs on our BG/P. His jobs die and dump
>> core, with no obvious signs (not to me, at least) of where the problem lies.
>>
>> I compiled Gromacs 4.6.3 with the following options:
>>
>> ------------------------------------------snip-------------------------------------------
>>
>> cmake .. \
>> -DCMAKE_TOOLCHAIN_FILE=../cmake/Platform/BlueGeneP-static-XL-C.cmake \
>>    -DBUILD_SHARED_LIBS=OFF \
>>    -DGMX_MPI=ON \
>>    -DCMAKE_C_FLAGS="-O3 -qarch=450d -qtune=450" \
>>    -DCMAKE_INSTALL_PREFIX=/scratch/bgapps/gromacs-4.6.2 \
>>    -DGMX_CPU_ACCELERATION=None \
>>    -DGMX_THREAD_MPI=OFF \
>>    -DGMX_OPENMP=OFF \
>>    -DGMX_DEFAULT_SUFFIX=ON \
>>    -DCMAKE_PREFIX_PATH=/scratch/bgapps/fftw-3.3.2 \
>>     2>&1 | tee cmake.log
>>
>> ------------------------------------------snip-------------------------------------------
>>
>> When one of my users submits a job, it dumps core. My scheduler is
>> LoadLeveler, and I used this JCF file to replicate the problem. I added the
>> '-debug 1' flag after searching the gmx-users archives:
>>
>> ------------------------------------------snip-------------------------------------------
>>
>> #!/bin/bash
>> # @ job_name = xiang
>> # @ job_type = bluegene
>> # @ bg_size = 64
>> # @ class = small
>> # @ wall_clock_limit = 01:00:00,00:50:00
>> # @ error = job.$(Cluster).$(Process).err
>> # @ output = job.$(Cluster).$(Process).out
>> # @ environment = COPY_ALL;
>> # @ queue
>>
>> source /scratch/bgapps/gromacs-4.6.2/bin/GMXRC.bash
>>
>> ------------------------------------------snip-------------------------------------------
>>
>> /bgsys/drivers/ppcfloor/bin/mpirun
>> /scratch/bgapps/gromacs-4.6.2/bin/mdrun_mpi -pin off -deffnm sbm-b_dyn3 -v
>> -dlb yes -debug 1
>>
>> The stderr file shows this at the bottom, which isn't too helpful:
>>
>> ------------------------------------------snip-------------------------------------------
>>
>> Reading file sbm-b_dyn3.tpr, VERSION 4.6.2 (single precision)
>>
>> Will use 48 particle-particle and 16 PME only nodes
>> This is a guess, check the performance at the end of the log file
>> Using 64 MPI processes
>> <Aug 06 17:25:55.303879> BE_MPI (ERROR): The error message in the job record
>> is as follows:
>> <Aug 06 17:25:55.303940> BE_MPI (ERROR):   "killed with signal 6"
>>
>> -----------------------------------------snip-----------------------------------------------
>>
>> I have a bunch of core files which I can analyze with the IBM Core file
>> processor, and I also have bunch of debug files from mdrun. I went through
>> about 12/64 of them, and didn't see anything that looked like an error.
>>
>> Can anyone offer me any suggestions of what to look for, or additional
>> debugging steps I can take? Please keep in mind I'm the system administrator
>> and not an expert-user of gromacs, so I'm not sure if the inputs are
>> correct, or are at correct for my BG/P configuration. Any help will be
>> greatly appreciated.
>>
>> Thanks,
>> Prentice
>>
>> --
>> gmx-users mailing list    gmx-users at gromacs.org
>> http://lists.gromacs.org/mailman/listinfo/gmx-users
>> * Please search the archive at
>> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
>> * Please don't post (un)subscribe requests to the list. Use the www
>> interface or send it to gmx-users-request at gromacs.org.
>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists




More information about the gromacs.org_gmx-users mailing list