[gmx-users] Assistance needed running gromacs 4.6.3 on Blue Gene/P

Prentice Bisbal prentice.bisbal at rutgers.edu
Tue Aug 6 23:43:52 CEST 2013


Dear GMX-users,

I need some assistance running Gromacs 4.6.3 on a Blue Gene/P. Although 
I have  a background in Chemistry, I'm an experienced professional HPC 
admin who's relatively new to supporting Blue Genes and Gromacs. My 
first Gromacs user is having trouble running Gromacs on our BG/P. His 
jobs die and dump core, with no obvious signs (not to me, at least) of 
where the problem lies.

I compiled Gromacs 4.6.3 with the following options:

------------------------------------------snip-------------------------------------------

cmake .. \
-DCMAKE_TOOLCHAIN_FILE=../cmake/Platform/BlueGeneP-static-XL-C.cmake \
   -DBUILD_SHARED_LIBS=OFF \
   -DGMX_MPI=ON \
   -DCMAKE_C_FLAGS="-O3 -qarch=450d -qtune=450" \
   -DCMAKE_INSTALL_PREFIX=/scratch/bgapps/gromacs-4.6.2 \
   -DGMX_CPU_ACCELERATION=None \
   -DGMX_THREAD_MPI=OFF \
   -DGMX_OPENMP=OFF \
   -DGMX_DEFAULT_SUFFIX=ON \
   -DCMAKE_PREFIX_PATH=/scratch/bgapps/fftw-3.3.2 \
    2>&1 | tee cmake.log

------------------------------------------snip-------------------------------------------

When one of my users submits a job, it dumps core. My scheduler is 
LoadLeveler, and I used this JCF file to replicate the problem. I added 
the '-debug 1' flag after searching the gmx-users archives:

------------------------------------------snip-------------------------------------------

#!/bin/bash
# @ job_name = xiang
# @ job_type = bluegene
# @ bg_size = 64
# @ class = small
# @ wall_clock_limit = 01:00:00,00:50:00
# @ error = job.$(Cluster).$(Process).err
# @ output = job.$(Cluster).$(Process).out
# @ environment = COPY_ALL;
# @ queue

source /scratch/bgapps/gromacs-4.6.2/bin/GMXRC.bash

------------------------------------------snip-------------------------------------------

/bgsys/drivers/ppcfloor/bin/mpirun 
/scratch/bgapps/gromacs-4.6.2/bin/mdrun_mpi -pin off -deffnm sbm-b_dyn3 
-v -dlb yes -debug 1

The stderr file shows this at the bottom, which isn't too helpful:

------------------------------------------snip-------------------------------------------

Reading file sbm-b_dyn3.tpr, VERSION 4.6.2 (single precision)

Will use 48 particle-particle and 16 PME only nodes
This is a guess, check the performance at the end of the log file
Using 64 MPI processes
<Aug 06 17:25:55.303879> BE_MPI (ERROR): The error message in the job 
record is as follows:
<Aug 06 17:25:55.303940> BE_MPI (ERROR):   "killed with signal 6"

-----------------------------------------snip-----------------------------------------------

I have a bunch of core files which I can analyze with the IBM Core file 
processor, and I also have bunch of debug files from mdrun. I went 
through about 12/64 of them, and didn't see anything that looked like an 
error.

Can anyone offer me any suggestions of what to look for, or additional 
debugging steps I can take? Please keep in mind I'm the system 
administrator and not an expert-user of gromacs, so I'm not sure if the 
inputs are correct, or are at correct for my BG/P configuration. Any 
help will be greatly appreciated.

Thanks,
Prentice




More information about the gromacs.org_gmx-users mailing list