[gmx-users] Seeking advice on how to build Gromacs on Teragrid resources

Mark Abraham Mark.Abraham at anu.edu.au
Thu Dec 9 23:38:58 CET 2010


On 10/12/2010 9:14 AM, J. Nathan Scott wrote:
> Hello gmx users! I realize this may be a touch off topic, but I am
> hoping that someone out there can offer some advice on how to build
> Gromacs for parallel use on a Teragrid site. Our group is currently
> using Abe on Teragrid, and unfortunately the latest version of Gromacs
> compiled for public use on Abe is 4.0.2. Apparently installation of
> 4.5.3 is at least on the to-do list for Abe, but we would very much
> like to use 4.5.3 now if we can get this issue figured it out.
>
> I have built a parallel version of mdrun using Abe installed versions
> of fftw3 and mvapich2 using the following commands:

Certainly MPICH use is discouraged, as GROMACS seems to find some bugs 
in it. I'm not sure about MVAPICH. Certainly you should be sure to be 
using the latest version. Compare with OpenMPI if you can.

> setenv CPPFLAGS "-I/usr/apps/math/fftw/fftw-3.1.2/gcc/include/
> -I/usr/apps/mpi/marmot_mvapich2_intel/include"
> setenv LDFLAGS "-L/usr/apps/math/fftw/fftw-3.1.2/gcc/lib
> -L/usr/apps/mpi/marmot_mvapich2_intel/lib"
> ./configure --enable-mpi --enable-float --prefix=/u/ac/jnscott/gromacs
> --program-suffix=_mpi
> make -j 8 mdrun&&  make install-mdrun
>
> My PBS script file looks like the following:
>
> -------------------------------
> #!/bin/csh
> #PBS -l nodes=2:ppn=8

Simplify the conditions when trying to diagnose a problem - try to run 
on one 8-processor node, or even 1 processor. Your crash is consistent 
with some MPI problem, because (off the top of my head) it seems to 
happen when GROMACS starts to do communication to pass around the input 
data.

Mark

> #PBS -V
> #PBS -o pbs_nvt.out
> #PBS -e pbs_nvt.err
> #PBS -l walltime=2:00:00
> #PBS -N gmx
> cd /u/ac/jnscott/1stn/1stn_wt/oplsaa_spce
> mvapich2-start-mpd
> setenv NP `wc -l ${PBS_NODEFILE} | cut -d'/' -f1`
> setenv MV2_SRQ_SIZE 4000
> mpirun -np ${NP} mdrun_mpi -s nvt.tpr -o nvt.trr -x nvt.xtc -cpo
> nvt.cpt -c nvt.gro -e nvt.edr -g nvt.log -dlb yes
> -------------------------------
>
> Unfortunately my runs always fail in the same manner. The log file
> simply ends, as you can see below. It appears that Gromacs is picking
> up the correct number of nodes specified in the PBS script, but then
> something causes it to quit abruptly with no error message.
>
> -------------------------------
> <snip>
> Initializing Domain Decomposition on 16 nodes
> Dynamic load balancing: yes
> Will sort the charge groups at every domain (re)decomposition
> Initial maximum inter charge-group distances:
>      two-body bonded interactions: 0.526 nm, LJ-14, atoms 1735 1744
>    multi-body bonded interactions: 0.526 nm, Ryckaert-Bell., atoms 1735 1744
> Minimum cell size due to bonded interactions: 0.578 nm
> Maximum distance for 5 constraints, at 120 deg. angles, all-trans: 0.820 nm
> Estimated maximum distance required for P-LINCS: 0.820 nm
> This distance will limit the DD cell size, you can override this with -rcon
> Guess for relative PME load: 0.27
> Will use 10 particle-particle and 6 PME only nodes
> This is a guess, check the performance at the end of the log file
> Using 6 separate PME nodes
> Scaling the initial minimum size with 1/0.8 (option -dds) = 1.25
> Optimizing the DD grid for 10 cells with a minimum initial size of 1.025 nm
> The maximum allowed number of cells is: X 5 Y 5 Z 4
> Domain decomposition grid 2 x 5 x 1, separate PME nodes 6
> PME domain decomposition: 2 x 3 x 1
> Interleaving PP and PME nodes
> This is a particle-particle only node
>
> Domain decomposition nodeid 0, coordinates 0 0 0
>
> Using two step summing over 2 groups of on average 5.0 processes
>
> Table routines are used for coulomb: TRUE
> Table routines are used for vdw:     FALSE
> Will do PME sum in reciprocal space.
>
> <snip>
>
> Will do ordinary reciprocal space Ewald sum.
> Using a Gaussian width (1/beta) of 0.320163 nm for Ewald
> Cut-off's:   NS: 1   Coulomb: 1   LJ: 1
> Long Range LJ corr.:<C6>  3.3589e-04
> System total charge: 0.000
> Generated table with 1000 data points for Ewald.
> Tabscale = 500 points/nm
> Generated table with 1000 data points for LJ6.
> Tabscale = 500 points/nm
> Generated table with 1000 data points for LJ12.
> Tabscale = 500 points/nm
> Generated table with 1000 data points for 1-4 COUL.
> Tabscale = 500 points/nm
> Generated table with 1000 data points for 1-4 LJ6.
> Tabscale = 500 points/nm
> Generated table with 1000 data points for 1-4 LJ12.
> Tabscale = 500 points/nm
>
> Enabling SPC-like water optimization for 6952 molecules.
>
> Configuring nonbonded kernels...
> Configuring standard C nonbonded kernels...
> Testing x86_64 SSE2 support... present.
>
> Removing pbc first time
>
> Initializing Parallel LINear Constraint Solver
>
> <snip>
> Linking all bonded interactions to atoms
> There are 9778 inter charge-group exclusions,
> will use an extra communication step for exclusion forces for PME
>
> The maximum number of communication pulses is: X 1 Y 2
> The minimum size for domain decomposition cells is 0.827 nm
> The requested allowed shrink of DD cells (option -dds) is: 0.80
> The allowed shrink of domain decomposition cells is: X 0.35 Y 0.73
> The maximum allowed distance for charge groups involved in interactions is:
>                   non-bonded interactions           1.000 nm
>              two-body bonded interactions  (-rdd)   1.000 nm
>            multi-body bonded interactions  (-rdd)   0.827 nm
>    atoms separated by up to 5 constraints  (-rcon)  0.827 nm
>
>
> Making 2D domain decomposition grid 2 x 5 x 1, home cell index 0 0 0
>
> Center of mass motion removal mode is Linear
> We have the following groups for center of mass motion removal:
>    0:  rest
>
> ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
> G. Bussi, D. Donadio and M. Parrinello
> Canonical sampling through velocity rescaling
> J. Chem. Phys. 126 (2007) pp. 014101
> -------- -------- --- Thank You --- -------- --------
> -----------------------------------------------------------
>
> My PBS error file is not of much help either I fear, an example of
> such a file is pasted below:
>
> -----------------------------------
> stty: standard input: Invalid argument
> stty: standard input: Invalid argument
> NNODES=16, MYRANK=0, HOSTNAME=abe0828
> NNODES=16, MYRANK=2, HOSTNAME=abe0828
> NNODES=16, MYRANK=12, HOSTNAME=abe0828
> NNODES=16, MYRANK=4, HOSTNAME=abe0828
> NNODES=16, MYRANK=10, HOSTNAME=abe0828
> NNODES=16, MYRANK=8, HOSTNAME=abe0828
> NNODES=16, MYRANK=6, HOSTNAME=abe0828
> NNODES=16, MYRANK=14, HOSTNAME=abe0828
> NODEID=0 argc=17
> NODEID=2 argc=17
> NODEID=4 argc=17
> NODEID=10 argc=17
> NODEID=12 argc=17
> NODEID=6 argc=17
> NODEID=14 argc=17
> NODEID=8 argc=17
> NNODES=16, MYRANK=5, HOSTNAME=abe0825
> NNODES=16, MYRANK=13, HOSTNAME=abe0825
>                           :-)  G  R  O  M  A  C  S  (-:
>
> NNODES=16, MYRANK=9, HOSTNAME=abe0825
> NNODES=16, MYRANK=11, HOSTNAME=abe0825
>                     Great Red Oystrich Makes All Chemists Sane
>
>                              :-)  VERSION 4.5.3  (-:
>
> <snip>
> Back Off! I just backed up nvt.log to ./#nvt.log.2#
> Reading file nvt.tpr, VERSION 4.5.3 (single precision)
>
> Will use 10 particle-particle and 6 PME only nodes
> This is a guess, check the performance at the end of the log file
> Making 2D domain decomposition 2 x 5 x 1
>
> Back Off! I just backed up nvt.edr to ./#nvt.edr.2#
> ----------------------------------------------
>
> The non-Torque section of the PBS log file is below:
>
> -----------------------------------------------
> Warning: no access to tty (Bad file descriptor).
> Thus no job control in this shell.
> running mpdallexit on abe0828
> LAUNCHED mpd on abe0828  via
> RUNNING: mpd on abe0828
> LAUNCHED mpd on abe0825  via  abe0828
> RUNNING: mpd on abe0825
> abe0828_43972 (10.1.67.66)
> abe0825_37571 (10.1.67.63)
> rank 1 in job 1  abe0828_43972   caused collective abort of all ranks
>    exit status of rank 1: killed by signal 9
> rank 0 in job 1  abe0828_43972   caused collective abort of all ranks
>    exit status of rank 0: killed by signal 9
> -------------------------------------------------
>
> I would should also note that both .edr and .trr files are created in
> the working directory, but both files are 0 bytes.
>
> Like I said, I realize this question is perhaps a bit off the topic of
> Gromacs exclusively, but I hope that someone can offer some tips or
> spot any obvious problems with my method that I have not noticed and
> would sincerely appreciate any help you can offer a novice.
>
> Best Wishes,
> -Nathan
>
>
> ----------
> J. Nathan Scott, Ph.D.
> Postdoctoral Fellow
> Department of Chemistry and Biochemistry
> Montana State University




More information about the gromacs.org_gmx-users mailing list