[gmx-users] problem with running mdrun in parallel

map110+ at pitt.edu map110+ at pitt.edu
Thu Jan 8 21:03:37 CET 2009


Hi there,

I am trying to run an MD simulation on a 13 residue peptide with distance
restraints. Before, I ran into a problem with this system when an error
message occurred during mdrun concerning distance restraints and domain
decomposition. There was apparently a bug in mshift.c and the bug was
fixed. However, at this point, my simulation works properly only if I use
the serial version of mdrun. When I try to run it in parallel, I get this
error message:

NNODES=4, MYRANK=0, HOSTNAME=chong06.chem.pitt.edu
NNODES=4, MYRANK=2, HOSTNAME=chong06.chem.pitt.edu
NNODES=4, MYRANK=3, HOSTNAME=chong06.chem.pitt.edu
NNODES=4, MYRANK=1, HOSTNAME=chong06.chem.pitt.edu
NODEID=0 argc=13
NODEID=3 argc=13
NODEID=2 argc=13
NODEID=1 argc=13
                         :-)  G  R  O  M  A  C  S  (-:

                               Grunge ROck MAChoS

                            :-)  VERSION 4.0.2  (-:


      Written by David van der Spoel, Erik Lindahl, Berk Hess, and others.
       Copyright (c) 1991-2000, University of Groningen, The Netherlands.
             Copyright (c) 2001-2008, The GROMACS development team,
            check out http://www.gromacs.org for more information.

         This program is free software; you can redistribute it and/or
          modify it under the terms of the GNU General Public License
         as published by the Free Software Foundation; either version 2
             of the License, or (at your option) any later version.

            :-)  /home/map110/gromacs.4.0.2patched/bin/mdrunmpi  (-:

Option     Filename  Type         Description
------------------------------------------------------------
  -s         md.tpr  Input        Run input file: tpr tpb tpa
  -o         md.trr  Output       Full precision trajectory: trr trj cpt
  -x         md.xtc  Output, Opt! Compressed trajectory (portable xdr format)
-cpi      state.cpt  Input, Opt.  Checkpoint file
-cpo      state.cpt  Output, Opt. Checkpoint file
  -c         md.gro  Output       Structure file: gro g96 pdb
  -e         md.edr  Output       Energy file: edr ene
  -g         md.log  Output       Log file
-dgdl      dgdl.xvg  Output, Opt. xvgr/xmgr file
-field    field.xvg  Output, Opt. xvgr/xmgr file
-table    table.xvg  Input, Opt.  xvgr/xmgr file
-tablep  tablep.xvg  Input, Opt.  xvgr/xmgr file
-tableb   table.xvg  Input, Opt.  xvgr/xmgr file
-rerun    rerun.xtc  Input, Opt.  Trajectory: xtc trr trj gro g96 pdb cpt
-tpi        tpi.xvg  Output, Opt. xvgr/xmgr file
-tpid   tpidist.xvg  Output, Opt. xvgr/xmgr file
 -ei        sam.edi  Input, Opt.  ED sampling input
 -eo        sam.edo  Output, Opt. ED sampling output
  -j       wham.gct  Input, Opt.  General coupling stuff
 -jo        bam.gct  Output, Opt. General coupling stuff
-ffout      gct.xvg  Output, Opt. xvgr/xmgr file
-devout   deviatie.xvg  Output, Opt. xvgr/xmgr file
-runav  runaver.xvg  Output, Opt. xvgr/xmgr file
 -px      pullx.xvg  Output, Opt. xvgr/xmgr file
 -pf      pullf.xvg  Output, Opt. xvgr/xmgr file
-mtx         nm.mtx  Output, Opt. Hessian matrix
 -dn     dipole.ndx  Output, Opt. Index file

Option       Type   Value   Description
------------------------------------------------------
-[no]h       bool   no      Print help info and quit
-nice        int    0       Set the nicelevel
-deffnm      string         Set the default filename for all file options
-[no]xvgr    bool   yes     Add specific codes (legends etc.) in the output
                            xvg files for the xmgrace program
-[no]pd      bool   no      Use particle decompostion
-dd          vector 0 0 0   Domain decomposition grid, 0 is optimize
-npme        int    -1      Number of separate nodes to be used for PME, -1
                            is guess
-ddorder     enum   interleave  DD node order: interleave, pp_pme or
cartesian
-[no]ddcheck bool   yes     Check for all bonded interactions with DD
-rdd         real   0       The maximum distance for bonded interactions with
                            DD (nm), 0 is determine from initial coordinates
-rcon        real   0       Maximum distance for P-LINCS (nm), 0 is estimate
-dlb         enum   auto    Dynamic load balancing (with DD): auto, no or yes
-dds         real   0.8     Minimum allowed dlb scaling of the DD cell size
-[no]sum     bool   yes     Sum the energies at every step
-[no]v       bool   no      Be loud and noisy
-[no]compact bool   yes     Write a compact log file
-[no]seppot  bool   no      Write separate V and dVdl terms for each
                            interaction type and node to the log file(s)
-pforce      real   -1      Print all forces larger than this (kJ/mol nm)
-[no]reprod  bool   no      Try to avoid optimizations that affect binary
                            reproducibility
-cpt         real   15      Checkpoint interval (minutes)
-[no]append  bool   no      Append to previous output files when restarting
                            from checkpoint
-maxh        real   -1      Terminate after 0.99 times this time (hours)
-multi       int    0       Do multiple simulations in parallel
-replex      int    0       Attempt replica exchange every # steps
-reseed      int    -1      Seed for replica exchange, -1 is generate a seed
-[no]glas    bool   no      Do glass simulation with special long range
                            corrections
-[no]ionize  bool   no      Do a simulation including the effect of an X-Ray
                            bombardment on your system

Reading file md.tpr, VERSION 4.0 (single precision)

NOTE: atoms involved in distance restraints should be within the longest
cut-off distance, if this is not the case mdrun generates a fatal error,
in that case use particle decomposition (mdrun option -pd)


WARNING: Can not write distance restraint data to energy file with domain
decomposition

-------------------------------------------------------
Program mdrunmpi, VERSION 4.0.2
Source code file: domdec.c, line: 5842

Fatal error:
There is no domain decomposition for 4 nodes that is compatible with the
given box and a minimum cell size of 3.03524 nm
Change the number of nodes or mdrun option -rdd or -dds
Look in the log file for details on the domain decomposition
-------------------------------------------------------

"It's Bicycle Repair Man !" (Monty Python)

Error on node 0, will try to stop all the nodes
Halting parallel program mdrunmpi on CPU 0 out of 4

-------------------------------------------------------
Program mdrunmpi, VERSION 4.0.2
Source code file: domdec.c, line: 5860

Fatal error:
The size of the domain decomposition grid (0) does not match the number of
nodes (4). The total number of nodes is 4
-------------------------------------------------------

"It's Bicycle Repair Man !" (Monty Python)

Error on node 1, will try to stop all the nodes
Halting parallel program mdrunmpi on CPU 1 out of 4

gcq#205: "It's Bicycle Repair Man !" (Monty Python)

[cli_0]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0

gcq#205: "It's Bicycle Repair Man !" (Monty Python)

[cli_1]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 1
rank 1 in job 66  chong06.chem.pitt.edu_35438   caused collective abort of
all ranks
  exit status of rank 1: killed by signal 9
rank 0 in job 66  chong06.chem.pitt.edu_35438   caused collective abort of
all ranks
  exit status of rank 0: killed by signal 9


Can anybody explain to me what could be the problem? I was thinking there
is a possibility that there might be another bug.

Thanks in advance!

Maria






More information about the gromacs.org_gmx-users mailing list