[gmx-users] parallel problems...

Arvid Soderhall arvid at fmp-berlin.de
Tue Feb 1 15:25:18 CET 2005

Hi all
I have a problem to run things on more than 4 cpus on our brand new 
cluster. I use a PBS script which launches the command
/usr/local/encap/mpich-126-gcc-64/bin/mpirun -machinefile machines -np 4 
/swc/gromacs/bin/mdrun_mpi -np 4 -v -s RW12_s20_a.tpr -o RW12_s20_a.trr 
-x RW12_s20_a.xtc -c RW12_s20_a.gro -e RW12_s20_a.edr -g RW12_s20_a.log
This runs nicely, but if I use 6 cpus (or 8) the calculation crashes and 
an error is reported in the standard out file:
rm_10214:  p4_error: semget failed for setnum: 0
p0_7618: (0.113281) net_recv failed for fd = 7
p0_7618:  p4_error: net_recv read, errno = : 104
p0_7618: (4.117188) net_send: could not write to fd=4, errno = 32
Moreover, there are some scattered mdrun_mpi processes that keeps on 
running (but never on the master node).

I have tryed to use mpiexec instead of mpirun
/usr/local/bin/mpiexec -verbose -comm lam -kill -mpich-p4-no-shmem 
Then I get a different error message to the error output file:
mpiexec: Warning: parse_args: argument "-mpich-p4-[no-]shmem" ignored since
  communication library not MPICH/P4.
NNODES=1, MYRANK=0, HOSTNAME=node24.beowulf.cluster
                         :-)  G  R  O  M  A  C  S  (-:

                   Good gRace! Old Maple Actually Chews Slate

                            :-)  VERSION 3.2.1  (-:

      Written by David van der Spoel, Erik Lindahl, Berk Hess, and others.
       Copyright (c) 1991-2000, University of Groningen, The Netherlands.
             Copyright (c) 2001-2004, The GROMACS development team,
            check out http://www.gromacs.org for more information.
: And so on until
[no]compact   bool    yes  Write a compact log file
  -[no]multi   bool     no  Do multiple simulations in parallel (only with
                            -np > 1)
   -[no]glas   bool     no  Do glass simulation with special long range
 -[no]ionize   bool     no  Do a simulation including the effect of an X-Ray
                            bombardment on your system

Fatal error: Could not open RW12_s20_a.log
[0] MPI Abort by user Aborting program !
[0] Aborting program!
    p4_error: latest msg from perror: Permission denied
Notice that the node (node24.bewoulf.cluster) tryes to write to the file 
RW12_s20_a.log and fails. The normal thing would be to write to the file 
RW12_s20_a#.log (where # is the node number)

Thanx in advance for any help!

   Arvid Söderhäll


Is there a way to use mpiexec instead of mpirun? I have some vague 
feeling that this may help...

More information about the gromacs.org_gmx-users mailing list