[gmx-users] parallel problems...
Arvid Soderhall
arvid at fmp-berlin.de
Tue Feb 1 15:25:18 CET 2005
Hi all
I have a problem to run things on more than 4 cpus on our brand new
cluster. I use a PBS script which launches the command
----------------------------------------
/usr/local/encap/mpich-126-gcc-64/bin/mpirun -machinefile machines -np 4
/swc/gromacs/bin/mdrun_mpi -np 4 -v -s RW12_s20_a.tpr -o RW12_s20_a.trr
-x RW12_s20_a.xtc -c RW12_s20_a.gro -e RW12_s20_a.edr -g RW12_s20_a.log
---------------------------------------------
This runs nicely, but if I use 6 cpus (or 8) the calculation crashes and
an error is reported in the standard out file:
-----------------------------------------------------------------------------
rm_10214: p4_error: semget failed for setnum: 0
p0_7618: (0.113281) net_recv failed for fd = 7
p0_7618: p4_error: net_recv read, errno = : 104
p0_7618: (4.117188) net_send: could not write to fd=4, errno = 32
---------------------------------------------------------------------------------
Moreover, there are some scattered mdrun_mpi processes that keeps on
running (but never on the master node).
I have tryed to use mpiexec instead of mpirun
------------------------------------
/usr/local/bin/mpiexec -verbose -comm lam -kill -mpich-p4-no-shmem
$APPLICATION $RUNFLAGS
--------------------------------------------
Then I get a different error message to the error output file:
--------------------------------------------------------------------------------------
mpiexec: Warning: parse_args: argument "-mpich-p4-[no-]shmem" ignored since
communication library not MPICH/P4.
NNODES=1, MYRANK=0, HOSTNAME=node24.beowulf.cluster
:-) G R O M A C S (-:
Good gRace! Old Maple Actually Chews Slate
:-) VERSION 3.2.1 (-:
Written by David van der Spoel, Erik Lindahl, Berk Hess, and others.
Copyright (c) 1991-2000, University of Groningen, The Netherlands.
Copyright (c) 2001-2004, The GROMACS development team,
check out http://www.gromacs.org for more information.
:
: And so on until
:
[no]compact bool yes Write a compact log file
-[no]multi bool no Do multiple simulations in parallel (only with
-np > 1)
-[no]glas bool no Do glass simulation with special long range
corrections
-[no]ionize bool no Do a simulation including the effect of an X-Ray
bombardment on your system
Fatal error: Could not open RW12_s20_a.log
[0] MPI Abort by user Aborting program !
[0] Aborting program!
p4_error: latest msg from perror: Permission denied
--------------------------------------------------------------------------------------------------------
Notice that the node (node24.bewoulf.cluster) tryes to write to the file
RW12_s20_a.log and fails. The normal thing would be to write to the file
RW12_s20_a#.log (where # is the node number)
Thanx in advance for any help!
Arvid Söderhäll
Is there a way to use mpiexec instead of mpirun? I have some vague
feeling that this may help...
More information about the gromacs.org_gmx-users
mailing list