[gmx-users] SGI Altix - mdrun_mpi
Nuno Loureiro Ferreira
nunolf at ci.uc.pt
Wed Nov 14 18:38:10 CET 2007
Hi guys
Just installed gmx on a SGI Altix system (64-way):
--> compiling mdrun with MPI support, single-precision
export CPPFLAGS=-I/home/nferreira/bin/fftw-3.0.1/include
export LDFLAGS=-L/home/nferreira/bin/fftw-3.0.1/lib
./configure --prefix=/home/nferreira/bin/gromacs-3.3.2 --enable-mpi
--program-suffix=_mpi
make mdrun
make install-mdrun
--> compiling all gromacs package programs without MPI, single-precision
make distclean
./configure --prefix=/home/nferreira/bin/gromacs-3.3.2
make
make install
It runs fine on single nodes, but I can't put gmx running in parallel.
The jobs are submitted directly (see a sample script bellow), without
any queuing handler, like PBS.
# This is my submission script
#######################
lamboot -v hostfile
lamnodes
# Testing mpi
mpirun -np 2 hello
# GMX run
grompp -np 2 -f equil -po equil_out -c after_em -p topology -o equil
mpirun -np 2 mdrun_mpi -np 2 -deffnm equil -c after_equil
lamhalt
# This is hostfile file
###############
# My boot schema
localhost cpu=64
I tryed several stuff (full paths for mpirun, mdrun_mpi, etc), but I'm
always getting the same error. I also tested a hello program (from
LAM-MPI user guide) and it gives no problems. Bellow is the output of
the submission script:
nferreira at behemoth:~/gmxbench> ./script
LAM 7.1.3/MPI 2 C++/ROMIO - Indiana University
n-1<16608> ssi:boot:base:linear: booting n0 (localhost)
n-1<16608> ssi:boot:base:linear: finished
n0 localhost:64:origin,this_node
Hello, world! I am 0 of 2
Hello, world! I am 1 of 2
NNODES=2, MYRANK=1, HOSTNAME=behemoth
NNODES=2, MYRANK=0, HOSTNAME=behemoth
NODEID=0 argc=8
NODEID=1 argc=8
:-) G R O M A C S (-:
GRoups of Organic Molecules in ACtion for Science
:-) VERSION 3.3.2 (-:
[ ... snipped ...]
Getting Loaded...
Reading file equil.tpr, VERSION 3.3.2 (single precision)
-----------------------------------------------------------------------------
One of the processes started by mpirun has exited with a nonzero exit
code. This typically indicates that the process finished in error.
If your process did not finish in error, be sure to include a "return
0" or "exit(0)" in your C code before exiting the application.
PID 16623 failed on node n0 (127.0.0.1) due to signal 11.
-----------------------------------------------------------------------------
LAM 7.1.3/MPI 2 C++/ROMIO - Indiana University
Any ideas? Searching the mailing list, seems that this is a recurrent
issue, but I was not able to find and answer.
And, the machine admin is not proficient in MPI.
Cheers,
Nuno
P.S. Other programs are running fine on this machine using the same LAM-MPI.
More information about the gromacs.org_gmx-users
mailing list