[gmx-users] GROMACS parallel on multiple nodes - ERROR

Thu Feb 26 20:03:00 CET 2004

Hi all,

At this moment I am trying to install GROMACS on a Linux Beowulf Cluster. I
compiled the code with this command:

---

$
./configure --enable-float --prefix=/ul/pspijker/exec/gromacs_mpi --enable-m
pi --without-motif-libraries

---

Everything compiled fine and running the code on one node didn't give any
problem. Though I still think that it uses only one processor, since there
was no difference between the runtime for the same simulation on 2
processors on the same node as for only 1 processor on that node.

I am using a PBS script to start my job at the Linux Beowulf Cluster. The
commands to start GROMACS are (where $MDP, $TPR, $GRO, $TOP and $NDX are the
files and $NOD is the number of nodes):

---

#!/bin/csh

#PBS -l nodes=4:ppn=1
#PBS -N GROMACS
#PBS -q work
#PBS -o std.out
#PBS -e std.err
#PBS -m e

### Set variables
set NOD=4

### Script Commands
cd $PBS_O_WORKDIR

### Set Environments
setenv CONV_RSH ssh
setenv LD_LIBRARY_PATH "/usr/lib"

### Write info about nodes used
set n=`wc -l < $PBS_NODEFILE`
echo 'PBS_NODEFILE ' $PBS_NODEFILE ' has ' $n ' lines'
cat $PBS_NODEFILE
echo

### Run simulation
lamboot $PBS_NODEFILE
/ul/pspijker/exec/gromacs_mpi/i686-pc-linux-gnu/bin/grompp -f $MDP -c
$GRO -p $TOP -o $TPR -np $NOD -deshuf $NDX -shuffle -sort
/ul/pspijker/exec/gromacs_mpi/i686-pc-linux-gnu/bin/mdrun -s $TPR -np $NOD

### Exit
echo
exit 0

---

I cannot see anything being really wrong in the script. When running this
script the following information is written to the error-file std.err:

---

----------------------------------------------------------------------------
-
LAM attempted to execute a process on the remote node
"node1-14.wag.caltech.edu",
but received some output on the standard error.

LAM tried to use the remote agent command "/usr/bin/ssh"
to invoke "echo $SHELL" on the remote node.

Try invoking the following command at the unix command line:

        /usr/bin/ssh node1-14.wag.caltech.edu -n echo $SHELL

When you can get this command to execute successfully by hand, LAM
will probably be able to function properly.
----------------------------------------------------------------------------
-

----------------------------------------------------------------------------
-
lamboot encountered some error (see above) during the boot process,
and will now attempt to kill all nodes that it was previously able to
boot (if any).

Please wait for LAM to finish; if you interrupt this process, you may
have LAM daemons still running on remote nodes.
----------------------------------------------------------------------------
-

----------------------------------------------------------------------------
-
It seems that there is no lamd running on this host, which indicates
that the LAM/MPI runtime environment is not operating.  The LAM/MPI
runtime environment is necessary for MPI programs to run (the MPI
program tired to invoke the "MPI_Init" function).

Please run the "lamboot" command the start the LAM/MPI runtime
environment.  See the LAM/MPI documentation for how to invoke
"lamboot" across multiple machines.
----------------------------------------------------------------------------
-

---

I tried running the specified command by hand and it replied correctly with:
/bin/tcsh
Does this mean I have to change something in the environment? I cannot
understand why it works with one node, but not with multiple.

If someone can help me, I would really appreciate it.

Kind regards,

Peter Spijker

---

Fulbright Fellow - The Netherland-America Foundation

California Institute of Technology
Biochemistry & Molecular Biophysics
Materials Process and Simulation Center
MC 139-74 Caltech
Pasadena, CA-91125
The United States of America

Phone: (626)-395-2844
E-mail: pspijker at wag.caltech.edu