[gmx-users] replica exchange: >4 processors
Paul Whitford
pwhitfor at ctbp.ucsd.edu
Fri Dec 7 06:18:09 CET 2007
I am using 3.3.2 and 3.3.1 and I get the following problem with both of
them.
If I run replica exchange on >4 processors (2 and 4 are fine), the
simulations finish, but mpi gives the following errors, thus the job never
terminates
this is the end of my log file
-----------------------------------------------------------------------
NODE (s) Real (s) (%)
Time: 158483.430 159636.000 99.3
1d20h01:23
(Mnbf/s) (MFlops) (ns/day) (hour/ns)
Performance: 18.919 818.029 2.726 8.805
p13_15442: p4_error: Timeout in establishing connection to remote process:
0
p12_15407: p4_error: Timeout in establishing connection to remote process:
0
Broken pipe
p11_2364: p4_error: Timeout in establishing connection to remote process: 0
p9_20588: p4_error: Timeout in establishing connection to remote process: 0
p10_2329: p4_error: Timeout in establishing connection to remote process: 0
Broken pipe
Broken pipe
Broken pipe
Broken pipe
p6_24137: p4_error: Timeout in establishing connection to remote process: 0
p7_24172: p4_error: Timeout in establishing connection to remote process: 0
Broken pipe
Broken pipe
I have tried installing on three different clusters, using different
versions of mpich and they all do this. BUT, I do not get the error if I am
running a single simulation on 8 processors, I only get this problem when I
run replica exchange. Any ideas what is going on? I'm also including my
submission script, perhaps I am missing something, but I'm just not seeing
it
#!/bin/bash
#
#$ -N switch_less
#$ -pe mpich 8
#$ -cwd
#$ -j y
#$ -S /bin/bash
#
#$ -l h_rt=00:05:00
MPIDIR=/opt/mpich/intel/bin/
MDDIR=/soft/linux/pkg/gromacs-3.3.1/bin
SYSTEM=free
INDEX=0
for T in 80 82 84 86 87 88 89 90
do
sed "s/TTTT/$T/g" MDRUN > mdrun.$INDEX.mdp
$MDDIR/grompp \
-f mdrun.$INDEX \
-c $SYSTEM.gro \
-p $SYSTEM.top \
-po mdout.$INDEX \
-o $SYSTEM$INDEX.tpr
let "INDEX += 1"
done
if test $NSLOTS -eq $INDEX
then
$MPIDIR/mpirun -v -np $NSLOTS -machinefile $TMPDIR/machines \
-nolocal $MDDIR/mdrun-mpi -v \
-np $NSLOTS \
-multi $NSLOTS \
-replex 50 \
-s $SYSTEM.tpr \
-o $SYSTEM \
-c $SYSTEM.out \
-g $SYSTEM \
-e $SYSTEM \
-x $SYSTEM
else
echo 'wrong number of nodes for the number of replicas'
fi
I have tried using the -debug option when running gromacs, but I can't tell
what is going on with it. Is there something I should look for in the debug
logfile?
thanks
-Paul
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-users/attachments/20071206/2811454f/attachment.html>
More information about the gromacs.org_gmx-users
mailing list