[gmx-users] Re: gmx-users Digest, Vol 72, Issue 13
Mark Abraham
Mark.Abraham at anu.edu.au
Sun Apr 4 02:42:02 CEST 2010
On 4/04/2010 3:13 AM, Hsin-Lin Chiang wrote:
> Hi,
>
> I tried to use 4 and 8 CPUs.
> There are about 6000 atoms in my system.
> The interconnect of our computer is the network with speed 1Gb but not
> optical fiber.
Gigabit ethernet is too slow for good scaling of GROMACS beyond about 2
or 4 cpus.
> I'm sorry for my poor English and I couldn't express well in my question.
> Everytime I submitted the parallel job, the nodes assigned to me have
> been 100% loading,
> and the CPU source availble to me is less then 10%.
> I think there is something wrong with my submit script or executable
> script,
> so I post them in my question before.(please see those below)
You're apparently using double-precision GROMACS. That makes everything,
including communication, much slower. Consider not doing that.
Mark
> Hsin-Lin
> > Hi,
> >
> > how many CPUs do you try to use? How big is your system. What kind of
> > interconnect? Since you use condor probably some pretty slow
> interconnect.
> > Than you can't aspect it to work on many CPUs. If you want to use
> many CPUs
> > for MD you need a faster interconnect.
> >
> > Roland
> >
> > 2010/4/2 Hsin-Lin Chiang <jiangsl at phys.sinica.edu.tw>
> >
> > > Hi,
> > >
> > > Do someone use gromacs, lam, and condor together here?
> > > I use gromacs with lam/mpi on condor system.
> > > Everytime I submit the parallel job.
> > > I got the node which is occupied before and the performance of each
> cpu is
> > > below 10%.
> > > How should I change the script?
> > > Below is one submit script and two executable script.
> > >
> > > condor_mpi:
> > > ----
> > > #!/bin/bash
> > > Universe = parallel
> > > Executable = ./lamscript
> > > machine_count = 2
> > > output = md_$(NODE).out
> > > error = md_$(NODE).err
> > > log = md.log
> > > arguments = /stathome/jiangsl/simulation/gromacs/2OMP/2OMP_1_1/md.sh
> > > +WantIOProxy = True
> > > should_transfer_files = yes
> > > when_to_transfer_output = on_exit
> > > Queue
> > > -------
> > >
> > > lamscript:
> > > -------
> > > #!/bin/sh
> > >
> > > _CONDOR_PROCNO=$_CONDOR_PROCNO
> > > _CONDOR_NPROCS=$_CONDOR_NPROCS
> > > _CONDOR_REMOTE_SPOOL_DIR=$_CONDOR_REMOTE_SPOOL_DIR
> > >
> > > SSHD_SH=`condor_config_val libexec`
> > > SSHD_SH=$SSHD_SH/sshd.sh
> > >
> > > CONDOR_SSH=`condor_config_val libexec`
> > > CONDOR_SSH=$CONDOR_SSH/condor_ssh
> > >
> > > # Set this to the bin directory of your lam installation
> > > # This also must be in your .cshrc file, so the remote side
> > > # can find it!
> > > export LAMDIR=/stathome/jiangsl/soft/lam-7.1.4
> > > export PATH=${LAMDIR}/bin:${PATH}
> > > export
> LD_LIBRARY_PATH=/lib:/usr/lib:$LAMDIR/lib:.:/opt/intel/compilers/lib
> > >
> > >
> > > . $SSHD_SH $_CONDOR_PROCNO $_CONDOR_NPROCS
> > >
> > > # If not the head node, just sleep forever, to let the
> > > # sshds run
> > > if [ $_CONDOR_PROCNO -ne 0 ]
> > > then
> > > wait
> > > sshd_cleanup
> > > exit 0
> > > fi
> > >
> > > EXECUTABLE=$1
> > > shift
> > >
> > > # the binary is copied but the executable flag is cleared.
> > > # so the script have to take care of this
> > > chmod +x $EXECUTABLE
> > >
> > > # to allow multiple lam jobs running on a single machine,
> > > # we have to give somewhat unique value
> > > export LAM_MPI_SESSION_SUFFIX=$$
> > > export LAMRSH=$CONDOR_SSH
> > > # when a job is killed by the user, this script will get sigterm
> > > # This script have to catch it and do the cleaning for the
> > > # lam environment
> > > finalize()
> > > {
> > > sshd_cleanup
> > > lamhalt
> > > exit
> > > }
> > > trap finalize TERM
> > >
> > > CONDOR_CONTACT_FILE=$_CONDOR_SCRATCH_DIR/contact
> > > export $CONDOR_CONTACT_FILE
> > > # The second field in the contact file is the machine name
> > > # that condor_ssh knows how to use. Note that this used to
> > > # say "sort -n +0 ...", but -n option is now deprecated.
> > > sort < $CONDOR_CONTACT_FILE | awk '{print $2}' > machines
> > >
> > > # start the lam environment
> > > # For older versions of lam you may need to remove the -ssi boot
> rsh line
> > > lamboot -ssi boot rsh -ssi rsh_agent "$LAMRSH -x" machines
> > >
> > > if [ $? -ne 0 ]
> > > then
> > > echo "lamscript error booting lam"
> > > exit 1
> > > fi
> > >
> > > mpirun C -ssi rpi usysv -ssi coll_smp 1 $EXECUTABLE $@ &
> > >
> > > CHILD=$!
> > > TMP=130
> > > while [ $TMP -gt 128 ] ; do
> > > wait $CHILD
> > > TMP=$?;
> > > done
> > >
> > > # clean up files
> > > sshd_cleanup
> > > /bin/rm -f machines
> > >
> > > # clean up lam
> > > lamhalt
> > >
> > > exit $TMP
> > > ----
> > >
> > > md.sh
> > > ----
> > > #!/bin/sh
> > > #running GROMACS
> > > /stathome/jiangsl/soft/gromacs-4.0.5/bin/mdrun_mpi_d \
> > > -s /stathome/jiangsl/simulation/gromacs/2OMP/2OMP_1_1/md/200ns.tpr \
> > > -e /stathome/jiangsl/simulation/gromacs/2OMP/2OMP_1_1/md/200ns.edr \
> > > -o /stathome/jiangsl/simulation/gromacs/2OMP/2OMP_1_1/md/200ns.trr \
> > > -g /stathome/jiangsl/simulation/gromacs/2OMP/2OMP_1_1/md/200ns.log \
> > > -c /stathome/jiangsl/simulation/gromacs/2OMP/2OMP_1_1/md/200ns.gro
> > > -----
> > >
> > >
> > > Hsin-Lin
>
More information about the gromacs.org_gmx-users
mailing list