[gmx-users] Re: gmx-users Digest, Vol 72, Issue 13

Sun Apr 4 02:42:02 CEST 2010

On 4/04/2010 3:13 AM, Hsin-Lin Chiang wrote:
> Hi,
> 
> I tried to use 4 and 8 CPUs.
> There are about 6000 atoms in my system.
> The interconnect of our computer is the network with speed 1Gb but not 
> optical fiber.

Gigabit ethernet is too slow for good scaling of GROMACS beyond about 2
or 4 cpus.

> I'm sorry for my poor English and I couldn't express well in my question.
> Everytime I submitted the parallel job, the nodes assigned to me have 
> been 100% loading,
> and the CPU source availble to me is less then 10%.
> I think there is something wrong with my submit script or executable 
> script,
> so I post them in my question before.(please see those below)

You're apparently using double-precision GROMACS. That makes everything,
including communication, much slower. Consider not doing that.

Mark

> Hsin-Lin
>  > Hi,
>  >
>  > how many CPUs do you try to use? How big is your system. What kind of
>  > interconnect? Since you use condor probably some pretty slow 
> interconnect.
>  > Than you can't aspect it to work on many CPUs. If you want to use 
> many CPUs
>  > for MD you need a faster interconnect.
>  >
>  > Roland
>  >
>  > 2010/4/2 Hsin-Lin Chiang <jiangsl at phys.sinica.edu.tw>
>  >
>  > > Hi,
>  > >
>  > > Do someone use gromacs, lam, and condor together here?
>  > > I use gromacs with lam/mpi on condor system.
>  > > Everytime I submit the parallel job.
>  > > I got the node which is occupied before and the performance of each 
> cpu is
>  > > below 10%.
>  > > How should I change the script?
>  > > Below is one submit script and two executable script.
>  > >
>  > > condor_mpi:
>  > > ----
>  > > #!/bin/bash
>  > > Universe = parallel
>  > > Executable = ./lamscript
>  > > machine_count = 2
>  > > output = md_$(NODE).out
>  > > error = md_$(NODE).err
>  > > log = md.log
>  > > arguments = /stathome/jiangsl/simulation/gromacs/2OMP/2OMP_1_1/md.sh
>  > > +WantIOProxy = True
>  > > should_transfer_files = yes
>  > > when_to_transfer_output = on_exit
>  > > Queue
>  > > -------
>  > >
>  > > lamscript:
>  > > -------
>  > > #!/bin/sh
>  > >
>  > > _CONDOR_PROCNO=$_CONDOR_PROCNO
>  > > _CONDOR_NPROCS=$_CONDOR_NPROCS
>  > > _CONDOR_REMOTE_SPOOL_DIR=$_CONDOR_REMOTE_SPOOL_DIR
>  > >
>  > > SSHD_SH=`condor_config_val libexec`
>  > > SSHD_SH=$SSHD_SH/sshd.sh
>  > >
>  > > CONDOR_SSH=`condor_config_val libexec`
>  > > CONDOR_SSH=$CONDOR_SSH/condor_ssh
>  > >
>  > > # Set this to the bin directory of your lam installation
>  > > # This also must be in your .cshrc file, so the remote side
>  > > # can find it!
>  > > export LAMDIR=/stathome/jiangsl/soft/lam-7.1.4
>  > > export PATH=${LAMDIR}/bin:${PATH}
>  > > export 
> LD_LIBRARY_PATH=/lib:/usr/lib:$LAMDIR/lib:.:/opt/intel/compilers/lib
>  > >
>  > >
>  > > . $SSHD_SH $_CONDOR_PROCNO $_CONDOR_NPROCS
>  > >
>  > > # If not the head node, just sleep forever, to let the
>  > > # sshds run
>  > > if [ $_CONDOR_PROCNO -ne 0 ]
>  > > then
>  > > wait
>  > > sshd_cleanup
>  > > exit 0
>  > > fi
>  > >
>  > > EXECUTABLE=$1
>  > > shift
>  > >
>  > > # the binary is copied but the executable flag is cleared.
>  > > # so the script have to take care of this
>  > > chmod +x $EXECUTABLE
>  > >
>  > > # to allow multiple lam jobs running on a single machine,
>  > > # we have to give somewhat unique value
>  > > export LAM_MPI_SESSION_SUFFIX=$$
>  > > export LAMRSH=$CONDOR_SSH
>  > > # when a job is killed by the user, this script will get sigterm
>  > > # This script have to catch it and do the cleaning for the
>  > > # lam environment
>  > > finalize()
>  > > {
>  > > sshd_cleanup
>  > > lamhalt
>  > > exit
>  > > }
>  > > trap finalize TERM
>  > >
>  > > CONDOR_CONTACT_FILE=$_CONDOR_SCRATCH_DIR/contact
>  > > export $CONDOR_CONTACT_FILE
>  > > # The second field in the contact file is the machine name
>  > > # that condor_ssh knows how to use. Note that this used to
>  > > # say "sort -n +0 ...", but -n option is now deprecated.
>  > > sort < $CONDOR_CONTACT_FILE | awk '{print $2}' > machines
>  > >
>  > > # start the lam environment
>  > > # For older versions of lam you may need to remove the -ssi boot 
> rsh line
>  > > lamboot -ssi boot rsh -ssi rsh_agent "$LAMRSH -x" machines
>  > >
>  > > if [ $? -ne 0 ]
>  > > then
>  > > echo "lamscript error booting lam"
>  > > exit 1
>  > > fi
>  > >
>  > > mpirun C -ssi rpi usysv -ssi coll_smp 1 $EXECUTABLE $@ &
>  > >
>  > > CHILD=$!
>  > > TMP=130
>  > > while [ $TMP -gt 128 ] ; do
>  > > wait $CHILD
>  > > TMP=$?;
>  > > done
>  > >
>  > > # clean up files
>  > > sshd_cleanup
>  > > /bin/rm -f machines
>  > >
>  > > # clean up lam
>  > > lamhalt
>  > >
>  > > exit $TMP
>  > > ----
>  > >
>  > > md.sh
>  > > ----
>  > > #!/bin/sh
>  > > #running GROMACS
>  > > /stathome/jiangsl/soft/gromacs-4.0.5/bin/mdrun_mpi_d \
>  > > -s /stathome/jiangsl/simulation/gromacs/2OMP/2OMP_1_1/md/200ns.tpr \
>  > > -e /stathome/jiangsl/simulation/gromacs/2OMP/2OMP_1_1/md/200ns.edr \
>  > > -o /stathome/jiangsl/simulation/gromacs/2OMP/2OMP_1_1/md/200ns.trr \
>  > > -g /stathome/jiangsl/simulation/gromacs/2OMP/2OMP_1_1/md/200ns.log \
>  > > -c /stathome/jiangsl/simulation/gromacs/2OMP/2OMP_1_1/md/200ns.gro
>  > > -----
>  > >
>  > >
>  > > Hsin-Lin
>