[gmx-users] gromacs, lam and condor

Hsin-Lin Chiang jiangsl at phys.sinica.edu.tw
Sun Apr 4 18:24:12 CEST 2010


Hi,

I tried to use 4 and 8 CPUs.

There are about 6000 atoms in my system.

The interconnect of our computer is the network with speed 1Gb but not optical fiber.

I'm sorry for my poor English and I couldn't express well in my question.

Everytime I submitted the parallel job, the nodes assigned to mehave been 100% loading,

and the CPU source availble to me is less then 10%.

I think there is something wrong with my submit script or executable script,

and I post them in my previous message.

How should I correct my script?

Hsin-Lin

> Hi, 
> 
> how many CPUs do you try to use? How big is your system. What kind of 
> interconnect? Since you use condor probably some pretty slow interconnect. 
> Than you can't aspect it to work on many CPUs. If you want to use many CPUs 
> for MD you need a faster interconnect. 
> 
> Roland 
> 
> 2010/4/2 Hsin-Lin Chiang <jiangsl at phys.sinica.edu.tw> 
> 
> > &#160;Hi, 
> > 
> > Do someone use gromacs, lam, and condor together here? 
> > I use gromacs with lam/mpi on condor system. 
> > Everytime I submit the parallel job. 
> > I got the node which is occupied before and the performance of each cpu is 
> > below 10%. 
> > How should I change the script? 
> > Below is one submit script and two executable script. 
> > 
> > condor_mpi: 
> > ---- 
> > #!/bin/bash 
> > Universe = parallel 
> > Executable = ./lamscript 
> > machine_count = 8
> > output = md_$(NODE).out 
> > error = md_$(NODE).err 
> > log = md.log 
> > arguments = /stathome/jiangsl/simulation/gromacs/2OMP/2OMP_1_1/md.sh 
> > +WantIOProxy = True 
> > should_transfer_files = yes 
> > when_to_transfer_output = on_exit 
> > Queue 
> > ------- 
> > 
> > lamscript: 
> > ------- 
> > #!/bin/sh 
> > 
> > _CONDOR_PROCNO=$_CONDOR_PROCNO 
> > _CONDOR_NPROCS=$_CONDOR_NPROCS 
> > _CONDOR_REMOTE_SPOOL_DIR=$_CONDOR_REMOTE_SPOOL_DIR 
> > 
> > SSHD_SH=`condor_config_val libexec` 
> > SSHD_SH=$SSHD_SH/sshd.sh 
> > 
> > CONDOR_SSH=`condor_config_val libexec` 
> > CONDOR_SSH=$CONDOR_SSH/condor_ssh 
> > 
> > # Set this to the bin directory of your lam installation 
> > # This also must be in your .cshrc file, so the remote side 
> > # can find it! 
> > export LAMDIR=/stathome/jiangsl/soft/lam-7.1.4 
> > export PATH=${LAMDIR}/bin:${PATH} 
> > export LD_LIBRARY_PATH=/lib:/usr/lib:$LAMDIR/lib:.:/opt/intel/compilers/lib 
> > 
> > 
> > . $SSHD_SH $_CONDOR_PROCNO $_CONDOR_NPROCS 
> > 
> > # If not the head node, just sleep forever, to let the 
> > # sshds run 
> > if [ $_CONDOR_PROCNO -ne 0 ] 
> > then 
> > &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; wait 
> > &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; sshd_cleanup 
> > &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; exit 0 
> > fi 
> > 
> > EXECUTABLE=$1 
> > shift 
> > 
> > # the binary is copied but the executable flag is cleared. 
> > # so the script have to take care of this 
> > chmod +x $EXECUTABLE 
> > 
> > # to allow multiple lam jobs running on a single machine, 
> > # we have to give somewhat unique value 
> > export LAM_MPI_SESSION_SUFFIX=$$ 
> > export LAMRSH=$CONDOR_SSH 
> > # when a job is killed by the user, this script will get sigterm 
> > # This script have to catch it and do the cleaning for the 
> > # lam environment 
> > finalize() 
> > { 
> > sshd_cleanup 
> > lamhalt 
> > exit 
> > } 
> > trap finalize TERM 
> > 
> > CONDOR_CONTACT_FILE=$_CONDOR_SCRATCH_DIR/contact 
> > export $CONDOR_CONTACT_FILE 
> > # The second field in the contact file is the machine name 
> > # that condor_ssh knows how to use. Note that this used to 
> > # say "sort -n +0 ...", but -n option is now deprecated. 
> > sort < $CONDOR_CONTACT_FILE | awk '{print $2}' > machines 
> > 
> > # start the lam environment 
> > # For older versions of lam you may need to remove the -ssi boot rsh line 
> > lamboot -ssi boot rsh -ssi rsh_agent "$LAMRSH -x" machines 
> > 
> > if [ $? -ne 0 ] 
> > then 
> > &#160; &#160; &#160; &#160; echo "lamscript error booting lam" 
> > &#160; &#160; &#160; &#160; exit 1 
> > fi 
> > 
> > mpirun C -ssi rpi usysv -ssi coll_smp 1 $EXECUTABLE $@ & 
> > 
> > CHILD=$! 
> > TMP=130 
> > while [ $TMP -gt 128 ] ; do 
> > &#160; &#160; &#160; &#160; wait $CHILD 
> > &#160; &#160; &#160; &#160; TMP=$?; 
> > done 
> > 
> > # clean up files 
> > sshd_cleanup 
> > /bin/rm -f machines 
> > 
> > # clean up lam 
> > lamhalt 
> > 
> > exit $TMP 
> > ---- 
> > 
> > md.sh 
> > ---- 
> > #!/bin/sh 
> > #running GROMACS 
> > /stathome/jiangsl/soft/gromacs-4.0.5/bin/mdrun_mpi_d \ 
> > -s /stathome/jiangsl/simulation/gromacs/2OMP/2OMP_1_1/md/200ns.tpr \ 
> > -e /stathome/jiangsl/simulation/gromacs/2OMP/2OMP_1_1/md/200ns.edr \ 
> > -o /stathome/jiangsl/simulation/gromacs/2OMP/2OMP_1_1/md/200ns.trr \ 
> > -g /stathome/jiangsl/simulation/gromacs/2OMP/2OMP_1_1/md/200ns.log \ 
> > -c /stathome/jiangsl/simulation/gromacs/2OMP/2OMP_1_1/md/200ns.gro 
> > ----- 
> > 
> > 
> > Hsin-Lin 
 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-users/attachments/20100405/54328ea0/attachment.html>


More information about the gromacs.org_gmx-users mailing list