[gmx-users] Why does the -append option exist?
Roland Schulz
roland at utk.edu
Thu Jun 9 08:55:14 CEST 2011
yes that helps a lot. One more question. What filesystem on hopper 2 are you
using for this test (home, scratch or proj, to see if it is Lustre or GPFS)
? And are you running the test on the login node or on the compute node?
On Wed, Jun 8, 2011 at 1:17 PM, Dimitar Pachov <dpachov at brandeis.edu> wrote:
> Hello,
> On Wed, Jun 8, 2011 at 4:21 AM, Sander Pronk <pronk at cbr.su.se> wrote:
>> Hi Dimitar,
>> Thanks for the bug report. Would you mind trying the test program I
>> attached on the same file system that you get the truncated files on?
>> compile it with gcc testje.c -o testio
> Yes, but no problem:
> ====
> [dpachov at login-0-0 NEWTEST]$ ./testio
> TEST PASSED: ftell gives: 46
> ====
> As for the other questions:
> HPC OS version:
> ====
> [dpachov at login-0-0 NEWTEST]$ uname -a
> Linux login-0-0.local 2.6.18-194.17.1.el5xen #1 SMP Mon Sep 20 07:20:39 EDT
> 2010 x86_64 x86_64 x86_64 GNU/Linux
> [dpachov at login-0-0 NEWTEST]$ cat /etc/redhat-release
> Red Hat Enterprise Linux Server release 5.2 (Tikanga)
> ====
> GROMACS 4.5.4 built:
> ====
> module purge
> module load INTEL/intel-12.0
> module load OPENMPI/1.4.3_INTEL_12.0
> module load FFTW/2.1.5-INTEL_12.0 # not needed
> #####
> # GROMACS settings
> export CC=mpicc
> export F77=mpif77
> export CXX=mpic++
> export FC=mpif90
> export F90=mpif90
> make distclean
> echo "XXXXXXX building single prec XXXXXX"
> ./configure
> --prefix=/home/dpachov/mymodules/GROMACS/EXEC/4.5.4-INTEL_12.0/SINGLE \
> --enable-mpi \
> --enable-shared \
> --program-prefix="" --program-suffix="" \
> --enable-float --disable-fortran \
> --with-fft=mkl \
> --with-external-blas \
> --with-external-lapack \
> --with-gsl \
> --without-x \
> CFLAGS="-O3 -funroll-all-loops" \
> FFLAGS="-O3 -funroll-all-loops" \
> LDFLAGS="-L${MPI_LIB} -L${MKL_LIB} -lmkl_intel_lp64 -lmkl_core
> -lmkl_intel_thread -liomp5 "
> make -j 8 && make install
> ====
> Just did the same test on Hopper 2:
> http://www.nersc.gov/users/computational-systems/hopper/
> with their built GROMACS 4.5.3 (gromacs/4.5.3(default)), and the result was
> the same as reported earlier. You could do the test there as well, if you
> have access, and see what you would get.
> Hope that helps a bit.
> Thanks,
> Dimitar
>> Sander
>> On Jun 7, 2011, at 23:21 , Dimitar Pachov wrote:
>> Hello,
>> Just a quick update after a few shorts tests we (my colleague and I)
>> quickly did. First, using
>> "*You can emulate this yourself by calling "sleep 10s" before mdrun and
>> see if that's long enough to solve the latency issue in your case.*"
>> doesn't work for a few reasons, mainly because it doesn't seem to be a
>> latency issue, but also because the load on a node is not affected by
>> "sleep".
>> However, you can reproduce the behavior I have observed pretty easily. It
>> seems to be related to the values of the pointers to the *xtc, *trr, *edr,
>> etc files written at the end of the checkpoint file after abrupt crashes AND
>> to the frequency of access (opening) to those files. How to test:
>> 1. In your input *mdp file put a high frequency of saving coordinates to,
>> say, the *xtc (10, for example) and a low frequency for the *trr file
>> (10,000, for example).
>> 2. Run GROMACS (mdrun -s run.tpr -v -cpi -deffnm run)
>> 3. Kill abruptly the run shortly after that (say, after 10-100 steps).
>> 4. You should have a few frames written in the *xtc file, and the only one
>> (the first) in the *trr file. The *cpt file should have different from zero
>> values for "file_offset_low" for all of these files (the pointers have been
>> updated).
>> 5. Restart GROMACS (mdrun -s run.tpr -v -cpi -deffnm run).
>> 6. Kill abruptly the run shortly after that (say, after 10-100 steps). Pay
>> attention that the frequency for accessing/writing the *trr has not been
>> reached.
>> 7. You should have a few additional frames written in the *xtc file, while
>> the *trr will still have only 1 frame (the first). The *cpt file now has
>> updated all pointer values "file_offset_low", BUT the pointer to the *trr
>> has acquired a value of 0. Obviously, we already now what will happen if we
>> restart again from this last *cpt file.
>> 8. Restart GROMACS (mdrun -s run.tpr -v -cpi -deffnm run).
>> 9. Kill it.
>> 10. File *trr has size zero.
>> Therefore, if a run is killed before the files are accessed for writing
>> (depending on the chosen frequency), the file offset values reported in the
>> *cpt file doesn't seem to be accordingly updated, and hence a new restart
>> inevitably leads to overwritten output files.
>> Do you think this is fixable?
>> Thanks,
>> Dimitar
>> On Sun, Jun 5, 2011 at 6:20 PM, Roland Schulz <roland at utk.edu> wrote:
>>> Two comments about the discussion:
>>> 1) I agree that buffered output (Kernel buffers - not application
>>> buffers) should not affect I/O. If it does it should be filed as bug to the
>>> OS. Maybe someone can write a short test application which tries to
>>> reproduce this idea. Thus writing to a file from one node and immediate
>>> after one test program is killed on one node writing to it from some other
>>> node.
>>> 2) We lock files but only the log file. The idea is that we only need
>>> to guarantee that the set of files is only accessed by one application. This
>>> seems safe but in case someone sees a way of how the trajectory is opened
>>> without the log file being opened, please file a bug.
>>> Roland
>>> On Sun, Jun 5, 2011 at 10:13 AM, Mark Abraham <Mark.Abraham at anu.edu.au>wrote:
>>>> On 5/06/2011 11:08 PM, Francesco Oteri wrote:
>>>> Dear Dimitar,
>>>> I'm following the debate regarding:
>>>> The point was not "why" I was getting the restarts, but the fact
>>>> itself that I was getting restarts close in time, as I stated in my first
>>>> post. I actually also don't know whether jobs are deleted or suspended. I've
>>>> thought that a job returned back to the queue will basically start from the
>>>> beginning when later moved to an empty slot ... so don't understand the
>>>> difference from that perspective.
>>>> In the second mail yoo say:
>>>> Submitted by:
>>>> ========================
>>>> ii=1
>>>> ifmpi="mpirun -np $NSLOTS"
>>>> --------
>>>> if [ ! -f run${ii}-i.tpr ];then
>>>> cp run${ii}.tpr run${ii}-i.tpr
>>>> tpbconv -s run${ii}-i.tpr -until 200000 -o run${ii}.tpr
>>>> fi
>>>> k=`ls md-${ii}*.out | wc -l`
>>>> outfile="md-${ii}-$k.out"
>>>> if [[ -f run${ii}.cpt ]]; then
>>>> * $ifmpi `which mdrun` *-s run${ii}.tpr -cpi run${ii}.cpt -v
>>>> -deffnm run${ii} -npme 0 > $outfile 2>&1
>>>> fi
>>>> =========================
>>>> If I understand well, you are submitting the SERIAL mdrun. This means
>>>> that multiple instances of mdrun are running at the same time.
>>>> Each instance of mdrun is an INDIPENDENT instance. Therefore checkpoint
>>>> files, one for each instance (i.e. one for each CPU), are written at the
>>>> same time.
>>>> Good thought, but Dimitar's stdout excerpts from early in the thread do
>>>> indicate the presence of multiple execution threads. Dynamic load balancing
>>>> gets turned on, and the DD is 4x2x1 for his 8 processors. Conventionally,
>>>> and by default in the installation process, the MPI-enabled binaries get an
>>>> "_mpi" suffix, but it isn't enforced - or enforceable :-)
>>>> Mark
>>>> --
>>>> gmx-users mailing list gmx-users at gromacs.org
>>>> http://lists.gromacs.org/mailman/listinfo/gmx-users
>>>> Please search the archive at
>>>> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
>>>> Please don't post (un)subscribe requests to the list. Use the
>>>> www interface or send it to gmx-users-request at gromacs.org.
>>>> Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>> --
>>> ORNL/UT Center for Molecular Biophysics cmb.ornl.gov
>>> 865-241-1537, ORNL PO BOX 2008 MS6309
>>> --
>>> gmx-users mailing list gmx-users at gromacs.org
>>> http://lists.gromacs.org/mailman/listinfo/gmx-users
>>> Please search the archive at
>>> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
>>> Please don't post (un)subscribe requests to the list. Use the
>>> www interface or send it to gmx-users-request at gromacs.org.
>>> Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>> --
>> =====================================================
>> *Dimitar V Pachov*
>> PhD Physics
>> Postdoctoral Fellow
>> HHMI & Biochemistry Department Phone: (781) 736-2326
>> Brandeis University, MS 057 Email: dpachov at brandeis.edu
>> =====================================================
>> --
>> gmx-users mailing list gmx-users at gromacs.org
>> http://lists.gromacs.org/mailman/listinfo/gmx-users
>> Please search the archive at
>> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
>> Please don't post (un)subscribe requests to the list. Use the
>> www interface or send it to gmx-users-request at gromacs.org.
>> Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>> --
>> gmx-users mailing list gmx-users at gromacs.org
>> http://lists.gromacs.org/mailman/listinfo/gmx-users
>> Please search the archive at
>> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
>> Please don't post (un)subscribe requests to the list. Use the
>> www interface or send it to gmx-users-request at gromacs.org.
>> Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> --
> =====================================================
> *Dimitar V Pachov*
> PhD Physics
> Postdoctoral Fellow
> HHMI & Biochemistry Department Phone: (781) 736-2326
> Brandeis University, MS 057 Email: dpachov at brandeis.edu
> =====================================================
> --
> gmx-users mailing list gmx-users at gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-users
> Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
> Please don't post (un)subscribe requests to the list. Use the
> www interface or send it to gmx-users-request at gromacs.org.
> Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
ORNL/UT Center for Molecular Biophysics cmb.ornl.gov
865-241-1537, ORNL PO BOX 2008 MS6309
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-users/attachments/20110609/1ce9cf8a/attachment.html>
More information about the gromacs.org_gmx-users
mailing list