[gmx-users] Why does the -append option exist?

Dimitar Pachov dpachov at brandeis.edu
Thu Jun 9 14:04:19 CEST 2011


Hi,

On Thu, Jun 9, 2011 at 2:55 AM, Roland Schulz <roland at utk.edu> wrote:

> Hi,
>
> yes that helps a lot. One more question. What filesystem on hopper 2 are
> you using for this test (home, scratch or proj, to see if it is Lustre or
> GPFS) ?
>

I used home.


> And are you running the test on the login node or on the compute node?
>

I did the test on the debug queue, so it was a compute node.

Let me know if you need more info.
Best,
Dimitar


>
> Thanks
> Roland
>
>
> On Wed, Jun 8, 2011 at 1:17 PM, Dimitar Pachov <dpachov at brandeis.edu>wrote:
>
>> Hello,
>>
>> On Wed, Jun 8, 2011 at 4:21 AM, Sander Pronk <pronk at cbr.su.se> wrote:
>>
>>> Hi Dimitar,
>>>
>>> Thanks for the bug report. Would you mind trying the test program I
>>> attached on the same file system that you get the truncated files on?
>>>
>>> compile it with gcc testje.c -o testio
>>>
>>
>> Yes, but no problem:
>>
>> ====
>> [dpachov at login-0-0 NEWTEST]$ ./testio
>> TEST PASSED: ftell gives: 46
>> ====
>>
>> As for the other questions:
>>
>> HPC OS version:
>> ====
>> [dpachov at login-0-0 NEWTEST]$ uname -a
>> Linux login-0-0.local 2.6.18-194.17.1.el5xen #1 SMP Mon Sep 20 07:20:39
>> EDT 2010 x86_64 x86_64 x86_64 GNU/Linux
>> [dpachov at login-0-0 NEWTEST]$ cat /etc/redhat-release
>> Red Hat Enterprise Linux Server release 5.2 (Tikanga)
>> ====
>>
>> GROMACS 4.5.4 built:
>> ====
>> module purge
>> module load INTEL/intel-12.0
>> module load OPENMPI/1.4.3_INTEL_12.0
>> module load FFTW/2.1.5-INTEL_12.0 # not needed
>>
>> #####
>> # GROMACS settings
>>
>> export CC=mpicc
>> export F77=mpif77
>> export CXX=mpic++
>> export FC=mpif90
>> export F90=mpif90
>>
>> make distclean
>>
>> echo "XXXXXXX building single prec XXXXXX"
>>
>> ./configure
>> --prefix=/home/dpachov/mymodules/GROMACS/EXEC/4.5.4-INTEL_12.0/SINGLE \
>> --enable-mpi \
>>  --enable-shared \
>> --program-prefix="" --program-suffix="" \
>> --enable-float --disable-fortran \
>> --with-fft=mkl \
>> --with-external-blas \
>> --with-external-lapack \
>> --with-gsl \
>> --without-x \
>> CFLAGS="-O3 -funroll-all-loops" \
>> FFLAGS="-O3 -funroll-all-loops" \
>> CPPFLAGS="-I${MPI_INCLUDE} -I${MKL_INCLUDE} " \
>> LDFLAGS="-L${MPI_LIB} -L${MKL_LIB} -lmkl_intel_lp64 -lmkl_core
>> -lmkl_intel_thread -liomp5 "
>>
>> make -j 8 && make install
>> ====
>>
>> Just did the same test on Hopper 2:
>> http://www.nersc.gov/users/computational-systems/hopper/
>>
>> with their built GROMACS 4.5.3 (gromacs/4.5.3(default)), and the result
>> was the same as reported earlier. You could do the test there as well, if
>> you have access, and see what you would get.
>>
>> Hope that helps a bit.
>>
>> Thanks,
>> Dimitar
>>
>>
>>
>>
>>
>>>
>>> Sander
>>>
>>>
>>>
>>>
>>>
>>> On Jun 7, 2011, at 23:21 , Dimitar Pachov wrote:
>>>
>>> Hello,
>>>
>>> Just a quick update after a few shorts tests we (my colleague and I)
>>> quickly did. First, using
>>>
>>> "*You can emulate this yourself by calling "sleep 10s" before mdrun and
>>> see if that's long enough to solve the latency issue in your case.*"
>>>
>>> doesn't work for a few reasons, mainly because it doesn't seem to be a
>>> latency issue, but also because the load on a node is not affected by
>>> "sleep".
>>>
>>> However, you can reproduce the behavior I have observed pretty easily. It
>>> seems to be related to the values of the pointers to the *xtc, *trr, *edr,
>>> etc files written at the end of the checkpoint file after abrupt crashes AND
>>> to the frequency of access (opening) to those files. How to test:
>>>
>>> 1. In your input *mdp file put a high frequency of saving coordinates to,
>>> say, the *xtc (10, for example) and a low frequency for the *trr file
>>> (10,000, for example).
>>> 2. Run GROMACS (mdrun -s run.tpr -v -cpi -deffnm run)
>>> 3. Kill abruptly the run shortly after that (say, after 10-100 steps).
>>> 4. You should have a few frames written in the *xtc file, and the only
>>> one (the first) in the *trr file. The *cpt file should have different from
>>> zero values for "file_offset_low" for all of these files (the pointers have
>>> been updated).
>>>
>>> 5. Restart GROMACS (mdrun -s run.tpr -v -cpi -deffnm run).
>>> 6. Kill abruptly the run shortly after that (say, after 10-100 steps).
>>> Pay attention that the frequency for accessing/writing the *trr has not been
>>> reached.
>>> 7. You should have a few additional frames written in the *xtc file,
>>> while the *trr will still have only 1 frame (the first). The *cpt file now
>>> has updated all pointer values "file_offset_low", BUT the pointer to the
>>> *trr has acquired a value of 0. Obviously, we already now what will happen
>>> if we restart again from this last *cpt file.
>>>
>>> 8. Restart GROMACS (mdrun -s run.tpr -v -cpi -deffnm run).
>>> 9. Kill it.
>>> 10. File *trr has size zero.
>>>
>>>
>>> Therefore, if a run is killed before the files are accessed for writing
>>> (depending on the chosen frequency), the file offset values reported in the
>>> *cpt file doesn't seem to be accordingly updated, and hence a new restart
>>> inevitably leads to overwritten output files.
>>>
>>> Do you think this is fixable?
>>>
>>> Thanks,
>>> Dimitar
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Sun, Jun 5, 2011 at 6:20 PM, Roland Schulz <roland at utk.edu> wrote:
>>>
>>>> Two comments about the discussion:
>>>>
>>>> 1) I agree that buffered output (Kernel buffers - not application
>>>> buffers) should not affect I/O. If it does it should be filed as bug to the
>>>> OS. Maybe someone can write a short test application which tries to
>>>> reproduce this idea. Thus writing to a file from one node and immediate
>>>> after one test program is killed on one node writing to it from some other
>>>> node.
>>>>
>>>> 2) We lock files but only the log file. The idea is that we only need
>>>> to guarantee that the set of files is only accessed by one application. This
>>>> seems safe but in case someone sees a way of how the trajectory is opened
>>>> without the log file being opened, please file a bug.
>>>>
>>>> Roland
>>>>
>>>> On Sun, Jun 5, 2011 at 10:13 AM, Mark Abraham <Mark.Abraham at anu.edu.au>wrote:
>>>>
>>>>>  On 5/06/2011 11:08 PM, Francesco Oteri wrote:
>>>>>
>>>>> Dear Dimitar,
>>>>> I'm following the debate regarding:
>>>>>
>>>>>
>>>>>    The point was not "why" I was getting the restarts, but the fact
>>>>> itself that I was getting restarts close in time, as I stated in my first
>>>>> post. I actually also don't know whether jobs are deleted or suspended. I've
>>>>> thought that a job returned back to the queue will basically start from the
>>>>> beginning when later moved to an empty slot ... so don't understand the
>>>>> difference from that perspective.
>>>>>
>>>>>
>>>>> In the second mail yoo say:
>>>>>
>>>>>  Submitted by:
>>>>> ========================
>>>>> ii=1
>>>>> ifmpi="mpirun -np $NSLOTS"
>>>>> --------
>>>>>    if [ ! -f run${ii}-i.tpr ];then
>>>>>        cp run${ii}.tpr run${ii}-i.tpr
>>>>>       tpbconv -s run${ii}-i.tpr -until 200000 -o run${ii}.tpr
>>>>>    fi
>>>>>
>>>>>     k=`ls md-${ii}*.out | wc -l`
>>>>>    outfile="md-${ii}-$k.out"
>>>>>    if [[ -f run${ii}.cpt ]]; then
>>>>>
>>>>>       * $ifmpi `which mdrun` *-s run${ii}.tpr -cpi run${ii}.cpt -v
>>>>> -deffnm run${ii} -npme 0 > $outfile  2>&1
>>>>>
>>>>>     fi
>>>>>  =========================
>>>>>
>>>>>
>>>>> If I understand well, you are submitting the SERIAL  mdrun. This means
>>>>> that multiple instances of mdrun are running at the same time.
>>>>> Each instance of mdrun is an INDIPENDENT instance. Therefore checkpoint
>>>>> files, one for each instance (i.e. one for each CPU),  are written at the
>>>>> same time.
>>>>>
>>>>>
>>>>> Good thought, but Dimitar's stdout excerpts from early in the thread do
>>>>> indicate the presence of multiple execution threads. Dynamic load balancing
>>>>> gets turned on, and the DD is 4x2x1 for his 8 processors. Conventionally,
>>>>> and by default in the installation process, the MPI-enabled binaries get an
>>>>> "_mpi" suffix, but it isn't enforced - or enforceable :-)
>>>>>
>>>>> Mark
>>>>>
>>>>> --
>>>>>
>>>>> gmx-users mailing list    gmx-users at gromacs.org
>>>>> http://lists.gromacs.org/mailman/listinfo/gmx-users
>>>>> Please search the archive at
>>>>> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
>>>>> Please don't post (un)subscribe requests to the list. Use the
>>>>> www interface or send it to gmx-users-request at gromacs.org.
>>>>> Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> ORNL/UT Center for Molecular Biophysics cmb.ornl.gov
>>>> 865-241-1537, ORNL PO BOX 2008 MS6309
>>>>
>>>> --
>>>> gmx-users mailing list    gmx-users at gromacs.org
>>>> http://lists.gromacs.org/mailman/listinfo/gmx-users
>>>> Please search the archive at
>>>> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
>>>> Please don't post (un)subscribe requests to the list. Use the
>>>> www interface or send it to gmx-users-request at gromacs.org.
>>>> Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>>
>>>
>>>
>>>
>>> --
>>> =====================================================
>>> *Dimitar V Pachov*
>>>
>>> PhD Physics
>>> Postdoctoral Fellow
>>> HHMI & Biochemistry Department        Phone: (781) 736-2326
>>> Brandeis University, MS 057                Email: dpachov at brandeis.edu
>>> =====================================================
>>>
>>> --
>>> gmx-users mailing list    gmx-users at gromacs.org
>>> http://lists.gromacs.org/mailman/listinfo/gmx-users
>>> Please search the archive at
>>> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
>>> Please don't post (un)subscribe requests to the list. Use the
>>> www interface or send it to gmx-users-request at gromacs.org.
>>> Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>
>>>
>>>
>>> --
>>>
>>> gmx-users mailing list    gmx-users at gromacs.org
>>> http://lists.gromacs.org/mailman/listinfo/gmx-users
>>> Please search the archive at
>>> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
>>> Please don't post (un)subscribe requests to the list. Use the
>>> www interface or send it to gmx-users-request at gromacs.org.
>>> Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>
>>
>>
>>
>> --
>> =====================================================
>> *Dimitar V Pachov*
>>
>> PhD Physics
>> Postdoctoral Fellow
>> HHMI & Biochemistry Department        Phone: (781) 736-2326
>> Brandeis University, MS 057                Email: dpachov at brandeis.edu
>> =====================================================
>>
>>
>> --
>> gmx-users mailing list    gmx-users at gromacs.org
>> http://lists.gromacs.org/mailman/listinfo/gmx-users
>> Please search the archive at
>> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
>> Please don't post (un)subscribe requests to the list. Use the
>> www interface or send it to gmx-users-request at gromacs.org.
>> Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>
>
>
>
> --
> ORNL/UT Center for Molecular Biophysics cmb.ornl.gov
> 865-241-1537, ORNL PO BOX 2008 MS6309
>
> --
> gmx-users mailing list    gmx-users at gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-users
> Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
> Please don't post (un)subscribe requests to the list. Use the
> www interface or send it to gmx-users-request at gromacs.org.
> Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>



-- 
=====================================================
*Dimitar V Pachov*

PhD Physics
Postdoctoral Fellow
HHMI & Biochemistry Department        Phone: (781) 736-2326
Brandeis University, MS 057                Email: dpachov at brandeis.edu
=====================================================
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-users/attachments/20110609/2daed1d0/attachment.html>


More information about the gromacs.org_gmx-users mailing list