[gmx-users] Build time/Build user mismatch, fatal error truncation of file *.xtc failed

Mark Abraham mark.j.abraham at gmail.com
Thu Jun 23 10:42:00 CEST 2016


Hi,

The only explanation is that that file is not in fact properly accessible
if rank 0 is placed other than on "compute-node," which means your
organization of file system / slurm / etc. aren't good enough for what
you're doing.

Mark

On Thu, Jun 23, 2016 at 10:15 AM Husen R <hus3nr at gmail.com> wrote:

> Hi,
>
> I still unable to find out the cause of the fatal error.
> Previously, gromacs is installed in every nodes. That's the cause Build
> time mismatch and Build user mismatch appeared.
> Now, Build time mismatch and Build user mismatch issues are solved by
> installing Gromacs in shared directory.
>
> I have tried to install gromacs in one node only (not in shared directory),
> but the error appeared.
>
>
> this is the error message if I exclude compute-node
> "--exclude=compute-node" from nodelist in slurm sbatch. excluding other
> nodes works fine.
>
>
>
> =========================================================================================
> GROMACS:      gmx mdrun, VERSION 5.1.2
> Executable:   /mirror/source/gromacs/bin/gmx_mpi
> Data prefix:  /mirror/source/gromacs
> Command line:
>   gmx_mpi mdrun -cpi md_gmx.cpt -deffnm md_gmx
>
>
> Running on 2 nodes with total 8 cores, 16 logical cores
>   Cores per node:            4
>   Logical cores per node:    8
> Hardware detected on host head-node (the node of MPI rank 0):
>   CPU info:
>     Vendor: GenuineIntel
>     Brand:  Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
>     SIMD instructions most likely to fit this hardware: AVX_256
>     SIMD instructions selected at GROMACS compile time: AVX_256
>
> Reading file md_gmx.tpr, VERSION 5.1.2 (single precision)
> Changing nstlist from 10 to 20, rlist from 1 to 1.03
>
>
> Reading checkpoint file md_gmx.cpt generated: Thu Jun 23 12:54:02 2016
>
>
>   #ranks mismatch,
>     current program: 16
>     checkpoint file: 24
>
>   #PME-ranks mismatch,
>     current program: -1
>     checkpoint file: 6
>
> GROMACS patchlevel, binary or parallel settings differ from previous run.
> Continuation is exact, but not guaranteed to be binary identical.
>
>
> -------------------------------------------------------
> Program gmx mdrun, VERSION 5.1.2
> Source code file:
> /home/necis/gromacsinstall/gromacs-5.1.2/src/gromacs/gmxlib/checkpoint.cpp,
> line: 2216
>
> Fatal error:
> Truncation of file md_gmx.xtc failed. Cannot do appending because of this
> failure.
> For more information and tips for troubleshooting, please check the GROMACS
> website at http://www.gromacs.org/Documentation/Errors
>
> ============================================================================================================
>
> On Thu, Jun 16, 2016 at 6:23 PM, Mark Abraham <mark.j.abraham at gmail.com>
> wrote:
>
> > Hi,
> >
> > On Thu, Jun 16, 2016 at 12:24 PM Husen R <hus3nr at gmail.com> wrote:
> >
> > > On Thu, Jun 16, 2016 at 4:01 PM, Mark Abraham <
> mark.j.abraham at gmail.com>
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > There's just nothing special about any node at run time.
> > > >
> > > > Your script looks like it is building GROMACS fresh each time -
> there's
> > > no
> > > > need to do that,
> > >
> > >
> > > which part of my script ?
> > >
> >
> > I can't tell how your script is finding its GROMACS installations, but
> the
> > advisory message says precisely that your runs are finding different
> > installations...
> >
> >   Build time mismatch,
> >     current program: Sel Apr  5 13:37:32 WIB 2016
> >     checkpoint file: Rab Apr  6 09:44:51 WIB 2016
> >
> >   Build user mismatch,
> >     current program: pro at head-node [CMAKE]
> >     checkpoint file: pro at compute-node [CMAKE]
> >
> > This reinforces my impression that the view of your file system available
> > at the start of the job script is varying with your choice of node
> subsets.
> >
> >
> > > I always use this command to restart from checkpoint file -->  "mpirun
> > > gmx_mpi mdrun -cpi [name].cpt -deffnm [name]".
> > > as far as I know -cpi option is used to refer to checkpoint file as
> input
> > > file.
> > >  what I have to change in my script ?
> > >
> >
> > Nothing about that aspect. But clearly your first run and the restart
> > simulating loss of a node are finding different gmx_mpi binaries from
> their
> > respective environments. This is not itself a problem, but it's probably
> > not what you intend, and may be symptomatic of the same issue that leads
> to
> > md_test.xtc not being accessible.
> >
> > Mark
> >
> >
> > >
> > > but the fact that the node name is showing up in the check
> > > > that takes place when the checkpoint is read is not relevant to the
> > > > problem.
> > > >
> > > > Mark
> > > >
> > > > On Thu, Jun 16, 2016 at 9:46 AM Husen R <hus3nr at gmail.com> wrote:
> > > >
> > > > > On Thu, Jun 16, 2016 at 2:32 PM, Mark Abraham <
> > > mark.j.abraham at gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > On Thu, Jun 16, 2016 at 9:30 AM Husen R <hus3nr at gmail.com>
> wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > Thank you for your reply !
> > > > > > >
> > > > > > > md_test.xtc is exist and writable.
> > > > > > >
> > > > > >
> > > > > > OK, but it needs to be seen that way from the set of compute
> nodes
> > > you
> > > > > are
> > > > > > using, and organizing that is up to you and your job scheduler,
> > etc.
> > > > > >
> > > > > >
> > > > > > > I tried to restart from checkpoint file by excluding other node
> > > than
> > > > > > > compute-node and it works.
> > > > > > >
> > > > > >
> > > > > > Go do that, then :-)
> > > > > >
> > > > >
> > > > > I'm building a simple system that can respond to node failure. if
> > > failure
> > > > > occured on node A, than the application has to be restarted and
> that
> > > node
> > > > > has to be excluded.
> > > > > this should apply to all node including this 'compute-node'.
> > > > >
> > > > > >
> > > > > >
> > > > > > > only '--exclude=compute-node' that produces this error.
> > > > > > >
> > > > > >
> > > > > > Then there's something about that node that is special with
> respect
> > > to
> > > > > the
> > > > > > file system - there's nothing about any particular node that
> > GROMACS
> > > > > cares
> > > > > > about.
> > > > > >
> > > > >
> > > > > > Mark
> > > > > >
> > > > > >
> > > > > > > is this has the same issue with this thread ?
> > > > > > >
> > http://comments.gmane.org/gmane.science.biology.gromacs.user/40984
> > > > > > >
> > > > > > > regards,
> > > > > > >
> > > > > > > Husen
> > > > > > >
> > > > > > > On Thu, Jun 16, 2016 at 2:20 PM, Mark Abraham <
> > > > > mark.j.abraham at gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > The stuff about different nodes or numbers of nodes doesn't
> > > matter
> > > > -
> > > > > > it's
> > > > > > > > merely an advisory note from mdrun. mdrun failed when it
> tried
> > to
> > > > > > operate
> > > > > > > > upon md_test.xtc, so perhaps you need to consider whether the
> > > file
> > > > > > > exists,
> > > > > > > > is writable, etc.
> > > > > > > >
> > > > > > > > Mark
> > > > > > > >
> > > > > > > > On Thu, Jun 16, 2016 at 6:48 AM Husen R <hus3nr at gmail.com>
> > > wrote:
> > > > > > > >
> > > > > > > > > Hi all,
> > > > > > > > >
> > > > > > > > > I got the following error message when I tried to restart
> > > gromacs
> > > > > > > > > simulation from checkpoint file.
> > > > > > > > > I restart the simulation using fewer nodes and processes,
> and
> > > > also
> > > > > I
> > > > > > > > > exclude one node using '--exclude=' option (in slurm) for
> > > > > > experimental
> > > > > > > > > purpose.
> > > > > > > > >
> > > > > > > > > I'm sure fewer nodes and processes are not the cause of
> this
> > > > error
> > > > > > as I
> > > > > > > > > already test that.
> > > > > > > > > I have checked that the cause of this error is '--exclude='
> > > > usage.
> > > > > I
> > > > > > > > > excluded 1 node named 'compute-node' when restart from
> > > checkpoint
> > > > > (at
> > > > > > > > first
> > > > > > > > > run, I use all node including 'compute-node').
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > it seems that at first run, the submit job script was built
> > at
> > > > > > > > > compute-node. So, at restart, build user mismatch appeared
> > > > because
> > > > > > > > > compute-node was not found (excluded).
> > > > > > > > >
> > > > > > > > > Am I right ? is this behavior normal ?
> > > > > > > > > or is that a way to avoid this, so I can freely restart
> from
> > > > > > checkpoint
> > > > > > > > > using any nodes without limitation.
> > > > > > > > >
> > > > > > > > > thank you in advance
> > > > > > > > >
> > > > > > > > > Regards,
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Husen
> > > > > > > > >
> > > > > > > > > ==========================restart script=================
> > > > > > > > > #!/bin/bash
> > > > > > > > > #SBATCH -J ayo
> > > > > > > > > #SBATCH -o md%j.out
> > > > > > > > > #SBATCH -A necis
> > > > > > > > > #SBATCH -N 2
> > > > > > > > > #SBATCH -n 16
> > > > > > > > > #SBATCH --exclude=compute-node
> > > > > > > > > #SBATCH --time=144:00:00
> > > > > > > > > #SBATCH --mail-user=hus3nr at gmail.com
> > > > > > > > > #SBATCH --mail-type=begin
> > > > > > > > > #SBATCH --mail-type=end
> > > > > > > > >
> > > > > > > > > mpirun gmx_mpi mdrun -cpi md_test.cpt -deffnm md_test
> > > > > > > > > =====================================================
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > ==================================output
> > > > > > error========================
> > > > > > > > > Reading checkpoint file md_test.cpt generated: Wed Jun 15
> > > > 16:30:44
> > > > > > 2016
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >   Build time mismatch,
> > > > > > > > >     current program: Sel Apr  5 13:37:32 WIB 2016
> > > > > > > > >     checkpoint file: Rab Apr  6 09:44:51 WIB 2016
> > > > > > > > >
> > > > > > > > >   Build user mismatch,
> > > > > > > > >     current program: pro at head-node [CMAKE]
> > > > > > > > >     checkpoint file: pro at compute-node [CMAKE]
> > > > > > > > >
> > > > > > > > >   #ranks mismatch,
> > > > > > > > >     current program: 16
> > > > > > > > >     checkpoint file: 24
> > > > > > > > >
> > > > > > > > >   #PME-ranks mismatch,
> > > > > > > > >     current program: -1
> > > > > > > > >     checkpoint file: 6
> > > > > > > > >
> > > > > > > > > GROMACS patchlevel, binary or parallel settings differ from
> > > > > previous
> > > > > > > run.
> > > > > > > > > Continuation is exact, but not guaranteed to be binary
> > > identical.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > -------------------------------------------------------
> > > > > > > > > Program gmx mdrun, VERSION 5.1.2
> > > > > > > > > Source code file:
> > > > > > > > > /home/pro/gromacs-5.1.2/src/gromacs/gmxlib/checkpoint.cpp,
> > > line:
> > > > > 2216
> > > > > > > > >
> > > > > > > > > Fatal error:
> > > > > > > > > Truncation of file md_test.xtc failed. Cannot do appending
> > > > because
> > > > > of
> > > > > > > > this
> > > > > > > > > failure.
> > > > > > > > > For more information and tips for troubleshooting, please
> > check
> > > > the
> > > > > > > > GROMACS
> > > > > > > > > website at http://www.gromacs.org/Documentation/Errors
> > > > > > > > > -------------------------------------------------------
> > > > > > > > >
> > > ================================================================
> > > > > > > > > --
> > > > > > > > > Gromacs Users mailing list
> > > > > > > > >
> > > > > > > > > * Please search the archive at
> > > > > > > > >
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List
> > > > before
> > > > > > > > > posting!
> > > > > > > > >
> > > > > > > > > * Can't post? Read
> > > http://www.gromacs.org/Support/Mailing_Lists
> > > > > > > > >
> > > > > > > > > * For (un)subscribe requests visit
> > > > > > > > >
> > > > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
> > > > > > or
> > > > > > > > > send a mail to gmx-users-request at gromacs.org.
> > > > > > > > >
> > > > > > > > --
> > > > > > > > Gromacs Users mailing list
> > > > > > > >
> > > > > > > > * Please search the archive at
> > > > > > > > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List
> > > before
> > > > > > > > posting!
> > > > > > > >
> > > > > > > > * Can't post? Read
> > http://www.gromacs.org/Support/Mailing_Lists
> > > > > > > >
> > > > > > > > * For (un)subscribe requests visit
> > > > > > > >
> > > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
> > > > > or
> > > > > > > > send a mail to gmx-users-request at gromacs.org.
> > > > > > > >
> > > > > > > --
> > > > > > > Gromacs Users mailing list
> > > > > > >
> > > > > > > * Please search the archive at
> > > > > > > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List
> > before
> > > > > > > posting!
> > > > > > >
> > > > > > > * Can't post? Read
> http://www.gromacs.org/Support/Mailing_Lists
> > > > > > >
> > > > > > > * For (un)subscribe requests visit
> > > > > > >
> > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
> > > > or
> > > > > > > send a mail to gmx-users-request at gromacs.org.
> > > > > > >
> > > > > > --
> > > > > > Gromacs Users mailing list
> > > > > >
> > > > > > * Please search the archive at
> > > > > > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List
> before
> > > > > > posting!
> > > > > >
> > > > > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> > > > > >
> > > > > > * For (un)subscribe requests visit
> > > > > >
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
> > > or
> > > > > > send a mail to gmx-users-request at gromacs.org.
> > > > > >
> > > > > --
> > > > > Gromacs Users mailing list
> > > > >
> > > > > * Please search the archive at
> > > > > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> > > > > posting!
> > > > >
> > > > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> > > > >
> > > > > * For (un)subscribe requests visit
> > > > > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
> > or
> > > > > send a mail to gmx-users-request at gromacs.org.
> > > > >
> > > > --
> > > > Gromacs Users mailing list
> > > >
> > > > * Please search the archive at
> > > > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> > > > posting!
> > > >
> > > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> > > >
> > > > * For (un)subscribe requests visit
> > > > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
> or
> > > > send a mail to gmx-users-request at gromacs.org.
> > > >
> > > --
> > > Gromacs Users mailing list
> > >
> > > * Please search the archive at
> > > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> > > posting!
> > >
> > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> > >
> > > * For (un)subscribe requests visit
> > > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> > > send a mail to gmx-users-request at gromacs.org.
> > >
> > --
> > Gromacs Users mailing list
> >
> > * Please search the archive at
> > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> > posting!
> >
> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >
> > * For (un)subscribe requests visit
> > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> > send a mail to gmx-users-request at gromacs.org.
> >
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
>


More information about the gromacs.org_gmx-users mailing list