[gmx-users] Two machines, same job, one fails

Justin A. Lemkul jalemkul at vt.edu
Wed Jan 26 00:53:25 CET 2011



TJ Mustard wrote:

<snip>

>  > 1. Do the systems in question crash immediately (i.e., step zero) or 
> do they run
>  > for some time?
>  >
> 
> Step 0, every time.
> 
>  
> 
>  > 2. If they give you even a little bit of output, you can analyze 
> which energy
>  > terms, etc go haywire with the tips listed here:
>  >
> 
> All I have seen on these is LINCS Errors and Water molecules unable to 
> be settled.
> 
>  
> 
> But I will check this out right now, and email if I smell trouble.
> 
>  
> 
>  > 
> http://www.gromacs.org/Documentation/Terminology/Blowing_Up#Diagnosing_an_Unstable_System
>  >
>  > That would help in tracking down any potential bug or error.
>  >
>  > 3. Is it just the production runs that are crashing, or everything?  
> If EM isn't
>  > even working, that smells even buggier.
> 
> Awesome question here, we have seen some weird stuff. Sometimes the 
> cluster will give us segmentation faults, then it will fail on our 
> machines or sometimes not on our iMacs. I know weird! If EM starts on 
> the cluster it will finish. Where we have issues is in positional 
> restraint (PR) and MD and MD/FEP. It doesn't matter if FEP is on or off 
> in a MD (although we are using SD for these MD/FEP runs).
> 
>  

Does "sometimes" refer to different simulations, or multiple invocations of the 
same simulation system?  If you're referencing the fact that system A works 
while system B doesn't, we're talking apples and oranges and it's irrelevant to 
the diagnosis (and perhaps some systems simply require greater finesse or a 
different protocol).  If one system continually fails on one system and works on 
another, that's what we need to be discussing.  Sorry if I've missed something, 
I'm just getting confused.

> 
>  >
>  > 4. Are the compilers the same on the iMac vs. AMD cluster?
> 
> No I am using x86_64-apple-darwin10 GCC 4.4.4 and the cluster is using 
> x86_64-redhat-linux 4.1.2 GCC.
> 

Well, I know that for years weird behavior has been attributed to the gcc-4.1.x 
series, including the famous warning on the downloads page:

"WARNING: do not use the gcc 4.1.x set of compilers. They are broken. These 
compilers come with recent Linux distributions like Fedora 5/6 etc."

I don't know if those issues were ever resolved (some error in Gromacs that 
wasn't playing nice with gcc, or vice versa).

> I just did a quick yum search and there doesn't seem to be a newer GCC. 
> We know you are going to cmake but we have yet to get it implemented on 
> our cluster successfully.
> 

The build system is irrelevant.  You still need a reliable C compiler, whether 
using autoconf or cmake.

-Justin

>  
> 
> Thank you,
> 
> TJ Mustard
> 
>  
> 
>  >
>  > -Justin
>  >
>  > > 
>  > >
>  > > Now I understand that my iMac works, but it only has 2 cpus and the
>  > > cluster has 320. Since we are running our jobs via a Bennet's 
> Acceptance
>  > > Ratio FEP with 21 lambda windows, using just one 2 cpu machine would
>  > > take too long. Especially since we wish to start pseudo high throughput
>  > > drug testing.
>  > >
>  > > 
>  > >
>  > > 
>  > >
>  > > In my .mdp files now, the only changes are:
>  > >
>  > > (the default setting is on the right of the ";")
>  > >
>  > > 
>  > >
>  > > 
>  > >
>  > > define                   =     ; =
>  > >
>  > > ; RUN CONTROL PARAMETERS
>  > > integrator               = sd    ; = md
>  > > ; Start time and timestep in ps
>  > > tinit                    = 0    ; = 0
>  > > dt                       = 0.004    ; = 0.001
>  > > nsteps                   = 750000       ; = 0 (this one depends on the
>  > > window and particular part of our job)
>  > >
>  > > ; OUTPUT CONTROL OPTIONS
>  > > ; Output frequency for coords (x), velocities (v) and forces (f)
>  > > nstxout                  = 10000    ; = 100 (to save on disk space)
>  > > nstvout                  = 10000    ; = 100
>  > >
>  > > 
>  > >
>  > > ; OPTIONS FOR ELECTROSTATICS AND VDW
>  > > ; Method for doing electrostatics
>  > > coulombtype              = PME    ; = Cutoff
>  > > rcoulomb-switch          = 0    ; = 0
>  > > rcoulomb                 = 1  ; = 1
>  > > ; Relative dielectric constant for the medium and the reaction field
>  > > epsilon_r                = 1    ; = 1
>  > > epsilon_rf               = 1    ; = 1
>  > > ; Method for doing Van der Waals
>  > > vdw-type                 = Cut-off    ; = Cut-off
>  > > ; cut-off lengths       
>  > > rvdw-switch              = 0    ; = 0
>  > > rvdw                     = 1  ; = 1
>  > > ; Spacing for the PME/PPPM FFT grid
>  > > fourierspacing           = 0.12    ; = 0.12
>  > > ; EWALD/PME/PPPM parameters
>  > > pme_order                = 4    ; = 4
>  > > ewald_rtol               = 1e-05    ; = 1e-05
>  > > ewald_geometry           = 3d    ; = 3d
>  > > epsilon_surface          = 0    ; = 0
>  > > optimize_fft             = yes    ; = no
>  > >
>  > > 
>  > >
>  > > ; OPTIONS FOR WEAK COUPLING ALGORITHMS
>  > > ; Temperature coupling 
>  > > tcoupl                   = v-rescale    ; = No
>  > > nsttcouple               = -1    ; = -1
>  > > nh-chain-length          = 10    ; = 10
>  > > ; Groups to couple separately
>  > > tc-grps                  = System    ; =
>  > > ; Time constant (ps) and reference temperature (K)
>  > > tau-t                    = 0.1    ; =
>  > > ref-t                    = 300    ; =
>  > > ; Pressure coupling     
>  > > Pcoupl                   = Parrinello-Rahman    ; = No
>  > > Pcoupltype               = Isotropic
>  > > nstpcouple               = -1    ; = -1
>  > > ; Time constant (ps), compressibility (1/bar) and reference P (bar)
>  > > tau-p                    = 1    ; = 1
>  > > compressibility          = 4.5e-5    ; =
>  > > ref-p                    = 1.0    ; =
>  > >
>  > > 
>  > >
>  > > ; OPTIONS FOR BONDS   
>  > > constraints              = all-bonds    ; = none
>  > > ; Type of constraint algorithm
>  > > constraint-algorithm     = Lincs    ; = Lincs
>  > >
>  > > 
>  > >
>  > > ; Free energy control stuff
>  > > free-energy              = yes    ; = no
>  > > init-lambda              = 0.00       ; = 0
>  > > delta-lambda             = 0    ; = 0
>  > > foreign_lambda           =        0.05 ; =
>  > > sc-alpha                 = 0.5    ; = 0
>  > > sc-power                 = 1.0    ; = 0
>  > > sc-sigma                 = 0.3    ; = 0.3
>  > > nstdhdl                  = 1    ; = 10
>  > > separate-dhdl-file       = yes    ; = yes
>  > > dhdl-derivatives         = yes    ; = yes
>  > > dh_hist_size             = 0    ; = 0
>  > > dh_hist_spacing          = 0.1    ; = 0.1
>  > > couple-moltype           = LGD    ; =
>  > > couple-lambda0           = vdw-q    ; = vdw-q
>  > > couple-lambda1           = none    ; = vdw-q
>  > > couple-intramol          = no     ;    = no
>  > >
>  > > 
>  > >
>  > > 
>  > >
>  > > Some of these change due to positional restraint md and energy 
> minimization.
>  > >
>  > > 
>  > >
>  > > All of these settings have come from either tutorials, papers or 
> peoples
>  > > advice.
>  > >
>  > > 
>  > >
>  > > If it would be advantageous I can post my entire energy minimization,
>  > > positional restraint, md, and FEP mdp files.
>  > >
>  > > 
>  > >
>  > > Thank you,
>  > >
>  > > TJ Mustard
>  > >
>  > > 
>  > >
>  > > 
>  > >
>  > >>> 
>  > >>>
>  > >>> Below is my command sequence:
>  > >>>
>  > >>> 
>  > >>>
>  > >>> echo
>  > >>> 
> ==============================================================================================================================
>  > >>> date >>RNAP-C.joblog
>  > >>> echo g453s-grompp -f em.mdp -c RNAP-C_b4em.gro -p RNAP-C.top -o
>  > >>> RNAP-C_em.tpr
>  > >>> /share/apps/gromacs-4.5.3-single/bin/g453s-grompp -f em.mdp -c
>  > >>> RNAP-C_b4em.gro -p RNAP-C.top -o RNAP-C_em.tpr
>  > >>> date >>RNAP-C.joblog
>  > >>> echo g453s-mdrun -v -s RNAP-C_em.tpr -c RNAP-C_after_em.gro -g
>  > >>> emlog.log -cpo state_em.cpt -nt 2
>  > >>> /share/apps/gromacs-4.5.3-single/bin/g453s-mdrun -v -s RNAP-C_em.tpr
>  > >>> -c RNAP-C_after_em.gro -g emlog.log -cpo stat_em.cpt -nt 2
>  > >>> date >>RNAP-C.joblog
>  > >>> echo g453s-grompp -f pr.mdp -c RNAP-C_after_em.gro -p RNAP-C.top -o
>  > >>> RNAP-C_pr.tpr
>  > >>> /share/apps/gromacs-4.5.3-single/bin/g453s-grompp -f pr.mdp -c
>  > >>> RNAP-C_after_em.gro -p RNAP-C.top -o RNAP-C_pr.tpr
>  > >>> echo g453s-mdrun -v -s RNAP-C_pr.tpr -e pr.edr -c RNAP-C_after_pr.gro
>  > >>> -g prlog.log -cpo state_pr.cpt -nt 2 -dhdl dhdl-pr.xvg
>  > >>> /share/apps/gromacs-4.5.3-single/bin/g453s-mdrun -v -s RNAP-C_pr.tpr
>  > >>> -e pr.edr -c RNAP-C_after_pr.gro -g prlog.log -cpo state_pr.cpt -nt 2
>  > >>> -dhdl dhdl-pr.xvg
>  > >>> date >>RNAP-C.joblog
>  > >>> echo g453s-grompp -f md.mdp -c RNAP-C_after_pr.gro -p RNAP-C.top -o
>  > >>> RNAP-C_md.tpr
>  > >>> /share/apps/gromacs-4.5.3-single/bin/g453s-grompp -f md.mdp -c
>  > >>> RNAP-C_after_pr.gro -p RNAP-C.top -o RNAP-C_md.tpr
>  > >>> date >>RNAP-C.joblog
>  > >>> echo g453s-mdrun -v -s RNAP-C_md.tpr -o RNAP-C_md.trr -c
>  > >>> RNAP-C_after_md.gro -g md.log -e md.edr -cpo state_md.cpt -nt 2 -dhdl
>  > >>> dhdl-md.xvg
>  > >>> /share/apps/gromacs-4.5.3-single/bin/g453s-mdrun -v -s RNAP-C_md.tpr
>  > >>> -o RNAP-C_md.trr -c RNAP-C_after_md.gro -g md.log -e md.edr -cpo
>  > >>> state_md.cpt -nt 2 -dhdl dhdl-md.xvg
>  > >>> date >>RNAP-C.joblog
>  > >>> echo g453s-grompp -f FEP.mdp -c RNAP-C_after_md.gro -p RNAP-C.top -o
>  > >>> RNAP-C_fep.tpr
>  > >>> /share/apps/gromacs-4.5.3-single/bin/g453s-grompp -f FEP.mdp -c
>  > >>> RNAP-C_after_md.gro -p RNAP-C.top -o RNAP-C_fep.tpr
>  > >>> date >>RNAP-C.joblog
>  > >>> echo g453s-mdrun -v -s RNAP-C_fep.tpr -o RNAP-C_fep.trr -c
>  > >>> RNAP-C_after_fep.gro -g fep.log -e fep.edr -cpo state_fep.cpt -nt 2
>  > >>> -dhdl dhdl-fep.xvg
>  > >>> /share/apps/gromacs-4.5.3-single/bin/g453s-mdrun -v -s RNAP-C_fep.tpr
>  > >>> -o RNAP-C_fep.trr -c RNAP-C_after_fep.gro -g fep.log -e fep.edr -cpo
>  > >>> state_fep.cpt -nt 2 -dhdl dhdl-fep.xvg
>  > >>>
>  > >>> 
>  > >>>
>  > >>> 
>  > >>>
>  > >>> I can add my .mdps but I do not think they are the problem since I
>  > >>> know it works on my personal iMac.
>  > >>>
>  > >>> 
>  > >>>
>  > >>> Thank you,
>  > >>>
>  > >>> TJ Mustard
>  > >>> Email: mustardt at onid.orst.edu <mailto:mustardt at onid.orst.edu>
>  > >>>
>  > >>
>  > > 
>  > >
>  > > TJ Mustard
>  > > Email: mustardt at onid.orst.edu
>  > >
>  >
>  > --
>  > ========================================
>  >
>  > Justin A. Lemkul
>  > Ph.D. Candidate
>  > ICTAS Doctoral Scholar
>  > MILES-IGERT Trainee
>  > Department of Biochemistry
>  > Virginia Tech
>  > Blacksburg, VA
>  > jalemkul[at]vt.edu | (540) 231-9080
>  > http://www.bevanlab.biochem.vt.edu/Pages/Personal/justin
>  >
>  > ========================================
>  > --
>  > gmx-users mailing list    gmx-users at gromacs.org
>  > http://lists.gromacs.org/mailman/listinfo/gmx-users
>  > Please search the archive at 
> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
>  > Please don't post (un)subscribe requests to the list. Use the
>  > www interface or send it to gmx-users-request at gromacs.org.
>  > Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>  >
> 
>  
> 
> TJ Mustard
> Email: mustardt at onid.orst.edu
> 

-- 
========================================

Justin A. Lemkul
Ph.D. Candidate
ICTAS Doctoral Scholar
MILES-IGERT Trainee
Department of Biochemistry
Virginia Tech
Blacksburg, VA
jalemkul[at]vt.edu | (540) 231-9080
http://www.bevanlab.biochem.vt.edu/Pages/Personal/justin

========================================



More information about the gromacs.org_gmx-users mailing list