[gmx-users] Two machines, same job, one fails

Justin A. Lemkul jalemkul at vt.edu
Wed Jan 26 00:24:35 CET 2011



TJ Mustard wrote:
> 
> 
>  
> 
> 
> On January 25, 2011 at 2:08 PM Mark Abraham <Mark.Abraham at anu.edu.au> wrote:
> 
>> On 26/01/2011 5:50 AM, TJ Mustard wrote:
>>>
>>> Hi all,
>>>
>>>  
>>>
>>> I am running MD/FEP on a protein-ligand system with gromacs 4.5.3 and 
>>> FFTW 3.2.2.
>>>
>>>  
>>>
>>> My iMac will run the job (over 4000 steps, till I killed it) at 4fs 
>>> steps. (I am using heavy H)
>>>
>>>  
>>>
>>> Once I put this on our groups AMD Cluster the jobs fail even with 2fs 
>>> steps. (with thousands of lincs errors)
>>>
>>>  
>>>
>>> We have recompiled the clusters gromacs 4.5.3 build, with no change. 
>>> I know the system is the same since I copied the job from the server 
>>> to my machine, to rerun it.
>>>
>>>  
>>>
>>> What is going on? Why can one machine run a job perfectly and the 
>>> other cannot? I also know there is adequate memory on both machines.
>>>
>>
>> You've posted this before, and I made a number of diagnostic 
>> suggestions. What did you learn?
>>
>> Mark
> 
> Mark and all,
> 
>  
> 
> First thank you for all our help. What you suggested last time helped 
> considerably with our jobs/calculations. I have learned that using the 
> standard mdp settings allow my heavyh 4fs jobs to run on my iMac (intel) 
> and have made these my new standard for future jobs. We chose to use the 
> smaller 0.8nm PME/Cutoff due to others papers/tutorials, but now we 
> understand why we need these standard settings. Now what I see to be our 
> problem is that our machines have some sort of variable we cannot 
> account for. If I am blind to my error, please show me. I just don't 
> understand why one computer works while the other does not. We have 
> recompiled gromacs 4.5.3 single precission on our cluster, and still 
> have this problem.
> 

I know the feeling all too well.  PowerPC jobs crash instantly, on our cluster, 
despite working beautifully on our lab machines.  There's a bug report about 
that one, but I haven't heard anything about AMD failures.  It remains a 
possibility that something beyond your control is going on.  To explore a bit 
further:

1. Do the systems in question crash immediately (i.e., step zero) or do they run 
for some time?

2. If they give you even a little bit of output, you can analyze which energy 
terms, etc go haywire with the tips listed here:

http://www.gromacs.org/Documentation/Terminology/Blowing_Up#Diagnosing_an_Unstable_System

That would help in tracking down any potential bug or error.

3. Is it just the production runs that are crashing, or everything?  If EM isn't 
even working, that smells even buggier.

4. Are the compilers the same on the iMac vs. AMD cluster?

-Justin

>  
> 
> Now I understand that my iMac works, but it only has 2 cpus and the 
> cluster has 320. Since we are running our jobs via a Bennet's Acceptance 
> Ratio FEP with 21 lambda windows, using just one 2 cpu machine would 
> take too long. Especially since we wish to start pseudo high throughput 
> drug testing.
> 
>  
> 
>  
> 
> In my .mdp files now, the only changes are:
> 
> (the default setting is on the right of the ";")
> 
>  
> 
>  
> 
> define                   =     ; =
> 
> ; RUN CONTROL PARAMETERS
> integrator               = sd    ; = md
> ; Start time and timestep in ps
> tinit                    = 0    ; = 0
> dt                       = 0.004    ; = 0.001
> nsteps                   = 750000       ; = 0 (this one depends on the 
> window and particular part of our job)
> 
> ; OUTPUT CONTROL OPTIONS
> ; Output frequency for coords (x), velocities (v) and forces (f)
> nstxout                  = 10000    ; = 100 (to save on disk space)
> nstvout                  = 10000    ; = 100
> 
>  
> 
> ; OPTIONS FOR ELECTROSTATICS AND VDW
> ; Method for doing electrostatics
> coulombtype              = PME    ; = Cutoff
> rcoulomb-switch          = 0    ; = 0
> rcoulomb                 = 1  ; = 1
> ; Relative dielectric constant for the medium and the reaction field
> epsilon_r                = 1    ; = 1
> epsilon_rf               = 1    ; = 1
> ; Method for doing Van der Waals
> vdw-type                 = Cut-off    ; = Cut-off
> ; cut-off lengths       
> rvdw-switch              = 0    ; = 0
> rvdw                     = 1  ; = 1
> ; Spacing for the PME/PPPM FFT grid
> fourierspacing           = 0.12    ; = 0.12
> ; EWALD/PME/PPPM parameters
> pme_order                = 4    ; = 4
> ewald_rtol               = 1e-05    ; = 1e-05
> ewald_geometry           = 3d    ; = 3d
> epsilon_surface          = 0    ; = 0
> optimize_fft             = yes    ; = no
> 
>  
> 
> ; OPTIONS FOR WEAK COUPLING ALGORITHMS
> ; Temperature coupling  
> tcoupl                   = v-rescale    ; = No
> nsttcouple               = -1    ; = -1
> nh-chain-length          = 10    ; = 10
> ; Groups to couple separately
> tc-grps                  = System    ; =
> ; Time constant (ps) and reference temperature (K)
> tau-t                    = 0.1    ; =
> ref-t                    = 300    ; =
> ; Pressure coupling     
> Pcoupl                   = Parrinello-Rahman    ; = No
> Pcoupltype               = Isotropic
> nstpcouple               = -1    ; = -1
> ; Time constant (ps), compressibility (1/bar) and reference P (bar)
> tau-p                    = 1    ; = 1
> compressibility          = 4.5e-5    ; =
> ref-p                    = 1.0    ; =
> 
>  
> 
> ; OPTIONS FOR BONDS    
> constraints              = all-bonds    ; = none
> ; Type of constraint algorithm
> constraint-algorithm     = Lincs    ; = Lincs
> 
>  
> 
> ; Free energy control stuff
> free-energy              = yes    ; = no
> init-lambda              = 0.00       ; = 0
> delta-lambda             = 0    ; = 0
> foreign_lambda           =        0.05 ; =
> sc-alpha                 = 0.5    ; = 0
> sc-power                 = 1.0    ; = 0
> sc-sigma                 = 0.3    ; = 0.3
> nstdhdl                  = 1    ; = 10
> separate-dhdl-file       = yes    ; = yes
> dhdl-derivatives         = yes    ; = yes
> dh_hist_size             = 0    ; = 0
> dh_hist_spacing          = 0.1    ; = 0.1
> couple-moltype           = LGD    ; =
> couple-lambda0           = vdw-q    ; = vdw-q
> couple-lambda1           = none    ; = vdw-q
> couple-intramol          = no     ;    = no
> 
>  
> 
>  
> 
> Some of these change due to positional restraint md and energy minimization.
> 
>  
> 
> All of these settings have come from either tutorials, papers or peoples 
> advice.
> 
>  
> 
> If it would be advantageous I can post my entire energy minimization, 
> positional restraint, md, and FEP mdp files.
> 
>  
> 
> Thank you,
> 
> TJ Mustard
> 
>  
> 
>  
> 
>>>  
>>>
>>> Below is my command sequence:
>>>
>>>  
>>>
>>> echo 
>>> ==============================================================================================================================
>>> date >>RNAP-C.joblog
>>> echo g453s-grompp -f em.mdp -c RNAP-C_b4em.gro -p RNAP-C.top -o 
>>> RNAP-C_em.tpr
>>> /share/apps/gromacs-4.5.3-single/bin/g453s-grompp -f em.mdp -c 
>>> RNAP-C_b4em.gro -p RNAP-C.top -o RNAP-C_em.tpr
>>> date >>RNAP-C.joblog
>>> echo g453s-mdrun -v -s RNAP-C_em.tpr -c RNAP-C_after_em.gro -g 
>>> emlog.log -cpo state_em.cpt -nt 2
>>> /share/apps/gromacs-4.5.3-single/bin/g453s-mdrun -v -s RNAP-C_em.tpr 
>>> -c RNAP-C_after_em.gro -g emlog.log -cpo stat_em.cpt -nt 2
>>> date >>RNAP-C.joblog
>>> echo g453s-grompp -f pr.mdp -c RNAP-C_after_em.gro -p RNAP-C.top -o 
>>> RNAP-C_pr.tpr
>>> /share/apps/gromacs-4.5.3-single/bin/g453s-grompp -f pr.mdp -c 
>>> RNAP-C_after_em.gro -p RNAP-C.top -o RNAP-C_pr.tpr
>>> echo g453s-mdrun -v -s RNAP-C_pr.tpr -e pr.edr -c RNAP-C_after_pr.gro 
>>> -g prlog.log -cpo state_pr.cpt -nt 2 -dhdl dhdl-pr.xvg
>>> /share/apps/gromacs-4.5.3-single/bin/g453s-mdrun -v -s RNAP-C_pr.tpr 
>>> -e pr.edr -c RNAP-C_after_pr.gro -g prlog.log -cpo state_pr.cpt -nt 2 
>>> -dhdl dhdl-pr.xvg
>>> date >>RNAP-C.joblog
>>> echo g453s-grompp -f md.mdp -c RNAP-C_after_pr.gro -p RNAP-C.top -o 
>>> RNAP-C_md.tpr
>>> /share/apps/gromacs-4.5.3-single/bin/g453s-grompp -f md.mdp -c 
>>> RNAP-C_after_pr.gro -p RNAP-C.top -o RNAP-C_md.tpr
>>> date >>RNAP-C.joblog
>>> echo g453s-mdrun -v -s RNAP-C_md.tpr -o RNAP-C_md.trr -c 
>>> RNAP-C_after_md.gro -g md.log -e md.edr -cpo state_md.cpt -nt 2 -dhdl 
>>> dhdl-md.xvg
>>> /share/apps/gromacs-4.5.3-single/bin/g453s-mdrun -v -s RNAP-C_md.tpr 
>>> -o RNAP-C_md.trr -c RNAP-C_after_md.gro -g md.log -e md.edr -cpo 
>>> state_md.cpt -nt 2 -dhdl dhdl-md.xvg
>>> date >>RNAP-C.joblog
>>> echo g453s-grompp -f FEP.mdp -c RNAP-C_after_md.gro -p RNAP-C.top -o 
>>> RNAP-C_fep.tpr
>>> /share/apps/gromacs-4.5.3-single/bin/g453s-grompp -f FEP.mdp -c 
>>> RNAP-C_after_md.gro -p RNAP-C.top -o RNAP-C_fep.tpr
>>> date >>RNAP-C.joblog
>>> echo g453s-mdrun -v -s RNAP-C_fep.tpr -o RNAP-C_fep.trr -c 
>>> RNAP-C_after_fep.gro -g fep.log -e fep.edr -cpo state_fep.cpt -nt 2 
>>> -dhdl dhdl-fep.xvg
>>> /share/apps/gromacs-4.5.3-single/bin/g453s-mdrun -v -s RNAP-C_fep.tpr 
>>> -o RNAP-C_fep.trr -c RNAP-C_after_fep.gro -g fep.log -e fep.edr -cpo 
>>> state_fep.cpt -nt 2 -dhdl dhdl-fep.xvg
>>>
>>>  
>>>
>>>  
>>>
>>> I can add my .mdps but I do not think they are the problem since I 
>>> know it works on my personal iMac.
>>>
>>>  
>>>
>>> Thank you,
>>>
>>> TJ Mustard
>>> Email: mustardt at onid.orst.edu <mailto:mustardt at onid.orst.edu>
>>>
>>
>  
> 
> TJ Mustard
> Email: mustardt at onid.orst.edu
> 

-- 
========================================

Justin A. Lemkul
Ph.D. Candidate
ICTAS Doctoral Scholar
MILES-IGERT Trainee
Department of Biochemistry
Virginia Tech
Blacksburg, VA
jalemkul[at]vt.edu | (540) 231-9080
http://www.bevanlab.biochem.vt.edu/Pages/Personal/justin

========================================



More information about the gromacs.org_gmx-users mailing list