[gmx-users] Possible bug in parallelization, PME or load-balancing on Gromacs 4.0_rc1 ??
Bjørn Steen Sæthre
st01397 at student.uib.no
Mon Sep 29 15:21:34 CEST 2008
I am running some annealing trials on a Cray XT4. And although the
throughput is impressive, I have severe difficulties with stability of
the code.
For my relatively small system of ~7500 atoms the engine typically crash
after ~500k steps.
I am using the bleeding-edge CVS version: mdrun.c (1.141) (the newest
one after Erik L.'s recent patch of the PME code)
I configure and compile on the compute nodes exclusively (not the
frontend) and the only compiler warning(s) I get are of the type:
"warning: Using 'getpwuid' in statically linked applications requires
at runtime the shared libraries from the glibc version used for linking"
After compile though, the code executes and runs for ~20mins, producing
sound data before stalling.
The error logs are very short and quite uniformative.
PBS .o:
Application 159316 exit codes: 137
Application 159316 exit signals: Killed
Application 159316 resources: utime 0, stime 0
--------------------------------------------------
Begin PBS Epilogue hexagon.bccs.uib.no
Date: Mon Sep 29 12:32:54 CEST 2008
Job ID: 65643.nid00003
Username: bjornss
Group: bjornss
Job Name: pmf_hydanneal_heatup_400K
Session: 10156
Limits: walltime=05:00:00
Resources:
cput=00:00:00,mem=4940kb,vmem=22144kb,walltime=00:20:31
Queue: batch
Account: fysisk
Base login-node: login5
End PBS Epilogue Mon Sep 29 12:32:54 CEST 2008
PBS .err:
_pmii_daemon(SIGCHLD): PE 0 exit signal Killed
[NID 702]Apid 159316: initiated application termination.
As proper electrostatics is crucial to my modeling I am using PME which
comprises a large part of my calculation cost: 35-50%
In the most extreme case, I use the following startup-script
run.pbs:
#!/bin/bash
#PBS -A fysisk
#PBS -N pmf_hydanneal_heatup_400K
#PBS -o pmf_hydanneal.o
#PBS -e pmf.hydanneal.err
#PBS -l walltime=5:00:00,mppwidth=40,mppnppn=4
cd /work/bjornss/pmf/structII/hydrate_annealing/heatup_400K
source $HOME/gmx_latest_290908/bin/GMXRC
aprun -n 40 parmdrun -s topol.tpr -maxh 5 -npme 20
exit $?
Now, apart from a significant reduction in the system dipole moment,
there are no large changes in the system, nor significant translations
of the molecules in the box.
I enclose the md.log and my parameter file. The run-topology (topol.tpr)
can be found at:
http:/drop.io/mdanneal
if anyone wants to try and replicate the crash on their local cluster,
they are welcome.
If after such trials are attempted the error persists, I am willing to
post a bug on bugzilla.
If more information is needed I will try to provide it upon request
Regards and thanks for bothering
--
---------------------
Bjørn Steen Saethre
PhD-student
Theoretical and Energy Physics Unit
Institute of Physics and Technology
Allegt, 41
N-5020 Bergen
Norway
Tel(office) +47 55582869
-------------- next part --------------
A non-text attachment was scrubbed...
Name: md.log
Type: text/x-log
Size: 10161 bytes
Desc: not available
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-users/attachments/20080929/ceefb79e/attachment.bin>
-------------- next part --------------
title = heatup 400K structII - propan - tip4p/ice(rigid) - PME
cpp = /lib/cpp
integrator = md
define =-DPOSRES
include = -I/home/fi/bjornss/mytop
;Run ctrl
dt = 0.001
nsteps = 2000000
nstxout = 200000
nstvout = 200000
nstfout = 200000
nstenergy = 100
nstlog = 100000
nstxtcout = 1000
;Electrostatics/Neigboursearch
nstlist = 5
ns_type = grid
rlist = 0.9
coulombtype = PME
ewald_geometry = 3d
rcoulomb = 0.9
vdw-type = Cut-off
rvdw = 0.9
optimize_fft = yes
fourier_nx = 60
fourier_ny = 40
fourier_nz = 40
pme_order = 6
;Boundary conditions/constraints etc,
pbc = xyz
DispCorr = Ener
constraints = hbonds
constraint_algorithm = lincs
lincs_iter = 2
lincs_order = 6
;nwall = 0
;walltype = 9-3
;wall_r_linpot = -10
;wall_atomtype = opls_113 opls_113
;wall_density = 4.6 4.6
;wall_ewald_zfac = 2.4
;Temperature and pressure generation and coupling
gen_vel = no
;gen_temp = 350
;gen_seed = -1
tcoupl = berendsen
tc_grps = System
tau_t = 0.5
ref_t = 400
pcoupl = no
;pcoupltype = isotropic
;tau_p = 2
;ref_p = 10
;compressibility = 5e-6
unconstrained-start = no
More information about the gromacs.org_gmx-users
mailing list