[gmx-users] Possible bug in parallelization, PME or load-balancing on Gromacs 4.0_rc1 ??

Bjørn Steen Sæthre st01397 at student.uib.no
Mon Sep 29 15:21:34 CEST 2008


I am running some annealing trials on a Cray XT4. And although the
throughput is impressive, I have severe difficulties with stability of
the code.
For my relatively small system of ~7500 atoms the engine typically crash
after ~500k steps.

I am using the bleeding-edge CVS version: mdrun.c (1.141) (the newest
one after Erik L.'s recent patch of the PME code) 

I configure and compile on the compute nodes exclusively (not the
frontend) and the only compiler warning(s) I get are of the type:

"warning: Using 'getpwuid' in statically linked applications requires 
at runtime the shared libraries from the glibc version used for linking"

After compile though, the code executes and runs for ~20mins, producing
sound data before stalling.

The error logs are very short and quite uniformative.

PBS .o: 
Application 159316 exit codes: 137
Application 159316 exit signals: Killed
Application 159316 resources: utime 0, stime 0
--------------------------------------------------
Begin PBS Epilogue hexagon.bccs.uib.no
Date:             Mon Sep 29 12:32:54 CEST 2008
Job ID:           65643.nid00003
Username:         bjornss
Group:            bjornss
Job Name:         pmf_hydanneal_heatup_400K
Session:          10156
Limits:           walltime=05:00:00
Resources:
cput=00:00:00,mem=4940kb,vmem=22144kb,walltime=00:20:31
Queue:            batch
Account:          fysisk
Base login-node:  login5
End PBS Epilogue  Mon Sep 29 12:32:54 CEST 2008

PBS .err:
_pmii_daemon(SIGCHLD): PE 0 exit signal Killed
[NID 702]Apid 159316: initiated application termination.

As proper electrostatics is crucial to my modeling I am using PME which
comprises a large part of my calculation cost: 35-50%
In the most extreme case, I use the following startup-script

run.pbs:

#!/bin/bash
#PBS -A fysisk
#PBS -N pmf_hydanneal_heatup_400K
#PBS -o pmf_hydanneal.o
#PBS -e pmf.hydanneal.err
#PBS -l walltime=5:00:00,mppwidth=40,mppnppn=4

cd /work/bjornss/pmf/structII/hydrate_annealing/heatup_400K
source $HOME/gmx_latest_290908/bin/GMXRC

aprun -n 40 parmdrun -s topol.tpr -maxh 5 -npme 20
exit $?


Now, apart from a significant reduction in the system dipole moment,
there are no large changes in the system, nor significant translations
of the molecules in the box.

I enclose the md.log and my parameter file. The run-topology (topol.tpr)
can be found at:

http:/drop.io/mdanneal

if anyone wants to try and replicate the crash on their local cluster,
they are welcome.
If after such trials are attempted the error persists, I am willing to
post a bug on bugzilla.


If more information is needed I will try to provide it upon request


Regards and thanks for bothering

-- 
---------------------
Bjørn Steen Saethre 
PhD-student
Theoretical and Energy Physics Unit
Institute of Physics and Technology
Allegt, 41
N-5020 Bergen
Norway

Tel(office) +47 55582869 


-------------- next part --------------
A non-text attachment was scrubbed...
Name: md.log
Type: text/x-log
Size: 10161 bytes
Desc: not available
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-users/attachments/20080929/ceefb79e/attachment.bin>
-------------- next part --------------
title                    = heatup 400K  structII - propan -  tip4p/ice(rigid) - PME
cpp			 = /lib/cpp
integrator		 = md
define			 =-DPOSRES
include			 = -I/home/fi/bjornss/mytop

;Run ctrl
dt			 = 0.001
nsteps                   = 2000000
nstxout			 = 200000
nstvout			 = 200000
nstfout			 = 200000
nstenergy		 = 100
nstlog			 = 100000
nstxtcout	         = 1000


;Electrostatics/Neigboursearch
nstlist			 = 5
ns_type			 = grid
rlist                    = 0.9
coulombtype              = PME
ewald_geometry		 = 3d
rcoulomb                 = 0.9
vdw-type                 = Cut-off
rvdw                     = 0.9
optimize_fft 		 = yes
fourier_nx		 = 60
fourier_ny		 = 40
fourier_nz		 = 40
pme_order		 = 6

;Boundary conditions/constraints etc,
pbc			 = xyz
DispCorr                 = Ener
constraints              = hbonds
constraint_algorithm	 = lincs
lincs_iter		 = 2
lincs_order		 = 6
;nwall			 = 0
;walltype		 = 9-3
;wall_r_linpot		 = -10
;wall_atomtype		 = opls_113 opls_113
;wall_density		 = 4.6 4.6 
;wall_ewald_zfac 	 = 2.4




;Temperature and pressure generation and coupling
gen_vel			 = no
;gen_temp		 = 350
;gen_seed		 = -1

tcoupl			 = berendsen
tc_grps		 	 = System
tau_t			 = 0.5
ref_t			 = 400

pcoupl			 = no
;pcoupltype		 = isotropic
;tau_p			 = 2 
;ref_p			 = 10
;compressibility	 = 5e-6

unconstrained-start      = no


More information about the gromacs.org_gmx-users mailing list