[gmx-users] shortage of shared memory
chris.neale at utoronto.ca
chris.neale at utoronto.ca
Sun Jul 8 06:36:22 CEST 2007
I have a variety of systems that run in parallel without ever having
errors due to shortage of shared memory (up to 500K atoms). However, I
find that I sometimes run into this problem with lipid bilayer systems
of less than 30K atoms.
If I submit a job and I get the shared memory error the error occurs
before any simulation time. What's more, if I resubmit the job it
often works fine. Howerver, one recent bilayer system set up by a
colleague won't ever run.
I am using openmpi_v1.2.1 and I can avoid using shared memory like this:
${OMPI}/mpirun --mca btl ^sm ${ED}/mdrun_openmpi_v1.2.1 -np ${mynp} -4
(etc...)
That absolutely fixes the error, but when I do that the scaling to 4
processors is very poor as judged by walltime and also by the output
at the end of the gromacs .log file.
This also confuses me since my sysadmin tells me that gromacs doesn't
use shared memory.
I get two basic error messages. Sometimes it is this to stderr:
[cn-r4-18][0,1,1][btl_sm_component.c:521:mca_btl_sm_component_progress] SM
faild to send message due to shortage of shared memory.
And sometimes it is a longer style error message (see the end of this
email for all stderr from a run of that type.)
I believe this to be a problem with our cluster, and I guess that
would make this the wrong mailing list for this question, but I am
hoping that somebody can help me clarify what is going on with shared
memory usage in gromacs and perhaps why the error appears to be
stochastic but also related to bilayers.
Our cluster is also having some problems with random xtc or trr file
corruption (1 in 10 to 20 runs) in case that seems related to the
shared memory issue. However, that is not the issue that I am
presenting in this post.
Thanks,
Chris.
########## Here is the stderr
########## Following this is the x0.log file, but that doesn't appear
to have error indications in it
NNODES=4, MYRANK=1, HOSTNAME=cn-r1-27
NNODES=4, MYRANK=0, HOSTNAME=cn-r1-27
NODEID=0 argc=8
:-) G R O M A C S (-:
NODEID=1 argc=8
NNODES=4, MYRANK=2, HOSTNAME=cn-r1-27
NODEID=2 argc=8
NNODES=4, MYRANK=3, HOSTNAME=cn-r1-27
NODEID=3 argc=8
Green Red Orange Magenta Azure Cyan Skyblue
:-) VERSION 3.3.1 (-:
Written by David van der Spoel, Erik Lindahl, Berk Hess, and others.
Copyright (c) 1991-2000, University of Groningen, The Netherlands.
Copyright (c) 2001-2006, The GROMACS development team,
check out http://www.gromacs.org for more information.
This program is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License
as published by the Free Software Foundation; either version 2
of the License, or (at your option) any later version.
:-)
/projects/pomes/cneale/exe/gromacs-3.3.1/exec/fftw-3.1.2/bin/mdrun_openmpi_v1.2
(-:
Option Filename Type Description
------------------------------------------------------------
-s bilayer_popc_md219.tpr Input Generic run input: tpr tpb tpa xml
-o bilayer_popc_md219.trr Output Full precision trajectory: trr trj
-x bilayer_popc_md219.xtc Output, Opt. Compressed trajectory (portable xdr
format)
-c bilayer_popc_md219.gro Output Generic structure: gro g96 pdb xml
-e bilayer_popc_md219.edr Output Generic energy: edr ene
-g bilayer_popc_md219.log Output Log file
-dgdlbilayer_popc_md219.xvg Output, Opt. xvgr/xmgr file
-fieldilayer_popc_md219.xvg Output, Opt. xvgr/xmgr file
-tableilayer_popc_md219.xvg Input, Opt. xvgr/xmgr file
-tableplayer_popc_md219.xvg Input, Opt. xvgr/xmgr file
-rerunilayer_popc_md219.xtc Input, Opt. Generic trajectory: xtc trr trj gro
g96 pdb
-tpi bilayer_popc_md219.xvg Output, Opt. xvgr/xmgr file
-ei bilayer_popc_md219.edi Input, Opt. ED sampling input
-eo bilayer_popc_md219.edo Output, Opt. ED sampling output
-j bilayer_popc_md219.gct Input, Opt. General coupling stuff
-jo bilayer_popc_md219.gct Output, Opt. General coupling stuff
-ffoutilayer_popc_md219.xvg Output, Opt. xvgr/xmgr file
-devoutlayer_popc_md219.xvg Output, Opt. xvgr/xmgr file
-runavilayer_popc_md219.xvg Output, Opt. xvgr/xmgr file
-pi bilayer_popc_md219.ppa Input, Opt. Pull parameters
-po bilayer_popc_md219.ppa Output, Opt. Pull parameters
-pd bilayer_popc_md219.pdo Output, Opt. Pull data output
-pn bilayer_popc_md219.ndx Input, Opt. Index file
-mtx bilayer_popc_md219.mtx Output, Opt. Hessian matrix
-dn bilayer_popc_md219.ndx Output, Opt. Index file
Option Type Value Description
------------------------------------------------------
-[no]h bool no Print help info and quit
-[no]X bool no Use dialog box GUI to edit command line options
-nice int 19 Set the nicelevel
-deffnm string bilayer_popc_md219 Set the default filename for all file
options
-[no]xvgr bool yes Add specific codes (legends etc.) in the output
xvg files for the xmgrace program
-np int 4 Number of nodes, must be the same as used for
grompp
-nt int 1 Number of threads to start on each node
-[no]v bool yes Be loud and noisy
-[no]compact bool yes Write a compact log file
-[no]sepdvdl bool no Write separate V and dVdl terms for each
interaction type and node to the log file(s)
-[no]multi bool no Do multiple simulations in parallel (only with
-np > 1)
-replex int 0 Attempt replica exchange every # steps
-reseed int -1 Seed for replica exchange, -1 is generate a seed
-[no]glas bool no Do glass simulation with special long range
corrections
-[no]ionize bool no Do a simulation including the effect of an X-Ray
bombardment on your system
Getting Loaded...
Reading file bilayer_popc_md219.tpr, VERSION 3.3.1 (single precision)
Loaded with Money
[cn-r1-27:26937] *** Process received signal ***
[cn-r1-27:26937] Signal: Segmentation fault (11)
[cn-r1-27:26937] Signal code: Address not mapped (1)
[cn-r1-27:26937] Failing at address: 0x18
[cn-r1-27:26937] [ 0] /lib64/tls/libpthread.so.0 [0x2a969a0730]
[cn-r1-27:26937] [ 1]
/tools/openmpi/1.2/lib/openmpi/mca_btl_sm.so(mca_btl_sm_send+0x6b)
[0x2a9a488c3b]
[cn-r1-27:26937] [ 2]
/tools/openmpi/1.2/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_start_copy+0x14d)
[0x2a9a1765ed]
[cn-r1-27:26937] [ 3]
/tools/openmpi/1.2/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_process_pending+0x1c4)
[0x2a9a177af4]
[cn-r1-27:26937] [ 4]
/tools/openmpi/1.2/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_match_completion_free+0x22e)
[0x2a9a17827e]
[cn-r1-27:26937] [ 5]
/tools/openmpi/1.2/lib/openmpi/mca_btl_sm.so(mca_btl_sm_component_progress+0x846)
[0x2a9a489fe6]
[cn-r1-27:26937] [ 6]
/tools/openmpi/1.2/lib/openmpi/mca_bml_r2.so(mca_bml_r2_progress+0x2a)
[0x2a9a27e47a]
[cn-r1-27:26937] [ 7]
/tools/openmpi/1.2/lib/libopen-pal.so.0(opal_progress+0x4a)
[0x2a966405da]
[cn-r1-27:26937] [ 8]
/tools/openmpi/1.2/lib/libmpi.so.0(ompi_request_wait_all+0xad)
[0x2a9637adbd]
[cn-r1-27:26937] [ 9]
/tools/openmpi/1.2/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_allreduce_intra_recursivedoubling+0x2ab)
[0x2a9a9af39b]
[cn-r1-27:26937] [10]
/tools/openmpi/1.2/lib/libmpi.so.0(ompi_comm_nextcid+0x20f)
[0x2a9636b36f]
[cn-r1-27:26937] [11]
/tools/openmpi/1.2/lib/libmpi.so.0(ompi_comm_dup+0x94) [0x2a96369bd4]
[cn-r1-27:26937] [12]
/tools/openmpi/1.2/lib/libmpi.so.0(PMPI_Comm_dup+0x6f) [0x2a96390e0f]
[cn-r1-27:26937] [13]
/projects/pomes/cneale/exe/gromacs-3.3.1/exec/fftw-3.1.2/bin/mdrun_openmpi_v1.2(gmx_parallel_3dfft_init+0x72)
[0x48f902]
[cn-r1-27:26937] [14]
/projects/pomes/cneale/exe/gromacs-3.3.1/exec/fftw-3.1.2/bin/mdrun_openmpi_v1.2(mk_fftgrid+0xd1)
[0x467f11]
[cn-r1-27:26937] [15]
/projects/pomes/cneale/exe/gromacs-3.3.1/exec/fftw-3.1.2/bin/mdrun_openmpi_v1.2(init_pme+0x4c0)
[0x460860]
[cn-r1-27:26937] [16]
/projects/pomes/cneale/exe/gromacs-3.3.1/exec/fftw-3.1.2/bin/mdrun_openmpi_v1.2(mdrunner+0x84f)
[0x42d8ff]
[cn-r1-27:26937] [17]
/projects/pomes/cneale/exe/gromacs-3.3.1/exec/fftw-3.1.2/bin/mdrun_openmpi_v1.2(main+0x237)
[0x42e357]
[cn-r1-27:26937] [18] /lib64/tls/libc.so.6(__libc_start_main+0xea)
[0x2a96ac4aaa]
[cn-r1-27:26937] [19]
/projects/pomes/cneale/exe/gromacs-3.3.1/exec/fftw-3.1.2/bin/mdrun_openmpi_v1.2(XmCreateOptionMenu+0x42)
[0x416d8a]
[cn-r1-27:26937] *** End of error message ***
mpirun noticed that job rank 0 with PID 26934 on node cn-r1-27 exited
on signal 15 (Terminated).
#######################
####################### Here is the log file from the head node
$cat bilayer_popclrlj_md1910.log
Log file opened on Sun May 20 18:12:34 2007
Host: cn-r4-29 pid: 9245 nodeid: 0 nnodes: 4
The Gromacs distribution was built Mon Mar 19 11:20:43 EDT 2007 by
cneale at cn-r1-3 (Linux 2.6.5-7.282-smp x86_64)
:-) G R O M A C S (-:
Gromacs Runs On Most of All Computer Systems
:-) VERSION 3.3.1 (-:
Written by David van der Spoel, Erik Lindahl, Berk Hess, and others.
Copyright (c) 1991-2000, University of Groningen, The Netherlands.
Copyright (c) 2001-2006, The GROMACS development team,
check out http://www.gromacs.org for more information.
This program is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License
as published by the Free Software Foundation; either version 2
of the License, or (at your option) any later version.
:-)
/projects/pomes/cneale/exe/gromacs-3.3.1/exec/fftw-3.1.2/bin/mdrun_openmpi_v1.2
(-:
++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
E. Lindahl and B. Hess and D. van der Spoel
GROMACS 3.0: A package for molecular simulation and trajectory analysis
J. Mol. Mod. 7 (2001) pp. 306-317
-------- -------- --- Thank You --- -------- --------
++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
H. J. C. Berendsen, D. van der Spoel and R. van Drunen
GROMACS: A message-passing parallel molecular dynamics implementation
Comp. Phys. Comm. 91 (1995) pp. 43-56
-------- -------- --- Thank You --- -------- --------
CPU= 0, lastcg= 5041, targetcg=15126, myshift= 2
CPU= 1, lastcg=10083, targetcg=20168, myshift= 2
CPU= 2, lastcg=15126, targetcg= 5042, myshift= 2
CPU= 3, lastcg=20168, targetcg=10084, myshift= 2
nsb->shift = 2, nsb->bshift= 0
Listing Scalars
nsb->nodeid: 0
nsb->nnodes: 4
nsb->cgtotal: 20169
nsb->natoms: 51236
nsb->shift: 2
nsb->bshift: 0
Nodeid index homenr cgload workload
0 0 12808 5042 5042
1 12808 12808 10084 10084
2 25616 12812 15127 15127
3 38428 12808 20169 20169
parameters of the run:
integrator = md
nsteps = 250000
init_step = 0
ns_type = Grid
nstlist = 10
ndelta = 2
bDomDecomp = FALSE
decomp_dir = 0
nstcomm = 1
comm_mode = Linear
nstcheckpoint = 1000
nstlog = 1000
nstxout = 250000
nstvout = 250000
nstfout = 250000
nstenergy = 5000
nstxtcout = 5000
init_t = 87300
delta_t = 0.002
xtcprec = 1000
nkx = 84
nky = 80
nkz = 60
pme_order = 4
ewald_rtol = 1e-05
ewald_geometry = 0
epsilon_surface = 0
optimize_fft = FALSE
ePBC = xyz
bUncStart = TRUE
bShakeSOR = FALSE
etc = Berendsen
epc = Berendsen
epctype = Semiisotropic
tau_p = 4
ref_p (3x3):
ref_p[ 0]={ 1.00000e+00, 0.00000e+00, 0.00000e+00}
ref_p[ 1]={ 0.00000e+00, 1.00000e+00, 0.00000e+00}
ref_p[ 2]={ 0.00000e+00, 0.00000e+00, 1.00000e+00}
compress (3x3):
compress[ 0]={ 4.50000e-05, 0.00000e+00, 0.00000e+00}
compress[ 1]={ 0.00000e+00, 4.50000e-05, 0.00000e+00}
compress[ 2]={ 0.00000e+00, 0.00000e+00, 4.50000e-05}
andersen_seed = 815131
rlist = 0.9
coulombtype = PME
rcoulomb_switch = 0
rcoulomb = 0.9
vdwtype = Cut-off
rvdw_switch = 0
rvdw = 1.4
epsilon_r = 1
epsilon_rf = 1
tabext = 1
gb_algorithm = Still
nstgbradii = 1
rgbradii = 2
gb_saltconc = 0
implicit_solvent = No
DispCorr = EnerPres
fudgeQQ = 0.5
free_energy = no
init_lambda = 0
sc_alpha = 0
sc_power = 0
sc_sigma = 0.3
delta_lambda = 0
disre_weighting = Conservative
disre_mixed = FALSE
dr_fc = 1000
dr_tau = 0
nstdisreout = 100
orires_fc = 0
orires_tau = 0
nstorireout = 100
dihre-fc = 1000
dihre-tau = 0
nstdihreout = 100
em_stepsize = 0.01
em_tol = 10
niter = 20
fc_stepsize = 0
nstcgsteep = 1000
nbfgscorr = 10
ConstAlg = Lincs
shake_tol = 1e-04
lincs_order = 4
lincs_warnangle = 30
lincs_iter = 1
bd_fric = 0
ld_seed = 1993
cos_accel = 0
deform (3x3):
deform[ 0]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
deform[ 1]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
deform[ 2]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
userint1 = 0
userint2 = 0
userint3 = 0
userint4 = 0
userreal1 = 0
userreal2 = 0
userreal3 = 0
userreal4 = 0
grpopts:
nrdf: 33598.8 51892.2
ref_t: 310 310
tau_t: 0.1 0.1
anneal: No No
ann_npoints: 0 0
acc: 0 0 0
nfreeze: N N N
energygrp_flags[ 0]: 0
efield-x:
n = 0
efield-xt:
n = 0
efield-y:
n = 0
efield-yt:
n = 0
efield-z:
n = 0
efield-zt:
n = 0
bQMMM = FALSE
QMconstraints = 0
QMMMscheme = 0
scalefactor = 1
qm_opts:
ngQM = 0
Max number of graph edges per atom is 4
Table routines are used for coulomb: TRUE
Table routines are used for vdw: FALSE
Using a Gaussian width (1/beta) of 0.288146 nm for Ewald
Cut-off's: NS: 0.9 Coulomb: 0.9 LJ: 1.4
System total charge: 0.000
Generated table with 1200 data points for Ewald.
Tabscale = 500 points/nm
Generated table with 1200 data points for LJ6.
Tabscale = 500 points/nm
Generated table with 1200 data points for LJ12.
Tabscale = 500 points/nm
Generated table with 500 data points for 1-4 COUL.
Tabscale = 500 points/nm
Generated table with 500 data points for 1-4 LJ6.
Tabscale = 500 points/nm
Generated table with 500 data points for 1-4 LJ12.
Tabscale = 500 points/nm
Enabling TIP4p water optimization for 8649 molecules.
Will do PME sum in reciprocal space.
++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
U. Essman, L. Perela, M. L. Berkowitz, T. Darden, H. Lee and L. G. Pedersen
A smooth particle mesh Ewald method
J. Chem. Phys. 103 (1995) pp. 8577-8592
-------- -------- --- Thank You --- -------- --------
More information about the gromacs.org_gmx-users
mailing list