[gmx-users] mdrun mpi segmentation fault in high load situation
Wojtyczka, André
a.wojtyczka at fz-juelich.de
Thu Dec 23 12:01:13 CET 2010
Dear Gromacs Enthusiasts.
I am experiencing problems with mdrun_mpi (4.5.3) on a Nehalem cluster.
Problem:
This runs fine:
mpiexec -np 72 /../mdrun_mpi -pd -s full031K_mdrun_ions.tpr
This produces a segmentation fault:
mpiexec -np 128 /../mdrun_mpi -pd -s full031K_mdrun_ions.tpr
So the only difference is the number of cores I am using.
mdrun_mpi was compiled using the intel compiler 11.1.072 with my own fftw3 installation.
While configuring and make mdrun / make install-mdrun no errors came
up.
Is there some issue with threading or mpi?
If someone has a clue please give me a hint.
integrator = md
dt = 0.004
nsteps = 25000000
nstxout = 0
nstvout = 0
nstlog = 250000
nstenergy = 250000
nstxtcout = 12500
xtc_grps = protein
energygrps = protein non-protein
nstlist = 2
ns_type = grid
rlist = 0.9
coulombtype = PME
rcoulomb = 0.9
fourierspacing = 0.12
pme_order = 4
ewald_rtol = 1e-5
rvdw = 0.9
pbc = xyz
periodic_molecules = yes
tcoupl = nose-hoover
nsttcouple = 1
tc-grps = protein non-protein
tau_t = 0.1 0.1
ref_t = 310 310
Pcoupl = no
gen_vel = yes
gen_temp = 310
gen_seed = 173529
constraints = all-bonds
Error:
Getting Loaded...
Reading file full031K_mdrun_ions.tpr, VERSION 4.5.3 (single precision)
Loaded with Money
NOTE: The load imbalance in PME FFT and solve is 48%.
For optimal PME load balancing
PME grid_x (144) and grid_y (144) should be divisible by #PME_nodes_x (128)
and PME grid_y (144) and grid_z (144) should be divisible by #PME_nodes_y (1)
Step 0, time 0 (ps)
PSIlogger: Child with rank 82 exited on signal 11: Segmentation fault
PSIlogger: Child with rank 79 exited on signal 11: Segmentation fault
PSIlogger: Child with rank 2 exited on signal 11: Segmentation fault
PSIlogger: Child with rank 1 exited on signal 11: Segmentation fault
PSIlogger: Child with rank 100 exited on signal 11: Segmentation fault
PSIlogger: Child with rank 97 exited on signal 11: Segmentation fault
PSIlogger: Child with rank 98 exited on signal 11: Segmentation fault
PSIlogger: Child with rank 96 exited on signal 6: Aborted
...
Ps, for now I don't care about the imbalanced PME load unless it's independent from my problem.
Cheers
André
------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------
Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDirig Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr. Achim Bachem (Vorsitzender),
Dr. Ulrich Krafft (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt
------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------
More information about the gromacs.org_gmx-users
mailing list