[gmx-users] mdrun mpi segmentation fault in high load situation

Wojtyczka, André a.wojtyczka at fz-juelich.de
Thu Dec 23 12:01:13 CET 2010


Dear Gromacs Enthusiasts.

I am experiencing problems with mdrun_mpi (4.5.3) on a Nehalem cluster.

Problem:
This runs fine:
mpiexec -np 72 /../mdrun_mpi -pd -s full031K_mdrun_ions.tpr

This produces a segmentation fault:
mpiexec -np 128 /../mdrun_mpi -pd -s full031K_mdrun_ions.tpr

So the only difference is the number of cores I am using.

mdrun_mpi was compiled using the intel compiler 11.1.072 with my own fftw3 installation.

While configuring and make mdrun / make install-mdrun no errors came
up.

Is there some issue with threading or mpi?

If someone has a clue please give me a hint.


integrator               = md
dt                      = 0.004
nsteps                  = 25000000
nstxout                  = 0
nstvout                  = 0
nstlog                  = 250000
nstenergy               = 250000
nstxtcout               = 12500
xtc_grps                 = protein
energygrps               = protein non-protein
nstlist                  = 2
ns_type                  = grid
rlist                    = 0.9
coulombtype              = PME
rcoulomb                 = 0.9
fourierspacing           = 0.12
pme_order                = 4
ewald_rtol               = 1e-5
rvdw                     = 0.9
pbc                      = xyz
periodic_molecules       = yes
tcoupl                   = nose-hoover
nsttcouple               = 1
tc-grps                  = protein non-protein
tau_t                    = 0.1 0.1
ref_t                    = 310 310
Pcoupl                   = no
gen_vel                  = yes
gen_temp                 = 310
gen_seed                 = 173529
constraints              = all-bonds



Error:
Getting Loaded...
Reading file full031K_mdrun_ions.tpr, VERSION 4.5.3 (single precision)
Loaded with Money


NOTE: The load imbalance in PME FFT and solve is 48%.
      For optimal PME load balancing
      PME grid_x (144) and grid_y (144) should be divisible by #PME_nodes_x (128)
      and PME grid_y (144) and grid_z (144) should be divisible by #PME_nodes_y (1)


Step 0, time 0 (ps)
PSIlogger: Child with rank 82 exited on signal 11: Segmentation fault
PSIlogger: Child with rank 79 exited on signal 11: Segmentation fault
PSIlogger: Child with rank 2 exited on signal 11: Segmentation fault
PSIlogger: Child with rank 1 exited on signal 11: Segmentation fault
PSIlogger: Child with rank 100 exited on signal 11: Segmentation fault
PSIlogger: Child with rank 97 exited on signal 11: Segmentation fault
PSIlogger: Child with rank 98 exited on signal 11: Segmentation fault
PSIlogger: Child with rank 96 exited on signal 6: Aborted
...

Ps, for now I don't care about the imbalanced PME load unless it's independent from my problem.

Cheers
André

------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------
Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDirig Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr. Achim Bachem (Vorsitzender),
Dr. Ulrich Krafft (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt
------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------



More information about the gromacs.org_gmx-users mailing list