[gmx-users] mdrun mpi segmentation fault in high load situation
Mark Abraham
Mark.Abraham at anu.edu.au
Thu Dec 23 13:12:27 CET 2010
On 23/12/2010 10:01 PM, Wojtyczka, André wrote:
> Dear Gromacs Enthusiasts.
>
> I am experiencing problems with mdrun_mpi (4.5.3) on a Nehalem cluster.
>
> Problem:
> This runs fine:
> mpiexec -np 72 /../mdrun_mpi -pd -s full031K_mdrun_ions.tpr
>
> This produces a segmentation fault:
> mpiexec -np 128 /../mdrun_mpi -pd -s full031K_mdrun_ions.tpr
Unless you know you need it, don't use -pd. DD will be faster and is
probably better bug-tested too.
Mark
> So the only difference is the number of cores I am using.
>
> mdrun_mpi was compiled using the intel compiler 11.1.072 with my own fftw3 installation.
>
> While configuring and make mdrun / make install-mdrun no errors came
> up.
>
> Is there some issue with threading or mpi?
>
> If someone has a clue please give me a hint.
>
>
> integrator = md
> dt = 0.004
> nsteps = 25000000
> nstxout = 0
> nstvout = 0
> nstlog = 250000
> nstenergy = 250000
> nstxtcout = 12500
> xtc_grps = protein
> energygrps = protein non-protein
> nstlist = 2
> ns_type = grid
> rlist = 0.9
> coulombtype = PME
> rcoulomb = 0.9
> fourierspacing = 0.12
> pme_order = 4
> ewald_rtol = 1e-5
> rvdw = 0.9
> pbc = xyz
> periodic_molecules = yes
> tcoupl = nose-hoover
> nsttcouple = 1
> tc-grps = protein non-protein
> tau_t = 0.1 0.1
> ref_t = 310 310
> Pcoupl = no
> gen_vel = yes
> gen_temp = 310
> gen_seed = 173529
> constraints = all-bonds
>
>
>
> Error:
> Getting Loaded...
> Reading file full031K_mdrun_ions.tpr, VERSION 4.5.3 (single precision)
> Loaded with Money
>
>
> NOTE: The load imbalance in PME FFT and solve is 48%.
> For optimal PME load balancing
> PME grid_x (144) and grid_y (144) should be divisible by #PME_nodes_x (128)
> and PME grid_y (144) and grid_z (144) should be divisible by #PME_nodes_y (1)
>
>
> Step 0, time 0 (ps)
> PSIlogger: Child with rank 82 exited on signal 11: Segmentation fault
> PSIlogger: Child with rank 79 exited on signal 11: Segmentation fault
> PSIlogger: Child with rank 2 exited on signal 11: Segmentation fault
> PSIlogger: Child with rank 1 exited on signal 11: Segmentation fault
> PSIlogger: Child with rank 100 exited on signal 11: Segmentation fault
> PSIlogger: Child with rank 97 exited on signal 11: Segmentation fault
> PSIlogger: Child with rank 98 exited on signal 11: Segmentation fault
> PSIlogger: Child with rank 96 exited on signal 6: Aborted
> ...
>
> Ps, for now I don't care about the imbalanced PME load unless it's independent from my problem.
>
> Cheers
> André
>
> ------------------------------------------------------------------------------------------------
> ------------------------------------------------------------------------------------------------
> Forschungszentrum Juelich GmbH
> 52425 Juelich
> Sitz der Gesellschaft: Juelich
> Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
> Vorsitzender des Aufsichtsrats: MinDirig Dr. Karl Eugen Huthmacher
> Geschaeftsfuehrung: Prof. Dr. Achim Bachem (Vorsitzender),
> Dr. Ulrich Krafft (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
> Prof. Dr. Sebastian M. Schmidt
> ------------------------------------------------------------------------------------------------
> ------------------------------------------------------------------------------------------------
More information about the gromacs.org_gmx-users
mailing list