AW: [gmx-users] mdrun mpi segmentation fault in high load situation

Mark Abraham Mark.Abraham at anu.edu.au
Thu Dec 23 22:34:42 CET 2010


On 24/12/2010 3:28 AM, Wojtyczka, André wrote:
>> On 23/12/2010 10:01 PM, Wojtyczka, André wrote:
>>> Dear Gromacs Enthusiasts.
>>>
>>> I am experiencing problems with mdrun_mpi (4.5.3) on a Nehalem cluster.
>>>
>>> Problem:
>>> This runs fine:
>>> mpiexec -np 72 /../mdrun_mpi -pd -s full031K_mdrun_ions.tpr
>>>
>>> This produces a segmentation fault:
>>> mpiexec -np 128 /../mdrun_mpi -pd -s full031K_mdrun_ions.tpr
>> Unless you know you need it, don't use -pd. DD will be faster and is
>> probably better bug-tested too.
>>
>> Mark
> Hi Mark
>
> thanks for the push into that direction, but I am in the unfortunate situation where
> I really need -pd because I have long bonds which is the reason why my large system
> is decomposable just into a little number of domains.

I'm not sure that PD has any advantage here. From memory it has to 
create a 128x1x1 grid, and you can direct that with DD also.

The contents of your .log file will be far more helpful than stdout in 
diagnosing what condition led to the problem.

Mark

>>> So the only difference is the number of cores I am using.
>>>
>>> mdrun_mpi was compiled using the intel compiler 11.1.072 with my own fftw3 installation.
>>>
>>> While configuring and make mdrun / make install-mdrun no errors came
>>> up.
>>>
>>> Is there some issue with threading or mpi?
>>>
>>> If someone has a clue please give me a hint.
>>>
>>>
>>> integrator               = md
>>> dt                      = 0.004
>>> nsteps                  = 25000000
>>> nstxout                  = 0
>>> nstvout                  = 0
>>> nstlog                  = 250000
>>> nstenergy               = 250000
>>> nstxtcout               = 12500
>>> xtc_grps                 = protein
>>> energygrps               = protein non-protein
>>> nstlist                  = 2
>>> ns_type                  = grid
>>> rlist                    = 0.9
>>> coulombtype              = PME
>>> rcoulomb                 = 0.9
>>> fourierspacing           = 0.12
>>> pme_order                = 4
>>> ewald_rtol               = 1e-5
>>> rvdw                     = 0.9
>>> pbc                      = xyz
>>> periodic_molecules       = yes
>>> tcoupl                   = nose-hoover
>>> nsttcouple               = 1
>>> tc-grps                  = protein non-protein
>>> tau_t                    = 0.1 0.1
>>> ref_t                    = 310 310
>>> Pcoupl                   = no
>>> gen_vel                  = yes
>>> gen_temp                 = 310
>>> gen_seed                 = 173529
>>> constraints              = all-bonds
>>>
>>>
>>>
>>> Error:
>>> Getting Loaded...
>>> Reading file full031K_mdrun_ions.tpr, VERSION 4.5.3 (single precision)
>>> Loaded with Money
>>>
>>>
>>> NOTE: The load imbalance in PME FFT and solve is 48%.
>>>         For optimal PME load balancing
>>>         PME grid_x (144) and grid_y (144) should be divisible by #PME_nodes_x (128)
>>>         and PME grid_y (144) and grid_z (144) should be divisible by #PME_nodes_y (1)
>>>
>>>
>>> Step 0, time 0 (ps)
>>> PSIlogger: Child with rank 82 exited on signal 11: Segmentation fault
>>> PSIlogger: Child with rank 79 exited on signal 11: Segmentation fault
>>> PSIlogger: Child with rank 2 exited on signal 11: Segmentation fault
>>> PSIlogger: Child with rank 1 exited on signal 11: Segmentation fault
>>> PSIlogger: Child with rank 100 exited on signal 11: Segmentation fault
>>> PSIlogger: Child with rank 97 exited on signal 11: Segmentation fault
>>> PSIlogger: Child with rank 98 exited on signal 11: Segmentation fault
>>> PSIlogger: Child with rank 96 exited on signal 6: Aborted
>>> ...
>>>
>>> Ps, for now I don't care about the imbalanced PME load unless it's independent from my problem.
>>>
>>> Cheers
>>> André
>
> ------------------------------------------------------------------------------------------------
> ------------------------------------------------------------------------------------------------
> Forschungszentrum Juelich GmbH
> 52425 Juelich
> Sitz der Gesellschaft: Juelich
> Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
> Vorsitzender des Aufsichtsrats: MinDirig Dr. Karl Eugen Huthmacher
> Geschaeftsfuehrung: Prof. Dr. Achim Bachem (Vorsitzender),
> Dr. Ulrich Krafft (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
> Prof. Dr. Sebastian M. Schmidt
> ------------------------------------------------------------------------------------------------
> ------------------------------------------------------------------------------------------------




More information about the gromacs.org_gmx-users mailing list