AW: [gmx-users] mdrun mpi segmentation fault in high load situation

Mark Abraham Mark.Abraham at anu.edu.au
Thu Dec 23 22:46:15 CET 2010


On 24/12/2010 8:34 AM, Mark Abraham wrote:
> On 24/12/2010 3:28 AM, Wojtyczka, André wrote:
>>> On 23/12/2010 10:01 PM, Wojtyczka, André wrote:
>>>> Dear Gromacs Enthusiasts.
>>>>
>>>> I am experiencing problems with mdrun_mpi (4.5.3) on a Nehalem 
>>>> cluster.
>>>>
>>>> Problem:
>>>> This runs fine:
>>>> mpiexec -np 72 /../mdrun_mpi -pd -s full031K_mdrun_ions.tpr
>>>>
>>>> This produces a segmentation fault:
>>>> mpiexec -np 128 /../mdrun_mpi -pd -s full031K_mdrun_ions.tpr
>>> Unless you know you need it, don't use -pd. DD will be faster and is
>>> probably better bug-tested too.
>>>
>>> Mark
>> Hi Mark
>>
>> thanks for the push into that direction, but I am in the unfortunate 
>> situation where
>> I really need -pd because I have long bonds which is the reason why 
>> my large system
>> is decomposable just into a little number of domains.
>
> I'm not sure that PD has any advantage here. From memory it has to 
> create a 128x1x1 grid, and you can direct that with DD also.

See mdrun -h -hidden for -dd.

Mark

> The contents of your .log file will be far more helpful than stdout in 
> diagnosing what condition led to the problem.
>
> Mark
>
>>>> So the only difference is the number of cores I am using.
>>>>
>>>> mdrun_mpi was compiled using the intel compiler 11.1.072 with my 
>>>> own fftw3 installation.
>>>>
>>>> While configuring and make mdrun / make install-mdrun no errors came
>>>> up.
>>>>
>>>> Is there some issue with threading or mpi?
>>>>
>>>> If someone has a clue please give me a hint.
>>>>
>>>>
>>>> integrator               = md
>>>> dt                      = 0.004
>>>> nsteps                  = 25000000
>>>> nstxout                  = 0
>>>> nstvout                  = 0
>>>> nstlog                  = 250000
>>>> nstenergy               = 250000
>>>> nstxtcout               = 12500
>>>> xtc_grps                 = protein
>>>> energygrps               = protein non-protein
>>>> nstlist                  = 2
>>>> ns_type                  = grid
>>>> rlist                    = 0.9
>>>> coulombtype              = PME
>>>> rcoulomb                 = 0.9
>>>> fourierspacing           = 0.12
>>>> pme_order                = 4
>>>> ewald_rtol               = 1e-5
>>>> rvdw                     = 0.9
>>>> pbc                      = xyz
>>>> periodic_molecules       = yes
>>>> tcoupl                   = nose-hoover
>>>> nsttcouple               = 1
>>>> tc-grps                  = protein non-protein
>>>> tau_t                    = 0.1 0.1
>>>> ref_t                    = 310 310
>>>> Pcoupl                   = no
>>>> gen_vel                  = yes
>>>> gen_temp                 = 310
>>>> gen_seed                 = 173529
>>>> constraints              = all-bonds
>>>>
>>>>
>>>>
>>>> Error:
>>>> Getting Loaded...
>>>> Reading file full031K_mdrun_ions.tpr, VERSION 4.5.3 (single precision)
>>>> Loaded with Money
>>>>
>>>>
>>>> NOTE: The load imbalance in PME FFT and solve is 48%.
>>>>         For optimal PME load balancing
>>>>         PME grid_x (144) and grid_y (144) should be divisible by 
>>>> #PME_nodes_x (128)
>>>>         and PME grid_y (144) and grid_z (144) should be divisible 
>>>> by #PME_nodes_y (1)
>>>>
>>>>
>>>> Step 0, time 0 (ps)
>>>> PSIlogger: Child with rank 82 exited on signal 11: Segmentation fault
>>>> PSIlogger: Child with rank 79 exited on signal 11: Segmentation fault
>>>> PSIlogger: Child with rank 2 exited on signal 11: Segmentation fault
>>>> PSIlogger: Child with rank 1 exited on signal 11: Segmentation fault
>>>> PSIlogger: Child with rank 100 exited on signal 11: Segmentation fault
>>>> PSIlogger: Child with rank 97 exited on signal 11: Segmentation fault
>>>> PSIlogger: Child with rank 98 exited on signal 11: Segmentation fault
>>>> PSIlogger: Child with rank 96 exited on signal 6: Aborted
>>>> ...
>>>>
>>>> Ps, for now I don't care about the imbalanced PME load unless it's 
>>>> independent from my problem.
>>>>
>>>> Cheers
>>>> André
>>
>> ------------------------------------------------------------------------------------------------ 
>>
>> ------------------------------------------------------------------------------------------------ 
>>
>> Forschungszentrum Juelich GmbH
>> 52425 Juelich
>> Sitz der Gesellschaft: Juelich
>> Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
>> Vorsitzender des Aufsichtsrats: MinDirig Dr. Karl Eugen Huthmacher
>> Geschaeftsfuehrung: Prof. Dr. Achim Bachem (Vorsitzender),
>> Dr. Ulrich Krafft (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
>> Prof. Dr. Sebastian M. Schmidt
>> ------------------------------------------------------------------------------------------------ 
>>
>> ------------------------------------------------------------------------------------------------ 
>>
>




More information about the gromacs.org_gmx-users mailing list