[gmx-users] Improving scaling - Gromacs 4.0 RC2
Justin A. Lemkul
jalemkul at vt.edu
Thu Oct 2 13:37:18 CEST 2008
Carsten Kutzner wrote:
> Hi Justin,
>
> I have written a small gmx tool that tries various PME/PP balances
> systematically
> for a given number of nodes and afterwards gives a suggestion what the
> fastest
> combindation is. Although I plan to extend it with more functionality,
> it's already
> working and I can send it to you if you like to try it.
>
I would like to try it; I think that it would help me get an empirical feel for
things.
> The performance loss due to load imbalance has two reasons:
> 1. imbalance in the force calculation, which can be levelled out by
> using -dlb yes or -dlb auto
> as David suggested
Indeed, using -dlb yes improves the % imbalance, but at the cost of output. If
we take my 64-core case, applying "-dlb yes" reduces my speed to around 5-6
ns/day (depending on the # of PME nodes - 16, 24, 32).
It just seems peculiar to me that I can get great speed in terms of ns/day using
64/32 PP/PME, but performance is hampered due to imbalance.
> 2. imbalance in short-range / long range force calculation which can be
> levelled
> by choosing the optimal PME/PP ratio. This the script should do for you.
>
> You might also want to check the md.log file for more detailed
> information about where
> your imbalance is coming from. My guess is that with 32 PME nodes and
> interleaved
> communication the PP (short range) nodes have to wait on the PME (long
> range)
> nodes, while with 16 PME nodes it is the other way around.
> That you see less imbalance with pp_pme or cartesian communication probably
> only means that the PME communication is slower in this case - the
> 'smaller'
> performance loss from imbalance is a bit misleading here.
>
Ah, that makes a bit more sense. Thanks :)
-Justin
> Carsten
>
>
> Am 01.10.2008 um 23:18 schrieb Justin A. Lemkul:
>
>>
>> Hi,
>>
>> I've been playing around with the latest release candidate of version
>> 4.0, and I was hoping someone out there more knowledgeable than me
>> might tell me how to improve a bit on the performance I'm seeing. To
>> clarify, the performance I'm seeing is a ton faster than 3.3.x, but I
>> still seem to be getting bogged down with the PME/PP balance. I'm
>> using mostly the default options with the new mdrun:
>>
>> mdrun_mpi -s test.tpr -np 64 -npme 32
>>
>> The system contains about 150,000 atoms - a membrane protein
>> surrounded by several hundred lipids and solvent (water). The protein
>> parameters are GROMOS, lipids are Berger, and water is SPC. My .mdp
>> file (adapted from a generic 3.3.x file that I always used to use for
>> such simulations) is attached at the end of this mail. It seems that
>> my system runs fastest on 64 CPU's. Almost all tests with 128 or 256
>> seem to run slower. The nodes are dual-core 2.3 GHz Xserve G5,
>> connected by Infiniband.
>>
>> Here's a summary of some of the tests I've run:
>>
>> -np -npme -ddorder ns/day % performance loss from imbalance
>> 64 16 interleave 5.760 19.6
>> 64 32 interleave 9.600 40.9
>> 64 32 pp_pme 5.252 3.9
>> 64 32 cartesian 5.383 4.7
>>
>> All other mdrun command line options are defaults.
>>
>> I get ~10.3 ns/day with -np 256 -npme 64, but since -np 64 -npme 32
>> seems to give almost that same performance there seems to be no
>> compelling reason to tie up that many nodes.
>>
>> Any hints on how to speed things up any more? Is it possible? Not
>> that I'm complaining...the same system under GMX 3.3.3 gives just
>> under 1 ns/day :) I'm really curious about the 40.9% performance loss
>> I'm seeing with -np 64 -npme 32, even though it gives the best overall
>> performance in terms of ns/day.
>>
>> Thanks in advance for your attention, and any comments.
>>
>> -Justin
>>
>> =======test.mdp=========
>> title = NPT simulation for a membrane protein
>> ; Run parameters
>> integrator = md
>> dt = 0.002
>> nsteps = 10000 ; 20 ps
>> nstcomm = 1
>> ; Output parameters
>> nstxout = 500
>> nstvout = 500
>> nstfout = 500
>> nstlog = 500
>> nstenergy = 500
>> ; Bond parameters
>> constraint_algorithm = lincs
>> constraints = all-bonds
>> continuation = no ; starting up
>> ; Twin-range cutoff scheme, parameters for Gromos96
>> nstlist = 5
>> ns_type = grid
>> rlist = 0.8
>> rcoulomb = 0.8
>> rvdw = 1.4
>> ; PME electrostatics parameters
>> coulombtype = PME
>> fourierspacing = 0.24
>> pme_order = 4
>> ewald_rtol = 1e-5
>> optimize_fft = yes
>> ; V-rescale temperature coupling is on in three groups
>> Tcoupl = V-rescale
>> tc_grps = Protein POPC SOL_NA+_CL-
>> tau_t = 0.1 0.1 0.1
>> ref_t = 310 310 310
>> ; Pressure coupling is on
>> Pcoupl = Berendsen
>> pcoupltype = semiisotropic
>> tau_p = 2.0
>> compressibility = 4.5e-5 4.5e-5
>> ref_p = 1.0 1.0
>> ; Generate velocities is on
>> gen_vel = yes
>> gen_temp = 310
>> gen_seed = 173529
>> ; Periodic boundary conditions are on in all directions
>> pbc = xyz
>> ; Long-range dispersion correction
>> DispCorr = EnerPres
>>
>> ========end test.mdp==========
>>
>> --
>> ========================================
>>
>> Justin A. Lemkul
>> Graduate Research Assistant
>> Department of Biochemistry
>> Virginia Tech
>> Blacksburg, VA
>> jalemkul[at]vt.edu | (540) 231-9080
>> http://www.bevanlab.biochem.vt.edu/Pages/Personal/justin
>>
>> ========================================
>> _______________________________________________
>> gmx-users mailing list gmx-users at gromacs.org
>> http://www.gromacs.org/mailman/listinfo/gmx-users
>> Please search the archive at http://www.gromacs.org/search before
>> posting!
>> Please don't post (un)subscribe requests to the list. Use the www
>> interface or send it to gmx-users-request at gromacs.org.
>> Can't post? Read http://www.gromacs.org/mailing_lists/users.php
>
>
--
========================================
Justin A. Lemkul
Graduate Research Assistant
Department of Biochemistry
Virginia Tech
Blacksburg, VA
jalemkul[at]vt.edu | (540) 231-9080
http://www.bevanlab.biochem.vt.edu/Pages/Personal/justin
========================================
More information about the gromacs.org_gmx-users
mailing list