[gmx-users] Re: Re: why Blue Gene/Q is so slow? (Mark Abraham)

Mark Abraham Mark.Abraham at anu.edu.au
Tue Jul 17 16:58:20 CEST 2012


On 17/07/2012 7:06 PM, DeChang Li wrote:
>> ------------------------------
>>
>> Message: 8
>> Date: Tue, 17 Jul 2012 18:40:05 +1000
>> From: Mark Abraham <Mark.Abraham at anu.edu.au>
>> Subject: Re: [gmx-users] why Blue Gene/Q is so slow?
>> To: Discussion list for GROMACS users <gmx-users at gromacs.org>
>> Message-ID: <500524E5.9050402 at anu.edu.au>
>> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>>
>> On 17/07/2012 5:00 PM, DeChang Li wrote:
>>> Dear all,
>>>
>>>        I am running a 9000 atom system with GBSA (Gromacs 4.5.5) in a
>>> Blue Gene/Q cluster. I got the speed 1.002 ns/day with 8 cores.
>>> However, in my own workstation with 8 cores the same system can reach
>>> nearly 10 ns/day (Intel(R) Xeon(R) CPU E5620  @ 2.40GHz). Can anyone
>>> tell me what's wrong in my simulation? Any suggestion will be
>>> appreciated.
>> Your workstation is running highly effective optimized SSE loops.
>> BlueGene/Q is not using its multiple FPU because that code hasn't been
>> written (for explicit or implicit solvation), and BlueGene's processors
>> are probably slower too.
>>
>> Mark
> That means the code itself causes only 10% speed in BlueGene/Q
> compared with intel CPUs workstation?

You'd see a comparable decrease if you would turn off the SSE 
optimization on your workstation, but perhaps not as severe. There's art 
and skill in making code run fast, and it's very rare that you don't 
need to target a specific architecture to achieve it.

>   Is there any method to improve
> the speed in BG/Q?

Write the optimized code ;-) Also, use more of the machine - you can 
probably get down to 500 atoms/core or below. There will be a limit 
beyond which it's impossible to go (or be effective). You can try 
simulating without cut-offs (see parts of manual 7.3 and mailing list 
discussions) which uses different all-vs-all inner loops, but your 
system might be too large for that to be useful.

Mark

>
>
> Dechang
>
>
>
>
>>> Following is my md.mdp file:
>>>
>>> constraints            = hbonds
>>> constraint_algorithm   = LINCS
>>> lincs_order            = 4
>>> comm_mode              = Angular
>>> comm_grps              = system
>>> integrator             = sd
>>> ;annealing           = single single
>>> ;annealing_npoints   = 2 2
>>> ;annealing_time      = 0 500 0 500
>>> ;annealing_temp      = 200 300 200 300
>>> dt                     = 0.002 ; ps !
>>> nsteps                 = 5000000 ; total 5000 ps.
>>> nstcomm                = 10
>>> nstcalcenergy           = 10
>>> nstxout                = 10000 ; collect data every 1 ps
>>> nstenergy              = 10000
>>> nstvout                = 10000
>>> nstlog                 = 1000
>>> ;nstxtcout              = 50000
>>> ;xtc_grps               = system
>>> nstfout                = 0
>>> nstlist                = 10
>>> ns_type                = grid
>>> pbc                    = no
>>> rlist                  = 1.2
>>> coulombtype            = cut-off
>>> rcoulomb               = 1.2
>>> rvdw                   = 1.2
>>> fourierspacing         = 0.12
>>> fourier_nx             = 0
>>> fourier_ny             = 0
>>> fourier_nz             = 0
>>> pme_order              = 4
>>> ewald_rtol             = 1e-5
>>> optimize_fft           = yes
>>> ;energygrps             = alpha1 alpha2 alpha3 beta1 beta2 beta3 gamma
>>> ;DispCorr               = EnerPres
>>> ; Berendsen temperature coupling is on in two groups
>>> Tcoupl                 =
>>> tau_t                  = 0.5
>>> tc-grps                = system
>>> ref_t                  = 300
>>> ; Pressure coupling is on
>>> Pcoupl                 = no ;berendsen
>>> tau_p                  = 1.0
>>> compressibility        = 4.5e-5
>>> ref_p                  = 1.0
>>> ; Generate velocites is on at 300 K.
>>> gen_vel                = yes
>>> gen_temp               = 300
>>> gen_seed               = -1
>>>
>>> implicit_solvent       = GBSA
>>> gb_algorithm           = OBC
>>> rgbradii               = 1.2
>>> sa_surface_tension     = 2.25936
>>>
>>>
>>>
>>> Here is the preformace info:
>>>
>>>           M E G A - F L O P S   A C C O U N T I N G
>>>
>>>      RF=Reaction-Field  FE=Free Energy  SCFE=Soft-Core/Free Energy
>>>      T=Tabulated        W3=SPC/TIP3p    W4=TIP4p (single or pairs)
>>>      NF=No Forces
>>>
>>>    Computing:                               M-Number         M-Flops  % Flops
>>> -----------------------------------------------------------------------------
>>>    Generalized Born Coulomb                61.482892        2951.179     0.4
>>>    GB Coulomb + LJ                       2565.481100      156494.347    19.4
>>>    Outer nonbonded loop                   152.268546        1522.685     0.2
>>>    1,4 nonbonded interactions             116.143224       10452.890     1.3
>>>    Born radii (HCT/OBC)                  2868.222234      524884.669    64.9
>>>    Born force chain rule                 2868.222234       43023.334     5.3
>>>    NS-Pairs                               516.814696       10853.109     1.3
>>>    Reset In Box                             4.464788          13.394     0.0
>>>    CG-CoM                                   4.482576          13.448     0.0
>>>    Bonds                                   22.174434        1308.292     0.2
>>>    Angles                                  80.586114       13538.467     1.7
>>>    Propers                                160.742142       36809.951     4.6
>>>    Virial                                   4.636254          83.453     0.0
>>>    Update                                  44.478894        1378.846     0.2
>>>    Stop-CM                                  4.455894          44.559     0.0
>>>    Calc-Ekin                               44.487788        1201.170     0.1
>>>    Lincs                                   44.951630        2697.098     0.3
>>>    Lincs-Mat                              261.822552        1047.290     0.1
>>>    Constraint-V                            44.951630         359.613     0.0
>>>    Constraint-Vir                           2.251163          54.028     0.0
>>> -----------------------------------------------------------------------------
>>>    Total                                                  808731.820   100.0
>>> -----------------------------------------------------------------------------
>>>
>>>
>>>       D O M A I N   D E C O M P O S I T I O N   S T A T I S T I C S
>>>
>>>    av. #atoms communicated per step for force:  2 x 660.5
>>>    av. #atoms communicated per step for LINCS:  2 x 34.3
>>>
>>>    Average load imbalance: 1.7 %
>>>    Part of the total run time spent waiting due to load imbalance: 1.4 %
>>>
>>>
>>>        R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G
>>>
>>>    Computing:         Nodes     Number     G-Cycles    Seconds     %
>>> -----------------------------------------------------------------------
>>>    Domain decomp.         8        502       59.421       37.1     0.5
>>>    DD comm. load          8          8        0.004        0.0     0.0
>>>    Comm. coord.           8       5001       16.575       10.4     0.2
>>>    Neighbor search        8        502      136.093       85.1     1.2
>>>    Force                  8       5001     9744.582     6090.7    88.3
>>>    Wait + Comm. F         8       5001       90.905       56.8     0.8
>>>    Write traj.            8          2        0.954        0.6     0.0
>>>    Update                 8       5001       72.936       45.6     0.7
>>>    Constraints            8      10002      171.445      107.2     1.6
>>>    Comm. energies         8        502       10.427        6.5     0.1
>>>    Rest                   8                 732.742      458.0     6.6
>>> -----------------------------------------------------------------------
>>>    Total                  8               11036.086     6897.9   100.0
>>> -----------------------------------------------------------------------
>>>
>>>           Parallel run - timing based on wallclock.
>>>
>>>                  NODE (s)   Real (s)      (%)
>>>          Time:    862.243    862.243    100.0
>>>                          14:22
>>>                  (Mnbf/s)   (MFlops)   (ns/day)  (hour/ns)
>>> Performance:      3.047    937.940      1.002     23.946
>>> Finished mdrun on node 0 Tue Jul 17 16:06:48 2012
>>





More information about the gromacs.org_gmx-users mailing list