[gmx-users] Re: Re: why Blue Gene/Q is so slow? (Mark Abraham)

DeChang Li li.dc06 at gmail.com
Tue Jul 17 11:06:40 CEST 2012


>------------------------------
>
>Message: 8
>Date: Tue, 17 Jul 2012 18:40:05 +1000
>From: Mark Abraham <Mark.Abraham at anu.edu.au>
>Subject: Re: [gmx-users] why Blue Gene/Q is so slow?
>To: Discussion list for GROMACS users <gmx-users at gromacs.org>
>Message-ID: <500524E5.9050402 at anu.edu.au>
>Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>
>On 17/07/2012 5:00 PM, DeChang Li wrote:
>> Dear all,
>>
>>       I am running a 9000 atom system with GBSA (Gromacs 4.5.5) in a
>> Blue Gene/Q cluster. I got the speed 1.002 ns/day with 8 cores.
>> However, in my own workstation with 8 cores the same system can reach
>> nearly 10 ns/day (Intel(R) Xeon(R) CPU E5620  @ 2.40GHz). Can anyone
>> tell me what's wrong in my simulation? Any suggestion will be
>> appreciated.
>
>Your workstation is running highly effective optimized SSE loops.
>BlueGene/Q is not using its multiple FPU because that code hasn't been
>written (for explicit or implicit solvation), and BlueGene's processors
>are probably slower too.
>
>Mark

That means the code itself causes only 10% speed in BlueGene/Q
compared with intel CPUs workstation? Is there any method to improve
the speed in BG/Q?


Dechang




>> Following is my md.mdp file:
>>
>> constraints            = hbonds
>> constraint_algorithm   = LINCS
>> lincs_order            = 4
>> comm_mode              = Angular
>> comm_grps              = system
>> integrator             = sd
>> ;annealing           = single single
>> ;annealing_npoints   = 2 2
>> ;annealing_time      = 0 500 0 500
>> ;annealing_temp      = 200 300 200 300
>> dt                     = 0.002 ; ps !
>> nsteps                 = 5000000 ; total 5000 ps.
>> nstcomm                = 10
>> nstcalcenergy           = 10
>> nstxout                = 10000 ; collect data every 1 ps
>> nstenergy              = 10000
>> nstvout                = 10000
>> nstlog                 = 1000
>> ;nstxtcout              = 50000
>> ;xtc_grps               = system
>> nstfout                = 0
>> nstlist                = 10
>> ns_type                = grid
>> pbc                    = no
>> rlist                  = 1.2
>> coulombtype            = cut-off
>> rcoulomb               = 1.2
>> rvdw                   = 1.2
>> fourierspacing         = 0.12
>> fourier_nx             = 0
>> fourier_ny             = 0
>> fourier_nz             = 0
>> pme_order              = 4
>> ewald_rtol             = 1e-5
>> optimize_fft           = yes
>> ;energygrps             = alpha1 alpha2 alpha3 beta1 beta2 beta3 gamma
>> ;DispCorr               = EnerPres
>> ; Berendsen temperature coupling is on in two groups
>> Tcoupl                 =
>> tau_t                  = 0.5
>> tc-grps                = system
>> ref_t                  = 300
>> ; Pressure coupling is on
>> Pcoupl                 = no ;berendsen
>> tau_p                  = 1.0
>> compressibility        = 4.5e-5
>> ref_p                  = 1.0
>> ; Generate velocites is on at 300 K.
>> gen_vel                = yes
>> gen_temp               = 300
>> gen_seed               = -1
>>
>> implicit_solvent       = GBSA
>> gb_algorithm           = OBC
>> rgbradii               = 1.2
>> sa_surface_tension     = 2.25936
>>
>>
>>
>> Here is the preformace info:
>>
>>          M E G A - F L O P S   A C C O U N T I N G
>>
>>     RF=Reaction-Field  FE=Free Energy  SCFE=Soft-Core/Free Energy
>>     T=Tabulated        W3=SPC/TIP3p    W4=TIP4p (single or pairs)
>>     NF=No Forces
>>
>>   Computing:                               M-Number         M-Flops  % Flops
>> -----------------------------------------------------------------------------
>>   Generalized Born Coulomb                61.482892        2951.179     0.4
>>   GB Coulomb + LJ                       2565.481100      156494.347    19.4
>>   Outer nonbonded loop                   152.268546        1522.685     0.2
>>   1,4 nonbonded interactions             116.143224       10452.890     1.3
>>   Born radii (HCT/OBC)                  2868.222234      524884.669    64.9
>>   Born force chain rule                 2868.222234       43023.334     5.3
>>   NS-Pairs                               516.814696       10853.109     1.3
>>   Reset In Box                             4.464788          13.394     0.0
>>   CG-CoM                                   4.482576          13.448     0.0
>>   Bonds                                   22.174434        1308.292     0.2
>>   Angles                                  80.586114       13538.467     1.7
>>   Propers                                160.742142       36809.951     4.6
>>   Virial                                   4.636254          83.453     0.0
>>   Update                                  44.478894        1378.846     0.2
>>   Stop-CM                                  4.455894          44.559     0.0
>>   Calc-Ekin                               44.487788        1201.170     0.1
>>   Lincs                                   44.951630        2697.098     0.3
>>   Lincs-Mat                              261.822552        1047.290     0.1
>>   Constraint-V                            44.951630         359.613     0.0
>>   Constraint-Vir                           2.251163          54.028     0.0
>> -----------------------------------------------------------------------------
>>   Total                                                  808731.820   100.0
>> -----------------------------------------------------------------------------
>>
>>
>>      D O M A I N   D E C O M P O S I T I O N   S T A T I S T I C S
>>
>>   av. #atoms communicated per step for force:  2 x 660.5
>>   av. #atoms communicated per step for LINCS:  2 x 34.3
>>
>>   Average load imbalance: 1.7 %
>>   Part of the total run time spent waiting due to load imbalance: 1.4 %
>>
>>
>>       R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G
>>
>>   Computing:         Nodes     Number     G-Cycles    Seconds     %
>> -----------------------------------------------------------------------
>>   Domain decomp.         8        502       59.421       37.1     0.5
>>   DD comm. load          8          8        0.004        0.0     0.0
>>   Comm. coord.           8       5001       16.575       10.4     0.2
>>   Neighbor search        8        502      136.093       85.1     1.2
>>   Force                  8       5001     9744.582     6090.7    88.3
>>   Wait + Comm. F         8       5001       90.905       56.8     0.8
>>   Write traj.            8          2        0.954        0.6     0.0
>>   Update                 8       5001       72.936       45.6     0.7
>>   Constraints            8      10002      171.445      107.2     1.6
>>   Comm. energies         8        502       10.427        6.5     0.1
>>   Rest                   8                 732.742      458.0     6.6
>> -----------------------------------------------------------------------
>>   Total                  8               11036.086     6897.9   100.0
>> -----------------------------------------------------------------------
>>
>>          Parallel run - timing based on wallclock.
>>
>>                 NODE (s)   Real (s)      (%)
>>         Time:    862.243    862.243    100.0
>>                         14:22
>>                 (Mnbf/s)   (MFlops)   (ns/day)  (hour/ns)
>> Performance:      3.047    937.940      1.002     23.946
>> Finished mdrun on node 0 Tue Jul 17 16:06:48 2012
>
>



More information about the gromacs.org_gmx-users mailing list