[gmx-users] Re: Re: why Blue Gene/Q is so slow? (Mark Abraham)
DeChang Li
li.dc06 at gmail.com
Tue Jul 17 11:06:40 CEST 2012
>------------------------------
>
>Message: 8
>Date: Tue, 17 Jul 2012 18:40:05 +1000
>From: Mark Abraham <Mark.Abraham at anu.edu.au>
>Subject: Re: [gmx-users] why Blue Gene/Q is so slow?
>To: Discussion list for GROMACS users <gmx-users at gromacs.org>
>Message-ID: <500524E5.9050402 at anu.edu.au>
>Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>
>On 17/07/2012 5:00 PM, DeChang Li wrote:
>> Dear all,
>>
>> I am running a 9000 atom system with GBSA (Gromacs 4.5.5) in a
>> Blue Gene/Q cluster. I got the speed 1.002 ns/day with 8 cores.
>> However, in my own workstation with 8 cores the same system can reach
>> nearly 10 ns/day (Intel(R) Xeon(R) CPU E5620 @ 2.40GHz). Can anyone
>> tell me what's wrong in my simulation? Any suggestion will be
>> appreciated.
>
>Your workstation is running highly effective optimized SSE loops.
>BlueGene/Q is not using its multiple FPU because that code hasn't been
>written (for explicit or implicit solvation), and BlueGene's processors
>are probably slower too.
>
>Mark
That means the code itself causes only 10% speed in BlueGene/Q
compared with intel CPUs workstation? Is there any method to improve
the speed in BG/Q?
Dechang
>> Following is my md.mdp file:
>>
>> constraints = hbonds
>> constraint_algorithm = LINCS
>> lincs_order = 4
>> comm_mode = Angular
>> comm_grps = system
>> integrator = sd
>> ;annealing = single single
>> ;annealing_npoints = 2 2
>> ;annealing_time = 0 500 0 500
>> ;annealing_temp = 200 300 200 300
>> dt = 0.002 ; ps !
>> nsteps = 5000000 ; total 5000 ps.
>> nstcomm = 10
>> nstcalcenergy = 10
>> nstxout = 10000 ; collect data every 1 ps
>> nstenergy = 10000
>> nstvout = 10000
>> nstlog = 1000
>> ;nstxtcout = 50000
>> ;xtc_grps = system
>> nstfout = 0
>> nstlist = 10
>> ns_type = grid
>> pbc = no
>> rlist = 1.2
>> coulombtype = cut-off
>> rcoulomb = 1.2
>> rvdw = 1.2
>> fourierspacing = 0.12
>> fourier_nx = 0
>> fourier_ny = 0
>> fourier_nz = 0
>> pme_order = 4
>> ewald_rtol = 1e-5
>> optimize_fft = yes
>> ;energygrps = alpha1 alpha2 alpha3 beta1 beta2 beta3 gamma
>> ;DispCorr = EnerPres
>> ; Berendsen temperature coupling is on in two groups
>> Tcoupl =
>> tau_t = 0.5
>> tc-grps = system
>> ref_t = 300
>> ; Pressure coupling is on
>> Pcoupl = no ;berendsen
>> tau_p = 1.0
>> compressibility = 4.5e-5
>> ref_p = 1.0
>> ; Generate velocites is on at 300 K.
>> gen_vel = yes
>> gen_temp = 300
>> gen_seed = -1
>>
>> implicit_solvent = GBSA
>> gb_algorithm = OBC
>> rgbradii = 1.2
>> sa_surface_tension = 2.25936
>>
>>
>>
>> Here is the preformace info:
>>
>> M E G A - F L O P S A C C O U N T I N G
>>
>> RF=Reaction-Field FE=Free Energy SCFE=Soft-Core/Free Energy
>> T=Tabulated W3=SPC/TIP3p W4=TIP4p (single or pairs)
>> NF=No Forces
>>
>> Computing: M-Number M-Flops % Flops
>> -----------------------------------------------------------------------------
>> Generalized Born Coulomb 61.482892 2951.179 0.4
>> GB Coulomb + LJ 2565.481100 156494.347 19.4
>> Outer nonbonded loop 152.268546 1522.685 0.2
>> 1,4 nonbonded interactions 116.143224 10452.890 1.3
>> Born radii (HCT/OBC) 2868.222234 524884.669 64.9
>> Born force chain rule 2868.222234 43023.334 5.3
>> NS-Pairs 516.814696 10853.109 1.3
>> Reset In Box 4.464788 13.394 0.0
>> CG-CoM 4.482576 13.448 0.0
>> Bonds 22.174434 1308.292 0.2
>> Angles 80.586114 13538.467 1.7
>> Propers 160.742142 36809.951 4.6
>> Virial 4.636254 83.453 0.0
>> Update 44.478894 1378.846 0.2
>> Stop-CM 4.455894 44.559 0.0
>> Calc-Ekin 44.487788 1201.170 0.1
>> Lincs 44.951630 2697.098 0.3
>> Lincs-Mat 261.822552 1047.290 0.1
>> Constraint-V 44.951630 359.613 0.0
>> Constraint-Vir 2.251163 54.028 0.0
>> -----------------------------------------------------------------------------
>> Total 808731.820 100.0
>> -----------------------------------------------------------------------------
>>
>>
>> D O M A I N D E C O M P O S I T I O N S T A T I S T I C S
>>
>> av. #atoms communicated per step for force: 2 x 660.5
>> av. #atoms communicated per step for LINCS: 2 x 34.3
>>
>> Average load imbalance: 1.7 %
>> Part of the total run time spent waiting due to load imbalance: 1.4 %
>>
>>
>> R E A L C Y C L E A N D T I M E A C C O U N T I N G
>>
>> Computing: Nodes Number G-Cycles Seconds %
>> -----------------------------------------------------------------------
>> Domain decomp. 8 502 59.421 37.1 0.5
>> DD comm. load 8 8 0.004 0.0 0.0
>> Comm. coord. 8 5001 16.575 10.4 0.2
>> Neighbor search 8 502 136.093 85.1 1.2
>> Force 8 5001 9744.582 6090.7 88.3
>> Wait + Comm. F 8 5001 90.905 56.8 0.8
>> Write traj. 8 2 0.954 0.6 0.0
>> Update 8 5001 72.936 45.6 0.7
>> Constraints 8 10002 171.445 107.2 1.6
>> Comm. energies 8 502 10.427 6.5 0.1
>> Rest 8 732.742 458.0 6.6
>> -----------------------------------------------------------------------
>> Total 8 11036.086 6897.9 100.0
>> -----------------------------------------------------------------------
>>
>> Parallel run - timing based on wallclock.
>>
>> NODE (s) Real (s) (%)
>> Time: 862.243 862.243 100.0
>> 14:22
>> (Mnbf/s) (MFlops) (ns/day) (hour/ns)
>> Performance: 3.047 937.940 1.002 23.946
>> Finished mdrun on node 0 Tue Jul 17 16:06:48 2012
>
>
More information about the gromacs.org_gmx-users
mailing list