[gmx-users] cpu/gpu utilization

Mahmood Naderan nt_mahmood at yahoo.com
Fri Mar 2 11:01:23 CET 2018


Command is "gmx mdrun -nobackup -pme cpu -nb gpu -deffnm md_0_1" and the log says

     R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

On 1 MPI rank, each using 16 OpenMP threads

 Computing:          Num   Num      Call    Wall time         Giga-Cycles
                     Ranks Threads  Count      (s)         total sum    %
-----------------------------------------------------------------------------
 Neighbor search        1   16        501       0.972         55.965   0.8
 Launch GPU ops.        1   16      50001       2.141        123.301   1.7
 Force                  1   16      50001       4.019        231.486   3.1
 PME mesh               1   16      50001      40.695       2344.171  31.8
 Wait GPU NB local      1   16      50001      60.155       3465.079  47.0
 NB X/F buffer ops.     1   16      99501       7.342        422.902   5.7
 Write traj.            1   16         11       0.246         14.184   0.2
 Update                 1   16      50001       3.480        200.461   2.7
 Constraints            1   16      50001       5.831        335.878   4.6
 Rest                                           3.159        181.963   2.5
-----------------------------------------------------------------------------
 Total                                        128.039       7375.390 100.0
-----------------------------------------------------------------------------
 Breakdown of PME mesh computation
-----------------------------------------------------------------------------
 PME spread             1   16      50001      17.086        984.209  13.3
 PME gather             1   16      50001      12.534        722.007   9.8
 PME 3D-FFT             1   16     100002       9.956        573.512   7.8
 PME solve Elec         1   16      50001       0.779         44.859   0.6
-----------------------------------------------------------------------------

               Core t (s)   Wall t (s)        (%)
       Time:     2048.617      128.039     1600.0
                 (ns/day)    (hour/ns)
Performance:       67.481        0.356






While the command is "", I see that the gpu is utilized about 10% and the log file says:

     R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

On 1 MPI rank, each using 16 OpenMP threads

 Computing:          Num   Num      Call    Wall time         Giga-Cycles
                     Ranks Threads  Count      (s)         total sum    %
-----------------------------------------------------------------------------
 Neighbor search        1   16       1251       6.912        398.128   2.3
 Force                  1   16      50001     210.689      12135.653  70.4
 PME mesh               1   16      50001      46.869       2699.656  15.7
 NB X/F buffer ops.     1   16      98751      22.315       1285.360   7.5
 Write traj.            1   16         11       0.216         12.447   0.1
 Update                 1   16      50001       4.382        252.386   1.5
 Constraints            1   16      50001       6.035        347.601   2.0
 Rest                                           1.666         95.933   0.6
-----------------------------------------------------------------------------
 Total                                        299.083      17227.165 100.0
-----------------------------------------------------------------------------
 Breakdown of PME mesh computation
-----------------------------------------------------------------------------
 PME spread             1   16      50001      21.505       1238.693   7.2
 PME gather             1   16      50001      12.089        696.333   4.0
 PME 3D-FFT             1   16     100002      11.627        669.705   3.9
 PME solve Elec         1   16      50001       0.965         55.598   0.3
-----------------------------------------------------------------------------

               Core t (s)   Wall t (s)        (%)
       Time:     4785.326      299.083     1600.0
                 (ns/day)    (hour/ns)
Performance:       28.889        0.831




Using GPU is still better than using CPU alone. However, I see that while GPU is utilized, the CPU is also busy. So, I was thinking that the source code uses cudaDeviceSynchronize() where the CPU enters a busy loop.

Regards,
Mahmood 

    On Friday, March 2, 2018, 11:37:11 AM GMT+3:30, Magnus Lundborg <magnus.lundborg at scilifelab.se> wrote:  
 
 Have you tried the mdrun options:

-pme cpu -nb gpu
-pme cpu -nb cpu

Cheers,

Magnus

  


More information about the gromacs.org_gmx-users mailing list