[gmx-users] gromacs on GPU

Thu Jan 10 07:25:28 CET 2013

Szilárd ,

 thanks again for explanation!

Today I've performed some tests on my calmodulin in water system with
different cutt-offs (I've used all cutt-ooffs 1.0 , 0.9 and 0.8
respectually)

Below you can see that the highest performance was in case of 0.8 cut-offs

all cut-offs 1.0
 Force evaluation time GPU/CPU: 6.134 ms/4.700 ms = 1.305

NOTE: The GPU has >20% more load than the CPU. This imbalance causes
      performance loss, consider using a shorter cut-off and a finer PME grid.

               Core t (s)   Wall t (s)        (%)
       Time:     1313.420      464.035      283.0
                 (ns/day)    (hour/ns)
Performance:        9.310        2.578
Finished mdrun on node 0 Thu Jan 10 09:39:23 2013

all cut-offs 0.9
Force evaluation time GPU/CPU: 4.951 ms/4.675 ms = 1.059

               Core t (s)   Wall t (s)        (%)
       Time:     2414.930      856.179      282.1
                 (ns/day)    (hour/ns)
Performance:       10.092        2.378
Finished mdrun on node 0 Thu Jan 10 10:09:52 2013

all cut-offs 0.8
 Force evaluation time GPU/CPU: 4.001 ms/4.659 ms = 0.859

               Core t (s)   Wall t (s)        (%)
       Time:     1166.390      413.598      282.0
                 (ns/day)    (hour/ns)
Performance:       10.445        2.298
Finished mdrun on node 0 Thu Jan 10 09:50:33 2013

Also I've noticed that 2-4 CPU cores usage in 2 and 3rd case was only
67%. Is there any other ways to increase performance by means of
neighboor search parameters ( e.g nstlist etc) ?
Might such reduced cut-off be used with the force fields ( e,g charmm)
where initially usage of longest cut-offs have given better results
(e,g in charmm27 and gromos56 I always use 1.2 and 1.4 nm for rvdw,
respectually) ?

James

2013/1/10 Szilárd Páll <szilard.pall at cbr.su.se>:
> Hi James,
>
> The build looks mostly fine except that you are using fftw3 compiled with
> AVX which is slower than with only SSE (even on AVX-capable CPUs) - you
> should have been warned about this at configure-time.
>
> Now, performance-wise everything looks fine except that with a 1.2 nm
> cut-off your GPU is not able to keep up with the CPU and finish the
> non-bonded work before the CPU is done with Bonded + PME. That's why you
> see the "Wait GPU" taking 20% of the total time and that's also why you see
> some cores idling (because for 20% of the run-time thread 0 on core 0
> is blocked waiting for the GPU while the rest idle).
>
> As the suggestion at the end of the log file point out, you can consider
> using a shorter cut-off which will push more work back to the PME on the
> CPU, but whether you can do this it depends on your very problem.
>
> There is one more alternative of running two MPI processes on the GPU
> (mpirun -np 2 mdrun -gpu_id 00) and using the -nb gpu_cpu mode which will
> execute part of the nonbonded on the CPU, but this might not help.
>
> Cheers,
>
> --
> Szilárd
>
>
> On Wed, Jan 9, 2013 at 8:27 PM, James Starlight <jmsstarlight at gmail.com>wrote:
>
>> Dear Szilárd, thanks for help again!
>>
>> 2013/1/9 Szilárd Páll <szilard.pall at cbr.su.se>:
>>
>> >
>> > There could be, but I/we can't well without more information on what and
>> > how you compiled and ran. The minimum we need is a log file.
>> >
>> I've compilated gromacs 4.6-3 beta via simple
>>
>>
>> cmake CMakeLists.txt -DGMX_GPU=ON
>> -DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda-5.0
>> make
>> sudo make install
>>
>> I have not added any special params to the grompp or mdrun.
>>
>> After that I've run tested simulation of the calmodulin in explicit
>> water ( 60k atoms ) 100ps and obtain next output
>>
>> Host: starlight  pid: 21028  nodeid: 0  nnodes:  1
>> Gromacs version:    VERSION 4.6-beta3
>> Precision:          single
>> MPI library:        thread_mpi
>> OpenMP support:     enabled
>> GPU support:        enabled
>> invsqrt routine:    gmx_software_invsqrt(x)
>> CPU acceleration:   AVX_256
>> FFT library:        fftw-3.3.2-sse2-avx
>> Large file support: enabled
>> RDTSCP usage:       enabled
>> Built on:           Wed Jan  9 20:44:51 MSK 2013
>> Built by:           own at starlight [CMAKE]
>> Build OS/arch:      Linux 3.2.0-2-amd64 x86_64
>> Build CPU vendor:   GenuineIntel
>> Build CPU brand:    Intel(R) Core(TM) i5-3570 CPU @ 3.40GHz
>> Build CPU family:   6   Model: 58   Stepping: 9
>> Build CPU features: aes apic avx clfsh cmov cx8 cx16 f16c htt lahf_lm
>> mmx msr nonstop_tsc pcid pclmuldq pdcm popcnt pse rdrnd rdtscp sse2
>> sse3 sse4.1 sse4.2 ssse3 tdt x2apic
>> C compiler:         /usr/bin/gcc GNU gcc (Debian 4.6.3-11) 4.6.3
>> C compiler flags:   -mavx  -Wextra -Wno-missing-field-initializers
>> -Wno-sign-compare -Wall -Wno-unused -Wunused-value
>> -fomit-frame-pointer -funroll-all-loops -fexcess-precision=fast  -O3
>> -DNDEBUG
>> C++ compiler:       /usr/bin/c++ GNU c++ (Debian 4.6.3-11) 4.6.3
>> C++ compiler flags: -mavx  -Wextra -Wno-missing-field-initializers
>> -Wno-sign-compare -Wall -Wno-unused -Wunused-value
>> -fomit-frame-pointer -funroll-all-loops -fexcess-precision=fast  -O3
>> -DNDEBUG
>> CUDA compiler:      nvcc: NVIDIA (R) Cuda compiler driver;Copyright
>> (c) 2005-2012 NVIDIA Corporation;Built on
>> Fri_Sep_21_17:28:58_PDT_2012;Cuda compilation tools, release 5.0,
>> V0.2.1221
>> CUDA driver:        5.0
>> CUDA runtime:       5.0
>>
>> ****************
>>
>>                Core t (s)   Wall t (s)        (%)
>>        Time:     2770.700     1051.927      263.4
>>                  (ns/day)    (hour/ns)
>> Performance:        8.214        2.922
>>
>> full log can be found here http://www.sendspace.com/file/inum84
>>
>>
>> Finally when I check CPU usage I notice that only 1 CPU was full
>> loaded ( 100%) and 2-4 cores were loaded on only 60% but  gave me
>> strange results that GPU is not used (I've only monitored temperature
>> of video card and noticed increase of the temperature up to 65 degrees
>> )
>>
>> +------------------------------------------------------+
>> | NVIDIA-SMI 4.304.54   Driver Version: 304.54         |
>>
>> |-------------------------------+----------------------+----------------------+
>> | GPU  Name                     | Bus-Id        Disp.  | Volatile Uncorr.
>> ECC |
>> | Fan  Temp  Perf  Pwr:Usage/Cap| Memory-Usage         | GPU-Util  Compute
>> M. |
>>
>> |===============================+======================+======================|
>> |   0  GeForce GTX 670          | 0000:02:00.0     N/A |
>>  N/A |
>> | 38%   63C  N/A     N/A /  N/A |   9%  174MB / 2047MB |     N/A
>>  Default |
>>
>> +-------------------------------+----------------------+----------------------+
>>
>>
>> +-----------------------------------------------------------------------------+
>> | Compute processes:                                               GPU
>> Memory |
>> |  GPU       PID  Process name                                     Usage
>>    |
>>
>> |=============================================================================|
>> |    0            Not Supported
>>     |
>>
>> +-----------------------------------------------------------------------------+
>>
>>
>> Thanks for help again,
>>
>> James
>> --
>> gmx-users mailing list    gmx-users at gromacs.org
>> http://lists.gromacs.org/mailman/listinfo/gmx-users
>> * Please search the archive at
>> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
>> * Please don't post (un)subscribe requests to the list. Use the
>> www interface or send it to gmx-users-request at gromacs.org.
>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>
> --
> gmx-users mailing list    gmx-users at gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-users
> * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
> * Please don't post (un)subscribe requests to the list. Use the
> www interface or send it to gmx-users-request at gromacs.org.
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists