[gmx-users] gromacs on GPU

Fri Jan 11 07:20:26 CET 2013

Szilárd,

the regime with 4 cores + cut-offs 0.8 has been best still.

  PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND
 1652 own       20   0 28.4g 135m  33m R 288.8  0.8   4:30.33 mdrun

 Force evaluation time GPU/CPU: 5.257 ms/5.187 ms = 1.013
For optimal performance this ratio should be close to 1!

               Core t (s)   Wall t (s)        (%)
       Time:      494.240      171.719      287.8
                 (ns/day)    (hour/ns)
Performance:       10.064        2.385
Finished mdrun on node 0 Fri Jan 11 09:38:38 2013

I've tried to use compination of the different core numbers but
results was the same ( below example with the 2 cores +gpu)
  PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND
 1578 own       20   0 28.3g 163m  33m R 170.7  1.0   1:50.68 mdrun

Also cut-offs lower that 0.8 produced the same results. When I've used
cut-off 0.1 the simulation have been crushed :) Finally increassing of
nstlist up to 50 also gave slightly better results ( cpu usage up to
295) but I'm not sure about influence os such large cutoofs on other
aspects of simulation.

On that tests I'm using cpu CPU Intel Core i5-3570 3.4 ГГц / 4core /
SVGA HD Graphics 2500 / 1+6Мб / as well as GPU GeForce GTX 670.

Also I want to point out that all simulation have been run from debian
GE GNOME desktope. Should I run simulation from console mode only to
kill all hidden cpu-dependent processes?

By the way I wounder to know is it possible to use 2 gpu at the same
time (in the SLI mode) ? How It might increase overasl performance ?
In future I'd like to built new work-station on 8 cores i7 CPU + 2
GPU. What the performance of such work-station will be? ( in
comparison to the typical cluster from several nodes of 8-12 cpu ) ?

Thanks for suggestions,
James

2013/1/11 Szilárd Páll <szilard.pall at cbr.su.se>:
> Hi,
>
> On Thu, Jan 10, 2013 at 8:30 PM, James Starlight <jmsstarlight at gmail.com>wrote:
>
>> Szilárd,
>>
>> There are no any others cpu-usage tasks. Below you can see log from the
>> TOP.
>>
>> 26553 own       20   0 28.4g 106m  33m S 285.6  0.7   2263:57 mdrun
>>
>
> This still shows that the average CPU utilization is only 285.6 iso 400 and
> that matches with what mdrun's log shows. Try to run with a very short
> cut-off, one which leads to <=1 GPU/CPU balance (i.e no waiting) and if you
> still don't get 400, something weird is going on.
>
>
>>  1611 root      20   0  171m  65m  24m S   3.0  0.4   7:43.05 Xorg
>> 29647 own       20   0  381m  22m  17m S   3.0  0.1   0:01.77
>> mate-system-mon
>>  2344 own       20   0  358m  17m  11m S   1.3  0.1   0:33.76 mate-terminal
>> 29018 root      20   0     0    0    0 S   0.3  0.0   0:04.99 kworker/0:0
>> 29268 root      20   0     0    0    0 S   0.3  0.0   0:00.22 kworker/u:2
>> 29705 root      20   0     0    0    0 S   0.3  0.0   0:00.03 kworker/3:0
>> 29706 own       20   0 23284 1648 1188 R   0.3  0.0   0:00.05 top
>>     1 root      20   0  8584  872  736 S   0.0  0.0   0:02.34 init
>>     2 root      20   0     0    0    0 S   0.0  0.0   0:00.02 kthreadd
>>     3 root      20   0     0    0    0 S   0.0  0.0   0:00.57 ksoftirqd/0
>>     6 root      rt   0     0    0    0 S   0.0  0.0   0:00.00 migration/0
>>     7 root      rt   0     0    0    0 S   0.0  0.0   0:00.17 watchdog/0
>>     8 root      rt   0     0    0    0 S   0.0  0.0   0:00.00 migration/1
>>    10 root      20   0     0    0    0 S   0.0  0.0   0:00.43 ksoftirqd/1
>>    12 root      rt   0     0    0    0 S   0.0  0.0   0:00.17 watchdog/1
>>    13 root      rt   0     0    0    0 S   0.0  0.0   0:00.00 migration/2
>>    15 root      20   0     0    0    0 S   0.0  0.0   0:00.37 ksoftirqd/2
>>    16 root      rt   0     0    0    0 S   0.0  0.0   0:00.16 watchdog/2
>>    17 root      rt   0     0    0    0 S   0.0  0.0   0:00.00 migration/3
>>    19 root      20   0     0    0    0 S   0.0  0.0   0:00.38 ksoftirqd/3
>>    20 root      rt   0     0    0    0 S   0.0  0.0   0:00.16 watchdog/3
>>    21 root       0 -20     0    0    0 S   0.0  0.0   0:00.00 cpuset
>>    22 root       0 -20     0    0    0 S   0.0  0.0   0:00.00 khelper
>>    23 root      20   0     0    0    0 S   0.0  0.0   0:00.00 kdevtmpfs
>>
>>
>> Usually I run my simulations by means of simple mdrun -v -deffnm md.
>> Should I specify number of cores manually by means of -nt (or -ntmpi)
>>
>
> If you just want to run on the full machine, simply running like that
> should in most cases still be the optimal run configuration or very close
> to the optimal, i.e. in your case:
> mdrun
> <=>
> mdrun -ntmpi 1 -ntomp 4 -gpu_id 0 -pinht
>
>
>> flagg? Also I notice that  -pinht flagg could give me Hyper-Threading
>> support. Does it reasonable in the simulation on cpu+gpu ? What
>>
>
> Correctly using HT is also fully automatic and optimal as long as you are
> using the full machine.
>
>
>> another possible options of md_run should I consider ? Finally is it
>> possible that problems due to openMP (4.7.2) or open-mpi (1.4.5)
>> drivers ?
>>
>
> No, you are using the latest version of compilers which is good. Other than
> my earlier suggestions, there isn't much you can do to eliminate the idling
> on the CPU (I assume that's what bugs you) - except getting a faster GPU.
> Btw, have you tried the hybrid GPU-CPU mode (although I expect it to not be
> faster)?
>
> Cheers,
> --
> Szilárd
>
>
>
>>
>>
>> Thanks for help
>>
>> James
>>
>>
>> 2013/1/10 Szilárd Páll <szilard.pall at cbr.su.se>:
>> > On Thu, Jan 10, 2013 at 7:25 AM, James Starlight <jmsstarlight at gmail.com
>> >wrote:
>> >
>> >> Szilárd ,
>> >>
>> >>  thanks again for explanation!
>> >>
>> >> Today I've performed some tests on my calmodulin in water system with
>> >> different cutt-offs (I've used all cutt-ooffs 1.0 , 0.9 and 0.8
>> >> respectually)
>> >>
>> >> Below you can see that the highest performance was in case of 0.8
>> cut-offs
>> >>
>> >> all cut-offs 1.0
>> >>  Force evaluation time GPU/CPU: 6.134 ms/4.700 ms = 1.305
>> >>
>> >> NOTE: The GPU has >20% more load than the CPU. This imbalance causes
>> >>       performance loss, consider using a shorter cut-off and a finer PME
>> >> grid.
>> >>
>> >>
>> >>                Core t (s)   Wall t (s)        (%)
>> >>        Time:     1313.420      464.035      283.0
>> >>                  (ns/day)    (hour/ns)
>> >> Performance:        9.310        2.578
>> >> Finished mdrun on node 0 Thu Jan 10 09:39:23 2013
>> >>
>> >>
>> >> all cut-offs 0.9
>> >> Force evaluation time GPU/CPU: 4.951 ms/4.675 ms = 1.059
>> >>
>> >>                Core t (s)   Wall t (s)        (%)
>> >>        Time:     2414.930      856.179      282.1
>> >>                  (ns/day)    (hour/ns)
>> >> Performance:       10.092        2.378
>> >> Finished mdrun on node 0 Thu Jan 10 10:09:52 2013
>> >>
>> >> all cut-offs 0.8
>> >>  Force evaluation time GPU/CPU: 4.001 ms/4.659 ms = 0.859
>> >>
>> >>                Core t (s)   Wall t (s)        (%)
>> >>        Time:     1166.390      413.598      282.0
>> >>                  (ns/day)    (hour/ns)
>> >> Performance:       10.445        2.298
>> >> Finished mdrun on node 0 Thu Jan 10 09:50:33 2013
>> >>
>> >> Also I've noticed that 2-4 CPU cores usage in 2 and 3rd case was only
>> >> 67%. Is there any other ways to increase performance by means of
>> >> neighboor search parameters ( e.g nstlist etc) ?
>> >>
>> >
>> > You can tweak nstlist and it often helps to increase it with GPUs,
>> > especially in parallel. However, as increasing nstlist requires larger
>> > rlist and more non-bonded calculations, this will not help you. You can
>> try
>> > to decrease it to 10-15 which will increase the NS cost but decrease the
>> > GPU time, but it won't change the performance dramatically.
>> >
>> > What's strange is that your Core time/Wall time = (%) is quite low. If
>> > you're running on four threads on an otherwise empty machine, you should
>> > get close to 400 if the threads are not idling, e.g waiting for the GPU.
>> > For instance in the rc=0.8 case you can see that the GPU/CPU balance is
>> > <1.0 meaning that the GPU has less work than the CPU, case in which there
>> > should be no idling and you should be getting (%) = 400.
>> >
>> > Long story short: are you sure you're not running anything else on the
>> > computer while simulating? What do you get if you run on CPU only?
>> >
>> > Might such reduced cut-off be used with the force fields ( e,g charmm)
>> >> where initially usage of longest cut-offs have given better results
>> >> (e,g in charmm27 and gromos56 I always use 1.2 and 1.4 nm for rvdw,
>> >> respectually) ?
>> >>
>> >
>> > No, at least not without *carefully* checking whether a shorter LJ
>> cut-off
>> > makes sense and that it does not break the physics of your simulation.
>> >
>> > Although we advise you to consider decreasing your cut-off - mostly
>> because
>> > these days a large number of simulations are carried out with overly long
>> > cut-off chosen by the rule of thumb or folclore -, you should always
>> either
>> > make sure that this makes sense before doing it or not do it at all.
>> >
>> > Cheers,
>> > --
>> > Szilárd
>> >
>> >
>> >>
>> >>
>> >> James
>> >>
>> >> 2013/1/10 Szilárd Páll <szilard.pall at cbr.su.se>:
>> >> > Hi James,
>> >> >
>> >> > The build looks mostly fine except that you are using fftw3 compiled
>> with
>> >> > AVX which is slower than with only SSE (even on AVX-capable CPUs) -
>> you
>> >> > should have been warned about this at configure-time.
>> >> >
>> >> > Now, performance-wise everything looks fine except that with a 1.2 nm
>> >> > cut-off your GPU is not able to keep up with the CPU and finish the
>> >> > non-bonded work before the CPU is done with Bonded + PME. That's why
>> you
>> >> > see the "Wait GPU" taking 20% of the total time and that's also why
>> you
>> >> see
>> >> > some cores idling (because for 20% of the run-time thread 0 on core 0
>> >> > is blocked waiting for the GPU while the rest idle).
>> >> >
>> >> > As the suggestion at the end of the log file point out, you can
>> consider
>> >> > using a shorter cut-off which will push more work back to the PME on
>> the
>> >> > CPU, but whether you can do this it depends on your very problem.
>> >> >
>> >> > There is one more alternative of running two MPI processes on the GPU
>> >> > (mpirun -np 2 mdrun -gpu_id 00) and using the -nb gpu_cpu mode which
>> will
>> >> > execute part of the nonbonded on the CPU, but this might not help.
>> >> >
>> >> > Cheers,
>> >> >
>> >> > --
>> >> > Szilárd
>> >> >
>> >> >
>> >> > On Wed, Jan 9, 2013 at 8:27 PM, James Starlight <
>> jmsstarlight at gmail.com
>> >> >wrote:
>> >> >
>> >> >> Dear Szilárd, thanks for help again!
>> >> >>
>> >> >> 2013/1/9 Szilárd Páll <szilard.pall at cbr.su.se>:
>> >> >>
>> >> >> >
>> >> >> > There could be, but I/we can't well without more information on
>> what
>> >> and
>> >> >> > how you compiled and ran. The minimum we need is a log file.
>> >> >> >
>> >> >> I've compilated gromacs 4.6-3 beta via simple
>> >> >>
>> >> >>
>> >> >> cmake CMakeLists.txt -DGMX_GPU=ON
>> >> >> -DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda-5.0
>> >> >> make
>> >> >> sudo make install
>> >> >>
>> >> >> I have not added any special params to the grompp or mdrun.
>> >> >>
>> >> >> After that I've run tested simulation of the calmodulin in explicit
>> >> >> water ( 60k atoms ) 100ps and obtain next output
>> >> >>
>> >> >> Host: starlight  pid: 21028  nodeid: 0  nnodes:  1
>> >> >> Gromacs version:    VERSION 4.6-beta3
>> >> >> Precision:          single
>> >> >> MPI library:        thread_mpi
>> >> >> OpenMP support:     enabled
>> >> >> GPU support:        enabled
>> >> >> invsqrt routine:    gmx_software_invsqrt(x)
>> >> >> CPU acceleration:   AVX_256
>> >> >> FFT library:        fftw-3.3.2-sse2-avx
>> >> >> Large file support: enabled
>> >> >> RDTSCP usage:       enabled
>> >> >> Built on:           Wed Jan  9 20:44:51 MSK 2013
>> >> >> Built by:           own at starlight [CMAKE]
>> >> >> Build OS/arch:      Linux 3.2.0-2-amd64 x86_64
>> >> >> Build CPU vendor:   GenuineIntel
>> >> >> Build CPU brand:    Intel(R) Core(TM) i5-3570 CPU @ 3.40GHz
>> >> >> Build CPU family:   6   Model: 58   Stepping: 9
>> >> >> Build CPU features: aes apic avx clfsh cmov cx8 cx16 f16c htt lahf_lm
>> >> >> mmx msr nonstop_tsc pcid pclmuldq pdcm popcnt pse rdrnd rdtscp sse2
>> >> >> sse3 sse4.1 sse4.2 ssse3 tdt x2apic
>> >> >> C compiler:         /usr/bin/gcc GNU gcc (Debian 4.6.3-11) 4.6.3
>> >> >> C compiler flags:   -mavx  -Wextra -Wno-missing-field-initializers
>> >> >> -Wno-sign-compare -Wall -Wno-unused -Wunused-value
>> >> >> -fomit-frame-pointer -funroll-all-loops -fexcess-precision=fast  -O3
>> >> >> -DNDEBUG
>> >> >> C++ compiler:       /usr/bin/c++ GNU c++ (Debian 4.6.3-11) 4.6.3
>> >> >> C++ compiler flags: -mavx  -Wextra -Wno-missing-field-initializers
>> >> >> -Wno-sign-compare -Wall -Wno-unused -Wunused-value
>> >> >> -fomit-frame-pointer -funroll-all-loops -fexcess-precision=fast  -O3
>> >> >> -DNDEBUG
>> >> >> CUDA compiler:      nvcc: NVIDIA (R) Cuda compiler driver;Copyright
>> >> >> (c) 2005-2012 NVIDIA Corporation;Built on
>> >> >> Fri_Sep_21_17:28:58_PDT_2012;Cuda compilation tools, release 5.0,
>> >> >> V0.2.1221
>> >> >> CUDA driver:        5.0
>> >> >> CUDA runtime:       5.0
>> >> >>
>> >> >> ****************
>> >> >>
>> >> >>                Core t (s)   Wall t (s)        (%)
>> >> >>        Time:     2770.700     1051.927      263.4
>> >> >>                  (ns/day)    (hour/ns)
>> >> >> Performance:        8.214        2.922
>> >> >>
>> >> >> full log can be found here http://www.sendspace.com/file/inum84
>> >> >>
>> >> >>
>> >> >> Finally when I check CPU usage I notice that only 1 CPU was full
>> >> >> loaded ( 100%) and 2-4 cores were loaded on only 60% but  gave me
>> >> >> strange results that GPU is not used (I've only monitored temperature
>> >> >> of video card and noticed increase of the temperature up to 65
>> degrees
>> >> >> )
>> >> >>
>> >> >> +------------------------------------------------------+
>> >> >> | NVIDIA-SMI 4.304.54   Driver Version: 304.54         |
>> >> >>
>> >> >>
>> >>
>> |-------------------------------+----------------------+----------------------+
>> >> >> | GPU  Name                     | Bus-Id        Disp.  | Volatile
>> >> Uncorr.
>> >> >> ECC |
>> >> >> | Fan  Temp  Perf  Pwr:Usage/Cap| Memory-Usage         | GPU-Util
>> >>  Compute
>> >> >> M. |
>> >> >>
>> >> >>
>> >>
>> |===============================+======================+======================|
>> >> >> |   0  GeForce GTX 670          | 0000:02:00.0     N/A |
>> >> >>  N/A |
>> >> >> | 38%   63C  N/A     N/A /  N/A |   9%  174MB / 2047MB |     N/A
>> >> >>  Default |
>> >> >>
>> >> >>
>> >>
>> +-------------------------------+----------------------+----------------------+
>> >> >>
>> >> >>
>> >> >>
>> >>
>> +-----------------------------------------------------------------------------+
>> >> >> | Compute processes:
>> GPU
>> >> >> Memory |
>> >> >> |  GPU       PID  Process name
>> Usage
>> >> >>    |
>> >> >>
>> >> >>
>> >>
>> |=============================================================================|
>> >> >> |    0            Not Supported
>> >> >>     |
>> >> >>
>> >> >>
>> >>
>> +-----------------------------------------------------------------------------+
>> >> >>
>> >> >>
>> >> >> Thanks for help again,
>> >> >>
>> >> >> James
>> >> >> --
>> >> >> gmx-users mailing list    gmx-users at gromacs.org
>> >> >> http://lists.gromacs.org/mailman/listinfo/gmx-users
>> >> >> * Please search the archive at
>> >> >> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
>> >> >> * Please don't post (un)subscribe requests to the list. Use the
>> >> >> www interface or send it to gmx-users-request at gromacs.org.
>> >> >> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>> >> >>
>> >> > --
>> >> > gmx-users mailing list    gmx-users at gromacs.org
>> >> > http://lists.gromacs.org/mailman/listinfo/gmx-users
>> >> > * Please search the archive at
>> >> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
>> >> > * Please don't post (un)subscribe requests to the list. Use the
>> >> > www interface or send it to gmx-users-request at gromacs.org.
>> >> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>> >> --
>> >> gmx-users mailing list    gmx-users at gromacs.org
>> >> http://lists.gromacs.org/mailman/listinfo/gmx-users
>> >> * Please search the archive at
>> >> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
>> >> * Please don't post (un)subscribe requests to the list. Use the
>> >> www interface or send it to gmx-users-request at gromacs.org.
>> >> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>> >>
>> > --
>> > gmx-users mailing list    gmx-users at gromacs.org
>> > http://lists.gromacs.org/mailman/listinfo/gmx-users
>> > * Please search the archive at
>> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
>> > * Please don't post (un)subscribe requests to the list. Use the
>> > www interface or send it to gmx-users-request at gromacs.org.
>> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>> --
>> gmx-users mailing list    gmx-users at gromacs.org
>> http://lists.gromacs.org/mailman/listinfo/gmx-users
>> * Please search the archive at
>> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
>> * Please don't post (un)subscribe requests to the list. Use the
>> www interface or send it to gmx-users-request at gromacs.org.
>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>
> --
> gmx-users mailing list    gmx-users at gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-users
> * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
> * Please don't post (un)subscribe requests to the list. Use the
> www interface or send it to gmx-users-request at gromacs.org.
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists