[gmx-users] gromacs on GPU

Szilárd Páll szilard.pall at cbr.su.se
Thu Jan 10 22:43:08 CET 2013


Hi,

On Thu, Jan 10, 2013 at 8:30 PM, James Starlight <jmsstarlight at gmail.com>wrote:

> Szilárd,
>
> There are no any others cpu-usage tasks. Below you can see log from the
> TOP.
>
> 26553 own       20   0 28.4g 106m  33m S 285.6  0.7   2263:57 mdrun
>

This still shows that the average CPU utilization is only 285.6 iso 400 and
that matches with what mdrun's log shows. Try to run with a very short
cut-off, one which leads to <=1 GPU/CPU balance (i.e no waiting) and if you
still don't get 400, something weird is going on.


>  1611 root      20   0  171m  65m  24m S   3.0  0.4   7:43.05 Xorg
> 29647 own       20   0  381m  22m  17m S   3.0  0.1   0:01.77
> mate-system-mon
>  2344 own       20   0  358m  17m  11m S   1.3  0.1   0:33.76 mate-terminal
> 29018 root      20   0     0    0    0 S   0.3  0.0   0:04.99 kworker/0:0
> 29268 root      20   0     0    0    0 S   0.3  0.0   0:00.22 kworker/u:2
> 29705 root      20   0     0    0    0 S   0.3  0.0   0:00.03 kworker/3:0
> 29706 own       20   0 23284 1648 1188 R   0.3  0.0   0:00.05 top
>     1 root      20   0  8584  872  736 S   0.0  0.0   0:02.34 init
>     2 root      20   0     0    0    0 S   0.0  0.0   0:00.02 kthreadd
>     3 root      20   0     0    0    0 S   0.0  0.0   0:00.57 ksoftirqd/0
>     6 root      rt   0     0    0    0 S   0.0  0.0   0:00.00 migration/0
>     7 root      rt   0     0    0    0 S   0.0  0.0   0:00.17 watchdog/0
>     8 root      rt   0     0    0    0 S   0.0  0.0   0:00.00 migration/1
>    10 root      20   0     0    0    0 S   0.0  0.0   0:00.43 ksoftirqd/1
>    12 root      rt   0     0    0    0 S   0.0  0.0   0:00.17 watchdog/1
>    13 root      rt   0     0    0    0 S   0.0  0.0   0:00.00 migration/2
>    15 root      20   0     0    0    0 S   0.0  0.0   0:00.37 ksoftirqd/2
>    16 root      rt   0     0    0    0 S   0.0  0.0   0:00.16 watchdog/2
>    17 root      rt   0     0    0    0 S   0.0  0.0   0:00.00 migration/3
>    19 root      20   0     0    0    0 S   0.0  0.0   0:00.38 ksoftirqd/3
>    20 root      rt   0     0    0    0 S   0.0  0.0   0:00.16 watchdog/3
>    21 root       0 -20     0    0    0 S   0.0  0.0   0:00.00 cpuset
>    22 root       0 -20     0    0    0 S   0.0  0.0   0:00.00 khelper
>    23 root      20   0     0    0    0 S   0.0  0.0   0:00.00 kdevtmpfs
>
>
> Usually I run my simulations by means of simple mdrun -v -deffnm md.
> Should I specify number of cores manually by means of -nt (or -ntmpi)
>

If you just want to run on the full machine, simply running like that
should in most cases still be the optimal run configuration or very close
to the optimal, i.e. in your case:
mdrun
<=>
mdrun -ntmpi 1 -ntomp 4 -gpu_id 0 -pinht


> flagg? Also I notice that  -pinht flagg could give me Hyper-Threading
> support. Does it reasonable in the simulation on cpu+gpu ? What
>

Correctly using HT is also fully automatic and optimal as long as you are
using the full machine.


> another possible options of md_run should I consider ? Finally is it
> possible that problems due to openMP (4.7.2) or open-mpi (1.4.5)
> drivers ?
>

No, you are using the latest version of compilers which is good. Other than
my earlier suggestions, there isn't much you can do to eliminate the idling
on the CPU (I assume that's what bugs you) - except getting a faster GPU.
Btw, have you tried the hybrid GPU-CPU mode (although I expect it to not be
faster)?

Cheers,
--
Szilárd



>
>
> Thanks for help
>
> James
>
>
> 2013/1/10 Szilárd Páll <szilard.pall at cbr.su.se>:
> > On Thu, Jan 10, 2013 at 7:25 AM, James Starlight <jmsstarlight at gmail.com
> >wrote:
> >
> >> Szilárd ,
> >>
> >>  thanks again for explanation!
> >>
> >> Today I've performed some tests on my calmodulin in water system with
> >> different cutt-offs (I've used all cutt-ooffs 1.0 , 0.9 and 0.8
> >> respectually)
> >>
> >> Below you can see that the highest performance was in case of 0.8
> cut-offs
> >>
> >> all cut-offs 1.0
> >>  Force evaluation time GPU/CPU: 6.134 ms/4.700 ms = 1.305
> >>
> >> NOTE: The GPU has >20% more load than the CPU. This imbalance causes
> >>       performance loss, consider using a shorter cut-off and a finer PME
> >> grid.
> >>
> >>
> >>                Core t (s)   Wall t (s)        (%)
> >>        Time:     1313.420      464.035      283.0
> >>                  (ns/day)    (hour/ns)
> >> Performance:        9.310        2.578
> >> Finished mdrun on node 0 Thu Jan 10 09:39:23 2013
> >>
> >>
> >> all cut-offs 0.9
> >> Force evaluation time GPU/CPU: 4.951 ms/4.675 ms = 1.059
> >>
> >>                Core t (s)   Wall t (s)        (%)
> >>        Time:     2414.930      856.179      282.1
> >>                  (ns/day)    (hour/ns)
> >> Performance:       10.092        2.378
> >> Finished mdrun on node 0 Thu Jan 10 10:09:52 2013
> >>
> >> all cut-offs 0.8
> >>  Force evaluation time GPU/CPU: 4.001 ms/4.659 ms = 0.859
> >>
> >>                Core t (s)   Wall t (s)        (%)
> >>        Time:     1166.390      413.598      282.0
> >>                  (ns/day)    (hour/ns)
> >> Performance:       10.445        2.298
> >> Finished mdrun on node 0 Thu Jan 10 09:50:33 2013
> >>
> >> Also I've noticed that 2-4 CPU cores usage in 2 and 3rd case was only
> >> 67%. Is there any other ways to increase performance by means of
> >> neighboor search parameters ( e.g nstlist etc) ?
> >>
> >
> > You can tweak nstlist and it often helps to increase it with GPUs,
> > especially in parallel. However, as increasing nstlist requires larger
> > rlist and more non-bonded calculations, this will not help you. You can
> try
> > to decrease it to 10-15 which will increase the NS cost but decrease the
> > GPU time, but it won't change the performance dramatically.
> >
> > What's strange is that your Core time/Wall time = (%) is quite low. If
> > you're running on four threads on an otherwise empty machine, you should
> > get close to 400 if the threads are not idling, e.g waiting for the GPU.
> > For instance in the rc=0.8 case you can see that the GPU/CPU balance is
> > <1.0 meaning that the GPU has less work than the CPU, case in which there
> > should be no idling and you should be getting (%) = 400.
> >
> > Long story short: are you sure you're not running anything else on the
> > computer while simulating? What do you get if you run on CPU only?
> >
> > Might such reduced cut-off be used with the force fields ( e,g charmm)
> >> where initially usage of longest cut-offs have given better results
> >> (e,g in charmm27 and gromos56 I always use 1.2 and 1.4 nm for rvdw,
> >> respectually) ?
> >>
> >
> > No, at least not without *carefully* checking whether a shorter LJ
> cut-off
> > makes sense and that it does not break the physics of your simulation.
> >
> > Although we advise you to consider decreasing your cut-off - mostly
> because
> > these days a large number of simulations are carried out with overly long
> > cut-off chosen by the rule of thumb or folclore -, you should always
> either
> > make sure that this makes sense before doing it or not do it at all.
> >
> > Cheers,
> > --
> > Szilárd
> >
> >
> >>
> >>
> >> James
> >>
> >> 2013/1/10 Szilárd Páll <szilard.pall at cbr.su.se>:
> >> > Hi James,
> >> >
> >> > The build looks mostly fine except that you are using fftw3 compiled
> with
> >> > AVX which is slower than with only SSE (even on AVX-capable CPUs) -
> you
> >> > should have been warned about this at configure-time.
> >> >
> >> > Now, performance-wise everything looks fine except that with a 1.2 nm
> >> > cut-off your GPU is not able to keep up with the CPU and finish the
> >> > non-bonded work before the CPU is done with Bonded + PME. That's why
> you
> >> > see the "Wait GPU" taking 20% of the total time and that's also why
> you
> >> see
> >> > some cores idling (because for 20% of the run-time thread 0 on core 0
> >> > is blocked waiting for the GPU while the rest idle).
> >> >
> >> > As the suggestion at the end of the log file point out, you can
> consider
> >> > using a shorter cut-off which will push more work back to the PME on
> the
> >> > CPU, but whether you can do this it depends on your very problem.
> >> >
> >> > There is one more alternative of running two MPI processes on the GPU
> >> > (mpirun -np 2 mdrun -gpu_id 00) and using the -nb gpu_cpu mode which
> will
> >> > execute part of the nonbonded on the CPU, but this might not help.
> >> >
> >> > Cheers,
> >> >
> >> > --
> >> > Szilárd
> >> >
> >> >
> >> > On Wed, Jan 9, 2013 at 8:27 PM, James Starlight <
> jmsstarlight at gmail.com
> >> >wrote:
> >> >
> >> >> Dear Szilárd, thanks for help again!
> >> >>
> >> >> 2013/1/9 Szilárd Páll <szilard.pall at cbr.su.se>:
> >> >>
> >> >> >
> >> >> > There could be, but I/we can't well without more information on
> what
> >> and
> >> >> > how you compiled and ran. The minimum we need is a log file.
> >> >> >
> >> >> I've compilated gromacs 4.6-3 beta via simple
> >> >>
> >> >>
> >> >> cmake CMakeLists.txt -DGMX_GPU=ON
> >> >> -DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda-5.0
> >> >> make
> >> >> sudo make install
> >> >>
> >> >> I have not added any special params to the grompp or mdrun.
> >> >>
> >> >> After that I've run tested simulation of the calmodulin in explicit
> >> >> water ( 60k atoms ) 100ps and obtain next output
> >> >>
> >> >> Host: starlight  pid: 21028  nodeid: 0  nnodes:  1
> >> >> Gromacs version:    VERSION 4.6-beta3
> >> >> Precision:          single
> >> >> MPI library:        thread_mpi
> >> >> OpenMP support:     enabled
> >> >> GPU support:        enabled
> >> >> invsqrt routine:    gmx_software_invsqrt(x)
> >> >> CPU acceleration:   AVX_256
> >> >> FFT library:        fftw-3.3.2-sse2-avx
> >> >> Large file support: enabled
> >> >> RDTSCP usage:       enabled
> >> >> Built on:           Wed Jan  9 20:44:51 MSK 2013
> >> >> Built by:           own at starlight [CMAKE]
> >> >> Build OS/arch:      Linux 3.2.0-2-amd64 x86_64
> >> >> Build CPU vendor:   GenuineIntel
> >> >> Build CPU brand:    Intel(R) Core(TM) i5-3570 CPU @ 3.40GHz
> >> >> Build CPU family:   6   Model: 58   Stepping: 9
> >> >> Build CPU features: aes apic avx clfsh cmov cx8 cx16 f16c htt lahf_lm
> >> >> mmx msr nonstop_tsc pcid pclmuldq pdcm popcnt pse rdrnd rdtscp sse2
> >> >> sse3 sse4.1 sse4.2 ssse3 tdt x2apic
> >> >> C compiler:         /usr/bin/gcc GNU gcc (Debian 4.6.3-11) 4.6.3
> >> >> C compiler flags:   -mavx  -Wextra -Wno-missing-field-initializers
> >> >> -Wno-sign-compare -Wall -Wno-unused -Wunused-value
> >> >> -fomit-frame-pointer -funroll-all-loops -fexcess-precision=fast  -O3
> >> >> -DNDEBUG
> >> >> C++ compiler:       /usr/bin/c++ GNU c++ (Debian 4.6.3-11) 4.6.3
> >> >> C++ compiler flags: -mavx  -Wextra -Wno-missing-field-initializers
> >> >> -Wno-sign-compare -Wall -Wno-unused -Wunused-value
> >> >> -fomit-frame-pointer -funroll-all-loops -fexcess-precision=fast  -O3
> >> >> -DNDEBUG
> >> >> CUDA compiler:      nvcc: NVIDIA (R) Cuda compiler driver;Copyright
> >> >> (c) 2005-2012 NVIDIA Corporation;Built on
> >> >> Fri_Sep_21_17:28:58_PDT_2012;Cuda compilation tools, release 5.0,
> >> >> V0.2.1221
> >> >> CUDA driver:        5.0
> >> >> CUDA runtime:       5.0
> >> >>
> >> >> ****************
> >> >>
> >> >>                Core t (s)   Wall t (s)        (%)
> >> >>        Time:     2770.700     1051.927      263.4
> >> >>                  (ns/day)    (hour/ns)
> >> >> Performance:        8.214        2.922
> >> >>
> >> >> full log can be found here http://www.sendspace.com/file/inum84
> >> >>
> >> >>
> >> >> Finally when I check CPU usage I notice that only 1 CPU was full
> >> >> loaded ( 100%) and 2-4 cores were loaded on only 60% but  gave me
> >> >> strange results that GPU is not used (I've only monitored temperature
> >> >> of video card and noticed increase of the temperature up to 65
> degrees
> >> >> )
> >> >>
> >> >> +------------------------------------------------------+
> >> >> | NVIDIA-SMI 4.304.54   Driver Version: 304.54         |
> >> >>
> >> >>
> >>
> |-------------------------------+----------------------+----------------------+
> >> >> | GPU  Name                     | Bus-Id        Disp.  | Volatile
> >> Uncorr.
> >> >> ECC |
> >> >> | Fan  Temp  Perf  Pwr:Usage/Cap| Memory-Usage         | GPU-Util
> >>  Compute
> >> >> M. |
> >> >>
> >> >>
> >>
> |===============================+======================+======================|
> >> >> |   0  GeForce GTX 670          | 0000:02:00.0     N/A |
> >> >>  N/A |
> >> >> | 38%   63C  N/A     N/A /  N/A |   9%  174MB / 2047MB |     N/A
> >> >>  Default |
> >> >>
> >> >>
> >>
> +-------------------------------+----------------------+----------------------+
> >> >>
> >> >>
> >> >>
> >>
> +-----------------------------------------------------------------------------+
> >> >> | Compute processes:
> GPU
> >> >> Memory |
> >> >> |  GPU       PID  Process name
> Usage
> >> >>    |
> >> >>
> >> >>
> >>
> |=============================================================================|
> >> >> |    0            Not Supported
> >> >>     |
> >> >>
> >> >>
> >>
> +-----------------------------------------------------------------------------+
> >> >>
> >> >>
> >> >> Thanks for help again,
> >> >>
> >> >> James
> >> >> --
> >> >> gmx-users mailing list    gmx-users at gromacs.org
> >> >> http://lists.gromacs.org/mailman/listinfo/gmx-users
> >> >> * Please search the archive at
> >> >> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
> >> >> * Please don't post (un)subscribe requests to the list. Use the
> >> >> www interface or send it to gmx-users-request at gromacs.org.
> >> >> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >> >>
> >> > --
> >> > gmx-users mailing list    gmx-users at gromacs.org
> >> > http://lists.gromacs.org/mailman/listinfo/gmx-users
> >> > * Please search the archive at
> >> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
> >> > * Please don't post (un)subscribe requests to the list. Use the
> >> > www interface or send it to gmx-users-request at gromacs.org.
> >> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >> --
> >> gmx-users mailing list    gmx-users at gromacs.org
> >> http://lists.gromacs.org/mailman/listinfo/gmx-users
> >> * Please search the archive at
> >> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
> >> * Please don't post (un)subscribe requests to the list. Use the
> >> www interface or send it to gmx-users-request at gromacs.org.
> >> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >>
> > --
> > gmx-users mailing list    gmx-users at gromacs.org
> > http://lists.gromacs.org/mailman/listinfo/gmx-users
> > * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
> > * Please don't post (un)subscribe requests to the list. Use the
> > www interface or send it to gmx-users-request at gromacs.org.
> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> --
> gmx-users mailing list    gmx-users at gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-users
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
> * Please don't post (un)subscribe requests to the list. Use the
> www interface or send it to gmx-users-request at gromacs.org.
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>



More information about the gromacs.org_gmx-users mailing list