[gmx-users] gromacs on GPU
James Starlight
jmsstarlight at gmail.com
Thu Jan 10 20:30:31 CET 2013
Szilárd,
There are no any others cpu-usage tasks. Below you can see log from the TOP.
26553 own 20 0 28.4g 106m 33m S 285.6 0.7 2263:57 mdrun
1611 root 20 0 171m 65m 24m S 3.0 0.4 7:43.05 Xorg
29647 own 20 0 381m 22m 17m S 3.0 0.1 0:01.77 mate-system-mon
2344 own 20 0 358m 17m 11m S 1.3 0.1 0:33.76 mate-terminal
29018 root 20 0 0 0 0 S 0.3 0.0 0:04.99 kworker/0:0
29268 root 20 0 0 0 0 S 0.3 0.0 0:00.22 kworker/u:2
29705 root 20 0 0 0 0 S 0.3 0.0 0:00.03 kworker/3:0
29706 own 20 0 23284 1648 1188 R 0.3 0.0 0:00.05 top
1 root 20 0 8584 872 736 S 0.0 0.0 0:02.34 init
2 root 20 0 0 0 0 S 0.0 0.0 0:00.02 kthreadd
3 root 20 0 0 0 0 S 0.0 0.0 0:00.57 ksoftirqd/0
6 root rt 0 0 0 0 S 0.0 0.0 0:00.00 migration/0
7 root rt 0 0 0 0 S 0.0 0.0 0:00.17 watchdog/0
8 root rt 0 0 0 0 S 0.0 0.0 0:00.00 migration/1
10 root 20 0 0 0 0 S 0.0 0.0 0:00.43 ksoftirqd/1
12 root rt 0 0 0 0 S 0.0 0.0 0:00.17 watchdog/1
13 root rt 0 0 0 0 S 0.0 0.0 0:00.00 migration/2
15 root 20 0 0 0 0 S 0.0 0.0 0:00.37 ksoftirqd/2
16 root rt 0 0 0 0 S 0.0 0.0 0:00.16 watchdog/2
17 root rt 0 0 0 0 S 0.0 0.0 0:00.00 migration/3
19 root 20 0 0 0 0 S 0.0 0.0 0:00.38 ksoftirqd/3
20 root rt 0 0 0 0 S 0.0 0.0 0:00.16 watchdog/3
21 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 cpuset
22 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 khelper
23 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kdevtmpfs
Usually I run my simulations by means of simple mdrun -v -deffnm md.
Should I specify number of cores manually by means of -nt (or -ntmpi)
flagg? Also I notice that -pinht flagg could give me Hyper-Threading
support. Does it reasonable in the simulation on cpu+gpu ? What
another possible options of md_run should I consider ? Finally is it
possible that problems due to openMP (4.7.2) or open-mpi (1.4.5)
drivers ?
Thanks for help
James
2013/1/10 Szilárd Páll <szilard.pall at cbr.su.se>:
> On Thu, Jan 10, 2013 at 7:25 AM, James Starlight <jmsstarlight at gmail.com>wrote:
>
>> Szilárd ,
>>
>> thanks again for explanation!
>>
>> Today I've performed some tests on my calmodulin in water system with
>> different cutt-offs (I've used all cutt-ooffs 1.0 , 0.9 and 0.8
>> respectually)
>>
>> Below you can see that the highest performance was in case of 0.8 cut-offs
>>
>> all cut-offs 1.0
>> Force evaluation time GPU/CPU: 6.134 ms/4.700 ms = 1.305
>>
>> NOTE: The GPU has >20% more load than the CPU. This imbalance causes
>> performance loss, consider using a shorter cut-off and a finer PME
>> grid.
>>
>>
>> Core t (s) Wall t (s) (%)
>> Time: 1313.420 464.035 283.0
>> (ns/day) (hour/ns)
>> Performance: 9.310 2.578
>> Finished mdrun on node 0 Thu Jan 10 09:39:23 2013
>>
>>
>> all cut-offs 0.9
>> Force evaluation time GPU/CPU: 4.951 ms/4.675 ms = 1.059
>>
>> Core t (s) Wall t (s) (%)
>> Time: 2414.930 856.179 282.1
>> (ns/day) (hour/ns)
>> Performance: 10.092 2.378
>> Finished mdrun on node 0 Thu Jan 10 10:09:52 2013
>>
>> all cut-offs 0.8
>> Force evaluation time GPU/CPU: 4.001 ms/4.659 ms = 0.859
>>
>> Core t (s) Wall t (s) (%)
>> Time: 1166.390 413.598 282.0
>> (ns/day) (hour/ns)
>> Performance: 10.445 2.298
>> Finished mdrun on node 0 Thu Jan 10 09:50:33 2013
>>
>> Also I've noticed that 2-4 CPU cores usage in 2 and 3rd case was only
>> 67%. Is there any other ways to increase performance by means of
>> neighboor search parameters ( e.g nstlist etc) ?
>>
>
> You can tweak nstlist and it often helps to increase it with GPUs,
> especially in parallel. However, as increasing nstlist requires larger
> rlist and more non-bonded calculations, this will not help you. You can try
> to decrease it to 10-15 which will increase the NS cost but decrease the
> GPU time, but it won't change the performance dramatically.
>
> What's strange is that your Core time/Wall time = (%) is quite low. If
> you're running on four threads on an otherwise empty machine, you should
> get close to 400 if the threads are not idling, e.g waiting for the GPU.
> For instance in the rc=0.8 case you can see that the GPU/CPU balance is
> <1.0 meaning that the GPU has less work than the CPU, case in which there
> should be no idling and you should be getting (%) = 400.
>
> Long story short: are you sure you're not running anything else on the
> computer while simulating? What do you get if you run on CPU only?
>
> Might such reduced cut-off be used with the force fields ( e,g charmm)
>> where initially usage of longest cut-offs have given better results
>> (e,g in charmm27 and gromos56 I always use 1.2 and 1.4 nm for rvdw,
>> respectually) ?
>>
>
> No, at least not without *carefully* checking whether a shorter LJ cut-off
> makes sense and that it does not break the physics of your simulation.
>
> Although we advise you to consider decreasing your cut-off - mostly because
> these days a large number of simulations are carried out with overly long
> cut-off chosen by the rule of thumb or folclore -, you should always either
> make sure that this makes sense before doing it or not do it at all.
>
> Cheers,
> --
> Szilárd
>
>
>>
>>
>> James
>>
>> 2013/1/10 Szilárd Páll <szilard.pall at cbr.su.se>:
>> > Hi James,
>> >
>> > The build looks mostly fine except that you are using fftw3 compiled with
>> > AVX which is slower than with only SSE (even on AVX-capable CPUs) - you
>> > should have been warned about this at configure-time.
>> >
>> > Now, performance-wise everything looks fine except that with a 1.2 nm
>> > cut-off your GPU is not able to keep up with the CPU and finish the
>> > non-bonded work before the CPU is done with Bonded + PME. That's why you
>> > see the "Wait GPU" taking 20% of the total time and that's also why you
>> see
>> > some cores idling (because for 20% of the run-time thread 0 on core 0
>> > is blocked waiting for the GPU while the rest idle).
>> >
>> > As the suggestion at the end of the log file point out, you can consider
>> > using a shorter cut-off which will push more work back to the PME on the
>> > CPU, but whether you can do this it depends on your very problem.
>> >
>> > There is one more alternative of running two MPI processes on the GPU
>> > (mpirun -np 2 mdrun -gpu_id 00) and using the -nb gpu_cpu mode which will
>> > execute part of the nonbonded on the CPU, but this might not help.
>> >
>> > Cheers,
>> >
>> > --
>> > Szilárd
>> >
>> >
>> > On Wed, Jan 9, 2013 at 8:27 PM, James Starlight <jmsstarlight at gmail.com
>> >wrote:
>> >
>> >> Dear Szilárd, thanks for help again!
>> >>
>> >> 2013/1/9 Szilárd Páll <szilard.pall at cbr.su.se>:
>> >>
>> >> >
>> >> > There could be, but I/we can't well without more information on what
>> and
>> >> > how you compiled and ran. The minimum we need is a log file.
>> >> >
>> >> I've compilated gromacs 4.6-3 beta via simple
>> >>
>> >>
>> >> cmake CMakeLists.txt -DGMX_GPU=ON
>> >> -DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda-5.0
>> >> make
>> >> sudo make install
>> >>
>> >> I have not added any special params to the grompp or mdrun.
>> >>
>> >> After that I've run tested simulation of the calmodulin in explicit
>> >> water ( 60k atoms ) 100ps and obtain next output
>> >>
>> >> Host: starlight pid: 21028 nodeid: 0 nnodes: 1
>> >> Gromacs version: VERSION 4.6-beta3
>> >> Precision: single
>> >> MPI library: thread_mpi
>> >> OpenMP support: enabled
>> >> GPU support: enabled
>> >> invsqrt routine: gmx_software_invsqrt(x)
>> >> CPU acceleration: AVX_256
>> >> FFT library: fftw-3.3.2-sse2-avx
>> >> Large file support: enabled
>> >> RDTSCP usage: enabled
>> >> Built on: Wed Jan 9 20:44:51 MSK 2013
>> >> Built by: own at starlight [CMAKE]
>> >> Build OS/arch: Linux 3.2.0-2-amd64 x86_64
>> >> Build CPU vendor: GenuineIntel
>> >> Build CPU brand: Intel(R) Core(TM) i5-3570 CPU @ 3.40GHz
>> >> Build CPU family: 6 Model: 58 Stepping: 9
>> >> Build CPU features: aes apic avx clfsh cmov cx8 cx16 f16c htt lahf_lm
>> >> mmx msr nonstop_tsc pcid pclmuldq pdcm popcnt pse rdrnd rdtscp sse2
>> >> sse3 sse4.1 sse4.2 ssse3 tdt x2apic
>> >> C compiler: /usr/bin/gcc GNU gcc (Debian 4.6.3-11) 4.6.3
>> >> C compiler flags: -mavx -Wextra -Wno-missing-field-initializers
>> >> -Wno-sign-compare -Wall -Wno-unused -Wunused-value
>> >> -fomit-frame-pointer -funroll-all-loops -fexcess-precision=fast -O3
>> >> -DNDEBUG
>> >> C++ compiler: /usr/bin/c++ GNU c++ (Debian 4.6.3-11) 4.6.3
>> >> C++ compiler flags: -mavx -Wextra -Wno-missing-field-initializers
>> >> -Wno-sign-compare -Wall -Wno-unused -Wunused-value
>> >> -fomit-frame-pointer -funroll-all-loops -fexcess-precision=fast -O3
>> >> -DNDEBUG
>> >> CUDA compiler: nvcc: NVIDIA (R) Cuda compiler driver;Copyright
>> >> (c) 2005-2012 NVIDIA Corporation;Built on
>> >> Fri_Sep_21_17:28:58_PDT_2012;Cuda compilation tools, release 5.0,
>> >> V0.2.1221
>> >> CUDA driver: 5.0
>> >> CUDA runtime: 5.0
>> >>
>> >> ****************
>> >>
>> >> Core t (s) Wall t (s) (%)
>> >> Time: 2770.700 1051.927 263.4
>> >> (ns/day) (hour/ns)
>> >> Performance: 8.214 2.922
>> >>
>> >> full log can be found here http://www.sendspace.com/file/inum84
>> >>
>> >>
>> >> Finally when I check CPU usage I notice that only 1 CPU was full
>> >> loaded ( 100%) and 2-4 cores were loaded on only 60% but gave me
>> >> strange results that GPU is not used (I've only monitored temperature
>> >> of video card and noticed increase of the temperature up to 65 degrees
>> >> )
>> >>
>> >> +------------------------------------------------------+
>> >> | NVIDIA-SMI 4.304.54 Driver Version: 304.54 |
>> >>
>> >>
>> |-------------------------------+----------------------+----------------------+
>> >> | GPU Name | Bus-Id Disp. | Volatile
>> Uncorr.
>> >> ECC |
>> >> | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util
>> Compute
>> >> M. |
>> >>
>> >>
>> |===============================+======================+======================|
>> >> | 0 GeForce GTX 670 | 0000:02:00.0 N/A |
>> >> N/A |
>> >> | 38% 63C N/A N/A / N/A | 9% 174MB / 2047MB | N/A
>> >> Default |
>> >>
>> >>
>> +-------------------------------+----------------------+----------------------+
>> >>
>> >>
>> >>
>> +-----------------------------------------------------------------------------+
>> >> | Compute processes: GPU
>> >> Memory |
>> >> | GPU PID Process name Usage
>> >> |
>> >>
>> >>
>> |=============================================================================|
>> >> | 0 Not Supported
>> >> |
>> >>
>> >>
>> +-----------------------------------------------------------------------------+
>> >>
>> >>
>> >> Thanks for help again,
>> >>
>> >> James
>> >> --
>> >> gmx-users mailing list gmx-users at gromacs.org
>> >> http://lists.gromacs.org/mailman/listinfo/gmx-users
>> >> * Please search the archive at
>> >> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
>> >> * Please don't post (un)subscribe requests to the list. Use the
>> >> www interface or send it to gmx-users-request at gromacs.org.
>> >> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>> >>
>> > --
>> > gmx-users mailing list gmx-users at gromacs.org
>> > http://lists.gromacs.org/mailman/listinfo/gmx-users
>> > * Please search the archive at
>> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
>> > * Please don't post (un)subscribe requests to the list. Use the
>> > www interface or send it to gmx-users-request at gromacs.org.
>> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>> --
>> gmx-users mailing list gmx-users at gromacs.org
>> http://lists.gromacs.org/mailman/listinfo/gmx-users
>> * Please search the archive at
>> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
>> * Please don't post (un)subscribe requests to the list. Use the
>> www interface or send it to gmx-users-request at gromacs.org.
>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>
> --
> gmx-users mailing list gmx-users at gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-users
> * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
> * Please don't post (un)subscribe requests to the list. Use the
> www interface or send it to gmx-users-request at gromacs.org.
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
More information about the gromacs.org_gmx-users
mailing list