[gmx-users] gromacs.org_gmx-users Digest, Vol 128, Issue 91

Tue Dec 30 12:55:47 CET 2014

Hi Szilárd,

Benchmarking is complete here is the summary and link to the full log files.

The system is a protein in water, 84,920 atoms in total using the GROMOS96 54a7 force field.
Logs: https://drive.google.com/file/d/0BwAaTxAET7c5V1FqeldpdkhZTFk/view?usp=sharing

#########################
gmx mdrun -deffnm md_4-5_3 -v -ntmpi 4 -gpu_id 0000 -maxh 0.5 -g benchmark_1.log
gmx mdrun -deffnm md_4-5_3 -v -ntmpi 4 -npme 2 -gpu_id 00 -maxh 0.5 -g benchmark_2.log
gmx mdrun -deffnm md_4-5_3 -v -ntmpi 8 -gpu_id 00000000 -maxh 0.5 -g benchmark_3.log
gmx mdrun -deffnm md_4-5_3 -v -ntmpi 8 -npme 2 -gpu_id 000000 -maxh 0.5 -g benchmark_4.log
gmx mdrun -deffnm md_4-5_3 -v -ntmpi 8 -npme 4 -gpu_id 0000 -maxh 0.5 -g benchmark_5.log
gmx mdrun -deffnm md_4-5_3 -v -ntmpi 16 -gpu_id 0000000000000000 -maxh 0.5 -g benchmark_6.log
gmx mdrun -deffnm md_4-5_3 -v -ntmpi 16 -npme 4 -gpu_id 000000000000 -maxh 0.5 -g benchmark_7.log
gmx mdrun -deffnm md_4-5_3 -v -ntmpi 16 -npme 8 -gpu_id 00000000 -maxh 0.5 -g benchmark_8.log
gmx mdrun -deffnm md_4-5_3 -v -ntmpi 16 -npme 8 -gpu_id 00000000 -maxh 0.5 -ddorder pp_pme  -g benchmark_9.log
gmx mdrun -deffnm md_4-5_3 -v -ntmpi 32 -npme 0 -gpu_id 00000000000000000000000000000000 -maxh 0.5 -g benchmark_10.log
gmx mdrun -deffnm md_4-5_3 -v -ntmpi 32 -npme 8 -gpu_id 000000000000000000000000 -maxh 0.5 -g benchmark_11.log
gmx mdrun -deffnm md_4-5_3 -v -ntmpi 32 -npme 16 -gpu_id 0000000000000000 -maxh 0.5 -g benchmark_12.log
gmx mdrun -deffnm md_4-5_3 -v -ntmpi 32 -npme 16 -gpu_id 0000000000000000 -maxh 0.5 -ddorder pp_pme -g benchmark_13.log
gmx mdrun -deffnm md_4-5_3 -v -maxh 0.5 -g benchmark_14.log

#########################       ns/day         hours/ns
benchmark_1.log:Performance:       43.942        0.546
benchmark_2.log:Performance:       34.681        0.692
benchmark_3.log:Performance:       50.854        0.472
benchmark_4.log:Performance:       39.170        0.613
benchmark_5.log:Performance:       45.999        0.522
benchmark_6.log:Performance:       42.964        0.559
benchmark_7.log:Performance:       46.112        0.520
benchmark_8.log:Performance:       48.916        0.491
benchmark_9.log:Performance:       47.524        0.505
benchmark_11.log:Performance:       28.543        0.841
benchmark_12.log:Performance:       43.401        0.553
benchmark_13.log:Performance:       43.758        0.548
benchmark_14.log:Performance:       29.141        0.824

Let me know if you want any other info. Thanks again for the help!

Best,
Jason

Jason Hill, Ph.D.
Wheat Lab
Zoologiska Institutionen
Stockholms Universitet
D-419 Svante Arrhenius v 18B
S-10691 Stockholm Sweden

> On Dec 20, 2014, at 2:51 AM, gromacs.org_gmx-users-request at maillist.sys.kth.se wrote:
> 
> Hi Jason,
> 
> I'm glad to hear the tips helped! Rogue processes can indeed get in
> the way - perhaps luckily GROMACS is quite sensitive to other
> processes interfering and the low performance is often noticeable
> enough.
> 
> Would you mind posting the log files of your benchmark runs for me
> (and others) to see some examples of how does mdrun behave on a 4 CPU
> + 1 GPU system?
> 
> Regarding the GPU application clocks, in the next release mdrun will
> be able to warn you about the suboptimal GPU clock or even increase it
> if the system setups allows!
> 
> Cheers,
> --
> Szil?rd
> 
> 
> On Fri, Dec 19, 2014 at 1:50 PM, Jason Hill <jason.hill at zoologi.su.se> wrote:
>> Hi Szilard,
>> 
>> Here is an update for you and anyone following this. It turns out that the primary problem was that I had single process I had not noticed hanging taking up a core and throwing off the whole thing. Once I ensured that the server was completely cleared of other processes the default parameters gave me 30ns/day up from 2ns/day. I?m actually glad I made this mistake because I otherwise would have probably been content with the speed and not found out about the best way to use the various threading parameter parameters or the clock speed increase on the GPU, so thank you and know that your assistance was not in vain! I then implemented different combinations of the thread parameters and ddorder and settled on:
>> 
>> gmx mdrun -deffnm md_4-5_3 -v -ntmpi 16 -gpu_id 00000000 -npme 8 -ddorder pp_pme
>> 
>> This combination, along with the increased clock speed of my GPU is churning out almost 50 ns/day now for my system. That?s really fantastic, and I?m quite grateful to you for your help.
>> 
>> Best regards,
>> Jason
>> 
>>> Hi Jason,
>>> 
>>> Good point, separate PME ranks may very well be able to help in this
>>> case. I typically use half of the ranks for PME with AMD CPU-based
>>> machines (from  3-4 sockets and above).
>>> 
>>> However, based on your log file something is still not right, PME is
>>> barely faster than with 64 OpenMP threads (59 vs 87 ms/step) and it's
>>> most likely the lack of pinning that leads to bad performance.
>>> 
>>> Try the following:
>>> gmx mdrun -ntmpi  16 -ntomp 4 -npme 8
>>> gmx mdrun -ntmpi  32 -ntomp 2 -npme 16
>>> 
>>> And additionally, do try the -ddorder pp_pme option, this will bring
>>> your PME ranks closer to each other possibly even keep them within a
>>> socket.
>>> 
>>> Cheers,
>>> --
>>> Szil?rd
>>> 
>>> 
>>> On Wed, Dec 17, 2014 at 2:20 PM, Jason Hill <jason.hill at zoologi.su.se> wrote:
>>>> Hi Szilard and list,
>>>> 
>>>> Thanks for the response. First, I experimented further with the MPI thread number. Optimal performance was reached when I used  24 mpi ranks and defined 12 of those to me used for PME only. This resulted in less threads than logical cores, and pinning being off. Even though I got a warning to that effect, performance still increased 33% and now I am simulating ~3ns/day on a 90,000 atom system. Using 16 or 32 mpi ranks cuts performance in half, and I notice that the automatic PME mesh size stays much larger. Can someone please explain how these might be related? If I try set domain decomposition manually through mdrun -dd, I can?t choose a value that it seems to land on automatically, for example 72 72 72 is said to be too small when the PME mesh ends up there automatically anyway. Am I misunderstanding PME mesh size vs domain decomposition?
>>>> 
>>>> Second, I increased the GPU clock speed to its maximum of 875MHz, but saw no improvement. In fact, monitoring GPU usage showed that it never exceeded 20%! I?m somewhat at a loss for how I can further optimize my run, and more efficiently use my GPU. Any further pointers here would be much appreciated.
>>>> 
>>>> The log file for the latest run is here: https://drive.google.com/file/d/0BwAaTxAET7c5bXE2bUtfWVp4dE0/view?usp=sharing<https://drive.google.com/file/d/0BwAaTxAET7c5bXE2bUtfWVp4dE0/view?usp=sharing>
>>>> 
>>>> Best regards,
>>>> Jason
>>>> 
>>>> Jason Hill, Ph.D.
>>>> Wheat Lab
>>>> Zoologiska Institutionen
>>>> Stockholms Universitet
>>>> D-419 Svante Arrhenius v 18B
>>>> S-10691 Stockholm Sweden
>>>> 
>>>>> Date: Tue, 16 Dec 2014 19:05:52 +0100
>>>>> From: Szil?rd P?ll <pall.szilard at gmail.com>
>>>>> To: Discussion list for GROMACS users <gmx-users at gromacs.org>
>>>>> Subject: Re: [gmx-users] Trouble balancing GPU/CPU force calculation
>>>>>     load, ratio = 0.09
>>>>> Message-ID:
>>>>>     <CANnYEw5+=XzaZf8KadoLyHs=rFgZW+pP4W-myN9mgAaO6fgzvA at mail.gmail.com>
>>>>> Content-Type: text/plain; charset=UTF-8
>>>>> 
>>>>> Even 4 ranks x 16 theads is too much for AMD CPUs! In my experience
>>>>> the optimal is typically 2-8 threads/rank (depending on DD / imbalance
>>>>> behavior), so I suggest that you try these lower thread/rank counts.
>>>>> 
>>>>> Also, make sure that the application clocks are set to max on that
>>>>> K40, otherwise you're missing 20% GPU performance!
>>>>> 
>>>>> --
>>>>> Szil?rd
>>>>> 
>>>>> 
>>>>> On Mon, Dec 15, 2014 at 12:57 PM, Carsten Kutzner <ckutzne at gwdg.de> wrote:
>>>>>> Hi,
>>>>>> 
>>>>>> from the log file it seems that you were actually using 64 OpenMP threads.
>>>>>> This is not very efficient, you could try to start mdrun with 4 thread-MPI
>>>>>> ranks (instead of 1), e.g.
>>>>>> 
>>>>>> mdrun -ntmpi 4 -gpu_id 0000 -s ?
>>>>>> 
>>>>>> Could it be that another process was running on your node while you
>>>>>> ran the simulation?
>>>>>> 
>>>>>> Carsten
>>>>>> 
>>>>>> 
>>>>>> On 15 Dec 2014, at 12:45, Jason Hill <jason.hill at zoologi.su.se> wrote:
>>>>>> 
>>>>>>> Hello list,
>>>>>>> 
>>>>>>> I am simulating a protein in water and am concerned that I am not using my hardware to the best of it?s abilities. Here (https://drive.google.com/file/d/0BwAaTxAET7c5VkZERkFsa1cyRlk/view?usp=sharing<https://drive.google.com/file/d/0BwAaTxAET7c5VkZERkFsa1cyRlk/view?usp=sharing>) is the log file from a 1 nanosecond simulation. The only piece of information missing from it that may be of use is that I am using the OPLS/AA force field. Additionally, GROMACS only seems to be using 8-12 cores of the 64 available despite it?s complaint that the GPU is being underutilized. Please take a look and if you can, give me some advice about improving my simulation efficiency.
>>>>>>> 
>>>>>>> Best regards,
>>>>>>> Jason
>>>>>>> 
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 842 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-users/attachments/20141230/6432048b/attachment.sig>