[gmx-users] Trouble balancing GPU/CPU force calculation load, ratio = 0.09
jason.hill at zoologi.su.se
Wed Dec 17 14:20:23 CET 2014
Hi Szilard and list,
Thanks for the response. First, I experimented further with the MPI thread number. Optimal performance was reached when I used 24 mpi ranks and defined 12 of those to me used for PME only. This resulted in less threads than logical cores, and pinning being off. Even though I got a warning to that effect, performance still increased 33% and now I am simulating ~3ns/day on a 90,000 atom system. Using 16 or 32 mpi ranks cuts performance in half, and I notice that the automatic PME mesh size stays much larger. Can someone please explain how these might be related? If I try set domain decomposition manually through mdrun -dd, I can’t choose a value that it seems to land on automatically, for example 72 72 72 is said to be too small when the PME mesh ends up there automatically anyway. Am I misunderstanding PME mesh size vs domain decomposition?
Second, I increased the GPU clock speed to its maximum of 875MHz, but saw no improvement. In fact, monitoring GPU usage showed that it never exceeded 20%! I’m somewhat at a loss for how I can further optimize my run, and more efficiently use my GPU. Any further pointers here would be much appreciated.
The log file for the latest run is here: https://drive.google.com/file/d/0BwAaTxAET7c5bXE2bUtfWVp4dE0/view?usp=sharing <https://drive.google.com/file/d/0BwAaTxAET7c5bXE2bUtfWVp4dE0/view?usp=sharing>
Jason Hill, Ph.D.
D-419 Svante Arrhenius v 18B
S-10691 Stockholm Sweden
> Date: Tue, 16 Dec 2014 19:05:52 +0100
> From: Szil?rd P?ll <pall.szilard at gmail.com>
> To: Discussion list for GROMACS users <gmx-users at gromacs.org>
> Subject: Re: [gmx-users] Trouble balancing GPU/CPU force calculation
> load, ratio = 0.09
> <CANnYEw5+=XzaZf8KadoLyHs=rFgZW+pP4W-myN9mgAaO6fgzvA at mail.gmail.com>
> Content-Type: text/plain; charset=UTF-8
> Even 4 ranks x 16 theads is too much for AMD CPUs! In my experience
> the optimal is typically 2-8 threads/rank (depending on DD / imbalance
> behavior), so I suggest that you try these lower thread/rank counts.
> Also, make sure that the application clocks are set to max on that
> K40, otherwise you're missing 20% GPU performance!
> On Mon, Dec 15, 2014 at 12:57 PM, Carsten Kutzner <ckutzne at gwdg.de> wrote:
>> from the log file it seems that you were actually using 64 OpenMP threads.
>> This is not very efficient, you could try to start mdrun with 4 thread-MPI
>> ranks (instead of 1), e.g.
>> mdrun -ntmpi 4 -gpu_id 0000 -s ?
>> Could it be that another process was running on your node while you
>> ran the simulation?
>> On 15 Dec 2014, at 12:45, Jason Hill <jason.hill at zoologi.su.se> wrote:
>>> Hello list,
>>> I am simulating a protein in water and am concerned that I am not using my hardware to the best of it?s abilities. Here (https://drive.google.com/file/d/0BwAaTxAET7c5VkZERkFsa1cyRlk/view?usp=sharing <https://drive.google.com/file/d/0BwAaTxAET7c5VkZERkFsa1cyRlk/view?usp=sharing>) is the log file from a 1 nanosecond simulation. The only piece of information missing from it that may be of use is that I am using the OPLS/AA force field. Additionally, GROMACS only seems to be using 8-12 cores of the 64 available despite it?s complaint that the GPU is being underutilized. Please take a look and if you can, give me some advice about improving my simulation efficiency.
>>> Best regards,
>>> Jason Hill, Ph.D.
>>> Wheat Lab
>>> Zoologiska Institutionen
>>> Stockholms Universitet
>>> D-419 Svante Arrhenius v 18B
>>> S-10691 Stockholm Sweden
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 842 bytes
Desc: Message signed with OpenPGP using GPGMail
More information about the gromacs.org_gmx-users