[gmx-users] Trouble balancing GPU/CPU force calculation load, ratio = 0.09

Szilárd Páll pall.szilard at gmail.com
Wed Dec 17 22:10:01 CET 2014

Hi Jason,

Good point, separate PME ranks may very well be able to help in this
case. I typically use half of the ranks for PME with AMD CPU-based
machines (from  3-4 sockets and above).

However, based on your log file something is still not right, PME is
barely faster than with 64 OpenMP threads (59 vs 87 ms/step) and it's
most likely the lack of pinning that leads to bad performance.

Try the following:
gmx mdrun -ntmpi  16 -ntomp 4 -npme 8
gmx mdrun -ntmpi  32 -ntomp 2 -npme 16

And additionally, do try the -ddorder pp_pme option, this will bring
your PME ranks closer to each other possibly even keep them within a


On Wed, Dec 17, 2014 at 2:20 PM, Jason Hill <jason.hill at zoologi.su.se> wrote:
> Hi Szilard and list,
> Thanks for the response. First, I experimented further with the MPI thread number. Optimal performance was reached when I used  24 mpi ranks and defined 12 of those to me used for PME only. This resulted in less threads than logical cores, and pinning being off. Even though I got a warning to that effect, performance still increased 33% and now I am simulating ~3ns/day on a 90,000 atom system. Using 16 or 32 mpi ranks cuts performance in half, and I notice that the automatic PME mesh size stays much larger. Can someone please explain how these might be related? If I try set domain decomposition manually through mdrun -dd, I can’t choose a value that it seems to land on automatically, for example 72 72 72 is said to be too small when the PME mesh ends up there automatically anyway. Am I misunderstanding PME mesh size vs domain decomposition?
> Second, I increased the GPU clock speed to its maximum of 875MHz, but saw no improvement. In fact, monitoring GPU usage showed that it never exceeded 20%! I’m somewhat at a loss for how I can further optimize my run, and more efficiently use my GPU. Any further pointers here would be much appreciated.
> The log file for the latest run is here: https://drive.google.com/file/d/0BwAaTxAET7c5bXE2bUtfWVp4dE0/view?usp=sharing <https://drive.google.com/file/d/0BwAaTxAET7c5bXE2bUtfWVp4dE0/view?usp=sharing>
> Best regards,
> Jason
> Jason Hill, Ph.D.
> Wheat Lab
> Zoologiska Institutionen
> Stockholms Universitet
> D-419 Svante Arrhenius v 18B
> S-10691 Stockholm Sweden
>> Date: Tue, 16 Dec 2014 19:05:52 +0100
>> From: Szil?rd P?ll <pall.szilard at gmail.com>
>> To: Discussion list for GROMACS users <gmx-users at gromacs.org>
>> Subject: Re: [gmx-users] Trouble balancing GPU/CPU force calculation
>>       load, ratio = 0.09
>> Message-ID:
>>       <CANnYEw5+=XzaZf8KadoLyHs=rFgZW+pP4W-myN9mgAaO6fgzvA at mail.gmail.com>
>> Content-Type: text/plain; charset=UTF-8
>> Even 4 ranks x 16 theads is too much for AMD CPUs! In my experience
>> the optimal is typically 2-8 threads/rank (depending on DD / imbalance
>> behavior), so I suggest that you try these lower thread/rank counts.
>> Also, make sure that the application clocks are set to max on that
>> K40, otherwise you're missing 20% GPU performance!
>> --
>> Szil?rd
>> On Mon, Dec 15, 2014 at 12:57 PM, Carsten Kutzner <ckutzne at gwdg.de> wrote:
>>> Hi,
>>> from the log file it seems that you were actually using 64 OpenMP threads.
>>> This is not very efficient, you could try to start mdrun with 4 thread-MPI
>>> ranks (instead of 1), e.g.
>>> mdrun -ntmpi 4 -gpu_id 0000 -s ?
>>> Could it be that another process was running on your node while you
>>> ran the simulation?
>>> Carsten
>>> On 15 Dec 2014, at 12:45, Jason Hill <jason.hill at zoologi.su.se> wrote:
>>>> Hello list,
>>>> I am simulating a protein in water and am concerned that I am not using my hardware to the best of it?s abilities. Here (https://drive.google.com/file/d/0BwAaTxAET7c5VkZERkFsa1cyRlk/view?usp=sharing <https://drive.google.com/file/d/0BwAaTxAET7c5VkZERkFsa1cyRlk/view?usp=sharing>) is the log file from a 1 nanosecond simulation. The only piece of information missing from it that may be of use is that I am using the OPLS/AA force field. Additionally, GROMACS only seems to be using 8-12 cores of the 64 available despite it?s complaint that the GPU is being underutilized. Please take a look and if you can, give me some advice about improving my simulation efficiency.
>>>> Best regards,
>>>> Jason
>>>> Jason Hill, Ph.D.
>>>> Wheat Lab
>>>> Zoologiska Institutionen
>>>> Stockholms Universitet
>>>> D-419 Svante Arrhenius v 18B
>>>> S-10691 Stockholm Sweden
> --
> Gromacs Users mailing list
> * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-request at gromacs.org.

More information about the gromacs.org_gmx-users mailing list