[gmx-users] GTX 960 vs Tesla K40

Sun Jun 24 01:23:56 CEST 2018

Hi Szilárd,

Thanks for the suggestion on removing that separate pme rank: 113 ns/day 
instead of 90 ns/day. ;) This is running on pretty much a piece of 
garbage and this performance is vs 320 ns/day on a much more powerful 
box with four GPUs.

I am fine with the general concept of ranks being units of execution, 
what I am not comfortable with is how one selects e.g. number of threads 
per rank, depending on the system size, or the use (or non-use) of a 
separate pme rank, e.g. let's say I make a system that's 3-4 times 
larger in XY. Do I keep all the mdrun scripts as is, do I go through 
tuning for every new system?

Some sort of a guideline with examples would be nice, or some automation 
on the mdrun side, or maybe a webform that asks questions about system 
size, number of GPUs/CPU cores and spits out a starting point for the 
"optimal" set of mdrun keys. I am mostly learning this by varying things 
(e.g. offloading or not offloading pmefft, or trying your suggestions, 
for instance). A deep understanding, however, is lacking.

Alex

On Thu, Jun 21, 2018 at 10:02 AM, Szilárd Páll<pall.szilard at gmail.com 
<mailto:pall.szilard at gmail.com>>wrote:

    On Mon, Jun 18, 2018 at 11:35 PM Alex <nedomacho at gmail.com
    <mailto:nedomacho at gmail.com>> wrote:

     > Persistence is enabled so I don't have to overclock again.

    Sure, makes sense. Note that strictly speaking this is not an
    "overclock",
    but a manual "boost clock" (to use terminology CPU vendors use).
    Consumer
    GPUs automatically scale their clock speeds above their nominal/base
    clock
    (just as CPUs do), but Tesla GPUs don't do that but rather give the
    option
    on the user (or put the burden if we want to look at it differently).

     > To be honest, I
     > am still not entirely comfortable with the notion of ranks, after
    reading
     > the acceleration document a bunch of times.

    Feel free to ask if you need clarification.
    Briefly: ranks are the execution units, typically MPI processes,
    that tasks
    get assigned to when decomposing work across multiple compute units
    (nodes,
    processors). In general, data or tasks can be decomposed (also called
    data-/task-parallelization), and GROMACS does employ both, the
    former for
    the spatial domain decomposition, the latter for offloading PME work
    to a
    subset of the ranks.

     > Parts of log file below and I
     > will obviously appreciate suggestions/clarifications:
     >

    In the future, please share the full log by uploading it somewhere.

On 6/21/2018 10:02 AM, Szilárd Páll wrote:
> On Mon, Jun 18, 2018 at 11:35 PM Alex <nedomacho at gmail.com> wrote:
>
>> Persistence is enabled so I don't have to overclock again.
>
> Sure, makes sense. Note that strictly speaking this is not an "overclock",
> but a manual "boost clock" (to use terminology CPU vendors use). Consumer
> GPUs automatically scale their clock speeds above their nominal/base clock
> (just as CPUs do), but Tesla GPUs don't do that but rather give the option
> on the user (or put the burden if we want to look at it differently).
>
>
>> To be honest, I
>> am still not entirely comfortable with the notion of ranks, after reading
>> the acceleration document a bunch of times.
>
> Feel free to ask if you need clarification.
> Briefly: ranks are the execution units, typically MPI processes, that tasks
> get assigned to when decomposing work across multiple compute units (nodes,
> processors). In general, data or tasks can be decomposed (also called
> data-/task-parallelization), and GROMACS does employ both, the former for
> the spatial domain decomposition, the latter for offloading PME work to a
> subset of the ranks.
>
>
>> Parts of log file below and I
>> will obviously appreciate suggestions/clarifications:
>>
> In the future, please share the full log by uploading it somewhere.
>
>
>> Command line:
>>    gmx mdrun -nt 4 -ntmpi 2 -npme 1 -pme gpu -nb gpu -s run_unstretch.tpr -o
>> traj_unstretch.trr -g md.log -c unstretched.gro
>>
> As noted before, I doubt that you benefit from using a separate PME rank
> with a single GPU.
>
> I suggest that instead you simply run:
> gmx mdrun -ntmpi 1 -pme gpu -nb gpu
> optionally, you can pass -ntomp 4, but that's the default so it's not
> needed.
>
>