[gmx-users] performance

Thu Sep 21 17:54:29 CEST 2017

Hi,

A few remarks in no particular order:

1. Avoid domain-decomposition unless necessary (especially in
CPU-bound runs, and especially with PME), it has a non-negligible
overhead (greatest when going from no DD to using DD). Running
multi-threading only typically has better performance. There are
exceptions (e.g. your case of reaction-field runs could be such a
case, but I'm doubtful as the DD cost is signiificant). Hence, I
suggest trying 1, 2, 4... ranks per simulation, i.e.
mpirun -np 1 gmx mdrun -ntomp N (single-run)
mpirun -np 2 gmx mdrun -ntomp N/2 (single-run)
mpirun -np 4 gmx mdrun -ntomp N/4 (single-run)
[...]
The multi-run equivalents of the above would simply use M ranks where
M=Nmulti * Nranks_per_run.

2. If you're aiming for best throughput place two or more
_independent_ runs on the same GPU, e.g. assuming 4 GPUs + 40 cores
(and that no DD turns out to be best) to run 2 sim/GPU you can do:
mpirun -np 8 -multi 8 gmx mdrun [-ntomp 5] [-gpu_id 00112233]
The last two args can be omitted, but you should make sure that's what
you get, i.e. that sim #0/#1 use GPU #0, sim #2/#3 use GPU#1, etc.

3. 2a,b are clearly off, my hypothesis is still that they get pinned
to the wrong cores. I suspect 6a,b are just lucky and happen to not be
placed too badly. Plus 6 use 4 GPUs vs 7 only 2 GPUs, so that's not a
fair comparison (and probably explains the 350 vs 300 ns/day).

4. -pin on is faster than letting the scheduler place jobs (e.g. 3ab
vs 4b) which is in line with what I would expect.

5. The strange asymmetry in 8a vs 8b is due to 8b having failed to pin
and running where it should not be (empty socket -> core turbo-ing?).
The 4a / 4b mismatch is strange; are those using the very same system
(tpr?) -- one of them reports higher load imbalance!

Overall, I suggest starting over and determining performance first by
deciding: What DD setup is best and how to lay out jobs in a node to
get best throughput. Start with run configs testing settings with
-multi to avoid pinning headaches and fill at least half a node (or a
full node) with #concurrent simulations >= #GPUs.

Cheers,
--
Szilárd

On Mon, Sep 18, 2017 at 9:25 PM, gromacs query <gromacsquery at gmail.com> wrote:
> Hi Szilárd,
>
> {I had to trim the message as my message is put on hold because only 50kb
> allowed and this message has reached 58 KB! Not due to files attached as
> they are shared via dropbox}; Sorry seamless reading might be compromised
> for future readers.
>
> Thanks for your replies. I have shared log files here:
>
> https://www.dropbox.com/s/m9mqqans0jci873/test_logs.zip?dl=0
>
> Two self-describing name folders have all the test logs. The test_*.log
> file serial numbers correspond to my simulations briefly described here
> [with folder names].
>
> For quick look one can: grep Performance *.log
>
> Folder 2gpu_4np:
> Sr. no.  Remarks  performance (ns/day)
> 1.  only one job  345 ns/day
> 2a,b.  two same jobs together (without pin on)  16.1 and 15.9
> 3a,b.  two same jobs together (without pin on, with -multidir)  270 and 276
> 4a,b.  two same jobs together (pin on, pinoffset at 0 and 5)  160 and 301
>
>
>
> Folder:4gpu_16np
>
>
>
>
> Remarks  performance (ns/day)
> 5.  only one job  694 ns/day
> 6a,b.  two same jobs together (without pin on)  340 and 350
> 7a,b.  two same jobs together (without pin on, with -multidir)  302 and 304
> 8a,b.  two same jobs together (pin on, pinoffset at 0 and 17)  204 and 546
> --
> Gromacs Users mailing list
>
> * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-request at gromacs.org.