[gmx-users] performance

Thu Sep 21 23:25:54 CEST 2017

Hi Szilárd,

Thanks a lot for your time, and see my replies below. Overall they are very
useful and I hope this long carried over discussion email will serve for
the future users. (Also could you please see my other email pointing
errors(?)/repeats in the web documentation about performance)

'multi/multidir' is not much helpful in my case as my simulation crashes
sometimes, and to restart them would be a pain as there are many (many!)
simulations. Also, one is never sure if other users will impose
-multi/-multidir option or not on shared node clusters. I have read your
other email suggestions [tagged: the importance of process/thread affinity,
especially in node sharing setups] where node sharing among different users
could be an issue which would ultimately depend on job scheduler.

My replies are inserted here:

On Thu, Sep 21, 2017 at 4:54 PM, Szilárd Páll <pall.szilard at gmail.com>
wrote:

> Hi,
>
> A few remarks in no particular order:
>
> 1. Avoid domain-decomposition unless necessary (especially in
> CPU-bound runs, and especially with PME), it has a non-negligible
> overhead (greatest when going from no DD to using DD). Running
> multi-threading only typically has better performance. There are
> exceptions (e.g. your case of reaction-field runs could be such a
> case, but I'm doubtful as the DD cost is signiificant). Hence, I
> suggest trying 1, 2, 4... ranks per simulation, i.e.
> mpirun -np 1 gmx mdrun -ntomp N (single-run)
> mpirun -np 2 gmx mdrun -ntomp N/2 (single-run)
> mpirun -np 4 gmx mdrun -ntomp N/4 (single-run)
> [...]
> The multi-run equivalents of the above would simply use M ranks where
> M=Nmulti * Nranks_per_run.

You mean -dlb no? I think I did not modify so should be on auto mode then.
I can try it though. And yes indeed I have tried many other cases where I
vary -np gradually. I just shared one of the glitchy performance issues [I
have wealth of such cases :)]. Which I suspect now is a slurm scheduler
issue. I need to ask Admin if there are affinities to cores for a job.

> 2. If you're aiming for best throughput place two or more
> _independent_ runs on the same GPU, e.g. assuming 4 GPUs + 40 cores
> (and that no DD turns out to be best) to run 2 sim/GPU you can do:
> mpirun -np 8 -multi 8 gmx mdrun [-ntomp 5] [-gpu_id 00112233]
> The last two args can be omitted, but you should make sure that's what
> you get, i.e. that sim #0/#1 use GPU #0, sim #2/#3 use GPU#1, etc.
>

 I am avoiding multi option as explained above. But this is useful.

> 3. 2a,b are clearly off, my hypothesis is still that they get pinned
> to the wrong cores. I suspect 6a,b are just lucky and happen to not be
> placed too badly. Plus 6 use 4 GPUs vs 7 only 2 GPUs, so that's not a
> fair comparison (and probably explains the 350 vs 300 ns/day).
>

Ah Sorry! yes my fault. I just checked 7th case uses 2 GPU. I forgot to
change the GPU numbers.

>
> 4. -pin on is faster than letting the scheduler place jobs (e.g. 3ab
> vs 4b) which is in line with what I would expect.
>

> 5. The strange asymmetry in 8a vs 8b is due to 8b having failed to pin
> and running where it should not be (empty socket -> core turbo-ing?).
> The 4a / 4b mismatch is strange; are those using the very same system
> (tpr?) -- one of them reports higher load imbalance!
>
>
>
Yes all these jobs (1 to 8 cases) use same tpr.

> Overall, I suggest starting over and determining performance first by
> deciding: What DD setup is best and how to lay out jobs in a node to
> get best throughput. Start with run configs testing settings with
> -multi to avoid pinning headaches and fill at least half a node (or a
> full node) with #concurrent simulations >= #GPUs.
>

I will see if I get some node free. I need to wait.

Thanks for all responses.

-J

> Cheers,
> --
> Szilárd
>
>
> On Mon, Sep 18, 2017 at 9:25 PM, gromacs query <gromacsquery at gmail.com>
> wrote:
> > Hi Szilárd,
> >
> > {I had to trim the message as my message is put on hold because only 50kb
> > allowed and this message has reached 58 KB! Not due to files attached as
> > they are shared via dropbox}; Sorry seamless reading might be compromised
> > for future readers.
> >
> > Thanks for your replies. I have shared log files here:
> >
> > https://www.dropbox.com/s/m9mqqans0jci873/test_logs.zip?dl=0
> >
> > Two self-describing name folders have all the test logs. The test_*.log
> > file serial numbers correspond to my simulations briefly described here
> > [with folder names].
> >
> > For quick look one can: grep Performance *.log
> >
> > Folder 2gpu_4np:
> > Sr. no.  Remarks  performance (ns/day)
> > 1.  only one job  345 ns/day
> > 2a,b.  two same jobs together (without pin on)  16.1 and 15.9
> > 3a,b.  two same jobs together (without pin on, with -multidir)  270 and
> 276
> > 4a,b.  two same jobs together (pin on, pinoffset at 0 and 5)  160 and 301
> >
> >
> >
> > Folder:4gpu_16np
> >
> >
> >
> >
> > Remarks  performance (ns/day)
> > 5.  only one job  694 ns/day
> > 6a,b.  two same jobs together (without pin on)  340 and 350
> > 7a,b.  two same jobs together (without pin on, with -multidir)  302 and
> 304
> > 8a,b.  two same jobs together (pin on, pinoffset at 0 and 17)  204 and
> 546
> > --
> > Gromacs Users mailing list
> >
> > * Please search the archive at http://www.gromacs.org/Support
> /Mailing_Lists/GMX-Users_List before posting!
> >
> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >
> > * For (un)subscribe requests visit
> > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
> --
> Gromacs Users mailing list
>
> * Please search the archive at http://www.gromacs.org/Support
> /Mailing_Lists/GMX-Users_List before posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
>