mdrun on 8-core AMD + GTX TITAN (was: Re: [gmx-users] Re: Gromacs-4.6 on two Titans GPUs)

Dwey Kauffman mpi566 at gmail.com
Tue Nov 12 21:28:49 CET 2013


Hi Mark and Szilard

    Thanks for your both suggestions. They are very helpful.

>
> Neither run had a PP-PME work distribution suitable for the hardware it
> was
> running on (and fixing that for each run requires opposite changes).
> Adding
> a GPU and hoping to see scaling requires that there be proportionately
> more
> GPU work available to do, *and* enough absolute work to do. mdrun tries to
> do this, and reports early in the log file, which is one of the reasons
> Szilard asked to see whole log files - please use a file sharing service
> to
> do that.
>

This task involves GPU calculation. We would not see PP-PME work
distribution.
This is a good hint from the angle of PP-PME work distribution.  And I
guessed that two GPUs' calculations are fast / or no enough work for GPU
calculation, which is aligned with your explanation.
 
Please see logs below again.

#### ONE GPU##

http://pastebin.com/B6bRUVSa

#### TWO GPUs##
http://pastebin.com/SLAYnejP
 
>
> As you can see, this test was made on the same node regardless of
> > networking.  Can the performance be improved  say 50% more when 2 GPUs
> are
> > used on a general task ?  If yes, how ?
> >
> > >Indeed, as Richard pointed out, I was asking for *full* logs, these
> > >summaries can't tell much, the table above the summary entitled "R E A
> > >L   C Y C L E   A N D   T I M E   A C C O U N T I N G" as well as
> > >other reported information across the log file is what I need to make
> > >an assessment of your simulations' performance.
> >
> > Please see below.
> >
> > >>However, in your case I suspect that the
> > >>bottleneck is multi-threaded scaling on the AMD CPUs and you should
> > >>probably decrease the number of threads per MPI rank and share GPUs
> > >>between 2-4 ranks.
> >
> > After I test all three clusters, I found it may NOT be an issue of AMD
> > cpus.
> > Intel cpus has the SAME scaling issue.
> >
> > However, I am curious as to how you justify the setup of 2-4 ranks
> sharing
> > GPUs ? Can you please explain it a bit more ?
> >
>
> NUMA effects on multi-socket AMD processors are particularly severe; the
> way GROMACS uses OpenMP is not well suited to them. Using a rank (or two)
> per socket will greatly reduce those effects, but introduces different
> algorithmic overhead from the need to do DD and explicitly communicate
> between ranks. (You can see the latter in your .log file snippets below.)
> Also, that means the parcel of PP work available from a rank to give to
> the
> GPU is smaller, which is the opposite of what you'd like for GPU
> performance and/or scaling. We are working on a general solution for this
> and lots of related issues in the post-5.0 space, but there is a very hard
> limitation imposed by the need to amortize the cost of CPU-GPU transfer by
> having lots of PP work available to do.
>

Is this reason why the scaling of two GPUs won't happen because of smaller
PP workload ?
>From the implication, I am wondering if we can increase PP workload through
parameters in a mdp file.  The question is what parameters are mostly
related to PP workload ? Would you please give more specific suggestions ?  


>
> > NOTE: The GPU has >20% more load than the CPU. This imbalance causes
> >       performance loss, consider using a shorter cut-off and a finer PME
> > grid.
> >
>
> This note needs to be addressed before maximum throughput is achieved and
> the question of scaling is worth considering. Ideally, "Wait GPU local"
> should be nearly zero, achieved as suggested above. Note that
> launch+force+mesh+wait is the sum of gpu total! But much of the
> information
> needed is higher up the log file, and the whole question is constrained by
> things like rvdw.
>

>From the note, it clearly suggested a shorter cut-off and a finer PME grid.
I am not sure how to set up a finer PME grid but I am able to set up shorter
cut-offs . However, it is risky to do so based on others' reports.
 
Indeed, I see differences among tests for 1 GPU.
Here cutoffs refer to rlist, rvdw and rcoulomb.  

I found that the smaller values of cutoffs, the faster computations.
The question is how small they can go because  it is interesting to know if
these different cutoffs generate equally good simulations.    

As to  two GPUs, when I set up larger cut-offs,  these two GPUs in the same
node had been very busy.   However, the outcome in such a configuration is
worse in terms of ns/day and time.

So what dose "a finer PME grid" mean, with respect to GPU workload ?

You mention the sum of GPU total is  launch + force + mesh + wait.    I
thought PME mesh is carried out by CPU instead of GPU. Do I miss something
here ?
I thought  GPU is responsible for the calculation of short-ranged non-bonded
force whereas CPU is responsible for that of bonded and PME long-ranged
force.  Can you clarify it here ?

Also, would rvdw play an important role in improving the performance of GPU
calculation ?


> >
> Unfortunately you didn't copy the GPU timing stuff here! Roughly, all the
> performance gain you are seeing here is eliminating most of the single-GPU
> "wait gpu" term by throwing more hardware at it. To hope to see some
> scaling, you'd need to be able to drop the PME mesh time by about a factor
> of two (coarser grid, and compensating increase to rcoulomb), and hope
> there was enough PP work that using two GPUs for a single simulation is
> even worth considering. Achieving throughput-style scaling by running two
> independent simulations on the same node may be all that is practical (but
> I don't even know how many atoms you are simulating!)
>
> Mark
In the configuration of two GPUs, there is NO such GPU timing table, while 
in that of 1 GPU there is one.  See logs.

Again, it is interesting to know if there was enough PP work for two GPUs. 
Increasing cuf-offs indeed achieves this purpose when cutoff >1.6 but the
total performance (ns/day) decreases severely. That's NOT what I want
because I would like to assign 0.8 or 1.0 or 1.2 nm to cutoffs for general
purposes.  In this tested case, I am testing Justin's umbrella's pull-code
in his tutorial. 

I should also mention that using two GPUs with INTEL 12-core CPUs,  I work
on a protein with 35,000 atoms including solvent and ions for general
purposes. Its performance only increases 5-8% more with 2 GPUs. 

Beside your hint of PP workload, any better suggestions in practice are
highly appreciated.  I can test your suggestion.

Thanks,
Dewey


--
View this message in context: http://gromacs.5086.x6.nabble.com/mdrun-on-8-core-AMD-GTX-TITAN-was-Re-gmx-users-Re-Gromacs-4-6-on-two-Titans-GPUs-tp5012330p5012475.html
Sent from the GROMACS Users Forum mailing list archive at Nabble.com.



More information about the gromacs.org_gmx-users mailing list