[gmx-users] Re: Reference structure for PCA.

Tsjerk Wassenaar tsjerkw at gmail.com
Tue Feb 12 14:12:44 CET 2013


Hey Ahmet,

1/(N-1) sum (x-mean)(x-mean)'

is the unbiased estimator of the true (population) covariance matrix,
provided the observations are mutually independent (!)

In most cases, N is quite large, so it doesn't actually matter, and
the eigenvectors and order (not size) of the eigenvalues are invariant
to scaling, so it doesn't actually matter. If the size of the
eigenvalues does matter, you better make sure that there is no
temporal correlation between the frames you use for the analysis.

Cheers,

Tsjerk

On Tue, Feb 12, 2013 at 1:53 PM, Ahmet yıldırım <ahmedo047 at gmail.com> wrote:
> Hi,
>
> S = 1/N sum (x - ref) (x - ref)'
>
> or
>
> S = 1/(N-1) sum (x - ref) (x - ref)'
>
> N: the number of frames
>
> Which one is right?
>
> 2013/2/12 Tsjerk Wassenaar <tsjerkw at gmail.com>
>
>> Hi Vivek,
>>
>> If you use the g_covar option -ref, you not only use the reference
>> structure for fitting, you use it for calculating the deviations. Your
>> covariance matrix is built as:
>>
>> S = 1/N sum (x - ref) (x - ref)'
>>
>> If you leave out the option -ref then the average structure will be
>> used for the covariance matrix, which is what you (should) want. In
>> that case the reference structure is only used for fitting. You might
>> want to redo your calculations to check the effect of fitting.
>>
>> Cheers,
>>
>> Tsjerk
>>
>> On Tue, Feb 12, 2013 at 9:02 AM, vivek modi <modi.vivek2009 at gmail.com>
>> wrote:
>> >>
>> >>
>> >> Message: 1
>> >> Date: Sun, 10 Feb 2013 21:32:15 +0000 (WET)
>> >> From: baptista at itqb.unl.pt
>> >> Subject: Re: [gmx-users] Reference structure for PCA.
>> >> To: Discussion list for GROMACS users <gmx-users at gromacs.org>
>> >> Message-ID: <alpine.DEB.2.00.1302102127430.8574 at simul36.itqb.unl.pt>
>> >> Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
>> >>
>> >
>> > Hello Antonio,
>> >
>> > Thank you very much for the reply. I have gone through the paper you
>> > mentioned. The central structure seems to be a good choice for the
>> > reference structure.
>> >
>> >
>> > Regards,
>> >
>> > -Vivek.
>> >
>> >
>> >>
>> >> Hi Vivek,
>> >>
>> >> There are two distinct steps involved: (1) the fit of your trajectory
>> to a
>> >> reference structure, which corresponds to choose a conformation space;
>> (2)
>> >> the use of the PCA method, which corresponds to find in that space a new
>> >> basis set whose ordered axes sequentially maximize dispersion (hopefully
>> >> capturing the distribution main features with only a few of the new
>> >> coordinates). The two steps just happen to be done by the same program.
>> >> The structure chosen for fitting is related to step 1, while the average
>> >> structure used to compute the covariance matrix is related to step 2 --
>> as
>> >> already pointed by Tjerk, the two structures are generally not the same.
>> >>
>> >> The aim of the fit is to get rid of the global translation and rotation
>> of
>> >> your protein in the simulation box, trying to place all the sampled
>> >> structures in a single 3D space that reflects "only" the conformational
>> >> differences. But this is necessarily approximate, because the
>> >> superimposition of any pair of structures after the global fit will be
>> >> always worse than you would get by making a pairwise fit of the two.
>> Thus,
>> >> you want to get a final dispersion around the reference as small as
>> >> possible. So, of the two average structures that you tried, you should
>> >> choose the one computed from the last 30 ns (it's not surprising that it
>> >> gives a smaller dispersion, because it refers to the segment you are
>> >> analyzing). Still, using an average structure as a reference is a
>> somewhat
>> >> illusory solution, because that average must itself be obtained after
>> >> fitting the trajectory to some reference... In a study of a small
>> flexible
>> >> peptide (where the choice of reference may have drastic effects), we
>> found
>> >> that a good reference seems to be the "central structure" of your
>> sample,
>> >> defined as the one that, when taken as a reference, leads to the lowest
>> >> overall dispersion (http://dx.doi.org/10.1021/jp902991u). The article
>> >> discusses the issues pointed above, so you may want to give it a look.
>> >>
>> >> You can also avoid the need of a reference by choosing a different
>> >> conformation space for PCA, a popular alternative being the phi and psi
>> >> dihedrals (look in the manual). Note that this dihedral space is a bit
>> >> different from the more usual one discussed above, each reflecting a
>> >> different kind of conformational proximity (this is also discussed in
>> the
>> >> article). It's up to you to decide which one better suits your problem.
>> >>
>> >> Hope this helps.
>> >> Cheers,
>> >> Antonio
>> >>
>> >> Thank you very much for your reply Tsjerk.
>> > I understand that the two reference structures are different. I had a
>> query
>> > because I found the method is very sensitive to the choice of the
>> reference
>> > structure for fitting. Most of the publications either do not mention the
>> > reference structure. Only in few papers I found initial structure for
>> > fitting. But the method gives different  results if initial structure is
>> > used; or average from complete trajectory;  or average over a certain
>> time
>> > window is used as the structure for reference.
>> > The average was calculated using the following command:
>> >
>> > g_rmsf -f *.xtc -s *.tpr -ox average70-100ns.pdb -b 70000 -e 100000
>> >
>> >
>> >
>> > Regards,
>> >
>> > -Vivek.
>> >
>> >
>> >
>> >> On Sat, 9 Feb 2013, Tsjerk Wassenaar wrote:
>> >>
>> >> > Hi,
>> >> >
>> >> > The commands would certainly help, including the commands for getting
>> the
>> >> > reference structure. Do note that the reference is the reference for
>> >> > fitting, which is 'external', i.e. provided by the user. This is not
>> the
>> >> > same as the structure used to calculate the deviations, which is the
>> >> > average structure of the frames selected.
>> >> >
>> >> > Cheers,
>> >> >
>> >> > Tsjerk
>> >> >
>> >> > On Sat, Feb 9, 2013 at 7:06 PM, bipin singh <bipinelmat at gmail.com>
>> >> wrote:
>> >> >
>> >> >> Hi vivek,
>> >> >>
>> >> >> I have few questions related to your query:
>> >> >>
>> >> >> During covariance matrix calculation, g_covar by default takes
>> average
>> >> >> structure of the trajectory as a reference structure then why you are
>> >> >> giving it average structure of your trajectory (0-100ns) manually.
>> >> >> Moreover without looking at your commands which you have used, it
>> would
>> >> be
>> >> >> difficult for anyone that why are you getting these surprising
>> results.
>> >> >> On Thu, Feb 7, 2013 at 1:26 PM, vivek modi <modi.vivek2009 at gmail.com
>> >
>> >> >> wrote:
>> >> >>
>> >> >>> Hello,
>> >> >>>
>> >> >>> I have troubled you with a similar question before also, but I
>> guess I
>> >> >> need
>> >> >>> some more clarification. My question is about the reference
>> structure
>> >> in
>> >> >>> PCA analysis.
>> >> >>> I have 100ns long protein simulation which I want to analyze using
>> PCA.
>> >> >> The
>> >> >>> RMSD shows fluctuations upto initial 25-30ns and then becomes very
>> >> >> stable.
>> >> >>> I have performed PCA on the last 30ns window of the simulation
>> where I
>> >> >>> assume the simulation has converged (I also did on other time
>> windows
>> >> as
>> >> >>> well).
>> >> >>>
>> >> >>> The question is this:
>> >> >>> I did the analysis on the last 30ns window in two ways by taking two
>> >> >>> different reference structures.
>> >> >>>
>> >> >>> a. I take the average structure of the trajectory (0-100ns) as
>> >> >>> the reference and then do the fitting and calculate covariance
>> matrix
>> >> for
>> >> >>> last 30ns. This is done because I suspect that the average structure
>> >> over
>> >> >>> full trajectory will reflect all the changes occurring in the
>> protein.
>> >> It
>> >> >>> also gives me low cosines (<0.1). The PCs show movement occurring in
>> >> >>> certain regions of the protein.
>> >> >>>
>> >> >>> b. I take the average structure from the same window (last 30ns)
>> then
>> >> do
>> >> >>> the fitting and calculate covariance matrix for the same. This is
>> done
>> >> >> with
>> >> >>> an assumption that the reference structure must reflect the
>> >> >>> equilibriated/stable part of the trajectory unlike the previous
>> case.
>> >> >>> Surprisingly it gives me high cosines (>0.5). Unlike the previous
>> case,
>> >> >>> this method shows very small movement in the protein (very low
>> RMSF).
>> >> >>>
>> >> >>> Both of these methods give me different RMSF for the PCs although
>> they
>> >> >> are
>> >> >>> done on the same part of the trajectory but the reference structure
>> is
>> >> >>> influencing the output.
>> >> >>>
>> >> >>>  Which protocol among the two is appropriate ?  And how can we
>> explain
>> >> >> high
>> >> >>> cosines in second case where the reference structure is the average
>> of
>> >> >> the
>> >> >>> same time window (there must not be large deviation) while I get low
>> >> >> cosine
>> >> >>> for the first case where deviations are calculated from the full
>> >> >> trajectory
>> >> >>> average (large deviation) ?
>> >> >>>
>> >> >>> Any help is appreciated.
>> >> >>>
>> >> >>> Thanks,
>> >> >>>
>> >> >>> -Vivek Modi
>> >> >>> Graduate Student
>> >> >>> IITK.
>> >> >>> --
>> >> >>> gmx-users mailing list    gmx-users at gromacs.org
>> >> >>> http://lists.gromacs.org/mailman/listinfo/gmx-users
>> >> >>> * Please search the archive at
>> >> >>> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
>> >> >>> * Please don't post (un)subscribe requests to the list. Use the
>> >> >>> www interface or send it to gmx-users-request at gromacs.org.
>> >> >>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>> >> >>>
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> *-----------------------
>> >> >> Thanks and Regards,
>> >> >> Bipin Singh*
>> >> >> --
>> >> >> gmx-users mailing list    gmx-users at gromacs.org
>> >> >> http://lists.gromacs.org/mailman/listinfo/gmx-users
>> >> >> * Please search the archive at
>> >> >> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
>> >> >> * Please don't post (un)subscribe requests to the list. Use the
>> >> >> www interface or send it to gmx-users-request at gromacs.org.
>> >> >> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>> >> >>
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Tsjerk A. Wassenaar, Ph.D.
>> >> >
>> >> > post-doctoral researcher
>> >> > Biocomputing Group
>> >> > Department of Biological Sciences
>> >> > 2500 University Drive NW
>> >> > Calgary, AB T2N 1N4
>> >> > Canada
>> >> > --
>> >> > gmx-users mailing list    gmx-users at gromacs.org
>> >> > http://lists.gromacs.org/mailman/listinfo/gmx-users
>> >> > * Please search the archive at
>> >> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
>> >> > * Please don't post (un)subscribe requests to the list. Use the
>> >> > www interface or send it to gmx-users-request at gromacs.org.
>> >> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>> >> >
>> >> --
>> >> Antonio M. Baptista
>> >> Instituto de Tecnologia Quimica e Biologica, Universidade Nova de Lisboa
>> >> Av. da Republica - EAN, 2780-157 Oeiras, Portugal
>> >> phone: +351-214469619         email: baptista at itqb.unl.pt
>> >> fax:   +351-214411277         WWW:   http://www.itqb.unl.pt/~baptista
>> >>
>> --------------------------------------------------------------------------
>> >>
>> >>
>> >>
>> >> ------------------------------
>> >>
>> >>
>> >>
>> > --
>> > gmx-users mailing list    gmx-users at gromacs.org
>> > http://lists.gromacs.org/mailman/listinfo/gmx-users
>> > * Please search the archive at
>> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
>> > * Please don't post (un)subscribe requests to the list. Use the
>> > www interface or send it to gmx-users-request at gromacs.org.
>> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>
>>
>>
>> --
>> Tsjerk A. Wassenaar, Ph.D.
>>
>> post-doctoral researcher
>> Biocomputing Group
>> Department of Biological Sciences
>> 2500 University Drive NW
>> Calgary, AB T2N 1N4
>> Canada
>> --
>> gmx-users mailing list    gmx-users at gromacs.org
>> http://lists.gromacs.org/mailman/listinfo/gmx-users
>> * Please search the archive at
>> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
>> * Please don't post (un)subscribe requests to the list. Use the
>> www interface or send it to gmx-users-request at gromacs.org.
>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>
>
>
>
> --
> Ahmet Yıldırım
> --
> gmx-users mailing list    gmx-users at gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-users
> * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
> * Please don't post (un)subscribe requests to the list. Use the
> www interface or send it to gmx-users-request at gromacs.org.
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists



-- 
Tsjerk A. Wassenaar, Ph.D.

post-doctoral researcher
Biocomputing Group
Department of Biological Sciences
2500 University Drive NW
Calgary, AB T2N 1N4
Canada



More information about the gromacs.org_gmx-users mailing list