[gmx-users] Reference structure for PCA.

Mon Feb 11 06:17:54 CET 2013

Dear all!

1) I'd like also to know more about algorithm of the reference
structure choosing.

Commonly I'm using

g_covar -f md.trr -s md.tpr

fur PCA of the md trajectory ( here md.tpr is the protein topology and
md.trr is the protein only trajectory)

and

g_covar -f ensemble.pdb -s ref.pdb

for PCA of the X-ray data set where ensemble is all of my pdb
structures in NMR-like format and reference is the random  structure (
I know that such assumpption is wrong by definition but I dont realy
know how I could calculate average structure for my 'pdb trajectory')

Sometimes that produce 'broken geometry' of my protein when I try to
fit calculated trajectory into NEW reference ( or into the old
reference - md.tpr) by means of g_anaign to produce filtered.xtc

g_anaeig -v eigenvec.trr -f md.trr -s md.tpr  -filt 	filtered.xtc

During visual analysis I noticed that my protein looks like
'compresed'. I dont know why this occures because I have that problem
not in the same case so I suppose that the problem in the initial
ref.structure choosing in g_covar or in the fitting in the g_anaeig.

2) Sometime I want to fit my md trajectory (known as the md_1) into
the eigenvectors calculated from the X-ray data set ( of from another
MD trajectory) for the same protein (known as the md_2). So I'd like
to examine wich conformation of the md_1 correspond to the which
positions in the conformational space of the second trajectory md_2.
What reference structure must be chosing for such pca?

3)I'm looking in the possible tutorial which explains step by step how
I could perform PCA in dihedral space for the averaged-size protein (
800 backbone atoms). As I understood I must routinelly defined each
dihedtal angle in the ndx file to provide this as the input in
g_covar. Has someone some script for automatisation of such process?

Thanks for suggestions,

James

2013/2/11  <baptista at itqb.unl.pt>:
> Hi Vivek,
>
> There are two distinct steps involved: (1) the fit of your trajectory to a
> reference structure, which corresponds to choose a conformation space; (2)
> the use of the PCA method, which corresponds to find in that space a new
> basis set whose ordered axes sequentially maximize dispersion (hopefully
> capturing the distribution main features with only a few of the new
> coordinates). The two steps just happen to be done by the same program. The
> structure chosen for fitting is related to step 1, while the average
> structure used to compute the covariance matrix is related to step 2 -- as
> already pointed by Tjerk, the two structures are generally not the same.
>
> The aim of the fit is to get rid of the global translation and rotation of
> your protein in the simulation box, trying to place all the sampled
> structures in a single 3D space that reflects "only" the conformational
> differences. But this is necessarily approximate, because the
> superimposition of any pair of structures after the global fit will be
> always worse than you would get by making a pairwise fit of the two. Thus,
> you want to get a final dispersion around the reference as small as
> possible. So, of the two average structures that you tried, you should
> choose the one computed from the last 30 ns (it's not surprising that it
> gives a smaller dispersion, because it refers to the segment you are
> analyzing). Still, using an average structure as a reference is a somewhat
> illusory solution, because that average must itself be obtained after
> fitting the trajectory to some reference... In a study of a small flexible
> peptide (where the choice of reference may have drastic effects), we found
> that a good reference seems to be the "central structure" of your sample,
> defined as the one that, when taken as a reference, leads to the lowest
> overall dispersion (http://dx.doi.org/10.1021/jp902991u). The article
> discusses the issues pointed above, so you may want to give it a look.
>
> You can also avoid the need of a reference by choosing a different
> conformation space for PCA, a popular alternative being the phi and psi
> dihedrals (look in the manual). Note that this dihedral space is a bit
> different from the more usual one discussed above, each reflecting a
> different kind of conformational proximity (this is also discussed in the
> article). It's up to you to decide which one better suits your problem.
>
> Hope this helps.
> Cheers,
> Antonio
>
> On Sat, 9 Feb 2013, Tsjerk Wassenaar wrote:
>
>> Hi,
>>
>> The commands would certainly help, including the commands for getting the
>> reference structure. Do note that the reference is the reference for
>> fitting, which is 'external', i.e. provided by the user. This is not the
>> same as the structure used to calculate the deviations, which is the
>> average structure of the frames selected.
>>
>> Cheers,
>>
>> Tsjerk
>>
>> On Sat, Feb 9, 2013 at 7:06 PM, bipin singh <bipinelmat at gmail.com> wrote:
>>
>>> Hi vivek,
>>>
>>> I have few questions related to your query:
>>>
>>> During covariance matrix calculation, g_covar by default takes average
>>> structure of the trajectory as a reference structure then why you are
>>> giving it average structure of your trajectory (0-100ns) manually.
>>> Moreover without looking at your commands which you have used, it would
>>> be
>>> difficult for anyone that why are you getting these surprising results.
>>> On Thu, Feb 7, 2013 at 1:26 PM, vivek modi <modi.vivek2009 at gmail.com>
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> I have troubled you with a similar question before also, but I guess I
>>>
>>> need
>>>>
>>>> some more clarification. My question is about the reference structure in
>>>> PCA analysis.
>>>> I have 100ns long protein simulation which I want to analyze using PCA.
>>>
>>> The
>>>>
>>>> RMSD shows fluctuations upto initial 25-30ns and then becomes very
>>>
>>> stable.
>>>>
>>>> I have performed PCA on the last 30ns window of the simulation where I
>>>> assume the simulation has converged (I also did on other time windows as
>>>> well).
>>>>
>>>> The question is this:
>>>> I did the analysis on the last 30ns window in two ways by taking two
>>>> different reference structures.
>>>>
>>>> a. I take the average structure of the trajectory (0-100ns) as
>>>> the reference and then do the fitting and calculate covariance matrix
>>>> for
>>>> last 30ns. This is done because I suspect that the average structure
>>>> over
>>>> full trajectory will reflect all the changes occurring in the protein.
>>>> It
>>>> also gives me low cosines (<0.1). The PCs show movement occurring in
>>>> certain regions of the protein.
>>>>
>>>> b. I take the average structure from the same window (last 30ns) then do
>>>> the fitting and calculate covariance matrix for the same. This is done
>>>
>>> with
>>>>
>>>> an assumption that the reference structure must reflect the
>>>> equilibriated/stable part of the trajectory unlike the previous case.
>>>> Surprisingly it gives me high cosines (>0.5). Unlike the previous case,
>>>> this method shows very small movement in the protein (very low RMSF).
>>>>
>>>> Both of these methods give me different RMSF for the PCs although they
>>>
>>> are
>>>>
>>>> done on the same part of the trajectory but the reference structure is
>>>> influencing the output.
>>>>
>>>>  Which protocol among the two is appropriate ?  And how can we explain
>>>
>>> high
>>>>
>>>> cosines in second case where the reference structure is the average of
>>>
>>> the
>>>>
>>>> same time window (there must not be large deviation) while I get low
>>>
>>> cosine
>>>>
>>>> for the first case where deviations are calculated from the full
>>>
>>> trajectory
>>>>
>>>> average (large deviation) ?
>>>>
>>>> Any help is appreciated.
>>>>
>>>> Thanks,
>>>>
>>>> -Vivek Modi
>>>> Graduate Student
>>>> IITK.
>>>> --
>>>> gmx-users mailing list    gmx-users at gromacs.org
>>>> http://lists.gromacs.org/mailman/listinfo/gmx-users
>>>> * Please search the archive at
>>>> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
>>>> * Please don't post (un)subscribe requests to the list. Use the
>>>> www interface or send it to gmx-users-request at gromacs.org.
>>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>>
>>>
>>>
>>>
>>> --
>>> *-----------------------
>>> Thanks and Regards,
>>> Bipin Singh*
>>> --
>>> gmx-users mailing list    gmx-users at gromacs.org
>>> http://lists.gromacs.org/mailman/listinfo/gmx-users
>>> * Please search the archive at
>>> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
>>> * Please don't post (un)subscribe requests to the list. Use the
>>> www interface or send it to gmx-users-request at gromacs.org.
>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>
>>
>>
>>
>> --
>> Tsjerk A. Wassenaar, Ph.D.
>>
>> post-doctoral researcher
>> Biocomputing Group
>> Department of Biological Sciences
>> 2500 University Drive NW
>> Calgary, AB T2N 1N4
>> Canada
>> --
>> gmx-users mailing list    gmx-users at gromacs.org
>> http://lists.gromacs.org/mailman/listinfo/gmx-users
>> * Please search the archive at
>> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
>> * Please don't post (un)subscribe requests to the list. Use the
>> www interface or send it to gmx-users-request at gromacs.org.
>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>
> --
> Antonio M. Baptista
> Instituto de Tecnologia Quimica e Biologica, Universidade Nova de Lisboa
> Av. da Republica - EAN, 2780-157 Oeiras, Portugal
> phone: +351-214469619         email: baptista at itqb.unl.pt
> fax:   +351-214411277         WWW:   http://www.itqb.unl.pt/~baptista
> --------------------------------------------------------------------------
>
> --
> gmx-users mailing list    gmx-users at gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-users
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
> * Please don't post (un)subscribe requests to the list. Use the www
> interface or send it to gmx-users-request at gromacs.org.
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists