[gmx-users] Principal Component Analysis

Tue Aug 14 01:14:56 CEST 2007

On Mon, 13 Aug 2007, Matthias Waegele wrote:

> Hello,
>
> I have a question concerning principal component analysis.
>
> In principal component analysis (PCA) it is assumed that the coordinates along
> each degree of freedom are Gaussianly distributed. If the data does not follow
> a normal distribution, PCA may not identify the correct principal modes since
> the largest variances do not correspond to the meaningful axes (e.g. J. Chem.
> Phys. (2006) 124, 024910).

This is a common misunderstanding. Actually, PCA does not assume
anything about the form of the distribution of your data, regardless
of whether you regard that data as a complete population or as just a
sample. Assuming a multivariate normal (Gaussian) distribution indeed
leads to simpler statistical and geometrical interpretations of the
PCA method (has it happens for so many other methods), meaning that
much of the existing mathematical results apply to that case. However,
you can still get some rigorous statistical and geometrical results
using less restrictive assumptions about your distribution (eg, that
it is elliptical). Anyway, many of the interesting features of PCA are
distribution-independent, particularly the fact that the principal
components are orthogonal and uncorrelated. This does not mean that
PCA is straightforward to interpret nor that it can magically reveal
profound things about your data (as some studies seem to assume). I
think that a geometrical understanding of the method helps a lot and
avoids you to do the usual mistakes. I would suggest you to check some
good book on PCA (eg, Jolliffe) or general multivariate analysis (eg,
Rencher).

> However, PCA is frequently applied to systems involving significant anharmonic
> motions. Even for native state simulations, anharmonic fluctuations are
> identified when projected along the principal axes (e.g. Proteins (1993) 17,
> 412-425). Some researcher applied the method to complete unfolding trajectories
> (e.g. J. Mol. Biol. (1999) 290, 283-304). Especially in the case of unfolding
> trajectories, I would expect that the coordinates corresponding to a certain
> degree of freedom do not follow a Gaussian distribution.
> My question is: Why can we (successfully) apply PCA to MD (unfolding)
> trajectories?
>
> Thank you for your help.
>
> -Matthias
>
> --------------------------------------------------------------------------------
> Matthias M. Waegele
> Graduate Student
> Gai Research Group http://gailab4.chem.upenn.edu/
> Department of Chemistry
> University of Pennsylvania
> 231 South 34th Street
> Philadelphia, PA 19104-6323
> --------------------------------------------------------------------------------
> _______________________________________________
> gmx-users mailing list    gmx-users at gromacs.org
> http://www.gromacs.org/mailman/listinfo/gmx-users
> Please search the archive at http://www.gromacs.org/search before posting!
> Please don't post (un)subscribe requests to the list. Use the
> www interface or send it to gmx-users-request at gromacs.org.
> Can't post? Read http://www.gromacs.org/mailing_lists/users.php
>

-- 
Antonio M. Baptista
Instituto de Tecnologia Quimica e Biologica, Universidade Nova de Lisboa
Av. da Republica, EAN, ITQB II, Apartado 127
2781-901 Oeiras, Portugal
phone: +351-214469619         email: baptista at itqb.unl.pt
fax:   +351-214411277         WWW:   http://www.itqb.unl.pt/~baptista
--------------------------------------------------------------------------