[gmx-users] eigenvalues and number of frames

Fri Apr 5 09:52:46 CEST 2002

Jose D Faraldo-Gomez wrote:

> 
> Thanks Anton/Bert; I think I can come to terms with this idea of the number
> of eigenvectors being equal to nframes-1 when nframes < 3N; though I'm not
> so sure how to apply this view when nframes > 3N, but in any case...
> 
> The thing that is bothering me is that I don't see how this condition
> appears in the diagonalization process in gromacs, so I must be not
> understanding either the algebra or the code.
> 
> For example, when I look at a 4ns interval using 201 and 101 frames, and I
> compare the covariance matrices (dumped by g_covar_d -debug), I don't find
> very large differences; for instance the ratio of the diagonal elements
> C(ii; 101f)/C(ii; 201 fr) is on average 0.98 +/- 0.08...
> 
> So where is the trick? Are the covariance matrices really different enough
> to give lists of precisely 100 and 200 eigenvalues? This is hard to believe
> (though I'm not very good at maths)...
> 

it may help to think of it as a multiple linear regression, which it is, actually.
Like in a simple 2D regression, you can have hundreds or thousands of points
through which you fit your straight line (eigenvector). So this would correspond
to your nframes > 3N case. This is not a problem at all. In fact: the more frames the 
better. And it's also immediately clear why removing every second point doesn't
significantly affect the regression. When you have less frames than dimensions,
you will maximally get (nframes-1) eigenvectors with non-zero eigenvalues, like
Anton explained before: it doesn't make sense to fit a plane through a dataset 
consisting of two points, the diagonalisation in this case simply yields zeroes
for all the non-sampled dimensions.

Now your example of a few hundred datapoints in a few thousand (I assume) dimensional
space. If you remove every second datapoint, your dataset will still be spread among
the dominant directions, so you'll still be able to extract the largest-eigenvalue
eigenvectors without too much problems (this explains your large overlap). This
will be different for the eigenvectors with smaller eigenvalues: they will be worse
and worse defined. If you like check the overlap between eigenvectors 90-100 for
both sets, and you'll see that that are very different. So it depends on what you need:
if you're only interested in the largest amplitude modes, then a representative
set of 100 or so snapshots may suffice. If you're also interested in the smaller
amplitude modes, then you should really feed g_covar as many frames as possible.

Bert

____________________________________________________________________________
Dr. Bert de Groot

Max Planck Institute for Biophysical Chemistry
Theoretical molecular biophysics group
Am Fassberg 11 
37077 Goettingen, Germany

tel: +49-551-2011306, fax: +49-551-2011089

email: bgroot at gwdg.de
http://www.mpibpc.gwdg.de/abteilungen/071/bgroot
____________________________________________________________________________