[gmx-users] Data and Knowledge Management Tools for Computational Chemists

Thu Jan 6 16:19:37 CET 2005

On Thu, Jan 06, 2005 at 02:40:18PM +0100, Marc Baaden wrote:
> 
> Dear All,
> 
> I hope this message is not too off-topic wrt Gromacs, but I think it
> directly relates to the production, treatment and dissemination of
> scientific results, eg those obtained with Gromacs.
> 
> I am looking for software, tools or general approaches to get hold of the
> wealth of information that accumulates (mostly) electronically. In particular
> emails, text/PDF/XML or similar documents, bookmarks to websites and 
> bibliographic references (but eventually also results from calculations,
> location of trajectories, ...).
> 
> The main request would be to be able to "store" information as is without
> having to enter it individually into a curated database. Filtering, indexing
> or cataloging through a script would be ok, though. A powerful search should be
> possible.
> 
> Some specific points:
> - concerning bibliographic references, there is a wide variety of formats
>   like Pubmed, email-alerts, quotes on websites, ... sometimes with a comment
>   by the person who sent the reference, sometimes with an URL link, ...
>   I would like to be able to gather all information in a first pass without
>   having to parse the format by hand (eg where are authors, title, etc).
> - concerning bookmarks, it would be nice to also have elimination of duplicates
>   and of dead links
> - taking it one step further, indexing the sites listed in the bookmarks might
>   also be an additional useful step
> 
> After some extensive search of the web, I could not come up with a fully
> satisfactory solution. My current best bet would be to index text and other
> files and email with a search engine like eg namazu. For bookmarks I'd ideally
> like to store them in XBEL format, but there seem to be only a limited number
> of tools, and none or very few that eliminate duplicates and dead links.
> A useful bookmark tool might be bookmarker.
> FramerD (a database) seems also an interesting possibility, but probably 
> requires quite some substantial coding.
> 
> In an ideal world, I'd also love to make use of some artificial intelligence
> code (eg Self-organizing maps, textual data mining,..) or some machine-learning
> tools, but my feeling is that those are not (yet) usable by non-experts.
> 
> My question is what do other people in the field use ? Are there any miracular
> packages that would do all that I want ? Are there other/better approaches ?

We have built our own. Actually the funny thing is that at the two intitutions
I consult for, there are two separate projects which together would satisfy 
your requirements. 

First, at one institution I am building a knowledge management system which 
indexes laboratory data, presentations, posters and other types of multimedia. 
I am currently tackling some database issues as well as common document format. 
The lab members like to keep stuff in powerpoint, but of course, the web 
services run in unix, so one of the things on the todo list is to explore 
powerpoint/wmf to pdf conversion utility for the service. In any case, the 
architecture for the system is based on a centralized content management 
system coupled with a gallery-like library system. This allows storage of the
document with its contextual metadata, collaboration via the CMS, and revision 
control. The document is also linked to a database of laboratory experimental
protocols which are linked to the inventory/reagent control system. 
The majority of the interface is still internal-use only so I can't really 
show you any examples. This is built using bsd, apache, pgsql, php

The main project at the other institution I do not work on directly, but that 
involves the automatic population of databases using both static text mining 
and natural language parsing with comparison to a user-defined knowledge-base. 
A framework for autopopulation of a database from genbank can be found here:
http://senselab.med.yale.edu/autoput/autopop.pl

A framework for natural language processing is demonstrated here: 
http://senselab.med.yale.edu/textmine/neurotext.pl

(As you might be able to tell, these applications are written in perl and
interface with a web service written in ASP.NET using an Oracle back end.)
Finally, structural data/simulation movies are stored as an object alongside
the appropriate macromolecular entry in the database, for example we have
a shortlength simulation of olfactory receptor I7 in the entry for I7:
http://senselab.med.yale.edu/senselab/site/dbData/eavData.asp?o=559
This database is populated by the aforementioned autopop application.

Bibliographies are standardized using Endnote libraries at both institutions,
but citations are also kept at SenseLab; I don't think we have mixing of
the two, because SenseLab is more outwardfacing and Endnote is more localized,
but sharing citations between the two may be indicated for depending on the
scope of the research.

-- 
Peter C. Lai
University of Connecticut
Dept. of Molecular and Cell Biology | Rsrch. Spc.
Yale University School of Medicine
SenseLab | Research Assistant
http://cowbert.2y.net/