How to Get Information from the Databases

This BioCompanion copy is a demo version . This section is to introduce to data retrieval. Data in Molecular Biology are produced in a very different ways. Literature abstraction, disease data, metabolic pathways, and a commonwealth of different types of data are collected, curated and redistributed. Within the context of sequence analysis, databases of or related to sequence data are most important. Therefore, the considerations below emphasise sequence data libraries and collections.

Principle

Production of Major Databases

The collection and maintenance of primary public sequence data libraries is performed at centres like the EBI (European Bioinformatics Institute, an outstation of EMBL) or the NCBI (National Center for Biotechnology Information). Other centres are similarly active, these two shall only serve as examples. Commercial vendors are getting increasingly important as sequence producers, and some databases with EST and other data are now available from different sources.

Sequence data collection, curation and dissemination are major tasks which require tremendous resources. Neither the local Bioinformatics resource nor the end user is expected to employ the sophisticated software which the central data collection institutions use to collect, maintain, and curate data. Therefore, more simplified procedures are required, and the structure of the data is modified to allow less sophisticated software to access sequence data as desired. After an export procedure to a so-called flat file, the data are distributed to the end users' sites in various formats. The main paradigm is that each biological sequence is described in an entry which has a title, the sequence data and associated reference information. In a "real" database system, these data are accessible in a smooth and interlinked fashion. To benefit from the databases in their original form, however, the customers would need to install the very expensive and staff-intensive database software (so-called relational database systems). During the export to flat files, a considerable part of structuring information is lost and, therefore, auxiliary information must be printed into each entry. The application software at the end user's site must use various conventions (called a format) to bring you the information as close to the original comprehensive set as possible.

Contents of a Sequence Database Entry

Each data set has the following fields:

Name (database-specific, one per entry)

Accession Number (universally valid, one or more per entry)

Title (usually similar in between databases, one line of description)

Reference (the literature reference or location of the lab that produced the sequence)

Sequence (always starting with position 1, even in DNA).

Some data which serve administrative purposes, such as section information or dates of creation or updating, are not listed. Optionally, one or more of the following data are attached to an entry if known:

Organism Classification

Features of the sequence (reading frame coordinates, protein functional motifs, etc.)

Cross-References to other databases (DNA refers to protein entry, and vice versa)

Scope of a Query

If you want to retrieve an entry from the database, it is important to decide what type of query will be most effective:

Query by reference, author, sequence name, etc.: This is a search in the annotation. The programs described in this chapter perform this type of search after having obtained the keywords to be searched. The result will be a list of entries which match the keywords exactly. The benefit of the match accuracy is accompanied by a severe disadvantage: Frequently, several keywords are required to define the desired question in sufficient detail. As the combination of keyword may be on an AND , OR , or other logical expression, a query language is required to define the syntax of field combination. Unfortunately, this adds a complexity to the programs, and frequently even simple questions become a bit tedious to use.
Query by sequence similarity: This is a search in the sequence data. No keywords but similar strings are searched. This will produce a list of entries which match closely, but not necessarily exactly the query sequence. Typically, only a fragment of the query or database sequence will match. Several methods exist to find these "similarities". It is important to realize that the method of search severely affects the result, and confusion on which program to use might arise easily. As a rule of thumb, the programs which allow to search extremely fast will be less sensitive as certain approximations (so-called heuristics) are employed. Those programs which take several hours to complete perform a more sensitive search (rigorous algorithms) but might be not needed if simple queries for identities are to be asked.

This section of the BioCompanion deals with the problem of keyword-based retrieval. The corresponding procedures of sequence similarity searching are described in a subsequent chapter .

Networks of Databases

Today's sequence databases have a significant number of cross-references to other databases. A protein sequence, for example, will have one or more references to the DNA sequence(s) coding for the protein, and possibly also hints to databases describing protein motifs (such as the PROSITE database ) or organism-specific databases. Recently, the interest of researchers focused on genome projects. Therefore, information on the genetic locus might be contained in the database and also pointers to other databases which deal with genomics specifically. All these entries will refer to publications which are described in the literature databases. Your computer does not necessarily have all these databases available within the application software used for sequence analysis (such as the GCG package), but browser programs, like the ENTREZ database browser or the SRS database browser , are capable of handling these complex networks of databases. Frequently, no specific programs are used but the general-purpose browsers of the World-wide Web can be employed.

To make the best use of the widely available databases, you first need to find out which databases are storing the information you are looking for in most comprehensive fashion. If you only search for a given accession number, you will be able to search all the sequence databases simultaneously. However, searching a genetic locus of a disease or a protein motif for a specific protein function will succeed more efficiently if you use one of the databases specifically made or this purpose. In the two examples mentioned, the databases of choice are OMIM and PROSITE , respectively. Once you encounter hits in one database, you should use this information to expand to other databases as well - once you have found one description of a sequence, your search is not finished.

Computer Networks

The access to databases is no longer necessarily performed on the same computer where you usually do sequence analysis. Some programs operate via networks exclusively, such as the famous ENTREZ or SRSWWW browser . The sections below reflect this fact. It is, however, important to note that the retrieved sequences will be in specific formats. The data will be ordered in a way that the software you want to use for further analysis can or cannot interpret them correctly. Therefore, you must determine the formats of the entries you get via computer networks and apply appropriate procedures for reformatting if the data shall be used in the GCG program package.

SECURITY NOTICE: Once you use wide area computer networks, you will most probably access databases and computers which are not under local control. Information quality, therefore, might not apply in the usual way. This consideration is particularly important for environments beyond or within firewalls (commercial companies).

Be sure to understand the difference between an INTRANET (on-campus or company-owned) and the INTERNET (the international world-wide academic network). Whereas the latter is entirely outside of control (hence, insecure and unreliable), sources on the INTRANET can be expected to be maintained in a more reliable fashion.

JAMF source file: getseq.jam
Next file in HTML: 'Obtaining Data from Databases stored on the local resource'

[next page] , or [overview] , or [table of contents]