This BioCompanion copy is a demo version .
This section is to introduce to data retrieval. Data
in Molecular Biology are produced in a very different ways. Literature abstraction, disease data,
metabolic pathways, and a commonwealth of different types of data are collected, curated and
redistributed. Within the context of sequence analysis, databases of or related to sequence data
are most important. Therefore, the considerations below emphasise sequence data libraries and
collections.
The collection and maintenance of primary public sequence data libraries is performed at centres
like the EBI (European Bioinformatics Institute, an outstation of EMBL) or the
NCBI (National Center for Biotechnology Information). Other centres are similarly
active, these two shall only serve as examples. Commercial vendors are getting increasingly important
as sequence producers, and some databases with EST and other data are now available from different
sources.
Sequence data collection, curation and dissemination are major tasks which require tremendous
resources. Neither the local Bioinformatics resource nor the end user is expected to employ the
sophisticated software which the central data collection institutions use to collect, maintain,
and curate data. Therefore, more simplified procedures are required, and the structure of the
data is modified to allow less sophisticated software to access sequence data as desired. After
an export procedure to a so-called flat file, the data are
distributed to the end users' sites in various formats. The main paradigm is that each biological
sequence is described in an entry which has a title, the
sequence data and associated reference information. In a "real" database system, these data are
accessible in a smooth and interlinked fashion. To benefit from the databases in their original
form, however, the customers would need to install the very expensive and staff-intensive database
software (so-called relational database systems). During the export to flat files, a considerable
part of structuring information is lost and, therefore, auxiliary information must be printed
into each entry. The application software at the end user's site must use various conventions
(called a format) to bring you the information as close to the original comprehensive
set as possible.
Each data set has the following fields: Some data which serve administrative purposes, such as
section information or dates of creation or updating, are not listed. Optionally, one or more
of the following data are attached to an entry if known:
If you want to retrieve an entry from the database, it is important to decide what type of
query will be most effective: This section of the BioCompanion deals with the problem of keyword-based retrieval. The corresponding
procedures of sequence similarity searching are described in
a subsequent chapter .
Today's sequence databases have a significant number of cross-references to other databases.
A protein sequence, for example, will have one or more references to the DNA sequence(s) coding
for the protein, and possibly also hints to databases describing protein motifs (such as the
PROSITE database ) or organism-specific databases. Recently,
the interest of researchers focused on
genome projects. Therefore, information on the genetic locus might be contained in the database
and also pointers to other databases which deal with genomics specifically. All these entries
will refer to publications which are described in the literature databases. Your
computer does not necessarily have all these databases available within the application software
used for sequence analysis (such as the GCG package), but browser programs, like the
ENTREZ database browser or the
SRS database browser , are capable of handling these complex
networks of databases. Frequently, no specific programs are used but the general-purpose browsers
of the World-wide Web can be employed.
To make the best use of the widely available databases, you first need to find out which databases
are storing the information you are looking for in most comprehensive fashion. If you only search
for a given accession number, you will be able to search all the sequence databases
simultaneously. However, searching a genetic locus of a disease or a protein motif for a specific
protein function will succeed more efficiently if you use one of the databases specifically made
or this purpose. In the two examples mentioned, the databases of choice are OMIM and
PROSITE , respectively. Once you encounter hits in one database, you should use this information
to expand to other databases as well - once you have found one description of a sequence, your
search is not finished.
The access to databases is no longer necessarily performed
on the same computer where you usually do sequence analysis. Some programs operate via networks
exclusively, such as the famous ENTREZ or
SRSWWW browser . The sections below reflect this fact. It is, however, important to note
that the retrieved sequences will be in specific formats. The data will be ordered in a way that
the software you want to use for further analysis can or cannot interpret them correctly. Therefore,
you must determine the formats of the entries you get via computer networks
and apply appropriate procedures for reformatting
if the data shall
be used in the GCG program package.
SECURITY NOTICE: Once you use
wide area computer networks, you will most probably access databases and computers which are
not under local control. Information quality, therefore, might not apply in the usual way. This
consideration is particularly important for environments beyond or within firewalls (commercial
companies).
Be sure to understand the difference between an INTRANET (on-campus or company-owned) and the
INTERNET (the international world-wide academic network). Whereas the latter is entirely outside
of control (hence, insecure and unreliable), sources on the INTRANET can be expected to be maintained
in a more reliable fashion.
[next page] , or [overview] , or [table of contents] Principle
Production of Major Databases
Contents of a Sequence Database Entry
Scope of a Query
Networks of Databases
Computer Networks
JAMF source file: getseq.jam
Next file in HTML:
'Obtaining Data from Databases stored on the local resource'