This BioCompanion copy is a demo version . This section is dedicated to pattern searching. Programs to find patterns in DNA sequences were already mentioned in the single sequence analysis section . The programs mentioned there search for patterns using compositional analysis. However, restriction mapping programs do also use pattern approaches.
Be aware of the differences - patterns are not sequences! Details will be outlined in this chapter.
+------------------+---------------------+-------------------+ | Feature | Sequence | Pattern | +------------------+---------------------+-------------------+ | | | | | Creation | from raw data | from many seq's | | | | | | Tool | editor (seqed) | manual work | | | | | | Symbols per | | | | position | one | one or several | | | | | | Symbol | | | | repetition | explicit | by expression or | | | | as range of symbols | | | | | Any character | X or N | X or N, but | | | | range is allowed | | | | | | length (in GCG) | up to 350000 | up to 132 | | | | | | searches | sequence | sequence vs. | | | vs. seq. database | pattern library | | | | OR pattern vs. | | | | sequence database | | | | | | data type | full raw data | processed data | | | | | | used at | start of work | late in analysis | | | | | +------------------+---------------------+-------------------+
Computers are fast if the comparison of letters or numbers tests whether these are equal
or not. This is called "test for symbol identity". In order to apply this principle
to biology, we need to define this "matching" in a simple table. E.g., if we compare
two protein sequences and find a letter D,
this is a symbol for aspartic acid. The test for "equal" or "not equal",
therefore, reads in an example as
The main benefit of patterns is that defined substitutions can occur as the
result of a combination of examples - in contrast to a sequence comparison based
on a comparison matrix derived from a sequence-independent generalisation of alignments.
This example refers to a site fragment where we have
proven examples of sequences containing E or D at a given position,
as well as L or D . To write our pattern, we will allow either
of the two sequence characters to be a 'match' at this position. An example of this in a short
alignment stretch is printed below:
Our example requires that
we apply different criteria to different positions of a sequence alignment. In order to have
the position-specific comparison done by a computer, a sophisticated program is required which
will do this type of calculation for us. Note the difference to the
pairwise alignment programs : The comparison of symbols was position-independent there.
Now, using patterns, we do position-dependent comparison calculations. As we cannot expect to
get a specific program for each special pattern, we need to have a pattern matching program
which will define the rules of patterns in a pattern language.
The creation of such a convention to describe patterns in a flexible fashion is not as difficult
as one might assume because patterns are typically short in length
(a few amino acids up to some dozen, but rarely more than hundred symbols) if compared to a query
sequence which we want to screen for the occurrence of a pattern. By reading pattern definitions,
the pattern matching program will be able to search a given
sequence for this pattern defined in a specific language. There are various ways to define such
a language, and the "regular expressions" of some essential programs of the UNIX operating system
use such a language. The GCG software package searches with a straightforward
definition, the most important features are listed below.
The residues to be allowed in a given position are listed with a comma (,) as a separator and
are embraced by parenthesis ( and ). To specify our example
above, it were sufficient to have
The number of residues in a pattern might be repeated or vary
in any symbol in a certain range. The pattern language allows repetitions in the syntax {minimum,
maximum} which consists of two numbers in curled braces. Loops in proteins, for example,
might be found as
Occasionally, if the number of considered fragments is very large, the number of residues to
be allowed in a given position becomes fairly large as well. Then, it is easier to define which
residues are not found at this place. The pattern language uses a tilde (~)
in front of a character or a list embraced with parenthesis in order to define these undesired
residues. Whereas in the example below exclusion was used to define position 2, the explicit
listing of allowed residues in position 3 is more efficient. The pattern definition
as described above relies on regular expressions. Two major disadvantages are
specific for this technique:
Lacking Weighting
Regular expressions are a mathematical formulation of an all-or-nothing approach allowed for
the symbols mentioned. This is fine as long as we do not want to weight for
the occurrence of the options at a given position. Pattern searching on the basis of several
examples, however, will possibly rely on the occurrence of certain preferences. To make this
clearer, review the last example above:
Missing Combinatorial Representation
Some configurations of protein structures reflect interactions by charged amino acids. In the
comparison of many structures, it has been shown that the position of the charge does frequently
not matter, and a Lys-Asp interaction might be equivalent to a Glu-Arg pair in another structure.
Patterns cannot describe this exchange of interaction. A pattern approach of the two sequence
fragments would equally admit Lys OR Glu at one position, whereas its counterpart would allow
for Asp OR Arg. Searching a database with such a pattern would allow a Lys/Arg pairing without
penalties. Unfortunately, no straightforward programs are available today for the molecular biology
researcher to compensate for this problem. Rule-based approaches are known but
rarely implemented in the corresponding software. The problem gets specifically urgent if mismatches
of patterns are allowed and parts of the paired pattern are missing (e.g. Cystein bridges). As
a workaround, the result of pattern searching must be analysed manually or with user-written
small programs to filter for these omissions.
[next page] , or [overview] , or [table of contents]
D meeting D => match
meeting any other => mismatch
This is not very sophisticated, in particular,
as certain proteins with specific sites (such as calcium binding) tend to be ambivalent and allow
either aspartic or glutamic acid. We could use the
comparison matrices as mentioned in the pairwise sequence
comparison section. As already explained there, the comparison of sequences on the basis of matrices
is computationally expensive. The avoidance of numerical computation is not the main reason,
however, for using patterns. Example of Pattern Benefit
sequence fragment 1: GDRDD
sequence fragment 2: GERLD
The simple comparison based on identity would get us two matches, in position 1 and 5,
resp., for G and D. To compensate for the occurrence of several characters at a single position,
we will need to write a pattern which accommodates this fact. For the sake of simplicity, we
do this sequentially. Our sequence match table, therefore, must be extended to read differently
for aspartic acid to accommodate the above-mentioned identity of D and E at position 2:
D meeting D => match
meeting E => match
meeting any other => mismatch
A suitable matrix would allow this as well.
However, keep in mind that a matrix covers symbol pairings at any position,
and, if we would introduce a perfect match of D and E in such a table we would still not accommodate
the L in position 4, but even allow an E there which is not found at all in position 4. To make
this clearer, look at the fourth position of the sequence fragments above. If we have a leucine
residue at a position where we find aspartic acid usually, this were (for the symbol D in position
4)
D meeting D => match
meeting L => match
meeting any other => mismatch
Again, a suitable matrix would possibly cover this second exceptional occurrence, but
we would have a problem with our first case: As we now allowed L in parallel
to D, this were bad for our definition D or E earlier.
The matrix, therefore,would be extended with exceptions and soon or later become very unspecific.
The solution, therefore, is to define substitutions as a function of the sequence:
D meeting D => match (valid at any position)
meeting E at pos. 2 => match
meeting L at pos. 4 => match
meeting any other at
any position => mismatch
Definition of a Pattern Language
sequence fragment 1: GDRDD
sequence fragment 2: GERLD
pattern description: G(D,E)R(D,L)D
sequence 1 fragment: GDGTRD
sequence 2 fragment: GERL
sequence 3 fragment: GDMRD
pattern description: G(D,E)(X){0,2}R(D,L)
sequence 1 fragment fit: GDXXRD (2 X)
sequence 2 fragment fit: GE RL (0 X)
sequence 3 fragment fit: GDX RD (1 X)
sequence 1 fragment: GDD
sequence 2 fragment: GEN
sequence 3 fragment: GNN
sequence 4 fragment: GQD
sequence 5 fragment: GCN
sequence 6 fragment: GKN
sequence 7 fragment: GTD
sequence 8 fragment: GWE
sequence 9 fragment: GKQ
sequence 10 fragment: GLD
=
pattern description: G~(A,F,G,H,I,M,P,R,S,V,Y)(D,E,N,Q)
Drawbacks of patterns
sequence 1 fragment: GDD Position 1: 100 % Gly
sequence 2 fragment: GEN Position 2: 10% each of Asp, Glu, Asn, Gln,
sequence 3 fragment: GNN Cys,Lys,Thr,Trp,Lys,Leu
sequence 4 fragment: GQD Position 3: 40% each of Asp and Asn
sequence 5 fragment: GCN 10% each of Glu and Gln
sequence 6 fragment: GKN
sequence 7 fragment: GTD
sequence 8 fragment: GWE
sequence 9 fragment: GKQ
sequence 10 fragment: GLD
pattern description: G~(A,F,G,H,I,M,P,R,S,V,Y)(D,E,N,Q)
The description of the alternatives (D,E,N,Q) in position 3 reflects
the majority of Asp and Asn insufficiently. To achieve a weighting in alignment calculations,
the profile technique is required which is discussed
in the "sequence families" chapter. There,
position-specific comparison matrices are used to overcome this problem.
JAMF source file: pattern.jam
Next file in HTML:
'Creation of Patterns '