Searching Patterns

This BioCompanion copy is a demo version . This section is dedicated to pattern searching. Programs to find patterns in DNA sequences were already mentioned in the single sequence analysis section . The programs mentioned there search for patterns using compositional analysis. However, restriction mapping programs do also use pattern approaches.

Be aware of the differences - patterns are not sequences! Details will be outlined in this chapter.

 
  
+------------------+---------------------+-------------------+  
| Feature          |   Sequence          |   Pattern         |  
+------------------+---------------------+-------------------+  
|                  |                     |                   |  
| Creation         | from raw data       | from many seq's   |  
|                  |                     |                   |  
| Tool             | editor (seqed)      | manual work       |  
|                  |                     |                   |  
| Symbols per      |                     |                   |  
|        position  | one                 | one or several    |  
|                  |                     |                   |  
| Symbol           |                     |                   |  
|      repetition  | explicit            | by expression or  |  
|                  |                     | as range of symbols  
|                  |                     |                   |  
| Any character    | X or N              | X or N, but       |  
|                  |                     | range is allowed  |  
|                  |                     |                   |  
| length (in GCG)  | up to 350000        | up to 132         |  
|                  |                     |                   |  
| searches         | sequence            | sequence vs.      |  
|                  | vs. seq. database   | pattern library   |  
|                  |                     | OR pattern vs.    |  
|                  |                     | sequence database |   
|                  |                     |                   |   
| data type        | full raw data       | processed data    |  
|                  |                     |                   |  
| used at          | start of work       | late in analysis  |  
|                  |                     |                   |  
+------------------+---------------------+-------------------+

Pattern Principles

Computers are fast if the comparison of letters or numbers tests whether these are equal or not. This is called "test for symbol identity". In order to apply this principle to biology, we need to define this "matching" in a simple table. E.g., if we compare two protein sequences and find a letter D, this is a symbol for aspartic acid. The test for "equal" or "not equal", therefore, reads in an example as

 
  
       D    meeting D            => match   
            meeting any other    => mismatch

This is not very sophisticated, in particular, as certain proteins with specific sites (such as calcium binding) tend to be ambivalent and allow either aspartic or glutamic acid. We could use the comparison matrices as mentioned in the pairwise sequence comparison section. As already explained there, the comparison of sequences on the basis of matrices is computationally expensive. The avoidance of numerical computation is not the main reason, however, for using patterns.

The main benefit of patterns is that defined substitutions can occur as the result of a combination of examples - in contrast to a sequence comparison based on a comparison matrix derived from a sequence-independent generalisation of alignments.

Example of Pattern Benefit

This example refers to a site fragment where we have proven examples of sequences containing E or D at a given position, as well as L or D . To write our pattern, we will allow either of the two sequence characters to be a 'match' at this position. An example of this in a short alignment stretch is printed below:

 
       sequence fragment 1:     GDRDD  
       sequence fragment 2:     GERLD

The simple comparison based on identity would get us two matches, in position 1 and 5, resp., for G and D. To compensate for the occurrence of several characters at a single position, we will need to write a pattern which accommodates this fact. For the sake of simplicity, we do this sequentially. Our sequence match table, therefore, must be extended to read differently for aspartic acid to accommodate the above-mentioned identity of D and E at position 2:

 
       D    meeting D            => match   
            meeting E            => match   
            meeting any other    => mismatch

A suitable matrix would allow this as well. However, keep in mind that a matrix covers symbol pairings at any position, and, if we would introduce a perfect match of D and E in such a table we would still not accommodate the L in position 4, but even allow an E there which is not found at all in position 4. To make this clearer, look at the fourth position of the sequence fragments above. If we have a leucine residue at a position where we find aspartic acid usually, this were (for the symbol D in position 4)

 
       D    meeting D            => match   
            meeting L            => match   
            meeting any other    => mismatch

Again, a suitable matrix would possibly cover this second exceptional occurrence, but we would have a problem with our first case: As we now allowed L in parallel to D, this were bad for our definition D or E earlier. The matrix, therefore,would be extended with exceptions and soon or later become very unspecific. The solution, therefore, is to define substitutions as a function of the sequence:

 
       D    meeting D            => match (valid at any position)  
            meeting E at pos. 2  => match   
            meeting L at pos. 4  => match   
            meeting any other at   
                    any position => mismatch

Definition of a Pattern Language

Our example requires that we apply different criteria to different positions of a sequence alignment. In order to have the position-specific comparison done by a computer, a sophisticated program is required which will do this type of calculation for us. Note the difference to the pairwise alignment programs : The comparison of symbols was position-independent there. Now, using patterns, we do position-dependent comparison calculations. As we cannot expect to get a specific program for each special pattern, we need to have a pattern matching program which will define the rules of patterns in a pattern language.

The creation of such a convention to describe patterns in a flexible fashion is not as difficult as one might assume because patterns are typically short in length (a few amino acids up to some dozen, but rarely more than hundred symbols) if compared to a query sequence which we want to screen for the occurrence of a pattern. By reading pattern definitions, the pattern matching program will be able to search a given sequence for this pattern defined in a specific language. There are various ways to define such a language, and the "regular expressions" of some essential programs of the UNIX operating system use such a language. The GCG software package searches with a straightforward definition, the most important features are listed below.

Alternative Residues

The residues to be allowed in a given position are listed with a comma (,) as a separator and are embraced by parenthesis ( and ). To specify our example above, it were sufficient to have

 
       sequence fragment 1:     GDRDD  
       sequence fragment 2:     GERLD  
  
       pattern description:     G(D,E)R(D,L)D

Repetitions
The number of residues in a pattern might be repeated or vary in any symbol in a certain range. The pattern language allows repetitions in the syntax {minimum, maximum} which consists of two numbers in curled braces. Loops in proteins, for example, might be found as

 
       sequence 1 fragment:     GDGTRD  
       sequence 2 fragment:     GERL  
       sequence 3 fragment:     GDMRD  
  
       pattern description:     G(D,E)(X){0,2}R(D,L)  
  
       sequence 1 fragment fit:    GDXXRD   (2 X)  
       sequence 2 fragment fit:    GE  RL   (0 X)  
       sequence 3 fragment fit:    GDX RD   (1 X)

Alternative Residues, Exclusion
Occasionally, if the number of considered fragments is very large, the number of residues to be allowed in a given position becomes fairly large as well. Then, it is easier to define which residues are not found at this place. The pattern language uses a tilde (~) in front of a character or a list embraced with parenthesis in order to define these undesired residues. Whereas in the example below exclusion was used to define position 2, the explicit listing of allowed residues in position 3 is more efficient.

 
       sequence 1 fragment:     GDD  
       sequence 2 fragment:     GEN  
       sequence 3 fragment:     GNN  
       sequence 4 fragment:     GQD  
       sequence 5 fragment:     GCN  
       sequence 6 fragment:     GKN  
       sequence 7 fragment:     GTD  
       sequence 8 fragment:     GWE  
       sequence 9 fragment:     GKQ  
       sequence 10 fragment:    GLD  
=  
       pattern description:     G~(A,F,G,H,I,M,P,R,S,V,Y)(D,E,N,Q)

Drawbacks of patterns

The pattern definition as described above relies on regular expressions. Two major disadvantages are specific for this technique:

Lacking Weighting

Regular expressions are a mathematical formulation of an all-or-nothing approach allowed for the symbols mentioned. This is fine as long as we do not want to weight for the occurrence of the options at a given position. Pattern searching on the basis of several examples, however, will possibly rely on the occurrence of certain preferences. To make this clearer, review the last example above:

 
       sequence 1 fragment:     GDD    Position 1: 100 % Gly   
       sequence 2 fragment:     GEN    Position 2: 10% each of Asp, Glu, Asn, Gln,    
       sequence 3 fragment:     GNN                        Cys,Lys,Thr,Trp,Lys,Leu  
       sequence 4 fragment:     GQD    Position 3: 40% each of Asp and Asn   
       sequence 5 fragment:     GCN                10% each of Glu and Gln  
       sequence 6 fragment:     GKN  
       sequence 7 fragment:     GTD  
       sequence 8 fragment:     GWE  
       sequence 9 fragment:     GKQ  
       sequence 10 fragment:    GLD  
  
       pattern description:     G~(A,F,G,H,I,M,P,R,S,V,Y)(D,E,N,Q)

The description of the alternatives (D,E,N,Q) in position 3 reflects the majority of Asp and Asn insufficiently. To achieve a weighting in alignment calculations, the profile technique is required which is discussed in the "sequence families" chapter. There, position-specific comparison matrices are used to overcome this problem.

Missing Combinatorial Representation

Some configurations of protein structures reflect interactions by charged amino acids. In the comparison of many structures, it has been shown that the position of the charge does frequently not matter, and a Lys-Asp interaction might be equivalent to a Glu-Arg pair in another structure. Patterns cannot describe this exchange of interaction. A pattern approach of the two sequence fragments would equally admit Lys OR Glu at one position, whereas its counterpart would allow for Asp OR Arg. Searching a database with such a pattern would allow a Lys/Arg pairing without penalties. Unfortunately, no straightforward programs are available today for the molecular biology researcher to compensate for this problem. Rule-based approaches are known but rarely implemented in the corresponding software. The problem gets specifically urgent if mismatches of patterns are allowed and parts of the paired pattern are missing (e.g. Cystein bridges). As a workaround, the result of pattern searching must be analysed manually or with user-written small programs to filter for these omissions.

JAMF source file: pattern.jam
Next file in HTML: 'Creation of Patterns '

[next page] , or [overview] , or [table of contents]