Comparison of Two Sequences

This BioCompanion copy is a demo version . This section of the BioCompanion is dedicated to the pairwise sequence comparison. In order to compare sequences, the motivation of comparison must be known. Is it an identity of the two sequences which is questioned, is it a region of best local homology of a sequence fragment which shall be found, or do you wand to find internal repeats? The following two approaches are

Either of the two methods has its advantages and will be described in detail below.


Schematic Comparison

Principle of Sequence Alignment

Let us assume the following two sequences:

 
  
My1.seq       tgatggtcaagtaaactatgaagagttt  
unknown seq   atggtaatggcacaattgactttcctgaatttctga  
  
If we want to align those, we will try to write the two sequences in a way which allows a pairwise comparison of each sequence symbol. As you might guess, there are lots of possible options to do so, and the longer the sequences are the more options to align two sequences will exist. In order to find the best alignment, we need to judge the quality of the alignment. To allow computations and comparisons, this judgement shall result in a numerical value, which is called a score . The determination of this score relies on a symbol comparison table, where each symbol pairing gets a value assigned, in order to determine the overall score by adding up the comparison value of each observed pair in our alignment. These tables are very important in the protein field, but also used in DNA comparison. A typical, simple scoring table for nucleotides will give a value of 1 to a match (treating U and T as "match"), and assign a value of 0 to each mismatch. This matrix is perfectly symmetric and would be sufficient if printed as half-populated table.

One additional value is missing: If the two sequences have different size, some symbols of one sequence will never have a counterpart. The score of any symbol to "nothing" is therefore assumed to be 0.

To get started, we will write the two sequences amongst each other like in the painting above. However, we can either align the beginning, the end, or arrange the sequences arbitrarily. Figure A below shows our two sequences shifted by various positions. The score is determined according to the table above. Figure A: Sequence alignments produced by shifting. Scores are calculated using a Match value of 1 and a mismatch value of 0. Numbers in parenthesis refer to the calculation with mismatch values of -0.5. In the matrix below, this value would replace the '0' values.

 
            Match value: 1  
         Mismatch value: 0  
           
               +-----+-----+-----+-----+-----+  
               |  A  |  G  |  C  |  T  |  U  |  
         +-----+-----+-----+-----+-----+-----+  
         |  A  |  1  |  0  |  0  |  0  |  0  |  
         +-----+-----+-----+-----+-----+-----+  
         |  G  |  0  |  1  |  0  |  0  |  0  |  
         +-----+-----+-----+-----+-----+-----+  
         |  C  |  0  |  0  |  1  |  0  |  0  |  
         +-----+-----+-----+-----+-----+-----+  
         |  T  |  0  |  0  |  0  |  1  |  1  |  
         +-----+-----+-----+-----+-----+-----+  
         |  U  |  0  |  0  |  0  |  1  |  1  |  
         +-----+-----+-----+-----+-----+-----+  
  
  
tgatggtcaagtaaactatgaagagttt  
123  |  ||    || ||  |   |               shift  3: score 9   (+1.0)  
   atggtaatggcacaattgactttcctgaatttctga  
  
tgatggtcaagtaaactatgaagagttt  
12||||| | |  |    ||                     shift  2: score 10  (+2.0)  
  atggtaatggcacaattgactttcctgaatttctga  
  
tgatggtcaagtaaactatgaagagttt  
1   |     | | | |         |              shift  1: score 6   (-5.5)  
 atggtaatggcacaattgactttcctgaatttctga  
  
tgatggtcaagtaaactatgaagagttt  
                |        |               shift  0: score 2   (-11.0)  
atggtaatggcacaattgactttcctgaatttctga  
  
 tgatggtcaagtaaactatgaagagttt  
1||    |     ||   |                      shift -1: score 6   (-5.0)  
atggtaatggcacaattgactttcctgaatttctga  
  
  tgatggtcaagtaaactatgaagagttt  
12         |  |           |  |           shift -2: score 4   (-8.0)  
atggtaatggcacaattgactttcctgaatttctga  
  
   tgatggtcaagtaaactatgaagagttt  
123     | ||                 ||          shift -3: score 5   (-6.5)  
atggtaatggcacaattgactttcctgaatttctga  
  
    tgatggtcaagtaaactatgaagagttt  
1234| ||||   | |  |||     || |||         shift -4: score 15  (+8.5)  
atggtaatggcacaattgactttcctgaatttctga  
  
     tgatggtcaagtaaactatgaagagttt  
12345    |  ||| | |  |      | ||         shift -5: score 10  (+1.0)   
atggtaatggcacaattgactttcctgaatttctga  

Figure A allows to conclude elementary findings:

The scoring table, if written as best-score listing of the top four alignments, will read as:

 
  
       Score    Shift   Length   
       ------------------------  
         15      -4       29  
         10       2       27  
                 -5       29  
          9       3       26  
  
This means that one alignment with shift -4 is calculated to be "best" but the alignments with shift 2, -5 and 3 are of a similar score.

This type of scoring will favour long alignments and will produce higher scores the longer the alignments are. However, the mismatches are not penalised, which implies that long stretches of different sequences might be in the alignment. The result will be that the score gets better if the alignment gets longer, regardless the amount of mismatches encountered. In order to discriminate better between similar sequences and those which have accidental similarity on a long range of symbols, such as expected in G/C rich sequences, we need to change scoring to penalise mismatches. As an example, we use the scoring

 
  
            Match value: +1.0  
         Mismatch value: -0.5  
  
and recalculate scores. Figure A shows the values in parenthesis. The scoring table, if written as best-score listing of the top four alignments, will now read as:
 
  
       Score    Shift   Length   
       ------------------------  
        +8.5     -4       28  
        +2.0      2       26  
        +1.0     -5       28  
                  3       25  
  
The main benefit of this scoring schema is a quality discrimination: All alignments which have twice as much mismatches than matches will score negatively. This implies that we can now introduce a threshold and indicate a "reasonable" alignment to be of a "positive score". However, we have not tried all of the possible shifts, and it is not easily feasible to compare several kb of sequences this way. Therefore, we need an automatism which allows to judge sequence alignments after visual inspection.

Principle of Dotplots

The dotplot method allows visual inspection of all possible alignments in schematic fashion, and is shown in Figures B and C .

Figure B displays the very basic dotplot: Two sequences are plotted as a matrix, and identical symbols get an x (for technical reasons only, in graphic output this is a "dot"). As the two sequences are 28 and 36 base pairs in length, we will have 28 x 36 = 1008 positions to calculate. Our DNA alphabet is a 4-letter alphabet (as we treat T and U as equal), which means that the random chance of an identical symbol at any given position is 1/4 = 0.25. Therefore, if the two sequences are totally unrelated, we expect 0.25 x 1008 = 252 dots, or, as formula,

 
Number of possible dots =   
        (probability of pair) * (length of sequence A) * (length of Sequence B)  
Counting the x in Figure B gives 278 dots, which is fairly close to the expected value.

Figure B: Dot-plot created by painting a dot (x) at each match.

 
  
  t   x     x     x               x x       x x x     x       x x x   x  
  t   x     x     x               x x       x x x     x       x x x   x  
__t   x     x     x               x x       x x x     x       x x x   x  
25g     x x         x x               x                 x               x  
  a x         x x         x   x x       x                 x x             x  
  g     x x         x x               x                 x               x  
  a x         x x         x   x x       x                 x x             x  
__a x         x x         x   x x       x                 x x             x  
20g     x x         x x               x                 x               x  
  t   x     x     x               x x       x x x     x       x x x   x  
  a x         x x         x   x x       x                 x x             x  
  t   x     x     x               x x       x x x     x       x x x   x  
__c                     x   x             x       x x               x  
15a x         x x         x   x x       x                 x x             x  
  a x         x x         x   x x       x                 x x             x  
  a x         x x         x   x x       x                 x x             x  
  t   x     x     x               x x       x x x     x       x x x   x  
__g     x x         x x               x                 x               x  
10a x         x x         x   x x       x                 x x             x  
  a x         x x         x   x x       x                 x x             x  
  c                     x   x             x       x x               x  
  t   x     x     x               x x       x x x     x       x x x   x  
__g     x x         x x               x                 x               x  
5 g     x x         x x               x                 x               x  
  t   x     x     x               x x       x x x     x       x x x   x  
  a x         x x         x   x x       x                 x x             x  
  g     x x         x x               x                 x               x  
  t   x     x     x               x x       x x x     x       x x x   x  
__  a t g g t a a t g g c a c a a t t g a c t t t c c t g a a t t t c t g a  
  |         |5        |10       |15       |20       |25       |30       |35  
  
Looking at Figure B, we can draw several conclusions:

Obviously, we still need to improve the so-called signal - to - noise ratio. The signal is a line or otherwise visible pattern which we could use in the visual inspection. The noise is what we can expect from statistics: If we use a mathematical approximation to count the probability in a four-letter alphabet, we expect 25% of a random hit probability, which is too high if we ask for weak similarities - remember that the best score we had was

 
  
    tgatggtcaagtaaactatgaagagttt  
1234| ||||   | |  |||     || |||         shift -4: score 15  (+8.5)  
atggtaatggcacaattgactttcctgaatttctga  
  

Dotplot Principle - Improved

We need an improvement for the dotplots with the word method: In our example, on a length of 28 base pairs, we have only 15 matches which is approximately every second nucleotide. This is a relatively weak signal. Therefore, we use a first approximation: We do no longer point a dot in if each nucleotide matches, but we use oligomers (called words) and paint a dot if these words match. This reduces the chance of a random match. If we use di-nucleotides, accidental matches will be (1/4)*(1/4) = 1/16 = 6.25% which is already much lower than the 25% obtained earlier. The GCG program suite uses the default word size of 6 - this is (1/4) to the power of 6, which results in a random choice probability of 0.025%. There is, however, an undesired side-effect: the larger the word size, the lower is the probability that a given word matches in between two different sequences. Expressed as formula, this will read

 
Number of possible dots =   
        (probability of word) * (length of sequence A) * (length of Sequence B)  
Figure C: Dot-plot painting a dot (o) at each matching di-nucleotide - more details are described in the text.
 
  t                                o         o o               o o       
  t                                o         o o               o o       
__t        o                                                             
25g                                                                       
  a                                    o                 o                  
  g                                                                       
  a                            o                           o                
__a                                    o                 o               o  
20g    o           o                 o                 o                   
  t  o           o                                           o           
  a          o                       o                 o               o   
  t                                       .o         o               o   
__c                        o            .o                            
15a            o          .    o      .                    o                 
  a            o        .      o    .                      o                 
  a                   .           .                                          
  t        o        .           .                                        
__g               .           .                                          
10a            o.           .  o                           o                 
  a           .          o.  o                                               
  c         .           .                         o                 o   
  t       .o          .                                                 
__g     .o          .o                                                     
5 g   .o          .o                 o                 o                   
  t .o          .o               o                            o            
  a           .                                          o               o  
  g    o           o                 o                 o               o  
  t                                                                      
__  a t g g t a a t g g c a c a a t t g a c t t t c c t g a a t t t c t g a     
  |         |5        |10       |15       |20       |25       |30       |35  
  
The result of the application in case of a di-nucleotide match (word size 2) is shown in Figure C:
In total, 65 dots (painted as o) are painted in the view of ((28 * 36) x (0.25 x 0.25) = ) 63 expected. In the figure, dots (.) have been painted in suggestively in order to show the position of the two best hits obtained in the alignments displayed in Figure A . The reason for the weak appearance is the low similarity of the two sequences.

Dotplot Principle - Improved Again

The problem with the sensitivity (too few subsequent identities in the case of low similarity) can be overcome with the permission of mismatches in a word. This is the already known windows technique : We select a window and request that a minimal number of matches within this window is obtained. The GCG programs call this a stringency. Therefore, using a window/stringency algorithm, we will be able to paint dots at the middle of a window rather than a word, which means that, given the values 9/5 for window/stringency, we obtain the plot as shown in Figure D.

Figure D: Dot-plot with a dot (+) at each matching window of 9 - with a minimum of 5 matches (stringency) per window.

 
  t                                                                      
  t                                                                      
__t                                                                      
25g                                       +                 +              
  a                                     +                 +                  
  g                                                           +            
  a                                                                          
__a                                                                          
20g                                                                        
  t                                                                      
  a                               +                                          
  t                                                               +      
__c                                       +                     +      
15a                                     +                                    
  a                                   +                                      
  a                                 + +                                      
  t                                 +                                    
__g                               +                                        
10a               +             +                                            
  a             +             +                                              
  c           +           + +                           +              
  t         +           +                                                
__g                   +                                                    
5 g                 +                       +                 +            
  t                                                                      
  a                                                                          
  g                                                                        
  t                                                                      
__  a t g g t a a t g g c a c a a t t g a c t t t c c t g a a t t t c t g a     
  |         |5        |10       |15       |20       |25       |30       |35  
  
Three conclusions can be drawn from Figure D:

Interpretation of Dotplots

Looking at Figure D, two main diagonals can be identified:

 
                        vertical      horizontal                 
                        sequence       sequence                    
  
short diagonal           5-10            4-9  
long diagonal            5-15            9-21  
  
This is an important conclusion, as the vertical sequence has obviously a region in its beginning which is similar to the horizontal sequence in two different areas (4-9, and 9-15, respectively). However, be careful if you use window/stringency as the coordinates plotted as diagonals will be affected by the size of the window. The actual region of similarity, therefore, will need to be expanded by (window size/2). If we schematically write the sequences in a letter-by-letter format, however, it will become immediately obvious that the window/stringency algorithm averages tremendously:
 
  
tgatggtcaagtaaactatgaagagttt             vertical sequence   
12||||| | |  |    ||                     (short diagonal)  
  atggtaatggcacaattgactttcctgaatttctga   horizontal sequence  
  1234| ||||   | |  |||     || |||       (long diagonal)  
      tgatggtcaagtaaactatgaagagttt       vertical sequence  
  
In this case, experimental evidence will be required to consolidate the computer prediction of whether either the "short" or "long" diagonal are of biological relevance. The protein comparison will be valuable if available or possible.

Tip for the Interpretation of Dotplots:

Always try to write down diagonals of interest in the way as depicted above. If you need computerised assistance, use the 'gap' program of the GCG package with very high gap penalty values (e.g., 50). Explanations on 'gap' can be found in a later section of this chapter.


JAMF source file: schema.jam
Next file in HTML: 'GCG's Implementation of Schematic Comparison'

[next page] , or [overview] , or [table of contents]