This BioCompanion copy is a demo version .
This section of the BioCompanion is dedicated to the pairwise sequence
comparison. In order to compare sequences, the motivation of comparison must be known. Is it
an identity of the two sequences which is questioned, is it a region of best local homology
of a sequence fragment which shall be found, or do you wand to find internal repeats? The following
two approaches are
and
Either of the two methods has its advantages and will be described in detail below.
Let us assume the following two sequences:
One additional value is missing: If the two sequences have different size, some symbols of one
sequence will never have a counterpart. The score of any symbol to "nothing" is
therefore assumed to be 0.
To get started, we will write the two sequences amongst each other like in the painting above.
However, we can either align the beginning, the end, or arrange the sequences arbitrarily. Figure
A below shows our two sequences shifted by various
positions. The score is determined according to the table above. Figure
A: Sequence alignments produced by
shifting. Scores are calculated using a Match value of 1 and a mismatch value of 0. Numbers in
parenthesis refer to the calculation with mismatch values of -0.5. In the matrix below, this
value would replace the '0' values.
Figure A allows to conclude elementary findings: The scoring table, if written as best-score listing of the top four alignments, will read
as:
This type of scoring will favour long alignments and will produce higher scores
the longer the alignments are. However, the mismatches are not penalised, which implies that
long stretches of different sequences might be in the alignment. The result will be that the
score gets better if the alignment gets longer, regardless the amount of mismatches encountered.
In order to discriminate better between similar sequences and those which have accidental
similarity on a long range of symbols, such as expected in G/C rich sequences, we need
to change scoring to penalise mismatches. As an example, we use the scoring
The dotplot method allows
visual inspection of all possible alignments in schematic fashion, and is shown in Figures B and C
.
Figure B displays the very basic dotplot: Two sequences are plotted
as a matrix, and identical symbols
get an x
(for technical reasons only, in graphic output this is a "dot"). As the two sequences are 28
and 36 base pairs in length, we will have 28 x 36 = 1008 positions to calculate. Our DNA alphabet
is a 4-letter alphabet (as we treat T and U as equal), which means that the random chance
of an identical symbol at any given position is 1/4 = 0.25. Therefore, if the two sequences
are totally unrelated, we expect 0.25 x 1008 = 252 dots, or, as formula,
Figure B: Dot-plot created by painting a dot (x) at each
match. Obviously, we still need to improve the so-called signal - to - noise ratio.
The signal is a line or otherwise visible pattern which we could use in the
visual inspection. The noise is what we can expect from statistics: If we use
a mathematical approximation to count the probability in a four-letter alphabet, we expect 25%
of a random hit probability, which is too high if we ask for weak similarities - remember that
the best score we had was We need an improvement
for the dotplots with the word method:
In our example, on a length of 28 base pairs, we have only 15 matches which is approximately
every second nucleotide. This is a relatively weak signal. Therefore, we use a first approximation:
We do no longer point a dot in if each nucleotide matches, but we use oligomers (called
words) and paint a dot if these words match. This reduces the chance of a random
match. If we use di-nucleotides, accidental matches will be (1/4)*(1/4) = 1/16 = 6.25% which
is already much lower than the 25% obtained earlier. The GCG program suite uses the default
word size of 6 - this is (1/4) to the power of 6, which results
in a random choice probability of 0.025%. There is, however, an undesired side-effect: the larger
the word size, the lower is the probability that a given word matches in between two different
sequences. Expressed as formula, this will read
The problem with
the sensitivity (too few subsequent identities in the case of low similarity) can be overcome
with the permission of mismatches in a word. This is the already known
windows technique : We select a window and
request that a minimal number of matches within this window is obtained. The GCG programs call
this a stringency. Therefore, using a window/stringency
algorithm, we will be able to paint dots at the middle of a window
rather than a word, which means that, given the values 9/5 for window/stringency,
we obtain the plot as shown in Figure D.
Figure D:
Dot-plot with a dot (+) at each matching window of 9 - with a minimum of 5 matches (stringency)
per window.
Looking at Figure D, two main diagonals
can be identified:
Tip for the Interpretation of Dotplots:
Always try to write down diagonals of interest in the way as depicted above. If you need computerised
assistance, use the 'gap' program of the GCG package with very high gap penalty values (e.g., 50). Explanations
on 'gap' can be found in a later section of this chapter.
[next page] , or [overview] , or [table of contents]
Schematic Comparison
Principle of Sequence Alignment
My1.seq tgatggtcaagtaaactatgaagagttt
unknown seq atggtaatggcacaattgactttcctgaatttctga
If we want to align those, we will try to write the two sequences in
a way which allows a pairwise comparison of each sequence symbol. As you might guess, there are
lots of possible options to do so, and the longer the sequences are the more options to align
two sequences will exist. In order to find the best alignment, we need to judge
the quality of the alignment. To allow computations and comparisons, this judgement shall result
in a numerical value, which is called a score . The determination of this score
relies on a symbol comparison table, where each symbol pairing gets a value
assigned, in order to determine the overall score by adding up the comparison value of each observed
pair in our alignment. These tables are very
important in the protein field, but also used in DNA comparison. A typical, simple scoring table for nucleotides
will give a value of 1 to a match (treating U and T as "match"), and assign
a value of 0 to each mismatch. This matrix
is perfectly symmetric and would be sufficient if printed as half-populated table.
Match value: 1
Mismatch value: 0
+-----+-----+-----+-----+-----+
| A | G | C | T | U |
+-----+-----+-----+-----+-----+-----+
| A | 1 | 0 | 0 | 0 | 0 |
+-----+-----+-----+-----+-----+-----+
| G | 0 | 1 | 0 | 0 | 0 |
+-----+-----+-----+-----+-----+-----+
| C | 0 | 0 | 1 | 0 | 0 |
+-----+-----+-----+-----+-----+-----+
| T | 0 | 0 | 0 | 1 | 1 |
+-----+-----+-----+-----+-----+-----+
| U | 0 | 0 | 0 | 1 | 1 |
+-----+-----+-----+-----+-----+-----+
tgatggtcaagtaaactatgaagagttt
123 | || || || | | shift 3: score 9 (+1.0)
atggtaatggcacaattgactttcctgaatttctga
tgatggtcaagtaaactatgaagagttt
12||||| | | | || shift 2: score 10 (+2.0)
atggtaatggcacaattgactttcctgaatttctga
tgatggtcaagtaaactatgaagagttt
1 | | | | | | shift 1: score 6 (-5.5)
atggtaatggcacaattgactttcctgaatttctga
tgatggtcaagtaaactatgaagagttt
| | shift 0: score 2 (-11.0)
atggtaatggcacaattgactttcctgaatttctga
tgatggtcaagtaaactatgaagagttt
1|| | || | shift -1: score 6 (-5.0)
atggtaatggcacaattgactttcctgaatttctga
tgatggtcaagtaaactatgaagagttt
12 | | | | shift -2: score 4 (-8.0)
atggtaatggcacaattgactttcctgaatttctga
tgatggtcaagtaaactatgaagagttt
123 | || || shift -3: score 5 (-6.5)
atggtaatggcacaattgactttcctgaatttctga
tgatggtcaagtaaactatgaagagttt
1234| |||| | | ||| || ||| shift -4: score 15 (+8.5)
atggtaatggcacaattgactttcctgaatttctga
tgatggtcaagtaaactatgaagagttt
12345 | ||| | | | | || shift -5: score 10 (+1.0)
atggtaatggcacaattgactttcctgaatttctga
Score Shift Length
------------------------
15 -4 29
10 2 27
-5 29
9 3 26
This means that one alignment with shift -4 is calculated to be "best"
but the alignments with shift 2, -5 and 3 are of a similar score.
Match value: +1.0
Mismatch value: -0.5
and recalculate scores. Figure A shows
the values in parenthesis. The scoring table, if written as best-score listing of the top four
alignments, will now read as:
Score Shift Length
------------------------
+8.5 -4 28
+2.0 2 26
+1.0 -5 28
3 25
The main benefit of this scoring schema is a quality
discrimination: All alignments which have twice as much mismatches than matches will
score negatively. This implies that we can now introduce a threshold
and indicate a "reasonable" alignment to be of a "positive
score". However, we have not tried all of the possible shifts, and it is not easily feasible
to compare several kb of sequences this way. Therefore, we need an automatism which allows to
judge sequence alignments after visual inspection. Principle of Dotplots
Number of possible dots =
(probability of pair) * (length of sequence A) * (length of Sequence B)
Counting the x in Figure B gives 278 dots, which is fairly close to the
expected value.
t x x x x x x x x x x x x x
t x x x x x x x x x x x x x
__t x x x x x x x x x x x x x
25g x x x x x x x
a x x x x x x x x x x
g x x x x x x x
a x x x x x x x x x x
__a x x x x x x x x x x
20g x x x x x x x
t x x x x x x x x x x x x x
a x x x x x x x x x x
t x x x x x x x x x x x x x
__c x x x x x x
15a x x x x x x x x x x
a x x x x x x x x x x
a x x x x x x x x x x
t x x x x x x x x x x x x x
__g x x x x x x x
10a x x x x x x x x x x
a x x x x x x x x x x
c x x x x x x
t x x x x x x x x x x x x x
__g x x x x x x x
5 g x x x x x x x
t x x x x x x x x x x x x x
a x x x x x x x x x x
g x x x x x x x
t x x x x x x x x x x x x x
__ a t g g t a a t g g c a c a a t t g a c t t t c c t g a a t t t c t g a
| |5 |10 |15 |20 |25 |30 |35
Looking at Figure B, we can draw several conclusions:
tgatggtcaagtaaactatgaagagttt
1234| |||| | | ||| || ||| shift -4: score 15 (+8.5)
atggtaatggcacaattgactttcctgaatttctga
Dotplot Principle - Improved
Number of possible dots =
(probability of word) * (length of sequence A) * (length of Sequence B)
Figure C: Dot-plot painting a dot (o) at each matching di-nucleotide
- more details are described in the text.
t o o o o o
t o o o o o
__t o
25g
a o o
g
a o o
__a o o o
20g o o o o
t o o o
a o o o o
t .o o o
__c o .o
15a o . o . o
a o . o . o
a . .
t o . .
__g . .
10a o. . o o
a . o. o
c . . o o
t .o .
__g .o .o
5 g .o .o o o
t .o .o o o
a . o o
g o o o o o
t
__ a t g g t a a t g g c a c a a t t g a c t t t c c t g a a t t t c t g a
| |5 |10 |15 |20 |25 |30 |35
The result of the application in case of a di-nucleotide match (word size 2) is shown
in Figure C: In total, 65 dots (painted as o)
are painted in the view of ((28 * 36) x (0.25 x 0.25) = ) 63 expected. In the figure,
dots (.) have been painted in suggestively in order to show the position of the two best
hits obtained in the alignments displayed in Figure A
. The reason for the weak appearance is the low similarity of the two sequences.
Dotplot Principle - Improved Again
t
t
__t
25g + +
a + +
g +
a
__a
20g
t
a +
t +
__c + +
15a +
a +
a + +
t +
__g +
10a + +
a + +
c + + + +
t + +
__g +
5 g + + +
t
a
g
t
__ a t g g t a a t g g c a c a a t t g a c t t t c c t g a a t t t c t g a
| |5 |10 |15 |20 |25 |30 |35
Three conclusions can be drawn from Figure D:
Interpretation of Dotplots
vertical horizontal
sequence sequence
short diagonal 5-10 4-9
long diagonal 5-15 9-21
This is an important conclusion, as the vertical sequence has obviously a region in its
beginning which is similar to the horizontal sequence in two different areas (4-9, and 9-15,
respectively). However, be careful if you use window/stringency as the coordinates
plotted as diagonals will be affected by the size of the window. The actual
region of similarity, therefore, will need to be expanded by (window size/2). If we schematically
write the sequences in a letter-by-letter format, however, it will become immediately obvious
that the window/stringency algorithm averages tremendously:
tgatggtcaagtaaactatgaagagttt vertical sequence
12||||| | | | || (short diagonal)
atggtaatggcacaattgactttcctgaatttctga horizontal sequence
1234| |||| | | ||| || ||| (long diagonal)
tgatggtcaagtaaactatgaagagttt vertical sequence
In this case, experimental evidence will be required to consolidate the
computer prediction of whether either the "short" or "long" diagonal are of biological relevance.
The protein comparison will be valuable if available or possible.
JAMF source file: schema.jam
Next file in HTML:
'GCG's Implementation of Schematic Comparison'