Homologies makes a table of the pair-wise distances within a group of aligned sequences.
Homologies makes a table of the pair-wise distances within a group of aligned sequences.
This program was written by Jack A.M. Leunissen (E-mail: jackl@caos.kun.nl; Post: CAOS/CAMM Center, University of Nijmegen, 6525 ED Nijmegen, The Netherlands).
All EGCG programs are supported by the EGCG Support Team, who can be contacted by E-mail (egcg@embnet.org).
Here is a sample session with Homologies
% homologies HOMOLOGIES uses any sequences HOMOLOGIES of what sequence(s) ? pileup.msf{*} pileup.msf{Hs70_Plafa}, len: 738 pileup.msf{Hs70_Thean}, len: 738 //////////////////////////////// pileup.msf{Hs77_Yeast}, len: 738 pileup.msf{Dnak_Ecoli}, len: 738 Start (* 1 *) ? 31 End (* 738 *) ? 230 What is the threshold for a match (* 0.60 *) ? How should gaps be handled: I)nclude gaps in the comparisons L)ength-independent gap inclusion E)xclude gaps from comparison N)one of the gaps in any sequence Please choose one (* I *): How should END-gaps be handled: I)nclude them in the comparison E)xclude them Please choose one (* I *): Divide the sum of the matches by R)esidues compared S)horter sequence length A)verage sequence length N)othing Please choose one (* R *): What should I call the output file (* pileup.homologies *) ? %
Here is some of the output file of a session with Homologies
HOMOLOGIES within: @gendocdata:pretty.list May 23, 1995 14:21 Number of sequences: 11 First residue number: 1 Last residue number: 349 Threshold of comparison: 0.60 Symbol Comparison Table: pepdistances.cmp Denominator: "Number of residues compared" Gap handling, General: "Including gaps" Gap handling, Termini: "Including terminal gaps" Gap penalty: 0.00 Default scoring matrix used by OLDDISTANCES for the comparison of protein sequences. Dayhoff table (Schwartz, R. M. and Dayhoff, M. O. [1979] in Atlas of Protein Sequence and Structure, Dayhoff, M. O. Ed, pp. 353-358, National Biomedical Research Foundation, Washington D.C.) rescaled by dividing each value by the sum of its row and column, and normalizing to a mean . . . Key for column and row indices: 1 fa10.ugly Length: 349 Length without gaps: 212 2 fa12.ugly Length: 349 Length without gaps: 213 3 fo1k.ugly Length: 349 Length without gaps: 213 4 e.ugly Length: 349 Length without gaps: 288 5 p1m.ugly Length: 349 Length without gaps: 302 6 p1s.ugly Length: 349 Length without gaps: 302 7 p2s.ugly Length: 349 Length without gaps: 301 8 p3s.ugly Length: 349 Length without gaps: 300 9 cb3.ugly Length: 349 Length without gaps: 288 10 r14.ugly Length: 349 Length without gaps: 289 11 r2.ugly Length: 349 Length without gaps: 289 Similarity Matrix Part: 1 1 2 3 4 5 ... ______________________________ ... .. | 1 | 0.6000 0.5606 0.4605 0.1634 0.1787 ... | 2 | 0.6000 0.4744 0.1677 0.1809 ... | 3 | 0.6000 0.1586 0.1809 ... | 4 | 0.6000 0.1694 ... //////////////////////////////////////////////////////
Homologies calculates the pair-wise homology scores, or the distances within a group of previously aligned sequences. Optionally, the program creates an output file, suitable to be used as input for the PHYLIP programs.
Homologies can handle gaps in the input sequences in various ways: they can be included in the calculations, or they may be ignored. When gaps are incorporated in the calculation(s), they may either be treated as all individual mismatches, or each gap can be treated as being just one single mismatch. Likewise, when gaps are ignored, they may either just be ignored for each sequence pair individually, or any gap occuring in any sequence may be ignored in all sequences.
A special case is the treatment of end-gaps, i.e. gaps occuring at the beginning or ends of the sequences, when the termini do not align. They can be switched on or off separately (this, of course, does not apply when the general gap handling was already switched off!).
The homology or mismatch value is usually expressed as the number of (mis)matches per residue. The number of matching characters - or the mismatch value - can therefore be divided by either the number of residues compared, the length of the smaller sequence, the average sequence length, or nothing. By using the -PERCent option, the sum of matching residues is (by default) divided by the number of residues compared, and multiplied by 100.
Homologies uses a correction method known as augmentation. When distantly related sequences are compared, usually the number of mismatches (or evolutionary events) is underestimated, due to a process known as "multiple hits". This simply means that any homologous position in a pair of sequences may have undergone numerous substitutions, while we just notice a difference at that particular position. In fact, the position may even be identical, while both lineages may have undergone several substitutions since their common ancestor, before arriving at their current (identical) state. To compensate for this underestimation of the number of substitutions, especially in distantly related sequences, several formulas ("augmentation" schemes) have been published. A number of them has been implemented in Homologies
The default settings in Homologies are by no means imperative, they merely reflect the author's preferences!
The input file for Homologies is a GCG MSF sequence file.
All parameters for this program may be put on the command line. Use the option -CHEck to see the summary below and to have a chance to add things to the command line before the program executes. In the summary below, the capitalized letters in the qualifier names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose qualifiers or parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.
Syntax: % homologies [-INfile=]pileup.msf{*} -Default Required Parameters: -THReshold=1.0 minimum symbol comparison score for a match -DENOMinator=Residue divides the sum of the matches by: Residues = Number of residues compared Shorter = Length of shorter sequence Average = Average length Nothing = Nothing -GAPS= Include = Include gaps in the comparison Length = length-independent gaps Exclude = Exclude gaps from the comparison None = Exclude EVERY gaps in EVERY sequence -ENDGaps= Include = Include end-gaps in the comparison Exclude = Exclude end-gaps in the comparison -GAPValue= Gap penalty [-OUTfile=]pileup.distances output file Local Data Files: -DATa=pepdistances.cmp comparison table for peptide sequences -DATa=dnadistances.cmp comparison table for nucleotide sequences Optional Parameters: -DISTances calculate sequence differences -NASscore Doolittle's NAS score -PERCent print percentage homology -AUGmentation= correct sequences for multiple hits: Jukes = Jukes-Cantor Kimura = Kimura's method -SQRT take square root of distances -PHYlip=PileUp.Phylip output comparison matrix in PHYLIP format -NAMELength=10 length of name-field in PHYLIP output
The files described below supply auxiliary data to this program. The program automatically reads them from a public data directory unless you either 1) have a data file with exactly the same name in your current working directory; or 2) name a file on the command line with an expression like -DATa1=myfile.dat. For more information see Chapter 4, Using Data Files in the User's Guide.
The parameters and switches listed below can be set from the command line. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.
-THReshold=1
defines the minimum symbol comparison score for a match.
-DENOMinator=RESIDUES
divides the sum of the matches by: Residues: the number of residues compared; Shorter: the length of shorter sequence; Average: the average sequence length; or Nothing: nothing.
-GAPS=
instructs the program how to handle gaps in the sequences. Include: include gaps in the comparison; Length: length-independent gaps, i.e. every gap is treated as one single mismatch, regardless of its length; Exclude: exclude gaps from the comparison; or None: exclude EVERY gap in EVERY sequence.
-ENDGaps=
tells the program how to operate on sequences of unqueal length. Valid reponses are: Include: include end-gaps in the comparison; or Exclude: exclude end-gaps in the comparison.
-GAPValues=
sets the gap penalty to a user-specified value.
-DISTances
calculate sequence differences instead of sequence similarities.
-NASscore
calculate Doolittle's NAS score.
-PERCent
print homology values as a percentage, rather than a fraction.
-AUGmentation=
correct sequences for multiple hits. Currently implemented methods are Jukes: Jukes-Cantor formula, and Kimura: Kimura's method.
-PHYlip=pileup.phylip
write the output comparison matrix in PHYLIP format.
-NAMELength=10
specify the length of the name-field in the PHYLIP output file. Changing this parameter usually requires you to also change this value in the PHYLIP source code, and to recompile these programs!
Printed: April 22, 1996 15:53 (1162)