Gelanalyze

Go back to top

GELANALYZE


FUNCTION

GelAnalyze reads a GelStatus report from a shotgun project, and produces project statistics by the method of Lander and Waterman.


DESCRIPTION

GelAnalyze produces a report of predicted number of contigs and sizes of gaps according to the method of Lander and Waterman (1988).

The effect of sequencing additional random clones can be estimated by asking GelAnalyze to predict the number of contigs remaining after more clones have been sequenced. This can greatly help when deciding between more clones and starting to use primers.

This method was used with success during the sequencing of the human HPRT locus (Edwards et al. (1990); Genomics 6:593-608) to decide at which point to change strategy from random sequencing to using primers to close the last few gaps.


AUTHOR

This program was written by Peter Rice (E-mail: pmr@sanger.ac.uk Post: Informatics Division, The Sanger Centre, Hinxton Hall, Cambridge, CB10 1RQ, UK).

All EGCG programs are supported by the EGCG Support Team, who can be contacted by E-mail (egcg@embnet.org).


EXAMPLE

Here is an example session with GelAnalyze

  
  
  % gelanalyze
  
    GELANALYZE of what GelStatus report ? hul3.dat
  
    What should I call the output file (* hul3.ana *) ?
  
    What is the total size of the region (* 10000.0 *) ? 15000
  
    What is the minimum overlap accepted (* 30.0 *) ? 30
  
    Read 179 fragments in 12 contigs
  
  %
  


OUTPUT

Here is some of the output file.

  
  
  
  GELANALYZE of Gelstatus report Hul3.Dat, April 8, 1991  13:54
   Qualifiers: -OVERlap=  30. -SIZe= 15000.
   Gelstatus of project Hul3,    April 8, 1991  12:35
  
       Number of fragments:      179
   Average fragment length:    267.6
 Total length of fragments:   47,909
  
 Sigma mean: 0.861,   Sigma variance: 0.0089
  
   (i) Number of apparent contigs
   Actual/Expected: 12 / 12.03
  
   (ii) Number of apparent contigs of j fragments
    j     Actual     Expected
   ---    ------     --------
     1         3         0.81
     2         1         0.75
     3         1         0.70
     4         1         0.66
     5         2         0.61
     6         0         0.57
    //////////////////////////////////
  
   (ii') Number of apparent contigs of at least 2 fragments
   Actual/Expected: 9 / 11.22
  
   (iii) Number of clones in an apparent contig
   Actual/Expected: 14.92 / 14.88
  
   (iv) Length of an apparent contig
   Actual/Expected: 1822.67 / 1200.33
  
   (v) Number of contigs if overlapping is perfect
   Expected:   7.39
  
   (vi) Probability that a gap of given length occurs
  
   Length   Given Gap   Any Gap
   ------   ---------   -------
        0       0.70      1.00
       50       0.39      0.97
      100       0.21      0.83
      150       0.12      0.60
      200       0.06      0.39
    //////////////////////////////////
  
   (vi') Probability that a gap is real:  0.70
  
   Maximum number of contigs: 24.0
      occurs at redundancy (c) = 1.16
      when total fragment length sequenced is 17422 bp
  
  


RELATED PROGRAMS

The GCG Fragment Assembly System programs are used to enter and manipulate raw sequence data. GelStatus reads a GCG Fragment Assembly database, and produces a summary report of the quality of each contig. GelPicture reads a contig from the Fragment Assembly database and displays a diagram of the gel alignments and a printout of the aligned gel sequences and consensus. GelPicture has been modified to include the sequence direction in both sections of the output, and to mark with '=======' any consensus sequence that is correct (agrees with every fragment) and has been sequenced in both directions.


RESTRICTIONS

The GCG Fragment Assembly System must be already started (by running GelStart) before running GelAnalyze

GelAnalyze is only applicable to "shotgun" sequencing projects.


ALGORITHM

The algorithms used by GelAnalyze were suggested by Lander and Waterman (1988); Genomics 2:231-239 for use in restriction mapping. The methods are equally applicable to the problems of Fragment Assembly.

The method is based on the assumption that all the clones in the database are selected at random. Given the minimum detectable overlap length, and the length distribution of the known clone sequences, it is possible to estimate how many overlaps should be detectable (and hence the expected number of contigs when all overlaps have been found).

GelAnalyze reads a GelStatus report as input, using fragment lengths from the report as the basis for the calculations. The actual numbers of fragments and contigs are also reported by GelAnalyze for comparison.

To allow for the effect of fragment length variance, the value of

  
  E**(-c*Sigma)
  

is replaced throughout by

  
  E**(-c*Mean(Sigma)) * (1 + c*c*Var(Sigma)/2)
  

as described in the original paper.


CONSIDERATIONS

GelAnalyze depends on a random selection of clones in the database. This assumption is invalid if there are, for example, duplicate runs of a single clone included in the project. In such cases, the GelStatus report should first be edited to remove the "duplicate" entries.

Be very cautious in interpreting the results of GelAnalyze and keep in mind that any non-random effects in the data will bias the results. Do not spend several hours looking for more overlaps just because fewer contigs were predicted. You may have some fragments with too many errors, or the statistical distribution of fragment lengths may not be fully allowed for.

If the actual number of contigs is not far (above or below) from the predicted number, you have probably found all the overlaps. If you have many more single fragment contigs than predicted, you should recheck their sequence quality in case they have a high number of errors, and also carefully check the contig ends for possible vector sequence.


SUGGESTIONS

GelAnalyze can be run with the "-NEWFRAGS=50" option to predict the effect of an additional 50 fragments on the number of contigs. Section (vi) of the output is particularly useful in combination with the "-NEWFRAGS" option when deciding whether to attempt gap closure by primer-directed sequencing.


INPUT FILE

The input to GelAnalyze is a report from the GelStatus program. The report may be edited first to delete non-random sequence fragments (those from repeat runs, primer-directed sequence, etc.)


LOCAL DATA FILES

None.


COMMAND-LINE SUMMARY

All parameters for this program may be put on the command line. Use the option -CHEck to see the summary below and to have a chance to add things to the command line before the program executes. In the summary below, the capitalized letters in the qualifier names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose qualifiers or parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.

  
  
  Minimum syntax: % gelanalyze  -Default
  
  Prompted Parameters:
  
  -OVERlap=30             Minimum detectable overlap
  -SIZe=1500              Total expected clone sequence length
  [-OUTfile=]myproj.ana     Output file
  
  Optional Parameters:
  
  -NEWfrags=50            Assume an additional 50 fragments
  
  


OPTIONAL PARAMETERS

The parameters and switches listed below can be set from the command line. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.

-SIZe=10000

size of expected final sequence

-OVERlap=30

size of minimum accepted overlap (MINOVERLAP used in GelOverlap runs).

-NEWfrags=50

calculate assuming an extra 50 random sequence fragments to show effect on contig number, expected gap length, etc.

-NOVARiance

do not adjust calculation for variance in fragment length.


REFERENCES

Lander, E., and Waterman, M.S. (1988). Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics 2, 231-239.

Edwards A., Voss H., Rice P., Civitello A., Stegemann J., Schwager C., Zimmermann J., Erfle H., Caskey C.T., Ansorge W. (1990). Automated DNA sequencing of the human HPRT locus. Genomics 6, 593-608.

Printed: April 22, 1996 15:53 (1162)