GelAnalyze reads a GelStatus report from a shotgun project, and produces project statistics by the method of Lander and Waterman.
GelAnalyze produces a report of predicted number of contigs and sizes of gaps according to the method of Lander and Waterman (1988).
The effect of sequencing additional random clones can be estimated by asking GelAnalyze to predict the number of contigs remaining after more clones have been sequenced. This can greatly help when deciding between more clones and starting to use primers.
This method was used with success during the sequencing of the human HPRT locus (Edwards et al. (1990); Genomics 6:593-608) to decide at which point to change strategy from random sequencing to using primers to close the last few gaps.
This program was written by Peter Rice (E-mail: pmr@sanger.ac.uk Post: Informatics Division, The Sanger Centre, Hinxton Hall, Cambridge, CB10 1RQ, UK).
All EGCG programs are supported by the EGCG Support Team, who can be contacted by E-mail (egcg@embnet.org).
Here is an example session with GelAnalyze
% gelanalyze GELANALYZE of what GelStatus report ? hul3.dat What should I call the output file (* hul3.ana *) ? What is the total size of the region (* 10000.0 *) ? 15000 What is the minimum overlap accepted (* 30.0 *) ? 30 Read 179 fragments in 12 contigs %
Here is some of the output file.
GELANALYZE of Gelstatus report Hul3.Dat, April 8, 1991 13:54 Qualifiers: -OVERlap= 30. -SIZe= 15000. Gelstatus of project Hul3, April 8, 1991 12:35 Number of fragments: 179 Average fragment length: 267.6 Total length of fragments: 47,909 Sigma mean: 0.861, Sigma variance: 0.0089 (i) Number of apparent contigs Actual/Expected: 12 / 12.03 (ii) Number of apparent contigs of j fragments j Actual Expected --- ------ -------- 1 3 0.81 2 1 0.75 3 1 0.70 4 1 0.66 5 2 0.61 6 0 0.57 ////////////////////////////////// (ii') Number of apparent contigs of at least 2 fragments Actual/Expected: 9 / 11.22 (iii) Number of clones in an apparent contig Actual/Expected: 14.92 / 14.88 (iv) Length of an apparent contig Actual/Expected: 1822.67 / 1200.33 (v) Number of contigs if overlapping is perfect Expected: 7.39 (vi) Probability that a gap of given length occurs Length Given Gap Any Gap ------ --------- ------- 0 0.70 1.00 50 0.39 0.97 100 0.21 0.83 150 0.12 0.60 200 0.06 0.39 ////////////////////////////////// (vi') Probability that a gap is real: 0.70 Maximum number of contigs: 24.0 occurs at redundancy (c) = 1.16 when total fragment length sequenced is 17422 bp
The GCG Fragment Assembly System programs are used to enter and manipulate raw sequence data. GelStatus reads a GCG Fragment Assembly database, and produces a summary report of the quality of each contig. GelPicture reads a contig from the Fragment Assembly database and displays a diagram of the gel alignments and a printout of the aligned gel sequences and consensus. GelPicture has been modified to include the sequence direction in both sections of the output, and to mark with '=======' any consensus sequence that is correct (agrees with every fragment) and has been sequenced in both directions.
The GCG Fragment Assembly System must be already started (by running GelStart) before running GelAnalyze
GelAnalyze is only applicable to "shotgun" sequencing projects.
The algorithms used by GelAnalyze were suggested by Lander and Waterman (1988); Genomics 2:231-239 for use in restriction mapping. The methods are equally applicable to the problems of Fragment Assembly.
The method is based on the assumption that all the clones in the database are selected at random. Given the minimum detectable overlap length, and the length distribution of the known clone sequences, it is possible to estimate how many overlaps should be detectable (and hence the expected number of contigs when all overlaps have been found).
GelAnalyze reads a GelStatus report as input, using fragment lengths from the report as the basis for the calculations. The actual numbers of fragments and contigs are also reported by GelAnalyze for comparison.
To allow for the effect of fragment length variance, the value of
E**(-c*Sigma)
is replaced throughout by
E**(-c*Mean(Sigma)) * (1 + c*c*Var(Sigma)/2)
as described in the original paper.
GelAnalyze depends on a random selection of clones in the database. This assumption is invalid if there are, for example, duplicate runs of a single clone included in the project. In such cases, the GelStatus report should first be edited to remove the "duplicate" entries.
Be very cautious in interpreting the results of GelAnalyze and keep in mind that any non-random effects in the data will bias the results. Do not spend several hours looking for more overlaps just because fewer contigs were predicted. You may have some fragments with too many errors, or the statistical distribution of fragment lengths may not be fully allowed for.
If the actual number of contigs is not far (above or below) from the predicted number, you have probably found all the overlaps. If you have many more single fragment contigs than predicted, you should recheck their sequence quality in case they have a high number of errors, and also carefully check the contig ends for possible vector sequence.
GelAnalyze can be run with the "-NEWFRAGS=50" option to predict the effect of an additional 50 fragments on the number of contigs. Section (vi) of the output is particularly useful in combination with the "-NEWFRAGS" option when deciding whether to attempt gap closure by primer-directed sequencing.
The input to GelAnalyze is a report from the GelStatus program. The report may be edited first to delete non-random sequence fragments (those from repeat runs, primer-directed sequence, etc.)
None.
All parameters for this program may be put on the command line. Use the option -CHEck to see the summary below and to have a chance to add things to the command line before the program executes. In the summary below, the capitalized letters in the qualifier names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose qualifiers or parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.
Minimum syntax: % gelanalyze -Default Prompted Parameters: -OVERlap=30 Minimum detectable overlap -SIZe=1500 Total expected clone sequence length [-OUTfile=]myproj.ana Output file Optional Parameters: -NEWfrags=50 Assume an additional 50 fragments
The parameters and switches listed below can be set from the command line. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.
size of expected final sequence
size of minimum accepted overlap (MINOVERLAP used in GelOverlap runs).
calculate assuming an extra 50 random sequence fragments to show effect on contig number, expected gap length, etc.
do not adjust calculation for variance in fragment length.
Lander, E., and Waterman, M.S. (1988). Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics 2, 231-239.
Edwards A., Voss H., Rice P., Civitello A., Stegemann J., Schwager C., Zimmermann J., Erfle H., Caskey C.T., Ansorge W. (1990). Automated DNA sequencing of the human HPRT locus. Genomics 6, 593-608.
Printed: April 22, 1996 15:53 (1162)