EWindow is a version of Window with command line control. Window makes a table of the frequencies of different sequence patterns within a window as it is moved along a sequence. A pattern is any short sequence like GC or R or ATG. You can plot the output with the program StatPlot.
EWindow calculates the frequency of patterns within a window of a set length. A pattern is any short sequence such as GC, or R, or ATG. The output is a table of numbers suitable for input to the StatPlot program. The window is moved along the sequence by a shift increment, and the number of observations of the pattern at every window position is measured. The frequency can be reported as a fraction, a percent, or simply a number of observations. You can also ask to see the difference between the number of observations of the pattern and the expected number of observations for a random sequence of identical composition. This expectation can be based either on the composition within the window (local) or on the composition of the whole sequence range (global). Another statistic lets you see the difference in frequency between two patterns. The pattern frequencies measured by EWindow are for one strand only.
You define the window size and the shift increment. The shift increment is the amount the window is moved between measurements. From a menu of the eight possible measures, you may choose up to six. Each measure you choose makes a column in the output table. After choosing the measurements, you are prompted to enter the pattern you want measured. For each measurement you must choose a pattern when prompted with a question that reminds you of the kind of measurement and the column number.
This GCG program was modified by Jaakko Hattula (Tampere University of Technology, Finland) and Peter Rice (E-mail: pmr@sanger.ac.uk Post: Informatics Division, The Sanger Centre, Hinxton Hall, Cambridge, CB10 1RQ, UK).
All EGCG programs are supported by the EGCG Support Team, who can be contacted by E-mail (egcg@embnet.org).
Here is a session using EWindow to measure that pattern of Cs, Gs, CGs, and GCs in the sequence gamma.seq. You can see from this experiment whether or not the frequency of the dinucleotide CG correlates well with the content of the nucleotides C and G (it doesn't). The output file from this session with EWindow is plotted as an example in the program StatPlot.
% ewindow EWINDOW uses any sequence data EWINDOW of what sequence ? gamma.seq Start (* 1 *) ? End (* 11375 *) ? 500 Reverse (* No *) ? What window size (* 100 *) ? What shift increment (* 3 *) ? What should I call the output file (* gamma.wdw *) ? What functions do you want: a) number of patterns observed b) percent of patterns observed c) fraction of patterns observed d) number of observed - expected(local) patterns e) number of observed - expected(global) patterns f) percent of observed - expected(local) patterns g) percent of observed - expected(global) patterns h) percent difference between two patterns q)uit Please select up to 6 functions (* ae *): aaadad What is the pattern for the "a" stat in column 1 ? c What is the pattern for the "a" stat in column 2 ? g What is the pattern for the "a" stat in column 3 ? cg What is the pattern for the "d" stat in column 4 ? cg What is the pattern for the "a" stat in column 5 ? gc What is the pattern for the "d" stat in column 6 ? gc %
Some of the output file is shown below. You can see the data plotted in the figure with the documentation for the StatPlot program.
WINDOW of: gamma.seq check: 6474 from: 1 to: 500 Window: 100 Shift: 3 MatchType: Subset MisMatch: 0 Human fetal beta globins G and A gamma from Shen, Slightom and Smithies, Cell 26; 191-203. Analyzed by Smithies et al. Cell 26; 345-353. July 15, 1994 15:01 Position C(obsrv) G(obsrv) CG(obsrv) CG_ob-ex(l) GC(obsrv) GC_ob-ex(l) .. 50 17.000 30.000 1.000 -4.049 4.000 -1.049 53 19.000 29.000 1.000 -4.455 5.000 -0.455 56 17.000 30.000 1.000 -4.049 5.000 -0.049 //////////////////////////////////////////////////////////////// 443 31.000 14.000 0.000 -4.297 2.000 -2.297 446 32.000 14.000 0.000 -4.435 2.000 -2.435 449 32.000 13.000 0.000 -4.118 2.000 -2.118
EStatPlot is a version of StatPlot with command line control. StatPlot plots a set of parallel curves from a table of numbers like the table written by the Window program. The statistics in each column of the table are associated with a position in the analyzed sequence.
The input sequence may not be more than 175,000-symbols long.
No more than six statistics can be tabulated. The shift increment cannot exceed the window size. Numbering in the Position column is for the forward strand even if the reverse strand is chosen.
Pattern definitions can only contain GCG sequence characters (see Appendix III) . We could easily modify EWindow to find patterns using a pattern definition syntax like that used for FindPatterns. Contact us if you think this is a good idea!
Each observation of a pattern is stored in a logical array. This array has a true (pattern observed) or false (pattern not observed) value for every position in the original sequence.
After the observation array is assembled, the incidence of each pattern can be found simply by putting down the window as a mask over the array and counting the observations under the window. The window is moved along the array (sequence) by the set shift increment and the observations are counted again.
EWindow calculates the number of observations per window in the following manner. The fraction of each symbol in the pattern is measured, either in the window (local expectation) or in the whole sequence range (global expectation). The product of the fractions for each symbol in the pattern times the length of the window is the expected number of observations for the pattern in the window. Four of the measurements report the difference between the actual number of observations and the expected number.
The percentage measures are simply the number of observations divided by the size of the window and multiplied by 100.
Fraction measures are the number of observations divided by the window size.
For nucleic acid sequences, the ambiguity codes in Appendix III are searched for subset matches. For instance, if the pattern specified is 'RR' and the sequence contains an 'AG,' an observation is scored at the position of the A. If the pattern specified were 'AG' and the sequence contained an 'RG,' no match would be scored. The sequence symbols must be the same as or a subset of the nucleotides implied by the pattern symbols.
If the sequence is a peptide sequence or if you have the option -PERfect on the command line, EWindow scores occurrences of patterns by finding perfect examples of the pattern in the sequence.
If you use the command line switch -ALL and your sequence is a nucleic acid sequence, the sequence can be an overlapping set of the pattern instead of only a subset. For instance, the pattern AG would match the sequence 'RR.' The pattern 'RA' would match the sequence 'MK.'
The cost of running EWindow is very low, but the output files can be very large. You should recognize that EWindow writes one line in the output file for every position of the window. Running EWindow on a sequence of length 10,000, with window size 100, shift increment 1, and using five measures will generate an output file with about 10,000 lines and about 60,000 numbers.
All parameters for this program may be put on the command line. Use the option -CHEck to see the summary below and to have a chance to add things to the command line before the program executes. In the summary below, the capitalized letters in the qualifier names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose qualifiers or parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.
Minimum syntax: % ewindow -INfile=gamma.seq -Default Prompted Parameters: -BEGin=1 -END=576 Range of interest -REVerse Use reverse strand -WINdow=100 Calculation window size -SHIFT=3 Shift in window position -MENU=AE Menu choice(s) -PATTern1=cg Pattern for first choice -PATTern2=cg Pattern for second choice (etc) -OUTfile=gamma.wdw Output file Optional Parameters: -ALL Allow overlapping sets to match -PERfect Only perfect matches, or protein sequence -MISmatch Subset matching
None.
The parameters and switches listed below can be set from the command line. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.
makes an overlapping-set search for patterns in nucleic acid sequences. If your sequence is rich in ambiguity, you can measure the frequency of potential examples of patterns.
Normally, EWindow searches for patterns using subset matching in nucleic acids and perfect matching in peptide sequences. You can override the subset default with the command line option -PERfect.
prompts for the number of mismatches allowed in a subset search.
Printed: April 22, 1996 15:53 (1162)