Ewindow

Go back to top

EWINDOW


FUNCTION

EWindow is a version of Window with command line control. Window makes a table of the frequencies of different sequence patterns within a window as it is moved along a sequence. A pattern is any short sequence like GC or R or ATG. You can plot the output with the program StatPlot.


DESCRIPTION

EWindow calculates the frequency of patterns within a window of a set length. A pattern is any short sequence such as GC, or R, or ATG. The output is a table of numbers suitable for input to the StatPlot program. The window is moved along the sequence by a shift increment, and the number of observations of the pattern at every window position is measured. The frequency can be reported as a fraction, a percent, or simply a number of observations. You can also ask to see the difference between the number of observations of the pattern and the expected number of observations for a random sequence of identical composition. This expectation can be based either on the composition within the window (local) or on the composition of the whole sequence range (global). Another statistic lets you see the difference in frequency between two patterns. The pattern frequencies measured by EWindow are for one strand only.


PARAMETERS

You define the window size and the shift increment. The shift increment is the amount the window is moved between measurements. From a menu of the eight possible measures, you may choose up to six. Each measure you choose makes a column in the output table. After choosing the measurements, you are prompted to enter the pattern you want measured. For each measurement you must choose a pattern when prompted with a question that reminds you of the kind of measurement and the column number.


AUTHOR

This GCG program was modified by Jaakko Hattula (Tampere University of Technology, Finland) and Peter Rice (E-mail: pmr@sanger.ac.uk Post: Informatics Division, The Sanger Centre, Hinxton Hall, Cambridge, CB10 1RQ, UK).

All EGCG programs are supported by the EGCG Support Team, who can be contacted by E-mail (egcg@embnet.org).


EXAMPLE

Here is a session using EWindow to measure that pattern of Cs, Gs, CGs, and GCs in the sequence gamma.seq. You can see from this experiment whether or not the frequency of the dinucleotide CG correlates well with the content of the nucleotides C and G (it doesn't). The output file from this session with EWindow is plotted as an example in the program StatPlot.

  
  
  % ewindow
  
   EWINDOW uses any sequence data
  
   EWINDOW of what sequence ?  gamma.seq
  
                  Start (* 1 *) ?
                End (* 11375 *) ?  500
               Reverse (* No *) ?
  
   What window size (* 100 *) ?
  
   What shift increment (* 3 *) ?
  
   What should I call the output file (* gamma.wdw *) ?
  
   What functions do you want:
  
   a) number   of patterns observed
   b) percent  of patterns observed
   c) fraction of patterns observed
   d) number   of observed - expected(local)  patterns
   e) number   of observed - expected(global) patterns
   f) percent  of observed - expected(local)  patterns
   g) percent  of observed - expected(global) patterns
   h) percent difference between two patterns
  
   q)uit
  
   Please select up to 6 functions (* ae *):  aaadad
  
   What is the pattern for the "a" stat in column 1 ?  c
   What is the pattern for the "a" stat in column 2 ?  g
   What is the pattern for the "a" stat in column 3 ?  cg
   What is the pattern for the "d" stat in column 4 ?  cg
   What is the pattern for the "a" stat in column 5 ?  gc
   What is the pattern for the "d" stat in column 6 ?  gc
  
  %
  
  


OUTPUT

Some of the output file is shown below. You can see the data plotted in the figure with the documentation for the StatPlot program.

  
  
   WINDOW of: gamma.seq  check: 6474  from: 1  to: 500
   Window: 100  Shift: 3  MatchType: Subset MisMatch: 0
  
  Human fetal beta globins G and A gamma
  from Shen, Slightom and Smithies,  Cell 26; 191-203.
  Analyzed by Smithies et al. Cell 26; 345-353.
  
                       July 15, 1994 15:01
  
  Position C(obsrv) G(obsrv) CG(obsrv) CG_ob-ex(l) GC(obsrv) GC_ob-ex(l) ..
  
   50   17.000   30.000     1.000      -4.049     4.000      -1.049
   53   19.000   29.000     1.000      -4.455     5.000      -0.455
   56   17.000   30.000     1.000      -4.049     5.000      -0.049
  
   ////////////////////////////////////////////////////////////////
  
  443   31.000   14.000     0.000      -4.297     2.000      -2.297
  446   32.000   14.000     0.000      -4.435     2.000      -2.435
  449   32.000   13.000     0.000      -4.118     2.000      -2.118
  


RELATED PROGRAMS

EStatPlot is a version of StatPlot with command line control. StatPlot plots a set of parallel curves from a table of numbers like the table written by the Window program. The statistics in each column of the table are associated with a position in the analyzed sequence.


RESTRICTIONS

The input sequence may not be more than 175,000-symbols long.

No more than six statistics can be tabulated. The shift increment cannot exceed the window size. Numbering in the Position column is for the forward strand even if the reverse strand is chosen.

Pattern definitions can only contain GCG sequence characters (see Appendix III) . We could easily modify EWindow to find patterns using a pattern definition syntax like that used for FindPatterns. Contact us if you think this is a good idea!


ALGORITHM

Each observation of a pattern is stored in a logical array. This array has a true (pattern observed) or false (pattern not observed) value for every position in the original sequence.

After the observation array is assembled, the incidence of each pattern can be found simply by putting down the window as a mask over the array and counting the observations under the window. The window is moved along the array (sequence) by the set shift increment and the observations are counted again.

EWindow calculates the number of observations per window in the following manner. The fraction of each symbol in the pattern is measured, either in the window (local expectation) or in the whole sequence range (global expectation). The product of the fractions for each symbol in the pattern times the length of the window is the expected number of observations for the pattern in the window. Four of the measurements report the difference between the actual number of observations and the expected number.

The percentage measures are simply the number of observations divided by the size of the window and multiplied by 100.

Fraction measures are the number of observations divided by the window size.


SUBSET MATCHING

For nucleic acid sequences, the ambiguity codes in Appendix III are searched for subset matches. For instance, if the pattern specified is 'RR' and the sequence contains an 'AG,' an observation is scored at the position of the A. If the pattern specified were 'AG' and the sequence contained an 'RG,' no match would be scored. The sequence symbols must be the same as or a subset of the nucleotides implied by the pattern symbols.


PERFECT MATCHING

If the sequence is a peptide sequence or if you have the option -PERfect on the command line, EWindow scores occurrences of patterns by finding perfect examples of the pattern in the sequence.


OVERLAPPING SET MATCHING

If you use the command line switch -ALL and your sequence is a nucleic acid sequence, the sequence can be an overlapping set of the pattern instead of only a subset. For instance, the pattern AG would match the sequence 'RR.' The pattern 'RA' would match the sequence 'MK.'


CONSIDERATIONS

The cost of running EWindow is very low, but the output files can be very large. You should recognize that EWindow writes one line in the output file for every position of the window. Running EWindow on a sequence of length 10,000, with window size 100, shift increment 1, and using five measures will generate an output file with about 10,000 lines and about 60,000 numbers.


COMMAND-LINE SUMMARY

All parameters for this program may be put on the command line. Use the option -CHEck to see the summary below and to have a chance to add things to the command line before the program executes. In the summary below, the capitalized letters in the qualifier names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose qualifiers or parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.

  
  
  Minimum syntax: % ewindow -INfile=gamma.seq -Default
  
  Prompted Parameters:
  
  -BEGin=1 -END=576       Range of interest
  -REVerse                Use reverse strand
  -WINdow=100             Calculation window size
  -SHIFT=3                Shift in window position
  -MENU=AE                Menu choice(s)
  -PATTern1=cg            Pattern for first choice
  -PATTern2=cg            Pattern for second choice (etc)
  -OUTfile=gamma.wdw      Output file
  
  Optional Parameters:
  
  -ALL                    Allow overlapping sets to match
  -PERfect                Only perfect matches, or protein sequence
  -MISmatch               Subset matching
  


LOCAL DATA FILES

None.


OPTIONAL PARAMETERS

The parameters and switches listed below can be set from the command line. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.

-ALL

makes an overlapping-set search for patterns in nucleic acid sequences. If your sequence is rich in ambiguity, you can measure the frequency of potential examples of patterns.

-PERFect

Normally, EWindow searches for patterns using subset matching in nucleic acids and perfect matching in peptide sequences. You can override the subset default with the command line option -PERfect.

-MISmatch

prompts for the number of mismatches allowed in a subset search.

Printed: April 22, 1996 15:53 (1162)