Wordcount

Go back to top

WORDCOUNT


FUNCTION

WordCount counts the commonest words in a sequence and reports them in order of frequency and sequence.


DESCRIPTION

WordCount indexes all occurences of words of a specified length in a nucleotide sequence, and reports the most frequent words.

WordCount can also report the complete list of all words in the sequence to a second output file for further statistical analysis.


AUTHOR

This program was written by Peter Rice (E-mail: pmr@sanger.ac.uk Post: Informatics Division, The Sanger Centre, Hinxton Hall, Cambridge, CB10 1RQ, UK).

All EGCG programs are supported by the EGCG Support Team, who can be contacted by E-mail (egcg@embnet.org).


EXAMPLE

Here is a sample session with WordCount

  
  
  % wordcount
  
   WORDCOUNT uses any sequence data
  
   WORDCOUNT of what sequence ?  em_ba:paamir
  
                Start (* 1 *) ?
                End (* 2167 *) ?
  
   What should I call the output file (* paamir.count *) ?
  
   What word size (* 6 *) ?
  
   What list size (* 100 *) ?
  
  %
  


OUTPUT

The output from WordCount is a simple report of hits in the sequence.

  
  
   CGGCGC    12
   GCCGCC    12
   GCGCCG    12
   GGCGGC    12
   CCAGCA    11
   CCGCCG    11
  
  ///////////////////////////////////////////////////////
  


INPUT FILE

The input file for WordCount is a GCG nucleotide sequence file.


COMMAND-LINE SUMMARY

All parameters for this program may be put on the command line. Use the option -CHEck to see the summary below and to have a chance to add things to the command line before the program executes. In the summary below, the capitalized letters in the qualifier names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose qualifiers or parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.

  
  
  Minimum syntax: % wordcount [-INfile1=]GGammaCod.Seq -Default
  
  Prompted parameters:
  
  -BEGin=1 -END=444          range of interest
  -WORdsize=6                word size or mask pattern
  -LIStsize=50               size of output list
  [-OUTfile=]ggammacod.count summary output file name
  
  Local Data Files: none
  
  Optional Parameters:
  
  -FULLfile=ggammacod.full   full output file name
  -MINscore=1                sets minimum score to be listed
  -ONEstrand                 calculate for forward direction only
  -TRIM                      remove lowest scores from hit list
  


OPTIONAL PARAMETERS

The parameters and switches listed below can be set from the command line. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.

-FULLfile=ggammacod.full

Reports the full list of all words with their frequencies. This can be used other programs of scripts to perform statistical analyses.

-MINscore=1

sets a minimum frequency to be included in the word scoring, in addition to the cutoff of the list size. This option can result in an empty output file.

-ONEstrand

Calculates word frequencies in the forward direction only.

-TRIM

Removes the lowest scoring hits from the output list, as these are probably incomplete. This option can result in an empty output file.

Printed: April 22, 1996 15:56 (1162)