Wordcount

Go back to top

WORDCOUNT

WORDCOUNT

FUNCTION

WordCount counts the commonest words in a sequence and reports them in order of frequency and sequence.

DESCRIPTION

WordCount indexes all occurences of words of a specified length in a nucleotide sequence, and reports the most frequent words.

WordCount can also report the complete list of all words in the sequence to a second output file for further statistical analysis.

AUTHOR

This program was written by Peter Rice (E-mail: pmr@sanger.ac.uk Post: Informatics Division, The Sanger Centre, Hinxton Hall, Cambridge, CB10 1RQ, UK).

All EGCG programs are supported by the EGCG Support Team, who can be contacted by E-mail (egcg@embnet.org).

EXAMPLE

Here is a sample session with WordCount

  
  
  % wordcount
  
   WORDCOUNT uses any sequence data
  
   WORDCOUNT of what sequence ?  em_ba:paamir
  
                Start (* 1 *) ?
                End (* 2167 *) ?
  
   What should I call the output file (* paamir.count *) ?
  
   What word size (* 6 *) ?
  
   What list size (* 100 *) ?
  
  %

OUTPUT

The output from WordCount is a simple report of hits in the sequence.

  
  
   CGGCGC    12
   GCCGCC    12
   GCGCCG    12
   GGCGGC    12
   CCAGCA    11
   CCGCCG    11
  
  ///////////////////////////////////////////////////////

INPUT FILE

The input file for WordCount is a GCG nucleotide sequence file.

All parameters for this program may be put on the command line. Use the option -CHEck to see the summary below and to have a chance to add things to the command line before the program executes. In the summary below, the capitalized letters in the qualifier names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose qualifiers or parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.

  
  
  Minimum syntax: % wordcount [-INfile1=]GGammaCod.Seq -Default
  
  Prompted parameters:
  
  -BEGin=1 -END=444          range of interest
  -WORdsize=6                word size or mask pattern
  -LIStsize=50               size of output list
  [-OUTfile=]ggammacod.count summary output file name
  
  Local Data Files: none
  
  Optional Parameters:
  
  -FULLfile=ggammacod.full   full output file name
  -MINscore=1                sets minimum score to be listed
  -ONEstrand                 calculate for forward direction only
  -TRIM                      remove lowest scores from hit list

Wordcount

WORDCOUNT

FUNCTION

DESCRIPTION

AUTHOR

EXAMPLE

OUTPUT

INPUT FILE

COMMAND-LINE SUMMARY

OPTIONAL PARAMETERS

-FULLfile=ggammacod.full

-MINscore=1

-ONEstrand

-TRIM