WordCount counts the commonest words in a sequence and reports them in order of frequency and sequence.
WordCount indexes all occurences of words of a specified length in a nucleotide sequence, and reports the most frequent words.
WordCount can also report the complete list of all words in the sequence to a second output file for further statistical analysis.
This program was written by Peter Rice (E-mail: pmr@sanger.ac.uk Post: Informatics Division, The Sanger Centre, Hinxton Hall, Cambridge, CB10 1RQ, UK).
All EGCG programs are supported by the EGCG Support Team, who can be contacted by E-mail (egcg@embnet.org).
Here is a sample session with WordCount
% wordcount WORDCOUNT uses any sequence data WORDCOUNT of what sequence ? em_ba:paamir Start (* 1 *) ? End (* 2167 *) ? What should I call the output file (* paamir.count *) ? What word size (* 6 *) ? What list size (* 100 *) ? %
The output from WordCount is a simple report of hits in the sequence.
CGGCGC 12 GCCGCC 12 GCGCCG 12 GGCGGC 12 CCAGCA 11 CCGCCG 11 ///////////////////////////////////////////////////////
The input file for WordCount is a GCG nucleotide sequence file.
All parameters for this program may be put on the command line. Use the option -CHEck to see the summary below and to have a chance to add things to the command line before the program executes. In the summary below, the capitalized letters in the qualifier names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose qualifiers or parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.
Minimum syntax: % wordcount [-INfile1=]GGammaCod.Seq -Default Prompted parameters: -BEGin=1 -END=444 range of interest -WORdsize=6 word size or mask pattern -LIStsize=50 size of output list [-OUTfile=]ggammacod.count summary output file name Local Data Files: none Optional Parameters: -FULLfile=ggammacod.full full output file name -MINscore=1 sets minimum score to be listed -ONEstrand calculate for forward direction only -TRIM remove lowest scores from hit list
The parameters and switches listed below can be set from the command line. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.
Reports the full list of all words with their frequencies. This can be used other programs of scripts to perform statistical analyses.
sets a minimum frequency to be included in the word scoring, in addition to the cutoff of the list size. This option can result in an empty output file.
Calculates word frequencies in the forward direction only.
Removes the lowest scoring hits from the output list, as these are probably incomplete. This option can result in an empty output file.
Printed: April 22, 1996 15:56 (1162)