Wordup

Go back to top

WORDUP

WORDUP

FUNCTION

WordUp is based on a first order Markov analysis and detects statistically significant oligonucleotide patterns from six to nine nucleotides long in the sequences under investigation. WordUp dynamically detects significant signals of any length in the same analysis.

DESCRIPTION

WordUp is based on a first order Markov analysis (Pesole et al.,1992) and detects statistically significant sequence motifs from six to ten nucleotides long in the sequences under investigation. WordUp dynamically detects significant signals of any length in the same analysis. The problem addressed is the singling out of short nucleic sequences with non-random statistical properties, which may be thus biologically active.

The key element in the processing is the use of a statistical analysis based on a special chi-square test between the pattern of observed frequencies and of that of frequencies assessed with probability calculation. The statistical significance of each pattern is determined by comparing the expected number of sequences containing a given pattern with the observed one. The results are then further tested to reject the falsely significant patterns, caused by partial overlapping with the true biological signal searched.

The problem was solved by starting from the analysis of shorter oligomers and, through subsequent interactions, by checking that the oligomers resulting as statistically significant are not actually components of longer biological signals.

The dataset to be used in the WordUp analysis needs to be purified from redundancies which make the risk of assigning high significance to nonsignificant pattern very high. The CLEANUP program (Grillo et al., ..) has been suited to remove redundancies (based on similarity and overlapping %) from sequence collections.

AUTHOR

The method has been written by Graziano Pesole et al. The computer program has been written by Giorgio Grillo (giorgio@area.ba.cnr.it) and Massimo Ianigro (massimo@area.ba.cnr.it), who should be contacted for support.

All EGCG programs are supported by the EGCG Support Team, who can be contacted by E-mail (egcg@embnet.org).

EXAMPLE

Here is a sample session with WordUp

  
  
  % wordup
  
   WORDUP uses nucleotide sequences
  
   WORDUP of what sequence(s) ? genembl:ecl*
  
   What base name for the output files (* ecl1 *) ?  ecl
  
   Chi-square threshold (negative value to disable)(* 4.0 *) ?
  
  %

OUTPUT

Here is a session using WordUp on a set of E.coli sequences where the non redundancy has not been cleane dup. It is shown part of the Output of the main searching file ("file") (a) and of the dynamic searching file ("file".dyn) (b).

  
  
    (a)
    =======================================================
         W O R D U P  -  MAIN SEARCH FILE
    =======================================================
                    OCCURRENCES
    -------------------------------------------------------
    WORDS           OBSERVED     EXPECTED       CHI-SQUARE
    -------------------------------------------------------
    GAAAAA             58        42.69480         5.48660
    AGAAAA             47        32.15242         6.85642
    TGAAAA             61        40.49997        10.37658
    CCCAAA              9        26.21809        11.30756
    GCCAAA             21        33.37346         4.58755
    TTCAAA             19        31.22186         4.78427
    AAGAAA             47        32.15242         6.85642
    ACGAAA             20        32.44691         4.77474
    CTGAAA             46        33.04326         5.08052
    .....................................................
    .....................................................
    =======================================================
  
    (b)
    =======================================================
         W O R D U P  -  DYNAMIC PATTERNS FILE
    =======================================================
                    OCCURRENCES
    -------------------------------------------------------
    WORDS           OBSERVED     EXPECTED       CHI-SQUARE
    -------------------------------------------------------
    CAGGTT             37        16.61286        25.01889
    GGCGCC              3        29.80980        24.11172
    CACGTG              1        25.64915        23.68813
    TACCAG             28        11.61083        23.13398
    TTGGAA              4        29.66267        22.20207
    CAGGTA             29        12.41553        22.15327
    AGCCAG             35        16.23141        21.70235
    TAACCC             34        15.89003        20.64003
    .....................................................
    .....................................................
    =======================================================

RELATED PROGRAMS

FindPatterns identifies sequences that contain short patterns like GAATTC or YRYRYRYR. You can define the patterns ambiguously and allow mismatches. You can provide the patterns in a file or simply type them in from the terminal.

RESTRICTIONS

Before doing any pattern search, WordUp converts the specified sequences into Pearson format and places them into a temporary file named wordup.tmp.seq The program creates another temporary file named wordup.tmp that contains the generated patterns. The two files are usually created into the current directory. However, you can force the program to create them in another directory defining the environment variable WORDUPTMP with the name of the directory that will hold this files.

For example (Cshell):

setenv WORDUPTMP /tmp

There is another limitation which consists of a maximum allowed pattern length of 9 nucleotides.

If the program aborts, than you have to remove manually this files.

ALGORITHM

The WordUp algorithm is aimed at the identification of statistically significant nucleotide strings which are shared or avoided in a set of sequences functionally equivalent but not evolutionary homologous (e.g. promoter regions, introns, etc.). The statistical significance of each oligonucleotide signal is simply determined through a chi-square test by comparing the actual and the expected number of sequences containing that given signal, calculated assuming that oligonucleotides are Poisson distributed and that their occurrence probability follows a first order Markov chain, i.e. depends on dinucleotide frequencies. The Poisson distribution is suitable for the description of rare events, such as the distribution of oligomers longer than w nucleotides in sequences quite shorter than 4**w nucleotides.

It must be stressed that statistical significance, even though it does not account for biological significance, can provide a substantial clue for this. The list of statistically significant motifs, i.e. those having a chi-square value above a given threshold constitute a motif vocabulary which is specific for the biological function shared by the analysed sequences.

The starting word length, w, to be used in the analysis, has to be defined as the shortest sequence allowing the validation that sequence oligomers are Poisson distributed (i.e. Lseq << 4**w). The best choice will be, thus, in general a string length six nucleotides long even if we do not know the actual length of the biological signals. In order to establish if there are significant oligomers of length w+1, w+2, .. etc. we follow a dynamic elongation procedure which considers pattern pairs overlapping by w-1/w nucleotides. If the w+1-mer is more significant than its two component w-mers, the w+1-mer replaces the latter in the vocabulary of significant motifs. If this does not happen of the two overlapping w-mers the least significant is removed from the vocabulary as its significance is likely to be due to the "overlapping effect". In general, this procedure can be used to determine the statistical significance of w+k -mers considering overlapping w+k-1 -mers.

COMMAND-LINE SUMMARY

All parameters for this program may be put on the command line. Use the option -CHEck to see the summary below and to have a chance to add things to the command line before the program executes. In the summary below, the capitalized letters in the qualifier names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose qualifiers or parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.

  
  
  Minimum syntax: % wordup [-INfile=]@sequence_list  -Default
  
  Prompted Parameters:
  
  [-OUTfile=]wordup.out          the output file name
  -CHI=20sets the threshold chi-square value
  
  
  Local Data Files:
  
  -PATTern = wordup.pat          a file with a set of patterns from 6 to 9
                             nucleotides long. Alternatively, using
                             -PATTern=num, you can set the starting
  pattern length
  
  Optional Parameters:
  
  -PATTern=xxx    Allows to define the starting pattern length. xxx could be
  a number representing the minimum pattern length (between 6 and
  9) or the name of the file containing the patterns
  -[NO]DYN        [Don't] Carry out the dynamic pattern elongation
  -SEC    Store results for patterns whith a chi-square
             below the fixed threshold in OUTfile.sec
  -BATch          Submit the program to the batch queue for processing after
  prompting you for all required user inputs
  -HelpShows the parameters

LOCAL DATA FILES

The files described below supply auxiliary data to this program. The program automatically reads them from a public data directory unless you either 1) have a data file with exactly the same name in your current working directory; or 2) name a file on the command line with an expression like -DATa1=myfile.dat. For more information see Chapter 4, Using Data Files in the User's Guide.

OPTIONAL PARAMETERS

The parameters and switches listed below can be set from the command line. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.

-PATTern

Allows the definition of the starting pattern length. A starting pattern length of six nucleotides is suggested to ensure that the pattern distibutions obey a Poisson distribution. Using the dynamic elongation procedure WordUp will identify also significant patterns longer than six nucleotides.

Using "-PATTern = wordup.pat" you can specify a file with a set of patterns to be searched from 6 to 9 nucleotides long. The search can be done on some of the 4**w patterns of length w. In this case they have to be stored into a file.

The file containing the patterns is made up of strings (e.g. the patterns) separated by spacings. Spacings are considered space bar, new line, return, horizontal and veritcal tabs and page ends. Thus, the pattern file is a text file where the pattern layout is freely chosen by the user. The pattern length in the file has to be the same for all patterns.

-[NO]DYN

[Don't] Carry out the dynamic pattern elongation. If the -NODYN option is set WordUp evaluates the statistical significance of patterns of fixed length. In this case a pattern overlapping by w-1 (w-2, ..) positions a true significant pattern w nucleotides long very likely will be considered also significant. The dynamic elongation procedure allows to reduce the pattern "overlapping" effect.

-CHI

Set the threshold chi-square value. A value of 20 is suggested as a default. Of course given that nucleotide sequences are not random is not guaranted that patterns with chi-square>20 are truly biologically significant. It need also to be considered that the chi-square values also depend on the size of the sequence collection considered in the analysis. Of course very high chi-square values (e.g. >50) will be actually significant.

-[NO]SEC

[Do not] Prints out the results for all the 4**w patterns of length w, not included in the main output file.

-BATch

Submits the program to the batch queue for processing after prompting you for all required user inputs. Any information that would normally appear on the screen while the program is running is written into a log file. Whether that log file is deleted, printed, or saved to your current directory depends on how your system manager has set up the command that submits this program to the batch queue. All output files are written to your current directory, unless you direct the output to another directory when you specify the output file.

REFERENCES

Pesole, G., Prunella, N., Liuni, S., Attimonelli,M. and Saccone C. (1992) WORDUP: an efficient algorithm for discovering statistically significant patterns in DNA sequences. Nucleic Acids Res. 20, 2871-2875.

Prunella, N., Liuni, S., Attimonelli,M. and Pesole, G. (1993) FASTPAT: a fast and efficient algorithm for string searching in DNA sequences. CABIOS 9, 541-545.

Liuni, S., Prunella, N., Pesole, G., D'Orazio, T., Stella, E. and Distante, A. (1993) SIMD parallelization of the WORDUP algorithm for detecting statistically significant patterns in DNA sequences. CABIOS 9, 701-707

Pesole, G., Attimonelli, M. and Saccone C. (1996) Linguistic analysis of nucleotide sequences : algorithms for pattern recognition and analysis of codon strategy. Methods in Enzymology 266, 281-294

Printed: April 22, 1996 15:56 (1162)