WordUp is based on a first order Markov analysis and detects statistically significant oligonucleotide patterns from six to nine nucleotides long in the sequences under investigation. WordUp dynamically detects significant signals of any length in the same analysis.
WordUp is based on a first order Markov analysis (Pesole et al.,1992) and detects statistically significant sequence motifs from six to ten nucleotides long in the sequences under investigation. WordUp dynamically detects significant signals of any length in the same analysis. The problem addressed is the singling out of short nucleic sequences with non-random statistical properties, which may be thus biologically active.
The key element in the processing is the use of a statistical analysis based on a special chi-square test between the pattern of observed frequencies and of that of frequencies assessed with probability calculation. The statistical significance of each pattern is determined by comparing the expected number of sequences containing a given pattern with the observed one. The results are then further tested to reject the falsely significant patterns, caused by partial overlapping with the true biological signal searched.
The problem was solved by starting from the analysis of shorter oligomers and, through subsequent interactions, by checking that the oligomers resulting as statistically significant are not actually components of longer biological signals.
The dataset to be used in the WordUp analysis needs to be purified from redundancies which make the risk of assigning high significance to nonsignificant pattern very high. The CLEANUP program (Grillo et al., ..) has been suited to remove redundancies (based on similarity and overlapping %) from sequence collections.
The method has been written by Graziano Pesole et al. The computer program has been written by Giorgio Grillo (giorgio@area.ba.cnr.it) and Massimo Ianigro (massimo@area.ba.cnr.it), who should be contacted for support.
All EGCG programs are supported by the EGCG Support Team, who can be contacted by E-mail (egcg@embnet.org).
Here is a sample session with WordUp
% wordup WORDUP uses nucleotide sequences WORDUP of what sequence(s) ? genembl:ecl* What base name for the output files (* ecl1 *) ? ecl Chi-square threshold (negative value to disable)(* 4.0 *) ? %
Here is a session using WordUp on a set of E.coli sequences where the non redundancy has not been cleane dup. It is shown part of the Output of the main searching file ("file") (a) and of the dynamic searching file ("file".dyn) (b).
(a) ======================================================= W O R D U P - MAIN SEARCH FILE ======================================================= OCCURRENCES ------------------------------------------------------- WORDS OBSERVED EXPECTED CHI-SQUARE ------------------------------------------------------- GAAAAA 58 42.69480 5.48660 AGAAAA 47 32.15242 6.85642 TGAAAA 61 40.49997 10.37658 CCCAAA 9 26.21809 11.30756 GCCAAA 21 33.37346 4.58755 TTCAAA 19 31.22186 4.78427 AAGAAA 47 32.15242 6.85642 ACGAAA 20 32.44691 4.77474 CTGAAA 46 33.04326 5.08052 ..................................................... ..................................................... ======================================================= (b) ======================================================= W O R D U P - DYNAMIC PATTERNS FILE ======================================================= OCCURRENCES ------------------------------------------------------- WORDS OBSERVED EXPECTED CHI-SQUARE ------------------------------------------------------- CAGGTT 37 16.61286 25.01889 GGCGCC 3 29.80980 24.11172 CACGTG 1 25.64915 23.68813 TACCAG 28 11.61083 23.13398 TTGGAA 4 29.66267 22.20207 CAGGTA 29 12.41553 22.15327 AGCCAG 35 16.23141 21.70235 TAACCC 34 15.89003 20.64003 ..................................................... ..................................................... =======================================================
FindPatterns identifies sequences that contain short patterns like GAATTC or YRYRYRYR. You can define the patterns ambiguously and allow mismatches. You can provide the patterns in a file or simply type them in from the terminal.
Before doing any pattern search, WordUp converts the specified sequences into Pearson format and places them into a temporary file named wordup.tmp.seq The program creates another temporary file named wordup.tmp that contains the generated patterns. The two files are usually created into the current directory. However, you can force the program to create them in another directory defining the environment variable WORDUPTMP with the name of the directory that will hold this files.
For example (Cshell):
setenv WORDUPTMP /tmp
There is another limitation which consists of a maximum allowed pattern length of 9 nucleotides.
If the program aborts, than you have to remove manually this files.
The WordUp algorithm is aimed at the identification of statistically significant nucleotide strings which are shared or avoided in a set of sequences functionally equivalent but not evolutionary homologous (e.g. promoter regions, introns, etc.). The statistical significance of each oligonucleotide signal is simply determined through a chi-square test by comparing the actual and the expected number of sequences containing that given signal, calculated assuming that oligonucleotides are Poisson distributed and that their occurrence probability follows a first order Markov chain, i.e. depends on dinucleotide frequencies. The Poisson distribution is suitable for the description of rare events, such as the distribution of oligomers longer than w nucleotides in sequences quite shorter than 4**w nucleotides.
It must be stressed that statistical significance, even though it does not account for biological significance, can provide a substantial clue for this. The list of statistically significant motifs, i.e. those having a chi-square value above a given threshold constitute a motif vocabulary which is specific for the biological function shared by the analysed sequences.
The starting word length, w, to be used in the analysis, has to be defined as the shortest sequence allowing the validation that sequence oligomers are Poisson distributed (i.e. Lseq << 4**w). The best choice will be, thus, in general a string length six nucleotides long even if we do not know the actual length of the biological signals. In order to establish if there are significant oligomers of length w+1, w+2, .. etc. we follow a dynamic elongation procedure which considers pattern pairs overlapping by w-1/w nucleotides. If the w+1-mer is more significant than its two component w-mers, the w+1-mer replaces the latter in the vocabulary of significant motifs. If this does not happen of the two overlapping w-mers the least significant is removed from the vocabulary as its significance is likely to be due to the "overlapping effect". In general, this procedure can be used to determine the statistical significance of w+k -mers considering overlapping w+k-1 -mers.
All parameters for this program may be put on the command line. Use the option -CHEck to see the summary below and to have a chance to add things to the command line before the program executes. In the summary below, the capitalized letters in the qualifier names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose qualifiers or parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.
Minimum syntax: % wordup [-INfile=]@sequence_list -Default Prompted Parameters: [-OUTfile=]wordup.out the output file name -CHI=20sets the threshold chi-square value Local Data Files: -PATTern = wordup.pat a file with a set of patterns from 6 to 9 nucleotides long. Alternatively, using -PATTern=num, you can set the starting pattern length Optional Parameters: -PATTern=xxx Allows to define the starting pattern length. xxx could be a number representing the minimum pattern length (between 6 and 9) or the name of the file containing the patterns -[NO]DYN [Don't] Carry out the dynamic pattern elongation -SEC Store results for patterns whith a chi-square below the fixed threshold in OUTfile.sec -BATch Submit the program to the batch queue for processing after prompting you for all required user inputs -HelpShows the parameters
The files described below supply auxiliary data to this program. The program automatically reads them from a public data directory unless you either 1) have a data file with exactly the same name in your current working directory; or 2) name a file on the command line with an expression like -DATa1=myfile.dat. For more information see Chapter 4, Using Data Files in the User's Guide.
The parameters and switches listed below can be set from the command line. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.
Allows the definition of the starting pattern length. A starting pattern length of six nucleotides is suggested to ensure that the pattern distibutions obey a Poisson distribution. Using the dynamic elongation procedure WordUp will identify also significant patterns longer than six nucleotides.
Using "-PATTern = wordup.pat" you can specify a file with a set of patterns to be searched from 6 to 9 nucleotides long. The search can be done on some of the 4**w patterns of length w. In this case they have to be stored into a file.
The file containing the patterns is made up of strings (e.g. the patterns) separated by spacings. Spacings are considered space bar, new line, return, horizontal and veritcal tabs and page ends. Thus, the pattern file is a text file where the pattern layout is freely chosen by the user. The pattern length in the file has to be the same for all patterns.
[Don't] Carry out the dynamic pattern elongation. If the -NODYN option is set WordUp evaluates the statistical significance of patterns of fixed length. In this case a pattern overlapping by w-1 (w-2, ..) positions a true significant pattern w nucleotides long very likely will be considered also significant. The dynamic elongation procedure allows to reduce the pattern "overlapping" effect.
Set the threshold chi-square value. A value of 20 is suggested as a default. Of course given that nucleotide sequences are not random is not guaranted that patterns with chi-square>20 are truly biologically significant. It need also to be considered that the chi-square values also depend on the size of the sequence collection considered in the analysis. Of course very high chi-square values (e.g. >50) will be actually significant.
[Do not] Prints out the results for all the 4**w patterns of length w, not included in the main output file.
Submits the program to the batch queue for processing after prompting you for all required user inputs. Any information that would normally appear on the screen while the program is running is written into a log file. Whether that log file is deleted, printed, or saved to your current directory depends on how your system manager has set up the command that submits this program to the batch queue. All output files are written to your current directory, unless you direct the output to another directory when you specify the output file.
Pesole, G., Prunella, N., Liuni, S., Attimonelli,M. and Saccone C. (1992) WORDUP: an efficient algorithm for discovering statistically significant patterns in DNA sequences. Nucleic Acids Res. 20, 2871-2875.
Prunella, N., Liuni, S., Attimonelli,M. and Pesole, G. (1993) FASTPAT: a fast and efficient algorithm for string searching in DNA sequences. CABIOS 9, 541-545.
Liuni, S., Prunella, N., Pesole, G., D'Orazio, T., Stella, E. and Distante, A. (1993) SIMD parallelization of the WORDUP algorithm for detecting statistically significant patterns in DNA sequences. CABIOS 9, 701-707
Pesole, G., Attimonelli, M. and Saccone C. (1996) Linguistic analysis of nucleotide sequences : algorithms for pattern recognition and analysis of codon strategy. Methods in Enzymology 266, 281-294
Printed: April 22, 1996 15:56 (1162)