FUNCTION
RFindPatterns identifies sequences that contain short patterns like GAATTC or YRYRYRYR. You can define the patterns ambiguously and allow mismatches. You can provide the patterns in a file or simply type them in from the terminal. The output is a series of files called r1.rfind, r2.rfind, and so on, each containing a single extracted sequence. These can be fed through Pileup or manipulated in other ways.
DESCRIPTION
RFindPatterns locates short sequence patterns. If you are trying to find a pattern in a sequence or if you know of a sequence that you think occurs somewhere within a larger one, you can find your place with RFindPatterns. RFindPatterns can look through large data sets for any short sequence patterns you specify. RFindPatterns can recognize patterns with some symbols mismatched but not with gaps. It supports the IUB-IUPAC nucleotide ambiguity codes (see Appendix III) for searching through nucleotide sequences.
RFindPatterns searches both strands of a nucleotide sequence if the patterns you specify are not identical on both strands. If your sequence is a peptide, RFindPatterns searches for a simple symbol match between your pattern and the peptide sequence.
RFindPatterns names each file on the screen as it is searched. The output file shows only sequences where a pattern was found unless you use the command-line option -SHOw. Five symbols from the original sequence are shown on either side of each "find." The word /Rev occurs if the reverse of the pattern is found. If you run RFindPatterns with the command-line option -NAMes, the output file is written in list file (formerly called file of sequence names) format, which you can use as input to other Wisconsin Sequence Analysis Package(TM) programs that support indirect file specifications
When FindPatterns finishes searching for your patterns, it returns to the first prompt in
the program, RFINDPATTERNS in what sequence(s) ? If you simply press
RFindPatterns keeps writing its results in the same output file (or on the screen).
RFindPatterns prints a short summary on your screen and in the output file when the
entire session is over.
AUTHOR
This GCG program was modified by David Mathog (E-mail: MATHOG@seqaxp.bio.caltech.edu Post:
Sequence Analysis Facility, Biology Division, Caltech), and modified for EGCG by Peter Rice (E-mail:
pmr@sanger.ac.uk Post: Informatics Division, The Sanger Centre, Hinxton Hall, Cambridge,
CB10 1RQ, UK).
All EGCG programs are supported by the EGCG Support Team, who can be contacted by
E-mail (egcg@embnet.org).
EXAMPLE
Here is a session using RFindPatterns to determine if there are any EcoRI or BamHI sites
in the human immunoglobulin sequences of the EML database (The program Fetch was
used first to make a copy of the file pattern.dat):
OUTPUT
Here is some of the output file:
If the pattern is a complex expression, it will be written above each find along with a simplification
of the pattern so that you can see what was actually found. In the above example, the Promoter
pattern CAT(N){20,30}TATTA is the pattern being searched, and CATN{29}TATTA is the pattern
actually found. Five symbols from the original sequence are shown on either side of the find. In the
example above, 104 is the coordinate of the first C in CATCGGG ... not of the G of the flanking
symbols GTCCC.
RELATED PROGRAMS
The GCG mapping programs Map, MapPlot and MapSort can be used to
mark finds in the context of a DNA restriction map. Motifs looks for sequence motifs by
searching through proteins for the patterns defined in the PROSITE Dictionary of Protein
Sites and Patterns. FindPatterns is the original GCG version of this program, with
fixed reporting of 5 flanking residues. These programs all use the same search algorithm and input
data file format as RFindPatterns.
RESTRICTIONS
Patterns typed in from the terminal may not be longer than 132 characters. Patterns from a data
file may not be longer than 350 characters.
RFindPatterns can search for a maximum of 5,000 patterns in a nucleotide sequence. If
your pattern.dat file contains more than 5,000 patterns, only the first 5,000 are used.
The restrictions specified with the -MINCuts and -MAXCuts command-line
options must be fulfilled on a single strand of a nucleotide sequence in order for the find to be
reported. For instance, if you use the command
% rfindpatterns -MINCuts=2 -PATterns=CCCC with the
sequence CCCCGGGG, no finds will be reported, even though there is one instance of the pattern on
each strand.
LIST REFINEMENT
The database programs Names, StringSearch, FindPatterns, FastA, TFastA, and WordSearch can be
used for list refinement if you are looking for sequences with something in common. For instance,
you could identify human globin sequences with StringSearch. The output list could then be refined
with FindPatterns to show only those globin sequences containing EcoRI sites. You could then use
WordSearch to compare this output list to a sequence of your own that you think is similar to these
human, globin, EcoRI-containing sequences.
Adding Lists Together
You can add two lists together by simply appending one of the files to the other. It is better if
you use a text editor to modify the heading of the combined list so that the annotation in the
list correctly reflects what you have done. Remember to delete the text heading from the
second file so that it does not occur in the middle of the list.
Suppressing Items
Suppress any item in a list by typing an exclamation point (!) in front of the item. You can
also put comments into a list anywhere on a line by placing an exclamation point before the
comment.
DEFINING PATTERNS
FindPatterns, Map, MapSort, MapPlot, and Motifs all let you search with ambiguous expressions
that match many different sequences. The expressions can include any legal GCG sequence
character (see Appendix III). The expressions can also include several non-sequence characters,
which are used to specify OR matching, NOT matching, begin and end constraints, and repeat
counts. For instance, the expression TAATA(N){20,30}ATG means TAATA, followed by 20 to 30 of
any base, followed by ATG. Following is an explanation of the syntax for pattern specification.
Implied Sets and Repeat Counts
Parentheses () enclose one or more symbols that can be repeated some number of times.
Braces {} enclose numbers that tell how many times the symbols within the preceding
parentheses must be found.
Sometimes, you can leave out part of an expression. If braces appear without preceding
parentheses, the numbers in the braces define the number of repeats for the immediately
preceding symbol. One or both of the numbers within the braces may be missing. For
instance, the pattern GATG{2,}A means GAT, followed by G repeated from 2 to 350,000 times,
followed by A; the pattern GATG{}A means GAT, followed by G repeated from 0 to 350,000
times, followed by A; the pattern GAT(TG){,2}A means GAT, followed by TG repeated from 0
to 2 times, followed by A. (If the pattern in the parentheses is an OR expression (see below), it
cannot be repeated more than 2,000 times.)
OR Matching
If you are searching nucleic acids, the ambiguity symbols defined in Appendix III let you
define any combination of G, A, T, or C. If you are searching proteins, you can specify any of
several symbol choices by enclosing the different choices in parentheses and separating the
choices with commas. For instance, RGF(Q,A)S means RGF followed by either Q or A followed
by S. The length of choices need not be the same, and there can be up to 31 different choices
within each set of parentheses. The pattern GAT(TG,T,G){1,4}A means GAT followed by any
combination of TG, T, or G from 1 to 4 times followed by A. The sequence GATTGGA matches
this pattern. There can be several parentheses in a pattern, but parentheses cannot be
nested.
NOT Matching
The pattern GC~CAT means GC, followed by any symbol except C, followed by AT. The
pattern GC~(A,T)CC means GC, followed by any symbol except A or T, followed by CC.
Begin and End Constraints
The pattern
CONSIDERATIONS
RFindPatterns will not introduce gaps but it can tolerate mismatches when it is run with
the command-line option -MISmatch. Mismatched finds are shown in the output in
lowercase.
If you are entering patterns from the command line with the -PATterns qualifier, any
pattern containing a comma must be enclosed in double quotes; otherwise, the comma is assumed to
separate different patterns on the command line.
SPECIFYING SEQUENCES
There is information on specifying sets of sequences in Chapter 2, Using Sequences of the
User's Guide.
LARGE DATA SETS
FindPatterns is one of the few programs in GCG or EGCG that can take more than a few
minutes to run. Large searches should probably be run in the batch queue. You can run this
program in the batch queue on many computers by using the command-line option -BATch.
Run this way, the program prompts you for all the required parameters and then automatically
submits itself to the batch or at queue. Batch jobs free your terminal for other work and may allow
the system manager to distribute the load on your computer more evenly. For more information, see
"Using the Batch Queue" in Chapter 3, Basic Concepts: Using Programs in the User's
Guide. Very large comparisons may exceed the CPU limit set by some systems.
Patterns that start with complicated OR or NOT expressions take longer to search than simple
expressions like GATTC.
INPUT FILE
You can put any patterns you want to search for into a file like the one below. The pattern data files
for RFindPatterns are modeled on the enzyme data files for the mapping programs
described in the Data Files manual. The names should not have more than eight letters.
The offset field is ignored by RFindPatterns, but the field should have a number in it to
make these files compatible with the files that are read by mapping programs.
The exact column used for each field does not matter, only the order of the fields in the line. You can
give several patterns the same name, but put all of the entries for that name on adjacent lines of the
file. The patterns may not be more than 350 characters long. Blank lines and lines that start with
an exclamation point (!) are ignored.
If the overhang field is a period (.) instead of a number, only the top strand of a nucleic acid sequence
is searched for the pattern. Any number implies that both strands are to be searched. The value of
the overhang number has no significance to RFindPatterns. Here is the pattern data file
used in the example above:
SEQUENCE TYPE
The function of RFindPatterns depends on whether your input sequence(s) are protein or nucleotide.
Programs determine the type of a sequence by the presence of either Type: N or Type: P on the last
line of the text heading just above the sequence. If your sequence(s) are not the correct type, turn to
Appendix VI for information on how to change or set the type of a sequence.
COMMAND-LINE SUMMARY
All parameters for this program may be put on the command line. Use the option -CHEck
to see the summary below and to have a chance to add things to the command line before the
program executes. In the summary below, the capitalized letters in the qualifier names are the
letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose
qualifiers or parameter values that are optional. For more information, see "Using Program
Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.
LOCAL DATA FILES
The files described below supply auxiliary data to this program. The program automatically reads
them from a public data directory unless you either 1) have a data file with exactly the same name in
your current working directory; or 2) name a file on the command line with an expression like
-DATa1=myfile.dat. For more information see Chapter 4, Using Data Files in the
User's Guide.
RFindPatterns can read the patterns you want to find from the file pattern.dat in your
working directory. If you don't have a file called pattern.dat in your directory,
RFindPatterns asks you to type in the patterns you want to find. If you want to use a
pattern data file with a name other than pattern.dat, include -DATa=filename on
the command line.
OPTIONAL PARAMETERS
The parameters and switches listed below can be set from the command line. For more information,
see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG
User's Guide.
-NAMes
writes the output file as a list file (formerly called a file of sequence names) suitable for input
to other Wisconsin Package programs that support indirect file specification (see Chapter 2,
Using Sequences of the User's Guide). All of the output showing the location of
the patterns found is suppressed when the output is written as a list file.
-SINce=6.90
limits the search to sequences that have been entered into the database or modified since June
1990. As this is being written, only the EMBL, GenBank, and SWISS-PROT databases
support this feature.
-CIRcular
searches past the end of the sequence into the beginning of the sequence as if the molecule
were continuous. Patterns that span the origin can only be found if the search is
-CIRcular.
-ONEstrand
searches only the top strand of nucleotide sequences.
-SHOw
Normally, RFindPatterns shows that a file was searched only if there were one or
more finds in sequence. With the -SHOw command-line option,
RFindPatterns shows every file searched whether or not a pattern was actually found
in it. (-SHOw is equivalent to setting -MINCuts=0.)
-TERminal
writes output on the terminal screen and suppresses the output file query. If you use
RFindPatterns often in this mode, you should assign a logical symbol that runs
RFindPatterns with terminal output as the default. Answering the output file query
with Term has the same effect on FindPatterns as this command-line
option.
-BATch
submits the program to the batch queue for processing after prompting you for all required
user inputs. Any information that would normally appear on the screen while the program is
running is written into a log file. Whether that log file is deleted, printed, or saved to your
current directory depends on how your system manager has set up the command that submits
this program to the batch queue. All output files are written to your current directory, unless
you direct the output to another directory when you specify the output file.
-MONitor
This program normally monitors its progress on your screen. However, when you use the
-Default option to suppress all program interaction, you also suppress the monitor.
You can turn it back on with this option. If your program is running in batch, the monitor will
appear in the log file. If the monitor is slowing the program down, suppress it with
-NOMONitor.
The descriptions of the exclusionary options below were written for the Wisconsin Package mapping
programs. A find in these applications is referred to as a cut while a pattern is referred to as a
restriction enzyme recognition site.
The options -SIXbase, -ONCe, -MINCuts, -MAXCuts, and
-EXCLude all suppress the display of undesired enzymes. The list of excluded enzymes in
the program output includes both enzymes that cut within excluded ranges and enzymes that do not
cut the right number of times.
-SIXbase
searches only for enzymes with six or more bases in the recognition site. You can display the
cuts from any enzyme in the enzyme data file that you take the trouble to name individually,
but when you use * (meaning all), the program uses all of the other enzymes whose
recognition sites have six or more non-N, non-X bases.
-ONCe
excludes, from the set you have chosen, those enzymes that cut your sequence more than once.
-MINCuts=2
excludes enzymes that do not cut at least two times.
-MAXCuts=2
excludes enzymes that cut more than two times.
-EXCLude=n1,n2[n3,n4,...]
excludes enzymes that cut anywhere within one or more ranges of the sequence. If an enzyme
is found within an excluded range, then the enzyme is not displayed. The list of excluded
enzymes includes enzymes that cut within excluded ranges. The ranges are defined with sets
of two numbers. The numbers are separated by commas. Spaces between numbers are not
allowed. The numbers must be integers that fall within the sequence beginning and ending
points you have chosen. The range may be circular if circular mapping is being done.
Exclusion is not done if there are any non-numeric characters in the numbers or numbers out
of range or if there is not an even number of integers next to the qualifier.
-MISmatch=1
causes the program to recognize sites that are like the recognition site but with one (or more)
mismatches. If you allow too many mismatches, you may get ridiculous results. The output
from most mapping programs distinguishes between real sites and sites with one or more
mismatches.
-PERFect
sets the program to look for a perfect alphabetic match between the site and the sequence.
Ambiguity codes are normally expanded so that the site RXY would find sequences like ACT
or GAC. With this switch the ambiguity codes are not expanded so the site RXY would only
match the sequence RXY. This switch is not the same as -MISismatch=0.
-ALL
makes an overlap set map instead of the usual subset map. If your sequence is very
ambiguous (as for instance a back-translated sequence would be) and you want to see where
restriction sites could be, then you should create an overlap-set map. Overlap-set and subset
pattern recognition are discussed in more detail in the Program Manual entry for
the Window program.
-APPend
appends the input enzyme data file to your output file.
-LWIDth=5
sets the number of residues to the left of the pattern to be included in the output file.
-RWIDth=5
sets the number of residues to the right of the pattern to be included in the output file.
-REPlace=NNNNN
replaces the search pattern with NNNNN in each output file. Alternative patterns can be
used. *N means fills whatever size the match is with N. "+" means use the original sequence
at that position.
-DIRectory=.
sets a directory for each output file name.
-PREfix=r
sets the first (non numeric) part of the output file name.
-SUFfix=rfind
sets the extension part of the output file name.
Printed: April 22, 1996 15:55 (1162)
% rfindpatterns
RFINDPATTERNS uses any sequences
RFINDPATTERNS of what sequence(s) ? GenEmbl:Hsig*
Search patterns read from "pattern.dat"
What should I call the output file (* rfindpatterns.rfind *) ?
HUMIG22L len: 180
HUMIGACHSR len: 3,326
HUMIGAHA2 len: 789
//////////////////////////
HUMIGXJAA len: 69
HUMIGXJAB len: 69
HUMIGXPSA len: 237
RFINDPATTERNS of what sequence(s) ?
Total finds: 523
Total length: 647,962
Total sequences: 1,323
CPU time: 03:18.97
Output file: rfindpatterns.rfind
%
RFINDPATTERNS on: EM_NEW:HSIGH344P
Original file info: M99673 Human immunoglobulin heavy chain
variable region V3-4 4P (IGHV@) gene, exons 1-2. 8/95
Matching pattern: TAATA(N){20,30}ATG
Pattern location: 114 to 151
Lwidth: 5
Rwidth: 5
Match and extraction from the REVERSE strand.
r1 Length: 48 September 25, 1995 16:46 Type: N Check: 5227 ..
1 TTCTCTAATA TCCACTCACA AACAATATCT GTAGTTCTTC ATGAATCA
An example of a pattern data file for the program FINDPATTERNS.
Name Offset Pattern Overhang Documentation ..
BamHI 1 GGATCC 0 !
EcoRI 1 GAATTC 0 !
Promotor 1 TAATA(N){20,30}ATG 0 !
Minimal Syntax: rfindpatterns [^
Prompted Parameters:
-
Local Data Files:
-
Optional Parameters:
-
-
-
*N means fills whatever size the match is with N.
"+" means use original sequence at that position.
-
- -
-d1.seq, temp/d2.seq and so on
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-