Mapselect

Go back to top

MAPSELECT


FUNCTION

MapSelect selects restriction enzymes by name or by their ability to cut a given sequence, and writes them to a new enzyme file for use in other programs.


DESCRIPTION

MapSelect is a modified version of Map that saves a new enzyme file with the list of enzymes used. If no input sequence is given, the file simply contains all the enzymes selected. If a sequence is given, MapSelect allows a choice of all enzymes that cut, all that do not cut, or all that are excluded by a specified minimum or maximum number of cut sites.

MapSelect displays a sequence that is being assembled or analysed intensively. MapSelect asks you to enter the names of those enzymes whose restriction sites should be marked. If you do not answer this question, MapSelect generates a restriction map with a representative isoschizomer from all of the commercially available enzymes. You can choose to have your sequence translated in any of the six possible translation frames. You can also choose to have only the open reading frames translated.

After running MapSelect you may create a new sequence file with the peptide sequence from any frame of DNA translation by using the Extract program with the MapSelect output file.


AUTHOR

This GCG program was modified by Peter Rice (E-mail: pmr@sanger.ac.uk Post: Informatics Division, The Sanger Centre, Hinxton Hall, Cambridge, CB10 1RQ, UK).

All EGCG programs are supported by the EGCG Support Team, who can be contacted by E-mail (egcg@embnet.org).


EXAMPLE

Here is a session using MapSelect to pick all enzymes that cut a region of gamma.seq in one or two places.

  
  
  % mapselect -mincut=1 -maxcut=2
  
   MAPSELECT uses (optional) sequence data
  
   MAPSELECT using what sequence ?  gamma.seq
  
               Start (* 1 *) ?  2101
             End (* 11375 *) ?  2600
  
  Selection option:
(1) All enzymes selected (sequence ignored)
(2) All enzymes that cut (mincut and maxcut also checked)
(3) All enzymes that do not cut the sequence
(4) All enzymes excluded by mincut and maxcut
  
  Option  (* 1 *) ? 2
  
  
  
  
   Select the enzymes:  Type nothing or "*" to get all enzymes. Type "?"
   for help on which enzymes are available and how to select them.
  

  
  
                                       Enzyme(* * *):
  
   What protein translations do you want:
  
   a) frame 1   b) frame 2   c) frame 3
   d) frame 4   e) frame 5   f) frame 6
  
   t)hree forward frames   s)ix frames   o)pen frames only
  
   n)o protein translation   q)uit
  
   Please select (capitalize for 3-letter) (* t *):  n
  
   What should I call the output file (* gamma.map *) ?
  
   What should I call the new enzyme file (* gamma.select *) ?
  %
  


OUTPUT

Here is part of the output file:

  
  
  MAPSELECT enzyme pattern file     March 9, 1993  15:15
    using target sequence gamma.seq
  All enzymes that cut
  
   ..
  AccI        2       GT'mk_AC              2       ! Cuts in sequence: 1
  AlwI        9       GGATCnnnn'n_          1       ! Cuts in sequence: 1
  AvaII       1       G'GwC_C               3       ! Cuts in sequence: 2
  BamHI       1       G'GATC_C              4       ! Cuts in sequence: 1
  BanI        1       G'GyrC_C              4       ! Cuts in sequence: 2
  
    ////////////////////////////////////////////////////////////////
  
  


RELATED PROGRAMS

MapSort, PlasmidMap, and MapPlot display restriction maps in other formats. Extract extracts the peptide sequence from any translation frame in the Map output file and puts it into a new sequence file. FindPatterns searches for short patterns like enzyme recognition sites in one or more sequences. Map displays both strands of a DNA sequence with restriction sites shown above the sequence and possible protein translations shown below. PeptideMap creates a peptide map of an amino acid sequence.


RESTRICTIONS

MapSelect does not treat your sequence as circular unless you use the command line option -CIRcular. MapSelect reads the Type: field on the divider line in the sequence file to determine whether your sequence is a nucleic acid or protein. You can insist that your sequence is a protein by placing -PROtein on the command line. The enzymes you name must be in the enzyme data file or you get an error message. You can have your system manager change the public enzyme data file to contain the enzymes most useful to your group, or you can maintain a private copy for your own use. (See the LOCAL DATA FILES topic below for more information.)

SUBSET, OVERLAP, AND PERFECT SEARCHES

This program normally requires that a sequence pattern be a subset of the enzyme recognition site. If the recognition pattern in the enzyme data file were GCRGC, then the pattern GCAGC in your sequence would be found, since A is within the set of bases defined by R (see Appendix III) . If the pattern in the enzyme data file were GCAGC, then a GCRGC in your sequence would not be recognized. If your sequence is very ambiguous, as it might be if it were a backtranslated sequence, then it may be better to use the -ALL switch to do an overlap search. The overlap search would consider an R in your sequence to match an A in the recognition site.

The command-line option -PERFect causes this program to look for a perfect symbol match between your sequence and the recognition pattern -- GCRGC in the recognition pattern would only match a GCRGC in the sequence.

All searches are case insensitive (upper- or lowercase) for the letters in either the sequence or the enzyme recognition site.


DISPLAY CONVENTIONS

Collisions

MapSelect identifies patterns by the positions where they occur in sequences. When a pattern cannot be shown at a particular position, it is shown at the next available position in the sequence. A '/' below the enzyme's name indicates that the name of the enzyme has been displaced to the right from the position where it should have been. When the number of finds is very great, the resolution of this kind of display is inadequate. If the display seems too full, you should restrict the number of enzymes chosen.

Potential Sites

When you search for potential restriction sites with either the -MISmatch or -SILent options, MapSelect differentiates the real sites from the potential sites by capitalizing the enzyme's name at the real sites.


SELECTING ENZYMES

The program presents you with an enzyme selection prompt that lets you enter enzymes individually or collectively. To get help with selecting enzymes, type a ? at the enzyme prompt. Here is what you see:

  
  
  Select enzymes:
  
  Type "*" to select all enzymes.
  Type "**" to select all enzymes including isoschizomers.
  Type individual names like "AluI" to select specific enzymes.
  Type "?" to see this message and all available enzymes.
  Type "??" to see the available enzymes AND their recognition sites.
  Type "?A*" to see what enzymes start with "A."
  Type "A*" to select all enzymes starting with "A."
  Type parts of names like "Al*" to select all enzymes starting with "AL."
  Type "~A*" to unselect all selected enzymes starting with "A."
  Type "/*" to see what enzymes you have selected so far.
  Type "#" to select no enzymes at all.
  
  Press  after each selection.
  Press  and nothing else to end your selections.
  Spaces are allowed and letter case is ignored.
  

We maintain our enzyme files with a semicolon (;) character in front of all but one member of a family of isoschizomers. (Isoschizomers are restriction endonucleases with the same recognition site.) The isoschizomers beginning with a semicolon are normally not displayed by our mapping programs unless you specifically select them by name or type "**" instead of "*" at the enzyme prompt.

There is more information on enzyme files in the Data Files manual.

A command-line expression like -ENZymes=AluI,EcoRII would choose AluI and EcoRII and suppress interactive enzyme selection.


CHOOSING THE TRANSLATION FRAMES

The translation menu allows several responses. You can name the frames of interest individually with a response like abcf. You can use t or s to mean the three forward or all six possible translation frames. You can make all of the characters in your response upper case to get three-letter instead of one-letter amino acid symbols in the translation. You can add o to your response to get translation only between potential start codons and stop codons (o by itself gives open reading frame translation of all six translation frames).

You can use an expression like -MENu=abcf to choose translation frames a, b, c, and f from the command line.


OPEN READING FRAMES

You can select translation for open reading frames only. All of the frames are treated as open at the 5' end of each strand; these pseudo-open reading frames run to the first stop codon in that frame (see the Translation Tables section of the Data Reference Set). Thereafter, reading is turned on at each potential start codon and runs to the next stop codon. You can suppress the display of short open reading frames with an expression like -OPEn= 20 on the command line.

Open reading frames are determined from the beginning and ending of the sequence in the file -- not from just the range you have chosen. The potential start codons and stop codons are defined in the local data file translate.txt.


POTENTIAL RESTRICTION SITES

To assist scientists doing site-directed mutagenesis, this program searches for places in your sequence where a restriction enzyme recognition site occurs with one or more mismatches. Use the command-line option -MISmatch= 1 to identify positions where recognition could occur with one or fewer mismatches.

Use the command-line option -SILent to find the places in your sequence where a restriction site could be introduced without changing the translation. Read more about this at -SILent under the OPTIONAL PARAMETERS topic below.


SEARCH FOR ANY SEQUENCE PATTERN

By changing the enzyme data file (see the LOCAL DATA FILES topic below), you can make this program search for any pattern. See the Data Files manual for notes on enzyme data files.


DEFINING PATTERNS

FindPatterns, Map, MapSort, MapPlot, and Motifs all let you search with ambiguous expressions that match many different sequences. The expressions can include any legal GCG sequence character (see Appendix III) . The expressions can also include several non-sequence characters, which are used to specify OR matching, NOT matching, begin and end constraints, and repeat counts. For instance, the expression TAATA(N){20,30}ATG means TAATA, followed by 20 to 30 of any base, followed by ATG. Following is an explanation of the syntax for pattern specification.

Implied Sets and Repeat Counts

Parentheses () enclose one or more symbols that can be repeated some number of times. Braces {} enclose numbers that tell how many times the symbols within the preceding parentheses must be found.

Sometimes, you can leave out part of an expression. If braces appear without preceding parentheses, the numbers in the braces define the number of repeats for the immediately preceding symbol. One or both of the numbers within the braces may be missing. For instance, the pattern GATG{2,}A means GAT, followed by G repeated from 2 to 350,000 times, followed by A; the pattern GATG{}A means GAT, followed by G repeated from 0 to 350,000 times, followed by A; the pattern GAT(TG){,2}A means GAT, followed by TG repeated from 0 to 2 times, followed by A. (If the pattern in the parentheses is an OR expression (see below), it cannot be repeated more than 2,000 times.)

OR Matching

If you are searching nucleic acids, the ambiguity symbols defined in Appendix III let you define any combination of G, A, T, or C. If you are searching proteins, you can specify any of several symbol choices by enclosing the different choices in parentheses and separating the choices with commas. For instance, RGF(Q,A)S means RGF followed by either Q or A followed by S. The length of choices need not be the same, and there can be up to 31 different choices within each set of parentheses. The pattern GAT(TG,T,G){1,4}A means GAT followed by any combination of TG, T, or G from 1 to 4 times followed by A. The sequence GATTGGA matches this pattern. There can be several parentheses in a pattern, but parentheses cannot be nested.

NOT Matching

The pattern GC~CAT means GC, followed by any symbol except C, followed by AT. The pattern GC~(A,T)CC means GC, followed by any symbol except A or T, followed by CC.

Begin and End Constraints

The pattern would only be found if it occurs at the end of the sequence range.


COMMAND-LINE SUMMARY

All parameters for this program may be put on the command line. Use the option -CHEck to see the summary below and to have a chance to add things to the command line before the program executes. In the summary below, the capitalized letters in the qualifier names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose qualifiers or parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.

  
  
  Minimum syntax: % mapselect [-INfile=]gamma.seq -Default
  
  Prompted parameters:
  
  -BEGin=2101 -END=2600       range of interest
  -ENZymes=*[,...]            enzymes to display
  -MENu=s                     translation frames s=six, t=three, o=open
  [-OUTfile=]gamma.map        output file name
  [-ENZFile=]gamma.select     output enzyme pattern file name
  -SELect=1                   enzymes to select
  
  Local Data Files:
  
  -DATa=enzyme.dat          contains enzyme names and sites
  -TRANSlate=translate.txt  contains the translation scheme
  
  Optional Parameters:
  
  -MINBase=5      shows only enzymes with at least 5 bases in cut site
  -MAXBase=6      shows only enzymes with up to 6 bases in cut site
  -PAGE[=64]      keeps the clusters from crossing page boundaries
  -WIDth=60       sets width to something other than 60 bp/line
  -OPEn=20        sets minimum open reading frame size
  -SIXbase        shows enzymes with six or more bases in recognition site
  -ONCe           shows enzymes that cut only once
  -MINCuts=2      shows only enzymes that cut at least 2 times
  -MAXCuts=2      shows only enzymes that cut no more than 2 times
  -EXCLude=n1,n2  suppresses enzymes that cut between bases n1 and n2
  -ALL            finds "overlapping-set" matches
  -PERFect        looks only for perfect matches
  -PROtein        makes a peptide sequence map (local data = proenzyme.dat)
  -CIRcular       treats the sequence as circular
  -LINear         treats the sequence as linear (default)
  -APPend         appends the input data files to the output file
  -THReeletter    displays three letter symbols in translation
  -SILent         finds translationally silent potential restriction sites
  -MISmatch=1     finds potential sites with one or fewer mismatches
  -NOSEQline      suppresses the sequence display
  -NOSCALeline    suppresses the scale line
  -NOCOMPline     suppresses the complement sequence display programs.
  


ACKNOWLEDGEMENT

MapSelect is simply a modified version of Map with an additional output file. The output format of Map was designed by John Schroeder and Frederick Blattner (NAR 10; 69-84 (1982), Figure 1). Map was written for the first release of the GCG Package by Paul Haeberli and John Devereux.


LOCAL DATA FILES

The files described below supply auxiliary data to this program. The program automatically reads them from a public data directory unless you either 1) have a data file with exactly the same name in your current working directory; or 2) name a file on the command line with an expression like -DATa1=myfile.dat. For more information see Chapter 4, Using Data Files in the User's Guide.

This program reads the public or local version of enzyme.dat to get the enzyme names, recognition sites, cut positions, and overhangs. You can use mapping programs to search for any sequence pattern by adding the pattern to the enzyme data file. If you use the command line option -APPend, this program appends the enzyme data file to the output file. (See the Data Files manual for more information about enzyme data files.)

If you run MapSelect with the command line option -PROtein, or if MapSelect finds Type: P on the divider line in the sequence file, it reads proteolytic cleavage data in the local data file proenzyme.dat.

The translation of codons to amino acids, the identification of potential start codons and stop codons, and the mappings of one-letter to three-letter amino acid codes are all defined in a translation table in the file translate.txt. If the standard genetic code does not apply to your sequence, you can provide a modified version of this file in your working directory or name an alternative file on the command line with an expression like -TRANSlate= mycode.txt. Translation tables are discussed in more detail in the Data Files manual.


OPTIONAL PARAMETERS

The parameters and switches listed below can be set from the command line. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.

-MINBase=5

shows only cut sites with at least 5 bases in the target site, counting partially ambiguous positins (R or Y for example) but not counting positions that the enzyme ignores completely (N and X).

-MAXBase=6

shows only cut sites with up to 6 bases in the target site, counting partially ambiguous positins (R or Y for example) but not counting positions that the enzyme ignores completely (N and X).

-OPEn=20

restricts the display of open reading frame translations to frames with at least 20 residues.

-CIRcular

tells MapSelect to treat your sequence as circular. If a possible recognition site starts at the end and continues into the beginning of the sequence, the site is marked at the point where a circular molecule would be cut. For instance if your sequence ends in GAA and starts with TTC, MapSelect shows an EcoRI cut two bases before the end of the sequence. The sequence is only circularized at the ends found in the file, so if you want a subrange to be treated as circular you have to create a file in which the subrange is the entire sequence (see the Assemble program). Alternatively, you could try to convince us that the program should be changed to allow circularization of subranges.

-LINear

is the opposite of -CIRcular. If you have defined a command that runs MapSelect with -CIRcular as the default, use the -LINear switch to make MapSelect treat your sequence as linear.

-PAGe=64

When you print the output from this program, it may cross from one page to another in a frustrating way -- especially when you print on individual sheets. This option adds form feeds to the output file in order to try to keep clusters of related information together. You can set the number of lines per page by supplying a number after the -PAGe qualifier.

-WIDth=60

allows you to choose the number of bases shown on each line of output. The standard is 60, which can be shown on a terminal screen nicely, but 100 sequence symbols per line is very convenient for estimating the size of fragments between cuts.

-PROtein

creates a peptide fragment map for a peptide sequence. The map finds proteolytic cleavage points. The local data file for peptide mapping is called proenzyme.dat. The documentation for PeptideMap has an example of the output.

-THReeletter

sets the translation to show three-letter amino acid codes instead of the one-letter codes. Normally the case of the translation menu is sufficient to make the three-letter/one-letter distinction. However, when you run MapSelect from the command line, you must add -THReeletter to get three-letter amino acid codes.

-MISmatch=1

causes the program to recognize sites that are like the recognition site but with one or fewer mismatches. If you allow too many mismatches, you may get ridiculous results. The output from most mapping programs distinguishes between sites with no mismatches and sites with mismatches.

-SILent

shows the places where restriction sites can be introduced (by site-directed mutagenesis) without changing the peptide translation of the sequence. The -SILent switch assumes that the range you have chosen defines a coding region and reading frame precisely. Sites may be found that have any number of bases changed as long as the changes do not alter the translation. The silent frame is implied by the beginning coordinate you specify. The output from most mapping programs distinguishes between real sites and sites with one or more mismatches. The data file translate.txt defines the genetic code.

-PERFect

sets the program to look for a perfect alphabetic match between the site and the sequence. Ambiguity codes are normally translated so that the site RXY would find sequences like ACT or GAC. With this switch the ambiguity codes are not translated so the site RXY would only match the sequence RXY. This switch is not the same as -MISmatch= 0!

-ALL

makes an overlap-set map instead of the usual subset map. If your sequence is very ambiguous (for instance, as a back-translated sequence would be) and you want to see where restriction sites could be, then an overlap-set map is for you. Overlap- set and subset pattern recognition is discussed in more detail in the Program Manual entry for Window.

-APPend

appends the input enzyme data file to your output file.

The options -SIXbase, -ONCe, -MINCuts, -MAXCuts, and -EXCLude all suppress the display of undesired enzymes. The list of excluded enzymes in the program output includes both enzymes that cut within excluded ranges and enzymes that do not cut the right number of times.

-SIXbase

searches only for enzymes with six or more bases in the recognition site. You can display the cuts from any enzyme in the enzyme data file that you take the trouble to name individually, but when you use * (meaning all), the program uses all of the other enzymes whose recognition sites have six or more non-N, non-X bases.

-ONCe

excludes, from the set you have chosen, those enzymes that cut your sequence more than once.

-MINCuts=2

excludes enzymes that do not cut at least two times.

-MAXCuts=2

excludes enzymes that cut more than two times.

-EXCLude=n1,n2[n3,n4,...]

excludes enzymes that cut anywhere within one or more ranges of the sequence. If an enzyme is found within an excluded range, then the enzyme is not displayed. The list of excluded enzymes includes enzymes that cut within excluded ranges. The ranges are defined with sets of two numbers. The numbers are separated by commas. Spaces between numbers are not allowed. The numbers must be integers that fall within the sequence beginning and ending points you have chosen. The range may be circular if circular mapping is being done. Exclusion is not done if there are any non-numeric characters in the numbers or numbers out of range or if there is not an even number of integers next to the qualifier.

-TRANSlate=filename.txt

Usually, translation is based on the translation table in a default or local data file called translate.txt. This option allows you to use a translation table in a different file. (See the Data Files manual for information about translation tables.)

The center of the MapSelect display is the sequence, a scale, and the sequence's complement. These three switches let you suppress any of these lines.

-NOSEQline

suppresses the sequence display.

-NOSCALeline

suppresses the scale line between the sequence and its complement.

-NOCOMPline

suppresses complement sequence display.


REFERENCES

Schroeder, J. and Blattner, F. (1982). Formal description of a DNA oriented computer language. Nucleic Acids Res. 10, 69-84.

Printed: April 22, 1996 15:54 (1162)