Pepcorrupt

Go back to top

PEPCORRUPT


FUNCTION

PepCorrupt randomly introduces small numbers of substitutions, insertions, and deletions into protein sequence(s). Note that substitutions are Residue to other Residue, and that back mutations to the original are allowed!


DESCRIPTION

PepCorrupt uses a random number generator to add errors to nucleotide sequences. You can set the number of substitutions and length errors independently. Length errors can either be insertions or deletions; these two changes are now collectively referred to as indels in the literature of mathematical biology. The position of each error is picked at random somewhere within the range and on the strand that you chose. The length of each indel is chosen at random from one to the maximum indel size. If the indel is positive (insertion), then the symbols added are also chosen at random.

The output files contain a complete record of the errors introduced. The chosen and actual number of substitutions may vary since one in four substitutions will not change the sequence. The output file also shows the total amount of length added (or subtracted) when all of the indels are taken together. The current time is used to seed the random number generator, so each run with PepCorrupt yields different results.

If you give PepCorrupt a single input sequence, you can choose the range, strand, and output file name. Otherwise, PepCorrupt uses the whole sequence, the top strand, and names the output file with the sequence's name followed by the file name extension .corrupt.


AUTHOR

This GCG program was modified by David Mathog (E-mail: MATHOG@seqaxp.bio.caltech.edu Post: Sequence Analysis Facility, Biology Division, Caltech), and modified for EGCG by Peter Rice (E-mail: pmr@sanger.ac.uk Post: Informatics Division, The Sanger Centre, Hinxton Hall, Cambridge, CB10 1RQ, UK).

All EGCG programs are supported by the EGCG Support Team, who can be contacted by E-mail (egcg@embnet.org).


EXAMPLE

Here is a session using PepCorrupt to corrupt the first 200 residues of Sw:laci_ecoli:

  
  
  % pepcorrupt
  
   PEPCORRUPT uses protein sequences
  
   PEPCORRUPT of what sequence(s) ?  Sw:laci_ecoli
  
                 Start (* 1 *) ?
                 End (* 360 *) ?  200
              Reverse (* No *) ?
  
   How many substitutions do you want (* 1 *) ?  3
  
   How many length errors do you want (* 1 *) ?  3
  
  %
  


OUTPUT

The file laci_ecoli.corrupt would contain the corrupted contents of the first 200 symbols in Sw:laci_ecoli. Here is the output from this session:

  
  
   PEPCORRUPT of:   check: 1939  from: 1  to: 360
  
  
   Effective Substitutions: 360
  
   % remaining identity: 0.000000
  
   InDels: 3   Substitutions: 3   MaxIndel: 3
  
   Actual substitutions: 4  Length change from indels: 0
  
  laci_ecoli.corrupt  Length: 360  February 28, 1996 13:18  Type: P  Check: 716  ..
  
    1  MKPVTLYDVA EYAGVSYQTV SRVVNQASHV SAKTRDINEK VEAAMAELNY
  
   51  IPNRVAQQLG KQSLLDGVAT SSLALHAPSQ IVAAIKSRAD QLGASVVVSM
  
  101  VERGGVEACK AAVHNLLAQR VSGLIWNYPL DDQDAIAVEA ACTNVPALFL
  
  151  DVSDQTPINS IIFSHEDGTR LGVEHLVALG HQQIALLAGP LSSVSARLRL
  
  201  AGWHKYLTRN QIQPIAEREG DWSAMSGFQQ TMQMLNEGIV PTAMLVANDQ
  
  251  MALGAMRAIT ESGLRVGADI SVVGYDDTED SSCYIPPLTT IKQDFRLLGQ
  
  301  TSVDRLLQLS QGQAVKGNQL LPVSLVKRKT TLAPNTQTAS PRALADSLMQ
  
  351  LARQVLESGQ
  
  


RELATED PROGRAMS

Sample extracts sequence fragments randomly from sequence(s). You can set a sampling rate to determine how many fragments Sample extracts. Shuffle randomizes the order of the symbols in a sequence without changing the composition. SeqEd is an interactive editor for entering and modifying sequences and for assembling parts of existing sequences into new genetic constructs. You can enter sequences from the keyboard or from a digitizer.


RESTRICTIONS

PepCorrupt works on protein sequences. The output is renumbered to start at one.

If an indel is longer than 250 residues, only the first 250 residues of the indel are shown in the output file.


CONSIDERATIONS

PepCorrupt makes the substitutions first followed by the insertions and deletions. The substitution algorithm is this: any of the standard amino acids is chosen at random and then put into any position in the sequence randomly. This means that, on average, about one in 26 substitutions will not change the residue.


SUGGESTIONS

You may find what happened hard to understand if you make a lot of indels. The best way we know of to reconstruct a corruption is to start with the original sequence and, using SeqEd, make the changes in exactly the same order as they appear in the output file trace. You can use Gap to display the original and corrupted sequences next to one another.


COMMAND-LINE SUMMARY

All parameters for this program may be put on the command line. Use the option -CHEck to see the summary below and to have a chance to add things to the command line before the program executes. In the summary below, the capitalized letters in the qualifier names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose qualifiers or parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the User's Guide.

  
  
  Minimal Syntax: ^|$|%\| pepcorrupt [^|/|-\|INfile=]gamma.pep ^|/|-\|Default
  
  Prompted Parameters: (for single sequences only):
  
  ^|/|-\|BEGin=1                  beginning of the range of interest
  ^|/|-\|END=11375                end of the range of interest
  [^|/|-\|OUTfile=]gamma.corrupt  output file name
  
  Other Prompted Parameters:
  
  ^|/|-\|SUBstitutions=1          number of substitutions to introduce
  ^|/|-\|INDels=1                 number of length errors to introduce
  
  Local Data Files: None
  
  Optional Parameters:
  
  ^|/|-\|MAXindel=3               size of maximum insertion/deletion
  ^|/|-\|TRAce                    record reside changes in the output file
  ^|/|-\|EXTension=.corrupt       sets the output file name extension
  ^|/|-\|LIStfile[=corrupt.list]  writes a list file of output sequence names
  ^|/|-\|NOMONitor                suppresses screen monitor (of input sequence
                         names)
  ^|/|-\|NOSUMmary                suppresses the screen summary
  
  


LOCAL DATA FILES

None.


OPTIONAL PARAMETERS

The parameters and switches listed below can be set from the command line. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the User's Guide.

-MAXindel=3

sets the maximum size of an insertion or deletion. The maximum is three unless you change it with this option.

-NOTRAce

Normally PepCorrupt writes a complete record in the output file of each substitution, insertion, and deletion. You can suppress this information with -NOTRAce.

-EXTension=.pepcorrupt

creates output filenames by using the original input filename for the base name and the program name for the name extension. Use this option to choose some other filename extension.

-LIStfile=pepcorrupt.list

writes a list file with the names of the output sequence files. This list file is suitable for input to other Wisconsin Package programs that support list files (see Chapter 2, Using Sequences of the User's Guide. ) If you don't specify a filename, then PepCorrupt makes one up using pepcorrupt for the filename and .list for the filename extension.

-MONitor

This program normally monitors its progress on your screen. However, when you use the -Default option to suppress all program interaction, you also suppress the monitor. You can turn it back on with this option. If your program is running in batch, the monitor will appear in the log file. If the monitor is slowing the program down, suppress it with -NOMONitor.

-SUMmary

writes a summary of the program's work to the screen when you've used the -Default qualifier to suppress all program interaction. A summary typically displays at the end of a program run interactively. You can suppress the summary for a program run interactively with -NOSUMmary.

Use this qualifier also to include a summary of the program's work in the log file for a program run in batch.

Printed: April 22, 1996 15:54 (1162)