PepCorrupt randomly introduces small numbers of substitutions, insertions, and deletions into protein sequence(s). Note that substitutions are Residue to other Residue, and that back mutations to the original are allowed!
PepCorrupt uses a random number generator to add errors to nucleotide sequences. You can set the number of substitutions and length errors independently. Length errors can either be insertions or deletions; these two changes are now collectively referred to as indels in the literature of mathematical biology. The position of each error is picked at random somewhere within the range and on the strand that you chose. The length of each indel is chosen at random from one to the maximum indel size. If the indel is positive (insertion), then the symbols added are also chosen at random.
The output files contain a complete record of the errors introduced. The chosen and actual number of substitutions may vary since one in four substitutions will not change the sequence. The output file also shows the total amount of length added (or subtracted) when all of the indels are taken together. The current time is used to seed the random number generator, so each run with PepCorrupt yields different results.
If you give PepCorrupt a single input sequence, you can choose the range, strand, and output file name. Otherwise, PepCorrupt uses the whole sequence, the top strand, and names the output file with the sequence's name followed by the file name extension .corrupt.
This GCG program was modified by David Mathog (E-mail: MATHOG@seqaxp.bio.caltech.edu Post: Sequence Analysis Facility, Biology Division, Caltech), and modified for EGCG by Peter Rice (E-mail: pmr@sanger.ac.uk Post: Informatics Division, The Sanger Centre, Hinxton Hall, Cambridge, CB10 1RQ, UK).
All EGCG programs are supported by the EGCG Support Team, who can be contacted by E-mail (egcg@embnet.org).
Here is a session using PepCorrupt to corrupt the first 200 residues of Sw:laci_ecoli:
% pepcorrupt PEPCORRUPT uses protein sequences PEPCORRUPT of what sequence(s) ? Sw:laci_ecoli Start (* 1 *) ? End (* 360 *) ? 200 Reverse (* No *) ? How many substitutions do you want (* 1 *) ? 3 How many length errors do you want (* 1 *) ? 3 %
The file laci_ecoli.corrupt would contain the corrupted contents of the first 200 symbols in Sw:laci_ecoli. Here is the output from this session:
PEPCORRUPT of: check: 1939 from: 1 to: 360 Effective Substitutions: 360 % remaining identity: 0.000000 InDels: 3 Substitutions: 3 MaxIndel: 3 Actual substitutions: 4 Length change from indels: 0 laci_ecoli.corrupt Length: 360 February 28, 1996 13:18 Type: P Check: 716 .. 1 MKPVTLYDVA EYAGVSYQTV SRVVNQASHV SAKTRDINEK VEAAMAELNY 51 IPNRVAQQLG KQSLLDGVAT SSLALHAPSQ IVAAIKSRAD QLGASVVVSM 101 VERGGVEACK AAVHNLLAQR VSGLIWNYPL DDQDAIAVEA ACTNVPALFL 151 DVSDQTPINS IIFSHEDGTR LGVEHLVALG HQQIALLAGP LSSVSARLRL 201 AGWHKYLTRN QIQPIAEREG DWSAMSGFQQ TMQMLNEGIV PTAMLVANDQ 251 MALGAMRAIT ESGLRVGADI SVVGYDDTED SSCYIPPLTT IKQDFRLLGQ 301 TSVDRLLQLS QGQAVKGNQL LPVSLVKRKT TLAPNTQTAS PRALADSLMQ 351 LARQVLESGQ
Sample extracts sequence fragments randomly from sequence(s). You can set a sampling rate to determine how many fragments Sample extracts. Shuffle randomizes the order of the symbols in a sequence without changing the composition. SeqEd is an interactive editor for entering and modifying sequences and for assembling parts of existing sequences into new genetic constructs. You can enter sequences from the keyboard or from a digitizer.
PepCorrupt works on protein sequences. The output is renumbered to start at one.
If an indel is longer than 250 residues, only the first 250 residues of the indel are shown in the output file.
PepCorrupt makes the substitutions first followed by the insertions and deletions. The substitution algorithm is this: any of the standard amino acids is chosen at random and then put into any position in the sequence randomly. This means that, on average, about one in 26 substitutions will not change the residue.
You may find what happened hard to understand if you make a lot of indels. The best way we know of to reconstruct a corruption is to start with the original sequence and, using SeqEd, make the changes in exactly the same order as they appear in the output file trace. You can use Gap to display the original and corrupted sequences next to one another.
All parameters for this program may be put on the command line. Use the option -CHEck to see the summary below and to have a chance to add things to the command line before the program executes. In the summary below, the capitalized letters in the qualifier names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose qualifiers or parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the User's Guide.
Minimal Syntax: ^|$|%\| pepcorrupt [^|/|-\|INfile=]gamma.pep ^|/|-\|Default Prompted Parameters: (for single sequences only): ^|/|-\|BEGin=1 beginning of the range of interest ^|/|-\|END=11375 end of the range of interest [^|/|-\|OUTfile=]gamma.corrupt output file name Other Prompted Parameters: ^|/|-\|SUBstitutions=1 number of substitutions to introduce ^|/|-\|INDels=1 number of length errors to introduce Local Data Files: None Optional Parameters: ^|/|-\|MAXindel=3 size of maximum insertion/deletion ^|/|-\|TRAce record reside changes in the output file ^|/|-\|EXTension=.corrupt sets the output file name extension ^|/|-\|LIStfile[=corrupt.list] writes a list file of output sequence names ^|/|-\|NOMONitor suppresses screen monitor (of input sequence names) ^|/|-\|NOSUMmary suppresses the screen summary
None.
The parameters and switches listed below can be set from the command line. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the User's Guide.
sets the maximum size of an insertion or deletion. The maximum is three unless you change it with this option.
Normally PepCorrupt writes a complete record in the output file of each substitution, insertion, and deletion. You can suppress this information with -NOTRAce.
creates output filenames by using the original input filename for the base name and the program name for the name extension. Use this option to choose some other filename extension.
writes a list file with the names of the output sequence files. This list file is suitable for input to other Wisconsin Package programs that support list files (see Chapter 2, Using Sequences of the User's Guide. ) If you don't specify a filename, then PepCorrupt makes one up using pepcorrupt for the filename and .list for the filename extension.
This program normally monitors its progress on your screen. However, when you use the -Default option to suppress all program interaction, you also suppress the monitor. You can turn it back on with this option. If your program is running in batch, the monitor will appear in the log file. If the monitor is slowing the program down, suppress it with -NOMONitor.
writes a summary of the program's work to the screen when you've used the -Default qualifier to suppress all program interaction. A summary typically displays at the end of a program run interactively. You can suppress the summary for a program run interactively with -NOSUMmary.
Use this qualifier also to include a summary of the program's work in the log file for a program run in batch.
Printed: April 22, 1996 15:54 (1162)