CReformat rewrites sequence file(s), scoring matrix file(s), or enzyme data file(s) so that they can be read by GCG programs. For sequence files, a base range can be selected or excluded.
CReformat rewrites sequence files to make them usable by the Wisconsin Sequence Analysis Package(TM) or to alter their appearance. The following are some of the manipulations that CReformat can perform:
- converting sequence files that were prepared or edited with a text editor or transferred to your computer from another computer into GCG format.
- converting between multiple sequence file (MSF) format and individual sequences in GCG format.
- correcting the sequence type (protein or nucleic acid) of sequence files that have no type or that were incorrectly typed when they were created.
- converting nucleic acid sequences between DNA (T, t) and RNA (U, u) representations.
- converting peptide sequences between one-letter and three-letter amino acid representations.
- converting sequences to all uppercase or all lowercase characters.
- removing gap characters from sequence files.
CReformat can also be used to rewrite into GCG format MSF files that you've edited with a text editor.
In order to use CReformat on sequence files, the files must contain a heading, a dividing line, and a sequence, as described below. You can use a text editor to make your "foreign" sequence files conform to this arrangement.
The heading of a sequence file may contain any number of lines of text at the top of the file to describe the sequence. The heading must not contain two adjacent periods (..) anywhere within it.
The heading is followed by a dividing line: a line containing two adjacent periods (..). Any information on the line other than the two periods is lost during reformatting. The dividing line may be omitted if there is absolutely no heading. All GCG data files contain a dividing line to separate the data from a documentary heading.
After the dividing line comes the sequence in any format you wish. It is conventional to use uppercase letters for known parts of the sequence and lowercase letters for uncertain parts. As in the example below, the sequence may have documentary comments embedded within it. You may either use two adjacent slash characters (//) to mark the end of the sequence data or just allow the sequence to go on until the end of the file.
The alphabet of legitimate sequence characters and their meanings are defined in Appendix III. Legitimate sequence characters include all uppercase and lowercase letters. Wisconsin Package(TM) programs support the IUB-IUPAC standard ambiguity codes for the representation of nucleic acid ambiguities and the standard one-letter amino acid codes. CReformat like all other Wisconsin Package programs, will ignore all characters that are not in the alphabet of legitimate sequence characters.
This GCG program was modified by David Mathog (E-mail: MATHOG@seqaxp.bio.caltech.edu Post: Sequence Analysis Facility, Biology Division, Caltech), and modified for EGCG by Peter Rice (E-mail: pmr@sanger.ac.uk Post: Informatics Division, The Sanger Centre, Hinxton Hall, Cambridge, CB10 1RQ, UK).
All EGCG programs are supported by the EGCG Support Team, who can be contacted by E-mail (egcg@embnet.org).
Here is a session using CReformat to rewrite a sequence file prepared with a text editor (see the INPUT FILE topic below) to GCG format:
% creformat -begin=1289 -end=1879 -lookup="T/U" CREFORMAT what sequence file(s) ? GenEmbl:paamir paamir.em_ba length: 591 bp %
Here is part of the output file from the example above:
ID PAAMIR standard; DNA; PRO; 2167 BP. XX AC X13776; /////////////////////////////////////////////////////// SQ Sequence 2167 BP; 363 A; 712 C; 730 G; 362 T; 0 other; REFORMAT of: em_ba:paamir, selected positions 1289 to 1879 Original length 2167 paamir.em_ba Length: 591 September 21, 1995 20:45 Type: N Check: 9909 .. 1 AUGAGCGCCA ACUCGCUGCU CGGCAGCCUG CGCGAGUUGC AGGUGCUGGU 51 CCUCAACCCG CCGGGGGAGG UCAGCGACGC CCUGGUCUUG CAGCUGAUCC /////////////////////////////////////////////////////// 501 GCACCAGCAC CUGUCGCGGG AAGCGAUGAA GCGGCGCGAG CCGAUCCUGA 551 AGAUCGCUCA GGAGUUGCUG GGAAACGAGC CGUCCGCCUG A
Here is part of the input file used for the example above:
Human fetal Beta globin G gamma from Shen, Slightom and Smithies, Cell 26; 191-203. Analyzed by Smithies et al. Cell 26; 345-353. The region below is used to demonstrate REFORMAT. It starts at base 2051 of the sequence reported in Cell. .. AGGAAGCACC CTTCAGCAGT TCCACA >Cap (G gamma) >CACT CGCTT CTGGA ACGTCTGAGG TTATCAATAA GCTCCTAGTC CAGACGCC >coding (G gamma) >AT //////////////////////////////////////////////////////// GCTCACTGCC CATGATGCAG AGCTTTCAAG GATAGGCTTT ATTCTGCAAG CAATACAAAT AATAAATCTA TTCTGCTAAG AGATCAC< POLYA (G gamma)< ACATGGTTGTCTTCAGTTCTT
Reformat is the original GCG version of this program.
SeqEd is a general purpose sequence editor.
All Wisconsin Package programs that write sequence files, such as Assemble, BackTranslate, ExtractPeptide, FromStaden, GetSeq, PepData, PileUp, Reverse, SeqEd, Shuffle, and Translate, write their sequences in GCG format.
The programs FromEMBL, FromFastA, FromGenBank, FromIG, FromPIR, and FromStaden are designed to bring files from six popular formats into GCG format. These specialized reformatting programs, in addition to reformatting the sequences, also convert the sequence characters into the nearest IUB-IUPAC equivalent character (see Appendix III) .
ChopUp reads files with lines up to 32,000 characters long. The file is rewritten to a new file that has lines no longer than 50 characters.
DataSet creates a GCG data library from any set of sequences in GCG format. ToBLAST combines any set of GCG sequences into a database that you can search with BLAST.
A sequence may not contain more than 350,000 sequence symbols. Embedded comments more than 125 characters long are truncated to 125 characters. Input lines may not be more than 511 characters. ChopUp can convert a file with lines exceeding 511 characters to a file suitable for input to CReformat
Nucleic acid and peptide sequences are generally named with the filename extensions .seq and .pep, respectively.
The command % seqformat Staden sets your process so that most programs accept sequences in the format used by the Staden programs directly without the need for reformatting. The command % seqformat GCG restores the system to expect sequences in GCG format.
You can use CReformat on Staden files (or any files that contain only sequence characters) without modification as long as all the sequence characters in the file belong to the IUB-IUPAC code representation. If your Staden file contains any of Staden's ambiguity codes, use the FromStaden program instead.
Multiple Sequence Format (MSF) Files
CReformat can be used to convert between MSF and individual sequence format files. All embedded comments are lost when converting from individual sequence to multiple sequence format. In addition, when the sequence files are specified using a list file, any weights present in the list file are lost during the conversion to the MSF file. (In Chapter 2, Using Sequences of the User's Guide, see the topic "Using Multiple Sequence Format (MSF) Files" for help in specifying sequences in MSF files, and the topic "Using List Files (formerly Files of Sequence Names) " for information about list files.)
Following are several examples of the commands you might type to convert between MSF and individual sequence format files. These examples use the files hsp70.msf and pretty.list, which can be copied to your local directory with the % fetch command.
To copy all of the sequences in hsp70.msf into separate sequence files, use
% creformat hsp70.msf{*}
To copy the sequence Hs70_Plafa from hsp70.msf into a separate sequence file, use
% creformat hsp70.msf{hs70_plafa}
To collect all of the sequences named in pretty.list into an MSF file, use
% creformat -MSF @pretty.list
To collect the mouse sequences in hsp70.msf into a separate MSF file, use
% creformat -MSF hsp70.msf{*mouse}
If you edit hsp70.msf with a text editor to manually adjust the alignment, you must use Reformat to rewrite the MSF file so that it can be used with Wisconsin Package programs by using
% creformat -MSF hsp70.msf{*}
You can control the number of sequence characters per line, the number of characters in each block, and whether blank lines appear between sequence lines by setting parameters on the command line. CReformat defaults to groups of 10 characters in lines of 50, with a blank line between each sequence line.
CReformat calculates a checksum based on the exact sequence in your file. CReformat always appends a line to the heading showing the filename, the date of reformatting, the length of the sequence, and the sequence's checksum. All Wisconsin Package programs that read sequences recalculate the checksum and compare it to the value written by CReformat to ensure the integrity of the data. If there is disagreement between the newly calculated and previously written values of checksum, the program stops and tells you to reformat the file. There is one chance in 10 thousand that two different sequences would have the same checksum.
You may embed comments of up to 125 characters within a sequence by enclosing them in special comment-delimiting characters. Comments are very helpful for documenting sequences, especially sequences assembled from several sources or sequences containing many genes.
Embedded comments can begin with one of the characters <, >, or $. Each comment must begin and end with the same character.
The embedded comments below seem useful for the sequences we have annotated.
>coding> beginning of coding sequenceCap> cap site >IVS> intervening sequence donor Transcript> beginning of transcript Promoter> promoter >Ribosome> ribosome binding site
Comments must start and end with the same delimiting character and may not be more than 125 characters long. Comments that are too long are truncated to 125 characters. CReformat searches through the whole file, if need be, for the second delimiting character that closes the field of a comment. CReformat prints a warning for unclosed comments, but not for comments that are too long.
All parameters for this program may be put on the command line. Use the option -CHEck to see the summary below and to have a chance to add things to the command line before the program executes. In the summary below, the capitalized letters in the qualifier names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose qualifiers or parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.
Minimal Syntax: % creformat [-INfile=]creformat.txt -default Prompted Parameters: None Local Data Files: -DATa=translate.txt three-letter to one-letter codes Optional Parameters: -LINesize=50 sets number of characters per line -BLOcksize=10 sets number of characters per block -BLAnklines=1 puts blank lines between the sequence lines -NONUMbering suppresses numbering -NOCOMments suppresses comments -DNA changes U into T -RNA changes T into U -UPPer makes all sequence characters uppercase -LOWer makes all sequence characters lowercase -LIStfile[=reformat.list] writes a list file of output sequence names -MSF reformats sequences into an MSF output file -DEGap removes gap characters (.) from the sequence -THReeintoone translates three-letter peptides into one-letter -ONEIntothree translates one-letter peptides into three-letter -COMparison reformats a table instead of a sequence -ENZymedata reformats an enzyme data file instead of a sequence (used with - protein enzyme data file) -PROtein insists that the sequences are reformatted as protein sequences -NUCleotide insists that the sequences are reformatted as nucleic acid sequences -PROFile reformats an old profile into the new profile format -EXTension=.seq defines a file name extension -TRANSlate=filename.txt lets you name the output translation table [-OUTfile=]newseqname lets you name the output file -NOMONitor suppresses the screen trace showing each output file -BEGin beginning of range, defaults to 1 -END end of range, defaults to maximum sequence length Use these to extract a subsequence from a sequence or MSF file. -DELete delete the subsequence in the range, leave the rest -REVerse return the reverse strand -LOOKup="U.,TZ" convert characters in first string to matching character in second string.
After modifying a scoring matrix, you may want to reformat it to give it a nicer appearance. To use CReformat for this purpose, run the program with % creformat -COMparison. (See Chapter 4, Using Data Files of the User's Guide for more information about scoring matrices.)
The files described below supply auxiliary data to this program. The program automatically reads them from a public data directory unless you either 1) have a data file with exactly the same name in your current working directory; or 2) name a file on the command line with an expression like -DATa1=myfile.dat. For more information see Chapter 4, Using Data Files in the User's Guide.
In the rare event that you are using CReformat to make a three-letter amino acid sequence into a one-letter sequence, CReformat looks for translate.txt as a local data file.
The translation of codons to amino acids, the identification of potential start codons and stop codons, and the mappings of one-letter to three-letter amino acid codes are all defined in a translation table in the file translate.txt. If the standard genetic code does not apply to your sequence, you can provide a modified version of this file in your working directory or name an alternative file on the command line with an expression like -TRANSlate= mycode.txt. Translation tables are discussed in more detail in the Data Files manual.
The parameters and switches listed below can be set from the command line. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.
lets you reformat a scoring matrix that had been modified with a text editor.
lets you reformat a nucleic acid or protein enzyme data file used with Version 4 of the Wisconsin Package into the format used for Version 5 or greater. Use the -PROtein switch if the patterns are amino acids and not nucleotides. This will tell CReformat to write a period (.) in the overhang field so that the mapping programs will know not to try to search both strands of a sequence.
reformats the sequence as a protein sequence.
reformats the sequence as a nucleotide sequence.
lets you reformat an old (pre-Version 6.2) profile into the new profile format.
substitutes T for U and t for u in the whole sequence.
substitutes U for T and u for t in the whole sequence.
puts all sequence characters into uppercase.
puts all sequence characters into lowercase.
reformats all input sequences into a multiple sequence format (MSF) output file. The default is to write each sequence into a separate output file.
removes gap characters (.) from the sequence.
changes a peptide sequence from three-letter codes into one-letter codes (see Appendix III) . Wisconsin Package programs for peptide sequence analysis can use peptide sequences in one-letter codes only.
changes a peptide sequence of one-letter codes into three-letter codes (see Appendix III) . Wisconsin Package programs for peptide sequence analysis can use peptide sequences in one-letter codes only.
lets you set the number of sequence characters per line to any number between 1 and 120.
lets you set the number of sequence characters in each block to any number between 1 and the line size.
leaves zero or more blank lines between the sequence lines.
suppresses the numbering next to each sequence line.
suppresses any comments that may have been in the input sequence file.
selects an output filename other than the name of the input file.
selects a filename extension other than the input filename extension.
writes a list file with the names of the output sequence files. This list file is suitable for input to other Wisconsin Package programs that support list files (see Chapter 2, Using Sequences of the User's Guide. ) If you don't specify a filename, then CReformat makes one up using creformat for the filename and .list for the filename extension. If -MSF is on the command line, this option is ignored and a list file will not be written.
beginning of range, defaults to 1. Use this option to extract (or delete) a subsequence from a sequence or MSF file.
end of range, defaults to maximum sequence length. Use this option to extract (or delete) a subsequence from a sequence or MSF file.
return the reverse strand
convert characters in first string to matching character in second string. The two strings "U." and "TZ" in the example above, are separated by a comma.
delete the subsequence in the range, leave the rest.
Usually, translation is based on the translation table in a default or local data file called translate.txt. This option allows you to use a translation table in a different file. (See the Data Files manual for information about translation tables.)
This program normally monitors its progress on your screen. However, when you use the -Default option to suppress all program interaction, you also suppress the monitor. You can turn it back on with this option. If your program is running in batch, the monitor will appear in the log file. If the monitor is slowing the program down, suppress it with -NOMONitor.
Printed: April 22, 1996 15:52 (1162)