Creformat

Go back to top

CREFORMAT


FUNCTION

CReformat rewrites sequence file(s), scoring matrix file(s), or enzyme data file(s) so that they can be read by GCG programs. For sequence files, a base range can be selected or excluded.


DESCRIPTION

CReformat rewrites sequence files to make them usable by the Wisconsin Sequence Analysis Package(TM) or to alter their appearance. The following are some of the manipulations that CReformat can perform:

- converting sequence files that were prepared or edited with a text editor or transferred to your computer from another computer into GCG format.

- converting between multiple sequence file (MSF) format and individual sequences in GCG format.

- correcting the sequence type (protein or nucleic acid) of sequence files that have no type or that were incorrectly typed when they were created.

- converting nucleic acid sequences between DNA (T, t) and RNA (U, u) representations.

- converting peptide sequences between one-letter and three-letter amino acid representations.

- converting sequences to all uppercase or all lowercase characters.

- removing gap characters from sequence files.

CReformat can also be used to rewrite into GCG format MSF files that you've edited with a text editor.

In order to use CReformat on sequence files, the files must contain a heading, a dividing line, and a sequence, as described below. You can use a text editor to make your "foreign" sequence files conform to this arrangement.


HEADING

The heading of a sequence file may contain any number of lines of text at the top of the file to describe the sequence. The heading must not contain two adjacent periods (..) anywhere within it.


DIVIDING LINE

The heading is followed by a dividing line: a line containing two adjacent periods (..). Any information on the line other than the two periods is lost during reformatting. The dividing line may be omitted if there is absolutely no heading. All GCG data files contain a dividing line to separate the data from a documentary heading.


SEQUENCE

After the dividing line comes the sequence in any format you wish. It is conventional to use uppercase letters for known parts of the sequence and lowercase letters for uncertain parts. As in the example below, the sequence may have documentary comments embedded within it. You may either use two adjacent slash characters (//) to mark the end of the sequence data or just allow the sequence to go on until the end of the file.


SEQUENCE CHARACTERS

The alphabet of legitimate sequence characters and their meanings are defined in Appendix III. Legitimate sequence characters include all uppercase and lowercase letters. Wisconsin Package(TM) programs support the IUB-IUPAC standard ambiguity codes for the representation of nucleic acid ambiguities and the standard one-letter amino acid codes. CReformat like all other Wisconsin Package programs, will ignore all characters that are not in the alphabet of legitimate sequence characters.


AUTHOR

This GCG program was modified by David Mathog (E-mail: MATHOG@seqaxp.bio.caltech.edu Post: Sequence Analysis Facility, Biology Division, Caltech), and modified for EGCG by Peter Rice (E-mail: pmr@sanger.ac.uk Post: Informatics Division, The Sanger Centre, Hinxton Hall, Cambridge, CB10 1RQ, UK).

All EGCG programs are supported by the EGCG Support Team, who can be contacted by E-mail (egcg@embnet.org).


EXAMPLE

Here is a session using CReformat to rewrite a sequence file prepared with a text editor (see the INPUT FILE topic below) to GCG format:

  
  
  % creformat -begin=1289 -end=1879 -lookup="T/U"
  
   CREFORMAT what sequence file(s) ?  GenEmbl:paamir
  
 paamir.em_ba  length:  591 bp
  
  %
  


OUTPUT FILE

Here is part of the output file from the example above:

  
  
  ID   PAAMIR     standard; DNA; PRO; 2167 BP.
  XX
  AC   X13776;
  
  ///////////////////////////////////////////////////////
  
  SQ   Sequence 2167 BP; 363 A; 712 C; 730 G; 362 T; 0 other;
   REFORMAT of: em_ba:paamir, selected positions 1289 to 1879
           Original length 2167
  
  paamir.em_ba  Length: 591  September 21, 1995 20:45  Type: N  Check: 9909  ..
  
    1  AUGAGCGCCA ACUCGCUGCU CGGCAGCCUG CGCGAGUUGC AGGUGCUGGU
  
   51  CCUCAACCCG CCGGGGGAGG UCAGCGACGC CCUGGUCUUG CAGCUGAUCC
  
  ///////////////////////////////////////////////////////
  
  501  GCACCAGCAC CUGUCGCGGG AAGCGAUGAA GCGGCGCGAG CCGAUCCUGA
  
  551  AGAUCGCUCA GGAGUUGCUG GGAAACGAGC CGUCCGCCUG A
  
  


INPUT FILE

Here is part of the input file used for the example above:

  
  
  Human fetal Beta globin G gamma
  from Shen, Slightom and Smithies,  Cell 26; 191-203.
  Analyzed by Smithies et al. Cell 26; 345-353.
  
  The region below is used to demonstrate REFORMAT.  It
  starts at base 2051 of the sequence reported in Cell.
  
                         ..
  
  AGGAAGCACC CTTCAGCAGT TCCACA >Cap (G gamma) >CACT CGCTT
  CTGGA ACGTCTGAGG
  TTATCAATAA GCTCCTAGTC CAGACGCC >coding (G gamma) >AT
  
  ////////////////////////////////////////////////////////
  
  GCTCACTGCC CATGATGCAG
  AGCTTTCAAG GATAGGCTTT ATTCTGCAAG CAATACAAAT AATAAATCTA
  TTCTGCTAAG AGATCAC< POLYA (G gamma)< ACATGGTTGTCTTCAGTTCTT
  


RELATED PROGRAMS

Reformat is the original GCG version of this program.

SeqEd is a general purpose sequence editor.

All Wisconsin Package programs that write sequence files, such as Assemble, BackTranslate, ExtractPeptide, FromStaden, GetSeq, PepData, PileUp, Reverse, SeqEd, Shuffle, and Translate, write their sequences in GCG format.

The programs FromEMBL, FromFastA, FromGenBank, FromIG, FromPIR, and FromStaden are designed to bring files from six popular formats into GCG format. These specialized reformatting programs, in addition to reformatting the sequences, also convert the sequence characters into the nearest IUB-IUPAC equivalent character (see Appendix III) .

ChopUp reads files with lines up to 32,000 characters long. The file is rewritten to a new file that has lines no longer than 50 characters.

DataSet creates a GCG data library from any set of sequences in GCG format. ToBLAST combines any set of GCG sequences into a database that you can search with BLAST.


RESTRICTIONS

A sequence may not contain more than 350,000 sequence symbols. Embedded comments more than 125 characters long are truncated to 125 characters. Input lines may not be more than 511 characters. ChopUp can convert a file with lines exceeding 511 characters to a file suitable for input to CReformat


CONSIDERATIONS

Filename Extensions

Nucleic acid and peptide sequences are generally named with the filename extensions .seq and .pep, respectively.

Use Staden Format Directly

The command % seqformat Staden sets your process so that most programs accept sequences in the format used by the Staden programs directly without the need for reformatting. The command % seqformat GCG restores the system to expect sequences in GCG format.

You can use CReformat on Staden files (or any files that contain only sequence characters) without modification as long as all the sequence characters in the file belong to the IUB-IUPAC code representation. If your Staden file contains any of Staden's ambiguity codes, use the FromStaden program instead.

Multiple Sequence Format (MSF) Files

CReformat can be used to convert between MSF and individual sequence format files. All embedded comments are lost when converting from individual sequence to multiple sequence format. In addition, when the sequence files are specified using a list file, any weights present in the list file are lost during the conversion to the MSF file. (In Chapter 2, Using Sequences of the User's Guide, see the topic "Using Multiple Sequence Format (MSF) Files" for help in specifying sequences in MSF files, and the topic "Using List Files (formerly Files of Sequence Names) " for information about list files.)

Following are several examples of the commands you might type to convert between MSF and individual sequence format files. These examples use the files hsp70.msf and pretty.list, which can be copied to your local directory with the % fetch command.

To copy all of the sequences in hsp70.msf into separate sequence files, use

% creformat hsp70.msf{*}

To copy the sequence Hs70_Plafa from hsp70.msf into a separate sequence file, use

% creformat hsp70.msf{hs70_plafa}

To collect all of the sequences named in pretty.list into an MSF file, use

% creformat -MSF @pretty.list

To collect the mouse sequences in hsp70.msf into a separate MSF file, use

% creformat -MSF hsp70.msf{*mouse}

If you edit hsp70.msf with a text editor to manually adjust the alignment, you must use Reformat to rewrite the MSF file so that it can be used with Wisconsin Package programs by using

% creformat -MSF hsp70.msf{*}


FORMAT CONTROL

You can control the number of sequence characters per line, the number of characters in each block, and whether blank lines appear between sequence lines by setting parameters on the command line. CReformat defaults to groups of 10 characters in lines of 50, with a blank line between each sequence line.


CHECKSUM

CReformat calculates a checksum based on the exact sequence in your file. CReformat always appends a line to the heading showing the filename, the date of reformatting, the length of the sequence, and the sequence's checksum. All Wisconsin Package programs that read sequences recalculate the checksum and compare it to the value written by CReformat to ensure the integrity of the data. If there is disagreement between the newly calculated and previously written values of checksum, the program stops and tells you to reformat the file. There is one chance in 10 thousand that two different sequences would have the same checksum.


EMBEDDED COMMENTS

You may embed comments of up to 125 characters within a sequence by enclosing them in special comment-delimiting characters. Comments are very helpful for documenting sequences, especially sequences assembled from several sources or sequences containing many genes.

Comment Delimiting Characters

Embedded comments can begin with one of the characters <, >, or $. Each comment must begin and end with the same character.

Suggestions

The embedded comments below seem useful for the sequences we have annotated.

  
  
     >coding>         beginning of coding sequence
     Cap>            cap site
     >IVS>            intervening sequence donor
     Transcript>     beginning of transcript
     Promoter>       promoter
     >Ribosome>       ribosome binding site
  
  

Comment Limitations

Comments must start and end with the same delimiting character and may not be more than 125 characters long. Comments that are too long are truncated to 125 characters. CReformat searches through the whole file, if need be, for the second delimiting character that closes the field of a comment. CReformat prints a warning for unclosed comments, but not for comments that are too long.


COMMAND-LINE SUMMARY

All parameters for this program may be put on the command line. Use the option -CHEck to see the summary below and to have a chance to add things to the command line before the program executes. In the summary below, the capitalized letters in the qualifier names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose qualifiers or parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.

  
  
  Minimal Syntax: % creformat [-INfile=]creformat.txt -default
  
  Prompted Parameters:  None
  
  Local Data Files:
  
  -DATa=translate.txt       three-letter to one-letter codes
  
  Optional Parameters:
  
  -LINesize=50              sets number of characters per line
  -BLOcksize=10             sets number of characters per block
  -BLAnklines=1             puts blank lines between the sequence lines
  -NONUMbering              suppresses numbering
  -NOCOMments               suppresses comments
  -DNA                      changes U into T
  -RNA                      changes T into U
  -UPPer                    makes all sequence characters uppercase
  -LOWer                    makes all sequence characters lowercase
  -LIStfile[=reformat.list] writes a list file of output sequence names
  -MSF                      reformats sequences into an MSF output file
  -DEGap                    removes gap characters (.) from the sequence
  -THReeintoone             translates three-letter peptides into one-letter
  -ONEIntothree             translates one-letter peptides into three-letter
  -COMparison               reformats a table instead of a sequence
  -ENZymedata               reformats an enzyme data file instead of a
                         sequence (used with -
                         protein enzyme data file)
  -PROtein                  insists that the sequences are reformatted as
                         protein sequences
  -NUCleotide               insists that the sequences are reformatted as
                         nucleic acid sequences
  -PROFile                  reformats an old profile into the new profile
                         format
  -EXTension=.seq           defines a file name extension
  -TRANSlate=filename.txt   lets you name the output translation table
  [-OUTfile=]newseqname     lets you name the output file
  -NOMONitor                suppresses the screen trace showing each output
                         file
  -BEGin           beginning of range, defaults to 1
  -END             end of range, defaults to maximum sequence length
Use these to extract a subsequence from a sequence or MSF file.
  -DELete          delete the subsequence in the range, leave the rest
  -REVerse         return the reverse strand
  -LOOKup="U.,TZ"  convert characters in first string to matching character
              in second string.
  


SCORING MATRICES

After modifying a scoring matrix, you may want to reformat it to give it a nicer appearance. To use CReformat for this purpose, run the program with % creformat -COMparison. (See Chapter 4, Using Data Files of the User's Guide for more information about scoring matrices.)


LOCAL DATA FILES

The files described below supply auxiliary data to this program. The program automatically reads them from a public data directory unless you either 1) have a data file with exactly the same name in your current working directory; or 2) name a file on the command line with an expression like -DATa1=myfile.dat. For more information see Chapter 4, Using Data Files in the User's Guide.

In the rare event that you are using CReformat to make a three-letter amino acid sequence into a one-letter sequence, CReformat looks for translate.txt as a local data file.

The translation of codons to amino acids, the identification of potential start codons and stop codons, and the mappings of one-letter to three-letter amino acid codes are all defined in a translation table in the file translate.txt. If the standard genetic code does not apply to your sequence, you can provide a modified version of this file in your working directory or name an alternative file on the command line with an expression like -TRANSlate= mycode.txt. Translation tables are discussed in more detail in the Data Files manual.


OPTIONAL PARAMETERS

The parameters and switches listed below can be set from the command line. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.

-COMparison

lets you reformat a scoring matrix that had been modified with a text editor.

-ENZymedata

lets you reformat a nucleic acid or protein enzyme data file used with Version 4 of the Wisconsin Package into the format used for Version 5 or greater. Use the -PROtein switch if the patterns are amino acids and not nucleotides. This will tell CReformat to write a period (.) in the overhang field so that the mapping programs will know not to try to search both strands of a sequence.

-PROtein

reformats the sequence as a protein sequence.

-NUCleotide

reformats the sequence as a nucleotide sequence.

-PROFile

lets you reformat an old (pre-Version 6.2) profile into the new profile format.

-DNA

substitutes T for U and t for u in the whole sequence.

-RNA

substitutes U for T and u for t in the whole sequence.

-UPPer

puts all sequence characters into uppercase.

-LOWer

puts all sequence characters into lowercase.

-MSF

reformats all input sequences into a multiple sequence format (MSF) output file. The default is to write each sequence into a separate output file.

-DEGap

removes gap characters (.) from the sequence.

-THReeintoone

changes a peptide sequence from three-letter codes into one-letter codes (see Appendix III) . Wisconsin Package programs for peptide sequence analysis can use peptide sequences in one-letter codes only.

-ONEIntothree

changes a peptide sequence of one-letter codes into three-letter codes (see Appendix III) . Wisconsin Package programs for peptide sequence analysis can use peptide sequences in one-letter codes only.

-LINesize=50

lets you set the number of sequence characters per line to any number between 1 and 120.

-BLOcksize=10

lets you set the number of sequence characters in each block to any number between 1 and the line size.

-BLAnklines=1

leaves zero or more blank lines between the sequence lines.

-NONUMbering

suppresses the numbering next to each sequence line.

-NOCOMments

suppresses any comments that may have been in the input sequence file.

-OUTfile=newseqname

selects an output filename other than the name of the input file.

-EXTension=.seq

selects a filename extension other than the input filename extension.

-LIStfile=creformat.list

writes a list file with the names of the output sequence files. This list file is suitable for input to other Wisconsin Package programs that support list files (see Chapter 2, Using Sequences of the User's Guide. ) If you don't specify a filename, then CReformat makes one up using creformat for the filename and .list for the filename extension. If -MSF is on the command line, this option is ignored and a list file will not be written.

-BEGin=1

beginning of range, defaults to 1. Use this option to extract (or delete) a subsequence from a sequence or MSF file.

-END=9999

end of range, defaults to maximum sequence length. Use this option to extract (or delete) a subsequence from a sequence or MSF file.

-REVerse

return the reverse strand

-LOOKup="U.,TZ"

convert characters in first string to matching character in second string. The two strings "U." and "TZ" in the example above, are separated by a comma.

-DELete

delete the subsequence in the range, leave the rest.

-TRANSlate=filename.txt

Usually, translation is based on the translation table in a default or local data file called translate.txt. This option allows you to use a translation table in a different file. (See the Data Files manual for information about translation tables.)

-MONitor

This program normally monitors its progress on your screen. However, when you use the -Default option to suppress all program interaction, you also suppress the monitor. You can turn it back on with this option. If your program is running in batch, the monitor will appear in the log file. If the monitor is slowing the program down, suppress it with -NOMONitor.

Printed: April 22, 1996 15:52 (1162)