EFromFastA reformats one or more sequences from FastA format into individual files in GCG format.
Use EFromFastA when you want to convert sequences that are in FastA format into a format suitable for use with programs in the Wisconsin Sequence Analysis Package(TM). FastA format may maintain many sequences in one file; in such a case EFromFastA writes many output files, one for each sequence in the FastA file. Each output file is named according to the first word (following the > character) on the documention line just above the sequence data in the FastA file. The documentation line from the FastA input file(s) is preserved in the GCG output file(s). EFromFastA can convert sets of FastA sequence files that are specified with a file of filenames or with multiple file specification syntax.
This GCG program was modified by Peter Rice (E-mail: pmr@sanger.ac.uk Post: Informatics Division, The Sanger Centre, Hinxton Hall, Cambridge, CB10 1RQ, UK).
All EGCG programs are supported by the EGCG Support Team, who can be contacted by E-mail (egcg@embnet.org).
Here is a session using EFromFastA to convert the FastA sequence file fasta.aa into separate sequence files in GCG format.
%efromfasta EFROMFASTA of what FastA sequence file(s) ? fasta.aa egmsmg 1217 aa. hshua1 129 aa. lcbo 230 aa. mchu 149 aa. musplfm 224 aa. mwkw 1966 aa. mwrtc1 428 aa. gt87 217 aa. qrhuld 860 aa. Finished EFROMFASTA with 9 files written. 5420 sequence characters were reformatted. %
Here is part of the first output file, egmsmg, from the example above:
EGMSMG Epidermal growth factor precursor - Mouse EGMSMG Length: 1217 July 29, 1994 14:38 Type: P Check: 9280 .. 1 MPWGRRPTWL LLAFLLVFLK ISILSVTAWQ TGNCQPGPLE RSERSGTCAG 51 PAPFLVFSQG KSISRIDPDG TNHQQLVVDA GISADMDIHY KKERLYWVDV //////////////////////////////////////////////////////////// 1151 PHIDGMGTGQ SCWIPPSSDR GPQEIEGNSH LPSYRPVGPE KLHSLQSANG 1201 SCHERAPDLP RQTEPVK
The following programs convert sequences between other formats and GCG format: FromEMBL, FromGenBank, FromIG, FromPIR, FromStaden, FromFastA, ToIG, ToPIR, ToStaden and ToFastA.
DataSet creates a GCG data library from any set of sequences in GCG format. ToBLAST creates a database that can be searched by the BLAST program from any set of sequences in GCG format.
Here is part of the input file used for the example above:
>EGMSMG Epidermal growth factor precursor - Mouse MPWGRRPTWLLLAFLLVFLKISILSVTAWQTGNCQPGPLERSERSGTCAGPAPFLVFSQGKSISRIDPDG TNHQQLVVDAGISADMDIHYKKERLYWVDVERQVLLRVFLNGTGLEKVCNVERKVSGLAIDWIDDEVLWV DQQNGVITVTDMTGKNSRVLLSSLKHPSNIAVDPIERLMFWSSEVTGSLHRAHLKGVDVKTLLETGGISV LTLDVLDKRLFWVQDSGEGSHAYIHSCDYEGGSVRLIRHQARHSLSSMAFFGDRIFYSVLKSKAIWIANK HTGKDTVRINLHPSFVTPGKLMVVHPRAQPRTEDAAKDPDPELLKQRGRPCRFGLCERDPKSHSSACAEG YTLSRDRKYCEDVNECATQNHGCTLGCENTPGSYHCTCPTGFVLLPDGKQCHELVS CPGNVSKCSHGCVLTSDGPRCICPAGSVLGRDGKTCTGCSSPDNGGCSQICLPLRPGSWECDCFPGYDLQ SDRKSCAASGPQPLLLFANSQDIRHMHFDGTDYKVLLSRQMGMVFALDYDPVESKIYFAQTALKWIERAN MDGSQRERLITEGVDTLEGLALDWIGRRIYWTDSGKSVVGGSDLSGKHHRIIIQERISRPRGIAVHPRAR RLFWTDVGMSPRIESASLQGSDRVLIASSNLLEPSGITIDYLTDTLYWCDTKRSVIEMANLDGSKRRRLI QNDVGHPFSLAVFEDHLWVSDWAIPSVIRVNKRTGQNRVRLQGSMLKPSSLVVVHPLAKPGADPCLYRNG GCEHICQESLGTARCLCREGFVKAWDGKMCLPQDYPILSGENADLSKEVTSLSNST QAEVPDDDGTESSTLVAEIMVSGMNYEDDCGPGGCGSHARCVSDGETAECQCLKGFARDGNLCSDIDECV LARSDCPSTSSRCINTEGGYVCRCSEGYEGDGISCFDIDECQRGAHNCAENAACTNTEGGYNCTCAGRPS SPGRSCPDSTAPSLLGEDGHHLDRNSYPGCPSSYDGYCLNGGVCMHIESLDSYTCNCVIGYSGDRCQTRD LRWWELRHAGYGQKHDIMVVAVCMVALVLLLLLGMWGTYYYRTRKQLSNPPKNPCDEPSGSVSSSGPDSS SGAAVASCPQPWFVVLEKHQDPKNGSLPADGTNGAVVDAGLSPSLQLGSVHLTSWRQKPHIDGMGTGQSC WIPPSSDRGPQEIEGNSHLPSYRPVGPEKLHSLQSANGSCHERAPDLPRQTEPVK //////////////////////////////////////////////////////////////////////
FastA format does not differentiate peptide from nucleotide sequences, so to ensure that the output files are written with the correct sequence type, use the -PROtein or -NUCleotide command-line option when running EFromFastA.
FastA format is not rigorously defined, so FastA files from different sources may not have exactly the same format. Please contact us by e-mail at egcg@embnet.org if you encounter problems converting FastA sequences using EFromFastA.
Note: EFromFastA has not been tested thoroughly at the time of this writing, so please examine your results carefully.
When EFromFastA writes GCG sequence files, it assigns the sequence type based on the composition of the sequence characters. This method is not fool-proof, so to ensure that the output files are written with the correct sequence type, use the -PROtein or -NUCleotide command-line option when running EFromFastA.
If EFromFastA is run interactively, you can watch the program monitor to see if the sequences are assigned the correct type. As each new file is written, its name and the number of bases (bp) or amino acids (aa) appears on the screen. If the wrong abbreviation appears (for example, bp appears for a protein sequence), the sequence file was assigned the wrong type. The sequence type also appears in the sequence file. Look on the last line of the text heading just above the sequence itself for Type: N or Type: P.
If the sequence type was incorrectly assigned, turn to Appendix VI for information on how to change or set the type of a sequence.
All parameters for this program may be put on the command line. Use the option -CHEck to see the summary below and to have a chance to add things to the command line before the program executes. In the summary below, the capitalized letters in the qualifier names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose qualifiers or parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.
Minimal Syntax: % efromfasta [-INfile=]fasta.aa -Default Prompted Parameters: None Local Data Files: None Optional Switches: -PROtein insists that the input sequences are proteins -NUCleotide insists that the input sequences are nucleic acids -LIStfile[=fromfasta.list] writes a list file of output sequence names
None.
The parameters and switches listed below can be set from the command line. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the User's Guide.
sets the program to expect either protein or nucleic acid sequences. Normally, EFromFastA determines whether an input sequence is protein or nucleic acid by looking at its composition. If the first 300 alphabetic characters in a sequence are composed entirely of IUB-IUPAC nucleotide codes (see Appendix III) , it is reformatted as a nucleic acid sequence in GCG format; otherwise it is reformatted as a protein sequence. Using these command-line options, you can insist that your sequences are proteins (-PROtein) or nucleic acids (-NUCleotide).
writes a list file with the names of the output sequence files. This list file is suitable for input to other Wisconsin Package programs that support list files (see Chapter 2, Using Sequences of the User's Guide. ) If you don't specify a filename, then EFromFastA makes one up using efromfasta for the filename and .list for the filename extension.
Printed: April 22, 1996 15:52 (1162)