Efromfasta

Go back to top

EFROMFASTA*

EFROMFASTA*

FUNCTION

EFromFastA reformats one or more sequences from FastA format into individual files in GCG format.

Use EFromFastA when you want to convert sequences that are in FastA format into a format suitable for use with programs in the Wisconsin Sequence Analysis Package(TM). FastA format may maintain many sequences in one file; in such a case EFromFastA writes many output files, one for each sequence in the FastA file. Each output file is named according to the first word (following the > character) on the documention line just above the sequence data in the FastA file. The documentation line from the FastA input file(s) is preserved in the GCG output file(s). EFromFastA can convert sets of FastA sequence files that are specified with a file of filenames or with multiple file specification syntax.

AUTHOR

This GCG program was modified by Peter Rice (E-mail: pmr@sanger.ac.uk Post: Informatics Division, The Sanger Centre, Hinxton Hall, Cambridge, CB10 1RQ, UK).

All EGCG programs are supported by the EGCG Support Team, who can be contacted by E-mail (egcg@embnet.org).

EXAMPLE

Here is a session using EFromFastA to convert the FastA sequence file fasta.aa into separate sequence files in GCG format.

  
  
  %efromfasta
  
   EFROMFASTA of what FastA sequence file(s) ?  fasta.aa
  
   egmsmg  1217 aa.
   hshua1  129 aa.
   lcbo  230 aa.
   mchu  149 aa.
   musplfm  224 aa.
   mwkw  1966 aa.
   mwrtc1  428 aa.
   gt87  217 aa.
   qrhuld  860 aa.
  
   Finished EFROMFASTA with 9 files written.
   5420 sequence characters were reformatted.
  
  %

OUTPUT

Here is part of the first output file, egmsmg, from the example above:

  
  
  EGMSMG Epidermal growth factor precursor - Mouse
    EGMSMG  Length: 1217  July 29, 1994 14:38  Type: P  Check: 9280  ..
  
    1  MPWGRRPTWL LLAFLLVFLK ISILSVTAWQ TGNCQPGPLE RSERSGTCAG
  
   51  PAPFLVFSQG KSISRIDPDG TNHQQLVVDA GISADMDIHY KKERLYWVDV
  
 ////////////////////////////////////////////////////////////
  
 1151  PHIDGMGTGQ SCWIPPSSDR GPQEIEGNSH LPSYRPVGPE KLHSLQSANG
  
 1201  SCHERAPDLP RQTEPVK

RELATED PROGRAMS

The following programs convert sequences between other formats and GCG format: FromEMBL, FromGenBank, FromIG, FromPIR, FromStaden, FromFastA, ToIG, ToPIR, ToStaden and ToFastA.

DataSet creates a GCG data library from any set of sequences in GCG format. ToBLAST creates a database that can be searched by the BLAST program from any set of sequences in GCG format.

INPUT FILE

Here is part of the input file used for the example above:

  
  
  >EGMSMG Epidermal growth factor precursor - Mouse
  
  MPWGRRPTWLLLAFLLVFLKISILSVTAWQTGNCQPGPLERSERSGTCAGPAPFLVFSQGKSISRIDPDG
  
  TNHQQLVVDAGISADMDIHYKKERLYWVDVERQVLLRVFLNGTGLEKVCNVERKVSGLAIDWIDDEVLWV
  
  DQQNGVITVTDMTGKNSRVLLSSLKHPSNIAVDPIERLMFWSSEVTGSLHRAHLKGVDVKTLLETGGISV
  
  LTLDVLDKRLFWVQDSGEGSHAYIHSCDYEGGSVRLIRHQARHSLSSMAFFGDRIFYSVLKSKAIWIANK
  
  HTGKDTVRINLHPSFVTPGKLMVVHPRAQPRTEDAAKDPDPELLKQRGRPCRFGLCERDPKSHSSACAEG
  YTLSRDRKYCEDVNECATQNHGCTLGCENTPGSYHCTCPTGFVLLPDGKQCHELVS
  
  CPGNVSKCSHGCVLTSDGPRCICPAGSVLGRDGKTCTGCSSPDNGGCSQICLPLRPGSWECDCFPGYDLQ
  
  SDRKSCAASGPQPLLLFANSQDIRHMHFDGTDYKVLLSRQMGMVFALDYDPVESKIYFAQTALKWIERAN
  
  MDGSQRERLITEGVDTLEGLALDWIGRRIYWTDSGKSVVGGSDLSGKHHRIIIQERISRPRGIAVHPRAR
  
  RLFWTDVGMSPRIESASLQGSDRVLIASSNLLEPSGITIDYLTDTLYWCDTKRSVIEMANLDGSKRRRLI
  
  QNDVGHPFSLAVFEDHLWVSDWAIPSVIRVNKRTGQNRVRLQGSMLKPSSLVVVHPLAKPGADPCLYRNG
  GCEHICQESLGTARCLCREGFVKAWDGKMCLPQDYPILSGENADLSKEVTSLSNST
  
  QAEVPDDDGTESSTLVAEIMVSGMNYEDDCGPGGCGSHARCVSDGETAECQCLKGFARDGNLCSDIDECV
  
  LARSDCPSTSSRCINTEGGYVCRCSEGYEGDGISCFDIDECQRGAHNCAENAACTNTEGGYNCTCAGRPS
  
  SPGRSCPDSTAPSLLGEDGHHLDRNSYPGCPSSYDGYCLNGGVCMHIESLDSYTCNCVIGYSGDRCQTRD
  
  LRWWELRHAGYGQKHDIMVVAVCMVALVLLLLLGMWGTYYYRTRKQLSNPPKNPCDEPSGSVSSSGPDSS
  
  SGAAVASCPQPWFVVLEKHQDPKNGSLPADGTNGAVVDAGLSPSLQLGSVHLTSWRQKPHIDGMGTGQSC
  WIPPSSDRGPQEIEGNSHLPSYRPVGPEKLHSLQSANGSCHERAPDLPRQTEPVK
  
  //////////////////////////////////////////////////////////////////////

RESTRICTIONS

FastA format does not differentiate peptide from nucleotide sequences, so to ensure that the output files are written with the correct sequence type, use the -PROtein or -NUCleotide command-line option when running EFromFastA.

FastA format is not rigorously defined, so FastA files from different sources may not have exactly the same format. Please contact us by e-mail at egcg@embnet.org if you encounter problems converting FastA sequences using EFromFastA.

Note: EFromFastA has not been tested thoroughly at the time of this writing, so please examine your results carefully.

SEQUENCE TYPE

When EFromFastA writes GCG sequence files, it assigns the sequence type based on the composition of the sequence characters. This method is not fool-proof, so to ensure that the output files are written with the correct sequence type, use the -PROtein or -NUCleotide command-line option when running EFromFastA.

If EFromFastA is run interactively, you can watch the program monitor to see if the sequences are assigned the correct type. As each new file is written, its name and the number of bases (bp) or amino acids (aa) appears on the screen. If the wrong abbreviation appears (for example, bp appears for a protein sequence), the sequence file was assigned the wrong type. The sequence type also appears in the sequence file. Look on the last line of the text heading just above the sequence itself for Type: N or Type: P.

If the sequence type was incorrectly assigned, turn to Appendix VI for information on how to change or set the type of a sequence.

COMMAND-LINE SUMMARY

All parameters for this program may be put on the command line. Use the option -CHEck to see the summary below and to have a chance to add things to the command line before the program executes. In the summary below, the capitalized letters in the qualifier names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose qualifiers or parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.

  
  
  Minimal Syntax: % efromfasta [-INfile=]fasta.aa -Default
  
  Prompted Parameters: None
  
  Local Data Files: None
  
  Optional Switches:
  
  -PROtein                    insists that the input sequences are proteins
  -NUCleotide                 insists that the input sequences are nucleic acids
  -LIStfile[=fromfasta.list]  writes a list file of output sequence names

LOCAL DATA FILES

None.

OPTIONAL PARAMETERS

The parameters and switches listed below can be set from the command line. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the User's Guide.

-PROtein and -NUCleotide

sets the program to expect either protein or nucleic acid sequences. Normally, EFromFastA determines whether an input sequence is protein or nucleic acid by looking at its composition. If the first 300 alphabetic characters in a sequence are composed entirely of IUB-IUPAC nucleotide codes (see Appendix III) , it is reformatted as a nucleic acid sequence in GCG format; otherwise it is reformatted as a protein sequence. Using these command-line options, you can insist that your sequences are proteins (-PROtein) or nucleic acids (-NUCleotide).

-LIStfile=efromfasta.list

writes a list file with the names of the output sequence files. This list file is suitable for input to other Wisconsin Package programs that support list files (see Chapter 2, Using Sequences of the User's Guide. ) If you don't specify a filename, then EFromFastA makes one up using efromfasta for the filename and .list for the filename extension.

Printed: April 22, 1996 15:52 (1162)