Embltogcgsc

Go back to top

EMBLTOGCGSC


FUNCTION

EMBLToGCGSC is the Sanger Centre's modification of GCG's EMBLtoGCG which reformats EMBL and SWISS-PROT flat sequence files into GCG data libraries.


DESCRIPTION

This is the utility the Sanger Centre uses to update EMBL-distributed data files. It accepts one or more EMBL flat distribution files and creates a new GCG data library, one library for each distribution file.

EMBLToGCGSC normally writes the output files directly into directory EMBLDir. You can redirect the output from the command line with the option -DIRectory.

With this program you must supply a release number (such as 24.0), the month of the release, and the last two digits of the year of the release. These items can be set from the command line using the qualifiers -RELease, -YEAr, and -MONth, respectively.

This utility can be used to format the EMBL-distributed, SWISS-PROT data bank using the command line option -SWISSprot.

Extensions to GCG's EMBLtoGCG are: (a) keeping the ".." syntax of the feature table so that SRS retrieval can be in EMBL format, (b) specifying an alternative name for swissprot updates so that the database file does not have to be called "swissprot", (c) changing the order of items in the definition line so that the species comes after the description and is visible in FASTA and BLAST output.

Data Libraries and Data Farms

GCG database formatting utilities create GCG data libraries, one for each data distribution file. A data library consists of seven files in the same directory. The seven files share the same base name and have the following extensions:

  
  .seq         contains the sequence data
  .ref         contains the sequence references
  .header      contains data library format information
  .seqcat      contains definitions for each sequence
  .offset      is an index file
  .names       is an index file
  .numbers     is an index file
  

Data libraries can be grouped together into a data farm. A data farm is a file that contains a list of the names of the data libraries in the farm. Sequences from any data library within a farm are referenced using the logical name for the data farm. The farm files distributed with the Wisconsin Package are in the directory GenRunData. All farm filenames end with the extension .farm.

Logical Names

To use a data library, you must have a defined logical name that translates to the complete file specification of each file without a filename extension. For example, if there is a data library in directory DataDir whose files share the base name Globin, there must be a defined logical name "Globin" that translates to "DataDir:Globin".

Logical names for all the GCG supplied data farms and data libraries are defined in the file GenScript:gcguserspecs. Site-specific data farms and data libraries should be defined in the file GenScript:siteuserspecs.

If database distribution consists of more than one distribution file, each distribution file must be mapped to both its data library name and to the logical name for the farm file to which it belongs in the file GenRundata:dbnames.map.


AUTHOR

This GCG program was modified by Peter Rice (E-mail: pmr@sanger.ac.uk Post: Informatics Division, The Sanger Centre, Hinxton Hall, Cambridge, CB10 1RQ, UK).

All EGCG programs are supported by the EGCG Support Team, who can be contacted by E-mail (egcg@embnet.org).


EXAMPLE

Here is the session with EMBLToGCGSC for EMBL release 46.0 (September 1990). Note that before running EMBLToGCGSC, you must verify that the name of each flat file that is mapped to a data library name and to a data farm logical name in the file GenRunData:dbnames.map.

  
  
  % embltogcgsc
  
    What EMBL flat files ?  TapeDir:*.dat
  
    What is the release number (* 1.0 *) ?  46.0
  
    What is the release year (* 90 *) ?  96
  
    What is the release month (* 8 *) ?  3
  
   Fun ..
  
   Input Entries:         3,154
   Output Entries:        3,154
     Total Length:    5,206,150
              CPU:     02:01.46
  
   Inv
  
    ////////
  
    EMBLToGCGSC complete:
  
  Input Entries:       3,155
 Output Entries:       3,155
   Total Length:   4,554,339
            CPU:    33:28:14
  
  
  %
  


OUTPUT

EMBLToGCGSC creates six files for each data library. Each file in a data library uses the same base name and has one of the following file name extensions: .seq, .ref, .names, .numbers, .offset, or .header. You cannot edit or change these files in any way.


RELATED PROGRAMS

GenBankToGCG creates GCG data libraries from GenBank distribution files. PIRToGCG creates a GCG data library from a PIR sequence file. AccessionNumbers compiles accession numbers from sequences in either GenBank or EMBL format. DBIndex generates the index files needed to access entries in a GCG database. The input file specification to DBIndex is one or more database sequence (.seq) files, such as Globin.Seq. SeqCat creates files of definitions for the Data Files manual and for the definition search performed by StringSearch. The input consists of the .seq files for the data libraries you want to catalog.


RESTRICTIONS

The .seq, .names, .numbers, and .offset files are unformatted binary files that cannot be typed out on your terminal screen.

WARNING: Even though the .names, .numbers, and .offset files appear to be formatted files they are not, and they must not be edited with a text editor!

The architecture of GCG databases requires that all of the files that make up a database be in the same directory and share the same base name. These files are differentiated only by their filename extensions.


EXCLUDING ENTRIES

Most of EMBL and GenBank data is the same -- only the format is different. You can avoid 80 percent of this redundancy by compiling the accession numbers from either the EMBL or GenBank data libraries into a file by using the system utility AccessionNumbers. You can then exclude the redundant entries in the other library whose primary accession numbers collide with any of the accession numbers you compiled from the first library. The command line option -EXCLude provides the list of accession numbers you want to exclude.


CONSIDERATIONS

To use a data library, you must have a logical name defined that translates to the complete file specification of each file without the file name extension.

If only one input file is specified, EMBLToGCGSC does not read GenRunData:dbnames.map to determine the base file name for files in the data library. Instead, the base name of the files in the data library is the same as the base name of the input file.


SORTING

For each data library, if any of the sequence identifiers are out of alphabetic order, EMBLToGCGSC creates a process that sorts the .names file. Similarly, if any of the accession numbers are out of alphabetical order, EMBLToGCGSC creates a process that sorts the .numbers file.


NUMBER OF ENTRIES

If an input entry has more than 350,000 symbols, EMBLToGCGSC divides it into more than one output entry. Each extra output entry has a '-2', et cetera, appended to the input entry's name. Because of this, the number of output entries may be greater than the number of input entries.


SUGGESTIONS

Here is a simple protocol for installing EMBL:

1. Read the files from their distribution tape onto disk. (At the time of this writing, the only tapes available from EMBL were for VMS.)

2. Rename or delete any distribution files (they have a .dat file extension) that you don't want to format.

3. Use Fetch to retrieve file embl.csh from the GCG Package and change the mode of the file to be executable; for example, % chmod +x embl.csh

4. Modify this shell script to specify the sequence distribution file, the release number, the release month, and the release year.

5. Look in the file GenRunData:dbnames.map to verify that the name of each data distribution file being formatted appears, and that the name is mapped to both a data library logical name and a data farm logical name.

6. Run embl.csh in the background by entering % embl.csh >& embl.log &

7. Read the log file to make sure the process completed normally.

NOTE: To minimize the disk requirements, embl.csh deletes all of the files in directory EMBLDir and restores the new versions. Therefore, EMBL is unavailable during the time it takes EMBLToGCGSC to reformat the new flat files (approximately two hours).


COMMAND-LINE SUMMARY

All parameters for this program may be put on the command line. Use the option -CHEck to see the summary below and to have a chance to add things to the command line before the program executes. In the summary below, the capitalized letters in the qualifier names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose qualifiers or parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the User's Guide.

  
  
  Minimal Syntax: ^|$|%\| embltogcgsc [^|/|-\|INfile=]TapeDir:*.dat ^|/|-\|Default
  
  Prompted Parameters:
  
  ^|/|-\|RELease=24.0             release number (defaults to 1.0)
  ^|/|-\|YEAr=90                  release year (defaults to current year)
  ^|/|-\|MONth=8                  release month (defaults to current month)
  
  Command Line Options:
  
  ^|/|-\|DIRectory=DirName        writes in a directory other than EMBLDir
  ^|/|-\|EXCLude=genbank.exclude  is an optional file of excluded accession numbers
  ^|/|-\|SWISSprot                formats the SwissProt or SwissNew database
  ^|/|-\|SWISSNAME=swissprot      sets the output name for protein data
  ^|/|-\|PROtein                  sets the type of the database to protein
  ^|/|-\|LN=EMBL                  defines the long name (defaults to output filename)
  ^|/|-\|SN=EM                    defines the short name (defaults to long name)
  ^|/|-\|NOBINary                 stores sequences as ASCII text in the .Seq file
  ^|/|-\|NOMONitor                suppresses the screen monitor
  ^|/|-\|NOSUMmary                suppresses the screen summary
  


LOCAL DATA FILES

None.


OPTIONAL PARAMETERS

The parameters and switches listed below can be set from the command line. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the User's Guide.

-EXCLude=GenBankDir:genbank.exclude

excludes entries in each data library whose primary accession number is represented in the file you have specified.

-DIRectory=test

directs the output to another directory. (Usually, EMBLToGCGSC writes the output data libraries into directory EMBLDir.)

-SWISSprot

formats the SWISS-PROT protein sequence database distribution files. By default, the GCG formatted data library files are placed in directory SwissProtDir.

-MONitor

This program normally monitors its progress on your screen. However, when you use the -Default option to suppress all program interaction, you also suppress the monitor. You can turn it back on with this option. If your program is running in batch, the monitor will appear in the log file. If the monitor is slowing the program down, suppress it with -NOMONitor.

-SUMmary

writes a summary of the program's work to the screen when you've used the -Default qualifier to suppress all program interaction. A summary typically displays at the end of a program run interactively. You can suppress the summary for a program run interactively with -NOSUMmary.

Use this qualifier also to include a summary of the program's work in the log file for a program run in batch.

-NOBINary

forces sequences to be stored as ASCII text in the .seq file. The default is to store sequences in binary format, which reduces the amount of disk storage occupied by the data library.

-LN=Globin

defines the long logical name that refers to the output library. This option is only valid when formatting individual data libraries. By default, it is the same as the basename of the output library.

-SN=Gl

defines the short logical name that refers to the output library. This option is only valid when formatting individual data libraries. By default, it is the same as the long logical name.

Printed: April 22, 1996 15:53 (1162)