Short Descriptions

The Short Descriptions section tells you what tools are available in the EGCG Package and contains a one or two sentence description of every EGCG command.

You run GCG and EGCG programs by typing the name of the program next to the % (percent sign) UNIX prompt. The text next to the % in the descriptions below are the commands that you enter to run each program; you don't type the percent sign. The bold type (also called the typewriter font) in any of its qualifiers indicates the minimum number of characters that you must type on the command line to run the program. You must type the full UNIX program name; partial names are ignored.

Whenever the word file(s) or sequence(s) appear in text, the (s) means that the program works on one or more sequences simultaneously. (See the Introduction section of the GCG users guide for more information about GCG documentation conventions.)

Initializing the EGCG programs

The % gcg command is used at login to initialize the GCG Package. The % egcg command is used in the same way to initialize the EGCG Package. The EGCG Package is ready to run when a banner appears on your screen that looks something like this:

  
  
         Welcome to the EGCG extensions to the WISCONSIN PACKAGE
                     Version 8.1, November 1995
                          Installed on irix
  
Copyright 1982, 1983, 1984, 1985, 1986, 1987, 1989, 1991, 1992, 1994
         Genetics Computer Group, Inc.  All rights reserved.
  
   Published research assisted by this software should always cite:
  
              Program Manual for the Wisconsin Package,
           Version 8, August 1994, Genetics Computer Group,
          575 Science Drive, Madison, Wisconsin, USA  53711
  
                and also for the Extension programs:
  
     Program Manual for the EGCG Package, Peter Rice, The Sanger Centre,
               Hinxton Hall, Cambridge, CB10 1RQ, England.
  
      Additional code by Peter Rice, The Sanger Centre, Hinxton, England
                 and other members of the EGCG team.
  
    Help is available with the commands % egenhelp and % egenmanual

EDITING

The NewFeatures program is used by the EMBL Data Library to edit feature tables and to update sequence database entries. It is also a useful tool for maintaining your own version of a feature table, or for exploring large feature table.

% newfeatures

NewFeatures is an interactive editor for entering and modifying the feature table and for minor editing of the sequence itself.

FRAGMENT ASSEMBLY

The Fragment Assembly programs provide additional functions for the GCG Fragment Assembly system.

% gelstatus

GelStatus reads a GCG Fragment Assembly database, and produces a summary report of the quality of each contig.

% gelpicture

GelPicture reads a contig from the Fragment Assembly database and displays a diagram of the gel alignments and a printout of the aligned gel sequences and consensus. GelPicture has been modified to include the sequence direction in both sections of the output, and to mark with '=======' any consensus sequence that is correct (agrees with every fragment) and has been sequenced in both directions.

% gelfigure

GelFigure produces a graphical report of the status of a contig in a fragment assembly project.

% gelanalyze

GelAnalyze reads a GelStatus report from a shotgun project, and produces project statistics by the method of Lander and Waterman.

MAPPING

A new program makes the task of creating your own restriction enzyme file for GCG programs easy.

% mapselect

MapSelect selects restriction enzymes by name or by their ability to cut a given sequence, and writes them to a new enzyme file for use in other programs.

% efingerprint

EFingerPrint identifies the products of T1 ribonuclease digestion. EFingerPrint is a version of GCG's old FingerPrint with command line control.

COMPARISON

Command line control is added to GCG's program Diverge.

% ediverge

EDiverge is a version of Diverge with command line control. Diverge measures the percent divergence of two protein coding sequences using the method of Perler and Efstratiadis.

% eoverlap

EOverlap compares two sets of DNA sequences to each other in both orientations using a WordSearch style comparison. EOverlap is an extended version of GCG's Overlap for use in database nonredundancy checks, together with the FilterOverlap program.

% bigeoverlap

BigEOverlap compares two sets of DNA sequences to each other in both orientations using a WordSearch style comparison. EOverlap is an extended version of GCG's Overlap for use in database nonredundancy checks, together with the FilterOverlap program. BigEOverlap has a very high limit on total sequnec length for genome scale sequence analysis, but it too large for general use on most systems.

% filteroverlap

FilterOverlap reads the output file from EOverlap and filters out only those overlaps which meet specified values when the alignments are built. Output from GCG's Overlap program may also be used, but only if generated from a self comparison of a single database.

DATABASE SEARCHING

Output from (T)Fasta can be screened for significance. TWordSearch searches can compare a protein sequence to the nucleotide databases. The EQuickSearch program can run far faster with far smaller memory requirements, and output can be screened for the best hits using QuickMatch.

% fastacheck

FastaCheck selects significant alignments from a (T)Fasta output file.

% twordsearch

TWordSearch identifies DNA sequences similar to a protein query sequence using a six frame translation of the database and a Wilbur and Lipman-style search. The output is a list of significant diagonals whose alignments can be displayed with TSegments.

% tsegments

TSegments aligns and displays the segments of similarity found by TWordSearch.

% equickSearch

EQuickSearch rapidly identifies places where query sequence(s) occur in a nucleotide sequence database. The output is a file of overlaps that can be displayed with QuickMatch or EQuickShow. You can make up your own sequence database or use GenEMBL, which consists of GenBank and those sequences in EMBL that are not represented in GenBank (or the other way around at some sites).

% quickmatch

QuickMatch displays the overlaps found by EQuickSearch with either optimal alignments or dot-plots. The alignments can be selected by quality to discard poor matches. The dot-plots can be reviewed rapidly with a graphic screen.

% equickindex

EQuickIndex builds hash tables from sequence(s) in data libraries, and stores them as map sections. These tables make up the database that is searched by EQuickSearch.

% newfetch

NewFetch copies GCG sequences or fragments or data files from the GCG database into your directory or displays them on your terminal screen and allows the user to specify a sequence range.

% stssearch

StsSearch looks for primer pairs in a set of sequences.

% rfindpatterns

RFindPatterns identifies sequences that contain short patterns like GAATTC or YRYRYRYR. You can define the patterns ambiguously and allow mismatches. You can provide the patterns in a file or simply type them in from the terminal. The output is a series of files called r1.rfind, r2.rfind, and so on, each containing a single extracted sequence. These can be fed through Pileup or manipulated in other ways.

% patternplot

PatternPlot produces a graphical representation of the results of GCG's FindPatterns program.

MULTIPLE SEQUENCE ANALYSIS

The first four programs in this section allow you to display multiple sequence alignments. The last three programs are modified verions of the GCG profile programs, supporting automatic translation of nucleotide sequence database entries, and modifications to allow searches of far larger databases.

% prettyplot

PrettyPlot displays multiple sequence alignments and calculates a consensus sequence. It does not create the alignment, it simply displays it.

% prettybox

PrettyBox displays multiple sequence alignments as shaded boxes in Postscript format (e.g., the output file must be printed and/or displayed on a Postscript-compatible device). PrettyBox will optionally calculate a consensus sequence. The program does not create the alignment; it simply displays it.

% plotalign

PlotAlign takes a GCG format sequence alignment, and plots the mean and range of values for any amino acid parameter you supply. The "panel file" contains a list of parameters to be plotted. The main database of parameters is taken from Nakai et al. (1988), and the default panel file uses selected parameters from the 13 discrete clusters in that paper. This program is experimental. Any suggestions would be most welcome.

% pepallwindow

PepAllWindow plots measures of protein hydrophobicity according to the method of Kyte and Doolittle.

% eplotsimilarity

EPlotSimilarity plots the running average of the similarity among the sequences in a multiple sequence alignment.

% polydot

PolyDot compares two sets of sequences, draws a dotplot for each pair of sequences, and reports all identical matches of a specified length.

% tprofilesearch

TProfileSearch uses a profile (representing a group of aligned protein sequences) as a probe to search the nucleotide database for new sequences with possible protein products having some similarity to the group. The profile is created with the program ProfileMake.

% tprofilesegments

TProfileSegments makes optimal alignments showing the segments of similarity found by TProfileSearch.

% tprofilegap

TProfileGap makes an optimal alignment between a profile and a sequence.

% profileplot

ProfilePlot produces a graphical report of the frequency of patterns in a protein or nucleotide sequence.

% sortconsensus

SortConsensus identifies the strong consensus regions of an alignment in an MSF file and reports them in sorted order.

% elineup

ELineUp is a screen editor for editing multiple sequence alignments. You can edit up to 500 sequences simultaneously. New sequences can be typed in by hand or added from existing sequence files. A consensus sequence identifies places where the sequences are in conflict.

% multalign

MultAlign does a simultaneous alignment for two or more DNA or protein sequences. It introduces a certain number of gaps into either pairwise aligned sequences or groups of sequences to find a minimal global distance. The user can influence the result by defining the order in which the sequences will be aligned. The program is based on a generalization of the algorithm of Waterman, Smith and Beyer by Krueger and Osterburg.

% eclustalw

EClustAlW calculates a multiple alignment of nucleic acid or protein sequences according to the method of Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994). This is part of the original ClustalW distribution, modified for inclusion in EGCG.

% clustree

ClusTree computes a phylogenetic tree according to the Neighbor-Joining Method of Saitou and Nei (1987). This is part of the original ClustalW distribution, modified for inclusion in EGCG. The tree will be displayed graphically.

% profalign

ProfAlign is for taking two old aligments (or single sequences) and aligning them with each other. The result is one bigger aligment. This is part of the original ClustalW distribution, modified for inclusion in EGCG.

EVOLUTIONARY ANALYSIS

These programs measure the pairwise homologies between a set of sequences, and provide a conversion to the format required by the Phylip program.

% homologies

Homologies makes a table of the pair-wise distances within a group of aligned sequences.

% tophylip

ToPhylip writes GCG sequences into a single file in Phylip format.

% phylip2tree

Phylip2Tree displays trees computed with one of the PHYLIP-programs in GCG style.

PATTERN RECOGNITION

Searching for inverted repeats in DNA sequences is now provided. Some GCG programs in this section now have command line control added.

% palindrome

Palindrome searches for perfect inverted repeats in a nucleic acid sequence.

% ewindow

EWindow is a version of Window with command line control. Window makes a table of the frequencies of different sequence patterns within a window as it is moved along a sequence. A pattern is any short sequence like GC or R or ATG. You can plot the output with the program StatPlot.

% estatplot

EStatPlot is a version of StatPlot with command line control. StatPlot plots a set of parallel curves from a table of numbers like the table written by the Window program. The statistics in each column of the table are associated with a position in the analyzed sequence.

% ecodonfrequency

ECodonFrequency tabulates codon usage from sequences and/or existing codon usage tables. The output file is correctly formatted for input to the CodFish, CodonPreference, Correspond, and Frames programs.

ECodonFrequency is a modified version of GCG version 7's CodonFrequency with command line control added.

% econsensus

EConsensus calculates a consensus sequence for a set of pre-aligned short nucleic acid sequences by tabulating the percent of G, A, T, and C for each position in the set. GCG's FitConsensus uses the EConsensus output table as a probe to search for the best examples of the derived consensus in other nucleotide sequences.

% ecorrespond

ECorrespond looks for similar patterns of codon usage by comparing codon frequency tables.

% erepeat

ERepeat finds direct repeats in sequences. You must set the size, stringency, and range within which the repeat must occur; all the repeats of that size or greater are displayed as short alignments. ERepeat is a version of GCG's old Repeat with command line control.

% eterminator

ETerminator searches for prokaryotic factor-independent RNA polymerase terminators according to the method of Brendel and Trifonov. ETerminator is a version of GCG's old Terminator with command line control.

NUCLEOTIDE ANALYSIS

Melting temperature and GC content of a sequence can be analyzed and displayed on a plot. The variation in di-nucleotide composition along a sequence can be plotted.

% melt

Melt calculates the melting temperature (Tm) and the percent G+C of a nucleic acid sequence using the algorithms described by Breslauer et al. Proc. Natl. Acad. Sci. USA 83, 3746-3750 and Baldino et al. Methods in Enzymol. 168, 761-777.

% meltplot

MeltPlot plots the melting curve for a nucleic acid sequence using the algorithms described by Breslauer et al. Proc. Natl. Acad. Sci. USA 83, 3746-3750 and Baldino et al. Methods in Enzymol. 168, 761-777.

% basepairplot

BasePairPlot plots the percentage occurence and the observed over expected frequency of a di-nucleotide pair relative to their position in a nucleic acid sequence.

% cpgplot

CpGPlot plots the frequency of occurence of CpG di-nucleotides and C and G percentage relative to their position in a sequence by the method described by Gardiner-Garden (1987)

% cpgreport

CpGReport looks for potential CpG islands in a nucleotide sequence.

% chaos

Chaos makes a CHAOS game representation of a nucleic acid sequence using the method of Jeffrey (1990) Nucleic Acids Research 18: 2163-2170.

% codfish

CODFISH calculates a set of codon usage statistics for a sequence using a specified codon usage table.

% wordcount

WordCount counts the commonest words in a sequence and reports them in order of frequency and sequence.

% wordup

WordUp is based on a first order Markov analysis and detects statistically significant oligonucleotide patterns from six to nine nucleotides long in the sequences under investigation. WordUp dynamically detects significant signals of any length in the same analysis.

% poland

The program Poland simulates transition curves of double-stranded nucleic acids (DNA as well as RNA). Calculation is based on Poland, D. (1974) 'Recursion Relation Generation of Probability Profiles for Specific-Sequence Macromolecules with Long-Range Correlations'.

% genetrans

GeneTrans extracts and/or translates coding regions as defined in the feature table of sequences stored in the EMBL or Genbank databases.

% gapframe

GapFrame moves all gaps in a DNA sequence reading frame to be at codon boundaries.

% prima

Prima selects oligonucleotide primers for a template DNA sequence. The primers may be useful for the polymerase chain reaction (PCR) or for DNA sequencing. You can allow Prima to choose primers from the whole template or limit the choices to a particular set of primers listed in a file.

% quicktandem

QuickTandem scans for potential tandem repeats in a nucleotide sequence.

% tandem

Tandem looks for multiple tandem repeats of a given size in a nucleotide sequence.

% inverted

Inverted looks for imperfect inverted repeats in a nucleotide sequence.

% ecomposition

EComposition determines the composition of sequence(s). For nucleotide sequence(s), EComposition also determines dinucleotide and trinucleotide content.

PROTEIN ANALYSIS

The first four programs provide graphical analyses of protein sequences. The first three provide different approaches to finding coiled-coil regions and amphipathic helices. PepWindow provides a general hydropathy plot. PepStats calculates physical properties of proteins. The last three programs look for specific sequence motifs: signal peptide cleavage sites, potential epitopes (antigenic surface regions), and helix turn helix DNA binding domains.

% pepcoil

PepCoil identifies potential coiled-coil regions of protein sequences using the algorithm of Lupas A, van Dyke M & Stock J (1991).

% pepnet

PepNet is a program to view the two-dimensional helical representation of protein sequences.

% pepwheel

PepWheel is a program to view the periodic distribution of amino acid residues in protein sequences.

% pepwindow

PepWindow plots measures of protein hydrophobicity according to the method of Kyte and Doolittle.

% pepstats

PepStats gives a short statistical summary on the composition of a protein sequence and gives the molecular weight and isoelectric point.

% sigcleave

SigCleave uses the von Heijne method to locate signal sequences, and to identify the cleavage site. The method is 95% accurate in resolving signal sequences from non-signal sequences with a cutoff score of 3.5, and 75-80% accurate in identifying the cleavage site. The program reports all hits above a minimum value.

% antigenic

Antigenic looks for potential antigenic regions using the method of Kolaskar.

% helixturnhelix

HelixTurnHelix uses the method of Dodd and Egan to determine the significance of possible helix-turn-helix matches in protein sequences.

% dodayhoffstat

DoDayhoffStat compares the composition of a protein sequence against the Dayhoff statistic for protein composition. The closer the Dayhoff Stat value is to 1.0 the better the composition of the protein sequence fits with the theoretical value.

% pepcount

PepCount reports the number of occurrences of residues at a given position in protein sequences.

% epeptidesort

EPeptideSort shows the peptide fragments from a digest of an amino acid sequence. It sorts the peptides by weight, position, and HPLC retention at pH 2.1, and shows the composition of each peptide. It also prints a summary of the composition of the whole protein. EPeptideSort is a modified version of GCG's PeptideSort which has additional options to control output of peptides sorted by weight, retention and position.

TRANSLATION

One GCG program has command line control added. The second program translates aligned nucleic acid sequences into aligned protein sequences.

% etranslate

ETranslate is a version of GCG's old Translate program with command line control added.

% eextractpeptide

EExtractPeptide is a version of ExtractPeptide with command line control. ExtractPeptide writes a peptide sequence from one or more of the translation frames displayed in the output from Map. Translate supercedes ExtractPeptide for most applications.

% alltrans

AllTrans translates a set of aligned nucleotide sequences into protein.

% mytrans

MyTrans is a simple EGCG application that translates part of a nucleotide sequence into protein.

MANIPULATION

A GCG program has command line control added.

% eassemble

EAssemble is a version of GCG's old Assemble program with command line control added.

% ecomptable

ECompTable creates a scoring matrix using equivalences defined in a simplification scheme such as the one used for Simplify. ECompTable is a version of GCG's CompTable with command line control added.

% ereverse

EReverse reverses and/or complements a sequence. EReverse is a version of GCG's Reverse with command line control.

% pepcorrupt

PepCorrupt randomly introduces small numbers of substitutions, insertions, and deletions into protein sequence(s). Note that substitutions are Residue to other Residue, and that back mutations to the original are allowed!

DISPLAY

EPublish is a version of Publish that allows command line control. No other Display programs are released, but there has been some interest in a modified version of Red to provide alternative forms of documentation.

% epublish

EPublish is a version of Publish with command line control. Publish arranges sequences for publication. It creates a text file that you can modify to your own needs with a text editor.

% elibgen

ELibGen creates formatted versions of EGCG documentation for the on-line help facilities egenhelp and egenmanual.

% redtohtml

RedToHtml is a modification of GCG's program Red to convert documentation source files into HTML documents.

SEQUENCE EXCHANGE

The first program converts any sequence to plain text. The next two programs provide a way to generate the original database entry format from a GenBank/EMBL entry in a GCG database. The ToPirAll program provides a way to extract a set of subsequences in PIR format. The last program produces input files for the Primer program.

% creformat

CReformat rewrites sequence file(s), scoring matrix file(s), or enzyme data file(s) so that they can be read by GCG programs. For sequence files, a base range can be selected or excluded.

% totext

ToText converts a sequence into plain text format.

% togenbank

ToGenBank is a simple utility program that reads a GenBank entry from a GCG sequence database, and writes it out in GenBank flat file format.

% toembl

ToEmbl is a simple utility program that reads an EMBL entry from a GCG sequence database, and writes it out in EMBL flat file format.

% topirall

ToPirAll is a utility program that converts a list of sequences, or ranges of sequences, into PIR format for use in other non-GCG programs, especially CLUSTALV.

% toprimer

ToPrimer formats a GCG sequence file into a PRIMER compatible file.

% torelate

ToRelate creates an input file for the NBRF RELATE program.

% efromfasta

EFromFastA reformats one or more sequences from FastA format into individual files in GCG format.

% efromstaden

EFromStaden changes a sequence from Staden format into GCG format. If the file contains a nucleotide sequence, the ambiguity codes are converted as shown in Appendix III of the GCG Program Manual. EFromStaden is a version of GCG's old FromStaden with command line control.

% etostaden

EToStaden writes a GCG sequence into a file in Staden format. If the file contains a nucleotide sequence, the ambiguity codes are converted as shown in Appendix III of the GCG Program Manual. EToStaden is a version of GCG's ToStaden with command line control.

% egetseq

EGetSeq reads a sequence from a computer that is acting as a terminal and writes it into a new sequence file in GCG format on the computer running the Wisconsin Package. EGetSeq is a version of GCG's GetSeq with command line control.

FILE UTILITIES

These utilities act on text files.

% noreturn

NoReturn removes trailing carriage return or line feed control characters from text files.

% cppjl

CppJL converts EGCG VMS fortran source code to Unix fortran source code.

% crtolf

CRtoLF converts carriage return characters to linefeed characters in text files.

% addcomment

AddComment rewrites a text file with every line commented out.

% ecrypt

ECrypt writes an encrypted version of a file using a key word that you choose. Run ECrypt a second time with the same keyword to restore the encrypted output file to its original state.

% ecodesearch

ECodeSearch searches through FORTRAN source files for references to mnemonics such as procedure names. You must provide the mnemonics in a separate file. The default parameters show you some suitable inputs.

% eclsort

ECLSort sorts the output of ECodeSearch on the first argument of each procedure. The heading is lost. EGCG will use ECLSort to make up the command line dictionary in the Procedure Library chapter of the future EGCG System Support Manual.

MISCELLANEOUS

These programs do not fit the other categories.

% test

Test is provided as a skeleton for programmers to test ideas.

% Ctest

CTest is provided as a skeleton for programmers to test ideas.

% keyfind

KeyFind reports the characters passed to the program by keys on the keyboard.

DATABASE MAINTENANCE

These programs are used at several sites to build additional databases in GCG format.

% dbstats

DbStats counts the number of entries and the total lengths of sequence entries in a GCG formatted database.

% gbonly

GbOnly creates a list of GenBank entries that have accession numbers not found in the latest release of the Embl database.

% pironly

PirOnly and related programs select entries from PIR that are not included in the latest release of SwissProt.

% checklen

CheckLen calculates five checksums and the sequence length for each entry in a database, and writes them to a file for use in a quick cross check for identical sequences.

% checklencomp

CheckLenComp compares two sorted CheckLen output files, and produces a list of entries from the first file which are not found in the second.

% kabattogcg

KabatToGCG creates GCG data libraries from Kabat distribution files.

% seqdbtogcg

SeqDbToGcg converts the SeqDb database distribution files into a database in GCG format.

% convertenz

ConvertEnz reads lines extracted from the ENZYME database, and converts them to lower case.

% ig2nbrf

Ig2Nbrf is a utility program that converts an IG formatted file into an NBRF formatted database which PirToGcg can index.

ConvertEnz reads lines extracted from the ENZYME database, and converts them to lower case.

% embltogcgsc

EMBLToGCGSC is the Sanger Centre's modification of GCG's EMBLtoGCG which reformats EMBL and SWISS-PROT flat sequence files into GCG data libraries.

ON-LINE HELP

% egenhelp

EGenHelp displays an index of all the programs in the EGCG Program Manual. To view the topics for an individual program on your screen, type in the program name. To select a topic for a program, type in the topic name (including any underscores). Program documentation always includes a picture of the screen for a typical session with the program.

% egenmanual

EGenManual displays an index of all the sections of the EGCG Program Manual. To view the programs in a section, type in the section name (including any underscores). To select a program, type in the program name. To select a topic for a program, type in the topic name (including any underscores).