SHORT DESCRIPTIONS
The Short Descriptions section tells you what tools are available in the EGCG Package and contains a one or two sentence description of every EGCG command.
Conventions
You run GCG and EGCG programs by typing the name of the program next to the % (percent sign) UNIX prompt. The text next to the % in the descriptions below are the commands that you enter to run each program; you don't type the percent sign. The bold type (also called the typewriter font) in any of its qualifiers indicates the minimum number of characters that you must type on the command line to run the program. You must type the full UNIX program name; partial names are ignored.
Whenever the word file(s) or sequence(s) appear in text, the (s) means that the program works on one or more sequences simultaneously. (See the Introduction section of the GCG users guide for more information about GCG documentation conventions.)
Initializing the EGCG programs
The % gcg command is used at login to initialize the GCG Package. The % egcg command is used in the same way to initialize the EGCG Package. The EGCG Package is ready to run when a banner appears on your screen that looks something like this:
Welcome to the EGCG extensions to the WISCONSIN PACKAGE
Version 8.1, November 1995
Installed on irix
Copyright 1982, 1983, 1984, 1985, 1986, 1987, 1989, 1991, 1992, 1994
Genetics Computer Group, Inc. All rights reserved.
Published research assisted by this software should always cite:
Program Manual for the Wisconsin Package,
Version 8, August 1994, Genetics Computer Group,
575 Science Drive, Madison, Wisconsin, USA 53711
and also for the Extension programs:
Program Manual for the EGCG Package, Peter Rice, The Sanger Centre,
Hinxton Hall, Cambridge, CB10 1RQ, England.
Additional code by Peter Rice, The Sanger Centre, Hinxton, England
and other members of the EGCG team.
Help is available with the commands % egenhelp and % egenmanual
EDITING
The NewFeatures program is used by the EMBL Data Library to edit feature tables and to update sequence database entries. It is also a useful tool for maintaining your own version of a feature table, or for exploring large feature table.
% newfeatures
NewFeatures is an interactive editor for entering and modifying the feature table and for minor editing of the sequence itself.
FRAGMENT ASSEMBLY
The Fragment Assembly programs provide additional functions for the GCG Fragment Assembly system.
% gelstatus
GelStatus reads a GCG Fragment Assembly database, and produces a summary report of the quality of each contig.
% gelpicture
GelPicture reads a contig from the Fragment Assembly database and displays a diagram of the gel alignments and a printout of the aligned gel sequences and consensus. GelPicture has been modified to include the sequence direction in both sections of the output, and to mark with '=======' any consensus sequence that is correct (agrees with every fragment) and has been sequenced in both directions.
% gelfigure
GelFigure produces a graphical report of the status of a contig in a fragment assembly project.
% gelanalyze
GelAnalyze reads a GelStatus report from a shotgun project, and produces project statistics by the method of Lander and Waterman.
MAPPING
A new program makes the task of creating your own restriction enzyme file for GCG programs easy.
% mapselect
MapSelect selects restriction enzymes by name or by their ability to cut a given sequence, and writes them to a new enzyme file for use in other programs.
% efingerprint
EFingerPrint identifies the products of T1 ribonuclease digestion. EFingerPrint is a version of GCG's old FingerPrint with command line control.
COMPARISON
Command line control is added to GCG's program Diverge.
% ediverge
EDiverge is a version of Diverge with command line control. Diverge measures the percent divergence of two protein coding sequences using the method of Perler and Efstratiadis.
% eoverlap
EOverlap compares two sets of DNA sequences to each other in both orientations using a WordSearch style comparison. EOverlap is an extended version of GCG's Overlap for use in database nonredundancy checks, together with the FilterOverlap program.
% bigeoverlap
BigEOverlap compares two sets of DNA sequences to each other in both orientations using a WordSearch style comparison. EOverlap is an extended version of GCG's Overlap for use in database nonredundancy checks, together with the FilterOverlap program. BigEOverlap has a very high limit on total sequnec length for genome scale sequence analysis, but it too large for general use on most systems.
% filteroverlap
FilterOverlap reads the output file from EOverlap and filters out only those overlaps which meet specified values when the alignments are built. Output from GCG's Overlap program may also be used, but only if generated from a self comparison of a single database.
DATABASE SEARCHING
Output from (T)Fasta can be screened for significance. TWordSearch searches can compare a protein sequence to the nucleotide databases. The EQuickSearch program can run far faster with far smaller memory requirements, and output can be screened for the best hits using QuickMatch.
% fastacheck
FastaCheck selects significant alignments from a (T)Fasta output file.
% twordsearch
TWordSearch identifies DNA sequences similar to a protein query sequence using a six frame translation of the database and a Wilbur and Lipman-style search. The output is a list of significant diagonals whose alignments can be displayed with TSegments.
% tsegments
TSegments aligns and displays the segments of similarity found by TWordSearch.
% equickSearch
EQuickSearch rapidly identifies places where query sequence(s) occur in a nucleotide sequence database. The output is a file of overlaps that can be displayed with QuickMatch or EQuickShow. You can make up your own sequence database or use GenEMBL, which consists of GenBank and those sequences in EMBL that are not represented in GenBank (or the other way around at some sites).
% quickmatch
QuickMatch displays the overlaps found by EQuickSearch with either optimal alignments or dot-plots. The alignments can be selected by quality to discard poor matches. The dot-plots can be reviewed rapidly with a graphic screen.
% equickindex
EQuickIndex builds hash tables from sequence(s) in data libraries, and stores them as map sections. These tables make up the database that is searched by EQuickSearch.
% newfetch
NewFetch copies GCG sequences or fragments or data files from the GCG database into your directory or displays them on your terminal screen and allows the user to specify a sequence range.
% stssearch
StsSearch looks for primer pairs in a set of sequences.
% rfindpatterns
RFindPatterns identifies sequences that contain short patterns like GAATTC or YRYRYRYR. You can define the patterns ambiguously and allow mismatches. You can provide the patterns in a file or simply type them in from the terminal. The output is a series of files called r1.rfind, r2.rfind, and so on, each containing a single extracted sequence. These can be fed through Pileup or manipulated in other ways.
% patternplot
PatternPlot produces a graphical representation of the results of GCG's FindPatterns program.
MULTIPLE SEQUENCE ANALYSIS
The first four programs in this section allow you to display multiple sequence alignments. The last three programs are modified verions of the GCG profile programs, supporting automatic translation of nucleotide sequence database entries, and modifications to allow searches of far larger databases.
% prettyplot
PrettyPlot displays multiple sequence alignments and calculates a consensus sequence. It does not create the alignment, it simply displays it.
% prettybox
PrettyBox displays multiple sequence alignments as shaded boxes in Postscript format (e.g., the output file must be printed and/or displayed on a Postscript-compatible device). PrettyBox will optionally calculate a consensus sequence. The program does not create the alignment; it simply displays it.
% plotalign
PlotAlign takes a GCG format sequence alignment, and plots the mean and range of values for any amino acid parameter you supply. The "panel file" contains a list of parameters to be plotted. The main database of parameters is taken from Nakai et al. (1988), and the default panel file uses selected parameters from the 13 discrete clusters in that paper. This program is experimental. Any suggestions would be most welcome.
% pepallwindow
PepAllWindow plots measures of protein hydrophobicity according to the method of Kyte and Doolittle.
% eplotsimilarity
EPlotSimilarity plots the running average of the similarity among the sequences in a multiple sequence alignment.
% polydot
PolyDot compares two sets of sequences, draws a dotplot for each pair of sequences, and reports all identical matches of a specified length.
% tprofilesearch
TProfileSearch uses a profile (representing a group of aligned protein sequences) as a probe to search the nucleotide database for new sequences with possible protein products having some similarity to the group. The profile is created with the program ProfileMake.
% tprofilesegments
TProfileSegments makes optimal alignments showing the segments of similarity found by TProfileSearch.
% tprofilegap
TProfileGap makes an optimal alignment between a profile and a sequence.
% profileplot
ProfilePlot produces a graphical report of the frequency of patterns in a protein or nucleotide sequence.
% sortconsensus
SortConsensus identifies the strong consensus regions of an alignment in an MSF file and reports them in sorted order.
% elineup
ELineUp is a screen editor for editing multiple sequence alignments. You can edit up to 500 sequences simultaneously. New sequences can be typed in by hand or added from existing sequence files. A consensus sequence identifies places where the sequences are in conflict.
% multalign
MultAlign does a simultaneous alignment for two or more DNA or protein sequences. It introduces a certain number of gaps into either pairwise aligned sequences or groups of sequences to find a minimal global distance. The user can influence the result by defining the order in which the sequences will be aligned. The program is based on a generalization of the algorithm of Waterman, Smith and Beyer by Krueger and Osterburg.
% eclustalw
EClustAlW calculates a multiple alignment of nucleic acid or protein sequences according to the method of Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994). This is part of the original ClustalW distribution, modified for inclusion in EGCG.
% clustree
ClusTree computes a phylogenetic tree according to the Neighbor-Joining Method of Saitou and Nei (1987). This is part of the original ClustalW distribution, modified for inclusion in EGCG. The tree will be displayed graphically.
% profalign
ProfAlign is for taking two old aligments (or single sequences) and aligning them with each other. The result is one bigger aligment. This is part of the original ClustalW distribution, modified for inclusion in EGCG.
EVOLUTIONARY ANALYSIS
These programs measure the pairwise homologies between a set of sequences, and provide a conversion to the format required by the Phylip program.
% homologies
Homologies makes a table of the pair-wise distances within a group of aligned sequences.
% tophylip
ToPhylip writes GCG sequences into a single file in Phylip format.
% phylip2tree
Phylip2Tree displays trees computed with one of the PHYLIP-programs in GCG style.
PATTERN RECOGNITION
Searching for inverted repeats in DNA sequences is now provided. Some GCG programs in this section now have command line control added.
% palindrome
Palindrome searches for perfect inverted repeats in a nucleic acid sequence.
% ewindow
EWindow is a version of Window with command line control. Window makes a table of the frequencies of different sequence patterns within a window as it is moved along a sequence. A pattern is any short sequence like GC or R or ATG. You can plot the output with the program StatPlot.
% estatplot
EStatPlot is a version of StatPlot with command line control. StatPlot plots a set of parallel curves from a table of numbers like the table written by the Window program. The statistics in each column of the table are associated with a position in the analyzed sequence.
% ecodonfrequency
ECodonFrequency tabulates codon usage from sequences and/or existing codon usage tables. The output file is correctly formatted for input to the CodFish, CodonPreference, Correspond, and Frames programs.
ECodonFrequency is a modified version of GCG version 7's CodonFrequency with command line control added.
% econsensus
EConsensus calculates a consensus sequence for a set of pre-aligned short nucleic acid sequences by tabulating the percent of G, A, T, and C for each position in the set. GCG's FitConsensus uses the EConsensus output table as a probe to search for the best examples of the derived consensus in other nucleotide sequences.
% ecorrespond
ECorrespond looks for similar patterns of codon usage by comparing codon frequency tables.
% erepeat
ERepeat finds direct repeats in sequences. You must set the size, stringency, and range within which the repeat must occur; all the repeats of that size or greater are displayed as short alignments. ERepeat is a version of GCG's old Repeat with command line control.
% eterminator
ETerminator searches for prokaryotic factor-independent RNA polymerase terminators according to the method of Brendel and Trifonov. ETerminator is a version of GCG's old Terminator with command line control.
NUCLEOTIDE ANALYSIS
Melting temperature and GC content of a sequence can be analyzed and displayed on a plot. The variation in di-nucleotide composition along a sequence can be plotted.
% melt
Melt calculates the melting temperature (Tm) and the percent G+C of a nucleic acid sequence using the algorithms described by Breslauer et al. Proc. Natl. Acad. Sci. USA 83, 3746-3750 and Baldino et al. Methods in Enzymol. 168, 761-777.
% meltplot
MeltPlot plots the melting curve for a nucleic acid sequence using the algorithms described by Breslauer et al. Proc. Natl. Acad. Sci. USA 83, 3746-3750 and Baldino et al. Methods in Enzymol. 168, 761-777.
% basepairplot
BasePairPlot plots the percentage occurence and the observed over expected frequency of a di-nucleotide pair relative to their position in a nucleic acid sequence.
% cpgplot
CpGPlot plots the frequency of occurence of CpG di-nucleotides and C and G percentage relative to their position in a sequence by the method described by Gardiner-Garden (1987)
% cpgreport
CpGReport looks for potential CpG islands in a nucleotide sequence.
% chaos
Chaos makes a CHAOS game representation of a nucleic acid sequence using the method of Jeffrey (1990) Nucleic Acids Research 18: 2163-2170.
% codfish
CODFISH calculates a set of codon usage statistics for a sequence using a specified codon usage table.
% wordcount
WordCount counts the commonest words in a sequence and reports them in order of frequency and sequence.
% wordup
WordUp is based on a first order Markov analysis and detects statistically significant oligonucleotide patterns from six to nine nucleotides long in the sequences under investigation. WordUp dynamically detects significant signals of any length in the same analysis.
% poland
The program Poland simulates transition curves of double-stranded nucleic acids (DNA as well as RNA). Calculation is based on Poland, D. (1974) 'Recursion Relation Generation of Probability Profiles for Specific-Sequence Macromolecules with Long-Range Correlations'.
% genetrans
GeneTrans extracts and/or translates coding regions as defined in the feature table of sequences stored in the EMBL or Genbank databases.
% gapframe
GapFrame moves all gaps in a DNA sequence reading frame to be at codon boundaries.
% prima
Prima selects oligonucleotide primers for a template DNA sequence. The primers may be useful for the polymerase chain reaction (PCR) or for DNA sequencing. You can allow Prima to choose primers from the whole template or limit the choices to a particular set of primers listed in a file.
% quicktandem
QuickTandem scans for potential tandem repeats in a nucleotide sequence.
% tandem
Tandem looks for multiple tandem repeats of a given size in a nucleotide sequence.
% inverted
Inverted looks for imperfect inverted repeats in a nucleotide sequence.
% ecomposition
EComposition determines the composition of sequence(s). For nucleotide sequence(s), EComposition also determines dinucleotide and trinucleotide content.
PROTEIN ANALYSIS
The first four programs provide graphical analyses of protein sequences. The first three provide different approaches to finding coiled-coil regions and amphipathic helices. PepWindow provides a general hydropathy plot. PepStats calculates physical properties of proteins. The last three programs look for specific sequence motifs: signal peptide cleavage sites, potential epitopes (antigenic surface regions), and helix turn helix DNA binding domains.
% pepcoil
PepCoil identifies potential coiled-coil regions of protein sequences using the algorithm of Lupas A, van Dyke M & Stock J (1991).
% pepnet
PepNet is a program to view the two-dimensional helical representation of protein sequences.
% pepwheel
PepWheel is a program to view the periodic distribution of amino acid residues in protein sequences.
% pepwindow
PepWindow plots measures of protein hydrophobicity according to the method of Kyte and Doolittle.
% pepstats
PepStats gives a short statistical summary on the composition of a protein sequence and gives the molecular weight and isoelectric point.
% sigcleave
SigCleave uses the von Heijne method to locate signal sequences, and to identify the cleavage site. The method is 95% accurate in resolving signal sequences from non-signal sequences with a cutoff score of 3.5, and 75-80% accurate in identifying the cleavage site. The program reports all hits above a minimum value.
% antigenic
Antigenic looks for potential antigenic regions using the method of Kolaskar.
% helixturnhelix
HelixTurnHelix uses the method of Dodd and Egan to determine the significance of possible helix-turn-helix matches in protein sequences.
% dodayhoffstat
DoDayhoffStat compares the composition of a protein sequence against the Dayhoff statistic for protein composition. The closer the Dayhoff Stat value is to 1.0 the better the composition of the protein sequence fits with the theoretical value.
% pepcount
PepCount reports the number of occurrences of residues at a given position in protein sequences.
% epeptidesort
EPeptideSort shows the peptide fragments from a digest of an amino acid sequence. It sorts the peptides by weight, position, and HPLC retention at pH 2.1, and shows the composition of each peptide. It also prints a summary of the composition of the whole protein. EPeptideSort is a modified version of GCG's PeptideSort which has additional options to control output of peptides sorted by weight, retention and position.
TRANSLATION
One GCG program has command line control added. The second program translates aligned nucleic acid sequences into aligned protein sequences.
% etranslate
ETranslate is a version of GCG's old Translate program with command line control added.
% eextractpeptide
EExtractPeptide is a version of ExtractPeptide with command line control. ExtractPeptide writes a peptide sequence from one or more of the translation frames displayed in the output from Map. Translate supercedes ExtractPeptide for most applications.
% alltrans
AllTrans translates a set of aligned nucleotide sequences into protein.
% mytrans
MyTrans is a simple EGCG application that translates part of a nucleotide sequence into protein.
MANIPULATION
A GCG program has command line control added.
% eassemble
EAssemble is a version of GCG's old Assemble program with command line control added.
% ecomptable
ECompTable creates a scoring matrix using equivalences defined in a simplification scheme such as the one used for Simplify. ECompTable is a version of GCG's CompTable with command line control added.
% ereverse
EReverse reverses and/or complements a sequence. EReverse is a version of GCG's Reverse with command line control.
% pepcorrupt
PepCorrupt randomly introduces small numbers of substitutions, insertions, and deletions into protein sequence(s). Note that substitutions are Residue to other Residue, and that back mutations to the original are allowed!
DISPLAY
EPublish is a version of Publish that allows command line control. No other Display programs are released, but there has been some interest in a modified version of Red to provide alternative forms of documentation.
% epublish
EPublish is a version of Publish with command line control. Publish arranges sequences for publication. It creates a text file that you can modify to your own needs with a text editor.
% elibgen
ELibGen creates formatted versions of EGCG documentation for the on-line help facilities egenhelp and egenmanual.
% redtohtml
RedToHtml is a modification of GCG's program Red to convert documentation source files into HTML documents.
SEQUENCE EXCHANGE
The first program converts any sequence to plain text. The next two programs provide a way to generate the original database entry format from a GenBank/EMBL entry in a GCG database. The ToPirAll program provides a way to extract a set of subsequences in PIR format. The last program produces input files for the Primer program.
% creformat
CReformat rewrites sequence file(s), scoring matrix file(s), or enzyme data file(s) so that they can be read by GCG programs. For sequence files, a base range can be selected or excluded.
% totext
ToText converts a sequence into plain text format.
% togenbank
ToGenBank is a simple utility program that reads a GenBank entry from a GCG sequence database, and writes it out in GenBank flat file format.
% toembl
ToEmbl is a simple utility program that reads an EMBL entry from a GCG sequence database, and writes it out in EMBL flat file format.
% topirall
ToPirAll is a utility program that converts a list of sequences, or ranges of sequences, into PIR format for use in other non-GCG programs, especially CLUSTALV.
% toprimer
ToPrimer formats a GCG sequence file into a PRIMER compatible file.
% torelate
ToRelate creates an input file for the NBRF RELATE program.
% efromfasta
EFromFastA reformats one or more sequences from FastA format into individual files in GCG format.
% efromstaden
EFromStaden changes a sequence from Staden format into GCG format. If the file contains a nucleotide sequence, the ambiguity codes are converted as shown in Appendix III of the GCG Program Manual. EFromStaden is a version of GCG's old FromStaden with command line control.
% etostaden
EToStaden writes a GCG sequence into a file in Staden format. If the file contains a nucleotide sequence, the ambiguity codes are converted as shown in Appendix III of the GCG Program Manual. EToStaden is a version of GCG's ToStaden with command line control.
% egetseq
EGetSeq reads a sequence from a computer that is acting as a terminal and writes it into a new sequence file in GCG format on the computer running the Wisconsin Package. EGetSeq is a version of GCG's GetSeq with command line control.
FILE UTILITIES
These utilities act on text files.
% noreturn
NoReturn removes trailing carriage return or line feed control characters from text files.
% cppjl
CppJL converts EGCG VMS fortran source code to Unix fortran source code.
% crtolf
CRtoLF converts carriage return characters to linefeed characters in text files.
% addcomment
AddComment rewrites a text file with every line commented out.
% ecrypt
ECrypt writes an encrypted version of a file using a key word that you choose. Run ECrypt a second time with the same keyword to restore the encrypted output file to its original state.
% ecodesearch
ECodeSearch searches through FORTRAN source files for references to mnemonics such as procedure names. You must provide the mnemonics in a separate file. The default parameters show you some suitable inputs.
% eclsort
ECLSort sorts the output of ECodeSearch on the first argument of each procedure. The heading is lost. EGCG will use ECLSort to make up the command line dictionary in the Procedure Library chapter of the future EGCG System Support Manual.
MISCELLANEOUS
These programs do not fit the other categories.
% test
Test is provided as a skeleton for programmers to test ideas.
% Ctest
CTest is provided as a skeleton for programmers to test ideas.
% keyfind
KeyFind reports the characters passed to the program by keys on the keyboard.
DATABASE MAINTENANCE
These programs are used at several sites to build additional databases in GCG format.
% dbstats
DbStats counts the number of entries and the total lengths of sequence entries in a GCG formatted database.
% gbonly
GbOnly creates a list of GenBank entries that have accession numbers not found in the latest release of the Embl database.
% pironly
PirOnly and related programs select entries from PIR that are not included in the latest release of SwissProt.
% checklen
CheckLen calculates five checksums and the sequence length for each entry in a database, and writes them to a file for use in a quick cross check for identical sequences.
% checklencomp
CheckLenComp compares two sorted CheckLen output files, and produces a list of entries from the first file which are not found in the second.
% kabattogcg
KabatToGCG creates GCG data libraries from Kabat distribution files.
% seqdbtogcg
SeqDbToGcg converts the SeqDb database distribution files into a database in GCG format.
% convertenz
ConvertEnz reads lines extracted from the ENZYME database, and converts them to lower case.
% ig2nbrf
Ig2Nbrf is a utility program that converts an IG formatted file into an NBRF formatted database which PirToGcg can index.
ConvertEnz reads lines extracted from the ENZYME database, and converts them to lower case.
% embltogcgsc
EMBLToGCGSC is the Sanger Centre's modification of GCG's EMBLtoGCG which reformats EMBL and SWISS-PROT flat sequence files into GCG data libraries.
ON-LINE HELP
% egenhelp
EGenHelp displays an index of all the programs in the EGCG Program Manual. To view the topics for an individual program on your screen, type in the program name. To select a topic for a program, type in the topic name (including any underscores). Program documentation always includes a picture of the screen for a typical session with the program.
% egenmanual
EGenManual displays an index of all the sections of the EGCG Program Manual. To view the programs in a section, type in the section name (including any underscores). To select a program, type in the program name. To select a topic for a program, type in the topic name (including any underscores).
EGCG COMMANDS
These commands are defined by EGCG for general use. Most are similar to GCG commands.
EGCGSUPPORT COMMANDS
These commands are defined by EGCGSUPPORT for system maintenance. Most are similar to GCGSUPPORT commands.
% egcghelpbuild
builds the complete help libraries for egenhelp and egenmanual from the documentation source files.
Printed: April 23, 1996 16:26 (1162)