EQuickIndex builds hash tables from sequence(s) in data libraries, and stores them as map sections. These tables make up the database that is searched by EQuickSearch.
NOTE: The EGCG Quick Searching System programs are now fully supported by the EGCG team. GCG distributed the original programs in the hope that users would make suggestions about their future development. This program is one such suggestion.
EQuickSearch gets its speed from the fact that all of the words in a set of sequences are first sorted into dictionaries so that they can be found rapidly. EQuickIndex is the tool that makes up those dictionaries. Such dictionaries are referred to in computing as "hash tables."
The input to EQuickIndex is one or more data library sequence files. These files are the files in each library that have the file name extension ".seq." These ".seq" files are NOT the same as individual GCG sequence files. If you don't know what a GCG sequence data library is, look at the documentation for DataSet in the Program Manual. The input file specification is a (possibly ambiguous) file specification for one or more ".seq" files.
The output from EQuickIndex is a set of four files that contain the hash table, indices for the hash table, the segment names, and a list of the libraries that were indexed. The files must all have the same name, but different file name extensions. The four output files must be kept together in the same directory.
This GCG program was modified by Peter Rice (E-mail: pmr@sanger.ac.uk Post: Informatics Division, The Sanger Centre, Hinxton Hall, Cambridge, CB10 1RQ, UK).
All EGCG programs are supported by the EGCG Support Team, who can be contacted by E-mail (egcg@embnet.org).
Here is the session with EQuickIndex that we used to create the hash tables for GenEMBL1.
% equickindex EQUICKINDEX what GCG ".seq" file(s) (* GenEMBLDir:*.seq *) ? What base name for the output files (* genembl *) ? Em_pr: 1 10,703,734 Em_vi: 2 7,663,081 Sorting . . . . List size: 928,747 Input: GenEMBLDir:*.seq Sequences: 13,294 Segments: 28,627 Libraries: 2 Total Length: 18,366,815 20-Mers Hashed: 928,747 Corrupted N-Mers: 588 Seqs with <10 N-Mers: 309 CPU Time (seconds): 224.47 Output is in: genembl1.hashtable, .quickindex, .segments, .buckets
EQuickIndex writes four files, each with the same name and a different file name extension. The file of library names (.quickindex) is the only one of the four that can be displayed on your screen. All of the other files are binary and should not be used except as input to EQuickSearch.
EQuickSearch rapidly identifies places where query sequence(s) occur in a nucleotide sequence database. The output is a file of overlaps that can be displayed with QuickMatch or EQuickShow. You can make up your own sequence database or use GenEMBL, which consists of GenBank and those sequences in EMBL that are not represented in GenBank (or the other way around at some sites). QuickMatch displays the overlaps found by EQuickSearch with either optimal alignments or dot-plots. The alignments can be selected by quality to discard poor matches. The dot-plots can be reviewed rapidly with a graphic screen.
DataSet creates a GCG data library from any set of sequences in GCG format.
Only sequences in data libraries can be used to make the hash tables. Sequences in GCG format can be assembled into data libraries with DataSet. Quick searching is only implemented for nucleotide sequences as this is being written.
The original version of QuickIndex collected indices in three large common blocks, and wrote these indices out to three large files. EQuickSearch then had to read these large files into its own common blocks, which is the part that takes several minutes to complete.
The VMS operating system provides a far more efficient way to do this. Instead of reading data into common blocks, and using large amounts of "virtual memory" which is in fact stored in the system's "page file", you can instead "map" the common block (or any "section" of memory) to your own "page file". When the program accesses a location in the common block, instead of reading the memory contents from the system "page file" they are read from the program's own private file.
This method has many advantages. The main one is that the map section "page file" can be loaded with the common block contents for the selected database. In order to load the data into the common block, the program needs only to open the map section file and all the data is automatically available without being read first.
For VMS system managers, there is a second great advantage. Normally with EQuickSearch each user reads the index files into memory and writes the memory out to the system page file. Because each user is reading and writing these memory locations, each user has his/her own copy in the page file, and the system page file can become very full, causing the system to run very slowly. The map section files are read-only, and all users share a single copy of the files.
The EGCG versions of EQuickIndex and EQuickSearch have simply been changed to use the map section technique to create and read the database indices. The only other change made to them is to allow an option in EQuickSearch to search all the available databases. At EMBL we have six databases indexed for Quick Searching. In addition to five roughly equal sized divisions of GenEMBL we have a database called GeNew which contains all the new entries since the last release of EMBL and GENBANK. These entries are available from the EMBL Network File Server, over the EMBnet network which is being set up throughout Europe, and by E-mail from GenBank.
EQuickIndex requires almost 50 megabytes of virtual memory. This is over three times as much as any other GCG program. If EQuickIndex stops with an error message implying insufficient memory, either wait until the system load is lower and try again, or see the section below for instructions on reducing the memory requirements for building the indices.
Note that EQuickIndex is used to create the index files, and is not intended for every GCG and EGCG user.
You can reduce the virtual memory requirements of EQuickIndex and EQuickSearch, at a cost in disk space, by creating smaller subsets of the databases. At the Sanger Centre, we index GenEmbl in nine parts of roughly equal size, and index the new entries (em_new) in a further set of indices. EQuickSearch by default searches all available indices, and runs much faster on smaller index files as the virtual memory requirements are closer to the available working set and free list sizes.
If you split the indices in this way, you should look at the .quickIndex files in directory EQuickDir: and note the maximum values for "Segments" and "Offsets". You can then edit the file EGENINCLUDE:QUICKPARAMETERS.INC to set the value for "ListSize" to a little more than the highest "Offsets" value, and the value for "MaxSegments" to a little more than the highest "Segments" value. You can then recompile and relink the Quick programs (using EGCGBUILD EQUICKINDEX and EGCGBUILD EQUICKSEARCH) to use the new smaller versions.
You will need to increase these values each time a new database is indexed, unless you decide to split the indices even further.
All parameters for this program may be put on the command line. Use the option -CHEck to see the summary below and to have a chance to add things to the command line before the program executes. In the summary below, the capitalized letters in the qualifier names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose qualifiers or parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.
Minimum syntax: % equickindex [-INfile=]GenEMBLDir:*.seq -Default Prompted parameters: [-OUTfile=]genembl a name for the output files Local Data Files: None Optional Switches: -DIRectory=mydir writes files in a directory other than "EQuickDir:" -MONitor shows each sequence name as it is indexed -NOSUMmary suppresses the summary at the end of the run
QuickIndex was designed and implemented by John Devereux. This version was modified to create map section files by Peter Rice at EMBL, Heidelberg, Germany.
None.
The parameters and switches listed below can be set from the command line. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.
lets you direct the output files into a directory other than the directory whose logical name is "EQuickDir."
This program normally monitors its progress on your screen. However, when you use the -Default option to suppress all program interaction, you also suppress the monitor. You can turn it back on with this option. If your program is running in batch, the monitor will appear in the log file. If the monitor is slowing the program down, suppress it with -NOMONitor.
writes a summary of the program's work to the screen when you've used the -Default qualifier to suppress all program interaction. A summary typically displays at the end of a program run interactively. You can suppress the summary for a program run interactively with -NOSUMmary.
Use this qualifier also to include a summary of the program's work in the log file for a program run in batch.
Printed: April 22, 1996 15:53 (1162)