Equickindex

Go back to top

EQUICKINDEX*


FUNCTION

EQuickIndex builds hash tables from sequence(s) in data libraries, and stores them as map sections. These tables make up the database that is searched by EQuickSearch.

NOTE: The EGCG Quick Searching System programs are now fully supported by the EGCG team. GCG distributed the original programs in the hope that users would make suggestions about their future development. This program is one such suggestion.


DESCRIPTION

EQuickSearch gets its speed from the fact that all of the words in a set of sequences are first sorted into dictionaries so that they can be found rapidly. EQuickIndex is the tool that makes up those dictionaries. Such dictionaries are referred to in computing as "hash tables."

The input to EQuickIndex is one or more data library sequence files. These files are the files in each library that have the file name extension ".seq." These ".seq" files are NOT the same as individual GCG sequence files. If you don't know what a GCG sequence data library is, look at the documentation for DataSet in the Program Manual. The input file specification is a (possibly ambiguous) file specification for one or more ".seq" files.

The output from EQuickIndex is a set of four files that contain the hash table, indices for the hash table, the segment names, and a list of the libraries that were indexed. The files must all have the same name, but different file name extensions. The four output files must be kept together in the same directory.


AUTHOR

This GCG program was modified by Peter Rice (E-mail: pmr@sanger.ac.uk Post: Informatics Division, The Sanger Centre, Hinxton Hall, Cambridge, CB10 1RQ, UK).

All EGCG programs are supported by the EGCG Support Team, who can be contacted by E-mail (egcg@embnet.org).


EXAMPLE

Here is the session with EQuickIndex that we used to create the hash tables for GenEMBL1.

  
  
  % equickindex
  
    EQUICKINDEX what GCG ".seq" file(s) (* GenEMBLDir:*.seq *) ?
  
    What base name for the output files (* genembl *) ?
  
      Em_pr: 1   10,703,734
      Em_vi: 2    7,663,081
  
    Sorting . . . . List size: 928,747
             Input: GenEMBLDir:*.seq
         Sequences: 13,294
          Segments: 28,627
         Libraries: 2
      Total Length: 18,366,815
    20-Mers Hashed: 928,747
  Corrupted N-Mers: 588
   Seqs with <10 N-Mers: 309
CPU Time (seconds): 224.47
  
      Output is in: genembl1.hashtable, .quickindex, .segments, .buckets
  


OUTPUT

EQuickIndex writes four files, each with the same name and a different file name extension. The file of library names (.quickindex) is the only one of the four that can be displayed on your screen. All of the other files are binary and should not be used except as input to EQuickSearch.


RELATED PROGRAMS

EQuickSearch rapidly identifies places where query sequence(s) occur in a nucleotide sequence database. The output is a file of overlaps that can be displayed with QuickMatch or EQuickShow. You can make up your own sequence database or use GenEMBL, which consists of GenBank and those sequences in EMBL that are not represented in GenBank (or the other way around at some sites). QuickMatch displays the overlaps found by EQuickSearch with either optimal alignments or dot-plots. The alignments can be selected by quality to discard poor matches. The dot-plots can be reviewed rapidly with a graphic screen.

DataSet creates a GCG data library from any set of sequences in GCG format.


RESTRICTIONS

Only sequences in data libraries can be used to make the hash tables. Sequences in GCG format can be assembled into data libraries with DataSet. Quick searching is only implemented for nucleotide sequences as this is being written.


MAPPED SECTIONS

The original version of QuickIndex collected indices in three large common blocks, and wrote these indices out to three large files. EQuickSearch then had to read these large files into its own common blocks, which is the part that takes several minutes to complete.

The VMS operating system provides a far more efficient way to do this. Instead of reading data into common blocks, and using large amounts of "virtual memory" which is in fact stored in the system's "page file", you can instead "map" the common block (or any "section" of memory) to your own "page file". When the program accesses a location in the common block, instead of reading the memory contents from the system "page file" they are read from the program's own private file.

This method has many advantages. The main one is that the map section "page file" can be loaded with the common block contents for the selected database. In order to load the data into the common block, the program needs only to open the map section file and all the data is automatically available without being read first.

For VMS system managers, there is a second great advantage. Normally with EQuickSearch each user reads the index files into memory and writes the memory out to the system page file. Because each user is reading and writing these memory locations, each user has his/her own copy in the page file, and the system page file can become very full, causing the system to run very slowly. The map section files are read-only, and all users share a single copy of the files.

The EGCG versions of EQuickIndex and EQuickSearch have simply been changed to use the map section technique to create and read the database indices. The only other change made to them is to allow an option in EQuickSearch to search all the available databases. At EMBL we have six databases indexed for Quick Searching. In addition to five roughly equal sized divisions of GenEMBL we have a database called GeNew which contains all the new entries since the last release of EMBL and GENBANK. These entries are available from the EMBL Network File Server, over the EMBnet network which is being set up throughout Europe, and by E-mail from GenBank.


MEMORY REQUIREMENTS

EQuickIndex requires almost 50 megabytes of virtual memory. This is over three times as much as any other GCG program. If EQuickIndex stops with an error message implying insufficient memory, either wait until the system load is lower and try again, or see the section below for instructions on reducing the memory requirements for building the indices.

Note that EQuickIndex is used to create the index files, and is not intended for every GCG and EGCG user.

You can reduce the virtual memory requirements of EQuickIndex and EQuickSearch, at a cost in disk space, by creating smaller subsets of the databases. At the Sanger Centre, we index GenEmbl in nine parts of roughly equal size, and index the new entries (em_new) in a further set of indices. EQuickSearch by default searches all available indices, and runs much faster on smaller index files as the virtual memory requirements are closer to the available working set and free list sizes.

If you split the indices in this way, you should look at the .quickIndex files in directory EQuickDir: and note the maximum values for "Segments" and "Offsets". You can then edit the file EGENINCLUDE:QUICKPARAMETERS.INC to set the value for "ListSize" to a little more than the highest "Offsets" value, and the value for "MaxSegments" to a little more than the highest "Segments" value. You can then recompile and relink the Quick programs (using EGCGBUILD EQUICKINDEX and EGCGBUILD EQUICKSEARCH) to use the new smaller versions.

You will need to increase these values each time a new database is indexed, unless you decide to split the indices even further.


COMMAND-LINE SUMMARY

All parameters for this program may be put on the command line. Use the option -CHEck to see the summary below and to have a chance to add things to the command line before the program executes. In the summary below, the capitalized letters in the qualifier names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose qualifiers or parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.

  
  
  Minimum syntax: % equickindex [-INfile=]GenEMBLDir:*.seq -Default
  
  Prompted parameters:
  
  [-OUTfile=]genembl       a name for the output files
  
  Local Data Files: None
  
  Optional Switches:
  
  -DIRectory=mydir  writes files in a directory other than "EQuickDir:"
  -MONitor          shows each sequence name as it is indexed
  -NOSUMmary        suppresses the summary at the end of the run
  


ACKNOWLEDGMENT

QuickIndex was designed and implemented by John Devereux. This version was modified to create map section files by Peter Rice at EMBL, Heidelberg, Germany.


LOCAL DATA FILES

None.


OPTIONAL PARAMETERS

The parameters and switches listed below can be set from the command line. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the GCG User's Guide.

-DIRectory=[MyDir.Sequences]

lets you direct the output files into a directory other than the directory whose logical name is "EQuickDir."

-MONitor

This program normally monitors its progress on your screen. However, when you use the -Default option to suppress all program interaction, you also suppress the monitor. You can turn it back on with this option. If your program is running in batch, the monitor will appear in the log file. If the monitor is slowing the program down, suppress it with -NOMONitor.

-SUMmary

writes a summary of the program's work to the screen when you've used the -Default qualifier to suppress all program interaction. A summary typically displays at the end of a program run interactively. You can suppress the summary for a program run interactively with -NOSUMmary.

Use this qualifier also to include a summary of the program's work in the log file for a program run in batch.

Printed: April 22, 1996 15:53 (1162)