The Fragment Assembly System (FAS)

If you are involved in sequencing projects, you will not obtain the full sequence of a cosmid or a YAC in a single sequencing run. Rather, you will need to subclone the sequence in several overlapping fragments, and obtain a reasonable coverage of the sequence. This way, a unique consensus can be determined which represents the single sequence of the original whole sequence. It is important to understand that a sufficient coverage will also imply that some ambiguities or other contradictions occur, or that even larger interruptions will require extensive additional sequencing.

An alternative strategy is the so-called shotgun sequencing, which results in a large number of sequence fragments which will potentially allow to assemble a single unique sequence. It has been mentioned before that some software packages will allow the manipulation of the trace files in sophisticated fashion. This section of the BIoCompanion explains the basic principles of fragment assembly as implemented in the GCG software, which expects that the sequence fragments have been entered into the computer before.

Principle of Fragment Assembly

The basic steps of a fragment assembly system (FAS) can be characterised as follows:

Note:

Commercial software for personal computer platforms is available in addition to the sequencing software to analyse the traces and convert these into sequences. Frequently, these products will also include software to assemble the data into "contigs" as described in this section of the BioCompanion. Keep in mind that the fragment assembly process as described below is not trivial and will be accomplished by personal computers if the sequences do not exceed a certain size. Larger projects, however, might easily exceed the scope or capacity of the software.

The Fragment Assembly Process

The fragments generated in a sequencing projects will need to be assembled into a common, single sequence. To achieve this goal, the computer will analyse the sequence fragments and determine a mathematical solution how to arrange the various data sets into a single data set. This is non-trivial because of the following reasons:

The Fragment Assembly Data Structures

There are a couple of serious informatics issues to be resolved if you deal with fragment assembly:

Typically, the results, intermediate results, and raw data of sequence assembly projects are stored in databases . This database is of a very different structure compared to sequence databases, and the tools to deal with these databases are very specific. Whereas sequence databases follow a "standard format" and can be used by various software packages, the fragment assembly databases are specific to the software and typically not be used by other programs. This means that intermediate stages cannot easily be processed with a different software than the one which was used at the start time, and therefore care has to be taken in case that the software will not work as expected.

NOTE:

Be careful to deal with fragment assembly databases only with the tools expected to be used for this process: If single files are modified or deleted, the entire project might become unusable.

GCG's implementation of the Fragment Assembly System

The software is based on software from Roger Staden and William Gilbert.

Start of the Fragment Assembly System

gelstart will initialise a fragment assembly database. The name of the project will need to be given. If this is a new name, the required data structures will be created from scratch. If the name has been used before, the existing data will be reused.

 
 % gelstart -delete  
will delete a fragment assembly project. As mentioned above, it is dangerous to delete single files. Depending on the set-up, it will even be prohibited by the operating system because of file protections.

Population of the Fragment Assembly Database

To load a fragment assembly database, use the command gelenter . The program will launch an editor similar to GCG's seqed as described. If you have already entered the data using different software or seqed itself, it is possible to enter lots of sequences using a "wildcard" specification, such as *.seq. For this purpose, use the command

 
 % gelstart -enter=*.seq  

Hint:

It might be unrealistic to assume that all your needed sequences are in a single directory, without any other unrelated sequences present. Follow the procedure below if you have a directory full of sequences and want to enter the sequences in a single run. Briefly, you create a directory, move all sequences there which you want to be moved there (the -i option will ask for each sequence individually), set default to this directory, initialise a new project, and enter the sequences there. Use your names instead of 'myproject' as directory name and 'newproject' as project name.

% mkdir myproject

% mv -i *.seq myproject

% cd myproject

% gelstart newproject

% gelenter -enter=*.seq

In order to reuse the project, use the 'cd' and 'gelstart' programs again.

Refer to the file handling commands described earlier for a reference regarding the non-GCG commands used above.

Merging fragments into contigs

The gelmerge program will assemble single sequences into a block of overlapping sequences called contigs . This procedure must not be misunderstood as multiple sequence alignment , as it will resemble the overlap of sequence fragments at 3' and 5' ends rather than an end-to-end alignment. There is an extended section in the GCG manual - use the command genhelp gelmerge algorithm for on-screen help. Briefly, overlaps are determined by a pairwise comparison of each sequence fragment. Refer to the considerations above for details on the assembly process.

Assembling contigs to a consensus

The gelassemble program will call an multi-sequence editor to review the assembly of single sequences into one or several assembled sequences called contigs . This procedure can be used to validate the overlap calculation of sequence fragments at 3' and 5' ends. The software will also allow to manually join contigs if the automatic threshold detection in the assembly algorithm did not allow joining in the computation. Briefly, the assembly editor works similar to the lineup editor described in the sequence families chapter . There is an extended section in the GCG manual - use the command genhelp gelassemble for on-screen help. Briefly, overlaps are visualised graphically, and a character-by-character "window" allows manipulation.

Assessment of the status of a Fragment Assembly project

The gelview program will show how the individual single sequence fragments have been assembled into contigs.

Reverting the Assembly process: Generating individual sequence fragments

Occasionally, the calculation of contigs will not result in a satisfactory solution. As all fragments are stored in a database, the sequences are not accessible directly (and should not be retrieved via the file system either, see above ). The geldisassemble program will dissolve any contig and relieve the individual single sequences fragments have been assembled into contigs.


JAMF source file single.jam
Next file in HTML: 'The Fragment Assembly System (FAS)'

[next page] , or [overview] , or [table of contents]