The Fragment Assembly System (FAS)

If you are involved in sequencing projects, you will not obtain the full sequence of a cosmid or a YAC in a single sequencing run. Rather, you will need to subclone the sequence in several overlapping fragments, and obtain a reasonable coverage of the sequence. This way, a unique consensus can be determined which represents the single sequence of the original whole sequence. It is important to understand that a sufficient coverage will also imply that some ambiguities or other contradictions occur, or that even larger interruptions will require extensive additional sequencing.

An alternative strategy is the so-called shotgun sequencing, which results in a large number of sequence fragments which will potentially allow to assemble a single unique sequence. It has been mentioned before that some software packages will allow the manipulation of the trace files in sophisticated fashion. This section of the BIoCompanion explains the basic principles of fragment assembly as implemented in the GCG software, which expects that the sequence fragments have been entered into the computer before.

Principle of Fragment Assembly

The basic steps of a fragment assembly system (FAS) can be characterised as follows:

Acquisition of the sequences. This can occur my manual tying using a sequence editor (the GCG version is seqed ), which implies that you obtained the sequence from a sequencing gel, or have used a sequenator which gives you traces of fluorescent markers, to be converted into sequence characters with software as supplied by the vendor of the sequencing machines.

Note:

Commercial software for personal computer platforms is available in addition to the sequencing software to analyse the traces and convert these into sequences. Frequently, these products will also include software to assemble the data into "contigs" as described in this section of the BioCompanion. Keep in mind that the fragment assembly process as described below is not trivial and will be accomplished by personal computers if the sequences do not exceed a certain size. Larger projects, however, might easily exceed the scope or capacity of the software.

Import of the sequences into a sequence assembly software package. This will assign a single name to each sequence fragment, but just load the data (rather than process the data).
Merging the data into "contigs" will allow larger fragments to be created from single sequence fragments. This is achieved by superimposing identical ends of fragments in a sophisticated calculation (see below).
Review of the Merging will be necessary to approve the result of the automatic process, and also allow refinement and corrections in an editor-type of software. Specifically, inconsistencies resulting from nearly but not identical overlaps will require to review the original data (see above, the "enter" step). Once all problems have been resolved, a consensus sequence will be established which represents the common sequence of all fragments.

The Fragment Assembly Process

The fragments generated in a sequencing projects will need to be assembled into a common, single sequence. To achieve this goal, the computer will analyse the sequence fragments and determine a mathematical solution how to arrange the various data sets into a single data set. This is non-trivial because of the following reasons:

Vector sequences which are frequently present are not part of the desired sequence and therefore inhibit suitable alignment. Typically, these sequences have to be removed (electronically) before the assembly calculation. This requires that the vector sequence is known (or at least the sequence close to the cloning site).

The orientation of the sequence is frequently unknown, and both strands must be analysed. However, only one of the orientations will fit. Depending on the number of fragments considered in a single computation, a significant computational effort will be required.

Many sequences contain errors . At the time of arranging sequences, errors in overlapping sequences might affect the computation significantly. These so-called ambiguities will need to be cleared before the sequence is "finished", which implies that the traces (or gels) must be re-analysed or the specific fragment must be re-sequenced.

The coverage of the data must be sufficient. This implies that the calculation will be incomplete and therefore result in multiple "contigs" rather than one, or parts of the assembled sequence is single-stranded and, therefore, less reliably determined than desirable.

Repeats in sequences will possibly make it impossible to compute the desired assembled sequences. Depending on the sequence, it might be needed to create smaller fragments, larger fragments, or change the sequencing strategy entirely.

The Fragment Assembly Data Structures

There are a couple of serious informatics issues to be resolved if you deal with fragment assembly:

Raw data need to be entered.
Processed data (i.e. intermediate steps of the calculation) will need to be changed in order to resolve inconsistencies, and, after validation against the raw data, propagated to the raw data sets. This is a potential source of severe error as a modification history is rarely kept.
Refined data will be created from raw and processed data, respectively, and be much larger than a data set which can be validated manually. Depending on the size of the sequencing project, the assembly of various sequences will be numerically too complex to be obvious, and the user is dependent on the algorithm employed.
The assembly process might have more than one solution if the coverage of the sequences is insufficient. This is a potential trap as the problems of assemblies based on ambiguous fragments might only be obvious in later steps if a low number of contigs cannot be joined. This implies that additional raw data have to be added (thus, implying more biological experiments), and the achieved contigs have to be dissolved again before restarting.

Typically, the results, intermediate results, and raw data of sequence assembly projects are stored in databases . This database is of a very different structure compared to sequence databases, and the tools to deal with these databases are very specific. Whereas sequence databases follow a "standard format" and can be used by various software packages, the fragment assembly databases are specific to the software and typically not be used by other programs. This means that intermediate stages cannot easily be processed with a different software than the one which was used at the start time, and therefore care has to be taken in case that the software will not work as expected.

NOTE:

Be careful to deal with fragment assembly databases only with the tools expected to be used for this process: If single files are modified or deleted, the entire project might become unusable.

GCG's implementation of the Fragment Assembly System

The software is based on software from Roger Staden and William Gilbert.

Start of the Fragment Assembly System

gelstart will initialise a fragment assembly database. The name of the project will need to be given. If this is a new name, the required data structures will be created from scratch. If the name has been used before, the existing data will be reused.

 
 % gelstart -delete

will delete a fragment assembly project. As mentioned above, it is dangerous to delete single files. Depending on the set-up, it will even be prohibited by the operating system because of file protections.

Population of the Fragment Assembly Database

To load a fragment assembly database, use the command gelenter . The program will launch an editor similar to GCG's seqed as described. If you have already entered the data using different software or seqed itself, it is possible to enter lots of sequences using a "wildcard" specification, such as *.seq. For this purpose, use the command

 
 % gelstart -enter=*.seq

Hint:

It might be unrealistic to assume that all your needed sequences are in a single directory, without any other unrelated sequences present. Follow the procedure below if you have a directory full of sequences and want to enter the sequences in a single run. Briefly, you create a directory, move all sequences there which you want to be moved there (the -i option will ask for each sequence individually), set default to this directory, initialise a new project, and enter the sequences there. Use your names instead of 'myproject' as directory name and 'newproject' as project name.

% mkdir myproject

% mv -i *.seq myproject

% cd myproject

% gelstart newproject

% gelenter -enter=*.seq

In order to reuse the project, use the 'cd' and 'gelstart' programs again.

Refer to the file handling commands described earlier for a reference regarding the non-GCG commands used above.

Merging fragments into contigs

The gelmerge program will assemble single sequences into a block of overlapping sequences called contigs . This procedure must not be misunderstood as multiple sequence alignment , as it will resemble the overlap of sequence fragments at 3' and 5' ends rather than an end-to-end alignment. There is an extended section in the GCG manual - use the command genhelp gelmerge algorithm for on-screen help. Briefly, overlaps are determined by a pairwise comparison of each sequence fragment. Refer to the considerations above for details on the assembly process.

Assembling contigs to a consensus

The gelassemble program will call an multi-sequence editor to review the assembly of single sequences into one or several assembled sequences called contigs . This procedure can be used to validate the overlap calculation of sequence fragments at 3' and 5' ends. The software will also allow to manually join contigs if the automatic threshold detection in the assembly algorithm did not allow joining in the computation. Briefly, the assembly editor works similar to the lineup editor described in the sequence families chapter . There is an extended section in the GCG manual - use the command genhelp gelassemble for on-screen help. Briefly, overlaps are visualised graphically, and a character-by-character "window" allows manipulation.

Assessment of the status of a Fragment Assembly project

The gelview program will show how the individual single sequence fragments have been assembled into contigs.

Reverting the Assembly process: Generating individual sequence fragments

Occasionally, the calculation of contigs will not result in a satisfactory solution. As all fragments are stored in a database, the sequences are not accessible directly (and should not be retrieved via the file system either, see above ). The geldisassemble program will dissolve any contig and relieve the individual single sequences fragments have been assembled into contigs.

JAMF source file single.jam
Next file in HTML: 'The Fragment Assembly System (FAS)'

[next page] , or [overview] , or [table of contents]