If you are involved in sequencing projects, you will not obtain the full sequence of a cosmid
or a YAC in a single sequencing run. Rather, you will need to subclone the sequence in several
overlapping fragments, and obtain a reasonable coverage of the sequence. This way, a unique
consensus can be determined which represents the single sequence of the original
whole sequence. It is important to understand that a sufficient coverage will also imply that
some ambiguities or other contradictions occur, or that even larger interruptions
will require extensive additional sequencing.
An alternative strategy is the so-called shotgun sequencing, which results in
a large number of sequence fragments which will potentially allow to assemble a single unique
sequence. It has been mentioned before that
some software packages will allow the manipulation of the trace files in sophisticated fashion.
This section of the BIoCompanion explains the basic principles of fragment assembly as implemented
in the GCG software, which expects that the sequence fragments have been entered into the computer
before.
The basic steps of a fragment assembly system (FAS) can be characterised as follows:
Note:
Commercial software for personal computer platforms is available in addition to the sequencing
software to analyse the traces and convert these into sequences. Frequently, these products
will also include software to assemble the data into "contigs" as described in this section of
the BioCompanion. Keep in mind that the fragment assembly process as described below is not trivial
and will be accomplished by personal computers if the sequences do not exceed a certain size.
Larger projects, however, might easily exceed the scope or capacity of the software.
The fragments generated in a sequencing projects will need to
be assembled into a common, single sequence. To achieve this goal, the computer will analyse
the sequence fragments and determine a mathematical solution how to arrange the various data
sets into a single data set. This is non-trivial because of the following reasons:
There are a couple of serious informatics issues to be resolved if you deal with fragment
assembly: Typically, the results, intermediate results, and raw data of sequence assembly projects
are stored in databases . This database is of a very
different structure compared to sequence databases, and the tools to deal with these databases
are very specific. Whereas sequence databases follow a "standard format" and can be used by various
software packages, the fragment assembly databases are specific to the software and typically
not be used by other programs. This means that intermediate stages cannot easily be processed
with a different software than the one which was used at the start time, and therefore care has
to be taken in case that the software will not work as expected.
NOTE:
Be careful to deal with fragment assembly databases only with the tools expected to be used for
this process: If single files are modified or deleted, the entire project might become unusable.
The software is based on software from Roger Staden and William Gilbert.
Start of the Fragment Assembly System
gelstart will initialise
a fragment assembly database. The name of the project will need to be given. If this is a new
name, the required data structures will be created from scratch. If the name has been used before,
the existing data will be reused.
Population of the Fragment Assembly Database
To load a fragment assembly database, use the command gelenter . The program
will launch an editor similar to GCG's seqed as
described. If you have already entered the data using different software or seqed
itself, it is possible to enter lots of sequences using a "wildcard" specification, such as *.seq.
For this purpose, use the command Hint:
It might be unrealistic to assume that all your needed sequences are in a single directory,
without any other unrelated sequences present. Follow the procedure below if you have a directory
full of sequences and want to enter the sequences in a single run. Briefly, you create a directory,
move all sequences there which you want to be moved there (the -i option will ask for each sequence
individually), set default to this directory, initialise a new project, and enter the sequences
there. Use your names instead of 'myproject' as directory name and 'newproject' as project name.
% mkdir myproject
% mv -i *.seq myproject
% cd myproject
% gelstart newproject
% gelenter -enter=*.seq
In order to reuse the project, use the 'cd' and 'gelstart' programs again. Refer to the file handling commands described
earlier for a reference regarding the non-GCG commands used above.
Merging fragments into contigs
The gelmerge program will assemble single sequences into a block of overlapping
sequences called contigs . This procedure must not be misunderstood as
multiple sequence alignment , as it will resemble the overlap of sequence
fragments at 3' and 5' ends rather than an end-to-end alignment. There is an extended section
in the GCG manual - use the command genhelp gelmerge algorithm for on-screen
help. Briefly, overlaps are determined by a pairwise comparison of each sequence fragment. Refer
to the considerations above for details on the assembly process.
Assembling contigs to a consensus
The gelassemble program will call an
multi-sequence editor to review the assembly
of single sequences into one or several assembled sequences called contigs .
This procedure can be used to validate the overlap calculation of sequence fragments at 3' and
5' ends. The software will also allow to manually join contigs if the automatic
threshold detection in the assembly algorithm did not allow joining in the computation. Briefly,
the assembly editor works similar to the lineup editor described in the sequence families chapter . There is an extended
section in the GCG manual - use the command genhelp gelassemble for on-screen
help. Briefly, overlaps are visualised graphically, and a character-by-character "window" allows
manipulation.
Assessment of the status of a Fragment Assembly project
The gelview program will show how the individual single
sequence fragments have been assembled into contigs.
Reverting the Assembly process: Generating individual sequence fragments
Occasionally, the calculation of contigs will not result in a satisfactory solution. As all fragments
are stored in a database, the sequences are not accessible directly (and should not be retrieved
via the file system either, see above ). The geldisassemble
program will dissolve any contig and relieve the individual single sequences fragments
have been assembled into contigs.
[next page] , or [overview] , or [table of contents] Principle of Fragment Assembly
The Fragment Assembly Process
The Fragment Assembly Data Structures
GCG's implementation of the Fragment Assembly System
% gelstart -delete
will delete a fragment assembly project. As mentioned above, it is dangerous
to delete single files. Depending on the set-up, it will even be prohibited by the operating
system because of file protections.
% gelstart -enter=*.seq
JAMF source file single.jam
Next file in HTML:
'The Fragment Assembly System (FAS)'