Society for Mathematical Biology nautilus logo

International Conference on Mathematical Biology and

Annual Meeting of The Society for Mathematical Biology,

July 27-30, 2009

University of British Columbia, Vancouver

.

Program

CTD7a
Inanc Birol
British Columbia Genome Sciences Centre
Title De novo assembly of transcriptomes with ABySS
Abstract Second generation sequencing technologies are being used routinely to investigate the genomes and transcriptomes of a wide variety of species. Although the increasing read lengths and protocols for paired end reads with various insert sizes enabled de novo assemblies of genomes, so far analysis of whole transcriptome shotgun data was carried out through alignment of reads to a reference. As powerful as alignment-based analysis methods are, they would be obtuse when it comes to detecting novel events. In this study, we present a de novo assembly approach for the analysis of transcriptomes, using the ABySS assembler tool, to address such shortcomings. The problem of transcriptome assembly is substantially different from the problem of genome assembly. For example, whereas changes in coverage levels in a genome assembly may be indicative of the repeat structures or otherwise be distributed randomly; in a transcriptome assembly, we would expect them to have wide swings, as they would be affected by different expression levels of various transcripts. Similarly, whereas contig growth ambiguities in a genome assembly would represent unresolved repeat structures; in a transcriptome assembly, they would correspond to isoform, gene family or allelic variations, thus would harbor useful and important information. Due to these variations, as well as to abundant small products, the contiguity of a transcriptome assembly will be low, and again unlike a genome assembly, this would not be an indication on the quality of the assembly. On the other hand, a capability to assemble transcriptomes opens up many opportunities, including the identification of novel transcripts and retained introns, and resolution of isoforms, gene families and allelic differences and their relative expressions, which would elude detection by alignment-based analyses. The ABySS algorithm is based on a de Bruijn di-graph representation of sequence neighborhoods, where a sequence read is decomposed into tiled sub-reads (k-mers) and sequences sharing k-1 bases are connected through directed edges. This approach is amenable for distributed representation, and our parallel implementation relaxes the memory and computation time restrictions present in other available de novo assemblers. ABySS reports three levels of output: (1) the assembled contigs, (2) information on allelic differences, collapsed near-repeats and read errors, and (3) contig adjacencies through overlaps and connecting read pairs, if available. In this work, we also report on our assembly visualization tool, ABySSexplorer, which uses the output from ABySS and enables manual inspection and refinement of assemblies. Furthermore, it aids incorporation of additional data and guide high throughput assembly finishing work.
CoauthorsShaun Jackman, Cydney Nielsen, Jenny Qian, Marco Marra, Steven JM Jones
LocationCHBE 102