- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Copy number estimation for high-throughput short read...
Open Collections
UBC Theses and Dissertations
UBC Theses and Dissertations
Copy number estimation for high-throughput short read shotgun sequencing de novo whole-genome assembly contigs Lim, Yee Fay
Abstract
                                    High-throughput short shotgun sequencing reads, also known as second-generation sequencing (SGS) reads, continue to be prevalent for de novo whole-genome assembly, whether alone or in combination with long-range information. Knowledge of contig multiplicity (copy number) is acknowledged to improve assembly correctness, contiguity, and coverage for SGS reads. Despite that, a principled, general solution for contig copy number estimation in de novo whole-genome SGS assembly has been unavailable. In the literature, the problem is generally unaddressed or given heuristic treatment.
In this work, we introduce a novel, versatile statistically informed contig copy number estimator, based on mixture models, for high-throughput short read shotgun sequencing de novo whole-genome assembly. In particular, this tool targets de Bruijn graph assembly, the dominant paradigm for de novo whole-genome SGS assembly. We show that it performs reliably at resolving multiplicities up to low repeat copy numbers; it is also robust over a range of genome characteristics, sequencing coverage levels, and assembly settings. Moreover, it is far more versatile than the closest existing alternative tools and usually outperforms them, often by a wide margin. At the same time, somewhat reduced though still robust performance in a limited set of experiments using real sequencing data suggests fundamental limitations to its usage of only length and read coverage data; incorporating other types of information, e.g. GC content, may be necessary to improve performance. Our code is publicly available at https://github.com/bcgsc/wgs-copynum-est; we hope this effort will provide a useful reference for similar future work.
                                    
                                                                    
Item Metadata
| Title | 
                                Copy number estimation for high-throughput short read shotgun sequencing de novo whole-genome assembly contigs                             | 
| Creator | |
| Publisher | 
                                University of British Columbia                             | 
| Date Issued | 
                                2021                             | 
| Description | 
                                High-throughput short shotgun sequencing reads, also known as second-generation sequencing (SGS) reads, continue to be prevalent for de novo whole-genome assembly, whether alone or in combination with long-range information. Knowledge of contig multiplicity (copy number) is acknowledged to improve assembly correctness, contiguity, and coverage for SGS reads. Despite that, a principled, general solution for contig copy number estimation in de novo whole-genome SGS assembly has been unavailable. In the literature, the problem is generally unaddressed or given heuristic treatment.
In this work, we introduce a novel, versatile statistically informed contig copy number estimator, based on mixture models, for high-throughput short read shotgun sequencing de novo whole-genome assembly. In particular, this tool targets de Bruijn graph assembly, the dominant paradigm for de novo whole-genome SGS assembly. We show that it performs reliably at resolving multiplicities up to low repeat copy numbers; it is also robust over a range of genome characteristics, sequencing coverage levels, and assembly settings. Moreover, it is far more versatile than the closest existing alternative tools and usually outperforms them, often by a wide margin. At the same time, somewhat reduced though still robust performance in a limited set of experiments using real sequencing data suggests fundamental limitations to its usage of only length and read coverage data; incorporating other types of information, e.g. GC content, may be necessary to improve performance. Our code is publicly available at https://github.com/bcgsc/wgs-copynum-est; we hope this effort will provide a useful reference for similar future work.                             | 
| Genre | |
| Type | |
| Language | 
                                eng                             | 
| Date Available | 
                                2021-04-22                             | 
| Provider | 
                                Vancouver : University of British Columbia Library                             | 
| Rights | 
                                Attribution-NonCommercial-ShareAlike 4.0 International                             | 
| DOI | 
                                10.14288/1.0396908                             | 
| URI | |
| Degree (Theses) | |
| Program (Theses) | |
| Affiliation | |
| Degree Grantor | 
                                University of British Columbia                             | 
| Graduation Date | 
                                2021-05                             | 
| Campus | |
| Scholarly Level | 
                                Graduate                             | 
| Rights URI | |
| Aggregated Source Repository | 
                                DSpace                             | 
Item Media
Item Citations and Data
Rights
Attribution-NonCommercial-ShareAlike 4.0 International