Transcriptomics: Quantifying Non-Uniform Read Distribution Using MapReduce
Abstract
RNA-seq is a high-throughput Next-sequencing technique for estimating the concentration of all transcripts in a transcriptome. The method involves complex preparatory and post-processing steps which can introduce bias, and the technique produces a large amount of data [7, 19]. Two important challenges in processing RNA-seq data are therefore the ability to process a vast amount of data, and methods to quantify the bias in public RNA-seq datasets. We describe a novel analysis method, based on analysing sequence motif correlations, that employs MapReduce on Apache Spark to quantify bias in Next-generation sequencing (NGS) data at the deep exon level. Our implementation is designed specifically for processing large datasets and allows for scalability and deployment on cloud service providers offering MapReduce. In investigating the wild and mutant organism types in the species D. melanogaster we have found that motifs with runs of Gs (or their complement) exhibit low motif-pair correlations in comparison with other motif-pairs. This is independent of the mean exon GC content in the wild type data, but there is a mild dependence in the mutant data. Hence, whilst both datasets show the same trends, there is however significant variation between the two samples.
References
- 1. S. Aerts, The S. Aerts Lab of Computational Biology (LCB) at ku leuven, http://gbiomed.kuleuven.be/english/research/50000622/lcb/. (2012) [Online; accessed 4-January-2017]. Google Scholar
- 2. , Investigation into the annotation of protocol sequencing steps in the sequence read archive, GigaScience 4 (2015) 23. Crossref, Web of Science, Google Scholar
- 3. Apache Software Foundation, Hadoop 2.7 documentation, http://hadoop.apache.org/ docs/r2.7.2/. (2016) [Online; accessed 06-Jan-2017]. Google Scholar
- 4. S. R. Archive, Overview of the Sequence Read Archive (SRA), https://trace. ncbi.nlm.nih.gov/Traces/sra/sra.cgi/. (2017) [Online; accessed 3-January-2017]. Google Scholar
- 5. ArrayExpress, EMBL-EBI, ArrayExpress functional genomics data, https://www.ebi.ac.uk/arrayexpress/. (2017) [Online; accessed 3-January-2017]. Google Scholar
- 6. , Effects of gc bias in next-generation-sequencing data on de novo genome assembly, PLoS one 8(4) (2013) e62856. Crossref, Web of Science, Google Scholar
- 7. , A survey of best practices for RNA-seq data analysis, Genome Biology 17(1) (2016) 13. Crossref, Web of Science, Google Scholar
- 8. , Gene expression Omnibus: NCBI gene expression and hybridization array data repository, Nucleic Acids Research 30(1) (2002) 207–210. Crossref, Web of Science, Google Scholar
- 9. , On the computational complexity of mapreduce, in International Symposium on Distributed Computing (Springer, 2015), pp. 1–15. Google Scholar
- 10. , Biases in Illumina transcriptome sequencing caused by random hexamer priming, Nucleic Acids Research 38(12) (2010) e131. Crossref, Web of Science, Google Scholar
- 11. , Why functional programming matters, The Computer Journal 32(2) (1989) 98–107. Crossref, Web of Science, Google Scholar
- 12. , RNA sequencing and analysis, Cold Spring Harbor Protocols 2015(11) (2015) pdb–top084970. Crossref, Google Scholar
- 13. , Googles mapreduce programming model revisited, Science of Computer Programming 70(1) (2008) 1–30. Crossref, Web of Science, Google Scholar
- 14. , The sequence read archive, Nucleic Acids Research 39(Database issue) (2011) D19–21. Crossref, Web of Science, Google Scholar
- 15. , The sequence alignment/map format and samtools, Bioinformatics 25(16) (2009) 2078–2079. Crossref, Web of Science, Google Scholar
- 16. , Modeling non-uniformity in short-read rates in rna-seq data, Genome Biology 11(5) (2010) 1. Crossref, Web of Science, Google Scholar
- 17. , Comparisons of computational methods for differential alternative splicing detection using rna-seq in plant systems, BMC Bioinformatics 15(1) (2014) 1. Crossref, Web of Science, Google Scholar
- 18. , Cut it out: Silencing of noise in the transcriptome, Nature Structural and Molecular Biology 13(10) (2006) 860. Crossref, Web of Science, Google Scholar
- 19. , Identification and correction of systematic error in high-throughput sequence data, BMC Bioinformatics 12(1) (2011) 451. Crossref, Web of Science, Google Scholar
- 20. , Identifying the impact of g-quadruplexes on affymetrix 3’arrays using cloud computing, Journal of Integrative Bioinformatics 7(111) (2010). Google Scholar
- 21. , Statistics for Engineering and the Sciences (Dellen Publishing Company, San Francisco, CA, USA, 1992). Google Scholar
- 22. , Mapping and quantifying mammalian transcriptomes by RNA-seq, Nature Methods 5(7) (2008) 621–628. Crossref, Web of Science, Google Scholar
- 23. , Comparative motif discovery combined with comparative transcriptomics yields accurate targetome and enhancer predictions, Genome Research 23(1) (2013) 74–88. Crossref, Web of Science, Google Scholar
- 24. , Direct RNA sequencing, Nature 461(7265) (2009) 814–818. Crossref, Web of Science, Google Scholar
- 25. , A general model of codon bias due to gc mutational bias, PLoS One 5(10) (2010) e13431. Crossref, Web of Science, Google Scholar
- 26. , Identifying differential alternative splicing events from RNA sequencing data using rnaseq-mats, Deep Sequencing Data Analysis (2013) 171–179. Crossref, Google Scholar
- 27. , Protocol dependence of sequencing-based gene expression measurements, PLoS One 6(5) (2011) e19287. Crossref, Web of Science, Google Scholar
- 28. , Gc-content normalization for RNA-seq data, BMC Bioinformatics 12(1) (2011) 480. Crossref, Web of Science, Google Scholar
- 29. , Improving RNA-Seq expression estimates by correcting for fragment bias, Genome Biology 12(3) (2011) R22. Crossref, Web of Science, Google Scholar
- 30. N. Sanger, Ensembl, GTFGFF file format specification, http://www.ensembl.org/info/website/upload/gff.html. (2012) [Online; accessed 15-November-2016]. Google Scholar
- 31. , Advanced sequencing technologies: Methods and goals, Nature Reviews Genetics 5(5) (2004) 335–344. Crossref, Web of Science, Google Scholar
- 32. , MPI — The Complete Reference: The MPI Core, Vol. 1 (MIT Press, 1998). Google Scholar
- 33. , Interpretation of multiple probe sets mapping to the same gene in affymetrix genechips, BMC Bioinformatics 8(1) (2007) 1. Crossref, Web of Science, Google Scholar
- 34. , Big data: Astronomical or genomical? PLoS Biol. 13(7) (2015) 1–11. Crossref, Web of Science, Google Scholar
- 35. , Transcriptional noise and the fidelity of initiation by RNA polymerase ii. Nature Structural Molecular Biology 14(2) (2007) 103–105. Crossref, Web of Science, Google Scholar
- 36. , An overview of the hadoop/mapreduce/hbase framework and its current applications in bioinformatics, BMC Bioinformatics 11(Suppl 12) (2010) S1. Crossref, Web of Science, Google Scholar
- 37. , G-spots cause incorrect expression measurement in affymetrix microarrays, BMC Genomics 9(1) (2008) 1. Crossref, Web of Science, Google Scholar
- 38. , Measurement of mRNA abundance using RNA-seq data: Rpkm measure is inconsistent among samples, Theory in Biosciences 131(4) (2012) 281–285. Crossref, Web of Science, Google Scholar
- 39. , RNA-seq: A revolutionary tool for transcriptomics, Nature Reviews Genetics 10(1) (2009) 57–63. Crossref, Web of Science, Google Scholar
- 40. , Bias detection and correction in RNA-sequencing data, BMC Bioinformatics 12(1) (2011) 1. Crossref, Web of Science, Google Scholar
Remember to check out the Most Cited Articles! |
---|
Check out these Handbooks in Computer Science |