World Scientific
  • Search
  •   
Skip main navigation

Cookies Notification

We use cookies on this site to enhance your user experience. By continuing to browse the site, you consent to the use of our cookies. Learn More
×

System Upgrade on Tue, May 28th, 2024 at 2am (EDT)

Existing users will be able to log into the site and access content. However, E-commerce and registration of new users may not be available for up to 12 hours.
For online purchase, please visit us again. Contact us at [email protected] for any enquiries.

Transcriptomics: Quantifying Non-Uniform Read Distribution Using MapReduce

    https://doi.org/10.1142/S0129054118430086Cited by:2 (Source: Crossref)
    This article is part of the issue:

    RNA-seq is a high-throughput Next-sequencing technique for estimating the concentration of all transcripts in a transcriptome. The method involves complex preparatory and post-processing steps which can introduce bias, and the technique produces a large amount of data [7, 19]. Two important challenges in processing RNA-seq data are therefore the ability to process a vast amount of data, and methods to quantify the bias in public RNA-seq datasets. We describe a novel analysis method, based on analysing sequence motif correlations, that employs MapReduce on Apache Spark to quantify bias in Next-generation sequencing (NGS) data at the deep exon level. Our implementation is designed specifically for processing large datasets and allows for scalability and deployment on cloud service providers offering MapReduce. In investigating the wild and mutant organism types in the species D. melanogaster we have found that motifs with runs of Gs (or their complement) exhibit low motif-pair correlations in comparison with other motif-pairs. This is independent of the mean exon GC content in the wild type data, but there is a mild dependence in the mutant data. Hence, whilst both datasets show the same trends, there is however significant variation between the two samples.

    References

    • 1. S. Aerts, The S. Aerts Lab of Computational Biology (LCB) at ku leuven, http://gbiomed.kuleuven.be/english/research/50000622/lcb/. (2012) [Online; accessed 4-January-2017]. Google Scholar
    • 2. J. Alnasir and H. P. Shanahan , Investigation into the annotation of protocol sequencing steps in the sequence read archive, GigaScience 4 (2015) 23. Crossref, Web of ScienceGoogle Scholar
    • 3. Apache Software Foundation, Hadoop 2.7 documentation, http://hadoop.apache.org/ docs/r2.7.2/. (2016) [Online; accessed 06-Jan-2017]. Google Scholar
    • 4. S. R. Archive, Overview of the Sequence Read Archive (SRA), https://trace. ncbi.nlm.nih.gov/Traces/sra/sra.cgi/. (2017) [Online; accessed 3-January-2017]. Google Scholar
    • 5. ArrayExpress, EMBL-EBI, ArrayExpress functional genomics data, https://www.ebi.ac.uk/arrayexpress/. (2017) [Online; accessed 3-January-2017]. Google Scholar
    • 6. Y.-C. Chen et al., Effects of gc bias in next-generation-sequencing data on de novo genome assembly, PLoS one 8(4) (2013) e62856. Crossref, Web of ScienceGoogle Scholar
    • 7. A. Conesa et al., A survey of best practices for RNA-seq data analysis, Genome Biology 17(1) (2016) 13. Crossref, Web of ScienceGoogle Scholar
    • 8. R. Edgar , Gene expression Omnibus: NCBI gene expression and hybridization array data repository, Nucleic Acids Research 30(1) (2002) 207–210. Crossref, Web of ScienceGoogle Scholar
    • 9. B. Fish et al., On the computational complexity of mapreduce, in International Symposium on Distributed Computing (Springer, 2015), pp. 1–15. Google Scholar
    • 10. K. D. Hansen, S. E. Brenner and S. Dudoit , Biases in Illumina transcriptome sequencing caused by random hexamer priming, Nucleic Acids Research 38(12) (2010) e131. Crossref, Web of ScienceGoogle Scholar
    • 11. J. Hughes , Why functional programming matters, The Computer Journal 32(2) (1989) 98–107. Crossref, Web of ScienceGoogle Scholar
    • 12. K. R. Kukurba and S. B. Montgomery , RNA sequencing and analysis, Cold Spring Harbor Protocols 2015(11) (2015) pdb–top084970. CrossrefGoogle Scholar
    • 13. R. Lämmel , Googles mapreduce programming model revisited, Science of Computer Programming 70(1) (2008) 1–30. Crossref, Web of ScienceGoogle Scholar
    • 14. R. Leinonen, H. Sugawara and M. Shumway , The sequence read archive, Nucleic Acids Research 39(Database issue) (2011) D19–21. Crossref, Web of ScienceGoogle Scholar
    • 15. H. Li et al., The sequence alignment/map format and samtools, Bioinformatics 25(16) (2009) 2078–2079. Crossref, Web of ScienceGoogle Scholar
    • 16. J. Li, H. Jiang and W. H. Wong , Modeling non-uniformity in short-read rates in rna-seq data, Genome Biology 11(5) (2010) 1. Crossref, Web of ScienceGoogle Scholar
    • 17. R. Liu, A. E. Loraine and J. A. Dickerson , Comparisons of computational methods for differential alternative splicing detection using rna-seq in plant systems, BMC Bioinformatics 15(1) (2014) 1. Crossref, Web of ScienceGoogle Scholar
    • 18. S. Lykke-Andersen and T. H. Jensen , Cut it out: Silencing of noise in the transcriptome, Nature Structural and Molecular Biology 13(10) (2006) 860. Crossref, Web of ScienceGoogle Scholar
    • 19. F. Meacham et al., Identification and correction of systematic error in high-throughput sequence data, BMC Bioinformatics 12(1) (2011) 451. Crossref, Web of ScienceGoogle Scholar
    • 20. F. N. Memon et al., Identifying the impact of g-quadruplexes on affymetrix 3’arrays using cloud computing, Journal of Integrative Bioinformatics 7(111) (2010). Google Scholar
    • 21. W. Mendenhall and T. Sincich , Statistics for Engineering and the Sciences (Dellen Publishing Company, San Francisco, CA, USA, 1992). Google Scholar
    • 22. A. Mortazavi et al., Mapping and quantifying mammalian transcriptomes by RNA-seq, Nature Methods 5(7) (2008) 621–628. Crossref, Web of ScienceGoogle Scholar
    • 23. M. Naval-Sánchez et al., Comparative motif discovery combined with comparative transcriptomics yields accurate targetome and enhancer predictions, Genome Research 23(1) (2013) 74–88. Crossref, Web of ScienceGoogle Scholar
    • 24. F. Ozsolak et al., Direct RNA sequencing, Nature 461(7265) (2009) 814–818. Crossref, Web of ScienceGoogle Scholar
    • 25. G. A. Palidwor, T. J. Perkins and X. Xia , A general model of codon bias due to gc mutational bias, PLoS One 5(10) (2010) e13431. Crossref, Web of ScienceGoogle Scholar
    • 26. J. W. Park et al., Identifying differential alternative splicing events from RNA sequencing data using rnaseq-mats, Deep Sequencing Data Analysis (2013) 171–179. CrossrefGoogle Scholar
    • 27. T. Raz et al., Protocol dependence of sequencing-based gene expression measurements, PLoS One 6(5) (2011) e19287. Crossref, Web of ScienceGoogle Scholar
    • 28. D. Risso et al., Gc-content normalization for RNA-seq data, BMC Bioinformatics 12(1) (2011) 480. Crossref, Web of ScienceGoogle Scholar
    • 29. A. Roberts et al., Improving RNA-Seq expression estimates by correcting for fragment bias, Genome Biology 12(3) (2011) R22. Crossref, Web of ScienceGoogle Scholar
    • 30. N. Sanger, Ensembl, GTFGFF file format specification, http://www.ensembl.org/info/website/upload/gff.html. (2012) [Online; accessed 15-November-2016]. Google Scholar
    • 31. J. Shendure et al., Advanced sequencing technologies: Methods and goals, Nature Reviews Genetics 5(5) (2004) 335–344. Crossref, Web of ScienceGoogle Scholar
    • 32. M. Snir , MPI — The Complete Reference: The MPI Core, Vol. 1 (MIT Press, 1998). Google Scholar
    • 33. M. A. Stalteri and A. P. Harrison , Interpretation of multiple probe sets mapping to the same gene in affymetrix genechips, BMC Bioinformatics 8(1) (2007) 1. Crossref, Web of ScienceGoogle Scholar
    • 34. Z. D. Stephens et al., Big data: Astronomical or genomical? PLoS Biol. 13(7) (2015) 1–11. Crossref, Web of ScienceGoogle Scholar
    • 35. K. Struhl , Transcriptional noise and the fidelity of initiation by RNA polymerase ii. Nature Structural Molecular Biology 14(2) (2007) 103–105. Crossref, Web of ScienceGoogle Scholar
    • 36. R. C. Taylor , An overview of the hadoop/mapreduce/hbase framework and its current applications in bioinformatics, BMC Bioinformatics 11(Suppl 12) (2010) S1. Crossref, Web of ScienceGoogle Scholar
    • 37. G. J. Upton, W. B. Langdon and A. P. Harrison , G-spots cause incorrect expression measurement in affymetrix microarrays, BMC Genomics 9(1) (2008) 1. Crossref, Web of ScienceGoogle Scholar
    • 38. G. P. Wagner, K. Kin and V. J. Lynch , Measurement of mRNA abundance using RNA-seq data: Rpkm measure is inconsistent among samples, Theory in Biosciences 131(4) (2012) 281–285. Crossref, Web of ScienceGoogle Scholar
    • 39. Z. Wang, M. Gerstein and M. Snyder , RNA-seq: A revolutionary tool for transcriptomics, Nature Reviews Genetics 10(1) (2009) 57–63. Crossref, Web of ScienceGoogle Scholar
    • 40. W. Zheng, L. M. Chung and H. Zhao , Bias detection and correction in RNA-sequencing data, BMC Bioinformatics 12(1) (2011) 1. Crossref, Web of ScienceGoogle Scholar
    Remember to check out the Most Cited Articles!

    Check out these Handbooks in Computer Science