World Scientific
  • Search
  •   
Skip main navigation

Cookies Notification

We use cookies on this site to enhance your user experience. By continuing to browse the site, you consent to the use of our cookies. Learn More
×

System Upgrade on Tue, May 28th, 2024 at 2am (EDT)

Existing users will be able to log into the site and access content. However, E-commerce and registration of new users may not be available for up to 12 hours.
For online purchase, please visit us again. Contact us at [email protected] for any enquiries.

Compression for population genetic data through finite-state entropy

    https://doi.org/10.1142/S0219720021500268Cited by:0 (Source: Crossref)

    We improve the efficiency of population genetic file formats and GWAS computation by leveraging the distribution of samples in population-level genetic data. We identify conditional exchangeability of these data, recommending finite state entropy algorithms as an arithmetic code naturally suited for compression of population genetic data. We show between 10% and 40% speed and size improvements over modern dictionary compression methods that are often used for population genetic data such as Zstd and Zlib in computation and decompression tasks. We provide open source prototype software for multi-phenotype GWAS with finite state entropy compression demonstrating significant space saving and speed comparable to the state-of-the-art.

    References

    • 1. Visscher PM, Wray NR, Zhang Q, Sklar P, McCarthy MI, Brown MA, Yang J , 10 years of GWAS discovery: Biology, function, and translation, Am J Hum Genet 101(1) :5–22, 2017. Crossref, MedlineGoogle Scholar
    • 2. Gailly J, Adler M, Zlib compression library, Available at http://www.dspace.cam.ac.uk/handle/1810/3486, 2004. Google Scholar
    • 3. Collet Y, Skibiński P, Zstandard release v1.5.0. facebook/zstd, Available at https://github.com/facebook/zstd/releases/tag/v1.5.0, 2019. Google Scholar
    • 4. Band G, Marchini J, BGEN: A binary file format for imputed genotype and haplotype data, bioRxiv preprint 101101/308296v2, 2018. Google Scholar
    • 5. Chang CC, Chow CC, Tellier LCAM, Vattikuti S, Purcell SM, Lee JL , Second-generation PLINK: Rising to the challenge of larger and richer datasets, GigaScience 4(1), 2015. CrossrefGoogle Scholar
    • 6. Bycroft C, Freeman C, Petkova D, Band G, Elliott LT, Sharp K, Motyer A, Vukcevic D, Delaneau O, O’Connell J, Cortes A, Welsh S, Young A, Effingham M, McVean G, Leslie S, Allen N, Donnelly P, Marchini J , The UK Biobank resource with deep phenotyping and genomic data, Nature 562(7726) :203–209, 2018. Crossref, MedlineGoogle Scholar
    • 7. Chang et al., PLINK 2 File format specification draft, Available at https://github.com/chrchang/plink-ng/tree/master/pgen_spec, 2019. Google Scholar
    • 8. Shannon CE , A mathematical theory of communication, Bell Syst Tech J 27 :3–55, 1948. CrossrefGoogle Scholar
    • 9. Collet Y, New generation entropy library, Claude Sammut and Geoffrey I. Webb (eds.). Available at https://github.com/Cyan4973/FiniteStateEntropy, 2019, pp. 81–89. Google Scholar
    • 10. Duda J, Asymmetric numeral systems: Entropy coding combining speed of Huffman coding with compression rate of arithmetic coding, arXiv preprint 13112540, 2013. Google Scholar
    • 11. Huffman D , A method for the construction of minimum-redundancy codes, Proc Inst Radio Eng 40(9) :1098–1101, 1952. Google Scholar
    • 12. Shannon CE , Communication in the presence of noise, Proc Inst Radio Eng 37(1) :10–21, 1949. Google Scholar
    • 13. Kelleher J, Wong Y, Albers P, Wohns AW, McVean G, Inferring the ancestry of everyone, bioRxiv preprint 101101/458067, 2018. Google Scholar
    • 14. Adjeroh D, Zhang Y, Mukherjee A, Powell M, Bell T , DNA sequence compression using the Burrows-Wheeler transform, Proc IEEE Computer Society Bioinformatics Conf, pp. 303–313, 2002. CrossrefGoogle Scholar
    • 15. Kryukov K, Ueda M, Nakagawa S, Imanishi T , Nucleotide Archival Format (NAF) enables efficient lossless reference-free compression of DNA sequences, Bioinformatics 35(19) :3826–3828, 2019. Crossref, MedlineGoogle Scholar
    • 16. Sweeten A, Accurate alignment-free inference of microbial phylogenies, PhD Thesis, Simon Fraser University, 2019. Google Scholar
    • 17. Liu Y, Peng H, Wong L, Li J , High-speed and high-ratio referential genome compression, Bioinformatics 33(21) :3364–3372, 2017. Crossref, MedlineGoogle Scholar
    • 18. Holmes I, Modular non-repeating codes for DNA storage, arXiv preprint 160601799v2, 2016. Google Scholar
    • 19. Orbanz P, Teh YW , Bayesian nonparametric models, in Encyclopedia of Machine Learning, Springer, 2010. Google Scholar
    • 20. Patterson N, Price AL, Reich D , Population structure and eigenanalysis, PLOS Genet 2(12) :2074–2093, 2006. CrossrefGoogle Scholar
    • 21. Sayood K , Lossless Compression Handbook, Academic Press, 2012. Google Scholar
    • 22. Ganczorz M , Entropy lower bounds for dictionary compression, Proc 30th Annual Symp Combinatorial Pattern Matching, 2019. Google Scholar
    • 23. Su Z, Marchini J, Donnelly P , HAPGEN2: Simulation of multiple disease SNPs, Bioinformatics 27(16) :2304–2305, 2011. Crossref, MedlineGoogle Scholar
    • 24. Hudson RR , Generating samples under a Wright-Fisher neutral model, Bioinformatics 18(2), 2002. Crossref, MedlineGoogle Scholar
    • 25. Howie BN, Donnelly P, Marchini J , A flexible and accurate genotype imputation method for the next generation of genome-wide association studies, PLOS Genet 5(6) :1–15, 2009. CrossrefGoogle Scholar
    • 26. Shin JH, Blay S, McNeney C, Graham J , LDheatmap: An R function for graphical display of pairwise linkage disequilibria between SNPs, J Stat Softw 16(3) :1–10, 2006. Google Scholar
    • 27. All of Us Research Program Investigators, The “All of Us” research program, N. Engl. J. Med. 381(7) :668–676, 2019. Crossref, MedlineGoogle Scholar