Compression for population genetic data through finite-state entropy
Abstract
We improve the efficiency of population genetic file formats and GWAS computation by leveraging the distribution of samples in population-level genetic data. We identify conditional exchangeability of these data, recommending finite state entropy algorithms as an arithmetic code naturally suited for compression of population genetic data. We show between and speed and size improvements over modern dictionary compression methods that are often used for population genetic data such as Zstd and Zlib in computation and decompression tasks. We provide open source prototype software for multi-phenotype GWAS with finite state entropy compression demonstrating significant space saving and speed comparable to the state-of-the-art.
References
- 1. , 10 years of GWAS discovery: Biology, function, and translation, Am J Hum Genet 101(1) :5–22, 2017. Crossref, Medline, Google Scholar
- 2. Gailly J, Adler M, Zlib compression library, Available at http://www.dspace.cam.ac.uk/handle/1810/3486, 2004. Google Scholar
- 3. Collet Y, Skibiński P, Zstandard release v1.5.0. facebook/zstd, Available at https://github.com/facebook/zstd/releases/tag/v1.5.0, 2019. Google Scholar
- 4. Band G, Marchini J, BGEN: A binary file format for imputed genotype and haplotype data, bioRxiv preprint 101101/308296v2, 2018. Google Scholar
- 5. , Second-generation PLINK: Rising to the challenge of larger and richer datasets, GigaScience 4(1), 2015. Crossref, Google Scholar
- 6. , The UK Biobank resource with deep phenotyping and genomic data, Nature 562(7726) :203–209, 2018. Crossref, Medline, Google Scholar
- 7. Chang et al., PLINK 2 File format specification draft, Available at https://github.com/chrchang/plink-ng/tree/master/pgen_spec, 2019. Google Scholar
- 8. , A mathematical theory of communication, Bell Syst Tech J 27 :3–55, 1948. Crossref, Google Scholar
- 9. Collet Y, New generation entropy library, Claude Sammut and Geoffrey I. Webb (eds.). Available at https://github.com/Cyan4973/FiniteStateEntropy, 2019, pp. 81–89. Google Scholar
- 10. Duda J, Asymmetric numeral systems: Entropy coding combining speed of Huffman coding with compression rate of arithmetic coding, arXiv preprint 13112540, 2013. Google Scholar
- 11. , A method for the construction of minimum-redundancy codes, Proc Inst Radio Eng 40(9) :1098–1101, 1952. Google Scholar
- 12. , Communication in the presence of noise, Proc Inst Radio Eng 37(1) :10–21, 1949. Google Scholar
- 13. Kelleher J, Wong Y, Albers P, Wohns AW, McVean G, Inferring the ancestry of everyone, bioRxiv preprint 101101/458067, 2018. Google Scholar
- 14. , DNA sequence compression using the Burrows-Wheeler transform, Proc IEEE Computer Society Bioinformatics Conf, pp. 303–313, 2002. Crossref, Google Scholar
- 15. , Nucleotide Archival Format (NAF) enables efficient lossless reference-free compression of DNA sequences, Bioinformatics 35(19) :3826–3828, 2019. Crossref, Medline, Google Scholar
- 16. Sweeten A, Accurate alignment-free inference of microbial phylogenies, PhD Thesis, Simon Fraser University, 2019. Google Scholar
- 17. , High-speed and high-ratio referential genome compression, Bioinformatics 33(21) :3364–3372, 2017. Crossref, Medline, Google Scholar
- 18. Holmes I, Modular non-repeating codes for DNA storage, arXiv preprint 160601799v2, 2016. Google Scholar
- 19. ,
Bayesian nonparametric models , in Encyclopedia of Machine Learning, Springer, 2010. Google Scholar - 20. , Population structure and eigenanalysis, PLOS Genet 2(12) :2074–2093, 2006. Crossref, Google Scholar
- 21. , Lossless Compression Handbook, Academic Press, 2012. Google Scholar
- 22. , Entropy lower bounds for dictionary compression, Proc 30th Annual Symp Combinatorial Pattern Matching, 2019. Google Scholar
- 23. , HAPGEN2: Simulation of multiple disease SNPs, Bioinformatics 27(16) :2304–2305, 2011. Crossref, Medline, Google Scholar
- 24. , Generating samples under a Wright-Fisher neutral model, Bioinformatics 18(2), 2002. Crossref, Medline, Google Scholar
- 25. , A flexible and accurate genotype imputation method for the next generation of genome-wide association studies, PLOS Genet 5(6) :1–15, 2009. Crossref, Google Scholar
- 26. , LDheatmap: An R function for graphical display of pairwise linkage disequilibria between SNPs, J Stat Softw 16(3) :1–10, 2006. Google Scholar
- 27.
All of Us Research Program Investigators , The “All of Us” research program, N. Engl. J. Med. 381(7) :668–676, 2019. Crossref, Medline, Google Scholar