Measuring consistency among gene set analysis methods: A systematic study
Abstract
Gene set analysis is a quantitative approach for generating biological insight from gene expression datasets. The abundance of gene set analysis methods speaks to their popularity, but raises the question of the extent to which results are affected by the choice of method. Our systematic analysis of 13 popular methods using 6 different datasets, from both DNA microarray and RNA-Seq origin, shows that this choice matters a great deal. We observed that the overall number of gene sets reported by each method differed by up to 2 orders of magnitude, and there was a bias toward reporting large gene sets with some methods. Furthermore, there was substantial disagreement between the 20 most statistically significant gene sets reported by the methods. This was also observed when expanding to the 100 most statistically significant reported gene sets. For different datasets of the same phenotype/condition, the top 20 and top 100 most significant results also showed little to no agreement even when using the same method. GAGE, PAGE, and ORA were the only methods able to achieve relatively high reproducibility when comparing the 20 and 100 most statistically significant gene sets. Biological validation on a juvenile idiopathic arthritis (JIA) dataset showed wide variation in terms of the relevance of the top 20 and top 100 most significant gene sets to known biology of the disease, where GAGE predicted the most relevant gene sets, followed by GSEA, ORA, and PAGE.
References
- 1. , Gene ontology: Tool for the unification of biology, Nat. Genet. 25(1) :25–29, 2000. Crossref, Medline, Google Scholar
- 2. , KEGG: Kyoto encyclopedia of genes and genomes, Nucleic Acids Res 28(1) :27–30, 2000. Crossref, Medline, Google Scholar
- 3. , BioCarta, Biotech Softw Internet Rep 2(3) :117–120, 2001. Crossref, Google Scholar
- 4. , The Reactome pathway knowledgebase, Nucleic Acids Res 42(D1) :D472–D477, 2013. Crossref, Medline, Google Scholar
- 5. , Systematic RNA interference reveals that oncogenic kras-driven cancers require tbk1, Nature 462(7269) :108, 2009. Crossref, Medline, Google Scholar
- 6. , A global test for groups of genes: Testing association with a clinical outcome, Bioinformatics 20(1) :93–99, 2004. Crossref, Medline, Google Scholar
- 7. , GSVA: Gene set variation analysis for microarray and RNA-seq data, BMC Bioinf 14(1) :7, 2013. Crossref, Medline, Google Scholar
- 8. , PAGE: Parametric analysis of gene set enrichment, BMC Bioinf 6(1) :144, 2005. Crossref, Medline, Google Scholar
- 9. , GAGE: Generally applicable gene set enrichment for pathway analysis, BMC Bioinf 10(1) :161, 2009. Crossref, Medline, Google Scholar
- 10. , Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles, Proc Natl Acad Sci USA 102(43) :15545–15550, 2005. Crossref, Medline, Google Scholar
- 11. , Down-weighting overlapping genes improves gene set analysis, BMC Bioinf 13(1) :136, 2012. Crossref, Medline, Google Scholar
- 12. , Pathway level analysis of gene expression using singular value decomposition, BMC Bioinf 6(1) :225, 2005. Crossref, Medline, Google Scholar
- 13. , ROAST: Rotation gene set tests for complex microarray experiments, Bioinformatics 26(17) :2176–2182, 2010. Crossref, Medline, Google Scholar
- 14. , Camera: A competitive gene set test accounting for inter-gene correlation, Nucleic Acids Res 40(17) :e133, 2012. Crossref, Medline, Google Scholar
- 15. , Sample size and reproducibility of gene set analysis, 2018 IEEE Int Conf Bioinformatics and Biomedicine (BIBM 2018), IEEE, pp. 122–129, 2018. Crossref, Google Scholar
- 16. , Method choice in gene set analysis has important consequences for analysis outcome, Proc 12th Int Joint Conf Biomedical Engineering Systems and Technologies (BIOSTEC), pp. 43–54, 2019. Crossref, Google Scholar
- 17. , Limma powers differential expression analyses for RNA-sequencing and microarray studies, Nucleic Acids Res 43(7) :e47–e47, 2015. Crossref, Medline, Google Scholar
- 18. , Statistics and Data Analysis for Microarrays Using R and Bioconductor, CRC Press, 2016. ISBN 978-14-39809-75-4. Google Scholar
- 19. , NCBI GEO: Mining millions of expression profiles — database and tools, Nucleic Acids Res 33(suppl 1) :D562–D566, 2005. Medline, Google Scholar
- 20. , Neuronal pentraxin 2 supports clear cell renal cell carcinoma by activating the ampa-selective glutamate receptor-4, Cancer Res 74(17) :4796–4810, 2014. Crossref, Medline, Google Scholar
- 21. , Genome-wide expression profiling of five mouse models identifies similarities and differences with human psoriasis, Plos One 6(4) :e18266, 2011. Crossref, Medline, Google Scholar
- 22. , Type I interferon: Potential therapeutic target for psoriasis? Plos One 3(7) :e2737, 2008. Crossref, Medline, Google Scholar
- 23. , Transcriptomes in healthy and diseased gingival tissues, J Periodontol 79(11) :2112–2124, 2008. Crossref, Medline, Google Scholar
- 24. , Genome-wide association analysis of juvenile idiopathic arthritis identifies a new susceptibility locus at chromosomal region 3q13, Arthritis Rheum 64(8) :2781–2791, 2012. Crossref, Medline, Google Scholar
- 25. , Transcriptome analysis of psoriasis in a large case–control sample: RNA-seq provides insights into disease mechanisms, J Invest Dermatol 134(7) :1828–1838, 2014. Crossref, Medline, Google Scholar
- 26. , Geoquery: A bridge between the gene expression omnibus (geo) and bioconductor, Bioinformatics 23(14) :1846–1847, 2007. Crossref, Medline, Google Scholar
- 27. , Summaries of affymetrix genechip probe level data, Nucleic Acids Res 31(4) :e15–e15, 2003. Crossref, Medline, Google Scholar
- 28. , affy — analysis of affymetrix genechip data at the probe level, Bioinformatics 20(3) :307–315, 2004. Crossref, Medline, Google Scholar
- 29. , Complementary domain prioritization: A method to improve biologically relevant detection in multi-omic data sets, Proc 10th Int Joint Conf Biomedical Engineering Systems and Technologies, Volume 3: BIOINFORMATICS (BIOSTEC 2017), INSTICC, SciTePress, pp. 68–80, 2017, https://doi.org/10.5220/0006151500680080. Crossref, Google Scholar
- 30. ,
Sensitivity analysis of granularity levels in complex biological networks , in Fred AGamboa H (eds.), Biomedical Engineering Systems and Technologies, Springer International Publishing, pp. 167–188, 2017. ISBN 978-3-319-54717-6. Crossref, Google Scholar - 31. , Sensitivity, specificity and prioritization of gene set analysis when applying different ranking metrics, 10th Int Conf Practical Applications of Computational Biology & Bioinformatics, Springer, pp. 61–69, 2016. Crossref, Google Scholar
- 32. , Trimmomatic: A flexible trimmer for illumina sequence data, Bioinformatics 30(15) :2114–2120, 2014. Crossref, Medline, Google Scholar
- 33. , Star: Ultrafast universal RNA-seq aligner, Bioinformatics 29(1) :15–21, 2013. Crossref, Medline, Google Scholar
- 34. , Controlling the false discovery rate: A practical and powerful approach to multiple testing, J Roy Stat Soc 57(1) :289–300, 1995. Google Scholar
- 35. , International league of associations for rheumatology classification of juvenile idiopathic arthritis: second revision, edmonton, 2001, J Rheum 31(2) :390, 2004. Medline, Google Scholar
- 36. , Subtype-specific peripheral blood gene expression profiles in recent-onset juvenile idiopathic arthritis, Arthritis Rheum 60(7) :2102–2112, 2009. Crossref, Medline, Google Scholar
- 37. , Analysis of 17 autoimmune disease-associated variants in type 1 diabetes identifies 6q23/tnfaip3 as a susceptibility locus, Genes Immun 10(2) :188, 2009. Crossref, Medline, Google Scholar
- 38. , Association of the STAT4 gene with increased susceptibility for some immune-mediated diseases, Arthritis Rheum 58(9) :2598–2602, 2008. Crossref, Medline, Google Scholar
- 39. , Susceptibility to jra/jia: Complementing general autoimmune and arthritis traits, Genes Immun 7(1) :1, 2006. Crossref, Medline, Google Scholar
- 40. , Genetics of juvenile idiopathic arthritis: An update, Curr Opin Rheumatol 16(5) :588–594, 2004. Crossref, Medline, Google Scholar
- 41. , A comprehensive review of the genetics of juvenile idiopathic arthritis, Pediatr Rheumat 6(1) :11, 2008. Crossref, Medline, Google Scholar
- 42. , Association of rantes promoter polymorphism with juvenile rheumatoid arthritis, Arthritis Rheum 60(4) :1173–1178, 2009. Crossref, Medline, Google Scholar
- 43. , Linkage disequilibrium and the search for complex disease genes, Genome Res 10(10) :1435–1444, 2000. Crossref, Medline, Google Scholar
- 44. , Genetic analysis of juvenile rheumatoid arthritis: Approaches to complex traits, Curr Probl Pediatr Adolesc Health Care 36(3) :83, 2006. Crossref, Medline, Google Scholar
- 45. , Juvenile rheumatoid arthritis: Linkage to HLA demonstrated by allele sharing in affected sibpairs, Arthritis Rheum 43(10) :2335–2338, 2000. Crossref, Medline, Google Scholar
- 46. , Textbook of Pediatric Rheumatology, Elsevier Health Sciences, 2015. ISBN 978-03-23356-13-8. Google Scholar
- 47. , SPSNet: Subpopulation-sensitive network-based analysis of heterogeneous gene expression data, BMC Syst Biol 12(2) :28, 2018. Crossref, Medline, Google Scholar
- 48. , Finding consistent disease subnetworks using pfsnet, Bioinformatics 30(2) :189–196, 2013. Crossref, Medline, Google Scholar
- 49. , Gene set databases: A fountain of knowledge or a siren call? Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics (BCB’19), ACM, New York, NY, USA, pp. 269–278, 2019. Crossref, Google Scholar


