Protein–protein interaction site prediction using random forest proximity distance
Abstract
A front-end method based on random forest proximity distance (PD) is used to screen the test set to improve protein–protein interaction site (PPIS) prediction. The assessment of a distance metric is done under the assumption that a distance definition of higher quality leads to higher classification. On an independent test set, the numerical analysis based on statistical inference shows that the PD has the advantage over Mahalanobis and Cosine distance. Based on the fact that the proximity distance depends on the tree composition of the random forest model, an iterative method is designed to optimize the proximity distance, which adjusts the tree composition of the random forest model by adjusting the size of the training set. Two PD metrics, 75PD and 50PD, are obtained by the iterative method. On two independent test sets, compared with the PD produced by the original training set, the values of 75PD in Matthews correlation coefficient and F1 score were higher, and the differences between them were statistically significant. All numerical experiments show that the closer the distance between the test data and the training data, the better the prediction results of the predictor. These indicate that the iterative method can optimize proximity distance definition and the distance information provided by PD can be used to indicate the reliability of prediction results.
References
- 1. , An overview of recent advances in structural bioinformatics of protein-protein interactions and a guide to their principles, Prog Biophys Mol Biol 116 :141–150, 2014. Crossref, Medline, Google Scholar
- 2. , Drugscoreppi webserver: Fast and accurate in silico alanine scanning for scoring protein–protein interactions, Nucleic Acids Res 38 :480–486, 2010. Crossref, Google Scholar
- 3. , Comparing experimental and computational alanine scanning techniques for probing a prototypical protein–protein interaction, Protein Eng Des Sel 24 :197–207, 2011. Crossref, Medline, Google Scholar
- 4. , Algorithmic approaches to protein-protein interaction site prediction, Algorithms Mol Biol 10 :7, 2015. Crossref, Medline, Google Scholar
- 5. , Protein-protein interaction site predictions with minimum covariance determinant and Mahalanobis distance, J Theor Biol 433 :57–63, 2017. Crossref, Medline, Google Scholar
- 6. , Random forests, Mach Learn 45 :5–32, 2001. Crossref, Google Scholar
- 7. , Exploring bias in the protein data bank using contrast classifiers, Pac Symp Biocomput 9 :435–446, 2004. Google Scholar
- 8. , Structures of membrane proteins determined at atomic resolution, J Biochem 124 :1051–1059, 2008. Crossref, Google Scholar
- 9. , Hedged predictions for traditional chinese chronic gastritis diagnosis with confidence machine, Comput Biol Med 39 :425–432, 2009. Crossref, Medline, Google Scholar
- 10. , A novel approach to estimate proximity in a random forest: An exploratory study, Expert Syst Appl 39 :13046–13050, 2012. Crossref, Google Scholar
- 11. , Propensity score and proximity matching using random forest, Contemp Clin Trials 47 :85–92, 2016. Crossref, Medline, Google Scholar
- 12. , Sequence-based prediction of protein interaction sites with an integrative method, Bioinformatics 25 :585–591, 2009. Crossref, Medline, Google Scholar
- 13. , Improved prediction of protein-protein binding sites using a support vector machines approach, Bioinformatics 21 :1487–1494, 2005. Crossref, Medline, Google Scholar
- 14. , Predus: A web server for predicting protein interfaces using structural neighbors, Nucleic Acids Res 39 :283–287, 2011. Crossref, Google Scholar
- 15. , Protein-protein docking benchmark version 3.0, Proteins 73 :705–709, 2008. Crossref, Medline, Google Scholar
- 16. , Dictionary of protein secondary structure: Pattern recognition of hydrogen-bonded and geometrical features, Biopolymers 22 :2577–2637, 1983. Crossref, Medline, Google Scholar
- 17. , Prediction of protein–protein interaction sites using patch-based residue characterization, J Theor Biol 293 :143–150, 2012. Crossref, Medline, Google Scholar
- 18. , Comparison of the predicted and observed secondary structure of t4 phage lysozyme, Biochim Biophys Acta 405 :442–451, 1975. Crossref, Medline, Google Scholar
- 19. , Using subsite coupling to predict signal peptides, Protein Eng 14 :75–79, 2001. Crossref, Medline, Google Scholar
- 20. , Prediction of protein-protein interaction sites using an ensemble method, BMC Bioinform. 10 :426, 2009. Crossref, Medline, Google Scholar
- 21. , Assessing the accuracy of prediction algorithms for classification: An overview, Bioinformatics 16 :412–424, 2000. Crossref, Medline, Google Scholar
- 22. , How proteins get in touch: Interface prediction in the study of biomolecular complexes, Curr Protein Pept Sci 9 :394–406, 2008. Crossref, Medline, Google Scholar
- 23. , Computational Protein-protein Interactions, CRC Press, Boca Raton, 2009. Crossref, Google Scholar
- 24. , A simple iterative method to optimize protein-ligand-binding residue prediction, J Theor Biol 317 :219–223, 2013. Crossref, Medline, Google Scholar


