World Scientific
  • Search
Skip main navigation

Cookies Notification

We use cookies on this site to enhance your user experience. By continuing to browse the site, you consent to the use of our cookies. Learn More
×
Our website is made possible by displaying certain online content using javascript.
In order to view the full content, please disable your ad blocker or whitelist our website www.worldscientific.com.

System Upgrade on Tue, Oct 25th, 2022 at 2am (EDT)

Existing users will be able to log into the site and access content. However, E-commerce and registration of new users may not be available for up to 12 hours.
For online purchase, please visit us again. Contact us at [email protected] for any enquiries.

For interpolating kernel machines, minimizing the norm of the ERM solution maximizes stability

    This article is part of the issue:

    In this paper, we study kernel ridge-less regression, including the case of interpolating solutions. We prove that maximizing the leave-one-out (CVloo) stability minimizes the expected error. Further, we also prove that the minimum norm solution — to which gradient algorithms are known to converge — is the most stable solution. More precisely, we show that the minimum norm interpolating solution minimizes a bound on CVloo stability, which in turn is controlled by the smallest singular value, hence the condition number, of the empirical kernel matrix. These quantities can be characterized in the asymptotic regime where both the dimension (d) and cardinality (n) of the data go to infinity (with ndγ as d,n). Our results suggest that the property of CVloostability of the learning algorithm with respect to perturbations of the training set may provide a more general framework than the classical theory of Empirical Risk Minimization (ERM). While ERM was developed to deal with the classical regime in which the architecture of the learning network is fixed and n, the modern regime focuses on interpolating regressors and overparameterized models, when both d and n go to infinity. Since the stability framework is known to be equivalent to the classical theory in the classical regime, our results here suggest that it may be interesting to extend it beyond kernel regression to other overparameterized algorithms such as deep networks.

    AMSC: 68Q32, 68T05

    References

    • 1. S. Arora, S. S. Du, W. Hu, Z. Y. Li and R. Wang, Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks, preprint (2019), arXiv:1901.08584. Google Scholar
    • 2. J. K. Baksalary, O. M. Baksalary and G. Trenkler , A revisitation of formulae for the Moore–Penrose inverse of modified matrices, Linear Algebra Appl. 372 (2003) 207–224. ISIGoogle Scholar
    • 3. P. L. Bartlett, P. M. Long, G. Lugosi and A. Tsigler, Benign overfitting in linear regression, preprint (2019), arXiv:1906.11300. Google Scholar
    • 4. M. Belkin, D. Hsu, S. Ma and S. Mandal , Reconciling modern machine-learning practice and the classical bias–variance trade-off, Proc. Natl. Acad. Sci. USA 116(32) (2019) 15849–15854. https://doi.org/10.1073/pnas.1903070116. Crossref, ISIGoogle Scholar
    • 5. M. Belkin, S. Ma and S. Mandal , To understand deep learning we need to understand kernel learning, in Proc. 35th Int. Conf. Machine Learning, Vol. 80, eds. J. DyA. Krause (PMLR, 2018), pp. 541–549, https://proceedings.mlr.press/v80/belkin18a.html. Google Scholar
    • 6. S. Boucheron, O. Bousquet and G. Lugosi , Theory of classification: A survey of some recent advances, ESAIM Probab. Stat. 9 (2005) 323–375. Google Scholar
    • 7. O. Bousquet and A. Elisseeff , Stability and generalization, J. Mach. Learn. Res. 2 (2002) 499–526. ISIGoogle Scholar
    • 8. P. Bühlmann and S. Van De Geer , Statistics for High-Dimensional Data : Methods, Theory and Applications (Springer Science & Business Media, 2011). Google Scholar
    • 9. N. El Karoui, The spectrum of kernel random matrices, preprint (2010), arXiv:1001.0492. Google Scholar
    • 10. T. Hastie, A. Montanari, S. Rosset and R. J. Tibshirani, Surprises in high-dimensional ridgeless least squares interpolation, preprint (2019), arXiv:1903.08560. Google Scholar
    • 11. S. Kutin and P. Niyogi, Almost-everywhere algorithmic stability and generalization error, Technical Report TR-2002-03, University of Chicago, 2002. Google Scholar
    • 12. T. Liang, A. Rakhlin and X. Zhai, On the risk of minimum-norm interpolants and restricted lower isometry of kernels, preprint (2019), arXiv:1908.10292. Google Scholar
    • 13. T. Liang et al., Just interpolate: Kernel “ridgeless” regression can generalize, Ann. Statist. 48(3) (2020) 1329–1347. ISIGoogle Scholar
    • 14. T. Liang and B. Recht, Interpolating classifiers make few mistakes, preprint (2021), arXiv:2101.11815. Google Scholar
    • 15. V. A. Marchenko and L. A. Pastur , Distribution of eigenvalues for some sets of random matrices, Mat. Sb. (N.S.) 72(114)(4) (1967) 457–483. Google Scholar
    • 16. S. Mei and A. Montanari, The generalization error of random features regression: Precise asymptotics and double descent curve, preprint (2019), arXiv:1908.05355. Google Scholar
    • 17. C. Meyer , Generalized inversion of modified matrices, SIAM J. Appl. Math. 24 (1973) 315–323. ISIGoogle Scholar
    • 18. C. A. Micchelli , Interpolation of scattered data: Distance matrices and conditionally positive definite functions, Constr. Approx. 2 (1986) 11–22. Crossref, ISIGoogle Scholar
    • 19. S. Mukherjee, P. Niyogi, T. Poggio and R. Rifkin , Learning theory: Stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization, Adv. Comput. Math. 25(1) (2006) 161–193. https://doi.org/f10.1007/s10444-004-7634-z. ISIGoogle Scholar
    • 20. E. Oravkin and P. Rebeschini , On optimal interpolation in linear regression, in 35th Conf. Neural Information Processing Systems (NeurIPS 2021), pp. 1–12. Google Scholar
    • 21. T. Poggio , Stable Foundations for Learning (Center for Brains, Minds and Machines, 2020). Google Scholar
    • 22. T. Poggio, G. Kur and A. Banburski, Double descent in the condition number, Technical Report, MIT Center for Brains Minds and Machines, 2019. Google Scholar
    • 23. T. Poggio, R. Rifkin, S. Mukherjee and P. Niyogi , General conditions for predictivity in learning theory, Nature 428 (2004) 419–422. Crossref, ISIGoogle Scholar
    • 24. A. Rakhlin and X. Zhai, Consistency of interpolation with Laplace kernels is a high-dimensional phenomenon, preprint (2018), arXiv:1812.11167. Google Scholar
    • 25. L. Rosasco and S. Villa , Learning with incremental iterative regularization, In Advances in Neural Information Processing Systems, Vol. 28, eds. C. CortesN. D. LawrenceD. D. LeeM. SugiyamaR. Garnett (Curran Associates, 2015), pp. 1630–1638, http://papers.nips.cc/paper/6015-learning-with-incremental-iterative-regularization.pdf. Google Scholar
    • 26. S. Shalev-Shwartz and S. Ben-David , Understanding Machine Learning : From Theory to Algorithms (Cambridge University Press, USA, 2014). Google Scholar
    • 27. S. Shalev-Shwartz, O. Shamir, N. Srebro and K. Sridharan , Learnability, stability and uniform convergence, J. Mach. Learn. Res. 11 (2010) 2635–2670. ISIGoogle Scholar
    • 28. I. Steinwart and A. Christmann , Support Vector Machines (Springer Science & Business Media, 2008). Google Scholar
    • 29. C. Zhang, S. Bengio, M. Hardt, B. Recht and O. Vinyals , Understanding deep learning requires rethinking generalization, in 5th Int. Conf. Learning Representations, ICLR 2017, Toulon, France, April 24–26, 2017, Conference Track Proc. (OpenReview.net, 2017), pp. 1–15, https://openreview.net/forum?id=Sy8gdB9xx. Google Scholar
    Remember to check out the Most Cited Articles!

    Check out our Differential Equations and Mathematical Analysis books in our Mathematics 2021 catalogue
    Featuring authors such as Ronen Peretz, Antonio Martínez-Abejón & Martin Schechter