Full error analysis for the training of deep neural networks
Abstract
Deep learning algorithms have been applied very successfully in recent years to a range of problems out of reach for classical solution paradigms. Nevertheless, there is no completely rigorous mathematical error and convergence analysis which explains the success of deep learning algorithms. The error of a deep learning algorithm can in many situations be decomposed into three parts, the approximation error, the generalization error, and the optimization error. In this work we estimate for a certain deep learning algorithm each of these three errors and combine these three error estimates to obtain an overall error analysis for the deep learning algorithm under consideration. In particular, we thereby establish convergence with a suitable convergence speed for the overall error of the deep learning algorithm under consideration. Our convergence speed analysis is far from optimal and the convergence speed that we establish is rather slow, increases exponentially in the dimensions, and, in particular, suffers from the curse of dimensionality. The main contribution of this work is, instead, to provide a full error analysis (i) which covers each of the three different sources of errors usually emerging in deep learning algorithms and (ii) which merges these three sources of errors into one overall error estimate for the considered deep learning algorithm.
Communicated by M. Roeckner
References
- 1. , Learning and generalization in overparameterized neural networks, going beyond two layers, Proc. 33rd Int. Conf. Neural Information Processing Systems (Curran Associates Inc., Red Hook, NY, USA, 2019). Google Scholar
- 2. , A convergence theory for deep learning via over-parameterization, Proc. 36th Int. Conf. Machine Learning, eds. K. ChaudhuriR. Salakhutdinov,
Proc. Mach. Learn. Res. , Vol. 97 (PMLR,09–15 Jun 2019 ), pp. 242–252. Google Scholar - 3. , Breaking the curse of dimensionality with convex neural networks, J. Mach. Learn. Res. 18(19) (2017) 1–53. Google Scholar
- 4. , Non-strongly-convex smooth stochastic approximation with convergence rate , Proc. 26th Int. Conf. Neural Information Processing Systems, NIPS’13 (Curran Associates Inc., USA, 2013), pp. 773–781. Google Scholar
- 5. , Universal approximation bounds for superpositions of a sigmoidal function, IEEE Trans. Inform. Theory 39(3) (1993) 930–945. Crossref, ISI, Google Scholar
- 6. , Approximation and estimation bounds for artificial neural networks, Mach. Learn. 14(1) (1994) 115–133. Crossref, ISI, Google Scholar
- 7. , Local Rademacher complexities, Ann. Statist. 33(4) (2005) 1497–1537. Crossref, ISI, Google Scholar
- 8. C. Beck, S. Becker, P. Grohs, N. Jaafari and A. Jentzen, Solving the Kolmogorov PDE by means of deep learning, (2018) 56 pages, arXiv:1806.00421. Google Scholar
- 9. , Solving the Kolmogorov PDE by means of deep learning, J. Sci. Comput. 88(73) (2021) 28 pages. Google Scholar
- 10. , Machine learning approximation algorithms for high-dimensional fully nonlinear partial differential equations and second-order backward stochastic differential equations, J. Nonlinear Sci. 29(4) (2019) 1563–1619. Crossref, ISI, Google Scholar
- 11. R. Bellman, Dynamic programming, Princeton Landmarks in Mathematics (Princeton University Press, Princeton, NJ, 2010). Reprint of the 1957 edition. Google Scholar
- 12. , Generic stochastic gradient methods, Wiley Encyclopedia of Operations Research and Management Science (2011) 8 pages. Google Scholar
- 13. , Analysis of the generalization error: Empirical risk minimization over deep artificial neural networks overcomes the curse of dimensionality in the numerical approximation of Black–Scholes partial differential equations, SIAM J. Math. Data Sci. 2(3) (2020) 631–657. Crossref, Google Scholar
- 14. , Approximation theory and feedforward networks, Neural Networks 4(4) (1991) 511–515. Crossref, Google Scholar
- 15. , Optimal approximation with sparsely connected deep neural networks, SIAM J. Math. Data Sci. 1(1) (2019) 8–45. Crossref, Google Scholar
- 16. , The tradeoffs of large scale learning, Proc. 20th Int. Conf. Neural Information Processing Systems, NIPS’07 (Curran Associates Inc., Red Hook, NY, USA, 2007), pp. 161–168. Google Scholar
- 17. , Error bounds for approximation with neural networks, J. Approx. Theory 112(2) (2001) 235–250. Crossref, Google Scholar
- 18. E. J. Candes, Ridgelets: Theory and applications, PhD thesis, Stanford University Stanford (1998). Google Scholar
- 19. N. H. Chau, É. Moulines, M. Rásonyi, S. Sabanis and Y. Zhang, On stochastic gradient Langevin dynamics with dependent data streams: The fully non-convex case, (2019) 27 pages, arXiv:1905.13142. Google Scholar
- 20. , Approximation capability to functions of several variables, nonlinear functionals, and operators by radial basis function neural networks, IEEE Trans. Neural Netw. 6(4) (1995) 904–910. Crossref, ISI, Google Scholar
- 21. , A proof of convergence for gradient descent in the training of artificial neural networks for constant target functions, J. Complexity (2022) 101646. Crossref, Google Scholar
- 22. , On the global convergence of gradient descent for over-parameterized models using optimal transport, in Advances in Neural Information Processing Systems (NeurIPS 2018) (2018) 32 pages. Google Scholar
- 23. , Neural networks for localized approximation, Math. Comp. 63(208) (1994) 607–623. Crossref, ISI, Google Scholar
- 24. , On the mathematical foundations of learning, Bull. Amer. Math. Soc. (N.S.) 39(1) (2002) 1–49. Crossref, ISI, Google Scholar
- 25. , Approximation by superpositions of a sigmoidal function, Math. Control Signals Systems 2(4) (1989) 303–314. Crossref, Google Scholar
- 26. , General multilevel adaptations for stochastic approximation algorithms of Robbins–Monro and Polyak–Ruppert type, Numer. Math. 142(2) (2019) 279–328. Crossref, Google Scholar
- 27. ,
Approximation by feed-forward neural networks , The heritage of P. L. Chebyshev: A Festschrift in honor of the 70th birthday of T. J. Rivlin, Vol. 4 (Baltzer Science Publishers BV, Amsterdam, 1997), pp. 261–287. Google Scholar - 28. , Gradient descent finds global minima of deep neural networks, in Proc. 36th Int. Conf. Machine Learning, eds. K. ChaudhuriR. Salakhutdinov, Vol. 97 (PMLR,
9–15 June 2019 ), pp. 1675–1685. Google Scholar - 29. S. S. Du, X. Zhai, B. Poczos and A. Singh, Gradient Descent provably optimizes over-parameterized neural networks, (2019) 19 pages, arXiv:1810.02054. Google Scholar
- 30. , A comparative analysis of optimization and generalization properties of two-layer neural network and random feature models under gradient descent dynamics, Sci. China Math. 63(7) (2020) 1235–1258. Crossref, Google Scholar
- 31. , Exponential convergence of the deep neural network approximation for analytic functions, Sci. China Math. 61 (2018) 1733–1740. Crossref, ISI, Google Scholar
- 32. , DNN expression rate analysis of high-dimensional PDEs: Application to option pricing, Constr. Approx. 55 (2022) 3–71. Crossref, Google Scholar
- 33. , The power of depth for feedforward neural networks, 29th Annual Conf. Learning Theory, eds. V. FeldmanA. RakhlinO. Shamir,
Proceedings of Machine Learning Research , Vol. 49 (PMLR, Columbia University, New York, USA,23–26 June 2016 ), pp. 907–940. Google Scholar - 34. ,
Aspects of the numerical analysis of neural networks , Acta Numerica, 1994, Acta Numer. (Cambridge University Press, Cambridge, 1994), pp. 145–202. Crossref, Google Scholar - 35. , Convergence rates for the stochastic gradient descent method for non-convex objective functions, J. Mach. Learn. Res. 21(136) (2020) 1–48. Google Scholar
- 36. , On the approximate realization of continuous mappings by neural networks, Neural Networks 2(3) (1989) 183–192. Crossref, ISI, Google Scholar
- 37. , Deep Learning,
Adaptive Computation and Machine Learning (MIT Press, Cambridge, MA, 2016). Google Scholar - 38. , Approximation spaces of deep neural networks, Constr. Approx. 55 (2022) 259–367. Crossref, Google Scholar
- 39. P. Grohs, F. Hornung, A. Jentzen and P. von Wurstemberger, A proof that artificial neural networks overcome the curse of dimensionality in the numerical approximation of Black–Scholes partial differential equations, to appear in Mem. Amer. Math. Soc., arXiv:1809.02362. Google Scholar
- 40. P. Grohs, F. Hornung, A. Jentzen and P. Zimmermann, Space-time error estimates for deep neural network approximations for differential equations, (2019) 86 pages, arXiv:1908.03833. Google Scholar
- 41. P. Grohs, S. Ibragimov, A. Jentzen and S. Koppensteiner, Lower bounds for artificial neural network approximations: A proof that shallow neural networks fail to overcome the curse of dimensionality, (2021) 53 pages, arXiv:2103.04488. Google Scholar
- 42. P. Grohs, A. Jentzen and D. Salimova, Deep neural network approximations for Monte Carlo algorithms, to appear in SN Partial Differ. Equ. Appl., arXiv:1908.10828. Google Scholar
- 43. , Deep neural network approximation theory, IEEE Trans. Inf. Theory 67(5) (2021) 2581–2623. Crossref, Google Scholar
- 44. , Error bounds for approximations with deep ReLU neural networks in norms, Anal. Appl. (Singapore) 18(5) (2020) 803–859. Link, Google Scholar
- 45. , Approximation capability of two hidden layer feedforward neural networks with fixed weights, Neurocomputing 316 (2018) 262–269. Crossref, Google Scholar
- 46. , On the approximation by single hidden layer feedforward neural networks with fixed weights, Neural Networks 98 (2018) 296–304. Crossref, Google Scholar
- 47. , A Distribution-Free Theory of Nonparametric Regression,
Springer Series in Statistics (Springer-Verlag, New York, 2002). Crossref, Google Scholar - 48. , Layered neural networks with Gaussian hidden units as universal approximations, Neural Comput. 2(2) (1990) 210–215. Crossref, Google Scholar
- 49. , Probability inequalities for sums of bounded random variables, J. Amer. Statist. Assoc. 58(301) (1963) 13–30. Crossref, ISI, Google Scholar
- 50. , Approximation capabilities of multilayer feedforward networks, Neural Networks 4(2) (1991) 251–257. Crossref, ISI, Google Scholar
- 51. , Some new results on neural network approximation, Neural Networks 6(8) (1993) 1069–1072. Crossref, ISI, Google Scholar
- 52. , Multilayer feedforward networks are universal approximators, Neural Networks 2(5) (1989) 359–366. Crossref, ISI, Google Scholar
- 53. , Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks, Neural Networks 3(5) (1990) 551–560. Crossref, Google Scholar
- 54. , Deep neural networks algorithms for stochastic control problems on finite horizon: Convergence analysis, SIAM J. Numer. Anal. 59(1) (2021) 525–557. Crossref, Google Scholar
- 55. , A proof that rectified deep neural networks overcome the curse of dimensionality in the numerical approximation of semilinear heat equations, SN Partial Differ. Equ. Appl. 1 (2020) 1–34. Crossref, Google Scholar
- 56. A. Jentzen and T. Kröger, Convergence rates for gradient descent in the training of overparameterized artificial neural networks with biases, (2021) 38 pages, arXiv:2102.11840. Google Scholar
- 57. , Strong error analysis for stochastic gradient descent optimization algorithms, IMA J. Numer. Anal. 41(1) (2021) 455–492. Crossref, Google Scholar
- 58. A. Jentzen and A. Riekert, Strong overall error analysis for the training of artificial neural networks via random initializations, (2020) 40 pages, arXiv:2012.08443. Google Scholar
- 59. A. Jentzen and A. Riekert, A proof of convergence for stochastic gradient descent in the training of artificial neural networks with ReLU activation for constant target functions, (2021) 29 pages, arXiv:2104.00277. Google Scholar
- 60. A. Jentzen and A. Riekert, A proof of convergence for the gradient descent optimization method with random initializations in the training of neural networks with ReLU activation for piecewise linear target functions, (2021) 44 pages, arXiv:2108.04620. Google Scholar
- 61. , A proof that deep artificial neural networks overcome the curse of dimensionality in the numerical approximation of Kolmogorov partial differential equations with constant diffusion and nonlinear drift coefficients, Commun. Math. Sci. 19(5) (2021) 1167–1205. Crossref, Google Scholar
- 62. , Lower error bounds for the stochastic gradient descent optimization algorithm: sharp convergence rates for slowly and fast decaying learning rates, J. Complexity 57 (2020) 101438, 16. Crossref, Google Scholar
- 63. A. Jentzen and T. Welti, Overall error analysis for the training of deep neural networks via stochastic gradient descent with random initialisation, (2020) 51 pages, arXiv:1910.00121. Google Scholar
- 64. , Non-asymptotic analysis of biased stochastic approximation scheme, Proc. Thirty-Second Conf. Learning Theory, eds. A. BeygelzimerD. Hsu,
Proceedings of Machine Learning Research , Vol. 99,25-28 June 2019 ,Phoenix, United States , pp. 1944–1974. Google Scholar - 65. , A theoretical analysis of deep neural networks and parametric PDEs, Constr. Approx. 55 (2022) 73–125. Crossref, Google Scholar
- 66. , Stochastic gradient descent for nonconvex learning without bounded gradient assumptions, IEEE Trans. Neural Netw. Learn. Syst. 31(10) (2020) 4394–4400. Crossref, Google Scholar
- 67. , Multilayer feedforward networks with a nonpolynomial activation function can approximate any function, Neural Networks 6(6) (1993) 861–867. Crossref, ISI, Google Scholar
- 68. , Sets of Finite Perimeter and Geometric Variational Problems, Vol. 135,
Cambridge Studies in Advanced Mathematics (Cambridge University Press, Cambridge, 2012). Crossref, Google Scholar - 69. P. Massart, Concentration inequalities and model selection, Vol. 1896, Lecture Notes in Mathematics (Springer, Berlin, 2007). Lectures from the 33rd Summer School on Probability Theory held in Saint-Flour, 6–23 July, 2003. Google Scholar
- 70. , Degree of approximation by neural and translation networks with a single hidden layer, Adv. in Appl. Math. 16(2) (1995) 151–183. Crossref, Google Scholar
- 71. , Deep vs. shallow networks: An approximation theory perspective, Anal. Appl. (Singapore) 14(6) (2016) 829–848. Link, Google Scholar
- 72. , Neural networks for optimal approximation of smooth and analytic functions, Neural Comput. 8(1) (1996) 164–177. Crossref, ISI, Google Scholar
- 73. ,
Non-asymptotic analysis of stochastic approximation algorithms for machine learning , Advances in Neural Information Processing Systems, eds. J. Shawe-TaylorR. ZemelP. BartlettF. PereiraK. Q. Weinberger, Vol. 24 (Curran Associates, Inc., 2011), pp. 451–459. Google Scholar - 74. , Approximation of functions and their derivatives: A neural network implementation with applications, Appl. Math. Model. 23(9) (1999) 687–704. Crossref, Google Scholar
- 75. , Tractability of Multivariate Problems. Vol. 1: Linear information, Vol. 6,
EMS Tracts in Mathematics (European Mathematical Society (EMS), Zürich, 2008). Crossref, Google Scholar - 76. , Tractability of Multivariate Problems. Volume II: Standard Information for Functionals, Vol. 12,
EMS Tracts in Mathematics (European Mathematical Society (EMS), Zürich, 2010). Crossref, Google Scholar - 77. , Universal approximation using radial-basis-function networks, Neural Comput. 3(2) (1991) 246–257. Crossref, ISI, Google Scholar
- 78. D. Perekrestenko, P. Grohs, D. Elbrächter and H. Bölcskei, The universal approximation power of finite-width deep ReLU networks, (2018) 16 pages, arXiv:1806.01528. Google Scholar
- 79. , Topological properties of the set of functions generated by neural networks of fixed size, Found. Comput. Math. 21 (2021) 375–444. Crossref, Google Scholar
- 80. , Optimal approximation of piecewise smooth functions using deep ReLU neural networks, Neural Networks 108 (2018) 296–330. Crossref, ISI, Google Scholar
- 81. , Equivalence of approximation by convolutional neural networks and fully-connected networks, Proc. Amer. Math. Soc. 148(4) (2020) 1567–1581. Crossref, ISI, Google Scholar
- 82. ,
Approximation theory of the MLP model in neural networks , Acta Numerica, 1999, Acta Numer., Vol. 8 (Cambridge University Press, Cambridge, 1999), pp. 143–195. Crossref, Google Scholar - 83. , Rectified deep neural networks overcome the curse of dimensionality for nonsmooth value functions in zero-sum games of nonlinear stiff systems, Anal. Appl. (Singapore) 18(6) (2020) 951–999. Link, Google Scholar
- 84. , Lower bounds on the complexity of approximating continuous functions by sigmoidal neural networks, Proceedings of the 12th International Conference on Neural Information Processing Systems, NIPS’99, (MIT Press, Cambridge, MA, USA, 1999), pp. 328–334. Google Scholar
- 85. , Deep learning in high dimension: neural network expression rates for generalized polynomial chaos expansions in UQ, Anal. Appl. (Singapore) 17(1) (2019) 19–55. Link, Google Scholar
- 86. , Provable approximation properties for deep neural networks, Appl. Comput. Harmon. Anal. 44(3) (2018) 537–557. Crossref, ISI, Google Scholar
- 87. , Understanding Machine Learning: From Theory to Algorithms (Cambridge University Press, Cambridge, 2014). Crossref, Google Scholar
- 88. , Nonlinear approximation via compositions, Neural Networks 119 (2019) 74–84. Crossref, Google Scholar
- 89. , Deep network approximation characterized by number of neurons, Commun. Comput. Phys. 28(5) (2020) 1768–1811. Crossref, Google Scholar
- 90. , Applications of Empirical Process Theory, Vol. 6,
Cambridge Series in Statistical and Probabilistic Mathematics (Cambridge University Press, Cambridge, 2000). Google Scholar - 91. , Approximation in with deep ReLU neural networks, 2019 13th Int. Conf. Sampling Theory and Applications (SampTA), 2019, 4 pages. Crossref, Google Scholar
- 92. , Error bounds for approximations with deep ReLU networks, Neural Networks 94 (2017) 103–114. Crossref, ISI, Google Scholar
- 93. , Universal approximations of invariant maps by neural networks, Constr. Approx. 55 (2022) 407–474. Crossref, Google Scholar
- 94. , Gradient descent optimizes over-parameterized deep ReLU networks, Mach. Learn. 109 (2020) 467–492. Crossref, ISI, Google Scholar
Remember to check out the Most Cited Articles! |
---|
Check out Probability & Statistics books in our Mathematics 2021 catalogue |