World Scientific
  • Search
Skip main navigation

Cookies Notification

We use cookies on this site to enhance your user experience. By continuing to browse the site, you consent to the use of our cookies. Learn More
×
Our website is made possible by displaying certain online content using javascript.
In order to view the full content, please disable your ad blocker or whitelist our website www.worldscientific.com.

System Upgrade on Tue, Oct 25th, 2022 at 2am (EDT)

Existing users will be able to log into the site and access content. However, E-commerce and registration of new users may not be available for up to 12 hours.
For online purchase, please visit us again. Contact us at [email protected] for any enquiries.

Full error analysis for the training of deep neural networks

    Deep learning algorithms have been applied very successfully in recent years to a range of problems out of reach for classical solution paradigms. Nevertheless, there is no completely rigorous mathematical error and convergence analysis which explains the success of deep learning algorithms. The error of a deep learning algorithm can in many situations be decomposed into three parts, the approximation error, the generalization error, and the optimization error. In this work we estimate for a certain deep learning algorithm each of these three errors and combine these three error estimates to obtain an overall error analysis for the deep learning algorithm under consideration. In particular, we thereby establish convergence with a suitable convergence speed for the overall error of the deep learning algorithm under consideration. Our convergence speed analysis is far from optimal and the convergence speed that we establish is rather slow, increases exponentially in the dimensions, and, in particular, suffers from the curse of dimensionality. The main contribution of this work is, instead, to provide a full error analysis (i) which covers each of the three different sources of errors usually emerging in deep learning algorithms and (ii) which merges these three sources of errors into one overall error estimate for the considered deep learning algorithm.

    Communicated by M. Roeckner

    AMSC: 65G99, 68T07

    References

    • 1. Z. Allen-Zhu, Y. Li and Y. Liang , Learning and generalization in overparameterized neural networks, going beyond two layers, Proc. 33rd Int. Conf. Neural Information Processing Systems (Curran Associates Inc., Red Hook, NY, USA, 2019). Google Scholar
    • 2. Z. Allen-Zhu, Y. Li and Z. Song , A convergence theory for deep learning via over-parameterization, Proc. 36th Int. Conf. Machine Learning, eds. K. ChaudhuriR. Salakhutdinov, Proc. Mach. Learn. Res., Vol. 97 (PMLR, 09–15 Jun 2019), pp. 242–252. Google Scholar
    • 3. F. Bach , Breaking the curse of dimensionality with convex neural networks, J. Mach. Learn. Res. 18(19) (2017) 1–53. Google Scholar
    • 4. F. Bach and E. Moulines , Non-strongly-convex smooth stochastic approximation with convergence rate O( 1 n), Proc. 26th Int. Conf. Neural Information Processing Systems, NIPS’13 (Curran Associates Inc., USA, 2013), pp. 773–781. Google Scholar
    • 5. A. R. Barron , Universal approximation bounds for superpositions of a sigmoidal function, IEEE Trans. Inform. Theory 39(3) (1993) 930–945. Crossref, ISIGoogle Scholar
    • 6. A. R. Barron , Approximation and estimation bounds for artificial neural networks, Mach. Learn. 14(1) (1994) 115–133. Crossref, ISIGoogle Scholar
    • 7. P. L. Bartlett, O. Bousquet and S. Mendelson , Local Rademacher complexities, Ann. Statist. 33(4) (2005) 1497–1537. Crossref, ISIGoogle Scholar
    • 8. C. Beck, S. Becker, P. Grohs, N. Jaafari and A. Jentzen, Solving the Kolmogorov PDE by means of deep learning, (2018) 56 pages, arXiv:1806.00421. Google Scholar
    • 9. C. Beck, S. Becker, P. Grohs, N. Jaafari and A. Jentzen , Solving the Kolmogorov PDE by means of deep learning, J. Sci. Comput. 88(73) (2021) 28 pages. Google Scholar
    • 10. C. Beck, W. E and A. Jentzen , Machine learning approximation algorithms for high-dimensional fully nonlinear partial differential equations and second-order backward stochastic differential equations, J. Nonlinear Sci. 29(4) (2019) 1563–1619. Crossref, ISIGoogle Scholar
    • 11. R. Bellman, Dynamic programming, Princeton Landmarks in Mathematics (Princeton University Press, Princeton, NJ, 2010). Reprint of the 1957 edition. Google Scholar
    • 12. B. Bercu and J.-C. Fort , Generic stochastic gradient methods, Wiley Encyclopedia of Operations Research and Management Science (2011) 8 pages. Google Scholar
    • 13. J. Berner, P. Grohs and A. Jentzen , Analysis of the generalization error: Empirical risk minimization over deep artificial neural networks overcomes the curse of dimensionality in the numerical approximation of Black–Scholes partial differential equations, SIAM J. Math. Data Sci. 2(3) (2020) 631–657. CrossrefGoogle Scholar
    • 14. E. K. Blum and L. K. Li , Approximation theory and feedforward networks, Neural Networks 4(4) (1991) 511–515. CrossrefGoogle Scholar
    • 15. H. Bölcskei, P. Grohs, G. Kutyniok and P. Petersen , Optimal approximation with sparsely connected deep neural networks, SIAM J. Math. Data Sci. 1(1) (2019) 8–45. CrossrefGoogle Scholar
    • 16. L. Bottou and O. Bousquet , The tradeoffs of large scale learning, Proc. 20th Int. Conf. Neural Information Processing Systems, NIPS’07 (Curran Associates Inc., Red Hook, NY, USA, 2007), pp. 161–168. Google Scholar
    • 17. M. Burger and A. Neubauer , Error bounds for approximation with neural networks, J. Approx. Theory 112(2) (2001) 235–250. CrossrefGoogle Scholar
    • 18. E. J. Candes, Ridgelets: Theory and applications, PhD thesis, Stanford University Stanford (1998). Google Scholar
    • 19. N. H. Chau, É. Moulines, M. Rásonyi, S. Sabanis and Y. Zhang, On stochastic gradient Langevin dynamics with dependent data streams: The fully non-convex case, (2019) 27 pages, arXiv:1905.13142. Google Scholar
    • 20. T. Chen and H. Chen , Approximation capability to functions of several variables, nonlinear functionals, and operators by radial basis function neural networks, IEEE Trans. Neural Netw. 6(4) (1995) 904–910. Crossref, ISIGoogle Scholar
    • 21. P. Cheridito, A. Jentzen, A. Riekert and F. Rossmannek , A proof of convergence for gradient descent in the training of artificial neural networks for constant target functions, J. Complexity (2022) 101646. CrossrefGoogle Scholar
    • 22. L. Chizat and F. Bach , On the global convergence of gradient descent for over-parameterized models using optimal transport, in Advances in Neural Information Processing Systems (NeurIPS 2018) (2018) 32 pages. Google Scholar
    • 23. C. K. Chui, X. Li and H. N. Mhaskar , Neural networks for localized approximation, Math. Comp. 63(208) (1994) 607–623. Crossref, ISIGoogle Scholar
    • 24. F. Cucker and S. Smale , On the mathematical foundations of learning, Bull. Amer. Math. Soc. (N.S.) 39(1) (2002) 1–49. Crossref, ISIGoogle Scholar
    • 25. G. Cybenko , Approximation by superpositions of a sigmoidal function, Math. Control Signals Systems 2(4) (1989) 303–314. CrossrefGoogle Scholar
    • 26. S. Dereich and T. Müller-Gronbach , General multilevel adaptations for stochastic approximation algorithms of Robbins–Monro and Polyak–Ruppert type, Numer. Math. 142(2) (2019) 279–328. CrossrefGoogle Scholar
    • 27. R. A. DeVore, K. I. Oskolkov and P. P. Petrushev , Approximation by feed-forward neural networks, The heritage of P. L. Chebyshev: A Festschrift in honor of the 70th birthday of T. J. Rivlin, Vol. 4 (Baltzer Science Publishers BV, Amsterdam, 1997), pp. 261–287. Google Scholar
    • 28. S. S. Du, J. D. Lee, H. Li, L. Wang and X. Zhai , Gradient descent finds global minima of deep neural networks, in Proc. 36th Int. Conf. Machine Learning, eds. K. ChaudhuriR. Salakhutdinov, Vol. 97 (PMLR, 9–15 June 2019), pp. 1675–1685. Google Scholar
    • 29. S. S. Du, X. Zhai, B. Poczos and A. Singh, Gradient Descent provably optimizes over-parameterized neural networks, (2019) 19 pages, arXiv:1810.02054. Google Scholar
    • 30. W. E, C. Ma and L. Wu , A comparative analysis of optimization and generalization properties of two-layer neural network and random feature models under gradient descent dynamics, Sci. China Math. 63(7) (2020) 1235–1258. CrossrefGoogle Scholar
    • 31. W. E and Q. Wang , Exponential convergence of the deep neural network approximation for analytic functions, Sci. China Math. 61 (2018) 1733–1740. Crossref, ISIGoogle Scholar
    • 32. D. Elbrächter, P. Grohs, A. Jentzen and C. Schwab , DNN expression rate analysis of high-dimensional PDEs: Application to option pricing, Constr. Approx. 55 (2022) 3–71. CrossrefGoogle Scholar
    • 33. R. Eldan and O. Shamir , The power of depth for feedforward neural networks, 29th Annual Conf. Learning Theory, eds. V. FeldmanA. RakhlinO. Shamir, Proceedings of Machine Learning Research, Vol. 49 (PMLR, Columbia University, New York, USA, 23–26 June 2016), pp. 907–940. Google Scholar
    • 34. S. W. Ellacott , Aspects of the numerical analysis of neural networks, Acta Numerica, 1994, Acta Numer. (Cambridge University Press, Cambridge, 1994), pp. 145–202. CrossrefGoogle Scholar
    • 35. B. Fehrman, B. Gess and A. Jentzen , Convergence rates for the stochastic gradient descent method for non-convex objective functions, J. Mach. Learn. Res. 21(136) (2020) 1–48. Google Scholar
    • 36. K.-I. Funahashi , On the approximate realization of continuous mappings by neural networks, Neural Networks 2(3) (1989) 183–192. Crossref, ISIGoogle Scholar
    • 37. I. Goodfellow, Y. Bengio and A. Courville , Deep Learning, Adaptive Computation and Machine Learning (MIT Press, Cambridge, MA, 2016). Google Scholar
    • 38. R. Gribonval, G. Kutyniok, M. Nielsen and F. Voigtlaender , Approximation spaces of deep neural networks, Constr. Approx. 55 (2022) 259–367. CrossrefGoogle Scholar
    • 39. P. Grohs, F. Hornung, A. Jentzen and P. von Wurstemberger, A proof that artificial neural networks overcome the curse of dimensionality in the numerical approximation of Black–Scholes partial differential equations, to appear in Mem. Amer. Math. Soc., arXiv:1809.02362. Google Scholar
    • 40. P. Grohs, F. Hornung, A. Jentzen and P. Zimmermann, Space-time error estimates for deep neural network approximations for differential equations, (2019) 86 pages, arXiv:1908.03833. Google Scholar
    • 41. P. Grohs, S. Ibragimov, A. Jentzen and S. Koppensteiner, Lower bounds for artificial neural network approximations: A proof that shallow neural networks fail to overcome the curse of dimensionality, (2021) 53 pages, arXiv:2103.04488. Google Scholar
    • 42. P. Grohs, A. Jentzen and D. Salimova, Deep neural network approximations for Monte Carlo algorithms, to appear in SN Partial Differ. Equ. Appl., arXiv:1908.10828. Google Scholar
    • 43. P. Grohs, D. Perekrestenko, D. Elbrächter and H. Bölcskei , Deep neural network approximation theory, IEEE Trans. Inf. Theory 67(5) (2021) 2581–2623. CrossrefGoogle Scholar
    • 44. I. Gühring, G. Kutyniok and P. Petersen , Error bounds for approximations with deep ReLU neural networks in Ws,p norms, Anal. Appl. (Singapore) 18(5) (2020) 803–859. LinkGoogle Scholar
    • 45. N. J. Guliyev and V. E. Ismailov , Approximation capability of two hidden layer feedforward neural networks with fixed weights, Neurocomputing 316 (2018) 262–269. CrossrefGoogle Scholar
    • 46. N. J. Guliyev and V. E. Ismailov , On the approximation by single hidden layer feedforward neural networks with fixed weights, Neural Networks 98 (2018) 296–304. CrossrefGoogle Scholar
    • 47. L. Györfi, M. Kohler, A. Krzyżak and H. Walk , A Distribution-Free Theory of Nonparametric Regression, Springer Series in Statistics (Springer-Verlag, New York, 2002). CrossrefGoogle Scholar
    • 48. E. J. Hartman, J. D. Keeler and J. M. Kowalski , Layered neural networks with Gaussian hidden units as universal approximations, Neural Comput. 2(2) (1990) 210–215. CrossrefGoogle Scholar
    • 49. W. Hoeffding , Probability inequalities for sums of bounded random variables, J. Amer. Statist. Assoc. 58(301) (1963) 13–30. Crossref, ISIGoogle Scholar
    • 50. K. Hornik , Approximation capabilities of multilayer feedforward networks, Neural Networks 4(2) (1991) 251–257. Crossref, ISIGoogle Scholar
    • 51. K. Hornik , Some new results on neural network approximation, Neural Networks 6(8) (1993) 1069–1072. Crossref, ISIGoogle Scholar
    • 52. K. Hornik, M. Stinchcombe and H. White , Multilayer feedforward networks are universal approximators, Neural Networks 2(5) (1989) 359–366. Crossref, ISIGoogle Scholar
    • 53. K. Hornik, M. Stinchcombe and H. White , Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks, Neural Networks 3(5) (1990) 551–560. CrossrefGoogle Scholar
    • 54. C. Huré, H. Pham, A. Bachouch and N. Langrené , Deep neural networks algorithms for stochastic control problems on finite horizon: Convergence analysis, SIAM J. Numer. Anal. 59(1) (2021) 525–557. CrossrefGoogle Scholar
    • 55. M. Hutzenthaler, A. Jentzen, T. Kruse and T. A. Nguyen , A proof that rectified deep neural networks overcome the curse of dimensionality in the numerical approximation of semilinear heat equations, SN Partial Differ. Equ. Appl. 1 (2020) 1–34. CrossrefGoogle Scholar
    • 56. A. Jentzen and T. Kröger, Convergence rates for gradient descent in the training of overparameterized artificial neural networks with biases, (2021) 38 pages, arXiv:2102.11840. Google Scholar
    • 57. A. Jentzen, B. Kuckuck, A. Neufeld and P. von Wurstemberger , Strong error analysis for stochastic gradient descent optimization algorithms, IMA J. Numer. Anal. 41(1) (2021) 455–492. CrossrefGoogle Scholar
    • 58. A. Jentzen and A. Riekert, Strong overall error analysis for the training of artificial neural networks via random initializations, (2020) 40 pages, arXiv:2012.08443. Google Scholar
    • 59. A. Jentzen and A. Riekert, A proof of convergence for stochastic gradient descent in the training of artificial neural networks with ReLU activation for constant target functions, (2021) 29 pages, arXiv:2104.00277. Google Scholar
    • 60. A. Jentzen and A. Riekert, A proof of convergence for the gradient descent optimization method with random initializations in the training of neural networks with ReLU activation for piecewise linear target functions, (2021) 44 pages, arXiv:2108.04620. Google Scholar
    • 61. A. Jentzen, D. Salimova and T. Welti , A proof that deep artificial neural networks overcome the curse of dimensionality in the numerical approximation of Kolmogorov partial differential equations with constant diffusion and nonlinear drift coefficients, Commun. Math. Sci. 19(5) (2021) 1167–1205. CrossrefGoogle Scholar
    • 62. A. Jentzen and P. von Wurstemberger , Lower error bounds for the stochastic gradient descent optimization algorithm: sharp convergence rates for slowly and fast decaying learning rates, J. Complexity 57 (2020) 101438, 16. CrossrefGoogle Scholar
    • 63. A. Jentzen and T. Welti, Overall error analysis for the training of deep neural networks via stochastic gradient descent with random initialisation, (2020) 51 pages, arXiv:1910.00121. Google Scholar
    • 64. B. Karimi, B. Miasojedow, E. Moulines and H.-T. Wai , Non-asymptotic analysis of biased stochastic approximation scheme, Proc. Thirty-Second Conf. Learning Theory, eds. A. BeygelzimerD. Hsu, Proceedings of Machine Learning Research, Vol. 99, 25-28 June 2019, Phoenix, United States, pp. 1944–1974. Google Scholar
    • 65. G. Kutyniok, P. Petersen, M. Raslan and R. Schneider , A theoretical analysis of deep neural networks and parametric PDEs, Constr. Approx. 55 (2022) 73–125. CrossrefGoogle Scholar
    • 66. Y. Lei, T. Hu, G. Li and K. Tang , Stochastic gradient descent for nonconvex learning without bounded gradient assumptions, IEEE Trans. Neural Netw. Learn. Syst. 31(10) (2020) 4394–4400. CrossrefGoogle Scholar
    • 67. M. Leshno, V. Y. Lin, A. Pinkus and S. Schocken , Multilayer feedforward networks with a nonpolynomial activation function can approximate any function, Neural Networks 6(6) (1993) 861–867. Crossref, ISIGoogle Scholar
    • 68. F. Maggi , Sets of Finite Perimeter and Geometric Variational Problems, Vol. 135, Cambridge Studies in Advanced Mathematics (Cambridge University Press, Cambridge, 2012). CrossrefGoogle Scholar
    • 69. P. Massart, Concentration inequalities and model selection, Vol. 1896, Lecture Notes in Mathematics (Springer, Berlin, 2007). Lectures from the 33rd Summer School on Probability Theory held in Saint-Flour, 6–23 July, 2003. Google Scholar
    • 70. H. N. Mhaskar and C. A. Micchelli , Degree of approximation by neural and translation networks with a single hidden layer, Adv. in Appl. Math. 16(2) (1995) 151–183. CrossrefGoogle Scholar
    • 71. H. N. Mhaskar and T. Poggio , Deep vs. shallow networks: An approximation theory perspective, Anal. Appl. (Singapore) 14(6) (2016) 829–848. LinkGoogle Scholar
    • 72. H. N. Mhaskar , Neural networks for optimal approximation of smooth and analytic functions, Neural Comput. 8(1) (1996) 164–177. Crossref, ISIGoogle Scholar
    • 73. E. Moulines and F. Bach , Non-asymptotic analysis of stochastic approximation algorithms for machine learning, Advances in Neural Information Processing Systems, eds. J. Shawe-TaylorR. ZemelP. BartlettF. PereiraK. Q. Weinberger, Vol. 24 (Curran Associates, Inc., 2011), pp. 451–459. Google Scholar
    • 74. T. Nguyen-Thien and T. Tran-Cong , Approximation of functions and their derivatives: A neural network implementation with applications, Appl. Math. Model. 23(9) (1999) 687–704. CrossrefGoogle Scholar
    • 75. E. Novak and H. Woźniakowski , Tractability of Multivariate Problems. Vol. 1: Linear information, Vol. 6, EMS Tracts in Mathematics (European Mathematical Society (EMS), Zürich, 2008). CrossrefGoogle Scholar
    • 76. E. Novak and H. Woźniakowski , Tractability of Multivariate Problems. Volume II: Standard Information for Functionals, Vol. 12, EMS Tracts in Mathematics (European Mathematical Society (EMS), Zürich, 2010). CrossrefGoogle Scholar
    • 77. J. Park and I. W. Sandberg , Universal approximation using radial-basis-function networks, Neural Comput. 3(2) (1991) 246–257. Crossref, ISIGoogle Scholar
    • 78. D. Perekrestenko, P. Grohs, D. Elbrächter and H. Bölcskei, The universal approximation power of finite-width deep ReLU networks, (2018) 16 pages, arXiv:1806.01528. Google Scholar
    • 79. P. Petersen, M. Raslan and F. Voigtlaender , Topological properties of the set of functions generated by neural networks of fixed size, Found. Comput. Math. 21 (2021) 375–444. CrossrefGoogle Scholar
    • 80. P. Petersen and F. Voigtlaender , Optimal approximation of piecewise smooth functions using deep ReLU neural networks, Neural Networks 108 (2018) 296–330. Crossref, ISIGoogle Scholar
    • 81. P. Petersen and F. Voigtlaender , Equivalence of approximation by convolutional neural networks and fully-connected networks, Proc. Amer. Math. Soc. 148(4) (2020) 1567–1581. Crossref, ISIGoogle Scholar
    • 82. A. Pinkus , Approximation theory of the MLP model in neural networks, Acta Numerica, 1999, Acta Numer., Vol. 8 (Cambridge University Press, Cambridge, 1999), pp. 143–195. CrossrefGoogle Scholar
    • 83. C. Reisinger and Y. Zhang , Rectified deep neural networks overcome the curse of dimensionality for nonsmooth value functions in zero-sum games of nonlinear stiff systems, Anal. Appl. (Singapore) 18(6) (2020) 951–999. LinkGoogle Scholar
    • 84. M. Schmitt , Lower bounds on the complexity of approximating continuous functions by sigmoidal neural networks, Proceedings of the 12th International Conference on Neural Information Processing Systems, NIPS’99, (MIT Press, Cambridge, MA, USA, 1999), pp. 328–334. Google Scholar
    • 85. C. Schwab and J. Zech , Deep learning in high dimension: neural network expression rates for generalized polynomial chaos expansions in UQ, Anal. Appl. (Singapore) 17(1) (2019) 19–55. LinkGoogle Scholar
    • 86. U. Shaham, A. Cloninger and R. R. Coifman , Provable approximation properties for deep neural networks, Appl. Comput. Harmon. Anal. 44(3) (2018) 537–557. Crossref, ISIGoogle Scholar
    • 87. S. Shalev-Shwartz and S. Ben-David , Understanding Machine Learning: From Theory to Algorithms (Cambridge University Press, Cambridge, 2014). CrossrefGoogle Scholar
    • 88. Z. Shen, H. Yang and S. Zhang , Nonlinear approximation via compositions, Neural Networks 119 (2019) 74–84. CrossrefGoogle Scholar
    • 89. Z. Shen, H. Yang and S. Zhang , Deep network approximation characterized by number of neurons, Commun. Comput. Phys. 28(5) (2020) 1768–1811. CrossrefGoogle Scholar
    • 90. S. A. van de Geer , Applications of Empirical Process Theory, Vol. 6, Cambridge Series in Statistical and Probabilistic Mathematics (Cambridge University Press, Cambridge, 2000). Google Scholar
    • 91. F. Voigtlaender and P. Petersen , Approximation in Lp(μ) with deep ReLU neural networks, 2019 13th Int. Conf. Sampling Theory and Applications (SampTA), 2019, 4 pages. CrossrefGoogle Scholar
    • 92. D. Yarotsky , Error bounds for approximations with deep ReLU networks, Neural Networks 94 (2017) 103–114. Crossref, ISIGoogle Scholar
    • 93. D. Yarotsky , Universal approximations of invariant maps by neural networks, Constr. Approx. 55 (2022) 407–474. CrossrefGoogle Scholar
    • 94. D. Zou, Y. Cao, D. Zhou and Q. Gu , Gradient descent optimizes over-parameterized deep ReLU networks, Mach. Learn. 109 (2020) 467–492. Crossref, ISIGoogle Scholar
    Remember to check out the Most Cited Articles!

    Check out Probability & Statistics books in our Mathematics 2021 catalogue
    Featuring authors Ingram Olkin, Takeyuki Hida and more.

    "