PDE approach to regularization in deep learning

BIRS Workshop Lecture Videos

Featured Collection

BIRS Workshop Lecture Videos

PDE approach to regularization in deep learning Oberman, Adam

Description

Deep neural networks have achieved significant success in a number of challenging engineering problems. There is consensus in the community that some form of smoothing of the loss function is needed, and there have been hundreds of papers and many conferences in the past three years on this topic. However, so far there has been little analysis by mathematicians. The fundamental tool in training deep neural networks is Stochastic Gradient Descent (SGD) applied to the ``loss'' function, $f(x)$, which is high dimensional and nonconvex. \begin{equation}\label{SGDintro}\tag{SDG} dx_t = -\nabla f(x_t) dt + dW_t \end{equation} There is a consensus in the field that some for of regularization of the loss function is needed, but so far there has been little progress. This may be in part because smoothing techniques, such a convolution, which are useful in low dimensions, are computationally intractable in the high dimensional setting. Two recent algorithms have shown promise in this direction. The first, \cite{zhang2015deep}, uses a mean field approach to perform SGD in parallel. The second, \cite{chaudhari2016entropy}, replaced $f$ in \eqref{SGDintro} with $f_\gamma(x)$, the \emph{local entropy} of $f$, which is defined using notions from statistical physics \cite{baldassi2016unreasonable}. We interpret both algorithms as replacing $f$ with $f_\gamma$, where $f_\gamma = u(x,\gamma)$ and $u$ is the solution of the viscous Hamilton-Jacobi PDE \[ u_t(x,t) = - \frac 1 2 |\grad u(x,t)|^2 + \beta^{-1} \Delta u(x,t) \] along with $u(x,0) = f(x)$. This interpretation leads to theoretical validation for empirical results. However, what is needed for \eqref{SGDintro} is $\grad f_\gamma(x)$. Remarkably, for short times, this vector can be computed efficiently by solving an auxiliary \emph{convex optimization} problem, which has much better convergence properties than the original non-convex problem. Tools from optimal transportation \cite{santambrogio2016euclidean} are used to justify the fast convergence of the solution of the auxiliary problem. In practice, this algorithm has significantly improved the training time (speed of convergence) for Deep Networks in high dimensions. The algorithm can also be applied to nonconvex minimization problems where \eqref{SGDintro} is used.

Item Metadata

Title	PDE approach to regularization in deep learning
Creator	Oberman, Adam
Publisher	Banff International Research Station for Mathematical Innovation and Discovery
Date Issued	2017-05-04T09:03
Description	Deep neural networks have achieved significant success in a number of challenging engineering problems. There is consensus in the community that some form of smoothing of the loss function is needed, and there have been hundreds of papers and many conferences in the past three years on this topic. However, so far there has been little analysis by mathematicians. The fundamental tool in training deep neural networks is Stochastic Gradient Descent (SGD) applied to the ``loss'' function, $f(x)$, which is high dimensional and nonconvex. \begin{equation}\label{SGDintro}\tag{SDG} dx_t = -\nabla f(x_t) dt + dW_t \end{equation} There is a consensus in the field that some for of regularization of the loss function is needed, but so far there has been little progress. This may be in part because smoothing techniques, such a convolution, which are useful in low dimensions, are computationally intractable in the high dimensional setting. Two recent algorithms have shown promise in this direction. The first, \cite{zhang2015deep}, uses a mean field approach to perform SGD in parallel. The second, \cite{chaudhari2016entropy}, replaced $f$ in \eqref{SGDintro} with $f_\gamma(x)$, the \emph{local entropy} of $f$, which is defined using notions from statistical physics \cite{baldassi2016unreasonable}. We interpret both algorithms as replacing $f$ with $f_\gamma$, where $f_\gamma = u(x,\gamma)$ and $u$ is the solution of the viscous Hamilton-Jacobi PDE \[ u_t(x,t) = - \frac 1 2 \|\grad u(x,t)\|^2 + \beta^{-1} \Delta u(x,t) \] along with $u(x,0) = f(x)$. This interpretation leads to theoretical validation for empirical results. However, what is needed for \eqref{SGDintro} is $\grad f_\gamma(x)$. Remarkably, for short times, this vector can be computed efficiently by solving an auxiliary \emph{convex optimization} problem, which has much better convergence properties than the original non-convex problem. Tools from optimal transportation \cite{santambrogio2016euclidean} are used to justify the fast convergence of the solution of the auxiliary problem. In practice, this algorithm has significantly improved the training time (speed of convergence) for Deep Networks in high dimensions. The algorithm can also be applied to nonconvex minimization problems where \eqref{SGDintro} is used.
Extent	58 minutes
Subject	Mathematics; Calculus of variations and optimal control; optimization; Statistics; Scientific computing
Type	Moving Image
File Format	video/mp4
Language	eng
Notes	Author affiliation: McGill University
Series	BIRS Workshop Lecture Videos (Oaxaca de Juárez (Mexico))
Date Available	2017-10-31
Provider	Vancouver : University of British Columbia Library
Rights	Attribution-NonCommercial-NoDerivatives 4.0 International
DOI	10.14288/1.0357416
URI	http://hdl.handle.net/2429/63505
Affiliation	Non UBC
Peer Review Status	Unreviewed
Scholarly Level	Faculty
Rights URI	http://creativecommons.org/licenses/by-nc-nd/4.0/
Aggregated Source Repository	DSpace

Item Media

201705040903-Oberman_lrv.mp4 -- 248.67MB

Item Citations and Data

Rights

Attribution-NonCommercial-NoDerivatives 4.0 International

Open Collections

BIRS Workshop Lecture Videos

PDE approach to regularization in deep learning Oberman, Adam

Description

Item Metadata

Item Media

Item Citations and Data

Rights