- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- BIRS Workshop Lecture Videos /
- PDE approach to regularization in deep learning
Open Collections
BIRS Workshop Lecture Videos
BIRS Workshop Lecture Videos
PDE approach to regularization in deep learning Oberman, Adam
Description
Deep neural networks have achieved significant success in a number of challenging engineering problems. There is consensus in the community that some form of smoothing of the loss function is needed, and there have been hundreds of papers and many conferences in the past three years on this topic. However, so far there has been little analysis by mathematicians. The fundamental tool in training deep neural networks is Stochastic Gradient Descent (SGD) applied to the ``loss'' function, $f(x)$, which is high dimensional and nonconvex. \begin{equation}\label{SGDintro}\tag{SDG} dx_t = -\nabla f(x_t) dt + dW_t \end{equation} There is a consensus in the field that some for of regularization of the loss function is needed, but so far there has been little progress. This may be in part because smoothing techniques, such a convolution, which are useful in low dimensions, are computationally intractable in the high dimensional setting. Two recent algorithms have shown promise in this direction. The first, \cite{zhang2015deep}, uses a mean field approach to perform SGD in parallel. The second, \cite{chaudhari2016entropy}, replaced $f$ in \eqref{SGDintro} with $f_\gamma(x)$, the \emph{local entropy} of $f$, which is defined using notions from statistical physics \cite{baldassi2016unreasonable}. We interpret both algorithms as replacing $f$ with $f_\gamma$, where $f_\gamma = u(x,\gamma)$ and $u$ is the solution of the viscous Hamilton-Jacobi PDE \[ u_t(x,t) = - \frac 1 2 |\grad u(x,t)|^2 + \beta^{-1} \Delta u(x,t) \] along with $u(x,0) = f(x)$. This interpretation leads to theoretical validation for empirical results. However, what is needed for \eqref{SGDintro} is $\grad f_\gamma(x)$. Remarkably, for short times, this vector can be computed efficiently by solving an auxiliary \emph{convex optimization} problem, which has much better convergence properties than the original non-convex problem. Tools from optimal transportation \cite{santambrogio2016euclidean} are used to justify the fast convergence of the solution of the auxiliary problem. In practice, this algorithm has significantly improved the training time (speed of convergence) for Deep Networks in high dimensions. The algorithm can also be applied to nonconvex minimization problems where \eqref{SGDintro} is used.
Item Metadata
Title |
PDE approach to regularization in deep learning
|
Creator | |
Publisher |
Banff International Research Station for Mathematical Innovation and Discovery
|
Date Issued |
2017-05-04T09:03
|
Description |
Deep neural networks have achieved significant success in a number of challenging engineering problems.
There is consensus in the community that some form of smoothing of the loss function is needed, and there have been hundreds of papers and many conferences in the past three years on this topic. However, so far there has been little analysis by mathematicians.
The fundamental tool in training deep neural networks is Stochastic Gradient Descent (SGD) applied to the ``loss'' function, $f(x)$, which is high dimensional and nonconvex.
\begin{equation}\label{SGDintro}\tag{SDG}
dx_t = -\nabla f(x_t) dt + dW_t
\end{equation}
There is a consensus in the field that some for of regularization of the loss function is needed, but so far there has been little progress. This may be in part because smoothing techniques, such a convolution, which are useful in low dimensions, are computationally intractable in the high dimensional setting.
Two recent algorithms have shown promise in this direction. The first, \cite{zhang2015deep}, uses a mean field approach to perform SGD in parallel. The second, \cite{chaudhari2016entropy}, replaced $f$ in \eqref{SGDintro} with $f_\gamma(x)$, the \emph{local entropy} of $f$, which is defined using notions from statistical physics \cite{baldassi2016unreasonable}.
We interpret both algorithms as replacing $f$ with $f_\gamma$, where $f_\gamma = u(x,\gamma)$ and $u$ is the solution of the viscous Hamilton-Jacobi PDE
\[
u_t(x,t) = - \frac 1 2 |\grad u(x,t)|^2 + \beta^{-1} \Delta u(x,t)
\]
along with $u(x,0) = f(x)$. This interpretation leads to theoretical validation for empirical results.
However, what is needed for \eqref{SGDintro} is $\grad f_\gamma(x)$. Remarkably, for short times, this vector can be computed efficiently by solving an auxiliary \emph{convex optimization} problem, which has much better convergence properties than the original non-convex problem. Tools from optimal transportation \cite{santambrogio2016euclidean} are used to justify the fast convergence of the solution of the auxiliary problem.
In practice, this algorithm has significantly improved the training time (speed of convergence) for Deep Networks in high dimensions. The algorithm can also be applied to nonconvex minimization problems where \eqref{SGDintro} is used.
|
Extent |
58 minutes
|
Subject | |
Type | |
File Format |
video/mp4
|
Language |
eng
|
Notes |
Author affiliation: McGill University
|
Series | |
Date Available |
2017-11-01
|
Provider |
Vancouver : University of British Columbia Library
|
Rights |
Attribution-NonCommercial-NoDerivatives 4.0 International
|
DOI |
10.14288/1.0357416
|
URI | |
Affiliation | |
Peer Review Status |
Unreviewed
|
Scholarly Level |
Faculty
|
Rights URI | |
Aggregated Source Repository |
DSpace
|
Item Media
Item Citations and Data
Rights
Attribution-NonCommercial-NoDerivatives 4.0 International