BIRS Workshop Lecture Videos

Banff International Research Station Logo

BIRS Workshop Lecture Videos

PDE approach to regularization in deep learning Oberman, Adam

Description

Deep neural networks have achieved significant success in a number of challenging engineering problems. There is consensus in the community that some form of smoothing of the loss function is needed, and there have been hundreds of papers and many conferences in the past three years on this topic. However, so far there has been little analysis by mathematicians. The fundamental tool in training deep neural networks is Stochastic Gradient Descent (SGD) applied to the ``loss'' function, $f(x)$, which is high dimensional and nonconvex. \begin{equation}\label{SGDintro}\tag{SDG} dx_t = -\nabla f(x_t) dt + dW_t \end{equation} There is a consensus in the field that some for of regularization of the loss function is needed, but so far there has been little progress. This may be in part because smoothing techniques, such a convolution, which are useful in low dimensions, are computationally intractable in the high dimensional setting. Two recent algorithms have shown promise in this direction. The first, \cite{zhang2015deep}, uses a mean field approach to perform SGD in parallel. The second, \cite{chaudhari2016entropy}, replaced $f$ in \eqref{SGDintro} with $f_\gamma(x)$, the \emph{local entropy} of $f$, which is defined using notions from statistical physics \cite{baldassi2016unreasonable}. We interpret both algorithms as replacing $f$ with $f_\gamma$, where $f_\gamma = u(x,\gamma)$ and $u$ is the solution of the viscous Hamilton-Jacobi PDE \[ u_t(x,t) = - \frac 1 2 |\grad u(x,t)|^2 + \beta^{-1} \Delta u(x,t) \] along with $u(x,0) = f(x)$. This interpretation leads to theoretical validation for empirical results. However, what is needed for \eqref{SGDintro} is $\grad f_\gamma(x)$. Remarkably, for short times, this vector can be computed efficiently by solving an auxiliary \emph{convex optimization} problem, which has much better convergence properties than the original non-convex problem. Tools from optimal transportation \cite{santambrogio2016euclidean} are used to justify the fast convergence of the solution of the auxiliary problem. In practice, this algorithm has significantly improved the training time (speed of convergence) for Deep Networks in high dimensions. The algorithm can also be applied to nonconvex minimization problems where \eqref{SGDintro} is used.

Item Media

Item Citations and Data

Rights

Attribution-NonCommercial-NoDerivatives 4.0 International