Training Neural Networks with LPs
TL;DR: This is an informal summary of our recent paper Principled Deep Neural Network Training through Linear Programming with Dan Bienstock and Gonzalo Muñoz, where we show that the computational complexity of approximate Deep Neural Network training depends polynomially on the data size for several architectures by means of constructing (relatively) small LPs.
What is the paper about and why you might care
Deep Learning has received significant attention due to its impressive performance in many state-of-the-art learning tasks. Unfortunately, while very powerful, Deep Learning is not well understood theoretically and in particular only recently results for the complexity of training deep neural networks have been obtained. So why would we care, “as long as it works”? The reason for this is multi-fold. First of all, understanding the complexity of training provides us with a general insight of how hard Deep Neural Network (DNN) training really is. Maybe it is generally a hard problem? Maybe it is actually easy? Moreover, we have to differentiate approaches that merely provide a “good solution” versus those that actually solve the training problem to (near-)optimality; a discussion whether or not this is desirable from a generalization point of view is for a different time, nonetheless, jumping ahead a bit, we do also establish generalization of models trained via LPs. Moreover, once the complexity of training is understood, one can start to consider follow-up questions, such as how hard robust training is, a technique that has become important in the context of hardening DNNs against adversarial attacks, which is of ultimate importance if we ever really want to deploy these ML systems in the real-world, on a large scale, possibly with human lives at stake (e.g., autonomous driving).
We show that training DNNs to $\varepsilon$-approximate optimality can be done via linear programs of relatively small size. I would like to stress though that our results are not about practical training but about characterizing the computational complexity of the DNN training problem.
Our results
In neural network training we are interested in solving the following Empirical Risk Minimization (ERM) problem:
\[\tag{ERM} \min_{\phi \in \Phi} \frac{1}{D} \sum_{i=1}^D \ell(f(\hat{x}^i,\phi), \hat{y}^i),\]where $\ell$ is some loss function, $(\hat{x}^i, \hat{y}^i)_{i=1}^D$ is an i.i.d. sample from some data distribution $\mathcal D$ of sample size $D$, and $f$ is a neural network architecture parameterized by $\phi \in \Phi$ with $\Phi$ being the parameter space of the considered architecture (e.g., network weights). The empirical risk minimization problem is solved as a stand-in for the general risk minimization (GRM) problem of the form
\[\tag{GRM} \min_{\phi \in \Phi} \mathbb E_{(x,y) \in \mathcal D} {\ell(f(x,\phi), y)}\]which we usually cannot solve due not having explicit access to the true data distribution $\mathcal D$ and one hopes that the (ERM) solution $\phi^\esx$ reasonable generalizes to the (GRM) solution, i.e., it is also roughly minimizing (GRM). To help generalization and prevent overfitting against the specific sample, we often solve the regularized ERM (rERM), which is of the form:
\[\tag{rERM} \min_{\phi \in \Phi} \frac{1}{D} \sum_{i=1}^D \ell(f(\hat{x}^i,\phi), \hat{y}^i) + \lambda R(\phi),\]where $R$ is a regularizer, typically a norm, and $\lambda > 0$ is a weight controlling the strength of the regularization.
Typically, the (regularized) ERM is solved with some form of stochastic gradient descent in neural network training, ultimately exploiting the finite sum structure of the objective and linearity of the derivative allowing us to (batch) sample from our data sample and compute stochastic gradients. It turns out that the same finite sum structure induces an optimization problem of low treewidth, which allows us to formulate the ERM problem as a reasonably small linear program. Note that we make no assumptions on convexity etc. here and complexity of the architecture and loss will be captured by Lipschitzness. To keep the exposition simple here (and also in the paper), we assume that both the data and the parameter space are normalized to be bounded within an appropriate-dimensional box of the form $[-1,1]^\esx$.
The main result we obtain can be semi-formally stated as follows; we formulate the result for (ERM) but it immediately extends to (rERM):
Theorem (informal) [BMP]. Let $D$ be a given sample size and $\varepsilon > 0$, then there exists a linear program with a polytope $P$ as a feasible region with the following properties:
(a) Data-independent LP. The linear program has no more than $O( D \cdot (\mathcal L/\varepsilon)^K)$ variables. The construction is independent of any specific training data set.
(b) Solving ERM. For any given dataset $\hat D \doteq (\hat{x}^i, \hat{y}^i)_{i=1}^D$, there exists a face of $P$, so that optimizing over this face provides an $\varepsilon$-approximate solution to (ERM) for $\hat D$. This is equivalent, by Farkas’ lemma, to the existence of a linear objective as a function of $\hat D$, that when optimized over $P$ yields an $\varepsilon$-approximate solution $\phi^\esx \in \Phi$ to (ERM) for $\hat D$. (as we require $\phi^\esx \in \Phi$ we are in the proper learning setting)
Here $\mathcal L$ is an architecture dependent Lipschitz constant and $K$ is the “size” of the network.
A few remarks are in order: what is special and maybe confusing at first is that the LP can be written down, before the actual training data is revealed and the only input (w.r.t. the data) for the construction is the sample size $D$ as well as network specific parameters. This is very subtle but highly important: if we talk only about the (bit) size of a linear program as a measure of complexity as we do here, then the actual time required to write down the linear program is irrelevant. As such, if we would allow the construction of the LP depend on the actual training data, then we can always find a small LP that basically just outputs the optimal network configuration $\phi^\esx$, which would be non-sensical. Observe that we have similar requirement for algorithms: they should work for a broad class of inputs and not for a single, specific input. What is different here is that the construction depends also on the sample size. This makes sense as the LP cannot “extend itself” after it has been constructed, whereas algorithms can cater to different input sizes. From a computational complexity perspective this phenomenon is well understood: LPs are more like circuits (e.g., the complexity class $\operatorname{P}/\operatorname{poly}$) whose construction also depends on the input size (the polynomial advice) than algorithms (e.g., complexity class $\operatorname{P}$). This phenomenon is also well known from extended formulations where often a uniform and a non-uniform model is distinguished, catering to this issue (see e.g., [BFP]). In the language of extended formulations, we have a uniform model here, where the instances (e.g., different training data sets $\hat D$) are only encoded in the objective functions.
Once the LP is constructed, it can be solved for a specific training dataset by fixing some of its variables to appropriate values to obtain the face described in (b)—from our construction it is clear that both fixing the variables or equivalently computing the desired linear objective can be done efficiently. When the actual LP is then solved, e.g., with the Ellipsoid method (see e.g., [GLS]) we obtain the desired training algorithm with a running time polynomial in the size of the LP and hence the sample size $D$. In particular, for fixed architectures we obtain training algorithms that are polynomial time; this has to be taken with a grain of salt though, as it is really the fact that there exists a single polytope that basically encodes the ERM for all realizations of the training data for a specific architecture and sample size $D$ that is interesting and unexpected here.
The way our construction works is by observing that the ERM problem from above naturally exhibits a formulation as optimization problem with low treewidth and, as discussed in an earlier post, this can be exploited to construct small linear programs: (1) reformulate ERM as an optimization problem of low treewidth (2) discretize and reformulate as a binary optimization problem (this is where $\mathcal L$ comes in) (3) exploit low treewidth to obtain a small LP formulation (basically a bit of convex relaxation and Sherali-Adams like lifting). For the last step (3) we rely on an immediate generalization of a theorem of [BM], that exploits low treewidth to construct small LPs for polynomical optimization problems.
Comparison to earlier results
We are not the first ones to think about the complexity of the training problem though and in fact it was [ABMM] that inspired our work. As such, in the following I will very briefly compare our results to earlier ones and refer the interested reader to the paper for a detailed discussion. Most closely related to our results are [ABMM], [GKKT], and [ZLJ], however it is hard to compare exact numbers directly as the setups differ. In fact a fair comparison might prove impossible and one should probably think of these results as complementary.
-
[GKKT] and [ZLJ] consider the improper learning setup, i.e., the constructed model is not from the same family of model $\Phi$ considered in the ERM. Their dependence on some of the input parameters is better but they also only consider a limited class of architectures.
-
[ABMM] on the other hand is considering proper learning but is limited to one hidden layer and output dimension one. Then again, they solve the ERM to global optimality (no $\varepsilon$’s here compared to us). In terms of complexity their dependence on the sampling size is much worse.
References
[BFP] Braun, G., Fiorini, S., & Pokutta, S. (2016). Average case polyhedral complexity of the maximum stable set problem. Mathematical Programming, 160(1-2), 407-431. journal arxiv
[GLS] Grötschel, M., Lovász, L., & Schrijver, A. (2012). Geometric algorithms and combinatorial optimization (Vol. 2). Springer Science & Business Media. google books
[BMP] Bienstock, D., Mun͂oz, G., & Pokutta, S. (2018). Principled Deep Neural Network Training through Linear Programming arxiv
[BM] Bienstock, D., & Mun͂oz, G. (2018). LP Formulations for Polynomial Optimization Problems. SIAM Journal on Optimization, 28(2), 1121-1150. journal arxiv
[ABMM] Arora, R., Basu, A., Mianjy, P., & Mukherjee, A. (2016). Understanding deep neural networks with rectified linear units. Proceedings of ICLR 2018. arxiv
[GKKT] Goel, S., Kanade, V., Klivans, A., & Thaler, J. (2017, June). Reliably Learning the ReLU in Polynomial Time. In Conference on Learning Theory (pp. 1004-1042). arxiv
[ZLJ] Zhang, Y., Lee, J. D., & Jordan, M. I. (2016, June). l1-regularized neural networks are improperly learnable in polynomial time. In International Conference on Machine Learning (pp. 993-1001). jmlr