Jekyll2021-01-17T13:26:05+01:00http://www.pokutta.com/blog/feed.xmlOne trivial observation at a timeEverything Mathematics, Optimization, Machine Learning, and Artificial IntelligenceCINDy: Conditional gradient-based Identification of Non-linear Dynamics2021-01-16T06:00:00+01:002021-01-16T06:00:00+01:00http://www.pokutta.com/blog/research/2021/01/16/cindy<p><em>TL;DR: This is an informal summary of our recent paper <a href="https://arxiv.org/abs/2101.02630">CINDy: Conditional gradient-based Identification of Non-linear Dynamics – Noise-robust recovery</a> by <a href="https://alejandro-carderera.github.io/">Alejandro Carderera</a>, <a href="http://www.pokutta.com/">Sebastian Pokutta</a>, <a href="https://www.mi.fu-berlin.de/en/math/groups/biocomputing/people/professors/christof_schuette.html">Christof Schütte</a> and <a href="https://www.zib.de/weiser/">Martin Weiser</a> where we propose the use of a Conditional Gradient algorithm (more concretely the <a href="https://arxiv.org/abs/1805.07311">Blended Conditional Gradients</a> [BPTW] algorithm) for the sparse recovery of a dynamic. In the presence of noise, the proposed algorithm presents superior sparsity-inducing properties, while ensuring a higher recovery accuracy, compared to other existing methods in the literature, most notably the popular <a href="https://arxiv.org/abs/1509.03580">SINDy</a> [BPK] algorithm, based on a sequentially-thresholded least-squares approach.</em>
<!--more--></p>
<p><em>Written by Alejandro Carderera.</em></p>
<h2 id="what-is-the-paper-about-and-why-you-might-care">What is the paper about and why you might care</h2>
<p>A large number of humankind’s scientific breakthroughs have been fueled by our ability to describe natural phenomena in terms of differential equations. These equations give us a condensed representation of the underlying dynamics and have helped build our understanding of natural phenomena in many scientific disciplines.</p>
<p>The modern age of Machine Learning and Big Data has heralded an age of data-driven models, in which
the phenomena we explain are described in terms of statistical relationships and data. Given sufficient
data, we are able to train neural networks to classify or to predict, with high accuracy, without the underlying
model having any apparent knowledge of how the data was generated or its structure. This makes the task of
classifying or predicting, on out-of-sample data a particularly challenging task. Due to this, there has
been a recent surge in interest in recovering the differential equations with which the data, often coming from
a physical system, have been generated. This enables us to better understand how the data is generated and
to better predict on out-of-sample data.</p>
<h2 id="learning-sparse-dynamics">Learning sparse dynamics</h2>
<p>Many physical systems can be described in terms of ordinary
differential equations of the form
$\dot{x}(t) = F\left(x(t)\right)$, where $x(t) \in \mathbb{R}^d$ denotes
the state of the system at time $t$ and $F: \mathbb{R}^d \rightarrow \mathbb{R}^d$ can usually be expressed as a linear combination of simpler <em>ansatz functions</em> $\psi_i: \mathbb{R}^d \rightarrow \mathbb{R}$ belonging to a dictionary \(\mathcal{D} = \left\{\psi_i \mid 1 \leq i \leq n
\right\}\). This allows us to express the dynamic followed by the
system as
$\dot{x}(t) = F\left(x(t)\right) = \Xi^T \bm{\psi}(x(t))$ where
$\Xi \in \mathbb{R}^{n \times d}$ is a – typically sparse – matrix
$\Xi = \left[\xi_1, \cdots, \xi_d \right]$ formed by column vectors
$i_i \in \mathbb{R}^n$ for $1 \leq i \leq n$ and
$\bm{\psi}(x(t)) = \left[ \psi_1(x(t)), \cdots, \psi_n(x(t))
\right]^T \in \mathbb{R}^{n}$. We can therefore write:</p>
\[\dot{x}(t) = \begin{bmatrix}
\rule{.5ex}{2.5ex}{0.5pt} & \xi_1 & \rule{.5ex}{2.5ex}{0.5pt}\\
& \vdots & \\
\rule{.5ex}{2.5ex}{0.5pt} & \xi_d & \rule{.5ex}{2.5ex}{0.5pt}
\end{bmatrix}
\begin{bmatrix}
\psi_1(x(t)) \\
\vdots \\
\psi_n(x(t))
\end{bmatrix}.\]
<p>In the absence of noise, if we are given a series of data points from the physical system \(\left\{ x(t_i), \dot{x}(t_i) \right\}_{i=1}^m\), then we know that:</p>
\[\begin{bmatrix}
\rule{-1ex}{0.5pt}{2.5ex} & & \rule{-1ex}{0.5pt}{2.5ex}\\
\dot{x}(t_1) & \cdots & \dot{x}(t_m)\\
\rule{-1ex}{0.5pt}{2.5ex} & & \rule{-1ex}{0.5pt}{2.5ex}
\end{bmatrix} =
\begin{bmatrix}
\rule{.5ex}{2.5ex}{0.5pt} & \xi_1 & \rule{.5ex}{2.5ex}{0.5pt}\\
& \vdots & \\
\rule{.5ex}{2.5ex}{0.5pt} & \xi_d & \rule{.5ex}{2.5ex}{0.5pt}
\end{bmatrix}
\begin{bmatrix}
\rule{-1ex}{0.5pt}{2.5ex} & & \rule{-1ex}{0.5pt}{2.5ex}\\
\bm{\psi}\left(x(t_1)\right) & \cdots & \bm{\psi}\left(x(t_m)\right)\\
\rule{-1ex}{0.5pt}{2.5ex} & & \rule{-1ex}{0.5pt}{2.5ex}
\end{bmatrix}.\]
<p>If we collect the data in matrices $\dot{X} = \left[ \dot{x}(t_1),\cdots, \dot{x}(t_m)\right] \in\mathbb{R}^{d\times m}$, $\Psi\left(X\right) = \left[ \bm{\psi}(x(t_1)),\cdots, \bm{\psi}(x(t_m))\right]\in\mathbb{R}^{n\times m}$, we can try to recover the underlying sparse dynamic by attempting to solve:</p>
\[\min\limits_{\dot{X} = \Omega^T \Psi(X)} \left\| \Omega\right\|_0.\]
<p>Unfortunately, the aforementioned problem is a notoriously difficult NP-hard
combinatorial problems, due to the presence of the $\ell_0$ norm in
the objective function of the problem. Moreover, if the
data points are contaminated by noise, leading to noisy matrices
$\dot{Y}$ and $\Psi(Y)$, depending on the
expressive power of the basis functions $\psi_i$ for
$1\leq i \leq n$, it may not even be possible (or desirable) to
satisfy $\dot{Y} = \Omega^T \Psi(Y)$ for any $\Omega \in \mathbb{R}^{n\times
d}$. Thus one can attempt to <em>convexify</em> the problem, substituting the $\ell_0$ norm (which is technically not a norm) for the $\ell_1$ norm. That is, solve for a suitably chosen $\epsilon >0$</p>
\[\tag{BPD}
\min\limits_{ \left\|\dot{Y} - \Omega^T \Psi(X) \right\|^2_F \leq \epsilon } \left\|\Omega\right\|_{1,1} \label{eq:l1_minimization_noisy2}\]
<p>This leads us to a formulation, known as <em>Basis Pursuit Denoising</em> (BPD) [CDS], which was initially developed by the signal processing community, and is intimately tied to the <em>Least Absolute Shrinkage and Selection Operator</em> (LASSO) regression formulation [T], developed in the statistics community. The latter formulation, which we will use for this problem, takes the form:</p>
\[\tag{LASSO}
\min\limits_{ \left\|\Omega\right\|_{1,1} \leq \alpha} \left\|\dot{Y} - \Omega^T \Psi(X) \right\|^2_F\]
<p>Both problems shown in (BPD) and
(LASSO) have a convex objective function and
a convex feasible region, which allows us to use the powerful tools
and guarantees of convex optimization. Moreover, there is a significant body of theoretical
literature, both from the statistics and the signal processing
community, on the conditions for which we can successfully recover the
support of $\Xi$ (see e.g., [W]), the uniqueness of the
LASSO solutions (see e.g., [T2]), or the robust
reconstruction of phenomena from incomplete data
(see e.g., [CRT]), to name but a few results.</p>
<h3 id="incorporating-structure-into-the-learning-problem">Incorporating structure into the learning problem</h3>
<p>Conservation laws are a fundamental pillar of our understanding of physical
systems. Imposing these laws through (symmetry) constraints in our sparse regression problem can
potentially lead to better generalization performance under noise,
reduced sample complexity, and to learned dynamics that are consistent
with the symmetries present in the real world. In particular, there are two
large classes of structural constraints that can be easily encoded
into our learning problem as linear constraints:</p>
<ol>
<li>Conservation properties: We often observe in dynamical systems that certain relations hold between the elements of $\dot{x}(t)$. Such is the case in chemical reaction dynamics, where if we denote the rate of change of the $i$-th species by $\dot{x}_i(t)$, we might observe relations of the form $a_j\dot{x}_j(t) + a_k\dot{x}_k(t) = 0$ due to mass conservation, which relate the $j$-th and $k$-th species being studied.</li>
<li>Symmetry between variables: One of the key assumptions used in many-particle quantum systems is the fact the particles being studied are indistinguishable. And so it makes sense to assume that the effect that the $i$-th particle exerts on the $j$-th particle is the same as the effect that the $j$-th particle exerts on the $i$-th particle. The same can be said in classical mechanics for a collection of identical masses, where each mass is connected to all the other masses through identical springs. These restrictions can also be added to our learning problem as linear constraints.</li>
</ol>
<p>If we were to add $L$ additional linear constraints to the problem in (LASSO) to reflect the underlying structure of the dynamical system through symmetry and conservation, we would arrive at a polytope $\mathcal{P}$ of the form</p>
\[\mathcal{P} = \left\{ \Omega \in \mathbb{R}^{n \times d} \mid \left\|\Omega\right\|_{1,1} \leq \tau, \text{trace}( A_l^T \Omega ) \leq b_l, 1 \leq l \leq L \right\},\]
<p>for an appropriately chosen $A_l$ and $b_l$.</p>
<h3 id="blended-conditional-gradients">Blended Conditional Gradients</h3>
<p>The problem is that in the presence of noise many learning approaches see their sparsity-inducing properties quickly degrade, producing dense dynamics that are far from the true dynamic, this is often what happens with the sequentially-thresholded least-squares algorithm in [BPK], which underlies SINDy. Ideally, we want to look for learning algorithms that are somewhat robust to the presense of noise. Moreover, it would also be advantageous if we could easily incorporate structural linear constraints into the learning problem, as described in the previous section, to lead to learned dynamics that are consistent with the true dynamic.</p>
<p>For the recovery of sparse dynamics from data, one of the most interesting algorithms in terms of sparsity is the <em>Fully-Corrective Conditional Gradient</em> algorithm. This algorithm picks up a vertex $V_k$ from the polytope $\mathcal{P}$ using a linear optimization oracle, and reoptimizes over the convex hull of $\mathcal{S}_{k} \bigcup V_k$, which is the union of the vertices picked up in previous iterations, and the new vertex $V_k$. One of the key advantages of requiring a linear optimization oracle, instead of a projection oracle, to solve the optimization problem is that for general polyhedral constraints there are efficient algorithms to solve linear optimization problems, whereas solving a quadratic problem to compute a projection can be too computationally expensive.</p>
<p class="mathcol"><strong>Fully-Corrective Conditional Gradient (CG) algorithm applied to (LASSO)</strong> <br />
<em>Input:</em> Initial point $\Omega_1 \in \mathcal{P}$ <br />
<em>Output:</em> Point $\Omega_{K+1} \in \mathcal{P}$ <br />
\(\mathcal{S}_{1} \leftarrow \emptyset\) <br />
For \(k = 1, \dots, K\) do: <br />
$\quad \nabla f \left( \Omega_k \right) \leftarrow 2 \Psi(Y) \left(\dot{Y} - \Omega_k^T\Psi(Y) \right)^T$ <br />
$\quad V_k \leftarrow \min\limits_{\Omega \in \mathcal{P}} \text{trace}\left(\Omega^T\nabla f \left( \Omega_k \right) \right)$ <br />
\(\quad \mathcal{S}_{k+1}\leftarrow \mathcal{S}_{k} \bigcup V_k\) <br />
\(\quad \Omega_{k+1} \leftarrow \min\limits_{\Omega \in \text{conv}\left( \mathcal{S}_{k+1} \right) } \left\|\dot{Y} - \Omega^T \Psi(Y)\right\|^2_F\) <br />
End For</p>
<p>To get a better feel of the sparsity inducing properties of the FCFW algorithm, if we assume that the starting point $\Omega_1$ is a vertex of the
polytope, then we know that the iterate $\Omega_k$ can be expressed
as a convex combination of at most $k$ vertices of
$\mathcal{P}$. This is due to the fact that the algorithm can pick
up no more than one vertex per iteration. Note that if $\mathcal{P}$ were the $\ell_1$ ball without any additional constraints, the FCFW algorithm
picks up at most one basis function in the $k$-th iteration, as
$V_k^T \bm{\psi}(x(t)) = \pm \tau \psi_i(x(t)$ for some
$1\leq i\leq n$. This means that if we use the
Frank-Wolfe algorithm to solve a problem over the $\ell_1$ ball, we
encourage sparsity not only through the regularization provided by
the $\ell_1$ ball, but also through the specific nature of the
Frank-Wolfe algorithm independently of the size of the feasible
region. In practice, when using, e.g., early termination due to some
stopping criterion, this results in the Frank-Wolfe algorithm
producing sparser solutions than projection-based algorithms (such
as projected gradient descent, which typically uses dense updates).</p>
<p>Reoptimizing over the union of vertices picked up can be an expensive operation, especially if there are many such vertices. An alternative is to compute these reoptimizations to $\varepsilon_k$-optimality at iteration $k$. However, this leads to the question: How should we choose $\varepsilon_k$ at
each iteration $k$, if we want to find an $\varepsilon$-optimal
solution to (LASSO)?
Computing a solution to the problem to accuracy
$\varepsilon_k = \varepsilon$ at each iteration might be way too
computationally expensive. Conceptually, we need relatively inaccurate
solutions for early iterations where
\(\Omega^\esx \notin \text{conv} \left(\mathcal{S}_{k+1}\right)\), requiring only
accurate solutions when
\(\Omega^\esx \in \text{conv} \left(\mathcal{S}_{k+1}\right)\). At the same time we
do not know whether we have found \(\mathcal{S}_{k+1}\) so that
\(\Omega^\esx \in \text{conv} \left(\mathcal{S}_{k+1}\right)\).</p>
<p>The rationale behind the <em>Blended Conditional Gradient</em> (BCG)
algorithm [BPTW] is to provide an
explicit value of the accuracy $\varepsilon_k$ needed at each
iteration starting with rather large \(\varepsilon_k\) in early iterations
and progressively getting more accurate when approaching the optimal
solution; the process is controlled by an optimality gap
measure. In some sense one might think of BCG as a practical version
of FCCG with stronger convergence guarantees and much faster
real-world performance.</p>
<p class="mathcol"><strong>CINDy: Blended Conditional Gradient (BCG) algorithm variant applied to (LASSO) problem</strong> <br />
<em>Input:</em> Initial point $\Omega_0 \in \mathcal{P}$ <br />
<em>Output:</em> Point $\Omega_{K+1} \in \mathcal{P}$ <br />
\(\Omega_1 \leftarrow \text{argmin}_{\Omega \in \mathcal{P}} \text{trace}\left(\Omega^T\nabla f \left( \Omega_0 \right) \right)\) <br />
\(\Phi \leftarrow \text{trace} \left( \left( \Omega_0 - \Omega_1\right)^T \nabla f(\Omega_0)\right)/2\) <br />
\(\mathcal{S}_{1} \leftarrow \left\{ \Omega_1 \right\}\) <br />
For \(k = 1, \dots, K\) do: <br />
\(\quad\) Find $\Omega_{k+1} \in \operatorname{conv}(\mathcal{S}_{k})$ such that \(\max_{\Omega \in \mathcal{P}} \text{trace}\left((\Omega_{k+1} -\Omega )^T\nabla f \left( \Omega_{k+1} \right) \right) \leq \Phi\) <br />
\(\quad V_{k+1} \leftarrow \text{argmin}_{\Omega \in \mathcal{P}} \text{trace}\left(\Omega^T\nabla f \left( \Omega_{k+1} \right) \right)\) <br />
\(\quad\) If \(\left( \text{trace}\left( \left( \Omega_{k+1} -V_{k+1}\right)^T \nabla f(\Omega_{k+1})\right) \leq \Phi \right)\) <br />
\(\quad\quad \Phi \leftarrow \text{trace}\left( \left( \Omega_{k+1} -V_{k+1}\right)^T \nabla f(\Omega_{k+1})\right)/2\) <br />
\(\quad\quad \mathcal{S}_{k+1} \leftarrow \mathcal{S}_k\) <br />
\(\quad\quad \Omega_{k+1} \leftarrow \Omega_k\) <br />
\(\quad\) Else <br />
\(\quad\quad\mathcal{S}_{k+1} \leftarrow \mathcal{S}_k \bigcup V_{k+1}\) <br />
\(\quad\quad D_k \leftarrow V_{k + 1} - \Omega_k\) <br />
\(\quad\quad \gamma_k \leftarrow \min\left\{-\frac{1}{2}\text{trace} \left( D_k^T \nabla f \left(
\Omega_k \right) \right)/ \left\| D_k^T
\Psi(Y)\right\|_F^2,1\right\}\) <br />
\(\quad\quad \Omega_{k+1} \leftarrow \Omega_k + \gamma_k D_k\) <br />
\(\quad\) End If <br />
End For</p>
<p>As we will show numerically in the next section, the CINDy algorithm not only produces sparser solutions to the learning problem, it also exhibits a higher robustness with respect to noise than other existing approaches. This is in keeping with the law of parsimony (also called <em>Occam’s Razor</em>), which states that the simplest explanation, in our case the sparsest, is usually the right one (or close to the right one!).</p>
<h2 id="numerical-experiments">Numerical experiments</h2>
<p>We benchmark the CINDy algorithm applied to
the LASSO sparse recovery formulations with the following
algorithms. Our main benchmark here is the SINDy algorithm, however we
included two more popular optimization methods for further
comparison, namely, the Interior-Point Methods in <em>CVXOPT</em> [ADLVSNW], and the FISTA algorithm.</p>
<p>We use CINDy (c) and CINDy to refer to the results achieved by the CINDy algorithm with and without the additional structural constraints arising e.g., from conservation laws. Likewise, we use IPM (c) and IPM to refer to the results achieved by the IPM algorithm with and without additional constraints. We have not added structural constraints to the formulation in the SINDy algorithm, as there is no straightforward way to include constraints in the original implementation, or the FISTA algorithm, as we would need to compute non-trivial proximal/projection operators, making the algorithm computationally too expensive.</p>
<p>To benchmark the algorithms we use two different metrics, the <em>recovery error</em> defined as \(\mathcal{E}_{R} = \norm{\Omega - \Xi}_F\) and the <em>number of extraneous terms</em> defined as \(\mathcal{S}_E = \abs {\left\{ \Omega_{i,j} \mid \Omega_{i,j} \neq 0, \Xi_{i,j} = 0, 1 \leq i \leq d \in, 1 \leq j \leq n \right\}}\), i.e., those terms that do not belong to the dynamic.</p>
<h3 id="fermi-pasta-ulam-tsingou-model">Fermi-Pasta-Ulam-Tsingou model</h3>
<p>The Fermi-Pasta-Ulam-Tsingou model describes a one-dimensional system of $d$ identical particles, where neighboring particles are connected with springs, subject to a nonlinear forcing term [FPUT]. This computational model was used at Los Alamos to study the behaviour of complex physical systems over long time periods. The equations of motion that govern the particles, when subjected to cubic forcing terms is given by
\(\ddot{x}_i = \left(x_{i+1} - 2 x_i + x_{i-1} \right) + \beta \left[ \left( x_{i+1} - x_i \right)^3 - \left( x_{i} - x_{i-1} \right)^3 \right],\)
where $1 \leq i \leq d$ and $x_{i}$ refers to the displacement of the $i$-th particle with respect to its equilibrium position. The exact dynamic $\Xi$ can be expressed using a dictionary of monomials of degree up to three.</p>
<div class="center">
<img src="http://www.pokutta.com/blog/assets/cindy/blog_post_FPUT_10_dim_v5.png" alt="fig1" style="float:center; margin-right: 1%; width:99%" />
<p style="clear: both;"></p>
</div>
<p class="figcap"><strong>Figure 1.</strong> Sparse recovery of the Fermi-Pasta-Ulam-Tsingou dynamic with $d = 10$</p>
<p>As we can see in the images, there is a large difference between the CINDy and FISTA algorithms, and the remaining algorithms, with the former algorithms being up to two orders of magnitude more accurate in terms of $\mathcal{E}_R$, while algo being much sparser, as seen in the image that depicts $\mathcal{S}_E$.</p>
<p>However, what does this difference in recovery error translate to? We can see the difference in accuracy between the different learned dynamics by simulating forward in time the dynamic learned by the CINDy algorithm and the SINDy algorithm, and comparing that to the evolution of the true dynamic. The results in the next image show this comparison for different times for the dynamics learnt by the two algorithms with a noise level of $10^{-4}$ for the example of dimensionality $d = 10$. In keeping with the physical nature of the problem, we present the ten dimensional phenomenon as a series of oscillators suffering a displacement on the vertical y-axis, in a similar fashion as was done in the original paper SINDy paper [BPK]. Note that we have added to the images the two extremal particles on the left and right that do not oscillate. While CINDy’s trajectory matches that of the real dynamic up to very small error—it is also much smoother in time—the learned dynamic of SINDy is very far away from the true dynamics not even recovering essential features of the oscillation; the large number of additional terms deform the essential structure of the dynamic.</p>
<div class="center">
<img src="http://www.pokutta.com/blog/assets/cindy/FPUT_animation.gif" alt="fig2" style="float:center; margin-right: 1%; width:99%" />
<p style="clear: both;"></p>
</div>
<p class="figcap"><strong>Figure 2.</strong> Fermi-Pasta-Ulam-Tsingou dynamic: Simulation of learned trajectories vs true trajectory.</p>
<h3 id="kuramoto-model">Kuramoto model</h3>
<p>The Kuramoto model [K] describes a large collection of $d$ weakly coupled identical oscillators, that differ in their natural frequency $\omega_i$. This dynamic is often used to describe synchronization phenomena in physics. If we denote by $x_i$ the angular displacement of the $i$-th oscillator, then the governing equation with external forcing can be written as:
\(\dot{x}_i = \omega_i + \frac{K}{d}\sum_{j=1}^d \left[\sin \left( x_j \right) \cos \left( x_i \right) - \cos \left( x_j \right) \sin \left( x_i \right) \right]+ h\sin \left( x_i\right),\)
for $1 \leq i \leq d$, where $d$ is the number of oscillators (the dimensionality of the problem), $K$ is the coupling strength between the oscillators and $h$ is the external forcing parameter. The exact dynamic $\Xi$ can be expressed using a dictionary of basis functions formed by sine and cosine functions of $x_i$ for $1 \leq i \leq d$, and pairwise combinations of these functions, plus a constant term.</p>
<div class="center">
<img src="http://www.pokutta.com/blog/assets/cindy/blog_post_kuramoto_10_dim_v5.png" alt="fig3" style="float:center; margin-right: 1%; width:99%" />
<p style="clear: both;"></p>
</div>
<p class="figcap"><strong>Figure 3.</strong> Sparse recovery of the Kuramoto dynamic with $d = 10$.</p>
<p>All algorithms except the IPM algorithms exhibit similar performance with respect to $\mathcal{E}_R$ and $\mathcal{S}_E$ up to a noise level of $10^{-5}$, however, the performance of the FISTA and SINDy algorithms degrade for noise levels above $10^{-5}$, producing solutions that are both dense (see $\mathcal{S}_E$), and are far away from the true dynamic (see $\mathcal{E}_R$). When we simulate the Kuramoto system from a
given initial position, the algorithms have very different
performances.</p>
<p>The next animation shows the results after simulating the dynamics learned by the CINDy and SINDy algorithm from the integral formulation for a Kuramoto model with $d = 10$ and a noise level of $10^{-3}$. In order to see more easily the differences between the algorithms and the position of the oscillators, we have placed the $i$-th oscillator at a radius of $i$, for $1 \leq i\leq d$. Note that the same coloring and markers are used as in the previous section to depict the trajectory followed by the exact dynamic, the dynamic learned with CINDy, and the dynamic learned with SINDy. As before while CINDy can reproduce the correct trajectory up to small error the trajectory of SINDy’s learned dynamic is rather far away from the real dynamic.</p>
<div class="center">
<img src="http://www.pokutta.com/blog/assets/cindy/kuramoto_animation.gif" alt="fig4" style="float:center; margin-right: 1%; width:99%" />
<p style="clear: both;"></p>
</div>
<p class="figcap"><strong>Figure 4.</strong> Kuramoto dynamic: Simulation of learned trajectories. Green is the true dynamic. Black is the dynamic learned via CINDy. Magenta is the dynamic learned via SINDy.</p>
<p>If we compare the CINDy and SINDy algorithms from the perspective of the sample efficiency, that is, the evolution of the error as we vary the number of training samples made available to the algorithm, and the noise levels, we can see that there is an additional benefit to the use of a CG-based algorithm for the recovery of the sparse dynamic and that inclusion of conversation laws can further improve sample efficiency and noise robustness.</p>
<div class="center">
<img src="http://www.pokutta.com/blog/assets/cindy/blog_post_sample_efficiency_kuramoto5.png" alt="fig5" style="float:center; margin-right: 1%; width:99%" />
<p style="clear: both;"></p>
</div>
<p class="figcap"><strong>Figure 5.</strong> Kuramoto dynamic: Sample efficiency with $d = 5$.</p>
<p>If we focus for example on the bottom right corner for each of the images, we can see that the SINDy algorithm outputs dynamics with a lower accuracy in the low-training sample regime for higher noise levels, as compared to the CINDy algorithm.</p>
<h3 id="michaelis-menten-model">Michaelis-Menten model</h3>
<p>The Michaelis-Menten model [MM] is used to describe enzyme reaction kinetics. We focus on the following derivation, in which an enzyme E combines with a substrate S to form an intermediate product ES with a reaction rate $k_{f}$. This reaction is reversible, in the sense that the intermediate product ES can decompose into E and S, with a reaction rate $k_{r}$. This intermediate product ES can also proceed to form a product P, and regenerate the free enzyme E. This can be expressed as</p>
\[S + E \rightleftharpoons E.S \to E + P.\]
<p>If we assume that the rate for a given reaction depends proportionately on the concentration of the reactants, and we denote the concentration of E, S, ES and P as $x_{\text{E}}$, $x_{\text{S}}$, $x_{\text{ES}}$ and $x_{\text{P}}$, respectively, we can express the dynamics of the chemical reaction as:</p>
\[\begin{align*}
\dot{x}_{\text{E}} &= -k_f x_{\text{E}} x_{\text{S}} + k_r x_{\text{ES}} + k_{\text{cat}} x_{\text{ES}} \\
\dot{x}_{\text{S}} &= -k_f x_{\text{E}} x_{\text{S}} + k_r x_{\text{ES}} \\
\dot{x}_{\text{ES}} &= k_f x_{\text{E}} x_{\text{S}} - k_r x_{\text{ES}} - k_{\text{cat}} x_{\text{ES}} \\
\dot{x}_{P} &= k_{\text{cat}} x_{\text{ES}}.
\end{align*}\]
<div class="center">
<img src="http://www.pokutta.com/blog/assets/cindy/blog_post_MMeasy_4_dim_v6.png" alt="fig6" style="float:center; margin-right: 1%; width:99%" />
<p style="clear: both;"></p>
</div>
<p class="figcap"><strong>Figure 6.</strong> Sparse recovery of the Michaelis-Menten dynamic with $d = 4$. Left is recovery error in Frobenius norm. Right is number of extra terms picked up that do not belong to dynamic.</p>
<p>We can observe that for the lowest noise levels, the CINDy algorithm presents no advantage over the SINDy algorithm, however, as we crank up the noise levels, the performance of SINDy degrades, as the algorithm picks up more and more extra terms that are not present in the true dynamic. For low to moderately high noise levels the CINDy algorithm provides the best performance, with the lowest error in terms of $\mathcal{E}_R$, and the sparsest solutions in terms of $\mathcal{S}_E$. For very high noise levels, all the algorithms perform similarly in terms of $\mathcal{E}_R$, while CINDy’s recoveries are still significantly sparser than those of SINDy.</p>
<h3 id="references">References</h3>
<p>[BPK] Brunton, S.L., Proctor, J.L. , and Kutz, J.N. (2016) Discovering governing equations from data by sparse identification of nonlinear dynamical systems. In <em>Proceedings of the national academy of sciences</em> 113.15 : 3932-3937 <a href="https://www.pnas.org/content/pnas/113/15/3932.full.pdf">pdf</a></p>
<p>[BPTW] Braun, G., Pokutta, S., Tu, D., & Wright, S. (2019). Blended conditonal gradients. In <em>International Conference on Machine Learning</em> (pp. 735-743). PMLR <a href="http://proceedings.mlr.press/v97/braun19a/braun19a.pdf">pdf</a></p>
<p>[CDS] Chen, S. S., Donoho, D. L., & Saunders, M. A. (2001). Atomic decomposition by basis pursuit. In <em>SIAM review</em>, 43(1), 129-159. <a href="http://https://web.stanford.edu/group/SOL/papers/BasisPursuit-SIGEST.pdf">pdf</a></p>
<p>[LZ] Lan, G., & Zhou, Y. (2016). Conditional gradient sliding for convex optimization. In <em>SIAM Journal on Optimization</em> 26(2) (pp. 1379–1409). SIAM. <a href="http://www.optimization-online.org/DB_FILE/2014/10/4605.pdf">pdf</a></p>
<p>[T] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. In <em>Journal of the Royal Statistical Society: Series B</em> (Methodological), 58(1), 267-288 <a href="https://statweb.stanford.edu/~tibs/lasso/lasso.pdf">pdf</a></p>
<p>[W] Wainwright, M. J. (2009). Sharp thresholds for High-Dimensional and noisy sparsity recovery using $\ell_ {1} $-Constrained Quadratic Programming (Lasso). In <em>IEEE transactions on information theory</em>, 55(5), 2183-2202 <a href="https://people.eecs.berkeley.edu/~wainwrig/Papers/Wai09_Sharp_Journal.pdf">pdf</a></p>
<p>[T2] Tibshirani, R. J. (2013). The lasso problem and uniqueness. In <em>Electronic Journal of statistics</em>, 7, 1456-1490 <a href="https://arxiv.org/pdf/1206.0313.pdf">pdf</a></p>
<p>[CRT] Candès, E. J., Romberg, J., & Tao, T. (2006). Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information. In <em>IEEE Transactions on information theory</em>, 52(2), 489-509 <a href="https://arxiv.org/pdf/math/0409186.pdf">pdf</a></p>
<p>[ADLVSNW] Andersen, M., Dahl, J., Liu, Z., Vandenberghe, L., Sra, S., Nowozin, S., & Wright, S. J. (2011). Interior-point methods for large-scale cone programming. In <em>Optimization for machine learning</em>, 5583 <a href="http://www.imm.dtu.dk/~mskan/publications/mlbook.pdf">pdf</a></p>
<p>[K] Kuramoto, Y. (1975). Self-entrainment of a population of coupled non-linear oscillators. In <em>International symposium on mathematical problems in theoretical physics</em> (pp. 420-422). Springer, Berlin, Heidelberg <a href="https://link.springer.com/chapter/10.1007%2FBFb0013365">pdf</a></p>
<p>[FPUT] Fermi, E., Pasta, P., Ulam, S., & Tsingou, M. (1955). Studies of the nonlinear problems (No. LA-1940). Los Alamos Scientific Lab., N. Mex. <a href="https://www.osti.gov/servlets/purl/4376203">pdf</a></p>
<p>[MM] Michaelis, L., Menten, M. L. (2007). Die kinetik der invertinwirkung. Universitätsbibliothek Johann Christian Senckenberg. <a href="https://path.upmc.edu/divisions/chp/PDF/Michaelis-Menten_Kinetik.pdf">pdf</a></p>Alejandro CardereraTL;DR: This is an informal summary of our recent paper CINDy: Conditional gradient-based Identification of Non-linear Dynamics – Noise-robust recovery by Alejandro Carderera, Sebastian Pokutta, Christof Schütte and Martin Weiser where we propose the use of a Conditional Gradient algorithm (more concretely the Blended Conditional Gradients [BPTW] algorithm) for the sparse recovery of a dynamic. In the presence of noise, the proposed algorithm presents superior sparsity-inducing properties, while ensuring a higher recovery accuracy, compared to other existing methods in the literature, most notably the popular SINDy [BPK] algorithm, based on a sequentially-thresholded least-squares approach.DNN Training with Frank–Wolfe2020-11-11T00:00:00+01:002020-11-11T00:00:00+01:00http://www.pokutta.com/blog/research/2020/11/11/NNFW<p><em>TL;DR: This is an informal discussion of our recent paper <a href="https://arxiv.org/abs/2010.07243">Deep Neural Network Training with Frank–Wolfe</a> by <a href="http://www.pokutta.com/">Sebastian Pokutta</a>, <a href="http://www.christophspiegel.berlin/">Christoph Spiegel</a>, and Max Zimmer, where we study the general efficacy of using Frank–Wolfe methods for the training of Deep Neural Networks with constrained parameters. Summarizing the results, we (1) show the general feasibility of this markedly different approach for first-order based training of Neural Networks, (2) demonstrate that the particular choice of constraints can have a drastic impact on the learned representation, and (3) show that through appropriate constraints one can achieve performance exceeding that of unconstrained stochastic Gradient Descent, matching state-of-the-art results relying on $L^2$-regularization.</em>
<!--more--></p>
<p><em>Written by Christoph Spiegel.</em></p>
<h3 id="motivation">Motivation</h3>
<p>Despite its simplicity, stochastic Gradient Descent (SGD) is still the method of choice for training Neural Networks. Assuming the network is parameterized by some unconstrained weights $\theta$, the standard SGD update can simply be stated as</p>
\[\theta_{t+1} = \theta_t - \alpha \tilde{\,\nabla} L(\theta_t),\]
<p>for some given loss function $L$, its $t$-th batch gradient $\tilde{\,\nabla} L(\theta_t)$ and some learning rate $\alpha$. In practice, one of the more significant contributions to this approach for obtaining state-of-the-art performance has come in the form of adding an $L^2$-regularization term to the loss function. Motivated by this, we explored the efficacy of constraining the parameter space of Neural Networks to a suitable compact convex region ${\mathcal C}$. Standard SGD would require a projection step during each update to maintain the feasibility of the parameters in this constrained setting, that is the update would be</p>
\[\theta_{t+1} = \Pi_{\mathcal C} \big( \theta_t - \alpha \tilde{\,\nabla} L(\theta_t) \big),\]
<p>where the projection function $\Pi_{\mathcal C}$ maps the input to its closest neighbor in ${\mathcal C}$. Depending on the particular feasible region, such a projection step can be very costly, so we instead explored a more appropriate alternative in the form of the (stochastic) Frank–Wolfe algorithm (SFW) [FW, LP]. Rather than relying on a projection step, SFW calls a linear minimization oracle (LMO) to determine</p>
\[v_t = \textrm{argmin}_{v \in \mathcal C} \langle \tilde{\,\nabla} L(\theta_t), v \rangle,\]
<p>and move in the direction of $v_t$ through the update</p>
\[\theta_{t+1} = \theta_t + \alpha ( v_t - \theta_t)\]
<p>where $\alpha \in [0,1]$. Feasibility is maintained since the update step takes the convex combination of two points in the convex feasible region. For a more in-depth look at Frank–Wolfe methods check out the <a href="http://www.pokutta.com/blog/research/2018/10/05/cheatsheet-fw.html">Frank-Wolfe and Conditional Gradients Cheat Sheet</a>. In the remainder of this post we will present some of the key findings from the paper.</p>
<h3 id="how-to-regularize-neural-networks-through-constraints">How to regularize Neural Networks through constraints</h3>
<p>We have focused on the case of uniformly applying the same type of constraint, such as a bound on the $L^p$-norm, separately on the weight and bias parameters of each individual layer of the network to achieve a regularizing effect, varying only the diameter of that region. Let us consider some particular types of constraints.</p>
<p><strong>$L^2$-norm ball.</strong> Constraining the $L^2$-norm of weights and optimizing them using SFW is most comparable, both in theory and in practice, to SGD with weight decay. The output of the LMO is given by</p>
\[\textrm{argmin}_{v \in \mathcal{B}_2(\tau)} \langle v,x \rangle = -\tau \, x / \|x\|_2,\]
<p>that is, it is parallel to the gradient and so, as long as the current iterate of the weights is not close to the boundary of the $L^2$-norm ball, the update of the SFW algorithm is similar to that of SGD given an appropriate learning rate.</p>
<p><strong>Hypercube.</strong> Requiring each individual weight of a network or a layer to lie within a certain range, say in $[-\tau,\tau],$ is possibly an even more natural type of constraint. Here the update step taken by SFW however differs drastically from that taken by projected SGD: in the output of the LMO each parameter receives a value of equal magnitude, since</p>
\[\textrm{argmin}_{v \in \mathcal{B}_\infty(\tau)} \langle v,x \rangle = -\tau \, \textrm{sgn}(x),\]
<p>so to a degree all parameters are forced to receive a non-trivial update each step.</p>
<p><strong>$L^1$-norm ball and $K$-sparse polytopes.</strong> On the other end of the spectrum from the dense updates forced by the LMO of the hypercube are feasible regions whose LMOs return very sparse vectors. When for example constraining the $L^1$-norm of weights of a layer, the output of the LMO is given by the vector with a single non-zero entry equal to $-\tau \, \textrm{sign}(x)$ at a point where $|x|$ takes its maximum. As a consequence, only a single weight, that from which the most gain can be derived, will in fact increase in absolute value during the update step of the Frank–Wolfe algorithm while all other weights will decay and move towards zero. The $K$-sparse polytope of radius $\tau > 0$ is obtained as the intersection of the $L^1$-ball of radius $\tau K$ and the hypercube of radius $\tau$ and generalizes that principle by increasing the absolute value of the $K$ most important weights.</p>
<h3 id="the-impact-of-constraints-on-learned-features">The impact of constraints on learned features</h3>
<p>Let us illustrate the impact that the choice of constraints has on the learned representations through a simple classifier trained on the MNIST dataset. The particular network chosen here, for the sake of exposition, has no hidden layers and no bias terms and the flattened input layer of size 784 is fully connected to the output layer of size 10. The weights of the network are therefore represented by a single 784 × 10 matrix, where each of the ten columns corresponds to the weights learned to recognize the ten digits 0 to 9. In Figure 1 we present a visualization of this network trained on the dataset with different types of constraints placed on the parameters. Each image interprets one of the columns of the weight matrix as an image of size 28 × 28 where red represents negative weights and green represents positive weights for a given pixel. We see that the choice of feasible region, and in particular the LMO associated with it, can have a drastic impact on the representations learned by the network when using the stochastic Frank–Wolfe algorithm. For completeness sake we have included several commonly used adaptive variants of SGD in the comparison.</p>
<div id="fig" class="center" style="margin-top:5mm">
<img src="http://www.pokutta.com/blog/assets/nnfw/mnist-visualization_compact.png" alt="Visualizing learned features on MNIST" style="float:left; width:98%" />
</div>
<p class="figcap"><strong>Figure 1.</strong> <em>Visualization of the weights in a fully connected no-hidden-layer classifier trained on the MNIST dataset corresponding to the digits 0, 1 and 2. Red corresponds to negative and green to positive weights.</em></p>
<p>Further demonstrating the impact of constraints on the learned representations, we consider the sparsity of the weights of trained networks. Let the parameter of a network be <em>inactive</em> if its absolute value is smaller than that of its random initialization. To study the effect of constraining the parameters, we trained two different types of networks, a fully connected network with two hidden layers with a total of 26 506 parameters and a convolutional network with 93 322, on the MNIST dataset. In Figure 2 we see that regions spanned by sparse vectors, such as $K$-sparse polytopes, result in noticeably fewer active parameters in the network over the course of training, whereas regions whose LMO forces larger updates in each parameter, such as the Hypercube, result in more active weights.</p>
<div id="fig" class="center" style="margin-top:5mm">
<img src="http://www.pokutta.com/blog/assets/nnfw/sparseness_sparse-mnist-2.png" alt="Sparseness during training on MNIST" style="width:80%" />
<p style="clear: both;"></p>
</div>
<p class="figcap"><strong>Figure 2.</strong> <em>Number of active parameters in two different networks trained on the MNIST dataset.</em></p>
<h3 id="achieving-state-of-the-art-results">Achieving state-of-the-art results</h3>
<p>Finally, we demonstrate the feasibility of training even very deep Neural Networks using SFW. We trained several state-of-the-art Neural Networks on the CIFAR-10, CIFAR-100, and ImageNet datasets. In Table 1 we show the top-1 test accuracy attained by networks based on the DenseNet, WideResNet, GoogLeNet and ResNeXt architecture on the test sets of these datasets. Here we compare networks with unconstrained parameters trained using SGD with momentum both with and without weight decay as well as networks whose parameters are constrained in their $L^2$-norm or $L^\infty$-norm and which were trained using SFW with momentum added. We can observe that, when constraining the $L^2$-norm of the parameters, SFW attains performance exceeding that of standard SGD and matching the state-of-the-art performance of SGD with weight decay. When constraining the $L^\infty$-norm of the parameters, SFW does not quite achieve the same performance as SGD with weight decay, but a regularization effect through the constraints is nevertheless clearly present, as it still exceeds the performance of SGD without weight decay. We furthermore note that, due to the nature of the LMOs associated with these particular regions, runtimes were comparable.</p>
<div id="fig" class="center" style="margin-top:5mm">
<img src="http://www.pokutta.com/blog//assets/nnfw/dnn_fw_stoa.png" alt="Sparseness during training on MNIST" style="float:left; width:98%" />
</div>
<p class="figcap"><strong>Table 1.</strong> <em>Test accuracy attained by several deep Neural Networks trained on the CIFAR-10, CIFAR-100, and ImageNet datasets. Parameters trained with SGD were unconstrained.</em></p>
<h3 id="reproducibility">Reproducibility</h3>
<p>We have made our implementations of the various stochastic Frank–Wolfe methods considered in the paper available online both for PyTorch and for TensorFlow under <a href="https://github.com/ZIB-IOL/StochasticFrankWolfe">github.com/ZIB-IOL/StochasticFrankWolfe</a>. There you will also find a list of Google Colab notebooks that allow you to recreate all the experimental results presented here.</p>
<h3 id="references">References</h3>
<p>[FW] Frank, M., & Wolfe, P. (1956). An algorithm for quadratic programming. Naval research logistics quarterly, 3(1‐2), 95-110. <a href="https://onlinelibrary.wiley.com/doi/abs/10.1002/nav.3800030109">pdf</a></p>
<p>[LP] Levitin, E. S., & Polyak, B. T. (1966). Constrained minimization methods. Zhurnal Vychislitel’noi Matematiki i Matematicheskoi Fiziki, 6(5), 787-823. <a href="http://www.mathnet.ru/php/archive.phtml?wshow=paper&jrnid=zvmmf&paperid=7415&option_lang=eng">pdf</a></p>Christoph SpiegelTL;DR: This is an informal discussion of our recent paper Deep Neural Network Training with Frank–Wolfe by Sebastian Pokutta, Christoph Spiegel, and Max Zimmer, where we study the general efficacy of using Frank–Wolfe methods for the training of Deep Neural Networks with constrained parameters. Summarizing the results, we (1) show the general feasibility of this markedly different approach for first-order based training of Neural Networks, (2) demonstrate that the particular choice of constraints can have a drastic impact on the learned representation, and (3) show that through appropriate constraints one can achieve performance exceeding that of unconstrained stochastic Gradient Descent, matching state-of-the-art results relying on $L^2$-regularization.Projection-Free Adaptive Gradients for Large-Scale Optimization2020-10-21T01:00:00+02:002020-10-21T01:00:00+02:00http://www.pokutta.com/blog/research/2020/10/21/adasfw<p><em>TL;DR: This is an informal summary of our recent paper <a href="https://arxiv.org/pdf/2009.14114.pdf">Projection-Free Adaptive Gradients for Large-Scale Optimization</a> by <a href="https://cyrillewcombettes.github.io/">Cyrille Combettes</a>, <a href="http://www.christophspiegel.berlin/">Christoph Spiegel</a>, and <a href="http://www.pokutta.com/">Sebastian Pokutta</a>. We propose to improve the performance of state-of-the-art stochastic Frank-Wolfe algorithms via a better use of first-order information. This is achieved by blending in adaptive gradients, a method for setting entry-wise step-sizes that automatically adjust to the geometry of the problem. Computational experiments on convex and nonconvex objectives demonstrate the advantage of our approach.</em>
<!--more--></p>
<p><em>Written by Cyrille Combettes.</em></p>
<h3 id="introduction">Introduction</h3>
<p>We consider the family of stochastic Frank-Wolfe algorithms, addressing constrained finite-sum optimization problems</p>
\[\min_{x\in\mathcal{C}}\left\{f(x)\overset{\text{def}}{=}\frac{1}{m}\sum_{i=1}^mf_i(x)\right\},\]
<p>where $\mathcal{C}\subset\mathbb{R}^n$ is a compact convex set and $f_1,\ldots,f_m\colon\mathbb{R}^n\rightarrow\mathbb{R}$ are smooth convex functions. Their generic template is presented in Template <a href="#fw">1</a>. When $\tilde{\nabla}f(x_t)=\nabla f(x_t)$, we recover the original Frank-Wolfe algorithm (<a href="#fw56">Frank and Wolfe</a>, <a href="#fw56">1956</a>), a.k.a. conditional gradient algorithm (<a href="#levitin66">Levitin and Polyak</a>, <a href="#levitin66">1966</a>). It is a simple projection-free algorithm, that computes a linear minimization at each iteration and moves in the direction of the solution $v_t$ returned, with a step-size $\gamma_t\in\left[0,1\right]$ ensuring that the new iterate $x_{t+1}=(1-\gamma_t)x_t+\gamma_tv_t\in\mathcal{C}$ is feasible by convex combination. Hence, it does not need to compute projections back onto $\mathcal{C}$.</p>
<hr />
<p><span id="fw"><strong>Template 1.</strong></span> Stochastic Frank-Wolfe</p>
<p><em>Input:</em> Start point $x_0\in\mathcal{C}$, step-sizes $\gamma_t\in\left[0,1\right]$.<br />
$\text{ }$ 1: $\text{ }$ <strong>for</strong> $t=0$ <strong>to</strong> $T-1$ <strong>do</strong><br />
$\text{ }$ <span id="fwest">2</span>: $\quad$ Update the gradient estimator $\tilde{\nabla}f(x_t)$<br />
$\text{ }$ 3: $\quad$ $v_t\leftarrow\underset{v\in\mathcal{C}}{\arg\min}\langle\tilde{\nabla}f(x_t),v\rangle$<br />
$\text{ }$ 4: $\quad$ $x_{t+1}\leftarrow x_t+\gamma_t(v_t-x_t)$<br />
$\text{ }$ 5: $\text{ }$ <strong>end for</strong></p>
<hr />
<p><br /></p>
<p>When $m$ is very large, querying exact first-order information from $f$ can be too expensive. Instead, stochastic Frank-Wolfe algorithms build a gradient estimator $\tilde{\nabla}f(x_t)$ with only approximate first-order information. For example, the Stochastic Frank-Wolfe algorithm (SFW) takes the average $\tilde{\nabla}f(x_t)\leftarrow(1/b_t)\sum_{i=i_1}^{i_{b_t}}\nabla f_i(x_t)$ over a minibatch $i_1,\ldots,i_{b_t}$ sampled uniformly at random from \(\{1,\ldots,m\}\). State-of-the-art stochastic Frank-Wolfe algorithms also include the Stochastic Variance-Reduced Frank-Wolfe algorithm (SVRF) (<a href="#hazan16">Hazan and Luo</a>, <a href="#hazan16">2016</a>), the Stochastic Path-Integrated Differential EstimatoR Frank-Wolfe algorithm (SPIDER-FW) (<a href="#yurtsever19">Yurtsever et al.</a>, <a href="#yurtsever19">2019</a>; <a href="#shen19">Shen et al.</a>, <a href="#shen19">2019</a>), the Online stochastic Recursive Gradient-based Frank-Wolfe algorithm (ORGFW) (<a href="#xie20">Xie et al.</a>, <a href="#xie20">2020</a>), and the Constant batch-size Stochastic Frank-Wolfe algorithm (CSFW) (<a href="#negiar20">Négiar et al.</a>, <a href="#negiar20">2020</a>). Their strategies are reported in Table <a href="#table">1</a>.</p>
<p><span id="table" style="font-size:95%">Table 1: Gradient estimator updates in stochastic Frank-Wolfe algorithms. The indices $i_1,\ldots,i_{b_t}$ are sampled i.i.d. uniformly at random from \(\{1,\ldots,m\}\).</span></p>
<table>
<thead>
<tr>
<th><strong>Algorithm</strong></th>
<th><strong>Update $\tilde{\nabla}f(x_t)$ in Line <a href="#fwest">2</a></strong></th>
<th><strong>Additional information</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td>SFW</td>
<td>$\displaystyle\frac{1}{b_t}\sum_{i=i_1}^{i_{b_t}}\nabla f_i(x_t)$</td>
<td>$\varnothing$</td>
</tr>
<tr>
<td>SVRF</td>
<td>\(\displaystyle\nabla f(\tilde{x}_t)+\frac{1}{b_t}\sum_{i=i_1}^{i_{b_t}}(\nabla f_i(x_t)-\nabla f_i(\tilde{x}_t))\)</td>
<td>$\tilde{x}_t$ is the last snapshot iterate</td>
</tr>
<tr>
<td>SPIDER-FW</td>
<td>\(\displaystyle\nabla f(\tilde{x}_t)+\frac{1}{b_t}\sum_{i=i_1}^{i_{b_t}}(\nabla f_i(x_t)-\nabla f_i(x_{t-1}))\)</td>
<td>$\tilde{x}_t$ is the last snapshot iterate</td>
</tr>
<tr>
<td>ORGFW</td>
<td>$\displaystyle\frac{1}{b_t}\sum_{i=i_1}^{i_{b_t}}\nabla f_i(x_t)+(1-\rho_t)\left(\tilde{\nabla}f(x_{t-1})-\frac{1}{b_t}\sum_{i=i_1}^{i_{b_t}}\nabla f_i(x_{t-1})\right)$</td>
<td>$\rho_t$ is the momentum parameter</td>
</tr>
<tr>
<td>CSFW</td>
<td>\(\displaystyle\tilde{\nabla}f(x_{t-1})+\sum_{i=i_1}^{i_{b_t}}\left(\frac{1}{m}f_i'(\langle a_i,x_t\rangle)-[\alpha_{t-1}]_i\right)a_i\) <br /> and \([\alpha_t]_i\leftarrow(1/m)f_i'(\langle a_i,x_t\rangle)\) if \(i\in\{i_1,\ldots,i_{b_t}\}\) else \([\alpha_{t-1}]_i\)</td>
<td>Assumes separability of $f$ as <br /> $\displaystyle f(x)=\frac{1}{m}\sum_{i=1}^mf_i(\langle a_i,x\rangle)$</td>
</tr>
</tbody>
</table>
<p>In our paper, we propose to improve the performance of this family of algorithms by using adaptive gradients.</p>
<h3 id="the-adaptive-gradient-algorithm">The Adaptive Gradient algorithm</h3>
<p>The Adaptive Gradient algorithm (AdaGrad) (<a href="#duchi11">Duchi et al.</a>, <a href="#duchi11">2011</a>; <a href="#mcmahan10">McMahan and Streeter</a>, <a href="#mcmahan10">2010</a>) is presented in Algorithm <a href="#adagrad">2</a>. The new iterate $x_{t+1}$ is obtained by solving a subproblem in Line <a href="#new">4</a>. The default value for the offset hyperparameter is $\delta\leftarrow10^{-8}$.</p>
<hr />
<p><span id="adagrad"><strong>Algorithm 2.</strong></span> Adaptive Gradient (AdaGrad)</p>
<p><em>Input:</em> Start point $x_0\in\mathcal{C}$, offset $\delta>0$, learning rate $\eta>0$.<br />
$\text{ }$ 1: $\text{ }$ <strong>for</strong> $t=0$ <strong>to</strong> $T-1$ <strong>do</strong><br />
$\text{ }$ 2: $\quad$ Update the gradient estimator $\tilde{\nabla}f(x_t)$<br />
$\text{ }$ <span id="h">3</span>: $\quad$ $H_t\leftarrow\operatorname{diag}\left(\delta1+\sqrt{\sum_{s=0}^t\tilde{\nabla}f(x_s)^2}\,\right)$<br />
$\text{ }$ <span id="new">4</span>: $\quad$ \(x_{t+1}\leftarrow\underset{x\in\mathcal{C}}{\arg\min}\,\eta\langle\tilde{\nabla}f(x_t),x\rangle+\frac{1}{2}\|x-x_t\|_{H_t}^2\)<br />
$\text{ }$ 5: $\text{ }$ <strong>end for</strong></p>
<hr />
<p><br /></p>
<p>The matrix $H_t\in\mathbb{R}^{n\times n}$ is diagonal and satisfies for all \(i,j\in\{1,\ldots,n\}\),</p>
\[[H_t]_{i,j}=\delta+\sqrt{\sum_{s=0}^t[\tilde{\nabla}f(x_s)]_i^2}\quad\text{if }i=j\quad\text{else } 0.\]
<p>To see why AdaGrad builds entry-wise step-sizes from past first-order information, note that the subproblem in Line <a href="#new">4</a> is equivalent to</p>
<div id="sub">
$$
x_{t+1}\leftarrow\underset{x\in\mathcal{C}}{\arg\min}\,\|x-(x_t-\eta H_t^{-1}\tilde{\nabla}f(x_t))\|_{H_t}\tag{1}
$$
</div>
<p>by first-order optimality condition (<a href="#polyak87">Polyak</a>, <a href="#polyak87">1987</a>), where \(\|\cdot\|_{H_t}\colon u\in\mathbb{R}^n\mapsto\sqrt{\langle u,H_tu\rangle}\). Ignoring the constraint set $\mathcal{C}$ for ease of exposition, we obtain</p>
\[x_{t+1}\leftarrow x_t-\eta H_t^{-1}\tilde{\nabla}f(x_t),\]
<p>i.e., for every feature \(i\in\{1,\ldots,n\}\),</p>
\[[x_{t+1}]_i\leftarrow[x_t]_i-\frac{\eta[\tilde{\nabla}f(x_t)]_i}{\delta+\sqrt{\sum_{s=0}^t[\tilde{\nabla}f(x_s)]_i^2}}.\]
<p>Therefore, $\delta$ prevents from dividing by zero and the step-sizes automatically adjust to the geometry of the problem. In particular, rare but potentially very informative features do not go unnoticed as they receive a large step-size whenever they appear.</p>
<h3 id="frank-wolfe-with-adaptive-gradients">Frank-Wolfe with adaptive gradients</h3>
<p>For constrained optimization, AdaGrad can be very expensive as it requires solving a constrained subproblem at each iteration (Line <a href="#new">4</a>), which, by (<a href="#sub">1</a>), can also be seen as a projection in the non-Euclidean norm \(\|\cdot\|_{H_t}\). Instead, we propose to solve the subproblems <em>very</em> incompletely, via a small and fixed number of iterations of the Frank-Wolfe algorithm. This approach is aimed at designing an efficient method in practice. In particular, contrary to <a href="#lan16">Lan and Zhou</a> (<a href="#lan16">2016</a>), we do not worry about the accuracy of the solutions to the subproblems. We present our method via a generic template in Template <a href="#adafw">3</a>.</p>
<hr />
<p><span id="adafw"><strong>Template 3.</strong></span> Frank-Wolfe with adaptive gradients</p>
<p><em>Input:</em> Start point $x_0\in\mathcal{C}$, number of inner iterations $K$, learning rate $\eta>0$.<br />
$\text{ }$ 1: $\text{ }$ <strong>for</strong> $t=0$ <strong>to</strong> $T-1$ <strong>do</strong><br />
$\text{ }$ <span id="adafwest">2</span>: $\quad$ Update the gradient estimator $\tilde{\nabla}f(x_t)$ <span style="float:right">$\triangleright$ as in any of Table <a href="#table">1</a></span><br />
$\text{ }$ 3: $\quad$ Update the diagonal matrix $H_t$ <span style="float:right">$\triangleright$ as in, e.g., Line <a href="#h">3</a> of Algorithm <a href="#adagrad">2</a></span><br />
$\text{ }$ <span id="start">4</span>: $\quad$ \(y_0^{(t)}\leftarrow x_t\)<br />
$\text{ }$ 5: $\quad$ <strong>for</strong> $k=0$ <strong>to</strong> $K-1$ <strong>do</strong><br />
$\text{ }$ 6: $\quad\quad$ \(\nabla Q_t(y_k^{(t)})\leftarrow\tilde{\nabla}f(x_t)+\frac{1}{\eta}H_t(y_k^{(t)}-x_t)\)<br />
$\text{ }$ 7: $\quad\quad$ \(v_k^{(t)}\leftarrow\underset{v\in\mathcal{C}}{\arg\min}\langle\nabla Q_t(y_k^{(t)}),v\rangle\)<br />
$\text{ }$ 8: $\quad\quad$ \(\gamma_k^{(t)}\leftarrow\min\left\{\eta\frac{\langle\nabla Q_t(y_k^{(t)}),y_k^{(t)}-v_k^{(t)}\rangle}{\|y_k^{(t)}-v_k^{(t)}\|_{H_t}^2},1\right\}\)<br />
$\text{ }$ 9: $\quad\quad$ \(y_{k+1}\leftarrow y_k^{(t)}+\gamma_k^{(t)}(v_k^{(t)}-y_k^{(t)})\)<br />
<span id="start">10</span>: $\quad$ <strong>end for</strong><br />
11: $\quad$ \(x_{t+1}\leftarrow y_K^{(t)}\)<br />
12: $\text{ }$ <strong>end for</strong></p>
<hr />
<p><br /></p>
<p>Lines <a href="#start">4</a>-<a href="#end">10</a> apply $K$ iterations of the Frank-Wolfe algorithm on</p>
\[\min_{x\in\mathcal{C}}\left\{Q_t(x)\overset{\text{def}}{=}f(x_t)+\langle\tilde{\nabla}f(x_t),x-x_t\rangle+\frac{1}{2\eta}\|x-x_t\|_{H_t}^2\right\},\]
<p>which is equivalent to the AdaGrad subproblem. The sequence of iterates is denoted by $y_0^{(t)},\ldots,y_K^{(t)}$, starting from $x_t=y_0^{(t)}$ and ending at $x_{t+1}=y_K^{(t)}$. In our experiments, we typically set $K\sim5$. The strategy in Line <a href="#adafwest">2</a> can be that of any of the variants SFW, SVRF, SPIDER-FW, ORGFW, or CSFW. When using variant X, the associated method is named AdaX.</p>
<h3 id="computational-experiments">Computational experiments</h3>
<p>We compare our method to SFW, SVRF, SPIDER-FW, ORGFW, and CSFW. For the three experiments with convex objectives, we plot our method using the best performing stochastic Frank-Wolfe variant. For the three neural network experiments, CSFW is not applicable and we run AdaSFW only, as variance reduction may be ineffective in deep learning (<a href="#defazio19">Defazio and Bottou</a>, <a href="#defazio19">2019</a>). In addition, since momentum has become a key ingredient for neural network optimization, we demonstrate that AdaSFW also works very well with momentum. The method is named AdamSFW and $H_t$ is built as in <a href="#reddi18">Reddi et al.</a> (<a href="#reddi18">2018</a>).</p>
<p>The results are presented in Figure <a href="#fig">1</a>. One important observation is that none of the previous methods outperform the vanilla SFW on the nonconvex experiments, except on the MNIST dataset. On the IMDB dataset, AdaSFW yields the best test performance despite optimizing slowly over the training set, and AdamSFW reaches its maximum accuracy very fast which can be interesting if we consider using early stopping.</p>
<div id="fig" class="center" style="margin-top:5mm">
<img src="http://www.pokutta.com/blog/assets/adasfw/svm.png" alt="svm" style="float:left; width:48%" />
<img src="http://www.pokutta.com/blog/assets/adasfw/lin.png" alt="linear" style="float:right; width:48%" />
<p style="clear: both;"></p>
<img src="http://www.pokutta.com/blog/assets/adasfw/log.png" alt="logistic" style="float:left; width:48%" />
<img src="http://www.pokutta.com/blog/assets/adasfw/mnist.png" alt="MNIST" style="float:right; width:48%" />
<p style="clear: both;"></p>
<img src="http://www.pokutta.com/blog/assets/adasfw/imdb.png" alt="IMDB" style="float:left; width:48%" />
<img src="http://www.pokutta.com/blog/assets/adasfw/cifar.png" alt="CIFAR-10" style="float:right; width:48%" />
<p style="clear: both;"></p>
</div>
<p class="figcap"><strong>Figure 1.</strong> Computational experiments on convex and nonconvex objectives.</p>
<h4 id="references">References</h4>
<p><span id="defazio19" style="font-size:95%">A. Defazio and L. Bottou. <a href="https://arxiv.org/pdf/1812.04529.pdf">On the ineffectiveness of variance reduced optimization for deep learning</a>. In <em>Advances in Neural Information Processing Systems 32</em>, pages 1755–1765. 2019.</span></p>
<p><span id="duchi11" style="font-size:95%">J. C. Duchi, E. Hazan, and Y. Singer. <a href="https://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf">Adaptive subgradient methods for online learning and stochastic optimization</a>. <em>Journal of Machine Learning Research</em>, 12(61):2121–2159, 2011.</span></p>
<p><span id="fw56" style="font-size:95%">M. Frank and P. Wolfe. <a href="https://onlinelibrary.wiley.com/doi/abs/10.1002/nav.3800030109">An algorithm for quadratic programming</a>. <em>Naval Research Logistics Quarterly</em>, 3(1-2):95–110, 1956.</span></p>
<p><span id="hazan16" style="font-size:95%">E. Hazan and H. Luo. <a href="https://arxiv.org/pdf/1602.02101.pdf">Variance-reduced and projection-free stochastic optimization</a>. In <em>Proceedings of the 33rd International Conference on Machine Learning</em>, pages 1263–1271, 2016.</span></p>
<p><span id="lan16" style="font-size:95%">G. Lan and Y. Zhou. <a href="https://pdfs.semanticscholar.org/5b75/13ad8e8fb691f5243278965dce549dbcc827.pdf">Conditional gradient sliding for convex optimization</a>. <em>SIAM Journal on Optimization</em>, 26(2):1379–1409, 2016.</span></p>
<p><span id="levitin66" style="font-size:95%">E. S. Levitin and B. T. Polyak. <a href="https://www.sciencedirect.com/science/article/abs/pii/0041555366901145">Constrained minimization methods</a>. <em>USSR Computational Mathematics and Mathematical Physics</em>, 6(5):1–50, 1966.</span></p>
<p><span id="mcmahan10" style="font-size:95%">H. B. McMahan and M. Streeter. <a href="https://arxiv.org/pdf/1002.4908.pdf">Adaptive bound optimization for online convex optimization</a>. In <em>Proceedings of the 23rd Annual Conference on Learning Theory</em>, 2010.</span></p>
<p><span id="negiar20" style="font-size:95%">G. Négiar, G. Dresdner, A. Y.-T. Tsai, L. El Ghaoui, F. Locatello, R. M. Freund, and F. Pedregosa. <a href="https://arxiv.org/pdf/2002.11860.pdf">Stochastic Frank-Wolfe for constrained finite-sum minimization</a>. In <em>Proceedings of the 37th International Conference on Machine Learning</em>. 2020. To appear.</span></p>
<p><span id="polyak87" style="font-size:95%">B. T. Polyak. <em><a href="http://lab7.ipu.ru/files/polyak/polyak-optimizationintro-eng.zip">Introduction to Optimization</a></em>. Optimization Software, 1987.</span></p>
<p><span id="reddi18" style="font-size:95%">S. J. Reddi, S. Kale, and S. Kumar. <a href="https://arxiv.org/pdf/1904.09237.pdf">On the convergence of Adam and beyond</a>. In <em>Proceedings of the 6th International Conference on Learning Representations</em>, 2018.</span></p>
<p><span id="shen19" style="font-size:95%">Z. Shen, C. Fang, P. Zhao, J. Huang, and H. Qian. <a href="http://proceedings.mlr.press/v89/shen19b/shen19b.pdf">Complexities in projection-free stochastic non-convex minimization</a>. In <em>Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics</em>, pages 2868–2876, 2019.</span></p>
<p><span id="xie20" style="font-size:95%">J. Xie, Z. Shen, C. Zhang, H. Qian, and B. Wang. <a href="https://arxiv.org/pdf/1910.09396.pdf">Efficient projection-free online methods with stochastic recursive gradient</a>. In <em>Proceedings of the 34th AAAI Conference on Artificial Intelligence</em>, pages 6446–6453, 2020.</span></p>
<p><span id="yurtsever19" style="font-size:95%">A. Yurtsever, S. Sra, and V. Cevher. <a href="http://proceedings.mlr.press/v97/yurtsever19b/yurtsever19b.pdf">Conditional gradient methods via stochastic path-integrated differential estimator</a>. In <em>Proceedings of the 36th International Conference on Machine Learning</em>, pages 7282–7291. 2019.</span></p>Cyrille CombettesTL;DR: This is an informal summary of our recent paper Projection-Free Adaptive Gradients for Large-Scale Optimization by Cyrille Combettes, Christoph Spiegel, and Sebastian Pokutta. We propose to improve the performance of state-of-the-art stochastic Frank-Wolfe algorithms via a better use of first-order information. This is achieved by blending in adaptive gradients, a method for setting entry-wise step-sizes that automatically adjust to the geometry of the problem. Computational experiments on convex and nonconvex objectives demonstrate the advantage of our approach.Accelerating Domain Propagation via GPUs2020-09-20T01:00:00+02:002020-09-20T01:00:00+02:00http://www.pokutta.com/blog/research/2020/09/20/gpu-prob<p><em>TL;DR: This is an informal discussion of our recent paper <a href="https://arxiv.org/abs/2009.07785">Accelerating Domain Propagation: an Efficient GPU-Parallel Algorithm over Sparse Matrices</a> by Boro Sofranac, Ambros Gleixner, and Sebastian Pokutta. In the paper, we present a new algorithm to perform domain propagation of linear constraints on GPUs efficiently. The results show that efficient implementations of Mixed-integer Programming (MIP) methods are possible on GPUs, even though the success of using GPUs in MIPs has traditionally been limited. Our algorithm is capable of performing domain propagation on the GPU exclusively, without the need for synchronization with the CPU, paving the way for the usage of this algorithm in a new generation of MIP methods that run on GPUs.</em>
<!--more--></p>
<p><em>Written by Boro Sofranac.</em>
</p>
<h2 id="the-motivation">The motivation</h2>
<p>
Since the advent of general-purpose, massively parallel GPU hardware, many fields of applied mathematics have actively sought out the design of specialized algorithms which exploit the unprecedented computational resources this new type of hardware has to offer. A prime example are Neural Networks, whose rise to prominence was fueled by the “Deep Learning revolution” with Deep Learning methods running on GPUs. Still, such success is missing in many other fields which are not as amenable to specialized and massively parallel programming model of the GPUs. One such field is Mixed-integer Programming (MIP).</p>
<p>The development of the <em>Simplex</em> algorithm for solving Linear Programs (LPs) in the 1940s was followed by a plethora of methods and solvers to solve LPs and MIPs over the following decades. A unifying characteristic of these methods is a) that they exhibit non-uniform algorithmic behaviour, and b) they operate on highly irregular data (i.e., data structures are sparse and do not have a regular structure in general). Such characteristics make the implementation of these methods on massively parallel hardware challenging and have hindered the application of GPUs in the field.</p>
<p>Recognizing these challenges in a number of fields, GPU hardware development in recent years has been geared towards easier expression of non-uniform workflows in its programming model (for example, the <a href="https://developer.nvidia.com/blog/introduction-cuda-dynamic-parallelism/"><em>Dynamic Parallelism</em></a> feature of NVIDIA GPUs) and hardware (e.g., <a href="https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#atomic-functions">atomic operations</a>). At the same time, researchers working in some related fields have shown that with the right algorithmic design GPUs can be efficiently used for workflows previously considered unsuitable due to challenges similar to those present in MIP methods. Consider for example <em>Numerical Linear Algebra</em> (LA): for years, dense LA has been one of the main benefactors of GPU computing and allowed for tremendous speedups. On the other hand, if the input data was sparse (and irregular), the same algorithms often exhibited disappointing performance. New algorithms however, such as the <a href="https://doi.org/10.1109/SC.2014.68"><em>CSR-Adaptive</em> algorithm developed by Greathouse and Daga</a> to perform sparse matrix-vector products (SpMV) have shown very impressive performance gains for highly unstructured sparse matrices.</p>
<p>Against this backdrop, we set out to investigate the applicability of massively parallel algorithms in MIP methods. This, however, is no simple task: massively parallel programming models bring with them a different algorithmic paradigm and complexity notions while MIP methods have historically been strongly sequential. Basic design decisions need to be rethought and/or new algorithms developed. So we took a core MIP method used by all state-of-the-art solvers, namely <em>Domain Propagation</em>, which is traditionally not amenable to efficient parallelization, and we show that with the right algorithmic design speedups seen in fields such as sparse Linear Algebra are also possible here. This new GPU algorithm for <em>Domain Propagation</em> and computational experiments assessing the performance are presented in <a href="https://arxiv.org/abs/2009.07785">our paper</a>; the code is available on our <a href="https://github.com/Sofranac-Boro/gpu-domain-propagator"><em>GitHub</em> page</a>.
</p>
<h2 id="sneak-peek-at-the-results">Sneak peek at the results</h2>
<p>
We use the <em>de-facto</em> standard <a href="https://miplib.zib.de/">MIPLIB2017</a> test set for MIPs to conduct our experiments. To better capture the response of our algorithm to the size of instances, we subdivided this set into 8 subsets with instances of increasing size, dubbed Set-1 to Set-8. For comparison, we use four algorithms:</p>
<ol>
<li><strong>cpu_seq</strong> is a sequential implementation of domain propagation which closely follows implementations in state-of-the-art solvers.</li>
<li><strong>cpu_omp</strong> is a shared-memory parallel version of domain propagation which runs on the CPU.</li>
<li><strong>gpu_atomic</strong> is our GPU implementation with atomic operations</li>
<li><strong>gpu_reduction</strong> is our GPU implementation which avoids using atomic operations by using reductions in global memory.</li>
</ol>
<p>The algorithms are tested on the following hardware:</p>
<ol>
<li><strong>V100</strong> NVIDIA Tesla V100 PCIe 32GB (GPU)</li>
<li><strong>RTX</strong> NVIDIA Titan RTX 24GB (GPU)</li>
<li><strong>P400</strong> NVIDIA Quadro P400 2GB (GPU)</li>
<li><strong>amdtr</strong> 64-core AMD Ryzen Threadripper 3990X @ 3.30 GHz with 128 GB RAM (CPU)</li>
<li><strong>xeon</strong> 24-core Intel Xeon Gold 6246 @ 3.30GHz with 384 GB RAM (CPU)</li>
</ol>
<p>As baseline, we choose the execution <em>cpu_seq</em> algorithm on the <em>xeon</em> machine. As the metric for comparison, we will report speedups of all other executions over this base case.</p>
<p>Figure 1-a shows the geometric mean of speedups of the four algorithms over the test subsets. Figure 1-b shows the distribution of speedups of the four algorithms over all the instances in the test set, sorted in ascending order by speedup.
</p>
<div class="center">
<img src="http://www.pokutta.com/blog/assets/GPUprop/speedups-1.png" alt="img1" style="float:center; margin-right: 1%; width:100%" />
<p style="clear: both;"></p>
</div>
<p class="figcap"><strong>Figure 1.</strong> a) Geometric means of speedups b) speedup distributions in ascending order.</p>
<p>We can see that GPU algorithms on high-end hardware <em>V100</em> and <em>RTX</em> drastically outperform the sequential and shared memory executions. Additionally, the GPU algorithms with atomic operations outperform the reduction version in all executions. The fastest combination is the <em>gpu_atomic</em> algorithm on <em>V100</em>. Its mean speedup is always greater than 1.6 and goes up to 46.9 in Set-8, following a roughly linear trend. Over the entire test set, the mean speedup is 6.1. For the top 5% of the instances, this execution achieves a speedup of at least 62.9 times. As we can see in Figure 1-b, speedups as high as 195x are possible.</p>
<p>A low-end, consumer-grade GPU <em>P400</em>, often found in home-use desktops, is also evaluated in the tests. It is evident from the plots that it cannot keep up with the two high-end GPUs. However, we can see that <em>gpu_atomic</em> running on <em>P400</em> is still competitive with the CPU implementation for about half the instances, where it achieves a small speedup over the benchmark cpu implementation. This result is interesting considering that GPUs are currently a resource that is not used by MIP solvers at all, opening up the possibility for using GPUs as co-processors for MIP solvers even on standard desktop machines.</p>
<p>Looking at the shared-memory parallel <em>cpu_omp</em> algorithm, we can see that it also is drastically outperformed by the GPU implementations on <em>V100</em> and <em>RTX</em>. Comparing it to the sequential base case, it underperforms in about half of the instances. The parallelism found in the domain propagation algorithm is relatively fine-grained, with low-arithmetic intensity in the parallel units of work. This does not bode well for the shared-memory parallelization on the CPU where managing CPU threads is comparatively expensive, explaining why current state-of-the-art implementations of Domain Propagation are usually single thread only
</p>
<h2 id="conclusions">Conclusions</h2>
<p>In conclusion, the domain propagation algorithm on the GPU achieves ample speedups over its CPU counterparts, over the majority of practically relevant instances. Beyond being interesting in its own right, it does not tell the whole story! The <em>throughput-based</em> GPU programming model differs significantly from the <em>latency-based</em> sequential model, and comparing the runtimes of parts of the solving process alone might not give justice to the GPU paradigm. Our algorithm runs entirely on the GPU, without the need for synchronization with the CPU, which paves the way to embedding this algorithm in future GPU-based MIP solvers/methods. Put differently, the massive amounts of parallelism on the GPU bring about a different paradigm, with the potential to achieve more than speeding up parts of an otherwise sequential workflow.</p>Boro SofranacTL;DR: This is an informal discussion of our recent paper Accelerating Domain Propagation: an Efficient GPU-Parallel Algorithm over Sparse Matrices by Boro Sofranac, Ambros Gleixner, and Sebastian Pokutta. In the paper, we present a new algorithm to perform domain propagation of linear constraints on GPUs efficiently. The results show that efficient implementations of Mixed-integer Programming (MIP) methods are possible on GPUs, even though the success of using GPUs in MIPs has traditionally been limited. Our algorithm is capable of performing domain propagation on the GPU exclusively, without the need for synchronization with the CPU, paving the way for the usage of this algorithm in a new generation of MIP methods that run on GPUs.Join CO@Work and EWG-POR – online and for free!2020-08-30T07:00:00+02:002020-08-30T07:00:00+02:00http://www.pokutta.com/blog/news/2020/08/30/coatwork<p><em>TL;DR: Announcement for CO@WORK and EWG-POR. Fully online and participation is free.</em>
<!--more--></p>
<p><em>Written by Timo Berthold.</em></p>
<p>This year, a lot of scientific conferences and workshops had to be cancelled or postponed – many others took the unprecedented situation as a chance to experiment with new, exciting formats. It is great to see how virtual conferences make the latest research insights available too much more people than traditional on-site events do.</p>
<p>Researchers from <a href="https://www.zib.de/">ZIB</a> and its partners from the <a href="http://forschungscampus-modal.de/">Research Campus Modal</a> will join this latest trend towards online workshops and host two exciting meetings in September, one two-week summer school mainly targeting PhD students and one two-day meeting, addressing practitioners from all industries using Operations Research.</p>
<div class="center">
<img src="http://www.pokutta.com/blog/assets/coatwork/Picture1.png" alt="img1" style="float:center; margin-right: 1%; width:80%" />
<p style="clear: both;"></p>
</div>
<h2 id="combinatorial-optimization-at-work">Combinatorial Optimization At Work</h2>
<p><a href="http://co-at-work.zib.de/">CO@Work</a> is an institution. This two-week summer school takes only place every five years and has always brought together researchers, practitioners, and students from all over the world. It addresses PhD students and post-docs interested in the use mathematical optimizations in concrete practical applications. This year’s theme is “Algorithmic Intelligence in Practice”. This amazing event features more than 30 distinguished lecturers from all over the world, including developers and managers of FICO, Google, SAP, Siemens, SAS, Gurobi, Mosek, GAMS, NTT Data, Litic, as well as leading scientists from TU Berlin, FU Berlin, Polytechnique Montréal, RWTH Aachen, the Chinese Academy of Science, University of Southern California, University of Edinburgh, Sabancı Üniversitesi, Escuela Politécnica Nacional Quito, TU Darmstadt, University of Exeter, and many more.</p>
<div class="center">
<img src="http://www.pokutta.com/blog/assets/coatwork/2015-coatwork.png" alt="img1" style="float:center; width:99%" />
<p style="clear: both;"></p>
</div>
<p class="figcap">CO@Work 2015 participants in front of ZIB.</p>
<p>All lectures will be made available on <a href="https://www.youtube.com/channel/UCphLz_BXrOAInHozAlTsigA">YouTube</a>, and there will be two Q&A sessions for each presentation – typically 11 hours apart, to cover all time zones worldwide. Similarly, there will be two practical exercise sessions each day, with hands-on experience on implementing optimization projects through the python interfaces of <a href="https://www.fico.com/en/products/fico-xpress-optimization">FICO Xpress</a> and <a href="https://www.scipopt.org/">SCIP</a>. Q&A and exercises will be hosted on Zoom. CO@Work will take place every weekday from September 14 to September 25.
Checkout the meeting homepage and registration <a href="http://co-at-work.zib.de/">here</a>. Hurry up, registration closes on September 6!</p>
<div class="center">
<img src="http://www.pokutta.com/blog/assets/coatwork/ewg.png" alt="img1" style="float:center; width:80%" />
<p style="clear: both;"></p>
</div>
<h2 id="practice-of-operations-research--euro-working-group-meeting">Practice of Operations Research – EURO Working Group meeting</h2>
<p>The <a href="https://www.eventbrite.co.uk/e/challenges-in-the-deployment-of-or-projects-tickets-62398252854">EWG-POR virtual conference</a> is co-hosted by <a href="https://www.zib.de/">ZIB</a> and <a href="https://www.fico.com/en/products/fico-xpress-optimization">FICO</a>. It focusses on an issue that many of us have faced: “Challenges in the deployment of OR projects”. The conference features five keynote lecturers from various industries highlighting how difficulties with the implementation of optimization projects could be overcome. Adrian Zymolka from Axioma will speak about “<strong>Optimization in Finance</strong>”, Colin Silvester from Uniper will present on “<strong>Delivering OR Solutions for Everyday Operations in Energy Trading</strong>”, Steffen Klosterhalfen from BASF will talk about “<strong>Successful Value Chain Optimization at a Chemical Company</strong>”, Ralf Werner from OGE will report on “<strong>Operations Research supporting Germany’s energy transition</strong>”, and Baris Cem Sal from DHL will let us know how they were “<strong>Putting Operations Research into Operations in Deutsche Post DHL Group transition</strong>”.</p>
<p>Furthermore, there will be special 45-minute interactive discussion group sessions. All participants are invited to contribute to one of the four rounds on “Change management issues in practical projects”, “Promotion of OR”, “How to state requirements and project specifications at the beginning” and “Relationship/collaboration between academia and industry”. We are curious to see what comes out of these roundtables.</p>
<p>The event will be completed by a wonderful social entertainment session – join to find out more…</p>
<p>The EWG-POR main event will take place on the 28th and 29th of September. The following five Mondays (Oct 5 – Nov 2), there will be a one hour Webinar series, each with two contributed talks on the topic of “Challenges in the deployment of OR projects”.</p>
<p>Checkout the <a href="https://www.euro-online.org/websites/or-in-practice/event/euro-working-group-practice-of-or-meeting-2020/">meeting homepage</a> and <a href="https://fico.zoom.us/webinar/register/9815939755759/WN_gb2az8OJR8u9y5FPl51RaA">register now</a> here.</p>Timo BertholdTL;DR: Announcement for CO@WORK and EWG-POR. Fully online and participation is free.Projection-Free Optimization on Uniformly Convex Sets2020-07-27T01:00:00+02:002020-07-27T01:00:00+02:00http://www.pokutta.com/blog/research/2020/07/27/uniform-convexity-fw<p><em>TL;DR: This is an informal summary of our recent paper <a href="https://arxiv.org/abs/2004.11053">Projection-Free Optimization on Uniformly Convex Sets</a> by Thomas Kerdreux, Alexandre d’Aspremont, and Sebastian Pokutta. We present convergence analyses of the Frank-Wolfe algorithm in settings where the constraint sets are uniformly convex. Our results generalize different analyses of [P], [DR], [D], and [GH] when the constraint sets are strongly convex. For instance, the $\ell_p$ balls are uniformly convex for all $p > 1$, but strongly convex for $p\in]1,2]$ only. We show in these settings that uniform convexity of the feasible region systematically induces accelerated convergence rates of the Frank-Wolfe algorithm (with short steps or exact line-search). This shows that the Frank-Wolfe algorithm is not just adaptive to the sharpness of the objective [KDP] but also to the feasible region.</em>
<!--more--></p>
<p><em>Written by Thomas Kerdreux.</em></p>
<p>We consider the following constrained optimization problem</p>
\[\underset{x\in\mathcal{C}}{\text{argmin}} f(x),\]
<p>where $f$ is a $L$-smooth convex function and $\mathcal{C}$ is a compact convex set in a Hilbert space. Frank-Wolfe algorithms form a classical family of first-order iterative methods for solving such problems. Each iteration requires in the worst case the solution of a linear minimization problem over the original feasible domain, a subset of the domain, or a reasonable modification of the domain, <em>i.e.</em> a change that does not implicitly amount to a proximal operation.</p>
<p>The understanding of the convergence rates of Frank-Wolfe algorithms in a variety of settings is an active field of research. For smooth constrained problems, the vanilla Frank-Wolfe algorithm (FW) enjoys a tight sublinear convergence rate of $\mathcal{O}(1/T)$ (see e.g., [J] for an in-depth discussion). There are known accelerated convergence regimes, as a function of the feasible region, only when $\mathcal{C}$ is a polytope or a strongly convex set.</p>
<p class="mathcol"><strong>Frank-Wolfe Algorithm</strong> <br />
<em>Input:</em> Start with $x_0 \in \mathcal{C}$, $L$ upper bound on the Lipschitz constant. <br />
<em>Output:</em> Sequence of iterates $x_t$ <br />
For $t=0, 1, \ldots, T $ do: <br />
$\qquad v_t \leftarrow \underset{v\in\mathcal{C}}{\text{argmax }} \langle -\nabla f(x_t); v - x_t\rangle$ $\qquad \triangleright $ FW vertex <br />
$\qquad \gamma_t \leftarrow \underset{\gamma\in[0,1]}{\text{argmin }} \gamma \langle \nabla f(x_t); v_t - x_t \rangle + \frac{\gamma^2}{2} L \norm{v_t - x_t}^2$ $\qquad \triangleright$ Short step <br />
$\qquad x_{t+1} \leftarrow (1 - \gamma_t)x_{t} + \gamma_t v_{t}$ $\qquad \triangleright$ Convex update</p>
<p>When $\mathcal{C}$ is a strongly convex set and \(\text{inf}_{x\in\mathcal{C}}\norm{\nabla f(x)}_\esx > 0\), FW enjoys linear convergence rates [P, DR]. Recently, [GH] showed that the Frank-Wolfe algorithm converges in \(\mathcal{O}(1/T^2)\) when the objective function is strongly convex without restriction on the position of the optimum $x^\esx$. Importantly, the conditioning of this sub-linear rate does not depend on the distance of the constrained optimum $x^\esx$ either from the boundary or the unconstrained optimum, as it is the case in [P, DR], or [GM] when the optimum $x^\esx$ is in the interior of $\mathcal{C}$. Note that all these analyses require short steps as step-size rule (or even stronger conditions such as line-search).</p>
<p>Finally, when $\mathcal{C}$ is a polytope and $f$ is strongly convex, <em>corrective</em> variants of Frank-Wolfe were recently shown to enjoy linear convergence rates, see [LJ]. No accelerated convergence rates are known for constraint sets that are neither polytopes nor strongly convex sets. We show here that uniformly convex sets, which non-trivially subsume strongly convex sets, systematically enjoy accelerated convergence rate in the respective settings of [P,DR], [GH], and [D].</p>
<h3 id="uniformly-convex-sets">Uniformly Convex Sets</h3>
<p>A closed set $\mathcal{C}\subset\mathbb{R}^d$ is $(\alpha, q)$-uniformly convex with respect to a norm $\norm{\cdot}$, if for any $x,y\in\mathcal{C}$, any $\eta\in[0,1]$, and any $z\in\mathbb{R}^d$ with $\norm{z} = 1$, we have</p>
\[\tag{1}
\eta x + (1-\eta)y + \eta (1 - \eta ) \alpha ||x-y||^q z \in \mathcal{C}.\]
<p>At a high-level, this property is a global quantification of the set curvature that subsumes strong convexity. There exist others equivalent definitions, see <em>e.g.</em> [GI] in the strongly convex case. In finite-dimensional spaces, the $\ell_p$ balls form classical and important examples of uniformly convex sets. For $0 < p < 1$, the $\ell_p$ balls are non-convex. For $p\in ]1,2]$, the $\ell_p$ balls are strongly convex (or $(\alpha,2)$-uniformly convex) and $p$-uniformly convex (but not strongly convex) for $p>2$. The $p$-Schatten norms with $p>1$ or various group norms are also typical examples.</p>
<div class="center">
<img src="http://www.pokutta.com/blog/assets/uc-fw/list_ball.png" alt="img1" style="float:center; margin-right: 1%; width:80%" />
<p style="clear: both;"></p>
</div>
<p class="figcap"><strong>Figure 1.</strong> Example of $\ell_q$ balls.</p>
<p>Besides the quantification in (1), uniform convexity is a very classical notion in the study of normed spaces. Indeed, it allows to refine the convex characterization of these spaces’ unit balls, leading to a plethora of interesting properties. For instance, the various uniform convexity types of Banach spaces have consequences, notably in learning theory [DDS], online learning [ST], or concentration inequalities [JN]. Here, we show that the Frank-Wolfe algorithm accelerates as a function of the uniform convexity of $\mathcal{C}$.</p>
<h3 id="convergence-analysis-for-frank-wolfe-with-uniformly-convex-sets">Convergence analysis for Frank-Wolfe with Uniformly Convex Sets</h3>
<h4 id="proof-sketch">Proof Sketch</h4>
<p>At iteration $t$, we have $x_{t+1} = x_t + \gamma_t (v_t - x_t)$ with $\gamma_t\in[0,1]$ chosen to optimize the quadratic upper-bound on $f$ implied by $L$-smoothness. Classically then, we obtain that for any $\gamma\in[0,1]$</p>
\[\tag{2}
f(x_{t+1}) - f(x^\esx) \leq f(x_t) - f(x^\esx) - \gamma \langle - \nabla f(x_t); v_t - x_t\rangle + \frac{\gamma^2}{2} L \norm{x_t - v_t}^2.\]
<p>The Frank-Wolfe gap $g_t = \langle - \nabla f(x_t); v_t - x_t\rangle\geq 0$ contributes in the primal decrease counter-balanced by the right-hand term. Informally then, uniform convexity of the set ensures that the distance of the iterate to the Frank-Wolfe vertex $\norm{v_t - x_t}$ shrinks to zero at a specific rate that depends on $g_t$. This schema illustrates that it is generally not the case when $\mathcal{C}$ is a polytope and how various type of <em>curvature</em> influence the shrinking of $\norm{v_t - x_t}$ to zero.</p>
<div class="center">
<img src="http://www.pokutta.com/blog/assets/uc-fw/uniform_convexity_assumption_cropped.png" alt="img1" style="float:center; margin-right: 1%; width:80%" />
<p style="clear: both;"></p>
</div>
<p class="figcap"><strong>Figure 2.</strong> The uniform convexity assumption.</p>
<h4 id="scaling-inequalities">Scaling Inequalities</h4>
<p>The uniform convexity parameters quantify different trade-off between the convergence to zero of $g_t$ and $\norm{v_t - x_t}$ . In particular, $(\alpha, q)$-uniform convexity of $\mathcal{C}$ implies scaling inequalities of the form
\(\langle -\nabla f(x_t); v_t - x_t\rangle \geq \alpha/2 \norm{\nabla f(x_t)}_\esx \norm{v_t - x_t}^q,\)
where $\norm{\cdot}_\esx$ stands for the dual norm to $\norm{\cdot}$. Plugging this scaling inequality into (1) is then the basis for the various convergence results.</p>
<h4 id="convergence-results-with-global-uniform-convexity">Convergence results with global uniform convexity</h4>
<p>Assuming that $\mathcal{C}$ is $(\alpha,q)$-uniformly convex and $f$ is a strongly convex and $L$-smooth function, we obtain convergence rates of $\mathcal{O}(1/T^{1/(1-1/q)})$ that interpolate between the general sub-linear rate of $\mathcal{O}(1/T)$ and the $\mathcal{O}(1/T^2)$ of [GH]. Note that we further generalize these results by relaxing the strong convexity of $f$ with $(\mu, \theta)$-Hölderian Error Bounds and the rates become $\mathcal{O}(1/T^{1/(1-2\theta/q)})$. For more details, see Theorem 2.10. of our paper or [KDP] for general error bounds in the context of the Frank-Wolfe algorithm.</p>
<p>Similarly, assuming \(\text{inf}_{x\in\mathcal{C}}\norm{\nabla f(x)}_\esx > 0\), when $\mathcal{C}$ is $(\alpha,q)$-uniformly convex with $q>2$, we obtain convergence rates of $\mathcal{O}(1/T^{1/(1-2/q)})$ that interpolate between the general sub-linear rate of $\mathcal{O}(1/T)$ and the linear convergence rates of [P, DR].</p>
<p>These two convergence regimes depend on the global uniform convexity parameters of the set $\mathcal{C}$. However, for example $\ell_3$ balls, seem to exhibit various degree of curvature depending on the position of $x^\esx$ on the boundary $\partial\mathcal{C}$.</p>
<h4 id="a-simple-numerical-experiment">A simple numerical experiment</h4>
<p>We now numerically observe the convergence of Frank-Wolfe where $f$ is a quadratic and the feasible region are $\ell_p$ balls with varying $p$. We provide two different plots where we vary the position of the optimal solution $x^\esx$. In both cases we use short steps.</p>
<p>In the right figure, $x^\esx$ is chosen near the intersection of the $\ell_p$-balls and the half-line generated by $\sum_{i=1}^{d}e_i$, where the $(e_i)$ is the canonical basis. Informally, this corresponds to the <em>curvy</em> areas of the $\ell_p$ balls. On the left figure, $x^\esx$ is chosen near the intersection of the $\ell_p$ balls and the half-line generated by one of the $(e_i)$, which corresponds to a <em>flat</em> area for large value of $p$.</p>
<p>We observe that when the optimum is near a <em>curvy</em> area, the converge rates are asymptotically linear for $\ell_p$ balls that are not strongly convex.</p>
<table>
<thead>
<tr>
<th style="text-align: center">Optimum $x^\esx$ in flat area</th>
<th style="text-align: center">Optimum $x^\esx$ in curvy area</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: center"><img src="http://www.pokutta.com/blog/assets/uc-fw/500_iter_20000_ls_exact_bad_opt.jpg" alt="Explicative figure" /></td>
<td style="text-align: center"><img src="http://www.pokutta.com/blog/assets/uc-fw/500_iter_20000_ls_exact_good_opt.jpg" alt="Explicative figure" /></td>
</tr>
</tbody>
</table>
<p>This suggests that the <em>local behavior</em> of $\mathcal{C}$ around $x^\esx$ might even better explain the convergence rates. Providing such an analysis would be in line with [D] which proves linear convergence rates assuming only local strong convexity of $\mathcal{C}$ around $x^\esx$. We extend this result to a localized notion of uniform convexity.</p>
<h3 id="frank-wolfe-analysis-with-localized-uniform-convexity">Frank-Wolfe Analysis with Localized Uniform Convexity</h3>
<p>When $\mathcal{C}$ is not globally uniformly convex, the scaling inequality does not necessarily hold anymore. We rather assume the following localized version around $x^\esx$:</p>
\[\tag{3}
\langle -\nabla f(x^\esx); x^\esx - x\rangle \geq \alpha/2 \norm{\nabla f(x^\esx)}_\esx \norm{x^\esx - x}^q.\]
<p>A localized definition of uniform convexity in the form of (1) applied at $x^\esx \in \partial \mathcal{C}$ indeed implies the <em>local scaling inequality</em> (3). However, the local scaling inequality (3) holds in more general situations, <em>e.g.</em> in the case of the strong convexity analog for the local moduli of rotundity in [GI]. This condition was already identified by Dunn [D], yet without any convergence analysis as soon as $q>2$. In another blog post, we will delve into more details on the generality of (3) and related assumptions in optimization.</p>
<p>Assuming \(\text{inf}_{x\in\mathcal{C}} \norm{\nabla f(x)}_\esx > 0\) and a local scaling inequality at $x^\esx$ with parameters $(\alpha, q)$, we obtain convergence rates of $\mathcal{O}(1/T^{1/(1-2/(q(q-1))})$ with $q>2$ that interpolate between the general sub-linear rate of $\mathcal{O}(1/T)$ and the linear convergence rates of [D].</p>
<p>Note that the obtained sublinear rate via the local scaling inequality is strictly worse than the one obtained via the global scaling inequality with same $(\alpha, q)$ parameters. The sublinear rate $\mathcal{O}(1/T^{1/(1-2/(q(q-1))})$ remains however always better than $\mathcal{O}(1/T)$ when $q>2$.</p>
<h2 id="conclusion">Conclusion</h2>
<p>Our results show that in all regimes (for smooth constrained problems) where the strong convexity of the constraint set is known to accelerate Frank-Wolfe algorithms (see [P, DR], [D] or [GH]), the uniform convexity of the set leads to accelerated convergence rates with respect to $\mathcal{O}(1/T)$ as well.</p>
<p>We also show that the local uniform convexity of $\mathcal{C}$ around $x^\esx$ already induces accelerated convergence rates. In particular, this acceleration is achieved with the vanilla Frank-Wolfe algorithm which does not require any knowledge about these underlying structural assumptions and their respective parameters. As such, our results further shed light on the adaptive properties of Frank-Wolfe type of algorithms. For instance, see also [KDP] for adaptivity with respect to error bound conditions on the objective function, or [J, LJ] for affine invariant analyses of Frank-Wolfe algorithms when the constraint set are polytopes.</p>
<h3 id="references">References</h3>
<p>[D] Dunn, Joseph C. Rates of convergence for conditional gradient algorithms near singular and nonsingular extremals. SIAM Journal on Control and Optimization 17.2 (1979): 187-211. <a href="https://epubs.siam.org/doi/pdf/10.1137/0324071?casa_token=mV4qkf9aLskAAAAA:--jyeKNCSwAH5fejuzgJr1im_OXPyesfgPOU1fk-cfmBYZjTdrRSAHHfZEWjQRUaYSI0vNPB7NwY">pdf</a></p>
<p>[DR] Demyanov, V. F. ; Rubinov, A. M. Approximate methods in optimization problems. Modern Analytic and Computational Methods in Science and Mathematics, 1970.</p>
<p>[DDS] Donahue, M. J.; Darken, C.; Gurvits, L.; Sontag, E. (1997). Rates of convex approximation in non-Hilbert spaces. <em>Constructive Approximation</em>, <em>13</em>(2), 187-220.</p>
<p>[GH] Garber, D.; Hazan, E. Faster rates for the frank-wolfe method over strongly-convex sets. In 32nd International Conference on Machine Learning, ICML 2015. <a href="https://arxiv.org/abs/1406.1305">pdf</a></p>
<p>[GI] Goncharov, V. V.; Ivanov, G. E. Strong and weak convexity of closed sets in a hilbert space. In Operations research engineering, and cyber security, pages 259–297. Springer, 2017.</p>
<p>[GM] Guélat, J.; Marcotte, P. (1986). Some comments on Wolfe’s ‘away step’. Mathematical Programming, 35(1), 110-119.</p>
<p>[HL] Huang, R.; Lattimore, T.; György, A.; Szepesvári, C. Following the leader and fast rates in linear prediction: Curved constraint sets and other regularities. In Advances in Neural Information Processing Systems, pages 4970–4978, 2016. <a href="https://arxiv.org/abs/1702.03040">pdf</a></p>
<p>[J] Jaggi, M. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In Proceedings of the 30th international conference on machine learning, ICML 2013. p. 427-435. <a href="http://www.jmlr.org/proceedings/papers/v28/jaggi13.pdf">pdf</a></p>
<p>[JN] Juditsky, A.; Nemirovski, A.S. Large deviations of vector-valued martingales in 2-smooth normed spaces. <a href="https://arxiv.org/abs/0809.0813">pdf</a></p>
<p>[KDP] Kerdreux, T.; d’Aspremont, A.; Pokutta, S. 2019. Restarting Frank-Wolfe. In The 22nd International Conference on Artificial Intelligence and Statistics (pp. 1275-1283). <a href="https://arxiv.org/abs/1810.02429">pdf</a></p>
<p>[LJ] Lacoste-Julien, S. ; Jaggi, Martin. On the global linear convergence of Frank-Wolfe optimization variants. In : Advances in neural information processing systems. 2015. p. 496-504. <a href="https://infoscience.epfl.ch/record/229239/files/nips15_paper_sup_camera_ready.pdf">pdf</a></p>
<p>[P] Polyak, B. T. Existence theorems and convergence of minimizing sequences for extremal problems with constraints. In Doklady Akademii Nauk, volume 166, pages 287–290. Russian Academy of Sciences, 1966.</p>
<p>[ST] Sridharan, K.; Tewari, A. (2010, June). Convex Games in Banach Spaces. In <em>COLT</em> (pp. 1-13). <a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.304.5992&rep=rep1&type=pdf">pdf</a></p>Thomas KerdreuxTL;DR: This is an informal summary of our recent paper Projection-Free Optimization on Uniformly Convex Sets by Thomas Kerdreux, Alexandre d’Aspremont, and Sebastian Pokutta. We present convergence analyses of the Frank-Wolfe algorithm in settings where the constraint sets are uniformly convex. Our results generalize different analyses of [P], [DR], [D], and [GH] when the constraint sets are strongly convex. For instance, the $\ell_p$ balls are uniformly convex for all $p > 1$, but strongly convex for $p\in]1,2]$ only. We show in these settings that uniform convexity of the feasible region systematically induces accelerated convergence rates of the Frank-Wolfe algorithm (with short steps or exact line-search). This shows that the Frank-Wolfe algorithm is not just adaptive to the sharpness of the objective [KDP] but also to the feasible region.Second-order Conditional Gradient Sliding2020-06-20T07:00:00+02:002020-06-20T07:00:00+02:00http://www.pokutta.com/blog/research/2020/06/20/socgs<p><em>TL;DR: This is an informal summary of our recent paper <a href="https://arxiv.org/abs/2002.08907">Second-order Conditional Gradient Sliding</a> by <a href="https://alejandro-carderera.github.io/">Alejandro Carderera</a> and <a href="http://www.pokutta.com/">Sebastian Pokutta</a>, where we present a second-order analog of the Conditional Gradient Sliding algorithm [LZ] for smooth and strongly-convex minimization problems over polytopes. The algorithm combines Inexact Projected Variable-Metric (PVM) steps with independent Away-step Conditional Gradient (ACG) steps to achieve global linear convergence and local quadratic convergence in primal gap. The resulting algorithm outperforms other projection-free algorithms in applications where first-order information is costly to compute.</em>
<!--more--></p>
<p><em>Written by Alejandro Carderera.</em></p>
<h2 id="what-is-the-paper-about-and-why-you-might-care">What is the paper about and why you might care</h2>
<p>Consider a problem of the sort:
\(\tag{minProblem}
\begin{align}
\label{eq:minimizationProblem}
\min\limits_{x \in \mathcal{X}} f(x),
\end{align}\)
where $\mathcal{X}$ is a polytope and $f(x)$ is a twice differentiable function that is strongly convex and smooth. We assume that solving an LP over $\mathcal{X}$ is easy, but projecting using the Euclidean norm (or any other norm) onto $\mathcal{X}$ is expensive. Moreover, we also assume that evaluating $f(x)$ is expensive, and so is computing the gradient and the Hessian of $f(x)$. An example of such an objective function can be found when solving an MLE problem to estimate the parameters of a Gaussian distribution modeled as a sparse undirected graph [BEA] (also known as the Graphical Lasso problem). Another example is the objective function used in logistic regression problems when the number of samples is high.</p>
<h2 id="projected-variable-metric-algorithms">Projected Variable-Metric algorithms</h2>
<p>Working with such unwieldy functions is often too expensive, and so a popular approach to tackling (minProblem) is to construct an approximation to the original function whose gradients are easier to compute. A linear approximation of $f(x)$ at $x_k$ using only first-order information will not contain any curvature information, giving us little to work with. Consider, on the other hand a quadratic approximation of $f(x)$ at $x_k$, denoted by $\hat{f_k}(x)$, that is:</p>
\[\tag{quadApprox}
\begin{align}
\label{eq:quadApprox}
\hat{f_k}(x) = f(x_k) + \left\langle \nabla f(x_k), x - x_k \right\rangle + \frac{1}{2} \norm{x - x_k}_{H_k}^2,
\end{align}\]
<p>where $H_k$ is a positive definite matrix that approximates the Hessian $\nabla^2 f(x_k)$. Algorithms that minimize the quadratic approximation $\hat{f}_k(x)$ over $\mathcal{X}$ at each iteration and set</p>
\[x_{k+1} = x_k + \gamma_k (\operatorname{argmin}_{x\in \mathcal{X}} \hat{f_k}(x) - x_k)\]
<p>for some \(\gamma_k \in [0,1]\) are dubbed <em>Projected Variable-Metric</em> (PVM) algorithms. These algorithms are useful when the progress per unit time obtained by moving towards the minimizer of $\hat{f}_k(x)$ over $\mathcal{X}$ at each time step is greater than the progress per unit time obtained by taking a step of any other first-order algorithm that makes use of the original function (whose gradients are very expensive to compute). We define the scaled projection of $x$ onto $\mathcal{X}$ when we measure the distance in the $H$-norm as \(\Pi_{\mathcal{X}}^{H} (y) \stackrel{\mathrm{\scriptscriptstyle def}}{=} \text{argmin}_{x\in\mathcal{X}} \norm{x - y}_{H}\). This allows us to interpret the steps taken by PVM algorithms as:</p>
\[\tag{stepPVM}
\begin{align}
\label{eq:stepPVM}
\operatorname{argmin}_{x\in \mathcal{X}} \hat{f_k}(x) = \Pi_{\mathcal{X}}^{H_k} \left( x_k - H_k^{-1} \nabla f(x_k) \right).
\end{align}\]
<p>These algorithms owe their name to this interpretation: at each iteration, as $H_k$ varies, we change the metric (the norm) with which we perform the scaled-projections, and we deform the negative of the gradient using this metric. The next image gives a schematic overview of a step of the PVM algorithm. The polytope $\mathcal{X}$ is depicted with solid black lines, the contour lines of the original objective function $f(x)$ are depicted with solid blue lines, and the contour lines of the quadratic approximation \(\hat{f}_k(x)\) are depicted with dashed red lines. Note that \(x_k - H_k^{-1}\nabla f(x_k)\) is the unconstrained minimizer of the quadratic approximation \(\hat{f}_k(x)\). The iterate used in the PVM algorithm to define the directions along which we move, i.e., \(\text{argmin}_{x\in \mathcal{X}} \hat{f}_k(x)\), is simply the scaled projection of that unconstrained minimizer onto \(\mathcal{X}\) using the norm \(\norm{\cdot}_{H_k}\) defined by $H_k$.</p>
<div class="center">
<img src="http://www.pokutta.com/blog/assets/socgs/SchematicAlgorithm.png" alt="Minimization of $\hat{f_k}(x)$ over $\mathcal{X}$." style="float:center; margin-right: 1%; width:80%" />
<p style="clear: both;"></p>
</div>
<p class="figcap"><strong>Figure 1.</strong> Minimization of $\hat{f_k}(x)$ over $\mathcal{X}$.</p>
<p>Note that if we set $H_k = \nabla^2 f(x_k)$ the PVM algorithm is equivalent to the <em>Projected Newton</em> algorithm, and if we set $H_k = I^n$, where $I^n$ is the identity matrix, the algorithm is equal to the <em>Projected Gradient Descent</em> algorithm. Intuitively, when $H_k$ is a good approximation to the Hessian $\nabla^2 f(x_k)$ we can expect to make good progress when moving along these directions. In terms of convergence, the PVM algorithm has a <em>global</em> linear convergence rate in primal gap when using an exact line search [KSJ], although with a dependence on the condition number that is worse than that of Projected Gradient Descent or the <em>Away-step Conditional Gradient</em> (ACG) algorithms. Moreover, the algorithm has a <em>local</em> quadratic convergence rate with a unit step size when close to the optimum $x^\esx$ if the matrix $H_k$ becomes a better and better approximation to $\nabla^2 f(x_k)$ as we approach $x^\esx$ (which we also assume in our theoretical results).</p>
<h2 id="second-order-conditional-gradient-sliding-algorithm">Second-order Conditional Gradient Sliding algorithm</h2>
<p>Two questions arise:</p>
<ol>
<li>Can we achieve a global linear convergence rate on par with that of the Away-step Conditional Gradient algorithm?</li>
<li>Solving the problem shown in (stepPVM) to optimality is often too expensive. Can we solve the problem to some $\varepsilon_k$-optimality and keep the local quadratic convergence?</li>
</ol>
<p>The <em>Second-order Conditional Gradient Sliding</em> (SOCGS) algorithm is designed with these considerations in mind, providing global linear convergence in primal gap and local quadratic convergence in primal gap and distance to $x^\esx$. The algorithm couples an independent ACG step with line search with an Inexact PVM step with a unit step size. At the end of each iteration, we choose the step that provides the greatest primal progress. The independent ACG steps will ensure global linear convergence in primal gap, and the Inexact PVM steps will provide quadratic convergence. Moreover, the line search in the ACG step can be substituted with a step size strategy that requires knowledge of the $L$-smoothness parameter of $f(x)$ [PNAM].</p>
<p>We compute the PVM step inexactly using the (same) ACG algorithm with an exact line search, thereby making the SOCGS algorithm <em>projection-free</em>. As the function being minimized in the Inexact PVM steps is quadratic there is a closed-form expression for the optimal step size. The scaled projection problem is solved to $\varepsilon_k$-optimality, using the Frank-Wolfe gap as a stopping criterion, as in the Conditional Gradient Sliding (CGS) algorithm [LZ]. The CGS algorithm uses the vanilla Conditional Gradient algorithm to find an approximate solution to the Euclidean projection problems that arise in <em>Nesterov’s Accelerated Gradient Descent</em> steps. In the SOCGS algorithm, we use the ACG algorithm to find an approximate solution to the scaled-projection problems that arise in PVM steps.</p>
<h3 id="accuracy-parameter-varepsilon_k">Accuracy Parameter $\varepsilon_k$.</h3>
<p>The accuracy parameter $\varepsilon_k$ in the SOCGS algorithm depends on a lower bound on the primal gap of (minProblem) which we denote by $lb\left( x_k \right)$ that satisfies $lb\left( x_k \right) \leq f\left(x_k \right) - f\left(x^\esx \right)$.</p>
<p>In several machine learning applications, the value of $f(x^\esx)$ is known a priori, such is the case of the approximate Carathéodory problem (see post <a href="/blog/research/2019/11/30/approxCara-abstract.html">Approximate Carathéodory via Frank-Wolfe</a> where $f(x^\esx) = 0$). In other applications, estimating $f(x^\esx)$ is easier than estimating the strong convexity parameter (see [BTA] for an in-depth discussion). This allows for tight lower bounds on the primal gap in these cases.</p>
<p>If there is no easy way to estimate the value of $f(x^\esx)$, we can compute a lower bound on the primal gap at $x_k$ (bounded away from zero) using any CG variant that monotonically decreases the primal gap. It suffices to run an arbitrary number of steps $n \geq 1$ of the aforementioned variant to minimize $f(x)$ starting from $x_k$, resulting in $x_k^n$. Simply noting that $f(x_k^n) \geq f(x^\esx)$ allows us to conclude that $f(x_k) - f(x^\esx) \geq f(x_k) - f(x_k^n)$, and therefore a valid lower bound is $lb\left( x_k \right) = f(x_k) - f(x^n_k)$. The higher the number of CG steps performed from $x_k$, the tighter the resulting lower bound will be.</p>
<h3 id="complexity-analysis">Complexity Analysis</h3>
<p>For the complexity analysis, we assume that we have at our disposal the tightest possible bound on the primal gap, which is $lb\left( x_k \right) = f(x_k) - f(x^\esx)$. A looser lower bound increases the number of linear minimization calls but does not increase the number of first-order, or approximate Hessian oracle calls. As in the classical analysis of Projected Newton algorithms, after a finite number of iterations independent of the target accuracy $\varepsilon$ (which in our case are linearly convergent in primal gap) the algorithm enters a regime of quadratic convergence in primal gap. Once in this phase the algorithm requires $\mathcal{O}\left( \log(1/\varepsilon) \log(\log 1/\varepsilon)\right)$ calls to a linear minimization oracle, and $\mathcal{O}\left( \log(\log 1/\varepsilon)\right)$ calls to a first-order and approximate Hessian oracle to reach an $\varepsilon$-optimal solution.</p>
<p>If we were to solve problem (minProblem) using the Away-step Conditional Gradient algorithm we would need $\mathcal{O}\left( \log(1/\varepsilon)\right)$ calls to a linear minimization and first-order oracle. Using the SOCGS algorithm makes sense if the linear minimization calls are not the computational bottleneck of the algorithm and the approximate Hessian oracle is about as expensive as the first-order oracle.</p>
<h3 id="computational-experiments">Computational Experiments</h3>
<p>We compare the performance of the SOCGS algorithm with that of other first-order projection free algorithms in settings where computing first-order information is expensive (and computing Hessian information is just as expensive). We also compare the performance of our algorithm with the recent <em>Newton Conditional Gradient</em> algorithm [LCT] which minimizes a self-concordant function over a convex set by performing Inexact Newton steps (thereby requiring an exact Hessian oracle) using a Conditional Gradient algorithm to compute the scaled projections. After a finite number of iterations (independent of the target accuracy $\varepsilon$), the convergence rate of the NCG algorithm is linear in primal gap. Once inside this phase an $\varepsilon$-optimal solution is reached after $\mathcal{O}\left(\log 1/\varepsilon\right)$ exact Hessian and first-order oracle calls and $\mathcal{O}( 1/\varepsilon^{\nu})$ linear minimization oracle calls, where $\nu$ is a constant greater than one.</p>
<p>In the first experiment the Hessian information will be inexact (but subject to an asymptotic accuracy assumption), and so we will only compare to other first-order projection-free algorithms. In the second and third experiments, the Hessian oracle will be exact. For reference, the algorithms in the legend correspond to the vanilla Conditional Gradient (CG), the Away-step Conditional Gradient (ACG) [GM], the Lazy Away-step Conditional Gradient (ACG (L)) [BPZ], the Pairwise-step Conditional Gradient (PCG) [LJ], the Conditional Gradient Sliding (CGS) [LZ], the Stochastic Variance-Reduced Conditional Gradient (SVRCG) [HL], the Decomposition Invariant Conditional Gradient (DICG) [GM2] and the Newton Conditional Gradient (NCG) [LCT] algorithm. We also present an LBFGS version of SOCGS (SOCGS LBFGS). However note that this algorithm, while performing well, does not formally satisfy our assumptions.</p>
<div class="center">
<img src="http://www.pokutta.com/blog/assets/socgs/Birkhoff_Experiments.png" alt="fig2" style="float:center; margin-right: 1%; width:99%" />
<p style="clear: both;"></p>
</div>
<p class="figcap"><strong>Figure 2.</strong> Sparse coding over the Birkhoff polytope.</p>
<div class="center">
<img src="http://www.pokutta.com/blog/assets/socgs/GLassoPSD.png" alt="fig3" style="float:center; margin-right: 1%; width:99%" />
<p style="clear: both;"></p>
</div>
<p class="figcap"><strong>Figure 3.</strong> Inverse covariance estimation over the spectrahedron.</p>
<div class="center">
<img src="http://www.pokutta.com/blog/assets/socgs/LogReg.png" alt="fig4" style="float:center; margin-right: 1%; width:99%" />
<p style="clear: both;"></p>
</div>
<p class="figcap"><strong>Figure 4.</strong> Structured logistic regression over $\ell_1$ unit ball.</p>
<h3 id="references">References</h3>
<p>[LZ] Lan, G., & Zhou, Y. (2016). Conditional gradient sliding for convex optimization. In <em>SIAM Journal on Optimization</em> 26(2) (pp. 1379–1409). SIAM. <a href="http://www.optimization-online.org/DB_FILE/2014/10/4605.pdf">pdf</a></p>
<p>[BEA] Banerjee, O., & El Ghaoui, L. & d’Aspremont, A. (2008). Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data. In <em>Journal of Machine Learning Research</em> 9 (2008) (pp. 485–516). JMLR. <a href="http://www.jmlr.org/papers/volume9/banerjee08a/banerjee08a.pdf">pdf</a></p>
<p>[KSJ] Karimireddy, S.P., & Stich, S.U. & Jaggi, M. (2018). Global linear convergence of Newton’s method without strong-convexity or Lipschitz gradients. <em>arXiv preprint:1806.00413</em>. <a href="https://arxiv.org/pdf/1806.00413.pdf">pdf</a></p>
<p>[PNAM] Pedregosa, F., & Negiar, G. & Askari, A. & Jaggi, M. (2020). Linearly Convergent Frank-Wolfe with Backtracking Line-Search. In <em>Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics</em>. <a href="http://proceedings.mlr.press/v108/pedregosa20a/pedregosa20a-supp.pdf">pdf</a></p>
<p>[BTA] Barré, M., & Taylor, A. & d’Aspremont, A. (2020). Complexity Guarantees for Polyak Steps with Momentum. <em>arXiv preprint:2002.00915</em>. <a href="https://arxiv.org/pdf/2002.00915.pdf">pdf</a></p>
<p>[LCT] Liu, D., & Cevher, V. & Tran-Dinh, Q. (2020). A Newton Frank-Wolfe Method for Constrained Self-Concordant Minimization. <em>arXiv preprint:2002.07003</em>. <a href="https://arxiv.org/pdf/2002.07003.pdf">pdf</a></p>
<p>[GM] Guélat, J., & Marcotte, P. (1986). Some comments on Wolfe’s ‘away step’. In <em>Mathematical Programming</em> 35(1) (pp. 110–119). Springer. <a href="http://www.iro.umontreal.ca/~marcotte/ARTIPS/1986_MP.pdf">pdf</a></p>
<p>[BPZ] Braun, G., & Pokutta, S. & Zink, D. (2017). Lazifying Conditional Gradient Algorithms. In <em>Proceedings of the 34th International Conference on Machine Learning</em>. <a href="http://proceedings.mlr.press/v70/braun17a/braun17a.pdf">pdf</a></p>
<p>[LJ] Lacoste-Julien, S. & Jaggi, M. (2015). On the global linear convergence of Frank-Wolfe optimization variants. In <em>Advances in Neural Information Processing Systems</em> 2015 (pp. 496-504). <a href="https://papers.nips.cc/paper/5925-on-the-global-linear-convergence-of-frank-wolfe-optimization-variants.pdf">pdf</a></p>
<p>[HL] Hazan, E. & Luo, H. (2016). Variance-reduced and projection-free stochastic optimization. In <em>Proceedings of the 33rd International Conference on Machine Learning</em>. <a href="https://arxiv.org/pdf/1602.02101.pdf">pdf</a></p>
<p>[GM2] Garber, D. & Meshi, O.(2016). Linear-memory and decomposition-invariant linearly convergent conditional gradient algorithm for structured polytopes. In <em>Advances in Neural Information Processing Systems</em> 2016 (pp. 1001-1009). <a href="https://arxiv.org/pdf/1605.06492.pdf">pdf</a></p>Alejandro CardereraTL;DR: This is an informal summary of our recent paper Second-order Conditional Gradient Sliding by Alejandro Carderera and Sebastian Pokutta, where we present a second-order analog of the Conditional Gradient Sliding algorithm [LZ] for smooth and strongly-convex minimization problems over polytopes. The algorithm combines Inexact Projected Variable-Metric (PVM) steps with independent Away-step Conditional Gradient (ACG) steps to achieve global linear convergence and local quadratic convergence in primal gap. The resulting algorithm outperforms other projection-free algorithms in applications where first-order information is costly to compute.On the unreasonable effectiveness of the greedy algorithm2020-06-03T07:00:00+02:002020-06-03T07:00:00+02:00http://www.pokutta.com/blog/research/2020/06/03/unreasonable-abstract<p><em>TL;DR: This is an informal summary of our recent paper <a href="https://arxiv.org/abs/2002.04063">On the Unreasonable Effectiveness of the Greedy Algorithm: Greedy Adapts to Sharpness</a> with <a href="https://www2.isye.gatech.edu/~msingh94/">Mohit Singh</a>, and <a href="https://sites.google.com/view/atorrico">Alfredo Torrico</a>, where we adapt the sharpness concept from convex optimization to explain the effectiveness of the greedy algorithm for submodular function maximization.</em>
<!--more--></p>
<h2 id="what-is-the-paper-about-and-why-you-might-care">What is the paper about and why you might care</h2>
<p>An important problem is the maximization of a non-negative monotone submodular set function $f: 2^V \rightarrow \RR_+$ subject to a cardinality constraint, i.e.,</p>
\[\tag{maxSub}
\max_{S \subseteq V, |S| \leq k} f(S).\]
<p>This problem naturally occurs in many contexts, such as, e.g., feature selection, sensor placement, and non-parametric learning. It is well known that in submodular function maximization with a single cardinality constraint we can compute a $(1-1/\mathrm{e})$-approximate solution by means of the greedy algorithm [NW], [NWF], while computing an exact solution is NP-hard. The greedy algorithm is extremely simple, selecting in each of its $k$ iterations the element with the largest <em>marginal gain</em> $\Delta_{S}(e) \doteq f(S \cup \setb{e}) - f(S)$:</p>
<p class="mathcol"><strong>Greedy Algorithm</strong> <br />
<em>Input:</em> Non-negative, monotone, submodular function $f$ and budget $k$ <br />
<em>Output:</em> Set $S_g \subseteq V$ <br />
$S_g \leftarrow \emptyset$ <br />
For $i = 1, \dots, k$ do: <br />
$\quad S_g \leftarrow S_g \cup \setb{\arg\max_{e \in V \setminus S_g} \Delta_{S_g}(e)}$<br /></p>
<p>Due to its simplicity and good real-world performance the greedy algorithm is often the method of choice in large-scale tasks where more involved methods, such as, e.g., integer programming are computationally prohibitive. As mentioned before, the returned solution $S_g \subseteq V$ of the greedy algorithm satisfies [NW]</p>
\[f(S_g) \geq (1-1/\mathrm{e}) \ f(S^\esx),\]
<p>where $S^\esx \subseteq V$ is the optimal solution to problem (maxSub).</p>
<p>In practice however, we often observe that the greedy algorithm performs much better than this conservative approximation guarantee suggests and several concepts such as, e.g., <em>curvature</em> [CC] or <em>stability</em> [CRV] have been proposed as a means to explain the excess performance of the greedy algorithm beyond this worst-case bound. The reason one might be interested in this, beyond understanding greedy’s performance as a function of additional properties of $f$ (which is interesting in its own right), is that the problem instance of interest might be amenable to pre-processing in order to improve conditioning with respect to these additional structural properties and hence performance.</p>
<h2 id="our-results">Our results</h2>
<p>We focus on giving an alternative explanation for those instances in which the optimal solution clearly stands out over the rest of feasible solutions. For this, we consider the concept of sharpness initially introduced in continuous optimization (see [BDL] and references contained therein) and we adapt it to submodular optimization. In convex optimization, roughly speaking, sharpness measures the behavior of the objective function around the set of optimal solutions and it translates to faster convergence rates. The way one should think about sharpness and similar parameters is <em>data-dependent</em> quantities that are usually either hard to compute or inaccessible. As these quantities are also (usually) unobservable and estimation is non-trivial yet impact the convergence rate, we would like our algorithms to be <em>adaptive</em> to these parameters without requiring them as <em>input</em>, i.e., the algorithm behaves automatically better when the data is better conditioned.</p>
<p>We show that the greedy algorithm for submodular maximization also provides better approximation guarantees as (our submodular analog to) sharpness of the objective function increases. While surprising at first, this is actually quite natural, once we understand the greedy algorithm as a discrete analog of ascent algorithms in continuous optimization, that is allowed to perform a fixed number of steps only: if the algorithm converges faster, than after a fixed number of steps ($k$ in the discrete case to be precise) its achieved approximation guarantee will be better. Then the key challenge is to identify a notion of sharpness that is meaningful in the context of submodular function maximization. We also show that the greedy algorithm automatically adapts to the submodular function’s sharpness.</p>
<p>The most basic notation of <em>sharpness for submodular functions</em> that we define is the following notion of <em>monontone sharpness</em>: There exists an optimal solution $S^\esx \subseteq V$ such that for all $S \subseteq V$ it holds:</p>
\[\tag{monSharp}
\sum_{e \in S^\esx \setminus S} \Delta_S(e) \geq \left( \frac{|S^\esx \setminus S|}{kc} \right)^{1/\theta} f(S^\esx),\]
<p>which then leads to a guarantee of the form:</p>
\[f(S_g) \geq \left(1- \left(1-\frac{\theta}{c}\right)^{1/c}\right) f(S^\esx),\]
<p>which interpolates between the worst-case approximation factor $(1-1/\mathrm{e})$ and the best-case approximation factor $1$.</p>
<p>We also define tighter notions of sharpness that explain more of greedy’s performance, however their definitions are slightly more involved and beyond the scope of this summary. In the following figure we depict the performance of the greedy algorithm on three different tasks as well as how much of its performance is explained by various data-dependent measures - it can be seen that our most advanced notion of sharpness, called <em>dynamic submodular sharpness</em>, explains a significant portion of greedy’s performance.</p>
<div class="center">
<img src="http://www.pokutta.com/blog/assets/sharpSubmodular/1_image_clustering.png" alt="img1" style="float:center; margin-right: 1%; width:27%" />
<img src="http://www.pokutta.com/blog/assets/sharpSubmodular/2_fac_loc.png" alt="img2" style="float:center; margin-right: 1%; width:27%" />
<img src="http://www.pokutta.com/blog/assets/sharpSubmodular/3_parkison_tele.png" alt="img3" style="float:center; margin-right: 1%; width:39%" />
<p style="clear: both;"></p>
</div>
<p class="figcap"><strong>Figure 1.</strong> Image Clustering (left), Facility Location (middle), Parkison Telemonitoring (right). For each example we computed both the sharpness parameters and optimal solutions to compare predicted vs actual performance. In all three examples sharpness explains a significant portion of the greedy algorithm’s excess performance.</p>
<p>One final but important question that one might ask is: how many functions actually do satisfy sharpness? In convex optimization, by the <em>Łojasiewicz Factorization Lemma</em> (see [BDL] and references contained therein), basically almost all functions exhibit non-trivial sharpness; the same is true for the submodular case here, albeit in somewhat weaker form.</p>
<h3 id="references">References</h3>
<p>[NWF] Nemhauser, G. L., Wolsey, L. A., & Fisher, M. L. (1978). An analysis of approximations for maximizing submodular set functions—I. Mathematical programming, 14(1), 265-294. <a href="https://link.springer.com/content/pdf/10.1007/BF01588971.pdf">pdf</a></p>
<p>[NW] Nemhauser, G. L., & Wolsey, L. A. (1978). Best algorithms for approximating the maximum of a submodular set function. Mathematics of operations research, 3(3), 177-188. <a href="https://www.jstor.org/stable/pdf/3689488.pdf">pdf</a></p>
<p>[CC] Conforti, M., & Cornuéjols, G. (1984). Submodular set functions, matroids and the greedy algorithm: tight worst-case bounds and some generalizations of the Rado-Edmonds theorem. Discrete applied mathematics, 7(3), 251-274. <a href="https://www.sciencedirect.com/science/article/pii/0166218X84900039">pdf</a></p>
<p>[CRV] Chatziafratis, V., Roughgarden, T., & Vondrák, J. (2017). Stability and recovery for independence systems. arXiv preprint arXiv:1705.00127. <a href="https://arxiv.org/abs/1705.00127">pdf</a></p>
<p>[BDL] Bolte, J., Daniilidis, A., & Lewis, A. (2007). The Łojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems. SIAM Journal on Optimization, 17(4), 1205-1223. <a href="https://epubs.siam.org/doi/pdf/10.1137/050644641">pdf</a></p>Sebastian PokuttaTL;DR: This is an informal summary of our recent paper On the Unreasonable Effectiveness of the Greedy Algorithm: Greedy Adapts to Sharpness with Mohit Singh, and Alfredo Torrico, where we adapt the sharpness concept from convex optimization to explain the effectiveness of the greedy algorithm for submodular function maximization.An update on SCIP2020-05-15T07:00:00+02:002020-05-15T07:00:00+02:00http://www.pokutta.com/blog/news/2020/05/15/scip-update<p><em>TL;DR: A quick update on what is on the horizon for SCIP.</em>
<!--more--></p>
<p>SCIP has been a cornerstone of ZIB’s mathematical optimization department for many years. It is probably (one of) the fastest and most comprehensive academic solvers for MIPs and several related optimization paradigms. Certainly it is the fastest MIP and MINLP solver that is fully transparent and accessible in source code. This impressive effort is due to a great team of researchers and developers, both at ZIB and throughout the world, that has been pushing SCIP to the cutting-edge.</p>
<p>Over the last 5 years, two people have strongly shaped the progress of SCIP at ZIB: Thorsten Koch on the organizational side and Ambros Gleixner as head of technical research & development. In Fall of 2019 I moved to ZIB. With this move I also took the lead of the overall SCIP project among several other new responsibilities and I would like to take the opportunity to thank Thorsten Koch for his great leadership of the SCIP project over the last years. I am quite excited to have the opportunity to shape the future of SCIP together with the rest of the SCIP team and in view of this I would like to share some updates. In a nutshell these changes can be summarized as follows:</p>
<ol>
<li>Making SCIP more open</li>
<li>Making SCIP more accessible</li>
<li>Making SCIP more inclusive</li>
</ol>
<p>While not everything can be achieved in a one step, this overview might give you an idea of what is on the horizon.</p>
<p>We also have some very exciting new research directions and results, however I am going to talk about some of that work elsewhere in a more research-focused post.</p>
<h2 id="scip-7-release">SCIP 7 release</h2>
<p>Before I am going to talk about some upcoming things, I wanted to briefly mention the recent release of <a href="http://www.optimization-online.org/DB_HTML/2020/03/7705.html">SCIP 7</a>, with many new features. Just to name two, there is a new parallel preprocessing library <em>PaPILO</em> and we now have <a href="http://www.optimization-online.org/DB_HTML/2020/04/7722.html">tree-size prediction</a> built-in:</p>
<blockquote>
<p>On average, the best method estimates B&B tree sizes within a factor of 3 on the set of unseen test instances even during the early stage of the search, and improves in accuracy as the search progresses. It also achieves a factor 2 over the entire search on each out of six additional sets of homogeneous instances we have tested.</p>
</blockquote>
<p><em>#firstSeenInSCIP</em></p>
<p>Both for MIP and MINLP, SCIP 7 is on average 1.36x faster than SCIP 6 on hard instances, i.e., on instances that take at least 100 seconds to solve. You can check out the latest release on the <a href="http://scip.zib.de">SCIP homepage</a>.</p>
<h2 id="interfaces">Interfaces</h2>
<p>SCIP already supports a wide variety of interfaces. In the future we will further integrate SCIP with those interfaces and in particular we will improve integration with Python through <a href="https://github.com/SCIP-Interfaces/PySCIPOpt">PySCIPOpt</a> and Julia through <a href="https://github.com/SCIP-Interfaces/SCIP.jl">SCIP.jl</a>. These two will become true first-class interfaces. Moreover, we will maintain several <a href="https://github.com/SCIP-Interfaces">other interfaces</a> depending on demand etc.</p>
<h2 id="distribution">Distribution</h2>
<p>We intend to extend the distribution mechanisms for SCIP. One very high priority is distribution through the conda package manager, so that the SCIP optimization suite and PySCIPOpt can be basically installed with a simple <code class="language-plaintext highlighter-rouge">conda install pyscipopt</code>. We are also exploring to make SCIP available in <a href="https://colab.research.google.com/">Google Colab</a>; the conda integration might make this a trivial exercise.</p>
<h2 id="tutorials">Tutorials</h2>
<p>Many of you have experienced that SCIP is a very complex software and getting started can be a nontrivial endeavor, just because of its high flexibility as a framework, which is fully exposed through its API. At the same time, SCIP can be used out-of-the-box as a powerful black-box solver. However, many of you have suffered from the current lack of good entry level documentation. To alleviate this in the short term, we are in the process of writing a tutorial specifically targeting the “black box user + SCIP” via PySCIPOpt. In the mid term we will try to offer more resources to people that use SCIP mainly as a black box solver; see the Website section below.</p>
<h2 id="new-platforms">New platforms</h2>
<p>We intend to support several new platforms for the SCIP Optimization Suite. As you might have already seen from <a href="/blog/random/2019/09/29/scipberry.html">a post sometime back</a> one such platform is ARM. This includes the RaspberryPi but also many cell phone and mobile architectures that then can potentially run SCIP. Moreover, we also plan a dockerized version of SCIP for deployment in cloud computing environments. In fact if you want to give a preliminary build a spin: <code class="language-plaintext highlighter-rouge">docker pull scipoptsuite/scipoptsuite:7.0.0</code>; SCIP Optimization Suite 7.0.0 and PySCIPOpt 3.0.0 on slim buster with Python 3.7—feedback appreciated.</p>
<p>A little further down the road, we will be likely also supporting RISC-V once stable development systems are available and we are currently evaluating Microsoft’s <a href="https://docs.microsoft.com/en-us/windows/wsl/wsl2-install">WSL</a> in particular together with <a href="https://ubuntu.com/wsl">Ubuntu on WSL</a> as an alternative deployment mode for Windows.</p>
<h2 id="decentralized-development">Decentralized Development</h2>
<p>SCIP has had a strong decentralized development component and this trend is likely to increase further in the future with many more non-ZIB developers contributing to SCIP. In Germany alone, we have 4 development centers with FAU Erlangen-Nürnberg, TU Darmstadt, RWTH Aachen, and the Zuse Institute Berlin. On top of that we have a large number of international contributors.</p>
<p>This decentralized development setup with many stakeholders and core developers outside ZIB will be also more strongly reflected in SCIP’s governance; more on this soon.</p>
<h2 id="website">Website</h2>
<p>SCIP will move to <a href="http://www.scipopt.org">http://www.scipopt.org</a> as a new home and one-stop-shop; should be online in a few days. Moreover, we will also separate the web site into two parts in the next few months: one for SCIP users and one for SCIP developers.</p>
<h2 id="licensing">Licensing</h2>
<p>There are some license changes on the horizon as well. Short version is that we intend that SCIP will be free for non-commercial use in general and we are currently discussing how to deal with commercial use. One model might be to have a community edition under some permissible open source license and a professional edition for commercial use. Obviously this is quite a complicated matter and it will take some time to iron out all the details and settle on a final setup.</p>
<p>In the meantime, if you want to use SCIP, send an email to <a href="mailto:licenses@zib.de">licenses@zib.de</a> and we will work something out in the spirit of the above.</p>
<h2 id="hiring">Hiring</h2>
<p>We are looking to grow the SCIP developer team. If you want to contribute to the future development of SCIP and want to get involved please get in touch.</p>Sebastian PokuttaTL;DR: A quick update on what is on the horizon for SCIP.Psychedelic Style Transfer2020-04-09T01:00:00+02:002020-04-09T01:00:00+02:00http://www.pokutta.com/blog/research/2020/04/09/ai-art<p><em>TL;DR: We point out how to make psychedelic animations from discarded instabilities in neural style transfer. This post builds upon a remark we made in our recent paper <a href="https://arxiv.org/abs/2003.06659">Interactive Neural Style Transfer with Artists</a>. In this paper, we questioned several simple evaluation aspects of neural style transfer methods. Also, it is our second series of interactive painting experiments where style transfer outputs constantly influence a painter, see the other series <a href="https://arxiv.org/abs/1910.04386">here</a>. See also our medium <a href="https://medium.com/@human.aimachine.art/psychedelic-style-transfer-5744b700fc3e">post</a>.</em>
<!--more--></p>
<p><em>Written by Thomas Kerdreux and Louis Thiry.</em> <br /></p>
<div class="paddingContainer">
<div class="iframe-container center">
<iframe width="100%" height="100%" src="https://www.youtube-nocookie.com/embed/1jg6CqMEbcQ" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe>
</div>
</div>
<p>The first frame of the video, a watercolor from my grand-father, is progressively stirred into a plethora of curvy and colorful patches. It then metamorphoses into a purplish phantasmal coral reef that is itself slowly submerged by an angry puce ocean. The water then calms down as the coral reef disappears and ends up perfectly still. How is this psychedelic animation related to style transfer methods?</p>
<p>Neural style transfers are rendering techniques – for images mostly – that seek to stylize a content image with the style of another, see figure below. More precisely, the algorithms are designed to extract a style representation of an image and a representation of the semantic content of another and then cleverly construct a new picture from these.</p>
<table>
<thead>
<tr>
<th style="text-align: center">Heard Island in Antarctica</th>
<th style="text-align: center">Maxime Maufra’s painting</th>
<th style="text-align: center">Style Transfer Output using STROTSS [KS]</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: center"><img src="/blog/assets/style_transfer/example_ST_content.jpg" style="zoom:415%;" /></td>
<td style="text-align: center"><img src="/blog/assets/style_transfer/example_ST_style.jpg" style="zoom:415%;" /></td>
<td style="text-align: center"><img src="/blog/assets/style_transfer/example_ST_output.jpg" style="zoom:485%;" /></td>
</tr>
</tbody>
</table>
<p>While designing new evaluation techniques for style transfer methods in [KT], we made an uncomplicated but crucial observation. <strong>Style transfer applied to the same image as style and content should reasonably output the image itself</strong>. However, we observed that many style transfer algorithms do not satisfy this property. No one ever cared for hard-coding this fundamental property. Here, we show how, leveraging on that instability, we produce animations like the one above.</p>
<table>
<thead>
<tr>
<th style="text-align: center">MST first iteration</th>
<th style="text-align: center">MST second iteration</th>
<th style="text-align: center">MST third iteration</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: center"><img src="/blog/assets/style_transfer/MST_0.jpg" style="zoom:400%;" /></td>
<td style="text-align: center"><img src="/blog/assets/style_transfer/MST_1.jpg" style="zoom:400%;" /></td>
<td style="text-align: center"><img src="/blog/assets/style_transfer/MST_2.jpg" style="zoom:400%;" /></td>
</tr>
</tbody>
</table>
<p>Formally a style transfer method is simply a function \(f\) that takes a style image \(s\) and a content image \(c\) and outputs a new image \(f(s,c)\). Our observation is that for some style transfer methods \(f\) and an initial images \(x_0\), the equality $f(x_0, x_0) = x_0$ is not satisfied. The output images are adding a slightly perceptible flicker, blur or blunder to the initial image \(x_0\). These instability patterns differ from one method to another but are experimentally the same when starting from different images \(x_0\).</p>
<p>Yet these effects are hardly perceptible. Hence to better understand the phenomenon, we need to amplify them. We simply repeat the process: start from an initial image \(x_0\) and reiterate the style transfer operation</p>
\[\begin{align*}
x_{t+1} = f(x_t, x_t)
\end{align*},\]
<p>In Figure above, after a few iterations, the effects become perceptible and particularly stylish. For instance, when taking the MST style transfer method [MST] (with this <a href="https://github.com/irasin/Pytorch_MST">code</a>) the iterates become tessellated versions of the initial image. The instabilities amplify all the lines of the pictures. On portraits, they reveal all wrinkles. When taking another algorithm like WCT [WCT] (with <a href="https://github.com/irasin/Pytorch_WCT">this code</a>), the effects are different. The goblin is slowly dematerialized by the devilish style transfer instabilities, see figure below.</p>
<table>
<thead>
<tr>
<th style="text-align: center">MST first iteration</th>
<th style="text-align: center">MST second iteration</th>
<th style="text-align: center">MST fourth iteration</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: center"><img src="/blog/assets/style_transfer/WCT_diablotin_0.jpg" style="zoom:200%;" /></td>
<td style="text-align: center"><img src="/blog/assets/style_transfer/WCT_diablotin_1.jpg" style="zoom:200%;" /></td>
<td style="text-align: center"><img src="/blog/assets/style_transfer/WCT_diablotin_4.jpg" style="zoom:200%;" /></td>
</tr>
</tbody>
</table>
<p>So far, we simply showed the outputs of the first iterations of the repeat process. Actually, the animation above basically collects all the images of the sequence \((x_t)\). For many different pictures and methods, we observe this asymptotic type of divergence that we name <em>psychedelic regime</em>. Indeed, once the algorithm loses track of the initial image, it starts raving. And then feeds itself with its own slowly delusional outputs, without ever going back to our reality! The raving differs from one method to another but experimentally seems not to depend on the initial image.</p>
<p>This playfully shoots what a machine can do when forgetting about the human inputs or the non-numerical reality. In fact, metaphorically, this is also happening in many practical uses of algorithms. For instance, collaborative based filtering recommender systems use new data that come from humans interacting with the algorithm. We no longer assess the choices human would have done without ever been influenced by algorithms. We have lost this initial input!</p>
<p>[R] and [G] studied instabilities of style transfer method in the case of real-time style transfer for videos. The style transfer output may differ significantly from one frame to another while the initial consecutive frames are perceptibly the same. This results in unpleasing flickering effect in style transferred videos. Similarly to the adversarial examples literature, the main focus is to study the instabilities to detect, correct and remove them. Here we outlined instabilities stemming from another type of inconsistency and took advantage of them.</p>
<p>Also, note that MST and WCT are feed-forward approaches to style transfer, i.e. the function $f$ is a neural network [JA,GL,LW]. In reality, the first approach to neural style transfer was optimization-based [G]. In particular, when considering the same image as style and content, the image is the global optimum of the loss. Hence the method satisfies $f(x,x)=x$ if properly initialized. Actually, even when choosing a random initial image, we observed that the image iterate converges to the initial image, <em>i.e.</em> the global minimum of a non-convex loss. Note although that some optimization-based methods like STROTSS may still not satisfy this stability property because of some randomization and re-parametrization of the image with its Laplacian pyramid.</p>
<p>Finally, if you are interested in making your psychedelic videos, the take-home message is that most certainly any feed-forward neural style-transfer approach will give a different <em>psychedelic regime</em>. Below we show one using the WCT method and the first iteration when using STROTSS optimization-based style method (our <a href="https://github.com/human-aimachine-art/pytorch-STROTSS-improved">code</a>).</p>
<div class="paddingContainer">
<div class="iframe-container center">
<iframe width="560" height="315" src="https://www.youtube-nocookie.com/embed/lyyAFlmNjIg" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe>
</div>
</div>
<table>
<thead>
<tr>
<th style="text-align: center">STROTSS first iteration</th>
<th style="text-align: center">STROTSS several iteration later…</th>
<th style="text-align: center">STROTSS several iteration later…</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: center"><img src="/blog/assets/style_transfer/STROTSS_0.jpg" style="zoom:300%;" /></td>
<td style="text-align: center"><img src="/blog/assets/style_transfer/STROTSS_5.jpg" style="zoom:300%;" /></td>
<td style="text-align: center"><img src="/blog/assets/style_transfer/STROTSS_15.jpg" style="zoom:300%;" /></td>
</tr>
</tbody>
</table>
<h3 id="references">References</h3>
<p>[CKT] Cabannes, V., Kerdreux, T., Thiry, L., Campana, T., & Ferrandes, C. (2019). Dialog on a Canvas with a Machine. Third Workshop of Creativity and Design at NeurIPS 2019. <a href="https://arxiv.org/abs/1910.04386">pdf</a></p>
<p>[JA] Johnson, J.; Alahi, A.; and Fei-Fei, L. 2016. Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision, 694–711. Springer. <a href="https://arxiv.org/abs/1603.08155">pdf</a></p>
<p>[G] Gatys, L. A., Ecker, Alexander S., et Bethge, M.. A neural algorithm of artistic style. (2015). <a href="https://arxiv.org/abs/1508.06576">pdf</a></p>
<p>[GL] Ghiasi, G.; Lee, H.; Kudlur, M.; Dumoulin, V.; and Shlens, J. (2017). Exploring the structure of a real-time, arbitrary neural artistic stylization network. <a href="https://arxiv.org/abs/1705.06830">pdf</a></p>
<p>[G] Gupta, A.; Johnson, J.; Alahi, A.; and Fei-Fei, L. 2017. Characterizing and improving stability in neural style transfer. In Proceedings of the IEEE International Conference on Computer Vision, 4067–4076. <a href="https://arxiv.org/abs/1705.02092">pdf</a></p>
<p>[KT] Kerdreux, T., Thiry, L., Kerdreux, E. (2020). Interactive Neural Style Transfer with Artists. <a href="https://arxiv.org/abs/2003.06659">pdf</a></p>
<p>[KS] Kolkin, N., Salavon, J., Shakhnarovich G. (2019). Style Transfer by Relaxed Optimal Transport and Self-Similarity. <a href="https://arxiv.org/abs/1904.12785">pdf</a></p>
<p>[LW] Li, C., and Wand, M. (2016). Precomputed real-time texture synthesis with markovian generative adversarial networks. In European conference on computer vision, 702–716 Springer. <a href="https://arxiv.org/abs/1604.04382">pdf</a></p>
<p>[MST] Zhang, Y., Fang, C., Wang, Y., Wang, Z., Lin, Z., Fu, Y., Yang, J. (2017). Multimodal Style Transfer via Graph Cuts. In Proceedings of the IEEE International Conference on Computer Vision. 2019. p. 5943–5951. <a href="https://arxiv.org/abs/1904.04443">pdf</a></p>
<p>[WCT] Li, Y., Fang, C., Yang, J., Wang, Z., Lu, X., & Yang, M. H. (2017). Universal style transfer via feature transforms. In Advances in neural information processing systems (pp. 386–396). <a href="https://arxiv.org/abs/1705.08086">pdf</a></p>
<p>[R] Risser, E.; Wilmot, P.; and Barnes, C. 2017. Stable and controllable neural texture synthesis and style transfer using histogram losses. <a href="https://arxiv.org/abs/1701.08893">pdf</a></p>Thomas Kerdreux, Louis ThiryTL;DR: We point out how to make psychedelic animations from discarded instabilities in neural style transfer. This post builds upon a remark we made in our recent paper Interactive Neural Style Transfer with Artists. In this paper, we questioned several simple evaluation aspects of neural style transfer methods. Also, it is our second series of interactive painting experiments where style transfer outputs constantly influence a painter, see the other series here. See also our medium post.