Jekyll2021-05-06T14:02:42+02:00http://www.pokutta.com/blog/feed.xmlOne trivial observation at a timeEverything Mathematics, Optimization, Machine Learning, and Artificial IntelligenceLearning to Schedule Heuristics in Branch and Bound2021-05-06T01:00:00+02:002021-05-06T01:00:00+02:00http://www.pokutta.com/blog/research/2021/05/06/learningHeuristics<p><em>TL;DR: This is an informal discussion of our recent paper <a href="https://arxiv.org/abs/2103.10294">Learning to Schedule Heuristics in Branch and Bound</a> by Antonia Chmiela, Elias Khalil, Ambros Gleixner, Andrea Lodi, and Sebastian Pokutta. In this paper, we propose the first data-driven framework for scheduling heuristics in a MIP solver. By learning from data describing the performance of primal heuristics, we obtain a problem-specific schedule of heuristics that collectively find many solutions at minimal cost. We provide a formal description of the problem and propose an efficient algorithm for computing such a schedule.</em>
<!--more--></p>
<p><em>Written by Antonia Chmiela.</em></p>
<h3 id="motivation">Motivation</h3>
<p>Primal heuristics play a crucial role in exact solvers for Mixed Integer Programming (MIP). For instance, Berthold [1] showed that the primal bound improved on average by around 80% when primal heuristics were used. While solvers are guaranteed to find optimal solutions given sufficient time, real-world applications typically require finding good solutions early on in the search to enable fast decision-making. Even though much of MIP research focuses on designing effective heuristics, the question of how to manage multiple MIP heuristics in a solver has not received equal attention.</p>
<p>Generally, a solver has a variety of primal heuristics implemented, where each class exploits a different idea to find good solutions. During Branch and Bound (B&B), these heuristics are executed successively at each node of the search tree, and improved solutions are reported back to the solver if found. Since most heuristics can be very costly, it is necessary to be strategic about the order in which the heuristics are executed and the number of iterations allocated to each, with the ultimate goal of obtaining good primal performance overall. Such decisions are often made by following hard-coded rules derived from testing on broad benchmark test sets. While these static settings yield good performance on average, their performance can be far from optimal when considering specific families of instances.</p>
<p>In this paper, we propose a data-driven approach to systematically improve the use of primal heuristics in B&B. By learning from data about the duration and success of every heuristic call for a set of training instances, we construct a schedule of heuristics deciding when and for how long a certain heuristic should be executed to obtain good primal solutions early on. As a result, we are able to significantly improve the use of primal heuristics.</p>
<h3 id="contributions">Contributions</h3>
<p>Our main contributions can be summarized as follows:</p>
<ol>
<li>We <em>formalize the learning task</em> of finding an effective, cost-efficient heuristic schedule on a training dataset as a Mixed Integer Quadratic Program;</li>
<li>We propose an <em>efficient heuristic</em> for solving the training (scheduling) problem and a <em>scalable data collection</em> strategy;</li>
<li>We perform <em>extensive computational experiments</em> on two classes of challenging instances and <em>demonstrate the benefits of our approach</em>.</li>
</ol>
<h3 id="obtaining-a-heuristic-schedule">Obtaining a Heuristic Schedule</h3>
<p>We consider the following practically relevant setting. We are given a set of heuristics $\mathcal{H}$ and a homogeneous set of training instances $\mathcal{X}$ from the same problem class we are interested to solve in practice. In a data collection phase, we are allowed to execute the B&B algorithm on the training instances, observing how each heuristic performs at each node of each search tree. At a high level, our goal is then to leverage this data to obtain a schedule of heuristics that minimizes a primal performance metric.</p>
<p>A heuristic schedule controls two important aspects: The <em>order</em> in which a set of applicable heuristics $\mathcal{H}$ is executed and the maximal <em>duration</em> of each heuristic run. To find primal solutions, the solver executes a heuristic loop that iterates over the heuristics in decreasing priority. The loop is terminated if a heuristic finds a new incumbent solution. As such, an ordering that prioritizes effective heuristics can lead to time savings without sacrificing primal performance. Furthermore, solvers use working limits to control the computational effort spent on heuristics. By allowing a heuristic to be more expensive, i.e., increase the overall running time, we also increase the likelihood of finding an integer feasible solution. Hence, a heuristic schedule is defined as follows. For a heuristic $h \in $ $\mathcal{H}$, let $\tau \in \mathbb{R}_{>0}$ denote $h$’s time budget. Then, we are interested in finding a schedule $S$ defined by</p>
\[S := \langle (h_1, \tau_1), \dots, (h_k, \tau_k) \rangle, h_i \in \mathcal{H}.\]
<p>We also refer to $\tau_i$ as the maximal number of iterations that is allocated to a $h_i$ in schedule $S$.</p>
<p>Furthermore, let us denote by $\mathcal{N}_{\mathcal{X}}$ the collection of search tree nodes that appear when solving the instances in $\mathcal{X}$ with B&B. Recall that our objective is to optimize the use of primal heuristics such that we find feasible solutions fast. To achieve this, we learn from the data and construct a schedule $S$ that finds feasible solutions for a large fraction of the nodes in $\mathcal{N}_X$, while also minimizing the number of iterations spent by schedule $S$. Hence, the heuristic scheduling problem we consider is given by</p>
\[\begin{equation} \tag{$P_{\mathcal{S}}$}
\underset{S \in \mathcal{S}}{\text{min}}
\sum_{N \in \mathcal{N}_{\mathcal{X}}} T(S,N)
\;\text{ s.t. }\;
|\mathcal{N}_S| \geq \alpha |\mathcal{N}_{\mathcal{X}}|.
\end{equation}\]
<p>Here $T(S,N)$ denotes the number of iterations schedule $S$ needs to solve node $N$ and $N_S$ is the set of nodes at which schedule $S$ is successful in finding a solution. The parameter $\alpha \in [0,1]$ denotes a minimum fraction of nodes at which we want the schedule to find a solution. Problem ($P_{\mathcal{S}}$) can be formulated as a Mixed-Integer Quadratic Program (MIQP).</p>
<p>To find such a schedule, we need to know the number of iteration it takes heuristic $h$ to solve node $N$ for all heuristics in $\mathcal{H}$ and nodes $\mathcal{N}_{\mathcal{X}}$. Hence, when collecting data for the instances in the training set $\mathcal{X}$, we track for every B&B node $N$ at which a heuristic $h$ was called, the number of iterations $\tau^h_N$ it took $h$ to find a feasible solution. We propose an efficient data collection framework that uses a specially crafted version of a MIP solver for collecting multiple reward signals for the execution of multiple heuristics per single MIP evaluation during the training phase. As a result, we obtain a large amount of data points that scales with the running time of the MIP solves.</p>
<p>Unfortunately, the problem ($P_{\mathcal{S}}$) is $\mathcal{NP}$-hard and too expensive to solve in practice, so we direct our attention towards designing an efficient heuristic algorithm. The approach we propose follows a greedy tactic and the basic idea can be summarized as follows: A schedule $G$ is built by successively adding the action $(h,\tau)$ to $G$ that maximizes the ratio of the marginal increase in the number of instances solved to the cost (w.r.t. a cost fuction $c$) of including $(h,\tau)$. In other words, we start with an empty schedule $G_0 = \langle \rangle$ and successively add actions more detailed description as well as the pseudo-code can be found in our paper.</p>
\[\begin{equation*}
\begin{aligned}
g_j = \underset{(h,\tau)}{\text{argmax}}
\frac{|\{
N \in \mathcal{N}_{\mathcal{X}}\setminus \mathcal{N}_{\mathcal{G}_{j-1}} \mid \tau_N^h \leq
\tau\}|}{c_{j-1}(h,\tau)},
\end{aligned}
\end{equation*}\]
<p>until either all nodes in $\mathcal{N}_{\mathcal{X}}$ are solved by $G_j$ or all heuristics are already contained in the schedule. A more detailed description as well as the pseudo-code can be found in our paper. Our code can be found <a href="https://github.com/antoniach/heuristic-scheduling">here</a>.</p>
<h3 id="sneak-peak-at-the-results">Sneak Peak at the Results</h3>
<p>A comprehensive experimental evaluation shows that our approach consistently learns heuristic schedules with better primal performance than the default settings of the state-of-the-art academic MIP solver <a href="https://www.scipopt.org">SCIP</a>. For instance, we are able to reduce the average primal integral by up to 49% on two classes of challenging instances – namely the <em>Generalized Independent Set Problem (GISP)</em> [2] and <em>Fixed-Charge Multicommodity Network Flow Problem (FCMNF)</em> [3]. A brief comparison of the average primal integral over time is shown in the following figure.</p>
<div class="center">
<img src="http://www.pokutta.com/blog/assets/learningHeuristics/average_primal.png" alt="img1" style="float:center; margin-right: 1%; width:80%" />
<p style="clear: both;"></p>
</div>
<p>The average primal integral is not the only performance metric for which we observed a significant improvement. On average, the instances solved with the schedule terminated with a smaller primal-dual gap, a better primal bound (for instances that hit the time limit) and overall found more solutions during the solving process.</p>
<h3 id="references">References</h3>
<p>[1] Timo Berthold. Measuring the impact of primal heuristics. Operations Research Letters 41.6 (2013): 611-614. <a href="https://opus4.kobv.de/opus4-zib/frontdoor/index/index/docId/4766">[pdf]</a></p>
<p>[2] Marco Colombi, Renata Mansini, Martin Savelsbergh. The generalized independent set problem: Polyhedral analysis and solution approaches. European Journal of Operational Research 260.1 (2017): 41-55. <a href="https://doi.org/10.1016/j.ejor.2016.11.050">[pdf]</a></p>
<p>[3] Lluís-Miquel Munguía, Shabbir Ahmed, David A. Bader, George L. Nemhauser, Vikas Goel, Yufen Shao. A parallel local search framework for the Fixed-Charge Multicommodity Network Flow problem. Computers & Operation Research 77 (2017): 44-57. <a href="https://doi.org/10.1016/j.cor.2016.07.016">[pdf]</a></p>Antonia ChmielaTL;DR: This is an informal discussion of our recent paper Learning to Schedule Heuristics in Branch and Bound by Antonia Chmiela, Elias Khalil, Ambros Gleixner, Andrea Lodi, and Sebastian Pokutta. In this paper, we propose the first data-driven framework for scheduling heuristics in a MIP solver. By learning from data describing the performance of primal heuristics, we obtain a problem-specific schedule of heuristics that collectively find many solutions at minimal cost. We provide a formal description of the problem and propose an efficient algorithm for computing such a schedule.FrankWolfe.jl: A high-performance and flexible toolbox for Conditional Gradients2021-04-20T07:00:00+02:002021-04-20T07:00:00+02:00http://www.pokutta.com/blog/research/2021/04/20/FrankWolfejl<p><em>TL;DR: We present <a href="https://github.com/ZIB-IOL/FrankWolfe.jl">$\texttt{FrankWolfe.jl}$</a>, an open-source implementation in Julia of several popular Frank-Wolfe and Conditional Gradients variants for first-order constrained optimization. The package is designed with flexibility and high-performance in mind, allowing for easy extension and relying on few assumptions regarding the user-provided functions. It supports Julia’s unique multiple dispatch feature, and interfaces smoothly with generic linear optimization formulations using $\texttt{MathOptInterface.jl}$.</em>
<!--more--></p>
<p><em>Written by Alejandro Carderera and Mathieu Besançon.</em></p>
<h2 id="what-does-the-package-do">What does the package do?</h2>
<p>The $\texttt{FrankWolfe.jl}$ Julia package aims at solving problems of the form:
\(\tag{minProblem}
\begin{align}
\label{eq:minimizationProblem}
\min\limits_{\mathbf{x} \in \mathcal{C}} f(\mathbf{x}),
\end{align}\)
where $\mathcal{C} \subseteq \mathbb{R}^d$ is a convex compact set and $f(\mathbf{x})$ is a differentiable function, through the use of <em>Frank-Wolfe</em> [FW] (also known as <em>Conditional Gradient</em> [LP]) algorithm variants. The two main ingredients that the package uses to solve this problem are:</p>
<ol>
<li>A First-Order Oracle (FOO): Given $\mathbf{x} \in \mathcal{C}$, the oracle returns $\nabla f(\mathbf{x})$.</li>
<li>A Linear Minimization Oracle (LMO): Given $\mathbf{x}\in \mathbb{R}^d$, the oracle returns $\mathbf{v} \in \operatorname{argmin}_{\mathbf{x} \in \mathcal{C}} \langle \mathbf{d}, \mathbf{x}\rangle$.</li>
</ol>
<p>This bypasses the need to use projection oracles onto $\mathcal{C}$, which can be extremely advantageous as solving an LP over $\mathcal{C}$ can be much cheaper than solving a quadratic (projection) problem over the same set.
Such is the case for the nuclear norm ball, as solving an LP over this convex set simply requires computing the left and right singular vectors associated with the largest singular value, whereas projecting onto this feasible region requires computing a full SVD decomposition.
See [CP] for more examples and for more background information on the package see also our software overview of the package [BCP].</p>
<h2 id="how-do-i-get-started">How do I get started</h2>
<p>In a Julia session, type <code class="language-plaintext highlighter-rouge">]</code> to switch to package mode:</p>
<div class="language-julia highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">julia</span><span class="o">></span> <span class="x">]</span>
<span class="x">(</span><span class="nd">@v1.6</span><span class="x">)</span> <span class="n">pkg</span><span class="o">></span>
</code></pre></div></div>
<p>See the <a href="https://docs.julialang.org/en/v1/stdlib/Pkg/">Julia documentation</a>
for more examples and advanced usage of the package manager.
You can then add the package with the <code class="language-plaintext highlighter-rouge">add</code> command:</p>
<div class="language-julia highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="x">(</span><span class="nd">@v1.6</span><span class="x">)</span> <span class="n">pkg</span><span class="o">></span> <span class="n">add</span> <span class="n">https</span><span class="o">://</span><span class="n">github</span><span class="o">.</span><span class="n">com</span><span class="o">/</span><span class="n">ZIB</span><span class="o">-</span><span class="n">IOL</span><span class="o">/</span><span class="n">FrankWolfe</span><span class="o">.</span><span class="n">jl</span>
</code></pre></div></div>
<p>Soon it will also be directly available through the package manager.</p>
<h2 id="and-why-should-you-care">And why should you care?</h2>
<p>Although the Frank-Wolfe algorithm and its variants have been studied for more than half a decade and have gained a lot of
attention due to their favorable theoretical and computational properties, no de-facto implementation exists. The goal of the package is to become a reference open-source implementation
for practitioners in need of a flexible and efficient first-order method and for researchers developing and comparing new approaches on similar classes of problems.</p>
<h2 id="algorithm-variants-included-in-the-package">Algorithm variants included in the package</h2>
<p>We summarize below the central ideas of the variants implemented in the package
and highlight in Table 1 key properties that can drive the choice of a variant on a given use case. More information about the variants can be found in the references provided. We mention briefly that most variants also work for the nonconvex case, providing some locally optimal solution in this case.</p>
<p><strong>Standard Frank-Wolfe.</strong> The simplest Frank-Wolfe variant is included in the package. It has the lowest memory requirements out of all the variants, as in its simplest form only requires keeping track of the current iterate. As such it is suited for extremely large problems.
However, in certain cases, this comes at the cost of speed of convergence in terms of iteration count, when compared to other variants. As an example, when minimizing a strongly convex and smooth function over a polytope this algorithm might converge sublinearly, whereas the three variants that will be presented next converge linearly.</p>
<p><strong>Away-step Frank-Wolfe.</strong> One of the most popular Frank-Wolfe variants is the Away-step Frank-Wolfe (AFW) algorithm [GM, LJ]. While the standard FW algorithm can only move <em>towards</em> extreme points of $\mathcal{C}$, the AFW can move <em>away</em> from some extreme points of $\mathcal{C}$, hence the name of the algorithm. To be more specific, the AFW algorithm moves away from vertices in its active set at iteration $t$, denoted by $\mathcal{S}_t$, which contains the set of vertices $\mathbf{v}_k$ for $k<t$ that allow us to recover the current iterate as a convex combination. This algorithm expands the range of directions that the FW algorithm can move along, at the expense of having to explicitly maintain the current iterate as a convex decomposition of extreme points.</p>
<p><strong>Lazifying Frank-Wolfe variants.</strong> One running assumption for the two previous variants is that calling the LMO is cheap. There are many applications where calling the LMO in absolute terms is costly (but is cheap in relative terms when compared to performing a projection). In such cases, one can attempt to <em>lazify FW</em> algorithms,
to avoid having to compute $\operatorname{argmin}_{\mathbf{v}\in\mathcal{C}}\left\langle\nabla f(\mathbf{x}_t),\mathbf{v} \right\rangle$ by calling the LMO, settling for solutions that guarantee enough progress [BPZ]. This allows us to substitute the LMO by a <em>Weak Separation Oracle</em> while maintaining essentially the same convergence rates. In practice, these algorithms search for appropriate vertices among the vertices in a cache, or the vertices in the active set $\mathcal{S}_t$, and can be much faster in wall-clock time. In the package, both AFW and FW have lazy variants while the BCG algorithm is lazified by design.</p>
<p><strong>Blended Conditional Gradients.</strong> The FW and AFW algorithms, and their lazy variants share one feature: they attempt to make primal progress over a reduced set of vertices.
The AFW algorithm does this through away steps (which do not increase the cardinality of the active set), and the lazy variants do this through the use of previously exploited vertices.
A third strategy that one can follow is to explicitly <em>blend</em> Frank-Wolfe steps with gradient descent steps over the convex hull of the active set (note that this can be done without requiring a projection oracle over $\mathcal{C}$, thus making the algorithm projection-free). This results in the Blended Conditional Gradient (BCG) algorithm [BPTW], which attempts to make as much progress as possible over the convex hull of the current active set $\mathcal{S}_t$ until it automatically detects that in order to further make further progress it requires additional calls to the LMO and new atoms.</p>
<p><strong>Stochastic Frank-Wolfe.</strong> In many problem instances, evaluating the FOO at a given point is prohibitively expensive. In such cases, one usually has access to a <em>Stochastic First-Order Oracle (SFOO)</em>, from which one can build a gradient estimator.
This idea, which has powered much of the success of deep learning, can also be applied to the Frank-Wolfe algorithm [HL], resulting in the Stochastic Frank-Wolfe (SFW) algorithm and its variants.</p>
<div class="center">
<img src="http://www.pokutta.com/blog/assets/JuliaFrankWolfe/comparison_table.png" alt="fig1" style="float:center; margin-right: 1%; width:99%" />
<p style="clear: both;"></p>
</div>
<p class="figcap"><strong>Table 1.</strong> Schematic comparison of the different algorithmic variants.</p>
<h2 id="package-design-characteristics">Package design characteristics</h2>
<p>Unlike disciplined convex frameworks or algebraic modeling languages such as
$\texttt{Convex.jl}$ or $\texttt{JuMP.jl}$,
our framework allows for arbitrary Julia functions defined outside of a Domain-Specific Language.
Users can provide their gradient implementation or leverage one of the many automatic differentiation
packages available in Julia.</p>
<p>One central design principle of $\texttt{FrankWolfe.jl}$ is to rely on few assumptions
regarding the user-provided functions, the atoms returned by the LMO, and
their implementation. The package works for instance out of the box when the LMO
returns Julia subtypes of <em>AbstractArray</em>, representing finite-dimensional
vectors, matrices or higher-order arrays.</p>
<p>Another design principle has been to favor in-place operations and reduce memory allocations when possible, since these
can become expensive when repeated at all iterations. This is reflected in the memory emphasis mode (the default mode for all algorithms), where as many computations as possible are performed in-place as well as in the gradient interface, where the gradient function is provided with a variable to write into rather than reallocating every time a gradient is computed.
The performance difference can be quite pronounced for problems in large dimensions, for example passing a gradient of size 7.5GB on a state-of-the-art machine is about 8 times slower than an in-place update.</p>
<p>Finally, default parameters are chosen to make all algorithms as robust as possible out of the box, while allowing extension and fine tuning for advanced users.
For example, the default step size strategy for all (but the stochastic variant) is the adaptive step size rule of [PNAM], which in computations not only usually outperforms both line search and the short step rule by dynamically estimating the Lipschitz constant but also overcomes several issues with the limited additive accuracy of traditional line search rules.
Similarly, the BCG variant automatically upgrades the numerical precision for certain subroutines if numerical instabilities are detected.</p>
<h3 id="linear-minimization-oracle-interface">Linear minimization oracle interface</h3>
<p>One key step of FW algorithms is the linear minimization step which, given first-order information
at the current iterate, returns an extreme point of the feasible region that minimizes the linear approximation
of the function. It is defined in $\texttt{FrankWolfe.jl}$ using a single function:</p>
<div class="language-julia highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">function</span><span class="nf"> compute_extreme_point</span><span class="x">(</span><span class="n">lmo</span><span class="o">::</span><span class="n">LMO</span><span class="x">,</span> <span class="n">direction</span><span class="o">::</span><span class="n">D</span><span class="x">;</span> <span class="n">kwargs</span><span class="o">...</span><span class="x">)</span><span class="o">::</span><span class="n">V</span>
<span class="c"># ...</span>
<span class="k">end</span>
</code></pre></div></div>
<p>The first argument $\texttt{lmo}$ represents the linear minimization oracle for the specific problem.
It encodes the feasible region $\mathcal{C}$, but also some algorithmic parameters or state.
This is especially useful for the lazified FW variants, as in these cases the LMO types can
take advantage of caching, by storing the extreme
vertices that have been computed in previous iterations and then looking up vertices
from the cache before computing a new one.</p>
<p>The package implements LMOs for commonly encountered feasible regions including $L_p$-norm balls,
$K$-sparse polytopes, the Birkhoff polytope, and the nuclear norm ball for matrix spaces, leveraging known
closed forms of extreme points. The multiple dispatch mechanism allows for different implementations
of a single LMO with multiple direction types. The type $\texttt{V}$ used to represent the computed
vertex is also specialized to leverage the properties of extreme vertices of the feasible region.
For instance, although the Birkhoff polytope is the convex hull of all doubly stochastic matrices of a given
dimension, its extreme vertices are permutation matrices that are much sparser in nature.
We also leverage sparsity outside of the traditional sense of nonzero entries.
When the feasible region is the nuclear norm ball in $\mathbb{R}^{N\times M}$, the vertices are rank-one matrices.
Even though these vertices are dense, they can be represented as the outer product of two vectors
and thus be stored with $\mathcal{O}(N+M)$ entries instead of $\mathcal{O}(N\times M)$ for the equivalent
dense matrix representation. The Julia abstract matrix representation allows the user and the library to interact
with these rank-one matrices with the same API as standard dense and sparse matrices.</p>
<p>In some cases, users may want to define a custom feasible region that does not admit a closed-form
linear minimization solution. We implement a generic LMO based on $\texttt{MathOptInterface.jl}$,
thus allowing users on the one hand to select any off-the-shelf LP, MILP, or conic solver suitable for their problem,
and on the other hand to formulate the constraints of the feasible domain using the $\texttt{JuMP.jl}$ or $\texttt{Convex.jl}$ DSL.
Furthermore, the interface is naturally extensible by users who can define their own LMO and implement the corresponding $\texttt{compute_extreme_point}$ method.</p>
<h3 id="numeric-type-genericity">Numeric type genericity</h3>
<p>The package was designed from the start to be generic over both the used numeric types and data structures.
Numeric type genericity allows running the algorithms in extended fixed or arbitrary precision, e.g., the package works out-of-the-box with $\texttt{Double64}$ and $\texttt{BigFloat}$ types.
Extended precision is essential for high-dimensional problems where the condition number of computed gradients etc., become too high.
For some well-conditioned problems, reduced precision is sometimes sufficient to achieve the desired tolerance.
Furthermore, it opens the possibility of gradient computation and LMO steps on hardware accelerators such as GPUs.</p>
<h2 id="examples">Examples</h2>
<p>We will now present a few examples that highlight specific features of the package. The full code of each example (and several more) can be found in the examples folder of the repository.</p>
<h3 id="matrix-completion">Matrix completion</h3>
<p>Missing data imputation is a key topic in data science.
Given a set of observed entries from a matrix
$Y \in \mathbb{R}^{m\times n}$, we want to compute
a matrix $X \in \mathbb{R}^{m\times n}$ that minimizes
the sum of squared errors on the observed entries. As it stands
this problem formulation is not well-defined or useful, as one could minimize the objective function simply by setting the observed entries of $X$ to match
those of $Y$, and setting the remaining entries of
$X$ arbitrarily. However, this
would not result in any meaningful information regarding the unobserved entries
in $Y$, which is one of the key tasks in missing data imputation. A common way
to solve this problem is to reduce the degrees of freedom of the problem in order
to recover the matrix $Y$ from a small subset of its entries,
e.g., by assuming that the matrix $Y$ has low rank. Note that even though the matrix $Y$ has $m\times n$ coefficients, if it has rank $r$, it can be expressed using only $(m + n - r)r$ coefficients through its singular value decomposition.
Finding the matrix $X \in \mathbb{R}^{m\times n}$ with minimum rank whose observed entries are equal to those of $Y$ is a non-convex problem that is $\exists \mathbb{R}$-hard.
A common proxy for rank constraints is the use of constraints on the nuclear norm
of a matrix, which is equal to the sum of its singular values, and can model the convex envelope of matrices of a given rank. Using this property, one of the most common ways to tackle matrix completion problems is to solve:
\(\begin{align}
\min_{\|X\|_{*} \leq \tau} \sum_{(i,j)\in \mathcal{I}} \left( X_{i,j} - Y_{i,j}\right)^2, \label{Prob:matrix_completion}
\end{align}\)
where $\tau>0$ and $\mathcal{I}$ denotes the indices of the observed entries of $Y$. In this example, we compare the Frank-Wolfe implementation from the package
with a Projected Gradient Descent (PGD) algorithm which, after each gradient descent step, projects the iterates back onto the nuclear norm ball.
We use one of the movielens datasets to compare the two methods.
The code required to reproduce the full example can be found in the repository.</p>
<div class="center">
<img src="http://www.pokutta.com/blog/assets/JuliaFrankWolfe/movielens_result.png" alt="fig3" style="float:center; margin-right: 1%; width:99%" />
<p style="clear: both;"></p>
</div>
<p class="figcap"><strong>Figure 2.</strong> Movielens results.</p>
<p>The results are presented in Figure 2. We can clearly observe that the computational cost
of a single PGD iteration is much higher than the cost of a FW variant step. The FW variants tested
complete $10^3$ iterations in around $120$ seconds, while the PGD algorithm only completes $10^2$ iterations
in a similar time frame. We also observe that the progress per iteration made by each
projection-free variant is smaller than the progress made by PGD, as expected.
Note that, minimizing a linear function over the nuclear norm ball, in order to compute the LMO, amounts to computing the left and right singular vectors associated with the largest singular value, which we do using the $\texttt{ARPACK}$ Julia wrapper in the current example.
On the other hand, projecting onto the nuclear norm ball requires computing a full singular value decomposition. The underlying linear solver can be switched by users developing their own LMO.</p>
<p>The top two figures in Figure 2 present the primal gap of the matrix completion problem objective function in terms of iteration count and wall-clock time.
The two bottom figures show the performance on a test set of entries. Note that the test error stagnates for all methods, as expected.
Even though the training error decreases linearly for PGD for all iterations,
the test error stagnates quickly. The final test error of PGD is about $6\%$ higher
than the final test error of the standard FW algorithm, which is also $2\%$ smaller than the final test error of the lazy FW algorithm. We would like to stress though that the intention here is primarily to showcase the algorithms and the results are considered to be illustrative in nature only rather than a proper evaluation with correct hyper-parameter tuning.</p>
<p>Another key aspect of FW algorithms is the sparsity of the provided solutions.
Sparsity in this context refers to a matrix being low-rank.
Although each solution is a dense matrix in terms of non-zeros,
it can be decomposed as a sum of a small number of rank-one terms,
each represented as a pair of left and right vectors.
At each iteration, FW algorithms add at most one rank-one term to the iterate,
thus resulting in a low-rank solution by design.
In our example here, the final FW solution is of rank at most $95$ while the lazified version provides a sparser solution of rank at most $80$. The lower rank of the lazified FW is due to the fact that this algorithm sometimes avoids calling
the LMO if there already exists an atom (here rank-1 factor) in the cache that guarantees enough progress; the higher sparsity might help with interpretability and robustness to noise. In contrast, the solution computed by PGD is of full column rank and even after truncating the spectrum, removing factors with small singular values, it is still of much higher rank than the FW solutions.</p>
<h3 id="exact-optimization-with-rational-arithmetic">Exact optimization with rational arithmetic</h3>
<p>The package allows for exact optimization with rational arithmetic.
For this, it suffices to set up the LMO to be rational and choose
an appropriate step-size rule as detailed below.
For the LMOs included in the package, this simply means initializing the
radius with a rational-compatible element type, e.g., $\texttt{1}$,
rather than a floating-point number, e.g., $\texttt{1.0}$.
Given that numerators and denominators can become quite large in rational arithmetic,
it is strongly advised to base the used rationals on extended-precision integer types such as $\texttt{BigInt}$, i.e., we use $\texttt{Rational{BigInt}}$.
For the probability simplex LMO with a rational radius of $\texttt{1}$, the LMO would be created as follows:</p>
<div class="language-julia highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">lmo</span> <span class="o">=</span> <span class="n">FrankWolfe</span><span class="o">.</span><span class="n">ProbabilitySimplexOracle</span><span class="x">{</span><span class="kt">Rational</span><span class="x">{</span><span class="kt">BigInt</span><span class="x">}}(</span><span class="mi">1</span><span class="x">)</span>
</code></pre></div></div>
<p>As mentioned before, the second requirement ensuring that the computation runs in rational
arithmetic is a rational-compatible step-size rule.
The most basic step-size rule compatible with rational optimization is the $\texttt{agnostic}$ step-size
rule with $\gamma_t = 2/(2+t)$.
With this step-size rule, the gradient does not even need to be rational as long as the atom computed by the LMO
is of a rational type.
Assuming these requirements are met, all iterates and the computed solution will then be rational:</p>
<div class="language-julia highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">n</span> <span class="o">=</span> <span class="mi">100</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">fill</span><span class="x">(</span><span class="n">big</span><span class="x">(</span><span class="mi">1</span><span class="x">)</span><span class="o">//</span><span class="mi">100</span><span class="x">,</span> <span class="n">n</span><span class="x">)</span>
<span class="c"># equivalent to { 1/100 }^100</span>
</code></pre></div></div>
<p>Another possible step-size rule is $\texttt{rationalshortstep}$ which computes the step size
by minimizing the smoothness inequality as
$\gamma_t = \frac{\langle \nabla f(\mathbf{x}_t), \mathbf{x}_t - \mathbf{v}_t\rangle}{2 L |\mathbf{x}_t - \mathbf{v}_t|^2}$.
However, as this step size depends on an upper bound on the Lipschitz constant $L$ as well as the
inner product with the gradient $\nabla f(\mathbf{x}_t)$, both have to be of a rational type.</p>
<h3 id="doubly-stochastic-matrices">Doubly stochastic matrices</h3>
<p>The set of doubly stochastic matrices or Birkhoff polytope appears in various combinatorial
problems including matching and ranking.
It is the convex hull of permutation matrices, a property of interest for FW algorithms
because the individual atoms returned by the LMO only have $n$ non-zero entries for
$n\times n$ matrices.
A linear function can be minimized over the Birkhoff polytope using the Hungarian algorithm.
This LMO is substantially more expensive than minimizing a linear function over the $\ell_1$-ball norm,
and thus the algorithm performance benefits from lazification.
We present the performance profile of several FW variants in the following example
on $200\times 200$ matrices. The results are presented in Figure 3.</p>
<p>The per-iteration primal value evolution is nearly identical for FW and the lazy cache variants.
We can observe a slower decrease rate in the first 10 iterations of BCG for both the primal value
and the dual gap. This initial overhead is however compensated after the first iterations,
BCG is the only algorithm terminating with the desired dual gap of $10^{-7}$ and not with the
iteration limit. In terms of runtime, all lazified variants outperform the standard FW,
the overhead of allocating and managing the cache are compensated by the reduced number of calls to the LMO.</p>
<div class="center">
<img src="http://www.pokutta.com/blog/assets/JuliaFrankWolfe/lcg_expensive.png" alt="fig6" style="float:center; margin-right: 1%; width:99%" />
<p style="clear: both;"></p>
</div>
<p class="figcap"><strong>Figure 3.</strong> Doubly stochastic matrices results.</p>
<h3 id="references">References</h3>
<p>[FW] Frank, M., & Wolfe, P. (1956). An algorithm for quadratic programming. In <em>Naval research logistics quarterly</em>, 3(1-2), 95-110.</p>
<p>[LP] Levitin, E. S., & Polyak, B. T. (1966). Constrained minimization methods. In <em>USSR Computational mathematics and mathematical physics</em>, 6(5), 1-50.</p>
<p>[CP] Combettes, C. W., & Pokutta, S. (2021). Complexity of linear minimization and projection on some sets. In <em>arXiv preprint arXiv:2101.10040</em>. <a href="https://arxiv.org/pdf/2101.10040.pdf">pdf</a></p>
<p>[BCP] Besançon, M., Carderera, A., & Pokutta, S. (2021). FrankWolfe.jl: a high-performance and flexible toolbox for Frank-Wolfe algorithms and Conditional Gradients. In <em>arXiv preprint arXiv:2104.06675</em>. <a href="https://arxiv.org/abs/2104.06675">pdf</a></p>
<p>[GM] Guélat, J., & Marcotte, P. (1986). Some comments on Wolfe’s ‘away step’. In <em>Mathematical Programming</em> 35(1) (pp. 110–119). Springer. <a href="http://www.iro.umontreal.ca/~marcotte/ARTIPS/1986_MP.pdf">pdf</a></p>
<p>[LJ] Lacoste-Julien, S., & Jaggi, M. (2015). On the global linear convergence of Frank-Wolfe optimization variants. In <em>Advances in Neural Information Processing Systems</em> 2015 (pp. 496-504). <a href="https://papers.nips.cc/paper/5925-on-the-global-linear-convergence-of-frank-wolfe-optimization-variants.pdf">pdf</a></p>
<p>[BPZ] Braun, G., Pokutta, S., & Zink, D. (2017). Lazifying Conditional Gradient Algorithms. In <em>Proceedings of the 34th International Conference on Machine Learning</em>. <a href="http://proceedings.mlr.press/v70/braun17a/braun17a.pdf">pdf</a></p>
<p>[BPTW] Braun, G., Pokutta, S., Tu, D., & Wright, S. (2019). Blended conditonal gradients. In <em>International Conference on Machine Learning</em> (pp. 735-743). PMLR. <a href="https://arxiv.org/abs/1805.07311">pdf</a></p>
<p>[HL] Hazan, E. & Luo, H. (2016). Variance-reduced and projection-free stochastic optimization. In <em>Proceedings of the 33rd International Conference on Machine Learning</em>. <a href="https://arxiv.org/pdf/1602.02101.pdf">pdf</a></p>
<p>[PNAM] Pedregosa, F., Negiar, G., Askari, A., & Jaggi, M. (2020). Linearly Convergent Frank-Wolfe with Backtracking Line-Search. In <em>Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics</em>. <a href="http://proceedings.mlr.press/v108/pedregosa20a/pedregosa20a-supp.pdf">pdf</a></p>Alejandro Carderera and Mathieu BesançonTL;DR: We present $\texttt{FrankWolfe.jl}$, an open-source implementation in Julia of several popular Frank-Wolfe and Conditional Gradients variants for first-order constrained optimization. The package is designed with flexibility and high-performance in mind, allowing for easy extension and relying on few assumptions regarding the user-provided functions. It supports Julia’s unique multiple dispatch feature, and interfaces smoothly with generic linear optimization formulations using $\texttt{MathOptInterface.jl}$.Linear Bandits on Uniformly Convex Sets2021-04-03T01:00:00+02:002021-04-03T01:00:00+02:00http://www.pokutta.com/blog/research/2021/04/03/linearBandits<p><em>TL;DR: This is an informal summary of our recent paper <a href="https://arxiv.org/abs/2103.05907">Linear Bandit on Uniformly Convex Sets</a> by Thomas Kerdreux, Christophe Roux, Alexandre d’Aspremont, and Sebastian Pokutta. We show that the strong convexity of the action set $\mathcal{K}\subset\mathbb{R}^n$ in the context of linear bandits leads to a gain of a factor of $\sqrt{n}$ in the pseudo-regret bounds. This improvement was previously known in only two settings: when $\mathcal{K}$ is the simplex or an $\ell_p$ ball with $p\in]1,2]$ [BCY]. When the action set is $q$-uniformly convex (with $q\geq 2$) but not necessarily strongly convex, we obtain pseudo-regret bounds of the form $\mathcal{O}(n^{1/q}T^{1/p})$ (with $1/p+1/q=1$), i.e., with a dimension dependency smaller than $\sqrt{n}$.</em>
<!--more--></p>
<p><em>Written by Thomas Kerdreux.</em></p>
<p>In this post, we continue our journey toward analyzing Machine Learning algorithms according to the constraint sets’ (in the context of optimization) or action sets’ (in the context of online learning or bandits) structural properties. In our recent paper [KDPa], we focused on the case of projection-free optimization and proved accelerated convergence rates when the set is (locally or globally) uniformly convex. This allowed us to understand better how to manipulate such structures and come up with local set assumptions. We detail these structural assumptions in [KDPb] and provide other connections between the uniform convexity of the set and problems in Machine Learning, <em>e.g.</em>, with generalization bounds. Here we focus on the linear bandit setting. We design and analyze algorithms that are not projection-free. Let us now first recall the linear bandit setting.</p>
<h3 id="linear-bandits">Linear Bandits</h3>
<p class="mathcol"><strong>Linear Bandit Setting</strong> <br />
<em>Input:</em> Consider a compact convex action set $\mathcal{K}\subset\mathbb{R}^n$. <br />
For $t=0, 1, \ldots, T $ do: <br />
$\qquad$ Nature decides on a loss vector $c_t\in\mathcal{K}^\circ$. <br />
$\qquad$ The bandit algorithm picks an action $a_t\in\mathcal{K}$. <br />
$\qquad$ The bandit observes the cost $\langle c_t; a_t\rangle$ of its action but not $c_t$. <br /></p>
<p>The goal of the bandit algorithm is to incur the smallest cumulative cost $\sum_{t=1}^{T}\langle c_t; a_t\rangle$. The bandit framework is also known as the <em>partial-information</em> setting (as opposed to <em>full-information</em> setting which is online learning) because the algorithm does not have access to the entire $c_t$ to update its strategy. The performance of a bandit algorithm is then measured via the regret $R_T$,</p>
\[\tag{Regret}
R_T = \sum_{t=1}^{T} \langle c_t; a_t\rangle - \underset{a\in\mathcal{K}}{\text{min }} \sum_{t=1}^{T} \langle c_t; a_t\rangle,\]
<p>where the second term represents the cost of playing the <em>single best action in hindsight</em>. Theoretical upper bounds on the regret $R_T$ then serve as a designing criterion of bandit algorithms. Since bandit algorithms all employ internal randomization procedures, we consider expected or high-probability regret bounds with respect to this internal randomness. However, these types of bounds remain challenging to obtain, and <em>pseudo-regret</em> bounds are often considered as a important first step. Denote by $\mathbb{E}$ the expectation with respect to the bandit randomness, the pseudo-regret $\bar{R}_T$ is defined as</p>
\[\tag{Pseudo-Regret}
\bar{R}_T = \mathbb{E}\Big(\sum_{t=1}^{T}\langle c_t; a_t\rangle\Big) - \underset{a\in\mathcal{K}}{\text{min }} \mathbb{E}\Big(\sum_{t=1}^{T} \langle c_t; a_t\rangle\Big).\]
<p>In the <em>linear</em> bandit setting, the algorithm can solely leverage the structure of the constraint set to achieve accelerated regret bounds. Indeed, the loss is linear so that there is no functional lower-curvature assumption, <em>e.g.</em>, no strong convexity. For a general compact convex set the pseudo-regret bound is of $\tilde{\mathcal{O}}(n\sqrt{T})$ and we are aware of better pseudo-regret bounds of $\tilde{\mathcal{O}}(\sqrt{nT})$ only when the action set $\mathcal{K}$ is a simplex or an $\ell_p$ ball with $p\in]1,2]$ [BCB,BCY]. Is there a more general mechanism that explain the accelerated regret bounds with the $\ell_p$ balls? What happens with $p>2$?</p>
<h3 id="preliminaries">Preliminaries</h3>
<p>We need to introduce a few simple notions of functional analysis and convex geometry. Although we do not enter the proof’s technical details in this post, we try to convey the core ingredient on which the proof relies. For $p\in]1,2]$, a convex differentiable function $f$ is <em>$(L, p)$-Hölder smooth</em> on $\mathcal{K}$ with respect to a norm $\norm{\cdot}$ if and only if for any $(x,y)\in\mathcal{K}\times\mathcal{K}$</p>
\[\tag{Hölder-Smoothness}
f(y) \leq f(x) + \langle \nabla f(x); y-x \rangle + \frac{L}{p} \norm{y-x}^p.\]
<p>It generalizes the classical notion of $L$-smoothness of the function. This assumption will play a role because it is dual to uniform convexity, <em>i.e.</em>, when a function is strongly convex then its Fenchel conjugate is smooth. The <em>Bregman divergence</em> of $F: \mathcal{D}\rightarrow\mathbb{R}$ is defined for $(x,y)\in\bar{\mathcal{D}}\times\mathcal{D}$ by</p>
\[\tag{Bregman-Divergence}
D_F(x,y) = F(x) - F(y) - \langle x-y ; \nabla F(y)\rangle.\]
<p>The key of our pseudo-regret upper-bounds analysis is to relate the set’s uniform convexity with the Hölder smoothness of a function related to $\mathcal{K}$. This then allows us to upper bound a Bregman Divergence that naturally arises in the computations.</p>
<p>Before going on with introducing the uniform convexity properties of the set, let us first explain the requirement that $c\in\mathcal{K}^\circ$. We write $\mathcal{K}^\circ$ for the <em>polar</em> of $\mathcal{K}$ and it is defined as follows</p>
\[\tag{Polar}
\mathcal{K}^\circ := \big\{ c\in\mathbb{R}^n ~|~ \langle c; a\rangle \leq 1~\forall a\in\mathcal{K} \big\}.\]
<p>In other words, constraining Nature to pick a loss vector $c\in\mathcal{K}^\circ$ ensures that whatever the action $a\in\mathcal{K}$ taken by the bandit, it will incur a bounded loss $\langle c; a\rangle\leq 1$.</p>
<p>Finally, both the algorithm and the analysis of our pseudo-regret bounds rely on the notion of gauge, which is essentially an extension of a norm. For a compact convex set $\mathcal{K}$ , the <em>gauge</em> $\norm{\cdot}_{\mathcal{K}}$ of $\mathcal{K}$ is defined at $x\in\mathbb{R}^n$ as</p>
\[\tag{Gauge of $\mathcal{K}$}
\|x\|_\mathcal{K} := \text{inf}\{\lambda>0~|~ x\in\lambda\mathcal{K}\}.\]
<p>When $\mathcal{K}$ is centrally symmetric and contains $0$ in its interior, then the gauge of $\mathcal{K}$ is a norm and $\mathcal{K}$ is the unit ball of its norm (indeed $\norm{x}_\mathcal{K} \leq 1 \Leftrightarrow x\in\mathcal{K}$). The gauge function is hence a natural way to associate a function to a set.</p>
<h3 id="action-set-assumptions">Action Set Assumptions</h3>
<p>A closed set $\mathcal{C}\subset\mathbb{R}^d$ is <em>$(\alpha, q)$-uniformly convex</em> with respect to a norm $\norm{\cdot}$, if for any $x,y \in \mathcal{C}$, any $\eta\in[0,1]$ and any $z\in\mathbb{R}^d$ with $\norm{z} = 1$, we have</p>
\[\tag{Uniform Convexity}
\eta x + (1-\eta) y + \eta (1 - \eta ) \alpha \norm{x-y}^q z \in \mathcal{C}.\]
<p>At a high level, this property is a global quantification of the set curvature that subsumes strong convexity. In finite-dimensional spaces, the $\ell_p$ balls are a fundamental example. For $p\in ]1,2]$, the $\ell_p$ balls are strongly convex (or $(\alpha,2)$-uniformly convex) and $p$-uniformly convex (but not strongly convex) for $p>2$. $p$-Schatten norms with $p>1$ or various group norms are also typical examples.</p>
<div id="fig" class="center" style="margin-top:5mm">
<img src="http://www.pokutta.com/blog/assets/linearBandit/list_ball.png" alt="various l_p balls" style="float:center; width:80%" />
<p style="clear: both;"></p>
</div>
<p class="figcap"><strong>Figure 1.</strong> Examples of $\ell_q$ balls.</p>
<p>As outlined in [KDPa,KDPb], scaling inequalities offer an interesting equivalent definition of $(\alpha,q)$-uniform convexity. Namely, $\mathcal{K}$ is $(\alpha, q)$-uniformly convex with respect to $\norm{\cdot}$ if and only if for any $x\in\partial\mathcal{K}$, \(c \in N_{\mathcal{K}}(x) := \big\{c\in\mathbb{R}^n \mid \langle c; x-y\rangle \geq 0 \ \forall y\in\mathcal{K}\big\}\) (the <em>normal cone</em>) and $y\in\mathcal{K}$, we have</p>
\[\tag{Scaling Inequality}
\langle c; x- y\rangle \geq \alpha \|c\|_\star \|x-y\|^q.\]
<p>A natural question arises when using this <em>less considered</em> (as opposed to the case for function) notion of uniform convexity for sets: does the uniform convexity of the set translate into a uniform convexity property of the gauge function?</p>
<p>It does and the result is quite classical. We survey such results in [KDPb]. Let us recall it here:</p>
\[\mathcal{K} \text{ is } (\alpha, q)\text{-uniformly convex} \Leftrightarrow \norm{\cdot}^q_{\mathcal{K}} \text{ is } (\alpha^\prime, q)\text{-uniformly convex for some } \alpha^\prime>0.\]
<p>Note also that the choice to constrain the loss vector $c_t\in\mathcal{K}^\circ$ now allows us to easily manipulate the gauge function and its dual. Indeed, we have that \(\norm{\cdot}_{\mathcal{K}}^\star = \norm{\cdot}_{\mathcal{K}^\circ}\). We can then also manipulate the Fenchel conjugate function (beware, however, that dual norm and Fenchel conjugate of the norm are not equal) of \(\norm{\cdot}^q_{\mathcal{K}}\) to link it with a power of \(\norm{\cdot}_{\mathcal{K}^\circ}\). Without entering into technical details, the high-level idea is that the uniform convexity of $\mathcal{K}$ ensures the uniform convexity of a power of the gauge function of \(\norm{\cdot}_{\mathcal{K}}\) and ultimately the Hölder Smoothness of the Fenchel conjugate of a power of the gauge function of \(\norm{\cdot}_{\mathcal{K}^\circ}\).</p>
<h3 id="bandit-algorithm-on-uniformly-convex-sets-and-pseudo-regret-bounds">Bandit Algorithm on Uniformly Convex Sets and Pseudo-Regret Bounds</h3>
<p>Similarly to [BCB,BCY] we apply a bandit version of Online Stochastic Mirror Descent. The sole difference is that we consider specific barrier $F_{\mathcal{K}}$ function for uniformly convex action set $\mathcal{K}$ and we account for the reference radius, <em>i.e.</em>, the $r>0$ such that $\ell_1(r)\subset\mathcal{K} $. For $x\in\text{Int}(\mathcal{K})$ the barrier function is defined as follows</p>
\[\tag{Barrier Function}
F_{\mathcal{K}}(x) := - \ln(1-\norm{x}_{\mathcal{K}}) - \norm{x}_{\mathcal{K}}.\]
<p>Here, we do not detail the action’s sampling scheme; see the details in [KCDP]. Note that the algorithm is adaptive because it does not require knowledge of the parameter of uniform convexity.</p>
<p class="mathcol"><strong>Algorithm 1: Linear Bandit Mirror Descent</strong> <br />
<em>Input:</em> $\eta>0$, $\gamma\in]0,1[$, $\mathcal{K}$ smooth and strictly convex such that \(\ell_1(r)\subset\mathcal{K}\). <br />
<em>Initialize:</em> \(x_1\in\text{argmin}_{x\in(1-\gamma)\mathcal{K}}F_{\mathcal{K}}(x)\) <br />
For $t=1, \ldots, T$ do: <br />
$\qquad$ Sample $a_t\in\mathcal{K}$ $\qquad \vartriangleright$ Bandit internal randomization. <br />
$\qquad$ \(\tilde{c}_t \gets \frac{n}{r^2} (1-\xi_t) \frac{\langle a_t; c_t\rangle }{1-\norm{x_t}_{\mathcal{K}}}a_t\) $\qquad \vartriangleright$ Estimate loss vector <br />
$\qquad$ \(x_{t+1} \gets \underset{y\in(1-\gamma)\mathcal{K}}{\text{argmin }} D_{F_{\mathcal{K}}}\big(y, \nabla F_{\mathcal{K}}^*(\nabla F_{\mathcal{K}}(x_t)- \eta \tilde{c}_t)\big) \big)\) $\qquad \vartriangleright$ Mirror Descent step</p>
<p>We now cite our analysis of the pseudo-regret bounds on Algorithm 1 when the set $\mathcal{K}$ is strongly convex.</p>
<p class="mathcol"><strong>Theorem 1: Linear Bandit on Strongly Convex Set</strong> <br />
Consider a compact convex set $\mathcal{K}$ that is centrally symmetric with non-empty interior. Assume $\mathcal{K}$ is smooth and $\alpha$-strongly convex set with respect to \(\norm{\cdot}_{\mathcal{K}}\) and \(\ell_2(r)\subset \mathcal{K} \subset \ell_{\infty}(R)\) for some \(r,R>0\).
Consider running Algorithm 1 with the barrier function \(F_{\mathcal{K}}(x)=-\ln\big(1-\norm{x}_{\mathcal{K}}\big) - \norm{x}_{\mathcal{K}}\), and \(\eta=\frac{1}{\sqrt{nT}}\), \(\gamma=\frac{1}{\sqrt{T}}\). Then, for \(T\geq 4n\big(\frac{R}{r}\big)^2\) we have
\[
\bar{R}_T \leq \sqrt{T} + \sqrt{nT}\ln(T)/2 + L\sqrt{nT} = \tilde{\mathcal{O}}(\sqrt{nT}),
\]
where $L=(R/r)^2(5\alpha + 4)/\alpha$.</p>
<p>In this blog, we do not detail the technical proof. The core idea is to leverage the strong convexity of \(\mathcal{K}\) by noting that it implies the smoothness of \(\frac{1}{2}\norm{\cdot}_{\mathcal{K}^\circ}^2\) on $\mathcal{K}$. It then provides an upper bound on the Bregman Divergence of \(\frac{1}{2}\norm{\cdot}_{\mathcal{K}^\circ}^2\). It is a crucial term that emerges when we carefully factorized the terms in the upper bound on the progress of a Mirror Descent step in Algorithm 1. We hence obtain pseudo-regret bounds in \(\tilde{\mathcal{O}}(\sqrt{nT})\) for a generic family of sets. Such accelerated pseudo-regret bounds were previously known only in the case of the simplex or the $\ell_p$ ball with $p\in]1,2]$.</p>
<p>We obtain a more generic version when the action set is $(\alpha,q)$-uniformly convex with $q \geq 2$.</p>
<p class="mathcol"><strong>Theorem 2: Linear Bandit on Uniformly Convex Set</strong> <br />
Let $\alpha>0$, $q\geq 2$, and $p\in]1,2]$ such that $1/p + 1/q=1$ and consider a compact convex set $\mathcal{K}$ that is centrally symmetric with non-empty interior. Assume $\mathcal{K}$ is smooth and $(\alpha, q)$-uniformly convex set with respect to \(\norm{\cdot}_{\mathcal{K}}\) and \(\ell_q(r)\subset \mathcal{K} \subset \ell_{\infty}(R)\) for some $r,R>0$. Consider running Algorithm A with the barrier function \(F_{\mathcal{K}}(x)=-\ln\big(1-\norm{x}_{\mathcal{K}}\big) - \norm{x}_{\mathcal{K}}\), and $\eta=1/(n^{1/q}T^{1/p})$, $\gamma=1/\sqrt{T}$. Then for $T\geq 2^p n \big(\frac{R}{r}\big)^p$ we have
\[
\bar{R}_T \leq \sqrt{T} + n^{1/q} T^{1/p} \ln(T)/2 + ((1/2)^{2-p} + L) \Big(\frac{R}{r}\Big)^p n^{1/q} T^{1/p} = \tilde{\mathcal{O}}(n^{1/q} T^{1/p}),
\]
where $L=2p(1 + (q/(2\alpha))^{1/(q-1)})$.</p>
<p>Here, the rate with uniformly convex set is not an interpolation between the $\tilde{\mathcal{O}}(\sqrt{nT})$ of strongly convex sets and the $\tilde{\mathcal{O}}(n\sqrt{T})$ when the set is compact convex. Another trade-off appears. Indeed, the pseudo-regret bounds dimension-dependence can be arbitrarily smaller than $\sqrt{n}$ while the rate in terms of iteration can get arbitrarily close to $\mathcal{O}(T)$.</p>
<h2 id="conclusion">Conclusion</h2>
<p>When the action set is strongly convex, we design a barrier function leading to a bandit algorithm with pseudo-regret in $\tilde{\mathcal{O}}(\sqrt{nT})$. We hence drastically extend the family of action sets for which such pseudo-regret holds, answering an open question of [BCB]. To our knowledge, a $\tilde{\mathcal{O}}(\sqrt{nT})$ bound was known only when the action set is a simplex or an $\ell_p$ ball with $p\in]1,2]$.</p>
<p>When the set is $(\alpha, q)$-uniformly convex with $q\geq 2$, we assume in Theorem 1 and 2 that $\ell_q(r)$ is contained in the action set $\mathcal{K}$. It is restrictive but allows us to first prove improved pseudo-regret bounds outside the explicit $\ell_p$ case. Removing this assumption is an interesting research direction.
However, it is not clear that the current classical algorithmic scheme with a barrier function is best adapted to leverage the strong convexity of the action set. Indeed, in the case of online linear learning, [HLGS] show that the simple FTL allows obtaining accelerated regret bounds.</p>
<p>At a high level, this work is an example of the favorable dimension-dependency of the sets’ uniform convexity assumptions for the pseudo-regret bounds. It is crucial for large-scale machine learning.
Besides, the uniform convexity structures for the sets are much less developed and understood than their functional counterpart, see, <em>e.g.</em>, [KDPb]. Arguably, this stems from a tendency in machine learning to consider the constraints to be theoretically interchangeable with penalization. It is often not quite accurate in terms of convergence results, and the algorithmic strategies developed differ. The linear bandit setting is a simple example where such symmetry is structurally not relevant.</p>
<h3 id="references">References</h3>
<p>[BCB] Bubeck, Sébastian, Cesa-Bianchi, Nicolo. “Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems”. <em>Foundations & Trends in Machine Learning</em>. 2012. <a href="https://arxiv.org/abs/1204.5721">pdf</a></p>
<p>[BCY] Bubeck, Sébastien, Michael Cohen, and Yuanzhi Li. “Sparsity, variance and curvature in multi-armed bandits.” <em>Algorithmic Learning Theory</em>. PMLR, 2018. <a href="https://arxiv.org/abs/1711.01037">pdf</a></p>
<p>[HLGS] Huang, R., Lattimore, T., György, A., and Szepesvári, C. “Following the leader and fast rates in online linear prediction: Curved constraint sets and other regularities”. <em>The Journal of Machine Learning Research</em>, <em>18</em>(1), 2017. <a href="https://jmlr.org/papers/v18/17-079.html">pdf</a></p>
<p>[KCDP] Kerdreux, Thomas, Christophe Roux, Alexandre d’Aspremont, and Sebastian Pokutta. “Linear Bandit on uniformly convex sets.” <a href="https://arxiv.org/abs/2103.05907">pdf</a></p>
<p>[KDPa] Kerdreux, Thomas, Alexandre d’Aspremont, and Sebastian Pokutta. “Projection-free optimization on uniformly convex sets.” <em>AISTATS</em>. 2021. <a href="https://arxiv.org/abs/2004.11053">pdf</a></p>
<p>[KDPb] Kerdreux, Thomas, Alexandre d’Aspremont, and Sebastian Pokutta. “Local and Global Uniform Convexity Conditions.” 2021. <a href="https://arxiv.org/abs/2102.05134">pdf</a></p>Thomas KerdreuxTL;DR: This is an informal summary of our recent paper Linear Bandit on Uniformly Convex Sets by Thomas Kerdreux, Christophe Roux, Alexandre d’Aspremont, and Sebastian Pokutta. We show that the strong convexity of the action set $\mathcal{K}\subset\mathbb{R}^n$ in the context of linear bandits leads to a gain of a factor of $\sqrt{n}$ in the pseudo-regret bounds. This improvement was previously known in only two settings: when $\mathcal{K}$ is the simplex or an $\ell_p$ ball with $p\in]1,2]$ [BCY]. When the action set is $q$-uniformly convex (with $q\geq 2$) but not necessarily strongly convex, we obtain pseudo-regret bounds of the form $\mathcal{O}(n^{1/q}T^{1/p})$ (with $1/p+1/q=1$), i.e., with a dimension dependency smaller than $\sqrt{n}$.CINDy: Conditional gradient-based Identification of Non-linear Dynamics2021-01-16T06:00:00+01:002021-01-16T06:00:00+01:00http://www.pokutta.com/blog/research/2021/01/16/cindy<p><em>TL;DR: This is an informal summary of our recent paper <a href="https://arxiv.org/abs/2101.02630">CINDy: Conditional gradient-based Identification of Non-linear Dynamics – Noise-robust recovery</a> by <a href="https://alejandro-carderera.github.io/">Alejandro Carderera</a>, <a href="http://www.pokutta.com/">Sebastian Pokutta</a>, <a href="https://www.mi.fu-berlin.de/en/math/groups/biocomputing/people/professors/christof_schuette.html">Christof Schütte</a> and <a href="https://www.zib.de/weiser/">Martin Weiser</a> where we propose the use of a Conditional Gradient algorithm (more concretely the <a href="https://arxiv.org/abs/1805.07311">Blended Conditional Gradients</a> [BPTW] algorithm) for the sparse recovery of a dynamic. In the presence of noise, the proposed algorithm presents superior sparsity-inducing properties, while ensuring a higher recovery accuracy, compared to other existing methods in the literature, most notably the popular <a href="https://arxiv.org/abs/1509.03580">SINDy</a> [BPK] algorithm, based on a sequentially-thresholded least-squares approach.</em>
<!--more--></p>
<p><em>Written by Alejandro Carderera.</em></p>
<h2 id="what-is-the-paper-about-and-why-you-might-care">What is the paper about and why you might care</h2>
<p>A large number of humankind’s scientific breakthroughs have been fueled by our ability to describe natural phenomena in terms of differential equations. These equations give us a condensed representation of the underlying dynamics and have helped build our understanding of natural phenomena in many scientific disciplines.</p>
<p>The modern age of Machine Learning and Big Data has heralded an age of data-driven models, in which
the phenomena we explain are described in terms of statistical relationships and data. Given sufficient
data, we are able to train neural networks to classify or to predict, with high accuracy, without the underlying
model having any apparent knowledge of how the data was generated or its structure. This makes the task of
classifying or predicting, on out-of-sample data a particularly challenging task. Due to this, there has
been a recent surge in interest in recovering the differential equations with which the data, often coming from
a physical system, have been generated. This enables us to better understand how the data is generated and
to better predict on out-of-sample data.</p>
<h2 id="learning-sparse-dynamics">Learning sparse dynamics</h2>
<p>Many physical systems can be described in terms of ordinary
differential equations of the form
$\dot{x}(t) = F\left(x(t)\right)$, where $x(t) \in \mathbb{R}^d$ denotes
the state of the system at time $t$ and $F: \mathbb{R}^d \rightarrow \mathbb{R}^d$ can usually be expressed as a linear combination of simpler <em>ansatz functions</em> $\psi_i: \mathbb{R}^d \rightarrow \mathbb{R}$ belonging to a dictionary \(\mathcal{D} = \left\{\psi_i \mid 1 \leq i \leq n
\right\}\). This allows us to express the dynamic followed by the
system as
$\dot{x}(t) = F\left(x(t)\right) = \Xi^T \bm{\psi}(x(t))$ where
$\Xi \in \mathbb{R}^{n \times d}$ is a – typically sparse – matrix
$\Xi = \left[\xi_1, \cdots, \xi_d \right]$ formed by column vectors
$i_i \in \mathbb{R}^n$ for $1 \leq i \leq n$ and
$\bm{\psi}(x(t)) = \left[ \psi_1(x(t)), \cdots, \psi_n(x(t))
\right]^T \in \mathbb{R}^{n}$. We can therefore write:</p>
\[\dot{x}(t) = \begin{bmatrix}
\rule{.5ex}{2.5ex}{0.5pt} & \xi_1 & \rule{.5ex}{2.5ex}{0.5pt}\\
& \vdots & \\
\rule{.5ex}{2.5ex}{0.5pt} & \xi_d & \rule{.5ex}{2.5ex}{0.5pt}
\end{bmatrix}
\begin{bmatrix}
\psi_1(x(t)) \\
\vdots \\
\psi_n(x(t))
\end{bmatrix}.\]
<p>In the absence of noise, if we are given a series of data points from the physical system \(\left\{ x(t_i), \dot{x}(t_i) \right\}_{i=1}^m\), then we know that:</p>
\[\begin{bmatrix}
\rule{-1ex}{0.5pt}{2.5ex} & & \rule{-1ex}{0.5pt}{2.5ex}\\
\dot{x}(t_1) & \cdots & \dot{x}(t_m)\\
\rule{-1ex}{0.5pt}{2.5ex} & & \rule{-1ex}{0.5pt}{2.5ex}
\end{bmatrix} =
\begin{bmatrix}
\rule{.5ex}{2.5ex}{0.5pt} & \xi_1 & \rule{.5ex}{2.5ex}{0.5pt}\\
& \vdots & \\
\rule{.5ex}{2.5ex}{0.5pt} & \xi_d & \rule{.5ex}{2.5ex}{0.5pt}
\end{bmatrix}
\begin{bmatrix}
\rule{-1ex}{0.5pt}{2.5ex} & & \rule{-1ex}{0.5pt}{2.5ex}\\
\bm{\psi}\left(x(t_1)\right) & \cdots & \bm{\psi}\left(x(t_m)\right)\\
\rule{-1ex}{0.5pt}{2.5ex} & & \rule{-1ex}{0.5pt}{2.5ex}
\end{bmatrix}.\]
<p>If we collect the data in matrices $\dot{X} = \left[ \dot{x}(t_1),\cdots, \dot{x}(t_m)\right] \in\mathbb{R}^{d\times m}$, $\Psi\left(X\right) = \left[ \bm{\psi}(x(t_1)),\cdots, \bm{\psi}(x(t_m))\right]\in\mathbb{R}^{n\times m}$, we can try to recover the underlying sparse dynamic by attempting to solve:</p>
\[\min\limits_{\dot{X} = \Omega^T \Psi(X)} \left\| \Omega\right\|_0.\]
<p>Unfortunately, the aforementioned problem is a notoriously difficult NP-hard
combinatorial problems, due to the presence of the $\ell_0$ norm in
the objective function of the problem. Moreover, if the
data points are contaminated by noise, leading to noisy matrices
$\dot{Y}$ and $\Psi(Y)$, depending on the
expressive power of the basis functions $\psi_i$ for
$1\leq i \leq n$, it may not even be possible (or desirable) to
satisfy $\dot{Y} = \Omega^T \Psi(Y)$ for any $\Omega \in \mathbb{R}^{n\times
d}$. Thus one can attempt to <em>convexify</em> the problem, substituting the $\ell_0$ norm (which is technically not a norm) for the $\ell_1$ norm. That is, solve for a suitably chosen $\epsilon >0$</p>
\[\tag{BPD}
\min\limits_{ \left\|\dot{Y} - \Omega^T \Psi(X) \right\|^2_F \leq \epsilon } \left\|\Omega\right\|_{1,1} \label{eq:l1_minimization_noisy2}\]
<p>This leads us to a formulation, known as <em>Basis Pursuit Denoising</em> (BPD) [CDS], which was initially developed by the signal processing community, and is intimately tied to the <em>Least Absolute Shrinkage and Selection Operator</em> (LASSO) regression formulation [T], developed in the statistics community. The latter formulation, which we will use for this problem, takes the form:</p>
\[\tag{LASSO}
\min\limits_{ \left\|\Omega\right\|_{1,1} \leq \alpha} \left\|\dot{Y} - \Omega^T \Psi(X) \right\|^2_F\]
<p>Both problems shown in (BPD) and
(LASSO) have a convex objective function and
a convex feasible region, which allows us to use the powerful tools
and guarantees of convex optimization. Moreover, there is a significant body of theoretical
literature, both from the statistics and the signal processing
community, on the conditions for which we can successfully recover the
support of $\Xi$ (see e.g., [W]), the uniqueness of the
LASSO solutions (see e.g., [T2]), or the robust
reconstruction of phenomena from incomplete data
(see e.g., [CRT]), to name but a few results.</p>
<h3 id="incorporating-structure-into-the-learning-problem">Incorporating structure into the learning problem</h3>
<p>Conservation laws are a fundamental pillar of our understanding of physical
systems. Imposing these laws through (symmetry) constraints in our sparse regression problem can
potentially lead to better generalization performance under noise,
reduced sample complexity, and to learned dynamics that are consistent
with the symmetries present in the real world. In particular, there are two
large classes of structural constraints that can be easily encoded
into our learning problem as linear constraints:</p>
<ol>
<li>Conservation properties: We often observe in dynamical systems that certain relations hold between the elements of $\dot{x}(t)$. Such is the case in chemical reaction dynamics, where if we denote the rate of change of the $i$-th species by $\dot{x}_i(t)$, we might observe relations of the form $a_j\dot{x}_j(t) + a_k\dot{x}_k(t) = 0$ due to mass conservation, which relate the $j$-th and $k$-th species being studied.</li>
<li>Symmetry between variables: One of the key assumptions used in many-particle quantum systems is the fact the particles being studied are indistinguishable. And so it makes sense to assume that the effect that the $i$-th particle exerts on the $j$-th particle is the same as the effect that the $j$-th particle exerts on the $i$-th particle. The same can be said in classical mechanics for a collection of identical masses, where each mass is connected to all the other masses through identical springs. These restrictions can also be added to our learning problem as linear constraints.</li>
</ol>
<p>If we were to add $L$ additional linear constraints to the problem in (LASSO) to reflect the underlying structure of the dynamical system through symmetry and conservation, we would arrive at a polytope $\mathcal{P}$ of the form</p>
\[\mathcal{P} = \left\{ \Omega \in \mathbb{R}^{n \times d} \mid \left\|\Omega\right\|_{1,1} \leq \tau, \text{trace}( A_l^T \Omega ) \leq b_l, 1 \leq l \leq L \right\},\]
<p>for an appropriately chosen $A_l$ and $b_l$.</p>
<h3 id="blended-conditional-gradients">Blended Conditional Gradients</h3>
<p>The problem is that in the presence of noise many learning approaches see their sparsity-inducing properties quickly degrade, producing dense dynamics that are far from the true dynamic, this is often what happens with the sequentially-thresholded least-squares algorithm in [BPK], which underlies SINDy. Ideally, we want to look for learning algorithms that are somewhat robust to the presense of noise. Moreover, it would also be advantageous if we could easily incorporate structural linear constraints into the learning problem, as described in the previous section, to lead to learned dynamics that are consistent with the true dynamic.</p>
<p>For the recovery of sparse dynamics from data, one of the most interesting algorithms in terms of sparsity is the <em>Fully-Corrective Conditional Gradient</em> algorithm. This algorithm picks up a vertex $V_k$ from the polytope $\mathcal{P}$ using a linear optimization oracle, and reoptimizes over the convex hull of $\mathcal{S}_{k} \bigcup V_k$, which is the union of the vertices picked up in previous iterations, and the new vertex $V_k$. One of the key advantages of requiring a linear optimization oracle, instead of a projection oracle, to solve the optimization problem is that for general polyhedral constraints there are efficient algorithms to solve linear optimization problems, whereas solving a quadratic problem to compute a projection can be too computationally expensive.</p>
<p class="mathcol"><strong>Fully-Corrective Conditional Gradient (CG) algorithm applied to (LASSO)</strong> <br />
<em>Input:</em> Initial point $\Omega_1 \in \mathcal{P}$ <br />
<em>Output:</em> Point $\Omega_{K+1} \in \mathcal{P}$ <br />
\(\mathcal{S}_{1} \leftarrow \emptyset\) <br />
For \(k = 1, \dots, K\) do: <br />
$\quad \nabla f \left( \Omega_k \right) \leftarrow 2 \Psi(Y) \left(\dot{Y} - \Omega_k^T\Psi(Y) \right)^T$ <br />
$\quad V_k \leftarrow \min\limits_{\Omega \in \mathcal{P}} \text{trace}\left(\Omega^T\nabla f \left( \Omega_k \right) \right)$ <br />
\(\quad \mathcal{S}_{k+1}\leftarrow \mathcal{S}_{k} \bigcup V_k\) <br />
\(\quad \Omega_{k+1} \leftarrow \min\limits_{\Omega \in \text{conv}\left( \mathcal{S}_{k+1} \right) } \left\|\dot{Y} - \Omega^T \Psi(Y)\right\|^2_F\) <br />
End For</p>
<p>To get a better feel of the sparsity inducing properties of the FCFW algorithm, if we assume that the starting point $\Omega_1$ is a vertex of the
polytope, then we know that the iterate $\Omega_k$ can be expressed
as a convex combination of at most $k$ vertices of
$\mathcal{P}$. This is due to the fact that the algorithm can pick
up no more than one vertex per iteration. Note that if $\mathcal{P}$ were the $\ell_1$ ball without any additional constraints, the FCFW algorithm
picks up at most one basis function in the $k$-th iteration, as
$V_k^T \bm{\psi}(x(t)) = \pm \tau \psi_i(x(t)$ for some
$1\leq i\leq n$. This means that if we use the
Frank-Wolfe algorithm to solve a problem over the $\ell_1$ ball, we
encourage sparsity not only through the regularization provided by
the $\ell_1$ ball, but also through the specific nature of the
Frank-Wolfe algorithm independently of the size of the feasible
region. In practice, when using, e.g., early termination due to some
stopping criterion, this results in the Frank-Wolfe algorithm
producing sparser solutions than projection-based algorithms (such
as projected gradient descent, which typically uses dense updates).</p>
<p>Reoptimizing over the union of vertices picked up can be an expensive operation, especially if there are many such vertices. An alternative is to compute these reoptimizations to $\varepsilon_k$-optimality at iteration $k$. However, this leads to the question: How should we choose $\varepsilon_k$ at
each iteration $k$, if we want to find an $\varepsilon$-optimal
solution to (LASSO)?
Computing a solution to the problem to accuracy
$\varepsilon_k = \varepsilon$ at each iteration might be way too
computationally expensive. Conceptually, we need relatively inaccurate
solutions for early iterations where
\(\Omega^\esx \notin \text{conv} \left(\mathcal{S}_{k+1}\right)\), requiring only
accurate solutions when
\(\Omega^\esx \in \text{conv} \left(\mathcal{S}_{k+1}\right)\). At the same time we
do not know whether we have found \(\mathcal{S}_{k+1}\) so that
\(\Omega^\esx \in \text{conv} \left(\mathcal{S}_{k+1}\right)\).</p>
<p>The rationale behind the <em>Blended Conditional Gradient</em> (BCG)
algorithm [BPTW] is to provide an
explicit value of the accuracy $\varepsilon_k$ needed at each
iteration starting with rather large \(\varepsilon_k\) in early iterations
and progressively getting more accurate when approaching the optimal
solution; the process is controlled by an optimality gap
measure. In some sense one might think of BCG as a practical version
of FCCG with stronger convergence guarantees and much faster
real-world performance.</p>
<p class="mathcol"><strong>CINDy: Blended Conditional Gradient (BCG) algorithm variant applied to (LASSO) problem</strong> <br />
<em>Input:</em> Initial point $\Omega_0 \in \mathcal{P}$ <br />
<em>Output:</em> Point $\Omega_{K+1} \in \mathcal{P}$ <br />
\(\Omega_1 \leftarrow \text{argmin}_{\Omega \in \mathcal{P}} \text{trace}\left(\Omega^T\nabla f \left( \Omega_0 \right) \right)\) <br />
\(\Phi \leftarrow \text{trace} \left( \left( \Omega_0 - \Omega_1\right)^T \nabla f(\Omega_0)\right)/2\) <br />
\(\mathcal{S}_{1} \leftarrow \left\{ \Omega_1 \right\}\) <br />
For \(k = 1, \dots, K\) do: <br />
\(\quad\) Find $\Omega_{k+1} \in \operatorname{conv}(\mathcal{S}_{k})$ such that \(\max_{\Omega \in \mathcal{P}} \text{trace}\left((\Omega_{k+1} -\Omega )^T\nabla f \left( \Omega_{k+1} \right) \right) \leq \Phi\) <br />
\(\quad V_{k+1} \leftarrow \text{argmin}_{\Omega \in \mathcal{P}} \text{trace}\left(\Omega^T\nabla f \left( \Omega_{k+1} \right) \right)\) <br />
\(\quad\) If \(\left( \text{trace}\left( \left( \Omega_{k+1} -V_{k+1}\right)^T \nabla f(\Omega_{k+1})\right) \leq \Phi \right)\) <br />
\(\quad\quad \Phi \leftarrow \text{trace}\left( \left( \Omega_{k+1} -V_{k+1}\right)^T \nabla f(\Omega_{k+1})\right)/2\) <br />
\(\quad\quad \mathcal{S}_{k+1} \leftarrow \mathcal{S}_k\) <br />
\(\quad\quad \Omega_{k+1} \leftarrow \Omega_k\) <br />
\(\quad\) Else <br />
\(\quad\quad\mathcal{S}_{k+1} \leftarrow \mathcal{S}_k \bigcup V_{k+1}\) <br />
\(\quad\quad D_k \leftarrow V_{k + 1} - \Omega_k\) <br />
\(\quad\quad \gamma_k \leftarrow \min\left\{-\frac{1}{2}\text{trace} \left( D_k^T \nabla f \left(
\Omega_k \right) \right)/ \left\| D_k^T
\Psi(Y)\right\|_F^2,1\right\}\) <br />
\(\quad\quad \Omega_{k+1} \leftarrow \Omega_k + \gamma_k D_k\) <br />
\(\quad\) End If <br />
End For</p>
<p>As we will show numerically in the next section, the CINDy algorithm not only produces sparser solutions to the learning problem, it also exhibits a higher robustness with respect to noise than other existing approaches. This is in keeping with the law of parsimony (also called <em>Occam’s Razor</em>), which states that the simplest explanation, in our case the sparsest, is usually the right one (or close to the right one!).</p>
<h2 id="numerical-experiments">Numerical experiments</h2>
<p>We benchmark the CINDy algorithm applied to
the LASSO sparse recovery formulations with the following
algorithms. Our main benchmark here is the SINDy algorithm, however we
included two more popular optimization methods for further
comparison, namely, the Interior-Point Methods in <em>CVXOPT</em> [ADLVSNW], and the FISTA algorithm.</p>
<p>We use CINDy (c) and CINDy to refer to the results achieved by the CINDy algorithm with and without the additional structural constraints arising e.g., from conservation laws. Likewise, we use IPM (c) and IPM to refer to the results achieved by the IPM algorithm with and without additional constraints. We have not added structural constraints to the formulation in the SINDy algorithm, as there is no straightforward way to include constraints in the original implementation, or the FISTA algorithm, as we would need to compute non-trivial proximal/projection operators, making the algorithm computationally too expensive.</p>
<p>To benchmark the algorithms we use two different metrics, the <em>recovery error</em> defined as \(\mathcal{E}_{R} = \norm{\Omega - \Xi}_F\) and the <em>number of extraneous terms</em> defined as \(\mathcal{S}_E = \abs {\left\{ \Omega_{i,j} \mid \Omega_{i,j} \neq 0, \Xi_{i,j} = 0, 1 \leq i \leq d \in, 1 \leq j \leq n \right\}}\), i.e., those terms that do not belong to the dynamic.</p>
<h3 id="fermi-pasta-ulam-tsingou-model">Fermi-Pasta-Ulam-Tsingou model</h3>
<p>The Fermi-Pasta-Ulam-Tsingou model describes a one-dimensional system of $d$ identical particles, where neighboring particles are connected with springs, subject to a nonlinear forcing term [FPUT]. This computational model was used at Los Alamos to study the behaviour of complex physical systems over long time periods. The equations of motion that govern the particles, when subjected to cubic forcing terms is given by
\(\ddot{x}_i = \left(x_{i+1} - 2 x_i + x_{i-1} \right) + \beta \left[ \left( x_{i+1} - x_i \right)^3 - \left( x_{i} - x_{i-1} \right)^3 \right],\)
where $1 \leq i \leq d$ and $x_{i}$ refers to the displacement of the $i$-th particle with respect to its equilibrium position. The exact dynamic $\Xi$ can be expressed using a dictionary of monomials of degree up to three.</p>
<div class="center">
<img src="http://www.pokutta.com/blog/assets/cindy/blog_post_FPUT_10_dim_v5.png" alt="fig1" style="float:center; margin-right: 1%; width:99%" />
<p style="clear: both;"></p>
</div>
<p class="figcap"><strong>Figure 1.</strong> Sparse recovery of the Fermi-Pasta-Ulam-Tsingou dynamic with $d = 10$</p>
<p>As we can see in the images, there is a large difference between the CINDy and FISTA algorithms, and the remaining algorithms, with the former algorithms being up to two orders of magnitude more accurate in terms of $\mathcal{E}_R$, while algo being much sparser, as seen in the image that depicts $\mathcal{S}_E$.</p>
<p>However, what does this difference in recovery error translate to? We can see the difference in accuracy between the different learned dynamics by simulating forward in time the dynamic learned by the CINDy algorithm and the SINDy algorithm, and comparing that to the evolution of the true dynamic. The results in the next image show this comparison for different times for the dynamics learnt by the two algorithms with a noise level of $10^{-4}$ for the example of dimensionality $d = 10$. In keeping with the physical nature of the problem, we present the ten dimensional phenomenon as a series of oscillators suffering a displacement on the vertical y-axis, in a similar fashion as was done in the original paper SINDy paper [BPK]. Note that we have added to the images the two extremal particles on the left and right that do not oscillate. While CINDy’s trajectory matches that of the real dynamic up to very small error—it is also much smoother in time—the learned dynamic of SINDy is very far away from the true dynamics not even recovering essential features of the oscillation; the large number of additional terms deform the essential structure of the dynamic.</p>
<div class="center">
<img src="http://www.pokutta.com/blog/assets/cindy/FPUT_animation.gif" alt="fig2" style="float:center; margin-right: 1%; width:99%" />
<p style="clear: both;"></p>
</div>
<p class="figcap"><strong>Figure 2.</strong> Fermi-Pasta-Ulam-Tsingou dynamic: Simulation of learned trajectories vs true trajectory.</p>
<h3 id="kuramoto-model">Kuramoto model</h3>
<p>The Kuramoto model [K] describes a large collection of $d$ weakly coupled identical oscillators, that differ in their natural frequency $\omega_i$. This dynamic is often used to describe synchronization phenomena in physics. If we denote by $x_i$ the angular displacement of the $i$-th oscillator, then the governing equation with external forcing can be written as:
\(\dot{x}_i = \omega_i + \frac{K}{d}\sum_{j=1}^d \left[\sin \left( x_j \right) \cos \left( x_i \right) - \cos \left( x_j \right) \sin \left( x_i \right) \right]+ h\sin \left( x_i\right),\)
for $1 \leq i \leq d$, where $d$ is the number of oscillators (the dimensionality of the problem), $K$ is the coupling strength between the oscillators and $h$ is the external forcing parameter. The exact dynamic $\Xi$ can be expressed using a dictionary of basis functions formed by sine and cosine functions of $x_i$ for $1 \leq i \leq d$, and pairwise combinations of these functions, plus a constant term.</p>
<div class="center">
<img src="http://www.pokutta.com/blog/assets/cindy/blog_post_kuramoto_10_dim_v5.png" alt="fig3" style="float:center; margin-right: 1%; width:99%" />
<p style="clear: both;"></p>
</div>
<p class="figcap"><strong>Figure 3.</strong> Sparse recovery of the Kuramoto dynamic with $d = 10$.</p>
<p>All algorithms except the IPM algorithms exhibit similar performance with respect to $\mathcal{E}_R$ and $\mathcal{S}_E$ up to a noise level of $10^{-5}$, however, the performance of the FISTA and SINDy algorithms degrade for noise levels above $10^{-5}$, producing solutions that are both dense (see $\mathcal{S}_E$), and are far away from the true dynamic (see $\mathcal{E}_R$). When we simulate the Kuramoto system from a
given initial position, the algorithms have very different
performances.</p>
<p>The next animation shows the results after simulating the dynamics learned by the CINDy and SINDy algorithm from the integral formulation for a Kuramoto model with $d = 10$ and a noise level of $10^{-3}$. In order to see more easily the differences between the algorithms and the position of the oscillators, we have placed the $i$-th oscillator at a radius of $i$, for $1 \leq i\leq d$. Note that the same coloring and markers are used as in the previous section to depict the trajectory followed by the exact dynamic, the dynamic learned with CINDy, and the dynamic learned with SINDy. As before while CINDy can reproduce the correct trajectory up to small error the trajectory of SINDy’s learned dynamic is rather far away from the real dynamic.</p>
<div class="center">
<img src="http://www.pokutta.com/blog/assets/cindy/kuramoto_animation.gif" alt="fig4" style="float:center; margin-right: 1%; width:99%" />
<p style="clear: both;"></p>
</div>
<p class="figcap"><strong>Figure 4.</strong> Kuramoto dynamic: Simulation of learned trajectories. Green is the true dynamic. Black is the dynamic learned via CINDy. Magenta is the dynamic learned via SINDy.</p>
<p>If we compare the CINDy and SINDy algorithms from the perspective of the sample efficiency, that is, the evolution of the error as we vary the number of training samples made available to the algorithm, and the noise levels, we can see that there is an additional benefit to the use of a CG-based algorithm for the recovery of the sparse dynamic and that inclusion of conversation laws can further improve sample efficiency and noise robustness.</p>
<div class="center">
<img src="http://www.pokutta.com/blog/assets/cindy/blog_post_sample_efficiency_kuramoto5.png" alt="fig5" style="float:center; margin-right: 1%; width:99%" />
<p style="clear: both;"></p>
</div>
<p class="figcap"><strong>Figure 5.</strong> Kuramoto dynamic: Sample efficiency with $d = 5$.</p>
<p>If we focus for example on the bottom right corner for each of the images, we can see that the SINDy algorithm outputs dynamics with a lower accuracy in the low-training sample regime for higher noise levels, as compared to the CINDy algorithm.</p>
<h3 id="michaelis-menten-model">Michaelis-Menten model</h3>
<p>The Michaelis-Menten model [MM] is used to describe enzyme reaction kinetics. We focus on the following derivation, in which an enzyme E combines with a substrate S to form an intermediate product ES with a reaction rate $k_{f}$. This reaction is reversible, in the sense that the intermediate product ES can decompose into E and S, with a reaction rate $k_{r}$. This intermediate product ES can also proceed to form a product P, and regenerate the free enzyme E. This can be expressed as</p>
\[S + E \rightleftharpoons E.S \to E + P.\]
<p>If we assume that the rate for a given reaction depends proportionately on the concentration of the reactants, and we denote the concentration of E, S, ES and P as $x_{\text{E}}$, $x_{\text{S}}$, $x_{\text{ES}}$ and $x_{\text{P}}$, respectively, we can express the dynamics of the chemical reaction as:</p>
\[\begin{align*}
\dot{x}_{\text{E}} &= -k_f x_{\text{E}} x_{\text{S}} + k_r x_{\text{ES}} + k_{\text{cat}} x_{\text{ES}} \\
\dot{x}_{\text{S}} &= -k_f x_{\text{E}} x_{\text{S}} + k_r x_{\text{ES}} \\
\dot{x}_{\text{ES}} &= k_f x_{\text{E}} x_{\text{S}} - k_r x_{\text{ES}} - k_{\text{cat}} x_{\text{ES}} \\
\dot{x}_{P} &= k_{\text{cat}} x_{\text{ES}}.
\end{align*}\]
<div class="center">
<img src="http://www.pokutta.com/blog/assets/cindy/blog_post_MMeasy_4_dim_v6.png" alt="fig6" style="float:center; margin-right: 1%; width:99%" />
<p style="clear: both;"></p>
</div>
<p class="figcap"><strong>Figure 6.</strong> Sparse recovery of the Michaelis-Menten dynamic with $d = 4$. Left is recovery error in Frobenius norm. Right is number of extra terms picked up that do not belong to dynamic.</p>
<p>We can observe that for the lowest noise levels, the CINDy algorithm presents no advantage over the SINDy algorithm, however, as we crank up the noise levels, the performance of SINDy degrades, as the algorithm picks up more and more extra terms that are not present in the true dynamic. For low to moderately high noise levels the CINDy algorithm provides the best performance, with the lowest error in terms of $\mathcal{E}_R$, and the sparsest solutions in terms of $\mathcal{S}_E$. For very high noise levels, all the algorithms perform similarly in terms of $\mathcal{E}_R$, while CINDy’s recoveries are still significantly sparser than those of SINDy.</p>
<h3 id="references">References</h3>
<p>[BPK] Brunton, S.L., Proctor, J.L. , and Kutz, J.N. (2016) Discovering governing equations from data by sparse identification of nonlinear dynamical systems. In <em>Proceedings of the national academy of sciences</em> 113.15 : 3932-3937 <a href="https://www.pnas.org/content/pnas/113/15/3932.full.pdf">pdf</a></p>
<p>[BPTW] Braun, G., Pokutta, S., Tu, D., & Wright, S. (2019). Blended conditonal gradients. In <em>International Conference on Machine Learning</em> (pp. 735-743). PMLR <a href="http://proceedings.mlr.press/v97/braun19a/braun19a.pdf">pdf</a></p>
<p>[CDS] Chen, S. S., Donoho, D. L., & Saunders, M. A. (2001). Atomic decomposition by basis pursuit. In <em>SIAM review</em>, 43(1), 129-159. <a href="http://https://web.stanford.edu/group/SOL/papers/BasisPursuit-SIGEST.pdf">pdf</a></p>
<p>[LZ] Lan, G., & Zhou, Y. (2016). Conditional gradient sliding for convex optimization. In <em>SIAM Journal on Optimization</em> 26(2) (pp. 1379–1409). SIAM. <a href="http://www.optimization-online.org/DB_FILE/2014/10/4605.pdf">pdf</a></p>
<p>[T] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. In <em>Journal of the Royal Statistical Society: Series B</em> (Methodological), 58(1), 267-288 <a href="https://statweb.stanford.edu/~tibs/lasso/lasso.pdf">pdf</a></p>
<p>[W] Wainwright, M. J. (2009). Sharp thresholds for High-Dimensional and noisy sparsity recovery using $\ell_ {1} $-Constrained Quadratic Programming (Lasso). In <em>IEEE transactions on information theory</em>, 55(5), 2183-2202 <a href="https://people.eecs.berkeley.edu/~wainwrig/Papers/Wai09_Sharp_Journal.pdf">pdf</a></p>
<p>[T2] Tibshirani, R. J. (2013). The lasso problem and uniqueness. In <em>Electronic Journal of statistics</em>, 7, 1456-1490 <a href="https://arxiv.org/pdf/1206.0313.pdf">pdf</a></p>
<p>[CRT] Candès, E. J., Romberg, J., & Tao, T. (2006). Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information. In <em>IEEE Transactions on information theory</em>, 52(2), 489-509 <a href="https://arxiv.org/pdf/math/0409186.pdf">pdf</a></p>
<p>[ADLVSNW] Andersen, M., Dahl, J., Liu, Z., Vandenberghe, L., Sra, S., Nowozin, S., & Wright, S. J. (2011). Interior-point methods for large-scale cone programming. In <em>Optimization for machine learning</em>, 5583 <a href="http://www.imm.dtu.dk/~mskan/publications/mlbook.pdf">pdf</a></p>
<p>[K] Kuramoto, Y. (1975). Self-entrainment of a population of coupled non-linear oscillators. In <em>International symposium on mathematical problems in theoretical physics</em> (pp. 420-422). Springer, Berlin, Heidelberg <a href="https://link.springer.com/chapter/10.1007%2FBFb0013365">pdf</a></p>
<p>[FPUT] Fermi, E., Pasta, P., Ulam, S., & Tsingou, M. (1955). Studies of the nonlinear problems (No. LA-1940). Los Alamos Scientific Lab., N. Mex. <a href="https://www.osti.gov/servlets/purl/4376203">pdf</a></p>
<p>[MM] Michaelis, L., Menten, M. L. (2007). Die kinetik der invertinwirkung. Universitätsbibliothek Johann Christian Senckenberg. <a href="https://path.upmc.edu/divisions/chp/PDF/Michaelis-Menten_Kinetik.pdf">pdf</a></p>Alejandro CardereraTL;DR: This is an informal summary of our recent paper CINDy: Conditional gradient-based Identification of Non-linear Dynamics – Noise-robust recovery by Alejandro Carderera, Sebastian Pokutta, Christof Schütte and Martin Weiser where we propose the use of a Conditional Gradient algorithm (more concretely the Blended Conditional Gradients [BPTW] algorithm) for the sparse recovery of a dynamic. In the presence of noise, the proposed algorithm presents superior sparsity-inducing properties, while ensuring a higher recovery accuracy, compared to other existing methods in the literature, most notably the popular SINDy [BPK] algorithm, based on a sequentially-thresholded least-squares approach.DNN Training with Frank–Wolfe2020-11-11T00:00:00+01:002020-11-11T00:00:00+01:00http://www.pokutta.com/blog/research/2020/11/11/NNFW<p><em>TL;DR: This is an informal discussion of our recent paper <a href="https://arxiv.org/abs/2010.07243">Deep Neural Network Training with Frank–Wolfe</a> by <a href="http://www.pokutta.com/">Sebastian Pokutta</a>, <a href="http://www.christophspiegel.berlin/">Christoph Spiegel</a>, and Max Zimmer, where we study the general efficacy of using Frank–Wolfe methods for the training of Deep Neural Networks with constrained parameters. Summarizing the results, we (1) show the general feasibility of this markedly different approach for first-order based training of Neural Networks, (2) demonstrate that the particular choice of constraints can have a drastic impact on the learned representation, and (3) show that through appropriate constraints one can achieve performance exceeding that of unconstrained stochastic Gradient Descent, matching state-of-the-art results relying on $L^2$-regularization.</em>
<!--more--></p>
<p><em>Written by Christoph Spiegel.</em></p>
<h3 id="motivation">Motivation</h3>
<p>Despite its simplicity, stochastic Gradient Descent (SGD) is still the method of choice for training Neural Networks. Assuming the network is parameterized by some unconstrained weights $\theta$, the standard SGD update can simply be stated as</p>
\[\theta_{t+1} = \theta_t - \alpha \tilde{\,\nabla} L(\theta_t),\]
<p>for some given loss function $L$, its $t$-th batch gradient $\tilde{\,\nabla} L(\theta_t)$ and some learning rate $\alpha$. In practice, one of the more significant contributions to this approach for obtaining state-of-the-art performance has come in the form of adding an $L^2$-regularization term to the loss function. Motivated by this, we explored the efficacy of constraining the parameter space of Neural Networks to a suitable compact convex region ${\mathcal C}$. Standard SGD would require a projection step during each update to maintain the feasibility of the parameters in this constrained setting, that is the update would be</p>
\[\theta_{t+1} = \Pi_{\mathcal C} \big( \theta_t - \alpha \tilde{\,\nabla} L(\theta_t) \big),\]
<p>where the projection function $\Pi_{\mathcal C}$ maps the input to its closest neighbor in ${\mathcal C}$. Depending on the particular feasible region, such a projection step can be very costly, so we instead explored a more appropriate alternative in the form of the (stochastic) Frank–Wolfe algorithm (SFW) [FW, LP]. Rather than relying on a projection step, SFW calls a linear minimization oracle (LMO) to determine</p>
\[v_t = \textrm{argmin}_{v \in \mathcal C} \langle \tilde{\,\nabla} L(\theta_t), v \rangle,\]
<p>and move in the direction of $v_t$ through the update</p>
\[\theta_{t+1} = \theta_t + \alpha ( v_t - \theta_t)\]
<p>where $\alpha \in [0,1]$. Feasibility is maintained since the update step takes the convex combination of two points in the convex feasible region. For a more in-depth look at Frank–Wolfe methods check out the <a href="http://www.pokutta.com/blog/research/2018/10/05/cheatsheet-fw.html">Frank-Wolfe and Conditional Gradients Cheat Sheet</a>. In the remainder of this post we will present some of the key findings from the paper.</p>
<h3 id="how-to-regularize-neural-networks-through-constraints">How to regularize Neural Networks through constraints</h3>
<p>We have focused on the case of uniformly applying the same type of constraint, such as a bound on the $L^p$-norm, separately on the weight and bias parameters of each individual layer of the network to achieve a regularizing effect, varying only the diameter of that region. Let us consider some particular types of constraints.</p>
<p><strong>$L^2$-norm ball.</strong> Constraining the $L^2$-norm of weights and optimizing them using SFW is most comparable, both in theory and in practice, to SGD with weight decay. The output of the LMO is given by</p>
\[\textrm{argmin}_{v \in \mathcal{B}_2(\tau)} \langle v,x \rangle = -\tau \, x / \|x\|_2,\]
<p>that is, it is parallel to the gradient and so, as long as the current iterate of the weights is not close to the boundary of the $L^2$-norm ball, the update of the SFW algorithm is similar to that of SGD given an appropriate learning rate.</p>
<p><strong>Hypercube.</strong> Requiring each individual weight of a network or a layer to lie within a certain range, say in $[-\tau,\tau],$ is possibly an even more natural type of constraint. Here the update step taken by SFW however differs drastically from that taken by projected SGD: in the output of the LMO each parameter receives a value of equal magnitude, since</p>
\[\textrm{argmin}_{v \in \mathcal{B}_\infty(\tau)} \langle v,x \rangle = -\tau \, \textrm{sgn}(x),\]
<p>so to a degree all parameters are forced to receive a non-trivial update each step.</p>
<p><strong>$L^1$-norm ball and $K$-sparse polytopes.</strong> On the other end of the spectrum from the dense updates forced by the LMO of the hypercube are feasible regions whose LMOs return very sparse vectors. When for example constraining the $L^1$-norm of weights of a layer, the output of the LMO is given by the vector with a single non-zero entry equal to $-\tau \, \textrm{sign}(x)$ at a point where $|x|$ takes its maximum. As a consequence, only a single weight, that from which the most gain can be derived, will in fact increase in absolute value during the update step of the Frank–Wolfe algorithm while all other weights will decay and move towards zero. The $K$-sparse polytope of radius $\tau > 0$ is obtained as the intersection of the $L^1$-ball of radius $\tau K$ and the hypercube of radius $\tau$ and generalizes that principle by increasing the absolute value of the $K$ most important weights.</p>
<h3 id="the-impact-of-constraints-on-learned-features">The impact of constraints on learned features</h3>
<p>Let us illustrate the impact that the choice of constraints has on the learned representations through a simple classifier trained on the MNIST dataset. The particular network chosen here, for the sake of exposition, has no hidden layers and no bias terms and the flattened input layer of size 784 is fully connected to the output layer of size 10. The weights of the network are therefore represented by a single 784 × 10 matrix, where each of the ten columns corresponds to the weights learned to recognize the ten digits 0 to 9. In Figure 1 we present a visualization of this network trained on the dataset with different types of constraints placed on the parameters. Each image interprets one of the columns of the weight matrix as an image of size 28 × 28 where red represents negative weights and green represents positive weights for a given pixel. We see that the choice of feasible region, and in particular the LMO associated with it, can have a drastic impact on the representations learned by the network when using the stochastic Frank–Wolfe algorithm. For completeness sake we have included several commonly used adaptive variants of SGD in the comparison.</p>
<div id="fig" class="center" style="margin-top:5mm">
<img src="http://www.pokutta.com/blog/assets/nnfw/mnist-visualization_compact.png" alt="Visualizing learned features on MNIST" style="float:left; width:98%" />
</div>
<p class="figcap"><strong>Figure 1.</strong> <em>Visualization of the weights in a fully connected no-hidden-layer classifier trained on the MNIST dataset corresponding to the digits 0, 1 and 2. Red corresponds to negative and green to positive weights.</em></p>
<p>Further demonstrating the impact of constraints on the learned representations, we consider the sparsity of the weights of trained networks. Let the parameter of a network be <em>inactive</em> if its absolute value is smaller than that of its random initialization. To study the effect of constraining the parameters, we trained two different types of networks, a fully connected network with two hidden layers with a total of 26 506 parameters and a convolutional network with 93 322, on the MNIST dataset. In Figure 2 we see that regions spanned by sparse vectors, such as $K$-sparse polytopes, result in noticeably fewer active parameters in the network over the course of training, whereas regions whose LMO forces larger updates in each parameter, such as the Hypercube, result in more active weights.</p>
<div id="fig" class="center" style="margin-top:5mm">
<img src="http://www.pokutta.com/blog/assets/nnfw/sparseness_sparse-mnist-2.png" alt="Sparseness during training on MNIST" style="width:80%" />
<p style="clear: both;"></p>
</div>
<p class="figcap"><strong>Figure 2.</strong> <em>Number of active parameters in two different networks trained on the MNIST dataset.</em></p>
<h3 id="achieving-state-of-the-art-results">Achieving state-of-the-art results</h3>
<p>Finally, we demonstrate the feasibility of training even very deep Neural Networks using SFW. We trained several state-of-the-art Neural Networks on the CIFAR-10, CIFAR-100, and ImageNet datasets. In Table 1 we show the top-1 test accuracy attained by networks based on the DenseNet, WideResNet, GoogLeNet and ResNeXt architecture on the test sets of these datasets. Here we compare networks with unconstrained parameters trained using SGD with momentum both with and without weight decay as well as networks whose parameters are constrained in their $L^2$-norm or $L^\infty$-norm and which were trained using SFW with momentum added. We can observe that, when constraining the $L^2$-norm of the parameters, SFW attains performance exceeding that of standard SGD and matching the state-of-the-art performance of SGD with weight decay. When constraining the $L^\infty$-norm of the parameters, SFW does not quite achieve the same performance as SGD with weight decay, but a regularization effect through the constraints is nevertheless clearly present, as it still exceeds the performance of SGD without weight decay. We furthermore note that, due to the nature of the LMOs associated with these particular regions, runtimes were comparable.</p>
<div id="fig" class="center" style="margin-top:5mm">
<img src="http://www.pokutta.com/blog//assets/nnfw/dnn_fw_stoa.png" alt="Sparseness during training on MNIST" style="float:left; width:98%" />
</div>
<p class="figcap"><strong>Table 1.</strong> <em>Test accuracy attained by several deep Neural Networks trained on the CIFAR-10, CIFAR-100, and ImageNet datasets. Parameters trained with SGD were unconstrained.</em></p>
<h3 id="reproducibility">Reproducibility</h3>
<p>We have made our implementations of the various stochastic Frank–Wolfe methods considered in the paper available online both for PyTorch and for TensorFlow under <a href="https://github.com/ZIB-IOL/StochasticFrankWolfe">github.com/ZIB-IOL/StochasticFrankWolfe</a>. There you will also find a list of Google Colab notebooks that allow you to recreate all the experimental results presented here.</p>
<h3 id="references">References</h3>
<p>[FW] Frank, M., & Wolfe, P. (1956). An algorithm for quadratic programming. Naval research logistics quarterly, 3(1‐2), 95-110. <a href="https://onlinelibrary.wiley.com/doi/abs/10.1002/nav.3800030109">pdf</a></p>
<p>[LP] Levitin, E. S., & Polyak, B. T. (1966). Constrained minimization methods. Zhurnal Vychislitel’noi Matematiki i Matematicheskoi Fiziki, 6(5), 787-823. <a href="http://www.mathnet.ru/php/archive.phtml?wshow=paper&jrnid=zvmmf&paperid=7415&option_lang=eng">pdf</a></p>Christoph SpiegelTL;DR: This is an informal discussion of our recent paper Deep Neural Network Training with Frank–Wolfe by Sebastian Pokutta, Christoph Spiegel, and Max Zimmer, where we study the general efficacy of using Frank–Wolfe methods for the training of Deep Neural Networks with constrained parameters. Summarizing the results, we (1) show the general feasibility of this markedly different approach for first-order based training of Neural Networks, (2) demonstrate that the particular choice of constraints can have a drastic impact on the learned representation, and (3) show that through appropriate constraints one can achieve performance exceeding that of unconstrained stochastic Gradient Descent, matching state-of-the-art results relying on $L^2$-regularization.Projection-Free Adaptive Gradients for Large-Scale Optimization2020-10-21T01:00:00+02:002020-10-21T01:00:00+02:00http://www.pokutta.com/blog/research/2020/10/21/adasfw<p><em>TL;DR: This is an informal summary of our recent paper <a href="https://arxiv.org/pdf/2009.14114.pdf">Projection-Free Adaptive Gradients for Large-Scale Optimization</a> by <a href="https://cyrillewcombettes.github.io/">Cyrille Combettes</a>, <a href="http://www.christophspiegel.berlin/">Christoph Spiegel</a>, and <a href="http://www.pokutta.com/">Sebastian Pokutta</a>. We propose to improve the performance of state-of-the-art stochastic Frank-Wolfe algorithms via a better use of first-order information. This is achieved by blending in adaptive gradients, a method for setting entry-wise step-sizes that automatically adjust to the geometry of the problem. Computational experiments on convex and nonconvex objectives demonstrate the advantage of our approach.</em>
<!--more--></p>
<p><em>Written by Cyrille Combettes.</em></p>
<h3 id="introduction">Introduction</h3>
<p>We consider the family of stochastic Frank-Wolfe algorithms, addressing constrained finite-sum optimization problems</p>
\[\min_{x\in\mathcal{C}}\left\{f(x)\overset{\text{def}}{=}\frac{1}{m}\sum_{i=1}^mf_i(x)\right\},\]
<p>where $\mathcal{C}\subset\mathbb{R}^n$ is a compact convex set and $f_1,\ldots,f_m\colon\mathbb{R}^n\rightarrow\mathbb{R}$ are smooth convex functions. Their generic template is presented in Template <a href="#fw">1</a>. When $\tilde{\nabla}f(x_t)=\nabla f(x_t)$, we recover the original Frank-Wolfe algorithm (<a href="#fw56">Frank and Wolfe</a>, <a href="#fw56">1956</a>), a.k.a. conditional gradient algorithm (<a href="#levitin66">Levitin and Polyak</a>, <a href="#levitin66">1966</a>). It is a simple projection-free algorithm, that computes a linear minimization at each iteration and moves in the direction of the solution $v_t$ returned, with a step-size $\gamma_t\in\left[0,1\right]$ ensuring that the new iterate $x_{t+1}=(1-\gamma_t)x_t+\gamma_tv_t\in\mathcal{C}$ is feasible by convex combination. Hence, it does not need to compute projections back onto $\mathcal{C}$.</p>
<hr />
<p><span id="fw"><strong>Template 1.</strong></span> Stochastic Frank-Wolfe</p>
<p><em>Input:</em> Start point $x_0\in\mathcal{C}$, step-sizes $\gamma_t\in\left[0,1\right]$.<br />
$\text{ }$ 1: $\text{ }$ <strong>for</strong> $t=0$ <strong>to</strong> $T-1$ <strong>do</strong><br />
$\text{ }$ <span id="fwest">2</span>: $\quad$ Update the gradient estimator $\tilde{\nabla}f(x_t)$<br />
$\text{ }$ 3: $\quad$ $v_t\leftarrow\underset{v\in\mathcal{C}}{\arg\min}\langle\tilde{\nabla}f(x_t),v\rangle$<br />
$\text{ }$ 4: $\quad$ $x_{t+1}\leftarrow x_t+\gamma_t(v_t-x_t)$<br />
$\text{ }$ 5: $\text{ }$ <strong>end for</strong></p>
<hr />
<p><br /></p>
<p>When $m$ is very large, querying exact first-order information from $f$ can be too expensive. Instead, stochastic Frank-Wolfe algorithms build a gradient estimator $\tilde{\nabla}f(x_t)$ with only approximate first-order information. For example, the Stochastic Frank-Wolfe algorithm (SFW) takes the average $\tilde{\nabla}f(x_t)\leftarrow(1/b_t)\sum_{i=i_1}^{i_{b_t}}\nabla f_i(x_t)$ over a minibatch $i_1,\ldots,i_{b_t}$ sampled uniformly at random from \(\{1,\ldots,m\}\). State-of-the-art stochastic Frank-Wolfe algorithms also include the Stochastic Variance-Reduced Frank-Wolfe algorithm (SVRF) (<a href="#hazan16">Hazan and Luo</a>, <a href="#hazan16">2016</a>), the Stochastic Path-Integrated Differential EstimatoR Frank-Wolfe algorithm (SPIDER-FW) (<a href="#yurtsever19">Yurtsever et al.</a>, <a href="#yurtsever19">2019</a>; <a href="#shen19">Shen et al.</a>, <a href="#shen19">2019</a>), the Online stochastic Recursive Gradient-based Frank-Wolfe algorithm (ORGFW) (<a href="#xie20">Xie et al.</a>, <a href="#xie20">2020</a>), and the Constant batch-size Stochastic Frank-Wolfe algorithm (CSFW) (<a href="#negiar20">Négiar et al.</a>, <a href="#negiar20">2020</a>). Their strategies are reported in Table <a href="#table">1</a>.</p>
<p><span id="table" style="font-size:95%">Table 1: Gradient estimator updates in stochastic Frank-Wolfe algorithms. The indices $i_1,\ldots,i_{b_t}$ are sampled i.i.d. uniformly at random from \(\{1,\ldots,m\}\).</span></p>
<table>
<thead>
<tr>
<th><strong>Algorithm</strong></th>
<th><strong>Update $\tilde{\nabla}f(x_t)$ in Line <a href="#fwest">2</a></strong></th>
<th><strong>Additional information</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td>SFW</td>
<td>$\displaystyle\frac{1}{b_t}\sum_{i=i_1}^{i_{b_t}}\nabla f_i(x_t)$</td>
<td>$\varnothing$</td>
</tr>
<tr>
<td>SVRF</td>
<td>\(\displaystyle\nabla f(\tilde{x}_t)+\frac{1}{b_t}\sum_{i=i_1}^{i_{b_t}}(\nabla f_i(x_t)-\nabla f_i(\tilde{x}_t))\)</td>
<td>$\tilde{x}_t$ is the last snapshot iterate</td>
</tr>
<tr>
<td>SPIDER-FW</td>
<td>\(\displaystyle\nabla f(\tilde{x}_t)+\frac{1}{b_t}\sum_{i=i_1}^{i_{b_t}}(\nabla f_i(x_t)-\nabla f_i(x_{t-1}))\)</td>
<td>$\tilde{x}_t$ is the last snapshot iterate</td>
</tr>
<tr>
<td>ORGFW</td>
<td>$\displaystyle\frac{1}{b_t}\sum_{i=i_1}^{i_{b_t}}\nabla f_i(x_t)+(1-\rho_t)\left(\tilde{\nabla}f(x_{t-1})-\frac{1}{b_t}\sum_{i=i_1}^{i_{b_t}}\nabla f_i(x_{t-1})\right)$</td>
<td>$\rho_t$ is the momentum parameter</td>
</tr>
<tr>
<td>CSFW</td>
<td>\(\displaystyle\tilde{\nabla}f(x_{t-1})+\sum_{i=i_1}^{i_{b_t}}\left(\frac{1}{m}f_i'(\langle a_i,x_t\rangle)-[\alpha_{t-1}]_i\right)a_i\) <br /> and \([\alpha_t]_i\leftarrow(1/m)f_i'(\langle a_i,x_t\rangle)\) if \(i\in\{i_1,\ldots,i_{b_t}\}\) else \([\alpha_{t-1}]_i\)</td>
<td>Assumes separability of $f$ as <br /> $\displaystyle f(x)=\frac{1}{m}\sum_{i=1}^mf_i(\langle a_i,x\rangle)$</td>
</tr>
</tbody>
</table>
<p>In our paper, we propose to improve the performance of this family of algorithms by using adaptive gradients.</p>
<h3 id="the-adaptive-gradient-algorithm">The Adaptive Gradient algorithm</h3>
<p>The Adaptive Gradient algorithm (AdaGrad) (<a href="#duchi11">Duchi et al.</a>, <a href="#duchi11">2011</a>; <a href="#mcmahan10">McMahan and Streeter</a>, <a href="#mcmahan10">2010</a>) is presented in Algorithm <a href="#adagrad">2</a>. The new iterate $x_{t+1}$ is obtained by solving a subproblem in Line <a href="#new">4</a>. The default value for the offset hyperparameter is $\delta\leftarrow10^{-8}$.</p>
<hr />
<p><span id="adagrad"><strong>Algorithm 2.</strong></span> Adaptive Gradient (AdaGrad)</p>
<p><em>Input:</em> Start point $x_0\in\mathcal{C}$, offset $\delta>0$, learning rate $\eta>0$.<br />
$\text{ }$ 1: $\text{ }$ <strong>for</strong> $t=0$ <strong>to</strong> $T-1$ <strong>do</strong><br />
$\text{ }$ 2: $\quad$ Update the gradient estimator $\tilde{\nabla}f(x_t)$<br />
$\text{ }$ <span id="h">3</span>: $\quad$ $H_t\leftarrow\operatorname{diag}\left(\delta1+\sqrt{\sum_{s=0}^t\tilde{\nabla}f(x_s)^2}\,\right)$<br />
$\text{ }$ <span id="new">4</span>: $\quad$ \(x_{t+1}\leftarrow\underset{x\in\mathcal{C}}{\arg\min}\,\eta\langle\tilde{\nabla}f(x_t),x\rangle+\frac{1}{2}\|x-x_t\|_{H_t}^2\)<br />
$\text{ }$ 5: $\text{ }$ <strong>end for</strong></p>
<hr />
<p><br /></p>
<p>The matrix $H_t\in\mathbb{R}^{n\times n}$ is diagonal and satisfies for all \(i,j\in\{1,\ldots,n\}\),</p>
\[[H_t]_{i,j}=\delta+\sqrt{\sum_{s=0}^t[\tilde{\nabla}f(x_s)]_i^2}\quad\text{if }i=j\quad\text{else } 0.\]
<p>To see why AdaGrad builds entry-wise step-sizes from past first-order information, note that the subproblem in Line <a href="#new">4</a> is equivalent to</p>
<div id="sub">
$$
x_{t+1}\leftarrow\underset{x\in\mathcal{C}}{\arg\min}\,\|x-(x_t-\eta H_t^{-1}\tilde{\nabla}f(x_t))\|_{H_t}\tag{1}
$$
</div>
<p>by first-order optimality condition (<a href="#polyak87">Polyak</a>, <a href="#polyak87">1987</a>), where \(\|\cdot\|_{H_t}\colon u\in\mathbb{R}^n\mapsto\sqrt{\langle u,H_tu\rangle}\). Ignoring the constraint set $\mathcal{C}$ for ease of exposition, we obtain</p>
\[x_{t+1}\leftarrow x_t-\eta H_t^{-1}\tilde{\nabla}f(x_t),\]
<p>i.e., for every feature \(i\in\{1,\ldots,n\}\),</p>
\[[x_{t+1}]_i\leftarrow[x_t]_i-\frac{\eta[\tilde{\nabla}f(x_t)]_i}{\delta+\sqrt{\sum_{s=0}^t[\tilde{\nabla}f(x_s)]_i^2}}.\]
<p>Therefore, $\delta$ prevents from dividing by zero and the step-sizes automatically adjust to the geometry of the problem. In particular, rare but potentially very informative features do not go unnoticed as they receive a large step-size whenever they appear.</p>
<h3 id="frank-wolfe-with-adaptive-gradients">Frank-Wolfe with adaptive gradients</h3>
<p>For constrained optimization, AdaGrad can be very expensive as it requires solving a constrained subproblem at each iteration (Line <a href="#new">4</a>), which, by (<a href="#sub">1</a>), can also be seen as a projection in the non-Euclidean norm \(\|\cdot\|_{H_t}\). Instead, we propose to solve the subproblems <em>very</em> incompletely, via a small and fixed number of iterations of the Frank-Wolfe algorithm. This approach is aimed at designing an efficient method in practice. In particular, contrary to <a href="#lan16">Lan and Zhou</a> (<a href="#lan16">2016</a>), we do not worry about the accuracy of the solutions to the subproblems. We present our method via a generic template in Template <a href="#adafw">3</a>.</p>
<hr />
<p><span id="adafw"><strong>Template 3.</strong></span> Frank-Wolfe with adaptive gradients</p>
<p><em>Input:</em> Start point $x_0\in\mathcal{C}$, number of inner iterations $K$, learning rate $\eta>0$.<br />
$\text{ }$ 1: $\text{ }$ <strong>for</strong> $t=0$ <strong>to</strong> $T-1$ <strong>do</strong><br />
$\text{ }$ <span id="adafwest">2</span>: $\quad$ Update the gradient estimator $\tilde{\nabla}f(x_t)$ <span style="float:right">$\triangleright$ as in any of Table <a href="#table">1</a></span><br />
$\text{ }$ 3: $\quad$ Update the diagonal matrix $H_t$ <span style="float:right">$\triangleright$ as in, e.g., Line <a href="#h">3</a> of Algorithm <a href="#adagrad">2</a></span><br />
$\text{ }$ <span id="start">4</span>: $\quad$ \(y_0^{(t)}\leftarrow x_t\)<br />
$\text{ }$ 5: $\quad$ <strong>for</strong> $k=0$ <strong>to</strong> $K-1$ <strong>do</strong><br />
$\text{ }$ 6: $\quad\quad$ \(\nabla Q_t(y_k^{(t)})\leftarrow\tilde{\nabla}f(x_t)+\frac{1}{\eta}H_t(y_k^{(t)}-x_t)\)<br />
$\text{ }$ 7: $\quad\quad$ \(v_k^{(t)}\leftarrow\underset{v\in\mathcal{C}}{\arg\min}\langle\nabla Q_t(y_k^{(t)}),v\rangle\)<br />
$\text{ }$ 8: $\quad\quad$ \(\gamma_k^{(t)}\leftarrow\min\left\{\eta\frac{\langle\nabla Q_t(y_k^{(t)}),y_k^{(t)}-v_k^{(t)}\rangle}{\|y_k^{(t)}-v_k^{(t)}\|_{H_t}^2},1\right\}\)<br />
$\text{ }$ 9: $\quad\quad$ \(y_{k+1}\leftarrow y_k^{(t)}+\gamma_k^{(t)}(v_k^{(t)}-y_k^{(t)})\)<br />
<span id="start">10</span>: $\quad$ <strong>end for</strong><br />
11: $\quad$ \(x_{t+1}\leftarrow y_K^{(t)}\)<br />
12: $\text{ }$ <strong>end for</strong></p>
<hr />
<p><br /></p>
<p>Lines <a href="#start">4</a>-<a href="#end">10</a> apply $K$ iterations of the Frank-Wolfe algorithm on</p>
\[\min_{x\in\mathcal{C}}\left\{Q_t(x)\overset{\text{def}}{=}f(x_t)+\langle\tilde{\nabla}f(x_t),x-x_t\rangle+\frac{1}{2\eta}\|x-x_t\|_{H_t}^2\right\},\]
<p>which is equivalent to the AdaGrad subproblem. The sequence of iterates is denoted by $y_0^{(t)},\ldots,y_K^{(t)}$, starting from $x_t=y_0^{(t)}$ and ending at $x_{t+1}=y_K^{(t)}$. In our experiments, we typically set $K\sim5$. The strategy in Line <a href="#adafwest">2</a> can be that of any of the variants SFW, SVRF, SPIDER-FW, ORGFW, or CSFW. When using variant X, the associated method is named AdaX.</p>
<h3 id="computational-experiments">Computational experiments</h3>
<p>We compare our method to SFW, SVRF, SPIDER-FW, ORGFW, and CSFW. For the three experiments with convex objectives, we plot our method using the best performing stochastic Frank-Wolfe variant. For the three neural network experiments, CSFW is not applicable and we run AdaSFW only, as variance reduction may be ineffective in deep learning (<a href="#defazio19">Defazio and Bottou</a>, <a href="#defazio19">2019</a>). In addition, since momentum has become a key ingredient for neural network optimization, we demonstrate that AdaSFW also works very well with momentum. The method is named AdamSFW and $H_t$ is built as in <a href="#reddi18">Reddi et al.</a> (<a href="#reddi18">2018</a>).</p>
<p>The results are presented in Figure <a href="#fig">1</a>. One important observation is that none of the previous methods outperform the vanilla SFW on the nonconvex experiments, except on the MNIST dataset. On the IMDB dataset, AdaSFW yields the best test performance despite optimizing slowly over the training set, and AdamSFW reaches its maximum accuracy very fast which can be interesting if we consider using early stopping.</p>
<div id="fig" class="center" style="margin-top:5mm">
<img src="http://www.pokutta.com/blog/assets/adasfw/svm.png" alt="svm" style="float:left; width:48%" />
<img src="http://www.pokutta.com/blog/assets/adasfw/lin.png" alt="linear" style="float:right; width:48%" />
<p style="clear: both;"></p>
<img src="http://www.pokutta.com/blog/assets/adasfw/log.png" alt="logistic" style="float:left; width:48%" />
<img src="http://www.pokutta.com/blog/assets/adasfw/mnist.png" alt="MNIST" style="float:right; width:48%" />
<p style="clear: both;"></p>
<img src="http://www.pokutta.com/blog/assets/adasfw/imdb.png" alt="IMDB" style="float:left; width:48%" />
<img src="http://www.pokutta.com/blog/assets/adasfw/cifar.png" alt="CIFAR-10" style="float:right; width:48%" />
<p style="clear: both;"></p>
</div>
<p class="figcap"><strong>Figure 1.</strong> Computational experiments on convex and nonconvex objectives.</p>
<h4 id="references">References</h4>
<p><span id="defazio19" style="font-size:95%">A. Defazio and L. Bottou. <a href="https://arxiv.org/pdf/1812.04529.pdf">On the ineffectiveness of variance reduced optimization for deep learning</a>. In <em>Advances in Neural Information Processing Systems 32</em>, pages 1755–1765. 2019.</span></p>
<p><span id="duchi11" style="font-size:95%">J. C. Duchi, E. Hazan, and Y. Singer. <a href="https://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf">Adaptive subgradient methods for online learning and stochastic optimization</a>. <em>Journal of Machine Learning Research</em>, 12(61):2121–2159, 2011.</span></p>
<p><span id="fw56" style="font-size:95%">M. Frank and P. Wolfe. <a href="https://onlinelibrary.wiley.com/doi/abs/10.1002/nav.3800030109">An algorithm for quadratic programming</a>. <em>Naval Research Logistics Quarterly</em>, 3(1-2):95–110, 1956.</span></p>
<p><span id="hazan16" style="font-size:95%">E. Hazan and H. Luo. <a href="https://arxiv.org/pdf/1602.02101.pdf">Variance-reduced and projection-free stochastic optimization</a>. In <em>Proceedings of the 33rd International Conference on Machine Learning</em>, pages 1263–1271, 2016.</span></p>
<p><span id="lan16" style="font-size:95%">G. Lan and Y. Zhou. <a href="https://pdfs.semanticscholar.org/5b75/13ad8e8fb691f5243278965dce549dbcc827.pdf">Conditional gradient sliding for convex optimization</a>. <em>SIAM Journal on Optimization</em>, 26(2):1379–1409, 2016.</span></p>
<p><span id="levitin66" style="font-size:95%">E. S. Levitin and B. T. Polyak. <a href="https://www.sciencedirect.com/science/article/abs/pii/0041555366901145">Constrained minimization methods</a>. <em>USSR Computational Mathematics and Mathematical Physics</em>, 6(5):1–50, 1966.</span></p>
<p><span id="mcmahan10" style="font-size:95%">H. B. McMahan and M. Streeter. <a href="https://arxiv.org/pdf/1002.4908.pdf">Adaptive bound optimization for online convex optimization</a>. In <em>Proceedings of the 23rd Annual Conference on Learning Theory</em>, 2010.</span></p>
<p><span id="negiar20" style="font-size:95%">G. Négiar, G. Dresdner, A. Y.-T. Tsai, L. El Ghaoui, F. Locatello, R. M. Freund, and F. Pedregosa. <a href="https://arxiv.org/pdf/2002.11860.pdf">Stochastic Frank-Wolfe for constrained finite-sum minimization</a>. In <em>Proceedings of the 37th International Conference on Machine Learning</em>. 2020. To appear.</span></p>
<p><span id="polyak87" style="font-size:95%">B. T. Polyak. <em><a href="http://lab7.ipu.ru/files/polyak/polyak-optimizationintro-eng.zip">Introduction to Optimization</a></em>. Optimization Software, 1987.</span></p>
<p><span id="reddi18" style="font-size:95%">S. J. Reddi, S. Kale, and S. Kumar. <a href="https://arxiv.org/pdf/1904.09237.pdf">On the convergence of Adam and beyond</a>. In <em>Proceedings of the 6th International Conference on Learning Representations</em>, 2018.</span></p>
<p><span id="shen19" style="font-size:95%">Z. Shen, C. Fang, P. Zhao, J. Huang, and H. Qian. <a href="http://proceedings.mlr.press/v89/shen19b/shen19b.pdf">Complexities in projection-free stochastic non-convex minimization</a>. In <em>Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics</em>, pages 2868–2876, 2019.</span></p>
<p><span id="xie20" style="font-size:95%">J. Xie, Z. Shen, C. Zhang, H. Qian, and B. Wang. <a href="https://arxiv.org/pdf/1910.09396.pdf">Efficient projection-free online methods with stochastic recursive gradient</a>. In <em>Proceedings of the 34th AAAI Conference on Artificial Intelligence</em>, pages 6446–6453, 2020.</span></p>
<p><span id="yurtsever19" style="font-size:95%">A. Yurtsever, S. Sra, and V. Cevher. <a href="http://proceedings.mlr.press/v97/yurtsever19b/yurtsever19b.pdf">Conditional gradient methods via stochastic path-integrated differential estimator</a>. In <em>Proceedings of the 36th International Conference on Machine Learning</em>, pages 7282–7291. 2019.</span></p>Cyrille CombettesTL;DR: This is an informal summary of our recent paper Projection-Free Adaptive Gradients for Large-Scale Optimization by Cyrille Combettes, Christoph Spiegel, and Sebastian Pokutta. We propose to improve the performance of state-of-the-art stochastic Frank-Wolfe algorithms via a better use of first-order information. This is achieved by blending in adaptive gradients, a method for setting entry-wise step-sizes that automatically adjust to the geometry of the problem. Computational experiments on convex and nonconvex objectives demonstrate the advantage of our approach.Accelerating Domain Propagation via GPUs2020-09-20T01:00:00+02:002020-09-20T01:00:00+02:00http://www.pokutta.com/blog/research/2020/09/20/gpu-prob<p><em>TL;DR: This is an informal discussion of our recent paper <a href="https://arxiv.org/abs/2009.07785">Accelerating Domain Propagation: an Efficient GPU-Parallel Algorithm over Sparse Matrices</a> by Boro Sofranac, Ambros Gleixner, and Sebastian Pokutta. In the paper, we present a new algorithm to perform domain propagation of linear constraints on GPUs efficiently. The results show that efficient implementations of Mixed-integer Programming (MIP) methods are possible on GPUs, even though the success of using GPUs in MIPs has traditionally been limited. Our algorithm is capable of performing domain propagation on the GPU exclusively, without the need for synchronization with the CPU, paving the way for the usage of this algorithm in a new generation of MIP methods that run on GPUs.</em>
<!--more--></p>
<p><em>Written by Boro Sofranac.</em>
</p>
<h2 id="the-motivation">The motivation</h2>
<p>
Since the advent of general-purpose, massively parallel GPU hardware, many fields of applied mathematics have actively sought out the design of specialized algorithms which exploit the unprecedented computational resources this new type of hardware has to offer. A prime example are Neural Networks, whose rise to prominence was fueled by the “Deep Learning revolution” with Deep Learning methods running on GPUs. Still, such success is missing in many other fields which are not as amenable to specialized and massively parallel programming model of the GPUs. One such field is Mixed-integer Programming (MIP).</p>
<p>The development of the <em>Simplex</em> algorithm for solving Linear Programs (LPs) in the 1940s was followed by a plethora of methods and solvers to solve LPs and MIPs over the following decades. A unifying characteristic of these methods is a) that they exhibit non-uniform algorithmic behaviour, and b) they operate on highly irregular data (i.e., data structures are sparse and do not have a regular structure in general). Such characteristics make the implementation of these methods on massively parallel hardware challenging and have hindered the application of GPUs in the field.</p>
<p>Recognizing these challenges in a number of fields, GPU hardware development in recent years has been geared towards easier expression of non-uniform workflows in its programming model (for example, the <a href="https://developer.nvidia.com/blog/introduction-cuda-dynamic-parallelism/"><em>Dynamic Parallelism</em></a> feature of NVIDIA GPUs) and hardware (e.g., <a href="https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#atomic-functions">atomic operations</a>). At the same time, researchers working in some related fields have shown that with the right algorithmic design GPUs can be efficiently used for workflows previously considered unsuitable due to challenges similar to those present in MIP methods. Consider for example <em>Numerical Linear Algebra</em> (LA): for years, dense LA has been one of the main benefactors of GPU computing and allowed for tremendous speedups. On the other hand, if the input data was sparse (and irregular), the same algorithms often exhibited disappointing performance. New algorithms however, such as the <a href="https://doi.org/10.1109/SC.2014.68"><em>CSR-Adaptive</em> algorithm developed by Greathouse and Daga</a> to perform sparse matrix-vector products (SpMV) have shown very impressive performance gains for highly unstructured sparse matrices.</p>
<p>Against this backdrop, we set out to investigate the applicability of massively parallel algorithms in MIP methods. This, however, is no simple task: massively parallel programming models bring with them a different algorithmic paradigm and complexity notions while MIP methods have historically been strongly sequential. Basic design decisions need to be rethought and/or new algorithms developed. So we took a core MIP method used by all state-of-the-art solvers, namely <em>Domain Propagation</em>, which is traditionally not amenable to efficient parallelization, and we show that with the right algorithmic design speedups seen in fields such as sparse Linear Algebra are also possible here. This new GPU algorithm for <em>Domain Propagation</em> and computational experiments assessing the performance are presented in <a href="https://arxiv.org/abs/2009.07785">our paper</a>; the code is available on our <a href="https://github.com/Sofranac-Boro/gpu-domain-propagator"><em>GitHub</em> page</a>.
</p>
<h2 id="sneak-peek-at-the-results">Sneak peek at the results</h2>
<p>
We use the <em>de-facto</em> standard <a href="https://miplib.zib.de/">MIPLIB2017</a> test set for MIPs to conduct our experiments. To better capture the response of our algorithm to the size of instances, we subdivided this set into 8 subsets with instances of increasing size, dubbed Set-1 to Set-8. For comparison, we use four algorithms:</p>
<ol>
<li><strong>cpu_seq</strong> is a sequential implementation of domain propagation which closely follows implementations in state-of-the-art solvers.</li>
<li><strong>cpu_omp</strong> is a shared-memory parallel version of domain propagation which runs on the CPU.</li>
<li><strong>gpu_atomic</strong> is our GPU implementation with atomic operations</li>
<li><strong>gpu_reduction</strong> is our GPU implementation which avoids using atomic operations by using reductions in global memory.</li>
</ol>
<p>The algorithms are tested on the following hardware:</p>
<ol>
<li><strong>V100</strong> NVIDIA Tesla V100 PCIe 32GB (GPU)</li>
<li><strong>RTX</strong> NVIDIA Titan RTX 24GB (GPU)</li>
<li><strong>P400</strong> NVIDIA Quadro P400 2GB (GPU)</li>
<li><strong>amdtr</strong> 64-core AMD Ryzen Threadripper 3990X @ 3.30 GHz with 128 GB RAM (CPU)</li>
<li><strong>xeon</strong> 24-core Intel Xeon Gold 6246 @ 3.30GHz with 384 GB RAM (CPU)</li>
</ol>
<p>As baseline, we choose the execution <em>cpu_seq</em> algorithm on the <em>xeon</em> machine. As the metric for comparison, we will report speedups of all other executions over this base case.</p>
<p>Figure 1-a shows the geometric mean of speedups of the four algorithms over the test subsets. Figure 1-b shows the distribution of speedups of the four algorithms over all the instances in the test set, sorted in ascending order by speedup.
</p>
<div class="center">
<img src="http://www.pokutta.com/blog/assets/GPUprop/speedups-1.png" alt="img1" style="float:center; margin-right: 1%; width:100%" />
<p style="clear: both;"></p>
</div>
<p class="figcap"><strong>Figure 1.</strong> a) Geometric means of speedups b) speedup distributions in ascending order.</p>
<p>We can see that GPU algorithms on high-end hardware <em>V100</em> and <em>RTX</em> drastically outperform the sequential and shared memory executions. Additionally, the GPU algorithms with atomic operations outperform the reduction version in all executions. The fastest combination is the <em>gpu_atomic</em> algorithm on <em>V100</em>. Its mean speedup is always greater than 1.6 and goes up to 46.9 in Set-8, following a roughly linear trend. Over the entire test set, the mean speedup is 6.1. For the top 5% of the instances, this execution achieves a speedup of at least 62.9 times. As we can see in Figure 1-b, speedups as high as 195x are possible.</p>
<p>A low-end, consumer-grade GPU <em>P400</em>, often found in home-use desktops, is also evaluated in the tests. It is evident from the plots that it cannot keep up with the two high-end GPUs. However, we can see that <em>gpu_atomic</em> running on <em>P400</em> is still competitive with the CPU implementation for about half the instances, where it achieves a small speedup over the benchmark cpu implementation. This result is interesting considering that GPUs are currently a resource that is not used by MIP solvers at all, opening up the possibility for using GPUs as co-processors for MIP solvers even on standard desktop machines.</p>
<p>Looking at the shared-memory parallel <em>cpu_omp</em> algorithm, we can see that it also is drastically outperformed by the GPU implementations on <em>V100</em> and <em>RTX</em>. Comparing it to the sequential base case, it underperforms in about half of the instances. The parallelism found in the domain propagation algorithm is relatively fine-grained, with low-arithmetic intensity in the parallel units of work. This does not bode well for the shared-memory parallelization on the CPU where managing CPU threads is comparatively expensive, explaining why current state-of-the-art implementations of Domain Propagation are usually single thread only
</p>
<h2 id="conclusions">Conclusions</h2>
<p>In conclusion, the domain propagation algorithm on the GPU achieves ample speedups over its CPU counterparts, over the majority of practically relevant instances. Beyond being interesting in its own right, it does not tell the whole story! The <em>throughput-based</em> GPU programming model differs significantly from the <em>latency-based</em> sequential model, and comparing the runtimes of parts of the solving process alone might not give justice to the GPU paradigm. Our algorithm runs entirely on the GPU, without the need for synchronization with the CPU, which paves the way to embedding this algorithm in future GPU-based MIP solvers/methods. Put differently, the massive amounts of parallelism on the GPU bring about a different paradigm, with the potential to achieve more than speeding up parts of an otherwise sequential workflow.</p>Boro SofranacTL;DR: This is an informal discussion of our recent paper Accelerating Domain Propagation: an Efficient GPU-Parallel Algorithm over Sparse Matrices by Boro Sofranac, Ambros Gleixner, and Sebastian Pokutta. In the paper, we present a new algorithm to perform domain propagation of linear constraints on GPUs efficiently. The results show that efficient implementations of Mixed-integer Programming (MIP) methods are possible on GPUs, even though the success of using GPUs in MIPs has traditionally been limited. Our algorithm is capable of performing domain propagation on the GPU exclusively, without the need for synchronization with the CPU, paving the way for the usage of this algorithm in a new generation of MIP methods that run on GPUs.Join CO@Work and EWG-POR – online and for free!2020-08-30T07:00:00+02:002020-08-30T07:00:00+02:00http://www.pokutta.com/blog/news/2020/08/30/coatwork<p><em>TL;DR: Announcement for CO@WORK and EWG-POR. Fully online and participation is free.</em>
<!--more--></p>
<p><em>Written by Timo Berthold.</em></p>
<p>This year, a lot of scientific conferences and workshops had to be cancelled or postponed – many others took the unprecedented situation as a chance to experiment with new, exciting formats. It is great to see how virtual conferences make the latest research insights available too much more people than traditional on-site events do.</p>
<p>Researchers from <a href="https://www.zib.de/">ZIB</a> and its partners from the <a href="http://forschungscampus-modal.de/">Research Campus Modal</a> will join this latest trend towards online workshops and host two exciting meetings in September, one two-week summer school mainly targeting PhD students and one two-day meeting, addressing practitioners from all industries using Operations Research.</p>
<div class="center">
<img src="http://www.pokutta.com/blog/assets/coatwork/Picture1.png" alt="img1" style="float:center; margin-right: 1%; width:80%" />
<p style="clear: both;"></p>
</div>
<h2 id="combinatorial-optimization-at-work">Combinatorial Optimization At Work</h2>
<p><a href="http://co-at-work.zib.de/">CO@Work</a> is an institution. This two-week summer school takes only place every five years and has always brought together researchers, practitioners, and students from all over the world. It addresses PhD students and post-docs interested in the use mathematical optimizations in concrete practical applications. This year’s theme is “Algorithmic Intelligence in Practice”. This amazing event features more than 30 distinguished lecturers from all over the world, including developers and managers of FICO, Google, SAP, Siemens, SAS, Gurobi, Mosek, GAMS, NTT Data, Litic, as well as leading scientists from TU Berlin, FU Berlin, Polytechnique Montréal, RWTH Aachen, the Chinese Academy of Science, University of Southern California, University of Edinburgh, Sabancı Üniversitesi, Escuela Politécnica Nacional Quito, TU Darmstadt, University of Exeter, and many more.</p>
<div class="center">
<img src="http://www.pokutta.com/blog/assets/coatwork/2015-coatwork.png" alt="img1" style="float:center; width:99%" />
<p style="clear: both;"></p>
</div>
<p class="figcap">CO@Work 2015 participants in front of ZIB.</p>
<p>All lectures will be made available on <a href="https://www.youtube.com/channel/UCphLz_BXrOAInHozAlTsigA">YouTube</a>, and there will be two Q&A sessions for each presentation – typically 11 hours apart, to cover all time zones worldwide. Similarly, there will be two practical exercise sessions each day, with hands-on experience on implementing optimization projects through the python interfaces of <a href="https://www.fico.com/en/products/fico-xpress-optimization">FICO Xpress</a> and <a href="https://www.scipopt.org/">SCIP</a>. Q&A and exercises will be hosted on Zoom. CO@Work will take place every weekday from September 14 to September 25.
Checkout the meeting homepage and registration <a href="http://co-at-work.zib.de/">here</a>. Hurry up, registration closes on September 6!</p>
<div class="center">
<img src="http://www.pokutta.com/blog/assets/coatwork/ewg.png" alt="img1" style="float:center; width:80%" />
<p style="clear: both;"></p>
</div>
<h2 id="practice-of-operations-research--euro-working-group-meeting">Practice of Operations Research – EURO Working Group meeting</h2>
<p>The <a href="https://www.eventbrite.co.uk/e/challenges-in-the-deployment-of-or-projects-tickets-62398252854">EWG-POR virtual conference</a> is co-hosted by <a href="https://www.zib.de/">ZIB</a> and <a href="https://www.fico.com/en/products/fico-xpress-optimization">FICO</a>. It focusses on an issue that many of us have faced: “Challenges in the deployment of OR projects”. The conference features five keynote lecturers from various industries highlighting how difficulties with the implementation of optimization projects could be overcome. Adrian Zymolka from Axioma will speak about “<strong>Optimization in Finance</strong>”, Colin Silvester from Uniper will present on “<strong>Delivering OR Solutions for Everyday Operations in Energy Trading</strong>”, Steffen Klosterhalfen from BASF will talk about “<strong>Successful Value Chain Optimization at a Chemical Company</strong>”, Ralf Werner from OGE will report on “<strong>Operations Research supporting Germany’s energy transition</strong>”, and Baris Cem Sal from DHL will let us know how they were “<strong>Putting Operations Research into Operations in Deutsche Post DHL Group transition</strong>”.</p>
<p>Furthermore, there will be special 45-minute interactive discussion group sessions. All participants are invited to contribute to one of the four rounds on “Change management issues in practical projects”, “Promotion of OR”, “How to state requirements and project specifications at the beginning” and “Relationship/collaboration between academia and industry”. We are curious to see what comes out of these roundtables.</p>
<p>The event will be completed by a wonderful social entertainment session – join to find out more…</p>
<p>The EWG-POR main event will take place on the 28th and 29th of September. The following five Mondays (Oct 5 – Nov 2), there will be a one hour Webinar series, each with two contributed talks on the topic of “Challenges in the deployment of OR projects”.</p>
<p>Checkout the <a href="https://www.euro-online.org/websites/or-in-practice/event/euro-working-group-practice-of-or-meeting-2020/">meeting homepage</a> and <a href="https://fico.zoom.us/webinar/register/9815939755759/WN_gb2az8OJR8u9y5FPl51RaA">register now</a> here.</p>Timo BertholdTL;DR: Announcement for CO@WORK and EWG-POR. Fully online and participation is free.Projection-Free Optimization on Uniformly Convex Sets2020-07-27T01:00:00+02:002020-07-27T01:00:00+02:00http://www.pokutta.com/blog/research/2020/07/27/uniform-convexity-fw<p><em>TL;DR: This is an informal summary of our recent paper <a href="https://arxiv.org/abs/2004.11053">Projection-Free Optimization on Uniformly Convex Sets</a> by Thomas Kerdreux, Alexandre d’Aspremont, and Sebastian Pokutta. We present convergence analyses of the Frank-Wolfe algorithm in settings where the constraint sets are uniformly convex. Our results generalize different analyses of [P], [DR], [D], and [GH] when the constraint sets are strongly convex. For instance, the $\ell_p$ balls are uniformly convex for all $p > 1$, but strongly convex for $p\in]1,2]$ only. We show in these settings that uniform convexity of the feasible region systematically induces accelerated convergence rates of the Frank-Wolfe algorithm (with short steps or exact line-search). This shows that the Frank-Wolfe algorithm is not just adaptive to the sharpness of the objective [KDP] but also to the feasible region.</em>
<!--more--></p>
<p><em>Written by Thomas Kerdreux.</em></p>
<p>We consider the following constrained optimization problem</p>
\[\underset{x\in\mathcal{C}}{\text{argmin}} f(x),\]
<p>where $f$ is a $L$-smooth convex function and $\mathcal{C}$ is a compact convex set in a Hilbert space. Frank-Wolfe algorithms form a classical family of first-order iterative methods for solving such problems. Each iteration requires in the worst case the solution of a linear minimization problem over the original feasible domain, a subset of the domain, or a reasonable modification of the domain, <em>i.e.</em> a change that does not implicitly amount to a proximal operation.</p>
<p>The understanding of the convergence rates of Frank-Wolfe algorithms in a variety of settings is an active field of research. For smooth constrained problems, the vanilla Frank-Wolfe algorithm (FW) enjoys a tight sublinear convergence rate of $\mathcal{O}(1/T)$ (see e.g., [J] for an in-depth discussion). There are known accelerated convergence regimes, as a function of the feasible region, only when $\mathcal{C}$ is a polytope or a strongly convex set.</p>
<p class="mathcol"><strong>Frank-Wolfe Algorithm</strong> <br />
<em>Input:</em> Start with $x_0 \in \mathcal{C}$, $L$ upper bound on the Lipschitz constant. <br />
<em>Output:</em> Sequence of iterates $x_t$ <br />
For $t=0, 1, \ldots, T $ do: <br />
$\qquad v_t \leftarrow \underset{v\in\mathcal{C}}{\text{argmax }} \langle -\nabla f(x_t); v - x_t\rangle$ $\qquad \triangleright $ FW vertex <br />
$\qquad \gamma_t \leftarrow \underset{\gamma\in[0,1]}{\text{argmin }} \gamma \langle \nabla f(x_t); v_t - x_t \rangle + \frac{\gamma^2}{2} L \norm{v_t - x_t}^2$ $\qquad \triangleright$ Short step <br />
$\qquad x_{t+1} \leftarrow (1 - \gamma_t)x_{t} + \gamma_t v_{t}$ $\qquad \triangleright$ Convex update</p>
<p>When $\mathcal{C}$ is a strongly convex set and \(\text{inf}_{x\in\mathcal{C}}\norm{\nabla f(x)}_\esx > 0\), FW enjoys linear convergence rates [P, DR]. Recently, [GH] showed that the Frank-Wolfe algorithm converges in \(\mathcal{O}(1/T^2)\) when the objective function is strongly convex without restriction on the position of the optimum $x^\esx$. Importantly, the conditioning of this sub-linear rate does not depend on the distance of the constrained optimum $x^\esx$ either from the boundary or the unconstrained optimum, as it is the case in [P, DR], or [GM] when the optimum $x^\esx$ is in the interior of $\mathcal{C}$. Note that all these analyses require short steps as step-size rule (or even stronger conditions such as line-search).</p>
<p>Finally, when $\mathcal{C}$ is a polytope and $f$ is strongly convex, <em>corrective</em> variants of Frank-Wolfe were recently shown to enjoy linear convergence rates, see [LJ]. No accelerated convergence rates are known for constraint sets that are neither polytopes nor strongly convex sets. We show here that uniformly convex sets, which non-trivially subsume strongly convex sets, systematically enjoy accelerated convergence rate in the respective settings of [P,DR], [GH], and [D].</p>
<h3 id="uniformly-convex-sets">Uniformly Convex Sets</h3>
<p>A closed set $\mathcal{C}\subset\mathbb{R}^d$ is $(\alpha, q)$-uniformly convex with respect to a norm $\norm{\cdot}$, if for any $x,y\in\mathcal{C}$, any $\eta\in[0,1]$, and any $z\in\mathbb{R}^d$ with $\norm{z} = 1$, we have</p>
\[\tag{1}
\eta x + (1-\eta)y + \eta (1 - \eta ) \alpha ||x-y||^q z \in \mathcal{C}.\]
<p>At a high-level, this property is a global quantification of the set curvature that subsumes strong convexity. There exist others equivalent definitions, see <em>e.g.</em> [GI] in the strongly convex case. In finite-dimensional spaces, the $\ell_p$ balls form classical and important examples of uniformly convex sets. For $0 < p < 1$, the $\ell_p$ balls are non-convex. For $p\in ]1,2]$, the $\ell_p$ balls are strongly convex (or $(\alpha,2)$-uniformly convex) and $p$-uniformly convex (but not strongly convex) for $p>2$. The $p$-Schatten norms with $p>1$ or various group norms are also typical examples.</p>
<div class="center">
<img src="http://www.pokutta.com/blog/assets/uc-fw/list_ball.png" alt="img1" style="float:center; margin-right: 1%; width:80%" />
<p style="clear: both;"></p>
</div>
<p class="figcap"><strong>Figure 1.</strong> Example of $\ell_q$ balls.</p>
<p>Besides the quantification in (1), uniform convexity is a very classical notion in the study of normed spaces. Indeed, it allows to refine the convex characterization of these spaces’ unit balls, leading to a plethora of interesting properties. For instance, the various uniform convexity types of Banach spaces have consequences, notably in learning theory [DDS], online learning [ST], or concentration inequalities [JN]. Here, we show that the Frank-Wolfe algorithm accelerates as a function of the uniform convexity of $\mathcal{C}$.</p>
<h3 id="convergence-analysis-for-frank-wolfe-with-uniformly-convex-sets">Convergence analysis for Frank-Wolfe with Uniformly Convex Sets</h3>
<h4 id="proof-sketch">Proof Sketch</h4>
<p>At iteration $t$, we have $x_{t+1} = x_t + \gamma_t (v_t - x_t)$ with $\gamma_t\in[0,1]$ chosen to optimize the quadratic upper-bound on $f$ implied by $L$-smoothness. Classically then, we obtain that for any $\gamma\in[0,1]$</p>
\[\tag{2}
f(x_{t+1}) - f(x^\esx) \leq f(x_t) - f(x^\esx) - \gamma \langle - \nabla f(x_t); v_t - x_t\rangle + \frac{\gamma^2}{2} L \norm{x_t - v_t}^2.\]
<p>The Frank-Wolfe gap $g_t = \langle - \nabla f(x_t); v_t - x_t\rangle\geq 0$ contributes in the primal decrease counter-balanced by the right-hand term. Informally then, uniform convexity of the set ensures that the distance of the iterate to the Frank-Wolfe vertex $\norm{v_t - x_t}$ shrinks to zero at a specific rate that depends on $g_t$. This schema illustrates that it is generally not the case when $\mathcal{C}$ is a polytope and how various type of <em>curvature</em> influence the shrinking of $\norm{v_t - x_t}$ to zero.</p>
<div class="center">
<img src="http://www.pokutta.com/blog/assets/uc-fw/uniform_convexity_assumption_cropped.png" alt="img1" style="float:center; margin-right: 1%; width:80%" />
<p style="clear: both;"></p>
</div>
<p class="figcap"><strong>Figure 2.</strong> The uniform convexity assumption.</p>
<h4 id="scaling-inequalities">Scaling Inequalities</h4>
<p>The uniform convexity parameters quantify different trade-off between the convergence to zero of $g_t$ and $\norm{v_t - x_t}$ . In particular, $(\alpha, q)$-uniform convexity of $\mathcal{C}$ implies scaling inequalities of the form
\(\langle -\nabla f(x_t); v_t - x_t\rangle \geq \alpha/2 \norm{\nabla f(x_t)}_\esx \norm{v_t - x_t}^q,\)
where $\norm{\cdot}_\esx$ stands for the dual norm to $\norm{\cdot}$. Plugging this scaling inequality into (1) is then the basis for the various convergence results.</p>
<h4 id="convergence-results-with-global-uniform-convexity">Convergence results with global uniform convexity</h4>
<p>Assuming that $\mathcal{C}$ is $(\alpha,q)$-uniformly convex and $f$ is a strongly convex and $L$-smooth function, we obtain convergence rates of $\mathcal{O}(1/T^{1/(1-1/q)})$ that interpolate between the general sub-linear rate of $\mathcal{O}(1/T)$ and the $\mathcal{O}(1/T^2)$ of [GH]. Note that we further generalize these results by relaxing the strong convexity of $f$ with $(\mu, \theta)$-Hölderian Error Bounds and the rates become $\mathcal{O}(1/T^{1/(1-2\theta/q)})$. For more details, see Theorem 2.10. of our paper or [KDP] for general error bounds in the context of the Frank-Wolfe algorithm.</p>
<p>Similarly, assuming \(\text{inf}_{x\in\mathcal{C}}\norm{\nabla f(x)}_\esx > 0\), when $\mathcal{C}$ is $(\alpha,q)$-uniformly convex with $q>2$, we obtain convergence rates of $\mathcal{O}(1/T^{1/(1-2/q)})$ that interpolate between the general sub-linear rate of $\mathcal{O}(1/T)$ and the linear convergence rates of [P, DR].</p>
<p>These two convergence regimes depend on the global uniform convexity parameters of the set $\mathcal{C}$. However, for example $\ell_3$ balls, seem to exhibit various degree of curvature depending on the position of $x^\esx$ on the boundary $\partial\mathcal{C}$.</p>
<h4 id="a-simple-numerical-experiment">A simple numerical experiment</h4>
<p>We now numerically observe the convergence of Frank-Wolfe where $f$ is a quadratic and the feasible region are $\ell_p$ balls with varying $p$. We provide two different plots where we vary the position of the optimal solution $x^\esx$. In both cases we use short steps.</p>
<p>In the right figure, $x^\esx$ is chosen near the intersection of the $\ell_p$-balls and the half-line generated by $\sum_{i=1}^{d}e_i$, where the $(e_i)$ is the canonical basis. Informally, this corresponds to the <em>curvy</em> areas of the $\ell_p$ balls. On the left figure, $x^\esx$ is chosen near the intersection of the $\ell_p$ balls and the half-line generated by one of the $(e_i)$, which corresponds to a <em>flat</em> area for large value of $p$.</p>
<p>We observe that when the optimum is near a <em>curvy</em> area, the converge rates are asymptotically linear for $\ell_p$ balls that are not strongly convex.</p>
<table>
<thead>
<tr>
<th style="text-align: center">Optimum $x^\esx$ in flat area</th>
<th style="text-align: center">Optimum $x^\esx$ in curvy area</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: center"><img src="http://www.pokutta.com/blog/assets/uc-fw/500_iter_20000_ls_exact_bad_opt.jpg" alt="Explicative figure" /></td>
<td style="text-align: center"><img src="http://www.pokutta.com/blog/assets/uc-fw/500_iter_20000_ls_exact_good_opt.jpg" alt="Explicative figure" /></td>
</tr>
</tbody>
</table>
<p>This suggests that the <em>local behavior</em> of $\mathcal{C}$ around $x^\esx$ might even better explain the convergence rates. Providing such an analysis would be in line with [D] which proves linear convergence rates assuming only local strong convexity of $\mathcal{C}$ around $x^\esx$. We extend this result to a localized notion of uniform convexity.</p>
<h3 id="frank-wolfe-analysis-with-localized-uniform-convexity">Frank-Wolfe Analysis with Localized Uniform Convexity</h3>
<p>When $\mathcal{C}$ is not globally uniformly convex, the scaling inequality does not necessarily hold anymore. We rather assume the following localized version around $x^\esx$:</p>
\[\tag{3}
\langle -\nabla f(x^\esx); x^\esx - x\rangle \geq \alpha/2 \norm{\nabla f(x^\esx)}_\esx \norm{x^\esx - x}^q.\]
<p>A localized definition of uniform convexity in the form of (1) applied at $x^\esx \in \partial \mathcal{C}$ indeed implies the <em>local scaling inequality</em> (3). However, the local scaling inequality (3) holds in more general situations, <em>e.g.</em> in the case of the strong convexity analog for the local moduli of rotundity in [GI]. This condition was already identified by Dunn [D], yet without any convergence analysis as soon as $q>2$. In another blog post, we will delve into more details on the generality of (3) and related assumptions in optimization.</p>
<p>Assuming \(\text{inf}_{x\in\mathcal{C}} \norm{\nabla f(x)}_\esx > 0\) and a local scaling inequality at $x^\esx$ with parameters $(\alpha, q)$, we obtain convergence rates of $\mathcal{O}(1/T^{1/(1-2/(q(q-1))})$ with $q>2$ that interpolate between the general sub-linear rate of $\mathcal{O}(1/T)$ and the linear convergence rates of [D].</p>
<p>Note that the obtained sublinear rate via the local scaling inequality is strictly worse than the one obtained via the global scaling inequality with same $(\alpha, q)$ parameters. The sublinear rate $\mathcal{O}(1/T^{1/(1-2/(q(q-1))})$ remains however always better than $\mathcal{O}(1/T)$ when $q>2$.</p>
<h2 id="conclusion">Conclusion</h2>
<p>Our results show that in all regimes (for smooth constrained problems) where the strong convexity of the constraint set is known to accelerate Frank-Wolfe algorithms (see [P, DR], [D] or [GH]), the uniform convexity of the set leads to accelerated convergence rates with respect to $\mathcal{O}(1/T)$ as well.</p>
<p>We also show that the local uniform convexity of $\mathcal{C}$ around $x^\esx$ already induces accelerated convergence rates. In particular, this acceleration is achieved with the vanilla Frank-Wolfe algorithm which does not require any knowledge about these underlying structural assumptions and their respective parameters. As such, our results further shed light on the adaptive properties of Frank-Wolfe type of algorithms. For instance, see also [KDP] for adaptivity with respect to error bound conditions on the objective function, or [J, LJ] for affine invariant analyses of Frank-Wolfe algorithms when the constraint set are polytopes.</p>
<h3 id="references">References</h3>
<p>[D] Dunn, Joseph C. Rates of convergence for conditional gradient algorithms near singular and nonsingular extremals. SIAM Journal on Control and Optimization 17.2 (1979): 187-211. <a href="https://epubs.siam.org/doi/pdf/10.1137/0324071?casa_token=mV4qkf9aLskAAAAA:--jyeKNCSwAH5fejuzgJr1im_OXPyesfgPOU1fk-cfmBYZjTdrRSAHHfZEWjQRUaYSI0vNPB7NwY">pdf</a></p>
<p>[DR] Demyanov, V. F. ; Rubinov, A. M. Approximate methods in optimization problems. Modern Analytic and Computational Methods in Science and Mathematics, 1970.</p>
<p>[DDS] Donahue, M. J.; Darken, C.; Gurvits, L.; Sontag, E. (1997). Rates of convex approximation in non-Hilbert spaces. <em>Constructive Approximation</em>, <em>13</em>(2), 187-220.</p>
<p>[GH] Garber, D.; Hazan, E. Faster rates for the frank-wolfe method over strongly-convex sets. In 32nd International Conference on Machine Learning, ICML 2015. <a href="https://arxiv.org/abs/1406.1305">pdf</a></p>
<p>[GI] Goncharov, V. V.; Ivanov, G. E. Strong and weak convexity of closed sets in a hilbert space. In Operations research engineering, and cyber security, pages 259–297. Springer, 2017.</p>
<p>[GM] Guélat, J.; Marcotte, P. (1986). Some comments on Wolfe’s ‘away step’. Mathematical Programming, 35(1), 110-119.</p>
<p>[HL] Huang, R.; Lattimore, T.; György, A.; Szepesvári, C. Following the leader and fast rates in linear prediction: Curved constraint sets and other regularities. In Advances in Neural Information Processing Systems, pages 4970–4978, 2016. <a href="https://arxiv.org/abs/1702.03040">pdf</a></p>
<p>[J] Jaggi, M. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In Proceedings of the 30th international conference on machine learning, ICML 2013. p. 427-435. <a href="http://www.jmlr.org/proceedings/papers/v28/jaggi13.pdf">pdf</a></p>
<p>[JN] Juditsky, A.; Nemirovski, A.S. Large deviations of vector-valued martingales in 2-smooth normed spaces. <a href="https://arxiv.org/abs/0809.0813">pdf</a></p>
<p>[KDP] Kerdreux, T.; d’Aspremont, A.; Pokutta, S. 2019. Restarting Frank-Wolfe. In The 22nd International Conference on Artificial Intelligence and Statistics (pp. 1275-1283). <a href="https://arxiv.org/abs/1810.02429">pdf</a></p>
<p>[LJ] Lacoste-Julien, S. ; Jaggi, Martin. On the global linear convergence of Frank-Wolfe optimization variants. In : Advances in neural information processing systems. 2015. p. 496-504. <a href="https://infoscience.epfl.ch/record/229239/files/nips15_paper_sup_camera_ready.pdf">pdf</a></p>
<p>[P] Polyak, B. T. Existence theorems and convergence of minimizing sequences for extremal problems with constraints. In Doklady Akademii Nauk, volume 166, pages 287–290. Russian Academy of Sciences, 1966.</p>
<p>[ST] Sridharan, K.; Tewari, A. (2010, June). Convex Games in Banach Spaces. In <em>COLT</em> (pp. 1-13). <a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.304.5992&rep=rep1&type=pdf">pdf</a></p>Thomas KerdreuxTL;DR: This is an informal summary of our recent paper Projection-Free Optimization on Uniformly Convex Sets by Thomas Kerdreux, Alexandre d’Aspremont, and Sebastian Pokutta. We present convergence analyses of the Frank-Wolfe algorithm in settings where the constraint sets are uniformly convex. Our results generalize different analyses of [P], [DR], [D], and [GH] when the constraint sets are strongly convex. For instance, the $\ell_p$ balls are uniformly convex for all $p > 1$, but strongly convex for $p\in]1,2]$ only. We show in these settings that uniform convexity of the feasible region systematically induces accelerated convergence rates of the Frank-Wolfe algorithm (with short steps or exact line-search). This shows that the Frank-Wolfe algorithm is not just adaptive to the sharpness of the objective [KDP] but also to the feasible region.Second-order Conditional Gradient Sliding2020-06-20T07:00:00+02:002020-06-20T07:00:00+02:00http://www.pokutta.com/blog/research/2020/06/20/socgs<p><em>TL;DR: This is an informal summary of our recent paper <a href="https://arxiv.org/abs/2002.08907">Second-order Conditional Gradient Sliding</a> by <a href="https://alejandro-carderera.github.io/">Alejandro Carderera</a> and <a href="http://www.pokutta.com/">Sebastian Pokutta</a>, where we present a second-order analog of the Conditional Gradient Sliding algorithm [LZ] for smooth and strongly-convex minimization problems over polytopes. The algorithm combines Inexact Projected Variable-Metric (PVM) steps with independent Away-step Conditional Gradient (ACG) steps to achieve global linear convergence and local quadratic convergence in primal gap. The resulting algorithm outperforms other projection-free algorithms in applications where first-order information is costly to compute.</em>
<!--more--></p>
<p><em>Written by Alejandro Carderera.</em></p>
<h2 id="what-is-the-paper-about-and-why-you-might-care">What is the paper about and why you might care</h2>
<p>Consider a problem of the sort:
\(\tag{minProblem}
\begin{align}
\label{eq:minimizationProblem}
\min\limits_{x \in \mathcal{X}} f(x),
\end{align}\)
where $\mathcal{X}$ is a polytope and $f(x)$ is a twice differentiable function that is strongly convex and smooth. We assume that solving an LP over $\mathcal{X}$ is easy, but projecting using the Euclidean norm (or any other norm) onto $\mathcal{X}$ is expensive. Moreover, we also assume that evaluating $f(x)$ is expensive, and so is computing the gradient and the Hessian of $f(x)$. An example of such an objective function can be found when solving an MLE problem to estimate the parameters of a Gaussian distribution modeled as a sparse undirected graph [BEA] (also known as the Graphical Lasso problem). Another example is the objective function used in logistic regression problems when the number of samples is high.</p>
<h2 id="projected-variable-metric-algorithms">Projected Variable-Metric algorithms</h2>
<p>Working with such unwieldy functions is often too expensive, and so a popular approach to tackling (minProblem) is to construct an approximation to the original function whose gradients are easier to compute. A linear approximation of $f(x)$ at $x_k$ using only first-order information will not contain any curvature information, giving us little to work with. Consider, on the other hand a quadratic approximation of $f(x)$ at $x_k$, denoted by $\hat{f_k}(x)$, that is:</p>
\[\tag{quadApprox}
\begin{align}
\label{eq:quadApprox}
\hat{f_k}(x) = f(x_k) + \left\langle \nabla f(x_k), x - x_k \right\rangle + \frac{1}{2} \norm{x - x_k}_{H_k}^2,
\end{align}\]
<p>where $H_k$ is a positive definite matrix that approximates the Hessian $\nabla^2 f(x_k)$. Algorithms that minimize the quadratic approximation $\hat{f}_k(x)$ over $\mathcal{X}$ at each iteration and set</p>
\[x_{k+1} = x_k + \gamma_k (\operatorname{argmin}_{x\in \mathcal{X}} \hat{f_k}(x) - x_k)\]
<p>for some \(\gamma_k \in [0,1]\) are dubbed <em>Projected Variable-Metric</em> (PVM) algorithms. These algorithms are useful when the progress per unit time obtained by moving towards the minimizer of $\hat{f}_k(x)$ over $\mathcal{X}$ at each time step is greater than the progress per unit time obtained by taking a step of any other first-order algorithm that makes use of the original function (whose gradients are very expensive to compute). We define the scaled projection of $x$ onto $\mathcal{X}$ when we measure the distance in the $H$-norm as \(\Pi_{\mathcal{X}}^{H} (y) \stackrel{\mathrm{\scriptscriptstyle def}}{=} \text{argmin}_{x\in\mathcal{X}} \norm{x - y}_{H}\). This allows us to interpret the steps taken by PVM algorithms as:</p>
\[\tag{stepPVM}
\begin{align}
\label{eq:stepPVM}
\operatorname{argmin}_{x\in \mathcal{X}} \hat{f_k}(x) = \Pi_{\mathcal{X}}^{H_k} \left( x_k - H_k^{-1} \nabla f(x_k) \right).
\end{align}\]
<p>These algorithms owe their name to this interpretation: at each iteration, as $H_k$ varies, we change the metric (the norm) with which we perform the scaled-projections, and we deform the negative of the gradient using this metric. The next image gives a schematic overview of a step of the PVM algorithm. The polytope $\mathcal{X}$ is depicted with solid black lines, the contour lines of the original objective function $f(x)$ are depicted with solid blue lines, and the contour lines of the quadratic approximation \(\hat{f}_k(x)\) are depicted with dashed red lines. Note that \(x_k - H_k^{-1}\nabla f(x_k)\) is the unconstrained minimizer of the quadratic approximation \(\hat{f}_k(x)\). The iterate used in the PVM algorithm to define the directions along which we move, i.e., \(\text{argmin}_{x\in \mathcal{X}} \hat{f}_k(x)\), is simply the scaled projection of that unconstrained minimizer onto \(\mathcal{X}\) using the norm \(\norm{\cdot}_{H_k}\) defined by $H_k$.</p>
<div class="center">
<img src="http://www.pokutta.com/blog/assets/socgs/SchematicAlgorithm.png" alt="Minimization of $\hat{f_k}(x)$ over $\mathcal{X}$." style="float:center; margin-right: 1%; width:80%" />
<p style="clear: both;"></p>
</div>
<p class="figcap"><strong>Figure 1.</strong> Minimization of $\hat{f_k}(x)$ over $\mathcal{X}$.</p>
<p>Note that if we set $H_k = \nabla^2 f(x_k)$ the PVM algorithm is equivalent to the <em>Projected Newton</em> algorithm, and if we set $H_k = I^n$, where $I^n$ is the identity matrix, the algorithm is equal to the <em>Projected Gradient Descent</em> algorithm. Intuitively, when $H_k$ is a good approximation to the Hessian $\nabla^2 f(x_k)$ we can expect to make good progress when moving along these directions. In terms of convergence, the PVM algorithm has a <em>global</em> linear convergence rate in primal gap when using an exact line search [KSJ], although with a dependence on the condition number that is worse than that of Projected Gradient Descent or the <em>Away-step Conditional Gradient</em> (ACG) algorithms. Moreover, the algorithm has a <em>local</em> quadratic convergence rate with a unit step size when close to the optimum $x^\esx$ if the matrix $H_k$ becomes a better and better approximation to $\nabla^2 f(x_k)$ as we approach $x^\esx$ (which we also assume in our theoretical results).</p>
<h2 id="second-order-conditional-gradient-sliding-algorithm">Second-order Conditional Gradient Sliding algorithm</h2>
<p>Two questions arise:</p>
<ol>
<li>Can we achieve a global linear convergence rate on par with that of the Away-step Conditional Gradient algorithm?</li>
<li>Solving the problem shown in (stepPVM) to optimality is often too expensive. Can we solve the problem to some $\varepsilon_k$-optimality and keep the local quadratic convergence?</li>
</ol>
<p>The <em>Second-order Conditional Gradient Sliding</em> (SOCGS) algorithm is designed with these considerations in mind, providing global linear convergence in primal gap and local quadratic convergence in primal gap and distance to $x^\esx$. The algorithm couples an independent ACG step with line search with an Inexact PVM step with a unit step size. At the end of each iteration, we choose the step that provides the greatest primal progress. The independent ACG steps will ensure global linear convergence in primal gap, and the Inexact PVM steps will provide quadratic convergence. Moreover, the line search in the ACG step can be substituted with a step size strategy that requires knowledge of the $L$-smoothness parameter of $f(x)$ [PNAM].</p>
<p>We compute the PVM step inexactly using the (same) ACG algorithm with an exact line search, thereby making the SOCGS algorithm <em>projection-free</em>. As the function being minimized in the Inexact PVM steps is quadratic there is a closed-form expression for the optimal step size. The scaled projection problem is solved to $\varepsilon_k$-optimality, using the Frank-Wolfe gap as a stopping criterion, as in the Conditional Gradient Sliding (CGS) algorithm [LZ]. The CGS algorithm uses the vanilla Conditional Gradient algorithm to find an approximate solution to the Euclidean projection problems that arise in <em>Nesterov’s Accelerated Gradient Descent</em> steps. In the SOCGS algorithm, we use the ACG algorithm to find an approximate solution to the scaled-projection problems that arise in PVM steps.</p>
<h3 id="accuracy-parameter-varepsilon_k">Accuracy Parameter $\varepsilon_k$.</h3>
<p>The accuracy parameter $\varepsilon_k$ in the SOCGS algorithm depends on a lower bound on the primal gap of (minProblem) which we denote by $lb\left( x_k \right)$ that satisfies $lb\left( x_k \right) \leq f\left(x_k \right) - f\left(x^\esx \right)$.</p>
<p>In several machine learning applications, the value of $f(x^\esx)$ is known a priori, such is the case of the approximate Carathéodory problem (see post <a href="/blog/research/2019/11/30/approxCara-abstract.html">Approximate Carathéodory via Frank-Wolfe</a> where $f(x^\esx) = 0$). In other applications, estimating $f(x^\esx)$ is easier than estimating the strong convexity parameter (see [BTA] for an in-depth discussion). This allows for tight lower bounds on the primal gap in these cases.</p>
<p>If there is no easy way to estimate the value of $f(x^\esx)$, we can compute a lower bound on the primal gap at $x_k$ (bounded away from zero) using any CG variant that monotonically decreases the primal gap. It suffices to run an arbitrary number of steps $n \geq 1$ of the aforementioned variant to minimize $f(x)$ starting from $x_k$, resulting in $x_k^n$. Simply noting that $f(x_k^n) \geq f(x^\esx)$ allows us to conclude that $f(x_k) - f(x^\esx) \geq f(x_k) - f(x_k^n)$, and therefore a valid lower bound is $lb\left( x_k \right) = f(x_k) - f(x^n_k)$. The higher the number of CG steps performed from $x_k$, the tighter the resulting lower bound will be.</p>
<h3 id="complexity-analysis">Complexity Analysis</h3>
<p>For the complexity analysis, we assume that we have at our disposal the tightest possible bound on the primal gap, which is $lb\left( x_k \right) = f(x_k) - f(x^\esx)$. A looser lower bound increases the number of linear minimization calls but does not increase the number of first-order, or approximate Hessian oracle calls. As in the classical analysis of Projected Newton algorithms, after a finite number of iterations independent of the target accuracy $\varepsilon$ (which in our case are linearly convergent in primal gap) the algorithm enters a regime of quadratic convergence in primal gap. Once in this phase the algorithm requires $\mathcal{O}\left( \log(1/\varepsilon) \log(\log 1/\varepsilon)\right)$ calls to a linear minimization oracle, and $\mathcal{O}\left( \log(\log 1/\varepsilon)\right)$ calls to a first-order and approximate Hessian oracle to reach an $\varepsilon$-optimal solution.</p>
<p>If we were to solve problem (minProblem) using the Away-step Conditional Gradient algorithm we would need $\mathcal{O}\left( \log(1/\varepsilon)\right)$ calls to a linear minimization and first-order oracle. Using the SOCGS algorithm makes sense if the linear minimization calls are not the computational bottleneck of the algorithm and the approximate Hessian oracle is about as expensive as the first-order oracle.</p>
<h3 id="computational-experiments">Computational Experiments</h3>
<p>We compare the performance of the SOCGS algorithm with that of other first-order projection free algorithms in settings where computing first-order information is expensive (and computing Hessian information is just as expensive). We also compare the performance of our algorithm with the recent <em>Newton Conditional Gradient</em> algorithm [LCT] which minimizes a self-concordant function over a convex set by performing Inexact Newton steps (thereby requiring an exact Hessian oracle) using a Conditional Gradient algorithm to compute the scaled projections. After a finite number of iterations (independent of the target accuracy $\varepsilon$), the convergence rate of the NCG algorithm is linear in primal gap. Once inside this phase an $\varepsilon$-optimal solution is reached after $\mathcal{O}\left(\log 1/\varepsilon\right)$ exact Hessian and first-order oracle calls and $\mathcal{O}( 1/\varepsilon^{\nu})$ linear minimization oracle calls, where $\nu$ is a constant greater than one.</p>
<p>In the first experiment the Hessian information will be inexact (but subject to an asymptotic accuracy assumption), and so we will only compare to other first-order projection-free algorithms. In the second and third experiments, the Hessian oracle will be exact. For reference, the algorithms in the legend correspond to the vanilla Conditional Gradient (CG), the Away-step Conditional Gradient (ACG) [GM], the Lazy Away-step Conditional Gradient (ACG (L)) [BPZ], the Pairwise-step Conditional Gradient (PCG) [LJ], the Conditional Gradient Sliding (CGS) [LZ], the Stochastic Variance-Reduced Conditional Gradient (SVRCG) [HL], the Decomposition Invariant Conditional Gradient (DICG) [GM2] and the Newton Conditional Gradient (NCG) [LCT] algorithm. We also present an LBFGS version of SOCGS (SOCGS LBFGS). However note that this algorithm, while performing well, does not formally satisfy our assumptions.</p>
<div class="center">
<img src="http://www.pokutta.com/blog/assets/socgs/Birkhoff_Experiments.png" alt="fig2" style="float:center; margin-right: 1%; width:99%" />
<p style="clear: both;"></p>
</div>
<p class="figcap"><strong>Figure 2.</strong> Sparse coding over the Birkhoff polytope.</p>
<div class="center">
<img src="http://www.pokutta.com/blog/assets/socgs/GLassoPSD.png" alt="fig3" style="float:center; margin-right: 1%; width:99%" />
<p style="clear: both;"></p>
</div>
<p class="figcap"><strong>Figure 3.</strong> Inverse covariance estimation over the spectrahedron.</p>
<div class="center">
<img src="http://www.pokutta.com/blog/assets/socgs/LogReg.png" alt="fig4" style="float:center; margin-right: 1%; width:99%" />
<p style="clear: both;"></p>
</div>
<p class="figcap"><strong>Figure 4.</strong> Structured logistic regression over $\ell_1$ unit ball.</p>
<h3 id="references">References</h3>
<p>[LZ] Lan, G., & Zhou, Y. (2016). Conditional gradient sliding for convex optimization. In <em>SIAM Journal on Optimization</em> 26(2) (pp. 1379–1409). SIAM. <a href="http://www.optimization-online.org/DB_FILE/2014/10/4605.pdf">pdf</a></p>
<p>[BEA] Banerjee, O., & El Ghaoui, L. & d’Aspremont, A. (2008). Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data. In <em>Journal of Machine Learning Research</em> 9 (2008) (pp. 485–516). JMLR. <a href="http://www.jmlr.org/papers/volume9/banerjee08a/banerjee08a.pdf">pdf</a></p>
<p>[KSJ] Karimireddy, S.P., & Stich, S.U. & Jaggi, M. (2018). Global linear convergence of Newton’s method without strong-convexity or Lipschitz gradients. <em>arXiv preprint:1806.00413</em>. <a href="https://arxiv.org/pdf/1806.00413.pdf">pdf</a></p>
<p>[PNAM] Pedregosa, F., & Negiar, G. & Askari, A. & Jaggi, M. (2020). Linearly Convergent Frank-Wolfe with Backtracking Line-Search. In <em>Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics</em>. <a href="http://proceedings.mlr.press/v108/pedregosa20a/pedregosa20a-supp.pdf">pdf</a></p>
<p>[BTA] Barré, M., & Taylor, A. & d’Aspremont, A. (2020). Complexity Guarantees for Polyak Steps with Momentum. <em>arXiv preprint:2002.00915</em>. <a href="https://arxiv.org/pdf/2002.00915.pdf">pdf</a></p>
<p>[LCT] Liu, D., & Cevher, V. & Tran-Dinh, Q. (2020). A Newton Frank-Wolfe Method for Constrained Self-Concordant Minimization. <em>arXiv preprint:2002.07003</em>. <a href="https://arxiv.org/pdf/2002.07003.pdf">pdf</a></p>
<p>[GM] Guélat, J., & Marcotte, P. (1986). Some comments on Wolfe’s ‘away step’. In <em>Mathematical Programming</em> 35(1) (pp. 110–119). Springer. <a href="http://www.iro.umontreal.ca/~marcotte/ARTIPS/1986_MP.pdf">pdf</a></p>
<p>[BPZ] Braun, G., & Pokutta, S. & Zink, D. (2017). Lazifying Conditional Gradient Algorithms. In <em>Proceedings of the 34th International Conference on Machine Learning</em>. <a href="http://proceedings.mlr.press/v70/braun17a/braun17a.pdf">pdf</a></p>
<p>[LJ] Lacoste-Julien, S. & Jaggi, M. (2015). On the global linear convergence of Frank-Wolfe optimization variants. In <em>Advances in Neural Information Processing Systems</em> 2015 (pp. 496-504). <a href="https://papers.nips.cc/paper/5925-on-the-global-linear-convergence-of-frank-wolfe-optimization-variants.pdf">pdf</a></p>
<p>[HL] Hazan, E. & Luo, H. (2016). Variance-reduced and projection-free stochastic optimization. In <em>Proceedings of the 33rd International Conference on Machine Learning</em>. <a href="https://arxiv.org/pdf/1602.02101.pdf">pdf</a></p>
<p>[GM2] Garber, D. & Meshi, O.(2016). Linear-memory and decomposition-invariant linearly convergent conditional gradient algorithm for structured polytopes. In <em>Advances in Neural Information Processing Systems</em> 2016 (pp. 1001-1009). <a href="https://arxiv.org/pdf/1605.06492.pdf">pdf</a></p>Alejandro CardereraTL;DR: This is an informal summary of our recent paper Second-order Conditional Gradient Sliding by Alejandro Carderera and Sebastian Pokutta, where we present a second-order analog of the Conditional Gradient Sliding algorithm [LZ] for smooth and strongly-convex minimization problems over polytopes. The algorithm combines Inexact Projected Variable-Metric (PVM) steps with independent Away-step Conditional Gradient (ACG) steps to achieve global linear convergence and local quadratic convergence in primal gap. The resulting algorithm outperforms other projection-free algorithms in applications where first-order information is costly to compute.