Jekyll2020-09-21T10:21:19+02:00http://www.pokutta.com/blog/feed.xmlOne trivial observation at a timeEverything Mathematics, Optimization, Machine Learning, and Artificial IntelligenceAccelerating Domain Propagation via GPUs2020-09-20T01:00:00+02:002020-09-20T01:00:00+02:00http://www.pokutta.com/blog/research/2020/09/20/gpu-prob<p><em>TL;DR: This is an informal discussion of our recent paper <a href="https://arxiv.org/abs/2009.07785">Accelerating Domain Propagation: an Efficient GPU-Parallel Algorithm over Sparse Matrices</a> by Boro Sofranac, Ambros Gleixner, and Sebastian Pokutta. In the paper, we present a new algorithm to perform domain propagation of linear constraints on GPUs efficiently. The results show that efficient implementations of Mixed-integer Programming (MIP) methods are possible on GPUs, even though the success of using GPUs in MIPs has traditionally been limited. Our algorithm is capable of performing domain propagation on the GPU exclusively, without the need for synchronization with the CPU, paving the way for the usage of this algorithm in a new generation of MIP methods that run on GPUs.</em>
<!--more--></p>
<p><em>Written by Boro Sofranac.</em>
</p>
<h2 id="the-motivation">The motivation</h2>
<p>
Since the advent of general-purpose, massively parallel GPU hardware, many fields of applied mathematics have actively sought out the design of specialized algorithms which exploit the unprecedented computational resources this new type of hardware has to offer. A prime example are Neural Networks, whose rise to prominence was fueled by the “Deep Learning revolution” with Deep Learning methods running on GPUs. Still, such success is missing in many other fields which are not as amenable to specialized and massively parallel programming model of the GPUs. One such field is Mixed-integer Programming (MIP).</p>
<p>The development of the <em>Simplex</em> algorithm for solving Linear Programs (LPs) in the 1940s was followed by a plethora of methods and solvers to solve LPs and MIPs over the following decades. A unifying characteristic of these methods is a) that they exhibit non-uniform algorithmic behaviour, and b) they operate on highly irregular data (i.e., data structures are sparse and do not have a regular structure in general). Such characteristics make the implementation of these methods on massively parallel hardware challenging and have hindered the application of GPUs in the field.</p>
<p>Recognizing these challenges in a number of fields, GPU hardware development in recent years has been geared towards easier expression of non-uniform workflows in its programming model (for example, the <a href="https://developer.nvidia.com/blog/introduction-cuda-dynamic-parallelism/"><em>Dynamic Parallelism</em></a> feature of NVIDIA GPUs) and hardware (e.g., <a href="https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#atomic-functions">atomic operations</a>). At the same time, researchers working in some related fields have shown that with the right algorithmic design GPUs can be efficiently used for workflows previously considered unsuitable due to challenges similar to those present in MIP methods. Consider for example <em>Numerical Linear Algebra</em> (LA): for years, dense LA has been one of the main benefactors of GPU computing and allowed for tremendous speedups. On the other hand, if the input data was sparse (and irregular), the same algorithms often exhibited disappointing performance. New algorithms however, such as the <a href="https://doi.org/10.1109/SC.2014.68"><em>CSR-Adaptive</em> algorithm developed by Greathouse and Daga</a> to perform sparse matrix-vector products (SpMV) have shown very impressive performance gains for highly unstructured sparse matrices.</p>
<p>Against this backdrop, we set out to investigate the applicability of massively parallel algorithms in MIP methods. This, however, is no simple task: massively parallel programming models bring with them a different algorithmic paradigm and complexity notions while MIP methods have historically been strongly sequential. Basic design decisions need to be rethought and/or new algorithms developed. So we took a core MIP method used by all state-of-the-art solvers, namely <em>Domain Propagation</em>, which is traditionally not amenable to efficient parallelization, and we show that with the right algorithmic design speedups seen in fields such as sparse Linear Algebra are also possible here. This new GPU algorithm for <em>Domain Propagation</em> and computational experiments assessing the performance are presented in <a href="https://arxiv.org/abs/2009.07785">our paper</a>; the code is available on our <a href="https://github.com/Sofranac-Boro/gpu-domain-propagator"><em>GitHub</em> page</a>.
</p>
<h2 id="sneak-peek-at-the-results">Sneak peek at the results</h2>
<p>
We use the <em>de-facto</em> standard <a href="https://miplib.zib.de/">MIPLIB2017</a> test set for MIPs to conduct our experiments. To better capture the response of our algorithm to the size of instances, we subdivided this set into 8 subsets with instances of increasing size, dubbed Set-1 to Set-8. For comparison, we use four algorithms:</p>
<ol>
<li><strong>cpu_seq</strong> is a sequential implementation of domain propagation which closely follows implementations in state-of-the-art solvers.</li>
<li><strong>cpu_omp</strong> is a shared-memory parallel version of domain propagation which runs on the CPU.</li>
<li><strong>gpu_atomic</strong> is our GPU implementation with atomic operations</li>
<li><strong>gpu_reduction</strong> is our GPU implementation which avoids using atomic operations by using reductions in global memory.</li>
</ol>
<p>The algorithms are tested on the following hardware:</p>
<ol>
<li><strong>V100</strong> NVIDIA Tesla V100 PCIe 32GB (GPU)</li>
<li><strong>RTX</strong> NVIDIA Titan RTX 24GB (GPU)</li>
<li><strong>P400</strong> NVIDIA Quadro P400 2GB (GPU)</li>
<li><strong>amdtr</strong> 64-core AMD Ryzen Threadripper 3990X @ 3.30 GHz with 128 GB RAM (CPU)</li>
<li><strong>xeon</strong> 24-core Intel Xeon Gold 6246 @ 3.30GHz with 384 GB RAM (CPU)</li>
</ol>
<p>As baseline, we choose the execution <em>cpu_seq</em> algorithm on the <em>xeon</em> machine. As the metric for comparison, we will report speedups of all other executions over this base case.</p>
<p>Figure 1-a shows the geometric mean of speedups of the four algorithms over the test subsets. Figure 1-b shows the distribution of speedups of the four algorithms over all the instances in the test set, sorted in ascending order by speedup.
</p>
<div class="center">
<img src="http://www.pokutta.com/blog/assets/GPUprop/speedups-1.png" alt="img1" style="float:center; margin-right: 1%; width:100%" />
<p style="clear: both;"></p>
</div>
<p class="figcap"><strong>Figure 1.</strong> a) Geometric means of speedups b) speedup distributions in ascending order.</p>
<p>We can see that GPU algorithms on high-end hardware <em>V100</em> and <em>RTX</em> drastically outperform the sequential and shared memory executions. Additionally, the GPU algorithms with atomic operations outperform the reduction version in all executions. The fastest combination is the <em>gpu_atomic</em> algorithm on <em>V100</em>. Its mean speedup is always greater than 1.6 and goes up to 46.9 in Set-8, following a roughly linear trend. Over the entire test set, the mean speedup is 6.1. For the top 5% of the instances, this execution achieves a speedup of at least 62.9 times. As we can see in Figure 1-b, speedups as high as 195x are possible.</p>
<p>A low-end, consumer-grade GPU <em>P400</em>, often found in home-use desktops, is also evaluated in the tests. It is evident from the plots that it cannot keep up with the two high-end GPUs. However, we can see that <em>gpu_atomic</em> running on <em>P400</em> is still competitive with the CPU implementation for about half the instances, where it achieves a small speedup over the benchmark cpu implementation. This result is interesting considering that GPUs are currently a resource that is not used by MIP solvers at all, opening up the possibility for using GPUs as co-processors for MIP solvers even on standard desktop machines.</p>
<p>Looking at the shared-memory parallel <em>cpu_omp</em> algorithm, we can see that it also is drastically outperformed by the GPU implementations on <em>V100</em> and <em>RTX</em>. Comparing it to the sequential base case, it underperforms in about half of the instances. The parallelism found in the domain propagation algorithm is relatively fine-grained, with low-arithmetic intensity in the parallel units of work. This does not bode well for the shared-memory parallelization on the CPU where managing CPU threads is comparatively expensive, explaining why current state-of-the-art implementations of Domain Propagation are usually single thread only
</p>
<h2 id="conclusions">Conclusions</h2>
<p>In conclusion, the domain propagation algorithm on the GPU achieves ample speedups over its CPU counterparts, over the majority of practically relevant instances. Beyond being interesting in its own right, it does not tell the whole story! The <em>throughput-based</em> GPU programming model differs significantly from the <em>latency-based</em> sequential model, and comparing the runtimes of parts of the solving process alone might not give justice to the GPU paradigm. Our algorithm runs entirely on the GPU, without the need for synchronization with the CPU, which paves the way to embedding this algorithm in future GPU-based MIP solvers/methods. Put differently, the massive amounts of parallelism on the GPU bring about a different paradigm, with the potential to achieve more than speeding up parts of an otherwise sequential workflow.</p>Boro SofranacTL;DR: This is an informal discussion of our recent paper Accelerating Domain Propagation: an Efficient GPU-Parallel Algorithm over Sparse Matrices by Boro Sofranac, Ambros Gleixner, and Sebastian Pokutta. In the paper, we present a new algorithm to perform domain propagation of linear constraints on GPUs efficiently. The results show that efficient implementations of Mixed-integer Programming (MIP) methods are possible on GPUs, even though the success of using GPUs in MIPs has traditionally been limited. Our algorithm is capable of performing domain propagation on the GPU exclusively, without the need for synchronization with the CPU, paving the way for the usage of this algorithm in a new generation of MIP methods that run on GPUs. Join CO@Work and EWG-POR – online and for free!2020-08-30T07:00:00+02:002020-08-30T07:00:00+02:00http://www.pokutta.com/blog/news/2020/08/30/coatwork<p><em>TL;DR: Announcement for CO@WORK and EWG-POR. Fully online and participation is free.</em>
<!--more--></p>
<p><em>Written by Timo Berthold.</em></p>
<p>This year, a lot of scientific conferences and workshops had to be cancelled or postponed – many others took the unprecedented situation as a chance to experiment with new, exciting formats. It is great to see how virtual conferences make the latest research insights available too much more people than traditional on-site events do.</p>
<p>Researchers from <a href="https://www.zib.de/">ZIB</a> and its partners from the <a href="http://forschungscampus-modal.de/">Research Campus Modal</a> will join this latest trend towards online workshops and host two exciting meetings in September, one two-week summer school mainly targeting PhD students and one two-day meeting, addressing practitioners from all industries using Operations Research.</p>
<div class="center">
<img src="http://www.pokutta.com/blog/assets/coatwork/Picture1.png" alt="img1" style="float:center; margin-right: 1%; width:80%" />
<p style="clear: both;"></p>
</div>
<h2 id="combinatorial-optimization-at-work">Combinatorial Optimization At Work</h2>
<p><a href="http://co-at-work.zib.de/">CO@Work</a> is an institution. This two-week summer school takes only place every five years and has always brought together researchers, practitioners, and students from all over the world. It addresses PhD students and post-docs interested in the use mathematical optimizations in concrete practical applications. This year’s theme is “Algorithmic Intelligence in Practice”. This amazing event features more than 30 distinguished lecturers from all over the world, including developers and managers of FICO, Google, SAP, Siemens, SAS, Gurobi, Mosek, GAMS, NTT Data, Litic, as well as leading scientists from TU Berlin, FU Berlin, Polytechnique Montréal, RWTH Aachen, the Chinese Academy of Science, University of Southern California, University of Edinburgh, Sabancı Üniversitesi, Escuela Politécnica Nacional Quito, TU Darmstadt, University of Exeter, and many more.</p>
<div class="center">
<img src="http://www.pokutta.com/blog/assets/coatwork/2015-coatwork.png" alt="img1" style="float:center; width:99%" />
<p style="clear: both;"></p>
</div>
<p class="figcap">CO@Work 2015 participants in front of ZIB.</p>
<p>All lectures will be made available on <a href="https://www.youtube.com/channel/UCphLz_BXrOAInHozAlTsigA">YouTube</a>, and there will be two Q&A sessions for each presentation – typically 11 hours apart, to cover all time zones worldwide. Similarly, there will be two practical exercise sessions each day, with hands-on experience on implementing optimization projects through the python interfaces of <a href="https://www.fico.com/en/products/fico-xpress-optimization">FICO Xpress</a> and <a href="https://www.scipopt.org/">SCIP</a>. Q&A and exercises will be hosted on Zoom. CO@Work will take place every weekday from September 14 to September 25.
Checkout the meeting homepage and registration <a href="http://co-at-work.zib.de/">here</a>. Hurry up, registration closes on September 6!</p>
<div class="center">
<img src="http://www.pokutta.com/blog/assets/coatwork/ewg.png" alt="img1" style="float:center; width:80%" />
<p style="clear: both;"></p>
</div>
<h2 id="practice-of-operations-research--euro-working-group-meeting">Practice of Operations Research – EURO Working Group meeting</h2>
<p>The <a href="https://www.eventbrite.co.uk/e/challenges-in-the-deployment-of-or-projects-tickets-62398252854">EWG-POR virtual conference</a> is co-hosted by <a href="https://www.zib.de/">ZIB</a> and <a href="https://www.fico.com/en/products/fico-xpress-optimization">FICO</a>. It focusses on an issue that many of us have faced: “Challenges in the deployment of OR projects”. The conference features five keynote lecturers from various industries highlighting how difficulties with the implementation of optimization projects could be overcome. Adrian Zymolka from Axioma will speak about “<strong>Optimization in Finance</strong>”, Colin Silvester from Uniper will present on “<strong>Delivering OR Solutions for Everyday Operations in Energy Trading</strong>”, Steffen Klosterhalfen from BASF will talk about “<strong>Successful Value Chain Optimization at a Chemical Company</strong>”, Ralf Werner from OGE will report on “<strong>Operations Research supporting Germany’s energy transition</strong>”, and Baris Cem Sal from DHL will let us know how they were “<strong>Putting Operations Research into Operations in Deutsche Post DHL Group transition</strong>”.</p>
<p>Furthermore, there will be special 45-minute interactive discussion group sessions. All participants are invited to contribute to one of the four rounds on “Change management issues in practical projects”, “Promotion of OR”, “How to state requirements and project specifications at the beginning” and “Relationship/collaboration between academia and industry”. We are curious to see what comes out of these roundtables.</p>
<p>The event will be completed by a wonderful social entertainment session – join to find out more…</p>
<p>The EWG-POR main event will take place on the 28th and 29th of September. The following five Mondays (Oct 5 – Nov 2), there will be a one hour Webinar series, each with two contributed talks on the topic of “Challenges in the deployment of OR projects”.</p>
<p>Checkout the <a href="https://www.euro-online.org/websites/or-in-practice/event/euro-working-group-practice-of-or-meeting-2020/">meeting homepage</a> and <a href="https://fico.zoom.us/webinar/register/9815939755759/WN_gb2az8OJR8u9y5FPl51RaA">register now</a> here.</p>Timo BertholdTL;DR: Announcement for CO@WORK and EWG-POR. Fully online and participation is free.Projection-Free Optimization on Uniformly Convex Sets2020-07-27T01:00:00+02:002020-07-27T01:00:00+02:00http://www.pokutta.com/blog/research/2020/07/27/uniform-convexity-fw<p><em>TL;DR: This is an informal summary of our recent paper <a href="https://arxiv.org/abs/2004.11053">Projection-Free Optimization on Uniformly Convex Sets</a> by Thomas Kerdreux, Alexandre d’Aspremont, and Sebastian Pokutta. We present convergence analyses of the Frank-Wolfe algorithm in settings where the constraint sets are uniformly convex. Our results generalize different analyses of [P], [DR], [D], and [GH] when the constraint sets are strongly convex. For instance, the $\ell_p$ balls are uniformly convex for all $p > 1$, but strongly convex for $p\in]1,2]$ only. We show in these settings that uniform convexity of the feasible region systematically induces accelerated convergence rates of the Frank-Wolfe algorithm (with short steps or exact line-search). This shows that the Frank-Wolfe algorithm is not just adaptive to the sharpness of the objective [KDP] but also to the feasible region.</em>
<!--more--></p>
<p><em>Written by Thomas Kerdreux.</em></p>
<p>We consider the following constrained optimization problem</p>
\[\underset{x\in\mathcal{C}}{\text{argmin}} f(x),\]
<p>where $f$ is a $L$-smooth convex function and $\mathcal{C}$ is a compact convex set in a Hilbert space. Frank-Wolfe algorithms form a classical family of first-order iterative methods for solving such problems. Each iteration requires in the worst case the solution of a linear minimization problem over the original feasible domain, a subset of the domain, or a reasonable modification of the domain, <em>i.e.</em> a change that does not implicitly amount to a proximal operation.</p>
<p>The understanding of the convergence rates of Frank-Wolfe algorithms in a variety of settings is an active field of research. For smooth constrained problems, the vanilla Frank-Wolfe algorithm (FW) enjoys a tight sublinear convergence rate of $\mathcal{O}(1/T)$ (see e.g., [J] for an in-depth discussion). There are known accelerated convergence regimes, as a function of the feasible region, only when $\mathcal{C}$ is a polytope or a strongly convex set.</p>
<p class="mathcol"><strong>Frank-Wolfe Algorithm</strong> <br />
<em>Input:</em> Start with $x_0 \in \mathcal{C}$, $L$ upper bound on the Lipschitz constant. <br />
<em>Output:</em> Sequence of iterates $x_t$ <br />
For $t=0, 1, \ldots, T $ do: <br />
$\qquad v_t \leftarrow \underset{v\in\mathcal{C}}{\text{argmax }} \langle -\nabla f(x_t); v - x_t\rangle$ $\qquad \triangleright $ FW vertex <br />
$\qquad \gamma_t \leftarrow \underset{\gamma\in[0,1]}{\text{argmin }} \gamma \langle \nabla f(x_t); v_t - x_t \rangle + \frac{\gamma^2}{2} L \norm{v_t - x_t}^2$ $\qquad \triangleright$ Short step <br />
$\qquad x_{t+1} \leftarrow (1 - \gamma_t)x_{t} + \gamma_t v_{t}$ $\qquad \triangleright$ Convex update</p>
<p>When $\mathcal{C}$ is a strongly convex set and \(\text{inf}_{x\in\mathcal{C}}\norm{\nabla f(x)}_\esx > 0\), FW enjoys linear convergence rates [P, DR]. Recently, [GH] showed that the Frank-Wolfe algorithm converges in \(\mathcal{O}(1/T^2)\) when the objective function is strongly convex without restriction on the position of the optimum $x^\esx$. Importantly, the conditioning of this sub-linear rate does not depend on the distance of the constrained optimum $x^\esx$ either from the boundary or the unconstrained optimum, as it is the case in [P, DR], or [GM] when the optimum $x^\esx$ is in the interior of $\mathcal{C}$. Note that all these analyses require short steps as step-size rule (or even stronger conditions such as line-search).</p>
<p>Finally, when $\mathcal{C}$ is a polytope and $f$ is strongly convex, <em>corrective</em> variants of Frank-Wolfe were recently shown to enjoy linear convergence rates, see [LJ]. No accelerated convergence rates are known for constraint sets that are neither polytopes nor strongly convex sets. We show here that uniformly convex sets, which non-trivially subsume strongly convex sets, systematically enjoy accelerated convergence rate in the respective settings of [P,DR], [GH], and [D].</p>
<h3 id="uniformly-convex-sets">Uniformly Convex Sets</h3>
<p>A closed set $\mathcal{C}\subset\mathbb{R}^d$ is $(\alpha, q)$-uniformly convex with respect to a norm $\norm{\cdot}$, if for any $x,y\in\mathcal{C}$, any $\eta\in[0,1]$, and any $z\in\mathbb{R}^d$ with $\norm{z} = 1$, we have</p>
\[\tag{1}
\eta x + (1-\eta)y + \eta (1 - \eta ) \alpha ||x-y||^q z \in \mathcal{C}.\]
<p>At a high-level, this property is a global quantification of the set curvature that subsumes strong convexity. There exist others equivalent definitions, see <em>e.g.</em> [GI] in the strongly convex case. In finite-dimensional spaces, the $\ell_p$ balls form classical and important examples of uniformly convex sets. For $0 < p < 1$, the $\ell_p$ balls are non-convex. For $p\in ]1,2]$, the $\ell_p$ balls are strongly convex (or $(\alpha,2)$-uniformly convex) and $p$-uniformly convex (but not strongly convex) for $p>2$. The $p$-Schatten norms with $p>1$ or various group norms are also typical examples.</p>
<div class="center">
<img src="http://www.pokutta.com/blog/assets/uc-fw/list_ball.png" alt="img1" style="float:center; margin-right: 1%; width:80%" />
<p style="clear: both;"></p>
</div>
<p class="figcap"><strong>Figure 1.</strong> Example of $\ell_q$ balls.</p>
<p>Besides the quantification in (1), uniform convexity is a very classical notion in the study of normed spaces. Indeed, it allows to refine the convex characterization of these spaces’ unit balls, leading to a plethora of interesting properties. For instance, the various uniform convexity types of Banach spaces have consequences, notably in learning theory [DDS], online learning [ST], or concentration inequalities [JN]. Here, we show that the Frank-Wolfe algorithm accelerates as a function of the uniform convexity of $\mathcal{C}$.</p>
<h3 id="convergence-analysis-for-frank-wolfe-with-uniformly-convex-sets">Convergence analysis for Frank-Wolfe with Uniformly Convex Sets</h3>
<h4 id="proof-sketch">Proof Sketch</h4>
<p>At iteration $t$, we have $x_{t+1} = x_t + \gamma_t (v_t - x_t)$ with $\gamma_t\in[0,1]$ chosen to optimize the quadratic upper-bound on $f$ implied by $L$-smoothness. Classically then, we obtain that for any $\gamma\in[0,1]$</p>
\[\tag{2}
f(x_{t+1}) - f(x^\esx) \leq f(x_t) - f(x^\esx) - \gamma \langle - \nabla f(x_t); v_t - x_t\rangle + \frac{\gamma^2}{2} L \norm{x_t - v_t}^2.\]
<p>The Frank-Wolfe gap $g_t = \langle - \nabla f(x_t); v_t - x_t\rangle\geq 0$ contributes in the primal decrease counter-balanced by the right-hand term. Informally then, uniform convexity of the set ensures that the distance of the iterate to the Frank-Wolfe vertex $\norm{v_t - x_t}$ shrinks to zero at a specific rate that depends on $g_t$. This schema illustrates that it is generally not the case when $\mathcal{C}$ is a polytope and how various type of <em>curvature</em> influence the shrinking of $\norm{v_t - x_t}$ to zero.</p>
<div class="center">
<img src="http://www.pokutta.com/blog/assets/uc-fw/uniform_convexity_assumption_cropped.png" alt="img1" style="float:center; margin-right: 1%; width:80%" />
<p style="clear: both;"></p>
</div>
<p class="figcap"><strong>Figure 2.</strong> The uniform convexity assumption.</p>
<h4 id="scaling-inequalities">Scaling Inequalities</h4>
<p>The uniform convexity parameters quantify different trade-off between the convergence to zero of $g_t$ and $\norm{v_t - x_t}$ . In particular, $(\alpha, q)$-uniform convexity of $\mathcal{C}$ implies scaling inequalities of the form
\(\langle -\nabla f(x_t); v_t - x_t\rangle \geq \alpha/2 \norm{\nabla f(x_t)}_\esx \norm{v_t - x_t}^q,\)
where $\norm{\cdot}_\esx$ stands for the dual norm to $\norm{\cdot}$. Plugging this scaling inequality into (1) is then the basis for the various convergence results.</p>
<h4 id="convergence-results-with-global-uniform-convexity">Convergence results with global uniform convexity</h4>
<p>Assuming that $\mathcal{C}$ is $(\alpha,q)$-uniformly convex and $f$ is a strongly convex and $L$-smooth function, we obtain convergence rates of $\mathcal{O}(1/T^{1/(1-1/q)})$ that interpolate between the general sub-linear rate of $\mathcal{O}(1/T)$ and the $\mathcal{O}(1/T^2)$ of [GH]. Note that we further generalize these results by relaxing the strong convexity of $f$ with $(\mu, \theta)$-Hölderian Error Bounds and the rates become $\mathcal{O}(1/T^{1/(1-2\theta/q)})$. For more details, see Theorem 2.10. of our paper or [KDP] for general error bounds in the context of the Frank-Wolfe algorithm.</p>
<p>Similarly, assuming \(\text{inf}_{x\in\mathcal{C}}\norm{\nabla f(x)}_\esx > 0\), when $\mathcal{C}$ is $(\alpha,q)$-uniformly convex with $q>2$, we obtain convergence rates of $\mathcal{O}(1/T^{1/(1-2/q)})$ that interpolate between the general sub-linear rate of $\mathcal{O}(1/T)$ and the linear convergence rates of [P, DR].</p>
<p>These two convergence regimes depend on the global uniform convexity parameters of the set $\mathcal{C}$. However, for example $\ell_3$ balls, seem to exhibit various degree of curvature depending on the position of $x^\esx$ on the boundary $\partial\mathcal{C}$.</p>
<h4 id="a-simple-numerical-experiment">A simple numerical experiment</h4>
<p>We now numerically observe the convergence of Frank-Wolfe where $f$ is a quadratic and the feasible region are $\ell_p$ balls with varying $p$. We provide two different plots where we vary the position of the optimal solution $x^\esx$. In both cases we use short steps.</p>
<p>In the right figure, $x^\esx$ is chosen near the intersection of the $\ell_p$-balls and the half-line generated by $\sum_{i=1}^{d}e_i$, where the $(e_i)$ is the canonical basis. Informally, this corresponds to the <em>curvy</em> areas of the $\ell_p$ balls. On the left figure, $x^\esx$ is chosen near the intersection of the $\ell_p$ balls and the half-line generated by one of the $(e_i)$, which corresponds to a <em>flat</em> area for large value of $p$.</p>
<p>We observe that when the optimum is near a <em>curvy</em> area, the converge rates are asymptotically linear for $\ell_p$ balls that are not strongly convex.</p>
<table>
<thead>
<tr>
<th style="text-align: center">Optimum $x^\esx$ in flat area</th>
<th style="text-align: center">Optimum $x^\esx$ in curvy area</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: center"><img src="http://www.pokutta.com/blog/assets/uc-fw/500_iter_20000_ls_exact_bad_opt.jpg" alt="Explicative figure" /></td>
<td style="text-align: center"><img src="http://www.pokutta.com/blog/assets/uc-fw/500_iter_20000_ls_exact_good_opt.jpg" alt="Explicative figure" /></td>
</tr>
</tbody>
</table>
<p>This suggests that the <em>local behavior</em> of $\mathcal{C}$ around $x^\esx$ might even better explain the convergence rates. Providing such an analysis would be in line with [D] which proves linear convergence rates assuming only local strong convexity of $\mathcal{C}$ around $x^\esx$. We extend this result to a localized notion of uniform convexity.</p>
<h3 id="frank-wolfe-analysis-with-localized-uniform-convexity">Frank-Wolfe Analysis with Localized Uniform Convexity</h3>
<p>When $\mathcal{C}$ is not globally uniformly convex, the scaling inequality does not necessarily hold anymore. We rather assume the following localized version around $x^\esx$:</p>
\[\tag{3}
\langle -\nabla f(x^\esx); x^\esx - x\rangle \geq \alpha/2 \norm{\nabla f(x^\esx)}_\esx \norm{x^\esx - x}^q.\]
<p>A localized definition of uniform convexity in the form of (1) applied at $x^\esx \in \partial \mathcal{C}$ indeed implies the <em>local scaling inequality</em> (3). However, the local scaling inequality (3) holds in more general situations, <em>e.g.</em> in the case of the strong convexity analog for the local moduli of rotundity in [GI]. This condition was already identified by Dunn [D], yet without any convergence analysis as soon as $q>2$. In another blog post, we will delve into more details on the generality of (3) and related assumptions in optimization.</p>
<p>Assuming \(\text{inf}_{x\in\mathcal{C}} \norm{\nabla f(x)}_\esx > 0\) and a local scaling inequality at $x^\esx$ with parameters $(\alpha, q)$, we obtain convergence rates of $\mathcal{O}(1/T^{1/(1-2/(q(q-1))})$ with $q>2$ that interpolate between the general sub-linear rate of $\mathcal{O}(1/T)$ and the linear convergence rates of [D].</p>
<p>Note that the obtained sublinear rate via the local scaling inequality is strictly worse than the one obtained via the global scaling inequality with same $(\alpha, q)$ parameters. The sublinear rate $\mathcal{O}(1/T^{1/(1-2/(q(q-1))})$ remains however always better than $\mathcal{O}(1/T)$ when $q>2$.</p>
<h2 id="conclusion">Conclusion</h2>
<p>Our results show that in all regimes (for smooth constrained problems) where the strong convexity of the constraint set is known to accelerate Frank-Wolfe algorithms (see [P, DR], [D] or [GH]), the uniform convexity of the set leads to accelerated convergence rates with respect to $\mathcal{O}(1/T)$ as well.</p>
<p>We also show that the local uniform convexity of $\mathcal{C}$ around $x^\esx$ already induces accelerated convergence rates. In particular, this acceleration is achieved with the vanilla Frank-Wolfe algorithm which does not require any knowledge about these underlying structural assumptions and their respective parameters. As such, our results further shed light on the adaptive properties of Frank-Wolfe type of algorithms. For instance, see also [KDP] for adaptivity with respect to error bound conditions on the objective function, or [J, LJ] for affine invariant analyses of Frank-Wolfe algorithms when the constraint set are polytopes.</p>
<h3 id="references">References</h3>
<p>[D] Dunn, Joseph C. Rates of convergence for conditional gradient algorithms near singular and nonsingular extremals. SIAM Journal on Control and Optimization 17.2 (1979): 187-211. <a href="https://epubs.siam.org/doi/pdf/10.1137/0324071?casa_token=mV4qkf9aLskAAAAA:--jyeKNCSwAH5fejuzgJr1im_OXPyesfgPOU1fk-cfmBYZjTdrRSAHHfZEWjQRUaYSI0vNPB7NwY">pdf</a></p>
<p>[DR] Demyanov, V. F. ; Rubinov, A. M. Approximate methods in optimization problems. Modern Analytic and Computational Methods in Science and Mathematics, 1970.</p>
<p>[DDS] Donahue, M. J.; Darken, C.; Gurvits, L.; Sontag, E. (1997). Rates of convex approximation in non-Hilbert spaces. <em>Constructive Approximation</em>, <em>13</em>(2), 187-220.</p>
<p>[GH] Garber, D.; Hazan, E. Faster rates for the frank-wolfe method over strongly-convex sets. In 32nd International Conference on Machine Learning, ICML 2015. <a href="https://arxiv.org/abs/1406.1305">pdf</a></p>
<p>[GI] Goncharov, V. V.; Ivanov, G. E. Strong and weak convexity of closed sets in a hilbert space. In Operations research engineering, and cyber security, pages 259–297. Springer, 2017.</p>
<p>[GM] Guélat, J.; Marcotte, P. (1986). Some comments on Wolfe’s ‘away step’. Mathematical Programming, 35(1), 110-119.</p>
<p>[HL] Huang, R.; Lattimore, T.; György, A.; Szepesvári, C. Following the leader and fast rates in linear prediction: Curved constraint sets and other regularities. In Advances in Neural Information Processing Systems, pages 4970–4978, 2016. <a href="https://arxiv.org/abs/1702.03040">pdf</a></p>
<p>[J] Jaggi, M. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In Proceedings of the 30th international conference on machine learning, ICML 2013. p. 427-435. <a href="http://www.jmlr.org/proceedings/papers/v28/jaggi13.pdf">pdf</a></p>
<p>[JN] Juditsky, A.; Nemirovski, A.S. Large deviations of vector-valued martingales in 2-smooth normed spaces. <a href="https://arxiv.org/abs/0809.0813">pdf</a></p>
<p>[KDP] Kerdreux, T.; d’Aspremont, A.; Pokutta, S. 2019. Restarting Frank-Wolfe. In The 22nd International Conference on Artificial Intelligence and Statistics (pp. 1275-1283). <a href="https://arxiv.org/abs/1810.02429">pdf</a></p>
<p>[LJ] Lacoste-Julien, S. ; Jaggi, Martin. On the global linear convergence of Frank-Wolfe optimization variants. In : Advances in neural information processing systems. 2015. p. 496-504. <a href="https://infoscience.epfl.ch/record/229239/files/nips15_paper_sup_camera_ready.pdf">pdf</a></p>
<p>[P] Polyak, B. T. Existence theorems and convergence of minimizing sequences for extremal problems with constraints. In Doklady Akademii Nauk, volume 166, pages 287–290. Russian Academy of Sciences, 1966.</p>
<p>[ST] Sridharan, K.; Tewari, A. (2010, June). Convex Games in Banach Spaces. In <em>COLT</em> (pp. 1-13). <a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.304.5992&rep=rep1&type=pdf">pdf</a></p>Thomas KerdreuxTL;DR: This is an informal summary of our recent paper Projection-Free Optimization on Uniformly Convex Sets by Thomas Kerdreux, Alexandre d’Aspremont, and Sebastian Pokutta. We present convergence analyses of the Frank-Wolfe algorithm in settings where the constraint sets are uniformly convex. Our results generalize different analyses of [P], [DR], [D], and [GH] when the constraint sets are strongly convex. For instance, the $\ell_p$ balls are uniformly convex for all $p > 1$, but strongly convex for $p\in]1,2]$ only. We show in these settings that uniform convexity of the feasible region systematically induces accelerated convergence rates of the Frank-Wolfe algorithm (with short steps or exact line-search). This shows that the Frank-Wolfe algorithm is not just adaptive to the sharpness of the objective [KDP] but also to the feasible region.Second-order Conditional Gradient Sliding2020-06-20T07:00:00+02:002020-06-20T07:00:00+02:00http://www.pokutta.com/blog/research/2020/06/20/socgs<p><em>TL;DR: This is an informal summary of our recent paper <a href="https://arxiv.org/abs/2002.08907">Second-order Conditional Gradient Sliding</a> by <a href="https://alejandro-carderera.github.io/">Alejandro Carderera</a> and <a href="http://www.pokutta.com/">Sebastian Pokutta</a>, where we present a second-order analog of the Conditional Gradient Sliding algorithm [LZ] for smooth and strongly-convex minimization problems over polytopes. The algorithm combines Inexact Projected Variable-Metric (PVM) steps with independent Away-step Conditional Gradient (ACG) steps to achieve global linear convergence and local quadratic convergence in primal gap. The resulting algorithm outperforms other projection-free algorithms in applications where first-order information is costly to compute.</em>
<!--more--></p>
<p><em>Written by Alejandro Carderera.</em></p>
<h2 id="what-is-the-paper-about-and-why-you-might-care">What is the paper about and why you might care</h2>
<p>Consider a problem of the sort:
\(\tag{minProblem}
\begin{align}
\label{eq:minimizationProblem}
\min\limits_{x \in \mathcal{X}} f(x),
\end{align}\)
where $\mathcal{X}$ is a polytope and $f(x)$ is a twice differentiable function that is strongly convex and smooth. We assume that solving an LP over $\mathcal{X}$ is easy, but projecting using the Euclidean norm (or any other norm) onto $\mathcal{X}$ is expensive. Moreover, we also assume that evaluating $f(x)$ is expensive, and so is computing the gradient and the Hessian of $f(x)$. An example of such an objective function can be found when solving an MLE problem to estimate the parameters of a Gaussian distribution modeled as a sparse undirected graph [BEA] (also known as the Graphical Lasso problem). Another example is the objective function used in logistic regression problems when the number of samples is high.</p>
<h2 id="projected-variable-metric-algorithms">Projected Variable-Metric algorithms</h2>
<p>Working with such unwieldy functions is often too expensive, and so a popular approach to tackling (minProblem) is to construct an approximation to the original function whose gradients are easier to compute. A linear approximation of $f(x)$ at $x_k$ using only first-order information will not contain any curvature information, giving us little to work with. Consider, on the other hand a quadratic approximation of $f(x)$ at $x_k$, denoted by $\hat{f_k}(x)$, that is:</p>
\[\tag{quadApprox}
\begin{align}
\label{eq:quadApprox}
\hat{f_k}(x) = f(x_k) + \left\langle \nabla f(x_k), x - x_k \right\rangle + \frac{1}{2} \norm{x - x_k}_{H_k}^2,
\end{align}\]
<p>where $H_k$ is a positive definite matrix that approximates the Hessian $\nabla^2 f(x_k)$. Algorithms that minimize the quadratic approximation $\hat{f}_k(x)$ over $\mathcal{X}$ at each iteration and set</p>
\[x_{k+1} = x_k + \gamma_k (\operatorname{argmin}_{x\in \mathcal{X}} \hat{f_k}(x) - x_k)\]
<p>for some \(\gamma_k \in [0,1]\) are dubbed <em>Projected Variable-Metric</em> (PVM) algorithms. These algorithms are useful when the progress per unit time obtained by moving towards the minimizer of $\hat{f}_k(x)$ over $\mathcal{X}$ at each time step is greater than the progress per unit time obtained by taking a step of any other first-order algorithm that makes use of the original function (whose gradients are very expensive to compute). We define the scaled projection of $x$ onto $\mathcal{X}$ when we measure the distance in the $H$-norm as \(\Pi_{\mathcal{X}}^{H} (y) \stackrel{\mathrm{\scriptscriptstyle def}}{=} \text{argmin}_{x\in\mathcal{X}} \norm{x - y}_{H}\). This allows us to interpret the steps taken by PVM algorithms as:</p>
\[\tag{stepPVM}
\begin{align}
\label{eq:stepPVM}
\operatorname{argmin}_{x\in \mathcal{X}} \hat{f_k}(x) = \Pi_{\mathcal{X}}^{H_k} \left( x_k - H_k^{-1} \nabla f(x_k) \right).
\end{align}\]
<p>These algorithms owe their name to this interpretation: at each iteration, as $H_k$ varies, we change the metric (the norm) with which we perform the scaled-projections, and we deform the negative of the gradient using this metric. The next image gives a schematic overview of a step of the PVM algorithm. The polytope $\mathcal{X}$ is depicted with solid black lines, the contour lines of the original objective function $f(x)$ are depicted with solid blue lines, and the contour lines of the quadratic approximation \(\hat{f}_k(x)\) are depicted with dashed red lines. Note that \(x_k - H_k^{-1}\nabla f(x_k)\) is the unconstrained minimizer of the quadratic approximation \(\hat{f}_k(x)\). The iterate used in the PVM algorithm to define the directions along which we move, i.e., \(\text{argmin}_{x\in \mathcal{X}} \hat{f}_k(x)\), is simply the scaled projection of that unconstrained minimizer onto \(\mathcal{X}\) using the norm \(\norm{\cdot}_{H_k}\) defined by $H_k$.</p>
<div class="center">
<img src="http://www.pokutta.com/blog/assets/socgs/SchematicAlgorithm.png" alt="Minimization of $\hat{f_k}(x)$ over $\mathcal{X}$." style="float:center; margin-right: 1%; width:80%" />
<p style="clear: both;"></p>
</div>
<p class="figcap"><strong>Figure 1.</strong> Minimization of $\hat{f_k}(x)$ over $\mathcal{X}$.</p>
<p>Note that if we set $H_k = \nabla^2 f(x_k)$ the PVM algorithm is equivalent to the <em>Projected Newton</em> algorithm, and if we set $H_k = I^n$, where $I^n$ is the identity matrix, the algorithm is equal to the <em>Projected Gradient Descent</em> algorithm. Intuitively, when $H_k$ is a good approximation to the Hessian $\nabla^2 f(x_k)$ we can expect to make good progress when moving along these directions. In terms of convergence, the PVM algorithm has a <em>global</em> linear convergence rate in primal gap when using an exact line search [KSJ], although with a dependence on the condition number that is worse than that of Projected Gradient Descent or the <em>Away-step Conditional Gradient</em> (ACG) algorithms. Moreover, the algorithm has a <em>local</em> quadratic convergence rate with a unit step size when close to the optimum $x^\esx$ if the matrix $H_k$ becomes a better and better approximation to $\nabla^2 f(x_k)$ as we approach $x^\esx$ (which we also assume in our theoretical results).</p>
<h2 id="second-order-conditional-gradient-sliding-algorithm">Second-order Conditional Gradient Sliding algorithm</h2>
<p>Two questions arise:</p>
<ol>
<li>Can we achieve a global linear convergence rate on par with that of the Away-step Conditional Gradient algorithm?</li>
<li>Solving the problem shown in (stepPVM) to optimality is often too expensive. Can we solve the problem to some $\varepsilon_k$-optimality and keep the local quadratic convergence?</li>
</ol>
<p>The <em>Second-order Conditional Gradient Sliding</em> (SOCGS) algorithm is designed with these considerations in mind, providing global linear convergence in primal gap and local quadratic convergence in primal gap and distance to $x^\esx$. The algorithm couples an independent ACG step with line search with an Inexact PVM step with a unit step size. At the end of each iteration, we choose the step that provides the greatest primal progress. The independent ACG steps will ensure global linear convergence in primal gap, and the Inexact PVM steps will provide quadratic convergence. Moreover, the line search in the ACG step can be substituted with a step size strategy that requires knowledge of the $L$-smoothness parameter of $f(x)$ [PNAM].</p>
<p>We compute the PVM step inexactly using the (same) ACG algorithm with an exact line search, thereby making the SOCGS algorithm <em>projection-free</em>. As the function being minimized in the Inexact PVM steps is quadratic there is a closed-form expression for the optimal step size. The scaled projection problem is solved to $\varepsilon_k$-optimality, using the Frank-Wolfe gap as a stopping criterion, as in the Conditional Gradient Sliding (CGS) algorithm [LZ]. The CGS algorithm uses the vanilla Conditional Gradient algorithm to find an approximate solution to the Euclidean projection problems that arise in <em>Nesterov’s Accelerated Gradient Descent</em> steps. In the SOCGS algorithm, we use the ACG algorithm to find an approximate solution to the scaled-projection problems that arise in PVM steps.</p>
<h3 id="accuracy-parameter-varepsilon_k">Accuracy Parameter $\varepsilon_k$.</h3>
<p>The accuracy parameter $\varepsilon_k$ in the SOCGS algorithm depends on a lower bound on the primal gap of (minProblem) which we denote by $lb\left( x_k \right)$ that satisfies $lb\left( x_k \right) \leq f\left(x_k \right) - f\left(x^\esx \right)$.</p>
<p>In several machine learning applications, the value of $f(x^\esx)$ is known a priori, such is the case of the approximate Carathéodory problem (see post <a href="/blog/research/2019/11/30/approxCara-abstract.html">Approximate Carathéodory via Frank-Wolfe</a> where $f(x^\esx) = 0$). In other applications, estimating $f(x^\esx)$ is easier than estimating the strong convexity parameter (see [BTA] for an in-depth discussion). This allows for tight lower bounds on the primal gap in these cases.</p>
<p>If there is no easy way to estimate the value of $f(x^\esx)$, we can compute a lower bound on the primal gap at $x_k$ (bounded away from zero) using any CG variant that monotonically decreases the primal gap. It suffices to run an arbitrary number of steps $n \geq 1$ of the aforementioned variant to minimize $f(x)$ starting from $x_k$, resulting in $x_k^n$. Simply noting that $f(x_k^n) \geq f(x^\esx)$ allows us to conclude that $f(x_k) - f(x^\esx) \geq f(x_k) - f(x_k^n)$, and therefore a valid lower bound is $lb\left( x_k \right) = f(x_k) - f(x^n_k)$. The higher the number of CG steps performed from $x_k$, the tighter the resulting lower bound will be.</p>
<h3 id="complexity-analysis">Complexity Analysis</h3>
<p>For the complexity analysis, we assume that we have at our disposal the tightest possible bound on the primal gap, which is $lb\left( x_k \right) = f(x_k) - f(x^\esx)$. A looser lower bound increases the number of linear minimization calls but does not increase the number of first-order, or approximate Hessian oracle calls. As in the classical analysis of Projected Newton algorithms, after a finite number of iterations independent of the target accuracy $\varepsilon$ (which in our case are linearly convergent in primal gap) the algorithm enters a regime of quadratic convergence in primal gap. Once in this phase the algorithm requires $\mathcal{O}\left( \log(1/\varepsilon) \log(\log 1/\varepsilon)\right)$ calls to a linear minimization oracle, and $\mathcal{O}\left( \log(\log 1/\varepsilon)\right)$ calls to a first-order and approximate Hessian oracle to reach an $\varepsilon$-optimal solution.</p>
<p>If we were to solve problem (minProblem) using the Away-step Conditional Gradient algorithm we would need $\mathcal{O}\left( \log(1/\varepsilon)\right)$ calls to a linear minimization and first-order oracle. Using the SOCGS algorithm makes sense if the linear minimization calls are not the computational bottleneck of the algorithm and the approximate Hessian oracle is about as expensive as the first-order oracle.</p>
<h3 id="computational-experiments">Computational Experiments</h3>
<p>We compare the performance of the SOCGS algorithm with that of other first-order projection free algorithms in settings where computing first-order information is expensive (and computing Hessian information is just as expensive). We also compare the performance of our algorithm with the recent <em>Newton Conditional Gradient</em> algorithm [LCT] which minimizes a self-concordant function over a convex set by performing Inexact Newton steps (thereby requiring an exact Hessian oracle) using a Conditional Gradient algorithm to compute the scaled projections. After a finite number of iterations (independent of the target accuracy $\varepsilon$), the convergence rate of the NCG algorithm is linear in primal gap. Once inside this phase an $\varepsilon$-optimal solution is reached after $\mathcal{O}\left(\log 1/\varepsilon\right)$ exact Hessian and first-order oracle calls and $\mathcal{O}( 1/\varepsilon^{\nu})$ linear minimization oracle calls, where $\nu$ is a constant greater than one.</p>
<p>In the first experiment the Hessian information will be inexact (but subject to an asymptotic accuracy assumption), and so we will only compare to other first-order projection-free algorithms. In the second and third experiments, the Hessian oracle will be exact. For reference, the algorithms in the legend correspond to the vanilla Conditional Gradient (CG), the Away-step Conditional Gradient (ACG) [GM], the Lazy Away-step Conditional Gradient (ACG (L)) [BPZ], the Pairwise-step Conditional Gradient (PCG) [LJ], the Conditional Gradient Sliding (CGS) [LZ], the Stochastic Variance-Reduced Conditional Gradient (SVRCG) [HL], the Decomposition Invariant Conditional Gradient (DICG) [GM2] and the Newton Conditional Gradient (NCG) [LCT] algorithm. We also present an LBFGS version of SOCGS (SOCGS LBFGS). However note that this algorithm, while performing well, does not formally satisfy our assumptions.</p>
<div class="center">
<img src="http://www.pokutta.com/blog/assets/socgs/Birkhoff_Experiments.png" alt="fig2" style="float:center; margin-right: 1%; width:99%" />
<p style="clear: both;"></p>
</div>
<p class="figcap"><strong>Figure 2.</strong> Sparse coding over the Birkhoff polytope.</p>
<div class="center">
<img src="http://www.pokutta.com/blog/assets/socgs/GLassoPSD.png" alt="fig3" style="float:center; margin-right: 1%; width:99%" />
<p style="clear: both;"></p>
</div>
<p class="figcap"><strong>Figure 3.</strong> Inverse covariance estimation over the spectrahedron.</p>
<div class="center">
<img src="http://www.pokutta.com/blog/assets/socgs/LogReg.png" alt="fig4" style="float:center; margin-right: 1%; width:99%" />
<p style="clear: both;"></p>
</div>
<p class="figcap"><strong>Figure 4.</strong> Structured logistic regression over $\ell_1$ unit ball.</p>
<h3 id="references">References</h3>
<p>[LZ] Lan, G., & Zhou, Y. (2016). Conditional gradient sliding for convex optimization. In <em>SIAM Journal on Optimization</em> 26(2) (pp. 1379–1409). SIAM. <a href="http://www.optimization-online.org/DB_FILE/2014/10/4605.pdf">pdf</a></p>
<p>[BEA] Banerjee, O., & El Ghaoui, L. & d’Aspremont, A. (2008). Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data. In <em>Journal of Machine Learning Research</em> 9 (2008) (pp. 485–516). JMLR. <a href="http://www.jmlr.org/papers/volume9/banerjee08a/banerjee08a.pdf">pdf</a></p>
<p>[KSJ] Karimireddy, S.P., & Stich, S.U. & Jaggi, M. (2018). Global linear convergence of Newton’s method without strong-convexity or Lipschitz gradients. <em>arXiv preprint:1806.00413</em>. <a href="https://arxiv.org/pdf/1806.00413.pdf">pdf</a></p>
<p>[PNAM] Pedregosa, F., & Negiar, G. & Askari, A. & Jaggi, M. (2020). Linearly Convergent Frank-Wolfe with Backtracking Line-Search. In <em>Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics</em>. <a href="http://proceedings.mlr.press/v108/pedregosa20a/pedregosa20a-supp.pdf">pdf</a></p>
<p>[BTA] Barré, M., & Taylor, A. & d’Aspremont, A. (2020). Complexity Guarantees for Polyak Steps with Momentum. <em>arXiv preprint:2002.00915</em>. <a href="https://arxiv.org/pdf/2002.00915.pdf">pdf</a></p>
<p>[LCT] Liu, D., & Cevher, V. & Tran-Dinh, Q. (2020). A Newton Frank-Wolfe Method for Constrained Self-Concordant Minimization. <em>arXiv preprint:2002.07003</em>. <a href="https://arxiv.org/pdf/2002.07003.pdf">pdf</a></p>
<p>[GM] Guélat, J., & Marcotte, P. (1986). Some comments on Wolfe’s ‘away step’. In <em>Mathematical Programming</em> 35(1) (pp. 110–119). Springer. <a href="http://www.iro.umontreal.ca/~marcotte/ARTIPS/1986_MP.pdf">pdf</a></p>
<p>[BPZ] Braun, G., & Pokutta, S. & Zink, D. (2017). Lazifying Conditional Gradient Algorithms. In <em>Proceedings of the 34th International Conference on Machine Learning</em>. <a href="http://proceedings.mlr.press/v70/braun17a/braun17a.pdf">pdf</a></p>
<p>[LJ] Lacoste-Julien, S. & Jaggi, M. (2015). On the global linear convergence of Frank-Wolfe optimization variants. In <em>Advances in Neural Information Processing Systems</em> 2015 (pp. 496-504). <a href="https://papers.nips.cc/paper/5925-on-the-global-linear-convergence-of-frank-wolfe-optimization-variants.pdf">pdf</a></p>
<p>[HL] Hazan, E. & Luo, H. (2016). Variance-reduced and projection-free stochastic optimization. In <em>Proceedings of the 33rd International Conference on Machine Learning</em>. <a href="https://arxiv.org/pdf/1602.02101.pdf">pdf</a></p>
<p>[GM2] Garber, D. & Meshi, O.(2016). Linear-memory and decomposition-invariant linearly convergent conditional gradient algorithm for structured polytopes. In <em>Advances in Neural Information Processing Systems</em> 2016 (pp. 1001-1009). <a href="https://arxiv.org/pdf/1605.06492.pdf">pdf</a></p>Alejandro CardereraTL;DR: This is an informal summary of our recent paper Second-order Conditional Gradient Sliding by Alejandro Carderera and Sebastian Pokutta, where we present a second-order analog of the Conditional Gradient Sliding algorithm [LZ] for smooth and strongly-convex minimization problems over polytopes. The algorithm combines Inexact Projected Variable-Metric (PVM) steps with independent Away-step Conditional Gradient (ACG) steps to achieve global linear convergence and local quadratic convergence in primal gap. The resulting algorithm outperforms other projection-free algorithms in applications where first-order information is costly to compute.On the unreasonable effectiveness of the greedy algorithm2020-06-03T07:00:00+02:002020-06-03T07:00:00+02:00http://www.pokutta.com/blog/research/2020/06/03/unreasonable-abstract<p><em>TL;DR: This is an informal summary of our recent paper <a href="https://arxiv.org/abs/2002.04063">On the Unreasonable Effectiveness of the Greedy Algorithm: Greedy Adapts to Sharpness</a> with <a href="https://www2.isye.gatech.edu/~msingh94/">Mohit Singh</a>, and <a href="https://sites.google.com/view/atorrico">Alfredo Torrico</a>, where we adapt the sharpness concept from convex optimization to explain the effectiveness of the greedy algorithm for submodular function maximization.</em>
<!--more--></p>
<h2 id="what-is-the-paper-about-and-why-you-might-care">What is the paper about and why you might care</h2>
<p>An important problem is the maximization of a non-negative monotone submodular set function $f: 2^V \rightarrow \RR_+$ subject to a cardinality constraint, i.e.,</p>
\[\tag{maxSub}
\max_{S \subseteq V, |S| \leq k} f(S).\]
<p>This problem naturally occurs in many contexts, such as, e.g., feature selection, sensor placement, and non-parametric learning. It is well known that in submodular function maximization with a single cardinality constraint we can compute a $(1-1/\mathrm{e})$-approximate solution by means of the greedy algorithm [NW], [NWF], while computing an exact solution is NP-hard. The greedy algorithm is extremely simple, selecting in each of its $k$ iterations the element with the largest <em>marginal gain</em> $\Delta_{S}(e) \doteq f(S \cup \setb{e}) - f(S)$:</p>
<p class="mathcol"><strong>Greedy Algorithm</strong> <br />
<em>Input:</em> Non-negative, monotone, submodular function $f$ and budget $k$ <br />
<em>Output:</em> Set $S_g \subseteq V$ <br />
$S_g \leftarrow \emptyset$ <br />
For $i = 1, \dots, k$ do: <br />
$\quad S_g \leftarrow S_g \cup \setb{\arg\max_{e \in V \setminus S_g} \Delta_{S_g}(e)}$<br /></p>
<p>Due to its simplicity and good real-world performance the greedy algorithm is often the method of choice in large-scale tasks where more involved methods, such as, e.g., integer programming are computationally prohibitive. As mentioned before, the returned solution $S_g \subseteq V$ of the greedy algorithm satisfies [NW]</p>
\[f(S_g) \geq (1-1/\mathrm{e}) \ f(S^\esx),\]
<p>where $S^\esx \subseteq V$ is the optimal solution to problem (maxSub).</p>
<p>In practice however, we often observe that the greedy algorithm performs much better than this conservative approximation guarantee suggests and several concepts such as, e.g., <em>curvature</em> [CC] or <em>stability</em> [CRV] have been proposed as a means to explain the excess performance of the greedy algorithm beyond this worst-case bound. The reason one might be interested in this, beyond understanding greedy’s performance as a function of additional properties of $f$ (which is interesting in its own right), is that the problem instance of interest might be amenable to pre-processing in order to improve conditioning with respect to these additional structural properties and hence performance.</p>
<h2 id="our-results">Our results</h2>
<p>We focus on giving an alternative explanation for those instances in which the optimal solution clearly stands out over the rest of feasible solutions. For this, we consider the concept of sharpness initially introduced in continuous optimization (see [BDL] and references contained therein) and we adapt it to submodular optimization. In convex optimization, roughly speaking, sharpness measures the behavior of the objective function around the set of optimal solutions and it translates to faster convergence rates. The way one should think about sharpness and similar parameters is <em>data-dependent</em> quantities that are usually either hard to compute or inaccessible. As these quantities are also (usually) unobservable and estimation is non-trivial yet impact the convergence rate, we would like our algorithms to be <em>adaptive</em> to these parameters without requiring them as <em>input</em>, i.e., the algorithm behaves automatically better when the data is better conditioned.</p>
<p>We show that the greedy algorithm for submodular maximization also provides better approximation guarantees as (our submodular analog to) sharpness of the objective function increases. While surprising at first, this is actually quite natural, once we understand the greedy algorithm as a discrete analog of ascent algorithms in continuous optimization, that is allowed to perform a fixed number of steps only: if the algorithm converges faster, than after a fixed number of steps ($k$ in the discrete case to be precise) its achieved approximation guarantee will be better. Then the key challenge is to identify a notion of sharpness that is meaningful in the context of submodular function maximization. We also show that the greedy algorithm automatically adapts to the submodular function’s sharpness.</p>
<p>The most basic notation of <em>sharpness for submodular functions</em> that we define is the following notion of <em>monontone sharpness</em>: There exists an optimal solution $S^\esx \subseteq V$ such that for all $S \subseteq V$ it holds:</p>
\[\tag{monSharp}
\sum_{e \in S^\esx \setminus S} \Delta_S(e) \geq \left( \frac{|S^\esx \setminus S|}{kc} \right)^{1/\theta} f(S^\esx),\]
<p>which then leads to a guarantee of the form:</p>
\[f(S_g) \geq \left(1- \left(1-\frac{\theta}{c}\right)^{1/c}\right) f(S^\esx),\]
<p>which interpolates between the worst-case approximation factor $(1-1/\mathrm{e})$ and the best-case approximation factor $1$.</p>
<p>We also define tighter notions of sharpness that explain more of greedy’s performance, however their definitions are slightly more involved and beyond the scope of this summary. In the following figure we depict the performance of the greedy algorithm on three different tasks as well as how much of its performance is explained by various data-dependent measures - it can be seen that our most advanced notion of sharpness, called <em>dynamic submodular sharpness</em>, explains a significant portion of greedy’s performance.</p>
<div class="center">
<img src="http://www.pokutta.com/blog/assets/sharpSubmodular/1_image_clustering.png" alt="img1" style="float:center; margin-right: 1%; width:27%" />
<img src="http://www.pokutta.com/blog/assets/sharpSubmodular/2_fac_loc.png" alt="img2" style="float:center; margin-right: 1%; width:27%" />
<img src="http://www.pokutta.com/blog/assets/sharpSubmodular/3_parkison_tele.png" alt="img3" style="float:center; margin-right: 1%; width:39%" />
<p style="clear: both;"></p>
</div>
<p class="figcap"><strong>Figure 1.</strong> Image Clustering (left), Facility Location (middle), Parkison Telemonitoring (right). For each example we computed both the sharpness parameters and optimal solutions to compare predicted vs actual performance. In all three examples sharpness explains a significant portion of the greedy algorithm’s excess performance.</p>
<p>One final but important question that one might ask is: how many functions actually do satisfy sharpness? In convex optimization, by the <em>Łojasiewicz Factorization Lemma</em> (see [BDL] and references contained therein), basically almost all functions exhibit non-trivial sharpness; the same is true for the submodular case here, albeit in somewhat weaker form.</p>
<h3 id="references">References</h3>
<p>[NWF] Nemhauser, G. L., Wolsey, L. A., & Fisher, M. L. (1978). An analysis of approximations for maximizing submodular set functions—I. Mathematical programming, 14(1), 265-294. <a href="https://link.springer.com/content/pdf/10.1007/BF01588971.pdf">pdf</a></p>
<p>[NW] Nemhauser, G. L., & Wolsey, L. A. (1978). Best algorithms for approximating the maximum of a submodular set function. Mathematics of operations research, 3(3), 177-188. <a href="https://www.jstor.org/stable/pdf/3689488.pdf">pdf</a></p>
<p>[CC] Conforti, M., & Cornuéjols, G. (1984). Submodular set functions, matroids and the greedy algorithm: tight worst-case bounds and some generalizations of the Rado-Edmonds theorem. Discrete applied mathematics, 7(3), 251-274. <a href="https://www.sciencedirect.com/science/article/pii/0166218X84900039">pdf</a></p>
<p>[CRV] Chatziafratis, V., Roughgarden, T., & Vondrák, J. (2017). Stability and recovery for independence systems. arXiv preprint arXiv:1705.00127. <a href="https://arxiv.org/abs/1705.00127">pdf</a></p>
<p>[BDL] Bolte, J., Daniilidis, A., & Lewis, A. (2007). The Łojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems. SIAM Journal on Optimization, 17(4), 1205-1223. <a href="https://epubs.siam.org/doi/pdf/10.1137/050644641">pdf</a></p>Sebastian PokuttaTL;DR: This is an informal summary of our recent paper On the Unreasonable Effectiveness of the Greedy Algorithm: Greedy Adapts to Sharpness with Mohit Singh, and Alfredo Torrico, where we adapt the sharpness concept from convex optimization to explain the effectiveness of the greedy algorithm for submodular function maximization.An update on SCIP2020-05-15T07:00:00+02:002020-05-15T07:00:00+02:00http://www.pokutta.com/blog/news/2020/05/15/scip-update<p><em>TL;DR: A quick update on what is on the horizon for SCIP.</em>
<!--more--></p>
<p>SCIP has been a cornerstone of ZIB’s mathematical optimization department for many years. It is probably (one of) the fastest and most comprehensive academic solvers for MIPs and several related optimization paradigms. Certainly it is the fastest MIP and MINLP solver that is fully transparent and accessible in source code. This impressive effort is due to a great team of researchers and developers, both at ZIB and throughout the world, that has been pushing SCIP to the cutting-edge.</p>
<p>Over the last 5 years, two people have strongly shaped the progress of SCIP at ZIB: Thorsten Koch on the organizational side and Ambros Gleixner as head of technical research & development. In Fall of 2019 I moved to ZIB. With this move I also took the lead of the overall SCIP project among several other new responsibilities and I would like to take the opportunity to thank Thorsten Koch for his great leadership of the SCIP project over the last years. I am quite excited to have the opportunity to shape the future of SCIP together with the rest of the SCIP team and in view of this I would like to share some updates. In a nutshell these changes can be summarized as follows:</p>
<ol>
<li>Making SCIP more open</li>
<li>Making SCIP more accessible</li>
<li>Making SCIP more inclusive</li>
</ol>
<p>While not everything can be achieved in a one step, this overview might give you an idea of what is on the horizon.</p>
<p>We also have some very exciting new research directions and results, however I am going to talk about some of that work elsewhere in a more research-focused post.</p>
<h2 id="scip-7-release">SCIP 7 release</h2>
<p>Before I am going to talk about some upcoming things, I wanted to briefly mention the recent release of <a href="http://www.optimization-online.org/DB_HTML/2020/03/7705.html">SCIP 7</a>, with many new features. Just to name two, there is a new parallel preprocessing library <em>PaPILO</em> and we now have <a href="http://www.optimization-online.org/DB_HTML/2020/04/7722.html">tree-size prediction</a> built-in:</p>
<blockquote>
<p>On average, the best method estimates B&B tree sizes within a factor of 3 on the set of unseen test instances even during the early stage of the search, and improves in accuracy as the search progresses. It also achieves a factor 2 over the entire search on each out of six additional sets of homogeneous instances we have tested.</p>
</blockquote>
<p><em>#firstSeenInSCIP</em></p>
<p>Both for MIP and MINLP, SCIP 7 is on average 1.36x faster than SCIP 6 on hard instances, i.e., on instances that take at least 100 seconds to solve. You can check out the latest release on the <a href="http://scip.zib.de">SCIP homepage</a>.</p>
<h2 id="interfaces">Interfaces</h2>
<p>SCIP already supports a wide variety of interfaces. In the future we will further integrate SCIP with those interfaces and in particular we will improve integration with Python through <a href="https://github.com/SCIP-Interfaces/PySCIPOpt">PySCIPOpt</a> and Julia through <a href="https://github.com/SCIP-Interfaces/SCIP.jl">SCIP.jl</a>. These two will become true first-class interfaces. Moreover, we will maintain several <a href="https://github.com/SCIP-Interfaces">other interfaces</a> depending on demand etc.</p>
<h2 id="distribution">Distribution</h2>
<p>We intend to extend the distribution mechanisms for SCIP. One very high priority is distribution through the conda package manager, so that the SCIP optimization suite and PySCIPOpt can be basically installed with a simple <code class="highlighter-rouge">conda install pyscipopt</code>. We are also exploring to make SCIP available in <a href="https://colab.research.google.com/">Google Colab</a>; the conda integration might make this a trivial exercise.</p>
<h2 id="tutorials">Tutorials</h2>
<p>Many of you have experienced that SCIP is a very complex software and getting started can be a nontrivial endeavor, just because of its high flexibility as a framework, which is fully exposed through its API. At the same time, SCIP can be used out-of-the-box as a powerful black-box solver. However, many of you have suffered from the current lack of good entry level documentation. To alleviate this in the short term, we are in the process of writing a tutorial specifically targeting the “black box user + SCIP” via PySCIPOpt. In the mid term we will try to offer more resources to people that use SCIP mainly as a black box solver; see the Website section below.</p>
<h2 id="new-platforms">New platforms</h2>
<p>We intend to support several new platforms for the SCIP Optimization Suite. As you might have already seen from <a href="/blog/random/2019/09/29/scipberry.html">a post sometime back</a> one such platform is ARM. This includes the RaspberryPi but also many cell phone and mobile architectures that then can potentially run SCIP. Moreover, we also plan a dockerized version of SCIP for deployment in cloud computing environments. In fact if you want to give a preliminary build a spin: <code class="highlighter-rouge">docker pull scipoptsuite/scipoptsuite:7.0.0</code>; SCIP Optimization Suite 7.0.0 and PySCIPOpt 3.0.0 on slim buster with Python 3.7—feedback appreciated.</p>
<p>A little further down the road, we will be likely also supporting RISC-V once stable development systems are available and we are currently evaluating Microsoft’s <a href="https://docs.microsoft.com/en-us/windows/wsl/wsl2-install">WSL</a> in particular together with <a href="https://ubuntu.com/wsl">Ubuntu on WSL</a> as an alternative deployment mode for Windows.</p>
<h2 id="decentralized-development">Decentralized Development</h2>
<p>SCIP has had a strong decentralized development component and this trend is likely to increase further in the future with many more non-ZIB developers contributing to SCIP. In Germany alone, we have 4 development centers with FAU Erlangen-Nürnberg, TU Darmstadt, RWTH Aachen, and the Zuse Institute Berlin. On top of that we have a large number of international contributors.</p>
<p>This decentralized development setup with many stakeholders and core developers outside ZIB will be also more strongly reflected in SCIP’s governance; more on this soon.</p>
<h2 id="website">Website</h2>
<p>SCIP will move to <a href="http://www.scipopt.org">http://www.scipopt.org</a> as a new home and one-stop-shop; should be online in a few days. Moreover, we will also separate the web site into two parts in the next few months: one for SCIP users and one for SCIP developers.</p>
<h2 id="licensing">Licensing</h2>
<p>There are some license changes on the horizon as well. Short version is that we intend that SCIP will be free for non-commercial use in general and we are currently discussing how to deal with commercial use. One model might be to have a community edition under some permissible open source license and a professional edition for commercial use. Obviously this is quite a complicated matter and it will take some time to iron out all the details and settle on a final setup.</p>
<p>In the meantime, if you want to use SCIP, send an email to <a href="mailto:licenses@zib.de">licenses@zib.de</a> and we will work something out in the spirit of the above.</p>
<h2 id="hiring">Hiring</h2>
<p>We are looking to grow the SCIP developer team. If you want to contribute to the future development of SCIP and want to get involved please get in touch.</p>Sebastian PokuttaTL;DR: A quick update on what is on the horizon for SCIP.Psychedelic Style Transfer2020-04-09T01:00:00+02:002020-04-09T01:00:00+02:00http://www.pokutta.com/blog/research/2020/04/09/ai-art<p><em>TL;DR: We point out how to make psychedelic animations from discarded instabilities in neural style transfer. This post builds upon a remark we made in our recent paper <a href="https://arxiv.org/abs/2003.06659">Interactive Neural Style Transfer with Artists</a>. In this paper, we questioned several simple evaluation aspects of neural style transfer methods. Also, it is our second series of interactive painting experiments where style transfer outputs constantly influence a painter, see the other series <a href="https://arxiv.org/abs/1910.04386">here</a>. See also our medium <a href="https://medium.com/@human.aimachine.art/psychedelic-style-transfer-5744b700fc3e">post</a>.</em>
<!--more--></p>
<p><em>Written by Thomas Kerdreux and Louis Thiry.</em> <br /></p>
<div class="paddingContainer">
<div class="iframe-container center">
<iframe width="100%" height="100%" src="https://www.youtube-nocookie.com/embed/1jg6CqMEbcQ" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe>
</div>
</div>
<p>The first frame of the video, a watercolor from my grand-father, is progressively stirred into a plethora of curvy and colorful patches. It then metamorphoses into a purplish phantasmal coral reef that is itself slowly submerged by an angry puce ocean. The water then calms down as the coral reef disappears and ends up perfectly still. How is this psychedelic animation related to style transfer methods?</p>
<p>Neural style transfers are rendering techniques – for images mostly – that seek to stylize a content image with the style of another, see figure below. More precisely, the algorithms are designed to extract a style representation of an image and a representation of the semantic content of another and then cleverly construct a new picture from these.</p>
<table>
<thead>
<tr>
<th style="text-align: center">Heard Island in Antarctica</th>
<th style="text-align: center">Maxime Maufra’s painting</th>
<th style="text-align: center">Style Transfer Output using STROTSS [KS]</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: center"><img src="/blog/assets/style_transfer/example_ST_content.jpg" style="zoom:415%;" /></td>
<td style="text-align: center"><img src="/blog/assets/style_transfer/example_ST_style.jpg" style="zoom:415%;" /></td>
<td style="text-align: center"><img src="/blog/assets/style_transfer/example_ST_output.jpg" style="zoom:485%;" /></td>
</tr>
</tbody>
</table>
<p>While designing new evaluation techniques for style transfer methods in [KT], we made an uncomplicated but crucial observation. <strong>Style transfer applied to the same image as style and content should reasonably output the image itself</strong>. However, we observed that many style transfer algorithms do not satisfy this property. No one ever cared for hard-coding this fundamental property. Here, we show how, leveraging on that instability, we produce animations like the one above.</p>
<table>
<thead>
<tr>
<th style="text-align: center">MST first iteration</th>
<th style="text-align: center">MST second iteration</th>
<th style="text-align: center">MST third iteration</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: center"><img src="/blog/assets/style_transfer/MST_0.jpg" style="zoom:400%;" /></td>
<td style="text-align: center"><img src="/blog/assets/style_transfer/MST_1.jpg" style="zoom:400%;" /></td>
<td style="text-align: center"><img src="/blog/assets/style_transfer/MST_2.jpg" style="zoom:400%;" /></td>
</tr>
</tbody>
</table>
<p>Formally a style transfer method is simply a function \(f\) that takes a style image \(s\) and a content image \(c\) and outputs a new image \(f(s,c)\). Our observation is that for some style transfer methods \(f\) and an initial images \(x_0\), the equality $f(x_0, x_0) = x_0$ is not satisfied. The output images are adding a slightly perceptible flicker, blur or blunder to the initial image \(x_0\). These instability patterns differ from one method to another but are experimentally the same when starting from different images \(x_0\).</p>
<p>Yet these effects are hardly perceptible. Hence to better understand the phenomenon, we need to amplify them. We simply repeat the process: start from an initial image \(x_0\) and reiterate the style transfer operation</p>
\[\begin{align*}
x_{t+1} = f(x_t, x_t)
\end{align*},\]
<p>In Figure above, after a few iterations, the effects become perceptible and particularly stylish. For instance, when taking the MST style transfer method [MST] (with this <a href="https://github.com/irasin/Pytorch_MST">code</a>) the iterates become tessellated versions of the initial image. The instabilities amplify all the lines of the pictures. On portraits, they reveal all wrinkles. When taking another algorithm like WCT [WCT] (with <a href="https://github.com/irasin/Pytorch_WCT">this code</a>), the effects are different. The goblin is slowly dematerialized by the devilish style transfer instabilities, see figure below.</p>
<table>
<thead>
<tr>
<th style="text-align: center">MST first iteration</th>
<th style="text-align: center">MST second iteration</th>
<th style="text-align: center">MST fourth iteration</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: center"><img src="/blog/assets/style_transfer/WCT_diablotin_0.jpg" style="zoom:200%;" /></td>
<td style="text-align: center"><img src="/blog/assets/style_transfer/WCT_diablotin_1.jpg" style="zoom:200%;" /></td>
<td style="text-align: center"><img src="/blog/assets/style_transfer/WCT_diablotin_4.jpg" style="zoom:200%;" /></td>
</tr>
</tbody>
</table>
<p>So far, we simply showed the outputs of the first iterations of the repeat process. Actually, the animation above basically collects all the images of the sequence \((x_t)\). For many different pictures and methods, we observe this asymptotic type of divergence that we name <em>psychedelic regime</em>. Indeed, once the algorithm loses track of the initial image, it starts raving. And then feeds itself with its own slowly delusional outputs, without ever going back to our reality! The raving differs from one method to another but experimentally seems not to depend on the initial image.</p>
<p>This playfully shoots what a machine can do when forgetting about the human inputs or the non-numerical reality. In fact, metaphorically, this is also happening in many practical uses of algorithms. For instance, collaborative based filtering recommender systems use new data that come from humans interacting with the algorithm. We no longer assess the choices human would have done without ever been influenced by algorithms. We have lost this initial input!</p>
<p>[R] and [G] studied instabilities of style transfer method in the case of real-time style transfer for videos. The style transfer output may differ significantly from one frame to another while the initial consecutive frames are perceptibly the same. This results in unpleasing flickering effect in style transferred videos. Similarly to the adversarial examples literature, the main focus is to study the instabilities to detect, correct and remove them. Here we outlined instabilities stemming from another type of inconsistency and took advantage of them.</p>
<p>Also, note that MST and WCT are feed-forward approaches to style transfer, i.e. the function $f$ is a neural network [JA,GL,LW]. In reality, the first approach to neural style transfer was optimization-based [G]. In particular, when considering the same image as style and content, the image is the global optimum of the loss. Hence the method satisfies $f(x,x)=x$ if properly initialized. Actually, even when choosing a random initial image, we observed that the image iterate converges to the initial image, <em>i.e.</em> the global minimum of a non-convex loss. Note although that some optimization-based methods like STROTSS may still not satisfy this stability property because of some randomization and re-parametrization of the image with its Laplacian pyramid.</p>
<p>Finally, if you are interested in making your psychedelic videos, the take-home message is that most certainly any feed-forward neural style-transfer approach will give a different <em>psychedelic regime</em>. Below we show one using the WCT method and the first iteration when using STROTSS optimization-based style method (our <a href="https://github.com/human-aimachine-art/pytorch-STROTSS-improved">code</a>).</p>
<div class="paddingContainer">
<div class="iframe-container center">
<iframe width="560" height="315" src="https://www.youtube-nocookie.com/embed/lyyAFlmNjIg" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe>
</div>
</div>
<table>
<thead>
<tr>
<th style="text-align: center">STROTSS first iteration</th>
<th style="text-align: center">STROTSS several iteration later…</th>
<th style="text-align: center">STROTSS several iteration later…</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: center"><img src="/blog/assets/style_transfer/STROTSS_0.jpg" style="zoom:300%;" /></td>
<td style="text-align: center"><img src="/blog/assets/style_transfer/STROTSS_5.jpg" style="zoom:300%;" /></td>
<td style="text-align: center"><img src="/blog/assets/style_transfer/STROTSS_15.jpg" style="zoom:300%;" /></td>
</tr>
</tbody>
</table>
<h3 id="references">References</h3>
<p>[CKT] Cabannes, V., Kerdreux, T., Thiry, L., Campana, T., & Ferrandes, C. (2019). Dialog on a Canvas with a Machine. Third Workshop of Creativity and Design at NeurIPS 2019. <a href="https://arxiv.org/abs/1910.04386">pdf</a></p>
<p>[JA] Johnson, J.; Alahi, A.; and Fei-Fei, L. 2016. Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision, 694–711. Springer. <a href="https://arxiv.org/abs/1603.08155">pdf</a></p>
<p>[G] Gatys, L. A., Ecker, Alexander S., et Bethge, M.. A neural algorithm of artistic style. (2015). <a href="https://arxiv.org/abs/1508.06576">pdf</a></p>
<p>[GL] Ghiasi, G.; Lee, H.; Kudlur, M.; Dumoulin, V.; and Shlens, J. (2017). Exploring the structure of a real-time, arbitrary neural artistic stylization network. <a href="https://arxiv.org/abs/1705.06830">pdf</a></p>
<p>[G] Gupta, A.; Johnson, J.; Alahi, A.; and Fei-Fei, L. 2017. Characterizing and improving stability in neural style transfer. In Proceedings of the IEEE International Conference on Computer Vision, 4067–4076. <a href="https://arxiv.org/abs/1705.02092">pdf</a></p>
<p>[KT] Kerdreux, T., Thiry, L., Kerdreux, E. (2020). Interactive Neural Style Transfer with Artists. <a href="https://arxiv.org/abs/2003.06659">pdf</a></p>
<p>[KS] Kolkin, N., Salavon, J., Shakhnarovich G. (2019). Style Transfer by Relaxed Optimal Transport and Self-Similarity. <a href="https://arxiv.org/abs/1904.12785">pdf</a></p>
<p>[LW] Li, C., and Wand, M. (2016). Precomputed real-time texture synthesis with markovian generative adversarial networks. In European conference on computer vision, 702–716 Springer. <a href="https://arxiv.org/abs/1604.04382">pdf</a></p>
<p>[MST] Zhang, Y., Fang, C., Wang, Y., Wang, Z., Lin, Z., Fu, Y., Yang, J. (2017). Multimodal Style Transfer via Graph Cuts. In Proceedings of the IEEE International Conference on Computer Vision. 2019. p. 5943–5951. <a href="https://arxiv.org/abs/1904.04443">pdf</a></p>
<p>[WCT] Li, Y., Fang, C., Yang, J., Wang, Z., Lu, X., & Yang, M. H. (2017). Universal style transfer via feature transforms. In Advances in neural information processing systems (pp. 386–396). <a href="https://arxiv.org/abs/1705.08086">pdf</a></p>
<p>[R] Risser, E.; Wilmot, P.; and Barnes, C. 2017. Stable and controllable neural texture synthesis and style transfer using histogram losses. <a href="https://arxiv.org/abs/1701.08893">pdf</a></p>Thomas Kerdreux, Louis ThiryTL;DR: We point out how to make psychedelic animations from discarded instabilities in neural style transfer. This post builds upon a remark we made in our recent paper Interactive Neural Style Transfer with Artists. In this paper, we questioned several simple evaluation aspects of neural style transfer methods. Also, it is our second series of interactive painting experiments where style transfer outputs constantly influence a painter, see the other series here. See also our medium post.Boosting Frank-Wolfe by Chasing Gradients2020-03-16T00:00:00+01:002020-03-16T00:00:00+01:00http://www.pokutta.com/blog/research/2020/03/16/boostFW<p><em>TL;DR: This is an informal summary of our recent paper <a href="https://arxiv.org/pdf/2003.06369.pdf">Boosting Frank-Wolfe by Chasing Gradients</a> by <a href="https://cyrillewcombettes.github.io">Cyrille Combettes</a> and <a href="http://www.pokutta.com">Sebastian Pokutta</a>, where we propose to speed-up the Frank-Wolfe algorithm by better aligning the descent direction with that of the negative gradient. This is achieved by chasing the negative gradient direction in a matching pursuit-style, while still remaining projection-free. Although the idea is reasonably natural, it produces very significant results.</em></p>
<!--more-->
<p><em>Written by Cyrille Combettes. <a href="https://www.youtube.com/watch?v=BfyV0C5FRbE">ICML video</a></em></p>
<h2 id="motivation">Motivation</h2>
<p>The Frank-Wolfe algorithm (FW) [FW, CG] is a simple projection-free algorithm addressing problems of the form</p>
\[\begin{align*}
\min_{x\in\mathcal{C}}f(x)
\end{align*}\]
<p>where $f$ is a smooth convex function and $\mathcal{C}$ is a compact convex set. At each iteration, FW performs a linear minimization $v_t\leftarrow\arg\min_{v\in\mathcal{C}}\langle\nabla f(x_t),v\rangle$ and updates $x_{t+1}\leftarrow x_t+\gamma_t(v_t-x_t)$. That is, it searches for a vertex $v_t$ minimizing the linear approximation of $f$ at $x_t$ over $\mathcal{C}$, i.e., $\arg\min_{v\in\mathcal{C}}f(x_t)+\langle\nabla f(x_t),v-x_t\rangle$, and moves in that direction. Thus, by imposing a step-size $\gamma_t\in\left[0,1\right]$, it ensures that $x_{t+1}=(1-\gamma_t)x_t+\gamma_tv_t\in\mathcal{C}$ is feasible by convex combination, and hence there is no need to use projections back onto $\mathcal{C}$. This property is very useful when projections onto $\mathcal{C}$ are much more expensive than linear minimizations over $\mathcal{C}$.</p>
<p>The main drawback of FW however lies in its convergence rate, which can be excessively slow when the descent directions $v_t-x_t$ are inadequate. This motivated the Away-Step Frank-Wolfe algorithm (AFW) [W, LJ]. Figure 1 illustrates the zig-zagging phenomenon that can arise in FW and how it is solved by AFW.</p>
<div class="center">
<img src="http://www.pokutta.com/blog/assets/boostfw/fig1.png" alt="img1" style="float:center; margin-right: 1%; width:45%" />
<img src="http://www.pokutta.com/blog/assets/boostfw/fig2.png" alt="img2" style="float:center; margin-right: 1%; width:45%" />
<p style="clear: both;"></p>
</div>
<p class="figcap"><strong>Figure 1.</strong> Trajectory of the iterates of FW and AFW to minimize $f(x)=\norm{x}_2^2/2$ over the convex hull of $\setb{(-1,0),(0,1),(1,0)}$, starting from $x_0=(0,1)$. The solution is $x^\esx=(0,0)$. FW tries to reach $x^\esx$ by moving <em>towards</em> vertices, which is not efficient here as the directions $v_t-x_t$ become more and more orthogonal to $x^\esx-x_t$. AFW solves this issue by adding the option to move <em>away</em> from vertices; here, $x_4$ is obtained by moving away from $x_0$ and enables $x_5=x^\esx$.</p>
<p>However, the descent directions of AFW might still not be as favorable as those of gradient descent. Furthermore, in order to decide whether to move towards vertices or away from vertices and, when appropriate, to perform an away step, AFW needs to maintain the decomposition of the iterates onto $\mathcal{C}$. This can become very costly in both memory usage and computation time [GM]. Thus, we propose to directly estimate the gradient descent direction $-\nabla f(x_t)$ by sequentially picking up vertices in a matching pursuit-style [MZ]. By doing so, we can descend in directions better aligned with those of the negative gradients while still remaining projection-free.</p>
<h2 id="boosting-via-gradient-pursuit">Boosting via gradient pursuit</h2>
<p>At each iteration, we perform a sequence of rounds <em>chasing</em> the direction $-\nabla f(x_t)$. We initialize our direction estimate as $d_0\leftarrow0$. At round $k$, the residual is $r_k\leftarrow-\nabla f(x_t)-d_k$ and we aim at maximally reducing it by subtracting its maximum component among the vertex directions. We update $d_{k+1}\leftarrow d_k+\lambda_ku_k$ where $u_k\leftarrow v_k-x_t$, $v_k\leftarrow\arg\max_{v\in\mathcal{C}}\langle r_k,v\rangle$, and $\lambda_k\leftarrow\frac{\langle r_k,u_k\rangle}{\norm{u_k}^2}$: $\lambda_ku_k$ is the projection of $r_k$ onto its maximum component $u_k$. Note that this “projection” is actually closed-form and very cheap. The new residual is $r_{k+1}\leftarrow-\nabla f(x_t)-d_{k+1}=r_k-\lambda_ku_k$. We stop the procedure whenever the improvement in alignment between rounds $k$ and $k+1$ is not <em>sufficient</em>, i.e., whenever</p>
\[\begin{align*}
\frac{\langle-\nabla f(x_t),d_{k+1}\rangle}{\norm{\nabla f(x_t)}\norm{d_{k+1}}}-\frac{\langle-\nabla f(x_t),d_k\rangle}{\norm{\nabla f(x_t)}\norm{d_k}}<\delta
\end{align*}\]
<p>for some $\delta\in\left]0,1\right[$. In our experiments, we typically set $\delta=10^{-3}$.</p>
<p>We stress that $d_k$ can be well aligned with $-\nabla f(x_t)$ even when $\norm{-\nabla f(x_t)-d_k}$ is arbitrarily large: we aim at estimating <em>the direction of</em> $-\nabla f(x_t)$ and not the vector $-\nabla f(x_t)$ itself (which would require many more rounds). Once the procedure is completed, we use $g_t\leftarrow d_k/\sum_{\ell=0}^{k-1}\lambda_\ell$ as descent direction. This normalization ensures that the entire segment $[x_t,x_t+g_t]$ is in the feasible region $\mathcal{C}$. Hence, by updating $x_{t+1}\leftarrow x_t+\gamma_tg_t$ with a step-size $\gamma_t\in\left[0,1\right]$, we remain projection-free while descending in the direction $g_t$ better aligned with $-\nabla f(x_t)$. This is the design of our <em>Boosted Frank-Wolfe</em> algorithm (BoostFW). Figure 2 illustrates the procedure.</p>
<div class="center">
<img src="http://www.pokutta.com/blog/assets/boostfw/fig4.png" alt="img4" style="float:center; margin-right: 1%; width:22%" />
<img src="http://www.pokutta.com/blog/assets/boostfw/fig5.png" alt="img5" style="float:center; margin-right: 1%; width:22%" />
<img src="http://www.pokutta.com/blog/assets/boostfw/fig6.png" alt="img6" style="float:center; margin-right: 1%; width:22%" />
<img src="http://www.pokutta.com/blog/assets/boostfw/fig7.png" alt="img7" style="float:center; margin-right: 1%; width:22%" />
<p style="clear: both;"></p>
</div>
<p class="figcap"><strong>Figure 2.</strong> Illustration of the gradient pursuit procedure. It builds a descent direction $g_t$ better aligned with the gradient descent direction $-\nabla f(x_t)$. We have $g_t=d_2/(\lambda_0+\lambda_1)$ where $d_2=\lambda_0u_0+\lambda_1u_1$, $u_0=v_0-x_t$, and $u_1=v_1-x_t$. Furthermore, note that $[x_t,x_t+d_2]\not\subset\mathcal{C}$ but $[x_t,x_t+g_t]\subset\mathcal{C}$. Hence, moving along the segment $[x_t,x_t+g_t]$ ensures feasibility of the new iterate $x_{t+1}$.</p>
<h2 id="computational-results">Computational results</h2>
<p>Observe that BoostFW likely performs multiple linear minimizations per iteration, while FW only performs $1$ and AFW performs $\sim2$ ($1$ for the FW vertex and $\sim1$ for the away vertex). Thus, one might wonder if the progress obtained by the gradient pursuit procedure is washed away by the higher number of linear minimizations. We conducted a series of computational experiments demonstrating that this is not the case and that the advantage is quite substantial. We compared BoostFW to AFW, DICG [GM]‚ and BCG [BPTW] on various tasks: sparse signal recovery, sparsity-constrained logistic regression, traffic assignment, collaborative filtering, and video-colocalization. Figures 3-6 show that BoostFW outperforms the other algorithms both per iteration and CPU time although it calls the linear minimization oracle more often: BoostFW makes better use of its oracle calls.</p>
<div class="center">
<img src="http://www.pokutta.com/blog/assets/boostfw/lasso-xkcd.png" alt="lasso" style="float:center; margin-right: 1%; width:80%" />
<p style="clear: both;"></p>
</div>
<p class="figcap"><strong>Figure 3.</strong> Sparse signal recovery.</p>
<div class="center">
<img src="http://www.pokutta.com/blog/assets/boostfw/gisette-all-ls-noafwl-xkcd.png" alt="gisette" style="float:center; margin-right: 1%; width:80%" />
<p style="clear: both;"></p>
</div>
<p class="figcap"><strong>Figure 4.</strong> Sparse logistic regression on the Gisette dataset.</p>
<div class="center">
<img src="http://www.pokutta.com/blog/assets/boostfw/traffic-xkcd.png" alt="traffic" style="float:center; margin-right: 1%; width:80%" />
<p style="clear: both;"></p>
</div>
<p class="figcap"><strong>Figure 5.</strong> Traffic assignment. DICG is not applicable here.</p>
<div class="center">
<img src="http://www.pokutta.com/blog/assets/boostfw/collabo-xkcd.png" alt="collabo" style="float:center; margin-right: 1%; width:80%" />
<p style="clear: both;"></p>
</div>
<p class="figcap"><strong>Figure 6.</strong> Collaborative filtering on the MovieLens 100k dataset. DICG is not applicable here.</p>
<p>Lastly, we present a preliminary extension of our boosting procedure to DICG. DICG is known to perform particularly well on the video-colocalization experiment of [JTF]. The comparison is made in duality gap, in line with [GM]. Figure 7 shows promising results for BoostDICG.</p>
<div class="center">
<img src="http://www.pokutta.com/blog/assets/boostfw/video-gaps-xkcd.png" alt="video" style="float:center; margin-right: 1%; width:80%" />
<p style="clear: both;"></p>
</div>
<p class="figcap"><strong>Figure 7.</strong> Video-colocalization on the YouTube-Objects dataset.</p>
<h3 id="references">References</h3>
<p>[FW] Frank, M., & Wolfe, P. (1956). An algorithm for quadratic programming. Naval research logistics quarterly, 3(1‐2), 95-110. <a href="https://onlinelibrary.wiley.com/doi/abs/10.1002/nav.3800030109">pdf</a></p>
<p>[CG] Levitin, E. S., & Polyak, B. T. (1966). Constrained minimization methods. Zhurnal Vychislitel’noi Matematiki i Matematicheskoi Fiziki, 6(5), 787-823. <a href="http://www.mathnet.ru/php/archive.phtml?wshow=paper&jrnid=zvmmf&paperid=7415&option_lang=eng">pdf</a></p>
<p>[W] Wolfe, P. (1970). Convergence theory in nonlinear programming. Integer and nonlinear programming, 1-36.</p>
<p>[LJ] Lacoste-Julien, S., & Jaggi, M. (2015). On the global linear convergence of Frank-Wolfe optimization variants. In Advances in Neural Information Processing Systems (pp. 496-504). <a href="http://papers.nips.cc/paper/5925-on-the-global-linear-convergence-of-frank-wolfe-optimization-variants.pdf">pdf</a></p>
<p>[GM] Garber, D., & Meshi, O. (2016). Linear-memory and decomposition-invariant linearly convergent conditional gradient algorithm for structured polytopes. In Advances in Neural Information Processing Systems (pp. 1001-1009). <a href="http://papers.nips.cc/paper/6115-linear-memory-and-decomposition-invariant-linearly-convergent-conditional-gradient-algorithm-for-structured-polytopes">pdf</a></p>
<p>[MZ] Mallat, S. G., & Zhang, Z. (1993). Matching pursuits with time-frequency dictionaries. IEEE Transactions on signal processing, 41(12), 3397-3415. <a href="https://pdfs.semanticscholar.org/0b6e/98a6a8cf8283fd76fe1100b23f11f4cfa711.pdf">pdf</a></p>
<p>[BPTW] Braun, G., Pokutta, S., Tu, D., & Wright, S. (2019). Blended conditional gradients: the unconditioning of conditional gradients. Proceedings of ICML. <a href="https://arxiv.org/abs/1805.07311">pdf</a></p>
<p>[JTF] Joulin, A., Tang, K., & Fei-Fei, L. (2014, September). Efficient image and video co-localization with frank-wolfe algorithm. In European Conference on Computer Vision (pp. 253-268). Springer, Cham. <a href="https://link.springer.com/chapter/10.1007/978-3-319-10599-4_17">pdf</a></p>Cyrille CombettesTL;DR: This is an informal summary of our recent paper Boosting Frank-Wolfe by Chasing Gradients by Cyrille Combettes and Sebastian Pokutta, where we propose to speed-up the Frank-Wolfe algorithm by better aligning the descent direction with that of the negative gradient. This is achieved by chasing the negative gradient direction in a matching pursuit-style, while still remaining projection-free. Although the idea is reasonably natural, it produces very significant results.Non-Convex Boosting via Integer Programming2020-02-13T06:00:00+01:002020-02-13T06:00:00+01:00http://www.pokutta.com/blog/research/2020/02/13/ipboost-abstract<p><em>TL;DR: This is an informal summary of our recent paper <a href="https://arxiv.org/abs/2002.04679">IPBoost – Non-Convex Boosting via Integer Programming</a> with <a href="https://www2.mathematik.tu-darmstadt.de/~pfetsch/">Marc Pfetsch</a>, where we present a non-convex boosting procedure that relies on integer programing. Rather than solving a convex proxy problem, we solve the actual classification problem with discrete decisions. The resulting procedure achieves performance at par or better than Adaboost however it is robust to label noise that can defeat convex potential boosting procedures.</em>
<!--more--></p>
<h2 id="what-is-the-paper-about-and-why-you-might-care">What is the paper about and why you might care</h2>
<p>Boosting (see <a href="https://en.wikipedia.org/wiki/Boosting_(machine_learning)">Wikipedia</a>) is an important (and by now standard) technique in classification to combine several “low accuracy” learners, so-called <em>base learners</em>, into a “high accuracy” learner, a so-called
<em>boosted learner</em>. Pioneered by the AdaBoost approach of [FS], in recent decades there has been extensive
work on boosting procedures and analyses of their limitations. In a nutshell, boosting procedures are (typically) iterative schemes that roughly work as follows: for $t = 1, \dots, T$ do the following:</p>
<ol>
<li>Train a learner $\mu_t$ from a given class of base learners on
the data distribution $\mathcal D_t$.</li>
<li>Evaluate performance of $\mu_t$ by computing its loss.</li>
<li>Push weight of the data distribution $\mathcal D_t$ towards misclassified examples leading to $\mathcal D_{t+1}$.</li>
</ol>
<p>Finally, the learners are combined by some form of voting (e.g., soft or hard voting, averaging, thresholding). A close inspection of most (but not all) boosting procedures reveals that they solve an underlying convex optimization problem over a convex loss function by means of coordinate gradient descent. Boosting schemes of this type are often referred to as <em>convex potential boosters</em>. These procedures can achieve exceptional performance on many data sets if the data is correctly labeled. In fact, in theory, provided the class of base learners is rich enough, a perfect strong learner can be constructed that has accuracy $1$ (see e.g., [AHK]), however clearly such a learner might not necessarily generalize well. Boosted learners can generate quite some complicated decision boundaries, much more complicated than that of the base learners. Here is an example from <a href="https://paulvanderlaken.com/2020/01/20/animated-machine-learning-classifiers/">Paul van der Laken’s blog / Extreme gradient boosting gif by Ryan Holbrook</a>. Here data is generated online according to some process with optimal decision boundary represented by the dotted line and <a href="https://xgboost.ai/">XGBoost</a> was used learn a classifier:</p>
<p class="minimg"><img src="https://paulvanderlaken.files.wordpress.com/2020/01/xgboost.gif?w=518&zoom=2" alt="XGBoost example" /></p>
<p><br /></p>
<h3 id="label-noise">Label noise</h3>
<p>In reality we usually face unclean data and so-called label noise, where some percentage of the classification labels might be corrupted. We would also like to construct strong learners for such data. However if we revisit the general boosting template from above, then we might suspect that we run into trouble as soon as a certain fraction of training examples is misclassified: in this case these examples cannot be correctly classified and the procedure shifts more and more weight towards these bad examples. This eventually leads to a strong learner, that perfectly predicts the (flawed) training data, however that does not generalize well anymore. This intuition has been formalized by [LS] who construct a “hard” training data distribution, where a small percentage of labels is randomly flipped. This label noise then leads to a significant reduction in performance of these boosted learners; see tables below. The more technical reason for this problem is actually the convexity of the loss function that is minimized by the boosting procedure. Clearly, one can use all types of “tricks” such as <em>early stopping</em> but at the end of the day this is not solving the fundamental problem.</p>
<h2 id="our-results">Our results</h2>
<p>To combat the problem of convex potential boosters being susceptible to label noise, rather than optimizing some convex loss proxy, why not considering the classification problem with the actual <em>misclassification loss function</em>:</p>
\[\tag{classify}
\begin{align}
\label{eq:trueLoss}
\ell(\theta,D) \doteq \sum_{i \in I} \mathbb I[h_\theta(x_i) \neq
y_i],
\end{align}\]
<p>where $h_\theta$ is a learner parameterized by $\theta$ and $D = \setb{(x_i,y_i) \mid i \in I}$ is the training data? This loss function counts the number of misclassifications, which somehow seems to be a natural quantity to minimize. Unfortunately, this loss function is non-convex, however, the resulting optimization problem can be rather naturally phrased as an <em>Integer Program (IP)</em>. In fact, our basic boosting model is captured by the following integer programming problem:</p>
\[\tag{basicBoost}
\begin{align*}
\min\; & \sum_{i=1}^N z_i \\
& \sum_{j=1}^L \eta_{ij}\, \lambda_j + (1 + \rho) z_i \geq
\rho\quad\forall\, i \in [N],\\
& \sum_{j=1}^L \lambda_j = 1,\; \lambda \geq 0,\\
& z \in \{0,1\}^N,
\end{align*}\]
<p>where the matrix $\eta_{ij}$ encodes the predictions of learner $j$ on example $i$. The boosting part comes naturally into play here as the number of base learners is potentially huge (sometimes even infinite) and we have to generate these learners with some procedure. We do this by means of <em>column generation</em>, where we add base learners via a pricing problem, that essentially generates an acceptable base learner for the modified data distribution that is encoded in the dual variables of the relaxed problem (basicBoost). This is somewhat similar to the (LP-based) LPBoost approach of [DKS] however we consider an integer program here where column generation is significantly more involved. The dual problem is of the following form:</p>
\[\begin{align*}
\tag{dualProblem}
\max\; & \rho \sum_{i=1}^N w_i + v - \sum_{i=1}^N u_i\\
& \sum_{i=1}^N \eta_{ij}\, w_i + v \leq 0 \quad\forall\, j \in \mathcal{L},\\
& \;(1 + \rho) w_i - u_i \leq 1\quad\forall\, i \in [N],\\
& w \geq 0,\; u \geq 0,\; v \text{ free},
\end{align*}\]
<p>and after some clean up etc, the pricing constraints that need to be satisfied are:</p>
\[\tag{pricing}
\begin{equation}\label{eq:PricingProb}
\sum_{i=1}^N \eta_{ij}\, w_i^\esx + v^\esx > 0
\end{equation},\]
<p>and we ask whether there exist a base learner $h_j \in \Omega$ such
that (pricing) holds? For this, the $w_i^\esx$ can be seen
as weights over the points $x_i$ with $i \in [N]$, and we have to
classify the points according to these weights. This pricing problem is solved within a branch-and-cut-and-price framework, complicating things significantly compared to column generation in the LP case.</p>
<h3 id="computations">Computations</h3>
<p>Solving an IP is much more computationally expensive than traditional boosting approaches or LPBoost. However, what we gain is robustness and stability. For the hard distribution of [LS] we significantly outperform Adaboost and gain moderately compared to LPBoost, which is already more robust towards label noise as it re-solves the optimization in each round, albeit for a proxy loss function. The reported accuracy is test accuracy over multiple runs for various parameters and $L$ denotes the (average) number of learners generated in order to construct the strong learner.</p>
<p class="center"><img src="http://www.pokutta.com/blog/assets/ipboost/hardDistPerf.png" alt="Results on hard distribution" /></p>
<p>Note that the hard distribution is for a binary classification problem, so that $50\%$ accuracy is random guessing. Therefore the improvement from $53.27\%$ for Adaboost, which is basically random guessing to $69.03\%$ is quite significant.</p>
<h3 id="references">References</h3>
<p>[FS] Freund, Y., & Schapire, R. E. (1995, March). A desicion-theoretic generalization of on-line learning and an application to boosting. In <em>European conference on computational learning theory</em> (pp. 23-37). Springer, Berlin, Heidelberg. <a href="https://pdfs.semanticscholar.org/5fb5/f7b545a5320f2a50b30af599a9d9a92a8216.pdf">pdf</a></p>
<p>[AHK] Arora, S., Hazan, E., & Kale, S. (2012). The multiplicative weights update method: a meta-algorithm and applications. Theory of Computing, 8(1), 121-164. <a href="http://www.theoryofcomputing.org/articles/v008a006/v008a006.pdf">pdf</a></p>
<p>[LS] Long, P. M., & Servedio, R. A. (2010). Random classification noise defeats all convex potential boosters. Machine learning, 78(3), 287-304. <a href="http://www.machinelearning.org/archive/icml2008/papers/258.pdf">pdf</a></p>
<p>[DKS] Demiriz, A., Bennett, K. P., & Shawe-Taylor, J. (2002). Linear programming boosting via column generation. Machine Learning, 46(1-3), 225-254. <a href="https://link.springer.com/content/pdf/10.1023/A:1012470815092.pdf">pdf</a></p>Sebastian PokuttaTL;DR: This is an informal summary of our recent paper IPBoost – Non-Convex Boosting via Integer Programming with Marc Pfetsch, where we present a non-convex boosting procedure that relies on integer programing. Rather than solving a convex proxy problem, we solve the actual classification problem with discrete decisions. The resulting procedure achieves performance at par or better than Adaboost however it is robust to label noise that can defeat convex potential boosting procedures.Approximate Carathéodory via Frank-Wolfe2019-11-30T00:00:00+01:002019-11-30T00:00:00+01:00http://www.pokutta.com/blog/research/2019/11/30/approxCara-abstract<p><em>TL;DR: This is an informal summary of our recent paper <a href="https://arxiv.org/pdf/1911.04415.pdf">Revisiting the Approximate Carathéodory Problem via the Frank-Wolfe Algorithm</a> with <a href="https://www.linkedin.com/in/cyrille-combettes/">Cyrille W Combettes</a>. We show that the Frank-Wolfe algorithm constitutes an intuitive and efficient method to obtain a solution to the approximate Carathéodory problem and that it also provides improved cardinality bounds in particular scenarios.</em>
<!--more--></p>
<p><em>Written by Cyrille W Combettes.</em></p>
<h2 id="what-is-the-paper-about-and-why-you-might-care">What is the paper about and why you might care</h2>
<p>Let $\mathcal{V}\subset\mathbb{R}^n$ be a compact set and denote by $\mathcal{C} \doteq \operatorname{conv}(\mathcal{V})$ its convex hull. Slightly abusing notation, we will refer to any point in $\mathcal{V}$ as a <em>vertex</em>. Let the <em>cardinality</em> of a point $x\in\operatorname{conv}(\mathcal{V})$ be the minimum number of vertices necessary to form $x$ as a convex combination. Then Carathéodory’s theorem [C] states that every point $x\in\mathcal{C}$ has cardinality at most $n+1$, and this bound is tight. However, if we can afford an $\epsilon$-approximation with respect to some norm, can we improve this bound?</p>
<p>The approximate Carathéodory theorem states that if $p \geq 2$, then for every $x^\esx \in \mathcal{C}$ there exists $x \in \mathcal{C}$ of cardinality $\mathcal{O}(pD_p^2/\epsilon^2)$ satisfying $\norm{x-x^\esx}_p \leq \epsilon$, where $D_p$ is the diameter of $\mathcal{V}$ in $\ell_p$-norm. This result is independent of the dimension $n$ and is therefore particularly significant in high dimensional spaces. Furthermore, [MLVW] showed that this bound is tight.</p>
<p>Let $p\geq2$. A natural way to think about the approximate Carathéodory problem is to minimize $f(x)=\norm{x-x^\esx}_p$ by sequentially picking up vertices, starting from an arbitrary vertex. By doing so, we hope to converge fast enough to $x^\esx$ so as to keep the number of iterations low, hence, to pick up as few vertices as possible. This is precisely the Frank-Wolfe algorithm [FW], a.k.a. conditional gradient algorithm [CG]. At each iteration, it selects a vertex via the following linear minimization problem:</p>
\[\begin{align*}
v_t\leftarrow\arg\min_{v\in\mathcal{V}}\langle\nabla f(x_t),v\rangle
\end{align*}\]
<p>and then moves towards that vertex, i.e., in the direction $v_t-x_t$:</p>
\[\begin{align*}
x_{t+1}\leftarrow x_t+\gamma_t(v_t-x_t)
\end{align*},\]
<p>where $\gamma_t \in [0,1]$. Note that this amounts to selecting the direction formed from the current iterate $x_t$ to a vertex $v_t$ that is most aligned with the gradient descent direction $-\nabla f(x_t)$, up to a normalization factor as measured by the inner product. Thus, FW “approximates” gradient descent with sparse directions ensuring that at most $1$ new vertex is added to the convex decomposition of the iterate $x_t$. Therefore, if $T$ is the number of iterations necessary to achieve $\norm{x_T-x^\esx}_p \leq \epsilon$, then $x_T$ is an $\epsilon$-approximate solution with cardinality $T+1$.</p>
<p>We can estimate $T$ using convergence results for FW. These often require convexity and smoothness of the objective function, and sometimes also strong convexity. In the case of $f(x)=\norm{x-x^\esx}_p^2$ (note that we squared the norm to obtain the following properties), we can verify that $f$ is convex and smooth, but it is strongly convex only for $p \in \left]1,2\right]$ (with respect to the $\ell_p$-norm). However, we can replace the strong convexity requirement with a weaker one satisfied by $f$, namely the Polyak-Łojasiewicz (PL) inequality [P], [L]:</p>
\[\begin{align*}
f(x)-\min_{\mathbb{R}^n}f
\leq\frac{1}{2\mu}\|\nabla f(x)\|_*^2.
\end{align*}\]
<p>Now, by using some existing convergence results using the PL condition for FW available in [LP], [GM], [J], [GH], we can directly deduce cardinality bounds in different scenarios. In particular, the approximate Carathéodory bound $\mathcal{O}(pD_p^2/\epsilon^2)$ is achieved and FW constitutes a very intuitive method to obtain a solution to the approximate Carathéodory problem.</p>
<table>
<thead>
<tr>
<th style="text-align: left">Assumptions</th>
<th style="text-align: right">FW Rate</th>
<th style="text-align: right">Cardinality Bound</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left">—</td>
<td style="text-align: right">$\frac{4(p-1)D_p^2}{t+2}$</td>
<td style="text-align: right">$\frac{4(p-1)D_p^2}{\epsilon^2}=\mathcal{O}\left(\frac{pD_p^2}{\epsilon^2}\right)$</td>
</tr>
<tr>
<td style="text-align: left">$\mathcal{C}$ is $S_p$-strongly convex</td>
<td style="text-align: right">$\frac{\max\{9(p-1)D_p^2,1152(p-1)^2/S_p^2\}}{(t+2)^2}$</td>
<td style="text-align: right">$\mathcal{O}\left(\frac{\sqrt{p}D_p+p/S_p}{\epsilon}\right)$</td>
</tr>
<tr>
<td style="text-align: left">$x^\esx \in \operatorname{relint}_p(\mathcal{C})$ with radius $r_p$</td>
<td style="text-align: right">$\left(1-\frac{1}{p-1}\frac{r_p^2}{D_p^2}\right)^t\epsilon_0$</td>
<td style="text-align: right">$\mathcal{O}\left(\frac{pD_p^2}{r_p^2}\ln\left(\frac{1}{\epsilon}\right)\right)$</td>
</tr>
</tbody>
</table>
<p>Let $H_n$ be a Hadamard matrix of dimension $n$ and $\mathcal{C} \doteq \operatorname{conv}(H_n/n^{1/p})$ be the convex hull of its normalized columns with respect to the $\ell_p$-norm. Suppose we want to approximate the convex decomposition of $x^\esx \doteq (H_n/n^{1/p})\mathbf{1}/n=e_1/n^{1/p}$; this is the lower bound instance from [MLVW]. Below we plot the performance of FW and two variants, Away-Step Frank-Wolfe (AFW) and Fully-Corrective Frank-Wolfe (FCFW), for the approximate Carathéodory problem here with $p=7$, as well as (a minor correction to) the corresponding lower bound stated by [MLVW].</p>
<p class="minimg"><img src="http://www.pokutta.com/blog/assets/approxCara/lower7-card.png" alt="p7 norm" /></p>
<p>We see that AFW performs better than FW and that FCFW almost matches the lower bound. It remains an open question however to derive a precise convergence rate for FCFW in this setting rather than simply inheriting the rate of AFW via [LJ], which seems to be too loose here.</p>
<h3 id="references">References</h3>
<p>[C] Carathéodory, C. (1907). Über den Variabilitätsbereich der Koeffizienten von Potenzreihen, die gegebene Werte nicht annehmen. Mathematische Annalen, 64(1), 95-115. <a href="https://link.springer.com/content/pdf/10.1007/BF01449883.pdf">pdf</a></p>
<p>[MLVW] Mirrokni, V., Leme, R. P., Vladu, A., & Wong, S. C. W. (2017, August). Tight bounds for approximate Carathéodory and beyond. In Proceedings of the 34th International Conference on Machine Learning-Volume 70 (pp. 2440-2448). <a href="https://arxiv.org/pdf/1512.08602.pdf">pdf</a></p>
<p>[FW] Frank, M., & Wolfe, P. (1956). An algorithm for quadratic programming. Naval research logistics quarterly, 3(1‐2), 95-110. <a href="https://onlinelibrary.wiley.com/doi/abs/10.1002/nav.3800030109">pdf</a></p>
<p>[CG] Levitin, E. S., & Polyak, B. T. (1966). Constrained minimization methods. Zhurnal Vychislitel’noi Matematiki i Matematicheskoi Fiziki, 6(5), 787-823. <a href="http://www.mathnet.ru/php/archive.phtml?wshow=paper&jrnid=zvmmf&paperid=7415&option_lang=eng">pdf</a></p>
<p>[P] Polyak, B. T. (1963). Gradient methods for the minimisation of functionals. USSR Computational Mathematics and Mathematical Physics, 3(4), 864-878. <a href="https://www.researchgate.net/profile/Boris_Polyak2/publication/243648552_Gradient_methods_for_the_minimisation_of_functionals/links/5a608e09aca272328103d55e/Gradient-methods-for-the-minimisation-of-functionals.pdf">pdf</a></p>
<p>[L] Lojasiewicz, S. (1963). Une propriété topologique des sous-ensembles analytiques réels. Les équations aux dérivées partielles, 117, 87-89.</p>
<p>[LP] Levitin, E. S., & Polyak, B. T. (1966). Constrained minimization methods. USSR Computational mathematics and mathematical physics, 6(5), 1-50.</p>
<p>[GM] Guélat, J., & Marcotte, P. (1986). Some comments on Wolfe’s ‘away step’. Mathematical Programming, 35(1), 110-119. <a href="https://link.springer.com/content/pdf/10.1007/BF01589445.pdf">pdf</a></p>
<p>[J] Jaggi, M. (2013, June). Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization. In ICML (1) (pp. 427-435). <a href="http://proceedings.mlr.press/v28/jaggi13-supp.pdf">pdf</a></p>
<p>[GH] Garber, D., & Hazan, E. (2014). Faster rates for the Frank-Wolfe method over strongly-convex sets. arXiv preprint arXiv:1406.1305. <a href="http://proceedings.mlr.press/v37/garbera15-supp.pdf">pdf</a></p>
<p>[LJ] Lacoste-Julien, S., & Jaggi, M. (2015). On the global linear convergence of Frank-Wolfe optimization variants. In Advances in Neural Information Processing Systems (pp. 496-504). <a href="http://papers.nips.cc/paper/5925-on-the-global-linear-convergence-of-frank-wolfe-optimization-variants.pdf">pdf</a></p>Cyrille W CombettesTL;DR: This is an informal summary of our recent paper Revisiting the Approximate Carathéodory Problem via the Frank-Wolfe Algorithm with Cyrille W Combettes. We show that the Frank-Wolfe algorithm constitutes an intuitive and efficient method to obtain a solution to the approximate Carathéodory problem and that it also provides improved cardinality bounds in particular scenarios.