Jekyll2019-03-04T16:59:47-05:00http://www.pokutta.com/blog/One trivial observation at a timeEverything Mathematics, Optimization, Machine Learning, and Artificial IntelligenceCheat Sheet: Subgradient Descent, Mirror Descent, and Online Learning2019-02-27T07:00:00-05:002019-02-27T07:00:00-05:00http://www.pokutta.com/blog/research/2019/02/27/cheatsheet-nonsmooth<p><em>TL;DR: Cheat Sheet for non-smooth convex optimization: subgradient descent, mirror descent, and online learning. Long and technical.</em> <!--more--></p> <p><em>Posts in this series (so far).</em></p> <ol> <li><a href="/blog/research/2018/12/06/cheatsheet-smooth-idealized.html">Cheat Sheet: Smooth Convex Optimization</a></li> <li><a href="/blog/research/2018/10/05/cheatsheet-fw.html">Cheat Sheet: Frank-Wolfe and Conditional Gradients</a></li> <li><a href="/blog/research/2018/10/19/cheatsheet-fw-lin-conv.html">Cheat Sheet: Linear convergence for Conditional Gradients</a></li> <li><a href="/blog/research/2018/11/11/heb-conv.html">Cheat Sheet: Hölder Error Bounds (HEB) for Conditional Gradients</a></li> <li><a href="/blog/research/2019/02/27/cheatsheet-nonsmooth.html">Cheat Sheet: Subgradient Descent, Mirror Descent, and Online Learning</a></li> </ol> <p><em>My apologies for incomplete references—this should merely serve as an overview.</em></p> <p>This time we will consider non-smooth convex optimization. Our starting point is a very basic argument that is used to prove convergence of <em>Subgradient Descent (SG)</em>. From there we will consider the projected variants in the constrained setting and naturally arrive at <em>Mirror Descent (MD)</em> of [NY]; we follow the proximal point of view as presented in [BT]. We will also see that online learning algorithms such as <em>Online Gradient Descent (OGD)</em> of [Z] or <em>Online Mirror Descent (OMD)</em> and the special case of the <em>Multiplicative Weights Update (MWU)</em> algorithm arise as natural consequences.</p> <p>This time we will consider a convex function $f: \RR^n \rightarrow \RR$ and we want to solve</p> <script type="math/tex; mode=display">\min_{x \in K} f(x),</script> <p>where $K$ is some convex feasible region, e.g., $K = \RR^n$ is the unconstrained case. However compared to previous posts now we will consider the <em>non-smooth</em> case. As before we assume that we only have <em>first-order access</em> to the function, via a so-called <em>first-order oracle</em>, which in the non-smooth case returns subgradients:</p> <p class="mathcol"><strong>First-Order oracle for $f$</strong> <br /> <em>Input:</em> $x \in \mathbb R^n$ <br /> <em>Output:</em> $\partial f(x)$ and $f(x)$</p> <p>In the above $\partial f(x)$ denotes a subgradient of the (convex!) function $f$ at point $x$. Recall that a <em>subgradient at $x \in \operatorname{dom}(f)$</em> is any vector $\partial f(x)$ such that $f(z) \geq f(x) + \partial \langle f(x), z-x \rangle$ holds for all $z \in \operatorname{dom}(f)$. So basically the same as we obtain from convexity for smooth functions, just that in the general non-smooth case, there might be more than one vector satisfying this condition. In contrast, for convex and smooth (i.e., differentiable) functions there exists only one subgradient at $x$, which is the gradient, i.e., $\partial f(x) = \nabla f(x)$ in this case. In the following we will use the notation $[n] \doteq \setb{1,\dots, n}$.</p> <h2 id="a-basic-argument">A basic argument</h2> <p>We will first consider gradient descent-like algorithms of the form</p> <p>$\tag{dirStep} x_{t+1} \leftarrow x_t - \eta_t d_t,$</p> <p>where we choose $d_t \doteq \partial f(x_t)$ and we show how we can establish convergence of the above scheme to an (approximately) optimal solution $x_T$ to $\min_{x \in K} f(x)$ in the case $K = \RR^n$; we will choose the step length $\eta_t$ later. For completeness, the full algorithm looks like this:</p> <p class="mathcol"><strong>Subgradient Descent Algorithm.</strong> <br /> <em>Input:</em> Convex function $f$ with first-order oracle access and some initial point $x_0 \in \RR^n$ <br /> <em>Output:</em> Sequence of points $x_0, \dots, x_T$ <br /> For $t = 1, \dots, T$ do: <br /> $\quad x_{t+1} \leftarrow x_t - \eta_t \partial f(x_t)$<br /></p> <p>In this section we will assume that $\norm{\cdot}$ is the $\ell_2$-norm, however note that later we will allow for other norms. Let $x^\esx$ be an optimal solution to $\min_{x \in K} f(x)$ and consider the following using (dirStep) and $d_t \doteq \partial f(x_t)$.</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align*} \norm{x_{t+1} - x^\esx}^2 & = \norm{x_t - x^\esx}^2 - 2 \eta_t \langle \partial f(x_t), x_t - x^\esx\rangle + \eta_t^2 \norm{\partial f(x_t)}^2. \end{align*} %]]></script> <p>This can be rearranged to</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align*} \tag{basic} 2 \eta_t \langle \partial f(x_t), x_t - x^\esx\rangle & = \norm{x_t - x^\esx}^2 - \norm{x_{t+1} - x^\esx}^2 + \eta_t^2 \norm{\partial f(x_t)}^2, \end{align*} %]]></script> <p>as we aim to later estimate $f(x_t) - f(x^\esx) \leq \langle \partial f(x_t), x_t - x^\esx\rangle$ as $\partial f(x)$ was a subgradient. However in view of setting out to provide a unified perspective on various settings, including online learning, we will do this substitution only in the very end. Adding up those equations until iteration $T-1$ we obtain:</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align*} \sum_{t = 0}^{T-1} 2\eta_t \langle \partial f(x_t), x_t - x^\esx\rangle & = \norm{x_0 - x^\esx}^2 - \norm{x_{T} - x^\esx}^2 + \sum_{t = 0}^{T-1} \eta_t^2 \norm{\partial f(x_t)}^2 \\ & \leq \norm{x_0 - x^\esx}^2 + \sum_{t = 0}^{T-1} \eta_t^2 \norm{\partial f(x_t)}^2. \end{align*} %]]></script> <p>Let us further assume that $\norm{\partial f(x_t)} \leq G$ for all $t = 0, \dots, T-1$ for some $G \in \RR$ and to simplify the exposition let us choose $\eta_t \doteq \eta &gt; 0$ for now for all $t$ for some $\eta$ to be chosen later. We obtain:</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align*} 2\eta \sum_{t = 0}^{T-1} \langle \partial f(x_t), x_t - x^\esx\rangle & \leq \norm{x_0 - x^\esx}^2 + \eta^2 T G^2 \\ \Leftrightarrow \sum_{t = 0}^{T-1} \langle \partial f(x_t), x_t - x^\esx\rangle & \leq \frac{\norm{x_0 - x^\esx}^2}{2\eta} + \frac{\eta}{2} T G^2, \end{align*} %]]></script> <p>where the right-hand side is minimized for</p> <script type="math/tex; mode=display">\eta \doteq \frac{\norm{x_0 - x^\esx}}{G} \sqrt{\frac{1}{T}},</script> <p>leading to</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align*} \tag{regretBound} \sum_{t = 0}^{T-1} \langle \partial f(x_t), x_t - x^\esx\rangle & \leq G \norm{x_0 - x^\esx} \sqrt{T}. \end{align*} %]]></script> <p>We will later see that (RegretBound) can be used as a starting point to develop online learning algorithms, for now however, we will derive our convergence guarantee from this. To this end we divide both sides by $T$, use convexity, and the subgradient property to conclude:</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align*} \tag{convergenceSG} f(\bar x) - f(x^\esx) & \leq \frac{1}{T} \sum_{t = 0}^{T-1} f(x_t) - f(x^\esx) \\ & \leq \frac{1}{T} \sum_{t = 0}^{T-1} \langle \partial f(x_t), x_t - x^\esx\rangle \\ & \leq G \norm{x_0 - x^\esx} \frac{1}{\sqrt{T}}, \end{align*} %]]></script> <p>where $\bar x \doteq \frac{1}{T} \sum_{t=0}^{T-1} x_t$ is the average of all iterates. As such we obtain a $O(1/\sqrt{T})$ convergence rate for our algorithm. It is useful to observe that what the algorithm does is to minimize the average of the dual gaps at points $x_t$ given by $\langle \partial f(x_t), x_t - x^\esx\rangle$ and since the average of the dual gaps upper bounds the primal gap of the average point convergence follows.</p> <p>This basic analysis is the standard analysis for <em>subgradient descent</em> and will serve as a starting point for what follows.</p> <p>Before we continue the following remarks are in order:</p> <ol> <li>An important observation is that in the argument above we never used that $x^\esx$ is an optimal solution and in fact the arguments hold for <em>any</em> point $u$; in particular for some choices $u$ the left-hand side of (regretBound) <em>can be negative</em> (as in the case of (convergenceSG) which becomes vacuous in this case). We will see the implications of this very soon below in the online learning section. Ultimately, subgradient descent (and also the mirror descent as we will see later) is a <em>dual method</em> in the sense, that it directly minimizes the duality gap or equivalently maximizes the dual. That is where the strong guarantees with respect to <em>all points</em> $u$ come from.</li> <li>Another important insight is that the argument from above does not provide a <em>descent algorithm</em>, i.e., it is <em>not guaranteed</em> that we make progress in terms of primal function value decrease in each iteration. However, what we show is that picking $\eta$ ensures that the average point $\bar x$ converges to an optimal solution: we make progress on average.</li> <li>In the current form as stated above the choice of $\eta$ requires prior knowledge of the number of total iterations $T$ and the guarantee <em>only</em> applies to the average point obtained from averaging over all iterations $T$. However, this can be remedied in various ways. The poor man’s approach is to simply run the algorithm with a small $T$ and whenever $T$ is reached to double $T$ and restart the algorithm. This is usually referred to as the <em>doubling-trick</em> and at most doubles the number of performed iterations but now we do not need prior knowledge of $T$ and we obtain guarantees at iterations of the form $t = 2^\ell$ for $\ell = 1,2, \dots$. The smarter way is to use a variable step size as we will show later. This requires however that $\norm{x_t - x^\esx} \leq D$ holds for all iterates for some constant $D$, which might be hard to ensure in general but which can be safely assumed in the compact constrained case by choosing $D$ to be the diameter; the guarantees will depend on that parameter.</li> </ol> <h3 id="an-optimal-update">An “optimal” update</h3> <p>Similar to the descent approach using smoothness, as done in several previous posts, such as e.g., <a href="/blog/research/2018/12/06/cheatsheet-smooth-idealized.html">Cheat Sheet: Smooth Convex Optimization</a>, we might try to pick $\eta_t$ in each step to maximize progress. Our starting point is</p> <script type="math/tex; mode=display">% <![CDATA[ \tag{expand} \begin{align*} \norm{x_{t+1} - x^\esx}^2 & = \norm{x_t - x^\esx}^2 - 2 \eta_t \langle \partial f(x_t), x_t - x^\esx\rangle + \eta_t^2 \norm{\partial f(x_t)}^2, \end{align*} %]]></script> <p>from above and we want to choose $\eta_t$ to maximize progress in terms of $\norm{x_{t+1} - x^\esx}$ vs. $\norm{x_t - x^\esx}$, i.e., decrease in distance to the optimal solution. Observe that the right-hand side is convex in $\eta_t$ and optimizing over $\eta_t$ leads to</p> <script type="math/tex; mode=display">\eta_t^\esx \doteq \frac{\langle \partial f(x_t), x_t - x^\esx\rangle}{\norm{\partial f(x_t)}^2}.</script> <p>This choice of $\eta_t^\esx$ looks very similar to the choice that we have seen before for, e.g., gradient descent in the smooth case (see, e.g., <a href="/blog/research/2018/12/06/cheatsheet-smooth-idealized.html">Cheat Sheet: Smooth Convex Optimization</a>), with some important differences however: we cannot compute the above step length as we do not know $x^\esx$; we ignore this for now.</p> <p>Plugging the step length back into (expand), we obtain:</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align*} \norm{x_{t+1} - x^\esx}^2 & = \norm{x_t - x^\esx}^2 - \frac{\langle \partial f(x_t), x_t - x^\esx\rangle^2}{\norm{\partial f(x_t)}^2}. \end{align*} %]]></script> <p>This shows that progress in the distance squared to the optimal solution decreases by $\frac{\langle \partial f(x_t), x_t - x^\esx\rangle^2}{\norm{\partial f(x_t)}^2}$, i.e., the better aligned the gradient is with the idealized direction $x_t - x^\esx$, which points towards an optimal solution $x^\esx$, the faster the progress. In particular, if the alignment is perfect, then <em>one step</em> suffices. Note, however that this is <em>only</em> hypothetical as the computation of the optimal step length requires knowledge of an optimal solution. This is simply to demonstrate that a “non-deterministic” version would only require one step. This is in contrast to e.g., gradient step progress in function value exploiting smoothness (see <a href="/blog/research/2018/12/06/cheatsheet-smooth-idealized.html">Cheat Sheet: Smooth Convex Optimization</a>). In that case, only using first order information and smoothness we <em>naturally</em> obtain, e.g., a $O(1/t)$-rate for the smooth case, even for the non-deterministic idealized algorithm, where we guess as direction $x_t - x^\esx$ pointing towards the optimum. This is a subtle but important difference.</p> <p>Finally, to add slightly more to the confusion (for now) compare the rearranged (expand) which captures progress <em>in the distance</em></p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align*} \norm{x_t - x^\esx}^2 - \norm{x_{t+1} - x^\esx}^2 & = 2 \eta_t \langle \partial f(x_t), x_t - x^\esx\rangle - \eta_t^2 \norm{\partial f(x_t)}^2, \end{align*} %]]></script> <p>to the smoothness induced progress <em>in function value</em> (or primal gap) for the idealized $d \doteq x_t - x^\esx$ (see, e.g., <a href="/blog/research/2018/12/06/cheatsheet-smooth-idealized.html">Cheat Sheet: Smooth Convex Optimization</a>):</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align*} f(x_{t}) - f(x_{t+1}) & \geq \eta_t \langle\nabla f(x_t), x_t - x^\esx \rangle - \eta_t^2 \frac{L}{2}\norm{x_t - x^\esx}^2. \end{align*} %]]></script> <p>These two progress-inducing (in-)equalities are very similar. In particular, in the smooth case, for $\eta_t$ tiny, the progress is identical up to the linear factor $2$ and lower order terms; this is for a good reason as I will discuss sometime in the future when we look at the continuous time versions.</p> <h3 id="online-learning">Online Learning</h3> <p>In the following we will discuss the connection of the above to online learning. In <em>online learning</em> we typically consider the following setup; I simplified the setup slightly for exposition and the exact requirements will become clear from the actual algorithm that we will use.</p> <p>We consider two players: the <em>adversary</em> and the <em>player</em>. We then play a game over $T$ rounds of the following form:</p> <p class="mathcol"><strong>Game.</strong> For $t = 0, \dots, T-1$ do: <br /> (1) Player chooses an action $x_t$ <br /> (2) Adversary picks a (convex) function $f_t$, reveals $\partial f_t(x_t)$ and $f_t(x_t)$ <br /> (3) Player updates/learns via $\partial f_t(x_t)$ and incurs cost $f_t(x_t)$ <br /></p> <p>The goal of the game is to minimize the so-called <em>regret</em>, which is defined as:</p> <script type="math/tex; mode=display">\tag{regret} \sum_{t = 0}^{T-1} f_t(x_t) - \min_{x} \sum_{t = 0}^{T-1} f_t(x),</script> <p>which measures how well our <em>dynamic strategy</em> $x_1, \dots, x_t$ compares to the <em>single best decision in hindsight</em>, i.e., a <em>static strategy</em> given perfect information.</p> <p>Although surprising at first, it turns out that one can show that there exists an algorithm that generates a strategy $x_1, \dots, x_t$, so that (regret) is growing sublinearly, in fact typically of the order $O(\sqrt{T})$, i.e., something of the following form holds:</p> <script type="math/tex; mode=display">\sum_{t = 0}^{T-1} f_t(x_t) - \min_{x} \sum_{t = 0}^{T-1} f_t(x) \leq O(\sqrt{T}),</script> <p>What does this mean? If we divide both sides by $T$, we obtain the so-called <em>average regret</em> and the bound becomes:</p> <script type="math/tex; mode=display">\frac{1}{T} \sum_{t = 0}^{T-1} f_t(x_t) - \min_{x} \sum_{t = 0}^{T-1} f_t(x) \leq O\left(\frac{1}{\sqrt{T}}\right),</script> <p>showing that the average mistake that we make per round, in the long run, tends to $0$ at a rate of $O\left(\frac{1}{\sqrt{T}}\right)$.</p> <p>Now it is time to wonder what this has to do with what we have seen so far. It turns out that already the our basic analysis from above provides a bound for the most basic unconstrained case for a given time horizon $T$. To this end recall the inequality (regretBound) that we established above:</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align*} \sum_{t = 0}^{T-1} \langle \partial f(x_t), x_t - x^\esx\rangle & \leq G \norm{x_0 - x^\esx} \sqrt{T}. \end{align*} %]]></script> <p>A careful look at the argument that we used to establish the inequality (regretBound) reveals that it actually does not depend on $f$ being the same in each iteration and also that we can replace $x^\esx$ by any other feasible solution $u$ (as discussed before) so that we also proved:</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align*} \sum_{t = 0}^{T-1} \langle \partial f_t(x_t), x_t - u\rangle & \leq G \norm{x_0 - u} \sqrt{T}, \end{align*} %]]></script> <p>with $G$ now being a bound on the subgradients across the rounds, i.e., $\norm{\partial f_t(x_t)} \leq G$ and now using the fact that $\partial_t(x_t)$ is a subgradient, we obtain:</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align*} \sum_{t = 0}^{T-1} \left (f_t(x_t) - f_t(u) \right) \leq \sum_{t = 0}^{T-1} \langle \partial f_t(x_t), x_t - u \rangle & \leq G \norm{x_0 - u} \sqrt{T}, \end{align*} %]]></script> <p>and in particular this holds for the minimum:</p> <script type="math/tex; mode=display">% <![CDATA[ \tag{regretSG} \begin{align*} \sum_{t = 0}^{T-1} f_t(x_t) - \min_x \sum_{t = 0}^{T-1} f_t(x) \leq \sum_{t = 0}^{T-1} \langle \partial f_t(x_t), x_t - u \rangle & \leq G \norm{x_0 - u} \sqrt{T}, \end{align*} %]]></script> <p>which establishes sublinear regret for the actions played by the player according to:</p> <p>$x_{t+1} \leftarrow x_t - \eta_t \partial f_t(x_t),$ with the step length $\eta \doteq \frac{\norm{x_0 - x^\esx}}{G} \sqrt{\frac{1}{T}}$, which in this context is also often referred to as <em>learning rate</em>. This setting requires knowledge of $T$ ahead of time. As discussed earlier this can be overcome, either with the doubling-trick or via the variable step length approach that we discuss further below; the cost in terms of regret is a $\sqrt{2}$-factor for the latter.</p> <p>So what is our algorithm doing when deployed in an online setting? For this it is helpful to consider the update in iteration $t$ through (basic), rearranged for convenience:</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align*} \norm{x_{t+1} - u}^2 & = \norm{x_t - u}^2 + \eta_t^2 \norm{\partial f_t(x_t)}^2 - 2 \eta_t \langle \partial f_t(x_t), x_t - u\rangle, \end{align*} %]]></script> <p>so for $x_{t+1}$ to move closer to a given $u$, it is necessary that</p> <script type="math/tex; mode=display">% <![CDATA[ \eta_t^2 \norm{\partial f_t(x_t)}^2 - 2 \eta_t \langle \partial f_t(x_t), x_t - u\rangle < 0, %]]></script> <p>or equivalently, that</p> <script type="math/tex; mode=display">% <![CDATA[ \frac{\eta_t}{2} \norm{\partial f_t(x_t)}^2 < \langle \partial f_t(x_t), x_t - u\rangle, %]]></script> <p>i.e., the <em>potential gain</em>, measured by the dual gap $\langle \partial f_t(x_t), x_t - u\rangle$ must be larger than $\frac{\eta_t}{2} \norm{\partial f_t(x_t)}^2$, where $\norm{\partial f_t(x_t)}^2$ is the maximally possible gain (the scalar product is maximized at $\partial f_t(x_t)$, e.g., by Cauchy-Schwarz). As such we require an $\frac{\eta_t}{2}$ fraction of the total possible gain to move closer to $u$.</p> <p>We will later see that other online learning variants naturally arise the same way by ‘short-cutting’ the convergence proof as we have done here. In particular, we will see that the famous <em>Multiplicative Weight Update</em> algorithm is basically obtained from short-cutting the Mirror Descent convergence proof for the probability simplex with the relative entropy as Bregman divergence; more on this later.</p> <h2 id="the-constrained-setting-projected-subgradient-descent">The constrained setting: projected subgradient descent</h2> <p>We will now move to the constrained setting where we want require that the iterates $x_t$ are contained in some convex set $K$, i.e., $x_t \in K$. As in the above, our starting point is the <em>poor man’s identity</em> arising from expanding the norm. To this end we write:</p> <script type="math/tex; mode=display">\norm{x_{t+1} - x^\esx}^2 = \norm{x_t - x^\esx}^2 - 2 \langle x_t - x_{t+1}, x_t - x^\esx \rangle + \norm{x_t - x_{t+1}}^2,</script> <p>or in a more convenient form (by rearranging) as:</p> <script type="math/tex; mode=display">\tag{normExpand} 2 \langle x_t - x_{t+1}, x_t - x^\esx \rangle = \norm{x_t - x^\esx}^2 - \norm{x_{t+1} - x^\esx}^2 + \norm{x_t - x_{t+1}}^2.</script> <p>In the basic analysis of subgradient descent, we then used the specific form of the update $x_{t+1} \leftarrow x_t - \eta_t \partial f(x_t)$ and then summed and telescoped out. Now, things are different. A hypothetical update $x_{t+1} \leftarrow x_t - \eta_t \partial f(x_t)$ might lead outside of $K$, i.e., $x_{t+1} \not\in K$ might happen. Observe though that (normExpand) still telescopes as before by simply adding up over the iterations, however we have no idea, how $\langle x_t - x_{t+1}, x_t - x^\esx \rangle$ relates to our function $f$ of interest and clearly this has to depend on the actual step we take, i.e., on the properties of $x_{t+1}$. A natural, but slightly too optimistic thing to hope for is to find a step $x_{t+1}$, such that</p> <script type="math/tex; mode=display">\tag{optimistic} \langle \eta_t \partial f(x_{t}), x_t - x^\esx \rangle \leq \langle x_t - x_{t+1}, x_t - x^\esx \rangle,</script> <p>holds as this actually held even with equality in the unconstrained case. However, suppose we can show the following:</p> <script type="math/tex; mode=display">\tag{lookAhead} \langle \eta_t \partial f(x_{t}), x_{t+1} - x^\esx \rangle \leq \langle x_t - x_{t+1}, x_{t+1} - x^\esx \rangle.</script> <p>Note the subtle difference in the indices in the $x_{t+1} - x^\esx$ term. It is much easier to show (lookAhead), as we will do further below, because the point $x_{t+1}$ that we choose as a function of $\nabla f(x_t)$ and $x_t$ is under our control; in comparison $x_t$ is already chosen at time $t$. However, this not yet good enough to telescope out the sums due to the mismatch in indices. The following observation remedies the situation by undoing the index shift and quantifying the change:</p> <p class="mathcol"><strong>Observation.</strong> If $\langle \eta_t \partial f(x_{t}), x_{t+1} - x^\esx \rangle \leq \langle x_t - x_{t+1}, x_{t+1} - x^\esx \rangle$, then \tag{lookAheadIneq} \begin{align} \langle \eta_t \partial f(x_{t}), x_{t} - x^\esx \rangle &amp; \leq \langle x_t - x_{t+1}, x_{t} - x^\esx \rangle \newline \nonumber &amp; - \frac{1}{2}\norm{x_t - x_{t+1}}^2 \newline \nonumber &amp; +\frac{1}{2}\norm{\eta_t \partial f(x_t)}^2 \end{align}</p> <p>Before proving the observation, observe that in the unconstrained case, where we choose $x_{t+1} = x_t - \eta \partial f(x_t)$, the inequality in the observation reduces to (optimistic), holding even with equality, and when plugging this back into our poor man’s identify this exactly becomes the basic argument from beginning of the post. This is a good news as it indicates that the observation reduces to what we know already in the unconstrained case. As such we might want to think of the observation as relating the step $x_{t+1} - x_t$ that we take with $\partial f(x_t)$, assuming that we can choose $x_{t+1}$ to satisfy (lookAhead).</p> <p><em>Proof (of observation).</em> Our starting point is the inequality (lookAhead) whose validity we establish a little later:</p> <script type="math/tex; mode=display">\langle \eta_t \partial f(x_{t}), x_{t+1} - x^\esx \rangle \leq \langle x_t - x_{t+1}, x_{t+1} - x^\esx \rangle.</script> <p>We will simply brute-force rewrite the inequality into the desired form and collect the error terms in the process. The above inequality is equivalent to:</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align*} & \langle \eta_t \partial f(x_{t}), x_{t} - x^\esx \rangle + \langle \eta_t \partial f(x_{t}), x_{t+1} - x_t \rangle \\ \leq\ & \langle x_t - x_{t+1}, x_{t} - x^\esx \rangle + \langle x_t - x_{t+1}, x_{t+1} - x_t \rangle. \end{align*} %]]></script> <p>Rewriting we obtain:</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align*} & \langle \eta_t \partial f(x_{t}), x_{t} - x^\esx \rangle \\ \leq\ & \langle x_t - x_{t+1}, x_{t} - x^\esx \rangle -\norm{x_{t+1} - x_t}^2 - \langle \eta_t \partial f(x_{t}), x_{t+1} - x_t \rangle \\ =\ & \langle x_t - x_{t+1}, x_{t} - x^\esx \rangle - \frac{1}{2}\norm{x_{t+1} - x_t}^2 - \frac{1}{2}\left(\norm{x_{t+1} - x_t}^2 - 2 \langle \eta_t \partial f(x_{t}), x_{t+1} - x_t \rangle\right) \\ \leq\ & \langle x_t - x_{t+1}, x_{t} - x^\esx \rangle - \frac{1}{2}\norm{x_{t+1} - x_t}^2 + \frac{1}{2} \norm{ \eta_t \partial f(x_{t})}^2, \end{align*} %]]></script> <p>where the last inequality uses the binomial formula, i.e., $(a+b)^2 = a^2 - 2ab +b^2 \geq 0$ and hence $a^2 \geq -b^2 +2ab$.</p> <p>$\qed$</p> <p>With the observation we can immediately conclude our convergence proof and the argument becomes identical to the basic case from above. Recall that our starting point is (normExpand):</p> <script type="math/tex; mode=display">2 \langle x_t - x_{t+1}, x_t - x^\esx \rangle = \norm{x_t - x^\esx}^2 - \norm{x_{t+1} - x^\esx}^2 + \norm{x_t - x_{t+1}}^2.</script> <p>Now we can estimate the term on the left-hand side using our observation. This leads to:</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align*} & 2 \langle \eta_t \partial f(x_{t}), x_{t} - x^\esx \rangle + \norm{x_t - x_{t+1}}^2 - \norm{\eta_t \partial f(x_t)}^2\\ & \leq 2 \langle x_t - x_{t+1}, x_t - x^\esx \rangle \\ & = \norm{x_t - x^\esx}^2 - \norm{x_{t+1} - x^\esx}^2 + \norm{x_t -x_{t+1}}^2, \end{align*} %]]></script> <p>and after subtracting $\norm{x_t - x_{t+1}}^2$ and adding $\norm{\eta_t \partial f(x_t)}^2$, we obtain:</p> <script type="math/tex; mode=display">2 \langle \eta_t \partial f(x_{t}), x_{t} - x^\esx \rangle \leq \norm{x_t - x^\esx}^2 - \norm{x_{t+1} - x^\esx}^2 + \norm{\eta_t \partial f(x_t)}^2,</script> <p>which is exactly (basic) as above and we can conclude the argument the same way: summing up and telescoping and then optimizing $\eta_t$. In particular, the convergence rate (convergenceSG) and regret bound (regretBound) stay the same with no deterioration due to constraints or projections:</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align*} \sum_{t = 0}^{T-1} \langle \partial f(x_t), x_t - x^\esx\rangle & \leq G \norm{x_0 - x^\esx} \sqrt{T}, \end{align*} %]]></script> <p>and</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align*} f(\bar x) - f(x^\esx) & \leq G \norm{x_0 - x^\esx} \frac{1}{\sqrt{T}} \end{align*} %]]></script> <p>So the key is really establishing (lookAhead) as it immediately implies all we need to establish convergence in the constrained case. This is what we will do now, which will also, finally, specify our choice of $x_{t+1}$.</p> <h3 id="using-optimization-to-prove-what-you-want">Using optimization to prove what you want</h3> <p>Have you ever wondered why people add these weird 2-norms to their optimization problem to “regularize” the problem, i.e., they solve problems of the form $\min_{x} f(x) + \lambda \norm{x - z}^2$? Then this section might provide some insight into this. We will see that it is actually not about the “problem” that is solved, but about what an optimal solution might guarantee; bear with me.</p> <p>So what we want to establish is inequality (lookAhead), i.e.,</p> <script type="math/tex; mode=display">\langle \eta_t \partial f(x_{t}), x_{t+1} - x^\esx \rangle \leq \langle x_t - x_{t+1}, x_{t+1} - x^\esx \rangle,</script> <p>or slightly more generally stated as our proof will work for <em>all</em> $u \in K$ (and in particular the choice $u = x^\esx$):</p> <script type="math/tex; mode=display">\langle \eta_t \partial f(x_{t}), x_{t+1} - u \rangle \leq \langle x_t - x_{t+1}, x_{t+1} - u \rangle.</script> <p>Rearranging the above we obtain:</p> <script type="math/tex; mode=display">\tag{optCon} \langle \eta_t \partial f(x_{t}), x_{t+1} - u \rangle - \langle x_t - x_{t+1}, x_{t+1} - u \rangle \leq 0.</script> <p>What we will do now is to interpret the above as an <em>optimality condition</em> to some smooth convex optimization problem of the form $\max_{x \in K} g(x)$, where $g(x)$ is some smooth and convex function. Recall, from the previous posts, e.g., <a href="/blog/research/2018/10/05/cheatsheet-fw.html">Cheat Sheet: Frank-Wolfe and Conditional Gradients</a>, that the first order optimality condition states, that for all $u \in K$ it holds:</p> <script type="math/tex; mode=display">\langle \nabla g(x), x - u \rangle \leq 0,</script> <p>provided that $x \in K$ is an optimal solution, as otherwise we would be able to make progress via, e.g., a gradient step or a Frank-Wolfe step. By simply reverse engineering (aka remembering how we differentiate), we guess</p> <script type="math/tex; mode=display">\tag{proj} g(x) \doteq \langle \eta_t \partial f(x_{t}), x \rangle + \frac{1}{2}\norm{x-x_t}^2,</script> <p>so that that its optimality condition produces (optCon). We now simply choose</p> <script type="math/tex; mode=display">\tag{constrainedStep}x_{t+1} \doteq \arg\min_{x \in K} \langle \eta_t \partial f(x_{t}), x \rangle + \frac{1}{2}\norm{x-x_t}^2,</script> <p>and (just to be sure) we inspect the optimality condition that states:</p> <script type="math/tex; mode=display">\begin{align*} \langle \eta_t \partial f(x_{t}), x_{t+1} - u \rangle - \langle x_t - x_{t+1}, x_{t+1} - u \rangle = \langle \nabla g(x_{t+1}), x_{t+1} - u \rangle \leq 0, \end{align*}</script> <p>which is exactly (lookAhead). This step then ensures convergence with (maybe surprisingly) a rate identical to the unconstrained case. The resulting algorithm is often referred to as <em>projected subgradient descent</em> and the problem whose optimal solution defines $x_{t+1}$ is the projection problem. We provide the <em>projected subgradient descent</em> algorithm below:</p> <p class="mathcol"><strong>Projected Subgradient Descent Algorithm.</strong> <br /> <em>Input:</em> Convex function $f$ with first-order oracle access and some initial point $x_0 \in K$<br /> <em>Output:</em> Sequence of points $x_0, \dots, x_T$ <br /> For $t = 1, \dots, T$ do: <br /> $\quad x_{t+1} \leftarrow \arg\min_{x \in K} \langle \eta_t \partial f(x_{t}), x \rangle + \frac{1}{2}\norm{x-x_t}^2$<br /></p> <h3 id="variable-step-length">Variable step length</h3> <p>We will now briefly explain how to replace the constant step length from before that requires a priori knowledge of $T$ by a variable step length, so that the convergence guarantee holds for any iterate $x_t$. To this end let $D \geq 0$ be a constant so that $\max_{x,y \in K} \norm{x-y} \leq D$. We now choose $\eta_t \doteq \tau \sqrt{\frac{1}{t+1}}$, where we will specify the constant $\tau \geq 0$ soon.</p> <p class="mathcol"><strong>Observation.</strong> For $\eta_t$ as above it holds: $\sum_{t = 0}^{T-1} \eta_t \leq \tau\left(2 \sqrt{T} - 1\right).$</p> <p><em>Proof.</em> There are various ways of showing the above. We follow the argument in [Z]. We have: <script type="math/tex">% <![CDATA[ \begin{align*} \sum_{t = 0}^{T-1} \eta_t & = \tau \sum_{t = 0}^{T-1} \frac{1}{\sqrt{t+1}} \\ & \leq \tau \left(1 + \int_{0}^{T-1}\frac{dt}{\sqrt{t+1}}\right) \\ & \leq \tau \left(1 + \left[2 \sqrt{t+1}\right]_0^{T-1} \right) = \tau (2\sqrt{T}-1) \qed \end{align*} %]]></script></p> <p>Now we restart from inequality (basic) from earlier:</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align*} 2\eta_t \langle \partial f(x_t), x_t - x^\esx\rangle & = \norm{x_t - x^\esx}^2 - \norm{x_{t+1} - x^\esx}^2 + \eta_t^2 \norm{\partial f(x_t)}^2, \end{align*} %]]></script> <p>however before we sum up and telescope we first divide by $2\eta_t$, i.e.,</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align*} \sum_{t=0}^{T-1}\langle \partial f(x_t), x_t - x^\esx\rangle & = \sum_{t=0}^{T-1} \left(\frac{\norm{x_t - x^\esx}^2}{2\eta_t} - \frac{\norm{x_{t+1} - x^\esx}^2}{2\eta_t} + \frac{\eta_t}{2} \norm{\partial f(x_t)}^2\right) \\ & \leq \frac{\norm{x_0 - x^\esx}^2}{2\eta_{0}} - \frac{\norm{x_{T} - x^\esx}^2}{2\eta_{T-1}} \\ & \qquad + \frac{1}{2} \sum_{t=1}^{T-1} \left(\frac{1}{\eta_{t}} - \frac{1}{\eta_{t-1}} \right) \norm{x_t - x^\esx}^2 + \sum_{t=0}^{T-1}\frac{\eta_t}{2} \norm{\partial f(x_t)}^2 \\ & \leq D^2 \left(\frac{1}{2\eta_0} + \frac{1}{2} \sum_{t=1}^{T-1} \left(\frac{1}{\eta_{t}} - \frac{1}{\eta_{t-1}} \right) \right) + \sum_{t=0}^{T-1}\frac{\eta_t}{2} G^2 \\ & \leq \frac{D^2 }{2\eta_{T-1}} + \sum_{t=0}^{T-1}\frac{\eta_t}{2} G^2 \\ & \leq \frac{1}{2}\left(\frac{D^2 }{\tau} \sqrt{T} + 2 G^2 \tau \sqrt{T} \right) = DG\sqrt{2T}, \end{align*} %]]></script> <p>where we applied the observation from above in the last but one inequality, plugged in the definition of $\eta_t$, and used the choice $\tau \doteq \frac{D}{G\sqrt{2}}$ in the last equation, which minimizes the term in the brackets in the last inequality. In summary, we have shown:</p> <script type="math/tex; mode=display">\tag{regretBoundAnytime} \sum_{t=0}^{T-1}\langle \partial f(x_t), x_t - x^\esx\rangle \leq DG\sqrt{2T}.</script> <p>From (regretBoundAnytime) we can now derive convergence rates as usual: summing up, then averaging the iterates, and using convexity.</p> <p>It is useful to compare (regretBoundAnytime) to the case with fixed step length, which is given in (regretBound): using a variable step length costs us a factor of $\sqrt{2}$, however the above bound in (regretBoundAnytime) now holds for <em>all</em> $t$ and a priori knowledge of $T$ is not required. Such regret bounds are sometimes referred to as <em>anytime regret bounds</em>.</p> <h3 id="online-sub-gradient-descent">Online (Sub-)Gradient Descent</h3> <p>Starting from (regretBoundAnytime), we can also follow the same path as in the online learning section from above. This recovers the Online (Sub-)Gradient Descent algorithm of [Z]: Consider the online learning setting from before and choose</p> <script type="math/tex; mode=display">x_{t+1} \leftarrow \arg\min_{x \in K} \langle \eta_t \partial f_t(x_{t}), x \rangle + \frac{1}{2}\norm{x-x_t}^2.</script> <p>Then, we obtain the regret bound</p> <script type="math/tex; mode=display">\tag{regretOGDanytime} \sum_{t=0}^{T-1} f_t(x_t) - \min_{x \in K} \sum_{t=0}^{T-1} f_t(x) \leq \sum_{t=0}^{T-1}\langle \partial f(x_t), x_t - x^\esx\rangle \leq DG\sqrt{2T},</script> <p>in the anytime setting and</p> <script type="math/tex; mode=display">\tag{regretOGD} \sum_{t=0}^{T-1} f_t(x_t) - \min_{x \in K} \sum_{t=0}^{T-1} f_t(x) \leq \sum_{t=0}^{T-1}\langle \partial f(x_t), x_t - x^\esx\rangle \leq DG\sqrt{T},</script> <p>when $T$ is known ahead of time, where $D$ and $G$ are a bound on the diameter of the feasible domain and the norm of the gradients respectively as before.</p> <h2 id="mirror-descent">Mirror Descent</h2> <p>We will now derive Nemirovski’s Mirror Descent algorithm (see e.g., [NY]) and we will be following somewhat the proximal perspective as outlined in [BT]. Simplifying and running the risk of attracting the wrath of the optimization titans, <em>Mirror Descent</em> arises from subgradient descent by replacing the $\ell_2$-norm with a “generalized distance” that satisfies the inequalities that we needed in the basic argument from above.</p> <p>Why would you want to do that? Adjusting the distance function will allow us to fine-tune the iterates and the resulting dimension-dependent term for the geometry under consideration.</p> <p>In the following, as we move away from the $\ell_2$-norm, which is self-dual, we will need the definition of the <em>dual norm</em> defined as <script type="math/tex">\norm{w}_\esx \doteq \max\setb{\langle w , x \rangle : \norm{x} = 1}</script>. Note that for the $\ell_p$-norm the $\ell_q$-norm is dual if $\frac{1}{p} + \frac{1}{q} = 1$. For the $\ell_1$-norm the dual norm is $\ell_\infty$. We will also need the <em>(generalized) Cauchy-Schwarz inequality</em> or <em>Hölder inequality</em>: $\langle y , x \rangle \leq \norm{y}_\esx \norm{x}$. A very useful consequence of this inequality is:</p> <script type="math/tex; mode=display">\tag{genBinomial} \norm{a}^2 - 2 \langle a , b \rangle + \norm{b}^2_\esx \geq 0,</script> <p>which follows from</p> <script type="math/tex; mode=display">\begin{align*} \norm{a}^2 - 2 \langle a , b \rangle + \norm{b}^2_\esx \geq \norm{a}^2 - 2 \norm{a} \norm{b}_\esx + \norm{b}^2_\esx = (\norm{a} - \norm{b}_\esx)^2 \geq 0. \end{align*}</script> <h3 id="generalized-distance-aka-bregman-divergence">“Generalized Distance” aka Bregman divergence</h3> <p>We will first introduce the generalization of norms that we will be working with. To this end, let us first collect the desired properties that we needed in the proof of the basic argument; I will already suggestively use the final notation to not create notational overload. Let our desired function be called $V_x(y)$ and let us further assume in a first step the choice $V_x(y) = \frac{1}{2} \norm{x-y}^2$; note the factor $\frac{1}{2}$ is only used to make the proofs cleaner.</p> <p>In the very first step we used the expansion of the $\ell_2$-norm. As $x^\esx$ plays no special role, we write everything with respect to any feasible $u$:</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align*} \norm{x_{t+1} - u}^2 & = \norm{x_t - u}^2 - 2 \langle x_{t} - x_{t+1}, x_t - u\rangle + \norm{x_{t+1} - x_t}^2. \end{align*} %]]></script> <p>Rescaling and substituting $V_x(y) = \frac{1}{2} \norm{x-y}^2$, and observing that $\nabla V_x(y) = y - x$, we obtain:</p> <script type="math/tex; mode=display">% <![CDATA[ \tag{req-1} \begin{align*} V_{x_{t+1}}(u) & = V_{x_{t}}(u) - \langle \nabla V_{x_{t+1}}(x_{t}), x_t - u\rangle + V_{x_{t+1}}(x_{t}), \end{align*} %]]></script> <p>where the last choice $V_{x_{t+1}}(x_{t})$ used a non-deterministic guess as by symmetry of the $\ell_2$-norm also $V_{x_{t}}(x_{t+1})$ would have been feasible.</p> <p>We will need another inequality if we aim to mimic the same proof in the constrained case. Recall that we needed (lookAheadIneq)</p> <p>$\langle \eta_t \partial f(x_{t}), x_{t} - x^\esx \rangle \leq \langle x_t - x_{t+1}, x_{t} - x^\esx \rangle - \frac{1}{2}\norm{x_t - x_{t+1}}^2+\frac{1}{2}\norm{\eta_t \partial f(x_t)}^2,$</p> <p>to relate the step $x_{t+1} - x_t$ that we take with $\partial f(x_t)$, assuming that $x_{t+1}$ was chosen appropriately. In the proof the term $\frac{1}{2}\norm{x_t - x_{t+1}}^2$ simply arose from the mechanics of the (standard) scalar product, which is inherently linked to the $\ell_2$-norm. Slightly jumping ahead (the later proof will make this requirement natural), we additionally require</p> <script type="math/tex; mode=display">\tag{req-2} \begin{align*} V_x(y) \geq \frac{1}{2}\norm{x-y}^2, \end{align*}</script> <p>Moreover, the term $\frac{1}{2}\norm{\eta_t \partial f(x_t)}^2$ in (lookAheadIneq) is actually using the <em>dual norm</em>, which we did not have to pay attention to as the $\ell_2$-norm is self-dual. We will redo the full argument in the next sections with the correct distinctions for completeness. First, however we will complete the definition of the $V_x(y)$.</p> <p>There is a natural class of functions that satisfy (req-1) and (req-2), so called <em>Bregman divergences</em>, which are defined through <em>Distance Generating Functions (DGFs)</em>. Let us choose some norm $\norm{\cdot}$, which is not necessarily the $\ell_2$-norm.</p> <p class="mathcol"><strong>Definition. (DGF and Bregman Divergence)</strong> Let $K \subseteq \RR^n$ be a closed convex set. Then $\phi: K \rightarrow \RR$ is called a <em>Distance Generating Function (DGF)</em> if $\phi$ is $1$-strongly convex with respect to $\norm{\cdot}$, i.e., for all $x \in K \setminus \partial K, y \in K$ we have $\phi(y) \geq \phi(x) + \langle \nabla \phi, y-x \rangle + \frac{1}{2}\norm{x-y}^2$. The <em>Bregman divergence (induced by $\phi$)</em> is defined as $V_x(y) \doteq \phi(y) - \langle \nabla \phi(x), y - x \rangle - \phi(x),$ $x \in K \setminus \partial K, y \in K$.</p> <p>Observe that the strong convexity requirement of the DGF is with respect to the chosen norm. This is important as it allows us to “fine-tune” our geometry. Before we establish some basic properties of Bregman divergences, here are two common examples:</p> <p class="mathcol"><strong>Examples. (Bregman Divergences)</strong> <br /> (a) Let $\norm{x} \doteq \norm{x}_2$ be the $\ell_2$-norm and $\phi(x) \doteq \frac{1}{2} \norm{x}^2$. Clearly, $\phi(x)$ is $1$-strongly convex with respect to $\norm{\cdot}$ (for any $K$). The resulting Bregman divergence is $V_x(y) = \frac{1}{2}\norm{x-y}^2$, which is the choice used for (projected) subgradient descent above. <br /> (b) Let <script type="math/tex">\norm{x} \doteq \norm{x}_1</script> be the $\ell_1$-norm and <script type="math/tex">\phi(x) \doteq \sum_{i \in [n]} x_i \log x_i</script> be the (negative) entropy. Then $\phi(x)$ is $1$-strongly convex for all <script type="math/tex">K \subseteq \Delta_n \doteq \setb{x \geq 0 \mid \sum_{i \in [n]}x_i = 1}</script>, which is the <em>probability simplex</em> with respect to <script type="math/tex">\norm{\cdot}_1</script>. The resulting Bregman divergence is <script type="math/tex">V_x(y) = \sum_{i \in [n]} y_i \log \frac{y_i}{x_i} = D(y \| x)</script>, which is the <em>Kullback-Leibler divergence</em> or <em>relative entropy</em>.</p> <p>We will now establish some basic properties for $V_x(y)$ and show that $V_x(y)$ satisfies the required properties:</p> <p class="mathcol"><strong>Lemma. (Properties of the Bregman Divergence)</strong> Let $V_x(y)$ be a Bregman divergence defined via some DGF $\phi$. Then the following holds: <br /> (a) Point-separating: $V_x(x) = 0$ (and also $V_x(y) = 0 \Leftrightarrow x = y$ via (b))<br /> (b) Compatible with norm: $V_x(y) \geq \frac{1}{2} \norm{x-y}^2 \geq 0$<br /> (c) $\Delta$-Inequality: $\langle - \nabla V_x(y), y - u \rangle = V_x(u) - V_y(u) - V_x(y)$</p> <p><em>Proof.</em> Property (a) follows directly from the definition and (b) follows from $\phi$ in the definition of $V_x(y)$ being $1$-strongly convex. Property (c) follows from straightforward expansion and computation:</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align*} \langle - \nabla V_x(y), y - u \rangle & = \langle\nabla\phi(x) -\nabla \phi(y) , y-u \rangle \\ & = (\phi(u) - \phi(x) - \langle \nabla \phi(x), u -x \rangle) \\ & - (\phi(u) - \phi(y) - \langle \nabla \phi(y), u - y \rangle) \\ & - (\phi(y) - \phi(x) - \langle \nabla \phi(x), y - x \rangle) \\ & = V_x(u) - V_y(u) - V_x(y). \end{align*} %]]></script> <script type="math/tex; mode=display">\qed</script> <h3 id="back-to-basics">Back to basics</h3> <p>In a first step we will redo our basic argument from the beginning of the post with a Bregman divergence instead of the expansion of the $\ell_2$-norm. To this end let $K \subseteq \RR^n$ (possibly $K = \RR^n$) be a closed convex set. We consider a generic algorithm that produces iterates $x_1, \dots, x_t, \dots$. We will define the choice of the iterates later. Our starting point is the $\Delta$-inequality of the Bregman divergence with the choices $y \leftarrow x_{t+1}$, $x \leftarrow x_t$, and $u \in K$ arbitrary:</p> <script type="math/tex; mode=display">\tag{basicBreg} \langle - \nabla V_{x_t}(x_{t+1}), x_{t+1} - u \rangle = V_{x_t}(u) - V_{x_{t+1}}(u) - V_{x_t}(x_{t+1}).</script> <p>We could now try the same strategy, summing up and telescoping out:</p> <script type="math/tex; mode=display">\sum_{t = 0}^{T-1} \langle - \nabla V_{x_t}(x_{t+1}), x_{t+1} - u \rangle = V_{x_0}(u) - V_{x_{T}}(u) - \sum_{t = 0}^{T-1} V_{x_t}(x_{t+1}).</script> <p>But how to continue? First observe that in contrast to the telescoping of the $\ell_2$-norm expansion we have a <em>negative</em> term on the right hand-side (this is technical and could have been done the same way for the $\ell_2$-norm) and the left-hand side, as of now, has no relation to the function $f$; clearly, we have not even defined our step yet. So let us try the obvious first-order guess, i.e., replacing the $\ell_2$-norm with the Bregman divergence.</p> <p>To this end, we define <script type="math/tex">\tag{IteratesMD} x_{t+1} \doteq \arg\min_{x \in K} \langle \eta_t \partial f(x_t), x \rangle + V_{x_t}(x).</script></p> <p>Mimicking the approach we took for projected gradient descent, let us inspect the optimality condition of the system. For all $u \in K$ it holds:</p> <script type="math/tex; mode=display">\tag{optConBreg} \langle \eta_t \partial f(x_t),x_{t+1} - u \rangle + \langle \nabla V_{x_t}(x_{t+1}), x_{t+1} - u \rangle \leq 0</script> <p>or equivalently we obtain:</p> <script type="math/tex; mode=display">\tag{lookaheadMD} \langle \eta_t \partial f(x_t),x_{t+1} - u \rangle \leq - \langle \nabla V_{x_t}(x_{t+1}), x_{t+1} - u \rangle</script> <p>as before, we now have to fix the index mismatch on the left:</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align*} \langle \eta_t \partial f(x_t),x_{t+1} - u \rangle & \leq - \langle \nabla V_{x_t}(x_{t+1}), x_{t+1} - u \rangle \\ \Leftrightarrow \langle \eta_t \partial f(x_t),x_{t} - u \rangle + \langle \eta_t \partial f(x_t),x_{t+1} - x_t \rangle & \leq - \langle \nabla V_{x_t}(x_{t+1}), x_{t+1} - u \rangle \end{align*} %]]></script> <p>This we can then plug back into (basicBreg) to obtain the key inequality for Mirror Descent:</p> <script type="math/tex; mode=display">% <![CDATA[ \tag{basicMD} \begin{align*} \langle \eta_t \partial f(x_t),x_{t} - u \rangle & \leq - \nabla V_{x_t}(x_{t+1}), x_{t+1} - u \rangle - \langle \eta_t \partial f(x_t),x_{t+1} - x_t \rangle \\ & = V_{x_t}(u) - V_{x_{t+1}}(u) - V_{x_t}(x_{t+1}) - \langle \eta_t \partial f(x_t),x_{t+1} - x_t \rangle \\ & \leq V_{x_t}(u) - V_{x_{t+1}}(u) - \frac{1}{2} \norm{x_t - x_{t+1}}^2 - \langle \eta_t \partial f(x_t),x_{t+1} - x_t \rangle \\ & \leq V_{x_t}(u) - V_{x_{t+1}}(u) + \left (\langle \eta_t \partial f(x_t),x_ t - x_{t+1} \rangle- \frac{1}{2} \norm{x_t - x_{t+1}}^2 \right) \\ & \leq V_{x_t}(u) - V_{x_{t+1}}(u) + \frac{\eta_t^2}{2}\norm{\partial f(x_t)}_\esx^2, \end{align*} %]]></script> <p>where the last inequality follows via (genBinomial). We can now simply sum up and telescope out to obtain the generic regret bound for Mirror Descent:</p> <script type="math/tex; mode=display">\tag{regretBoundMD} \begin{align*} \sum_{t=0}^{T-1} \langle \eta_t \partial f(x_t),x_{t} - u \rangle \leq V_{x_0}(u) + \sum_{t=0}^{T-1} \frac{\eta_t^2}{2} \norm{\partial f(x_t)}_\esx^2, \end{align*}</script> <p>and further we can again use convexity, averaging of the iterates, and picking $\eta_t = \eta \doteq \sqrt{\frac{2M}{G^2T}}$ (by optimizing out) to arrive at the convergence rate of Mirror Descent:</p> <script type="math/tex; mode=display">% <![CDATA[ \tag{convergenceMD} \begin{align*} f(\bar x) - f(x^\esx) & \leq \frac{1}{T} \sum_{t=0}^{T-1} \left (f(x_t) - f(x^\esx) \right) \\ & \leq \sum_{t=0}^{T-1} \langle \partial f(x_t),x_{t} - x^\esx \rangle \leq \frac{M}{\eta T} + \frac{\eta G^2}{2} \\ & \leq \sqrt{\frac{2M G^2}{T}}, \end{align*} %]]></script> <p>where $\norm{\partial f(x_t)}_\esx \leq G$ and <script type="math/tex">V_{x_0}(u) \leq M</script> for all $u \in K$..</p> <p>For completeness, the Mirror Descent algorithm is specified below:</p> <p class="mathcol"><strong>Mirror Descent Algorithm.</strong> <br /> <em>Input:</em> Convex function $f$ with first-order oracle access and some initial point $x_0 \in K$<br /> <em>Output:</em> Sequence of points $x_0, \dots, x_T$ <br /> For $t = 1, \dots, T$ do: <br /> $\quad x_{t+1} \leftarrow \arg\min_{x \in K} \langle \eta_t \partial f(x_t), x \rangle + V_{x_t}(x)$</p> <h3 id="online-mirror-descent-and-multiplicative-weights">Online Mirror Descent and Multiplicative Weights</h3> <p>Alternatively, starting from (regretBoundMD) we can yet again observe that one could use a different function $f_t$ in each iteration $t$, which leads us to <em>Online Mirror Descent</em> as we will briefly discuss in this section. From (regretBoundMD) we have with $\eta_t = \eta$ chosen below:</p> <script type="math/tex; mode=display">\begin{align*} \sum_{t=0}^{T-1} \langle \eta \partial f_t(x_t),x_{t} - u \rangle \leq V_{x_0}(u) + \frac{\eta^2}{2} \sum_{t=0}^{T-1} \norm{\partial f_t(x_t)}_\esx^2. \end{align*}</script> <p>Rearranging with $\norm{\partial f_t(x_t)}_\esx \leq G$ and <script type="math/tex">V_{x_0}(u) \leq M</script> for all $u \in K$, gives</p> <script type="math/tex; mode=display">\begin{align*} \sum_{t=0}^{T-1} \langle \partial f_t(x_t),x_{t} - u \rangle \leq \frac{M}{\eta} + \frac{\eta T}{2} G^2. \end{align*}</script> <p>With the (optimal) choice $\eta = \sqrt{\frac{2M}{G^2T}}$ and using the subgradient property, we obtain the online learning regret bound for Mirror Descent:</p> <script type="math/tex; mode=display">\tag{regretMD} \begin{align*} \sum_{t = 0}^{T-1} f_t(x_t) - \min_x \sum_{t = 0}^{T-1} f_t(x) \leq \min_x \sum_{t=0}^{T-1} \langle \partial f_t(x_t),x_{t} - x \rangle \leq \sqrt{2M G^2T}, \end{align*}</script> <p>and, paying another factor $\sqrt{2}$, we can make this bound <em>anytime</em>.</p> <p>We will now consider the important special case of $K = \Delta_n$ being the probability simplex and <script type="math/tex">V_x(y) = \sum_{i \in [n]} y_i \log \frac{y_i}{x_i} = D(y \| x)</script> being the <em>relative entropy</em>, which will lead to (an alternative proof of) the <em>Multiplicative Weight Update (MWU)</em> algorithm; this argument is folklore and has been widely known by experts, see e.g., [BT, AO]. In particular, it generalizes immediately to the matrix case in contrast to other proofs of the MWU algorithm. We refer the interested reader to [AHK] for an overview of the many applications of the MWU algorithm or equivalently Mirror Descent over the probability simplex with relative entropy as Bregman divergence.</p> <p>Via information-theoretic inequalities or just by-hand calculations (see [BT]), it can be easily seen that <script type="math/tex">D(x\| y)</script> is $1$-stronly convex with respect to <script type="math/tex">\norm{.}_1</script>, whose dual norm is <script type="math/tex">\norm{.}_\infty</script>. Moreover, <script type="math/tex">D(x \| x_0) \leq \log n</script> for all $x \in \Delta_n$, with $x_0 = (1/n, \dots, 1/n)$ being the uniform distribution.</p> <p>Recall that the iterates are defined via (IteratesMD), which in our case becomes</p> <script type="math/tex; mode=display">\tag{IteratesMWU} x_{t+1} \doteq \arg\min_{x \in K} \langle \eta_t \partial f_t(x_t), x \rangle + D(x \| x_t),</script> <p>and making this explicit amounts to updates of the form (to be read coordinate-wise):</p> <script type="math/tex; mode=display">\tag{IteratesMWUExp} x_{t+1} \leftarrow x_t \cdot \frac{e^{-\eta_t \partial f_t(x_t)}}{K_t},</script> <p>where $K_t$ is chosen such that $\norm{x_{t+1}}_1 = 1$, i.e., $K_t = \norm{x_t \cdot e^{-\eta_t \partial f_t(x_t)}}_1$, which is precisely the Multiplicative Weight Update algorithm.</p> <p>With the bounds $M = \log n$ and $\norm{\partial f_t(x_t)}_\infty \leq G$ for all $t = 0, \dots T-1$, the regret bound in this case becomes:</p> <script type="math/tex; mode=display">\tag{regretMWU} \begin{align*} \sum_{t = 0}^{T-1} f_t(x_t) - \min_x \sum_{t = 0}^{T-1} f_t(x) \leq \min_x \sum_{t=0}^{T-1} \langle \partial f_t(x_t),x_{t} - x \rangle \leq \sqrt{2 \log(n) G^2 T}, \end{align*}</script> <p>for the the variant with known $T$ and we can pay another factor $\sqrt{2}$ to make this bound an anytime guarantee.</p> <h3 id="mirror-descent-vs-gradient-descent">Mirror Descent vs. Gradient Descent</h3> <p>One of the key questions of course if whether the improvement in convergence rate through fine-tuning against the geometry materializes in actual computations or whether it is just an improvement on paper. Following [AO], some comments are helpful: if $V_x(y) \doteq \frac{1}{2}\norm{x-y^2}$, then Mirror Descent and (Sub-)Gradient Descent produce identical iterates. If on the other hand we, e.g., pick $K = \Delta_n$ the probability simplex in $\RR^n$, then we can pick <script type="math/tex">V_x(y) \doteq D(y\|x)</script> to be the relative entropy, and the iterates from Gradient Descent with the $\ell_2$-norm and Mirror Descent with relative entropy will be very different; and so will be the convergence behavior. Mirror Descent provides a guarantee of</p> <script type="math/tex; mode=display">\begin{align*} f(\bar x) - f(x^\esx) \leq \frac{\sqrt{2 \log(n) G^2}}{\sqrt{T}}, \end{align*}</script> <p>where $\bar x = \frac{1}{T} \sum_0^{T-1} x_t$ and $\norm{\partial f(x_t)}_\infty \leq G$ for all $t = 0, \dots T-1$ in this case.</p> <p>Below we compare Mirror Descent and Gradient Descent over $K = \Delta_n$ with $n = 10000$ (left) and across different values of $n$ on the (right) for some randomly generated functions; both plots are log-log plots. As can be seen Mirror Descent can scale much better than Gradient Descent by choosing a Bregman divergence that is optimized for the geometry. For $K = \Delta_n$, the dependence on $n$ for Mirror Descent (with relative entropy and $\ell_1$-norm) is only logarithmic vs. for Gradient Descent it is linear in the dimension $n$. This logarithmic dependency makes Mirror Descent well suited for large-scale applications in this case.</p> <p class="center"><img src="http://www.pokutta.com/blog/assets/md/MD-arranged.png" alt="MD vs. GD" /></p> <h2 id="extensions">Extensions</h2> <p>Finally I will talk about some natural extensions. While the full arguments will be beyond the scope, the interested reader might consult [Z2] for proofs. Also, there are various natural extensions in the online learning case, e.g., where we compare to slowly changing strategies; see [Z] for details.</p> <h3 id="stochastic-versions">Stochastic versions</h3> <p>It is relatively easy to see that the above bounds can be transferred to the stochastic setting, where we have an unbiased gradient estimator only. We then obtain basically the same convergence rates and regret bounds <em>in expectation</em>. With the usual Markov trick etc, we can also obtain high probability bounds, say with probability $1-\delta$, paying a $\log \frac{1}{\delta}$ factor in convergence bound and regret.</p> <h3 id="smooth-case">Smooth case</h3> <p>When $f$ is smooth we can modify (basicMD) to obtain the improved $O(1/t)$ rate. Recall that $f$ is $L$-smooth with respect to $\norm{\cdot}$ if:</p> <script type="math/tex; mode=display">\tag{smooth} f(y) - f(x) \leq \langle \nabla f(x), y-x \rangle + \frac{L}{2} \norm{x-y}^2.</script> <p>for all $x,y \in \mathbb R^n$. Choosing $x \leftarrow x_t$ and $y \leftarrow x_{t+1}$ we obtain:</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align*} \langle \nabla f(x_t), x_{t+1} - x^\esx \rangle & = \langle \nabla f(x_t), x_{t} - x^\esx \rangle + \langle \nabla f(x_t), x_{t+1} - x_t \rangle \\ & \geq f(x_t) - f(x^\esx) + f(x_{t+1}) - f(x_t) - \frac{L}{2} \norm{x_{t+1}-x_t}^2 \\ & = f(x_{t+1}) - f(x^\esx) - \frac{L}{2} \norm{x_{t+1}-x_t}^2 \end{align*} %]]></script> <p>We now modify (basicMD) with $u = x^\esx$ as follows:</p> <script type="math/tex; mode=display">% <![CDATA[ \tag{basicMDSmooth} \begin{align*} \langle \eta_t \nabla f(x_t),x_{t+1} - x^\esx \rangle & \leq - \nabla V_{x_t}(x_{t+1}), x_{t+1} - x^\esx \rangle \\ & = V_{x_t}(x^\esx) - V_{x_{t+1}}(x^\esx) - V_{x_t}(x_{t+1}). \end{align*} %]]></script> <p>Chaining in the inequality we obtained from smoothness:</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align*} \eta_t (f(x_{t+1}) - f(x^\esx)) & \leq \langle \eta_t \nabla f(x_t),x_{t+1} - x^\esx \rangle + \eta_t \frac{L}{2} \norm{x_{t+1}-x_t}^2 \\  & \leq - \nabla V_{x_t}(x_{t+1}), x_{t+1} - x^\esx \rangle + \eta_t \frac{L}{2} \norm{x_{t+1}-x_t}^2 \\ & = V_{x_t}(x^\esx) - V_{x_{t+1}}(x^\esx) - V_{x_t}(x_{t+1}) + \eta_t \frac{L}{2} \norm{x_{t+1}-x_t}^2 \\ & \leq V_{x_t}(x^\esx) - V_{x_{t+1}}(x^\esx) - \frac{1}{2} \norm{x_{t+1}-x_t}^2 + \eta_t \frac{L}{2} \norm{x_{t+1}-x_t}^2, \end{align*} %]]></script> <p>where the last inequality used the compatibility of the Bregman divergence with the norm. Picking $\eta_t = \frac{1}{L}$ results in</p> <script type="math/tex; mode=display">\frac{1}{L} (f(x_{t+1}) - f(x^\esx)) \leq V_{x_t}(x^\esx) - V_{x_{t+1}}(x^\esx),</script> <p>which we telescope out to</p> <script type="math/tex; mode=display">\sum_{t = 0}^{T-1} (f(x_{t+1}) - f(x^\esx)) \leq L V_{x_0}(x^\esx),</script> <p>and by convexity the average $\bar x = \frac{1}{T} \sum_{t = 0}^{T-1} x_t$ satisfies:</p> <script type="math/tex; mode=display">f(\bar x) - f(x^\esx) \leq \frac{L V_{x_0}(x^\esx)}{T},</script> <p>which is the expected rate for the smooth case. Note that this improvement does not translate to the online case.</p> <h3 id="strongly-convex-case">Strongly convex case</h3> <p>Finally we will show that if $f$ is $\mu$-strongly convex <em>with respect to $V_x(y)$</em> (not necessarily smooth though), then we can also obtain improved rates. This improvement translates also to the online learning case, i.e., we get the corresponding improvement in regret. Recall that a function is $\mu$-strongly convex with respect to $V_x(y)$ if:</p> <script type="math/tex; mode=display">f(y) - f(x) \geq \langle \nabla f(x),y-x \rangle + \mu V_x(y),</script> <p>holds for all $x,y \in \mathbb R^n$. Choosing $x \leftarrow x_t$ and $y \leftarrow x^\esx$, we obtain:</p> <script type="math/tex; mode=display">\langle \nabla f(x_t), x_t - x^\esx \rangle \geq f(x_t) - f(x^\esx) + \mu V_{x_t}(x^\esx).</script> <p>Again we start with (basicMD), which we will modify:</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align*} \langle \eta_t \nabla f(x_t),x_{t} - x^\esx \rangle & \leq V_{x_t}(x^\esx) - V_{x_{t+1}}(x^\esx) + \frac{\eta_t^2}{2}\norm{\partial f(x_t)}_\esx^2. \end{align*} %]]></script> <p>We now plug in the bound from strong convexity to obtain:</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align*} \eta_t (f(x_t) - f(x^\esx) + \mu V_{x_t}(x^\esx)) & \leq \langle \eta_t \nabla f(x_t),x_{t} - x^\esx \rangle \\ & \leq V_{x_t}(x^\esx) - V_{x_{t+1}}(x^\esx) + \frac{\eta_t^2}{2}\norm{\partial f(x_t)}_\esx^2, \end{align*} %]]></script> <p>which can be simplified to</p> <script type="math/tex; mode=display">% <![CDATA[ \tag{basicMDSC} \begin{align*} \eta_t (f(x_t) - f(x^\esx)) & \leq V_{x_t}(x^\esx) - V_{x_{t+1}}(x^\esx) + \frac{\eta_t^2}{2}\norm{\partial f(x_t)}_\esx^2 - \frac{\eta_t \mu}{2} \norm{x_t - x^\esx}^2 \\ & \leq \left(1- \eta_t \mu\right) V_{x_t}(x^\esx) - V_{x_{t+1}}(x^\esx) + \frac{\eta_t^2}{2}\norm{\partial f(x_t)}_\esx^2. \end{align*} %]]></script> <p>Choosing $\eta_t = \frac{1}{\mu t}$ now we obtain:</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align*} \frac{1}{\mu t} (f(x_t) - f(x^\esx)) & \leq \left(1- \frac{1}{t}\right) V_{x_t}(x^\esx) - V_{x_{t+1}}(x^\esx) + \frac{1}{2\mu^2t^2}\norm{\partial f(x_t)}_\esx^2 \\ \Leftrightarrow \frac{1}{\mu} (f(x_t) - f(x^\esx)) & \leq \left(t- 1\right) V_{x_t}(x^\esx) - t V_{x_{t+1}}(x^\esx) + \frac{1}{2\mu^2t}\norm{\partial f(x_t)}_\esx^2, \end{align*} %]]></script> <p>which we can finally sum up (starting at $t=1$), multiply by $\mu$, and telescope out to arrive at:</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align*} \sum_{t = 1}^{T} (f(x_t) - f(x^\esx)) & \leq - T \mu V_{x_{T+1}}(x^\esx) + \frac{G^2}{2\mu} \sum_{t = 1}^{T-1} \frac{1}{t} \leq \frac{G^2 \log T}{2\mu}, \end{align*} %]]></script> <p>and with the usual averaging and using convexity we obtain:</p> <script type="math/tex; mode=display">f(\bar x) - f(x^\esx) \leq \frac{G^2 \log T}{2\mu T}</script> <p>for the convergence rate and</p> <script type="math/tex; mode=display">\sum_{t = 0}^{T-1} f_t(x_t) - \min_x \sum_{t = 0}^{T-1} f_t(x) \leq \frac{G^2 }{2\mu} \log T,</script> <p>for the regret; note that this bound is already anytime. In order to obtain the regret bound, simply replace $x^\esx$ by an arbitrary $u$ and $f(x_t)$ by $f_t(x_t)$. Note however, that this time the argument is directly on the primal difference $f_t(x_t) - f_t(u)$, rather than the dual gaps, i.e., after plugging-in the strong convexity inequality we start from:</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align*} \eta_t (f_t(x_t) - f_t(u) + \mu V_{x_t}(u)) & \leq \langle \eta_t \nabla f_t(x_t), x_{t} - u \rangle \\ & \leq V_{x_t}(u) - V_{x_{t+1}}(u) + \frac{\eta_t^2}{2}\norm{\partial f_t(x_t)}_\esx^2, \end{align*} %]]></script> <p>and continue the same way.</p> <h3 id="references">References</h3> <p>[NY] Nemirovsky, A. S., &amp; Yudin, D. B. (1983). Problem complexity and method efficiency in optimization.</p> <p>[BT] Beck, A., &amp; Teboulle, M. (2003). Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3), 167-175. <a href="https://web.iem.technion.ac.il/images/user-files/becka/papers/3.pdf">pdf</a></p> <p>[Z] Zinkevich, M. (2003). Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning (ICML-03) (pp. 928-936). <a href="http://www.aaai.org/Papers/ICML/2003/ICML03-120.pdf">pdf</a></p> <p>[AO] Allen-Zhu, Z., &amp; Orecchia, L. (2014). Linear coupling: An ultimate unification of gradient and mirror descent. arXiv preprint arXiv:1407.1537. <a href="https://arxiv.org/abs/1407.1537">pdf</a></p> <p>[AHK] Arora, S., Hazan, E., &amp; Kale, S. (2012). The multiplicative weights update method: a meta-algorithm and applications. Theory of Computing, 8(1), 121-164. <a href="http://www.theoryofcomputing.org/articles/v008a006/v008a006.pdf">pdf</a></p> <p>[Z2] Zhang, X. Bregman Divergence and Mirror Descent. <a href="http://users.cecs.anu.edu.au/~xzhang/teaching/bregman.pdf">pdf</a></p> <p><br /></p> <h4 id="changelog">Changelog</h4> <p>03/02/2019: Fixed several typos and added clarifications as pointed out by Matthieu Bloch.</p> <p>03/04/2019: Fixed several typos and a norm/divergence mismatch in the strongly convex case as pointed out by Cyrille Combettes.</p>Sebastian PokuttaTL;DR: Cheat Sheet for non-smooth convex optimization: subgradient descent, mirror descent, and online learning. Long and technical.Mixing Frank-Wolfe and Gradient Descent2019-02-18T00:00:00-05:002019-02-18T00:00:00-05:00http://www.pokutta.com/blog/research/2019/02/18/bcg-abstract<p><em>TL;DR: This is an informal summary of our recent paper <a href="https://arxiv.org/abs/1805.07311">Blended Conditional Gradients</a> with <a href="https://users.renyi.hu/~braung/">Gábor Braun</a>, <a href="https://www.linkedin.com/in/dan-tu/">Dan Tu</a>, and <a href="http://pages.cs.wisc.edu/~swright/">Stephen Wright</a>, showing how mixing Frank-Wolfe and Gradient Descent gives a new, very fast, projection-free algorithm for constrained smooth convex minimization.</em> <!--more--></p> <h2 id="what-is-the-paper-about-and-why-you-might-care">What is the paper about and why you might care</h2> <p>Frank-Wolfe methods [FW] (also called conditional gradient methods [CG]) have been very successful in solving <em>constrained smooth convex minimization</em> problems of the form:</p> <script type="math/tex; mode=display">\min_{x \in P} f(x),</script> <p>where $P$ is some compact and convex feasible region; you might want to think of, e.g., $P$ being a polytope, which is one of the most common cases. We assume so-called <em>first-order access</em> to the objective function $f$, i.e., we have an oracle that returns function evaluation $f(x)$ and gradient information $\nabla f(x)$ for a provided point $x \in P$. Moreover, we assume that we have access to the feasible region $P$ by means of a so-called <em>linear optimization oracle</em>, which upon being presented with a linear objective $c \in \RR^n$ returns $\arg\min_{x \in P} \langle c, x \rangle$. The basic Frank-Wolfe algorithm looks like this:</p> <p class="mathcol"><strong>Frank-Wolfe Algorithm [FW]</strong> <br /> <em>Input:</em> Smooth convex function $f$ with first-order oracle access, feasible region $P$ with linear optimization oracle access, initial point (usually a vertex) $x_0 \in P$. <br /> <em>Output:</em> Sequence of points $x_0, \dots, x_T$ <br /> For $t = 1, \dots, T$ do: <br /> $\quad v_t \leftarrow \arg\min_{x \in P} \langle \nabla f(x_{t-1}), x \rangle$ <br /> $\quad x_{t+1} \leftarrow (1-\gamma_t) x_t + \gamma_t v_t$</p> <p>The Frank-Wolfe algorithm has a couple of important advantages:</p> <ol> <li>It is very easy to implement</li> <li>It does not require projections (as projected gradient descent does)</li> <li>It maintains iterates as reasonably sparse convex combination of vertices.</li> </ol> <p>Generally, one can expect an $O(1/t)$ convergence for the general convex smooth case and linear convergence for strongly convex functions with appropriate modifications of the Frank-Wolfe algorithm. The interested reader might check out <a href="/blog/research/2018/10/05/cheatsheet-fw.html">Cheat Sheet: Frank-Wolfe and Conditional Gradients</a> and <a href="/blog/research/2018/10/19/cheatsheet-fw-lin-conv.html">Cheat Sheet: Linear convergence for Conditional Gradients</a> for an extensive overview.</p> <p>In the context of Frank-Wolfe methods a key assumption is that linear optimization is <em>cheap</em>. Compared to projections one would have to perform, say for projected gradient descent this is almost always true (except for very simple feasible regions where projection is trivial). As such traditionally one would account for the linear optimization oracle call with an $O(1)$ cost and disregard it in the analysis. However, if the feasible region is complex (e.g., arising from an integer program or just being a really large linear program), this assumption is not warranted anymore and one might ask a few natural questions:</p> <ol> <li>Do we really have to call the (expensive) linear programming oracle in each iteration?</li> <li>Do we really need to compute (approximately) optimal solutions to the LP or does something completely different suffice?</li> <li>More generally, can we reuse information?</li> </ol> <p>It turns out that one can replace the linear programming oracle by what we call a <em>weak separation oracle</em>; see [BPZ] for more details. Without going into full detail, as I will have a dedicated post about <em>lazification</em> (what we dubbed this technique), what the oracle does, it basically wraps around the actual linear programming oracle. Before calling the linear programming oracle, the weak separation oracle can answer by checking previous answers to oracle calls (caching). Moreover, one gains that one does not have to solve the LPs to (approximate) optimality but rather it suffices to check for a certain minimal improvement, which is compatible with the to-be-achieved convergence rate. In particular, we do not need any optimality proofs. One can then show that one can <em>maintain</em> the same convergence rates using the weak separation oracle as for the respective Frank-Wolfe variant utilizing the linear programming oracle, while drastically reducing the number of LP oracle calls.</p> <h2 id="our-results">Our results</h2> <p>In practice, while lazification can provide huge speedups for Frank-Wolfe type methods when the LPs are hard to solve, this technique loses its advantage when the LPs are simple. The reason for this is that at the end of the day, there is a trade-off between the quality of the computed directions in terms of providing progress vs. how hard they are to compute: the weak-separation oracle computes potentially worse approximations but does so very fast.</p> <p>However, what we show in our <em>Blended Conditional Gradients</em> paper is that one can:</p> <ol> <li>Cut-out a <em>huge fraction of LP oracle calls</em> (sometimes less than 1% of the iterations require an actual LP oracle call)</li> <li>While working with <em>actual gradients</em> as descent direction providing much better progress than traditional Frank-Wolfe directions and</li> <li>Staying fully <em>projection-free</em>.</li> </ol> <p>This is achieved by <em>blending together</em> conditional gradient descent steps and gradient steps in a special way. The resulting algorithm has a per-iteration cost that is very comparable to gradient descent in most of the steps and when the LP oracle is called the per-iteration cost is comparable to the standard Frank-Wolfe Algorithm. In progress per iteration though our algorithm, which we call <em>Blended Conditional Gradients (BCG)</em> (see [BPTW] for details), typically outperforms Away-Step and Pairwise Conditional Gradients (the current state-of-the-art methods). We are often even faster in wall-clock performance as we eschew most LP oracle calls. Naturally, we maintain worst-case convergence rates that match those of Away-Step Frank-Wolfe and Pairwise Conditional Gradients; the known lower bounds only assume first-order oracle access and LP oracle access and are unconditional.</p> <p>Rather than stating the algorithm’s (worst-case) convergence rates for the various cases which are identical to the ones for Away-Step Frank-Wolfe and Pairwise Conditional Gradients achieving $O(1/\varepsilon)$-convergence for general smooth and convex functions and $O(\log 1/\varepsilon)$-convergence for smooth and strongly convex functions (see [LJ] and [BPTW] for details), I rather present some computational results as they highlight the typical behavior. The following graphics provide a pretty representative overview of the computational performance. Everything is in log-scale and we ran each algorithm with a fixed time limit; we refer the reader to [BPTW] for more details.</p> <p>The first example is a benchmark of BCG vs. Away-Step Frank-Wolfe (AFW), Pairwise Conditional Gradients (PCG), and Vanilla Frank-Wolfe (FW) on a LASSO instance. BCG significantly outperforms the other variants and in fact the empirical convergence rate of BCG is much higher than the rates of the other algorithms; recall we are in log-scale and we expect linear convergence for all variants due to the characteristics of the instance (optimal solution in strict interior).</p> <p class="center"><img src="http://www.pokutta.com/blog/assets/bcg/bcg4.png" alt="BCG vs. normal" /></p> <p>One might say that the above is not completely unexpected in particular because BCG uses also the lazification technique from our previous work in [BPZ]. So let us see how we compare to lazified variants of Frank-Wolfe. In the next graph we compare BCG vs. LPCG vs. PCG. The problem we solve here is a structured regression problem over a spanning tree polytope. Clearly, while LPCG is faster than PCG, BCG is significantly faster than either of those, both in iterations and wall-clock time.</p> <p class="center"><img src="http://www.pokutta.com/blog/assets/bcg/bcg3.png" alt="BCG vs. lazy" /></p> <p>To better understand what is going on let us see how often the LP oracle is actually called throughout the iterations. In the next graph we plot iterations vs. cumulative number of calls to the (true) LP oracle. Here we added also Fully-Corrective Frank-Wolfe (FCFW) variants that fully optimize out over the active set and hence should have the lowest number of required LP calls; we implemented two variants: one that optimizes over the active set for a fixed number of iterations (the faster one in grey) and one that optimizes to a specific accuracy (the slower one in orange). The next plot shows two instances: LASSO (left) and structured regression over a <em>netgen</em> instance (right); for the former lazification is not helpful as the LP oracle is too simple for the latter it is. As expected for the non-lazy variants such as FW, AFW, and PCG we have a straight line as we perform one LP call per iteration. For LCG, BCG, and the two FCFW variants we obtain a significant reduction in actual calls to the LP oracle with BCG sitting right between the (non-)lazy variants and the FCFW variants. BCG attains a large fraction of the reduction in calls of the much slower FCFW variants while being extremely fast compared to FCFW and all other variants.</p> <p class="center"><img src="http://www.pokutta.com/blog/assets/bcg/bcg1.png" alt="Cache rates" /></p> <p>On a fundamental level one might argue that what it really comes down to is how well the algorithm uses the information obtained from an LP call. Clearly, there is a trade-off: on the one hand, better utilization of that information will be increasingly more expensive making the algorithm slower, so that it might be advantageous to rather do another LP call, on the other hand, calling the LP oracle too often results in suboptimal use of the LP call’s information and these calls can be expensive. Managing this tradeoff is critical to achieve high performance and BCG uses a convergence criterion to maintain a very favorable balance. To get an idea how well the various algorithms are using the LP call information, consider the next graphic, where we run various algorithms on a LASSO instance. As can be seen in primal and dual progress, BCG is using LP information in a much more aggressive way, while maintaining a very high speed—the two FCFW variants that would use the information from the LP calls even more aggressively only performed a handful of iterations though as they are extremely slow (see the grey and orange line right at the beginning of the red line). As a measure we depict primal and dual progress vs (true) LP oracle call.</p> <p class="center"><img src="http://www.pokutta.com/blog/assets/bcg/bcg2.png" alt="Progress per LP call" /></p> <h3 id="bcg-code">BCG Code</h3> <p>If you are interested in using BCG, we made a preliminary version of our code available on <a href="https://github.com/pokutta/bcg">github</a>; a significant update with more options and additional algorithms is coming soon.</p> <h3 id="references">References</h3> <p>[FW] Frank, M., &amp; Wolfe, P. (1956). An algorithm for quadratic programming. Naval research logistics quarterly, 3(1‐2), 95-110. <a href="https://onlinelibrary.wiley.com/doi/abs/10.1002/nav.3800030109">pdf</a></p> <p>[CG] Levitin, E. S., &amp; Polyak, B. T. (1966). Constrained minimization methods. Zhurnal Vychislitel’noi Matematiki i Matematicheskoi Fiziki, 6(5), 787-823. <a href="http://www.mathnet.ru/php/archive.phtml?wshow=paper&amp;jrnid=zvmmf&amp;paperid=7415&amp;option_lang=eng">pdf</a></p> <p>[BPZ] Braun, G., Pokutta, S., &amp; Zink, D. (2017, August). Lazifying conditional gradient algorithms. In Proceedings of the 34th International Conference on Machine Learning-Volume 70 (pp. 566-575). JMLR. org. <a href="https://arxiv.org/abs/1610.05120">pdf</a></p> <p>[BPTW] Braun, G., Pokutta, S., Tu, D., &amp; Wright, S. (2018). Blended Conditional Gradients: the unconditioning of conditional gradients. arXiv preprint arXiv:1805.07311. <a href="https://arxiv.org/abs/1805.07311">pdf</a></p> <p>[LJ] Lacoste-Julien, S., &amp; Jaggi, M. (2015). On the global linear convergence of Frank-Wolfe optimization variants. In Advances in Neural Information Processing Systems (pp. 496-504). <a href="http://papers.nips.cc/paper/5925-on-the-global-linear-convergence-of-frank-wolfe-optimization-variants.pdf">pdf</a></p>Sebastian PokuttaTL;DR: This is an informal summary of our recent paper Blended Conditional Gradients with Gábor Braun, Dan Tu, and Stephen Wright, showing how mixing Frank-Wolfe and Gradient Descent gives a new, very fast, projection-free algorithm for constrained smooth convex minimization.The Zeroth World2019-02-05T23:06:41-05:002019-02-05T23:06:41-05:00http://www.pokutta.com/blog/random/2019/02/05/zeroth-world<p><em>TL;DR: On the impact of AI on society and economy and its potential to enable a zeroth world with unprecedented economic output.</em> <!--more--></p> <p>In this post I want to talk about the impact that artificial intelligence might have on society and economy; not because of the “terminator scenario” but because of what it already can achieve <em>right now</em>. Over the last few months I have had many such discussions within industry, academia, and government and this is a summary of what I think; as always <em>biased and incomplete</em>.</p> <p>Before delving into the actual discussion, I would like to clarify what I consider artificial intelligence (AI) as this is a very elusive term that has been overloaded several times to suit various narratives. When I talk about <em>artificial intelligence (AI)</em>, what I am talking about is <em>any technology, technology complex, or system</em>, that: 1) (Sensing) gathers information through direct input, sensors, etc. 2) (Learning) processes information with the explicit or implicit aim of forming an evaluation of its environment. 3) (Deciding) Decides on a course of action. 4) (Acting) Informs or implements that course of action.</p> <p>For those familiar, this is quite similar to the <a href="https://en.wikipedia.org/wiki/OODA_loop">OODA loop</a>, an abstraction that captures dynamic decision-making with feedback. The (minor) difference here, is that we (a) consider broader systems and we (b) do not require necessarily a feedback. In terms of (Acting) we also assume some form of autonomy, however the action might be either only suggested by the system or directly executed. The purpose of this “definition” is not to add yet another definition to the mix but to make precise, <em>for the purpose of this post</em>, what we will be talking about. For simplicity from now on we will refer to such systems as AI or AI systems. We will also refer to larger systems as AI system if they contain such technology at their core.</p> <p>Examples of where such AI systems are used or appear are:</p> <ul> <li>Credit ratings</li> <li>Amazon’s “people also bought”</li> <li>Autonomous vehicles</li> <li>Medical decision-support systems</li> <li>Facial recognition</li> <li>…</li> </ul> <p>Also, note that I chose the term “AI systems” vs many other equally fitting terms as it seems to be more “accessible” than some of the more technical ones, such as <em>Machine Learning</em> or <em>Decision-Support Systems</em>. Otherwise this choice is really arbitrary; let’s not make it about the choice of words.</p> <h2 id="impact-through-hybridization">Impact through Hybridization</h2> <p>A lot of the current discussion has been centered around the direct substitution of technology, workers, etc. by AI systems, as in <em>robot-in-human-out</em>. In believe however that this is not the likely scenario in the short to mid term as it would require a very high maturity level of current AI and machine learning technology that seems far away. Those wary of AI would argue that the <em>singularity</em>, where basically AI systems improve themselves, will drive maturity exponentially fast. Whether this is likely to happen I do not know as predictions of such type are tough. Most of those voices wary of AI, seem to argue from a utilitarian perspective a la Bernoulli and rather want to err on the safe side; from a risk management perspective not necessarily a bad approach. Most of those unconcerned argue that the we have not figured out some very basic challenges and as such there is no real risk.</p> <p class="center"><img src="https://imgs.xkcd.com/comics/skynet.png" alt="Comparison different step size rules" /> <a href="https://xkcd.com/1046/">[Source: XKCD]</a></p> <p>While this discourse might be important in its own right, I want to focus more on the <em>(relatively) immediate, short-term</em> impact: timelines of the order of 10 - 20 years, which is really short compared to the speed with which societies and economic systems adapt.</p> <h3 id="scaling-and-enabling-through-ai">Scaling and Enabling through AI</h3> <p>In order for AI to have a disruptive impact on society full maturity is not required; neither is <em>explainability</em> although this might be desirable. The reason for this is that we can simply “pair up a human with an AI”, which I refer to as <em>Hybridization</em>, forming a symbiotic system in a more Xenoblade-esque fashion. The basic principle is that 90% of the basics can be performed efficiently and faster by an AI and for the remaining 10% we have human override. This will (1) enable an individual to perform tasks that were out of reach at unprecedented speed and (2) allows an individual to aggressively scale up her/his operations by operating on a higher level, letting the AI take care of the basics.</p> <p>While this sounds Sci-Fi at first, a closer look reveals that we have been operating like this for many decades, we build tools to automate basic tasks (where basic is relative to the current level). This leads to an <em>automate-and-elevate</em> paradigm or cycle: automate the basics (e.g., via a machine or computer) and then go to the next level. A couple of examples:</p> <ul> <li>Driver + Google Maps</li> <li>Engineer + finite elements software</li> <li>Vlogger + Camera + Final Cut Pro</li> <li>MD + X-Ray</li> </ul> <p>I am sure you can come up with hundreds of other examples. What all these examples have in common is (1) an enabling factor and (2) a scale-up factor. Take the “Engineer + finite elements software” example: The engineer can suddenly compute and test designs that were impossible to verify by himself beforehand and required a larger number of other people to be involved. However with this tool, the number of involved people can be significantly reduced (the individual’s productivity skyrockets) and completely new unthinkable things can be suddenly done.</p> <p>What AI systems bring to the mix is that they suddenly allow us to (at least partially) tool and automate tasks that were out of reach so far because of “messy inputs”, i.e., these AI systems allow us to redefine what we consider “basic”.</p> <h3 id="an-example">An example</h3> <p>Let us consider the example of autonomous driving. Not because I like it particularly but because most of us have a pretty good idea about driving. Also today’s cars already have very basic automation, such as “cruise control” and “lane assist” systems, so that the idea is not that foreign. Traditionally, a car has one driver. While AI for autonomous driving seems far from being completely there yet, we <em>do not need this</em> to achieve disruptive improvements. Here are two use cases:</p> <p>Use case 1: Let the AI take care of the basic driving tasks. Whenever a situation is unclear the controls are transferred to a centralized control center, where professional drivers take over for the duration of the “complex task” and then the controls are passed back to the car. This might allow a single driver, together with AI subsystems to operate 4-10 cars at a time; the range is arbitrary but seems reasonable: not correcting for correlation and tail risks, a 4x factor would require the AI to tackle 75% of the driven miles autonomously and a factor of 10x would require 90% of the driven miles being handled autonomously. Current disengagement rates of Waymo seem to be far better than that.</p> <p>Use case 2: Long-haul trucking. Highway autonomy is much easier than intracity operations. Have truck drivers drive the truck to a “handover point” on a highway. Truck driver gets off the truck, the truck drives autonomously via the highway network to the handover point close to its destination. Human truck driver “picks up” the truck for last-mile intracity driving. If you now consider the ratio between the intracity portions and the highway portions of the trip, the number of required drivers can be reduced significantly; a 10x factor seems conservative. Moreover, rest times etc can be cut out as well.</p> <p>Clearly, we can also combine use case 1 and 2 for extra safety with minimal extra cost. What we see however from this basic example is that AI systems can scale-up what a single human can do by significant multiples. Also in the long-haul example from above, the quality of life of the drivers goes up, e.g., less time spent away from family (that is for those that keep their job). However, the <em>very important</em> flip-side of this hybridization is that it threatens to displace a huge fraction of jobs: at a scaling of 10x about 90% of the jobs might be at risk; this is of course a naive estimate.</p> <p>Other tasks, which might become “basic” are:</p> <ul> <li><em>Call center operations:</em> We already have call systems handling large portions of the call until being passed to an operator. AI-based systems bring this to another level. Think: <a href="https://www.theverge.com/2018/12/5/18123785/google-duplex-how-to-use-reservations">Google Duplex</a></li> <li><em>Checking NDAs and contracts:</em> Time consuming and not value add. There are several systems (have not verified their accuracy) that offer automatic review, e.g., <a href="https://www.ndalynn.com/">NDALynn</a>, <a href="https://www.lawgeex.com/">LawGeex</a> (see also <a href="https://www.techspot.com/news/77189-machine-learning-algorithm-beats-20-lawyers-nda-legal.html">TechSpot</a>).</li> <li><em>Managing investment portfolios:</em> Robo-Advisors in the retail space deliver similar or better performance than traditional and costly (and often subpar) investment advisors; after all, the hot shots are mostly working for funds or UHNWIs. (see <a href="http://money.com/money/5330932/best-robo-advisors-beginner-advanced-2018/">here</a> and <a href="https://www.barrons.com/articles/the-top-robo-advisors-an-exclusive-ranking-1532740937">here</a>)</li> <li><em>Design of (simple) machine learning solutions:</em> Google’s <a href="https://cloud.google.com/automl/">AutoML</a> automates the creation of high-performance machine learning models. Upload your data and get a deployment ready model with REST API etc. No data scientist required.</li> <li>I know of other large companies using AI systems to automate the RFP process by sifting through thousands of pages of specifications to determine a product offering.</li> </ul> <p>Of course, just to be clear, all of the above come also with certain usage risks if not used properly or without the necessary expertise.</p> <h2 id="the-bigger-picture-learning-rate-and-discovery-rate">The bigger picture: learning rate and discovery rate</h2> <p>What this all might lead to is a <em>Zeroth World</em> whose advantage (broadly speaking in terms of development: economic, educational, societal, etc) over the First World might be as large as the advantage of the First World over the Third World.</p> <h3 id="gdp-per-employed-person">GDP per employed person</h3> <p>A very skewed but still informative metric is GDP per person employed. It gives generally a good idea of the productivity levels achieved <em>on average</em>. There are a couple of special cases, for example China with an extremely high variance. Nonetheless, in the graphics below generated from <a href="https://www.google.com/publicdata/explore?ds=d5bncppjof8f9_&amp;ctype=l&amp;met_y=sl_gdp_pcap_em_kd#!ctype=l&amp;strail=false&amp;bcs=d&amp;nselm=h&amp;met_y=sl_gdp_pcap_em_kd&amp;scale_y=log&amp;ind_y=false&amp;rdim=region&amp;idim=region:NAC&amp;idim=country:SGP:JPN:DEU:CHN:FRA:CMR:COG:ETH:GHA:KEN:NGA:SDN:ECU&amp;ifdim=region&amp;tdim=true&amp;hl=en_US&amp;dl=en_US&amp;ind=false">Google’s dataset</a> you can see a strict separation between (some) First World countries and (some) Third World countries; note that the scale is logarithmic:</p> <p class="center"><img src="http://www.pokutta.com/blog/assets/gdp-employed-person-comp.png" alt="GDP per employed person" /> <a href="https://www.google.com/publicdata/explore?ds=d5bncppjof8f9_&amp;ctype=l&amp;met_y=sl_gdp_pcap_em_kd#!ctype=l&amp;strail=false&amp;bcs=d&amp;nselm=h&amp;met_y=sl_gdp_pcap_em_kd&amp;scale_y=log&amp;ind_y=false&amp;rdim=region&amp;idim=region:NAC&amp;idim=country:SGP:JPN:DEU:CHN:FRA:CMR:COG:ETH:GHA:KEN:NGA:SDN:ECU&amp;ifdim=region&amp;tdim=true&amp;hl=en_US&amp;dl=en_US&amp;ind=false">[Source: Google’s dataset]</a></p> <p>Now imagine that some countries, upon leveraging AI systems achieve a 10x gain in output per employed person. That will be the <em>Zeroth World</em>: people operating at 10x of their First World productivity levels. Hard to imagine, but that is roughly the separation between the US and Ghana for example.</p> <p>The graph above is very compatible with well-known trends, e.g., <a href="https://www.reuters.com/article/us-singapore-semiconductors-analysis/singapores-automation-incentives-draw-tech-firms-boost-economy-idUSKBN17T3DX">Singapore strongly investing in automation</a> or China being the country with <a href="https://www.dbs.com/aics/templatedata/article/generic/data/en/GR/042018/180409_insights_understanding_china_automation_drive_is_essential_and_welcome.xml">largest number of industrial robots going online</a>. JP Morgan estimates that automation could add <a href="https://www.businessinsider.com/automation-one-trillion-dollars-global-economy-jpmam-report-2017-11">up to $1.1 trilion</a> to the global economy over the next 10-15 years. While this is only 1-1.5% of an overall boost in global GDP, in actuality the effect might be much more pronounced as it will be concentrated in few countries leading to a much stronger separation; still even if the whole boost would be accounted to the US it would still be just about 5%. But AI systems go beyond mere manufacturing automation and it is hard to estimate the cumulative effect. To put things into context, in manufacturing an extreme shift happened around the 2000’s when the first wave of strong automation kicked in. Over the last 30 or so years we roughly doubled manufacturing output and close to halved the number of people; see the graphics from <a href="https://www.businessinsider.com/manufacturing-output-versus-employment-chart-2016-12">Business Insider</a>:</p> <p class="center"><img src="https://amp.businessinsider.com/images/584b0056ca7f0c5c008b4a92-960-720.png" alt="Manufacturing output vs. automation" /> <a href="https://www.businessinsider.com/manufacturing-output-versus-employment-chart-2016-12">[Source: Business Insider]</a></p> <p>That is 4x in about 30 years in a physical space, with large, tangible assets and more generally with lots of overall inertia in the system. It is quite likely that AI systems will have an even more pronounced effect because they are more widely deployable, so that the 10x scenario is <em>not that</em> ambitious.</p> <h3 id="learning-rate-vs-discovery-rate">Learning rate vs discovery rate</h3> <p>To better understand what AI systems reasonably can and cannot do, without making strong predictions about the future we need to differentiate between the <em>learning rate</em> and the <em>discovery rate</em> of a technology. In a nutshell, the learning rate captures how fast, e.g., prices, resources required etc fell over time for an <em>existing</em> solution or product, i.e., by how much flying got cheaper over time. This captures various improvements over time in deploying a given technology. The learning rate makes no statement about new discoveries or overcoming fundamental roadblocks. That is exactly what the <em>discovery rate</em> captures. While the learning rate tends to be quite observable and often follows a relatively stable trend over time, the discovery rate is much more unpredictable (due to its nature) and that is where often speculation about the future and its various scenarios comes into play. I will not go there: the learning rate alone can provide us with some insights. Note that we refer to those two as “rates” as it is very insightful to consider the world in logarithmic scale, e.g., measuring time to double or halve. Let us consider the examples of <a href="https://aiimpacts.org/wikipedia-history-of-gflops-costs/">historical prices for GFlops</a>:</p> <p class="center"><img src="http://www.pokutta.com/blog/assets/History-of-GFLOPS-prices.png" alt="Learning rate GFlops" /></p> <p class="center"><a href="https://aiimpacts.org/wikipedia-history-of-gflops-costs/">[Source: AIImpacts.org]</a></p> <p>We can find a very similar trend in <a href="https://jcmit.net/memoryprice.htm">historical prices for storage</a>:</p> <p class="center"><img src="http://www.pokutta.com/blog/assets/MemoryDiskPriceGraph-2018Dec.jpg" alt="Learning rate storage" /> <a href="https://jcmit.net/memoryprice.htm">[Source: jcmit.net]</a></p> <p>These two are probably pretty much expected as they roughly follow <a href="https://en.wikipedia.org/wiki/Moore%27s_law">Moore’s law</a>, however there are many similar examples in other industries with different rates. For examples <a href="https://www.vox.com/2016/8/24/12620920/us-solar-power-costs-falling">historical prices for solar panels</a> or <a href="https://www.theatlantic.com/business/archive/2013/02/how-airline-ticket-prices-fell-50-in-30-years-and-why-nobody-noticed/273506/">flights</a>. Now let us compare this to the recent increase in <a href="https://blog.openai.com/ai-and-compute/">compute deployed for training ai systems</a>:</p> <p class="center"><img src="https://blog.openai.com/content/images/2018/05/compute_diagram-log@2x-3.png" alt="Learning rate storage" /> <a href="https://blog.openai.com/ai-and-compute/">[Source: OpenAI Blog]</a></p> <p>Compared to Moore’s law with a doubling rate of roughly every 18 months (so far) for the compute deployed here the doubling rate much higher at about only 3.5 months (so far). Clearly, neither can continue forever at such aggressive rates, however this example points at two things: (a) we are moving <em>much faster</em> than anything that we have seen so far and (b) with the deployment of more compute usually a roughly similar increase in required data comes along (the reason being, that training algorithms, usually based on variants of stochastic gradient descent, can only make so many passes over the data before overfitting). Notably those applications in the graph with the highest compute are not relying on labeled data (except for maybe Neural Machine Translation to some extent; not sure) but are reinforcement learning systems, where training data is generated through simulation and (self-)play. For more details see the <a href="ttps://blog.openai.com/ai-and-compute/">AI and Compute</a> post on OpenAI’s blog. The graph above is not exactly the learning rate as it lacks the relation to e.g., price, however it clearly shows how fast we are progressing. It is not hard to imagine that with new hardware architectures, in a not too distant future that type of power will be available on your cell phone. So even <em>without new discoveries</em>, just following the natural learning rate of the industry and making the current state-of-the-art cheaper, will have profound impact. For example, just a few days ago Google’s Deepmind <a href="https://blog.usejournal.com/an-analysis-on-how-deepminds-starcraft-2-ai-s-superhuman-speed-could-be-a-band-aid-fix-for-the-1702fb8344d6">(not completely uncontroversially)</a> won against pro players at playing <a href="https://deepmind.com/blog/alphastar-mastering-real-time-strategy-game-starcraft-ii/">StarCraft 2</a> (see also <a href="https://www.theverge.com/2019/1/24/18196135/google-deepmind-ai-starcraft-2-victory">here</a>). The training of this system required an enormous amount of computational resources. Even in light of the controversy, this is still an important achievement in terms of scaling technology, large-scale training with multiple agents, demonstrating that well designed reinforcement learning systems <em>can</em> learn very complex tasks, and more generally to “make it work”; whether reinforcement learning in general is the right approach to such problems is left for another discussion. In a few years we will teach building such integrated large-scale systems at universities end-to-end as a senior design type of project and then a few years later you will be able to download such a bot in the <em>App Store</em>. Crazy? Think of <em>neural style transfer</em> a few years back. You can now get <a href="https://prisma-ai.com/">Prisma</a> on your cell phone. Sure it might offload the computation to the cloud—at least previous versions did so—but that is not the point. The point is that complex AI system designs at the cutting edge are made available to the broader public only <em>a few years</em> after their inceptions. <a href="https://ai.googleblog.com/2018/05/duplex-ai-system-for-natural-conversation.html">Google Duplex</a> is another such example making restaurant etc. reservations for you. To be clear, I am also very well aware of the limitations etc., but at the same time I fail to see a <em>fundamental roadblock</em> and existing limitations might be removed quickly with good engineering and research.</p> <h3 id="impact-on-society-and-economy">Impact on society and economy</h3> <p>In a nutshell: we are moving very fast. In fact, so fast that the consequences are unclear. Forget about the “terminator scenario” as a threat to society. Not because it might or it might not happen but rather because already the <em>current technology</em> just following its natural learning rate cycle poses a much more immediate challenge with the potential to lead to <em>huge</em> disruptions, both positive and negative.</p> <p>One very critical impact to think about is workforce. If AI enables people to be more productive then either the economic output increases or the number of people required to achieve a given output level will decrease; these are the two sides of the same coin. The reality is that while there (likely) will be significant improvements in terms of economic output, there is only so much increase the “world” can absorb in a short period of time: at an economic world growth of about 2-3% per year the time it takes to 10x the output is roughly 80-100 years; even with significantly improved efficiency due to AI systems you can only push the output so far. What this means is that we might be facing a transitory period where efficiency improvements will drastically impact employment levels and it will take a considerable time for the workforce to adjust to these changes. In light of this one might actually contemplate whether populations in several developed countries are shrinking in early anticipation of the times ahead.</p> <p>The other critical thing to think about is the concentration of power and wealth that might be accompanied by these shifts. Already today, we see that tech companies accumulate wealth and capital at unprecedented rates, leveraging the network effects of the internet. Yet, still somewhat tight to the physical world, e.g., due to users, there is still <em>some limit</em> to their growth. It is easily imaginable however, that the next “category of scale” will be defined by AI companies, with an insane concentration of resources, wealth, and power that pales current concentration levels in the valley.</p> <p>We will likely also see the empowering of individuals beyond what we could imagine just a few years back by (a) multiplying the sheer output of an individuals due to scaling but also (b) by enabling the individual to do new things leveraging AI support systems. Then the “best” will dominate and technology will enable that individual to act globally, removing the last of the geographic entry barriers. As a simple example, take the recent “vlog” phenomenon, where one-person video productions can achieve a level of professionalism that rivals that of large scale productions. Executed from any place in the world and distributed world-wide through youtube. Moreover, the individual can directly “sell” to her/his target audience cutting out the middle man. This might provide a larger diversity and also a democratization of such disciplines but at the same time might also remove a useful filter in some cases.</p> <p>These shifts, brought about by AI systems and resulting technology, come with a lot of (potential) positives and negatives and the promises of AI systems are great. Being high on possibilities of this new paradigm, it is easy to forget though that there might be severe unintended consequences with potentially critical impact on our societies and economies. In order to enable sustainable progress we need to not just be aware but prepare and actively shape the use of these new technologies.</p>Sebastian PokuttaTL;DR: On the impact of AI on society and economy and its potential to enable a zeroth world with unprecedented economic output.Toolchain Tuesday No. 52018-12-22T19:00:00-05:002018-12-22T19:00:00-05:00http://www.pokutta.com/blog/random/2018/12/22/toolchain-5<p><em>TL;DR: Part of a series of posts about tools, services, and packages that I use in day-to-day operations to boost efficiency and free up time for the things that really matter. Use at your own risk - happy to answer questions. For the full, continuously expanding list so far see <a href="/blog/pages/toolchain.html">here</a>.</em> <!--more--></p> <p>This is the fifth installment of a series of posts; the <a href="/blog/pages/toolchain.html">full list</a> is expanding over time. This time around will be about modeling languages for optimization problems.</p> <h2 id="software">Software:</h2> <h3 id="cvxopt">CVXOPT</h3> <p>Low-level <code class="highlighter-rouge">Python</code> interface for convex optimization.</p> <p><em>Learning curve: ⭐️⭐️⭐️⭐️</em> <em>Usefulness: ⭐️⭐️⭐️</em> <br /> <em>Site: <a href="https://cvxopt.org">https://cvxopt.org</a></em></p> <p><code class="highlighter-rouge">CVXOPT</code> is basically a <code class="highlighter-rouge">Python</code> interface to various optimization solvers, providing an intermediate, relatively low-level, matrix-based interface. This is in contrast to some of the modeling languages from below that provide a higher level of abstraction; basically these modeling languages generate the matrix structure by transcribing the statements of the modeling language. Nonetheless, <code class="highlighter-rouge">CVXOPT</code> is a great tool to solve optimization problems in <code class="highlighter-rouge">Python</code>.</p> <p>From <a href="https://cvxopt.org">https://cvxopt.org</a>:</p> <blockquote> <p>CVXOPT is a free software package for convex optimization based on the Python programming language. It can be used with the interactive Python interpreter, on the command line by executing Python scripts, or integrated in other software via Python extension modules. Its main purpose is to make the development of software for convex optimization applications straightforward by building on Python’s extensive standard library and on the strengths of Python as a high-level programming language.</p> </blockquote> <p>Here is sample code from <a href="https://cvxopt.org">https://cvxopt.org</a> to give you an idea:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Risk-return trade-off.</span> <span class="kn">from</span> <span class="nn">math</span> <span class="kn">import</span> <span class="n">sqrt</span> <span class="kn">from</span> <span class="nn">cvxopt</span> <span class="kn">import</span> <span class="n">matrix</span> <span class="kn">from</span> <span class="nn">cvxopt.blas</span> <span class="kn">import</span> <span class="n">dot</span> <span class="kn">from</span> <span class="nn">cvxopt.solvers</span> <span class="kn">import</span> <span class="n">qp</span><span class="p">,</span> <span class="n">options</span> <span class="n">n</span> <span class="o">=</span> <span class="mi">4</span> <span class="n">S</span> <span class="o">=</span> <span class="n">matrix</span><span class="p">(</span> <span class="p">[[</span> <span class="mf">4e-2</span><span class="p">,</span> <span class="mf">6e-3</span><span class="p">,</span> <span class="o">-</span><span class="mf">4e-3</span><span class="p">,</span> <span class="mf">0.0</span> <span class="p">],</span> <span class="p">[</span> <span class="mf">6e-3</span><span class="p">,</span> <span class="mf">1e-2</span><span class="p">,</span> <span class="mf">0.0</span><span class="p">,</span> <span class="mf">0.0</span> <span class="p">],</span> <span class="p">[</span><span class="o">-</span><span class="mf">4e-3</span><span class="p">,</span> <span class="mf">0.0</span><span class="p">,</span> <span class="mf">2.5e-3</span><span class="p">,</span> <span class="mf">0.0</span> <span class="p">],</span> <span class="p">[</span> <span class="mf">0.0</span><span class="p">,</span> <span class="mf">0.0</span><span class="p">,</span> <span class="mf">0.0</span><span class="p">,</span> <span class="mf">0.0</span> <span class="p">]]</span> <span class="p">)</span> <span class="n">pbar</span> <span class="o">=</span> <span class="n">matrix</span><span class="p">([</span><span class="o">.</span><span class="mi">12</span><span class="p">,</span> <span class="o">.</span><span class="mi">10</span><span class="p">,</span> <span class="o">.</span><span class="mo">07</span><span class="p">,</span> <span class="o">.</span><span class="mo">03</span><span class="p">])</span> <span class="n">G</span> <span class="o">=</span> <span class="n">matrix</span><span class="p">(</span><span class="mf">0.0</span><span class="p">,</span> <span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="n">n</span><span class="p">))</span> <span class="n">G</span><span class="p">[::</span><span class="n">n</span><span class="o">+</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="o">-</span><span class="mf">1.0</span> <span class="n">h</span> <span class="o">=</span> <span class="n">matrix</span><span class="p">(</span><span class="mf">0.0</span><span class="p">,</span> <span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="mi">1</span><span class="p">))</span> <span class="n">A</span> <span class="o">=</span> <span class="n">matrix</span><span class="p">(</span><span class="mf">1.0</span><span class="p">,</span> <span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="n">n</span><span class="p">))</span> <span class="n">b</span> <span class="o">=</span> <span class="n">matrix</span><span class="p">(</span><span class="mf">1.0</span><span class="p">)</span> <span class="n">N</span> <span class="o">=</span> <span class="mi">100</span> <span class="n">mus</span> <span class="o">=</span> <span class="p">[</span> <span class="mi">10</span><span class="o">**</span><span class="p">(</span><span class="mf">5.0</span><span class="o">*</span><span class="n">t</span><span class="o">/</span><span class="n">N</span><span class="o">-</span><span class="mf">1.0</span><span class="p">)</span> <span class="k">for</span> <span class="n">t</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">N</span><span class="p">)</span> <span class="p">]</span> <span class="n">options</span><span class="p">[</span><span class="s">'show_progress'</span><span class="p">]</span> <span class="o">=</span> <span class="bp">False</span> <span class="n">xs</span> <span class="o">=</span> <span class="p">[</span> <span class="n">qp</span><span class="p">(</span><span class="n">mu</span><span class="o">*</span><span class="n">S</span><span class="p">,</span> <span class="o">-</span><span class="n">pbar</span><span class="p">,</span> <span class="n">G</span><span class="p">,</span> <span class="n">h</span><span class="p">,</span> <span class="n">A</span><span class="p">,</span> <span class="n">b</span><span class="p">)[</span><span class="s">'x'</span><span class="p">]</span> <span class="k">for</span> <span class="n">mu</span> <span class="ow">in</span> <span class="n">mus</span> <span class="p">]</span> <span class="n">returns</span> <span class="o">=</span> <span class="p">[</span> <span class="n">dot</span><span class="p">(</span><span class="n">pbar</span><span class="p">,</span><span class="n">x</span><span class="p">)</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">xs</span> <span class="p">]</span> <span class="n">risks</span> <span class="o">=</span> <span class="p">[</span> <span class="n">sqrt</span><span class="p">(</span><span class="n">dot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">S</span><span class="o">*</span><span class="n">x</span><span class="p">))</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">xs</span> <span class="p">]</span> </code></pre></div></div> <h3 id="pyomo">Pyomo</h3> <p><code class="highlighter-rouge">Pyomo</code> is a python-based open-source optimization modeling language supporting a wide range of optimization paradigms and solvers.</p> <p><em>Learning curve: ⭐️⭐️⭐️</em> <em>Usefulness: ⭐️⭐️⭐️⭐️</em> <br /> <em>Site: <a href="http://www.pyomo.org/">http://www.pyomo.org/</a></em></p> <p><code class="highlighter-rouge">Pyomo</code> is a Python-based open-source optimization modeling language. It supports a variety of different optimization paradigms and integrates with a wide range of solvers including <code class="highlighter-rouge">BARON</code>, <code class="highlighter-rouge">CBC</code>, <code class="highlighter-rouge">CPLEX</code>, <code class="highlighter-rouge">Gurobi</code>, and <code class="highlighter-rouge">glpsol</code>; check the <a href="https://pyomo.readthedocs.io/en/latest/index.html">Pyomo Manual</a>. Another great resource with examples is the <a href="https://github.com/jckantor/ND-Pyomo-Cookbook">Pyomo Cookbook</a>.</p> <p>What sets <code class="highlighter-rouge">Pyomo</code> apart from <code class="highlighter-rouge">MathProg</code> and <code class="highlighter-rouge">CVXOPT</code> is that it is a relatively high-level modeling language (compared to <code class="highlighter-rouge">CVXOPT</code>) while being written in <code class="highlighter-rouge">Python</code> (compared to <code class="highlighter-rouge">MathProg</code>) allowing for easy integration with a plethora of other packages.</p> <p>From <a href="http://www.pyomo.org/">http://www.pyomo.org/</a>:</p> <blockquote> <p>A core capability of Pyomo is modeling structured optimization applications. Pyomo can be used to define general symbolic problems, create specific problem instances, and solve these instances using commercial and open-source solvers. Pyomo’s modeling objects are embedded within a full-featured high-level programming language providing a rich set of supporting libraries, which distinguishes Pyomo from other algebraic modeling languages like AMPL, AIMMS and GAMS.</p> </blockquote> <p>Supported problem types include:</p> <ul> <li>Linear programming</li> <li>Quadratic programming</li> <li>Nonlinear programming</li> <li>Mixed-integer linear programming</li> <li>Mixed-integer quadratic programming</li> <li>Mixed-integer nonlinear programming</li> <li>Stochastic programming</li> <li>Generalized disjunctive programming</li> <li>Differential algebraic equations</li> <li>Bilevel programming</li> <li>Mathematical programs with equilibrium constraints</li> </ul> <p>Here is an example of a model:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">pyomo.environ</span> <span class="kn">import</span> <span class="o">*</span> <span class="n">model</span> <span class="o">=</span> <span class="n">ConcreteModel</span><span class="p">()</span> <span class="c"># declare decision variables</span> <span class="n">model</span><span class="o">.</span><span class="n">y</span> <span class="o">=</span> <span class="n">Var</span><span class="p">(</span><span class="n">domain</span><span class="o">=</span><span class="n">NonNegativeReals</span><span class="p">)</span> <span class="c"># declare objective</span> <span class="n">model</span><span class="o">.</span><span class="n">profit</span> <span class="o">=</span> <span class="n">Objective</span><span class="p">(</span> <span class="n">expr</span> <span class="o">=</span> <span class="mi">30</span><span class="o">*</span><span class="n">model</span><span class="o">.</span><span class="n">y</span><span class="p">,</span> <span class="n">sense</span> <span class="o">=</span> <span class="n">maximize</span><span class="p">)</span> <span class="c"># declare constraints</span> <span class="n">model</span><span class="o">.</span><span class="n">laborA</span> <span class="o">=</span> <span class="n">Constraint</span><span class="p">(</span><span class="n">expr</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">y</span> <span class="o">&lt;=</span> <span class="mi">80</span><span class="p">)</span> <span class="n">model</span><span class="o">.</span><span class="n">laborB</span> <span class="o">=</span> <span class="n">Constraint</span><span class="p">(</span><span class="n">expr</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">y</span> <span class="o">&lt;=</span> <span class="mi">100</span><span class="p">)</span> <span class="c"># solve</span> <span class="n">SolverFactory</span><span class="p">(</span><span class="s">'glpk'</span><span class="p">)</span><span class="o">.</span><span class="n">solve</span><span class="p">(</span><span class="n">model</span><span class="p">)</span><span class="o">.</span><span class="n">write</span><span class="p">()</span> </code></pre></div></div> <h3 id="mathprog">MathProg</h3> <p><code class="highlighter-rouge">MathProg</code> (aka <code class="highlighter-rouge">GMPL</code>) is a modeling language for Mixed-Integer Linear Programs.</p> <p><em>Learning curve: ⭐️⭐️</em> <em>Usefulness: ⭐️⭐️⭐️</em> <br /> <em>Site: <a href="https://www.gnu.org/software/glpk/">https://www.gnu.org/software/glpk/</a></em></p> <p><code class="highlighter-rouge">MathProg</code>, also known as <code class="highlighter-rouge">GMPL</code> is a modeling language for Mixed-Integer Linear Programs (MILP). It is included with <code class="highlighter-rouge">glpk</code>, the <em>GNU Linear Programming Kit</em> and it supports reading and writing from data sources (databases via <code class="highlighter-rouge">ODBC</code> or <code class="highlighter-rouge">JDBC</code>) and, apart from <code class="highlighter-rouge">glpsol</code> which is <code class="highlighter-rouge">glpk</code>’s own MILP solver it supports various other solvers such as <code class="highlighter-rouge">CPLEX</code>, <code class="highlighter-rouge">Gurobi</code>, or <code class="highlighter-rouge">SCIP</code> through LP-files. From the <a href="https://en.wikibooks.org/wiki/GLPK/GMPL_(MathProg)">[GLPK wikibook]</a>, which is also a great resource for <code class="highlighter-rouge">GMPL</code>:</p> <blockquote> <p>GNU MathProg is a high-level language for creating mathematical programming models. MathProg is specific to GLPK, but resembles a subset of AMPL. MathProg can also be referred to as GMPL (GNU Mathematical Programming Language), the two terms being interchangeable.</p> </blockquote> <p><code class="highlighter-rouge">MathProg</code> is in particular great for fast prototyping. Unfortunately, it does not directly support other IP solvers through its built-in interface but requires to go the route via LP-files. As a consequence the relatively powerful output post-processing of <code class="highlighter-rouge">MathProg</code> cannot be used in that case and <code class="highlighter-rouge">glpk</code>’s own solver <code class="highlighter-rouge">glpsol</code> that integrates with <code class="highlighter-rouge">MathProg</code> natively is only able to handle smaller to midsize problems. This limits the use case (therefore the ⭐️⭐️⭐️-rating on usefulness). Probably the natural course of things is to ‘graduate’ to <code class="highlighter-rouge">Pyomo</code> over time. Despite all of this, the actual <code class="highlighter-rouge">MathProg</code> language very useful. I have used it <em>very often</em> for the actual modeling task, to separate (supporting) code from model. I then generate an LP file from that model and its datasources that I then solve with e.g., <code class="highlighter-rouge">Gurobi</code>. Finally, I parse the output back in, mostly using <code class="highlighter-rouge">Python</code>; there are other <code class="highlighter-rouge">Python</code> packages for <code class="highlighter-rouge">glpsol</code> that handle this parsing for you if you do not want to implement it yourself (relatively easy though). <code class="highlighter-rouge">Pyomo</code> for example also goes through the LP-file route to interface with <code class="highlighter-rouge">glpsol</code>.</p> <p>See <a href="https://en.wikibooks.org/wiki/GLPK/Obtaining_GLPK">here</a> on how to obtain <code class="highlighter-rouge">glpk</code>. To give you an idea of the syntax, check out this example that also includes some solution post-processing (similar to what e.g., <code class="highlighter-rouge">OPL</code> can do):</p> <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code># A TRANSPORTATION PROBLEM # # This problem finds a least cost shipping schedule that meets # requirements at markets and supplies at factories. # # References: # Dantzig G B, "Linear Programming and Extensions." # Princeton University Press, Princeton, New Jersey, 1963, # Chapter 3-3. set I; /* canning plants */ set J; /* markets */ param a{i in I}; /* capacity of plant i in cases */ param b{j in J}; /* demand at market j in cases */ param d{i in I, j in J}; /* distance in thousands of miles */ param f; /* freight in dollars per case per thousand miles */ param c{i in I, j in J} := f * d[i,j] / 1000; /* transport cost in thousands of dollars per case */ var x{i in I, j in J} &gt;= 0; /* shipment quantities in cases */ minimize cost: sum{i in I, j in J} c[i,j] * x[i,j]; /* total transportation costs in thousands of dollars */ s.t. supply{i in I}: sum{j in J} x[i,j] &lt;= a[i]; /* observe supply limit at plant i */ s.t. demand{j in J}: sum{i in I} x[i,j] &gt;= b[j]; /* satisfy demand at market j */ solve; # Report / Result Section (Optional) printf '#################################\n'; printf 'Transportation Problem / LP Model Result\n'; printf '\n'; printf 'Minimum Cost = %.2f\n', cost; printf '\n'; printf '\n'; printf 'Variables (i.e. shipment quantities in cases ) \n'; printf 'Shipment quantities in cases\n'; printf 'Canning Plants Markets Solution (Cases) \n'; printf{i in I, j in J}:'%14s %10s %11s\n',i,j, x[i,j]; printf '\n'; printf 'Constraints\n'; printf '\n'; printf 'Observe supply limit at plant i\n'; printf 'Canning Plants Solution Sign Required\n'; for {i in I} { printf '%14s %10.2f &lt;= %.3f\n', i, sum {j in J} x[i,j], a[i]; } printf '\n'; printf 'Satisfy demand at market j\n'; printf 'Market Solution Sign Required\n'; for {j in J} { printf '%5s %10.2f &gt;= %.3f\n', j, sum {i in I} x[i,j], b[j]; } data; set I := Seattle San-Diego; set J := New-York Chicago Topeka; param a := Seattle 350 San-Diego 600; param b := New-York 325 Chicago 300 Topeka 275; param d : New-York Chicago Topeka := Seattle 2.5 1.7 1.8 San-Diego 2.5 1.8 1.4 ; param f := 90; end; </code></pre></div></div>TL;DR: Part of a series of posts about tools, services, and packages that I use in day-to-day operations to boost efficiency and free up time for the things that really matter. Use at your own risk - happy to answer questions. For the full, continuously expanding list so far see here.Cheat Sheet: Smooth Convex Optimization2018-12-06T23:00:00-05:002018-12-06T23:00:00-05:00http://www.pokutta.com/blog/research/2018/12/06/cheatsheet-smooth-idealized<p><em>TL;DR: Cheat Sheet for smooth convex optimization and analysis via an idealized gradient descent algorithm. While technically a continuation of the Frank-Wolfe series, this should have been the very first post and this post will become the Tour d’Horizon for this series. Long and technical.</em> <!--more--></p> <p><em>Posts in this series (so far).</em></p> <ol> <li><a href="/blog/research/2018/12/06/cheatsheet-smooth-idealized.html">Cheat Sheet: Smooth Convex Optimization</a></li> <li><a href="/blog/research/2018/10/05/cheatsheet-fw.html">Cheat Sheet: Frank-Wolfe and Conditional Gradients</a></li> <li><a href="/blog/research/2018/10/19/cheatsheet-fw-lin-conv.html">Cheat Sheet: Linear convergence for Conditional Gradients</a></li> <li><a href="/blog/research/2018/11/11/heb-conv.html">Cheat Sheet: Hölder Error Bounds (HEB) for Conditional Gradients</a></li> <li><a href="/blog/research/2019/02/27/cheatsheet-nonsmooth.html">Cheat Sheet: Subgradient Descent, Mirror Descent, and Online Learning</a></li> </ol> <p><em>My apologies for incomplete references—this should merely serve as an overview.</em></p> <p>In this fourth installment of the series on Conditional Gradients, which actually should have been the very first post, I will talk about an idealized gradient descent algorithm for smooth convex optimization, which allows to obtain convergence rates and from which we can instantiate several known algorithms, including gradient descent and Frank-Wolfe variants. This post will become a Tour d’Horizon of the various results from this series. To be clear, the focus is on <em>projection-free</em> methods in the <em>constraint</em> case, however I will deal with other approaches to complement the exposition.</p> <p>While I will use notation that is compatible with previous posts, in particular the <a href="/blog/research/2018/10/05/cheatsheet-fw.html">first post</a>, I will make this post as self-contained as possible with few forward references, so that this will become “Post Zero”. As before I will use Frank-Wolfe [FW] and Conditional Gradients [CG] interchangeably.</p> <p>Our setup will be as follows. We will consider a convex function$f: \RR^n \rightarrow \RR$and we want to solve</p> <script type="math/tex; mode=display">\min_{x \in K} f(x),</script> <p>where$K$is some feasible region, e.g.,$K = \RR^n$is the unconstrained case. We will in particular consider smooth functions as detailed further below and we assume that we only have <em>first-order access</em> to the function, via a so-called <em>first-order oracle</em>:</p> <p class="mathcol"><strong>First-Order oracle for$f$</strong> <br /> <em>Input:</em>$x \in \mathbb R^n$<br /> <em>Output:</em>$\nabla f(x)$and$f(x)$</p> <p>For now we disregard how we can access the feasible region$K$as there are various access models and we will specify the model based on the algorithmic class that we target later. For the sake of simplicity, we will be using the$\ell_2$-norm but the arguments can be easily extended to other norms, e.g., replacing Cauchy-Schwartz inequalities by Hölder inequalities and using dual norms.</p> <h2 id="an-idealized-gradient-descent-algorithm">An idealized gradient descent algorithm</h2> <p>In a first step we will devise an idealized gradient descent algorithm, for which we will then derive convergence guarantees under different assumptions on the function$f$under consideration. We will then show how known guarantees can be easily obtained from this idealized gradient descent algorithm.</p> <p>Let$f: \RR^n \rightarrow \RR$be a convex function and$K$be some feasible region. We are interested in studying ‘gradient descent-like’ algorithms. To this end let$x_t \in K$be some point and we consider updates of the form</p> <p>$\tag{dirStep} x_{t+1} \leftarrow x_t - \eta_t d_t,$</p> <p>for some direction$d_t \in \RR^n$and$\eta_t \in \RR$for$t$. For example, we would obtain standard gradient descent by choosing$d \doteq \nabla f(x_t)$and$\eta_t = \frac{1}{L}$, where$L$is the Lipschitz constant of$f$.</p> <h3 id="measures-of-progress">Measures of progress</h3> <p>We will consider two important measures that drive the overall convergence rate. The first is a <em>measure of progress</em>, which in our context will be provided by the smoothness of the function. This will be the only measure of progress that we will consider, but there are many others for different setups. Note that the arguments here using smoothness do not rely on the convexity of the function; something to remember for later.</p> <p>Let us recall the definition of smoothness:</p> <p class="mathcol"><strong>Definition (smoothness).</strong> A convex function$f$is said to be <em>$L$-smooth</em> if for all$x,y \in \mathbb R^n$it holds: $f(y) - f(x) \leq \nabla f(x)(y-x) + \frac{L}{2} \norm{x-y}^2.$</p> <p>There are two things to remember about smoothness:</p> <ol> <li>If$x$is an optimal solution to (the unconstrained)$f$, then$\nabla f(x) = 0$, so that smoothness provides an <em>upper bound</em> on the distance to optimality:$f(x) - f(x^\esx) \leq \frac{L}{2} \norm{x-x^\esx}^2$.</li> <li>More generally it provides an upper bound change of the function by means of a quadratic.</li> </ol> <p>The <em>most important thing</em> however is that <em>smoothness induces progress</em> in schemes such as (dirStep). For this let us consider the smoothness inequality at two iterates$x_t$and$x_{t+1}$in the scheme from above. Plugging in the definition of (dirStep) we obtain</p> <script type="math/tex; mode=display">\underbrace{f(x_{t}) - f(x_{t+1})}_{\text{primal progress}} \geq \eta \langle\nabla f(x_t),d\rangle - \eta^2 \frac{L}{2} \|d\|^2</script> <p>Note that the function on the right is concave in$\eta$and so we can maximize the right-hand side to obtain a lower bound on the progress. Taking the derivative on the right-hand side and asserting criticality we obtain:</p> <script type="math/tex; mode=display">\langle\nabla f(x_t),d\rangle - \eta L \norm{d}^2 = 0,</script> <p>which leads to the optimal choice$\eta^\esx \doteq \frac{\langle\nabla f(x_t),d\rangle}{L \norm{d}^2}$. This induces a progress lower bound of:</p> <p class="mathcol"><strong>Progress induced by smoothness (for$d$).</strong> $\begin{equation} \tag{Progress from d} \underbrace{f(x_{t}) - f(x_{t+1})}_{\text{primal progress}} \geq \frac{\langle\nabla f(x_t),d\rangle^2}{2L \norm{d}^2}. \end{equation}$</p> <p>We will now formulate our <em>idealized gradient descent</em> by using the <em>(normalized) idealized direction</em>$d \doteq \frac{x_t - x^\esx}{\norm{ x_t - x^\esx }}$, where we basically make steps in the direction of the optimal solution$x^\esx$; note that in general there might be multiple optimal solutions, in which case we choose arbitrarily but fixed.</p> <p class="mathcol"><strong>Idealized Gradient Descent (IGD)</strong> <br /> <em>Input:</em> Smooth convex function$f$with first-order oracle access and smoothness parameter$L$. <br /> <em>Output:</em> Sequence of points$x_0, \dots, x_T$<br /> For$t = 0, \dots, T-1$do: <br />$\quad x_{t+1} \leftarrow x_t - \eta_t \frac{x_t - x^\esx}{\norm{ x_t - x^\esx }}$with$\eta_t = \frac{\langle\nabla f(x_t),\frac{x_t - x^\esx}{\norm{ x_t - x^\esx }}\rangle}{L}$</p> <p>It is important to note that in reality we <em>do not</em> have access to this idealized direction. Moreover, if we would have access, we could perform line search along this direction to get the optimal solution$x^\esx$in a <em>single</em> step. However, what we assume here is that the <em>algorithm does not know that this is an optimal direction</em> and hence only having first-order access, the smoothness condition, and assuming that we do not do line search etc., the best the algorithm can do is using the optimal step length from smoothness, which is exactly how we choose$\eta_t$. Also note, that we could have defined$d$as the unnormalized idealized direction$x_t - x^\esx$, however the normalization simplifies exposition.</p> <p>Let us now briefly establish the progress guarantees for IGD. For the sake of brevity let$h_t \doteq h(x_t) \doteq f(x_t) - f(x^\esx)$denote the <em>primal gap (at$x_t$)</em>. Plugging in the parameters into the progress inequality, we obtain</p> <p class="mathcol"><strong>Progress guarantee for IGD.</strong> $\begin{equation} \tag{IGD Progress} \underbrace{f(x_{t}) - f(x_{t+1})}_{\text{primal progress}} = h_{t} - h_{t+1} \geq \frac{\langle \nabla f(x_t),x_t - x^\esx \rangle^2}{2L \norm{x_t - x^\esx}^2}. \end{equation}$</p> <h3 id="measures-of-optimality">Measures of optimality</h3> <p>We will now introduce <em>measures of optimality</em> that together with (IGD Progress) induce convergence rates for IGD. These rates are <em>idealized rates</em> as they depend on the idealized direction, nonetheless we will see that actual rates for known algorithms almost immediately follow from here in the following section. We will start with some basic measures first; I might expand this list over time if I come across other measures that can be explained relatively easily.</p> <p>In order to establish (idealized) convergence rates, we have to relate$\langle \nabla f(x_t),x_t - x^\esx \rangle$with$f(x_t) - f(x^\esx)$. There are many different such relations that we refer to as <em>measures of optimality</em>, as they effectively provide a guarantee on the primal gap$h_t$via dual information as will become clear soon.</p> <p>To put things into perspective, smoothness provides a <em>quadratic</em> upper bound on$f(x)$, while convexity provides a <em>linear</em> lower bound on$f(x)$and strong convexity provides a <em>quadratic</em> lower bound on$f(x)$. The HEB condition, which will be one of the considered measures of optimality, basically interpolates between linear and quadratic lower bounds by capturing how sharp the function curves around the optimal solution(s). The following graphics shows the relation between convexity, strong convexity, and smoothness on the left and functions with different$\theta$-values in the HEB condition (as explained further below) are depicted on the right.</p> <p class="center"><img src="http://www.pokutta.com/blog/assets/heb-conv-over.png" alt="Convexity HEB overview" /></p> <h4 id="convexity">Convexity</h4> <p>Our first measure of optimality is <em>convexity</em>.</p> <p class="mathcol"><strong>Definition (convexity).</strong> A differentiable function$f$is said to be <em>convex</em> if for all$x,y \in \mathbb R^n$it holds: $f(y) - f(x) \geq \langle \nabla f(x), y-x \rangle.$</p> <p>From this we can derive a very basic guarantee on the primal gap$h_t$, by choosing$y \leftarrow x^\esx$and$x \leftarrow x_t$and we obtain:</p> <p class="mathcol"><strong>Primal Bound (convexity).</strong> At an iterate$x_t$convexity induces a primal bound of the form: $\tag{PB-C} f(x_t) - f(x^\esx) \leq \langle \nabla f(x_t),x_t - x^\esx \rangle.$</p> <p>Combining (PB-C) with (IGD-Progress) we obtain:</p> <script type="math/tex; mode=display">h_{t} - h_{t+1} \geq \frac{\langle \nabla f(x_t),x_t - x^\esx \rangle^2}{2L \norm{x_t - x^\esx}^2} \geq \frac{h_t^2}{2L \norm{x_t - x^\esx}^2} \geq \frac{h_t^2}{2L \norm{x_0 - x^\esx}^2},</script> <p>where the last inequality is not immediate but also not hard to show. Rearranging things we obtain:</p> <p class="mathcol"><strong>IGD contraction (convexity).</strong> Assuming convexity the primal gap$h_t$contracts as: $\tag{Rec-C} h_{t+1} \leq h_t \left(1 - \frac{h_t}{2L \norm{x_0 - x^\esx}^2}\right),$ which leads to a convergence rate after solving the recurrence of $\tag{Rate-C} h_T \leq \frac{2L \norm{x_0 - x^\esx}^2}{T+4}.$</p> <h4 id="strong-convexity">Strong Convexity</h4> <p>Our second measure of optimality is <em>strong convexity</em>.</p> <p class="mathcol"><strong>Definition (strong convexity).</strong> A convex function$f$is said to be <em>$\mu$-strongly convex</em> if for all$x,y \in \mathbb R^n$it holds: $f(y) - f(x) \geq \langle \nabla f(x),y-x \rangle + \frac{\mu}{2} \norm{x-y}^2.$</p> <p>The strong convexity inequality is basically the reverse inequality of smoothness and we can use an argument similar to one we used for the progress bound. For this we choose$x \leftarrow x_t$and$y \leftarrow x_t - \eta e_t$with$e_t \doteq x_t - x^\esx = d_t \norm{x_t-x^\esx}$being the unnormalized idealized direction to obtain:</p> <script type="math/tex; mode=display">f(x_t - \eta e_t) - f(x_t) \geq - \eta \langle\nabla f(x_t),e_t\rangle + \eta^2\frac{\mu}{2} \| e_t \|^2.</script> <p>Now we minimize the right-hand side over$\eta$and obtain that the minimum is achieved for the choice$\eta^\esx \doteq \frac{\langle\nabla f(x_t), e_t\rangle}{\mu \norm{e_t}^2}$; this is basically the same form as the$\eta^*$from above. Plugging this back in, we obtain</p> <script type="math/tex; mode=display">f(x_t) - f(x_t - \eta e_t) \leq \frac{\langle\nabla f(x_t),e_t\rangle^2}{2 \mu \norm{e_t}^2},</script> <p>and as the right-hand side is now independent of$\eta$, we can in particular choose$\eta = 1$and obtain:</p> <p class="mathcol"><strong>Primal Bound (strong convexity).</strong> At an iterate$x_t$strong convexity induces a primal bound of the form: $\tag{PB-SC} f(x_t) - f(x^\esx) \leq \frac{\langle \nabla f(x_t),x_t - x^\esx \rangle^2}{2\mu \norm{x_t - x^\esx}^2}.$</p> <p>Combining (PB-SC) with (IGD-Progress) we obtain:</p> <script type="math/tex; mode=display">h_{t} - h_{t+1} \geq \frac{\langle \nabla f(x_t),x_t - x^\esx \rangle^2}{2L \norm{x_t - x^\esx}^2} \geq \frac{\mu}{L} h_t.</script> <p class="mathcol"><strong>IGD contraction (strong convexity).</strong> Assuming strong convexity the primal gap$h_t$contracts as: $\tag{Rec-SC} h_{t+1} \leq h_t \left(1 - \frac{\mu}{L}\right),$ which leads to a convergence rate after solving the recurrence of $\tag{Rate-SC} h_T \leq \left(1 - \frac{\mu}{L}\right)^T h_0 \leq e^{-\frac{\mu}{L}T}h_0.$ or equivalently,$h_T \leq \varepsilon$for $T \geq \frac{L}{\mu} \log \frac{h_0}{\varepsilon}.$</p> <h4 id="hölder-error-bound-heb-condition">Hölder Error Bound (HEB) Condition</h4> <p>One might wonder whether there are rates between those induced by convexity and those induced by strong convexity. This brings us to the Hölder Error Bound (HEB) condition that interpolates smoothly between the two regimes. Here we will confine the discussion to the basics that induce the bounds that we need; for an in-depth discussion and relation to e.g., the dominated gradient property (see the <a href="/blog/research/2018/11/11/heb-conv.html">HEB post</a> in this series). Let$K^\esx$denote the set of optimal solutions to$\min_{x \in K} f(x)$and let$f^\esx \doteq f(x)$for some$x \in K^\esx$.</p> <p class="mathcol"><strong>Definition (Hölder Error Bound (HEB) condition).</strong> A convex function$f$is satisfies the <em>Hölder Error Bound (HEB) condition on$K$</em> with parameters$0 &lt; c &lt; \infty$and$\theta \in [0,1]$if for all$x \in K$it holds: $c (f(x) - f^\esx)^\theta \geq \min_{y \in K^\esx} \norm{x-y}.$</p> <p>Note that in contrast to convexity and strong convexity the HEB condition is a <em>local</em> condition as can be seen from its definition. As we assume that our functions are smooth it follows$\theta \leq 1/2$(see <a href="/blog/research/2018/11/11/heb-conv.html">HEB post</a> for details). We can now combine (HEB) for any$x^\esx \in K^\esxwith convexity to obtain:</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align*} f(x) - f^\esx & = f(x) - f(x^\esx) \leq \langle \nabla f(x), x - x^\esx \rangle \\ & = \frac{\langle \nabla f(x), x - x^\esx \rangle}{\norm{x - x^\esx}} \norm{x - x^\esx} \\ & \leq \frac{\langle \nabla f(x), x - x^\esx \rangle}{\norm{x - x^\esx}} c (f(x) - f^\esx)^\theta. \end{align*} %]]></script> <p>Via rearranging we derive: $\frac{1}{c}(f(x) - f^\esx)^{1-\theta} \leq \frac{\langle \nabla f(x), x - x^\esx \rangle}{\norm{x - x^\esx}}.$</p> <p class="mathcol"><strong>Primal Bound (HEB).</strong> At an iteratex_t$HEB induces a primal bound of the form: $\tag{PB-HEB} \frac{1}{c}(f(x_t) - f^\esx)^{1-\theta} \leq \frac{\langle \nabla f(x_t), x_t - x^\esx \rangle}{\norm{x_t - x^\esx}}$ for any$x^\esx \in K^\esx$.</p> <p>Combining (PB-HEB) with (IGD-Progress) we obtain:</p> <script type="math/tex; mode=display">h_t - h_{t+1} \geq \frac{\langle \nabla f(x_t), x_t - x^\esx\rangle^2}{2L \norm{x_t - x^\esx}^2} \geq \frac{\left(\frac{1}{c}h_t^{1-\theta} \right)^2}{2L},</script> <p>which can be rearranged to:</p> <script type="math/tex; mode=display">h_{t+1} \leq h_t - \frac{\frac{1}{c^2}h_t^{2-2\theta}}{2L} \leq h_t \left(1 - \frac{1}{2Lc^2} h_t^{1-2\theta}\right).</script> <p class="mathcol"><strong>IGD contraction (HEB).</strong> Assuming HEB the primal gap$h_t$contracts as: $\tag{Rec-HEB} h_{t+1} \leq h_t \left(1 - \frac{1}{2Lc^2} h_t^{1-2\theta}\right),$ which leads to a convergence rate after solving the recurrence of $\tag{Rate-HEB} h_T \leq \begin{cases} \left(1 - \frac{1}{2Lc^2}\right)^T h_0 &amp; \theta = 1/2 \newline O(1) \left(\frac{1}{T} \right)^\frac{1}{1-2\theta} &amp; \text{if } \theta &lt; 1/2 \end{cases}$ or equivalently for the latter case, to ensure$h_T \leq \varepsilon$it suffices to choose$T \geq \Omega\left(\frac{1}{\varepsilon^{1 - 2\theta}}\right)$. Note that the$O(1)$term hides the dependence on$h_0$for simplicity of exposition.</p> <h2 id="obtaining-known-algorithms">Obtaining known algorithms</h2> <p>We will now derive several known algorithms and results using IGD from above. The basic task that we have to accomplish is always the same. We show that the direction$d_t$that our algorithm under consideration takes in iteration$t$satisfies:</p> <p>$\tag{Scaling} \frac{\langle \nabla f(x_t),d_t \rangle}{\norm{d_t}} \geq \alpha_t \frac{\langle \nabla f(x_t), x_t - x^\esx \rangle}{\norm{x_t - x^\esx}},$</p> <p>for some$\alpha_t \geq 0$. The reason why we want to show (Scaling) is that, assuming that we use the optimal step length$\eta_t^\esx = \frac{\langle\nabla f(x_t),d_t\rangle}{L \norm{d_t}^2}$from the smoothness equation, this ensures that for the progress from our step it holds:</p> <p>$\tag{ProgressApprox} h_t - h_{t+1} \geq \frac{\langle \nabla f(x_t),d_t \rangle^2}{2L\norm{d_t}^2} \geq \alpha_t^2 \frac{\langle \nabla f(x_t),x_t - x^\esx \rangle^2}{2L\norm{x_t - x^\esx}^2},$</p> <p>so that we lose the approximation factor$\alpha_t^2$in the primal progress inequality. Usually, we will see that we can compute a constant$a_ t = a &gt; 0$for all$t$. This allows us to immediately apply all previous convergence bounds derived for IGD, corrected by the approximation factor$\alpha^2$that we (might) lose now in each iteration.</p> <p>Note, that for several of the algorithms presented below accelerated variants can be obtained, so that the presented rates are not optimal; I will address this and talk about acceleration in a future post. In general the method via IGD might not necessarily provide the sharpest constants etc but rather favors simplicity of exposition.</p> <h3 id="gradient-descent">Gradient Descent</h3> <p>We will start with the (vanilla) <em>Gradient Descent (GD)</em> algorithms in the unconstrained setting, i.e.,$K = \RR^n$.</p> <p class="mathcol"><strong>(Vanilla) Gradient Descent (GD)</strong> <br /> <em>Input:</em> Smooth convex function$f$with first-order oracle access, initial point$x_0 \in \RR^n$. <br /> <em>Output:</em> Sequence of points$x_0, \dots, x_T$<br /> For$t = 0, \dots, T-1$do: <br />$\quad x_{t+1} \leftarrow x_t - \gamma_t \nabla f(x_t)$</p> <p>In order to show (Scaling) for$d_t \doteq \nabla f(x_t)$consider: $\tag{ScalingGD} \frac{\langle \nabla f(x_t),\nabla f(x_t) \rangle}{\norm{\nabla f(x_t)}} = \norm{\nabla f(x_t)} \geq \frac{\langle \nabla f(x_t), x_t - x^\esx \rangle}{\norm{x_t - x^\esx}},$ by Cauchy-Schwarz, so that we can choose$\alpha_t = 1$for all$t \in [T]$. In order to obtain (ProgressApprox) we pick the optimal step length$\gamma_t^\esx = \frac{\langle\nabla f(x_t),d_t\rangle}{L \norm{d_t}^2} = \frac{1}{L}$.</p> <p>We now obtain the convergence rate by simply combining the approximation from above with the IGD convergence rates. These bounds readily follow from plugging-in and we only copy-and-paste them here for completeness.</p> <h4 id="general-convergence-for-smooth-functions">General convergence for smooth functions</h4> <p>For the (general) smooth case we obtain:</p> <p class="mathcol"><strong>GD contraction (convexity).</strong> Assuming convexity the primal gap$h_t$contracts as: $\tag{GD-Rec-C} h_{t+1} \leq h_t \left(1 - \frac{h_t}{2L \norm{x_0 - x^\esx}^2}\right),$ which leads to a convergence rate after solving the recurrence of $\tag{GD-Rate-C} h_T \leq \frac{2L \norm{x_0 - x^\esx}^2}{T+4}.$</p> <h4 id="linear-convergence-for-strongly-convex-functions">Linear convergence for strongly convex functions</h4> <p>For smooth and strongly convex functions we obtain:</p> <p class="mathcol"><strong>GD contraction (strong convexity).</strong> Assuming strong convexity the primal gap$h_t$contracts as: $\tag{GD-Rec-SC} h_{t+1} \leq h_t \left(1 - \frac{\mu}{L}\right),$ which leads to a convergence rate after solving the recurrence of $\tag{GD-Rate-SC} h_T \leq \left(1 - \frac{\mu}{L}\right)^T h_0 \leq e^{-\frac{\mu}{L}T}h_0.$ or equivalently,$h_T \leq \varepsilon$for $T \geq \frac{L}{\mu} \log \frac{h_0}{\varepsilon}.$</p> <h4 id="heb-rates">HEB rates</h4> <p>And for smooth functions satisfying the HEB condition we obtain:</p> <p class="mathcol"><strong>GD contraction (HEB).</strong> Assuming HEB the primal gap$h_t$contracts as: $\tag{GD-Rec-HEB} h_{t+1} \leq h_t \left(1 - \frac{1}{2Lc^2} h_t^{1-2\theta}\right),$ which leads to a convergence rate after solving the recurrence of $\tag{GD-Rate-HEB} h_T \leq \begin{cases} \left(1 - \frac{1}{2Lc^2}\right)^T h_0 &amp; \theta = 1/2 \newline O(1) \left(\frac{1}{T} \right)^\frac{1}{1-2\theta} &amp; \text{if } \theta &lt; 1/2 \end{cases}$ or equivalently for the latter case, to ensure$h_T \leq \varepsilon$it suffices to choose$T \geq \Omega\left(\frac{1}{\varepsilon^{1 - 2\theta}}\right)$. Note that the$O(1)$term hides the dependence on$h_0$for the simplicity of exposition.</p> <h4 id="projected-gradient-descent">Projected Gradient Descent</h4> <p>The route through IGD is flexible enough to also accommodate the constraint case. Now we have to project back into the feasible region$K$and <em>Projected Gradient Descent (PGD)</em>, employs a projection$\Pi_K$that projects a point$x \in \RR^n$back into the feasible region$K$(note that$\Pi_K$has to satisfy certain properties to be admissible):</p> <p class="mathcol"><strong>Projected Gradient Descent (PGD)</strong> <br /> <em>Input:</em> Smooth convex function$f$with first-order oracle access, initial point$x_0 \in K$. <br /> <em>Output:</em> Sequence of points$x_0, \dots, x_T$<br /> For$t = 0, \dots, T-1$do: <br />$\quad x_{t+1} \leftarrow \Pi_K(x_t - \gamma_t \nabla f(x_t))$</p> <p>Without going into details we obtain (Scaling) here in a similar way due to the properties of the projection; I might explicitly consider projection-based methods in a later post.</p> <h3 id="frank-wolfe-variants">Frank-Wolfe Variants</h3> <p>We will now discuss how Frank-Wolfe Variants fit into the IGD framework laid out above. For this, in addition to the first-order access to the function$f$we now need to specify access to the feasible region$K$, which will be through a <em>linear programming oracle</em>:</p> <p class="mathcol"><strong>Linear Programming oracle</strong> <br /> <em>Input:</em>$c \in \mathbb R^n$<br /> <em>Output:</em>$\arg\min_{x \in K} \langle c, x \rangle$</p> <p>With this we can formulate the (vanilla) Frank-Wolfe algorithm:</p> <p class="mathcol"><strong>Frank-Wolfe Algorithm [FW]</strong> <br /> <em>Input:</em> Smooth convex function$f$with first-order oracle access, feasible region$K$with linear optimization oracle access, initial point (usually a vertex)$x_0 \in K$. <br /> <em>Output:</em> Sequence of points$x_0, \dots, x_T$<br /> For$t = 0, \dots, T-1$do: <br />$\quad v_t \leftarrow \arg\min_{x \in K} \langle \nabla f(x_{t}), x \rangle$<br />$\quad x_{t+1} \leftarrow (1-\eta_t) x_t + \eta_t v_t$</p> <p>The Frank-Wolfe algorithm [FW] (also known as Conditional Gradients [CG]) has many advantages with its projection-freeness being one of the most important; see <a href="/blog/research/2018/10/05/cheatsheet-fw.html">Cheat Sheet: Frank-Wolfe and Conditional Gradients</a> for an in-depth discussion.</p> <p>Before, we continue we need to address a small technicality: In the argumentation so far we did not have any restriction on choosing the step length$\eta$. However, in the case of Frank-Wolfe, as we are forming convex combinations, we have$0\leq \eta \leq 1$to ensure feasibility. Formally, we would have to distinguish two cases, namely, where$\eta^\esx = \frac{\langle \nabla f(x_t), x_t - v_t\rangle}{L \norm{x_t - v_t}^2} \geq 1$and$\eta^\esx &lt; 1$; note that we always have nonnegativity as$\langle \nabla f(x_t), x_t - v_t\rangle \geq 0$. We will purposefully disregard the former case, because in this regime we have linear convergence (the best we can hope for) anyways and as such it is really the iterations with$\eta &lt; 1$, which determine the convergence rate. Before we continue, we briefly provide a proof of linear convergence when$\eta \geq 1$in which case we simply choose$\eta \doteq 1$; moreover we will also establish that this case typically only happens once. By smoothness and using that in this case it holds$\langle \nabla f(x_t), x_t - v_t\rangle \geq L \norm{x_t - v_t}^2we have:</p> <script type="math/tex; mode=display">% <![CDATA[ \tag{LongStep} \begin{align*} \underbrace{f(x_{t}) - f(x_{t+1})}_{\text{primal progress}} & \geq \langle\nabla f(x_t),x_t - v_t\rangle - \frac{L}{2} \norm{x_t - v_t}^2 \newline & \geq \frac{1}{2} \langle\nabla f(x_t),x_t - v_t\rangle \newline & \geq \frac{1}{2} h_t, \end{align*} %]]></script> <p>so that in this regime we contract as</p> <script type="math/tex; mode=display">h_{t+1} \leq \frac{1}{2} h_t</script> <p>This can happen only a logarithmic number of steps until\eta^\esx &lt; 1$has to hold. In fact, the analysis can be slightly improved to show that this can happen only <em>at most once</em> if we argue directly via the primal gap$h_t$. Suppose that$h_0 &gt; L \norm{x_0 - v_0}^2$. Then</p> <script type="math/tex; mode=display">\underbrace{f(x_{0}) - f(x_{1})}_{\text{primal progress}} \geq \langle\nabla f(x_0),x_0 - v_0\rangle - \frac{L}{2} \norm{x_0 - v_0}^2 \geq h_0 - \frac{L}{2} \norm{x_0 - v_0}^2</script> <p>Thus after a single iteration we have$h_1 \leq h_0 - (h_0 - \frac{L}{2} \norm{x_0 - v_0}^2) \leq \frac{L}{2} \norm{x_0 - v_0}^2$.</p> <p>In the following let$D$denote the diameter of$K$with respect to$\norm{\cdot}$.</p> <h4 id="convergence-for-smooth-convex-functions">Convergence for Smooth Convex Functions</h4> <p>We will now first establish the convergence rate in the (general) smooth case. For this it suffices to observe that:</p> <script type="math/tex; mode=display">\langle\nabla f(x_t),x_t - v_t\rangle \geq \langle\nabla f(x_t),x_t - x^\esx\rangle,</script> <p>as$v_t = \arg\min_{x \in K} \langle \nabla f(x_{t}), x \rangle$and we can rearrange this to:</p> <script type="math/tex; mode=display">\tag{ScalingFW} \frac{\langle\nabla f(x_t),x_t - v_t\rangle}{\norm{x_t - v_t}} \geq \frac{\norm{x_t - x^\esx}}{D} \cdot \frac{\langle\nabla f(x_t),x_t - x^\esx\rangle}{\norm{x_t - x^\esx}},</script> <p>so that the progress per iteration, with$\alpha_t = \frac{\norm{x_t - x^\esx}}{D}, can be lower bounded by:</p> <script type="math/tex; mode=display">% <![CDATA[ \tag{ProgressApproxFW} \begin{align*} h_t - h_{t+1} & \geq \frac{\langle \nabla f(x_t),x_t - v_t \rangle^2}{2L\norm{x_t - v_t}^2} \\ & \geq \alpha_t^2 \frac{\langle \nabla f(x_t),x_t - x^\esx \rangle^2}{2L\norm{x_t - x^\esx}^2} \\ & \geq \frac{\langle \nabla f(x_t),x_t - x^\esx \rangle^2}{2LD^2}. \end{align*} %]]></script> <p>We obtain for the (general) smooth case:</p> <p class="mathcol"><strong>FW contraction (convexity).</strong> Assuming convexity the primal gaph_t$contracts as: $\tag{FW-Rec-C} h_{t+1} \leq h_t \left(1 - \frac{h_t}{2L D^2}\right),$ which leads to a convergence rate after solving the recurrence of $\tag{FW-Rate-C} h_T \leq \frac{2L D^2}{T+4}.$</p> <h4 id="linear-convergence-for-xesx-in-relative-interior">Linear convergence for$x^\esx$in relative interior</h4> <p>Next, we will demonstrate that in the case where$x^\esx$lies in the relative interior of$K$, then already the vanilla Frank-Wolfe algorithm achieves linear convergence when$f$is strongly convex. For this we use the following lemma proven in [GM]:</p> <p class="mathcol"><strong>Lemma [GM].</strong> If$x^\esx$is contained$2r$-deep in the relative interior of$K$, i.e.,$B(x^\esx,2r) \cap \operatorname{aff}(K) \subseteq K$for some$r &gt; 0$, then there exists some$t’$so that for all$t\geq t’$it holds $\frac{\langle \nabla f(x_t),x_t - v\rangle}{\norm{x_t - v}} \geq \frac{r}{D} \norm{\nabla f(x_t)}.$</p> <p>The lemma establishes (Scaling) with$\alpha_t \doteq \frac{r}{D}$:</p> <script type="math/tex; mode=display">\tag{ScalingFWint} \frac{\langle \nabla f(x_t),x_t - v\rangle}{\norm{x_t - v}} \geq \frac{r}{D} \norm{\nabla f(x_t)} \geq \frac{r}{D} \frac{\langle\nabla f(x_t),x_t - x^\esx\rangle}{\norm{x_t - x^\esx}}.</script> <p>Plugging this into the formula for strongly convex functions and ignoring the initial burn-in phase until we reach$t’$we obtain:</p> <p class="mathcol"><strong>FW contraction (strong convexity and$x^\esx$in rel.int).</strong> Assuming strong convexity of$f$and$x^\esx$being in the relative interior of$K$with depth$2r$, the primal gap$h_t$contracts as: $\tag{Rec-SC-Int} h_{t+1} \leq h_t \left(1 - \frac{r^2}{D^2} \frac{\mu}{L}\right),$ which leads to a convergence rate after solving the recurrence of $\tag{Rate-SC-Int} h_T \leq \left(1 - \frac{r^2}{D^2} \frac{\mu}{L}\right)^T h_0 \leq e^{-\frac{r^2}{D^2} \frac{\mu}{L}T}h_0.$ or equivalently,$h_T \leq \varepsilon$for $T \geq \frac{D^2}{r^2} \frac{L}{\mu} \log \frac{h_0}{\varepsilon}.$</p> <p>Note, that it is fine to ignore the burn-in phase before we reach$t’$as for a function family with optima$x^\esx$being$r$-deep in the relative interior of$K$, smoothness parameter$L$, and strong convexity parameter$\mu$, using$\nabla f(x^\esx) = 0$and strong convexity, we need$\norm{x_t-x^\esx}^2 \leq \frac{2}{\mu} h_t \leq r^2$and hence$h_t \leq \frac{\mu}{2} r^2$, which is satisfied after at most$O(\frac{4 LD^2}{\mu r^2})$iterations, which is a constant for any family satisfying those parameters.</p> <p>The above is the best we can hope for using the vanilla Frank-Wolfe algorithm. In particular, if$x^\esx$is on the boundary linear convergence for strongly convex functions cannot be achieved in general with the vanilla Frank-Wolfe algorithm. Rather it requires a modification of the Frank-Wolfe algorithm that we will discuss further below. For more details, and in particular the lower bound for the case with$x^\esx$being on the boundary, see <a href="/blog/research/2018/10/19/cheatsheet-fw-lin-conv.html">Cheat Sheet: Linear convergence for Conditional Gradients</a>.</p> <h4 id="improved-convergence-for-strongly-convex-feasible-regions">Improved convergence for strongly convex feasible regions</h4> <p>We will now show that if the feasible region$K$is strongly convex and the function$f$is strongly convex, then we can also improve over the standard$O(1/t)$convergence rate of conditional gradients however it is not known whether we can achieve linear convergence in that case (to the best of my knowledge). Note that we make no assumption here about the location of$x^\esx$. The original result is due to [GH] however the exposition will be different to fit into our IGD framework.</p> <p>Before we continue, we need to briefly recall <em>strong convexity of a set</em>:</p> <p class="mathcol"><strong>Definition (Strongly convex set).</strong> A convex set$K$is <em>$\alpha$-strongly convex</em> with respect to$\norm{\cdot}$if for any$x,y \in K$,$\gamma \in [0,1]$, and$z \in \RR^n$with$\norm{z} = 1$it holds: $\gamma x + (1-\gamma) y + \gamma(1-\gamma)\frac{\alpha}{2}\norm{x-y}^2z \in K.$</p> <p>So what this really means is that if you take the line segment between two points then on for any point on that line segment you can squeeze a ball around that point into$K$, where the radius depends on where you are on the line. We will apply the above definition to the mid point of$x$and$y$, so that the definition ensures that for any$x,y \in K$</p> <script type="math/tex; mode=display">\tag{SCmidpoint} \frac{1}{2} (x + y) + \frac{\alpha}{8}\norm{x-y}^2z \in K,</script> <p>where$z$is a norm-$1$direction, as shown in the following graphic:</p> <p class="center"><img src="http://www.pokutta.com/blog/assets/scbody.png" alt="Strongly Convex body" /></p> <p>With this we can easily establish the following variant of (Scaling):</p> <p class="mathcol"><strong>Lemma (Scaling for Strongly Convex Body (SCB)).</strong> Let$K$be a strongly convex set with parameter$\alpha$. Then it holds: $\tag{ScalingSCB} \frac{\langle \nabla f(x_t), x_t - v_t \rangle}{\norm{x_t - v_t}^2} \geq \frac{\alpha}{4} \norm{\nabla f(x_t)},$ where$v_t$is the Frank-Wolfe point from the algorithm.</p> <p><em>Proof.</em> Let$m \doteq \frac{1}{2} (x_t + v_t) + \frac{\alpha}{8}\norm{x_t-v_t}^2z$, where$z = \arg\min_{w \in \RR^n, \norm{w} = 1} \langle \nabla f(x_t), w \rangle$. Note that$\langle \nabla f(x_t), w \rangle = - \norm{\nabla f(x_t)}. Now we have:</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align*} \langle \nabla f(x_t), x_t - v_t \rangle & \geq \langle \nabla f(x_t), x_t - m \rangle \\ & = \frac{1}{2} \langle \nabla f(x_t), x_t - v_t \rangle - \frac{\alpha}{8} \norm{x_t - v_t}^2 \langle \nabla f(x_t), w \rangle \\ & = \frac{1}{2} \langle \nabla f(x_t), x_t - v_t \rangle + \frac{\alpha}{8} \norm{x_t - v_t}^2 \norm{\nabla f(x_t)}, \end{align*} %]]></script> <p>where the first inequality follows from the optimality of the Frank-Wolfe point. From this the statement follows by simply rearranging.\qed$</p> <p>This lemma is very much in spirit of the proof of [GM] for$x^\esx$being in the relative interior of$K$. However, the bound of [GM] is stronger: (ScalingSCB) is not exactly what we need, as we are missing a square around the scalar product in the numerator. This seems to be subtle but it is actually the reason why we do not obtain linear convergence by straightforward plugging-in. In fact, we have to conclude the convergence rate in this case slightly differently by “mixing” the bound from (standard) convexity and (ScalingSCB). Observe that so far, we have <em>not</em> used strong convexity of$f$yet. Our starting point is the progress inequality from smoothness for the Frank-Wolfe direction$d = x_t - v_tand we continue as follows:</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align*} f(x_t) - f(x_{t+1}) & \geq \frac{\langle \nabla f(x_t), x_t - v_t \rangle^2}{2L \norm{x_t - v_t}^2} \\ & \geq \langle \nabla f(x_t), x_t - v_t \rangle \cdot \frac{\langle \nabla f(x_t), x_t - v_t \rangle}{2L \norm{x_t - v_t}^2} \\ & \geq h_t \cdot \frac{\alpha}{8L} \norm{\nabla f(x_t)}. \end{align*} %]]></script> <p>This leads to a contraction of the form:</p> <script type="math/tex; mode=display">\tag{Rec-SCB-C} h_{t+1} \leq h_t (1- \frac{\alpha}{8L}\norm{\nabla f(x_t)}),</script> <p>and together with strong convexity that ensures</p> <script type="math/tex; mode=display">h_t \leq \frac{\norm{\nabla f(x_t)}^2}{2\mu}</script> <p>we get:</p> <p class="mathcol"><strong>FW contraction (strong convexity and strongly convex body).</strong> Assuming strong convexity off$and$K$is a strongly convex set with parameter$\alpha$, the primal gap$h_t$contracts as: $\tag{Rec-SC-SCB} h_{t+1} \leq h_t \left(1 - \frac{\alpha}{8L}\sqrt{2\mu h_t}\right),$ which leads to a convergence rate after solving the recurrence of $\tag{Rate-SC-SCB} h_T \leq O\left(1/T^2\right),$ where the$O(.)$term hides the dependency on the parameters$L$,$\mu$, and$\alpha$.</p> <h4 id="linear-convergence-for-normnabla-fx--c">Linear convergence for$\norm{\nabla f(x)} &gt; c$</h4> <p>As mentioned above (Rec-SCB-C) does not make any assumptions regarding the strong convexity of the function and in fact we can use this contraction to obtain linear convergence over strongly convex bodies, whenever the <em>lower-bounded gradient assumption</em> holds, i.e., for all$x \in K$, we require$\norm{\nabla f(x)} \geq c &gt; 0$. With this (Rec-SCB-C) immediately implies:</p> <p class="mathcol"><strong>FW contraction (strongly convex body and lower-bounded gradient).</strong> Assuming strong convexity of$K$and$\norm{\nabla f(x)} \geq c &gt; 0$for all$x \in K$: $\tag{Rec-SCB-LBG} h_{t+1} \leq h_t (1- \frac{\alpha c}{8L}),$ which leads to a convergence rate after solving the recurrence of $\tag{Rate-SCB-LBG} h_T \leq \left(1 - \frac{\alpha c}{8L}\right)^T h_0 \leq e^{-\frac{\alpha c}{8L}T}h_0.$ or equivalently,$h_T \leq \varepsilon$for $T \geq \frac{8L}{\alpha c}\log \frac{h_0}{\varepsilon}.$</p> <h4 id="linear-convergence-over-polytopes">Linear convergence over polytopes</h4> <p>Next up is linear convergence of Frank-Wolfe over polytopes for strongly convex functions. First of all, it is important to note that the vanilla Frank-Wolfe algorithm <em>cannot</em> achieve linear convergence in general in this case; see <a href="/blog/research/2018/10/19/cheatsheet-fw-lin-conv.html">Cheat Sheet: Linear convergence for Conditional Gradients</a> for details. Rather, we need to consider a modification of the Frank-Wolfe Algorithm by introducing so called <em>away steps</em>, which basically add additional feasible directions to the Frank-Wolfe algorithm. Here we will only provide a very compressed discussion and we refer the interested reader to <a href="/blog/research/2018/10/19/cheatsheet-fw-lin-conv.html">Cheat Sheet: Linear convergence for Conditional Gradients</a> for more details. Let us first recall the <em>Away Step Frank-Wolfe Algorithm</em>:</p> <p class="mathcol"><strong>Away-step Frank-Wolfe (AFW) Algorithm [W]</strong> <br /> <em>Input:</em> Smooth convex function$f$with first-order oracle access, feasible region$K$with linear optimization oracle access, initial vertex$x_0 \in K$and initial active set$S_0 = \setb{x_0}$. <br /> <em>Output:</em> Sequence of points$x_0, \dots, x_T$<br /> For$t = 0, \dots, T-1$do: <br />$\quad v_t \leftarrow \arg\min_{x \in K} \langle \nabla f(x_{t}), x \rangle \quad \setb{\text{FW direction}}$<br />$\quad a_t \leftarrow \arg\max_{x \in S_t} \langle \nabla f(x_{t}), x \rangle \quad \setb{\text{Away direction}}$<br />$\quad$If$\langle \nabla f(x_{t}), x_t - v_t \rangle &gt; \langle \nabla f(x_{t}), a_t - x_t \rangle: \quad \setb{\text{FW vs. Away}}$<br />$\quad \quad x_{t+1} \leftarrow (1-\gamma_t) x_t + \gamma_t v_t$with$\gamma_t \in [0,1]\quad \setb{\text{Perform FW step}}$<br />$\quad$Else: <br />$\quad \quad x_{t+1} \leftarrow (1+\gamma_t) x_t - \gamma_t a_t$with$\gamma_t \in [0,\frac{\lambda_{a_t}}{1-\lambda_{a_t}}]\quad \setb{\text{Perform Away step}}$<br />$\quad S_{t+1} \rightarrow \operatorname{ActiveSet}(x_{t+1})$</p> <p>The important term here is$\langle \nabla f(x_{t}), a_t - v_t \rangle$, which we refer to as the <em>strong Wolfe gap</em>; the name will become apparent in a few minutes. First however, observe that if we would do either an away step or a Frank-Wolfe step, at least one of them has to recover$1/2$of$\langle \nabla f(x_{t}), a_t - v_t \rangle$, i.e., either</p> <script type="math/tex; mode=display">\langle \nabla f(x_{t}), x_t - v_t \rangle \geq 1/2 \ \langle \nabla f(x_{t}), a_t - v_t \rangle</script> <p>or</p> <script type="math/tex; mode=display">\langle \nabla f(x_{t}), a_t - x_t \rangle \geq 1/2 \ \langle \nabla f(x_{t}), a_t - v_t \rangle.</script> <p>Why? If not, simply add up both inequalities and you end up with a contradiction. It is easy to see that$\langle \nabla f(x_{t}), x_t - v_t \rangle \leq \langle \nabla f(x_{t}), a_t - v_t \rangle$, so at first one may think of the strong Wolfe gap being <em>weaker</em> than the Wolfe gap. However, what Lacoste-Julien and Jaeggi in [LJ] showed is that <em>in the case of$K$being a polytope</em> there exists the magic scalar$\alpha_t$that we have been using before for (Scaling) relative to the strong Wolfe gap$\langle \nabla f(x_{t}), a_t - v_t \rangle$. More precisely, they showed the existence of a geometric constant$w(K)$, the so-called <em>pyramidal width</em> that <em>only</em> depends on the polytope$K$so that</p> <script type="math/tex; mode=display">\tag{ScalingAFW} \langle \nabla f(x_{t}), a_t - v_t \rangle \geq w(K) \frac{\langle \nabla f(x_t), x_t - x^\esx \rangle}{\norm{x_t - x^\esx}},</script> <p>Note that the missing normalization term$\norm{a_t - v_t}$can be absorbed in various way if the feasible region is bounded, e.g., we can simply replace it by the diameter and absorb it into$w(K)$or use the affine-invariant definition of curvature. Now it also becomes clear why the name <em>strong Wolfe gap</em> makes sense for$\langle \nabla f(x_{t}), a_t - v_t \rangle$: we can combine (Scaling) with the strong convexity of$f$and obtain:</p> <script type="math/tex; mode=display">h(x_t) \leq \frac{\langle \nabla f(x_{t}), a_t - v_t \rangle^2}{2 \mu w(K)^2},</script> <p>i.e., we obtain a strong upper bound on the primal gap$h_t$in spirit similar to the bound induced by strong convexity. Similarly, combining (Scaling) with our IGD arguments, we immediately obtain:</p> <p class="mathcol"><strong>AFW contraction (strong convexity and$K$polytope).</strong> Assuming strong convexity of$f$and$K$being a polytope, the primal gap$h_t$contracts as: $\tag{Rec-AFW-SC} h(x_{t+1}) \leq h_t \left(1 - \frac{\mu}{L} w(K)^2 \right),$ which leads to a convergence rate after solving the recurrence of $\tag{Rate-AFW-SC} h_T \leq \left(1 - \frac{\mu}{L} w(K)^2\right)^T h_0 \leq e^{-\frac{\mu}{L} w(K)^2T}h_0.$ or equivalently,$h_T \leq \varepsilon$for $T \geq \frac{L}{w(K)^2\mu} \log \frac{h_0}{\varepsilon}.$</p> <p>On a final note for this section, the reason why we need to assume that$K$is a polytope is that$w(K)$can tend to zero for general convex bodies, so that no reasonably bound can be obtained; in fact$w(K)$is a minimum over certain subsets of vertices and this list is only finite in the polyhedral case.</p> <h4 id="heb-rates-1">HEB rates</h4> <p>We can also further combine (ScalingAFW) with the HEB condition to obtain HEB rates for a variant of AFW that employs restarts. This follows exactly the template as in the section before relying on (ScalingAFW) and we thus skip it here and refer to the interested reader to <a href="/blog/research/2018/11/11/heb-conv.html">Cheat Sheet: Hölder Error Bounds (HEB) for Conditional Gradients</a>, where we provide a full derivation including the restart-variant of AFW.</p> <h4 id="a-note-on-affine-invariant-constants">A note on affine-invariant constants</h4> <p>Note that the Frank-Wolfe algorithm and its variants can be formulated as affine-invariant algorithms, while I purposefully opted for an affine-variant exposition. While, certainly from a theoretical perspective the affine-invariant versions are nicer (basically$LD^2$is replaced by a much sharper quantity$C$) from a practical perspective when we actually have to choose step lengths the affine-variants perform often much better. For this let us compare the <em>affine-invariant progress bound</em></p> <script type="math/tex; mode=display">\tag{ProgressAI} f(x_t) - f(x_{t+1}) \geq \frac{\langle\nabla f(x_t),d\rangle^2}{2C},</script> <p>with optimal choice$\eta^\esx_{AI} \doteq \frac{\langle\nabla f(x_t),d\rangle}{C}$, versus the <em>affine-variant progress bound</em></p> <script type="math/tex; mode=display">\tag{ProgressAV} f(x_t) - f(x_{t+1}) \geq \frac{\langle\nabla f(x_t),d\rangle^2}{2L \norm{d}^2},</script> <p>with optimal choice$\eta_{AV}^\esx \doteq \frac{\langle\nabla f(x_t),d\rangle}{L \norm{d}^2}$.</p> <p>Combining the two, we have</p> <script type="math/tex; mode=display">\frac{\eta_{AV}^\esx}{\eta_{AI}^\esx} = \frac{C}{L} \norm{d}^2,</script> <p>and in particular, when$\norm{d}^2$is small, then$\eta_{AV}^\esx$gets larger and we make longer steps. While this is not important for the theoretical analysis, it does make a difference for actual implementations as has been observed before e.g., by [PANJ]:</p> <blockquote> <p>We also note that this algorithm is not affine invariant, i.e., the iterates are not invariant by affine transformations of the variable, as is the case for some FW variants [J]. It is possible to derive a similar affine invariant algorithm by replacing$L_td_t^2$by$C_t$in Line 6 and (1), and estimate$C_t$instead of$L_t$. However, we have found that this variant performs empirically worse than AdaFW and did not consider it further.</p> </blockquote> <h3 id="references">References</h3> <p>[CG] Levitin, E. S., &amp; Polyak, B. T. (1966). Constrained minimization methods. Zhurnal Vychislitel’noi Matematiki i Matematicheskoi Fiziki, 6(5), 787-823. <a href="http://www.mathnet.ru/php/archive.phtml?wshow=paper&amp;jrnid=zvmmf&amp;paperid=7415&amp;option_lang=eng">pdf</a></p> <p>[FW] Frank, M., &amp; Wolfe, P. (1956). An algorithm for quadratic programming. Naval research logistics quarterly, 3(1‐2), 95-110. <a href="https://onlinelibrary.wiley.com/doi/abs/10.1002/nav.3800030109">pdf</a></p> <p>[GM] Guélat, J., &amp; Marcotte, P. (1986). Some comments on Wolfe’s ‘away step’. Mathematical Programming, 35(1), 110-119. <a href="https://link.springer.com/content/pdf/10.1007/BF01589445.pdf">pdf</a></p> <p>[GH] Garber, D., &amp; Hazan, E. (2014). Faster rates for the frank-wolfe method over strongly-convex sets. arXiv preprint arXiv:1406.1305. <a href="http://proceedings.mlr.press/v37/garbera15-supp.pdf">pdf</a></p> <p>[W] Wolfe, P. (1970). Convergence theory in nonlinear programming. Integer and nonlinear programming, 1-36.</p> <p>[LJ] Lacoste-Julien, S., &amp; Jaggi, M. (2015). On the global linear convergence of Frank-Wolfe optimization variants. In Advances in Neural Information Processing Systems (pp. 496-504). <a href="http://papers.nips.cc/paper/5925-on-the-global-linear-convergence-of-frank-wolfe-optimization-variants.pdf">pdf</a></p> <p>[PANJ] Pedregosa, F., Askari, A., Negiar, G., &amp; Jaggi, M. (2018). Step-Size Adaptivity in Projection-Free Optimization. arXiv preprint arXiv:1806.05123. <a href="https://arxiv.org/abs/1806.05123">pdf</a></p>Sebastian PokuttaTL;DR: Cheat Sheet for smooth convex optimization and analysis via an idealized gradient descent algorithm. While technically a continuation of the Frank-Wolfe series, this should have been the very first post and this post will become the Tour d’Horizon for this series. Long and technical.Toolchain Tuesday No. 42018-12-03T19:00:00-05:002018-12-03T19:00:00-05:00http://www.pokutta.com/blog/random/2018/12/03/toolchain-4<p><em>TL;DR: Part of a series of posts about tools, services, and packages that I use in day-to-day operations to boost efficiency and free up time for the things that really matter. Use at your own risk - happy to answer questions. For the full, continuously expanding list so far see <a href="/blog/pages/toolchain.html">here</a>.</em> <!--more--></p> <p>This is the fourth installment of a series of posts; the <a href="/blog/pages/toolchain.html">full list</a> is expanding over time. This time around will be about <code class="highlighter-rouge">git</code>, which enables version control and distributed, asynchronous collaboration. <code class="highlighter-rouge">Git</code> is probably the single most useful tool in my workflow.</p> <h2 id="software">Software:</h2> <h3 id="git">Git</h3> <p>Decentralized version control for coding, latex documents, and much more.</p> <p><em>Learning curve: ⭐️⭐️⭐️⭐️</em> <em>Usefulness: ⭐️⭐️⭐️⭐️⭐️</em> <br /> <em>Site: <a href="https://github.com/git/git">https://github.com/git/git</a></em></p> <p><code class="highlighter-rouge">Git</code> is the single most useful tool in my whole workflow. Think of it as the operating system that underlies almost everything. Basically everything from writing papers, coding, all my markdown documents, and even my <code class="highlighter-rouge">Jekyll</code>-driven sites are managed in a <code class="highlighter-rouge">git</code> repository. So what is <code class="highlighter-rouge">git</code>? From <a href="https://en.wikipedia.org/wiki/Git">[wikipedia]</a>:</p> <blockquote> <p>Git (/ɡɪt/) is a version-control system for tracking changes in computer files and coordinating work on those files among multiple people. It is primarily used for source-code management in software development, but it can be used to keep track of changes in any set of files. As a distributed revision-control system, it is aimed at speed, data integrity, and support for distributed, non-linear workflows. […] As with most other distributed version-control systems, and unlike most client–server systems, every Git directory on every computer is a full-fledged repository with complete history and full version-tracking abilities, independent of network access or a central server.</p> </blockquote> <p>So what do these features come down to in the hard reality of day-to-day life?</p> <ol> <li> <p><em>Collaboration.</em> Working with others <em>without</em> having to worry about ‘tokens’ and other concepts solely created to implement file locks through human behavior. <code class="highlighter-rouge">Git</code> provides capabilities for <em>distributed</em> and <em>asynchronous</em> collaboration. In terms of how awesome <code class="highlighter-rouge">git</code> really is, let the numbers speak: Microsoft just <a href="https://news.microsoft.com/2018/06/04/microsoft-to-acquire-github-for-7-5-billion/">paid</a>$7.5 billion for <code class="highlighter-rouge">github</code>, one of the main <code class="highlighter-rouge">git</code> repository platforms, for a reason… With <code class="highlighter-rouge">git</code> any number of people can work on the same files, code, project etc and <code class="highlighter-rouge">git</code> will automatically merge changes provided they were not overlapping and if they were overlapping they can be merged relatively easily by hand with the help of <code class="highlighter-rouge">git</code>. Also, nothing is ever lost! Remember, when you shared files on Dropbox and someone overwrote your file after you edited it painstakingly just to fix a comma? With <code class="highlighter-rouge">git</code> this cannot happen.</p> </li> <li> <p><em>Backup and full history.</em> Every copy of the repository on any machine contains the <em>full</em> version history. This provides incredible redundancy <em>and</em> if you <code class="highlighter-rouge">push</code> into a remote repository then you have a remote backup that you can <code class="highlighter-rouge">pull</code> from basically any location with an internet connection. For repository space check out, e.g., <a href="https://bitbucket.org/">bitbucket.org</a> and <a href="https://github.com">github.com</a>.</p> </li> <li> <p><em>Different version branches.</em> Another powerful feature of <code class="highlighter-rouge">git</code> is to maintain and synchronize different versions of a product through <code class="highlighter-rouge">branches</code>. One of the most common use cases for me is for example, when we have an arxiv version and a conference version of a paper, which need to be keep synchronized. With <code class="highlighter-rouge">git</code> you can easily track changes between these versions and <code class="highlighter-rouge">cherry pick</code> those that you want to synchronize.</p> </li> </ol> <p>Unfortunately, the learning curve of <code class="highlighter-rouge">git</code> is quite steep, in particular if you want to do something slightly more advanced. For most users I highly recommend a <code class="highlighter-rouge">git</code> gui as it makes merging etc much easier. I will mention two choices below. There are tons of git tutorials online and a good starting point is <a href="https://try.github.io/">[here]</a> and <a href="https://git-scm.com/docs/gittutorial">[here]</a>. (Ping me for your favorite one; happy to add links)</p> <h3 id="sourcetree">Sourcetree</h3> <p>Great and free <code class="highlighter-rouge">git</code> gui for mac os x and windows.</p> <p><em>Learning curve: ⭐️⭐️</em> <em>Usefulness: ⭐️⭐️⭐️⭐️</em> <br /> <em>Site: <a href="https://www.sourcetreeapp.com/">https://www.sourcetreeapp.com/</a></em></p> <p><code class="highlighter-rouge">Sourcetree</code> is a great graphical <code class="highlighter-rouge">git</code> client. It has full support for <code class="highlighter-rouge">git</code> and comes with many useful features and is free. Not much to say otherwise: the power of <code class="highlighter-rouge">gui</code> accessible through a great user interface.</p> <h3 id="smartgit">SmartGit</h3> <p>Great <code class="highlighter-rouge">git</code> gui for mac os x and windows.</p> <p><em>Learning curve: ⭐️⭐️</em> <em>Usefulness: ⭐️⭐️⭐️⭐️</em> <br /> <em>Site: <a href="https://www.syntevo.com/smartgit/">https://www.syntevo.com/smartgit/</a></em></p> <p><code class="highlighter-rouge">SmartGit</code> is another great graphical <code class="highlighter-rouge">git</code> client and it is free for non-commercial use. Otherwise the same as for <code class="highlighter-rouge">Sourcetree</code> applies here; both <code class="highlighter-rouge">Sourcetree</code> and <code class="highlighter-rouge">SmartGit</code> are great and it comes down to personal preference.</p>TL;DR: Part of a series of posts about tools, services, and packages that I use in day-to-day operations to boost efficiency and free up time for the things that really matter. Use at your own risk - happy to answer questions. For the full, continuously expanding list so far see here.Emulating the Expert2018-11-25T19:00:00-05:002018-11-25T19:00:00-05:00http://www.pokutta.com/blog/research/2018/11/25/expertLearning-abstract<p><em>TL;DR: This is an informal summary of our recent paper <a href="https://arxiv.org/abs/1810.12997">An Online-Learning Approach to Inverse Optimization</a> with <a href="http://www.am.uni-erlangen.de/index.php?id=229">Andreas Bärmann</a>, <a href="https://www.am.uni-erlangen.de/?id=199">Alexander Martin</a>, and <a href="https://www.mso.math.fau.de/edom/team/schneider-oskar/oskar-schneider/">Oskar Schneider</a>, where we show how methods from online learning can be used to learn a hidden objective of a decision-maker in the context of Mixed-Integer Programs and more general (not necessarily convex) optimization problems.</em> <!--more--></p> <h2 id="what-is-the-paper-about-and-why-you-might-care">What is the paper about and why you might care</h2> <p>We often face the situation in which we observe a decision-maker—let’s call her Alice—who is making “reasonably optimal” decisions with respect to some private objective function and another party—let’s call him Bob—would like to make decisions that emulate Alice’s decisions in terms of quality with respect to <em>Alice’s private objective function</em>. Classical applications where this naturally occurs is in the context of learning customer preferences from observed behavior in order to recommend new products etc. that match the customer’s preference or for example in dynamic routing, where we observe routing decisions of individual participants but we cannot directly observe, e.g., travel times. The formal name for the problem that we consider is <em>inverse optimization</em>; informally speaking we can say we simply want to <em>emulate the expert</em>. For completeness, in reinforcement learning we would refer to what we want to achieve as <em>inverse reinforcement learning</em>.</p> <p>The following graph lays out the basic setup that we consider:</p> <p class="center"><img src="http://www.pokutta.com/blog/assets/setup-inverse-opt.png" alt="Setup Emulating Expert" /></p> <p>In summary, Alice is solving</p> <script type="math/tex; mode=display">x_t \doteq \arg \min_{x \in P_t} c_{true}^\intercal x,</script> <p>and Bob can solve</p> <script type="math/tex; mode=display">\bar x_t \doteq \arg \min_{x \in P_t} c_t^\intercal x,</script> <p>for some guessed objective $c_t$ and after Bob played his decision $\bar x_t$, he observes Alice decision $x_t$ taken with respect to her <em>private</em> objective $c_{true}$. For each time step $t \in [T]$, the $P_t$ is some feasible set of decisions over which Alice and Bob can optimize their respective (linear) objective functions; the interesting case is where $P_t$ varies over time, so that Alice’s decision $x_t$ is round-dependent. Note, that we can basically accommodate arbitrary (potentially non-linear) function families as long as we have a reasonable “basis” for this family; the interested reader might check the paper for details.</p> <p>Learning to emulate Alice’s decisions $x_t$ seems to be almost impossible to accomplish at first:</p> <ol> <li>We obtain potentially very little information only from Alice’s decision $x_t$.</li> <li>The objective that explains Alice’s decisions might not be unique.</li> </ol> <p>However, it turns out that under reasonable assumptions, such that Alice’s decisions are reasonably close to the optimal ones with respect to $c_{true}$ and with an amount of examples that are “diverse enough” as necessitated by the specifics of the instance, we in fact <em>can</em> learn an <em>equivalent</em> objective that renders Alice’s solutions basically optimal w.r.t. this learned proxy objective. In fact, one way to solve an offline variant of this problem to obtain such a proxy objective that is quite well known is via dualization or KKT system approaches. For example in the case of <em>linear programs</em> this can be done as follows:</p> <p class="mathcol"><strong>Remark (LP case).</strong> Suppose that $P_t \doteq \setb{x \in \RR^n \mid A_t x \leq b_t}$ for $t \in [T]$ and assume that we have a polyhedral feasible region $F = \setb{c \in \RR^n \mid Bc \leq d}$ for the candidate objectives. Then the linear program $\min \sum_{t = 1}^T (b_t^\intercal y_t - c^\intercal x_t) \qquad$ $A_t^\intercal y_t = c \qquad \forall t \in [T]$ $y_t \geq 0 \qquad \forall t \in [T]$ $Bc \leq d,$ where $c$ and the $y_t$ are variables and the rest is input data, computes a linear objective $c$, if feasible and bounded etc, so that for all $t \in [T]$ it holds $c^\intercal x_t = \max_{x \in P} c_{true}^\intercal x.$</p> <p>While the above can also be reasonably extended to convex programs via solving the KKT system instead, it has two disadvantages:</p> <ol> <li>It is an <em>offline</em> approach: first collect data and <em>then</em> regress out a proxy objective, i.e., <em>first-learn-then-optimize</em>, which might be problematic in many applications.</li> <li>Additionally, and not less severe, this <em>only</em> works for linear programs (convex programs) and not Mixed-Integer Programs or more general optimization problems as, due to non-convexity, the KKT system or the dual program is not defined/available in this case.</li> </ol> <h2 id="our-results">Our results</h2> <p>Our method alleviates both of the above shortcomings, by providing an <em>online learning algorithm</em>, where we learn a proxy objective equivalent to Alice’s objective <em>while</em> we are participating in the decision-making process, i.e., our algorithm is an online algorithm. Moreover, our approach is general enough to apply to a wide variety of optimization problems (including MIPs etc) as it only relies on standard regret guarantees and (approximate) optimality of Alice’s decisions. More precisely, we provide an online learning algorithm—using either Multiplicative Weights Updates (MWU) or Online Gradient Descent (OGD) as a black box—that ensures the following guarantee.</p> <p class="mathcol"><strong>Theorem [BMPS, BPS].</strong> With the notation from above the online learning algorithm ensures $0 \leq \frac{1}{T} \sum_{t = 1}^T (c_t - c_{true})^\intercal (\bar x_t - x_t) \leq O\left(\sqrt{\frac{1}{T}}\right),$ where the constant hidden in the $O$-notation depends on the used algorithm (either MWU or OGD) and the (maximum) diameter of the feasible regions $P_t$.</p> <p>In particular, note that in the above</p> <script type="math/tex; mode=display">(c_t - c_{true})^\intercal (\bar x_t - x_t) = \underbrace{c_t^\intercal (\bar x_t - x_t)}_{\geq 0} + \underbrace{c_{true}^\intercal (x_t - \bar x_t)}_{\geq 0},</script> <p>where the nonnegativity arises from the optimality of $x_t$ w.r.t. $c_{true}$ and the optimality of $\bar x_t$ w.r.t. $c_t$. We therefore obtain in particular that</p> <p>$0 \leq \frac{1}{T} \sum_{t = 1}^T c_t^\intercal (\bar x_t - x_t) \leq O\left(\sqrt{\frac{1}{T}}\right),$</p> <p>and</p> <p>$0 \leq \frac{1}{T} \sum_{t = 1}^T c_{true}^\intercal (x_t - \bar x_t) \leq O\left(\sqrt{\frac{1}{T}}\right),$</p> <p>hold, which tend to $0$ on the right-hand side for $T \rightarrow \infty$. Thus Bob’s decisions $\bar x_t$ converge to decisions that are not only close in cost, on average, compared to Alice’s decisions $x_t$ w.r.t. to $c_t$ but <em>also</em> w.r.t. to $c_{true}$ although we might never actually observe $c_{true}$. In the paper we consider also special cases under which we can ensure to recover $c_{true}$ and not just an equivalent function. One way of thinking about our online learning algorithm is that it provides an approximate solution to the (inaccessible) KKT system that we would like to solve. In fact in the case of, e.g., LPs it can be shown that our algorithm solves a dual program similar to the one from above by means of gradient descent (or mirror descent).</p> <p>The key question of course is, whether our algorithm actually also works in practice. And the answer is <em>yes</em>. The left plot shows the convergence of the total error $(c_t - c_{true})^\intercal (\bar x_t - x_t)$ over $t \in [T]$ in each round (red dots) as well as the cumulative average error up to that point (blue line) for an integer knapsack problem with $n = 1000$ items over $T = 1000$ rounds using MWU as black box algorithm. The proposed algorithm is also rather stable and consistent across instances in terms of convergence, as can be seen in the right plot, where we consider the statistics of the total error over $500$ runs for a linear knapsack problem with $n = 50$ items over $T = 500$ rounds. Here we depict mean total error averaged up to that point in time $\ell$, i.e., $\frac{1}{\ell} \sum_{t = 1}^{\ell} (c_t - c_{true})^\intercal (\bar x_t - x_t)$ and associated error bands.</p> <p class="center"><img src="http://www.pokutta.com/blog/assets/online-learning-comp.png" alt="Convergence of Total Error and Statistics" /></p> <h3 id="a-note-on-generalization">A note on generalization</h3> <p>If the varying decision environments $P_t$ are drawn i.i.d. from some distribution $\mathcal D$, then also a reasonable form of generalization to unseen realizations of the decision environment $P_t$ drawn from distribution $\mathcal D$ can be shown, provided that we have seen enough examples within the learning process. For this one can show that after a sufficient number of samples $T$ it holds</p> <p>$\frac{1}{T} \sum_{t = 1}^T c_{true}^\intercal x_t \approx \mathbb E_{\mathcal D} [c_{true}^\intercal \tilde x],$</p> <p>where $\tilde x = \arg \max_{x \in P} c_{true}^\intercal x$ for $P \sim \mathcal D$ and one then applies the regret bound, which provides</p> <p>$\frac{1}{T} \sum_{t = 1}^T c_{true}^\intercal x_t \approx \frac{1}{T} \sum_{t = 1}^T c_{true}^\intercal \bar x_t,$</p> <p>so that roughly</p> <p>$\frac{1}{T} \sum_{t = 1}^T c_{true}^\intercal \bar x_t \approx \mathbb E_{\mathcal D} [c_{true}^\intercal \tilde x] ,$</p> <p>follows. This can be made precise by working out the number of samples, so that the approximation errors above are of the order of a given $\varepsilon &gt; 0$.</p> <h3 id="references">References</h3> <p>[BMPS] Bärmann, A., Martin, A., Pokutta, S., &amp; Schneider, O. (2018). An Online-Learning Approach to Inverse Optimization. arXiv preprint arXiv:1810.12997. <a href="https://arxiv.org/abs/1810.12997">arxiv</a></p> <p>[BPS] Bärmann, A., Pokutta, S., &amp; Schneider, O. (2017, July). Emulating the Expert: Inverse Optimization through Online Learning. In International Conference on Machine Learning (pp. 400-410). <a href="http://proceedings.mlr.press/v70/barmann17a.html">pdf</a></p>Sebastian PokuttaTL;DR: This is an informal summary of our recent paper An Online-Learning Approach to Inverse Optimization with Andreas Bärmann, Alexander Martin, and Oskar Schneider, where we show how methods from online learning can be used to learn a hidden objective of a decision-maker in the context of Mixed-Integer Programs and more general (not necessarily convex) optimization problems.Toolchain Tuesday No. 32018-11-19T23:00:00-05:002018-11-19T23:00:00-05:00http://www.pokutta.com/blog/random/2018/11/19/toolchain-3<p><em>TL;DR: Part of a series of posts about tools, services, and packages that I use in day-to-day operations to boost efficiency and free up time for the things that really matter. Use at your own risk - happy to answer questions. For the full, continuously expanding list so far see <a href="/blog/pages/toolchain.html">here</a>.</em> <!--more--></p> <p>This is the third installment of a series of posts; the <a href="/blog/pages/toolchain.html">full list</a> is expanding over time. This time around will be <code class="highlighter-rouge">Markdown</code> heavy.</p> <h2 id="software">Software:</h2> <h3 id="jekyll">Jekyll</h3> <p>Static website and blog generator.</p> <p><em>Learning curve: ⭐️⭐️⭐️⭐️⭐️</em> <em>Usefulness: ⭐️⭐️⭐️⭐️</em> <br /> <em>Site: <a href="https://jekyllrb.com/">https://jekyllrb.com/</a></em></p> <p><code class="highlighter-rouge">Jekyll</code> is an extremely useful piece of software. Once set up it basically allows you to turn markdown documents into webpages or blog posts: think <em>compiling</em> your webpage similarly to how you would use <code class="highlighter-rouge">LaTeX</code>. Supports all types of plugins and can be insanely customized. However, the learning curve is <em>steep</em>: lots of moving pieces such as <code class="highlighter-rouge">css</code> templates, <code class="highlighter-rouge">ruby</code> code, <code class="highlighter-rouge">yaml</code> (data-oriented markup <a href="https://en.wikipedia.org/wiki/YAML">[link]</a>), and <code class="highlighter-rouge">liquid</code> (template language <a href="https://github.com/Shopify/liquid/wiki">[link]</a>). Nonetheless, definitely worth it and I highly recommend investing the time the next time when you need to redo your webpage etc. I am running both my homepage <em>and</em> blog via <code class="highlighter-rouge">Jekyll</code>. Just to give you an example how easy things are once setup:</p> <figure class="highlight"><pre><code class="language-liquid" data-lang="liquid">--- layout: landing author_profile: true title: "Publications of Sebastian Pokutta" --- **In Preparation / Articles Pending Review.** <span class="p">{%</span><span class="w"> </span><span class="nt">bibliography</span><span class="w"> </span>--query<span class="w"> </span>@*[ptype<span class="w"> </span><span class="na">~</span><span class="o">=</span><span class="w"> </span>preprint]<span class="w"> </span><span class="p">%}</span> **Refereed Conference Proceedings.** <span class="p">{%</span><span class="w"> </span><span class="nt">bibliography</span><span class="w"> </span>--query<span class="w"> </span>@*[ptype<span class="w"> </span><span class="na">~</span><span class="o">=</span><span class="w"> </span>conference]<span class="w"> </span><span class="p">%}</span> **Refereed Journals.** <span class="p">{%</span><span class="w"> </span><span class="nt">bibliography</span><span class="w"> </span>--query<span class="w"> </span>@*[ptype<span class="w"> </span><span class="na">~</span><span class="o">=</span><span class="w"> </span>journal]<span class="w"> </span><span class="p">%}</span> **Unpublished Manuscripts.** <span class="p">{%</span><span class="w"> </span><span class="nt">bibliography</span><span class="w"> </span>--query<span class="w"> </span>@*[ptype<span class="w"> </span><span class="na">~</span><span class="o">=</span><span class="w"> </span>unpublished]<span class="w"> </span><span class="p">%}</span> **Other.** <span class="p">{%</span><span class="w"> </span><span class="nt">bibliography</span><span class="w"> </span>--query<span class="w"> </span>@*[ptype<span class="w"> </span><span class="na">~</span><span class="o">=</span><span class="w"> </span>other]<span class="w"> </span><span class="p">%}</span></code></pre></figure> <p>These few lines of <code class="highlighter-rouge">markdown</code> and <code class="highlighter-rouge">liquid</code> generate <a href="http://www.pokutta.com/publications/">my publication list</a> from a bibtex file, pulling out the URLs adding links to (1) arxiv, (2) the journal, and (3) summaries on the blog here (if they exist).</p> <p>Also all the formulas in the posts are handled via <code class="highlighter-rouge">jekyll</code> + <code class="highlighter-rouge">markdown</code> + <code class="highlighter-rouge">mathjax</code>:</p> <div class="language-md highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gs">**Typesetting formulas with markdown:**</span> $$<span class="se">\m</span>in_{v <span class="se">\i</span>n P} <span class="se">\l</span>angle <span class="se">\n</span>abla f(x), v <span class="se">\r</span>angle.$$ </code></pre></div></div> <p>gives:</p> <p><strong>Typesetting formulas with markdown:</strong></p> <script type="math/tex; mode=display">\min_{v \in P} \langle \nabla f(x), v \rangle.</script> <p>In fact, <a href="https://pages.github.com/">GitHub pages</a> is driven by <code class="highlighter-rouge">Jekyll</code>, so that you most likely have already encountered it without knowing.</p> <h3 id="markdown">Markdown</h3> <p>Versatile plain text format that can be converted into almost anything.</p> <p><em>Learning curve: ⭐️⭐️</em> <em>Usefulness: ⭐️⭐️⭐️⭐️⭐️</em> <br /> <em>Site: <a href="https://en.wikipedia.org/wiki/Markdown">https://en.wikipedia.org/wiki/Markdown</a></em></p> <p>From <a href="https://en.wikipedia.org/wiki/Markdown">Wikipedia</a>:</p> <blockquote> <p>Markdown is a lightweight markup language with plain text formatting syntax. Its design allows it to be converted to many output formats.</p> </blockquote> <p>By now I use Markdown for a variety of things as it can be converted into almost any output format by either dedicated converters (e.g., <code class="highlighter-rouge">Jekyll</code> discussed above to turn it into HTML) or universal converters such as <code class="highlighter-rouge">pandoc</code> (see below).</p> <h3 id="pandoc">Pandoc</h3> <p>Universal document converter. Great together with Markdown.</p> <p><em>Learning curve: ⭐️⭐️</em> <em>Usefulness: ⭐️⭐️⭐️⭐️</em> <br /> <em>Site: <a href="https://pandoc.org/">https://pandoc.org/</a></em></p> <p><code class="highlighter-rouge">pandoc</code> (among other output converters) is what makes Markdown extremely powerful. Consider this Markdown file named <code class="highlighter-rouge">SLIDES</code>:</p> <div class="language-md highlighter-rouge"><div class="highlight"><pre class="highlight"><code>% Eating Habits % John Doe % March 22, 2005 <span class="gh"># In the morning</span> <span class="p"> -</span> Eat eggs <span class="p">-</span> Drink coffee <span class="gh"># In the evening</span> <span class="p"> -</span> Eat spaghetti <span class="p">-</span> Drink wine <span class="gh"># Conclusion</span> <span class="p"> -</span> And the answer is... <span class="p">-</span> $f(x)=<span class="se">\s</span>um_{n=0}^<span class="se">\i</span>nfty<span class="se">\f</span>rac{f^{(n)}(a)}{n!}(x-a)^n$ </code></pre></div></div> <p>A simple <code class="highlighter-rouge">pandoc -s --mathml -i -t dzslides SLIDES -o example16a.html</code> turns this into <a href="https://pandoc.org/demo/example16a.html">HTML slides</a>. Want a different slide format? Try, e.g., <code class="highlighter-rouge">pandoc -s --webtex -i -t slidy SLIDES -o example16b.html</code> with those <a href="https://pandoc.org/demo/example16b.html">HTML slides</a> or <code class="highlighter-rouge">pandoc -s --mathjax -i -t revealjs SLIDES -o example16d.html</code> with those <a href="https://pandoc.org/demo/example16d.html">HTML slides</a>. Don’t like HTML slides and want <code class="highlighter-rouge">beamer</code> instead? <code class="highlighter-rouge">pandoc -t beamer SLIDES -o example8.pdf</code> does the job and you get <a href="https://pandoc.org/demo/example8.pdf">this</a>. Just want a regular pdf? <code class="highlighter-rouge">pandoc SLIDES --pdf-engine=xelatex -o example13.pdf</code>. Want a <code class="highlighter-rouge">Microsoft Word</code> file (including the formulas)? <code class="highlighter-rouge">pandoc -s SLIDES -o example29.docx</code>… the possibilities are endless and all derived from a <em>single</em> original format. (Near) perfect separation from content and presentation.</p>TL;DR: Part of a series of posts about tools, services, and packages that I use in day-to-day operations to boost efficiency and free up time for the things that really matter. Use at your own risk - happy to answer questions. For the full, continuously expanding list so far see here.Cheat Sheet: Hölder Error Bounds for Conditional Gradients2018-11-11T23:00:00-05:002018-11-11T23:00:00-05:00http://www.pokutta.com/blog/research/2018/11/11/heb-conv<p><em>TL;DR: Cheat Sheet for convergence of Frank-Wolfe algorithms (aka Conditional Gradients) under the Hölder Error Bound (HEB) condition, or how to interpolate between convex and strongly convex convergence rates. Continuation of the Frank-Wolfe series. Long and technical.</em> <!--more--></p> <p><em>Posts in this series (so far).</em></p> <ol> <li><a href="/blog/research/2018/12/06/cheatsheet-smooth-idealized.html">Cheat Sheet: Smooth Convex Optimization</a></li> <li><a href="/blog/research/2018/10/05/cheatsheet-fw.html">Cheat Sheet: Frank-Wolfe and Conditional Gradients</a></li> <li><a href="/blog/research/2018/10/19/cheatsheet-fw-lin-conv.html">Cheat Sheet: Linear convergence for Conditional Gradients</a></li> <li><a href="/blog/research/2018/11/11/heb-conv.html">Cheat Sheet: Hölder Error Bounds (HEB) for Conditional Gradients</a></li> <li><a href="/blog/research/2019/02/27/cheatsheet-nonsmooth.html">Cheat Sheet: Subgradient Descent, Mirror Descent, and Online Learning</a></li> </ol> <p><em>My apologies for incomplete references—this should merely serve as an overview.</em></p> <p>In this third installment of the series on Conditional Gradients, I will talk about the <em>Hölder Error Bound (HEB) condition</em>. This post is going to be slightly different from the previous ones, as the conditional gradients part will be basically a simple corollary to our discussion of the general (constraint or unconstrained) case here. The HEB condition is extremely useful in general for establishing converge rates and I will first talk about how it compares to, e.g., strong convexity, when it holds etc. All these aspects will be independent of Frank-Wolfe per se. Going then from the general case to Frank-Wolfe is basically a simple corollary except for some non-trivial technical challenges; but those are really just that: technical challenges.</p> <p>I will stick to the notation from the <a href="/blog/research/2018/10/05/cheatsheet-fw.html">first post</a> and will refer to it frequently, so you might want to give it a quick refresher or read. As before I will use Frank-Wolfe [FW] and Conditional Gradients [CG] interchangeably.</p> <h2 id="the-hölder-error-bound-heb-condition">The Hölder Error Bound (HEB) condition</h2> <p>We have seen that in general (without acceleration), we can obtain a rate of basically $O(1/\varepsilon)$ for the smooth and convex case and a rate of basically $O(\log 1/\varepsilon)$ in the smooth and strongly convex case. A natural question to ask is what happens inbetween these two extremes, i.e., are there functions that converge with a rate of e.g., $O(1/\varepsilon^p)$? The answer is <em>yes</em> and the HEB condition allows basically to smoothly interpolate between the two regimes, depending on the property of the function under consideration of course.</p> <p>For the sake of continuity we work here assuming the constraint case as we will aim for applications to Frank-Wolfe later, however the discussion holds more broadly for the unconstrained case as well; simply replace $P$ with $\RR^n$. In the following let $\Omega^\esx$ denote the set of optimal solutions to $\min_{x \in P} f(x)$ (there might be multiple) and let $f^\esx \doteq \min_{x \in P} f(x)$. In the following we will always assume that $x^\esx \in \Omega^\esx$.</p> <p class="mathcol"><strong>Definition (Hölder Error Bound (HEB) condition).</strong> A convex function $f$ is satisfies the <em>Hölder Error Bound (HEB) condition on $P$</em> with parameters $0 &lt; c &lt; \infty$ and $\theta \in [0,1]$ if for all $x \in P$ it holds: $c (f(x) - f^\esx)^\theta \geq \min_{y \in \Omega^\esx} \norm{x-y}.$</p> <p>Note that to simplify the exposition we assume here that the condition holds for all $x \in P$. Usually this is only assumed for a compact convex subset $K$ with $\Omega^\esx \subseteq K \subseteq P$, requiring an initial burn-in phase of the algorithm until the condition is satisfied; we ignore this subtlety here.</p> <p>As far as I can see, basically this condition goes back to [L] and has been studied extensively since then, see e.g., [L2] and [BLO]; if anyone has more accurate information please ping me. So what this condition measures is how <em>sharp</em> the function increases around the (set of) optimal solution(s), which is why this condition sometimes is also referred to as <em>sharpness condition</em>. It is also important to note that the definition here depends on $P$ and the set of minimizers $\Omega^\esx$, whereas e.g., strong convexity is a <em>global</em> property of the function <em>independent</em> of $P$. Before delving further into HEB, we might wonder whether there are functions that satisfy this condition that are not strongly convex.</p> <p class="mathcol"><strong>Example.</strong> A simple optimization problem with a function that satisfies the HEB condition with non-trivial parameters is, e.g., $\min_{x \in P} \norm{x-\bar x}_2^\alpha,$ where $\bar x \in \RR^n$ and $\alpha \geq 2$. In this case we obtain $\theta = 1/\alpha$. The function to be minimized is not strongly convex for $\alpha &gt; 2$.</p> <p>So the HEB condition is <em>more general</em> than strong convexity and, as we will see further below, it is also much weaker: it requires <em>less</em> from a <em>given</em> function (compared to strong convexity) and at the same time works for functions that are not covered by strong convexity.</p> <p>The following graph depicts functions with varying $\theta$. All functions with $\theta &lt; 1/2$ are <em>not</em> strongly convex. Those with $\theta &gt; 1/2$ are only depicted for illustration here, as they curve faster than the power of the (standard) smoothness that we use (as we will discuss briefly below) and we will be therefore limited to functions with $0 \leq \theta \leq 1/2$, where $\theta = 0$ does not provide any additional information beyond what we get from the basic convexity assumption and $\theta = 1/2$ will be essentially providing information very similar to the strongly convex case (and will lead to similar rates). If $\theta &gt; 1/2$ is desired than the notion of smoothness has to be adjusted as well as briefly lined out in the <em>Hölder smoothness</em> section.</p> <p class="center"><img src="http://www.pokutta.com/blog/assets/heb-functions.png" alt="HEB examples" /></p> <p class="mathcol"><strong>Remark (Smoothness limits $\theta$).</strong> We only consider smooth functions as we aim for applying HEB to conditional gradient methods later. This implies that the case $\theta &gt; 1/2$ is impossible in general: suppose that $x^\esx$ is an optimal solution in the relative interior of $P$. Then $\nabla f(x^\esx) = 0$ and by smoothness we have $f(x) - f(x^\esx) \leq \frac{L}{2} \norm{x- x^\esx}^2$ and via HEB we have $\frac{1}{c^{1/\theta}} \norm{x - x^\esx}^{1/\theta} \leq f(x) - f(x^\esx)$, so that we obtain: $\frac{1}{c^{1/\theta}} \norm{x - x^\esx}^{1/\theta} \leq f(x) - f(x^\esx) \leq \frac{L}{2} \norm{x- x^\esx}^2,$ and hence $K \leq \norm{x- x^\esx}^{2\theta-1}$ for some constant $K&gt; 0$. If now $\theta &gt; 1/2$ this inequality cannot hold as $x \rightarrow x^\esx$. However, in the <em>non-smooth</em> case, the HEB condition with, e.g., $\theta = 1$ might easily hold, as seen for example by choosing $f(x) = \norm{x}$. By a similar argument applied in reverse, we can see that $0 \leq \theta &lt; 1/2$ can only be expected to hold on a bounded set in general: using $K \leq \norm{x- x^\esx}^{2\theta-1}$ from above now with $2 \theta &lt; 1$ it follows that $\norm{x- x^\esx}^{2\theta-1} \rightarrow 0$, when $x$ follows an unbounded direction with $\norm{x} \rightarrow \infty$.</p> <h3 id="from-heb-to-primal-gap-bounds">From HEB to primal gap bounds</h3> <p>The ultimate reason why we care for the HEB condition is that it immediately provides a bound on the primal optimality gap by a straightforward combination with convexity:</p> <p class="mathcol"><strong>Lemma (HEB primal gap bounds).</strong> Let $f$ satisfy the HEB condition on $P$ with parameters $c$ and $\theta$. Then it holds: $\tag{HEB primal bound} f(x) - f^\esx \leq c^{\frac{1}{1-\theta}} \left(\frac{\langle \nabla f(x), x - x^\esx \rangle}{\norm{x - x^\esx}}\right)^{\frac{1}{1-\theta}},$ or equivalently, $\tag{HEB primal bound} \frac{1}{c}(f(x) - f^\esx)^{1-\theta} \leq \frac{\langle \nabla f(x), x - x^\esx \rangle}{\norm{x - x^\esx}}$ for any $x^\esx \in P$ with $f(x^\esx) = f^\esx$. <br /></p> <p><em>Proof.</em> By first applying convexity and then the HEB condition for any $x^\esx \in \Omega^\esx$ with $f(x^\esx) = f^\esx$ it holds:</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align*} f(x) - f^\esx & = f(x) - f(x^\esx) \leq \langle \nabla f(x), x - x^\esx \rangle \\ & = \frac{\langle \nabla f(x), x - x^\esx \rangle}{\norm{x - x^\esx}} \norm{x - x^\esx} \\ & \leq \frac{\langle \nabla f(x), x - x^\esx \rangle}{\norm{x - x^\esx}} c (f(x) - f^\esx)^\theta, \end{align*} %]]></script> <p>so we obtain $\frac{1}{c}(f(x) - f^\esx)^{1-\theta} \leq \frac{\langle \nabla f(x), x - x^\esx \rangle}{\norm{x - x^\esx}},$ or equivalently $f(x) - f^\esx \leq c^{\frac{1}{1-\theta}} \left(\frac{\langle \nabla f(x), x - x^\esx \rangle}{\norm{x - x^\esx}}\right)^{\frac{1}{1-\theta}}.$ $\qed$</p> <p class="mathcol"><strong>Remark (Relation to the gradient dominated property).</strong> Estimating $\frac{\langle \nabla f(x), x - x^\esx \rangle}{\norm{x - x^\esx}} \leq \norm{\nabla f(x)}$, we obtain the weaker condition: $f(x) - f^\esx \leq c^{\frac{1}{1-\theta}} \norm{\nabla f(x)}^{\frac{1}{1-\theta}},$ which is known as the <em>gradient dominated property</em> introduced in [P]. If $\Omega^\esx \subseteq \operatorname{rel.int}(P)$, then the two conditions are equivalent and for simplicity we will use the weaker version below in our example where we show that the Scaling Frank-Wolfe algorithm adapts dynamically to the HEB bound <em>if</em> the optimal solution(s) are contained in the (strict) relative interior. However, if the optimal solution(s) are on the boundary of $P$ as is not infrequently the case, then the two conditions <em>are not</em> equivalent as $\norm{\nabla f(x)}$ might not vanish for $x \in \Omega^\esx$, whereas $\langle \nabla f(x), x - x^\esx \rangle$ does, i.e., (HEB primal bound) is tighter than the one induced by the gradient dominated property; we have seen this difference before when we analyzed linear convergence in the <a href="/blog/research/2018/10/19/cheatsheet-fw-lin-conv.html">last post</a>.</p> <h3 id="heb-and-strong-convexity">HEB and strong convexity</h3> <p>We will now show that strong convexity implies the HEB condition and then with (HEB primal bound) provides a bound on the primal gap, albeit a slightly weaker one than if we would have directly used strong convexity to obtain the bound. We briefly recall the definition of strong convexity.</p> <p class="mathcol"><strong>Definition (strong convexity).</strong> A convex function $f$ is said to be <em>$\mu$-strongly convex</em> if for all $x,y \in \mathbb R^n$ it holds: <script type="math/tex">f(y) - f(x) \geq \nabla f(x)(y-x) + \frac{\mu}{2} \norm{x-y}^2</script>.</p> <p>Plugging in $x \doteq x^\esx$ with $x^\esx \in \Omega^\esx$ in the above, we obtain $\nabla f(x^\esx)(y-x^\esx) \geq 0$ for all $y \in P$ by first-order optimality and therefore the condition</p> <p>$f(y) - f(x^\esx) \geq \frac{\mu}{2} \norm{x^\esx -y}^2,$</p> <p>for all $y \in P$ and rearranging leads to</p> <p>$\tag{HEB-SC} \left(\frac{2}{\mu}\right)^{1/2} (f(y) - f(x^\esx))^{1/2} \geq \norm{x^\esx -y},$</p> <p>for all $y \in P$, which is the HEB condition with specific parameterization $\theta = 1/2$ and $c=2/\mu$. However, here and in the HEB condition we <em>only</em> require this behavior around the optimal solution $x^\esx \in \Omega^\esx$ (which is unique in the case of strong convexity). The strong convexity condition is a global condition however, required for <em>all</em> $x,y \in \mathbb R^n$ (and not just $x = x^\esx \in \Omega^\esx$).</p> <p>If we now plug-in the parameters from (HEB-SC) into (HEB primal bound), we obtain:</p> <script type="math/tex; mode=display">f(x) - f(x^\esx) \leq 2 \frac{\langle \nabla f(x), x - x^\esx \rangle^2}{\mu \norm{x - x^\esx}^2}.</script> <p>Note that the strong convexity induced bound obtained this way is a factor of $4$ weaker than the bound obtained in the <a href="/blog/research/2018/10/05/cheatsheet-fw.html">first post in this series</a> via optimizing out the strong convexity inequality. On the other hand we have used a simpler estimation here not relying on <em>any</em> gradient information as compared to the stronger bound. This weaker estimation will lead to slightly weaker convergence rate bounds: basically we lose the same $4$ in the rate.</p> <h3 id="when-does-the-heb-condition-hold">When does the HEB condition hold</h3> <p>In fact, it turns out that the HEB condition holds almost always with some (potentially bad) parameterization for reasonably well behaved functions (those that we usually encounter). For example, if $P$ is compact, $\theta = 0$ and $c$ large enough will always work and the condition becomes trivial. However, HEB often also holds for <em>non-trivial</em> parameterization and for wide classes of functions; the interested reader is referred to [BDL] and references contained therein for an in-depth discussion. Just to give a glimpse, at the core of those arguments are variants of the <em>Łojasewicz Inequality</em> and the <em>Łojasewicz Factorization Lemma</em>.</p> <p class="mathcol"><strong>Lemma (Łojasewicz Inequality; see [L] and [BDL]).</strong> Let $f: \operatorname{dom} f \subseteq \RR^n \rightarrow \RR$ be a lower semi-continuous and subanalytic function. Then for any compact set $C \subseteq \operatorname{dom} f$ there exist $c, \theta &gt; 0$, so that $c (f(x) - f^\esx)^\theta \geq \min_{y \in \Omega^\esx} \norm{x-y}.$ for all $x \in C$.</p> <h3 id="hölder-smoothness">Hölder smoothness</h3> <p>Without going into any detail here I would like to remark that also the smoothness condition can be weakened in a similar fashion, basically requiring (only) Hölder continuity of the gradients, i.e.,</p> <p class="mathcol"><strong>Definition (Hölder smoothness).</strong> A convex function $f$ is said to be <em>$(s,L)$-Hölder smooth</em> if for all $x,y \in \mathbb R^n$ it holds: <script type="math/tex">f(y) - f(x) \leq \nabla f(x)(y-x) + \frac{L}{s} \| x-y \|^s</script>.</p> <p>Using this more general definition of smoothness an analogous discussion with the obvious modifications applies, e.g., now the progress guarantee from smoothness has to be adapted. The interested reader is referred to [RA] for more details and the relationship between $s$ and $\theta$.</p> <h2 id="faster-rates-via-heb">Faster rates via HEB</h2> <p>We will now show how HEB can be used to obtain faster rates. We will first consider the impact of HEB from a theoretical perspective and then we will discuss how faster rates via HEB can be obtained in practice.</p> <h3 id="theoretically-faster-rates">Theoretically faster rates</h3> <p>Let us assume that we run a hypothetical first-order algorithm with updates of the form $x_{t+1} \leftarrow x_t - \eta_t d_t$ for some step length $\eta_t$ and direction $d_t$. To this end, recall from the <a href="/blog/research/2018/10/05/cheatsheet-fw.html">first post</a> that the progress at some point $x$ induced by smoothness for a direction $d$ is given by (via a short step)</p> <p class="mathcol"><strong>Progress induced by smoothness:</strong> $f(x_{t}) - f(x_{t+1}) \geq \frac{\langle \nabla f(x_t), d\rangle^2}{2L \norm{d}^2},$</p> <p>and in particular for the direction pointing towards the optimal solution $d \doteq \frac{x_t - x^\esx}{\norm{x_t - x^\esx}}$ this becomes:</p> <script type="math/tex; mode=display">\underbrace{f(x_{t}) - f(x_{t+1})}_{\text{primal progress}} \geq \frac{\langle \nabla f(x_t), x_t - x^\esx\rangle^2}{2L \norm{x_t - x^\esx}^2}.</script> <p>At the same time, via (HEB primal bound) we have</p> <p class="mathcol"><strong>Primal bound via HEB:</strong> $\frac{1}{c}(f(x_t) - f^\esx)^{1-\theta} \leq \frac{\langle \nabla f(x_t), x_t - x^\esx \rangle}{\norm{x_t - x^\esx}}.$</p> <p>Chaining these two inequalities together we obtain</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align*} f(x_{t}) - f(x_{t+1}) & \geq \frac{\langle \nabla f(x_t), x_t - x^\esx\rangle^2}{2L \norm{x_t - x^\esx}^2} \\ & \geq \frac{\left(\frac{1}{c}(f(x_t) - f^\esx)^{1-\theta} \right)^2}{2L}. \end{align*} %]]></script> <p>and so adding $f(x^\esx)$ on both sides and rearranging with $h_t \doteq f(x_t) - f(x^\esx)$</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align*} h_{t+1} & \leq h_t - \frac{\frac{1}{c^2}h_t^{2-2\theta}}{2L} \\ & = h_t \left(1 - \frac{1}{2Lc^2} h_t^{1-2\theta}\right). \end{align*} %]]></script> <p>If $\theta = 1/2$, then we obtain linear convergence with the usual arguments. Otherwise, whenever we have a contraction of the form $h_{t+1} \leq h_t \left(1 - Mh_t^{\alpha}\right)$ with $\alpha &gt; 0$, it can be shown by induction plus some estimations that $h_t \leq O(1) \left(\frac{1}{t} \right)^\frac{1}{\alpha}$, so that we obtain</p> <script type="math/tex; mode=display">h_t \leq O(1) \left(\frac{1}{t} \right)^\frac{1}{1-2\theta},</script> <p>or equivalently, to achieve $h_T \leq \varepsilon$, we need roughly $T \geq \Omega\left(\frac{1}{\varepsilon^{1 - 2\theta}}\right)$.</p> <p>Then, as we have done before, in an actual algorithm we use a direction $\hat d_t$ that ensures progress at least as good as from the direction $d_t = \frac{x_t - x^\esx}{\norm{x_t - x^\esx}}$ pointing towards the optimal solution by means of an inequality of the form:</p> <script type="math/tex; mode=display">\frac{\langle \nabla f(x_t), \hat d_t\rangle}{\norm{\hat d_t}} \geq \alpha \frac{\langle \nabla f(x_t), x_t - x^\esx\rangle}{\norm{x_t - x^\esx}},</script> <p>and the argument is for a specific algorithm is concluded as we have done before several times.</p> <h3 id="practically-faster-rates">Practically faster rates</h3> <p>If the HEB condition almost always holds with <em>some</em> parameters and we generally can expect faster rates, why is it rather seldomly referred to or used (compared to e.g., strong convexity)? The reason for this is that the improved bounds are <em>only</em> useful in practice if the HEB parameters are known in advance, as only then we know when we can legitimately stop with a guaranteed accuracy. The key to get around this issue is to use <em>robust restarts</em>, which basically allow to achieve the rate implied by HEB <em>without</em> requiring knowledge of the parameters; this costs only a constant factor in the convergence rate compared to exactly knowing the parameters. If no error bound criterion is known, then these robust scheduled restarts rely on a grid search over a grid of logarithmic size. In the case that there is an error bound criterion available, such as e.g., the Wolfe gap in our case, then no grid search is required and basically it suffices to restart the algorithm whenever it has closed a (constant) multiplicative fraction of the residual primal gap. The overall complexity bound arises then from estimating how long each such restart takes. Coincidentally, this is exactly what the Scaling Frank-Wolfe algorithm from the <a href="/blog/research/2018/10/05/cheatsheet-fw.html">first post</a> does and we will analyze the algorithm in the next section. For details and in-depth, the interested reader is referred to [RA] for the (smooth) unconstrained case and [KDP] for the (smooth) constrained case via Conditional Gradients.</p> <h3 id="a-heb-fw-for-optimal-solutions-in-relative-interior">A HEB-FW for optimal solutions in relative interior</h3> <p>As an application of the above, we will now show that the <em>Scaling Frank-Wolfe algorithm</em> from the <a href="/blog/research/2018/10/05/cheatsheet-fw.html">first post</a> dynamically adjusts to the HEB condition and achieves a HEB-optimal rate up to constant factors (see p.6 of [NN] for the matching lower bound) provided that the optimal solution is contained in the strict interior of $P$; for the general case see [KDP], where we need to employ away steps. Recall from the last post, that the reason why we do not need away steps if $x^\esx \in \operatorname{rel.int}(P)$ is that in this case it holds</p> <p>$\frac{\langle \nabla f(x),x - v\rangle}{\norm{x - v}} \geq \alpha \norm{\nabla f(x)},$</p> <p>for some $\alpha &gt; 0$, whenever $v \doteq \arg\min_{x \in P} \langle \nabla f(x), x \rangle$ is the Frank-Wolfe vertex and as such that the standard FW direction provides a sufficient approximation of $\norm{\nabla f(x)}$; see <a href="/blog/research/2018/10/19/cheatsheet-fw-lin-conv.html">second post</a> for details. This can be weakened to</p> <p>$\tag{norm approx} \langle \nabla f(x),x - v\rangle \geq \frac{\alpha}{D} \norm{\nabla f(x)},$</p> <p>where $D$ is the diameter of $P$, which is sufficient for our purposes in the following. From this we can derive our operational primal gap bound that we will be working with by combining (gradient norm approx) with (HEB primal bound):</p> <p>$\tag{HEB-FW PB} f(x) - f^\esx \leq \left(\frac{cD}{\alpha}\right)^{\frac{1}{1-\theta}} \langle \nabla f(x),x - v\rangle^{\frac{1}{1-\theta}}.$</p> <p>Furthermore, let us recall the Scaling Frank-Wolfe algorithm:</p> <p class="mathcol"><strong>Scaling Frank-Wolfe Algorithm [BPZ]</strong> <br /> <em>Input:</em> Smooth convex function $f$ with first-order oracle access, feasible region $P$ with linear optimization oracle access, initial point (usually a vertex) $x_0 \in P$. <br /> <em>Output:</em> Sequence of points $x_0, \dots, x_T$ <br /> Compute initial dual gap: $\Phi_0 \leftarrow \max_{v \in P} \langle \nabla f(x_0), x_0 - v \rangle$ <br /> For $t = 0, \dots, T-1$ do: <br /> $\quad$ Find $v_t$ vertex of $P$ such that: $\langle \nabla f(x_t), x_t - v_t \rangle &gt; \Phi_t/2$ <br /> $\quad$ If no such vertex $v_t$ exists: $x_{t+1} \leftarrow x_t$ and $\Phi_{t+1} \leftarrow \Phi_t/2$ <br /> $\quad$ Else: $x_{t+1} \leftarrow (1-\gamma_t) x_t + \gamma_t v_t$ and $\Phi_{t+1} \leftarrow \Phi_t$</p> <p>As remarked earlier, the Scaling Frank-Wolfe Algorithm can be seen as a certain variant of a restart scheme, where we ‘restart’, whenever we update $\Phi_{t+1} \leftarrow \Phi_t/2$. The key is that the algorithm is parameter-free (when run with line search), does not require the estimation of HEB parameters, and is essentially optimal; skipping optimizing the update $\Phi_{t+1} \leftarrow \Phi_t/2$ with a different factor here which affects the rate only by a constant factor (in the exponent).</p> <p>We will now show the following theorem, which is a straightforward adaptation from <a href="/blog/research/2018/10/05/cheatsheet-fw.html">the first post</a> incorporating (HEB-FW PB) instead of the vanilla convexity estimation.</p> <p class="mathcol"><strong>Lemma (Scaling Frank-Wolfe HEB convergence).</strong> Let $f$ be a smooth convex function satisfying HEB with parameters $c$ and $\theta$. Then the Scaling Frank-Wolfe algorithm ensures: $h(x_T) \leq \varepsilon \qquad \text{for} \qquad \begin{cases} T \geq (1+K) \left(\lceil \log \frac{\Phi_0}{\varepsilon}\rceil + 1\right) &amp; \text{ if } \theta = 1/2 \\ T \geq {\lceil \log \frac{\Phi_0}{\varepsilon}\rceil} + \frac{K 4^{-\tau}}{\left(\frac{1}{2^\tau}\right) - 1} \left(\frac{1}{\varepsilon}\right)^{-\tau} &amp; \text{ if } \theta &lt; 1/2 \end{cases},$ where $K \doteq \left(\frac{cD}{2\alpha}\right)^{\frac{1}{1-\theta}} 8LD^2$, $\tau \doteq {\frac{1}{1-\theta}-2}$, and the $\log$ is to the basis of $2$.</p> <p><em>Proof.</em> We consider two types of steps: (a) primal progress steps, where $x_t$ is changed and (b) dual update steps, where $\Phi_t$ is changed. <br /> <br /> Let us start with the dual update step (b). In such an iteration we know that for all $v \in P$ it holds $\nabla f(x_t - v) \leq \Phi_t/2$ and in particular for $v = x^\esx$ and by (HEB-FW PB) this implies $h_t \leq \left(\frac{cD}{\alpha}\right)^{\frac{1}{1-\theta}} (\Phi_t/2)^{\frac{1}{1-\theta}}.$ For a primal progress step (a), we have by the same arguments as before $f(x_t) - f(x_{t+1}) \geq \frac{\Phi_t^2}{8LD^2}.$ From these two inequalities we can conclude the proof as follows: Clearly, to achieve accuracy $\varepsilon$, it suffices to halve $\Phi_0$ at most $\lceil \log \frac{\Phi_0}{\varepsilon}\rceil$ times. Next we bound how many primal progress steps of type (a) we can do between two steps of type (b); we call this a <em>scaling phase</em>. After accounting for the halving at the beginning of the iteration and observing that $\Phi_t$ does not change between any two iterations of type (b), by simply dividing the upper bound on the residual gap by the lower bound on the progress, the number of required steps can be at most $\left(\frac{cD}{\alpha}\right)^{\frac{1}{1-\theta}} (\Phi/2)^{\frac{1}{1-\theta}} \cdot \frac{8LD^2}{\Phi^2} = \underbrace{\left(\frac{cD}{2\alpha}\right)^{\frac{1}{1-\theta}} 8LD^2}_{\doteq K} \cdot \Phi^{\frac{1}{1-\theta}-2},$ where $\Phi$ is the estimate valid for these iterations of type (a). Thus, with $\tau \doteq {\frac{1}{1-\theta}-2}$, the total number of iterations $T$ required to achieve $\varepsilon$-accuracy can be bounded by</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align*} \sum_{\ell = 0}^{\lceil \log \frac{\Phi_0}{\varepsilon}\rceil} \left(1 + K (\Phi_0/2^\ell)^\tau \right) & = \underbrace{\lceil \log \frac{\Phi_0}{\varepsilon}\rceil}_{\text{Type (b)}} + \underbrace{K \Phi_0^\tau \sum_{\ell = 0}^{\lceil \log \frac{\Phi_0}{\varepsilon}\rceil} \left(\frac{1}{2^\tau}\right)^\ell}_{\text{Type (a)}}, \end{align*} %]]></script> <p>where differentiate two cases. First let $\tau = 0$, and hence $\theta = 1/2$. This corresponds to case where we obtain linear convergence as now ${\lceil \log \frac{\Phi_0}{\varepsilon}\rceil} + {K \Phi_0^\tau \sum_{\ell = 0}^{\lceil \log \frac{\Phi_0}{\varepsilon}\rceil} \left(\frac{1}{2^\tau}\right)^\ell} \leq (1+K) \left(\lceil \log \frac{\Phi_0}{\varepsilon}\rceil + 1\right).$ Now let $\tau &lt; 0$, i.e., $\theta &lt; 1/2$. Then</p> <p><script type="math/tex">% <![CDATA[ \begin{align*} {\lceil \log \frac{\Phi_0}{\varepsilon}\rceil} + {K \Phi_0^\tau \sum_{\ell = 0}^{\lceil \log \frac{\Phi_0}{\varepsilon}\rceil} \left(\frac{1}{2^\tau}\right)^\ell} & = {\lceil \log \frac{\Phi_0}{\varepsilon}\rceil} + K \Phi_0^\tau \frac{1-\left(\frac{1}{2^\tau}\right)^{\lceil \log \frac{\Phi_0}{\varepsilon}\rceil + 1}}{1 - \left(\frac{1}{2^\tau}\right)} \\ & \leq {\lceil \log \frac{\Phi_0}{\varepsilon}\rceil} + \frac{K \Phi_0^\tau}{\left(\frac{1}{2^\tau}\right)-1} \left(\frac{1}{2^\tau}\right)^{\lceil \log \frac{\Phi_0}{\varepsilon}\rceil + 1} \\ & \leq {\lceil \log \frac{\Phi_0}{\varepsilon}\rceil} + \frac{K \Phi_0^\tau}{\left(\frac{1}{2^\tau}\right)-1} \left(\frac{4\Phi_0}{\varepsilon}\right)^{-\tau} \\ & \leq {\lceil \log \frac{\Phi_0}{\varepsilon}\rceil} + \frac{K 4^{-\tau}}{\left(\frac{1}{2^\tau}\right) - 1} \left(\frac{1}{\varepsilon}\right)^{-\tau} \end{align*} %]]></script> $\qed$</p> <p>So we obtain the following convergence rate regimes:</p> <ol> <li>If $\theta = 1/2$, we obtain linear convergence with a convergence rate that is similar to the rate achieved in the strongly convex case up to a small constant factor, as expected from the discussion before.</li> <li>If $\theta = 0$, then $\tau = -1$ and we obtain the standard rate relying only on smoothness and convexity, namely $O\left(\frac{1}{\varepsilon^{-\tau}}\right) = O\left(\frac{1}{\varepsilon}\right)$</li> <li>If $0 &lt; \theta &lt; 1/2$, we have with $\tau = {\frac{1}{1-\theta}-2}$ that $0 &lt; 2-\frac{1}{1-\theta} &lt; 1$ and a rate of $O\left(\frac{1}{\varepsilon^{-\tau}}\right) = O\left(\frac{1}{\varepsilon^{2-\frac{1}{1-\theta}}}\right) = o\left(\frac{1}{\varepsilon}\right)$. This is strictly better than the rate obtained only from convexity and smoothness.</li> </ol> <p>It is helpful to compare the rate $O\left(\frac{1}{\varepsilon^{2-\frac{1}{1-\theta}}}\right)$ with the rate $O\left(\frac{1}{\varepsilon^{1 - 2\theta}}\right)$ that we derived above directly from the contraction. For this we rewrite $2-\frac{1}{1-\theta} = \frac{1-2\theta}{1-\theta}$, so that we have $\varepsilon^{-(1 - 2\theta)}$ vs. $\varepsilon^{- \frac{1 - 2\theta}{1-\theta}}$ and maximizing out the error in the exponent over $\theta$, we obtain <script type="math/tex">\varepsilon^{-(1 - 2\theta)} \cdot \varepsilon^{-(3-2\sqrt{2})} \geq \varepsilon^{- \frac{1 - 2\theta}{1-\theta}},</script> so that the error in rate is $\varepsilon^{-(3-2\sqrt{2})} \approx \varepsilon^{-0.17157}$, which is achieved for $\theta = 1- \frac{1}{\sqrt{2}} \approx 0.29289$. This discrepancy arises from the scaling of the dual gap estimate and optimizing the factor $\gamma$ in the update $\Phi_{t+1} \leftarrow \Phi_t/\gamma$ can reduce this further to a constant factor error (rather than a constant exponent error).</p> <p class="mathcol"><strong>Remark (HEB rates for vanilla FW).</strong> Similar HEB rate adaptivity can be shown for the vanilla Frank-Wolfe algorithm in a relatively straightforward way; e.g., a direct adaptation of the proof of [XY] will work. I opted for a proof for the Scaling Frank-Wolfe as I believe it is more straightforward and the Scaling Frank-Wolfe algorithm retains all the advantages discussed in <a href="/blog/research/2018/10/05/cheatsheet-fw.html">the first post</a> under the HEB condition.</p> <p>Finally, a graph showing the behavior of Frank-Wolfe under HEB on the probability simplex of dimension $30$ and function $\norm{x}_2^{1/\theta}$. As we can see, for $\theta = 1/2$, we observe linear convergence as expected, while for the other values of $\theta$ we observe various degrees of sublinear convergence of the form $O(1/\varepsilon^p)$ with $p \geq 1$. The difference in slope is not quite as pronounced as I had hoped for but, again, the bounds are only upper bounds on the convergence rates.</p> <p class="center"><img src="http://www.pokutta.com/blog/assets/heb-simplex-30-noLine.png" alt="HEB with approx minimizer" /></p> <p>Interestingly, when using line search it seems we still achieve linear convergence and in fact the sharper functions converge <em>faster</em>; note this can only be a spurious phenomenon or even some bug due to the matching lower bound of our rates in [NN]. This phenomenon <em>might be</em> due to the fact that the progress from smoothness is only an underestimator of the achievable progress and the specific (as in simple) structure of our functions. If time permits I might try to compute out the actual optimal progress and see whether faster convergence can be proven. Here is a graph to demonstrate the difference: Frank-Wolfe run on the probability simplex for $n = 100$ and function $\norm{x}_2^{1/\theta}$.</p> <p class="center"><img src="http://www.pokutta.com/blog/assets/heb-simplex-100-comp.png" alt="HEB with line search" /></p> <h3 id="references">References</h3> <p>[CG] Levitin, E. S., &amp; Polyak, B. T. (1966). Constrained minimization methods. Zhurnal Vychislitel’noi Matematiki i Matematicheskoi Fiziki, 6(5), 787-823. <a href="http://www.mathnet.ru/php/archive.phtml?wshow=paper&amp;jrnid=zvmmf&amp;paperid=7415&amp;option_lang=eng">pdf</a></p> <p>[FW] Frank, M., &amp; Wolfe, P. (1956). An algorithm for quadratic programming. Naval research logistics quarterly, 3(1‐2), 95-110. <a href="https://onlinelibrary.wiley.com/doi/abs/10.1002/nav.3800030109">pdf</a></p> <p>[L] Łojasiewicz, S. (1963). Une propriété topologique des sous-ensembles analytiques réels. Les équations aux dérivées partielles, 117, 87-89.</p> <p>[L2] Łojasiewicz, S. (1993). Sur la géométrie semi-et sous-analytique. Ann. Inst. Fourier, 43(5), 1575-1595. <a href="http://www.numdam.org/article/AIF_1993__43_5_1575_0.pdf">pdf</a></p> <p>[BLO] Burke, J. V., Lewis, A. S., &amp; Overton, M. L. (2002). Approximating subdifferentials by random sampling of gradients. Mathematics of Operations Research, 27(3), 567-584. <a href="https://www.jstor.org/stable/pdf/3690452.pdf?casa_token=WG5QKXxjgU8AAAAA:USOjl9WVAlwxXujFadFmzAmEH5J1JEX5fTr5tikcZPokBgqI6CU6UdMP6gb1Nh771ucW3lAjDD2RZWn5zqlfYPgSbePz1zr8R6dPnPYe7ftU4azql_k">pdf</a></p> <p>[P] Polyak, B. T. (1963). Gradient methods for minimizing functionals. Zhurnal Vychislitel’noi Matematiki i Matematicheskoi Fiziki, 3(4), 643-653.</p> <p>[BDL] Bolte, J., Daniilidis, A., &amp; Lewis, A. (2007). The Łojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems. SIAM Journal on Optimization, 17(4), 1205-1223. <a href="https://epubs.siam.org/doi/pdf/10.1137/050644641?casa_token=FJQHJsH8X7QAAAAA%3AKsy_oqj_H1BsF3MOlJsvVXoGHTGuLCiPXnSFhuWA22CpZ4aZGOpJao-vPuBWzuptLKNQqkDPiA&amp;">pdf</a></p> <p>[RA] Roulet, V., &amp; d’Aspremont, A. (2017). Sharpness, restart and acceleration. In Advances in Neural Information Processing Systems (pp. 1119-1129). <a href="http://papers.nips.cc/paper/6712-sharpness-restart-and-acceleration">pdf</a></p> <p>[KDP] Kerdreux, T., d’Aspremont, A., &amp; Pokutta, S. (2018). Restarting Frank-Wolfe. <a href="https://arxiv.org/abs/1810.02429">pdf</a></p> <p>[XY] Xu, Y., &amp; Yang, T. (2018). Frank-Wolfe Method is Automatically Adaptive to Error Bound Condition. arXiv preprint arXiv:1810.04765. <a href="https://arxiv.org/pdf/1810.04765.pdf">pdf</a></p> <p>[BPZ] Braun, G., Pokutta, S., &amp; Zink, D. (2017, July). Lazifying Conditional Gradient Algorithms. In International Conference on Machine Learning (pp. 566-575). <a href="https://arxiv.org/abs/1610.05120">pdf</a></p> <p>[NN] Nemirovskii, A. &amp; Nesterov, Y. E. (1985), Optimal methods of smooth convex minimization, USSR Computational Mathematics and Mathematical Physics 25(2), 21–30.</p>Sebastian PokuttaTL;DR: Cheat Sheet for convergence of Frank-Wolfe algorithms (aka Conditional Gradients) under the Hölder Error Bound (HEB) condition, or how to interpolate between convex and strongly convex convergence rates. Continuation of the Frank-Wolfe series. Long and technical.Toolchain Tuesday No. 22018-10-23T00:00:00-04:002018-10-23T00:00:00-04:00http://www.pokutta.com/blog/random/2018/10/23/toolchain-2<p><em>TL;DR: Part of a series of posts about tools, services, and packages that I use in day-to-day operations to boost efficiency and free up time for the things that really matter. Use at your own risk - happy to answer questions. For the full, continuously expanding list so far see <a href="/blog/pages/toolchain.html">here</a>.</em> <!--more--></p> <p>This is the second installment of a series of posts; the <a href="/blog/pages/toolchain.html">full list</a> is expanding over time. This time around will be about the python environment that I am using. Python has become my go-to language for rapid prototyping. In some sense these tools are some of the most fundamental ones but at the same time they do not provide direct utility by solving a specific problem but rather by <em>accelerating</em> problem solving etc.</p> <h2 id="python-libraries-and-distributions">Python Libraries and Distributions</h2> <h3 id="anaconda">Anaconda</h3> <p>Python distribution geared towards scientific computing and data science applications.</p> <p><em>Learning curve: ⭐️⭐️</em> <em>Usefulness: ⭐️⭐️⭐️⭐️⭐️</em> <br /> <em>Site: <a href="https://www.anaconda.com">https://www.anaconda.com</a></em></p> <p><code class="highlighter-rouge">Anaconda</code> is a very comprehensive and well-maintained python distribution geared towards scientific computing and data science applications. It uses the <code class="highlighter-rouge">conda</code> package manager making package management as well as creating different environments with different python versions exceptionally convenient. Learning curve only got ⭐️⭐️ as it is not harder than any other python distribution.</p> <h2 id="software">Software</h2> <h3 id="pycharm">PyCharm</h3> <p>Extremely powerful integrated development environment (IDE) for python.</p> <p><em>Learning curve: ⭐️⭐️⭐️</em> <em>Usefulness: ⭐️⭐️⭐️⭐️</em> <br /> <em>Site: <a href="https://www.jetbrains.com/pycharm/">https://www.jetbrains.com/pycharm/</a></em></p> <p>Excellent support for coding including simple things such as syntax highlighting and more complex refactoring. Support for managing different build/run environments, remote kernels, etc. Also great for managing larger scale projects.</p> <h3 id="jupyter">Jupyter</h3> <p>Interactive python computing.</p> <p><em>Learning curve: ⭐️⭐️⭐️</em> <em>Usefulness: ⭐️⭐️⭐️⭐️</em> <br /> <em>Site: <a href="http://jupyter.org/">http://jupyter.org/</a></em></p> <p>While <code class="highlighter-rouge">PyCharm</code> is great for more traditional development (write code, run, debug, iterate), <code class="highlighter-rouge">Jupyter</code> provides an <em>interactive (python) computing</em> environment in a web browser (for those in the know it is basically <code class="highlighter-rouge">IPython</code> on steroids). So what it allows to do is essentially to work with data etc in real-time and interactively by allowing partial execution of code, directly reviewing results, plotting etc. without having to fully re-run the code. Great for example, for exploratory data analysis. Allows for significantly faster tinkering etc with code and data and then once it is stable it can be easily transferred into a more traditional python code setup.</p> <p>A typical process that I regularly follow is first writing a library that provides black box functions for some tasks and then I use Jupyter to do very high level tinkering/modifications. My Jupyter notebook might look like this:</p> <figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">tools</span> <span class="c"># load graph</span> <span class="n">g</span> <span class="o">=</span> <span class="n">tools</span><span class="o">.</span><span class="n">loadGraph</span><span class="p">(</span><span class="s">"downtown-SF"</span><span class="p">)</span> <span class="c"># compute distance matrix</span> <span class="n">dist</span> <span class="o">=</span> <span class="n">tools</span><span class="o">.</span><span class="n">shortestPathDistances</span><span class="p">(</span><span class="n">g</span><span class="p">)</span> <span class="c"># solve configurations</span> <span class="n">resCF</span> <span class="o">=</span> <span class="n">tools</span><span class="o">.</span><span class="n">optimizeFlow</span><span class="p">(</span><span class="n">g</span><span class="p">,</span><span class="n">dist</span><span class="p">,</span><span class="n">congestion</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span> <span class="n">resCT</span> <span class="o">=</span> <span class="n">tools</span><span class="o">.</span><span class="n">optimizeFlow</span><span class="p">(</span><span class="n">g</span><span class="p">,</span><span class="n">dist</span><span class="p">,</span><span class="n">congestion</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span> <span class="c"># compare</span> <span class="n">tools</span><span class="o">.</span><span class="n">plotComparison</span><span class="p">(</span><span class="n">refCF</span><span class="p">,</span><span class="n">resCT</span><span class="p">)</span></code></pre></figure> <p>The <code class="highlighter-rouge">tools</code> library does all the heavy lifting behind the scenes and I use the <code class="highlighter-rouge">Jupyter</code> notebook for very high level manipulations and tests.</p>TL;DR: Part of a series of posts about tools, services, and packages that I use in day-to-day operations to boost efficiency and free up time for the things that really matter. Use at your own risk - happy to answer questions. For the full, continuously expanding list so far see here.