Jekyll2018-11-13T13:02:54-05:00http://www.pokutta.com/blog/One trivial observation at a timeEverything Mathematics, Optimization, Machine Learning, and Artificial IntelligenceCheat Sheet: Hölder Error Bounds for Conditional Gradients2018-11-11T23:00:00-05:002018-11-11T23:00:00-05:00http://www.pokutta.com/blog/research/2018/11/11/heb-conv<p><em>TL;DR: Cheat Sheet for convergence of Frank-Wolfe algorithms (aka Conditional Gradients) under the Hölder Error Bound (HEB) condition, or how to interpolate between convex and strongly convex convergence rates. Continuation of the Frank-Wolfe series. Long and technical.</em>
<!--more--></p>
<p><em>Posts in this series (so far).</em></p>
<ol>
<li><a href="/blog/research/2018/10/05/cheatsheet-fw.html">Cheat Sheet: Frank-Wolfe and Conditional Gradients</a></li>
<li><a href="/blog/research/2018/10/19/cheatsheet-fw-lin-conv.html">Cheat Sheet: Linear convergence for Conditional Gradients</a></li>
<li><a href="/blog/research/2018/11/11/heb-conv.html">Cheat Sheet: Hölder Error Bounds (HEB) for Conditional Gradients</a></li>
</ol>
<p><em>My apologies for incomplete references—this should merely serve as an overview.</em></p>
<p>In this third installment of the series on Conditional Gradients, I will talk about the <em>Hölder Error Bound (HEB) condition</em>. This post is going to be slightly different from the previous ones, as the conditional gradients part will be basically a simple corollary to our discussion of the general (constraint or unconstrained) case here. The HEB condition is extremely useful in general for establishing converge rates and I will first talk about how it compares to, e.g., strong convexity, when it holds etc. All these aspects will be independent of Frank-Wolfe per se. Going then from the general case to Frank-Wolfe is basically a simple corollary except for some non-trivial technical challenges; but those are really just that: technical challenges.</p>
<p>I will stick to the notation from the <a href="/blog/research/2018/10/05/cheatsheet-fw.html">first post</a> and will refer to it frequently, so you might want to give it a quick refresher or read. As before I will use Frank-Wolfe [FW] and Conditional Gradients [CG] interchangeably.</p>
<h2 id="the-hölder-error-bound-heb-condition">The Hölder Error Bound (HEB) condition</h2>
<p>We have seen that in general (without acceleration), we can obtain a rate of basically $O(1/\varepsilon)$ for the smooth and convex case and a rate of basically $O(\log 1/\varepsilon)$ in the smooth and strongly convex case. A natural question to ask is what happens inbetween these two extremes, i.e., are there functions that converge with a rate of e.g., $O(1/\varepsilon^p)$? The answer is <em>yes</em> and the HEB condition allows basically to smoothly interpolate between the two regimes, depending on the property of the function under consideration of course.</p>
<p>For the sake of continuity we work here assuming the constraint case as we will aim for applications to Frank-Wolfe later, however the discussion holds more broadly for the unconstrained case as well; simply replace $P$ with $\RR^n$. In the following let $\Omega^\esx$ denote the set of optimal solutions to $\min_{x \in P} f(x)$ (there might be multiple) and let $f^\esx \doteq \min_{x \in P} f(x)$. In the following we will always assume that $x^\esx \in \Omega^\esx$.</p>
<p class="mathcol"><strong>Definition (Hölder Error Bound (HEB) condition).</strong> A convex function $f$ is satisfies the <em>Hölder Error Bound (HEB) condition on $P$</em> with parameters $0 < c < \infty$ and $\theta \in [0,1]$ if for all $x \in P$ it holds:
\[
c (f(x) - f^\esx)^\theta \geq \min_{y \in \Omega^\esx} \norm{x-y}.
\]</p>
<p>As far as I can see, basically this condition goes back to [L] and has been studied extensively since then, see e.g., [L2] and [BLO]; if anyone has more accurate information please ping me. So what this condition measures is how <em>sharp</em> the function increases around the (set of) optimal solution(s), which is why this condition sometimes is also referred to as <em>sharpness condition</em>. It is also important to note that the definition here depends on $P$ and the set of minimizers $\Omega^\esx$, whereas e.g., strong convexity is a <em>global</em> property of the function <em>independent</em> of $P$. Before delving further into HEB, we might wonder whether there are functions that satisfy this condition that are not strongly convex.</p>
<p class="mathcol"><strong>Example.</strong>
A simple optimization problem with a function that satisfies the HEB condition with non-trivial parameters is, e.g.,
\[
\min_{x \in P} \norm{x-\bar x}_2^\alpha,
\]
where $\bar x \in \RR^n$ and $\alpha \geq 2$, where the resulting $\theta = 1/\alpha$. The function to be minimized is not strongly convex for $\alpha > 2$.</p>
<p>So the HEB condition is <em>more general</em> than strong convexity and, as we will see further below, it is also much weaker: it requires <em>less</em> from a <em>given</em> function (compared to strong convexity) and at the same time works for functions that are not covered by strong convexity.</p>
<p>The following graph depicts functions with varying $\theta$. All functions with $\theta < 1/2$ are <em>not</em> strongly convex. Those with $\theta > 1/2$ are only depicted for illustration here, as they curve faster than the power of the (standard) smoothness that we use (as we will discuss briefly below) and we will be therefore limited to functions with $0 \leq \theta \leq 1/2$, where $\theta = 0$ does not provide any additional information beyond what we get from the basic convexity assumption and $\theta = 1/2$ will be essentially providing information very similar to the strongly convex case (and will lead to similar rates). If $\theta > 1/2$ is desired than the notion of smoothness has to be adjusted as well as briefly lined out in the <em>Hölder smoothness</em> section.</p>
<p class="center"><img src="http://www.pokutta.com/blog/assets/heb-functions.png" alt="HEB examples" /></p>
<p class="mathcol"><strong>Remark (Smoothness limits $\theta$).</strong>
We only consider smooth functions as we aim for applying HEB to conditional gradient methods later. This implies that the case $\theta > 1/2$ is impossible in general: suppose that $x^\esx$ is an optimal solution in the relative interior of $P$. Then $\nabla f(x^\esx) = 0$ and by smoothness we have $f(x) - f(x^\esx) \leq \frac{L}{2} \norm{x- x^\esx}^2$ and via HEB we have $\frac{1}{c^{1/\theta}} \norm{x - x^\esx}^{1/\theta} \leq f(x) - f(x^\esx)$, so that we obtain:
\[\frac{1}{c^{1/\theta}} \norm{x - x^\esx}^{1/\theta} \leq f(x) - f(x^\esx) \leq \frac{L}{2} \norm{x- x^\esx}^2,
\]
and hence
\[
K \leq \norm{x- x^\esx}^{2\theta-1}
\]
for some constant $K> 0$. If now $\theta > 1/2$ this inequality cannot hold as $x \rightarrow x^\esx$. However, in the <em>non-smooth</em> case, the HEB condition with, e.g., $\theta = 1$ might easily hold, as seen for example by choosing $f(x) = \norm{x}$. By a similar argument applied in reverse, we can see that $0 \leq \theta < 1/2$ can only be expected to hold on a bounded set in general: using $K \leq \norm{x- x^\esx}^{2\theta-1}$ from above
now with $2 \theta < 1$ it follows that $\norm{x- x^\esx}^{2\theta-1} \rightarrow 0$, when $x$ follows an unbounded direction with $\norm{x} \rightarrow \infty$.</p>
<h3 id="from-heb-to-primal-gap-bounds">From HEB to primal gap bounds</h3>
<p>The ultimate reason why we care for the HEB condition is that it immediately provides a bound on the primal optimality gap by a straightforward combination with convexity:</p>
<p class="mathcol"><strong>Lemma (HEB primal gap bounds).</strong> Let $f$ satisfy the HEB condition on $P$ with parameters $c$ and $\theta$. Then it holds:
\[
\tag{HEB primal bound} f(x) - f^\esx \leq c^{\frac{1}{1-\theta}} \left(\frac{\langle \nabla f(x), x - x^\esx \rangle}{\norm{x - x^\esx}}\right)^{\frac{1}{1-\theta}},
\]
or equivalently,
\[
\tag{HEB primal bound}
\frac{1}{c}(f(x) - f^\esx)^{1-\theta} \leq \frac{\langle \nabla f(x), x - x^\esx \rangle}{\norm{x - x^\esx}}
\]
for any $x^\esx \in P$ with $f(x^\esx) = f^\esx$. <br /></p>
<p><em>Proof.</em> By first applying convexity and then the HEB condition for any $x^\esx \in \Omega^\esx$ with $f(x^\esx) = f^\esx$ it holds:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
f(x) - f^\esx & = f(x) - f(x^\esx) \leq \langle \nabla f(x), x - x^\esx \rangle \\
& = \frac{\langle \nabla f(x), x - x^\esx \rangle}{\norm{x - x^\esx}} \norm{x - x^\esx} \\
& \leq \frac{\langle \nabla f(x), x - x^\esx \rangle}{\norm{x - x^\esx}} c (f(x) - f^\esx)^\theta,
\end{align*} %]]></script>
<p>so we obtain
\[
\frac{1}{c}(f(x) - f^\esx)^{1-\theta} \leq \frac{\langle \nabla f(x), x - x^\esx \rangle}{\norm{x - x^\esx}},
\]
or equivalently
\[
f(x) - f^\esx \leq c^{\frac{1}{1-\theta}} \left(\frac{\langle \nabla f(x), x - x^\esx \rangle}{\norm{x - x^\esx}}\right)^{\frac{1}{1-\theta}}.
\]
\[\qed\]</p>
<p class="mathcol"><strong>Remark (Relation to gradient dominated property).</strong> Estimating $\frac{\langle \nabla f(x), x - x^\esx \rangle}{\norm{x - x^\esx}} \leq \norm{\nabla f(x)}$, we obtain the weaker condition:
\[
f(x) - f^\esx \leq c^{\frac{1}{1-\theta}} \norm{\nabla f(x)}^{\frac{1}{1-\theta}},
\]
which is known as the <em>gradient dominated property</em> introduced in [P]. If $\Omega^\esx \subseteq \operatorname{rel.int}(P)$, then the two conditions are equivalent and for simplicity we will use the weaker version below in our example where we show that the Scaling Frank-Wolfe algorithm adapts dynamically to the HEB bound <em>if</em> the optimal solution(s) are contained in the (strict) relative interior. However, if the optimal solution(s) are on the boundary of $P$ as is not infrequently the case, then the two conditions <em>are not</em> equivalent as $\norm{\nabla f(x)}$ might not vanish for $x \in \Omega^\esx$, whereas $\langle \nabla f(x), x - x^\esx \rangle$ does, i.e., (HEB primal bound) is tighter than the one induced by the gradient dominated property; we have seen this difference before when we analyzed linear convergence in the <a href="/blog/research/2018/10/19/cheatsheet-fw-lin-conv.html">last post</a>.</p>
<h3 id="heb-and-strong-convexity">HEB and strong convexity</h3>
<p>We will now show that strong convexity implies the HEB condition and then with (HEB primal bound) provides a bound on the primal gap, albeit a slightly weaker one than if we would have directly used strong convexity to obtain the bound. We briefly recall the definition of strong convexity.</p>
<p class="mathcol"><strong>Definition (strong convexity).</strong> A convex function $f$ is said to be <em>$\mu$-strongly convex</em> if for all $x,y \in \mathbb R^n$ it holds: <script type="math/tex">f(y) - f(x) \geq \nabla f(x)(y-x) + \frac{\mu}{2} \norm{x-y}^2</script>.</p>
<p>Plugging in $x \doteq x^\esx$ with $x^\esx \in \Omega^\esx$ in the above, we obtain $\nabla f(x^\esx)(y-x^\esx) \geq 0$ for all $y \in P$ by first-order optimality and therefore the condition</p>
<p>\[
f(y) - f(x^\esx) \geq \frac{\mu}{2} \norm{x^\esx -y}^2,
\]</p>
<p>for all $y \in P$ and rearranging leads to</p>
<p>\[
\tag{HEB-SC}
\left(\frac{2}{\mu}\right)^{1/2} (f(y) - f(x^\esx))^{1/2} \geq \norm{x^\esx -y},
\]</p>
<p>for all $y \in P$, which is the HEB condition with specific parameterization $\theta = 1/2$ and $c=2/\mu$. However, here and in the HEB condition we <em>only</em> require this behavior around the optimal solution $x^\esx \in \Omega^\esx$ (which is unique in the case of strong convexity). The strong convexity condition is a global condition however, required for <em>all</em> $x,y \in \mathbb R^n$ (and not just $x = x^\esx \in \Omega^\esx$).</p>
<p>If we now plug-in the parameters from (HEB-SC) into (HEB primal bound), we obtain:</p>
<script type="math/tex; mode=display">f(x) - f(x^\esx) \leq 2 \frac{\langle \nabla f(x), x - x^\esx \rangle^2}{\mu \norm{x - x^\esx}^2}.</script>
<p>Note that the strong convexity induced bound obtained this way is a factor of $4$ weaker than the bound obtained in the <a href="/blog/research/2018/10/05/cheatsheet-fw.html">first post in this series</a> via optimizing out the strong convexity inequality. On the other hand we have used a simpler estimation here not relying on <em>any</em> gradient information as compared to the stronger bound. This weaker estimation will lead to slightly weaker convergence rate bounds: basically we lose the same $4$ in the rate.</p>
<h3 id="when-does-the-heb-condition-hold">When does the HEB condition hold</h3>
<p>In fact, it turns out that the HEB condition holds almost always with some (potentially bad) parameterization for reasonably well behaved functions (those that we usually encounter). For example, if $P$ is compact, $\theta = 0$ and $c$ large enough will always work and the condition becomes trivial. However, HEB often also holds for <em>non-trivial</em> parameterization and for wide classes of functions; the interested reader is referred to [BDL] and references contained therein for an in-depth discussion. Just to give a glimpse, at the core of those arguments are variants of the <em>Łojasewicz Inequality</em> and the <em>Łojasewicz Factorization Lemma</em>.</p>
<p class="mathcol"><strong>Lemma (Łojasewicz Inequality; see [L] and [BDL]).</strong> Let $f: \operatorname{dom} f \subseteq \RR^n \rightarrow \RR$ be a lower semi-continuous and subanalytic function. Then for any compact set $C \subseteq \operatorname{dom} f$ there exist $c, \theta > 0$, so that
\[
c (f(x) - f^\esx)^\theta \geq \min_{y \in \Omega^\esx} \norm{x-y}.
\]
for all $x \in C$.</p>
<h3 id="hölder-smoothness">Hölder smoothness</h3>
<p>Without going into any detail here I would like to remark that also the smoothness condition can be weakened in a similar fashion, basically requiring (only) Hölder continuity of the gradients, i.e.,</p>
<p class="mathcol"><strong>Definition (Hölder smoothness).</strong> A convex function $f$ is said to be <em>$(s,L)$-Hölder smooth</em> if for all $x,y \in \mathbb R^n$ it holds: <script type="math/tex">f(y) - f(x) \leq \nabla f(x)(y-x) + \frac{L}{s} \| x-y \|^s</script>.</p>
<p>Using this more general definition of smoothness an analogous discussion with the obvious modifications applies, e.g., now the progress guarantee from smoothness has to be adapted. The interested reader is referred to [RA] for more details and the relationship between $s$ and $\theta$.</p>
<h2 id="faster-rates-via-heb">Faster rates via HEB</h2>
<p>We will now show how HEB can be used to obtain faster rates. We will first consider the impact of HEB from a theoretical perspective and then we will discuss how faster rates via HEB can be obtained in practice.</p>
<h3 id="theoretically-faster-rates">Theoretically faster rates</h3>
<p>Let us assume that we run a hypothetical first-order algorithm with updates of the form $x_{t+1} \leftarrow x_t - \eta_t d_t$ for some step length $\eta_t$ and direction $d_t$. To this end, recall from the <a href="/blog/research/2018/10/05/cheatsheet-fw.html">first post</a> that the progress at some point $x$ induced by smoothness for a direction $d$ is given by (via a short step)</p>
<p class="mathcol"><strong>Progress induced by smoothness:</strong>
\[
f(x_{t}) - f(x_{t+1}) \geq \frac{\langle \nabla f(x_t), d\rangle^2}{2L \norm{d}^2},
\]</p>
<p>and in particular for the direction pointing towards the optimal solution $d \doteq \frac{x_t - x^\esx}{\norm{x_t - x^\esx}}$ this becomes:</p>
<script type="math/tex; mode=display">\underbrace{f(x_{t}) - f(x_{t+1})}_{\text{primal progress}} \geq \frac{\langle \nabla f(x_t), x_t - x^\esx\rangle^2}{2L \norm{x_t - x^\esx}^2}.</script>
<p>At the same time, via (HEB primal bound) we have</p>
<p class="mathcol"><strong>Primal bound via HEB:</strong>
\[
\frac{1}{c}(f(x_t) - f^\esx)^{1-\theta} \leq \frac{\langle \nabla f(x_t), x_t - x^\esx \rangle}{\norm{x_t - x^\esx}}.
\]</p>
<p>Chaining these two inequalities together we obtain</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
f(x_{t}) - f(x_{t+1}) & \geq \frac{\langle \nabla f(x_t), x_t - x^\esx\rangle^2}{2L \norm{x_t - x^\esx}^2} \\
& \geq \frac{\left(\frac{1}{c}(f(x_t) - f^\esx)^{1-\theta} \right)^2}{2L}.
\end{align*} %]]></script>
<p>and so adding $f(x^\esx)$ on both sides and rearranging with $h_t \doteq f(x_t) - f(x^\esx)$</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
h_{t+1} & \leq h_t - \frac{\frac{1}{c^2}h_t^{2-2\theta}}{2L} \\
& \leq h_t \left(1 - \frac{1}{2Lc^2} h_t^{1-2\theta}\right).
\end{align*} %]]></script>
<p>If $\theta = 1/2$, then we obtain linear convergence with the usual arguments. Otherwise, whenever we have a contraction of the form $h_{t+1} \leq h_t \left(1 - Mh_t^{\alpha}\right)$ with $\alpha > 0$, it can be shown by induction plus some estimations that
$h_t \leq O(1) \left(\frac{1}{t} \right)^\frac{1}{\alpha}$, so that we obtain</p>
<script type="math/tex; mode=display">h_t \leq O(1) \left(\frac{1}{t} \right)^\frac{1}{1-2\theta},</script>
<p>or equivalently, to achieve $h_T \leq \varepsilon$, we need roughly $T \geq \Omega\left(\frac{1}{\varepsilon^{1 - 2\theta}}\right)$.</p>
<p>Then, as we have done before, in an actual algorithm we use a direction $\hat d_t$ that ensures progress at least as good as from the direction $d_t = \frac{x_t - x^\esx}{\norm{x_t - x^\esx}}$ pointing towards the optimal solution by means of an inequality of the form:</p>
<script type="math/tex; mode=display">\frac{\langle \nabla f(x_t), \hat d_t\rangle}{\norm{\hat d_t}} \geq \alpha \frac{\langle \nabla f(x_t), x_t - x^\esx\rangle}{\norm{x_t - x^\esx}},</script>
<p>and the argument is for a specific algorithm is concluded as we have done before several times.</p>
<h3 id="practically-faster-rates">Practically faster rates</h3>
<p>If the HEB condition almost always holds with <em>some</em> parameters and we generally can expect faster rates, why is it rather seldomly referred to or used (compared to e.g., strong convexity)? The reason for this is that the improved bounds are <em>only</em> useful in practice if the HEB parameters are known in advance, as only then we know when we can legitimately stop with a guaranteed accuracy. The key to get around this issue is to use <em>robust restarts</em>, which basically allow to achieve the rate implied by HEB <em>without</em> requiring knowledge of the parameters; this costs only a constant factor in the convergence rate compared to exactly knowing the parameters. If no error bound criterion is known, then these robust scheduled restarts rely on a grid search over a grid of logarithmic size. In the case that there is an error bound criterion available, such as e.g., the Wolfe gap in our case, then no grid search is required and basically it suffices to restart the algorithm whenever it has closed a (constant) multiplicative fraction of the residual primal gap. The overall complexity bound arises then from estimating how long each such restart takes. Coincidentally, this is exactly what the Scaling Frank-Wolfe algorithm from the <a href="/blog/research/2018/10/05/cheatsheet-fw.html">first post</a> does and we will analyze the algorithm in the next section. For details and in-depth, the interested reader is referred to [RA] for the (smooth) unconstrained case and [KDP] for the (smooth) constrained case via Conditional Gradients.</p>
<h3 id="a-heb-fw-for-optimal-solutions-in-relative-interior">A HEB-FW for optimal solutions in relative interior</h3>
<p>As an application of the above, we will now show that the <em>Scaling Frank-Wolfe algorithm</em> from the <a href="/blog/research/2018/10/05/cheatsheet-fw.html">first post</a> dynamically adjusts to the HEB condition and achieves a HEB-optimal rate up to constant factors (see p.6 of [NN] for the matching lower bound) provided that the optimal solution is contained in the strict interior of $P$; for the general case see [KDP], where we need to employ away steps. Recall from the last post, that the reason why we do not need away steps if $x^\esx \in \operatorname{rel.int}(P)$ is that in this case it holds</p>
<p>\[
\frac{\langle \nabla f(x),x - v\rangle}{\norm{x - v}} \geq \alpha \norm{\nabla f(x)},
\]</p>
<p>for some $\alpha > 0$, whenever $v \doteq \arg\min_{x \in P} \langle \nabla f(x), x \rangle$ is the Frank-Wolfe vertex and as such that the standard FW direction provides a sufficient approximation of $\norm{\nabla f(x)}$; see <a href="/blog/research/2018/10/19/cheatsheet-fw-lin-conv.html">second post</a> for details. This can be weakened to</p>
<p>\[
\tag{norm approx}
\langle \nabla f(x),x - v\rangle \geq \frac{\alpha}{D} \norm{\nabla f(x)},
\]</p>
<p>where $D$ is the diameter of $P$, which is sufficient for our purposes in the following. From this we can derive our operational primal gap bound that we will be working with by combining (gradient norm approx) with (HEB primal bound):</p>
<p>\[
\tag{HEB-FW PB}
f(x) - f^\esx \leq \left(\frac{cD}{\alpha}\right)^{\frac{1}{1-\theta}} \langle \nabla f(x),x - v\rangle^{\frac{1}{1-\theta}}.
\]</p>
<p>Furthermore, let us recall the Scaling Frank-Wolfe algorithm:</p>
<p class="mathcol"><strong>Scaling Frank-Wolfe Algorithm [BPZ]</strong> <br />
<em>Input:</em> Smooth convex function $f$ with first-order oracle access, feasible region $P$ with linear optimization oracle access, initial point (usually a vertex) $x_0 \in P$. <br />
<em>Output:</em> Sequence of points $x_0, \dots, x_T$ <br />
Compute initial dual gap: $\Phi_0 \leftarrow \max_{v \in P} \langle \nabla f(x_0), x_0 - v \rangle$ <br />
For $t = 0, \dots, T-1$ do: <br />
$\quad$ Find $v_t$ vertex of $P$ such that: $\langle \nabla f(x_t), x_t - v_t \rangle > \Phi_t/2$ <br />
$\quad$ If no such vertex $v_t$ exists: $x_{t+1} \leftarrow x_t$ and $\Phi_{t+1} \leftarrow \Phi_t/2$ <br />
$\quad$ Else: $x_{t+1} \leftarrow (1-\gamma_t) x_t + \gamma_t v_t$ and $\Phi_{t+1} \leftarrow \Phi_t$</p>
<p>As remarked earlier, the Scaling Frank-Wolfe Algorithm can be seen as a certain variant of a restart scheme, where we ‘restart’, whenever we update $\Phi_{t+1} \leftarrow \Phi_t/2$. The key is that the algorithm is parameter-free (when run with line search), does not require the estimation of HEB parameters, and is essentially optimal; skipping optimizing the update $\Phi_{t+1} \leftarrow \Phi_t/2$ with a different factor here which affects the rate only by a constant factor (in the exponent).</p>
<p>We will now show the following theorem, which is a straightforward adaptation from <a href="/blog/research/2018/10/05/cheatsheet-fw.html">the first post</a> incorporating (HEB-FW PB) instead of the vanilla convexity estimation.</p>
<p class="mathcol"><strong>Lemma (Scaling Frank-Wolfe HEB convergence).</strong>
Let $f$ be a smooth convex function satisfying HEB with parameters $c$ and $\theta$. Then the Scaling Frank-Wolfe algorithm ensures:
\[
h(x_T) \leq \varepsilon \qquad \text{for} \qquad
\begin{cases}
T \geq (1+K) \left(\lceil \log \frac{\Phi_0}{\varepsilon}\rceil + 1\right) & \text{ if } \theta = 1/2 \\ T \geq {\lceil \log \frac{\Phi_0}{\varepsilon}\rceil} +
\frac{K 4^{-\tau}}{\left(\frac{1}{2^\tau}\right) - 1} \left(\frac{1}{\varepsilon}\right)^{-\tau} & \text{ if } \theta < 1/2
\end{cases},
\]
where $K \doteq \left(\frac{cD}{2\alpha}\right)^{\frac{1}{1-\theta}} 8LD^2$, $\tau \doteq {\frac{1}{1-\theta}-2}$, and the $\log$ is to the basis of $2$.</p>
<p><em>Proof.</em>
We consider two types of steps: (a) primal progress steps, where $x_t$ is changed and (b) dual update steps, where $\Phi_t$ is changed. <br /> <br /> Let us start with the dual update step (b). In such an iteration we know that for all $v \in P$ it holds $\nabla f(x_t - v) \leq \Phi_t/2$ and in particular for $v = x^\esx$ and by (HEB-FW PB) this implies
\[h_t \leq \left(\frac{cD}{\alpha}\right)^{\frac{1}{1-\theta}} (\Phi_t/2)^{\frac{1}{1-\theta}}.\]
For a primal progress step (a), we have by the same arguments as before
\[f(x_t) - f(x_{t+1}) \geq \frac{\Phi_t^2}{8LD^2}.\]
From these two inequalities we can conclude the proof as follows: Clearly, to achieve accuracy $\varepsilon$, it suffices to halve $\Phi_0$ at most $\lceil \log \frac{\Phi_0}{\varepsilon}\rceil$ times. Next we bound how many primal progress steps of type (a) we can do between two steps of type (b); we call this a <em>scaling phase</em>. After accounting for the halving at the beginning of the iteration and observing that $\Phi_t$ does not change between any two iterations of type (b), by simply dividing the upper bound on the residual gap by the lower bound on the progress, the number of required steps can be at most
\[\left(\frac{cD}{\alpha}\right)^{\frac{1}{1-\theta}} (\Phi/2)^{\frac{1}{1-\theta}} \cdot \frac{8LD^2}{\Phi^2} = \underbrace{\left(\frac{cD}{2\alpha}\right)^{\frac{1}{1-\theta}} 8LD^2}_{\doteq K} \cdot \Phi^{\frac{1}{1-\theta}-2},\]
where $\Phi$ is the estimate valid for these iterations of type (a). Thus, with $\tau \doteq {\frac{1}{1-\theta}-2}$, the total number of iterations $T$ required to achieve $\varepsilon$-accuracy can be bounded by</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\sum_{\ell = 0}^{\lceil \log \frac{\Phi_0}{\varepsilon}\rceil} \left(1 + K (\Phi_0/2^\ell)^\tau \right) & = \underbrace{\lceil \log \frac{\Phi_0}{\varepsilon}\rceil}_{\text{Type (b)}} + \underbrace{K \Phi_0^\tau \sum_{\ell = 0}^{\lceil \log \frac{\Phi_0}{\varepsilon}\rceil} \left(\frac{1}{2^\tau}\right)^\ell}_{\text{Type (a)}},
\end{align*} %]]></script>
<p>where differentiate two cases. First let $\tau = 0$, and hence $\theta = 1/2$. This corresponds to case where we obtain linear convergence as now
\[
{\lceil \log \frac{\Phi_0}{\varepsilon}\rceil} + {K \Phi_0^\tau \sum_{\ell = 0}^{\lceil \log \frac{\Phi_0}{\varepsilon}\rceil} \left(\frac{1}{2^\tau}\right)^\ell} \leq (1+K) \left(\lceil \log \frac{\Phi_0}{\varepsilon}\rceil + 1\right).
\]
Now let $\tau < 0$, i.e., $\theta < 1/2$. Then</p>
<p><script type="math/tex">% <![CDATA[
\begin{align*}
{\lceil \log \frac{\Phi_0}{\varepsilon}\rceil} + {K \Phi_0^\tau \sum_{\ell = 0}^{\lceil \log \frac{\Phi_0}{\varepsilon}\rceil} \left(\frac{1}{2^\tau}\right)^\ell}
& = {\lceil \log \frac{\Phi_0}{\varepsilon}\rceil} + K \Phi_0^\tau
\frac{1-\left(\frac{1}{2^\tau}\right)^{\lceil \log \frac{\Phi_0}{\varepsilon}\rceil + 1}}{1 - \left(\frac{1}{2^\tau}\right)} \\
& \leq {\lceil \log \frac{\Phi_0}{\varepsilon}\rceil} +
\frac{K \Phi_0^\tau}{\left(\frac{1}{2^\tau}\right)-1} \left(\frac{1}{2^\tau}\right)^{\lceil \log \frac{\Phi_0}{\varepsilon}\rceil + 1} \\
& \leq {\lceil \log \frac{\Phi_0}{\varepsilon}\rceil} +
\frac{K \Phi_0^\tau}{\left(\frac{1}{2^\tau}\right)-1} \left(\frac{4\Phi_0}{\varepsilon}\right)^{-\tau} \\
& \leq {\lceil \log \frac{\Phi_0}{\varepsilon}\rceil} +
\frac{K 4^{-\tau}}{\left(\frac{1}{2^\tau}\right) - 1} \left(\frac{1}{\varepsilon}\right)^{-\tau}
\end{align*} %]]></script>
\[\qed\]</p>
<p>So we obtain the following convergence rate regimes:</p>
<ol>
<li>If $\theta = 1/2$, we obtain linear convergence with a convergence rate that is similar to the rate achieved in the strongly convex case up to a small constant factor, as expected from the discussion before.</li>
<li>If $\theta = 0$, then $\tau = -1$ and we obtain the standard rate relying only on smoothness and convexity, namely $O\left(\frac{1}{\varepsilon^{-\tau}}\right) = O\left(\frac{1}{\varepsilon}\right)$</li>
<li>If $0 < \theta < 1/2$, we have with $\tau = {\frac{1}{1-\theta}-2}$ that $0 < 2-\frac{1}{1-\theta} < 1$ and a rate of $O\left(\frac{1}{\varepsilon^{-\tau}}\right) = O\left(\frac{1}{\varepsilon^{2-\frac{1}{1-\theta}}}\right) = o\left(\frac{1}{\varepsilon}\right)$. This is strictly better than the rate obtained only from convexity and smoothness.</li>
</ol>
<p>It is helpful to compare the rate $O\left(\frac{1}{\varepsilon^{2-\frac{1}{1-\theta}}}\right)$ with the rate $O\left(\frac{1}{\varepsilon^{1 - 2\theta}}\right)$ that we derived above directly from the contraction. For this we rewrite $2-\frac{1}{1-\theta} = \frac{1-2\theta}{1-\theta}$, so that we have $\varepsilon^{-(1 - 2\theta)}$ vs. $\varepsilon^{- \frac{1 - 2\theta}{1-\theta}}$ and maximizing out the error in the exponent over $\theta$, we obtain
<script type="math/tex">\varepsilon^{-(1 - 2\theta)} \cdot \varepsilon^{-(3-2\sqrt{2})} \geq \varepsilon^{- \frac{1 - 2\theta}{1-\theta}},</script>
so that the error in rate is $\varepsilon^{-(3-2\sqrt{2})} \approx \varepsilon^{-0.17157}$, which is achieved for $\theta = 1- \frac{1}{\sqrt{2}} \approx 0.29289$. This discrepancy arises from the scaling of the dual gap estimate and optimizing the factor $\gamma$ in the update $\Phi_{t+1} \leftarrow \Phi_t/\gamma$ can reduce this further to a constant factor error (rather than a constant exponent error).</p>
<p class="mathcol"><strong>Remark (HEB rates for vanilla FW).</strong>
Similar HEB rate adaptivity can be shown for the vanilla Frank-Wolfe algorithm in a relatively straightforward way; e.g., a direct adaptation of the proof of [XY] will work. I opted for a proof for the Scaling Frank-Wolfe as I believe it is more straightforward and the Scaling Frank-Wolfe algorithm retains all the advantages discussed in <a href="/blog/research/2018/10/05/cheatsheet-fw.html">the first post</a> under the HEB condition.</p>
<p>Finally, a graph showing the behavior of Frank-Wolfe under HEB on the probability simplex of dimension $30$ and function $\norm{x}_2^{1/\theta}$. As we can see, for $\theta = 1/2$, we observe linear convergence as expected, while for the other values of $\theta$ we observe various degrees of sublinear convergence of the form $O(1/\varepsilon^p)$ with $p \geq 1$. The difference in slope is not quite as pronounced as I had hoped for but, again, the bounds are only upper bounds on the convergence rates.</p>
<p class="center"><img src="http://www.pokutta.com/blog/assets/heb-simplex-30-noLine.png" alt="HEB with approx minimizer" /></p>
<p>Interestingly, when using line search it seems we still achieve linear convergence and in fact the sharper functions converge <em>faster</em>; note this can only be a spurious phenomenon or even some bug due to the matching lower bound of our rates in [NN]. This phenomenon <em>might be</em> due to the fact that the progress from smoothness is only an underestimator of the achievable progress and the specific (as in simple) structure of our functions. If time permits I might try to compute out the actual optimal progress and see whether faster convergence can be proven. Here is a graph to demonstrate the difference: Frank-Wolfe run on the probability simplex for $n = 100$ and function $\norm{x}_2^{1/\theta}$.</p>
<p class="center"><img src="http://www.pokutta.com/blog/assets/heb-simplex-100-comp.png" alt="HEB with line search" /></p>
<h3 id="references">References</h3>
<p>[CG] Levitin, E. S., & Polyak, B. T. (1966). Constrained minimization methods. Zhurnal Vychislitel’noi Matematiki i Matematicheskoi Fiziki, 6(5), 787-823. <a href="http://www.mathnet.ru/php/archive.phtml?wshow=paper&jrnid=zvmmf&paperid=7415&option_lang=eng">pdf</a></p>
<p>[FW] Frank, M., & Wolfe, P. (1956). An algorithm for quadratic programming. Naval research logistics quarterly, 3(1‐2), 95-110. <a href="https://onlinelibrary.wiley.com/doi/abs/10.1002/nav.3800030109">pdf</a></p>
<p>[L] Łojasiewicz, S. (1963). Une propriété topologique des sous-ensembles analytiques réels. Les équations aux dérivées partielles, 117, 87-89.</p>
<p>[L2] Łojasiewicz, S. (1993). Sur la géométrie semi-et sous-analytique. Ann. Inst. Fourier, 43(5), 1575-1595. <a href="http://www.numdam.org/article/AIF_1993__43_5_1575_0.pdf">pdf</a></p>
<p>[BLO] Burke, J. V., Lewis, A. S., & Overton, M. L. (2002). Approximating subdifferentials by random sampling of gradients. Mathematics of Operations Research, 27(3), 567-584. <a href="https://www.jstor.org/stable/pdf/3690452.pdf?casa_token=WG5QKXxjgU8AAAAA:USOjl9WVAlwxXujFadFmzAmEH5J1JEX5fTr5tikcZPokBgqI6CU6UdMP6gb1Nh771ucW3lAjDD2RZWn5zqlfYPgSbePz1zr8R6dPnPYe7ftU4azql_k">pdf</a></p>
<p>[P] Polyak, B. T. (1963). Gradient methods for minimizing functionals. Zhurnal Vychislitel’noi Matematiki i Matematicheskoi Fiziki, 3(4), 643-653.</p>
<p>[BDL] Bolte, J., Daniilidis, A., & Lewis, A. (2007). The Łojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems. SIAM Journal on Optimization, 17(4), 1205-1223. <a href="https://epubs.siam.org/doi/pdf/10.1137/050644641?casa_token=FJQHJsH8X7QAAAAA%3AKsy_oqj_H1BsF3MOlJsvVXoGHTGuLCiPXnSFhuWA22CpZ4aZGOpJao-vPuBWzuptLKNQqkDPiA&">pdf</a></p>
<p>[RA] Roulet, V., & d’Aspremont, A. (2017). Sharpness, restart and acceleration. In Advances in Neural Information Processing Systems (pp. 1119-1129). <a href="http://papers.nips.cc/paper/6712-sharpness-restart-and-acceleration">pdf</a></p>
<p>[KDP] Kerdreux, T., d’Aspremont, A., & Pokutta, S. (2018). Restarting Frank-Wolfe. <a href="https://arxiv.org/abs/1810.02429">pdf</a></p>
<p>[XY] Xu, Y., & Yang, T. (2018). Frank-Wolfe Method is Automatically Adaptive to Error Bound Condition. arXiv preprint arXiv:1810.04765. <a href="https://arxiv.org/pdf/1810.04765.pdf">pdf</a></p>
<p>[BPZ] Braun, G., Pokutta, S., & Zink, D. (2017, July). Lazifying Conditional Gradient Algorithms. In International Conference on Machine Learning (pp. 566-575). <a href="https://arxiv.org/abs/1610.05120">pdf</a></p>
<p>[NN] Nemirovskii, A. & Nesterov, Y. E. (1985), Optimal methods of smooth convex minimization, USSR Computational Mathematics and Mathematical Physics 25(2), 21–30.</p>Sebastian PokuttaTL;DR: Cheat Sheet for convergence of Frank-Wolfe algorithms (aka Conditional Gradients) under the Hölder Error Bound (HEB) condition, or how to interpolate between convex and strongly convex convergence rates. Continuation of the Frank-Wolfe series. Long and technical.Toolchain Tuesday No. 22018-10-23T00:00:00-04:002018-10-23T00:00:00-04:00http://www.pokutta.com/blog/random/2018/10/23/toolchain-2<p><em>TL;DR: Part of a series of posts about tools, services, and packages that I use in day-to-day operations to boost efficiency and free up time for the things that really matter. Use at your own risk - happy to answer questions. For the full, continuously expanding list so far see <a href="/blog/pages/toolchain.html">here</a>.</em>
<!--more--></p>
<p>This is the second installment of a series of posts; the <a href="/blog/pages/toolchain.html">full list</a> is expanding over time. This time around will be about the python environment that I am using. Python has become my go-to language for rapid prototyping. In some sense these tools are some of the most fundamental ones but at the same time they do not provide direct utility by solving a specific problem but rather by <em>accelerating</em> problem solving etc.</p>
<h2 id="software">Software:</h2>
<h3 id="pycharm">PyCharm</h3>
<p>Extremely powerful integrated development environment (IDE) for python.</p>
<p><em>Learning curve: ⭐️⭐️⭐️</em>
<em>Usefulness: ⭐️⭐️⭐️⭐️</em> <br />
<em>Site: <a href="https://www.jetbrains.com/pycharm/">https://www.jetbrains.com/pycharm/</a></em></p>
<p>Excellent support for coding including simple things such as syntax highlighting and more complex refactoring. Support for managing different build/run environments, remote kernels, etc. Also great for managing larger scale projects.</p>
<h3 id="anaconda">Anaconda</h3>
<p>Python distribution geared towards scientific computing and data science applications.</p>
<p><em>Learning curve: ⭐️⭐️</em>
<em>Usefulness: ⭐️⭐️⭐️⭐️⭐️</em> <br />
<em>Site: <a href="https://www.anaconda.com">https://www.anaconda.com</a></em></p>
<p><code class="highlighter-rouge">Anaconda</code> is a very comprehensive and well-maintained python distribution geared towards scientific computing and data science applications. It uses the <code class="highlighter-rouge">conda</code> package manager making package management as well as creating different environments with different python versions exceptionally convenient. Learning curve only got ⭐️⭐️ as it is not harder than any other python distribution.</p>
<h3 id="jupyter">Jupyter</h3>
<p>Interactive python computing.</p>
<p><em>Learning curve: ⭐️⭐️⭐️</em>
<em>Usefulness: ⭐️⭐️⭐️⭐️</em> <br />
<em>Site: <a href="http://jupyter.org/">http://jupyter.org/</a></em></p>
<p>While <code class="highlighter-rouge">PyCharm</code> is great for more traditional development (write code, run, debug, iterate), <code class="highlighter-rouge">Jupyter</code> provides an <em>interactive (python) computing</em> environment in a web browser (for those in the know it is basically <code class="highlighter-rouge">IPython</code> on steroids). So what it allows to do is essentially to work with data etc in real-time and interactively by allowing partial execution of code, directly reviewing results, plotting etc. without having to fully re-run the code. Great for example, for exploratory data analysis. Allows for significantly faster tinkering etc with code and data and then once it is stable it can be easily transferred into a more traditional python code setup.</p>
<p>A typical process that I regularly follow is first writing a library that provides black box functions for some tasks and then I use Jupyter to do very high level tinkering/modifications. My Jupyter notebook might look like this:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">tools</span>
<span class="c"># load graph</span>
<span class="n">g</span> <span class="o">=</span> <span class="n">tools</span><span class="o">.</span><span class="n">loadGraph</span><span class="p">(</span><span class="s">"downtown-SF"</span><span class="p">)</span>
<span class="c"># compute distance matrix</span>
<span class="n">dist</span> <span class="o">=</span> <span class="n">tools</span><span class="o">.</span><span class="n">shortestPathDistances</span><span class="p">(</span><span class="n">g</span><span class="p">)</span>
<span class="c"># solve configurations</span>
<span class="n">resCF</span> <span class="o">=</span> <span class="n">tools</span><span class="o">.</span><span class="n">optimizeFlow</span><span class="p">(</span><span class="n">g</span><span class="p">,</span><span class="n">dist</span><span class="p">,</span><span class="n">congestion</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
<span class="n">resCT</span> <span class="o">=</span> <span class="n">tools</span><span class="o">.</span><span class="n">optimizeFlow</span><span class="p">(</span><span class="n">g</span><span class="p">,</span><span class="n">dist</span><span class="p">,</span><span class="n">congestion</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="c"># compare</span>
<span class="n">tools</span><span class="o">.</span><span class="n">plotComparison</span><span class="p">(</span><span class="n">refCF</span><span class="p">,</span><span class="n">resCT</span><span class="p">)</span></code></pre></figure>
<p>The <code class="highlighter-rouge">tools</code> library does all the heavy lifting behind the scenes and I use the <code class="highlighter-rouge">Jupyter</code> notebook for very high level manipulations and tests.</p>TL;DR: Part of a series of posts about tools, services, and packages that I use in day-to-day operations to boost efficiency and free up time for the things that really matter. Use at your own risk - happy to answer questions. For the full, continuously expanding list so far see here.Cheat Sheet: Linear convergence for Conditional Gradients2018-10-19T00:50:00-04:002018-10-19T00:50:00-04:00http://www.pokutta.com/blog/research/2018/10/19/cheatsheet-fw-lin-conv<p><em>TL;DR: Cheat Sheet for linearly convergent Frank-Wolfe algorithms (aka Conditional Gradients). What does linear convergence mean for Frank-Wolfe and how to achieve it? Continuation of the Frank-Wolfe series. Long and technical.</em>
<!--more--></p>
<p><em>Posts in this series (so far).</em></p>
<ol>
<li><a href="/blog/research/2018/10/05/cheatsheet-fw.html">Cheat Sheet: Frank-Wolfe and Conditional Gradients</a></li>
<li><a href="/blog/research/2018/10/19/cheatsheet-fw-lin-conv.html">Cheat Sheet: Linear convergence for Conditional Gradients</a></li>
<li><a href="/blog/research/2018/11/11/heb-conv.html">Cheat Sheet: Hölder Error Bounds (HEB) for Conditional Gradients</a></li>
</ol>
<p><em>My apologies for incomplete references—this should merely serve as an overview.</em></p>
<p>In the <a href="/blog/research/2018/10/05/cheatsheet-fw.html">first post</a> of this series we have looked at the basic mechanics of Conditional Gradients algorithms; as mentioned in my last post I will use Frank-Wolfe [FW] and Conditional Gradients [CG] interchangeably. In this installment we will look at linear convergence of these methods and work through the many subtleties that can easily cause confusion. I will stick to the notation from the <a href="/blog/research/2018/10/05/cheatsheet-fw.html">first post</a> and will refer to it frequently, so you might want to give it a quick refresher or read.</p>
<h2 id="what-is-linear-convergence-and-can-it-be-achieved">What is linear convergence and can it be achieved?</h2>
<p>I am purposefully vague here for the time being, as for its reasons, it will become clear further down below. Let us consider the convex optimization problem</p>
<script type="math/tex; mode=display">\min_{x \in P} f(x),</script>
<p>where $f$ is a differentiable convex function and $P$ is some compact and convex feasible region. In a nutshell linear convergence of an optimization method $\mathcal A$ (producing iterates $x_1, \dots, x_t, \dots$) asserts that given $\varepsilon > 0$, in order to achieve</p>
<script type="math/tex; mode=display">f(x_t) - f(x^\esx) \leq \varepsilon,</script>
<p>where $x^\esx$ is an optimal solution, it suffices to choose $t \geq \Omega(\log 1/\varepsilon)$, i.e., the number of required iterations is logarithmic in the reciprocal of the error, or the algorithm convergences “exponentially fast”, which is called <em>linear convergence</em> in convex optimization. Frankly, I am not perfectly sure, where the name <em>linear convergence</em> originates from. The best explanation I got so far, is to consider iterations to achieve the “the next significant digit” (i.e., powers of $10$): $k$ more significant digits requires $\operatorname{linear}(k)$ iterations. Now in the statement $t \geq \Omega(\log 1/\varepsilon)$ above I brushed many “constants” under the rug and it is precisely here that we need to be extra careful to understand what is happening; minor spoiler: maybe some of the constants are not that constant after all.</p>
<p>Linear convergence can be typically achieved for strongly convex functions as shown in the unconstrained case <a href="/blog/research/2018/10/05/cheatsheet-fw.html">last time</a>. Now let us consider the following example that we have also already encountered in the last post, which comes from [J].</p>
<p class="mathcol"><strong>Example:</strong> For linear optimization oracle-based first-order methods, a rate of $O(1/t)$ is the best possible. Consider the function $f(x) \doteq \norm{x}^2$, which is strongly convex and the polytope $P = \operatorname{conv}\setb{e_1,\dots, e_n} \subseteq \RR^n$ being the probability simplex in dimension $n$. We want to solve $\min_{x \in P} f(x)$. Clearly, the optimal solution is $x^\esx = (\frac{1}{n}, \dots, \frac{1}{n})$. Whenever we call the linear programming oracle on the other hand, we will obtain one of the $e_i$ vectors and in lieu of any other information but that the feasible region is convex, we can only form convex combinations of those. Thus after $k$ iterations, the best we can produce as a convex combination is a vector with support $k$, where the minimizer of such vectors for $f(x)$ is, e.g., $x_k = (\frac{1}{k}, \dots,\frac{1}{k},0,\dots,0)$ with $k$ times $1/k$ entries, so that we obtain a gap
<script type="math/tex">h(x_k) \doteq f(x_k) - f(x^\esx) = \frac{1}{k}-\frac{1}{n},</script>
which after requiring $\frac{1}{k}-\frac{1}{n} < \varepsilon$ implies $k > \frac{1}{\varepsilon - 1/n} \approx \frac{1}{\varepsilon}$ for $n$ large. In particular it holds for $k \leq \lfloor n/2 \rfloor$:
\[h(x_k) \geq \frac{1}{k}-\frac{1}{n} \geq \frac{1}{2k}.\]</p>
<p>Letting $n$ be large, this example basically shows that linear convergence for <em>any</em> first-order methods based on a linear optimization oracle cannot beat $O(1/\varepsilon)$ convergence: linear convergence is a hoax. Or is it? The problem is of course the ordering of the quantifiers here: in the definition of linear convergence, we first choose the instance $\mathcal I$ with its parameter and then for any $\varepsilon > 0$, we need a dependence of $h(x_t) \leq e^{- r(\mathcal I)t}$, where the rate $r(\mathcal I)$ is a constant that can (and usually will) depend on the instance $\mathcal I$; this is a good reminder that quantifier ordering <em>does matter a lot</em>. In fact, it turns out that this example (and modifications of it) is one of the most illustrative examples, to understand what linear convergence (for Conditional Gradients) really means.</p>
<p class="mathcol"><strong>Definition (linear convergence).</strong> Let $f$ be a convex function and $P$ be some feasible region. An algorithm that produces iterates $x_1, \dots, x_t, \dots$ <em>converges linearly</em> to $f^\esx \doteq \min_{x \in P} f(x)$, if there exists an $r > 0$, so that
\[
f(x_t) - f^\esx \leq e^{-r t}.
\]</p>
<p>So let us get back to our example. One of my favorite things about convex optimization is: “when in doubt, compute.” So let us do exactly this. Before we go there let us also recall the notion of smoothness and the convergence rate of the standard Frank-Wolfe method:</p>
<p class="mathcol"><strong>Definition (smoothness).</strong> A convex function $f$ is said to be <em>$L$-smooth</em> if for all $x,y \in \mathbb R^n$ it holds: <script type="math/tex">f(y) - f(x) \leq \nabla f(x)(y-x) + \frac{L}{2} \norm{x-y}^2</script>.</p>
<p>With this we can formulate the convergence rate for the Frank-Wolfe algorithm:</p>
<p class="mathcol"><strong>Theorem (Convergence of Frank-Wolfe [FW], see also [J]).</strong> Let $f$ be a convex <em>$L$-smooth</em> function. The standard Frank-Wolfe algorithm with step size rule $\gamma_t \doteq \frac{2}{t+2}$ produces iterates $x_t$ that satisfy:
\[f(x_t) - f(x^\esx) \leq \frac{LD^2}{t+2},\]
where $D$ is the diameter of $P$ in the considered norm (in smoothness) and $L$ the Lipschitz constant.</p>
<p>Let us consider the $\ell_2$-norm as the norm for smoothness and hence the diameter and apply this bound to the example above. Oberve that we have $D = \sqrt{2}$ for the probability simplex. Moreover, we obtain that $L \doteq 2$ is a feasible choice. With $f(x) = \norm{x}^2$:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\norm{y}^2 - \norm{x}^2 & \leq \nabla \norm{x}^2(y-x) + \frac{L}{2} \norm{x-y}^2 \\
& = 2\langle x,y\rangle -2 \norm{x}^2 + \frac{L}{2} \norm{x-y}^2 \\
& = 2\langle x,y\rangle -2 \norm{x}^2 + \frac{L}{2} \norm{x}^2 + \frac{L}{2} \norm{y}^2 - L \langle x,y\rangle.
\end{align} %]]></script>
<p>For $L \doteq 2$, this simplifies to</p>
<script type="math/tex; mode=display">0 \leq - \norm{y}^2 + \norm{x}^2 - 2 \norm{x}^2 + \norm{x}^2 + \norm{y}^2 = 0,</script>
<p>which is also the optimal choice; as both sides above are $0$, we also have that the strong convexity constant $\mu \doteq 2$, which we will use later. As such the convergence of the Frank-Wolfe algorithm becomes</p>
<script type="math/tex; mode=display">f(x_t) - f(x^\esx) \leq \frac{4}{t+2}.</script>
<p>Note, that this is has a couple of implications. First of all, this guarantee for our example is <em>independent</em> of the dimension of the probability simplex that we are using. Moreover, we also have</p>
<script type="math/tex; mode=display">\underbrace{\frac{1}{t} - \frac{1}{n}}_{\text{lower bound}} \leq \underbrace{\frac{4}{t+2}}_{\text{upper bound}},</script>
<p>i.e., a very tight band in which the Frank-Wolfe algorithms has to move. Now it is time to do some actual computations and look at the plots. Note that the plots are in log-log-scale (mapping $f(x) \mapsto \log f(e^x))$, which is helpful to identify super-polynomial behavior, effectively turning:</p>
<ol>
<li>(inverse) polynomials into linear functions: degree of polynomial affecting the slope and multiplicative factors affecting the shift</li>
<li>any super polynomial function into a non-linearity.</li>
</ol>
<p>Hence we roughly have that the upper bound is an additive shift of the lower bound. Let us now look at actual computations. In the figure below we depict the convergence of the standard Frank-Wolfe algorithm using the step size rule $\gamma_t = \frac{\langle \nabla f(x_{t-1}), x_{t-1} - v_t \rangle}{L}$, where $v_t$ is the Frank-Wolfe vertex of the respective round; this is the analog of the short step for Frank-Wolfe (see <a href="/blog/research/2018/10/05/cheatsheet-fw.html">last post</a>). We depict convergence on probability simplices of sizes $n \in \setb{30,40,50}$ as well as the upper bound function $\operatorname{ub(t)} \doteq \frac{4}{t+2}$ and the lower bound function $\operatorname{lb(t)} \doteq \frac{1}{2t}$. On the left, we see the first $100$ iterations only and then on the right the first $5000$ iterations. It is important to note that we plot the <em>primal gap</em> $h(x_t)$ here and not the <em>primal function value</em> $f(x_t)$.</p>
<p class="center"><img src="http://www.pokutta.com/blog/assets/fw-lin-inner-threshold_100_5000.png" alt="Convergence for different simplices" /></p>
<p>As we can see, for all three instances the primal gaps stay neatly within the upper and lower bound but then suddenly, they break out, below the lower bound curve and we can see from the plot that the primal gaps drops super-polynomially fast. To avoid confusion, keep in mind that we only established the validity of the lower bound up to $\lfloor n/2 \rfloor$ iterations, where $n$ is the dimension of the simplex. Now let us us have a closer look what is really happening. In the next graph we only consider the instance $n=30$ for the first $200$ iterations.</p>
<p class="center"><img src="http://www.pokutta.com/blog/assets/lin-conv-explanation-simplex.png" alt="Three regimes" /></p>
<p>We have three distinct regimes. In regime $R_1$ for $t \in [1, n/2]$, we see that $\operatorname{lb(t)} \leq h(x_t) \leq \operatorname{ub(t)}$. In the second regime $R_2$ for $t \in [n/2 + 1, n]$, we see that $h(x_t)$ crosses the lower bound and we can also see that in regimes $R_1$ and $R_2$ we have that $h(x_t)$ drops super-polynomially. Then for $t = n$, where regime $R_3$ begins the convergence rate abruptly slows down, however it continues to drop super-polynomially as we have seen in the graphs above.</p>
<h3 id="quantifying-convergence">Quantifying convergence</h3>
<p>Next, let us try to put some actual numbers, beyond intuition, to what is happening in our example. For the sake of exposition we will favor simplicity over sharpness of the derived rates. In fact the obtained rates are not optimal as we can see by comparing them to the figures above. Recall the definition of strong convexity:</p>
<p class="mathcol"><strong>Definition (strong convexity).</strong> A convex function $f$ is said to be <em>$\mu$-strongly convex</em> if for all $x,y \in \mathbb R^n$ it holds: <script type="math/tex">f(y) - f(x) \geq \nabla f(x)(y-x) + \frac{\mu}{2} \| x-y \|^2</script>.</p>
<p>For the analysis we will need the following bound on the primal gap induced by strong convexity:</p>
<script type="math/tex; mode=display">f(x_t) - f(x^\esx) \leq \frac{\langle\nabla f(x_t),x_t - x^\esx\rangle^2}{2 \mu \norm{x_t - x^\esx}^2},</script>
<p>as well as the progress induced by smoothness (using e.g., the short step rule):</p>
<script type="math/tex; mode=display">f(x_{t}) - f(x_{t+1}) \geq \frac{\langle \nabla f(x_t), d\rangle^2}{2L \norm{d}^2},</script>
<p>where $d$ is some direction that we consider (see <a href="/blog/research/2018/10/05/cheatsheet-fw.html">last post</a> for both derivations). If we could now (non-deterministcally) choose $d \doteq x_t - x^\esx$, then we immediately can combine these two inequalities to combine:</p>
<script type="math/tex; mode=display">f(x_{t}) - f(x_{t+1}) \geq \frac{\mu}{L} h(x_t),</script>
<p>and iterating this inequality we obtain linear convergence. However, in fact, usually we do not have access to $d \doteq x_t - x^\esx$. So that we somehow have to relate the step that we take, i.e., the Frank-Wolfe step to this “optimal” direction. In fact, to simplify things we will relate the Frank-Wolfe step to $\norm{\nabla f(x_t)}$ and then use that $\langle \nabla f(x_t), \frac{x_t - x^\esx}{\norm{x_t - x^\esx}}\rangle \leq \norm{\nabla f(x_t)}$ by Cauchy-Schwartz. In particular, we want to show that there exists $1 \geq \alpha > 0$, so that</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
f(x_{t}) - f(x_{t+1}) & \geq \frac{\langle \nabla f(x_t), x_t - v \rangle^2}{2L \norm{x_t - v}^2}
\\ & \geq \alpha^2 \frac{\norm{\nabla f(x_t)}^2}{2L}
\\ & \geq \alpha^2 \frac{\langle \nabla f(x_t), x_t - x^\esx \rangle^2}{2L \norm{x_t - x^\esx}^2},
\end{align} %]]></script>
<p>by means of showing $\frac{\langle \nabla f(x_t), x_t - v \rangle}{\norm{x_t - v}} \geq \alpha \norm{\nabla f(x_t)}$. We can then complete the argument as before simply losing the multiplicative factor $\alpha^2$ and obtain:</p>
<script type="math/tex; mode=display">f(x_{t}) - f(x_{t+1}) \geq \alpha^2 \frac{\mu}{L} h(x_t),</script>
<p>or equivalently,</p>
<script type="math/tex; mode=display">h(x_{t+1}) \leq \left(1 - \alpha^2 \frac{\mu}{L}\right) h(x_t).</script>
<p>To get slightly sharper bounds, we can estimate $\alpha$ separately in each iteration, which we will do now:</p>
<p class="mathcol"><strong>Observation.</strong> For $t \leq n$ the scaling factor $\alpha_t$ satisfies $\alpha_t \geq \sqrt{\frac{1}{2t}}$. <br /></p>
<p><em>Proof.</em>
Our starting point is the inequality $\frac{\langle \nabla f(x_t), x_t - v \rangle}{\norm{x_t - v}} \geq \alpha_t \norm{\nabla f(x_t)}$, for which we want to determine a suitable $\alpha_t$. Observe that for $f(x) = \norm{x}^2$, we have $\nabla f(x) = 2x$. Thus the inequality becomes
\[
\frac{\langle 2 x_t, x_t - v \rangle}{\norm{x_t - v}} \geq \alpha_t \norm{2 x_t}.
\]
Now observe that if we pick $v = \arg \max_{x \in P} \langle \nabla f(x_t), x_t - v \rangle$, then for rounds $t \leq n$, there exists a least one base vector $e_i$ that is not yet in the support of $x_t$ (where the <em>support</em> is simple set of the vertices in the convex combination that convex combine $x_t$), so that $\langle x_t, v \rangle = 0$. Thus the above can be further simplified to
\[
\frac{2 \norm{x_t}^2}{\norm{x_t - v}} \geq \alpha_t 2 \norm{ x_t} \quad \Leftrightarrow \quad \frac{\norm{x_t}}{\norm{x_t - v}} \geq \alpha_t.
\]
Moreover, $\norm{x_t}\geq \sqrt{\frac{1}{t}}$ and $\norm{x_t - v} \leq \sqrt{2}$, so that we obtain a choice $\alpha_t \doteq \sqrt{\frac{1}{2t}}$.
\[\qed\]</p>
<p>Combining this with the above, recalling that $L = \mu = 2$ in our example, we obtain up to iteration $t\leq n$, a contraction of the form</p>
<script type="math/tex; mode=display">h(x_n) \leq h(x_0) \prod_{t = 2}^n \left(1-\alpha_t^2\right) = h(x_0) \prod_{t = 2}^n \left(1-\frac{1}{2t}\right) \leq \prod_{t = 2}^n \left(1-\frac{1}{2t}\right),</script>
<p>as $h(x_0) \leq 1$. In fact I strongly suspect that the $2$ in the above can be shaved off as well, as then we would obtain a contraction of the form:</p>
<script type="math/tex; mode=display">\tag{estimatedConv}
h(x_n) \leq \prod_{t = 2}^n \left(1-\frac{1}{t}\right) = \frac{1}{n},</script>
<p>which would be in line with the observed rates in the following graphic. The factor $2$ that arose from estimating $\norm{x_t -v} \leq \sqrt{2}$ can be at least partially improved, in particular if we would use line-search, this would actually reduce to $\norm{x_t -v } = \sqrt{1+ \frac{1}{t}}$ and the short step rule should be pretty close to the line search step as $f$ is actually a quadratic (might update the computation at a later time to see whether this can be made precise). Assuming that we are “close enough” to the line search guarantee of $\norm{x_t -v } = \sqrt{1+ \frac{1}{t}}$, we obtain the desired bound as now</p>
<p>\[
\alpha_t \leq \sqrt{\frac{1}{t+1}} = \sqrt{\frac{1}{t (1+ \frac{1}{t})}} \leq \frac{\norm{x_t}}{\norm{x_t - v}},
\]</p>
<p>and we can choose $\alpha_t \doteq \sqrt{\frac{1}{t+1}}$, so that</p>
<script type="math/tex; mode=display">h(x_n) \leq h(x_0) \prod_{t = 2}^n \left(1-\alpha_t^2\right) = h(x_0) \prod_{t = 2}^n \left(1-\frac{1}{t+1}\right) \leq \prod_{t = 2}^n \left(1-\frac{1}{t+1}\right),</script>
<p>where the $t+1$ vs. $t$ offset is due to index shifting and we obtain the desired form (estimatedConv); in the worst-case within a factor of $2$.</p>
<p>Note, in the following graph the upper bound is now $1/t$ as function of $t$ and lower bound is $1/t - 1/n$ plotted for $n=30$ as a function of $t$. Clearly, the lower bound is only valid for $t\leq n$. Again we depict two regimes: left for the first $200$ iterations, right for the first $2000$ iterations.</p>
<p class="center"><img src="http://www.pokutta.com/blog/assets/fw-lin-speedup.png" alt="Linear vs. Sublinear Convergence" /></p>
<p>In fact, after closer inspection, it seems that locally we are actually doing slightly better than $\alpha_t = \sqrt{\frac{1}{t}}$, but for the sake of argument of linear convergence we could safely estimate $\alpha_t \geq \sqrt{\frac{1}{2n}}$ to show linear convergence up to $t \leq n$; as a side note, we can always do this: bounding the progress with the worst-case progress per round and calling it “linear convergence”. The key however is that we can show that there is a reasonable lower bound <em>independent of $\varepsilon > 0$</em>.</p>
<p>To this end, we will now analyze the sudden change of slope for $t > n$ and we will show that even after that change of slope, we still have a reasonable lower bound for the $\alpha_t$ <em>independent</em> of $t$ or $\varepsilon$. Intuitively, the sudden change in slope makes sense: from iteration $t+1$ onwards we cannot use $\langle x_t, v \rangle = 0$ anymore as all vertices have been picked up and the estimation from above gets much weaker. However, we will see now that we can still bound $\norm{\nabla f(x_t)}$ in a similar fashion; this argument is originally due to [GM] and we will revisit it in isolated and more general form in the next section.</p>
<p class="mathcol"><strong>Observation.</strong> There exists $t’ \geq n$ so that for all $t \geq t’$ the scaling factor $\alpha_t$ satisfies $\alpha_t \geq \sqrt{\frac{1}{8n}}$. <br /></p>
<p><em>Proof.</em>
Suppose that we are in iteration $t > n$ and let $H$ be the affine space that contains $P$. Suppose that there exists a ball $B(x^\esx, 2 r ) \cap H \subseteq P$ of radius $2r$ around the optimal solution that is contained in the relative interior of $P$. If now the primal gap $h(x_{t’}) \leq r^2$ for some $t’$, it follows by smoothness $\norm{x_{t} - x^\esx}^2 \leq h(x_{t}) \leq h(x_{t’}) \leq r^2$ for $t \geq t’$, as $h(x_t)$ is monotonously decreasing (by the choice of the short step) and $L=2$. Thus for $t \geq t’$ it holds $\norm{x_t - x^\esx} \leq r$. For the remainder of the argument let us assume that the gradient $\nabla f(x_t)$ is already projected onto the linear space $H$. Therefore $x_t + r \frac{\nabla f(x_t)}{\norm{\nabla f(x_t)}} \in B(x^\esx, 2 r ) \cap H \subseteq P$ and as such $d \doteq r \frac{\nabla f(x_t)}{\norm{\nabla f(x_t)}}$ is a valid direction and we have
\[
\max_{x \in P} \langle \nabla f(x_t),x_t - v\rangle \geq \langle \nabla f(x_t), d\rangle = r \norm{\nabla f(x_t)},
\]
and in particular
\[
\frac{\langle \nabla f(x_t),x_t - v\rangle}{\norm{x_t - v}} \geq \frac{r}{\norm{x_t - v}} \norm{\nabla f(x_t)} \geq \frac{r}{\sqrt{2}} \norm{\nabla f(x_t)}.
\]
For the choice $r \doteq \frac{1}{2\sqrt{n}}$, we have $B(x^\esx, 2 r ) \cap H \subseteq P$, so that we obtain a choice of $\alpha_t \doteq \frac{1}{2\sqrt{2n}}$ and a contraction of the form
\[
h(x_{t+1}) \leq h(x_t) \left(1 - \frac{1}{8n} \right).
\]
\[\qed\]</p>
<p>To finish off this exercise, let us briefly derive a lower bound for any linear rate. To this end, recall that $h(x_{n/2}) \geq 1/n$. Moreover, we have $h(x_0) \leq 1$. Suppose we have a linear rate with constant $\beta$, then</p>
<script type="math/tex; mode=display">\frac{1}{n} \leq h(x_0) \left(1-\beta\right)^{n/2} \leq \left(1-\beta\right)^{n/2},</script>
<p>and as such we have $- \ln n \leq (n/2) \ln(1-\beta)$ or equivalently</p>
<script type="math/tex; mode=display">- \frac{2 \ln n}{n} \leq \ln (1 - \beta) \leq - \beta,</script>
<p>so that $\beta \leq \frac{2 \ln n}{n}$ follows, or put differently, any linear rate <em>has to depend</em> on the dimension $n$.</p>
<h3 id="impact-of-the-step-size-rule">Impact of the step size rule</h3>
<p>For completeness, the step size rule is important to achieve linear convergence. In particular, the standard Frank-Wolfe step size rule of $\frac{2}{t+2}$ does not induce linear convergence, as we can see in the graph below. On the left is the standard Frank-Wolfe step size rule and the right the short step rule from above for comparison. The key difference is that the short step rule roughly maximizes progress via the smoothness equation. Note, however that from the graph we can see the standard Frank Wolfe step size rule still does induce a convergence rate of $O(1/\varepsilon^p)$ for some $p > 1$, i.e., outperforming $O(1/\varepsilon)$ convergence for iterations $t \geq n$.</p>
<p class="center"><img src="http://www.pokutta.com/blog/assets/compare-subLin-Lin.png" alt="Comparison different step size rules" /></p>
<h2 id="linear-convergence-for-frank-wolfe">Linear Convergence for Frank-Wolfe</h2>
<p>After having worked through the example to hopefully get some intuition what is going on, let us no turn to the general setup. We consider the problem:</p>
<script type="math/tex; mode=display">\min_{x \in P} f(x),
\tag{P}</script>
<p>where $P$ is a polytope and $f$ is a strongly convex function. Note that we restrict ourselves to polytopes here not merely for exposition but because the involved quantities that we will need are only defined for the polyhedral case and in fact these quantities can approach $0$ for general compact convex sets.</p>
<p>Before we consider the general setup, observe that the arguments in the example above do not cleanly separate out the contribution of the geometry from the contribution of the strong convexity of the function. While this helped a lot with simplifying the arguments, this is highly undesirable and we will work towards a clean separation of the contribution of strong convexity of the function and the contribution of the geometry of the polytope towards the rate of linear convergence. Ultimately, we have already seen what we have to show for the general case: there exists $\alpha > 0$, so that for any iterate $x_t$, our algorithms (Frank-Wolfe or modifications of such) provide a direction $d$, so that</p>
<script type="math/tex; mode=display">\frac{\langle \nabla f(x_t), d \rangle}{\norm{d}} \geq \alpha \frac{\langle \nabla f(x_t), x_t - x^\esx \rangle}{\norm{x_t - x^\esx}}.
\tag{Scaling}</script>
<p>This condition should serve as a guide post throughout the following discussion: if such an $\alpha$ exists, we basically obtain a linear rate of $\alpha^2 \frac{\mu}{L}$, i.e., $h(x_t) \leq h(x_0) \left(1- \alpha^2 \frac{\mu}{L}\right)^t$, exactly as done above.</p>
<h3 id="the-simple-case-xesx-in-strict-relative-interior">The simple case: $x^\esx$ in strict relative interior</h3>
<p>In the example above we have actually proven something stronger, which is due to [GM] (and holds more generally for compact convex sets):</p>
<p class="mathcol"><strong>Theorem (Linear convergence for $x^\esx$ in relative interior [GM]).</strong> Let $f$ be a smooth strongly convex function with smoothness $L$ and strong convexity parameter $\mu$. Further let $P$ be a compact convex set. If $B(x^\esx,2\varepsilon) \cap \operatorname{aff}(P) \subseteq P$ with $x^\esx \doteq \arg\min_{x \in P} f(x)$ for some $\varepsilon > 0$, then there exists $t’$ such that for all $t \geq t’$ it holds
\[
h(x_t) \leq \left(1 - \frac{\varepsilon^2}{D^2} \frac{\mu}{L}\right)^{t-t’} h(x_{t’}),
\]
where $D$ is the diameter of $P$. <br /></p>
<p><em>Proof.</em>
We basically gave the proof above already. The key insight is that if $x^\esx$ is contained $2\varepsilon$-deep in the relative interior, then we can show the existence of some $t’$ so that for all $t\geq t’$ it holds
\[
\frac{\langle \nabla f(x_t),x_t - v\rangle}{\norm{x_t - v}} \geq \frac{r}{D} \norm{\nabla f(x_t)}.
\]
and then we plug this into the formula from the example with $\alpha = \frac{r}{D}$. $\qed$</p>
<p>The careful reader will have observed that in the example as well as in the proof above, we realize a bound</p>
<script type="math/tex; mode=display">\frac{\langle \nabla f(x_t), d \rangle}{\norm{d}} \geq \alpha \norm{\nabla f(x_t)},</script>
<p>which is stronger than what (Scaling) requires, using $\frac{\langle \nabla f(x_t), x_t - x^\esx \rangle}{\norm{x_t - x^\esx}} \leq \norm{\nabla f(x_t)}$. This stronger condition cannot be satisfied in general: if $x^\esx$ lies on the boundary of $P$, then $\norm{\nabla f(x_t)}$ does not vanish (while $\langle \nabla f(x_t), x_t - x^\esx \rangle$ does) and therefore this condition is unsatisfiable as it would guarantee infinite progress via smoothness. This is not the only problem though as we could directly aim for establishing (Scaling). However, it turns out that the standard Frank-Wolfe algorithm <em>cannot</em> achieve linear convergence when the optimal solution $x^\esx$ lies on the (relative) boundary of $P$, as the following theorem shows:</p>
<p class="mathcol"><strong>Theorem (FW converges sublinearly for $x^\esx$ on the boundary [W]).</strong> Suppose that the (unique) optimal solution $x^\esx$ lies on the boundary of the polytope $P$ and is not an extreme point of $P$. Further suppose that there exists an iterate $x_t$ that is not already contained in the same minimal face as $x^\esx$. Then for any $\delta > 0$ constant, the relation
\[
f(x_t) - f(x^\esx) \geq \frac{1}{k^{1+\delta}},
\]
holds for infinitely many indices $t$.</p>
<h3 id="introducing-away-steps">Introducing Away-Steps</h3>
<p>So what is the fundamental reason that we cannot achieve basically better than $\Omega(1/t)$-rates if $x^\esx$ is on the boundary? The problem lies in the scaling condition that we want to satisfy via Frank-Wolfe steps, i.e.,</p>
<script type="math/tex; mode=display">\frac{\langle \nabla f(x_t), x_t - v \rangle}{\norm{x_t - v}} \geq \alpha \frac{\langle \nabla f(x_t), x_t - x^\esx \rangle}{\norm{x_t - x^\esx}}.</script>
<p>The closer we get to the boundary the smaller $\langle \nabla f(x_t), x_t - v \rangle$ gets: the direction $\frac{x_t - v}{\norm{x_t - v}}$ approximates the gradient $\nabla f(x_t)$ worse and worse compared to the direction $\frac{x_t - x^\esx}{\norm{x_t - x^\esx}}$. This is basically also how the proof of the theorem above works: we need that $x^\esx$ is not an extreme point, otherwise for $x^\esx = v$ the approximation cannot be arbitrarily bad and at no point do we need to be in the face of the optimal solution as otherwise we are back to the case of the relative interior. As a result of this flattening of the gradient approximations we observe the (relatively well-known) zig-zagging phenomenon (see figure further below).</p>
<p>The ultimate reason for the zig-zagging is that we lack directions in which we can go that guarantee (Scaling). The first ones to overcome this challenge in the general case were Garber and Hazan [GH]. At the risk of oversimplifying their beautiful result, the main idea to define a new oracle that does not perform linear optimization over $P$ but over $P \cap \tilde B(x_t,\varepsilon)$, some notion of “ball”, so that $p = \arg \min_{P \cap \tilde B(x_t,\varepsilon)} \nabla f(x_t)$, produces a point $p \in P$ (usually not a vertex), so that the direction</p>
<script type="math/tex; mode=display">\frac{\langle \nabla f(x_t), x_t - d \rangle}{\norm{x_t - d}} \geq \alpha \frac{\langle \nabla f(x_t), x_t - x^\esx \rangle}{\norm{x_t - x^\esx}}.</script>
<p>is satisfied for some $\alpha$. This is “trivial” if $\tilde B$ is the euclidean ball as then $x_t - d = - \varepsilon \nabla f(x_t)$ if $\varepsilon$ is small enough; we are then basically in the case of the interior solution. The key insight in [GH] however is that you can define a notion of ball $\tilde B$, so that you can solve this modified oracle with a <em>single</em> call to the original LP oracle and <em>still</em> ensure (Scaling). What this really comes down to is that you add many more directions that ultimately provide better approximations of $\frac{\langle \nabla f(x_t), x_t - x^\esx \rangle}{\norm{x_t - x^\esx}}$. Unfortunately, the resulting algorithm is extremely hard to implement and not practical due to exponentially sized constants.</p>
<p>What we will consider in the following is an alternative approach to add more directions, which is due to [W]. Suppose we have an iterate $x_t$ obtained via a couple of Frank-Wolfe iterations. Then $x_t = \sum_{i \in [t]} \lambda_i v_i$, where $v_i$ with $i \in [t]$ are extreme points of $P$, $\lambda_i \geq 0$ with $i \in [t]$, and $\sum_{i \in [t]} \lambda_i = 1$. We call the set $\setb{v_i \mid \lambda_i > 0, i \in [t] }$ the <em>active set $S_t$</em>. So additionally to Frank-Wolfe directions of the form $x_t - v$, we can consider
$a \doteq \arg\max_{v \in S} \nabla f(x_t)$ (as opposed to $\arg\min$ for Frank-Wolfe directions) and the resulting <em>away direction</em> $a - x_t$, that does not add a new vertex but <em>removes</em> weight from a previously added vertex; since we have the decomposition, we know exactly how much weight can be removed, while staying feasible. The reason why this is useful is that it not just adds some additional directions, but directions that intuitively make sense: Slow convergence happens because we cannot enter to optimal face that contains $x^\esx$ fast enough and this is because we have a vertex in the convex combination that keeps the iterates from reaching the face and with Frank-Wolfe steps we would now slowly wash out this blocking vertex (basically at a rate of $1/t$). An <em>away step</em>, which is following an away direction can potentially remove the same blocking vertex in a <em>single</em> iteration. Let us consider the following figure, where on the left we see the normal Frank-Wolfe behavior and on the right the behavior with an away step; in the example the depicted polytope is $\operatorname{conv}(S_t)$ (see [JL] for a nicer illustration).</p>
<p class="center"><img src="http://www.pokutta.com/blog/assets/fw-away.png" alt="Away-step of FW" /></p>
<p>With this improvement we can formulate:</p>
<p class="mathcol"><strong>Away-step Frank-Wolfe (AFW) Algorithm [W]</strong> <br />
<em>Input:</em> Smooth convex function $f$ with first-order oracle access, feasible region $P$ with linear optimization oracle access, initial vertex $x_0 \in P$ and initial active set $S_0 = \setb{x_0}$. <br />
<em>Output:</em> Sequence of points $x_0, \dots, x_T$ <br />
For $t = 1, \dots, T$ do: <br />
$\quad v_t \leftarrow \arg\min_{x \in P} \langle \nabla f(x_{t-1}), x \rangle \quad \setb{\text{FW direction}}$ <br />
$\quad a_t \leftarrow \arg\max_{x \in S_t} \langle \nabla f(x_{t-1}), x \rangle \quad \setb{\text{Away direction}}$ <br />
$\quad$ If $\langle \nabla f(x_{t-1}), x_t - v_t \rangle > \langle \nabla f(x_{t-1}), a_t - x_t \rangle: \quad \setb{\text{FW vs. Away}}$<br />
$\quad \quad x_{t+1} \leftarrow (1-\gamma_t) x_t + \gamma_t v_t$ with $\gamma_t \in [0,1]$ $\quad \setb{\text{Perform FW step}}$ <br />
$\quad$ Else: <br />
$\quad \quad x_{t+1} \leftarrow (1+\gamma_t) x_t - \gamma_t a_t$ with $\gamma_t \in [0,\frac{\lambda_{a_t}}{1-\lambda_{a_t}}]$ $\quad \setb{\text{Perform Away step}}$ <br />
$\quad S_{t+1} \rightarrow \operatorname{ActiveSet}(x_{t+1})$</p>
<p>In the above $\lambda_{a_t}$ is the weight of vertex ${a_t}$ in decomposition of $x_t$ in iteration $t$. Moreover, by the same smoothness argument as we have used now multiple times, the progress of an away step, provided it did not hit the upper bound $\frac{\lambda_{a_t}}{1-\lambda_{a_t}}$, is at least</p>
<script type="math/tex; mode=display">f(x_t) - f(x_{t+1}) \geq \frac{\langle \nabla f(x_{t-1}), a_t - x_t \rangle^2}{2L \norm{a_t - x_t}},</script>
<p>i.e., the same type of progress that we have for the FW steps. If we hit the upper bound then vertex $a_t$ is removed from the convex combination / active set and we call this a <em>drop step</em>.</p>
<p class="mathcol"><strong>Observation (A case for FW).</strong>
We will see in the next section that the Away-step Frank-Wolfe algorithm achieves linear convergence (for strongly convex functions) even for optimal solutions on the boundary. Moreover, it has been widely empirically observed that AFW has better per-iteration convergence than FW. So irrespective of the convergence rate proof, why not always using the AFW variant given that the additional computational overhead is small? In some cases the vanilla FW can have a huge advantage over the AFW: it does not have to maintain the active set for the decomposition and for some problems, e.g., matrix completion this matters a lot. For example some of the experiments in [PANJ] could not be performed for variants other than FW that need to maintain the active set. Now, there are special cases (see, e.g., [GM2]) or when assuming the existence of a rather strong <em>away oracle</em> (see, e.g., [BZ]) that we do not need to maintain active sets. For completeness and slightly simplifying, what the away oracle does, it solves $\max_{x \in P \cap F} \nabla f(x_t)$, where $F$ is the minimal face that contains $x_t$. One can then easily verify that the optimal solution is an away vertex for <em>some</em> decomposition of $x_t$ and in fact it will induce the largest progress (provided it is not a drop step); see [BZ] for details.</p>
<h3 id="pyramidal-width-and-linear-convergence-for-afw">Pyramidal width and linear convergence for AFW</h3>
<p>So how do we obtain linear convergence with the help of away steps? The key insight here is due to Lacoste-Julien and Jaeggi [LJ] that showed that there exists a geometric constant $w(P)$, the so-called <em>pyramidal width</em> that <em>only</em> depends on the polytope $P$. While the full derivation of the pyramidal width would be very tedious, it provides the following crucial strong convexity bound:</p>
<script type="math/tex; mode=display">h(x_t) \leq \frac{\langle \nabla f(x_{t}), a_t - v_t \rangle^2}{2 \mu w(P)^2},</script>
<p>where $\mu$ is the strong convexity constant of the function $f$. If we plug this back into the standard progress equation (as done before), we obtain:</p>
<script type="math/tex; mode=display">h(x_{t+1}) \leq h_t \left(1 - \frac{\mu}{L} w(P)^2 \right),</script>
<p>i.e., we obtain linear convergence. Note, that I have cheated slightly here not accounting for the drop steps (no more than genuine FW steps). Rephrasing the provided bound into our language here (see Theorem’ 3 in [LJ]), it holds:</p>
<script type="math/tex; mode=display">\langle \nabla f(x_{t}), a_t - v_t \rangle > w(P) \frac{\langle \nabla f(x_t), x_t - x^\esx \rangle}{\norm{x_t - x^\esx}},</script>
<p>where the missing term $\norm{a_t - v_t}$ can be absorbed in various way, e.g., bounding it via the diameter of $P$ and absorbing into $w(P)$ itself or absorbing it into the definition of curvature in the case of the affine-invariant version of AFW.</p>
<h3 id="final-comments">Final comments</h3>
<p>I would like to end this post with a few comments:</p>
<ol>
<li>
<p>The Away-step Frank-Wolfe algorithm can be further improved by not choosing either a FW step or an Away step but by directly combining those into a direction $d \doteq a_t - v_t$, which leads to the <em>Pairwise Conditional Gradients</em> algorithm, which is typically faster but harder to analyze due to so called <em>swap steps</em>, when one vertex leaves the active sets and another one enters at the same time.</p>
</li>
<li>
<p>Recently, in [PR] and [GP], the notion of pyramidal width has been further simplified and generalized.</p>
</li>
<li>
<p>There is also a very beautiful way of achieving (Scaling) in the decomposition-invariant case where no active set has to be maintained. The initial key insight here is due to [GM2], where they show that if $\norm{x_t - x^\esx}$ is small, then so is the amount of weight that needs to be shifted around in the convex combination of $x_t$ to represent $x^\esx$. This can then be directly combined with the away steps to obtain (Scaling). In [GM2] the construction works for certain structured polytopes only and this has been recently extended in [BZ] to the general case.</p>
</li>
</ol>
<h3 id="references">References</h3>
<p>[CG] Levitin, E. S., & Polyak, B. T. (1966). Constrained minimization methods. Zhurnal Vychislitel’noi Matematiki i Matematicheskoi Fiziki, 6(5), 787-823. <a href="http://www.mathnet.ru/php/archive.phtml?wshow=paper&jrnid=zvmmf&paperid=7415&option_lang=eng">pdf</a></p>
<p>[FW] Frank, M., & Wolfe, P. (1956). An algorithm for quadratic programming. Naval research logistics quarterly, 3(1‐2), 95-110. <a href="https://onlinelibrary.wiley.com/doi/abs/10.1002/nav.3800030109">pdf</a></p>
<p>[J] Jaggi, M. (2013, June). Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization. In ICML (1) (pp. 427-435). <a href="http://proceedings.mlr.press/v28/jaggi13-supp.pdf">pdf</a></p>
<p>[GM] Guélat, J., & Marcotte, P. (1986). Some comments on Wolfe’s ‘away step’. Mathematical Programming, 35(1), 110-119. <a href="https://link.springer.com/content/pdf/10.1007/BF01589445.pdf">pdf</a></p>
<p>[W] Wolfe, P. (1970). Convergence theory in nonlinear programming. Integer and nonlinear programming, 1-36.</p>
<p>[GH] Garber, D., & Hazan, E. (2013). A linearly convergent conditional gradient algorithm with applications to online and stochastic optimization. arXiv preprint arXiv:1301.4666. <a href="https://arxiv.org/abs/1301.4666">pdf</a></p>
<p>[PANJ] Pedregosa, F., Askari, A., Negiar, G., & Jaggi, M. (2018). Step-Size Adaptivity in Projection-Free Optimization. arXiv preprint arXiv:1806.05123. <a href="https://arxiv.org/abs/1806.05123">pdf</a></p>
<p>[GM2] Garber, D., & Meshi, O. (2016). Linear-memory and decomposition-invariant linearly convergent conditional gradient algorithm for structured polytopes. In Advances in Neural Information Processing Systems (pp. 1001-1009). <a href="http://papers.nips.cc/paper/6115-linear-memory-and-decomposition-invariant-linearly-convergent-conditional-gradient-algorithm-for-structured-polytopes">pdf</a></p>
<p>[BZ] Bashiri, M. A., & Zhang, X. (2017). Decomposition-Invariant Conditional Gradient for General Polytopes with Line Search. In Advances in Neural Information Processing Systems (pp. 2690-2700). <a href="http://papers.nips.cc/paper/6862-decomposition-invariant-conditional-gradient-for-general-polytopes-with-line-search">pdf</a></p>
<p>[LJ] Lacoste-Julien, S., & Jaggi, M. (2015). On the global linear convergence of Frank-Wolfe optimization variants. In Advances in Neural Information Processing Systems (pp. 496-504). <a href="http://papers.nips.cc/paper/5925-on-the-global-linear-convergence-of-frank-wolfe-optimization-variants.pdf">pdf</a></p>
<p>[PR] Pena, J., & Rodriguez, D. (2018). Polytope conditioning and linear convergence of the Frank–Wolfe algorithm. Mathematics of Operations Research. <a href="https://arxiv.org/pdf/1512.06142.pdf">pdf</a></p>
<p>[GP] Gutman, D. H., & Pena, J. F. (2018). The condition of a function relative to a polytope. arXiv preprint arXiv:1802.00271. <a href="https://arxiv.org/pdf/1802.00271.pdf">pdf</a></p>Sebastian PokuttaTL;DR: Cheat Sheet for linearly convergent Frank-Wolfe algorithms (aka Conditional Gradients). What does linear convergence mean for Frank-Wolfe and how to achieve it? Continuation of the Frank-Wolfe series. Long and technical.Training Neural Networks with LPs2018-10-12T00:00:00-04:002018-10-12T00:00:00-04:00http://www.pokutta.com/blog/research/2018/10/12/DNN-learning-lp-abstract<p><em>TL;DR: This is an informal summary of our recent paper <a href="http://arxiv.org/abs/1810.03218">Principled Deep Neural Network Training through Linear Programming</a> with <a href="http://www.columbia.edu/~dano/">Dan Bienstock</a> and <a href="http://cerc-datascience.polymtl.ca/person/gonzalo-munoz/">Gonzalo Muñoz</a>, where we show that the computational complexity of approximate Deep Neural Network training depends polynomially on the data size for several architectures by means of constructing (relatively) small LPs.</em>
<!--more--></p>
<h2 id="what-is-the-paper-about-and-why-you-might-care">What is the paper about and why you might care</h2>
<p>Deep Learning has received significant attention due to its impressive
performance in many state-of-the-art learning tasks. Unfortunately, while very powerful, Deep Learning is not well understood theoretically and in particular only recently results for the complexity of training deep neural networks have been obtained. So why would we care, “as long as it works”? The reason for this is multi-fold. First of all, understanding the complexity of training provides us with a <em>general</em> insight of how hard Deep Neural Network (DNN) training really is. Maybe it is generally a hard problem? Maybe it is actually easy? Moreover, we have to differentiate approaches that merely provide a “good solution” versus those that actually solve the training problem to (near-)optimality; a discussion whether or not this is desirable from a generalization point of view is for a different time, nonetheless, jumping ahead a bit, we <em>do</em> also establish generalization of models trained via LPs. Moreover, once the complexity of training is understood, one can start to consider follow-up questions, such as how hard <em>robust</em> training is, a technique that has become important in the context of hardening DNNs against adversarial attacks, which is of ultimate importance if we ever really want to deploy these ML systems in the real-world, on a large scale, possibly with human lives at stake (e.g., autonomous driving).</p>
<p>We show that training DNNs to $\varepsilon$-approximate optimality can be done via linear programs of relatively small size. I would like to stress though that our results are <em>not about practical training</em> but about characterizing the <em>computational complexity</em> of the DNN training problem.</p>
<h2 id="our-results">Our results</h2>
<p>In neural network training we are interested in solving the following <em>Empirical Risk Minimization (ERM)</em> problem:</p>
<script type="math/tex; mode=display">\tag{ERM}
\min_{\phi \in \Phi} \frac{1}{D} \sum_{i=1}^D
\ell(f(\hat{x}^i,\phi), \hat{y}^i),</script>
<p>where $\ell$ is some <em>loss function</em>,
$(\hat{x}^i, \hat{y}^i)_{i=1}^D$ is an i.i.d. sample from some data
distribution $\mathcal D$ of sample size $D$, and $f$ is a neural network architecture
parameterized by $\phi \in \Phi$ with $\Phi$ being the parameter
space of the considered architecture (e.g., network weights). The empirical risk minimization problem is solved as a stand-in for the <em>general risk minimization (GRM)</em> problem of the form</p>
<script type="math/tex; mode=display">\tag{GRM}
\min_{\phi \in \Phi} \mathbb E_{(x,y) \in \mathcal D} {\ell(f(x,\phi),
y)}</script>
<p>which we usually cannot solve due not having explicit access to the true data distribution $\mathcal D$ and one hopes that the (ERM) solution $\phi^\esx$ reasonable generalizes to the (GRM) solution, i.e., it is also roughly minimizing (GRM). To help generalization and prevent overfitting against the specific sample, we often solve the <em>regularized ERM (rERM)</em>, which is of the form:</p>
<script type="math/tex; mode=display">\tag{rERM}
\min_{\phi \in \Phi} \frac{1}{D} \sum_{i=1}^D
\ell(f(\hat{x}^i,\phi), \hat{y}^i) + \lambda R(\phi),</script>
<p>where $R$ is a <em>regularizer</em>, typically a norm, and $\lambda > 0$ is a weight controlling the strength of the regularization.</p>
<p>Typically, the (regularized) ERM is solved with some form of stochastic gradient descent in neural network training, ultimately exploiting the <em>finite sum structure</em> of the objective and linearity of the derivative allowing us to (batch) sample from our data sample and compute stochastic gradients. It turns out that the same finite sum structure induces an optimization problem of low treewidth, which allows us to formulate the ERM problem as a reasonably small linear program. Note that we make no assumptions on convexity etc. here and complexity of the architecture and loss will be captured by Lipschitzness. To keep the exposition simple here (and also in the paper), we assume that both the data and the parameter space are normalized to be bounded within an appropriate-dimensional box of the form $[-1,1]^\esx$.</p>
<p>The main result we obtain can be semi-formally stated as follows; we formulate the result for (ERM) but it immediately extends to (rERM):</p>
<p class="mathcol"><strong>Theorem (informal) [BMP].</strong> Let $D$ be a given sample size and $\varepsilon > 0$, then there exists a linear program with a polytope $P$ as a feasible region with the following properties: <br /> <br />
(a) <em>Data-independent LP.</em> The linear program has no more than $O( D \cdot (\mathcal L/\varepsilon)^K)$ variables. The construction is <em>independent</em> of any specific training data set. <br /> <br />
(b) <em>Solving ERM.</em> For any given dataset $\hat D \doteq (\hat{x}^i, \hat{y}^i)_{i=1}^D$, there exists a face of $P$, so that optimizing over this face provides an $\varepsilon$-approximate solution to (ERM) for $\hat D$. This is equivalent, by Farkas’ lemma, to the existence of a <em>linear objective</em> as a function of $\hat D$, that when optimized over $P$ yields an $\varepsilon$-approximate solution $\phi^\esx \in \Phi$ to (ERM) for $\hat D$. (as we require $\phi^\esx \in \Phi$ we are in the <em>proper learning</em> setting) <br /> <br />
Here $\mathcal L$ is an architecture dependent Lipschitz constant and $K$ is the “size” of the network.</p>
<p>A few remarks are in order: what is special and maybe confusing at first is that the LP can be written down, <em>before</em> the actual training data is revealed and the only input (w.r.t. the data) for the construction is the <em>sample size $D$</em> as well as network specific parameters. This is very subtle but highly important: if we talk only about the <em>(bit) size of a linear program</em> as a measure of complexity as we do here, then the actual time required to write down the linear program is irrelevant. As such, if we would allow the construction of the LP depend on the actual training data, then we can always find a small LP that basically just outputs the optimal network configuration $\phi^\esx$, which would be non-sensical. Observe that we have similar requirement for algorithms: they should work for a <em>broad class of inputs</em> and not for a <em>single, specific</em> input. What is different here is that the construction depends also on the sample size. This makes sense as the LP cannot “extend itself” after it has been constructed, whereas algorithms can cater to different input sizes. From a computational complexity perspective this phenomenon is well understood: LPs are more like circuits (e.g., the complexity class $\operatorname{P}/\operatorname{poly}$) whose construction also depends on the input size (the polynomial advice) than algorithms (e.g., complexity class $\operatorname{P}$). This phenomenon is also well known from <em>extended formulations</em> where often a <em>uniform</em> and a <em>non-uniform model</em> is distinguished, catering to this issue (see e.g., [BFP]). In the language of extended formulations, we have a uniform model here, where the instances (e.g., different training data sets $\hat D$) are only encoded in the objective functions.</p>
<p>Once the LP is constructed, it can be solved for a <em>specific training dataset</em> by fixing some of its variables to appropriate values to obtain the face described in (b)—from our construction it is clear that both fixing the variables or equivalently computing the desired linear objective can be done efficiently. When the actual LP is then solved, e.g., with the Ellipsoid method (see e.g., [GLS]) we obtain the desired training algorithm with a running time polynomial in the size of the LP and hence the sample size $D$. In particular, for fixed architectures we obtain training algorithms that are polynomial time; this has to be taken with a grain of salt though, as it is really the fact that there exists a <em>single</em> polytope that basically encodes the ERM for <em>all realizations of the training data</em> for a specific architecture and sample size $D$ that is interesting and unexpected here.</p>
<p>The way our construction works is by observing that the ERM problem from above naturally exhibits a formulation as optimization problem with low treewidth and, as discussed in an <a href="/blog/research/2018/09/22/treewidth-abstract.html">earlier post</a>, this can be exploited to construct small linear programs: (1) reformulate ERM as an optimization problem of low treewidth (2) discretize and reformulate as a binary optimization problem (this is where $\mathcal L$ comes in) (3) exploit low treewidth to obtain a small LP formulation (basically a bit of convex relaxation and Sherali-Adams like lifting). For the last step (3) we rely on an immediate generalization of a theorem of [BM], that exploits low treewidth to construct small LPs for polynomical optimization problems.</p>
<h3 id="comparison-to-earlier-results">Comparison to earlier results</h3>
<p>We are not the first ones to think about the complexity of the training problem though and in fact it was [ABMM] that inspired our work. As such, in the following I will <em>very briefly</em> compare our results to earlier ones and refer the interested reader to the paper for a detailed discussion. Most closely related to our results are [ABMM], [GKKT], and [ZLJ], however it is hard to compare exact numbers directly as the setups differ. In fact a fair comparison might prove impossible and one should probably think of these results as complementary.</p>
<ol>
<li>
<p>[GKKT] and [ZLJ] consider the <em>improper learning</em> setup, i.e., the constructed model is not from the same family of model $\Phi$ considered in the ERM. Their dependence on some of the input parameters is better but they also only consider a limited class of architectures.</p>
</li>
<li>
<p>[ABMM] on the other hand is considering <em>proper learning</em> but is limited to one hidden layer and output dimension one. Then again, they solve the ERM to <em>global optimality</em> (no $\varepsilon$’s here compared to us). In terms of complexity their dependence on the sampling size is much worse.</p>
</li>
</ol>
<h3 id="references">References</h3>
<p>[BFP] Braun, G., Fiorini, S., & Pokutta, S. (2016). Average case polyhedral complexity of the maximum stable set problem. Mathematical Programming, 160(1-2), 407-431. <a href="https://link.springer.com/article/10.1007/s10107-016-0989-3">journal</a> <a href="https://arxiv.org/abs/1311.4001">arxiv</a></p>
<p>[GLS] Grötschel, M., Lovász, L., & Schrijver, A. (2012). Geometric algorithms and combinatorial optimization (Vol. 2). Springer Science & Business Media. <a href="https://books.google.com/books?id=x1zmCAAAQBAJ&lpg=PA1&ots=QZkUnX8MZu&dq=Geometric%20algorithms%20and%20combinatorial%20optimization%20vol%202&lr&pg=PA1#v=onepage&q=Geometric%20algorithms%20and%20combinatorial%20optimization%20vol%202&f=false">google books</a></p>
<p>[BMP] Bienstock, D., Mun͂oz, G., & Pokutta, S. (2018). Principled Deep Neural Network Training through Linear Programming <a href="http://arxiv.org/abs/1810.03218">arxiv</a></p>
<p>[BM] Bienstock, D., & Mun͂oz, G. (2018). LP Formulations for Polynomial Optimization Problems. SIAM Journal on Optimization, 28(2), 1121-1150. <a href="https://epubs.siam.org/doi/10.1137/15M1054079">journal</a> <a href="https://arxiv.org/abs/1501.00288">arxiv</a></p>
<p>[ABMM] Arora, R., Basu, A., Mianjy, P., & Mukherjee, A. (2016). Understanding deep neural networks with rectified linear units. Proceedings of ICLR 2018. <a href="https://arxiv.org/abs/1611.01491">arxiv</a></p>
<p>[GKKT] Goel, S., Kanade, V., Klivans, A., & Thaler, J. (2017, June). Reliably Learning the ReLU in Polynomial Time. In Conference on Learning Theory (pp. 1004-1042). <a href="https://arxiv.org/abs/1611.10258">arxiv</a></p>
<p>[ZLJ] Zhang, Y., Lee, J. D., & Jordan, M. I. (2016, June). l1-regularized neural networks are improperly learnable in polynomial time. In International Conference on Machine Learning (pp. 993-1001). <a href="http://proceedings.mlr.press/v48/zhangd16.pdf">jmlr</a></p>Sebastian PokuttaTL;DR: This is an informal summary of our recent paper Principled Deep Neural Network Training through Linear Programming with Dan Bienstock and Gonzalo Muñoz, where we show that the computational complexity of approximate Deep Neural Network training depends polynomially on the data size for several architectures by means of constructing (relatively) small LPs.Toolchain Tuesday No. 12018-10-09T00:00:00-04:002018-10-09T00:00:00-04:00http://www.pokutta.com/blog/random/2018/10/09/toolchain-1<p><em>TL;DR: Part of a series of posts about tools, services, and packages that I use in day-to-day operations to boost efficiency and free up time for the things that really matter. Use at your own risk - happy to answer questions. For the full, continuously expanding list so far see <a href="/blog/pages/toolchain.html">here</a>.</em>
<!--more--></p>
<p>This is the first installment of a series of posts; the <a href="/blog/pages/toolchain.html">full list</a> is expanding over time.</p>
<h2 id="software">Software:</h2>
<h3 id="atom">Atom</h3>
<p>Multi-purpose, highly-extensible text editor.</p>
<p><em>Learning curve: ⭐️⭐️⭐️</em>
<em>Usefulness: ⭐️⭐️⭐️⭐️⭐️</em> <br />
<em>Site: <a href="https://atom.io/">https://atom.io/</a></em></p>
<p>I stumbled upon <code class="highlighter-rouge">atom</code> by accident because one of my students was using it and it has become (one of) the crucial infrastructure piece(s) for me. <code class="highlighter-rouge">Atom</code> is hands-down the best text editor that is currently out there. I would even go as far as saying that it is today what <code class="highlighter-rouge">emacs</code> was many years back. It goes way beyond text editing due to an extensive package library that allows you to customize and extend <code class="highlighter-rouge">atom</code> in infinitely many different ways. It takes a few days getting used to it but it is worth it. In fact it is open <em>constantly</em> on my machine. Moreover, it is available for basically all platforms and you can simply move configurations between machines to make sure it is the same everywhere. Here are a few examples what I use <code class="highlighter-rouge">atom</code> for:</p>
<ol>
<li>Markdown editor and previewer</li>
<li>Latex texing environment</li>
<li>Development environment (when I don’t need the full power of <code class="highlighter-rouge">PyCharm</code>; more on this later)</li>
<li>Collaboration environment (see an <a href="/blog/random/2018/08/20/atom-markdown.html">older post here</a>)</li>
<li>Interactive execution of <code class="highlighter-rouge">python</code>, <code class="highlighter-rouge">julia</code>, and <code class="highlighter-rouge">R</code> code with the <code class="highlighter-rouge">hydrogen</code> package.</li>
</ol>
<h3 id="docker">Docker</h3>
<p>Deploy code in a self-contained mini-virtual machine.</p>
<p><em>Learning curve: ⭐️⭐️⭐️⭐️⭐️</em>
<em>Usefulness: ⭐️⭐️⭐️⭐️</em> <br />
<em>Site: <a href="https://www.docker.com/">https://www.docker.com/</a></em></p>
<p><code class="highlighter-rouge">Docker</code> became an essential tool for me for deploying software. The way you should think about <code class="highlighter-rouge">docker</code> is a lightweight virtual machine in which your code runs and that you deploy, a so-called <em>container</em>. The problems that <code class="highlighter-rouge">Docker</code> solves are:</p>
<ol>
<li><em>Platform independence:</em> Easy deployment without having to worry about the target system at all except for that it should run the docker service. No worrying about dependencies and correct versions on the target system: if it runs in the container on your machine it will run on the target system.</li>
<li><em>Shorter time to test/production</em>: Significantly relaxing deployment requirements for prototypical code: just run it in a container and have the infrastructure around it handle the security piece. This allows for significantly shorter turn arounds to put things into test and production, e.g., for A/B testing. In several of my projects it cut down deployment from several months to a few days.</li>
<li><em>Non-Persistency</em>: Changing dependencies and libraries on the host system does not affect the container. Moreover, restarting the container resets it to its initial state: you cannot break a container.</li>
<li><em>Scalability</em>: Need more throughput? Just spawn multiple instances of the container.</li>
<li><em>Sandboxing:</em> I also use <code class="highlighter-rouge">docker</code> for sandboxing on my own machine. I prefer not to install every new tool on the horizon and mess up my system configuration, rather I test it in a container.</li>
</ol>
<p>In terms of performance, it costs you some but usually it is ok for most applications. <code class="highlighter-rouge">Docker</code> is not trivial to set up though and will require some time to get it right.</p>
<h2 id="python-libraries">Python Libraries:</h2>
<h3 id="tqdm">TQDM</h3>
<p>Progress bar for python with automatic timing, ETA, etc for loops and enumerations.</p>
<p><em>Learning curve: ⭐️</em>
<em>Usefulness: ⭐️⭐️⭐️⭐️⭐️</em> <br />
<em>Site: <a href="https://github.com/tqdm/tqdm">https://github.com/tqdm/tqdm</a></em></p>
<p>Have you ever written loops in python, e.g., sifting over a large data set and you have no idea how long it is going to take until completion or how fast an iteration is? This is where <code class="highlighter-rouge">TQDM</code> comes in handy. Simply wrap it around the enumerator and when running the code you get a progress bar with all that information:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">tqdm</span> <span class="kn">import</span> <span class="n">tqdm</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">tqdm</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="mi">10000</span><span class="p">)):</span>
<span class="o">...</span></code></pre></figure>
<p>Output:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="mi">76</span><span class="o">%|</span><span class="err">████████████████████████████</span> <span class="o">|</span> <span class="mi">7568</span><span class="o">/</span><span class="mi">10000</span> <span class="p">[</span><span class="mo">00</span><span class="p">:</span><span class="mi">33</span><span class="o"><</span><span class="mo">00</span><span class="p">:</span><span class="mi">10</span><span class="p">,</span> <span class="mf">229.00</span><span class="n">it</span><span class="o">/</span><span class="n">s</span><span class="p">]</span></code></pre></figure>
<h2 id="services">Services:</h2>
<h3 id="trello">Trello</h3>
<p>Manage lists (e.g., todo lists) online, across various platforms with various plugins.</p>
<p><em>Learning curve: ⭐️⭐️</em>
<em>Usefulness: ⭐️⭐️⭐️⭐️</em> <br />
<em>Site: <a href="https://trello.com/">https://trello.com/</a></em></p>
<p><code class="highlighter-rouge">Trello</code> became my go to solution for todo lists etc. Works on all possible devices, integrates with <code class="highlighter-rouge">Evernote</code>, <code class="highlighter-rouge">google drive</code>, and my calendar. Has notifications and allows for collaboration/sharing. Also, extremely useful for developing software, e.g., for scrum boards.</p>TL;DR: Part of a series of posts about tools, services, and packages that I use in day-to-day operations to boost efficiency and free up time for the things that really matter. Use at your own risk - happy to answer questions. For the full, continuously expanding list so far see here.Cheat Sheet: Frank-Wolfe and Conditional Gradients2018-10-05T09:50:00-04:002018-10-05T09:50:00-04:00http://www.pokutta.com/blog/research/2018/10/05/cheatsheet-fw<p><em>TL;DR: Cheat Sheet for Frank-Wolfe and Conditional Gradients. Basic mechanics and results; this is a rather long post and the start of a series of posts on this topic.</em>
<!--more--></p>
<p><em>Posts in this series (so far).</em></p>
<ol>
<li><a href="/blog/research/2018/10/05/cheatsheet-fw.html">Cheat Sheet: Frank-Wolfe and Conditional Gradients</a></li>
<li><a href="/blog/research/2018/10/19/cheatsheet-fw-lin-conv.html">Cheat Sheet: Linear convergence for Conditional Gradients</a></li>
<li><a href="/blog/research/2018/11/11/heb-conv.html">Cheat Sheet: Hölder Error Bounds (HEB) for Conditional Gradients</a></li>
</ol>
<p><em>My apologies for incomplete references—this should merely serve as an overview.</em></p>
<p>One of my favorite topics that I am currently interested in is constraint smooth convex optimization and in particular projection-free first-order methods, such as the <em>Frank-Wolfe Method [FW]</em> aka <em>Conditional Gradients [CG]</em>. In this post, I will provide a basic overview of these methods, how they work, and my perspective which will be the basis for some of the future posts.</p>
<h2 id="the-goal-smooth-constraint-convex-minimization">The goal: Smooth Constraint Convex Minimization</h2>
<p>The task that we will be considering here is to solve <em>constrained smooth convex optimization</em> of the form</p>
<script type="math/tex; mode=display">\min_{x \in P} f(x),</script>
<p>where $f$ is a differentiable convex function, $P$ is some compact and convex feasible region; you might want to think of $P$, e.g., being a polytope, which is one of the most common cases. As such, for the sake of exposition we will assume that $P \subseteq \mathbb R^n$ is a polytope. We are interested in general purpose methods and not methods specialized to specific problem configurations.</p>
<p>First, we need to agree on how we can access the function $f$ and the feasible region $P$. For the feasible region $P$ we assume access by means of a so-called <em>linear programming oracle</em>:</p>
<p class="mathcol"><strong>Linear Programming oracle</strong> <br />
<em>Input:</em> $c \in \mathbb R^n$ <br />
<em>Output:</em> $\arg\min_{x \in P} \langle c, x \rangle$</p>
<p>We further assume that we can access $f$ by means of a so-called <em>first-order oracle</em>:</p>
<p class="mathcol"><strong>First-Order oracle</strong> <br />
<em>Input:</em> $x \in \mathbb R^n$ <br />
<em>Output:</em> $\nabla f(x)$ and $f(x)$</p>
<p>Many problems of interest in optimization and machine learning can be naturally cast in this setting, such as, e.g., <a href="https://en.wikipedia.org/wiki/Lasso_(statistics)">Linear Regression with LASSO</a>:</p>
<p class="mathcol"><strong>Example:</strong> LASSO Regression <br />
Linear Regression with LASSO regularization can be formulated as
minimizing a quadratic loss function over a (rescaled) $\ell_1$-ball:
<script type="math/tex">\min_{\beta, \beta_0} \{\frac{1}{N} \|y- \beta_0 1_N - X\beta\|_2^2 \ \mid \ \|\beta\|_1 \leq t\}.</script></p>
<h3 id="the-workhorses-convexity-and-smoothness">The workhorses: convexity and smoothness</h3>
<p>In smooth convex optimization we have two key concepts (and variations of those) that drive most results: <em>smoothness</em> and <em>convexity</em>.</p>
<p><em>Convexity</em> provides an under-estimator for the change of the function $f$ by means of a linear function, actually the Taylor-expansion first-order estimation.</p>
<p class="mathcol"><strong>Definition (convexity).</strong> A differentiable function $f$ is said to be <em>convex</em> if for all $x,y \in \mathbb R^n$ it holds: <script type="math/tex">f(y) - f(x) \geq \nabla f(x)(y-x)</script>.</p>
<p><em>Smoothness</em> provides an over-estimator of the change of the function $f$ by means of a quadratic function; in fact the smoothness inequality works in reverse compared to convexity. For the sake of simplicity we will work with the affine-variant versions however similar affine-invariant versions exist.</p>
<p class="mathcol"><strong>Definition (smoothness).</strong> A convex function $f$ is said to be <em>$L$-smooth</em> if for all $x,y \in \mathbb R^n$ it holds: <script type="math/tex">f(y) - f(x) \leq \nabla f(x)(y-x) + \frac{L}{2} \| x-y \|^2</script>.</p>
<p>Finally, we have <em>strong convexity</em>, which provides provides a quadratic under-estimator for the change of the function $f$, again obtained from the Taylor-expansion. We will not exploit strong convexity today, except for the warm-up below, but it is helpful for understanding the overall context. Note that strong convexity is basically the reverse inequality of smoothness (and provides stronger bounds than simple convexity).</p>
<p class="mathcol"><strong>Definition (strong convexity).</strong> A convex function $f$ is said to be <em>$\mu$-strongly convex</em> if for all $x,y \in \mathbb R^n$ it holds: <script type="math/tex">f(y) - f(x) \geq \nabla f(x)(y-x) + \frac{\mu}{2} \| x-y \|^2</script>.</p>
<p>The following graphic relates the various concepts with each other:</p>
<p class="center"><img src="http://www.pokutta.com/blog/assets/convexity.png" alt="Convexity and smoothness" /></p>
<h4 id="warmup-gradient-descent-from-smoothness-and-strong-convexity">Warmup: Gradient Descent from smoothness and (strong) convexity</h4>
<p>As a warmup we will now establish convergence of gradient descent in the smooth <em>unconstrained</em> case. An important consequence of smoothness is that we can use it to <em>lower bound the progress</em> of a typical gradient step. Let $x_t \in \mathbb R^n$ be an (arbitrary) point and let $x_{t+1} = x_t - \eta \cdot d$, where $d \in \mathbb R^n$ is some <em>direction</em>. Using smoothness we obtain:</p>
<script type="math/tex; mode=display">\underbrace{f(x_{t}) - f(x_{t+1})}_{\text{primal progress}} \geq \eta \langle\nabla f(x_t),d\rangle - \eta^2 \frac{L}{2} \|d\|^2</script>
<p>Optimizing the right-hand side for $\eta$ leads to $\eta^* = \frac{\langle\nabla f(x_t),d\rangle}{L \norm{d}^2}$, which upon plugging back in into the above, with the usual choice $d \doteq \nabla f(x_t)$, leads to:</p>
<p class="mathcol"><strong>Progress induced by smoothness:</strong>
\[
\begin{equation}
\underbrace{f(x_{t}) - f(x_{t+1})}_{\text{primal progress}} \geq \frac{\norm{\nabla f(x_t)}^2}{2L}.
\end{equation}
\]</p>
<p>We will first complete the argument using strong convexity now as it is significantly simpler than using convexity only. While smoothness provides a lower bound on the primal progress of a typical gradient step, we can use strong convexity to obtain <em>an upper bound</em> on the <em>primal optimality gap</em> <script type="math/tex">h(x_t) \doteq f(x_t) - f(x^*)</script> by means of the norm of the gradient. The argument is very similar to the argument employed for the progress bound induced by smoothness. We start from the strong convexity inequality and apply it to the points $x = x_t$ and $y = x_t - \eta e_t$, where <script type="math/tex">e_t \doteq x^*-x_t</script> is the direction pointing towards the optimal solution $x^*$; note that in the case of strong convexity the latter is unique. We have:</p>
<script type="math/tex; mode=display">f(x_t - \eta e_t) - f(x_t) \geq - \eta \langle\nabla f(x_t),e_t\rangle + \eta^2\frac{\mu}{2} \| e_t \|^2.</script>
<p>If we now minimize the right-hand side of the above inequality over $\eta$, with an argument identical to the ones above the minimum is achieved for the
choice $\eta^\esx \doteq \frac{\langle\nabla f(x_t), e_t\rangle}{\mu \norm{e_t}^2}$; note that is has the same form as the $\eta^*$ we derived via smoothness, however this time the inequality is reversed. Plugging this back in, we obtain</p>
<script type="math/tex; mode=display">f(x_t) - f(x_t - \eta e_t) \leq \frac{\langle\nabla f(x_t),e_t\rangle^2}{2 \mu \norm{e_t}^2},</script>
<p>and as the right-hand side is now independent of $\eta$ we can choose $\eta = 1$ and observe that $\frac{\langle\nabla f(x_t),e_t\rangle^2}{\norm{e_t}^2} \leq \norm{\nabla f(x_t)}^2$, via the Cauchy-Schwarz inequality. We arrive at the actual upper bound from strong convexity that we care for.</p>
<p class="mathcol"><strong>Upper bound on primal gap induced by strong convexity:</strong>
\[
\begin{equation}
f(x_t) - f(x^\esx) \leq \frac{\norm{\nabla f(x_t)}^2}{2 \mu}.
\end{equation}
\]</p>
<p>From these two bounds we immediately obtain <em>linear convergence</em> in the case of strongly convex functions: We have
<script type="math/tex">f(x_{t}) - f(x_{t+1}) \geq \frac{\mu}{L} (f(x_t) - f(x^\esx)),</script>
i.e., in each iteration we recover a $\mu/L$-fraction of the residual primal gap. Simply adding $f(x^\esx)$ on both sides and rearranging gives the desired bound per iteration</p>
<script type="math/tex; mode=display">f(x_{t+1}) - f(x^\esx) \leq \left(1-\frac{\mu}{L}\right)(f(x_t) - f(x^\esx)),</script>
<p>which we iterate to obtain</p>
<script type="math/tex; mode=display">f(x_T) - f(x^\esx) \leq \left(1-\frac{\mu}{L}\right)^T(f(x_0) - f(x^\esx)).</script>
<p>If we have only smooth and not necessarily strongly convex function the last part of the argument changes a little. Rather than plugging in the bound from strong convexity, we use convexity and estimate:</p>
<script type="math/tex; mode=display">f(x_t) - f(x^\esx) \leq \langle \nabla f(x_t) , x_t - x^\esx \rangle \leq \norm{\nabla f(x_t)} \norm{(x_t - x^*)}.</script>
<p>This estimation is much weaker as the one from strong convexity and if we plug this into the progress inequality we only obtain:</p>
<script type="math/tex; mode=display">f(x_{t+1}) - f(x_{t}) \geq \frac{(f(x_t) - f(x^\esx))^2}{2L \, \norm{(x_t - x^*)}^2} \geq \frac{(f(x_t) - f(x^\esx))^2}{2L \, \norm{(x_0 - x^*)}^2},</script>
<p>where the last inequality is not immediate but also not terribly hard to show. Now with some induction one can show the standard rate of roughly</p>
<script type="math/tex; mode=display">f(x_T) - f(x^\esx) \leq \frac{2L \, \norm{(x_0 - x^*)}^2}{T+4}.</script>
<p>We skip the details here as we will see a similar induction below for the Frank-Wolfe algorithm.</p>
<h3 id="projections-and-the-frank-wolfe-method">Projections and the Frank-Wolfe method</h3>
<p>So far we have considered <em>unconstrained</em> smooth convex minimization in the example above. At first sight, in the constrained case, the situation does not dramatically change: as soon as we have constraints, then basically we have to augment the methods from above to ‘project back’ into the feasible region after each step (otherwise a gradient step might lead us outside out of the feasible region). To this end, let $P$ be a compact convex feasible region and $\Pi_P$ be an appropriate projection onto $P$, then we simply modify the iterates of our gradient descent scheme to be</p>
<script type="math/tex; mode=display">x_{t+1} \leftarrow \Pi_P(x_t - \eta \nabla f(x_t)).</script>
<p>This projection can be performed relatively easily for some domains and norms, e.g., the simplex (probability simplex), the $\ell_1$-ball, the $\ell_2$-ball, and more general permutahedra. However, as soon as the projection problem is not that easy to solve anymore, than the projection that we need can give basically rise to another optimization problem that can be expensive to solve.</p>
<p>This is where the <em>Frank-Wolfe algorithm [FW]</em> or <em>Conditional Gradients [CG]</em> come into play as a <em>projection-free</em> first-order method for constraint smooth minimization and the way these algorithms do this is by maintaining feasibility of all iterates $x_t$ throughout by merely forming convex combinations of the current iterate and a new point $v \in P$. But before we talk more about the why-you-should-care factor, let us first have a look at the (most basic variant of the) algorithm:</p>
<p class="mathcol"><strong>Frank-Wolfe Algorithm [FW]</strong> <br />
<em>Input:</em> Smooth convex function $f$ with first-order oracle access, feasible region $P$ with linear optimization oracle access, initial point (usually a vertex) $x_0 \in P$. <br />
<em>Output:</em> Sequence of points $x_0, \dots, x_T$ <br />
For $t = 1, \dots, T$ do: <br />
$\quad v_t \leftarrow \arg\min_{x \in P} \langle \nabla f(x_{t-1}), x \rangle$ <br />
$\quad x_{t+1} \leftarrow (1-\gamma_t) x_t + \gamma_t v_t$</p>
<p>In the algorithm above the step size $\gamma_t$ can be set in various ways:</p>
<ol>
<li>$\gamma_t = \frac{2}{t+2}$ (the original step size) <br />
Assumes a worst-case upper bound on the function $f$. Does not require knowledge of $L$, although the convergence bound will depend on it. Does not ensure that we have monotonous progress. Basically, what this step size rule ensures is that $x_t$ is the uniform average of the vertices obtained up to that point from the linear optimization oracle.</li>
<li>$\gamma_t = \frac{\langle \nabla f(x_{t-1}), x_{t-1} -v_t \rangle}{L}$ (the <em>short step</em>; the analog to the above in the unconstrained case) <br />
Approximately, minimizes the curvature equation as we have seen in the example above. Ensures monotone progress but requires approximate knowledge of $L$, or a search for it. Note that the <em>magnitude</em> of progress though does not have to be monotonous across iterations.</li>
<li>$\gamma_t$ via line search to maximize function decrease <br />
Does not require any knowledge about $L$ and is also monotone, however requires several function evaluations to find the minimum. There has been some recent work on this specifically for Conditional Gradients algorithms that provides a reasonable tradeoff [PANJ].</li>
</ol>
<p>The Frank-Wolfe algorithm has many appealing features. Just to name a few:</p>
<ol>
<li>Extremely easy to implement</li>
<li>No complicated data structures to maintain, which makes it quite memory-efficient</li>
<li>No projections</li>
<li>Iterates are maintained as convex combinations. This can be useful when we interpret the final solution as a distribution over vertices.</li>
</ol>
<p>Especially the last point is useful in many applications. Jumping slightly ahead, as we will prove below, for a general smooth convex function, the Frank-Wolfe algorithm achieves:</p>
<script type="math/tex; mode=display">f(x_t) - f(x^\esx) \leq \frac{LD^2}{t+2} = O(1/t),</script>
<p>where $D$ is the diameter of $P$ in the used norm (in smoothness) and $L$ the Lipschitz constant from above.</p>
<p class="mathcol"><strong>Example:</strong> Approximate Carathéodory <br />
Let $\hat x \in P$. Goal: find a convex combination $\tilde x$ of vertices of $P$, so that $\norm{\hat x-\tilde x} \leq \varepsilon$. This can be solved with the Frank-Wolfe algorithm via solving $\min_{x \in P} \norm{\hat x - x}^2$. We need an accuracy $\norm{\hat x - x}^2 \leq \varepsilon^2$, so that we roughly need $\frac{LD^2}{\varepsilon^2}$ iterations to achieve this approximation. <br />
(For completeness: if we are willing to have a bound that explicitly depends on the dimension of $P$, then we can exploit the strong convexity of our objective and we need only roughly $O(n \log 1/\varepsilon)$ iterations. We will come back to the strongly convex case in a later post).</p>
<p>The following graph shows typical convergence of the Frank-Wolfe algorithm for the three step size rules above. As mentioned above, the step size rule $\frac{2}{t+2}$ does not guarantee monotonous progress. Moreover, as we can see for our choice of $L$ as a guess of the true Lipschitz constant, we converge to a suboptimal solution as $L$ was chosen (on purpose) to be too small. This is one of the main issues with estimated Lipschitz constants: when too small we might not converge. From the argumentation above in terms of progress one might be willing to opt for a much smaller guess but this immediately and strongly affects the progress: using $\alpha / L$ with $0 < \alpha < 1$ leads to an $\alpha$ slowdown. Also, note that while the $\frac{2}{t+2}$ rule makes less progress per iteration, in wall-clock time it is actually pretty fast here as we save on the function evaluations in the (admittedly simple) line search; one can do much better, see [PANJ]. We also depict a dual gap, which is upper bounding $f(x_t) - f(x^\esx)$ that we will explore in more detail in the next section.</p>
<p class="center"><img src="http://www.pokutta.com/blog/assets/fw.png" alt="FW Algorithm" /></p>
<p>So what does the Frank-Wolfe algorithm actually do? As shown in the figure below, rather than using the negative gradient direction $-\nabla f(x_t)$ for descent, it uses a replacement direction $d = v - x_t$ with potentially weaker progress: $\frac{\norm{\nabla f(x_t)}^2}{2L}$ vs. $\frac{\langle{\nabla f(x_t),x_t-v\rangle}^2}{2LD^2}$. At the same time it is much easier to ensure feasibility for this direction as the next iterate $x_{t+1}$ is simply a convex combination of the current iterate $x_t$ and the vertex $v$ of $P$. Thus we do not have to do projection but we pay for this in (potentially) less progress per iteration—this is the tradeoff that Frank-Wolfe makes.</p>
<p class="center"><img src="http://www.pokutta.com/blog/assets/fw-dir-approx.png" alt="Convexity and smoothness" /></p>
<h4 id="dual-gaps-in-constraint-convex-optimization">Dual gaps in constraint convex optimization</h4>
<p>In the case of unconstrained gradient descent we had two types of dual gaps, i.e., those that bound $f(x_t) - f(x^\esx)$ as a function of the gradient. We will now obtain a similar dual gap that will be central to the Frank-Wolfe algorithm. Recall that we defined $h(x) \doteq f(x) - f(x^\esx)$ as the <em>primal (optimality) gap $h(\cdot)$</em> above. In the context of constraint convex optimization we can define the following <em>dual (optimality) gap $g(\cdot)$</em>, which is often referred to as the <em>Wolfe gap</em>. To this end, observe that by convexity we have:</p>
<script type="math/tex; mode=display">h_t = f(x) - f(x^\esx) \leq \langle \nabla f(x_t), x_t - x^\esx \rangle \leq \max_{x \in P} \langle \nabla f(x_t), x_t - x \rangle \doteq g(x_t).</script>
<p>There is two ways of thinking about the dual gap. First of all, it is upper bounding the primal gap, but also here we can understand $g(x_t)$ as computing the “best possible” approximation to $\nabla f(x_t)$. Of course, closer inspection reveals that this is flawed as stated, as it measures the quality of approximation scaled with the length of the line segment $x_t - x$:</p>
<script type="math/tex; mode=display">g(x_t) = \max_{x \in P} \langle \nabla f(x_t), x_t - x \rangle = \max_{x \in P} \frac{\langle \nabla f(x_t), x_t - x \rangle}{\norm{x_t - x}} \cdot \norm{x_t - x}.</script>
<p>This subtlety is worth keeping in mind. Moreover, the dual gap also serves as an optimality certificate.</p>
<p class="mathcol"><strong>The Wolfe gap</strong> <br />
We have $0 \leq h(x_t) \leq g(x_t)$. Moreover, $g(x) = 0 \Leftrightarrow f(x) - f(x^\esx) = 0$.</p>
<p>The proof of the above is straight forward: Clearly, if $g(x_t) = 0$, then $h(x_t) = 0$. For the converse, we go the lazy route through smoothness. We showed above that $f(x_{t}) - f(x_{t+1}) \geq \eta \langle\nabla f(x_t),d\rangle - \eta^2 \frac{LD^2}{2}$. In the case of Frank-Wolfe we have $d= v_t - x_t$ for the maximizing vertex $v_t$ as $x_{t +1} = (1-\gamma_t) x_t + \gamma_t v_t$. Choosing $\gamma_t = \min\setb{\frac{g(x_t)}{LD^2},1}$ yields:</p>
<script type="math/tex; mode=display">f(x_t) - f(x_{t+1}) \geq \min\setb{g(x_t)/2,\frac{g(x_t)^2}{2LD^2}},</script>
<p>and since we were optimal the left-hand side is $0$. Thus $g(x_t) = 0$ follows.</p>
<p class="mathcol"><strong>Lemma (Frank-Wolfe convergence).</strong>
After $t$ iterations the Frank-Wolfe algorithm ensures:
\[h(x_t) \leq \frac{LD^2}{t+2}.\]</p>
<p><em>Proof.</em>
Combining the Wolfe gap with the progress induced by smoothness, we immediately obtain:
\[h(x_{t+1}) \leq h(x_t) - \frac{h(x_t)^2}{2LD^2} = h(x_t) \left(1-\frac{h(x_t)}{2LD^2}\right),\]
and with the standard induction, this leads to:
\[h(x_t) \leq \frac{LD^2}{t+2}. \qed \]</p>
<p>A similar result can also be shown for the Wolfe gap $g(x_t)$, however the statement is slightly more involved; see [J] for details. Note that the convergence rate of the Frank-Wolfe Algorithm is <em>not</em> optimal for smooth constraint convex optimization, as accelerated methods can achieve a rate of $O(1/t^2)$, however for the case of first-order methods accessing a linear optimization oracle, this is as good as it gets:</p>
<p class="mathcol"><strong>Example:</strong> For linear optimization oracle-based first-order methods, <em>even for strongly convex functions, as long as a rate independent of the dimension $n$ is to be derived</em>, a rate of $O(1/t)$ is the best possible (for more details see [J]). Consider the function $f(x) \doteq \norm{x}^2$, which is strongly convex and the polytope $P = \operatorname{conv}\setb{e_1,\dots, e_n} \subseteq \RR^n$ being the probability simplex in dimension $n$. We want to solve $\min_{x \in P} f(x)$. Clearly, the optimal solution is $x^\esx = (\frac{1}{n}, \dots, \frac{1}{n})$. Whenever we call the linear programming oracle on the other hand, we will obtain one of the $e_i$ vectors and in lieu of any other information but that the feasible region is convex, we can only form convex combinations of those. Thus after $k$ iterations, the best we can produce as a convex combination is a vector with support $k$, where the minimizer of such vectors for $f(x)$ is, e.g., $x_k = (\frac{1}{k}, \dots,\frac{1}{k},0,\dots,0)$ with $k$ times $1/k$ entries, so that we obtain a gap
<script type="math/tex">f(x_k) - f(x^\esx) = \frac{1}{k}-\frac{1}{n},</script>
which after requiring $\frac{1}{k}-\frac{1}{n} < \varepsilon$ implies $k > \frac{1}{\varepsilon - 1/n} \approx \frac{1}{\varepsilon}$ for $n$ large.</p>
<h2 id="a-slightly-different-interpretation">A slightly different interpretation</h2>
<p>In the following we will slightly change the interpretation of what is happening: we will directly analyze the Frank-Wolfe algorithm by means of a scaling argument; we will consider the general convex case here and consider strong convexity in a later post. This perspective is inspired by scaling algorithms in discrete optimization and in particular flow algorithms and comes in very handy here (see [SW], [SWZ], [LPPP] for the use in discrete optimization).</p>
<h3 id="driving-progress-and-bounding-the-gap">Driving progress and bounding the gap</h3>
<p>We have seen in various forms above that smoothness induces progress and in the case of the Frank-Wolfe algorithm it implies:</p>
<script type="math/tex; mode=display">f(x_t) - f(x_{t+1}) \geq \min\setb{g(x_t)/2,\frac{g(x_t)^2}{2LD^2}},</script>
<p>i.e., the ensured primal progress is quadratic in the dual gap; the case with progress $g(x_t)/2$ is irrelevant as it only appears in the very first iteration of the Frank-Wolfe Algorithm. At the same time we have</p>
<script type="math/tex; mode=display">h(x_t) \leq g(x_t),</script>
<p>by convexity and the definition of the Wolfe gap. Following the idea of the aforementioned scaling algorithms, this gives rise to the following variant of Frank-Wolfe (see [BPZ] for details):</p>
<p class="mathcol"><strong>Scaling Frank-Wolfe Algorithm [BPZ]</strong> <br />
<em>Input:</em> Smooth convex function $f$ with first-order oracle access, feasible region $P$ with linear optimization oracle access, initial point (usually a vertex) $x_0 \in P$. <br />
<em>Output:</em> Sequence of points $x_0, \dots, x_T$ <br />
Compute initial dual gap: $\Phi_0 \leftarrow \max_{v \in P} \langle \nabla f(x_0), x_0 - v \rangle$ <br />
For $t = 0, \dots, T-1$ do: <br />
$\quad$ Find $v_t$ vertex of $P$ such that: $\langle \nabla f(x_t), x_t - v_t \rangle > \Phi_t/2$ <br />
$\quad$ If no such vertex $v_t$ exists: $x_{t+1} \leftarrow x_t$ and $\Phi_{t+1} \leftarrow \Phi_t/2$ <br />
$\quad$ Else: $x_{t+1} \leftarrow (1-\gamma_t) x_t + \gamma_t v_t$ and $\Phi_{t+1} \leftarrow \Phi_t$</p>
<p>This algorithm has many advantages computationally but before we talk about those, let us first show that this algorithm recovers the same converge guarantee as the Frank-Wolfe algorithm up to a small constant factor. The way to think of the $\Phi_t$ is as an estimation of the dual gap and/or the primal progress. For simplicity of the proof let us assume that we choose the $\gamma_t$ with line search and that $h(x_0) \leq LD^2$, which holds after a single Frank-Wolfe iteration. This ensures that in the line search $\gamma_t < 1$ and $f(x_t) - f(x_{t+1}) \geq \frac{\langle \nabla f(x_t), x_t - v \rangle^2}{2LD^2}$, where $v =\arg\min_{x \in P} \langle \nabla f(x_{t}), x \rangle$.</p>
<p class="mathcol"><strong>Lemma (Scaling Frank-Wolfe convergence).</strong>
The Scaling Frank-Wolfe algorithm ensures:
\[h(x_T) \leq \varepsilon \qquad \text{for} \qquad T \geq\lceil \log \frac{\Phi_0}{\varepsilon}\rceil + \frac{16LD^2}{\varepsilon},\]
where the $\log$ is to the basis of $2$. <br /> <br /></p>
<p><em>Proof.</em>
We consider two types of steps: (a) primal progress steps, where $x_t$ is changed and (b) dual update steps, where $\Phi_t$ is changed. <br /> <br /> Let us start with the dual update step (b). In such an iteration we know that for all $v \in P$ it holds $\nabla f(x_t - v) \leq \Phi_t/2$ and in particular for $v = x^\esx$ and by convexity this implies
\[h_t \leq \Phi_t/2.\]
Now for a primal progress step (a), we have by the same arguments as before
\[f(x_t) - f(x_{t+1}) \geq \frac{\Phi_t^2}{8LD^2}.\]
From these two inequalities we can conclude the proof as follows: Clearly, to achieve accuracy $\varepsilon$, it suffices to halve $\Phi_0$ at most $\lceil \log \frac{\Phi_0}{\varepsilon}\rceil$ times. Next we bound how many primal progress steps of type (a) we can do between two steps of type (b); we call this a <em>scaling phase</em>. After accounting for the halving at the beginning of the iteration and observing that $\Phi_t$ does not change between any two iterations of type (b), by simply dividing the upper bound on the residual gap by the lower bound on the progress, this can be at most
\[\Phi \cdot \frac{8LD^2}{\Phi^2} = \frac{8LD^2}{\Phi},\]
where $\Phi$ is the estimate valid for these iterations of type (a). Thus the total number of iterations $T$ required to achieve $\varepsilon$-accuracy can be bounded as:
\[\sum_{\ell = 0}^{\lceil \log \frac{\Phi_0}{\varepsilon}\rceil} \left(1 + \frac{8LD^2}{\Phi_0 / 2^\ell}\right) = \underbrace{\lceil \log \frac{\Phi_0}{\varepsilon}\rceil}_{\text{Type (b)}} + \underbrace{\frac{8LD^2}{\Phi_0} \sum_{\ell = 0}^{\lceil \log \frac{\Phi_0}{\varepsilon}\rceil} \frac{1}{2^\ell}}_{\text{Type (a)}} \leq \underbrace{\lceil \log \frac{\Phi_0}{\varepsilon}\rceil}_{\text{Type (b)}} + \underbrace{\frac{16LD^2}{\varepsilon}}_{\text{Type (a)}}.\]
\[\qed\]</p>
<p>Note that finding a vertex $v_t$ of $P$ with $\langle \nabla f(x_t), x_t - v_t \rangle > \Phi_t/2$ can be achieved by a single call to the linear optimization oracle, if it exists and otherwise the linear optimization call will also certify non-existence. So what are the advantages of the Scaling Frank-Wolfe algorithm when it basically has the same (worst-case) convergence rate as the Frank-Wolfe algorithm? The key advantages are two-fold:</p>
<ol>
<li>In many cases checking the existence of the vertex is much easier than solving the actual LP to optimality. In fact, the argument shows that an LP oracle with multiplicative error, with respect to $\langle \nabla f(x_t), x_t - v_t \rangle$, would be good enough but even weaker oracles are possible.</li>
<li>Before even calling the LP, we can check whether any of the previously computed vertices satisfies the condition and in that case simply use it.</li>
</ol>
<p>These two properties can lead to significant real-world speedups in the computation. However, as we will see in the follow-up posts soon, it is this scaling perspective that allows us to derive many other, efficient variants of Frank-Wolfe with linear convergence and other desirable properties.</p>
<p><em>Note</em>: I recently learned that a similar perspective arises from <em>restarting</em> (see, e.g., [RA] and the references contained therein) and it turns out that Frank-Wolfe can also be restarted to get improved rates (see [KDP]; a post on this will follow a little later).</p>
<h3 id="references">References</h3>
<p>[CG] Levitin, E. S., & Polyak, B. T. (1966). Constrained minimization methods. Zhurnal Vychislitel’noi Matematiki i Matematicheskoi Fiziki, 6(5), 787-823. <a href="http://www.mathnet.ru/php/archive.phtml?wshow=paper&jrnid=zvmmf&paperid=7415&option_lang=eng">pdf</a></p>
<p>[FW] Frank, M., & Wolfe, P. (1956). An algorithm for quadratic programming. Naval research logistics quarterly, 3(1‐2), 95-110. <a href="https://onlinelibrary.wiley.com/doi/abs/10.1002/nav.3800030109">pdf</a></p>
<p>[PANJ] Pedregosa, F., Askari, A., Negiar, G., & Jaggi, M. (2018). Step-Size Adaptivity in Projection-Free Optimization. arXiv preprint arXiv:1806.05123. <a href="https://arxiv.org/abs/1806.05123">pdf</a></p>
<p>[J] Jaggi, M. (2013, June). Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization. In ICML (1) (pp. 427-435). <a href="http://proceedings.mlr.press/v28/jaggi13-supp.pdf">pdf</a></p>
<p>[LJ] Lacoste-Julien, S., & Jaggi, M. (2015). On the global linear convergence of Frank-Wolfe optimization variants. In Advances in Neural Information Processing Systems (pp. 496-504). <a href="http://papers.nips.cc/paper/5925-on-the-global-linear-convergence-of-frank-wolfe-optimization-variants.pdf">pdf</a></p>
<p>[SW] Schulz, A. S., & Weismantel, R. (2002). The complexity of generic primal algorithms for solving general integer programs. Mathematics of Operations Research, 27(4), 681-692. <a href="https://www.jstor.org/stable/pdf/3690461.pdf?casa_token=mJjcpfOA8sYAAAAA:Jjd7I5U46kCEIxiouP8czPrimzHzFeTHVlkxKksZLcPnBelGIlbj7dyErlE4igyzM6Jxxt019AL27DMCp5_vkOsLx7UxrwxBzOVcj_n88JDruTRpxZ_wAA">pdf</a></p>
<p>[SWZ] Schulz, A. S., Weismantel, R., & Ziegler, G. M. (1995, September). 0/1-integer programming: Optimization and augmentation are equivalent. In European Symposium on Algorithms (pp. 473-483). Springer, Berlin, Heidelberg. <a href="https://opus4.kobv.de/opus4-zib/files/174/SC-95-08.pdf">pdf</a></p>
<p>[LPPP] Le Bodic, P., Pavelka, J. W., Pfetsch, M. E., & Pokutta, S. (2018). Solving MIPs via scaling-based augmentation. Discrete Optimization, 27, 1-25. <a href="https://arxiv.org/pdf/1509.03206.pdf">pdf</a></p>
<p>[BPZ] Braun, G., Pokutta, S., & Zink, D. (2017, July). Lazifying Conditional Gradient Algorithms. In International Conference on Machine Learning (pp. 566-575). <a href="https://arxiv.org/abs/1610.05120">pdf</a></p>
<p>[RA] Roulet, V., & d’Aspremont, A. (2017). Sharpness, restart and acceleration. In Advances in Neural Information Processing Systems (pp. 1119-1129). <a href="http://papers.nips.cc/paper/6712-sharpness-restart-and-acceleration">pdf</a></p>
<p>[KDP] Kerdreux, T., d’Aspremont, A., & Pokutta, S. (2018). Restarting Frank-Wolfe. <a href="https://arxiv.org/abs/1810.02429">pdf</a></p>Sebastian PokuttaTL;DR: Cheat Sheet for Frank-Wolfe and Conditional Gradients. Basic mechanics and results; this is a rather long post and the start of a series of posts on this topic.Tractability limits of small treewidth2018-09-22T09:50:00-04:002018-09-22T09:50:00-04:00http://www.pokutta.com/blog/research/2018/09/22/treewidth-abstract<p><em>TL;DR: This is an informal summary of our recent paper <a href="https://arxiv.org/abs/1807.02551">Limits of Treewidth-based tractability in Optimization</a> with <a href="https://ieor.columbia.edu/faculty/yuri-faenza">Yuri Faenza</a> and <a href="http://cerc-datascience.polymtl.ca/person/gonzalo-munoz/">Gonzalo Muñoz</a>, where we provide almost matching lower bounds for extended formulations that exploit small treewidth to obtain smaller formulations. We also show that treewidth in some sense is the only graph-theoretic notion that appropriately captures sparsity and tractability in a broader algorithmic setting.</em>
<!--more--></p>
<h2 id="what-is-the-paper-about-and-why-you-might-care">What is the paper about and why you might care</h2>
<p>It is well known that for many problems on graphs, e.g., optimization problems but also problems in the context of graphical models, that are hard to solve in full generality, we can obtain fast algorithms if the underlying graph $G$ exhibits <em>small treewidth</em>. Without going into detail here, <em>treewidth</em> basically measures how close a graph is to a tree and it has been extensively used as a concept to model and capture sparsity in problems. Typically, we can obtain then algorithms that have a running time that is super-polynomial in the treewidth but polynomial in the problem dimension, where the notion of <em>dimension</em> depends on the context. One such example is combinatorial problems on graphs, where <em>Dynamic Programming</em> can be used to obtain fast algorithms (with a non-polynomial dependence on the treewidth); we refer the interested reader to the actual paper for an extensive list of references. As such one might think of <em>treewidth</em> as a complexity parameter leading to some form of parametrized complexity.</p>
<p>It is also known, that such problems often admit linear programming formulations or semidefinite programming formulation parametrized by treewidth (e.g., via Sherali-Adams and Lasserre hierarchy approaches). More recently a very comprehensive result in [BM] by Bienstock and Muñoz extended this to the general case of mixed-integer polynomial optimization basically showing the following:</p>
<p class="mathcol"><strong>Theorem (informal) [Bienstock and Muñoz].</strong> Consider a polynomial optimization problem of the form
<script type="math/tex">\min \{ c^\intercal x \mid f_i(x) \geq 0\quad \forall i \in [m], x \in \{0,1\}^p \times [0,1]^{n-p}\},</script>
where $f_i(x)$ are polynomials of degree at most $\rho$. If the underlying constraint intersection graph $\Gamma$ (the graph that has a vertex for each variable of the problem and an edge between two variables if they appear in a common constraint) has treewidth at most $\omega$, then there is a linear program of size $O((2\rho/\varepsilon)^{\omega + 1}n \log 1/\varepsilon)$ that computes an $\varepsilon$-approximate solution to the polynomial optimization problem.</p>
<p>This basically matches the type of bounds that have been obtained beforehand for other types of problems. With this a couple of natural questions arise:</p>
<ol>
<li>Are these bounds of the form of $O(n 2^\omega)$ the best one can hope for, for linear or semidefinite programming formulations?</li>
<li>Are there other graph-theoretic concepts of sparsity apart from treewidth that can be used similarly to obtain parametrized complexity measures?</li>
</ol>
<h2 id="our-results">Our results</h2>
<p>We basically answer both of those questions from above. It is important to point out though that the results for linear programming or semidefinite programming are somewhat different from the algorithmic ones: the algorithmic statements holds for general algorithms but are conditional on the usual $\operatorname{P} \neq \operatorname{NP}$ assumption (and in fact we use the complexity-theoretic assumption $\operatorname{NP} \not\subseteq \operatorname{BPP}$), the linear programming and semidefinite programming statements hold unconditionally but only apply to those two optimization paradigms.</p>
<p>First we show that, there is no other graph-theoretic structure that yields tractability in the same way as treewidth does (for optimization problems). In a nutshell:</p>
<p class="mathcol">Unbounded treewidth can yield intractability.</p>
<p>This result relies on the commonly believed complexity-theoretic assumption $\operatorname{NP} \not\subseteq \operatorname{BPP}$ and the grid-minor hypothesis that was recently shown to be true in a breakthrough result in [CC] by Chekuri and Chuzhoy. Our proof works via a reduction and is the analog of a similar result known in graphical models; see Chandrasekaran et al. [CSH].</p>
<p>Second we show that, the upper bounds as parametrized by treewidth obtained for linear programming formulations as well as semidefinite formulations are essentially optimal (with minor losses):</p>
<p class="mathcol">Linear programming and semidefinite programming formulations of size of the form $O(n 2^\omega)$ are basically the best one can hope for.</p>
<p><em>Finally, I would also like to mention that independent of our work Aboulker et al. showed in [A+] a similar result for the linear extension complexity case based on analyzing faces of the correlation polytope.</em></p>
<h3 id="references">References</h3>
<p>[FMP] Faenza, Y., Muñoz, G., & Pokutta, S. (2018). Limits of Treewidth-based tractability in Optimization. arXiv preprint arXiv:1807.02551. <a href="https://arxiv.org/abs/1807.02551">arxiv</a></p>
<p>[BM] Bienstock, D., & Mun͂oz, G. (2018). LP Formulations for Polynomial Optimization Problems. SIAM Journal on Optimization, 28(2), 1121-1150. <a href="https://epubs.siam.org/doi/10.1137/15M1054079">journal</a> <a href="https://arxiv.org/abs/1501.00288">arxiv</a></p>
<p>[CC] Chekuri, C., & Chuzhoy, J. (2016). Polynomial bounds for the grid-minor theorem. Journal of the ACM (JACM), 63(5), 40. <a href="https://dl.acm.org/citation.cfm?id=2820609">journal</a> <a href="https://arxiv.org/abs/1305.6577">arxiv</a></p>
<p>[CSH] Chandrasekaran, V., Srebro, N., & Harsha, P. (2012). Complexity of inference in graphical models. arXiv preprint arXiv:1206.3240. <a href="https://arxiv.org/abs/1206.3240">arxiv</a></p>
<p>[A+] Aboulker, P., Fiorini, S., Huynh, T., Macchia, M., & Seif, J. (2018). Extension Complexity of the Correlation Polytope. arXiv preprint arXiv:1806.00541. <a href="https://arxiv.org/abs/1806.00541">arxiv</a></p>Sebastian PokuttaTL;DR: This is an informal summary of our recent paper Limits of Treewidth-based tractability in Optimization with Yuri Faenza and Gonzalo Muñoz, where we provide almost matching lower bounds for extended formulations that exploit small treewidth to obtain smaller formulations. We also show that treewidth in some sense is the only graph-theoretic notion that appropriately captures sparsity and tractability in a broader algorithmic setting.On the relevance of AI and ML research in academia2018-09-15T09:50:00-04:002018-09-15T09:50:00-04:00http://www.pokutta.com/blog/random/2018/09/15/ai-academia<p><em>TL;DR: Is AI and ML research in academia relevant and necessary? Yes.</em>
<!--more--></p>
<p>Over the last few months (and at our very recent faculty retreat) I had various discussions about the role of Artificial Intelligence and Machine Learning research (short: ML research) in academia and its relevance in light of various large companies, such as Google, Facebook, Microsoft, (Google’s) Deepmind, OpenAI, and Amazon pursuing their own research efforts in that space, at an unprecedented scale with a resource backing that no academic institution will ever be able replicate. A naïve first assessment might lead to the conclusion: we are done - let the industry guys take it from here. However, a more realistic assessment points to a synergetic relationship between industry and academia being located at very different stages in the <em>research-to-product</em> pipeline. Clearly, this post is (highly) biased; I am in academia after all (though have worked in industry at various stages).</p>
<h2 id="industry-ml-research-is-valuable-and-important">Industry ML research is valuable and important</h2>
<p>ML research conducted in industry has had a huge impact over the last few years with various high-profile examples including the <a href="https://deepmind.com/research/alphago/">AlphaGo’s success</a> in playing go as well as more generally autonomous vehicles—although the latter recently came under heightened scrutiny.</p>
<h3 id="transition-to-scale">Transition-to-scale</h3>
<p>Often these successes are not necessarily about <em>fundamental</em> advancements in the underlying methodology but impressive demonstrations of <em>transition-to-scale</em>. In fact several of these recent high-profile successes are made possible by an <a href="https://blog.openai.com/ai-and-compute/">insane amount of compute power for training</a> but the underlying methodology (e.g., policy gradients) has not fundamentally improved. This is good and bad at the same time. First of all, it demonstrates the capabilities of methodology that we have <em>in the limit</em>. That’s good as it helps us understand whether there is an inherent shortcoming within the methodology or whether, e.g., it is just not scaling. At the same time it is bad as it might negatively impact the perceived need for improving the underlying methodology, as we can somehow make it work.</p>
<h3 id="access-to-data-and-infrastructure">Access to data and infrastructure</h3>
<p>Another big advantage of ML research through industry is that industry often has access to data and infrastructure that is not available in an academic setting. This allows industry to build ML systems that cannot be realized within an academic context, e.g., large-scale machine translations, systems such as Google’s assistant etc. Moreover these systems can be integrated into other large-scale systems offering value to the user and society at large—not necessarily for free though. The impact of these systems on day-to-day life can be quite impressive.</p>
<h2 id="academic-ml-research-is-essential-to-society">Academic ML research is essential to society</h2>
<p>I believe that academic ML research, does/should/can/will play a very different role and can serve societal needs that are beyond the scope and interest of industry-driven ML research, as they do not bear an immediate or mid-term profit opportunity. I would like to stress a few things first though:</p>
<ul>
<li>this applies to a large extent beyond the specifics of ML research, however the current representation of large global industry players so close to academic entities in ML is (arguably) unprecedented.</li>
<li>The following topics etc are not exclusive to academic research although, in my experience, they have been much stronger represented in academia. Moreover, these topics are <em>on top</em> of the <em>foundational research agendas</em> in ML and AI that are found throughout top academic institutions and that I deem essential for the academic pursuit as a whole.</li>
</ul>
<h3 id="conducting-curiosity-driven-high-risk-research">Conducting curiosity-driven high-risk research</h3>
<p>In general, academic research is situated very differently. Not having the need to serve a company’s agenda and ultimately a profit motive, it allows for exploration of fundamentally new methodologies and ideas that are more high risk but might ultimately lead to revolutionary approaches. After all, basic ideas of the approaches that we are riding on right now date back to about the 1940s and 1950s, but back then these ideas were considered crazy, unrealistic, or simply crackpot. It is precisely this curiosity-driven research that academia can provide and that is essential to society. Andrew Odlyzko provided an interesting and nuanced perspective on this in 1995 when he was at Bell labs in <a href="http://www.dtc.umn.edu/~odlyzko/doc/decline.txt">“The decline of unfettered research”</a>:</p>
<blockquote>
<p>We are going through a period of technological change that is unprecedented in extent and speed. The success of corporations and even nations depends more than ever on rapid adoption of new technologies and operating methods. It is widely acknowledged that science made this transformation possible. At the same time, scientific research is under stress, with pressures to change, to turn away from investigation of fundamental scientific problems, and to focus on short-term projects. The aim of this essay is to discuss the reasons for this paradox, and especially for the decline of unfettered research.</p>
</blockquote>
<h3 id="open-transparent-and-falsifiable">Open, transparent, and falsifiable</h3>
<p>In contrast to industry research that is often proprietary and only available in watered-down versions (lacking details, data, or both), academic research is typically made available to the public including enough details to <em>falsify</em> a proposed approach. Staying true to Popper, this tiny detail is extremely important as it allows for scientific discourse, where a hypothesis can be tested and rejected and as such ultimately advances science and is highly relevant in the context of the current “alchemy-discussion” in ML research (see here for <a href="https://www.youtube.com/watch?v=ORHFOnaEzPc">Ali Rahimi’s talk at NIPS</a>, <a href="https://www2.isye.gatech.edu/~tzhao80/Yann_Response.pdf">Yann LeCun’s response</a>, and some background <a href="https://syncedreview.com/2017/12/12/lecun-vs-rahimi-has-machine-learning-become-alchemy/">here</a>, <a href="http://www.sciencemag.org/news/2018/05/ai-researchers-allege-machine-learning-alchemy">here</a>, and <a href="http://www.argmin.net/2017/12/11/alchemy-addendum/">an Addendum on Ben’s blog</a>). I am with Ali, Ben, et al on this one, especially if we really plan on deploying ML-based systems in the physical world and putting them at the center of life and death decisions, as e.g., in autonomous vehicles… but that discussion is for some other time.</p>
<h3 id="tackling-societal-challenges">Tackling societal challenges</h3>
<p>Academic (ML) research allows for tackling societal challenges that I believe deserve our attention even if they do not bear a short or mid-term profit opportunity. These include:</p>
<ul>
<li><strong>Infrastructure management.</strong> E.g., improving power systems, transportation systems, etc., especially given that many of those systems are beyond end-of-life or highly strained. One example that comes to mind is <a href="https://robohub.org/talking-machines-machine-learning-and-the-flint-water-crisis-with-jake-abernethy/">Jake Abernethy’s ML/data approach</a> in the context of the Flint water crisis (see also <a href="http://www.mywater-flint.com/">here</a> and <a href="https://www.ic.gatech.edu/news/610023/using-data-science-fix-flint-water-crisis">here</a>)—this is also a great example for synergies between industry and academia as Google <a href="https://news.umich.edu/google-funded-flint-water-app-helps-residents-find-lead-risk-resources/">funded the research with $150k</a>.</li>
<li><strong>Healthcare.</strong> (Beyond longevity), e.g., support systems for elderly healthcare, and systems for improving health-related outcomes in resource-limited settings. These can for example include systems for the early detection of cognitive decline as done at Riken AIP’s <a href="http://www.riken.jp/en/research/labs/aip/goalorient_tech/cogn_behav_assist_tech/">Cognitive Behavioral Assistive Technology Team</a>.</li>
<li><strong>Human impact on the earth.</strong> This includes understanding the human impact on global weather change, as well as mitigating the effects of severe weather events (through AI-based early warning systems) and potentially reversing them through a holistic understanding of the causal chains.</li>
<li><strong>Broad societal challenges.</strong> Including, mitigating societal impact from unequal wealth distribution and workforce impact through ever faster technology cycles.</li>
<li><strong>Education.</strong> Having reached a point in time where technological cycles are so short that life-long learning is a necessity, ML approaches in teaching might significantly improve and speed-up learning outcomes. This goes hand in hand with the sustainability question of online education and the resulting challenges from such decentralized approaches.</li>
<li><strong>Protecting society against manipulation through data and ML.</strong> This includes things such as, detecting <a href="http://fortune.com/2018/09/11/deep-fakes-obama-video/">deep fakes (now available as an app)</a> (<a href="https://www.iflscience.com/technology/deep-fake-videos-created-by-ai-just-got-even-more-terrifying/">here is the SIGGRAPH video—check it out! </a>), detecting manipulated news, as well as the detection of broader exposition to information bias, and many more.</li>
</ul>
<h3 id="working-with-smaller-noisy-data-sets-and-unbounded-losses">Working with smaller, noisy data sets, and unbounded losses</h3>
<p>I believe another important challenge in learning that has received relatively little consideration in industry ML research is working with small, noisy, and potentially unlabelled data sets. Working with such data, which is often at the core of real-world applications requires new approaches, interpolating dynamically between model-based approaches (regularly incorporating deep domain knowledge) and model-free approaches, where the system dynamics are learned directly from data. For example:</p>
<ul>
<li><strong>Medical applications.</strong> Often it is hard (to impossible) to obtain the necessary amount of data for data-intensive learning approaches (such as, e.g., deep learning). Typically obtaining or ‘generating’ such data follows very complex and time-intense processes requiring complex IRBs and even if all formal requirements are met, then often, say, a condition one would like to obtain data about, is so rare that the overall data availability and throughput is very limited.</li>
<li><strong>Physical systems.</strong> Here the main challenges are that physical systems are bound to the limitations of physics and as such generation of data is often slow and expensive. To make things a bit more tangible, say you would like to build a reinforcement learning-based system for inventory management in a highly dynamic environment. For the necessary data collection, you either have to wait a long time as you actually have to observe system to obtain the data (<em>reality-in-the-loop</em> approach), apart from the fact that <em>testing</em> in such systems is nearly impossible, or you have to run a simulation but then you likely to run into model-mismatch issues as the simulation model does not quite match reality.</li>
<li><strong>Unbounded losses.</strong> A standard approach for many learning problems is based on (regularized) empirical risk minimization (ERM), where we solve problems of the form $\min_{\theta} \frac{1}{n} \sum_{i \in [n]} \ell(f(x_i,\theta),y_i) + R(\theta)$ and then $\theta$ is the parametrization of the learned model. ‘Getting it right on average’ (or some other form of probabilistic statement or risk measure) however is often not good enough for real-world applications. A great example is (again) autonomous driving: we do not want to learn that crashing into a wall is not good by actually crashing into the wall; a typical scenario where losses are essentially unbounded but in the ERM problem their contribution would be limited. These applications require either a very different learning approach (here is a nice example from the <a href="http://www.mpc.berkeley.edu/research/adaptive-and-learning-predictive-control">MPC Lab @ Berkeley</a> for safe learning; check out the video!)—or explicit consideration of the maximal loss (see this work of <a href="https://arxiv.org/abs/1602.01690">Shalev-Shwartz and Wexler</a>; can be nicely incorporated into many ERM approaches) if an ERM formulation is desired or required.</li>
</ul>
<h2 id="synergetic-relationship-between-academia-and-industry">Synergetic relationship between Academia and Industry</h2>
<p>So how does this all come together? I strongly believe that the relationship between Academia and Industry has to be synergetic. Never has an ‘academic skillset’ had such a direct translation into an industry context. While this direct translation is a root cause for the current debate on relevance of academic ML research at the same time it is also an opportunity for doing something together and rather than outlining a very limited model of what one could do I’d rather mention two current themes that I think are <em>not helpful</em> to achieve synergies—of course as always there are exceptions.</p>
<h3 id="the-false-god-of-co-employment">The false god of co-employment</h3>
<p>It is impossible to serve two masters with vastly different objectives. Some time back I talked to a researcher with a co-employment deal (similar to the one that Facebook would like to see) and I asked him about publishing. He told me about an argument that he had had with one his superiors. Upon requesting time to publish (a pretty substantial methodological advance) he got the following answer (paraphrasing): “If it creates value for the company why do you want to make it available to the public? If it does not create value, why do you waste time writing it up?” (For a much more nuanced and detailed discussion you should read: <a href="http://www.argmin.net/2018/08/09/co-employment/">“You Cannot Serve Two Masters: The Harms of Dual Affiliation”</a>)</p>
<h3 id="tapping-the-talent-pool-disguised-as-university-partnerships">Tapping the talent pool disguised as university partnerships</h3>
<p>Current interactions between academia and established industry players in the ML field are often reduced to treating the academic institution as a talent pool. This comes with many complications. Given the strong demand for ML talent, companies are vacuuming up whatever comes into their way, including students that would have been brilliant academic researchers and that are much less suited for an industry R&D type role. Often it is pure compensation numbers that lure students away and while there might be a short-term benefit for industry eventually this is akin to killing the golden goose. This is not to say that university partnerships with industry cannot be successful—I believe it is quite the opposite actually—but the <em>current predominant model</em> is harmful to academic institutions (and society at large) and does not satisfy the <em>equal partners</em> requirement; you know how the saying goes: <em>if you cannot spot the sucker in the room, it is you.</em></p>Sebastian PokuttaTL;DR: Is AI and ML research in academia relevant and necessary? Yes.Collaborating online, in real-time, with math-support and computations2018-08-20T11:20:04-04:002018-08-20T11:20:04-04:00http://www.pokutta.com/blog/random/2018/08/20/atom-markdown<p><em>TL;DR: Using atom + teletype + markdown as real-time math collaboration environment.</em>
<!--more--></p>
<p>One of the biggest challenges to overcome in (research) collaborations is often that people are not in the same place and do not have a common “space” available where one can exchange and discuss, including being able to write math equations, in real-time: <em>a shared digital whiteboard</em>. Recently, I finally found a solution that works well (for my needs) and I thought it might be helpful for others as well.
I have been through numerous iterations, including onenote, overleaf, rudel, google docs, gobby-based chats and many other tools and constructs but all of them fell short in at least one of the following dimensions:</p>
<ol>
<li>Real-time collaboration</li>
<li>Support for LaTeX-like math equations</li>
<li>Minimal setup time and cost</li>
<li>Compatible with other tools (e.g., integration with git)</li>
<li>Easily parseable for computations (e.g., for verification purposes)</li>
</ol>
<p><em>Disclaimer.</em> Since nobody has time to read and even less time to write: The following is really only a quick summary of what is needed in terms of software and packages. Go out and explore - it is super easy to use… and feel free to ask questions!!</p>
<h2 id="what-do-youget">What do you get?</h2>
<p>What you will get is a text editor with
Markdown support. Markdown in a nutshell:</p>
<blockquote>
<p>“Markdown is a lightweight markup language with plain text formatting syntax. It is designed so that it can be converted to HTML and many other formats using a tool by the same name” <a href="https://en.wikipedia.org/wiki/Markdown">[wikipedia]</a></p>
</blockquote>
<p>See <a href="https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet">[here]</a> and <a href="https://guides.github.com/pdfs/markdown-cheatsheet-online.pdf">[here]</a> for a markdown cheatsheet and <a href="https://github.com/burodepeper/language-markdown/blob/master/FAQ.md">[here]</a> for some common questions.</p>
<p><strong>Real-time preview.</strong>
Think of instantaneous typesetting (see animation below) for all participants of the session.</p>
<p><strong>Math support.</strong> Simply type LaTeX code and have it typeset in real-time. Supported LaTeX commands etc are limited but good enough for most applications; see <a href="https://github.com/atom-community/markdown-preview-plus/blob/master/docs/math.md">[here]</a> and <a href="http://docs.mathjax.org/en/latest/tex.html#supported-latex-commands">[here]</a> for details.</p>
<p><strong>Real-time collaboration.</strong> Share a file in real-time and work on it together, similar to google docs but with the interactivity of a jupyter notebook and the readability of a document with typeset formulae.</p>
<p><strong>Interactive code.</strong> Execute python, julia, R, etc. in place via hydrogen (see below). Highlight code. shift-enter. And it runs. Right in your editor.</p>
<p><strong>Apart from that.</strong> Atom is also great for coding, LaTeX typesetting and much more… but that’s for some other time.</p>
<p class="center"><img src="http://www.pokutta.com/blog/assets/atomMD.gif" alt="Atom + markdown + math" class="align-center" /></p>
<h2 id="what-do-you-need-aka-installation">What do you need? aka Installation</h2>
<ol>
<li>Atom text editor from <a href="https://atom.io/">[atom.io]</a></li>
<li>Within atom install the following packages with its package manager:
<ul>
<li>markdown-preview-plus <a href="https://atom.io/packages/markdown-preview-plus">[link]</a></li>
<li>language-markdown <a href="https://atom.io/packages/language-markdown">[link]</a></li>
<li>teletype <a href="https://teletype.atom.io/">[link]</a></li>
<li>hydrogen <a href="https://atom.io/packages/hydrogen">[link]</a> (optional: for interactive code execution)</li>
</ul>
</li>
<li>Remarks:
<ul>
<li>You will need a github account to use teletype. Teletype uses github for data exchange between participants.</li>
<li>You might want to activate <em>“Enable Math Rendering By Default”</em> in the settings of the <em>markdown-preview-plus</em> package settings</li>
<li>You can activate the markdown preview with ctrl-shift-M</li>
<li>You can activate the (atom) command window with cmd/windows-shift-p, which is helpful to access <em>hydrogen</em> commands.</li>
</ul>
</li>
</ol>
<h2 id="adding-hydrogen-to-themix">Adding hydrogen to the mix.</h2>
<p>What is <a href="https://github.com/nteract/hydrogen#multiple-kernels-inside-one-rich-document">hydrogen</a>:</p>
<blockquote>
<p>Hydrogen was inspired by Bret Victor’s ideas about the power of instantaneous feedback and the design of Light Table. Running code inline and in real time is a more natural way to develop. By bringing the interactive style of Light Table to the rock-solid usability of Atom, Hydrogen makes it easy to write code the way you want to.</p>
</blockquote>
<p>In short: run your code out of atom like jupyter (including plotting etc) and collaborate like a boss. Wanna verify computations of a limit or integral? Write it in sympy and run it <strong>right there</strong>.</p>
<p class="center"><img src="http://www.pokutta.com/blog/assets/hydrogen.gif" alt="from the hydrogen site for illustration only" class="align-center" /></p>
<p><a href="https://github.com/nteract/hydrogen#multiple-kernels-inside-one-rich-document">[Image from the hydrogen site for illustration only]</a></p>
<p>…and works especially well together with the setup from above. <a href="https://blog.nteract.io/hydrogen-interactive-computing-in-atom-89d291bcc4dd">[Further read]</a></p>Sebastian PokuttaTL;DR: Using atom + teletype + markdown as real-time math collaboration environment.