A Secant Method Line Search for Frank-Wolfe algorithms

TL;DR: This is a brief post about a line search based on the secant method for Frank-Wolfe algorithms. This line search is now available in the FrankWolfe.jl package.

Introduction

The Secant Method is a well-known method to finding the root of a function $f$. It is based on the recursion:

\[\tag{secant} x_{n+1} \leftarrow x_{n} - f(x_{n}) \cdot \frac{x_{n} - x_{n-1}}{f(x_{n}) - f(x_{n-1})},\]

with initial values $x_0, x_1$ chosen appropriately. We might think of the secant method as an approximate version of Newton‘s method, which uses the update

\[x_{n+1} \leftarrow x_{n} - f(x_n) \cdot \frac{1}{f'(x_{n})},\]

so that we effectively approximate

\[f'(x_{n}) \approx \frac{f(x_{n}) - f(x_{n-1})}{x_{n} - x_{n-1}}.\]

Note however that the secant method is much older: “However, the secant method predates Newton’s method by over 3000 years” as Wikipedia states; we refer the interested reader to [PT] for an overview.

So what does this have to do with line searches and Frank-Wolfe algorithms? Usually in those algorithm we update

\[x_{t+1} \leftarrow x_t - \lambda (x_t - v_t),\]

with $\lambda \in [0, 1]$ and it makes sense to choose $\lambda$ to maximize progress, i.e., to solve

\[\lambda \leftarrow \arg\min_{\lambda \in [0,1]} f(x_t - \lambda (x_t - v_t)),\]

which is the standard line search problem. Finding such a $\lambda$, under suitable assumptions as we discuss later, can be achieved by solving

\[\frac{\partial}{\partial \lambda} f(x_t - \lambda d) = 0,\]

i.e., we are looking for the root of the optimality condition for the optimization problem from above.

A convergence proof for the Secant Method

Before we consider the actual line search problem, we will first derive the convergence rate of the secant method once we are close enough to the root; we will talk about this assumption later. The proof, which follows [G], is relatively straightforward relying on Taylor expansion and basic algebraic manipulations. Suppose that $f(a) = 0$, i.e., $a$ is a root. For convenience let us rewrite the iterates as $x_n = a + \epsilon_n$. With this we obtain:

\[\tag{recursion} \epsilon_{n+1} = \epsilon_{n} - f(a + \epsilon_n) \cdot \frac{\epsilon_n - \epsilon_{n-1}}{f(a + \epsilon_n) - f(a + \epsilon_{n-1})},\]

If $f$ is now twice differentiable with $f'(a) \neq 0$ and $f''(a) \neq 0$ (for roots with higher multiplicity the reader is referred to [D]), we can compute the degree-$2$ Taylor expansion, using that $f(a) = 0$ and $\epsilon$ “small” (i.e., we are close enough to $a$):

\[\tag{taylor}f(a + \epsilon) \approx f'(a) \epsilon + \frac{f''(a)}{2} \epsilon^2 = \epsilon f'(a) (1 + M \epsilon),\]

with $M \doteq \frac{f''(a)}{2 f'(a)}$. We can now use (taylor) to rewrite (recursion) to obtain:

\[\varepsilon_{n+1} \approx \varepsilon_n - \frac{\epsilon_n (1 + M \epsilon_n)}{1+M(\varepsilon_n + \varepsilon_{n-1})} = \frac{\epsilon_{n-1} \epsilon_n M}{1+M(\varepsilon_n + \varepsilon_{n-1})} \approx \epsilon_{n-1} \epsilon_n M,\]

i.e.,

\[\varepsilon_{n+1} \approx \frac{f''(a)}{2 f'(a)} \epsilon_{n-1} \epsilon_n,\]

which is quite close to the contraction $\varepsilon_{n+1} \approx \frac{f''(a)}{2 f'(a)}\epsilon_n^2$ obtained for Newton‘s method with a similar reasoning. This difference however is the reason for a reduced convergence rate compared to Newton‘s method as we will see now.

We claim that $\abs{\epsilon_{n+1}} \approx C \abs{\epsilon_n}^p$ and will try to estimate $p$ by unrolling one iteration. We have:

\[C \abs{\epsilon_n}^p \approx \abs{\epsilon_{n+1}} \approx \abs{M} \abs{\epsilon_n} \abs{\epsilon_{n-1}},\]

so that rearranging gives

\[\abs{\epsilon_n} \approx \left(\frac{\abs{M}}{C}\right)^{1/(p-1)} \epsilon_{n-1}^{1/(p-1)}.\]

Thus we can set $C = \left(\frac{\abs{M}}{C}\right)^{1/(p-1)}$ and choose $p > 0$ such that $p = 1/(p-1)$, which gives $p \doteq \frac{1+\sqrt{5}}{2}$, i.e., the Golden ratio, by simply solving the arising quadratic equation and we can resolve $C$ further to $C = \abs{M}^{1/p} = \abs{\frac{f''(a)}{2 f'(a)}}^{p-1}$.

Thus we obtain the final form:

\[\tag{secant-conv} \abs{x_{n+1} - a} \approx \left|\frac{f''(a)}{2 f'(a)}\right|^{\frac{\sqrt{5}-1}{2}} \abs{x_{n} - a}^{\frac{1+\sqrt{5}}{2}}.\]

Assumptions.

$f$ is twice differentiable
$f'(a) \neq 0$ and $f''(a) \neq 0$
We have to start close enough to $a$, so that we can assume that $\epsilon$ is small enough

Using the mean value theorem we can also derive a convergence test using $f(a) - f(x_n) = f'(c_n) (a-x_n)$, for some value $c_n$ between $a$ and $x_n$. Again, if $x_n$ is close to $a$, then $c_n \approx x_n$ and we have

\[a-x_n = - \frac{f(x_n)}{f'(c_n)} \approx -\frac{f(x_n)}{f'(x_n)} \approx - f(x_n) \frac{x_n - x_{n-1}}{f(x_n) - f(x_{n-1})},\]

and by the definition of the secant method, the last term is exactly equal to $x_{n+1} - x_n$. Therefore we obtain

\[a-x_n \approx x_{n+1} - x_n,\]

which we can use to dynamically estimate the error in order to stop the search. Alternatively, we can simply check whether $f(x_{n}) \approx 0$, as we compute $f(x_{n})$ anyways in each iteration, however that error is in the function value and not the distance.

Note. In contrast to Newton’s method, the secant method only uses function evaluations and in fact only one per iteration if we reuse already computed values. This is in contrast to Newton’s method that requires both one function value and one gradient computation per iteration. Thus if we compare cost and not iterations, and assuming that a gradient evaluation is roughly as expensive as a function evaluation, we can do two secant method steps for each Newton step, leading to an order of convergence of $1.618^2 \approx 2.617924$ vs. $2$. This suggests that the secant method can be faster in many relevant cases in practice compared to Newton’s method.

Observation (Estimation of iterations). Let $\tau = \abs{a - x_0}$ and $\epsilon > 0$. Under the assumptions from above, then after no more $n \approx \lceil 2 \log \log \frac{1}{\epsilon \abs{M}} - 2\log \log \frac{1}{\tau \abs{M}} \rceil$ iterations we have $\abs{a - x_n} < \epsilon$.

Proof. With the parameter choice as above, we have the recursion

\[\abs{x_{n+1} - a} \approx \left| M\right|^{p-1} \abs{x_{n} - a}^{p},\]

so that we obtain

\[|x_n - a| \approx \left(\left| M\right|^{p-1}\right)^{\frac{p^n-1}{p-1}} |x_0 - a |^{p^n} = \left| M\right|^{p^n-1} |x_0 - a |^{p^n}.\]

Thus we estimate

\[\epsilon \approx \left| M\right|^{p^n-1} \tau^{p^n},\]

so that

\[\log \epsilon \left| M\right|\approx p^n \log \tau \abs{M},\]

and hence

\[\frac{\log \frac{1}{\epsilon \abs{M}}}{\log \frac{1}{\tau \abs{M}}} \approx p^n,\]

assuming that the left-hand side is larger than $0$, taking one further log leads to

\[\frac{\log \log \frac{1}{\epsilon \abs{M}} - \log \log \frac{1}{\tau \abs{M}}}{\log p} \approx n,\]

and using $\log p \approx 0.48 < 1/2$, this can be simplified to

\[\tag{iterations}2 \log \log \frac{1}{\epsilon \abs{M}} - 2\log \log \frac{1}{\tau \abs{M}} \approx n,\] \[\qed\]

The Secant Method for line search

With this we are ready to consider the secant method for our line search problem from above. Note, this is only guaranteed to work for functions that satsify the necessary requirements of the secant method. The algorithms we consider in the FrankWolfe.jl package are all Frank-Wolfe algorithms that update their iterates via

\[x_{t+1} \leftarrow x_t - \lambda d,\]

with $\lambda \in [0, \lambda_\max]$ where, e.g., $d = x_t - v_t$ and $\lambda_\max = 1$ for the standard Frank-Wolfe algorithm. As such, it makes sense to choose $\lambda$ to maximize progress, i.e., to solve

\[\lambda \leftarrow \arg\min_{\lambda \in [0,\lambda_\max]} f(x_t - \lambda d),\]

which is the standard line search problem. There are many ways to solve this line search problem, e.g., via backtracking, Golden-section search, etc. Another way to solve the line search problem is to solve its optimality system, i.e., finding $\lambda$, under suitable assumptions, such that

\[\tag{ls-opt} \frac{\partial}{\partial \lambda} f(x_t - \lambda d) = 0 \Leftrightarrow \langle \nabla f(x_t - \lambda d), d \rangle = 0,\]

using $\frac{\partial}{\partial \lambda} f(x_t - \lambda d) = \langle \nabla f(x_t - \lambda d), -d \rangle$; note that we dropped the minus as we solve against $0$ and we use $\langle \nabla f(x_t - \lambda d), d \rangle$ in the definition of (ls-secant), as the minus cancels and this keeps things clean. Finding the root of that system can be done with the secant method, which has the big advantage, that we do not need to compute Hessians compared to Newton’s method. Applying the secant method to (ls-opt), we obtain the recursion

\[\tag{ls-secant} \lambda_{n+1} \leftarrow \lambda_{n} - \langle \nabla f(x_t - \lambda_n d), d \rangle \cdot \frac{\lambda_{n} - \lambda_{n-1}}{\langle \nabla f(x_t - \lambda_n d), d \rangle - \langle \nabla f(x_t - \lambda_{n-1} d), d \rangle}.\]

Note that we have deliberately chosen the expanded form in the denominator as this form allows us to more easily see how information can be reused in following iterations, i.e., the computation of $\langle \nabla f(x_t - \lambda_n d), d \rangle$ can be reused in the following iteration.

For the convergence we have the assumptions as before. The constant $M$ is now one order higher in terms of derivatives, i.e.,

\[M = \frac{\frac{\partial^3}{\partial \lambda^3} f(x_t - \lambda d)(\lambda_a)}{2 \frac{\partial^2}{\partial \lambda^2} f(x_t - \lambda d)(\lambda_a)},\]

where $\lambda_a$ denotes the optimal $\lambda$. This can be simplified in the case when $f$ is self-concordant to

\[\abs{M} \leq \sqrt{\frac{\partial^2}{\partial \lambda^2} f(x_t - \lambda d)(\lambda_a)} = \sqrt{d^T \nabla^2 f(x_t - \lambda_a d) d},\]

using $\frac{\partial^2}{\partial \lambda^2} f(x_t - \lambda d)(\lambda_a) = (-d)^T \nabla^2 f(x_t - \lambda_a d) (-d) = d^T \nabla^2 f(x_t - \lambda_a d) d$.

This approach has been implemented in the FrankWolfe.jl package as Secant line search method, with the following implementation specifics:

Implementation Notes.

The number of required iterations to reach an $\epsilon$-accuracy of say $\epsilon = 10^{-8}$ is very low often being only 4 - 5 iterations; this is futher improved with warm-starting (see below).
Warm-starting: we use the $\lambda$ from the last run of the line search as a starting point for the next one.
We use $\abs{\langle \nabla f(x_t - \lambda_n d), d \rangle} < \epsilon$ as stopping criterion, i.e., we test first-order optimality of the line search condition.
We added a domain oracle for application to generalized self-concordant functions.
If $\lambda_n \not \in [0, \lambda_\max]$, we clip $\lambda_n$ accordingly. If done twice in a row then we stop as $\lambda_n - \lambda_{n+1} = 0$ in this case and no further updates are performed.
For quadratic objectives the gradient is a linear function so that the secant line search converges in one iteration (up to numerics etc).

Computational Tests

The following graphics depicts computational experiments on various instances of a portfolio optimization problems; see examples/portfolio_opt.jl and examples/portfolio_opt_large.jl, see also examples/optimal_experiment_design.jl for another example in the FrankWolfe.jl package.

Figure 1. Comparing 4 step size strategies. Agnostic, Adapative, Monotonic, and Secant from the package. For these problems the secant search is superior in progress per iteration, as it solves the line search problem to much higher precision and often also faster in wall-clock time.

Acknowledgement

I would like to thank David Martínez-Rubio for the initial idea of using an approximate Newton method to solve the line search problem for self-concordant functions.

References

[PT] Papakonstantinou, J. M., & Tapia, R. A. (2013). Origin and evolution of the secant method in one dimension. The American Mathematical Monthly, 120(6), 500-517. pdf

[G] Grinshpan, A. Order of convergence of the secant method. pdf

[D] Díez, P. (2003). A note on the convergence of the secant method for simple and multiple roots. Applied mathematics letters, 16(8), 1211-1215. pdf