TL;DR: This is an informal discussion of our recent paper Conditional Gradients for the Approximately Vanishing Ideal by Elias Wirth and Sebastian Pokutta. In the paper, we present a new algorithm, the Conditional Gradients Approximately Vanishing Ideal algorithm (CGAVI), for the construction of a set of generators of the approximately vanishing ideal of a finite data set $X \subseteq \mathbb{R}^n$. The novelty of our approach is that CGAVI constructs the set of generators by solving instances of convex optimization problems with the Pairwise Frank-Wolfe algorithm (PFW).

Written by Elias Wirth.

Introduction

The accuracy of classification algorithms relies on the quality of the available features. Here we focus on feature transformations for a linear kernel Support Vector Machine (SVM) [SV], an algorithm that is reliant on the linear separability of the different classes to achieve high classificaiton accuracy. Our approach is based on the idea that a given set of data points $X = \lbrace x_1, \ldots, x_m\rbrace\subseteq \mathbb{R}^n$ can be succinctly described by the vanishing ideal over $X$, i.e., the set of polynomials vanishing over $X$:

\[\mathcal{I}_X = \lbrace f\in \mathcal{P} \mid f(x) = 0 \text{ for all } x \in X\rbrace,\]

where $\mathcal{P}$ denotes the polynomial ring in $n$-variables.

The set $\mathcal{I}_X$ contains infinitely many polynomials, but, by Hilbert’s basis theorem [CLO], there exists a finite number of polynomials $g_1, \ldots, g_k \in \mathcal{I}_X$, $k\in \mathbb{N}$, referred to as generators, such that for any $f\in \mathcal{I}_X$, there exist polynomials $h_1, \ldots, h_k \in \mathcal{P}$ such that

\[f = \sum_{i = 1}^kg_ih_i.\]

Thus, the set of generators is a finite representation of the ideal $\mathcal{I}_X$, and, as we explain below, can be used to create a linearly separable representation of the data set.

How can generators be used for classification?

We now explain how sets of generators can be employed to create a linearly separable representation of the data: Consider a set of data points $X = \lbrace x_1, \ldots, x_m\rbrace \subseteq \mathbb{R}^n$ with associated label vector $Y \in \lbrace -1, 1 \rbrace ^m$. The goal is to train a linear classifier that assigns the correct label to each data point. Let $X^{-1}\subseteq X$ and $X^{1}\subseteq X$ denote the subsets of feature vectors corresponding to data points with labels $-1$ and $1$, respectively. With access to an algorithm that can construct a set of generators for a data set $X\subseteq \mathbb{R}^n$, we construct a set of generators $\mathcal{G}^{-1} = \lbrace g_1, \ldots, g_k \rbrace$ of the vanishing ideal corresponding to $X^{-1}$, such that for all $g\in \mathcal{G}^{-1}$ it holds that

\[g(x) = \begin{cases} = 0, & x \in X^{-1}\\ \neq 0, & x \in X^{1}. \end{cases}\]

Similarly, we construct a set of generators $\mathcal{G}^{1} = \lbrace h_1, \ldots h_l \rbrace $ of the vanishing ideal corresponding to $X^{1}$, such that for all $h\in \mathcal{G}^{1}$ it holds that

\[h(x) = \begin{cases} \neq 0, & x \in X^{-1}\\ = 0, & x \in X^{1}. \end{cases}\]

Let $\mathcal{G}: = \mathcal{G}^{-1} \cup \mathcal{G}^{1} = \lbrace g_1, \ldots, g_k, h_1, \ldots, h_l\rbrace$ and consider the associated feature transformation:

\[x \mapsto \tilde{x} = \left(|g_1(x)|, \ldots, |g_k(x)|, |h_1(x)|, \ldots, |h_l(x)|\right)^\intercal\in \mathbb{R}^{k+l}.\]

Under mild assumptions [L], it then holds that for $x\in X^{-1}$,

\[\tilde{x}_i = \begin{cases} = 0, & i \in \lbrace 1, \ldots, k\rbrace\\ > 0, & i \in \lbrace k + 1, \ldots, k + l\rbrace, \end{cases}\]

and for $x\in X^{1}$,

\[\tilde{x}_i = \begin{cases} >0, & i \in \lbrace 1, \ldots, k\rbrace\\ =0, & i \in \lbrace k + 1, \ldots, k + l\rbrace. \end{cases}\]

The transformed data is now linearly separable. Indeed, let

\[w : = (-1, \ldots, -1, 1, \ldots, 1)^\intercal \in \mathbb{R}^{k + l},\]

where the first $k$ entries are $-1$ and the last $l$ entries are $1$. Then,

\[w^\intercal \tilde{x} = \begin{cases} < 0, & x\in X^{-1}\\ > 0, & x\in X^{1}, \end{cases}\]

and we can perfectly classify all $x \in X$. In practice, we instead use a linear kernel Support Vector Machine (SVM) [SV] as the classifier.

Noisy data: The vanishing ideal is highly susceptible to noise in the data. Thus, in practice, instead of constructing generators of the vanishing ideal, we construct generators of the approximately vanishing ideal, that is, the set of polynomials $g\in \mathcal{P}$ such that $g(x)\approx 0$ for all $x\in X$. For details on the switch to the approximately vanishing ideal, we refer the interested reader to the full paper.

Contributions

Our main contribution is the introduction of a new algorithm for the construction of a finite set of generators corresponding to the approximately vanishing ideal of a data set $X\subseteq\mathbb{R}^n$, the Conditional Gradients Approximately Vanishing Ideal algorithm (CGAVI). The novelty of our approach lies in the way CGAVI constructs generators of the approximately vanishing ideal. The algorithm constructs generators by solving (constrained) convex optimization problems (CCOPs). In CGAVI, these CCOPs are solved using the Pairwise Frank-Wolfe algorithm (PFW) [LJ], whereas related methods such as the Approximate Vanishing Ideal algorithm (AVI) [H] and Vanishing Component Analysis (VCA) [L] employ Singular Value Decompositions (SVDs) to construct generators. As we demonstrate in our paper, our approach admits the following attractive properties when the CCOP is the LASSO and solved with PFW:

  1. Generalization bounds: Under mild assumptions, the generators constructed with CGAVI provably vanish on out-sample data and the combined approach of constructing generators with CGAVI to transform features for a linear kernel SVM inherits the margin bound of the SVM. To the best of our knowledge, these results cannot be extended to AVI or VCA.
  2. Sparse generators: PFW is known to construct sparse iterates [LJ], which then leads to the construction of sparse generators with CGAVI.
  3. Blueprint: Even though we propose to solve the CCOP with PFW, it is possible to replace PFW with any solver of (constrained) convex optimization problems. Thus, our approach gives rise to a family of procedures for the construction of generators of the approximately vanishing ideal.
  4. Empirical results: In practical experiments, we observe that CGAVI tends to construct fewer and sparser generators than AVI or VCA. For the combined approach of constructing generators to transform features for a linear kernel SVM, generators constructed with CGAVI lead to test set classification errors and evaluation times comparable to or better than generators constructed with related methods such as AVI or VCA.

Conclusion

From a high-level perspective, we reformulate the construction of generators as a (constrained) convex optimization problem, thus motivating the replacement of the SVD-based approach prevalent in most generator construction algorithms. Our approach enjoys theoretically appealing properties, e.g., we derive two generalization bounds that do not hold for SVD-based approaches and since the solver of CCOP can be chosen freely, CGAVI is highly modular. Practically, CGAVI can compete with and sometimes outperform SVD-based approaches and produces sparser and fewer generators than AVI or VCA.

References

[CLO] Cox, D., Little, J., and O’Shea, D. (2013). Ideals, varieties, and algorithms: an introduction to computational algebraic geometry and commutative algebra. Springer Science & Business Media.

[F] Frank, M. and Wolfe, P. (1956). An algorithm for quadratic programming. Naval research logistics quarterly, 3(1-2):95–110.

[H] Heldt, D., Kreuzer, M., Pokutta, S., and Poulisse, H. (2009). Approximate computation of zero-dimensional polynomial ideals. Journal of Symbolic Computation, 44(11):1566–1591.

[LJ] Lacoste-Julien, S. and Jaggi, M. (2015). On the global linear convergence of Frank-Wolfe optimization variants. In Advances in neural information processing systems, pages 496–504.

[L] Livni, R., Lehavi, D., Schein, S., Nachliely, H., Shalev-Shwartz, S., and Globerson, A. (2013). Vanishing component analysis. In International Conference on Machine Learning, pages 597–605.

[SV] Suykens, J. A. and Vandewalle, J. (1999). Least squares support vector machine classifiers. Neural processing letters, 9(3):293–300.