Support Vector Machines

This software accompanies the paper Support vector machine training using matrix completion techniques by Martin Andersen and Lieven Vandenberghe. The code can be downloaded as a zip file and requires the Python extensions CVXOPT and CHOMPACK 2.3.1 or later.

Feedback and bug reports

We welcome feedback, and bug reports are much appreciated. Please email bug reports to martin.skovgaard.andersen@gmail.com.

Overview

This software provides two routines for soft-margin support vector machine training. Both routines use the CVXOPT QP solver which implements an interior-point method.

The routine softmargin() solves the standard SVM QP. It computes and stores the entire kernel matrix, and hence it is only suited for small problems.

The routine softmargin_appr() solves an approximate problem in which the (generally dense) kernel matrix is replaced by a positive definite approximation (the maximum determinant positive definite completion of a partially specified kernel matrix) whose inverse is sparse. This can be exploited in interior-point methods, and the technique is implemented as a custom KKT solver for the CVXOPT QP solver. As a consequence, softmargin_appr() can handle much larger problems than softmargin().

Documentation

softmargin(X, d, gamma, kernel='linear', sigma=1.0, degree=1, theta=1.0)

Solves the ‘soft-margin’ SVM problem

\[\begin{split}\begin{array}{ll} \mbox{maximize} & -(1/2) z^T Q z + d^Tz \\ \mbox{subject to} & 0 \preceq \mathrm{diag}(d) z \preceq \gamma \mathbf{1} \\ & \mathbf{1}^T z =0 \end{array}\end{split}\]

(with variables \(z\)), and its dual problem

\[\begin{split}\begin{array}{ll} \mbox{minimize} & (1/2) y^T Q^{-1} y + \gamma \mathbf{1}^Tv \\ \mbox{subject to} & \mathrm{diag}(d) (y + b \mathbf{1}) \succeq \mathbf{1} - v \\ & v \succeq 0 \end{array}\end{split}\]

(with variables \(y\), \(v\), \(b\)).

The \(m \times m\) kernel matrix \(Q\) is given by \(Q_{ij} = h(x_i, x_j)\) where \(h\) is a kernel function and \(x_i^T\) is the i’th row of the \(m \times n\) data matrix \(X\), and \(d\) is an \(m\)-vector with labels (i.e. \(d_i \in \{ -1,1\}\)). If \(Q\) is singular, we replace \(Q^{-1}\) in the dual with its pseudo-inverse and add a constraint \(y \in \mathrm{Range}(Q)\).

Valid kernel functions are:

'linear'

the linear kernel: \(h(x_i,x_j) = x_i^Tx_j/\sigma\)

'poly'

the polynomial kernel: \(h(x_i,x_j) = (x_i^Tx_j/\sigma)^d\)

'rbf'

the radial basis function: \(h(x_i,x_j) = \exp\{ -\|x_i-x_j\|^2/(2\sigma)\}\)

'tanh'

the sigmoid kernel: \(h(x_i,x_j) = \tanh(x_i^Tx_j/\sigma - \theta)\)

The kernel parameters \(\sigma\), \(d\), and \(\theta\) are specified using the input arguments sigma, degree, and theta, respectively.

softmargin() returns a dictionary with the following keys:

'classifier'

a Python function object that takes an \(M \times n\) matrix with test vectors as rows and returns a vector with labels

'z'

a sparse \(m\)-vector

'cputime'

a tuple (\(T_{\rm tot}\), \(T_{\rm qp}\), \(T_{\rm ker}\)) where \(T_{\rm tot}\) is the total CPU time, \(T_{\rm qp}\) is the CPU time spent solving the QP, and \(T_{\rm ker}\) is the CPU time spent computing the kernel matrix

'iterations'

the number of interior point iterations

'misclassified'

a tuple (L1, L2) where L1 is a list of indices of misclassified training vectors from class 1, and L2 is a list of indices of misclassified training vectors from class 2

softmargin_appr(X, d, gamma, width, kernel='linear', sigma=1.0, degree=1, theta=1.0)

Solves the ‘soft-margin’ SVM problem

\[\begin{split}\begin{array}{ll} \mbox{maximize} & -(1/2) z^T \tilde{Q} z + d^Tz \\ \mbox{subject to} & 0 \preceq \mathrm{diag}(d) z \preceq \gamma \mathbf{1} \\ & \mathbf{1}^T z =0 \end{array}\end{split}\]

(with variables \(z\)), and its dual problem

\[\begin{split}\begin{array}{ll} \mbox{minimize} & (1/2) y^T \tilde{Q}^{-1} y + \gamma \mathbf{1}^Tv \\ \mbox{subject to} & \mathrm{diag}(d) (y + b \mathbf{1}) \succeq \mathbf{1} - v \\ & v \succeq 0 \end{array}\end{split}\]

(with variables \(y\), \(v\), \(b\)).

The \(m \times m\) kernel matrix \(\tilde{Q}\) is the maximum determinant completion of the projection of Q on a band with bandwidth \(2w+1\). Here \(Q_{ij} = h(x_i, x_j)\) where \(h\) is one of the kernel functions defined under softmargin() and \(x_i^T\) is the i’th row of the \(m \times n\) data matrix \(X\). The \(m\)-vector \(d\) is a vector with labels (i.e. \(d_i \in \{ -1,1\}\)). The half-bandwidth parameter \(w\) is set using the input argument width.

softmaring_appr() returns a dictionary that contains the same keys as the dictionary returned by softmargin(). In addition to these keys, the dictionary returned by softmargin_appr() contains an second classifier:

'completion classifier'

a Python function object that takes an \(M \times n\) matrix with test vectors as rows and returns a vector with labels

Example 1

As a toy example, consider the following classification problem with two (nonlinearly) separable classes. We use as training set \(m\) points in \(\mathbf{R}^2\) generated according to a uniform distribution over the box \(\mathcal{B} = \{ x\,|\,\|x\|_\infty \leq 1 \}\). We assign labels using the function \(\mathrm{sign}(f(x)) = \mathrm{sign}(x_1 x_2)\), i.e., points in the first and third quadrants belong to class 1 and points in the second and fourth quadrants belong to class 2. We remark that in this simple example, the degree 2 polynomial kernel can separate the two classes.

The following Python code illustrates how to solve this classification problem using each of the two routines provided in SVMCMPL. In this example we solve a problem instance with 2,000 training points, and we use \(\gamma = 2.0\) and the RBF kernel with \(\sigma = 1.0\).

import cvxopt, svmcmpl
m = 2000
X = 2.0*cvxopt.uniform(m,2)-1.0
d = cvxopt.matrix([2*int(v>=0)-1 for v in cvxopt.mul(X[:,0],X[:,1])],(m,1))
gamma = 2.0; kernel = 'rbf'; sigma = 1.0; width = 20
sol1 = svmcmpl.softmargin(X, d, gamma, kernel, sigma)
sol2 = svmcmpl.softmargin_appr(X, d, gamma, width, kernel, sigma)

Solving the standard (dense) SVM problem produced 445 support vectors, marked with white dots in the plot below:

../../_images/xor_N2000_rbf.png

The solid curve marks the decision boundry whereas the dashed curves are the -1 and +1 contours of \(g(x)\) where \(\mathrm{sign}(g(x))\) is the decision function.

Solving the approximation problem with half-bandwidth \(w = 10\) produced 1,054 support vectors.

../../_images/xor_N2000_rbf_skc_w10.png ../../_images/xor_N2000_rbf_ckc_w10.png

In this example, the standard kernel classifier is clearly better than the completion kernel classifier at this bandwidth. Increasing the half-bandwidth to \(w = 20\) produced 467 support vectors.

../../_images/xor_N2000_rbf_skc_w20.png ../../_images/xor_N2000_rbf_ckc_w20.png

Notice that both the standard kernel and completion kernel classifiers are now nearly identical to classifier obtained by solving the standard SVM QP.

Solving the dense SVM QP required 7.0 seconds whereas the approximation QPs required 0.3 seconds and 0.7 seconds for \(w = 10\) and \(w = 20\), respectively.

Example 2

The following example demonstrates the approximate SVM method on the MNIST database of handwritten digits. In the example we use the Python module mnist.py to read the database files. The following code trains a binary classifier using as training set 4,000 examples of the digit ‘0’ as class 1 and 4,000 examples of the digit ‘1’ as class 2.

import mnist, svmcmpl, cvxopt, random

digits1 = [ 0 ]
digits2 = [ 1 ]

m1 = 4000; m2 = 4000

# read training data
images, labels = mnist.read(digits1 + digits2, dataset = "training", path = "data/mnist")
images = images / 256.

C1 = [ k for k in xrange(len(labels)) if labels[k] in digits1 ]
C2 = [ k for k in xrange(len(labels)) if labels[k] in digits2 ]

random.seed()
random.shuffle(C1)
random.shuffle(C2)

train = C1[:m1] + C2[:m2]
random.shuffle(train)
X = images[train,:]
d = cvxopt.matrix([ 2*(k in digits1) - 1 for k in labels[train] ])

gamma = 4.0
sol = svmcmpl.softmargin_appr(X, d, gamma, width = 50, kernel = 'rbf', sigma = 2**5)

In this example, both the standard kernel classifier and the completion kernel classifier misclassified 8 out of 2,115 test examples (digits ‘0’ and ‘1’ from the MNIST test set):

../../_images/mnist_miscls.png