Separable Data
Nonseparable Data
Nonlinear Transformation with Kernels
You can use a support vector machine (SVM) when your data has exactly two classes. An SVM classifies data by finding the best hyperplane that separates all data points of one class from those of the other class. The best hyperplane for an SVM means the one with the largest margin between the two classes. Margin means the maximal width of the slab parallel to the hyperplane that has no interior data points.
The support vectors are the data points that are closest to the separating hyperplane; these points are on the boundary of the slab. The following figure illustrates these definitions, with + indicating data points of type 1, and – indicating data points of type –1.

Mathematical Formulation: Primal. This discussion follows Hastie, Tibshirani, and Friedman [1] and Christianini and Shawe-Taylor [2].
The data for training is a set of points (vectors) xj along with their categories yj. For some dimension d, the xj ? Rd, and the yj = ±1. The equation of a hyperplane is
f(x)=x′β+b=0
where β ? Rd and b is a real number.
The following problem defines the best separating hyperplane (i.e., the decision boundary). Find β and b that minimize ||β|| such that for all data points (xj,yj),
yjf(xj)≥1.
The support vectors are the xj on the boundary, those for which yjf(xj)=1.
For mathematical convenience, the problem is usually given as the equivalent problem of minimizing ?β?. This is a quadratic programming problem. The optimal solution (ˆβ,ˆb) enables classification of a vector z as follows:
class(z)=sign(z′ˆβ+ˆb)=sign(ˆf(z)).
ˆf(z) is the classification score and represents the distance z is from the decision boundary.
Mathematical Formulation: Dual. It is computationally simpler to solve the dual quadratic programming problem. To obtain the dual, take positive Lagrange multipliers αj multiplied by each constraint, and subtract from the objective function:
LP=12β′β−?jαj(yj(xj′β+b)−1),
where you look for a stationary point of LP over β and b. Setting the gradient of LP to 0, you get
| β0=?jαjyjxj=?jαjyj. | (1) |
Substituting into LP, you get the dual LD:
LD=?jαj−12?j?kαjαkyjykxj′xk,
which you maximize over αj ≥ 0. In general, many αj are 0 at the maximum. The nonzero αj in the solution to the dual problem define the hyperplane, as seen in Equation 1, which gives β as the sum of αjyjxj. The data points xj corresponding to nonzero αj are the support vectors.
The derivative of LD with respect to a nonzero αj is 0 at an optimum. This gives
yjf(xj