11.3. The Non-separable case
11.3.1. Non-separable Data
So far, we have covered Linear Support Vector Machines (SVM) for separable data. For example, in the image on the left, we have two groups of data points that can be easily separated by a solid blue line. However, what if the data is not separable, meaning there is no single solid blue line that can perfectly separate the two groups? In such cases, we can extend the hard margin formulation in two ways.

One approach is known as the ‘soft margin,’ where we still aim to create a wide margin between the groups but allow some data points to be on the ‘wrong’ side of the dashed line. This introduces an objective function that balances maximizing the margin and minimizing the errors.
Another extension is to consider nonlinear decision boundaries. While a linear function may not work for some data configurations, we can explore nonlinear functions. This can be achieved by expanding the feature space to a higher dimension, effectively applying a ‘kernel trick’ without the need to explicitly visit the high-dimensional feature space. Nonlinear SVMs are often referred to as ‘kernel machines.’
Though these extensions are primarily designed for handling non-separable data, we can still use them for separable data scenarios.
11.3.2. The Soft-Margin Problem
Now, let’s examine a data configuration (below) where it’s impossible to separate the two groups of data points using a linear function. Nevertheless, we’ll still introduce a wide margin, represented by the white space between the two groups of data points. Some data points, specifically two green points, find themselves on the wrong side of the observed dashed line.

Here’s how we’ll formulate the problem:
For data points on the correct side of the dashed line, we know they satisfy the condition
But for the two green dots on the wrong side, we introduce a slack variable, denoted as
These:math:xi_i values represent the degree of error, and they non-negative because
Consequently, the objective function comprises two terms: one focusing on maximizing the margin and another aiming to minimize the errors (measured by

This optimization problem is still convex, with
11.3.3. The KKT Conditions
The KKT conditions play a critical role in solving this problem. There are four groups of these conditions, each color-coded on the slide above. Let’s go through them one by one.
The first group deals with taking the derivatives of the Lagrangian function with respect to
and and setting them to zero. These are similar to the hard margin case.The second group enforces that all Lagrange multipliers (
and ) must be non-negative.The third group simply reiterates the primal constraints.
The fourth group involves complementary slackness.
and along with the corresponding constraints, cannot be nonzero simultaneously.
Once these KKT conditions are satisfied, we can solve the dual problem. The dual problem minimizes the Lagrangian function with respect to
Support Vectors
Support vectors are crucial in SVM, and they are the data points with nonzero
Data points on the correct side of the dashed line, where
, making them non-support vectors.Data points on the dashed line, where
could be nonzero, indicating potential support vectors.Data points on the wrong side of the dashed line, where
, making nonzero, and they are support vectors.
In summary, this soft margin formulation leads to a solution that depends only on support vectors, which include data points on the dashed lines and those on the wrong side of the dashed lines. For this particular example, there are four support vectors.
11.3.4. The Gamma Parameter
An additional parameter in the soft margin formulation is
Gamma’s value is specified by the user or can be chosen through cross-validation. It dictates our preference for a wider margin versus reducing classification errors. The larger the

In the soft margin formulation,
11.3.5. Loss + Penalty
We can also view the primal problem as a special case of the Loss + Penalty framework. The hinge loss function, represented by the blue curve in the graph, measures the margin of our classifier

In the context of SVM, we employ hinge loss over the