12.4. AdaBoosting

Many of you are already familiar with how decision trees work, the underlying concepts, and the criteria used to split a tree. You can grow a decision tree, let it reach a certain size, and then apply pruning techniques to obtain your final classification model.

However, we know that a single tree often doesn’t perform very well on its own. That’s where ensemble methods come into play, such as Random Forests, which build on the principles we’ve learned about regression trees. Another powerful ensemble method is boosting, and in this discussion, we’ll delve into the concept of boosting.

Boosting, specifically AdaBoost, was introduced in the context of classification. We’ll explore what AdaBoost does and what we can infer about the final classifier from this boosting algorithm. It’s worth noting that AdaBoost is essentially a gradient-based algorithm, aiming to fit the model using an exponential loss function.

12.4.1. The Algorithm

When discussing the AdaBoost algorithm, we need to switch to a notation where binary labels are represented as +1 or -1, rather than 1 and 0. The goal of AdaBoost is to combine weak classifiers, which are classifiers that don’t need to be exceptionally accurate, as long as they perform slightly better than random guessing. In fact, we can even use classifiers that perform worse than random guessing, meaning their error rate is greater than 50%, because we can simply flip their predictions to improve accuracy.

The core idea of the algorithm is to iteratively modify the weights assigned to the training data. Let’s outline the algorithm:

At each iteration, marked by \(t\), we update the data weights, select a classifier (which can be chosen randomly), calculate the training error, and determine a weight \(\alpha_t.\) Then, we update the data weights based on the classifier’s performance. The algorithm continues this process for a predefined number of iterations, and at the end, we sum the weighted classifiers, each scaled by \(\alpha_t,\) to form the final classifier.

To make predictions on new data, we apply this final classifier and check the sign of the output: if it’s positive, we classify as +1, and if it’s negative, we classify as -1.

12.4.2. Proof

Next we demonstrate that the training error can be driven to zero as the number of iterations ‘T’ approaches infinity, even when the individual classifiers are weak.

To prove this, we analyze the upper bound of the training error and show that it decreases with each iteration. The training error is computed using weighted data, which results in values between 0 and 1. By upper-bounding the indicator function used to measure classification error,

\[I (z < 0) < e^{-z}, z \in \mathbb{R},\]

We can express the training error in terms of exponential loss.

\[\begin{split}\text{Training-Err}(G_T) &= \sum_i \frac{1}{n} I \Big (y_i \ne \text{sign} \big ( \sum_{t=1}^T \alpha_t g_t(x_i) \big ) \Big ) \\ &= \sum_i \frac{1}{n} I \Big ( \sum_{t=1}^T y_i \alpha_t g_t(x_i) < 0 \Big ) \\ & \le \sum_i \frac{1}{n} \exp \Big ( - \sum_{t=1}^T \alpha_t y_i g_t(x_i) \Big ) \\ &\le \prod_{t=1}^{T} Z_t.\end{split}\]

The last inequality is due to the following:

\[\begin{split}& \sum_{i=1}^n \frac{1}{n} \exp \Big ( - \sum_{t=1}^T \alpha_t y_i g_t(x_i) \Big ) \\ =& \sum_i \frac{1}{n} \prod_{t=1}^T \exp \Big ( - \alpha_t y_i g_t(x_i) \Big) \\ =& \sum_i w^{(1)}_i \prod_{t=1}^T \frac{w_i^{(t+1)}}{w_i^{(t)}} Z_t \\ =& \sum_i w_i^{(1)} \frac{w_i^{(2)}}{w_i^{(1)}} \cdots \frac{w_i^{(T)}}{w_i^{(T-1)}}\frac{w_i^{(T+1)}}{w_i^{(T)}} \Big ( \prod_{t=1}^T Z_t \Big ) \\ =& \Big ( \prod_{t=1}^T Z_t \Big ) \sum_i w_i^{(T+1)} \\ =& \prod_{t=1}^T Z_t.\end{split}\]

Finally, we conclude that the training error decreases as ‘T’ increases, provided that no classifier achieves an error rate exactly equal to 50%:

\[\begin{split}Z_t &= \sum_i w_i^{(t)} \exp \big (- \alpha_t y_i g_t(x_i) \big ) \\ &= \sum_{i: y_i g_t(x_i) =1} w_i^{(t)} \exp \big (- \alpha_t \big ) + \\ & \sum_{i: y_i g_t(x_i) = -1} w_i^{(t)} \exp \big ( \alpha_t \big ) \\ &=(1- \epsilon_t) \exp \big ( - \alpha_t \big ) + \epsilon_t \exp \big (\alpha_t \big ) \\ &= (1-\epsilon_t) \sqrt{\frac{\epsilon_t}{1-\epsilon_t} } + \epsilon_t \sqrt{\frac{1-\epsilon _t}{\epsilon_t}} \\ &= 2 \sqrt{\epsilon_t ( 1 - \epsilon_t)} \\ & < 1.\end{split}\]

It’s important to note that that the training error of the combined classifier G_T generated by AdaBoost is not guaranteed to be monotonically decreasing with respect to the number of iterations. Instead, after each iteration, AdaBoost decreases a specific upper bound on the 0/1 training error. Over the long run, this continual adjustment of weights pushes the training error towards zero.
AdaBoost can utilize weak classifiers, denoted by \(g_t(x)\), even if their error rate is worse than random guessing (i.e., \(\epsilon_t > 1/2\)). In such cases, AdaBoost adapts by assigning a negative weight \(\alpha_t\), effectively using \(-g_t(x)\) to improve overall performance.
The final classifier obtained from AdaBoost may not perform well on test data, as AdaBoost aims to minimize training error, potentially leading to overfitting. Proper regularization or early stopping is often necessary to ensure good generalization to unseen data.

In summary, AdaBoost is a simple yet powerful boosting algorithm that combines weak classifiers to achieve impressive results in classification tasks. It may not always perform well on test data, so careful consideration and fine-tuning are essential when applying it to real-world problems.