Often, a team performs much better than individuals. Each member in the team can contribute a unique aspect to the final product. Each provides a different view point and if one can productively combine each of these, we can end up with a wonderful solution.
Ensemble Learning is based on this concept. Instead of using just one algorithm to come up with just one model, we can use a combination of multiple types of algorithms and models. If we can do this correctly, a lot of the common problems of learning models can be addressed very easily.
The implementation for Ensemble learning is quite intuitive. We train and use multiple models. We combine their outputs using some specific method to get the final output. Of course, we have left a lot of open space out here. There are many different ways we can train multiple models. And there are many different ways to combine their outputs. Depending upon the requirement and the data and the computational power at hand, we can select some of these techniques. Let have a look at them, starting with the simplest ones.
Voting is typically used for classification problems. As one can guess, here, the outcome of all the different trained model is picked up and the decision of the majority is considered correct. For Voting, we need to train each model independently. Once we have trained each model, we can put them together to form the ensemble.
Voting is very good for classification problems. But when it comes to regression, we can never get anything meaningful from voting. Here we use averaging.
Again, we train each model independently and once the individual models are ready, we can put them together to form an ensemble. The output of each model is picked individually and the final prediction is the average of all the individual models.
Now, we can have a problem here - in voting as well as averaging. It is quite possible that some of the individual models are intrinsically more accurate than the others. In such a case, we have to be careful that the ensemble does not degrade them. The whole purpose is to add value rather than degrading individual models. With this in mind, we can assign weights to individual models.
In order to address such a problem, we can use weights when we identify the average or votes. In any ensemble, we can have some models that we trust more than the others. In this case, we can assign higher weights to these trusted models.
Blending and Stacking are two different techniques based on similar principles. The techniques we saw above, created an ensemble layer where each model contributes directly to the output prediction. An ensemble can also contain multiple layers - where the output of one layer is the input to the next layer. This way, each layer can add to what the previous layers did.
We can do this technique to implement such layers. Note that there are many ways of implementing layers. But Blending/Stacking typically do this. To start with, we train a few different models based on the data we have. Then we try to predict the outcome by combining these models. Of course, there is a gap. The predicted Y is similar and related to the actual Y, but we do not know the relation. Hence the problem.
In order to identify this relation, we build another layer that takes the prediction of the first layer as the input features, and we train it to get the actual Y as output. Naturally, this generates an output that is a lot more closer to the actual Y.
There is a difference between blending and stacking in the way they train the individual models and the layers. In Blending, we split the available data into three sets - training/validation/test sets. As one would want, we use the training set to build the model and then the validation and test sets to evaluate each model and then the entire layer as a combination of these individual models. Beyond this, we feed the Y predicted out of the validation and test sets as an input to the next layer... and so on.
Note here that the amount of data available for training reduces for each layer. This is avoided in the Stacking technique. Here, we split the available data into N (say 10) parts. Then we train the all the models in the first layer with 9 of these 10 parts and then validate it using the tenth. We do this 10 times, each time, dropping one of these sets. Thus, we have a model that is trained 10 times using 9 of these sets at a time.
Naturally, Stacking is much more efficient in using the available data, but it looses out on processing efficiency.
So far, we have looked into ensembles of different types of models - that bring in the strengths of different model; in order to get something much better. But what about building an ensemble of many instances of models of the same type? For example, can we club together several instances of Logistic Regression models? Will that help?
If we think of this in detail, we can see that all of these models - trained with identical data would be identical (ignoring the minor effect of random initialization). Then what's the point in inviting all the overheads of creating an ensemble out of these? None! True that there is no use of creating an ensemble of identical models.
But, if we can somehow get the required disparity in each of these models, it would be a lot better in the training and also the performance. There are two essential ways of doing this - splitting the data and splitting the features.
We can split the data into multiple sets and train each individual model based on a small subset of the data. All these models can then form a meaningful ensemble - because each model has a unique contribution of its own. Or we can split the data into different sets of features and feed each model using a different set of features. Either approach has its own advantages.
One might be anxious about splitting the data into different subsets. Can this lead to overfitting on each of those models? Quite possible. But even if we have an overfitting on individual model, we can expect that each individual model will have a different kind of overfitting - leading to a good outcome. Also, the fact is that we go in for ensembles when we know that individual models are not enough to hold all the data we have. When we know that training each model with all the data is not going to help it. In such a case, splitting the data really makes sense.
Among the features, we often have some features that dominate over the others. In such a case, the suppressed features are really not valued enough to be able to make a meaningful contribution to the model. If we train different models over different combinations of subsets, it is much more likely that these suppressed features get a chance to show up and do their job.
This is another interesting ensemble technique. To understand this technique, we should first try to understand why some data points are missed in a classification.
To do this, let us consider points in a two dimensional space that we want to classify using a straight line. Now, this is bound to fail if the points are mingled together. A high order curve may be able to separate them, but that would lead to overfitting. Thus, we always have a tradeoff between loss in accuracy and the risk of overfitting - when we work with a single model.
Boosting tries to fix this problem. It works by assigning weights to data points as it trains the models. We start by training the first model with equal weights to all the data points. Once we are done, we evaluate the model to identify the data points that are not doing well in the classification. Now, we give higher weights to these points as we train the second model.
This goes on and we create multiple models, each attempting to correct the errors of the previous models. When we combine these together (with appropriate weights), we get a model that is a lot more richer than each individual model. As in the example above, if we have one dimensional models in two dimensional space, we would end up with many single dimensional models - many straight lines that combine together to form a curve. This takes care of avoiding overfitting while increasing the accuracy.
The individual models are called weak learners, while the combined model is called the strong learner.
The above techniques provide us the basic principles for using ensembles. There are many specific learning algorithms that implement these techniques to improve the efficiency and accuracy. The most important ones are
Each of these applies the essential principles in a unique way to improve on some of the primary algorithms.