Decision Tree is an interesting concept that mimics a very common way our mind approaches a classification problem.

Suppose our training data set has n features, we can take up one feature at a time and classify data elements of that feature. The two nodes thus obtained can be classified further based on the remaining features. Thus, we get a tree structure where each node representing a given data feature. As we perform this classification, we expect to reach a state where each node has true on one side and false on the other - thus completing our model training.

Consider for example, our old breast cancer data set. We can classify the data on one feature at a time. Say we start with the size. We can define a threshold and anything less than the threshold goes to the lower node and anything higher than the threshold goes to the upper node. Then we look at the weight. And so on.. Over time, we have a Tree of nodes - each classifying the data based on a threshold of a given feature. If our thresholds are wisely chosen, then we can expect that each leaf node of a well trained tree will be able to correctly classify into a positive or negative.

Of course, we have two hyperparameters going into our assumption - the choice of order of features and the "wisely chosen" thresholds. There are several algorithms for identifying these. But, once they are chosen, the classification is not a difficult task.

The essential concept of decision tree is based on splitting the tree on the appropriate features at appropriate thresholds. But identifying these features and thresholds is very important. And overfitting, of course is the single dominant evil in any machine learning scenario. Researchers have identified several ways around these problems. Below is an introduction to the most important ones.

When we have different algorithms for the same task, a natural question in any mind is - how do they compare? Which one is better? Which one should I use?

Well, the fact is that both are good at their own set of problems. There are some problems that fit much better in a linear model and there are some others that fit much better in a tree model. Intuitively, we can say that if the correlation between the input features and the output is simple and linear (in the sense that one increases/decreases uniformly with the other), then a Linear model would work much better. But if the correlation is pretty complex and not linear, then a Tree model has a better chance of working out.

Also, compared to Linear models, a Tree model is a lot easier to grasp intuitively. So, if you need humans to understand the model, then a Tree model is far better.

The first step to preparing for the Decision Tree is to identify the right set of features and split. Below are some of the important techniques to help you with that.

This works with categorical target variable, only Binary splits. The Gini Index is based on the concept of Purity of Population. A Population is pure if the probability of two random samples belonging to the same class is 1. You should identify the Gini index for different possible splits. The one with highest Gini Index should be the starting point.

This is another way of identifying the more significant component. Chi-Square is computed from sum of squares of standardized differences between observed and expected frequencies of target variable. Specifically, the Chi-Square value is calculated as

`Chi-Square = ((Actual Positive Count - Expected Positive Count)^2 / (Expected Positive Count)) ^ (1/2)`

A low value of Chi-Square suggests that the Actual and Expected counts are similar. That means, the component hardly impacts the decision - the output is statistical. High difference between the Actual and Expected counts means that the component has enough influence to drive the decision significantly away from its statistical value. This more significant component should be used to start the decision.

Entropy has always been the known distraction in any walk of life. Most of the processes are targeted towards minimizing the entropy. In terms of information theory, higher the entropy, lower is the information. Most of our computational effort is in an attempt to gain information - so we try to minimize the entropy. The entropy of any split is computed as

`entropy = - p * (log2 q) - q * (log2 p)`

Where p and q are the probability of success and failure on that node. If both are 0.5 - meaning the node is redundant - the entropy is 1. On the other hand, if both are close to 0 - meaning we have almost no false negatives or false positives, this is 'the' decision node - the entropy is very low. Surely, it should be the first choice

Next, we look at the techniques for avoiding over fitting. When training a decision tree, it is possible that the tree grows huge to the extent that there is a node for each data element - or perhaps even more. In such a case, we have 100% accuracy for the training data and of course, it fails miserably outside the set. Thus, the fundamental reason for over fitting is the tree growing too big. The below techniques avoid this situation to in order to avoid over fitting. Logically, there are two ways to do this - don't let the tree grow and cut it down. That is what the below techniques do.

The tree can be constrained by defining individual constraints on its growth. This is done by defining limits on the following:

Minimum samples for a node split

- Defines the minimum number of samples (or observations) which are required in a node to be considered for splitting.
- Used to control over-fitting. Higher values prevent a model from learning relations which might be highly specific to the particular sample selected for a tree.
- Too high values can lead to under-fitting hence, it should be tuned using CV.

Minimum samples for a terminal node (leaf)

- Defines the minimum samples (or observations) required in a terminal node or leaf.
- Used to control over-fitting similar to min_samples_split.
- Generally lower values should be chosen for imbalanced class problems because the regions in which the minority class will be in majority will be very small.

Maximum depth of tree (vertical depth)

- The maximum depth of a tree.
- Used to control over-fitting as higher depth will allow model to learn relations very specific to a particular sample.
- Should be tuned using CV.

Maximum number of terminal nodes

- The maximum number of terminal nodes or leaves in a tree.
- Can be defined in place of max_depth. Since binary trees are created, a depth of ānā would produce a maximum of 2^n leaves.

Maximum features to consider for split

- The number of features to consider while searching for a best split. These will be randomly selected.
- As a thumb-rule, square root of the total number of features works great but we should check upto 30-40% of the total number of features.
- Higher values can lead to over-fitting but depends on case to case.

Constraining the tree works reactively - while training. That means, the tree is constrained without checking up the full data. This may be quicker, but mathematically, this is not the best way. Pruning on the other hand lets the tree grow really huge. Then tries to reduce the tree by pruning low performing leaves - bottom up. Thus, the process is a lot more consistent and results in a much better model.

Imperfection manifests in any model as Bias and Variance. These are two metrics that show how good or bad is your model. In simple terms, Bias tells us if the entire set is offset from what it is meant to be. On the other hand, Variance tells us how good or bad is the compliance within the data set. It is possible that the average of the entire set matches what it is supposed to be.

But, that is not enough if individual points do not match. This is defined by the variance. On the other hand, it is possible that none of the points match, but all of them are shifted by a constant value. This would be denoted by high bias and low variance. A small tree has high Bias and low Variance. On the other hand, a big tree tends to produce low Bias but high Variance. Naturally, we would like to find the minimum value in this U shaped error curve.

Bagging is one of the techniques used to reduce the variance of our predictions. Essentially it combines multiple classifier models trained on different sub-samples of the given data set. The training data set is sampled multiple times to get multiple subsets. Once this is available, each is used to train a model independently. The models are then merged to get the final model. Thus, the steps for bagging are:

Create Multiple DataSets:

- Sampling is done with replacement on the original data and new datasets are formed.
- The new data sets can have a fraction of the columns as well as rows, which are generally hyper-parameters in a bagging model
- Taking row and column fractions less than 1 helps in making robust models, less prone to overfitting

Build Multiple Classifiers:

- Classifiers are built on each data set.
- Generally the same classifier is modeled on each data set and predictions are made.

Combine Classifiers:

- The predictions of all the classifiers are combined using a mean, median or mode value depending on the problem at hand.
- The combined values are generally more robust than a single model.

There are many different implementations of this concept of Bagging. Random Forest is one of them