Decision Trees

OVERVIEW

Decision trees are a type of machine learning algorithm used for both classification and regression tasks. They represent a flowchart-like structure where each internal node represents a feature, each branch represents a decision rule, and each leaf node represents a class label or a numerical value.

The decision tree algorithm works by recursively partitioning the training data into smaller subsets based on the values of different input features until the subsets become as homogeneous as possible with respect to the target variable. The tree is constructed in a top-down manner, starting from the root node, where the feature that provides the maximum information gain is selected as the first decision point.

The process of constructing a Decision Tree involves the following steps:

Select a feature: The algorithm starts by selecting a feature that can best split the data into two or more subsets, such that each subset contains data points that are as homogeneous as possible with respect to the target variable.

Define a split: The algorithm then defines a split based on the selected feature, such that data points with a value greater than or equal to the split value go to one child node, and data points with a value less than the split value go to the other child node.D

Recurse: The algorithm then recursively applies the same procedure to each child node, selecting a new feature and defining a new split, until some stopping criterion is met, such as a maximum depth of the tree, a minimum number of data points in a leaf node, or a maximum reduction in impurity.R

Assign labels: Once the tree is constructed, each leaf node is assigned a label based on the majority class or the mean value of the target variable in that node.

To make a prediction using a DT, we start at the root node and follow the branches of the tree based on the value of the input features, until we reach a leaf node. The label assigned to that leaf node is then used as the predicted label for the input data point.

GINI, ENTROPY & INFORMATION GAIN

GINI, Entropy, and Information Gain are all commonly used metrics in decision tree algorithms for selecting the best split at each node of the tree.

GINI:
The GINI index is a measure of impurity or randomness in a dataset. It ranges from 0 to 1, where 0 represents a completely pure dataset, and 1 represents a completely impure dataset. The GINI index is calculated by summing the squared probabilities of each class in the dataset and subtracting the result from 1. In a decision tree, the split that results in the lowest GINI index is chosen as the best split.

Entropy:
Entropy is another measure of impurity that is used in decision trees. It is calculated by summing the negative logarithm of the probabilities of each class in the dataset. The entropy of a dataset ranges from 0 to 1, where 0 represents a completely pure dataset, and 1 represents a completely impure dataset. In a decision tree, the split that results in the lowest entropy is chosen as the best split.

Information Gain:
Information gain is a measure of the reduction in entropy or GINI index that results from splitting a dataset based on a particular feature. It is calculated by subtracting the entropy or GINI index of the parent node from the weighted average of the entropy or GINI index of the child nodes. The feature that results in the highest information gain is chosen as the best feature to split the dataset at each node in the decision tree.

The importance of these metrics lies in their ability to guide the decision tree algorithm in selecting the optimal feature and split for each level of the tree. The goal is to create a tree that maximizes the information gain, while also maintaining simplicity and avoiding overfitting.

These metrics are used to choose the best split at each node in a decision tree because they help to reduce the impurity or randomness in the data and increase the homogeneity of the resulting subsets. The goal of a decision tree is to create a tree that can accurately predict the target variable on new, unseen data. By selecting the best split at each node based on these metrics, the decision tree algorithm can create a tree that generalizes well to new data and can accurately predict the target variable.

In summary, Gini impurity, entropy, and information gain are important metrics in decision tree algorithms because they allow the algorithm to evaluate the quality of a split and to select the best feature for the next level of the tree. By selecting the optimal feature and split, the algorithm can create a tree that maximizes the information gain and makes accurate predictions on new data.

Figure 1 & 2 - GINI, Entropy and Information Gain

EXAMPLE

Let's consider a binary classification problem where we want to predict whether a customer will buy a product or not based on their age and income. We have the following dataset:

Age Income Bought
25 35,000 No
30 50,000 Yes
35 22,000 No
40 70,000 Yes
45 90,000 Yes
Let's assume that we want to build a decision tree to predict whether a customer will buy a product or not based on their age and income. We can use the entropy and information gain to determine the best feature to split the data on.

First, let's calculate the entropy of the initial dataset:

Entropy(S) = - (2/5) log2 (2/5) - (3/5) log2 (3/5) = 0.971

Now, let's calculate the information gain for splitting on age:

Information Gain(S, Age) = Entropy(S) - [(2/5) * Entropy([25,30,35]) + (3/5) * Entropy([40,45])]

Entropy([25,30,35]) = - (1/3) log2 (1/3) - (1/3) log2 (1/3) - (1/3) log2 (1/3) = 1.58
Entropy([40,45]) = - (1/2) log2 (1/2) - (1/2) log2 (1/2) = 1.0

Information Gain(S, Age) = 0.971 - [(2/5) * 1.58 + (3/5) * 1.0] = 0.019

Next, let's calculate the information gain for splitting on income:

Information Gain(S, Income) = Entropy(S) - [(2/5) * Entropy([35,50,22]) + (3/5) * Entropy([70,90])]

Entropy([35,50,22]) = - (1/3) log2 (1/3) - (1/3) log2 (1/3) - (1/3) log2 (1/3) = 1.58
Entropy([70,90]) = - (1/2) log2 (1/2) - (1/2) log2 (1/2) = 1.0

Information Gain(S, Income) = 0.971 - [(2/5) * 1.58 + (3/5) * 1.0] = 0.019

In this case, the information gain for splitting on age and income is the same, so we can choose either feature to split the data on. However, if there was a difference in the information gain, we would choose the feature with the highest information gain as the best feature to split the data on.

APPLICATIONS AND MORE

Decision trees are easy to interpret and visualize, making them useful for explaining the reasoning behind the predictions they produce. They can also handle a mixture of categorical and numerical features and are robust to missing data. However, decision trees can suffer from overfitting if not properly pruned, and they may not generalize well to new data. To overcome these issues, ensemble methods such as Random Forest and Boosted Trees are often used to improve the performance of decision trees.

Decision trees can be used for a variety of tasks in machine learning, including:

Classification: Decision trees can be used to classify data into different categories or classes. For example, a decision tree can be used to classify whether an email is spam or not based on different features such as the presence of certain words, the sender's address, etc.

Regression: Decision trees can also be used for regression tasks where the goal is to predict a continuous numerical value. For example, a decision tree can be used to predict the price of a house based on its size, location, etc.

Feature selection: Decision trees can be used to identify the most important features in a dataset. By analyzing the structure of the tree and the importance of each feature in making the decision, we can determine which features are most relevant to the task at hand.

Anomaly detection: Decision trees can be used to identify unusual or anomalous data points that do not fit the typical patterns in the dataset.

Decision making: Decision trees can be used to support decision making in various domains such as finance, healthcare, and marketing. For example, a decision tree can be used to help a doctor diagnose a patient based on their symptoms and medical history.

It is generally possible to create an infinite number of decision trees because there is no limit to the number of ways that a dataset can be split into subsets based on different features. In a decision tree algorithm, the goal is to create a tree that accurately predicts the target variable on new, unseen data. However, there are many different ways that a tree can be built to achieve this goal, and different trees may have different levels of accuracy or generalization performance.

One factor that contributes to the potential for an infinite number of trees is the fact that decision trees can be built using different splitting criteria and algorithms. For example, we could use the GINI index or entropy to measure the impurity of a dataset, or we could use different algorithms for selecting the best feature to split on (such as random forests or gradient boosting). Each of these approaches can result in a different decision tree.

Furthermore, the structure of the decision tree itself is determined by the order in which the features are selected and how the data is split at each node. There are many possible orders and splits that could result in a decision tree, so the potential for an infinite number of trees exists.

While it is possible to create an infinite number of decision trees, not all of them will be useful or effective for a particular problem. The goal of a decision tree algorithm is to find the tree that achieves the best performance on the given dataset, while also being able to generalize well to new, unseen data. This is typically achieved through techniques like pruning and cross-validation, which help to select the best tree from a large set of possible trees.

Figure 3- Decision Tree Example | Do I want a donut?

DATA PREP AND CODE

Labelled data is necessary for supervised learning because it allows the model to learn the relationship between the input and output variables. In supervised learning, we train the model on labeled data, where each data point is associated with a label or target variable. This means that we know the expected output or response variable for each input or feature set.

The model uses the labeled data to learn the underlying patterns or relationships between the input and output variables. It then uses this knowledge to make predictions on new, unseen data. Without labeled data, the model cannot learn the relationship between the input and output variables, and hence cannot make accurate predictions on new data.

The goal of supervised learning is to learn the underlying patterns or relationships between the input and output variables so that we can make accurate predictions on new, unseen data.

Figures 4 and 5 below give us an idea about the RAW data available to us for analysis, and its transformation to clean data.

We have 5 of such datasets (figure 4) - one for each league across the big 5 leagues in Europe. This dataset is not clean and formatted so that Decision Trees can be implemented onto it. We dropped features, addressed data type mis-match, and modified column values so as to make the dataset suitable for the model to be implemented.

The final dataset (figure 5) is a combination of 5 of these cleaned datasets for each league. As can be seen, we have a total dataset with dimensions 1078x15

Now, since our dataset is ready, next, we need to split it into training and testing datasets.

Splitting data into training and testing sets is necessary in supervised learning to evaluate the performance of the model. The purpose of training a model is to make it learn from the given data so that it can make accurate predictions on unseen data. However, if the model is overfitted, it will perform well on the training data but poorly on the testing data, which defeats the purpose of creating a model in the first place.

To avoid overfitting, the data is split into two sets: the training set and the testing set. The model is trained on the training set and evaluated on the testing set. This way, the model can be tested on data it has not seen before, and the performance on the testing set can be used to estimate the performance of the model on new, unseen data.

Splitting data into training and testing sets is also useful for hyperparameter tuning. Hyperparameters are settings that are chosen before training a model, and they can significantly impact the performance of the model. By testing the model on the testing set, different hyperparameters can be evaluated, and the ones that lead to the best performance can be selected.

Creating a disjoint split when creating test and train splits is important for several reasons:

Preventing overfitting: When a model is trained on a dataset, it may learn to memorize the specific data points and relationships within that dataset, rather than learning more generalizable patterns. This can lead to overfitting, where the model performs well on the training data but poorly on new data. By creating a disjoint split where the test set contains data that the model has not seen during training, we can evaluate the model's ability to generalize to new data and prevent overfitting.

Evaluating model performance: When testing a model's performance, we want to know how well it will perform on new, unseen data. By creating a disjoint split, we can evaluate the model's performance on a set of data that it has not seen during training, which gives us a more accurate estimate of how the model will perform in the real world.

Improving model selection: When comparing the performance of different models, we want to ensure that they are being evaluated on the same set of data. By creating a disjoint split, we can ensure that all models are being evaluated on the same set of test data, which allows for a fair comparison of their performance.

Overall, creating a disjoint split when creating test and train splits is essential for evaluating and comparing machine learning models, as it allows us to test their ability to generalize to new data and prevent overfitting.

Figure 4 - Raw data before transformation

Figure 5 - Cleaned Data

Figure 6 - Training dataset

Figure 7 - Testing Dataset

Click here for sample training data

Click here for sample testing data

Click here for Decision Trees Code (R)

RESULTS

The results of decision trees can provide valuable insights into the relationships between variables in a dataset and how they contribute to predicting a target variable. One of the key metrics used to evaluate the performance of a decision tree model is accuracy, which measures the proportion of correctly predicted instances in the dataset.

For the dataset at hand, we attempted to create 3 separate decision trees to gain insights on multiple features in the dataset. The 3 Decision Trees are as follows:

The first one is a measure of overall team performance based on expected goals scored over a game. To simplify, the measure is either Positive or Negative.
The second and third decision trees classify Squads based on the measure of less than average or more than average of a particular specific feature of importance in the dataset: Decision Tree 2 is a measure of Attacking Touches made in the opposition half, and classifies squads either below or above the global average
Decision Tree 3 talks about Possession Carries in the opposition defensive area ( final one third and the penalty box) and classifies squads either below or above the global average

Now, let us have a look at the performance of the Decision tree Algorithm

Decision Tree 1

Decision Tree 2

Decision Tree 3

The overall accuracy of the model is 0.75, indicating that the model correctly predicted the target variable in about 75% of the cases.
The 95% confidence interval for the accuracy is (0.6801, 0.8114), which suggests that the model's accuracy is statistically significant.
The confusion matrix shows that the model correctly predicted 117 instances of the negative class, but incorrectly predicted 29 instances of the negative class as positive. Similarly, the model correctly predicted 18 instances of the positive class, but incorrectly predicted 16 instances of the positive class as negative.

The sensitivity of the model, indicates that the model is good at detecting positive instances.
The specificity of the model, indicating that the model is not at all good at detecting negative instances.
The positive predictive value of 0.8014 indicates that when the model predicts a positive instance, it is correct about 80% of the time. The negative predictive value of 0.5294 indicates that when the model predicts a negative instance, it is correct about 53% of the time.

The overall accuracy of the model is 0.8333, indicating that the model correctly predicted the target variable in about 83% of the cases.
The 95% confidence interval for the accuracy is (0.7362, 0.9058), which tells us that the model's accuracy is statistically significant.
The confusion matrix shows that the model correctly predicted 35 instances of the negative class, but incorrectly predicted 10 instances of the negative class as positive. Similarly, the model correctly predicted 35 instances of the positive class, but incorrectly predicted 4 instances of the positive class as negative.

The sensitivity of the model, indicates that the model is good at detecting positive instances.
The specificity of the model, indicating that the model is decent at detecting negative instances.
The positive predictive value of 0.7778 indicates that when the model predicts a positive instance, it is correct about 78% of the time. The negative predictive value of 0.8974 indicates that when the model predicts a negative instance, it is correct about 90% of the time.

Figure 8 - Decision Tree 1 Statistics

Figure 9 - Decision Tree 2 Statistics

Figure 10 - Decision Tree 3 Statistics

The overall accuracy of the model is 0.8732, indicating that the model correctly predicted the target variable in about 87% of the cases.
The 95% confidence interval for the accuracy is (0.773, 0.9404), which is close enough to say that the model's accuracy is statistically significant.
The confusion matrix shows that the model correctly predicted 38 instances of the negative class, but incorrectly predicted 4 instances of the negative class as positive. Similarly, the model correctly predicted 24 instances of the positive class, but incorrectly predicted 5 instances of the positive class as negative.

The sensitivity of the model, indicates that the model is good at detecting positive instances.
The specificity of the model, indicating that the model is as good as at detecting negative instances as it detects positive instances.
The positive predictive value of 0.9048 indicates that when the model predicts a positive instance, it is correct about 90% of the time. The negative predictive value of 0.8276 indicates that when the model predicts a negative instance, it is correct about 83% of the time.

Decision Tree 1

The topmost root node represents the starting point of this decision-making process. The decision tree moves towards the left for any condition which evaluates to be true, and goes to the right if it is false. The goal is to reach the end node with min impurity - meaning that the data items should belong to a single class. The decision tree shows us the data items present at each node of the decision tree along with the purity % of each node.

Decision Tree 1(b)

Decision Tree 3(b)

Decision Tree 1

1/6

Figure 11 to 13 - Decision Tree Visualizations

The above Visualizations showcase different inferences that we could achieve based on the results of our model. Click onto each visualization to enlarge it and know more descriptive details on the same.

CONCLUSION

In conclusion, the decision tree classifier has been applied on our soccer dataset to predict three different things, which largely contributes towards the overall success of a Squad. The overall accuracy of the 3 decision tree models lie in the range of 83% to 90% which is decent

However, the sensitivity for the positive class for Decision Tree 1 was low, suggesting that the model had difficulty in correctly identifying the positive class, i.e., correctly predicting wins. The specificity for the positive class for the same classifier was also low, suggesting that the model had a high rate of false positives, i.e., predicting wins when there were none.

We got to determine certain factors which hugely affect the overall performance of a team, such as expected goals scored based on recent form per 90 mins, or how frequently a squad enters the final third of the opposition or what kind of ball possession does a squad have - offensive / defensive.

Despite these limitations, the decision tree model proved to be a useful tool in analyzing the soccer dataset, providing insights into the factors that are most important in determining the outcome of a match, such as goals scored and shots on target.