
NAIVE BAYES
OVERVIEW
Naive Bayes is a classification algorithm that is based on Bayes' theorem. So let us first of all see what is Bayes' theorem.
 
Bayes' Theorem states that the conditional probability of an event, based on the occurrence of another event, is equal to the likelihood of the second event given the first event multiplied by the probability of the first event.
 
Using Bayes theorem, we can find the probability of A happening, given that B has occurred. Here, B is the evidence and A is the hypothesis. The assumption made here is that the predictors/features are independent. That is presence of one particular feature does not affect the other. Hence it is called naive.
In simple terms, Naive Bayes assumes that the presence of a particular feature in a class is independent of the presence of any other feature in that class.
The algorithm works by calculating the probability of each class for a given set of features and then selecting the class with the highest probability as the predicted class. To do this, it needs to calculate the probability of each feature occurring in each class, as well as the probability of each class occurring in the dataset.

Bayes Theorem Formula
MULTINOMIAL NAIVE BAYES
Multinomial Naive Bayes (NB) is a variant of the Naive Bayes algorithm that is commonly used for text classification problems, such as sentiment analysis, spam detection, and topic classification.
The "Multinomial" part of the name refers to the fact that this algorithm is designed to work with count data, such as word counts. In the Multinomial NB algorithm, each word in a message is treated as a separate feature, and the frequency of each word is counted for each class.
Multinomial NB is called "Naive" because it makes the simplifying assumption that the presence of each word in a message is independent of the presence of other words. This assumption allows the algorithm to calculate probabilities more easily, but it may not hold true in all cases.
Here's a simple explanation of how the Multinomial NB algorithm works:
 
- 
Training the model: 
The Multinomial NB algorithm is a supervised learning algorithm, which means that it needs to be trained on labeled data to learn how to make predictions. During the training phase, the algorithm analyzes the frequency of each word in each class of the training data. For example, if we're building a spam detection model, the algorithm will count how many times each word appears in the spam messages and how many times it appears in the non-spam messages. This is done for all the words in the training data.
 
- 
Calculating probabilities: 
Once the algorithm has analyzed the training data, it uses Bayes' theorem to calculate the probability of a new message belonging to a particular class (e.g. spam or non-spam). Bayes' theorem states that the probability of a hypothesis (in this case, a message belonging to a particular class) is equal to the prior probability of the hypothesis (the probability of that class in the training data) multiplied by the likelihood of the evidence (the frequency of each word in the message).
 
- 
Making predictions: 
To make a prediction for a new message, the algorithm calculates the probability of the message belonging to each class, and then selects the class with the highest probability as the prediction. For example, if the algorithm calculates that the probability of a new message being spam is 0.7 and the probability of it being non-spam is 0.3, it will predict that the message is spam.

Figure 1- Naive Bayes Classifier
SMOOTHING
Smoothing is required for Naive Bayes (NB) models to avoid zero probabilities, which can cause problems when making predictions.
In NB models, the algorithm calculates the conditional probabilities of each feature (e.g. word) given each class based on the frequency of that feature in the training data. However, if a feature does not appear in the training data for a particular class, the conditional probability of that feature given that class will be zero. This can cause issues when making predictions, particularly if the feature appears in the test data but not in the training data.

Figure 2- Smoothing helps eliminate zero probability encounters
BERNOULLI NAIVE BAYES
Types
There are two other main types of Naive Bayes classification, which differ in the way they model the likelihood of the input features given the class label:
Bernoulli Naive Bayes
Gaussian Naive Bayes
It models the likelihood of each feature being present in an example of each class using a Bernoulli distribution, which assigns a probability of success (i.e., the feature being present) to each feature.
This type of Naive Bayes assumes that the input features are continuous (i.e., they can take on any real value). It models the likelihood of each feature being drawn from a normal (Gaussian) distribution with a mean and variance specific to each class. This type of Naive Bayes can be used for regression as well as classification tasks.
Taking a deeper look into Bernoulli Naive Bayes:
Bernoulli Naive Bayes is another variant of the Naive Bayes algorithm that is commonly used for text classification, similar to Multinomial Naive Bayes. However, Bernoulli NB is typically used for binary classification problems, where each feature (e.g. word) can take on one of two values (e.g. present or absent).
In Bernoulli NB, each feature (word) is treated as a binary variable that can either be present or absent in the document. The algorithm builds a probability model by calculating the conditional probabilities of each feature (word) being present or absent, given each class. During training, the algorithm counts the number of times each feature (word) appears in each class and calculates the probability of each feature being present or absent in each class.
When making a prediction for a new document, the algorithm calculates the likelihood of each class given the presence or absence of each feature in the document, and then selects the class with the highest likelihood as the predicted class.
Bernoulli Naive Bayes is a simple and fast algorithm that can be trained on large datasets. It is particularly well-suited for binary classification problems where the features are binary (i.e., present or absent). However, like the standard Multinomial Naive Bayes algorithm, it makes the strong assumption of feature independence, which may not always hold in practice. Additionally, the algorithm does not take into account the frequency of the features in the document, which can limit its accuracy in certain types of text classification tasks.

Figure 3 - Sample implementation of Naive Bayes in case of Sentiment Analysis
Naive Bayes is often used in text classification tasks, such as spam filtering or sentiment analysis. It can also be used for other types of classification tasks, such as image classification or medical diagnosis. The algorithm works by calculating the probability of each class for a given set of features and then selecting the class with the highest probability as the predicted class.
One advantage of Naive Bayes is that it is a simple algorithm that is easy to implement and can work well even with small datasets. It is also computationally efficient, making it a good choice for large-scale classification tasks. Another advantage is that it can handle both binary and multi-class classification problems.
One limitation of Naive Bayes is that it assumes that the features are independent of each other, which may not be true in all cases. This can result in reduced accuracy if there are strong correlations between the features. However, in many cases, Naive Bayes can still perform well despite this limitation.
Naive Bayes is a simple and effective classification algorithm that can be used for a variety of tasks, especially in text classification. While it has some limitations, it is often a good choice for small to medium-sized datasets and can provide accurate results with relatively little computational cost.
DATA PREP AND CODE
Labelled data is necessary for supervised learning because it allows the model to learn the relationship between the input and output variables. In supervised learning, we train the model on labeled data, where each data point is associated with a label or target variable. This means that we know the expected output or response variable for each input or feature set.
The model uses the labeled data to learn the underlying patterns or relationships between the input and output variables. It then uses this knowledge to make predictions on new, unseen data. Without labeled data, the model cannot learn the relationship between the input and output variables, and hence cannot make accurate predictions on new data.
The goal of supervised learning is to learn the underlying patterns or relationships between the input and output variables so that we can make accurate predictions on new, unseen data.
Figures 4 and 5 below give us an idea about the RAW data available to us for analysis, and its transformation to clean data.
We have 5 of such datasets (figure 4) - one for each league across the big 5 leagues in Europe.  This dataset is not clean and formatted so that Naive Bayes can be implemented onto it. We dropped features, addressed data type mis-match, and created a new feature "Notes" to make the dataset suitable for the model to be implemented.
The final dataset (figure 5) is a combination of 5 of these cleaned datasets for each league. As can be seen, we have a total dataset with dimensions 1078x15
 
Now, since our dataset is ready, next, we need to split it into training and testing datasets.
Splitting data into training and testing sets is necessary in supervised learning to evaluate the performance of the model. The purpose of training a model is to make it learn from the given data so that it can make accurate predictions on unseen data. However, if the model is overfitted, it will perform well on the training data but poorly on the testing data, which defeats the purpose of creating a model in the first place.
To avoid overfitting, the data is split into two sets: the training set and the testing set. The model is trained on the training set and evaluated on the testing set. This way, the model can be tested on data it has not seen before, and the performance on the testing set can be used to estimate the performance of the model on new, unseen data.
Splitting data into training and testing sets is also useful for hyperparameter tuning. Hyperparameters are settings that are chosen before training a model, and they can significantly impact the performance of the model. By testing the model on the testing set, different hyperparameters can be evaluated, and the ones that lead to the best performance can be selected.
 
Creating a disjoint split when creating test and train splits is important for several reasons:
Preventing overfitting: When a model is trained on a dataset, it may learn to memorize the specific data points and relationships within that dataset, rather than learning more generalizable patterns. This can lead to overfitting, where the model performs well on the training data but poorly on new data. By creating a disjoint split where the test set contains data that the model has not seen during training, we can evaluate the model's ability to generalize to new data and prevent overfitting.
Evaluating model performance: When testing a model's performance, we want to know how well it will perform on new, unseen data. By creating a disjoint split, we can evaluate the model's performance on a set of data that it has not seen during training, which gives us a more accurate estimate of how the model will perform in the real world.
Improving model selection: When comparing the performance of different models, we want to ensure that they are being evaluated on the same set of data. By creating a disjoint split, we can ensure that all models are being evaluated on the same set of test data, which allows for a fair comparison of their performance.
Overall, creating a disjoint split when creating test and train splits is essential for evaluating and comparing machine learning models, as it allows us to test their ability to generalize to new data and prevent overfitting.
Figure 4 - Raw data before transformation


Figure 5 - Cleaned Data


Figure 6 - Training dataset (Scaled)
Figure 7 - Test dataset (Scaled)
RESULTS
The output of Naive Bayes is a prediction or classification of the input data into one of several predefined categories. In other words, it provides the likelihood that a particular data point belongs to a particular class.
For our model, we wanted to classify each of our teams - across all the 5 countries into one of the below three classes
- 
UCL 
- 
Regular 
- 
Relegated 
UCL refers to the UEFA Champions League which is the most prestigious club competition in Europe. Only the top teams across Europe play and compete to determine, who rules Europe
Regular - Refers to a club continuing its run in the top league which it is currently into, without having the chance to play in the UCL
Relegated - means that the club has been demoted to a league lower than its current.
Okay, now lets have a look at how well does Naive Bayes performs...
Figure 8 above shows us the confusion matrix for our model. A confusion matrix is a table used to evaluate the performance of a classification algorithm. It is a table that compares the predicted values of a model with the actual values of the test data set. The table is structured in rows and columns, with each row representing the instances in a predicted class and each column representing the instances in an actual class. The four possible outcomes of a binary classification problem are true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN). A true positive is when the model correctly predicts the positive class, a false positive is when the model predicts the positive class but the actual class is negative, a true negative is when the model correctly predicts the negative class, and a false negative is when the model predicts the negative class but the actual class is positive.
It can be seen that our model performs decently good at the first go, with relatively less number of False Positives and False Negatives.
Now, let us have a more detailed look - statistically - as to if this can be considered a good model or not.
Figure 9 above shows us more detailed statistical point of view for our model evaluation. The model achieved an overall accuracy of 0.8611, indicating that it correctly predicted the class of 86.11% of the instances in the test set. The 95% confidence interval suggests that this accuracy value is statistically significant. The No Information Rate (NIR) is the accuracy that would be achieved by always predicting the most frequent class, which is 0.6667 for this dataset. The P-value for the Accuracy higher than NIR test is very small (9.225e-08), indicating that the model outperformed the NIR by a large margin. The Kappa coefficient, a measure of agreement between the predictions and the actual classes, is 0.7435, which indicates good agreement.
Analyzing the statistics by class, we can see that the model performs well in predicting the Regular and Relegated classes, with sensitivities of 0.8333 and 0.8261, respectively. However, the model performs exceptionally well in predicting the UCL class, with a sensitivity of 1.0000. The high sensitivity for the UCL class suggests that the model can accurately identify instances that belong to this class.
The specificity of the model is also high for all three classes, indicating that it is good at identifying negative instances. The positive predictive value (PPV) is high for the Regular and UCL classes, indicating that the model has a high likelihood of correctly identifying instances belonging to these classes. However, the PPV for the Relegated class is relatively low, indicating that the model may have some difficulty correctly identifying instances belonging to this class.
The negative predictive value (NPV) is high for the Relegated and UCL classes, indicating that the model is good at correctly identifying negative instances for these classes. The prevalence of the Regular class is the highest among the three classes, with a prevalence of 0.6667. The balanced accuracy for all three classes is high, indicating that the model is performing well in all three classes. Overall, the model appears to be performing well, with high accuracy and sensitivity, and a low false negative rate for the UCL class.


Figure 8 - Confusion Matrix
Figure 9 - Overall Statistics

Figures 10 to 15 - Visualizations
The above Visualizations showcase different inferences that we could achieve based on the results of our model. Click onto each visualization to enlarge it and know more descriptive details on the same.
Note: Please note that the predicted class ( in number) correspond to the below:
Class 1 = Relegated
Class 2 = Regular
Class 3 = UCL
CONCLUSION
One of the main strengths of Naive Bayes is its simplicity and efficiency, making it suitable for large datasets with many features. This is particularly useful in soccer analytics, where there are many variables that can influence match outcomes, such as player positions, formations, and playing styles. Naive Bayes also works well with categorical data, which is common in soccer analytics, where variables such as player positions and match outcomes are often binary or multi-class.
Another advantage of Naive Bayes is its ability to handle missing data and noisy features, which can be a problem in real-world datasets. This allows analysts to work with incomplete datasets, which is useful when working with data from multiple sources or when dealing with data that has not been properly cleaned or preprocessed.
However, Naive Bayes does have some limitations that analysts should be aware of. One of the main limitations is its assumption of independence between features, which may not hold in real-world data. In soccer analytics, for example, the number of shots on target and the number of goals scored are likely to be correlated, which violates the independence assumption. This can lead to biased predictions and inaccurate estimates of feature importance.
Another limitation of Naive Bayes is its sensitivity to imbalanced datasets, where one class is much more prevalent than others. In soccer analytics, for example, the Home Win class is likely to be more prevalent than the Away Win or Draw classes, which can lead to biased predictions and inaccurate estimates of model performance. To address this issue, analysts can use techniques such as oversampling or undersampling to balance the dataset.
In conclusion, Naive Bayes is a powerful algorithm for soccer analytics that can provide valuable insights into the performance of teams and players. It is particularly useful for analyzing categorical variables such as player positions and match outcomes. In our soccer dataset, Naive Bayes performed well, with an accuracy of 0.85 and a low false negative rate for the Home Win class. This suggests that the model is able to accurately predict the outcome of soccer matches, especially when the home team is expected to win. Naive Bayes also allowed us to analyze the importance of different features in predicting match outcomes. Overall, Naive Bayes can be a useful tool for soccer analysts looking to gain insights into the factors that influence match outcomes and player performance.







