
SUPPORT VECTOR MACHINE
OVERVIEW
Support Vector Machines (SVMs) are a type of machine learning algorithm used for classification and regression analysis. SVMs are called "support vector" machines because they rely on a set of data points called "support vectors" to define the decision boundary between different classes or groups of data.
SVMs work by finding the hyperplane that best separates the data into different classes. The hyperplane is defined as the line or plane that maximizes the margin between the closest data points of each class. The margin is the distance between the hyperplane and the closest data points, and the goal of the SVM algorithm is to maximize this distance.
In some cases, the data may not be linearly separable, which means that there is no hyperplane that can perfectly separate the classes. In these cases, the SVM algorithm uses a kernel function to map the data into a higher-dimensional space where it may be linearly separable. There are several types of kernel functions, including linear, polynomial, radial basis function (RBF), and sigmoid, among others.
SVMs can be used for both binary and multi-class classification problems. In binary classification, the SVM algorithm assigns each data point to one of two classes based on which side of the decision boundary it falls on. In multi-class classification, the SVM algorithm assigns each data point to one of several classes based on its proximity to multiple decision boundaries.
SVMs are a popular and powerful machine learning algorithm that can be used in a wide range of applications, including text classification, image classification, and bioinformatics. SVMs are known for their ability to handle high-dimensional data, and they are often used in situations where the number of features is much larger than the number of samples. However, SVMs can be sensitive to the choice of kernel function and the parameters used in the algorithm, and tuning these parameters can require significant computational resources.
 

Figure 1- Sample SVM
LINEAR SEPARATORS & KERNELS
SVMs are called linear separators because they attempt to find a linear decision boundary that separates the data points into two or more classes. A linear decision boundary is a straight line or hyperplane that can be represented by a linear equation, such as y = mx + b or
Ax + By + Cz + D = 0.
The goal of the SVM algorithm is to find the hyperplane that maximizes the margin between the closest data points of each class. The margin is the distance between the hyperplane and the closest data points, and by maximizing this distance, the SVM algorithm can create a decision boundary that is as far away from the data points as possible.
When the data is linearly separable, a hyperplane can perfectly separate the data points into different classes. In this case, the SVM algorithm can find the optimal hyperplane that maximizes the margin between the closest data points of each class.
However, when the data is not linearly separable, the SVM algorithm uses a kernel function to map the data into a higher-dimensional space where it may be linearly separable. The kernel function allows the SVM algorithm to transform the original data points into a new feature space where a linear decision boundary may be able to separate the data points into different classes.
Kernels play a critical role in Support Vector Machines (SVMs) by allowing the algorithm to find non-linear decision boundaries.
A kernel function maps the input data points into a higher-dimensional space, where the data points may be more easily separated by a linear decision boundary. Kernels essentially transform the input data points into a new feature space, where the data points are represented by a set of new features. These new features can be used to find a linear decision boundary that separates the data points into different classes.
There are several types of kernel functions that can be used in SVMs, including linear, polynomial, radial basis function (RBF), and sigmoid, among others. The choice of kernel function depends on the nature of the data and the problem being solved.
Linear kernels simply perform a dot product between the input data points, resulting in a linear decision boundary. Polynomial kernels transform the data points into a higher-dimensional space using a polynomial function. RBF kernels transform the data points into an infinite-dimensional space, where the distance between the data points is measured using a Gaussian function. Sigmoid kernels use a sigmoid function to transform the data points into a new feature space.
Once the data points have been transformed into a higher-dimensional space, the SVM algorithm can find a linear decision boundary that separates the data points into different classes. This is done by finding the hyperplane that maximizes the margin between the closest data points of each class.

Figure 2 - Various SVM Kernel Formulae
DOT PRODUCT IMPORTANCE
The dot product is a critical component of the kernel function in Support Vector Machines (SVMs) because it allows the algorithm to measure the similarity or dissimilarity between pairs of input data points.
The dot product is used to calculate the inner product between two vectors, which is a measure of the similarity between the vectors. When the dot product between two vectors is high, it indicates that the vectors are similar or have similar features. Conversely, when the dot product between two vectors is low, it indicates that the vectors are dissimilar or have different features.
In SVMs, the dot product is used to measure the similarity between pairs of input data points that have been transformed into a higher-dimensional space using a kernel function. By measuring the similarity between the data points in the new feature space, the SVM algorithm can find a linear decision boundary that separates the data points into different classes.
For example, in a linear kernel, the dot product is used to calculate the inner product between pairs of input data points. The resulting value represents the similarity between the data points in the original feature space. When the dot product between two data points is high, it indicates that the data points are similar and belong to the same class. When the dot product between two data points is low, it indicates that the data points are dissimilar and belong to different classes.
KERNEL FUNCTIONS
The polynomial and RBF (Radial Basis Function) kernel functions are two commonly used types of kernels in Support Vector Machines (SVMs).
The polynomial kernel function transforms the input data points into a higher-dimensional space using a polynomial function. The polynomial kernel function has a parameter called the degree, which determines the degree of the polynomial used in the transformation. The polynomial kernel function is defined as:
K(x, y) = (x * y + c)^d
where x and y are input data points, c is a constant, and d is the degree of the polynomial.
The RBF kernel function transforms the input data points into an infinite-dimensional space using a Gaussian function. The RBF kernel function has a parameter called the gamma, which determines the width of the Gaussian function. The RBF kernel function is defined as:
K(x, y) = exp(-gamma * ||x - y||^2)
where x and y are input data points, gamma is a constant, and ||x - y||^2 is the squared Euclidean distance between the two points.
Both the polynomial and RBF kernel functions are used to map the input data points into a higher-dimensional space where they can be more easily separated by a linear decision boundary. The choice of kernel function depends on the nature of the data and the problem being solved. The polynomial kernel function is suitable for data that has polynomial relationships between the input features, while the RBF kernel function is suitable for data that has non-linear relationships between the input features.
 
EXAMPLE
 Let's say we have a 2D point (1, 3) and we want to use a polynomial kernel with r = 1 and d = 2 to transform it into a higher-dimensional space. The polynomial kernel function is defined as:
K(x, y) = (x * y + r)^d
where x and y are input data points, r is a constant, and d is the degree of the polynomial.
To apply the polynomial kernel to our 2D point, we first need to calculate the dot product of the point with itself. This is because the polynomial kernel is applied to pairs of points, and in this case we want to apply it to a single point.
So, the dot product of (1, 3) with itself is:
(1, 3) * (1, 3) = 1 * 1 + 3 * 3 = 10
Now we can use this dot product and the parameters r and d to calculate the transformed point. Substituting in r = 1 and d = 2, we get:
K((1, 3), (1, 3)) = (10 + 1)^2 = 121
So the transformed point is (121, 1, 3), which is now in a higher-dimensional space. The first dimension corresponds to the dot product of the original point with itself, while the second and third dimensions correspond to the original coordinates of the point.
This process can be repeated for each point in the dataset, and then the SVM can find the optimal decision boundary in this higher-dimensional space.
DATA PREP AND CODE
Labelled data is necessary for supervised learning because it allows the model to learn the relationship between the input and output variables. In supervised learning, we train the model on labeled data, where each data point is associated with a label or target variable. This means that we know the expected output or response variable for each input or feature set.
The model uses the labeled data to learn the underlying patterns or relationships between the input and output variables. It then uses this knowledge to make predictions on new, unseen data. Without labeled data, the model cannot learn the relationship between the input and output variables, and hence cannot make accurate predictions on new data.
The goal of supervised learning is to learn the underlying patterns or relationships between the input and output variables so that we can make accurate predictions on new, unseen data.
Figures 3 and 4 below give us an idea about the RAW data available to us for analysis, and its transformation to clean data.
This dataset is not clean and formatted so that the SVM algorithm can be implemented onto it. We dropped features, addressed data type mis-match, and modified column values so as to make the dataset suitable for the model to be implemented.
The final dataset (figure 4) is a combination of 6 of these cleaned datasets for each league. As can be seen, we have a total dataset with dimensions 174x10
 
Now, since our dataset is ready, next, we need to split it into training and testing datasets.
Splitting data into training and testing sets is necessary in supervised learning to evaluate the performance of the model. The purpose of training a model is to make it learn from the given data so that it can make accurate predictions on unseen data. However, if the model is overfitted, it will perform well on the training data but poorly on the testing data, which defeats the purpose of creating a model in the first place.
To avoid overfitting, the data is split into two sets: the training set and the testing set. The model is trained on the training set and evaluated on the testing set. This way, the model can be tested on data it has not seen before, and the performance on the testing set can be used to estimate the performance of the model on new, unseen data.
Splitting data into training and testing sets is also useful for hyperparameter tuning. Hyperparameters are settings that are chosen before training a model, and they can significantly impact the performance of the model. By testing the model on the testing set, different hyperparameters can be evaluated, and the ones that lead to the best performance can be selected.
Creating a disjoint split when creating test and train splits is important for several reasons:
Preventing overfitting: When a model is trained on a dataset, it may learn to memorize the specific data points and relationships within that dataset, rather than learning more generalizable patterns. This can lead to overfitting, where the model performs well on the training data but poorly on new data. By creating a disjoint split where the test set contains data that the model has not seen during training, we can evaluate the model's ability to generalize to new data and prevent overfitting.
Evaluating model performance: When testing a model's performance, we want to know how well it will perform on new, unseen data. By creating a disjoint split, we can evaluate the model's performance on a set of data that it has not seen during training, which gives us a more accurate estimate of how the model will perform in the real world.
Improving model selection: When comparing the performance of different models, we want to ensure that they are being evaluated on the same set of data. By creating a disjoint split, we can ensure that all models are being evaluated on the same set of test data, which allows for a fair comparison of their performance.
Overall, creating a disjoint split when creating test and train splits is essential for evaluating and comparing machine learning models, as it allows us to test their ability to generalize to new data and prevent overfitting.
Lastly, SVMs require labeled numeric data because they rely on mathematical calculations to classify and separate data points. When training an SVM model, the algorithm needs to find the optimal hyperplane that best separates the labeled data points into their respective categories. To do this, the SVM algorithm calculates the dot product between pairs of data points, which requires the data to be in numerical form. The dot product measures the similarity between two vectors, and it's a fundamental operation in many machine learning algorithms, including SVMs.
Furthermore, the labels must also be numeric because the algorithm needs to calculate the distance between a data point and the decision boundary. This distance is used to determine which side of the hyperplane the point belongs to and therefore assign a label to it.
Therefore, without numeric data and labels, SVMs cannot perform the mathematical calculations necessary to learn from the data and make predictions.
Figure 3 - Raw data before transformation
Figure 4 - Cleaned Data
Figure 5 - Training dataset




Figure 6 - Testing Dataset
RESULTS
SVM Kernel = Linear
The results of SVM applied on Soccer analytics, can provide valuable insights into the relationships between variables in a dataset and how they contribute to predicting a target variable. Two of the key metrics used to evaluate the performance of a SVM model are accuracy, and confusion matrix which measures the proportion of correctly predicted instances in the dataset.
For the dataset at hand, we attempted to make use of three different SVM kernels : linear, polynomial, and radial, each with different values of the cost function. The data we have is for the club 'Manchester United', and we have analyzed their match outcome trends over the past 6 seasons.
Now, let us have a look at the performance of various SVM kernels with different cost functions on the said dataset.

Figure 7 - Linear SVM Performance
The above results are for a linear SVM model with three different values of the cost function. The accuracy of the model is 100%, indicating that the model was able to correctly classify all instances of wins and losses. The balanced accuracy is also 1.0, indicating that the model performs equally well in predicting both wins and losses. Also, notice that, even with varied cost function values, the Linear SVMs accuracy does not change. Overall, these results suggest that the linear SVM model is very effective at predicting the outcomes of soccer matches.
SVM Kernel = Radial

Figure 8 - Radial SVM Performance
The above results are for a radial SVM model with three different values of the cost function. The accuracy of the varies from 70% to 100% with the change in cost function values. Nonetheless, with higher values of the cost function (1 and 10), the model is able to correctly classify all instances of wins and losses correctly. The balanced accuracy for such instances is also 1.0, indicating that the model performs equally well in predicting both wins and losses with higher values of the cost function. For a lower value of the cost function (0.1), the accuracy drops to about 70%, with the balanced accuracy also dropping to about 58%. Overall, with relatively higher values of the cost function, it is safe to conclude that the radial SVM performs well on the soccer dataset.
SVM Kernel = Polynomial

Figure 9 - Polynomial SVM Performance
The above results are for a polynomial SVM model with three different values of the cost function. Similar to the radial SVM, the accuracy of the varies from 70% to 100% with the change in cost function values. Nonetheless, with higher values of the cost function (1 and 10), the model is able to correctly classify all instances of wins and losses correctly. The balanced accuracy for such instances is also 1.0, indicating that the model performs equally well in predicting both wins and losses with higher values of the cost function. For a lower value of the cost function (0.1), the accuracy drops to about 70%, with the balanced accuracy also dropping to about 58%. Overall, with relatively higher values of the cost function, we can say that the polynomial SVM performs quite well on the soccer dataset.

Figure 10 to 13 - SVM Visualizations
The above Visualizations showcase different inferences that we could achieve based on the results of our model. Click onto each visualization to enlarge it and know more descriptive details on the same.
CONCLUSION
In this study, we utilized Support Vector Machines (SVM) to predict the outcome of soccer matches based on a dataset containing various match statistics. We used three different kernels, namely linear, polynomial, and radial basis function (RBF), with different cost values to train and test our SVM models. The linear kernel performed the best, with an accuracy of 100% and a kappa value of 1, indicating perfect prediction. The polynomial and RBF kernels had lower accuracies, but still performed well.
The confusion matrix for each kernel showed that the models had perfect predictions for the linear kernel, while the other two kernels had some misclassifications. The sensitivity and specificity values were both 1 for the linear kernel, indicating perfect performance. For the polynomial and RBF kernels, the sensitivity and specificity values were lower but still relatively high.
Also, we were able to determine the importance of features being used in our dataset. The SVM algorithm showed us that not all features can be utilized effectively to predict the outcome of a game. Attributes like - Goals Scored, Goals Conceded, Venue, Possession, do help us in predicting the outcome of a match, but , attributes like expected goals or expected goals conceded are not so reliable in predicting the match outcome.
Overall, the results suggest that SVM can be a powerful tool for predicting the outcome of soccer matches based on various statistics. The linear kernel, in particular, seems to be a highly accurate predictor for this type of data. However, further research can be done to investigate the impact of different features on the prediction accuracy and to explore other machine learning algorithms for soccer analytics.





