Association Rule Mining

OVERVIEW

Association rule mining is a data mining technique used to discover the relationships between variables or items in large datasets. The technique involves analyzing transactional data to find associations or relationships between items. It is based on the concept of a transaction, which is a set of items purchased by a customer. This method attempts to identify the items that are frequently purchased together and the degree of association between them.

The output of association rule mining is a set of rules that indicate the presence of strong associations or relationships between items. These rules are expressed in the form of "if-then" statements, where the "if" part represents the antecedent, or the set of items that are being considered, and the "then" part represents the consequent, or the item that is being predicted.

The strength of an association is typically measured in terms of support, which is the frequency with which the antecedent and consequent occur together, and confidence, which is the probability of the consequent occurring given the antecedent.

SUPPORT, LIFT & CONFIDENCE

Support, lift, and confidence are important measures used in association rule mining to evaluate the strength of relationships between items in a dataset.

Support:

Support is a measure of the frequency with which a specific set of items occurs together in a dataset. It is calculated as the proportion of transactions that contain all the items in the set. The support value indicates how popular or frequent a particular itemset is in the dataset. A higher support value means that the itemset is more popular, and a lower support value means that the itemset is less frequent.

Confidence:

Confidence is a measure of the strength of the relationship between the antecedent (if) and consequent (then) in an association rule. It is calculated as the proportion of transactions that contain both the antecedent and the consequent over the proportion of transactions that contain only the antecedent. The confidence value indicates the likelihood that the consequent will occur when the antecedent is present. A higher confidence value means that the relationship between the antecedent and the consequent is stronger.

Lift:

Lift is a measure of the strength of association between the antecedent and the consequent in an association rule, while controlling for the frequency of the consequent. It is calculated as the ratio of the observed support of the antecedent and consequent together to the expected support of the antecedent and consequent if they were statistically independent. The lift value indicates whether the antecedent and the consequent are associated more frequently than would be expected by chance. A lift value greater than 1 indicates a positive association between the antecedent and consequent, while a value less than 1 indicates a negative association.

In summary, support measures the frequency of an itemset, confidence measures the strength of the relationship between the antecedent and consequent, and lift measures the strength of the association between the antecedent and consequent while controlling for the frequency of the consequent. These measures are important in association rule mining to identify significant relationships between items in a dataset and to make decisions based on the discovered patterns.

Figure 1- Support, Confidence and Lift

ASSOCIATION RULES

In association rule mining, "rules" are statements that describe the relationships between items or variables in a dataset. These rules are formed by analyzing the frequency of co-occurrence of items in transactions or records in the dataset.

A rule is typically expressed in the form of "If A then B" or "A implies B", where A and B are sets of items. A is known as the "antecedent" or "left-hand side" of the rule, while B is known as the "consequent" or "right-hand side" of the rule. The rule can be interpreted as "if itemset A occurs, then itemset B is likely to occur as well".

To generate rules, association rule mining algorithms first identify all frequent itemsets, i.e., sets of items that occur together in a sufficient number of transactions in the dataset. Then, the algorithms generate rules by applying certain threshold measures such as support, confidence, and lift to the frequent itemsets.

Figure 2- Market Basket Analysis example

APRIORI ALGORITHM

The Apriori algorithm is a popular algorithm for association rule mining in large datasets. It was introduced by R. Agrawal and R. Srikant in 1994 and is based on the concept of frequent itemsets.

The algorithm works in two steps:

Generating frequent itemsets:

The Apriori algorithm uses a "bottom-up" approach to find frequent itemsets, starting with individual items and progressively combining them into larger itemsets. The algorithm scans the dataset to identify the support of each item (i.e., the frequency of occurrence in the dataset) and selects the frequent items that meet a specified minimum support threshold. These frequent items are used to generate candidate itemsets of size 2, which are then pruned to remove those that do not meet the minimum support threshold. The process is repeated to generate candidate itemsets of size k (k > 2), which are pruned until only frequent itemsets remain.

Generating association rules:

Once the frequent itemsets have been identified, the Apriori algorithm generates association rules by selecting the itemsets that meet a specified minimum confidence threshold. For each frequent itemset, the algorithm generates all possible subsets of items and calculates the confidence of the corresponding rules. Only the rules that meet the minimum confidence threshold are retained.

The Apriori algorithm has several advantages, such as its scalability to large datasets, its simplicity, and its ability to handle multiple minimum support and confidence thresholds. However, it also has some limitations, such as its dependence on the number of itemsets, which can lead to large search spaces and slow processing times.

Overall, the Apriori algorithm is a powerful and widely used algorithm for association rule mining that has enabled many important applications in various fields, such as market basket analysis, customer behavior modeling, and more.

Figure 3 below shows the basic implementation of the Apriori Algorithm

Figure 3 - Apriori Algorithm

DATA PREP

Unlike some other machine learning techniques that require labeled data for learning, ARM only requires unlabeled transaction data to identify relationships between items.

Transaction data typically consists of records that contain a set of items purchased or observed together by a customer, or any other type of event that is represented as a transaction. For example, a transaction may contain a list of items purchased in a supermarket, or a list of web pages visited by a user during a session.

Using this transaction data, ARM can identify frequent itemsets and association rules, which provide valuable insights into the relationships between items in the transactions. Frequent itemsets are groups of items that frequently appear together in transactions, while association rules express the likelihood of one item being associated with another item in a transaction.

Figure 4 - Raw data before transformation

Figure 4 above shows the raw data which was gathered using APIs. The dataset shown is of Nationalities of players across the English Premier League - the data consists stats for the year. As can be seen, the dataset above is not suitable to run Association Rule Mining. Columns are majorly numerical, data is missing, zeros are present, labels are present.

Hence, unnecessary columns were removed, making use of the gsub function, the column with improper data type was rectified, and lastly, the data items with textual data was made use of.

Figure 5 below is a snippet of the clean and transformed dataset, ready for use!

Figure 5 - Clean Data after transformation

Click here for sample data

Click here for association rule mining (R)

RESULTS

The results of ARM are typically presented in the form of association rules, which consist of an antecedent (a set of items) and a consequent (another item) and are often expressed as "If antecedent, then consequent".

The results of ARM can provide insights into the relationships between items in the transactions, and can be used for a variety of purposes, such as: Market Basket Analysis, Recommender Systems, Fraud Detection etc.

Figures 6 to 8 - Association rules with different parameters

The above Visualizations showcase different association rules, sorted according to different parameters. The first figure shows us the top 15 association rules, sorted by Confidence, the second image - sorted by Lift and the third image, the rules are sorted by Support.

The quality of the results of ARM depends on the quality of the transaction data, the choice of parameters such as minimum support and minimum confidence, and the choice of algorithms used for mining the association rules. It is important to interpret the results of ARM with care and to validate them using domain knowledge and statistical methods. For optimal results, we have kept the minimum support value to 0.01, and the minimum confidence value of 0.1

With the above set values, we were able to get the best kind of association rules, which were decent in number and also made sense.

Figure 9 - Visualization of Association Rules

Figures 9 and 10 gives us detailed insights about the association rules which have been formed. While Figure 9 is vertical in nature, and shows us the association rules on the LHS and RHS with their respective support values, shaded according to the strength of their lift. Figure 10 is all about diplaying the strength of confidence values between the LHS and RHS of association rules. Whilst some rules have a strong association, there are certain weak rules as well which are established by ARM. It is important to note that further analysis is needed in order to validate if a rule can be dropped based on its strength or not.

Figure 10 - Visualization of Association Rules

CONCLUSION

Association Rule Mining (ARM) can be applied to a soccer dataset to discover interesting and useful patterns in the data. For example, ARM can be used to identify frequent itemsets of player attributes or team statistics that co-occur in matches or tournaments. ARM can also be used to generate association rules that express the relationships between player attributes or team statistics and match outcomes, such as "If a team has a high possession rate, then they are more likely to win the match".

The quality of the results of ARM on a soccer dataset depends on the quality of the transaction data and the choice of parameters and algorithms used for mining the association rules. It is important to interpret the results of ARM with care and to validate them using statistical methods and domain knowledge.

For our association rules, we have primarily focused on the Player Nationalities, the position they play in, and the club they play for. With the help of ARM, we were able to discover trends and insights which were not expected. For example the English club Brentford is dominated by Danish players. England as a nation produced a lot of Defenders and Midfield players, and Everton's squad has a lot of English defenders.

A lot of these insights can be effectively used to scout prospective young talent, judging by the nation the belong to or the position they play in. This is just one aspect, where-in the said analysis can prove to be helpful. Overall, ARM can be a valuable technique for uncovering patterns and relationships in soccer data, which can provide insights into the factors that contribute to team success and player performance.