A Comprehensive Guide on Market Basket Analysis
- by user1
- 20 March, 2022
This article was published as a part of the Data Science Blogathon
Overview
- This comprehensive guide will instigate you to the world of Market Basket Analysis along with an implementation using Python on a dataset.
- Market Basket Analysis will help you to design different store Layouts.
Introduction
Nowadays Machine Learning is helping the Retail Industry in many different ways. You can imagine that from forecasting the performance of sales to identify the buyers, there are many applications of machine learning(ML) in the retail industry. “Market Basket Analysis” is one of the best applications of machine learning in the retail industry. By analyzing the past buying behavior of customers, we can find out which are the products that are bought frequently together by the customers.
Image 1
In this article, we will cover a hands-on guide on Market Basket Analysis, its components comprehensively and then deep dive into Market Basket Analysis including how to perform it in Python on a real-world dataset.
Table of Contents
- What is Market Basket Analysis?
- What is Association Rule?
- Algorithms used in Market Basket Analysis
- Advantages of Market Basket Analysis
- How does Market Basket Analysis look from Customer’s Perspective?
- Implementing Market Basket Analysis from scratch in Python
- End Notes
What is Market Basket Analysis?
Frequent itemset mining leads to the discovery of associations and correlations between items in huge transactional or relational datasets. With vast amounts of data continuously being collected and stored, many industries are becoming interested in mining such kinds of patterns from their databases. The disclosure of “Correlation Relationships” among huge amounts of transaction records can help in many decision-making processes such as the design of catalogs, cross-marketing, and behavior customer shopping Analysis.
A popular example of frequent itemset mining is Market Basket Analysis. This process identifies customer buying habits by finding associations between the different items that customers place in their “shopping baskets” as you can see in the following fig. The discovery of this kind of association will be helpful for retailers or marketers to develop marketing strategies by gaining insight into which items
are frequently bought together by customers.
For example, if customers are buying milk, how probably are they to also buy bread (and which kind of bread) on the same trip to the supermarket? This information may lead to increase sales by helping retailers to do selective marketing and plan their ledge space.
Image 2
Suppose just think of the universe as the set of items available at the store, then each item has a Boolean variable that represents the presence or absence of that item. Now each basket can then be represented by a Boolean vector of values that are assigned to these variables. The Boolean vectors can be analyzed of buying patterns that reflect items that are frequently associated or bought together. Such patterns will be represented in the form of association rules.
What is Association Rule for Market basket Analysis?
Let I = {I1, I2,…, Im} be an itemset. Let D, the data, be a set of database transactions where each transaction T is a nonempty itemset such that T ⊆ I. Each transaction is associated with an identifier, called a TID(or Tid). Let A be a set of items(itemset). T is the Transaction which is said to contain A if A ⊆ T. An Association Rule is an implication of the form A ⇒ B, where A ⊂ I, B ⊂ I, and A ∩B = φ.
The rule A ⇒ B holds in the data set(transactions) D with supports, where ‘s’ is the percentage of transactions in D that contain A ∪ B (that is the union of set A and set B, or, both A and B). This is taken as the probability, P(A ∪ B). Rule A ⇒ B has confidence c in the transaction set D, where c is the percentage of transactions in D containing A that also contains B. This is taken to be the conditional probability, like P(B|A). That is,
support(A⇒ B) =P(A ∪ B)
confidence(A⇒ B) =P(B|A)
Rules that satisfy both a minimum support threshold (called min sup) and a minimum confidence threshold (called min conf ) are called “Strong”.
Confidence(A⇒ B) = P(B|A) =
support(A ∪ B) /support(A) =
support count(A ∪ B) / support count(A)
Generally, Association Rule Mining can be viewed in a two-step process:-
1. Find all Frequent itemsets: By definition, each of these itemsets will occur at least as
frequently as a pre-established minimum support count, min sup.
2. Generate Association Rules from the Frequent itemsets: By definition, these
rules must satisfy minimum support and minimum confidence.
Association Rule Mining is primarily used when you want to identify an association between different items in a set, then find frequent patterns in a transactional database, relational databases(RDBMS).
The best example of the association is as you can see in the following image.
Image 3
Algorithms used in Market Basket Analysis
There are Multiple Techniques and Algorithms are used in Market Basket Analysis. One of the important objectives is “to predict the probability of items that are being bought together by customers”.
- AIS
- SETM Algorithm
- Apriori Algorithm
- FP Growth
> Apriori Algorithm:
Apriori Algorithm is a widely-used and well-known Association Rule algorithm and is a popular algorithm used in market basket analysis. It is also considered accurate and overtop AIS and SETM algorithms. It helps to find frequent itemsets in transactions and identifies association rules between these items. The limitation of the Apriori Algorithm is frequent itemset generation. It needs to scan the database many times which leads to increased time and reduce performance as it is a computationally costly step because of a huge database. It uses the concept of Confidence, Support.
> AIS Algorithm:
The AIS algorithm creates multiple passes on the entire database or transactional data. During every pass, it scans all
transactions. As you can see, in the first pass, it counts the support of separate items and determines then which of them are frequent in the database. Huge itemsets of every pass are enlarged to generate candidate itemsets. After each scanning of a transaction, the common itemsets between these itemsets of the previous pass and then items of this transaction are
determined. This algorithm was the first published algorithm which is developed to generate all large itemsets in a
transactional database. It was focusing on the enhancement of databases with the necessary performance to process
decision support. This technique is bounded to
only one item in the consequent.
Advantage: The AIS algorithm was used to find whether there was an association between items or not.
Disadvantage: The main disadvantage of the AIS algorithm is that it generates too many candidates set that after turn out to be small. As well as the data structure is to be maintained.
> SETM Algorithm:
This Algorithm is quite similar to the AIS algorithm. The SETM algorithm creates collective passes over the database. As you can see, in the first pass, it counts the support of single items and then determines which of them are frequent in the
database. Then, it also generates the candidate itemsets by enlarging large itemsets of the previous pass. In addition to this, the SETM algorithm recalls the TIDs(transaction ids) of the generating transactions with the candidate itemsets.
Advantage: While generating candidate itemsets, SETM algorithm arranges candidate itemsets together with the TID(transaction Id) in a sequential manner.
Disadvantage: For every item set, there is an association with Tid, hence it requires more space to store a huge number of TIDs.
> FP Growth
FP Growth is known as Frequent Pattern Growth Algorithm. FP growth algorithm is a concept of representing the data in the form of an FP tree or Frequent Pattern. Hence FP Growth is a method of Mining Frequent Itemsets. This algorithm is an advancement to the Apriori Algorithm. There is no need for candidate generation to generate the frequent pattern. This frequent pattern tree structure maintains the association between the itemsets.
A Frequent Pattern Tree is a tree structure that is made with the earlier itemsets of the data. The main purpose of the FP tree is to mine the most frequent patterns. Every node of the FP tree represents an item of that itemset. The root node represents the null value whereas the lower nodes represent the itemsets of the data. The association of these nodes with the lower nodes that is between itemsets is maintained while creating the tree.
For Example:
Image 4
Advantages of Market Basket Analysis
There are many advantages to implementing Market Basket Analysis in marketing. Market basket Analysis(MBA) can be applied to data of customers from the point of sale (PoS) systems.
It helps retailers with:
- Increases customer engagement
- Boosting sales and increasing RoI
- Improving customer experience
- Optimize marketing strategies and campaigns
- Help to understand customers better
- Identifies customer behavior and pattern
How does Market Basket Analysis look from the Customer’s perspective?
Let us take an example from Amazon, the world’s largest eCommerce platform. From a customer’s perspective, Market Basket Analysis is like shopping at a supermarket. Generally, it observes all items bought by customers together in a single purchase. Then it shows the most related products together that customers will tend to buy in one purchase.
Image 5
Implementing Market Basket Analysis from Scratch in Python
The steps of working of the apriori algorithm can be given as:-
- First, define the minimum support and confidence for the association rule.
- Find out all the subsets in the transactions with higher support(sup) than the minimum support.
- Find all the rules for these subsets with higher confidence than minimum confidence.
- Sort these association rules in decreasing order.
- Analyze the rules along with their confidence and support.
The Dataset
In this implementation, we have to use the Store Data dataset that is publicly available on Kaggle. This dataset contains a total of 7501 transaction records where every record consists of the list of items sold in just one transaction.
Market Basket Analysis using the Apriori method
We are required to import the necessary libraries. Python provides the apyori as an API that is required to be imported to run the Apriori Algorithm.
import pandas as pd import numpy as np from apyori import apriori
Now we want to read the dataset that is downloaded from Kaggle. There is no header in the dataset and hence the first row contains the first transaction so that we have mentioned header = None here.
st_df=pd.read_csv("store_data.csv",header=None) st_df
Output:
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | shrimp | almonds | avocado | vegetables mix | green grapes | whole weat flour | yams | cottage cheese | energy drink | tomato juice | low fat yogurt | green tea | honey | salad | mineral water | salmon | antioxydant juice | frozen smoothie | spinach | olive oil |
1 | burgers | meatballs | eggs | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | chutney | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | turkey | avocado | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | mineral water | milk | energy bar | whole wheat rice | green tea | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
7496 | butter | light mayo | fresh bread | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
7497 | burgers | frozen vegetables | eggs | french fries | magazines | green tea | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
7498 | chicken | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
7499 | escalope | green tea | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
7500 | eggs | frozen smoothie | yogurt cake | low fat yogurt | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
7501 rows × 20 columns
Once we have read the dataset completely, we are required to get the list of items in every transaction. So we are going to run two loops. One will be for the total number of transactions, and the other will be for the total number of columns in every transaction. The list will work as a training set from where we can generate the list of Association Rules.
#converting dataframe into list of lists l=[] for i in range(1,7501): l.append([str(st_df.values[i,j]) for j in range(0,20)])
So we are ready with the list of items in our training set then we need to run the apriori algorithm which will learn the list of association rules from the training set i.e list. So, the minimum support here will be 0.0045 which is taken here as support. Now let us see that we have kept 0.2 as the min confidence. The minimum lift =3 is taken and the minimum length is considered as 2 because we have to find an association among a minimum of two items.
#applying apriori algorithm association_rules = apriori(l, min_support=0.0045, min_confidence=0.2, min_lift=3, min_length=2) association_results = list(association_rules)
After running the above line of code, we have generated the list of association rules between the items. So to see these rules, the below line of code needs to be run.
for i in range(0, len(association_results)): print(association_results[i][0])
Output:
frozenset({'light cream', 'chicken'}) frozenset({'mushroom cream sauce', 'escalope'}) frozenset({'pasta', 'escalope'}) frozenset({'herb & pepper', 'ground beef'}) frozenset({'tomato sauce', 'ground beef'}) frozenset({'whole wheat pasta', 'olive oil'}) frozenset({'shrimp', 'pasta'}) frozenset({'nan', 'light cream', 'chicken'}) frozenset({'shrimp', 'frozen vegetables', 'chocolate'}) frozenset({'spaghetti', 'cooking oil', 'ground beef'}) frozenset({'mushroom cream sauce', 'nan', 'escalope'}) frozenset({'nan', 'pasta', 'escalope'}) frozenset({'spaghetti', 'frozen vegetables', 'ground beef'}) frozenset({'olive oil', 'frozen vegetables', 'milk'}) frozenset({'shrimp', 'frozen vegetables', 'mineral water'}) frozenset({'spaghetti', 'olive oil', 'frozen vegetables'}) frozenset({'spaghetti', 'shrimp', 'frozen vegetables'}) frozenset({'spaghetti', 'frozen vegetables', 'tomatoes'}) frozenset({'spaghetti', 'grated cheese', 'ground beef'}) frozenset({'herb & pepper', 'mineral water', 'ground beef'}) frozenset({'nan', 'herb & pepper', 'ground beef'}) frozenset({'spaghetti', 'herb & pepper', 'ground beef'}) frozenset({'olive oil', 'milk', 'ground beef'}) frozenset({'nan', 'tomato sauce', 'ground beef'}) frozenset({'spaghetti', 'shrimp', 'ground beef'}) frozenset({'spaghetti', 'olive oil', 'milk'}) frozenset({'soup', 'olive oil', 'mineral water'}) frozenset({'whole wheat pasta', 'nan', 'olive oil'}) frozenset({'nan', 'shrimp', 'pasta'}) frozenset({'spaghetti', 'olive oil', 'pancakes'}) frozenset({'nan', 'shrimp', 'frozen vegetables', 'chocolate'}) frozenset({'spaghetti', 'nan', 'cooking oil', 'ground beef'}) frozenset({'spaghetti', 'nan', 'frozen vegetables', 'ground beef'}) frozenset({'spaghetti', 'frozen vegetables', 'milk', 'mineral water'}) frozenset({'nan', 'frozen vegetables', 'milk', 'olive oil'}) frozenset({'nan', 'shrimp', 'frozen vegetables', 'mineral water'}) frozenset({'spaghetti', 'nan', 'frozen vegetables', 'olive oil'}) frozenset({'spaghetti', 'nan', 'shrimp', 'frozen vegetables'}) frozenset({'spaghetti', 'nan', 'frozen vegetables', 'tomatoes'}) frozenset({'spaghetti', 'nan', 'grated cheese', 'ground beef'}) frozenset({'nan', 'herb & pepper', 'mineral water', 'ground beef'}) frozenset({'spaghetti', 'nan', 'herb & pepper', 'ground beef'}) frozenset({'nan', 'milk', 'olive oil', 'ground beef'}) frozenset({'spaghetti', 'nan', 'shrimp', 'ground beef'}) frozenset({'spaghetti', 'nan', 'milk', 'olive oil'}) frozenset({'soup', 'nan', 'olive oil', 'mineral water'}) frozenset({'spaghetti', 'nan', 'olive oil', 'pancakes'}) frozenset({'spaghetti', 'milk', 'mineral water', 'nan', 'frozen vegetables'})
Here we are going to display Rule, Support, and lift ratio for every above association rule by using for loop.
for item in association_results: # first index of the inner list # Contains base item and add item pair = item[0] items = [x for x in pair] print("Rule: " + items[0] + " -> " + items[1]) # second index of the inner list print("Support: " + str(item[1])) # third index of the list located at 0th position # of the third index of the inner list print("Confidence: " + str(item[2][0][2])) print("Lift: " + str(item[2][0][3])) print("-----------------------------------------------------")
Output:
Rule: light cream -> chicken Support: 0.004533333333333334 Confidence: 0.2905982905982906 Lift: 4.843304843304844 ----------------------------------------------------- Rule: mushroom cream sauce -> escalope Support: 0.005733333333333333 Confidence: 0.30069930069930073 Lift: 3.7903273197390845 ----------------------------------------------------- Rule: pasta -> escalope Support: 0.005866666666666667 Confidence: 0.37288135593220345 Lift: 4.700185158809287 ----------------------------------------------------- Rule: herb & pepper -> ground beef Support: 0.016 Confidence: 0.3234501347708895 Lift: 3.2915549671393096 ----------------------------------------------------- Rule: tomato sauce -> ground beef Support: 0.005333333333333333 Confidence: 0.37735849056603776 Lift: 3.840147461662528 ----------------------------------------------------- Rule: whole wheat pasta -> olive oil Support: 0.008 Confidence: 0.2714932126696833 Lift: 4.130221288078346 ----------------------------------------------------- Rule: shrimp -> pasta Support: 0.005066666666666666 Confidence: 0.3220338983050848 Lift: 4.514493901473151 ----------------------------------------------------- Rule: nan -> light cream Support: 0.004533333333333334 Confidence: 0.2905982905982906 Lift: 4.843304843304844 ----------------------------------------------------- Rule: shrimp -> frozen vegetables Support: 0.005333333333333333 Confidence: 0.23255813953488372 Lift: 3.260160834601174 ----------------------------------------------------- Rule: spaghetti -> cooking oil Support: 0.0048 Confidence: 0.5714285714285714 Lift: 3.281557646029315 ----------------------------------------------------- Rule: mushroom cream sauce -> nan Support: 0.005733333333333333 Confidence: 0.30069930069930073 Lift: 3.7903273197390845 ----------------------------------------------------- Rule: nan -> pasta Support: 0.005866666666666667 Confidence: 0.37288135593220345 Lift: 4.700185158809287 ----------------------------------------------------- Rule: spaghetti -> frozen vegetables Support: 0.008666666666666666 Confidence: 0.3110047846889952 Lift: 3.164906221394116 ----------------------------------------------------- Rule: olive oil -> frozen vegetables Support: 0.0048 Confidence: 0.20338983050847456 Lift: 3.094165778526489 ----------------------------------------------------- Rule: shrimp -> frozen vegetables Support: 0.0072 Confidence: 0.3068181818181818 Lift: 3.2183725365543547 ----------------------------------------------------- Rule: spaghetti -> olive oil Support: 0.005733333333333333 Confidence: 0.20574162679425836 Lift: 3.1299436124887174 ----------------------------------------------------- Rule: spaghetti -> shrimp Support: 0.006 Confidence: 0.21531100478468898 Lift: 3.0183785717479763 ----------------------------------------------------- Rule: spaghetti -> frozen vegetables Support: 0.006666666666666667 Confidence: 0.23923444976076555 Lift: 3.497579674864993 ----------------------------------------------------- Rule: spaghetti -> grated cheese Support: 0.005333333333333333 Confidence: 0.3225806451612903 Lift: 3.282706701098612 ----------------------------------------------------- Rule: herb & pepper -> mineral water Support: 0.006666666666666667 Confidence: 0.390625 Lift: 3.975152645861601 ----------------------------------------------------- Rule: nan -> herb & pepper Support: 0.016 Confidence: 0.3234501347708895 Lift: 3.2915549671393096 ----------------------------------------------------- Rule: spaghetti -> herb & pepper Support: 0.0064 Confidence: 0.3934426229508197 Lift: 4.003825878061259 ----------------------------------------------------- Rule: olive oil -> milk Support: 0.004933333333333333 Confidence: 0.22424242424242424 Lift: 3.411395906324912 ----------------------------------------------------- Rule: nan -> tomato sauce Support: 0.005333333333333333 Confidence: 0.37735849056603776 Lift: 3.840147461662528 ----------------------------------------------------- Rule: spaghetti -> shrimp Support: 0.006 Confidence: 0.5232558139534884 Lift: 3.004914704939635 ----------------------------------------------------- Rule: spaghetti -> olive oil Support: 0.0072 Confidence: 0.20300751879699247 Lift: 3.0883496774390333 ----------------------------------------------------- Rule: soup -> olive oil Support: 0.0052 Confidence: 0.2254335260115607 Lift: 3.4295161157945335 ----------------------------------------------------- Rule: whole wheat pasta -> nan Support: 0.008 Confidence: 0.2714932126696833 Lift: 4.130221288078346 ----------------------------------------------------- Rule: nan -> shrimp Support: 0.005066666666666666 Confidence: 0.3220338983050848 Lift: 4.514493901473151 ----------------------------------------------------- Rule: spaghetti -> olive oil Support: 0.005066666666666666 Confidence: 0.20105820105820105 Lift: 3.0586947422647217 ----------------------------------------------------- Rule: nan -> shrimp Support: 0.005333333333333333 Confidence: 0.23255813953488372 Lift: 3.260160834601174 ----------------------------------------------------- Rule: spaghetti -> nan Support: 0.0048 Confidence: 0.5714285714285714 Lift: 3.281557646029315 ----------------------------------------------------- Rule: spaghetti -> nan Support: 0.008666666666666666 Confidence: 0.3110047846889952 Lift: 3.164906221394116 ----------------------------------------------------- Rule: spaghetti -> frozen vegetables Support: 0.004533333333333334 Confidence: 0.28813559322033905 Lift: 3.0224013274860737 ----------------------------------------------------- Rule: nan -> frozen vegetables Support: 0.0048 Confidence: 0.20338983050847456 Lift: 3.094165778526489 ----------------------------------------------------- Rule: nan -> shrimp Support: 0.0072 Confidence: 0.3068181818181818 Lift: 3.2183725365543547 ----------------------------------------------------- Rule: spaghetti -> nan Support: 0.005733333333333333 Confidence: 0.20574162679425836 Lift: 3.1299436124887174 ----------------------------------------------------- Rule: spaghetti -> nan Support: 0.006 Confidence: 0.21531100478468898 Lift: 3.0183785717479763 ----------------------------------------------------- Rule: spaghetti -> nan Support: 0.006666666666666667 Confidence: 0.23923444976076555 Lift: 3.497579674864993 ----------------------------------------------------- Rule: spaghetti -> nan Support: 0.005333333333333333 Confidence: 0.3225806451612903 Lift: 3.282706701098612 ----------------------------------------------------- Rule: nan -> herb & pepper Support: 0.006666666666666667 Confidence: 0.390625 Lift: 3.975152645861601 ----------------------------------------------------- Rule: spaghetti -> nan Support: 0.0064 Confidence: 0.3934426229508197 Lift: 4.003825878061259 ----------------------------------------------------- Rule: nan -> milk Support: 0.004933333333333333 Confidence: 0.22424242424242424 Lift: 3.411395906324912 ----------------------------------------------------- Rule: spaghetti -> nan Support: 0.006 Confidence: 0.5232558139534884 Lift: 3.004914704939635 ----------------------------------------------------- Rule: spaghetti -> nan Support: 0.0072 Confidence: 0.20300751879699247 Lift: 3.0883496774390333 ----------------------------------------------------- Rule: soup -> nan Support: 0.0052 Confidence: 0.2254335260115607 Lift: 3.4295161157945335 ----------------------------------------------------- Rule: spaghetti -> nan Support: 0.005066666666666666 Confidence: 0.20105820105820105 Lift: 3.0586947422647217 ----------------------------------------------------- Rule: spaghetti -> milk Support: 0.004533333333333334 Confidence: 0.28813559322033905 Lift: 3.0224013274860737 -----------------------------------------------------
End Notes
In this article, we have discussed Market Basket Analysis. We have implemented it from scratch and looked at step-by-step implementation by using Python.
Finally, we implemented Market Basket Analysis using Apriori Algorithm. We can also use FP Growth and AIS algorithms.
If you have any doubts or feedback, feel free to share it in the comments section below.
Thank You!
Image Sources-
- Image 1 – https://miro.medium.com/max/1200/1*ZqZCewk1zZOghuIDqsp6VA.png
- Image 2 – https://www.sciencedirect.com/topics/computer-science/market-basket-analysis
- Image 3 – https://rb.gy/1s7fbc
- Image 4 – https://i.imgur.com/G9svOR8.png
- Image 5 – https://datamathstat.files.wordpress.com/2018/02/untitled.png
The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.
Size: Unknown Price: Free Author: Amruta Kadlaskar Data source: https://www.analyticsvidhya.com/