Site icon Zdataset

A Comprehensive Guide on Market Basket Analysis

This article was published as a part of the Data Science Blogathon

Overview

Introduction

Nowadays Machine Learning is helping the Retail Industry in many different ways. You can imagine that from forecasting the performance of sales to identify the buyers, there are many applications of machine learning(ML) in the retail industry. “Market Basket Analysis” is one of the best applications of machine learning in the retail industry. By analyzing the past buying behavior of customers, we can find out which are the products that are bought frequently together by the customers.

Image 1

In this article, we will cover a hands-on guide on Market Basket Analysis, its components comprehensively and then deep dive into Market Basket Analysis including how to perform it in Python on a real-world dataset.

Table of Contents

  1. What is Market Basket Analysis?
  2. What is Association Rule?
  3. Algorithms used in Market Basket Analysis
  4. Advantages of Market Basket Analysis
  5. How does Market Basket Analysis look from Customer’s Perspective?
  6. Implementing Market Basket Analysis from scratch in Python
  7. End Notes

What is Market Basket Analysis?

Frequent itemset mining leads to the discovery of associations and correlations between items in huge transactional or relational datasets. With vast amounts of data continuously being collected and stored, many industries are becoming interested in mining such kinds of patterns from their databases. The disclosure of “Correlation Relationships” among huge amounts of transaction records can help in many decision-making processes such as the design of catalogs, cross-marketing, and behavior customer shopping Analysis.

 A popular example of frequent itemset mining is Market Basket Analysis. This process identifies customer buying habits by finding associations between the different items that customers place in their “shopping baskets” as you can see in the following fig. The discovery of this kind of association will be helpful for  retailers or marketers to develop marketing strategies by gaining insight into which items
are frequently bought together by customers.

For example, if customers are buying milk, how probably are they to also buy bread (and which kind of bread) on the same trip to the supermarket? This information may lead to increase sales by helping retailers to do selective marketing and plan their ledge space.

Image 2

Suppose just think of the universe as the set of items available at the store, then each item has a Boolean variable that represents the presence or absence of that item. Now each basket can then be represented by a Boolean vector of values that are assigned to these variables. The Boolean vectors can be analyzed of buying patterns that reflect items that are frequently associated or bought together. Such patterns will be represented in the form of association rules.

What is Association Rule for Market basket Analysis?

Let I = {I1, I2,…, Im} be an itemset. Let D, the data, be a set of database transactions where each transaction T is a nonempty itemset such that T ⊆ I. Each transaction is associated with an identifier, called a TID(or Tid). Let A be a set of items(itemset). T is the Transaction which is said to contain A if A ⊆ T. An Association Rule is an implication of the form A ⇒ B, where A ⊂ I, B ⊂ I,  and A ∩B = φ.

The rule A ⇒ B holds in the data set(transactions) D with supports, where ‘s’ is the percentage of transactions in D that contain A ∪ B (that is the union of set A and set B, or, both A and B). This is taken as the probability, P(A ∪ B). Rule A ⇒ B has confidence c in the transaction set D, where c is the percentage of transactions in D containing A that also contains B. This is taken to be the conditional probability, like P(B|A). That is,

support(A⇒ B) =P(A ∪  B) 

confidence(A⇒ B) =P(B|A)

Rules that satisfy both a minimum support threshold (called min sup) and a minimum confidence threshold (called min conf ) are called “Strong”.

Confidence(A⇒ B) = P(B|A) =
support(A ∪ B) /support(A) =
support count(A ∪ B) / support count(A)

Generally, Association Rule Mining can be viewed in a two-step process:-

1. Find all Frequent itemsets: By definition, each of these itemsets will occur at least as
frequently as a pre-established minimum support count, min sup
.

2. Generate Association Rules from the Frequent itemsets: By definition, these
rules must satisfy minimum support and minimum confidence.

Association Rule Mining is primarily used when you want to identify an association between different items in a set, then find frequent patterns in a transactional database, relational databases(RDBMS).

The best example of the association is as you can see in the following image.

Image 3

Algorithms used in Market Basket Analysis

There are Multiple Techniques and Algorithms are used in Market Basket Analysis. One of the important objectives is “to predict the probability of items that are being bought together by customers”.

> Apriori Algorithm:

Apriori Algorithm is a widely-used and well-known Association Rule algorithm and is a popular algorithm used in market basket analysis. It is also considered accurate and overtop AIS and SETM algorithms. It helps to find frequent itemsets in transactions and identifies association rules between these items. The limitation of the Apriori Algorithm is frequent itemset generation. It needs to scan the database many times which leads to increased time and reduce performance as it is a computationally costly step because of a huge database. It uses the concept of Confidence, Support.

> AIS Algorithm:

The AIS algorithm creates multiple passes on the entire database or transactional data. During every pass, it scans all
transactions. As you can see, in the first pass, it counts the support of separate items and determines then which of them are frequent in the database. Huge itemsets of every pass are enlarged to generate candidate itemsets. After each scanning of a transaction, the common itemsets between these itemsets of the previous pass and then items of this transaction are
determined. This algorithm was the first published algorithm which is developed to generate all large itemsets in a
transactional database. It was focusing on the enhancement of databases with the necessary performance to process
decision support. This technique is bounded to
only one item in the consequent.

Advantage: The AIS algorithm was used to find whether there was an association between items or not. 

Disadvantage: The main disadvantage of the AIS algorithm is that it generates too many candidates set that after turn out to be small. As well as the data structure is to be maintained.

> SETM Algorithm:

This Algorithm is quite similar to the AIS algorithm. The SETM algorithm creates collective passes over the database. As you can see, in the first pass, it counts the support of single items and then determines which of them are frequent in the
database. Then, it also generates the candidate itemsets by enlarging large itemsets of the previous pass. In addition to this, the SETM algorithm recalls the TIDs(transaction ids) of the generating transactions with the candidate itemsets.

Advantage: While generating candidate itemsets, SETM algorithm arranges candidate itemsets together with the TID(transaction Id) in a sequential manner.

Disadvantage: For every item set, there is an association with Tid, hence it requires more space to store a huge number of TIDs.

> FP Growth

FP Growth is known as Frequent Pattern Growth Algorithm. FP growth algorithm is a concept of representing the data in the form of an FP tree or Frequent Pattern. Hence FP Growth is a method of Mining Frequent Itemsets. This algorithm is an advancement to the Apriori Algorithm. There is no need for candidate generation to generate the frequent pattern. This frequent pattern tree structure maintains the association between the itemsets.

A Frequent Pattern Tree is a tree structure that is made with the earlier itemsets of the data. The main purpose of the FP tree is to mine the most frequent patterns. Every node of the FP tree represents an item of that itemset. The root node represents the null value whereas the lower nodes represent the itemsets of the data. The association of these nodes with the lower nodes that is between itemsets is maintained while creating the tree.

For Example:

Image 4

Advantages of Market Basket Analysis

There are many advantages to implementing Market Basket Analysis in marketing. Market basket Analysis(MBA) can be applied to data of customers from the point of sale (PoS) systems.

It helps retailers with:

How does Market Basket Analysis look from the Customer’s perspective?

Let us take an example from Amazon, the world’s largest eCommerce platform. From a customer’s perspective, Market Basket Analysis is like shopping at a supermarket. Generally, it observes all items bought by customers together in a single purchase. Then it shows the most related products together that customers will tend to buy in one purchase.

Image 5

Implementing Market Basket Analysis from Scratch in Python

The steps of working of the apriori algorithm can be given as:-

  1. First, define the minimum support and confidence for the association rule.
  2. Find out all the subsets in the transactions with higher support(sup) than the minimum support.
  3. Find all the rules for these subsets with higher confidence than minimum confidence.
  4. Sort these association rules in decreasing order.
  5. Analyze the rules along with their confidence and support.

The Dataset

In this implementation, we have to use the Store Data dataset that is publicly available on Kaggle. This dataset contains a total of 7501 transaction records where every record consists of the list of items sold in just one transaction.

Market Basket Analysis using the Apriori method

We are required to import the necessary libraries. Python provides the apyori as an API that is required to be imported to run the Apriori Algorithm.

import pandas as pd
import numpy as np
from apyori import apriori

Now we want to read the dataset that is downloaded from Kaggle. There is no header in the dataset and hence the first row contains the first transaction so that we have mentioned header = None here.

st_df=pd.read_csv("store_data.csv",header=None)
st_df

Output:

012345678910111213141516171819
0shrimpalmondsavocadovegetables mixgreen grapeswhole weat flouryamscottage cheeseenergy drinktomato juicelow fat yogurtgreen teahoneysaladmineral watersalmonantioxydant juicefrozen smoothiespinacholive oil
1burgersmeatballseggsNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
2chutneyNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
3turkeyavocadoNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
4mineral watermilkenergy barwhole wheat ricegreen teaNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
7496butterlight mayofresh breadNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
7497burgersfrozen vegetableseggsfrench friesmagazinesgreen teaNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
7498chickenNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
7499escalopegreen teaNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
7500eggsfrozen smoothieyogurt cakelow fat yogurtNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN

7501 rows × 20 columns

Once we have read the dataset completely, we are required to get the list of items in every transaction. So we are going to run two loops. One will be for the total number of transactions, and the other will be for the total number of columns in every transaction. The list will work as a training set from where we can generate the list of Association Rules.

#converting dataframe into list of lists
l=[]
for i in range(1,7501):
    l.append([str(st_df.values[i,j]) for j in range(0,20)])

So we are ready with the list of items in our training set then we need to run the apriori algorithm which will learn the list of association rules from the training set i.e list. So, the minimum support here will be  0.0045 which is taken here as support. Now let us see that we have kept 0.2 as the min confidence. The minimum lift =3 is taken and the minimum length is considered as 2 because we have to find an association among a minimum of two items.

#applying apriori algorithm
association_rules = apriori(l, min_support=0.0045, min_confidence=0.2, min_lift=3, min_length=2)
association_results = list(association_rules)

After running the above line of code, we have generated the list of association rules between the items. So to see these rules, the below line of code needs to be run.

for i in range(0, len(association_results)):
    print(association_results[i][0])

Output:

frozenset({'light cream', 'chicken'})
frozenset({'mushroom cream sauce', 'escalope'})
frozenset({'pasta', 'escalope'})
frozenset({'herb & pepper', 'ground beef'})
frozenset({'tomato sauce', 'ground beef'})
frozenset({'whole wheat pasta', 'olive oil'})
frozenset({'shrimp', 'pasta'})
frozenset({'nan', 'light cream', 'chicken'})
frozenset({'shrimp', 'frozen vegetables', 'chocolate'})
frozenset({'spaghetti', 'cooking oil', 'ground beef'})
frozenset({'mushroom cream sauce', 'nan', 'escalope'})
frozenset({'nan', 'pasta', 'escalope'})
frozenset({'spaghetti', 'frozen vegetables', 'ground beef'})
frozenset({'olive oil', 'frozen vegetables', 'milk'})
frozenset({'shrimp', 'frozen vegetables', 'mineral water'})
frozenset({'spaghetti', 'olive oil', 'frozen vegetables'})
frozenset({'spaghetti', 'shrimp', 'frozen vegetables'})
frozenset({'spaghetti', 'frozen vegetables', 'tomatoes'})
frozenset({'spaghetti', 'grated cheese', 'ground beef'})
frozenset({'herb & pepper', 'mineral water', 'ground beef'})
frozenset({'nan', 'herb & pepper', 'ground beef'})
frozenset({'spaghetti', 'herb & pepper', 'ground beef'})
frozenset({'olive oil', 'milk', 'ground beef'})
frozenset({'nan', 'tomato sauce', 'ground beef'})
frozenset({'spaghetti', 'shrimp', 'ground beef'})
frozenset({'spaghetti', 'olive oil', 'milk'})
frozenset({'soup', 'olive oil', 'mineral water'})
frozenset({'whole wheat pasta', 'nan', 'olive oil'})
frozenset({'nan', 'shrimp', 'pasta'})
frozenset({'spaghetti', 'olive oil', 'pancakes'})
frozenset({'nan', 'shrimp', 'frozen vegetables', 'chocolate'})
frozenset({'spaghetti', 'nan', 'cooking oil', 'ground beef'})
frozenset({'spaghetti', 'nan', 'frozen vegetables', 'ground beef'})
frozenset({'spaghetti', 'frozen vegetables', 'milk', 'mineral water'})
frozenset({'nan', 'frozen vegetables', 'milk', 'olive oil'})
frozenset({'nan', 'shrimp', 'frozen vegetables', 'mineral water'})
frozenset({'spaghetti', 'nan', 'frozen vegetables', 'olive oil'})
frozenset({'spaghetti', 'nan', 'shrimp', 'frozen vegetables'})
frozenset({'spaghetti', 'nan', 'frozen vegetables', 'tomatoes'})
frozenset({'spaghetti', 'nan', 'grated cheese', 'ground beef'})
frozenset({'nan', 'herb & pepper', 'mineral water', 'ground beef'})
frozenset({'spaghetti', 'nan', 'herb & pepper', 'ground beef'})
frozenset({'nan', 'milk', 'olive oil', 'ground beef'})
frozenset({'spaghetti', 'nan', 'shrimp', 'ground beef'})
frozenset({'spaghetti', 'nan', 'milk', 'olive oil'})
frozenset({'soup', 'nan', 'olive oil', 'mineral water'})
frozenset({'spaghetti', 'nan', 'olive oil', 'pancakes'})
frozenset({'spaghetti', 'milk', 'mineral water', 'nan', 'frozen vegetables'})

Here we are going to display Rule, Support, and lift ratio for every above association rule by using for loop.

for item in association_results:
    # first index of the inner list
    # Contains base item and add item
    pair = item[0]
    items = [x for x in pair]
    print("Rule: " + items[0] + " -> " + items[1])
    # second index of the inner list
    print("Support: " + str(item[1]))
    # third index of the list located at 0th position
    # of the third index of the inner list
    print("Confidence: " + str(item[2][0][2]))
    print("Lift: " + str(item[2][0][3]))
    print("-----------------------------------------------------")

Output:

Rule: light cream -> chicken
Support: 0.004533333333333334
Confidence: 0.2905982905982906
Lift: 4.843304843304844
-----------------------------------------------------
Rule: mushroom cream sauce -> escalope
Support: 0.005733333333333333
Confidence: 0.30069930069930073
Lift: 3.7903273197390845
-----------------------------------------------------
Rule: pasta -> escalope
Support: 0.005866666666666667
Confidence: 0.37288135593220345
Lift: 4.700185158809287
-----------------------------------------------------
Rule: herb & pepper -> ground beef
Support: 0.016
Confidence: 0.3234501347708895
Lift: 3.2915549671393096
-----------------------------------------------------
Rule: tomato sauce -> ground beef
Support: 0.005333333333333333
Confidence: 0.37735849056603776
Lift: 3.840147461662528
-----------------------------------------------------
Rule: whole wheat pasta -> olive oil
Support: 0.008
Confidence: 0.2714932126696833
Lift: 4.130221288078346
-----------------------------------------------------
Rule: shrimp -> pasta
Support: 0.005066666666666666
Confidence: 0.3220338983050848
Lift: 4.514493901473151
-----------------------------------------------------
Rule: nan -> light cream
Support: 0.004533333333333334
Confidence: 0.2905982905982906
Lift: 4.843304843304844
-----------------------------------------------------
Rule: shrimp -> frozen vegetables
Support: 0.005333333333333333
Confidence: 0.23255813953488372
Lift: 3.260160834601174
-----------------------------------------------------
Rule: spaghetti -> cooking oil
Support: 0.0048
Confidence: 0.5714285714285714
Lift: 3.281557646029315
-----------------------------------------------------
Rule: mushroom cream sauce -> nan
Support: 0.005733333333333333
Confidence: 0.30069930069930073
Lift: 3.7903273197390845
-----------------------------------------------------
Rule: nan -> pasta
Support: 0.005866666666666667
Confidence: 0.37288135593220345
Lift: 4.700185158809287
-----------------------------------------------------
Rule: spaghetti -> frozen vegetables
Support: 0.008666666666666666
Confidence: 0.3110047846889952
Lift: 3.164906221394116
-----------------------------------------------------
Rule: olive oil -> frozen vegetables
Support: 0.0048
Confidence: 0.20338983050847456
Lift: 3.094165778526489
-----------------------------------------------------
Rule: shrimp -> frozen vegetables
Support: 0.0072
Confidence: 0.3068181818181818
Lift: 3.2183725365543547
-----------------------------------------------------
Rule: spaghetti -> olive oil
Support: 0.005733333333333333
Confidence: 0.20574162679425836
Lift: 3.1299436124887174
-----------------------------------------------------
Rule: spaghetti -> shrimp
Support: 0.006
Confidence: 0.21531100478468898
Lift: 3.0183785717479763
-----------------------------------------------------
Rule: spaghetti -> frozen vegetables
Support: 0.006666666666666667
Confidence: 0.23923444976076555
Lift: 3.497579674864993
-----------------------------------------------------
Rule: spaghetti -> grated cheese
Support: 0.005333333333333333
Confidence: 0.3225806451612903
Lift: 3.282706701098612
-----------------------------------------------------
Rule: herb & pepper -> mineral water
Support: 0.006666666666666667
Confidence: 0.390625
Lift: 3.975152645861601
-----------------------------------------------------
Rule: nan -> herb & pepper
Support: 0.016
Confidence: 0.3234501347708895
Lift: 3.2915549671393096
-----------------------------------------------------
Rule: spaghetti -> herb & pepper
Support: 0.0064
Confidence: 0.3934426229508197
Lift: 4.003825878061259
-----------------------------------------------------
Rule: olive oil -> milk
Support: 0.004933333333333333
Confidence: 0.22424242424242424
Lift: 3.411395906324912
-----------------------------------------------------
Rule: nan -> tomato sauce
Support: 0.005333333333333333
Confidence: 0.37735849056603776
Lift: 3.840147461662528
-----------------------------------------------------
Rule: spaghetti -> shrimp
Support: 0.006
Confidence: 0.5232558139534884
Lift: 3.004914704939635
-----------------------------------------------------
Rule: spaghetti -> olive oil
Support: 0.0072
Confidence: 0.20300751879699247
Lift: 3.0883496774390333
-----------------------------------------------------
Rule: soup -> olive oil
Support: 0.0052
Confidence: 0.2254335260115607
Lift: 3.4295161157945335
-----------------------------------------------------
Rule: whole wheat pasta -> nan
Support: 0.008
Confidence: 0.2714932126696833
Lift: 4.130221288078346
-----------------------------------------------------
Rule: nan -> shrimp
Support: 0.005066666666666666
Confidence: 0.3220338983050848
Lift: 4.514493901473151
-----------------------------------------------------
Rule: spaghetti -> olive oil
Support: 0.005066666666666666
Confidence: 0.20105820105820105
Lift: 3.0586947422647217
-----------------------------------------------------
Rule: nan -> shrimp
Support: 0.005333333333333333
Confidence: 0.23255813953488372
Lift: 3.260160834601174
-----------------------------------------------------
Rule: spaghetti -> nan
Support: 0.0048
Confidence: 0.5714285714285714
Lift: 3.281557646029315
-----------------------------------------------------
Rule: spaghetti -> nan
Support: 0.008666666666666666
Confidence: 0.3110047846889952
Lift: 3.164906221394116
-----------------------------------------------------
Rule: spaghetti -> frozen vegetables
Support: 0.004533333333333334
Confidence: 0.28813559322033905
Lift: 3.0224013274860737
-----------------------------------------------------
Rule: nan -> frozen vegetables
Support: 0.0048
Confidence: 0.20338983050847456
Lift: 3.094165778526489
-----------------------------------------------------
Rule: nan -> shrimp
Support: 0.0072
Confidence: 0.3068181818181818
Lift: 3.2183725365543547
-----------------------------------------------------
Rule: spaghetti -> nan
Support: 0.005733333333333333
Confidence: 0.20574162679425836
Lift: 3.1299436124887174
-----------------------------------------------------
Rule: spaghetti -> nan
Support: 0.006
Confidence: 0.21531100478468898
Lift: 3.0183785717479763
-----------------------------------------------------
Rule: spaghetti -> nan
Support: 0.006666666666666667
Confidence: 0.23923444976076555
Lift: 3.497579674864993
-----------------------------------------------------
Rule: spaghetti -> nan
Support: 0.005333333333333333
Confidence: 0.3225806451612903
Lift: 3.282706701098612
-----------------------------------------------------
Rule: nan -> herb & pepper
Support: 0.006666666666666667
Confidence: 0.390625
Lift: 3.975152645861601
-----------------------------------------------------
Rule: spaghetti -> nan
Support: 0.0064
Confidence: 0.3934426229508197
Lift: 4.003825878061259
-----------------------------------------------------
Rule: nan -> milk
Support: 0.004933333333333333
Confidence: 0.22424242424242424
Lift: 3.411395906324912
-----------------------------------------------------
Rule: spaghetti -> nan
Support: 0.006
Confidence: 0.5232558139534884
Lift: 3.004914704939635
-----------------------------------------------------
Rule: spaghetti -> nan
Support: 0.0072
Confidence: 0.20300751879699247
Lift: 3.0883496774390333
-----------------------------------------------------
Rule: soup -> nan
Support: 0.0052
Confidence: 0.2254335260115607
Lift: 3.4295161157945335
-----------------------------------------------------
Rule: spaghetti -> nan
Support: 0.005066666666666666
Confidence: 0.20105820105820105
Lift: 3.0586947422647217
-----------------------------------------------------
Rule: spaghetti -> milk
Support: 0.004533333333333334
Confidence: 0.28813559322033905
Lift: 3.0224013274860737
-----------------------------------------------------

End Notes

In this article, we have discussed  Market Basket Analysis. We have implemented it from scratch and looked at step-by-step implementation by using Python.

Finally, we implemented Market Basket Analysis using Apriori Algorithm. We can also use FP Growth and AIS algorithms.

If you have any doubts or feedback, feel free to share it in the comments section below.

Thank You!

Image Sources-

  1. Image 1 – https://miro.medium.com/max/1200/1*ZqZCewk1zZOghuIDqsp6VA.png
  2. Image 2 – https://www.sciencedirect.com/topics/computer-science/market-basket-analysis
  3. Image 3 – https://rb.gy/1s7fbc
  4. Image 4 – https://i.imgur.com/G9svOR8.png
  5. Image 5 – https://datamathstat.files.wordpress.com/2018/02/untitled.png

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

Exit mobile version