What Are n-grams and How to Implement Them in Python?

  • by user1
  • 20 March, 2022

This article was published as a part of the Data Science Blogathon

Dear readers,

In this blog, we will learn what n-grams are and explore them on text data in Python. It’s completely alright even if you have never heard of the term “n-grams” before. We will study and implement n-grams right from scratch!

The objective of the blog is to analyze different types of n-grams on the given text data and hence decide which n-gram works the best for our data.

So, let’s begin…

Agenda

  • What are n-grams?
  • How are n-grams classified?
  • An example of n-grams
  • Step-by-step implementation of n-grams in Python
  • Results
  • Conclusion

What are n-grams?

N-grams are continuous sequences of words or symbols or tokens in a document. In technical terms, they can be defined as the neighbouring sequences of items in a document. They come into play when we deal with text data in NLP(Natural Language Processing) tasks.

That’s alright! But, aren’t you curious what these items are? Note that they are no big geek stuff; instead, they simply refer to the words, letters, or symbols.

How are n-grams classified?

Did you notice the ‘n’ in the term “n-grams”?Can you guess what this ‘n’ possibly is?

Remember the days we learned how to input an array by first inputting its size(n) or even a number from the user? Generally, we used to store such values in a variable declared as ‘n’!Apart from programming, you must have extensively encountered ‘n’ in the formulae of the sum of series and so on. What do you think ‘n’ was over there?

Summing up,’n’ is just a variable that can have positive integer values including 1,2,3 and so on.’n’ basically refers to multiple.

Thinking in the same lines, n-grams are classified into the following types, depending on the value that ‘n’ takes.

nTerm
1Unigram
2Bigram
3Trigram
nn-gram

As clearly depicted in the table above, when n=1, it is said to be a unigram. When n=2, it is said to be a bigram and so on.

Now, you must be wondering why we need many different types of n-grams?! This is because different types of n-grams are suitable for different types of applications. You should try different n-grams on your data in order to confidently conclude which one works the best among all for your text analysis. For instance, research has substantiated that trigrams and 4 grams work the best in the case of spam filtering.

An example of n-grams

Let’s understand n-grams practically with the help of the following sentence:

“I reside in Bengaluru”

SL.No.Type of n-gram    Generated n-grams
1Unigram[“I”,”reside”,”in”,“Bengaluru”]
2Bigram[“I reside”,”reside in”,”in Bengaluru”]
3Trigram[“I reside in”, “reside in Bengaluru”]

For the time being, let’s not consider the removal of stop-words 🙂

From the table above, it’s clear that unigram means taking only one word at a time, bigram means taking two words at a time and trigram means taking three words at a time. We will be implementing only till trigrams here in this blog. Feel free to proceed ahead and explore 4 grams,5-grams, and so on from your take-aways from the blog!

Step-by-step implementation of n-grams in Python

And here comes the most interesting section of the blog! Unless we practically implement what we learn, there is absolutely no fun in learning it! So, let’s proceed to code and generate n-grams on Google Colab in Python.

Steps:

  1. Explore the dataset
  2. Feature extraction
  3. Train-test split
  4. Basic pre-processing
  5. Code to generate N-grams
  6. Creating unigrams
  7. Creating bigrams
  8. Creating trigrams

1. Explore the dataset:

I will be using sentiment analysis for the financial news dataset. The sentiments are from the perspective of retail investors. It is an open-source Kaggle dataset. Download it from here before moving ahead.

Let’s begin, as usual, by importing the required libraries:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use(style='seaborn')
%matplotlib inline

Now, let’s read the dataset and understand it using the pandas library:

df=pd.read_csv('all-data.csv',encoding = "ISO-8859-1")
df.head()
df.info()

You can see that the dataset has 4846 rows and two columns, namely,’ Sentiment’ and ‘News Headline’.

NOTE:
When you download the dataset from Kaggle directly, you will notice that the columns are nameless! So, I have named them later and updated them in the all-data.csv file before reading it using pandas. Ensure that you do not miss this step.

df.isna().sum()

The data is just perfect with absolutely no missing values at all! That’s our luck indeed!

df['Sentiment'].value_counts()

We can undoubtedly infer that the dataset includes three categories of sentiments:

  • Neutral
  • Positive
  • Negative

Out of 4846 sentiments,2879 have been found to be neutral,1363 positive, and the rest negative.

2. Feature extraction:

Our objective is to predict the sentiment of a given news headline. Obviously, the ‘News Headline’ column is our only feature and the ‘Sentiment’ column is our target variable.

y=df['Sentiment'].values
y.shape

x=df[‘News Headline’].valuesx.shape





Both the outputs return a shape of (4846,) which means 4846 rows and 1 column as we have 4846 rows of data and just 1 feature and a target for x and y respectively.

3. Train-test split:

In any machine learning, deep learning, or NLP(Natural Language Processing) task, splitting the data into train and test is indeed a highly crucial step. The train_test_split() method provided by sklearn is widely used for the same. So, let’s begin by importing it:

from sklearn.model_selection import train_test_split

I have split the data this way:60% for train and the rest 40% for test. I had started with 20% for the test and kept on playing with the test_size parameter only to realize that the 60-40 ratio of split provides more useful and meaningful insights from the trigrams generated. Don’t worry, we will be looking at trigrams in just a while.

(x_train,x_test,y_train,y_test)=train_test_split(x,y,test_size=0.4)
x_train.shape
y_train.shape
x_test.shape
y_test.shape

On executing the codes above, you will observe that 2907 rows have been considered as train data and the rest of 1939 rows have been considered as test data.

Our next step is to convert these NumPy arrays to Pandas data frames and thus create two data frames, namely,df_train and df_test. The former is created by concatenating x_train and y_train arrays. The latter data frame is created by concatenating x_test and y_test arrays. This is necessary to count the number of positive, negative, and neutral sentiments in both train and test datasets which we will be doing in a while.

df1=pd.DataFrame(x_train)
df1=df1.rename(columns={0:'news'})
df2=pd.DataFrame(y_train)
df2=df2.rename(columns={0:'sentiment'})
df_train=pd.concat([df1,df2],axis=1)
df_train.head()
df3=pd.DataFrame(x_test)
df3=df3.rename(columns={0:'news'})
df4=pd.DataFrame(y_test)
df4=df2.rename(columns={0:'sentiment'})
df_test=pd.concat([df3,df4],axis=1)
df_test.head()

4. Basic pre-processing of train and test data

Here, in order to pre-process our text data, we will remove punctuations in train and test data for the ‘news’ column using punctuation provided by the string library.

#removing punctuations
#library that contains punctuation
import string
string.punctuation
#defining the function to remove punctuation
def remove_punctuation(text):
  if(type(text)==float):
    return text
  ans=""  
  for i in text:     
    if i not in string.punctuation:
      ans+=i    
  return ans
#storing the puntuation free text in a new column called clean_msg
df_train['news']= df_train['news'].apply(lambda x:remove_punctuation(x))
df_test['news']= df_test['news'].apply(lambda x:remove_punctuation(x))
df_train.head()
#punctuations are removed from news column in train dataset

Compare the above output with the previous output of df_train. You can observe that punctuations have been successfully removed from the text present in the feature column(news column) of the training dataset. Similarly, from the above codes, punctuations will be removed successfully from the news column of the test data frame as well. You can optionally view df_test.head() as well to note it.

As a next step, we have to remove stopwords from the news column. For this, let’s use the stopwords provided by nltk as follows:

import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

We will be using this to generate n-grams in the very next step.

5. Code to generate n-grams

Lets code a custom function to generate n-grams for a given text as follows:

#method to generate n-grams:
#params:
#text-the text for which we have to generate n-grams
#ngram-number of grams to be generated from the text(1,2,3,4 etc., default value=1)
def generate_N_grams(text,ngram=1):
  words=[word for word in text.split(" ") if word not in set(stopwords.words('english'))]  
  print("Sentence after removing stopwords:",words)
  temp=zip(*[words[i:] for i in range(0,ngram)])
  ans=[' '.join(ngram) for ngram in temp]
  return ans

The above function inputs two parameters, namely, text and ngram which refer to the text data for which we want to generate a given number of n-grams and the number of grams to be generated respectively. Firstly, word tokenization is done where the stop words are ignored and the remaining words are retained. From the example section, you must have been clear on how to manually generate n-grams for a given text. We have coded the very same logic in the function generate_N_grams() above. It will thus consider n words at a time from the text where n is given by the value of the ngram parameter of the function.

Let’s check the working of the function with the help of a simple example to create bigrams as follows:

#sample!
generate_N_grams("The sun rises in the east",2)

Great!!!

We are now set to proceed!!!

6. Creating unigrams:

Let’s follow the steps below to create unigrams for the news column of the df_train data frame:

Create unigrams for each of the news records belonging to each of the three categories of sentiments

  1. Store the word and its count in the corresponding dictionaries
  2. Convert these dictionaries to corresponding data frames
  3. Fetch the top 10 most frequently used words
  4. Visualize the most frequently used words for all the 3 categories-positive, negative and neutral.

Have a look at the codes below to understand the steps better.

from collections import defaultdict
positiveValues=defaultdict(int)
negativeValues=defaultdict(int)
neutralValues=defaultdict(int)
#get the count of every word in both the columns of df_train and df_test dataframes
#get the count of every word in both the columns of df_train and df_test dataframes where sentiment="positive"
for text in df_train[df_train.sentiment=="positive"].news:
  for word in generate_N_grams(text):
    positiveValues[word]+=1
#get the count of every word in both the columns of df_train and df_test dataframes where sentiment="negative"
for text in df_train[df_train.sentiment=="negative"].news:
  for word in generate_N_grams(text):
    negativeValues[word]+=1
#get the count of every word in both the columns of df_train and df_test dataframes where sentiment="neutral"
for text in df_train[df_train.sentiment=="neutral"].news:
  for word in generate_N_grams(text):
    neutralValues[word]+=1
#focus on more frequently occuring words for every sentiment=>
#sort in DO wrt 2nd column in each of positiveValues,negativeValues and neutralValues
df_positive=pd.DataFrame(sorted(positiveValues.items(),key=lambda x:x[1],reverse=True))
df_negative=pd.DataFrame(sorted(negativeValues.items(),key=lambda x:x[1],reverse=True))
df_neutral=pd.DataFrame(sorted(neutralValues.items(),key=lambda x:x[1],reverse=True))
pd1=df_positive[0][:10]
pd2=df_positive[1][:10]
ned1=df_negative[0][:10]
ned2=df_negative[1][:10]
nud1=df_neutral[0][:10]
nud2=df_neutral[1][:10]
plt.figure(1,figsize=(16,4))
plt.bar(pd1,pd2, color ='green',
        width = 0.4)
plt.xlabel("Words in positive dataframe")
plt.ylabel("Count")
plt.title("Top 10 words in positive dataframe-UNIGRAM ANALYSIS")
plt.savefig("positive-unigram.png")
plt.show()
plt.figure(1,figsize=(16,4))
plt.bar(ned1,ned2, color ='red',
        width = 0.4)
plt.xlabel("Words in negative dataframe")
plt.ylabel("Count")
plt.title("Top 10 words in negative dataframe-UNIGRAM ANALYSIS")
plt.savefig("negative-unigram.png")
plt.show()
plt.figure(1,figsize=(16,4))
plt.bar(nud1,nud2, color ='yellow',
        width = 0.4)
plt.xlabel("Words in neutral dataframe")
plt.ylabel("Count")
plt.title("Top 10 words in neutral dataframe-UNIGRAM ANALYSIS")
plt.savefig("neutral-unigram.png")
plt.show()

7. Creating bigrams:

Repeat the same steps which we followed to analyze our data using unigrams except that you have to pass parameter 2 while invoking the generate_N_grams() function. You can optionally consider changing the names of the data frames, which I have done.

positiveValues2=defaultdict(int)
negativeValues2=defaultdict(int)
neutralValues2=defaultdict(int)
#get the count of every word in both the columns of df_train and df_test dataframes
#get the count of every word in both the columns of df_train and df_test dataframes where sentiment="positive"
for text in df_train[df_train.sentiment=="positive"].news:
  for word in generate_N_grams(text,2):
    positiveValues2[word]+=1
#get the count of every word in both the columns of df_train and df_test dataframes where sentiment="negative"
for text in df_train[df_train.sentiment=="negative"].news:
  for word in generate_N_grams(text,2):
    negativeValues2[word]+=1
#get the count of every word in both the columns of df_train and df_test dataframes where sentiment="neutral"
for text in df_train[df_train.sentiment=="neutral"].news:
  for word in generate_N_grams(text,2):
    neutralValues2[word]+=1
#focus on more frequently occuring words for every sentiment=>
#sort in DO wrt 2nd column in each of positiveValues,negativeValues and neutralValues
df_positive2=pd.DataFrame(sorted(positiveValues2.items(),key=lambda x:x[1],reverse=True))
df_negative2=pd.DataFrame(sorted(negativeValues2.items(),key=lambda x:x[1],reverse=True))
df_neutral2=pd.DataFrame(sorted(neutralValues2.items(),key=lambda x:x[1],reverse=True))
pd1bi=df_positive2[0][:10]
pd2bi=df_positive2[1][:10]
ned1bi=df_negative2[0][:10]
ned2bi=df_negative2[1][:10]
nud1bi=df_neutral2[0][:10]
nud2bi=df_neutral2[1][:10]
plt.figure(1,figsize=(16,4))
plt.bar(pd1bi,pd2bi, color ='green',width = 0.4)
plt.xlabel("Words in positive dataframe")
plt.ylabel("Count")
plt.title("Top 10 words in positive dataframe-BIGRAM ANALYSIS")
plt.savefig("positive-bigram.png")
plt.show()
plt.figure(1,figsize=(16,4))
plt.bar(ned1bi,ned2bi, color ='red',
        width = 0.4)
plt.xlabel("Words in negative dataframe")
plt.ylabel("Count")
plt.title("Top 10 words in negative dataframe-BIGRAM ANALYSIS")
plt.savefig("negative-bigram.png")
plt.show()
plt.figure(1,figsize=(16,4))
plt.bar(nud1bi,nud2bi, color ='yellow',
        width = 0.4)
plt.xlabel("Words in neutral dataframe")
plt.ylabel("Count")
plt.title("Top 10 words in neutral dataframe-BIGRAM ANALYSIS")
plt.savefig("neutral-bigram.png")
plt.show()

8. Creating trigrams:

Repeat the same steps which we followed to analyze our data using unigrams except that you have to pass parameter 3 while invoking the generate_N_grams() function. You can optionally consider changing the names of the data frames, which I have done.

positiveValues3=defaultdict(int)
negativeValues3=defaultdict(int)
neutralValues3=defaultdict(int)
#get the count of every word in both the columns of df_train and df_test dataframes
#get the count of every word in both the columns of df_train and df_test dataframes where sentiment="positive"
for text in df_train[df_train.sentiment=="positive"].news:
  for word in generate_N_grams(text,3):
    positiveValues3[word]+=1
#get the count of every word in both the columns of df_train and df_test dataframes where sentiment="negative"
for text in df_train[df_train.sentiment=="negative"].news:
  for word in generate_N_grams(text,3):
    negativeValues3[word]+=1
#get the count of every word in both the columns of df_train and df_test dataframes where sentiment="neutral"
for text in df_train[df_train.sentiment=="neutral"].news:
  for word in generate_N_grams(text,3):
    neutralValues3[word]+=1#focus on more frequently occuring words for every sentiment=>
#sort in DO wrt 2nd column in each of positiveValues,negativeValues and neutralValues
df_positive3=pd.DataFrame(sorted(positiveValues3.items(),key=lambda x:x[1],reverse=True))
df_negative3=pd.DataFrame(sorted(negativeValues3.items(),key=lambda x:x[1],reverse=True))
df_neutral3=pd.DataFrame(sorted(neutralValues3.items(),key=lambda x:x[1],reverse=True))
pd1tri=df_positive3[0][:10]
pd2tri=df_positive3[1][:10]
ned1tri=df_negative3[0][:10]
ned2tri=df_negative3[1][:10]
nud1tri=df_neutral3[0][:10]
nud2tri=df_neutral3[1][:10]
plt.figure(1,figsize=(16,4))
plt.bar(pd1tri,pd2tri, color ='green',
        width = 0.4)
plt.xlabel("Words in positive dataframe")
plt.ylabel("Count")
plt.title("Top 10 words in positive dataframe-TRIGRAM ANALYSIS")
plt.savefig("positive-trigram.png")
plt.show()
plt.figure(1,figsize=(16,4))
plt.bar(ned1tri,ned2tri, color ='red',
        width = 0.4) 
plt.xlabel("Words in negative dataframe")
plt.ylabel("Count")
plt.title("Top 10 words in negative dataframe-TRIGRAM ANALYSIS")
plt.savefig("negative-trigram.png")
plt.show()
plt.figure(1,figsize=(16,4))
plt.bar(nud1tri,nud2tri, color ='yellow',
        width = 0.4) 
plt.xlabel("Words in neutral dataframe")
plt.ylabel("Count")
plt.title("Top 10 words in neutral dataframe-TRIGRAM ANALYSIS")
plt.savefig("neutral-trigram.png")
plt.show()

Results of the model

From the above graphs, we can conclude that trigrams perform the best on our train data. This is because it provides more useful words frequently such as profit rose EURyear earlier for the positive data frame, corresponding periodperiod 2007, names of companies such as HEL for the negative data frame and Finland, company said and again names of companies such as HEL, OMX Helsinki and so on for the neutral data frame.

Conclusion

Therefore, in this blog, we have successfully learned what n-grams are and how we can generate any number of n-grams for a given text dataset easily in Python and thus analyze our dataset thoroughly.

You can find the entire code from here.

Hope you found my blog useful!

Thanks for reading!

References:

1.Dataset – https://www.kaggle.com/ankurzing/sentiment-analysis-for-financial-news

2. Video on n-grams – https://youtu.be/MZIm_5NN3MY

About Me:

I am Nithyashree V, a final year BTech Computer Science and Engineering student. I love learning such cool technologies and putting them into practice, especially to observe how they help us solve society’s challenging problems. My areas of interest include Artificial Intelligence, Data Science, and Natural Language Processing.

Here is my LinkedIn profile: My LinkedIn

You can find my other articles here.

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

Size: Unknown Price: Free Author: Nithyashree V Data source: https://www.analyticsvidhya.com/