Quick Hacks To Save Machine Learning Model using Pickle and Joblib

by user1
21 March, 2022

This article was published as a part of the Data Science Blogathon

Introduction
Loading dataset and creating our model
Saving model
1. Using Pickle
2. Using Sklearn Joblib
Conclusion
References

INTRODUCTION

Saving models is a crucial part of the realm of model development. To understand what it means let’s understand it with a very simple example:

Suppose you are working on a practice problem related to house rent given lots of data points and input features. It’s quite common to perform EDA, Preprocessing(may need to create additional features), and feeding our data to our model. In this scenario even if we use the simplest Linear Regression Model (multiple variables) it may become huge in size due to all the input_features and all the parameters which will be time-consuming to re-train again and again for use.

So the simplest thing to do is to save our model and later load it for inference or prediction at a later time. While Keras models API provides the [model.save()] functionality for saving our deep learning model is limited to the realm of deep learning and for most beginners, in ML it’s quite confusing to save their model. Also due to estimators having a huge number of parameters, it is quite advisable to save them. So in this article, we will look into few small hacks to save our model

Loading Dataset And Creating Our Model

We are going to use a house price prediction dataset with a single feature area(for demonstration purposes). Our job will be to predict the price given the area. For keeping things simple we will have only 4-5 data points and the model we will be using will be a Linear Regression Model which just fits a straight line to our dataset and calculates the square of predicted difference from actual differences over all data points*

The square in cost function ensures that negative values are nullified

Creating Model Files

We are now quickly going to create our model file in 5 steps which we will be saving for later use.

1. We will start by loading all the required dependencies.

# loading dependencies
import pandas as pd
import numpy as np
from sklearn import linear_model

2. Now we will be loading our data using pd.read_csv() function into a pandas dataframe(train_df) and use df.head() method to print first 5 rows.

# loading our data
train_df = pd.read_csv('/content/train.csv')
# viewing few files
train_df.head()

	area	price
0	2600	550000
1	3000	565000
2	3200	610000
3	3600	680000
4	4000	725000

3. To create our model we will be first creating a model object which will be actually a LinearRegression classifier and then fit our model with our training samples and training labels for which our model job will be to find the best straight line fit.

# creating the model object
model = linear_model.LinearRegression() # y = mx+b

# fitting model with X_train - area, y_train - price
model.fit(train_df[['area']],train_df.price)

After executing the above code output will look a bit like this

>> LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

4. As we know a straight line has a coefficient and an intercept in the equation, so we should check out those values as sklearn provides some handy attributes. These can be checked as

# checking coeffiecent - m
model.coef_

>> array([135.78767123])

# checking intercept - b
model.intercept_

>> 180616.43835616432

5. Finally for completeness sake one can test the model for predicting the price for a 5000sqft area house.

# predict model values - area = 5000
model.predict([[5000]])

>> array([859554.79452055])

Saving Model

It’s now time to save our created model. We are going to look into 2 quick hacks for the saving model. Also as a bonus, I will be providing guidelines on where to use which method.

Method 1 – Pickle – 2 Steps

Many of you will be familiar with the pickle module, however, if not it’s good to know that the pickle module allows you to pickle a file using de-serialization which means simply breaking down an object into its constituting components. For e.g, our model files attribute like the one we saw.

To save a file using pickle one needs to open a file, load it under some alias name and dump all the info of the model. This can be achieved using below code:

# loading library
import pickle

# create an iterator object with write permission - model.pkl
with open('model_pkl', 'wb') as files:
    pickle.dump(model, files)

After the above steps, one can see a file with the name model_pkl in the directory, and opening it will show something like this:

Directory As Shown In Google Collab

inside model_pkl file

One can load this file back again into a model using the same logic, here we are using the lr variable for referencing the model and then using it to predict the price for 5000sqft:

# load saved model
with open('model_pkl' , 'rb') as f:
    lr = pickle.load(f)

# check prediction

lr.predict([[5000]]) # similar

>> array([859554.79452055])

Benefits:

The pickle module keeps track of the objects it has already serialized, so that later references to the same object won’t be serialized again, thus allowing for faster execution time.
Allows saving model in very little time.
Good For small models with fewer parameters like the one we used.

Method 2 – Joblib – 2 Steps

Joblib is an alternative to model saving in a way that it can operate on objects with large NumPy arrays/data as a backend with many parameters. It can be used as an individual module(refer here) or using the Sci-Kit Learn library. For simplicity’s sake, we will be using the second method.

-> First, we will import joblib from sklearn’s external class

# loading dependency
from sklearn.externals import joblib

To save the model we will use its dump functionality to save the model to the model_jlib file.

# saving our model # model - model , filename-model_jlib
joblib.dump(model , 'model_jlib')

After running the above code a file will be created with a filename and contents will be similar to the pickle file.

The directory

inside model_jlib file

Note: We didn’t use an iterator as the module saves the data onto disk rather than string-names. However, it accepts file-like objects.

To load the model we will be providing file-path or file object to the load function and storing it in the m_jlib variable, which we can later use for prediction.

# opening the file- model_jlib
m_jlib = joblib.load('model_jlib')

Finally for predicting we can call predict method on m_jlib and pass it a 2d array with values as 5000.

# check prediction
m_jlib.predict([[5000]]) # similar

>> array([859554.79452055])

Note predict methods assumes you provide data in a 2d format so we used [[5000]] meaning 5000 as an 2d array

Benefits:

Ideal for the large models having many parameters and can have large NumPy arrays in the backend.
Can only save the file to disk and not to a string.
Works similar to pickle `dump` and `load`
Most fitted for sklearn estimators.

Conclusion

Due to the time complexity involved in training large models, saving is becoming a crucial part of the data-science realm and with this article, I tried to introduce few quick ways to save them. However, it must be noted both the process works on the same concept of serialization(saving of data into its component form) and deserialization(restoring of data from the serialized chunks), thus it is advised to pickle or joblib the model from a trusted source.

Also for simplicity sake, we have used a Linear Regression model, but the same can be used to save models of different types like Logistic Regression, Decision Trees, SVM’s, and a lot more:)

Hope you have enjoyed reading the article and learned something in the process. Those who want to dive deeper can refer to the reference section and work along.

References

Collab Notebook: – File containing all the codes can be found here.

Job Lib:- For those who want to learn more about joblib, refer here.

Inspiration:- A humble and respectful thanks to code basics which inspire me to write the content.

CSV File – The training data used can be downloaded from hereThe media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

Size: Unknown Price: Free Author: Purnendu Shukla Data source: https://www.analyticsvidhya.com/