Guide for building an End-to-End Logistic Regression Model

user1

3 years ago

This article was published as a part of the Data Science Blogathon

In this blog, we’ll go over everything you need to know about Logistic Regression to get started and build a model in Python. If you’re new to machine learning and have never built a model before, don’t worry; after reading this, I’m confident you’ll be able to do so.

For those who are new to this, let’s start with a basic understanding of machine learning before moving on to Logistic Regression.

What is Machine Learning?

In simple terms, the Machine learning model uses algorithms in which the machine learns from the data just like humans learn from their experiences. Machine learning allows computers to find hidden insights without being explicitly programmed.

Types of Machine Learning algorithms

Based on the output type and task done, machine learning models are classified into the following types

Logistic Regression falls under the Supervised Learning type. Let’s learn more about it.

Supervised Learning

It’s a type of Machine Learning that uses labelled data from the past. Models are trained using already labelled samples.

Example: You have past data of the football premier league and based on that data and previous match results you predict which team will win the next game.

Supervised learning is further divided into two types-

Regression – target/output variable is continuous.
Classification – target/output variable is categorical.

Logistic Regression is a Classification model. It helps to make predictions where the output variable is categorical. With this let’s understand Logistic Regression in detail.

What is Logistic Regression?

As previously stated, Logistic Regression is used to solve classification problems. Models are trained on historical labelled datasets and aim to predict which category new observations will belong to.

Below are few examples of binary classification problems which can be solved using logistic regression-

The probability of a political candidate winning or losing the next election.
Whether a machine in manufacturing will stop running in a few days or not.
Filtering email as spam or not spam.

Logistic regression is well suited when we need to predict a binary answer (only 2 possible values like yes or no).

The term logistic regression comes from “Logistic Function,” which is also known as “Sigmoid Function”. Let us learn more about it.

Logistic/Sigmoid Function

The sigmoid function, commonly known as the logistic function, predicts the likelihood of a binary outcome occurring. The function takes any value and converts it to a number between 0 and 1. The Sigmoid Function is a machine learning activation function that is used to introduce non-linearity to a machine learning model.

The formula of Logistic Function is:

When we plot the above equation, we get S shape curve like below.

The key point from the above graph is that no matter what value of x we use in the logistic or sigmoid function, the output along the vertical axis will always be between 0 and 1.

When the result of the sigmoid function is greater than 0.5, we classify the label as class 1 or positive class; if it’s less than 0.5, we can classify it as a negative class or 0.

In Logistic Regression, iterative optimization algorithms like Gradient Descent or probabilistic methods like Maximum Likelihood are used to get the “best fit” S curve.

Let’s understand the mathematics behind the sigmoid function.

Logistic regression is derived from Linear regression bypassing its output value to the sigmoid function and the equation for the Linear Regression is –

In Linear Regression we try to find the best-fit line by changing m and c values from the above equation and y (output) can take any values from -infinity to +infinity. But, Logistic regression predicts the probability of outcome which can be between 0 to 1. So, to convert those values between 0 to 1 we use the sigmoid function.

after getting our output value we need to see how our model works, for that, we need to calculate the loss function. The loss function tells us how much our predicted output differ from the actual output. A good model should have less loss value. Let’s see how to calculate the loss function.

When y=1, the predicted y value should be close to 1 to reduce the loss. Now Let’s see when our actual output value is 0.

When y=0, the predicted y value should be close to 0 to reduce the loss.

Let’s move on to the implementation of the Logistic Regression model now that we’ve covered the basics.

Step by step implementation of Logistic Regression Model in Python

Based on parameters in the dataset, we will build a Logistic Regression model in Python to predict whether an employee will be promoted or not.

For everyone, promotion or appraisal cycles are the most exciting times of the year. Final promotions are only disclosed after employees have been evaluated on a variety of criteria, which causes a delay in transitioning to new responsibilities. We will build a Machine Learning model to predict who is qualified for promotion to speed up the process.

You can get more understanding of the problem statement and download the dataset from Supervised learning.

Importing Libraries

We’ll begin by loading the necessary libraries for creating a Logistic Regression model.

import numpy as np
import pandas as pd
#Libraries for data visualization
import matplotlib.pyplot as plt
import seaborn as sns
#We will use sklearn for building logistic regression model
from sklearn.linear_model import LogisticRegression

Loading Dataset

We’ll
use the HR Analytics dataset from the link above. We’ll start by loading the
dataset from the downloaded CSV file with the code below.

# Loading dataset from CSV file
data= pd.read_csv("train_LZdllcl.csv",sep=",")

Understanding the Data for Logistic Regression

It’s always a good idea to learn more about data after loading it, such as the shape of the data and statistical information about the columns in a dataset. We can achieve all of this with the code below :

#shape of dataset
print("shape of dataframe is : ", data.shape)
#Let's look at first 5 rows of dataset
data.head(5)
# summary of data
data.info()
#Get Statistical details of data
data.describe()

There are a total of 14 variables in this dataset, with a total of 54808 observations. “is_promoted” is our Target Variable, which has two categories encoded as 1 (promoted) and 0 (not promoted) rest all are input features. In addition, we can observe that our dataset contains both numerical and Categorical features.

Data Cleaning

Data cleaning is a crucial stage in the data preprocessing process. We’ll remove columns with only one unique value because their variance will be 0 and they won’t help us anticipate anything.

Let’s see whether there are any columns that
only have one unique value.

#Checking the unique value counts in columns
featureValues={}
for d in data.columns.tolist():
    count=data[d].nunique()
    if count==1:
        featureValues[d]=count
# List of columns having same 1 unique value        
cols_to_drop= list(featureValues.keys())
print("Columns having 1 unique value are :n",cols_to_drop)

This signifies that there isn’t any column having only 1 unique value.

We’ll now drop the employee_id column because it’s merely a unique identifier, and
then verify each field in the dataset for null value percentages.

#Drop employee_id column as it is just a unique id
data.drop("employee_id",inplace=True,axis=1)
#Checking null percentage
data.isnull().mean()*100

previous_year_rating and education both features have null values. As a result, we will impute those null values instead of dropping them. Following our examination of those columns, we discovered that –

For rows with a null previous_year_rating, we can see that their length of service is 1, which could be why they don’t have a previous year rating. As a result, we’ll use 0 to impute null values.
For the education column, we will impute null values with mode.

#fill missing value
data["previous_year_rating"]= data["previous_year_rating"].fillna(0)
#change type to int
data["previous_year_rating"]= data["previous_year_rating"].astype("int")
#Find out mode value for education
data["education"].mode()
#fill missing value with mode 
data["education"]= data["education"].fillna("Bachelor's")

Now, we do not have any null or missing values in our data. So, let’s
proceed to our next step. There are no null or
missing values in our data now. So, let’s go on to the next step.

Exploratory Data Analysis before creating a Logistic Regression Model

Getting insights from data and visualizing them is an important stage in machine learning since it provides us with a better view of features and their relationships.

Let’s look at the target variable’s distribution in the dataset.

# cchart for distribution of target variable
fig= plt.figure(figsize=(10,3) )
fig.add_subplot(1,2,1)
a= data["is_promoted"].value_counts(normalize=True).plot.pie()
fig.add_subplot(1,2,2)
churnchart=sns.countplot(x=data["is_promoted"])
plt.tight_layout()
plt.show()

We can observe from the above charts that, promoted employee data is less than non-promoted employee data, indicating that there is a class imbalance because class 0 has more data points or observations than class.

Let’s visualize if there is any relationship between the target variable and other variables.

# Visualize relationship between promoted and other features
fig= plt.figure(figsize=(10,5) )
fig.add_subplot(1,3,1)
ar_6=sns.boxplot(x=data["is_promoted"],y=data["length_of_service"])
fig.add_subplot(1,3,2)
ar_6=sns.boxplot(x=data["is_promoted"],y=data["avg_training_score"])
fig.add_subplot(1,3,3)
ar_6=sns.boxplot(x=data["is_promoted"],y=data["previous_year_rating"])
plt.tight_layout()
plt.show()

For an employee If the avg_training_score value is higher then the chances of getting promoted are more.

We will plot correlations between different variables using a heatmap.

#correlation between features
corr_plot = sns.heatmap(data.corr(),annot = True,linewidths=3 )
plt.title("Correlation plot")
plt.show()

None of the features is highly correlated with each other except age and length of
the service.

Feature Engineering

In feature engineering, we apply domain expertise to produce new features from raw data, or we convert or encode features. We’ll encode categorical features or make dummy features out of them in this section.

#Converting Categorical columns into one hot encoding
data["gender"]=data["gender"].apply(lambda x: 1 if x=="m" else 0)
#list of columns
cols = data.select_dtypes(["object"]).columns
#Create dummy variables
ds=pd.get_dummies(data[cols],drop_first=True)
ds
#concat newly created columns with original dataframe
data=pd.concat([data,ds],axis=1)
#Drop original columns
data.drop(cols,axis=1,inplace=True)

Train-Test Split

We will divide the dataset into two subsets: train and test. To perform the train-test split, we’ll use Scikit-learn machine learning.

Train subset – we will use this subset to fit/train the model
Test subset – we will use this subset to evaluate our model

from sklearn.model_selection import train_test_split
#split data into dependent variables(X) and independent variable(y) that we would predict
y = data.pop("is_promoted")
X = data
#Let’s split X and y using Train test split
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=42,train_size=0.8)
#get shape of train and test data
print("train size X : ",X_train.shape)
print("train size y : ",y_train.shape)
print("test size X : ",X_test.shape)
print("test size y : ",y_test.shape)

After splitting the dataset, we have 43846 observations in the training subset and 10962 in the test subset.

After diving into the dataset let’s move on to the next phase of feature scaling.

Feature Scaling/Normalization

Why Feature scaling is important?

As previously stated, Logistic Regression uses Gradient Descent as one of the approaches for obtaining the best result, and feature scaling helps to speed up the Gradient Descent convergence process. When we have features that vary greatly in magnitude, the algorithm assumes that features with a large magnitude are more relevant than those with a small magnitude. As a result, when we train the model, those characteristics become more important.

Because of this feature scaling is required to put all features into the same range, regardless of their relevance.

Feature Scaling Techniques

We bring all the features into the same range using feature scaling. There are many ways to do feature scaling like normalization, standardization, robust scaling, min-max scaling, etc. But here we will discuss the Standardization technique that we are going to apply to our features.

In standardization, features will be scaled to have a mean of 0 and a standard deviation of 1. It does not scale to a preset range. The features are scaled using the formula below:

z = (x – u) / s

where u is the mean of the training samples and s is a standard deviation of the training samples.

Let’s see how to do feature scaling in python using Scikit-learn.

#Feature scaling
from sklearn.preprocessing import StandardScaler
scale=StandardScaler()
X_train = scale.fit_transform(X_train)
X_test = scale.transform(X_test)

Class Imbalance

What is the class imbalance?

when a dataset has more data points or
observations belonging to one category and very few for another, we call
it a class imbalance problem since the distribution of class labels is
not balanced and skewed. Let’s see whether we have a class imbalance problem.

#check for distribution of labels
y_train.value_counts(normalize=True)

We can observe that the majority of the labels are from class 0 and only a few are from class 1.

If we use this distribution to develop our model, it may become biased towards predicting the majority class since there will be insufficient data to learn minority class patterns. The model will start predicting every new observation as 0 or majority class. (In our problem employee is not promoted). We’ll get more model accuracy here, but it won’t be a decent model because it won’t predict class 1 or minority class, which is a crucial class.

As a result, we must consider class imbalance when developing our Logistic Regression model.

How to Handle Class Imbalance?

There are a variety of approaches to dealing with class imbalance, such as increasing minority class samples or decreasing majority class samples to ensure that both classes have the same distribution.

Because we’re using the Scikit-learn machine library to create the model, it has a logistic regression implementation that supports class weighting. We will use the inbuilt parameter “class_weight” while creating an instance of the Logistic Regression model.

Both the majority and minority classes will be given separate weights. During the training phase, the weight differences will influence the classification of the classes.

The purpose of adding class weights is to penalize the minority class for misclassification by setting a higher class weight while decreasing the weight for the majority class.

Build and Train Logistic Regression model in Python

o implement Logistic Regression, we will use the Scikit-learn library. We’ll start by building a base model with default parameters, then look at how to improve it with Hyperparameter Tuning.

As previously stated, we will use the “class_weight” parameter to address the problem of class imbalance. Let’s start by creating our base model with the code below.

#import library
from sklearn.linear_model import LogisticRegression
#make instance of model with default parameters except class weight
#as we will add class weights due to class imbalance problem
lr_basemodel =LogisticRegression(class_weight={0:0.1,1:0.9})
# train model to learn relationships between input and output variables
lr_basemodel.fit(X_train,y_train)

After training our model on the training dataset, we used our model to predict values for the test dataset and recorded them in the y_pred_basemodel variable.

Let’s look at which metrics to use and how to evaluate our base model.

Model Evaluation Metrics

To evaluate performance or our model we will be using “f1 score” as this is a class imbalance problem using accuracy as a performance metrics is not good also, we can say that f1 score is the go-to metric when we have a class imbalance problem. The formula for calculating the F1 score is as follows:

F1 Score = 2*(Recall * Precision) / (Recall + Precision)

Precision is the ratio of accurately predicted positive observations to the total predicted positive observations.

Precision = TP/TP+FP

Recall is the ratio of accurately predicted positive observations to all observations in actual class – yes.

Recall = TP/TP+FN

F1 Score is the weighted average of Precision and Recall. Therefore, this score takes both false positives and false negatives into account.

Let’s evaluate our base model using the f1 score.

from sklearn.metrics import f1_score
print("f1 score for base model is : " , f1_score(y_test,y_pred_basemodel))

We got a 0.37 f1 score on our base model created using default parameters.

Up to this point, we saw how to create a logistic regression model using default parameters.

Now let’s increase model performance and evaluate it again after tuning hyperparameters of the model.

Hyperparameter Optimization for the Logistic Regression Model

Model parameters (such as weight, bias, and so on) are learned from data, whereas hyperparameters specify how our model should be organized. The process of finding the optimum fit or ideal model architecture is known as hyperparameter tuning. Hyperparameters control the overfitting or underfitting of the model. Hyperparameter tuning can be done using algorithms like Grid Search or Random Search.

We will use Grid Search which is the most basic method of searching optimal values for hyperparameters. To tune hyperparameters, follow the steps below:

Create a model instance of the Logistic Regression class
Specify hyperparameters with all possible values
Define performance evaluation metrics
Apply cross-validation
Train the model using the training dataset
Determine the best values for the hyperparameters given.

We can use the below code to implement hyperparameter tuning in python using the Grid Search method.

#Hyperparameter tuning
# define model/create instance
lr=LogisticRegression()
#tuning weight for minority class then weight for majority class will be 1-weight of minority class
#Setting the range for class weights
weights = np.linspace(0.0,0.99,500)
#specifying all hyperparameters with possible values
param= {'C': [0.1, 0.5, 1,10,15,20], 'penalty': ['l1', 'l2'],"class_weight":[{0:x ,1:1.0 -x} for x in weights]}
# create 5 folds
folds = StratifiedKFold(n_splits = 5, shuffle = True, random_state = 42)
#Gridsearch for hyperparam tuning
model= GridSearchCV(estimator= lr,param_grid=param,scoring="f1",cv=folds,return_train_score=True)
#train model to learn relationships between x and y
model.fit(X_train,y_train)

After fitting the model, we will extract the best fit values for all specified hyperparameters.

# print best hyperparameters
print("Best F1 score: ", model.best_score_)
print("Best hyperparameters: ", model.best_params_)

We will now build our Logistic Regression model using the above values we got by tuning Hyperparameters.

Build Model using optimal values of Hyperparameters

Let’s use the below code to build our model again.

#Building Model again with best params
lr2=LogisticRegression(class_weight={0:0.27,1:0.73},C=20,penalty="l2")
lr2.fit(X_train,y_train)

After training our final model it’s time to evaluate our Logistic Regression model using chosen metrics.

Model Evaluation

We will evaluate our model on Test Dataset. First, we will predict values on the Test dataset.

We chose “f1 score” as our performance metric above, but let’s look at the scores for all of the metrics, including confusion metrics, precision, recall, ROC-AUC score, and ultimately f1 score, for learning purposes.

Then, we’ll compare our final model’s f1 score to our base model to see if it’s improved.

We’ll use the code below to calculate the score
for various metrics:

# predict probabilities on Test and take probability for class 1([:1])
y_pred_prob_test = lr2.predict_proba(X_test)[:, 1]
#predict labels on test dataset
y_pred_test = lr2.predict(X_test)
# create onfusion matrix
cm = confusion_matrix(y_test, y_pred_test)
print("confusion Matrix is :nn",cm)
print("n")
# ROC- AUC score
print("ROC-AUC score  test dataset:  t", roc_auc_score(y_test,y_pred_prob_test))
#Precision score
print("precision score  test dataset:  t", precision_score(y_test,y_pred_test))
#Recall Score
print("Recall score  test dataset:  t", recall_score(y_test,y_pred_test))
#f1 score
print("f1 score  test dataset :  t", f1_score(y_test,y_pred_test))

We can see that by tuning hyperparameters, we were able to improve the performance of our model since our F1 Score for the final model (0.43) is higher than that of the base model (0.37). After the hyperparameter tuning model got a 0.88 ROC-AUC score.

With this, we were able to construct our logistic regression model and test it on the Test dataset. More feature engineering, hyperparameter optimization, and cross-validation techniques can improve its performance even more.

Conclusion

We began our learning journey by understanding the basics of machine learning and logistic regression. Then we moved on to the implementation of a Logistic Regression model in Python. We learned key steps in Building a Logistic Regression model like Data cleaning, EDA, Feature engineering, feature scaling, handling class imbalance problems, training, prediction, and evaluation of model on the test dataset. Apart from that, we learned how to use Hyperparameter Tuning to improve the performance of our model and avoid overfitting and underfitting.

I hope you find this information useful and will try it out.

Connect with me on LinkedIn.

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.