Complete guide on how to Use LightGBM in Python

by user1
21 March, 2022

This article was published as a part of the Data Science Blogathon

Introduction

A Gradient Boosting Decision tree or a GBDT is a very popular machine learning algorithm that has effective implementations like XGBoost and many optimization techniques are actually adopted from this algorithm. The efficiency and scalability of the model are not quite up to the mark when there are more features in the data. For this specific behavior, the major reason is that each feature should scan all the various data instances to make an estimate of all the possible split points which is very time-consuming and tedious.

Source: https://s3.ap-south-1.amazonaws.com/techleerimages/0504748b-5e9b-49ee-9824-3ab4ac76760e.jpg

To solve this problem, The LGBM or Light Gradient Boosting Model is used. It uses two types of techniques which are gradient Based on side sampling or GOSS and Exclusive Feature bundling or EFB. So GOSS will actually exclude the significant portion of the data part which have small gradients and only use the remaining data to estimate the overall information gain. The data instances which have large gradients actually play a greater role for computation on information gain. GOSS can get accurate results with a significant information gain despite using a smaller dataset than other models.

With the EFB, It puts the mutually exclusive features along with nothing but it will rarely take any non-zero value at the same time to reduce the number of features. This impacts the overall result for an effective feature elimination without compromising the accuracy of the split point.

By combining the two changes, it will fasten up the training time of any algorithm by 20 times. So LGBM can be thought of as gradient boosting trees with the combination for EFB and GOSS. You can access their official documentation here.

The main features of the LGBM model are as follows :

Higher accuracy and a faster training speed.
Low memory utilization
Comparatively better accuracy than other boosting algorithms and handles overfitting much better while working with smaller datasets.
Parallel Learning support.
Compatible with both small and large datasets

With the above-mentioned features and advantages of LGBM, it has become the default algorithm for machine learning competitions when someone is working with a tabular kind of data regarding both regression and classification problems.

Demystifying the Maths behind LGBM

We use a concept known as verdict trees so that we can cram a function like for example, from the input space X, towards the gradient space G. A training set with the instances like x₁,x₂ and up to x_n is assumed where each element is a vector with s dimensions in the space X. In each of the restatements of a gradient boosting, all the negative gradients of a loss function with respect towards the output model are denoted as g₁, g₂, and up to g_n. The decision tree actually divides each and every node at the most revealing feature, it also gives rise to the largest evidence gain. In this type of model, the data improvement can be measured by the variance after segregating. It can be represented by the following formula :

“Y=Base_tree(X)-lr*Tree1(X)-lr*Tree2(X)-lr*Tree3(X)”

Explanation, Let O be a training dataset on a fixed node of a decision tree and then the variance gain of dividing measure j at a point d for a node is defined as :

Source: https://ejmcm.com/article_9403_15c24bd9c676c28d90c3fc5fad8b42ea.pdf

Gradient One-Sided Sampling or GOSS utilizes every instance with a larger gradient and does the task of random sampling on the various instances with the small gradients. The training dataset is given by the notation of O for each particular node of the Decision tree. The variance gain of j or the dividing measure at the point d for the node is given by :

Source: https://ejmcm.com/article_9403_15c24bd9c676c28d90c3fc5fad8b42ea.pdf

This is achieved by the method of GOSS in LightGBM models.

Coding an LGBM in Python

The LGBM model can be installed by using the Python pip function and the command is “pip install lightbgm” LGBM also has a custom API support in it and using it we can implement both Classifier and regression algorithms where both the models operate in a similar fashion. The Dataset used here is of the Titanic Passengers which will be used in the below code and can be found in my drive at this location.

( Dataset Link: https://drive.google.com/file/d/1uuFe0f2gjEE77-PL9LhMrEd6n-1L5vfW/view?usp=sharing )

Code :

Importing all the libraries

import lightgbm as lgb
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import metrics
Loading the Data : 
data = pd.read_csv('/content/SVMtrain.csv')
data.head()

Output :

Here we can see that there are 8 columns out of which the passenger ID will be dropped and the embarked will be finally chosen as a target variable for the following classification challenge.

Loading the variables:

# To define the input and output feature
x = data.drop(['Embarked','PassengerId'],axis=1)
y = data.Embarked
# train and test split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.33,random_state=42)

Loading and fitting the model:

The initial process of initializing a model is very similar to a normal model initializing and the main difference is that we will get much more parameter settings adjustments while we are initializing the model. We will define the max_depth, learning rate and random state in the following code. In the fit model, we have passed eval_matrix and eval_set to evaluate the model during training itself.

Code :

model = lgb.LGBMClassifier(learning_rate=0.09,max_depth=-5,random_state=42)
model.fit(x_train,y_train,eval_set=[(x_test,y_test),(x_train,y_train)],
          verbose=20,eval_metric='logloss')

Output:

Since our model has very low instances, we need to first check for overfitting with the following code and then we will proceed for the next few steps :

Code :

print('Training accuracy {:.4f}'.format(model.score(x_train,y_train)))
print('Testing accuracy {:.4f}'.format(model.score(x_test,y_test)))

Output :

Training accuracy 0.9647
Testing accuracy 0.8163

As we can clearly see that there is absolutely no significant difference between both the accuracies and hence the model has made an estimation that is quite accurate.

LGBM also comes with additional plotting functions like plotting the various feature importance, metric evaluation and the tree plot.

Code :

lgb.plot_importance(model)

Output :

If you do not mention the eval_set during the fitment, then you will actually get an error while plotting the metric evaluation

Code :

lgb.plot_metric(model)

Output

And as you can clearly see here, the validation curve will tend to increase after it has crossed the 100th evaluation. This can be totally fixed by tuning and setting the hyperparameters of the model. We can also plot the tree using a function.

Code:

lgb.plot_tree(model,figsize=(30,40))

Output:

Now we will plot a few metrics by using the sklearn library

Code :

metrics.plot_confusion_matrix(model,x_test,y_test,cmap='Blues_r')

Output :

Code :

print(metrics.classification_report(y_test,model.predict(x_test)))

Output :

Now as we can clearly see from the confusion matrix combined with the classification report, the model is struggling to predict class 1 because of the few instances that we have but if we compare the same result with the other various ensemble algorithm, then LGBM performs the best. We can also perform the same process for the regressor model but there we need to change the estimator to the LGBMRegressor()

End Notes:

From this article, we can see and understand how to use an LGBM model and how it can tackle the problem by using a GODD and EFB and then we implemented it for a real-life classification problem and the overall process is also very similar to the other ML algorithms. The in-built plotting functionality also makes the library much more attractive and reduces the overall effort for the evaluation side.

Stay Safe and get vaccinated everyone.

Arnab Mondal

Data Engineer | Python Developer

https://www.linkedin.com/in/arnab1408/

Collab Notebook Link :

https://colab.research.google.com/drive/1KJAGA3NRyy7wMPEWPnBo1ooMOJtvKcrG?usp=sharingThe media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

Size: Unknown Price: Free Author: Arnab Mondal Data source: https://www.analyticsvidhya.com/