Site icon Zdataset

End-to-End Introduction to Evaluating Regression Models

This article was published as a part of the Data Science BlogathonThe objective of any machine learning model is to understand and learn patterns from the data which can further be used to make predictions or answer questions or simply just understand the underlying pattern that is otherwise not evident candidly. Most of the time, the learning part is iterative. A model learns some patterns from the data, we test it against some new data that the model did not encounter during training, we see how good or how bad a job it did, we tweak and adjust some parameters, then we put it to test again. This process is repeated until we are presented with a model that is good enough (Although, some real-world models can just be satisfactory and make a world of difference). The part in which we evaluate and test our model is where the loss functions come into play. Evaluation metric is an integral part of regression models.

Loss functions take the model’s predicted values and compare them against the actual values. It estimates how well (or how bad) the model is, in terms of its ability in mapping the relationship between X (a feature, or independent variable, or predictor variable) and Y (the target, or dependent variable, or response variable). Sometimes just knowing how bad the model is performing may not be enough, we might also need to calculate how far off the model is from the actual value. By knowing the amount of deviation between the predicted value and the actual value, we can train our model accordingly. This difference between the actual value and the predicted value is called the loss. A high loss value means the model has poor performance.

There are many loss functions for evaluating regression models.

There is no “one function to rule them all”.

Choosing the appropriate loss function is very crucial and what makes one desirable depends on the data at hand. Every function has its own properties. There are many factors that contribute to the appropriate choice of a loss function like the algorithm used, outliers in the data, whether you want the function to be differentiable, etc.

This article aims to present you with a list of all loss functions for regression with their pros and cons. Although all of them can be implemented using libraries such as SciPy, PyTorch, Scikit Learn, Keras, etc, I have implemented the code using NumPy as it helps in gaining a better understanding of what is happening under the hood.

Without further ado, let’s get started.

This article was published as a part of the Data Science BlogathonThe objective of any machine learning model is to understand and learn patterns from the data which can further be used to make predictions or answer questions or simply just understand the underlying pattern that is otherwise not evident candidly. Most of the time, the learning part is iterative. A model learns some patterns from the data, we test it against some new data that the model did not encounter during training, we see how good or how bad a job it did, we tweak and adjust some parameters, then we put it to test again. This process is repeated until we are presented with a model that is good enough (Although, some real-world models can just be satisfactory and make a world of difference). The part in which we evaluate and test our model is where the loss functions come into play. Evaluation metric is an integral part of regression models.

Loss functions take the model’s predicted values and compare them against the actual values. It estimates how well (or how bad) the model is, in terms of its ability in mapping the relationship between X (a feature, or independent variable, or predictor variable) and Y (the target, or dependent variable, or response variable). Sometimes just knowing how bad the model is performing may not be enough, we might also need to calculate how far off the model is from the actual value. By knowing the amount of deviation between the predicted value and the actual value, we can train our model accordingly. This difference between the actual value and the predicted value is called the loss. A high loss value means the model has poor performance.

There are many loss functions for evaluating regression models.

There is no “one function to rule them all”.

Choosing the appropriate loss function is very crucial and what makes one desirable depends on the data at hand. Every function has its own properties. There are many factors that contribute to the appropriate choice of a loss function like the algorithm used, outliers in the data, whether you want the function to be differentiable, etc.

This article aims to present you with a list of all loss functions for regression with their pros and cons. Although all of them can be implemented using libraries such as SciPy, PyTorch, Scikit Learn, Keras, etc, I have implemented the code using NumPy as it helps in gaining a better understanding of what is happening under the hood.

Without further ado, let’s get started.

Table of Contents

Loss function vs Cost function

A function that calculates loss for 1 data point is called the loss function.

A loss function

A function that calculates loss for the entire data being used is called the cost function.

A cost function

Evaluation Metrics

Mean Absolute Error (MAE)

Mean absolute error, also known as L1 loss is one of the simplest loss functions and an easy-to-understand evaluation metric. It is calculated by taking the absolute difference between the predicted values and the actual values and averaging it across the dataset. Mathematically speaking, it is the arithmetic average of absolute errors. MAE measures only the magnitude of the errors and doesn’t concern itself with their direction. The lower the MAE, the higher the accuracy of a model.

Mathematically, MAE can be expressed as follows,

where y_i = actual value, y_hat_i = predicted value, n = sample size

def mean_absolute_error(true, pred):
    abs_error = np.abs(true - pred)
    sum_abs_error = np.sum(abs_error)
    mae_loss = sum_abs_error / true.size
    return mae_loss

Pros of the Evaluation Metric:

 

Cons of the evaluation metric:

Mean Bias Error (MBE)

Bias in “Mean Bias Error” is the tendency of a measurement process to overestimate or underestimate the value of a parameter. Bias has only one direction, which can be either positive or negative. A positive bias means the error from the data is overestimated and a negative bias means the error is underestimated. Mean Bias Error (MBE) is the mean of the difference between the predicted values and the actual values. This evaluation metric quantifies the overall bias and captures the average bias in the prediction. It is almost similar to MAE, the only difference being the absolute value is not taken here. This evaluation metric should be handled carefully as the positive and negative errors can cancel each other out.

The formula for MBE,

def mean_bias_error(true, pred):
    bias_error = true - pred
    mbe_loss = np.mean(np.sum(diff) / true.size)
    return mbe_loss

Pros of the Evaluation Metric:

Cons of the evaluation metric:

Relative Absolute Error (RAE)

Relative absolute error is computed by taking the total absolute error and dividing it by the absolute difference between the mean and the actual value.

RAE is expressed as,

where y_bar is the mean of the n actual values.

RAE measures the performance of a predictive model and is expressed in terms of a ratio. The value of RAE can range from zero to one. A good model will have values close to zero, with zero being the best value. This error shows how the mean residual relates to the mean deviation of the target function from its mean.

def relative_absolute_error(true, pred):
    true_mean = np.mean(true)
    squared_error_num = np.sum(np.abs(true - pred))
    squared_error_den = np.sum(np.abs(true - true_mean))
    rae_loss = squared_error_num / squared_error_den
    return rae_loss

Pros of the Evaluation Metric:

Cons of the evaluation metric:

Mean Absolute Percentage Error (MAPE)

Mean absolute percentage error is calculated by taking the difference between the actual value and the predicted value and dividing it by the actual value. An absolute percentage is applied to this value and it is averaged across the dataset. MAPE is also known as Mean Absolute Percentage Deviation (MAPD). It increases linearly with an increase in error. The smaller the MAPE, the better the model performance.

def mean_absolute_percentage_error(true, pred):
    abs_error = (np.abs(true - pred)) / true
    sum_abs_error = np.sum(abs_error)
    mape_loss = (sum_abs_error / true.size) * 100
    return mape_loss

Pros of the Evaluation Metric:

Cons of the evaluation metric:

Mean Squared Error (MSE)

MSE is one of the most common regression loss functions. In Mean Squared Error also known as L2 loss, we calculate the error by squaring the difference between the predicted value and actual value and averaging it across the dataset. MSE is also known as Quadratic loss as the penalty is not proportional to the error but to the square of the error. Squaring the error gives higher weight to the outliers, which results in a smooth gradient for small errors. Optimization algorithms benefit from this penalization for large errors as it is helpful in finding the optimum values for parameters. MSE will never be negative since the errors are squared. The value of the error ranges from zero to infinity. MSE increases exponentially with an increase in error. A good model will have an MSE value closer to zero.

def mean_squared_error(true, pred):
    squared_error = np.square(true - pred) 
    sum_squared_error = np.sum(squared_error)
    mse_loss = sum_squared_error / true.size
    return mse_loss

Pros of the Evaluation Metric:

Cons of the evaluation metric:

Root Mean Squared Error (RMSE)

RMSE is computed by taking the square root of MSE. RMSE is also called the Root Mean Square Deviation. It measures the average magnitude of the errors and is concerned with the deviations from the actual value. RMSE value with zero indicates that the model has a perfect fit. The lower the RMSE, the better the model and its predictions. A higher RMSE indicates that there is a large deviation from the residual to the ground truth. RMSE can be used with different features as it helps in figuring out if the feature is improving the model’s prediction or not.

def root_mean_squared_error(true, pred):
    squared_error = np.square(true - pred) 
    sum_squared_error = np.sum(squared_error)
    rmse_loss = np.sqrt(sum_squared_error / true.size)
    return rmse_loss

Pros of the Evaluation Metric:

Cons of the evaluation metric:

Relative Squared Error (RSE):

In order to calculate Relative Squared Error, you take the Mean Squared Error (MSE) and divide it by the square of the difference between the actual and the mean of the data. In other words, we divide the MSE of our model by the MSE of a model which uses the mean as the predicted value.

def relative_squared_error(true, pred):
    true_mean = np.mean(true)
    squared_error_num = np.sum(np.square(true - pred))
    squared_error_den = np.sum(np.square(true - true_mean))
    rse_loss = squared_error_num / squared_error_den
    return rse_loss

The output value of RSE is expressed in terms of ratio. It can range from zero to one. A good model should have a value close to zero while a model with a value greater than 1 is not reasonable.

Pros of the Evaluation Metric:

Normalized Root Mean Squared Error (NRMSE)

The Normalized RMSE is generally computed by dividing a scalar value. It can be in different ways like,


# implementation of NRMSE with standard deviation
def normalized_root_mean_squared_error(true, pred):
    squared_error = np.square((true - pred))
    sum_squared_error = np.sum(squared_error)
    rmse = np.sqrt(sum_squared_error / true.size)
    nrmse_loss = rmse/np.std(pred)
    return nrmse_loss

Sometimes choosing the interquartile range may be the best bet as other methods are prone to outliers. NRMSE is a good measure when you want to compare the models of different dependent variables or when the dependent variables are modified (log-transformed or standardized). It overcomes the scale-dependency and eases comparison between models of different scales or even between datasets.

Relative Root Mean Squared Error (RRMSE)

RRMSE is a dimensionless form of RMSE. Relative Root Mean Square Error (RRMSE) is the root mean squared error normalized by the root mean square value where each residual is scaled against the actual value. While RMSE is restricted by the scale of original measurements, RRMSE can be used to compare different measurement techniques. When your predictions are inaccurate, it results in an increased RRMSE. RRMSE expresses the error relatively or in a percentage form. Model accuracy is,

def relative_root_mean_squared_error(true, pred):
    num = np.sum(np.square(true - pred))
    den = np.sum(np.square(pred))
    squared_error = num/den
    rrmse_loss = np.sqrt(squared_error)
    return rrmse_loss

Root Mean Squared Logarithmic Error (RMSLE)

Root Mean Squared Logarithmic Error is calculated by applying log to the actual and the predicted values and then taking their differences. RMSLE is robust to outliers where the small and the large errors are treated evenly.

It penalizes the model more if the predicted value is less than the actual value while the model is less penalized if the predicted value is more than the actual value. It does not penalize high errors due to the log. Hence the model has a large penalty for underestimation than overestimation. This can be helpful in situations where we are not bothered by overestimation but underestimation is not acceptable.

def root_mean_squared_log_error(true, pred):
    square_error = np.square((np.log(true + 1) - np.log(pred + 1)))
    mean_square_log_error = np.mean(square_error)
    rmsle_loss = np.sqrt(mean_square_log_error)
    return rmsle_loss

Pros of the Evaluation Metric:

Cons of the evaluation metric:

Huber Loss

What if you want a function that learns about the outliers as well as ignores them? Well, Huber loss is the one for you. Huber loss is a combination of both linear and quadratic scoring methods. It has a hyperparameter delta (𝛿) which can be tuned according to the data. The loss will be linear (L1 loss) for values above delta and quadratic (L2 loss) for values below it. It balances and combines good properties of both MAE (Mean Absolute Error) and MSE (Mean Squared Error). In other words, for loss values less than delta, MSE will be used and for loss values greater than delta, MAE will be used. The choice of delta (𝛿) is extremely critical because it defines our choice of the outlier. Huber loss reduces the weight we put on outliers for larger loss values by using MAE while for smaller loss values it maintains a quadratic function using MSE.

def huber_loss(true, pred, delta):
    huber_mse = 0.5 * np.square(true - pred)
    huber_mae = delta * (np.abs(true - pred) - 0.5 * (np.square(delta)))
    return np.where(np.abs(true - pred) <= delta, huber_mse, huber_mae)

Pros of the Evaluation Metric:

Cons of the evaluation metric:

Log Cosh Loss

Log cosh calculates the logarithm of the hyperbolic cosine of the error. This function is smoother than quadratic loss. It works like MSE but is not affected by large prediction errors. It is quite similar to Huber loss in the sense that it is a combination of both linear and quadratic scoring methods.

def log_cosh(true, pred):
    logcosh = np.log(np.cosh(pred - true))
    logcosh_loss = np.sum(logcosh)
    return logcosh_loss

Pros of the Evaluation Metric:

Cons of the evaluation metric:

Quantile Loss

Quantile regression loss function is applied to predict quantiles. The quantile is the value that determines how many values in the group fall below or above a certain limit. It estimates the conditional median or quantile of the response(dependent) variables across values of the predictor(independent) variables. The loss function is an extension of MAE except for the 50th percentile, where it is MAE. It provides prediction intervals even for residuals with non-constant variance and it does not assume a particular parametric distribution for the response.

𝛾 represents the required quantile. The quantiles values are selected based on how we want to weigh the positive and the negative errors.

In the loss function above, 𝛾 has a value between 0 and 1. When there is an underestimation, the first part of the formula will dominate and for overestimation, the second part will dominate. The chosen value of quantile(𝛾) gives different penalties for over-prediction and under prediction. When 𝛾 = 0.5, underestimation and overestimation are penalized by the same factor and the median is obtained. When the value of 𝛾 is larger, overestimation is penalized more than underestimation. For example, when 𝛾 = 0.75 the model will penalize overestimation and it will cost three times as much as underestimation. Optimization algorithms based on gradient descent learn from the quantiles instead of the mean.

def quantile_loss(true, pred, gamma):
    val1 = gamma * np.abs(true - pred)
    val2 = (1-gamma) * np.abs(true - pred)
    q_loss = np.where(true >= pred, val1, val2)
    return q_loss

Pros of the Evaluation Metric:

Cons of the evaluation metric:

Thank you for reading all the way down here! I hope this article was helpful in your learning journey. I would love to hear in the comments about any other loss functions that I have missed. Happy Evaluating!

Connect with me on LinkedIn

References:

Image Credits:

Exit mobile version