Credit Scoring — Applied to Logistic Regression in R (2024)

Identification of Factors that Influence the Risk of Default

Eudito Magul

13 min read

May 11, 2023

What is Logistic Regression?

Logistic regression is a statistical technique that aims to produce, from a set of observations, a model that allows the prediction of values taken by a categorical variable, often binary, as a function of one or more continuous and/or binary independent variables.

Then, from this generated model it is possible to calculate and predict the probability of an event occurring (represented by 1 or 0, yes or no, success or failure), given a random observation.

For example: Let’s say you want to guess if your website visitor will click the checkout button in their shopping cart or not. Logistic regression analysis looks at past visitor behavior, such as time spent on the website and the number of items in the cart. It determines that, in the past, if visitors spent more than five minutes on the site and added more than three items to the cart, they clicked the checkout button.

Using this information, the logistic regression function can predict the behavior of a new site visitor.

In this, the logistic regression model allows:

Model the probability of an event occurring depending on the values of the independent variables, which can be categorical or continuous;
Estimating the probability of an event occurring for a randomly selected observation against the probability of the event not occurring;
Predict the effect of the set of variables on the binary dependent variable;
Classify observations by estimating the probability that an observation is in a given category.

The dependent variable Y in logistic regression is often binary, so in these cases it follows the Bernoulli distribution, having an unknown probability p. Remember that the Bernoulli distribution is just a special case of the binomial distribution, where n=1 (considers a single experiment):

Credit Scoring — Applied to Logistic Regression in R (3)

The probability of success is 0 ≤ p ≤ 1 and the probability of failure is q = 1- p. In logistic regression, the unknown probability p is estimated, given a linear combination of independent variables.

Logistics Function

When you do a logistic regression analysis the problem you have in mind is one of classification, that is, the value that is returned will always be between 0 and 1.

Unlike linear regression, logistic regression does not return a straight line that best fits the data, but rather an ‘S’ shaped curve that best fits the model. Thus the link function is the logistic or sigmoid function. This function is defined by:

Credit Scoring — Applied to Logistic Regression in R (4)

Readjusting the terms, you have:

Credit Scoring — Applied to Logistic Regression in R (5)

Credit Scoring — Applied to Logistic Regression in R (6)

Error Function (Cross Entropy)

The error function in logistic regression, will always be a comparison between the original value (y) and the predicted value (ˆy). Naturally the objective is to minimize the cross entropy function, because since the sigmoid has added nonlinearity to the system, the function is described as the logarithm of the likelihood.

As such, the total error cost is the sum of all errors divided by m which is the number of trials in our database, for the logistic regression, and the cross entropy function is given by:

Credit Scoring — Applied to Logistic Regression in R (7)

Odds Ratio

The odds ratio (O.R) compares the chance of two events, and is defined as the ratio between the chance of an event occurring in one group and the chance of the same event occurring in another group. Given two groups ‘A’ and ‘B’ and the probabilities of an event in each group ‘p’ and ‘q’ respectively, the odds ratio is obtained by:

Credit Scoring — Applied to Logistic Regression in R (8)

Wald Test

The Wald test is a parametric statistical test that tests whether each coefficient is significantly different from zero. Thus, this test checks whether each of the independent variables has a statistically significant relationship with the dependent variable. Test Hypothesis:

Method of Selecting Variables

The selection of model variables is based on some algorithm that verifies the importance of a given variable and its inclusion or not in the model. Thus, there are three widespread methods presented here: forward, backward and stepwise.

We will highlight the stepwise method that incorporates the forward and backward models, which starts with the forward model but at each variable added the previous variables are reviewed and it is verified that their power to explain the model remains significant.

Akaike's Information Criterion (AIC)

The AIC is determined by:

Credit Scoring — Applied to Logistic Regression in R (10)

where Lp is the maximum likelihood function of the model and p is the number of explanatory variables in the model. Since the lowest AIC value is always sought, Akaike’s information criterion penalizes models with many variables, because the more variables the higher the AIC value.

Credit Scoring models are based on historical data from the existing customer base to assess whether a future customer is more likely to be a good or bad payer.

Models that evaluate credit are of great relevance to financial institutions, since a good customer classified as bad wasters the institution’s chance of profit, and a bad customer classified as good causes losses.

However, no model can achieve absolute accuracy, but they do assist in credit granting decision making, and any improvement in accuracy accuracy can generate financial gains for the institution.

How to build a Credit Scoring model using Logistic Regression?

Survey a historical customer base: models are built on past information and it is important that there is availability and quality of this database to result in a successful model.
Classification of customers according to the institution’s policy and definition of the dependent variable: it should be noted that the definition of good and bad customers may vary depending on each institution. And besides good and bad customers, there are those who are on the borderline between the two, i.e., are not in the position of good or bad, so these are generally disregarded from the study, due to the greater ease of working with the dichotomous dependent variable.
Selection of a representative sample of the historical customer base: It is suggested for the random sample that the cases of the categories of the dependent variable, in this case good and bad customers, have the same size to avoid possible biases due to size difference.
Descriptive analysis and data preparation: consists in analyzing, according to statistical criteria, each variable to be used in the model.
Application of logistic regression: starting from the random sample of the historical base and the variables to be used in the model, the logistic regression analysis is applied in order to obtain a regression model for credit analysis.

In this scenario at hand, we consider that an individual can be classified as a good customer (good payer) or a bad customer (bad payer). Therefore the binary dependent variable Y can take on the values:

Credit Scoring — Applied to Logistic Regression in R (11)

The dependent variable determined was 1 for good customers and 0 for bad customers, but it could be the other way around. Regardless of which category was coded as 1, the logistic regression technique offers the same results. The logistic regression model obtained from this technique for the proposed coding allows the calculation of the probability that a customer is a good payer. To obtain the probability that he is a bad payer, it is enough to calculate the complementary probability, that is, if the probability that a customer is a good payer is 0.7, the probability that he is a bad payer will be 0.3.

There are a number of characteristics that can be included as possible independent variables, such as: gender, age, marital status, level of education, type of housing (own or rented), number of dependents, amount of income, amount of loan, amount and number of installments, current credit status (delinquent or defaulter), and others.

Objective of the Study

Data Mining techniques and the development of Machine Learning models have become increasingly necessary for finding relevant patterns of information in large volumes of data. In this study, we describe the logistic regression method in the application of credit scoring to discriminate the characteristics of a customer, an individual, that produce an increase or decrease in the probability of credit risk (default or delinquent). To carry out this study, a kaggle dataset was used.

Exploratory Analysis

To paraphrase John Wilder Tukey (1977), exploratory analysis is a technique that employs a wide variety of quantitative and graphical techniques to maximize the obtaining of information from the variables in question.

#Libraries Loading
library(tidyverse)
library(pander)
library(modelr)
library(broom) 
library(caret)
library(GGally)
library(ggplot2)
library(ROCR)
theme_set(theme_bw())

#Data Loading
df <- read.csv("credit_risk.csv")

df %>% glimpse

Credit Scoring — Applied to Logistic Regression in R (12)

We remove from the study the variable “loan_int_rate”, which describes the interest rate offered by banks or any financial institution on loans, because there is no fixed value as it varies from bank to bank.

# Removal of the variable loan_int_rate
df <- df[,-8]

# Verification of omitted cases
sum(is.na(df))

Credit Scoring — Applied to Logistic Regression in R (13)

# Verification of variables with omitted cases
summary(is.na(df)) %>% pander()

Credit Scoring — Applied to Logistic Regression in R (14)

# Removal of omitted cases
df <- na.omit(df)

# Stats of categorical variables
df[,c(-3,-5,-6,-8,-10,-11)] %>% summary() %>% pander()

Credit Scoring — Applied to Logistic Regression in R (15)

# Correlation
ggcorr(cor(df[,c(-3,-5,-6,-8,-10)]), label = T, label_round = 3)

Logistic Regression

For the estimation of the Logistic Regression model, we used a historical base with 32,581 cases and for the training of the model divided into 70%.

# Split the data into training and test set
set.seed(123)
df_sample <- df$loan_status %>% createDataPartition(p = 0.7, list = FALSE)
train.data <- df[df_sample, ]
test.data <- df[-df_sample, ]

# Fit the model
model <- glm(loan_status ~., data = train.data, family = binomial)# Summarize the model
summary(model)

Credit Scoring — Applied to Logistic Regression in R (17)

Interpretation: In the logistic regression model, the impact of each explanatory variable can be explained by analyzing its coefficient. Positive coefficients are characteristics that produce an increase in the probability of the customer not becoming a defaulter. The customer characteristics that individually favor reducing the risk of default are: anual income, type of home ownership (other and own), the person’s intention for the loan (home improvement), loan grade (B, C, D, E, F, and G), percentage of the person’s income dedicated for the mortgage, and the customer’s credit history.

On the other hand, we have the variables with negative coefficients that produce a reduction in the probability of the customer becoming a good payer, that is, they reduce the probability of the customer not becoming a defaulter. The customer characteristics that individually that increase the risk of default are: age in years, type of home ownership(own), length of customer’s employment in years, person’s intention for the loan (training, health, personal and consumption), amount of the loan, whether the customer has a history of default (yes).

Thus, the longer the loan a customer takes out, the more likely he is to default on the loan, and the higher the interest rate, the more likely he is to default.

Selection of Variables.

The estimation of the logistic model was based on the stepwise method, which incorporates the forward and backward models.

# Application of the stepwise method
stepwise <- step(model, direction="both")

stepwise$formula

Credit Scoring — Applied to Logistic Regression in R (18)

The variables highlighted in the table below are the most significant in the model by the stepwise method, considering the significance level (= 0, 05), which are: age in years, anual income, type of home ownership, length of client’s employment in years, person’s intention for the loan, degree of the loan, loan amount, and percentage of the person’s income dedicated for the mortgage.

# Model with the variables indicated by stepwise
stepwise <- glm(stepwise$formula, family=binomial, data=train.data)
summary(stepwise)

Credit Scoring — Applied to Logistic Regression in R (19)

Interpretation: It can be seen in the table above that the values estimated by the stepwise method show the coefficients in log odds format. Thus, when the annual income increases by 1 (one) unit, the log of the expected odds for annual income changes by 8.745e-07. The column Pr(>|z|) shows the p-values of the variables indicating the null hypothesis test. As a result the annual income variable revealed statistical significance at 0.56% ($<$0.0056), however the usual value to consider it statistically significant is 5% (0.05).

For the results in the table above the mathematical function of the model is given by:

Credit Scoring — Applied to Logistic Regression in R (20)

You have:

Credit Scoring — Applied to Logistic Regression in R (21)

Interpretation: The regression coefficient is -3.478, this indicates that 1 (one) unit increase in the customer’s anual income will decrease the odds of being delinquent by exp(-3.478) -2.478 times.

Evaluation of the Model's Performance

For the logistic model estimated from the separate data set for validation, we proceed to evaluate the model’s performance; this analysis seeks to judge the model’s efficiency when using unpublished data.

# Calculation of the odds ratio (O.R)
odratio <- exp(cbind(OR = coef(stepwise), confint(stepwise)))
odratio %>% pander()

Credit Scoring — Applied to Logistic Regression in R (22)

Interpretation of coefficients: In the logistic model, the interpretation of the coefficient is different from that of the linear model. The exp of the coefficient corresponds to the estimated odds ratio (O.R).

The result above shows that for a change of 1 (one) unit in the type of house ownership (other), the chance that Y equals 1 increases by 15.3% ((1.153–1)*100). In other words, the chance of Y=1 is 1.153 times greater when the type of house ownership (other) increases by one unit (all other independent variables remaining constant).

Next we will make predictions using the test data to evaluate the performance of our logistic regression model. The procedure is as follows:

Predict the class association probabilities of the observations based on the prediction variables;
Assign the observations to the class with the highest probability score (i.e. greater than 0.5).

# Making predictions
probabilities <- model %>% predict(test.data, type = "response")
head(probabilities) %>% pander()

Credit Scoring — Applied to Logistic Regression in R (23)

predicted.classes <- ifelse(probabilities > 0.5, "1", "0")
head(predicted.classes) %>% pander()

Credit Scoring — Applied to Logistic Regression in R (24)

# Assessing model accuracy
mean(predicted.classes == test.data$loan_status)

Credit Scoring — Applied to Logistic Regression in R (25)

The credit scoring model developed by means of Logistic Regression presented the overall classification hit percentage of 86,7%, so the model is well accurate and presented good classification results.

According to Selau & Ribeiro (2009) experts consider credit scoring models with hit rates above 65% to be good models.

The sensitivity, the ability of the model to classify the customer as a defaulter when he really is a defaulter, was 0.867.

The results of the credit scoring models serve as support for credit analysis, as it makes it possible to obtain the probability of occurrence, or non-occurrence of default, in addition to facilitating the identification of factors that influence the risk of default. It is up to each organization to evaluate the conditions involved in the operation together with the result obtained in the model. This information provides support to minimize default and consequently, credit loss.

Another issue, therefore, is that each study offers a particular result, as it depends entirely on what is being considered, the historical basis obtained, the data available and used, and the policy of each institution.

[1] J. Hair et al, Multivariate Data Analysis (2009), 7th Edition, Pearson Publication

[2] W. Hosmer and Lemeshow, S, Applied Logistic Regression (2000), 3nd Edition, New York: John Wiley Sons

[3] P.R. Selau and L.D. Ribeiro, A systematic approach to the construction and choice of credit risk prediction models (2009)

[4] UC Business Analytics R Programming Guide (2023), https://uc-r.github.io/logistic_regression

[5] S. Vinicius, What is logistic regression and how to apply it using Python (2019)

[6] M. Alice, How to perform a Logistic Regression in R (2015)