Development of credit scoring models and Cut-off score setting in R. (2024)

A credit score is a numerical representation of an individual's creditworthiness, commonly used by credit lending institutions like banks, telecom and fintechs. These organizations employ complex credit models that consider various factors that run from customer demographic details, their sources of income, credit commitments, and past loan performance to calculate a credit score.

Credit scoring models can broadly be categorized into:

  • Behavioral Scorecards
  • Application Scorecards

A credit score indicates the likelihood of timely repayment if a loan or credit card is advanced to the person. Overtime different scorecards have been developed to address specific needs of the business for example:

  • lending scorecards
  • collection scorecards
  • fraud scorecards
  • risk-based pricing scorecards

They are used by lenders to determine the interest rates and terms offered to borrowers. Each scorecard is designed to assess specific aspects of credit risk and aid in decision-making processes related to lending, collections, fraud detection, or pricing.

EDA - Exploratory Data Analysis

This is the initial exploration and analysis of the credit-related data to gain insights and understanding of the variables and their relationships. It helps to identify patterns, trends, outliers, and potential issues within the data. R software has fantastic packages that automates this tedious task. Some of these libraries are:

  • SmartEDA
  • DataExplorer
  • AutoEDA
  • wpa
  • scoringTools

There are many others but these are my personal favorite.

Step 1: Prepare the environment.

R software is open source and offers a rich collection of packages for credit scoring. Before you start your project remember to:

# Clear all objects including hidden objects

rm(list = ls(all.names = TRUE)) 

# Free up memory and report the memory usage.

gc() 

The you can load the desired packages

library(ggplot2library(pROC)library(reshape2)library(dplyr)library(scales)library(tidyr)library(kableExtra)library(knitr)library(data.table)library(ggpubr)library(gridExtra)library(lubridate)library(SmartEDA)library(caret)library(randomForest)library(scorecard)library(pacman)library(patchwork)library(woeBinning)) 

You need to load the data in R and perform initial ETL that entails handling dates and other data types during the loading process. There are various ways to load the data into R depending on the source of the data. For simplicity purpose I will assume my data is a csv file.

data <- read.csv("hw_table.csv", header = TRUE, stringsAsFactors = FALSE#---------------- Format Datesdata$var41 <- as.POSIXct(data$var41, format = "%Y-%m-%d")data$var38 <- as.POSIXct(data$var38, format = "%Y-%m-%d")) 

Next we create a brief description of our data set just to understand what we are dealing with:

Development of credit scoring models and Cut-off score setting in R. (1)

The code below creates the table above, I have used the pipe coding format that allows one achieve a task in a block of code that is easy to comprehend.

data %> SmartEDA::ExpData(type = 1) %>% kbl() %>% kable_minimal() %>% kable_paper("hover", full_width = FALSE)% 

Credit scoring models are developed using supervised machine learning models. The data used must therefore have a label column to indicate good and bad that is binary in nature. This can in some instances be represented by 0 or 1.

Development of credit scoring models and Cut-off score setting in R. (2)

The code below will produce the plot shown above.

data %> select(all_of(var_columns)) %>% summarise( Numeric = sum(sapply(., is.numeric)), Categorical = sum(sapply(., is.character)), Binary = sum(sapply(., function(x) is.logical(x) && length(unique(x)) <= 2)), Date = sum(sapply(., function(x) inherits(x, "Date") || inherits(x, "POSIXct"))) ) %>% pivot_longer(everything(), names_to = "Data Type", values_to = "Count") %>% ggplot(aes(x = `Data Type`, y = Count, fill = `Data Type`)) + geom_bar(stat = "identity") + labs(x = "Data Type", y = "Count", title = "Data Type Counts for Variables Starting with 'var'") + scale_fill_manual(values = c("darkolivegreen3", "deepskyblue3", "darkseagreen", "darkslategray3", "bisque3")) + theme_minimal()% 

Credit risk models are developed based on performance windows. These windows of observations are defined based on booking/underwriting months. For example if customer defaults (90 days or more commonly referred to as DPD Days Past Due) during the performance window, borrower would be considered as a 'bad' customer and labeled as 'event' in the dependent variable otherwise 'good ' and labeled as 'non-event.

Vintage Analysis

At this point it is important to perform a vintage analysis on the data extract. This will help unearth the following:

  • Identify if loans issued in a particular month or quarter are riskier than others.
  • Determine the optimal period of performance window in development of scorecard.
  • Monitor or track risk of a portfolio.
  • Estimate minimum months required after that we can cross-sell to new customers.
  • Forecast risk and can also be used in stress testing.

Roll Rate Analysis

The other analysis to conduct is a roll rate analysis. This is commonly used in loss forecasting and in determination of 'bad' customers (defaulters). It is used to define the threshold of delinquency to identify 'bad' customers. You can also use it to determine if to fully or partially charge off a loan.Finally this can be used to assess and determine cure rates timing.

Roll Rate Analysis can provide valuable inputs for the calculation of expected credit losses (ECL) under IFRS 9. By analyzing historical roll rates, financial institutions can gain insights into the probability of transitions between different credit states (e.g., from current to 30 days past due, from 30 days past due to charge-off) and use this information to estimate future credit losses.

Roll Rate Analysis can be used to develop probability of default (PD) models, which are one of the components of IFRS 9 impairment calculations. PD models estimate the likelihood of a borrower defaulting within a given time period.

Analysis by Origination Month

Development of credit scoring models and Cut-off score setting in R. (3)

There were more loans issued in September than in the other 3 months. While sampling it is important to take this into account.

While the graph above is important as it shows the number of loans by booking month an analysis of bad rate by month of origination is necessary.

Development of credit scoring models and Cut-off score setting in R. (4)

You can plot the above graphs using the following code.

#--------------- Inspect loan issuance by booking monthdata %>% mutate(var41 = ymd(var41)) %>% mutate(good_bad = factor(ifelse(default_flag == 1, "Bad", "Good"))) %>% mutate(month_year = format(var41, "%m-%Y")) %>% count(month_year, good_bad) %>% group_by(month_year) %>% ungroup() %>% ggplot(aes(x = month_year, y = n, fill = good_bad)) + geom_bar(stat = "identity") + geom_text(aes(label = n), position = position_stack(vjust = 0.5), color = "white", size = 4, fontface = "bold") + labs(x = "Month-Year", y = "Count", fill = "Bad/Good") + ggtitle("Stacked Bar Chart of Goods and Bads by Month-Year") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + scale_y_continuous(labels = comma_format()) + scale_fill_manual(values = c("#E7B800", "#00AFBB")) + theme_classic()#--------------- Inspect Bad rates over time by booking monthdata %>% mutate(var41 = ymd(var41)) %>% mutate(good_bad = factor(ifelse(default_flag == 1, "Bad", "Good"))) %>% mutate(month_year = format(var41, "%m-%Y")) %>% count(month_year, good_bad) %>% group_by(month_year) %>% mutate(percentage = n / sum(n)) %>% ungroup() %>% ggplot(aes(x = month_year, y = percentage, fill = good_bad)) + geom_bar(position = "fill", stat = "identity") + labs(x = "Month-Year", y = "Percentage", fill = "Bad/Good") + ggtitle("Monthly Analysis of Goods and Bads by Booking Month (Hypothetically)") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + theme_classic() + scale_y_continuous(labels = scales::percent_format(accuracy = 1)) + scale_fill_manual(values = c("coral1", "cyan2")) + geom_text(aes(label = scales::percent(percentage, accuracy = 1)), position = position_fill(vjust = 0.5), color = "black", size = 4, fontface = "bold") 

In particular instances you will find that loans are classified by product codes, or there could be some form of coding that is numeric. Such codes that are numeric in nature could be interpreted as numerical variables which is not the case. A unique variable analysis is therefore important.

Development of credit scoring models and Cut-off score setting in R. (5)

This can assist the scientist to filter out categorical variables much easily.

Below is a check of variable completeness, this is an analysis of missing values by variable. Different tasks could call for a certain threshold of missing values, however an understanding of a variable would inform on missing values to expect and how to treat missing values. Missing values can be imputed using mean or median values.

Development of credit scoring models and Cut-off score setting in R. (6)
#---------- Count the number of unique values in each variabldata %>% summarise(across(starts_with("var"), ~ length(unique(.)))) %>% pivot_longer(everything(), names_to = "Column", values_to = "Count") %>% mutate(Column = factor(Column, levels = var_columns)) %>% ggplot(aes(x = Column, y = Count)) + geom_bar(stat = "identity", fill = "darkgoldenrod2") + geom_text(aes(label = Count, y = Count), vjust = 0.5, color = "black", size = 3) + labs(x = "Column", y = "Count") + ggtitle("Count of Unique Values in Each Column (Starting with 'var')") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + coord_flip() + theme_classic()#---------------------- Check for completeness of the data#---------- The data doesn't have incomplete records except for Variable 38data %>% select(starts_with("var")) %>% summarise_all(~ sum(!is.na(.))/n()) %>% pivot_longer(everything(), names_to = "Variable", values_to = "Completeness") %>% # arrange(desc(Completeness)) %>% arrange(Completeness) %>% kbl() %>% kable_paper("hover", full_width = F) 

Definition of Good Bad Label

As explained in the section above the days past due DPD on a facility help in determining the flag for good and bad. This is also informed by the structure of the product.

Development of credit scoring models and Cut-off score setting in R. (7)
data %> mutate(good_bad = factor(ifelse(default_flag == 1, "Bad", "Good"))) %>% group_by(good_bad) %>% summarise(percent = n() / nrow(data)) %>% ggplot(aes(x = good_bad, y = percent, fill = good_bad)) + geom_bar(stat = "identity") + geom_text(aes(label = paste0(round(percent * 100, 1), "%")), position = position_stack(vjust = 0.5), vjust = -0.5, color = "black", fontface = "bold") + labs(x = "Category", y = "Percentage", title = "Distribution of Good vs. Bad") + scale_y_continuous(labels = percent_format()) + theme_minimal() 

Relationship between Regressor and Response variable

It is good to subject explanatory variables both continuous and discrete variable to a distribution analysis in light of the target variable Y.

These plots can help spot instances where the distribution of bad is more than that of good and vice versa.

Plots of a batch for continuous variables.

Development of credit scoring models and Cut-off score setting in R. (8)

Plots of a batch for discrete variables.

Development of credit scoring models and Cut-off score setting in R. (9)

Plots of a batch for discrete variables.

Development of credit scoring models and Cut-off score setting in R. (10)

These plots can be generated using the code below

#----------------------------------------------------------------------------------#----------------------------- Analysis of Discrete variables#-----------------------------------------------------------------------------------variables <- c(paste0("var", 2:13), "var42")data[, variables] <- lapply(data[, variables], factor)# Create a list to store the plotsplots <- list()# Create the stack bar plots for each variablefor (i in 1:length(variables)) { variable <- variables[i] plot <- data %>% mutate(good_bad = factor(ifelse(default_flag == 1, "Bad", "Good"))) %>% count(!!as.name(variable), good_bad) %>% group_by(!!as.name(variable)) %>% mutate(percentage = n / sum(n)) %>% ungroup() %>% ggplot(aes(x = !!as.name(variable), y = percentage, fill = good_bad)) + geom_bar(position = "fill", stat = "identity") + labs(x = variable, y = "Percentage", fill = "Category") + ggtitle(paste("Stacked Bar Plot of", variable, "by Good/Bad")) + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + theme_classic() + scale_y_continuous(labels = scales::percent_format(accuracy = 1)) + scale_fill_manual(values = c("#E7B800", "#00AFBB")) + geom_text(aes(label = scales::percent(percentage, accuracy = 1)), position = position_fill(vjust = 0.5), color = "black", size = 3.5, fontface = "bold") plots[[variable]] <- plot print(variable) # Check if 4 plots have been created or it's the last variable if (i %% 4 == 0 || i == length(variables)) { # Arrange the plots in a grid grid.arrange(grobs = plots, ncol = 2) # Clear the list for the next set of plots plots <- list() }}- 

Feature Engineering

Feature engineering enhances the effectiveness of machine learning models by carefully selecting and preparing the relevant features. This process involves choosing the most appropriate features for the model and transforming them in a manner that aligns with the requirements of the machine learning algorithm. By performing feature engineering, the performance and accuracy of the model can be improved, leading to better predictions and outcomes. The variables that undergo feature engineering are also known as independent variables, features, predictors, attributes, model factors, covariates, regressors, or characteristics.

This task can take many forms.

  1. Creating Interaction Terms: This involves combining two or more existing features to create new features that capture the interaction or relationship between them. This can be done using arithmetic operations or statistical functions.
  2. Encoding Categorical Variables: Categorical variables need to be converted into numerical representations for many machine learning algorithms. Techniques like one-hot encoding, label encoding, or target encoding can be used to transform categorical variables into numeric features.
  3. Handling Missing Values: Missing data can be problematic for machine learning models. Techniques such as imputation (e.g., mean, median, mode imputation), deletion of missing values, or advanced methods like multiple imputation can be employed to handle missing data effectively.
  4. Scaling and Normalization: Scaling or normalizing numerical features ensures that they are on a similar scale, preventing certain features from dominating the model. Common scaling techniques include standardization (mean centering and scaling to unit variance) and min-max scaling.

Feature Reduction:

  1. Principal Component Analysis (PCA): PCA is a dimensionality reduction technique that identifies the most important patterns in the data by transforming the original features into a smaller set of uncorrelated variables called principal components. These components capture the maximum variance in the data.
  2. Recursive Feature Elimination (RFE): RFE is an iterative feature selection method that works by recursively eliminating less important features based on their impact on model performance. It uses a machine learning model and successively removes features with the lowest importance until a desired number of features is reached.
  3. Feature Importance: Many machine learning algorithms provide a measure of feature importance. For example, Random Forests or Gradient Boosting models can rank features based on their contribution to model performance. Features with low importance can be removed or given less priority.
  4. Univariate Feature Selection: This approach involves selecting features based on their individual relationship with the target variable. Statistical tests, such as chi-square test for categorical variables or correlation coefficients for numerical variables, can be used to select the most relevant features.

Correlation

Below are some of highly correlated variables in the data. If these variables are left in the model they can cause problems in interpreting and drawing reliable conclusions from regression models. It is important to assess and address multicollinearity to ensure the validity of the results and the reliability of the model. Various techniques, such as variance inflation factor (VIF) analysis, can help detect and mitigate multicollinearity in regression analysis.

Development of credit scoring models and Cut-off score setting in R. (11)

Some of the drawbacks of multicollinearity are:

  1. Unreliable coefficient estimates: Multicollinearity makes it difficult to determine the true relationship between each independent variable and the dependent variable. The coefficients may be imprecise, and small changes in the data can lead to large variations in the estimated coefficients.
  2. Reduced statistical significance: Multicollinearity inflates the standard errors of the coefficients, which reduces the t-statistics and p-values associated with the variables. As a result, variables that might be individually significant in a model with no multicollinearity may become statistically insignificant when multicollinearity is present.
  3. Ambiguous interpretation of coefficients: When variables are highly correlated, it becomes challenging to interpret the coefficients accurately. In the presence of multicollinearity, the coefficients may reflect the combined effects of the correlated variables rather than their individual contributions to the model.
  4. Inefficient allocation of coefficients: Multicollinearity can lead to coefficients being assigned to variables that do not contribute significantly to the dependent variable. This inefficient allocation of coefficients may lead to misleading conclusions about the importance of different variables in explaining the outcome.
  5. Increased standard errors: Multicollinearity inflates the standard errors of the regression coefficients, which reduces the precision of the estimates. Wide confidence intervals around the coefficients make it difficult to draw reliable inferences and may weaken the predictive power of the model.
  6. Instability of coefficients: Multicollinearity makes the estimated coefficients highly sensitive to small changes in the data. Including or excluding a variable or even minor changes in the data set can lead to large variations in the coefficients, making it difficult to replicate or generalize the results.
  7. Difficulties in identifying influential variables: Multicollinearity can mask the true effects of individual variables and make it challenging to identify the most important predictors in a regression model. The correlation between variables can obscure the independent contributions of each variable to the dependent variable.

Other methods of Feature Selection

Missing value threshold can be set as a method of feature selection. Features with number of missing values above a certain threshold are dropped. The next step is to bin variables.

Binning, also known as discretization, is a data preprocessing technique that involves dividing a continuous variable into a set of predefined bins or intervals. Instead of using the original continuous values, the data points are assigned to the corresponding bin based on their values. Binning can be useful in several scenarios:

  1. Handling non-linear relationships: Binning can be used to capture non-linear relationships between a continuous independent variable and the dependent variable. By dividing the variable into bins, you can model different ranges of values separately and potentially capture different patterns or relationships within each bin.
  2. Simplifying complex models: Binning can help simplify complex models by converting continuous variables into categorical variables. Categorical variables can be easier to interpret and may reduce the risk of overfitting. Binning can also help in reducing the complexity of models that have limited data or suffer from the curse of dimensionality.
  3. Dealing with outliers: Binning can mitigate the impact of outliers by grouping extreme values into the same bin. This can help in reducing the influence of outliers on the model's performance and make it more robust to extreme values.
  4. Addressing data sparsity: In situations where there are sparse data points in certain regions of a continuous variable, binning can be used to aggregate those data points into a single bin. This can help overcome issues related to limited data availability and improve the stability of the analysis.
  5. Meeting specific business requirements: Binning can be useful when specific business requirements or constraints call for discretizing continuous variables. For example, in credit scoring, lenders may have specific risk categories based on income levels or credit scores.

However, it's important to consider the potential limitations and drawbacks of binning:

  1. Information loss: Binning reduces the granularity of the data by collapsing values into bins. This can result in information loss and may lead to a loss of precision in the analysis.
  2. Arbitrary bin boundaries: The choice of bin boundaries is subjective and can affect the results. Different binning strategies may lead to different interpretations and outcomes. Careful consideration should be given to selecting appropriate bin boundaries that align with the data and the objectives of the analysis.
  3. Increased dimensionality: Binning can increase the dimensionality of the data, especially when multiple variables are binned simultaneously. This can have implications for subsequent analyses, such as increasing the number of variables in a regression model.
  4. Sensitivity to bin boundaries: Binning can introduce sensitivity to the specific bin boundaries chosen. Small changes in the binning scheme may lead to different results or conclusions. It is important to evaluate the stability and robustness of the analysis with respect to different binning strategies.

Binning can be a useful technique in certain situations, such as handling non-linear relationships or simplifying complex models. However, it is essential to carefully consider the potential drawbacks and evaluate the impact of binning on the analysis before applying it to the data.

Fine and Coarse classing

Fine and coarse classing are two approaches to binning or discretizing continuous variables. These approaches differ in the level of granularity or detail in which the continuous variable is to be divided into bins or categories.

  1. Fine classing: Fine classing involves dividing the continuous variable into a larger number of smaller bins or categories, resulting in a more detailed representation of the data. This approach provides a higher level of granularity and can capture finer patterns or variations within the data. Fine classing is suitable when the data has a wide range and requires a more precise representation.
  2. Coarse classing: Coarse classing, on the other hand, involves dividing the continuous variable into a smaller number of larger bins or categories, resulting in a less detailed representation of the data. This approach provides a broader overview and simplifies the data by grouping similar values together. Coarse classing is suitable when the data has a large range and a higher level of detail is not necessary or practical.

Monotonic binning is a technique commonly used in credit scoring to discretize continuous variables while preserving their monotonic relationship with the target variable (e.g., default or non-default). The objective is to create bins that exhibit a consistent trend or monotonic pattern with the target variable across the range of values.

Bin evaluation: Calculate the default rate (or any other relevant metric) within each bin. Evaluate the monotonicity of the bins by analyzing the trend of the default rate across the bins. Ideally, the default rate should consistently increase or decrease monotonically as the values of the continuous variable increase. This leads to computation of WoE and Iv.

Development of credit scoring models and Cut-off score setting in R. (12)

Information Value (IV) comes from information theory, it measures the predictive power of independent variables which is useful in feature selection.

Development of credit scoring models and Cut-off score setting in R. (13)

We then transform all the independent variables using the weight of evidence (WoE) method. Based on the proportion of good applicants to bad applicants at each group level, this method measures the “strength” of grouping for differentiating good and bad risk, and attempts to find a monotonic relationship between the independent variables and the target variable.

Development of credit scoring models and Cut-off score setting in R. (14)

According to Siddiqi (2006), by convention, the values of the IV statistic in credit scoring can be interpreted as follows:

Development of credit scoring models and Cut-off score setting in R. (15)

The final step is to create two data sets one for Bin transformation and another one for WoE transformation.

# Convert factor variables to numeric representatiofor (col in names(data)[sapply(data, is.factor)]) { data[[col]] <- as.integer(data[[col]])}# Correlationcor_matrix <- cor(data[, startsWith(names(data), "var")], method = "pearson")# Find attributes that are highly correlatedhighlyCorrelated <- findCorrelation(cor_matrix, cutoff = 0.7)print(colnames(data[, startsWith(names(data), "var")])[highlyCorrelated])#- Plotcorrplot::corrplot(cor(data[, colnames(data[, startsWith(names(data), "var")])[highlyCorrelated]], method = "pearson"), method = "color", type = "lower", tl.col = "black", tl.srt = 45)#----------------- Uncorrelated Data for Iv computationuncor_data <- data[, -highlyCorrelated]#---------------------------------------------------------------------------------------------------------------------------------------------#--------------- Variable selection by Iv#--------------------------------------------------------------------------------------------------------------------------------------------variables <- variables[variables %in% colnames(uncor_data)]uncor_data[, variables] <- lapply(uncor_data[, variables], factor)# Bin the databinning <- woe.tree.binning(subset(uncor_data[, c(2:ncol(uncor_data), 1)], select = -loanid), 'default_flag', uncor_data, event.class = 0)# woe.binning.plot(binning, multiple.plots = FALSE)uncor_data_binned <- woe.binning.deploy(uncor_data, binning, min.iv.total = 0.1, add.woe.or.dum.var='woe')uncor_data_binned <- uncor_data_binned %>% mutate_at(vars(starts_with("woe")), ~ . / 100)# Tabulate the binned variablestabulate_binning <- woe.binning.table(binning)tabulate_binningmodel_data <- subset(uncor_data_binned, select = grepl("binned|default_flag", names(uncor_data_binned)))colnames(model_data) <- gsub("\\.", "_", colnames(model_data)) 

You could also create the bins manually and compute the Woe using the script below.

#-----------------------------------------------------------------------------------------------------------------------------------#--------------------- Function to compute Woe and Iv manually#------------------------------------------------------------------------------------------------------------------------------------tbl_fun <- function(x, y) { options(scipen = 999) y_recode <- ifelse(y == 0, "Good", "Bad") Mat <- as.matrix(table(as.factor(x), as.factor(y_recode))) Total <- Mat[, 1] + Mat[, 2] Total_Pct <- percent(Total / sum(Mat), accuracy = 0.01) Bad_Pct <- percent((Mat[, 1] / sum(Mat[, 1])), accuracy = 0.01) Good_Pct <- percent((Mat[, 2] / sum(Mat[, 2])), accuracy = 0.01) Bad_Rate <- percent(round((Mat[, 1] / (Mat[, 1] + Mat[, 2])), 2), accuracy = 0.01) WOE <- round(log((Mat[, 2] / sum(Mat[, 2])) / (Mat[, 1] / sum(Mat[, 1]))), 4) g_b_comp <- ifelse(Mat[, 1] == Mat[, 2], 0, 1) IV <- round(ifelse(g_b_comp == 0, 0, ((Mat[, 2] / sum(Mat[, 2])) - (Mat[, 1] / sum(Mat[, 1]))) * (WOE / 10)), 4) Total_IV <- sum(IV) tbl <- as.data.frame(cbind(Bins = rownames(Mat), Total, Total_Pct, Good = Mat[, 2], Good_Pct, Bad = Mat[, 1], Bad_Pct, Bad_Rate, WOE, IV, Total_IV)) rownames(tbl) <- NULL return(tbl)}tbl_fun(select(model_data, starts_with("var"), default_flag)$var8_binned, select(model_data, starts_with("var"), default_flag)$default_flag)tbl_fun(select(model_data, starts_with("var"), default_flag)$var43_binned, select(model_data, starts_with("var"), default_flag)$default_flag) 

This is the second batch of the code.

plot1 <- model_data %> mutate(default_flag = ifelse(default_flag == 0, "Good", "Bad"), var19_binned = as.character(var19_binned)) %>% na.omit() %>% group_by(var19_binned, default_flag) %>% summarise(count = n()) %>% ggplot(aes(x = var19_binned, y = count, fill = default_flag)) + geom_col(position = "stack") + labs(x = "Credit History", y = "Count", fill = "Good-Bad") + ggtitle("Credit History for var43 After Binning") + theme_minimal()plot2 <- model_data %>% mutate(default_flag = ifelse(default_flag == 0, "Good", "Bad"), Bins = as.character(var19_binned)) %>% na.omit() %>% group_by(Bins, default_flag, .groups = 'drop') %>% summarise(count = n()) %>% pivot_wider(names_from = default_flag, values_from = count) %>% mutate(Bad_Rate = Bad / (Bad + Good)) %>% ggplot(aes(x = Bins, y = Bad_Rate, group = 1)) + geom_line(size = 1.5, color = "blue") + geom_text(aes(label = scales::percent(Bad_Rate, accuracy = 0.01)), vjust = -1.2, size = 3.5, fontface = "bold") + labs(x = "Bins", y = "Bad Rate") + ggtitle("Bad Rate by Bins for Var43") + theme_minimal() + scale_y_continuous(labels = scales::percent)combined_plot <- plot1 + plot2# Arrange the plots side by sidecombined_plot <- combined_plot + plot_layout(ncol = 2)# Display the combined plotcombined_plot% 

Below are some of the plots generated by the scripts above.

Count of observations per bin and the corresponding bad rates per bin.

Development of credit scoring models and Cut-off score setting in R. (16)

Below is the WoE plot by bins

Development of credit scoring models and Cut-off score setting in R. (17)

This is the script for the plot above.

woe_plot %> slice(-n()) %>% mutate(WOE = as.numeric(WOE)) %>% ggplot(aes(x = Bins, y = WOE, fill = factor(sign(WOE)))) + geom_col(position = "stack") + labs(x = "Bins", y = "WoE", fill = "WoE") + ggtitle("WoE plot for the Bins") + theme_minimal() + scale_fill_manual(values = c("#E7B800", "#00AFBB"), labels = c("Positive", "Negative"), guide = guide_legend(reverse = TRUE)) 

Monotonic binning means the relationship between the variable being binned and the target variable follows a consistent pattern. As the values of independent variable increase or decrease, the values of the target variable also consistently increase or decrease respectively. This is achieved through coarse classing.

wpa package can help achieve this through the script below.

IV_report data = uncor_data[, -which(names(uncor_data) == "loanid")], predictors = NULL, outcome = "default_flag", bins = 5, max_var = 9, path = "IV report", timestamp = TRUE) 

The package wpa would give a WoE plot that is monotonic as shown below. Then the final part would be to combine bin 440 through to 525.

Development of credit scoring models and Cut-off score setting in R. (18)

Split the data into Training and Test

For modelling purposes the data is split into Train and Test. One set is to be used for training and the other one is to be used for validation. You can also explore K-Fold cross validation technique where you do not need to split the data.

#-------------------------------------------------------------------------------------------------------------------------------------#------------------ Split the data into Training and Test# Set the seed for reproducibilityraw_data <- uncor_data_binned[, c("default_flag", sub("_(.*)", "", colnames(select(model_data, starts_with("var")))))]bins <- woebin(raw_data, y = "default_flag", positive = "bad|1", method = "tree");bins[1]raw_data <- raw_data[,c("default_flag",names(bins))]#--------------------------------------------------------------------------------------------------------------------------------------set.seed(77)# Perform stratified samplingtrain_indices <- createDataPartition(subset(model_data, select = grepl("woe|default_flag", names(model_data)))$default_flag, p = 0.8, list = FALSE)# Split the data into train and test sets based on the indicestrain_data <- subset(model_data, select = grepl("woe|default_flag", names(model_data)))[train_indices, ]test_data <- subset(model_data, select = grepl("woe|default_flag", names(model_data)))[-train_indices, ]train_raw <- raw_data[train_indices, ]test_raw <- raw_data[-train_indices, ]#--------------------------------------------------------------------------------------------------------------------------------------#-------------------------- Fit a Logistic Regression Model to WOE values#--------------------------------------------------------------------------------------------------------------------------------------#--------------- Model 1colnames(train_data) <- c("default_flag", paste0(sub("^[^_]*_([^_]+)_.*$", "\\1", colnames(train_data))[-1],"_woe"))formula <- as.formula(paste("I(default_flag == 1) ~", paste(colnames(select(train_data, ends_with("woe"))), collapse = " + ")))formulamodel_1 <- glm(formula, data = train_data, family = "binomial")logit_model <- step(model_1, direction = "both", trace = FALSE)summary(logit_model)#-------------- Model 2train_woe <- woebin_ply(train_raw, bins)formula <- as.formula(paste("I(default_flag == 1) ~", paste(colnames(select(train_woe, ends_with("woe"))), collapse = " + ")))formulanull <- glm(default_flag ~ 1, data = train_woe, family = binomial)full <- glm(formula, data = train_woe, family = binomial)bicglm <- step(null, scope=formula(full), direction = "both", k = log(nrow(train_woe)))model <- glm(formula, family = binomial(), data = train_woe)logistic_step <- step(model, direction = "both", trace = FALSE)summary(logistic_step)pred = predict(logistic_step)y_pred_prob = predict(logistic_step, type = 'response') 

After fitting data to the model. It is important to check the following, all p-values are less than 0.05, coefficients are positive and there are no signs of multicolleniarity. If multicolleniarity is detected then you can consider.

  • Dropping one or more of the highly correlated variables from the model.
  • Combining or transforming variables to create a single composite variable.
  • Collecting additional data to reduce the correlation among variables.
  • Using regularization techniques like ridge regression or lasso regression.

The final results looks like this:

Development of credit scoring models and Cut-off score setting in R. (19)

The variables used in the final model are not highly correlated.

Development of credit scoring models and Cut-off score setting in R. (20)

Gini and AUC

Interpretation of what AUC and Gini to be considered acceptable performance depends on the specific context, the nature of the problem, and the performance standards set by the domain or industry.

In general, an AUC value of 0.66 is considered reasonable and suggests that the model has some discriminatory power to distinguish between the positive and negative classes. An AUC of 0.5 represents a model with no discrimination (similar to random guessing), while an AUC of 1.0 indicates a perfect classification model. A Gini coefficient of 0.32 suggests a moderate level of discrimination by the model but indicates room for improvement.

Development of credit scoring models and Cut-off score setting in R. (21)

Generating a Scorecard

Convert the model coefficients into a simplified scoring system, known as the scorecard. The scorecard assigns points to each predictor variable based on their coefficients, reflecting their relative importance in predicting creditworthiness. The scaling process involves determining the range and granularity of the scorecard, such as assigning points to specific ranges of values or categories.

#-------------------------------------------------------------------------------------------------------------------------------# Manual scorecard Generation#--------------------------------------------------------------------------------------------------------------------------------#=---------------- Get the coefficients from the modelcoefficients <- coef(logistic_step);coefficients#=------------------ Define parameters for the scorecardpoints0 <- 600pdo <- 20odds0 <- 50n <- length(model$coefficients[-1]);nfactor <- pdo / log(2)offset <- points0 - factor * log(odds0)alpha <- coefficients[1]BaseScore <- (offset - (factor * coefficients[1]));round(BaseScore,0) 

The score card is generated using the script below.

# Create an empty data frame for the scorecarcard <- data.frame(Variable = character(), Bin = character(), Woe = numeric(), Iv = numeric(), Coef = numeric(), stringsAsFactors = FALSE)#------------- Calculate the score for each bin of each variablesuppressWarnings({for (var in sub("^(.*?)_.*$", "\\1", colnames(train_data)[-1])) { var_tbl <- bins[var] variable <- var_tbl$var$variable bin <- var_tbl$var$bin woe <- var_tbl$var$woe iv <- var_tbl$var$bin_iv coefficient <- coefficients[paste0(var, "_woe")] card <- rbind(card, data.frame(Variable = variable, Bin = bin, Woe = woe, Iv = iv, Coef = coefficient)) print(var)}})card$Neutral <- round(-(alpha/n)*factor+offset/n,0)card$ScoreCntr <- round(card$Coef * card$Woe,4)card$Score <- round((card$ScoreCntr + (alpha/n))*factor + offset/n,0)head(card,4) 

From the scorecard above total score per customer decreases with probability or likelihood of default.

Development of credit scoring models and Cut-off score setting in R. (22)

When scaling the model into a scorecard, logistic regression coefficients from model fitting as well as the transformed WoE values are required. Subsequently you need to convert the score from the log-odds unit to a points system.

Development of credit scoring models and Cut-off score setting in R. (23)

  • βi :- logistic regression coefficient for the variable Xi
  • α :- logistic regression intercept
  • WoE :- Weight of Evidence value for variable Xi
  • n :- number of independent variable Xi in the model
  • Factor, Offset :- known as scaling parameter

From the script above base score was set at 529

BaseScore <- (offset - (factor * coefficients[1]))

This is the resulting score card.

Development of credit scoring models and Cut-off score setting in R. (24)

Determination of Cutoff Score

The choice of cut-off score depends on the specific requirements, risk appetite, and business objectives of the organization implementing the credit scoring model. Different methods may be used in combination, and adjustments to the cut-off score may be made over time based on performance monitoring and feedback.

Ultimately, the determination of the cut-off score involves finding a balance between minimizing credit risk and maximizing business opportunities while considering various factors, including model performance, business objectives, and regulatory considerations.

  1. Receiver Operating Characteristic (ROC) Curve: The ROC curve is a graphical representation of the performance of a credit scoring model at various threshold levels. The cut-off score can be determined by selecting a point on the curve that optimizes the balance between sensitivity (the ability to correctly identify positive cases) and specificity (the ability to correctly identify negative cases). This approach allows for a trade-off between false positives and false negatives.
  2. Cost-Benefit Analysis: This method involves evaluating the costs and benefits associated with different cut-off scores. The costs may include defaults, collection efforts, or potential losses, while the benefits may include interest income, loan volumes, or customer acquisition. The optimal cut-off score is the one that maximizes the net benefit or minimizes the overall cost.
  3. Gini Coefficient or Kolmogorov-Smirnov (KS) Statistic: The Gini coefficient and KS statistic are measures of the predictive power of a credit scoring model. These metrics indicate the separation between good and bad credit risks. The cut-off score can be determined by selecting the point on the score distribution where the Gini coefficient or KS statistic is maximized.

You can use the gain lift table to set the cutoff score. The gain lift table is constructed by dividing the population into deciles or groups based on the predicted probabilities or scores from the credit scoring model.

Development of credit scoring models and Cut-off score setting in R. (25)

The optimal cut-off score on the score scale can be obtained by examining the KS curve maximum vertical distance (i.e., the largest gap) between the cumulative good percentage and the cumulative bad rate. This point corresponds to the optimal cut-off score. Observations with scores above the cut-off score will be classified as "good," while those below the cut-off score will be classified as "bad."

Development of credit scoring models and Cut-off score setting in R. (26)

The cut-off score involves a trade-off between the false positive rate (classifying a "good" observation as "bad") and the false negative rate (classifying a "bad" observation as "good"). The specific choice of the cut-off score will depend on the desired balance between these two types of errors, which may vary depending on the application and the costs associated with misclassifications.

y = train_raw$default_flapred = predict(logistic_step)y_pred_prob = predict(logistic_step, type = 'response')resp = predict(logistic_step, type = 'response')res = tibble(logit = pred , y_pred_prob = y_pred_prob , y_actual = y , odds = exp(pred) , prob = odds / (odds + 1) , prob_ctrl = resp )sc = scorecard_ply(train_raw, score_card)res$score = sc[[1]]res$score_ctrl = offset - factor * res$logitresks.test(res$y_pred_prob, res$y_actual, alternative='greater')df_summary <- res %>% arrange(score) %>% mutate(cum_bad = c*msum(y_actual) / sum(y_actual), cum_good = c*msum(1 - y_actual) / sum(1 - y_actual))max_ks_score <- df_summary$score[which.max(abs(df_summary$cum_bad - df_summary$cum_good))]ggplot(df_summary, aes(x = score)) + geom_line(aes(y = cum_bad), color = "red", size = 1, linetype = "solid") + geom_line(aes(y = cum_good), color = "blue", size = 1, linetype = "solid") + geom_vline(aes(xintercept = max_ks_score), color = "orange", linetype = "dashed", size = 1.5) + annotate("text", x = max_ks_score, y = max(df_summary$cum_bad, df_summary$cum_good) + 0.05, label = paste0("KS =", round(max(abs(df_summary$cum_bad - df_summary$cum_good)), 3)), color = "black") + annotate("text", x = max_ks_score, y = -Inf, label = paste0("x =", round(max_ks_score, 3)), vjust = -1, color = "black") + labs(x = "Score Cutoff", y = "Cumulative Rate", title = "KS Plot") + theme_minimal() 

Additionally, it's important to validate the chosen cut-off score on an independent dataset or through cross-validation to ensure its generalization and performance.

Development of credit scoring models and Cut-off score setting in R. (2024)

FAQs

What is the statistical model for credit scoring? ›

Logistic regression. The Logistic Regression (LR) is the most commonly used statistical model in credit scoring due to interpretability of its decisions.

What is the regression model for credit scoring? ›

Logistic regression links the score and probability of default (PD) through the logistic regression function, and is the default fitting and scoring model when you work with creditscorecard objects.

How to develop a scoring model? ›

Create a Scoring Model
  1. Enter all of the ranges of values that you need to define for a given data point.
  2. Assign a score to each range of values.
  3. Assign a relative weighting factor to the data point, as it relates to all other data points.

What is the best model for credit scoring? ›

The FICO scoring model is an algorithm that produces what is considered the most reliable credit scores. About 90% of lenders use FICO's model to evaluate candidates for credit.

What are the four 4 strategic elements of a scorecard? ›

The balanced scorecard involves measuring four main aspects of a business: Learning and growth, business processes, customers, and finance. BSCs allow companies to pool information in a single report, to provide information into service and quality in addition to financial performance, and to help improve efficiencies.

How do you set up a scorecard? ›

Follow these steps to create a balanced scorecard:
  1. Outline your purpose. ...
  2. Create specific objectives and performance measures. ...
  3. Strategically map each perspective. ...
  4. Analyze performance. ...
  5. Share and communicate results. ...
  6. Develop strategic changes and initiatives. ...
  7. Implement the changes. ...
  8. Improving strategic planning.
Dec 19, 2022

How are credit risk models developed? ›

Credit risk models rely on a wide range of data sources to accurately assess the risk of potential borrowers. These data sources include financial statements, credit bureau data, and alternate data.

What is the most used credit score model? ›

FICO scores are the most widely used credit scores in the U.S. for consumer lending decisions.

What is credit scoring algorithm? ›

A credit scoring algorithm is a systematic and rule-based procedure designed to evaluate an individual's or entity's creditworthiness. In essence, it is a set of rules that guide the assessment process, aiming to provide a numerical representation of how likely a borrower is to fulfill their financial obligations.

How are credit scores modeled? ›

FICO Scores are calculated using many different pieces of credit data in your credit report. This data is grouped into five categories: payment history (35%), amounts owed (30%), length of credit history (15%), new credit (10%) and credit mix (10%).

What is R score in regression? ›

R in a regression analysis is called the correlation coefficient and it is defined as the correlation or relationship between an independent and a dependent variable. It ranges from -1 to +1.

How does a credit scoring model work? ›

Credit scoring models

These models leverage statistical algorithms and historical credit data to evaluate the likelihood of a borrower defaulting on a loan or credit obligation. The primary objective is to provide lenders with a quantitative measure that helps them make informed decisions about extending credit.

What is the most commonly used credit scoring model answer? ›

Did you know you have more than one FICO® Score? Since FICO® Scores were introduced to lenders over 25 years ago, they have become the best-known and most widely used credit score.

What are the 4 R's of credit scoring? ›

As [1] summarised, credit scoring is functional in four scenarios denoted by the acronym 4R, namely Risk, Response, Revenue and Retention.

What are the 5 C's of credit? ›

Called the five Cs of credit, they include capacity, capital, conditions, character, and collateral. There is no regulatory standard that requires the use of the five Cs of credit, but the majority of lenders review most of this information prior to allowing a borrower to take on debt.

Which models can be used to build a credit risk model? ›

In this regard there are two main classes of credit risk models – structural and reduced form models. Structural models are used to calculate the probability of default for a firm based on the value of its assets and liabilities. A firm defaults if the market value of its assets is less than the debt it has to pay.

Top Articles
Latest Posts
Article information

Author: Barbera Armstrong

Last Updated:

Views: 5698

Rating: 4.9 / 5 (59 voted)

Reviews: 82% of readers found this page helpful

Author information

Name: Barbera Armstrong

Birthday: 1992-09-12

Address: Suite 993 99852 Daugherty Causeway, Ritchiehaven, VT 49630

Phone: +5026838435397

Job: National Engineer

Hobby: Listening to music, Board games, Photography, Ice skating, LARPing, Kite flying, Rugby

Introduction: My name is Barbera Armstrong, I am a lovely, delightful, cooperative, funny, enchanting, vivacious, tender person who loves writing and wants to share my knowledge and understanding with you.