This package has been developed as part of a CIFRE PhD, a special PhDcontract in France which is for the most part financed by a company.This company subsequently gets to choose which subject(s) aretackled.
This research has been financed by Crédit Agricole Consumer Finance(CA CF), subsidiary of the Crédit Agricole Group which provides allkinds of banking and insurance services. CA CF focuses on consumerloans, ranging from luxury cars to small electronics.
In order to accept / reject loan applications more efficiently (bothquicker and to select better applicants), most financial institutionsresort to Credit Scoring: given the applicant’s characteristics he/sheis given a Credit Score, which has been statistically designed usingpreviously accepted applicants, and which partly decides whether thefinancial institution will grant the loan or not.
Context
In practice, the statistical modeler has historical data about eachcustomer’s characteristics. For obvious reasons, only data available atthe time of inquiry must be used to build a future applicationscorecard. Those data often take the form of a well-structured tablewith one line per client alongside their performance (did they pay backtheir loan or not?) as can be seen in the following table:
Job | Habitation | Time_in_job | Children | Family_status | Default |
---|---|---|---|---|---|
Craftsman | Owner | 10 | 0 | Divorced | No |
Technician | Renter | 20 | 1 | Widower | No |
Executive | Starter | 5 | 2 | Single | Yes |
Office employee | By family | 2 | 3 | Married | No |
Formulation
The variable to predict, here denoted by , is an active researchfield and we will not discuss it here. We suppose we already have abinary random variable \(Y\) from whichwe have \(n\) observations \(\mathbf{y} = (y_i)_1^n\).
The \(d\) predictive features, herefor example the job, habitation situation, etc., are usuallysocio-demographic features asked by the financial institutions at thetime of application. They are denoted by the random vector \(\boldsymbol{X} = (X_j)_1^d\) and as for\(Y\) we have \(n\) observations \(\mathbf{x}=(x_i)_1^n\).
We suppose that observations \((\mathbf{x},\mathbf{y})\) come from anunknown distribution \(p(x,y)\) whichis not directly of interest. Our interest lies in the conditionalprobability of a client with characteristics \(\boldsymbol{x}\) of paying back his loan,i.e.\(p(y|\boldsymbol{x})\), alsounknown.
In the context of Credit Scoring, we historically stick to logisticregression, for various reasons out of the scope of this vignette. Thelogistic regression model assumes the following relation between \(\boldsymbol{X}\) (supposed continuous here)and \(Y\): \[\ln \left(\frac{p_{\boldsymbol{\theta}}(Y=1|\boldsymbol{x})}{p_{\boldsymbol{\theta}}(Y=0|\boldsymbol{x})}\right) = (1, \boldsymbol{x})'{\boldsymbol{\theta}}\]
We would like to have the ‘‘best’’ model compared to the true \(p(y|\boldsymbol{x})\) from which we onlyhave samples. Had we access to the true underlying model, we would liketo minimize, w.r.t. \({\boldsymbol{\theta}}\), \(H_{\boldsymbol{\theta}} = \mathbb(E)_{(X,Y) \simp}[\ln(p_{\boldsymbol{\theta}}(Y|\boldsymbol{X}))]\). Since thisis not possible, we approximate this criterion by maximizing, w.r.t.\(\theta\), the likelihood \(\ell({\boldsymbol{\theta}};\mathbf{x},\mathbf{y})= \sum_{i=1}^n\ln(p_{\boldsymbol{\theta}}(y_i|\boldsymbol{x}_i))\).
In R, this is done by fitting a model to the data:
library(scoringTools)scoring_model <- glm(Default ~ ., data = lendingClub, family = binomial(link = "logit"))
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
We can now focus on the regression coefficients \(\boldsymbol{\theta}\):
## (Intercept) Amount_Requested ## 5.446254e-01 5.198134e-06 ## Loan_Purposecredit_card Loan_Purposedebt_consolidation ## -2.161336e-01 -4.537949e-01 ## Loan_Purposeeducational Loan_Purposehome_improvement ## 1.858680e-01 -6.656963e-01 ## Loan_Purposehouse Loan_Purposemajor_purchase ## -1.278938e+00 -1.726151e+00 ## Loan_Purposemedical Loan_Purposemoving ## -7.204768e-01 -4.125148e-01 ## Loan_Purposeother Loan_Purposerenewable_energy ## -1.044591e-01 -1.902471e+01 ## Loan_Purposesmall_business Loan_Purposevacation ## -7.710864e-01 -8.271925e-01 ## Loan_Purposewedding Loan_Length ## -4.670372e-01 -8.072343e-03 ## Debt_To_Income_Ratio Home_OwnershipMORTAGE ## 4.673087e-04 -9.385231e-01 ## Home_OwnershipMORTGAE Home_OwnershipMORTGAG ## -4.822148e-02 -1.559521e+00 ## Home_OwnershipMORTGAGE Home_OwnershipMORTGGE ## -9.395086e-01 -1.053046e+00 ## Home_OwnershipMOTGAGE Home_OwnershipMRTGAGE ## -1.320655e-01 7.277748e-02 ## Home_OwnershipORTGAGE Home_OwnershipOTHER ## -4.190851e-01 -2.067367e+01 ## Home_OwnershipOWN Home_OwnershipRENT ## -1.052071e+00 -7.275610e-01 ## Open_CREDIT_Lines Revolving_CREDIT_Balance ## -1.537008e-02 5.409573e-06 ## Inquiries_in_the_Last_6_Months Monthly_Income ## -5.478806e-02 1.679455e-05 ## Employment_Length StateAL ## 2.190715e-02 -9.302735e-01 ## StateAR StateAZ ## -2.001419e+01 -8.498573e-01 ## StateCA StateCO ## -1.324136e+00 -8.503792e-01 ## StateCT StateDC ## -1.006930e+00 -8.278092e-01 ## StateDE StateFL ## -1.813406e+01 -7.749171e-01 ## StateGA StateHI ## -1.658919e+00 -8.162135e-01 ## StateIA StateIL ## -2.456610e+00 -8.436800e-01 ## StateIN StateKS ## -1.028000e+00 -1.184442e+00 ## StateKY StateLA ## -2.489800e+00 -1.522087e+00 ## StateMA StateMD ## -2.233885e+00 -6.024556e-01 ## StateMI StateMN ## -4.866130e-02 -1.755743e+00 ## StateMO StateMS ## -1.490269e+00 -1.272611e+00 ## StateMT StateNC ## -2.953696e-01 -1.296718e+00 ## StateNH StateNJ ## -9.519204e-01 -1.139183e+00 ## StateNM StateNV ## -5.655698e-01 -1.188136e+00 ## StateNY StateOH ## -6.806718e-01 -7.549876e-01 ## StateOK StateOR ## -2.235123e+00 -1.849957e+00 ## StatePA StateRI ## -8.662468e-01 -1.162867e-01 ## StateSC StateSD ## -1.620496e+00 1.488559e+01 ## StateTX StateUT ## -1.195268e+00 -1.546485e+00 ## StateVA StateVT ## -8.237064e-01 3.419687e-01 ## StateWA StateWI ## -1.026220e+00 -5.139766e-01 ## StateWV StateWY ## -8.625548e-01 -1.224568e+00 ## Interest_Rate FICO_Range645-649 ## 3.282162e-02 -1.933897e+01 ## FICO_Range650-654 FICO_Range655-659 ## 2.331851e+01 2.633920e+00 ## FICO_Range660-664 FICO_Range665-669 ## 9.115198e-01 5.946684e-01 ## FICO_Range670-674 FICO_Range675-679 ## 1.004201e+00 9.227298e-01 ## FICO_Range680-684 FICO_Range685-689 ## 7.418759e-01 1.059893e+00 ## FICO_Range690-694 FICO_Range695-699 ## 6.573794e-01 -1.914143e+01 ## FICO_Range700-704 FICO_Range705-709 ## -1.916224e+01 -1.916976e+01 ## FICO_Range710-714 FICO_Range715-719 ## -1.907579e+01 -3.016283e+01 ## FICO_Range720-724 FICO_Range725-729 ## -1.890553e+01 -3.031735e+01 ## FICO_Range730-734 FICO_Range735-739 ## -1.903904e+01 -1.898787e+01 ## FICO_Range740-744 FICO_Range745-749 ## -1.901698e+01 -1.899618e+01 ## FICO_Range750-754 FICO_Range755-759 ## -1.904364e+01 -1.898244e+01 ## FICO_Range760-764 FICO_Range765-769 ## -1.883702e+01 -1.871614e+01 ## FICO_Range770-774 FICO_Range775-779 ## -1.888484e+01 -1.870410e+01 ## FICO_Range780-784 FICO_Range785-789 ## -1.877299e+01 -1.889972e+01 ## FICO_Range790-794 FICO_Range795-799 ## -1.882304e+01 -1.947507e+01 ## FICO_Range800-804 FICO_Range805-809 ## -1.877842e+01 -3.202823e+01 ## FICO_Range810-814 FICO_Range815-819 ## -1.877796e+01 -1.899147e+01 ## FICO_Range820-824 FICO_Range830-834 ## -1.776919e+01 -1.902438e+01 ## Age ## -3.512107e-03
and the deviance at this estimation of \(\boldsymbol{\theta}\): [1] 1103.43
From this, it seems that Credit Scoring is pretty straightforwardwhen the data is at hand.
Conceptual problems of current approaches to Credit Scoring
Nevertheless, there are a few theoretical limitations of the currentapproach, e.g.:
- We don’t observe rejected applicants’s performance, i.e.we don’thave observations \(y_i\) forpreviously rejected applicants;
- The performance variable \(Y\) mustbe constructed using historical data but we can’t wait for all currentcontracts to end, that’s why financial institutions usually consider adefaulting client to be someone failing to pay two consecutiveinstallments;
- Credit risk modelers often ‘‘discretize’’ the input data \(\boldsymbol{X}\), that is to say continuousvariables are transformed into categorical variables corresponding tointervals of the support of \(\boldsymbol{X}\) and categorical variablesmight see their values regrouped to form a categorical variable withless values (but whose coefficients are ‘‘easier’’ to estimate). Up tonow, there was no theoretical grounds to do so and no uniformly bettermethod;
- Credit risk modelers have always sticked to logistic regressionwithout knowing whether it is somewhat ‘‘close’’ to the true underlyingmodel.
Problems tackled in this package
Two problems have been tackled so far in the Credit Scoringframework:
- Reject Inference,
- ‘‘Quantization’’ of continuous (discretization) and qualitative(grouping) features,
Other packages
We released two other packages:
- Package glmdisc for‘‘Quantization’’ of continuous (discretization) and qualitative(grouping) features and interactions amongcovariates,
- Package glmtree for‘‘Segmentation’’ of clients into subpopulations with differentscorecards: logistic regression trees.
Other packages focus on Credit Scoring, see e.g.this review paper.