Credit scoring with macroeconomic variables using survival analysis

Explores the hypothesis that probability of default is affected by general conditions in the economy over time
Survival analysis provides a framework for their inclusion as time-varying covariates
Macroeconomic variables, such as interest rate and unemployment rate improves model fit and affects probability of default
yielding a modest improvement in predictions of default on hold out sample
Can simulate the effects of downturns in the macroeconomy on the future PD for an applicant and can also do so for future PDs for a portfolio of applicant
Use Cox PH model with time varying covariates
Survival analysis has been applied in many financial contexts including explaining financial product purchases (Tang et al. 2007), behavioural scoring for consumer credit (Stepanova & Thomas 2001), predicting default on personal loans (Stepanova & Thomas 2002) and the development of generic score cards for retail cards (Andreeva 2006)
Allows to model not just if a borrower will default, but when.

Study time to failure of some population (time to failure)
Can include data that have not time to fail (censored data)
Use hazard function which gives the range of probability of failure at time t
The model is semiparametric: depends on a vector of coefficients beta
Coefficients beta are estimated using a partial likelihood function of the training observations. This way allows for the use of the maximum likelihood estimation without the need to know the baseline hazard (Hosmer and Lemeshow 1999, section 7.3)
Numeric integration is used to compute the coefficients following Chen et al (2005)

Data on credit card account from a UK bank from 1997 to mid-2005.
Training data: 1997-2000
Out of sample : 2002-2005
Due to large size of data sample it was judged that forward and backward elimination methods would be too time consuming
Default defined as three consecutive missed payments
Use default within 12 months of opening account

Discussions by Chawka et al (2005)
Natural distribution is not necessarily the optimal one (Weiss and Provost 2003)
Most common and simplest solutions are 1) under-sampling the majority class or 2) over-sampling the minority class
some concerns for over-sampling of overfitting (Weiss 2004).
Chose to over-sample the total number of bads so to be a many as the goods. Found that it gave good results for this study
Best predictive results were achieved when no over-sampling was used for logistic regression
Oversampling of bad cases artificially alters the distribution of training cases so that the Wald statistic can not be used to generate P-values. use bootstrap method to compute percentile confidence interval for each coefficient estimates for which P-values are reported (Efron and Tibshirani 1993)
Error on bads have a higher cost than those on goods and a cost function is used to determine the value of a prediction
Cut-offs are set so as to minimize the total cost on the test set
If a particular model performs well with both cut-offs from both training and test sample then it shows the model is good
Models giving a lower mean cost have performed better

Interest rate is by far the most important factor followed by Earning
Lag values of 3, 6, 12 months for macro-economic variables did not lead to better performance than the simplest model with values taken at the point of default
The inclusion of the interaction term IR x Income with a negative coefficient estimate suggests that individuals with a higher income are less sensitive to a change in interest rates to determine their PD

Related Hamster Notes