- Explores the hypothesis that probability of default is affected by general conditions in the economy over time
- Survival analysis provides a framework for their inclusion as time-varying covariates
- Macroeconomic variables, such as interest rate and unemployment rate improves model fit and affects probability of default
yielding a modest improvement in predictions of default on hold out sample - Can simulate the effects of downturns in the macroeconomy on the future PD for an applicant and can also do so for future PDs for a portfolio of applicant
- Use Cox PH model with time varying covariates
- Survival analysis has been applied in many financial contexts including explaining financial product purchases (Tang et al. 2007), behavioural scoring for consumer credit (Stepanova & Thomas 2001), predicting default on personal loans (Stepanova & Thomas 2002) and the development of generic score cards for retail cards (Andreeva 2006)
- Allows to model not just if a borrower will default, but when.
Cox Proportional Hazard Model
- Study time to failure of some population (time to failure)
- Can include data that have not time to fail (censored data)
- Use hazard function which gives the range of probability of failure at time t
- The model is semiparametric: depends on a vector of coefficients beta
- Coefficients beta are estimated using a partial likelihood function of the training observations. This way allows for the use of the maximum likelihood estimation without the need to know the baseline hazard (Hosmer and Lemeshow 1999, section 7.3)
- Numeric integration is used to compute the coefficients following Chen et al (2005)
Implementation
- Data on credit card account from a UK bank from 1997 to mid-2005.
- Training data: 1997-2000
- Out of sample : 2002-2005
- Due to large size of data sample it was judged that forward and backward elimination methods would be too time consuming
- Default defined as three consecutive missed payments
- Use default within 12 months of opening account
Problem of imbalanced data
- Discussions by Chawka et al (2005)
- Natural distribution is not necessarily the optimal one (Weiss and Provost 2003)
- Most common and simplest solutions are 1) under-sampling the majority class or 2) over-sampling the minority class
- some concerns for over-sampling of overfitting (Weiss 2004).
- Chose to over-sample the total number of bads so to be a many as the goods. Found that it gave good results for this study
- Best predictive results were achieved when no over-sampling was used for logistic regression
- Oversampling of bad cases artificially alters the distribution of training cases so that the Wald statistic can not be used to generate P-values. use bootstrap method to compute percentile confidence interval for each coefficient estimates for which P-values are reported (Efron and Tibshirani 1993)
- Error on bads have a higher cost than those on goods and a cost function is used to determine the value of a prediction
- Cut-offs are set so as to minimize the total cost on the test set
- If a particular model performs well with both cut-offs from both training and test sample then it shows the model is good
- Models giving a lower mean cost have performed better
Results
- Interest rate is by far the most important factor followed by Earning
- Lag values of 3, 6, 12 months for macro-economic variables did not lead to better performance than the simplest model with values taken at the point of default
- The inclusion of the interaction term IR x Income with a negative coefficient estimate suggests that individuals with a higher income are less sensitive to a change in interest rates to determine their PD