While the percentage of customers defaulting on loans is generally low, the financial impact can be disproportionately large. In this case, defaulting on a loan means failure to make loan payments for consecutive months. This makes it a key problem for banks and other financial institutions to solve and using machine learning models to predict default accounts they can proactively handle problematic accounts and minimize risk. Classifying loan accounts that are likely to default within 12 months provides enough lead time to handle the at-risk accounts and mitigate losses, while not worrying over accounts which may still recover. By applying a logistic regression to customer transaction data, we were able to increase the loan default prediction F1-score from 36% to 54%, significantly reducing potential losses.
The project was solved and implemented in IBM’s Cloud Pak for Data (CP4D) and its analytics projects, which are collaborative spaces for organizing and managing project resources and can be used for the full data science process. Analytics projects provided the option for shared direct access to data sources which simplified the data exploration and preparation process and accelerated the modeling process by enabling easily customizable and scalable jupyter notebook environments. The productionalization of the final model could also be implemented through scheduling of a versioned jupyter notebook or by wrapping the data preparation and model inside a Watson machine learning pipeline, which would allow the model to be called as an API.
This solution was focused on determining how transactional banking information could be used to predict loan account default. Bank account transaction data provides insight into customer behavior and what patterns led to loan default. The data consisted of account information including types, sources, locations, and amounts of transactions. Additionally, account balance, and age and type of account were included. Combining this information with age and type of loan provided a rich feature set to work from. One of the first issues to tackle was the large imbalance between the default and non-default classifications. As seen in the graph below, the default rate of accounts is 2% or less per quarter. Without additional sampling methods, this imbalance would cause the model almost exclusively learn the patterns of non-default accounts and perform poorly positive default accounts.
To counter this, the training data consisted of 100% of the defaults and randomly selected 30% of the non-defaults. This changes the default rate from ~0.7% to ~2%. In addition to this, different weights are assigned to default vs non-default in the model, which helps the model learn despite the imbalance. Additional sampling steps were performed for the test data to ensure that it had the same default rate as the training data for fair evaluation of the model. The training and test data was from 2012 –2017 and the test data was 2018 Q1 – Q3.
While sampling was useful for training and testing the model, no sampling was done on the validation data – 100% of defaults and non-defaults were used – to show how the model would perform in real life. The validation data was 2019 Q1 – Q3.
Another issue to handle was data aggregation. Since the data was at transaction level and the target was on a quarterly scale, aggregations were applied to create useful account level features which could be used by a model to create one classification per account. These features included monthly and yearly transactional summaries.
The following two graphs show how some of the average transaction amounts by transaction type and total sum of debits and credits could be used to differentiate default vs non-default.
Once all features were created, Analysis of Variance (ANOVA) and Chi-Square Test of Independence were used to determine the strength of the relationship between features and the response variable, loan default. The ANOVA test is used for continuous features and compares the means of multiple groups to determine if there are statistically significant differences between them. The Chi-Square test is used to determine whether there is a significant relationship between two or more categorical features, each with two or more categories. These tests are useful for determining significant relationships in data that cannot be inspected visually or for confirming visual analysis, and for removing unhelpful features before modeling because having many features in a model makes it more computationally expensive. Using the test statistic for each feature and the response variable, the feature set could be reduced to a smaller set with greater statistical significance. Correlation between features was also tested to ensure features were linearly independent. These tests provided a final subset of features that were acceptable for modeling.
Three main types of models were used to classify the accounts as default or non-default: random forest, gradient boosting, and logistic regression.
Random forest and gradient boosting models are ensemble models which create predictions based on the mean value for regression or the mode for classification. Random forest builds decision trees while gradient boosting builds additive weak learners, which are usually decision trees. While the data impacts model performance, gradient boosting tends to outperform random forest.
Logistic regression uses a logistic function to transform regression model predictions of binary values into probabilities between 0 and 1. While random forest and gradient boosting can create better performing models, logistic regression creates a more easily interpretable and explainable model which can be an important benefit.
While gradient boosting would typically be the model to use in many classification projects, model explain-ability was of greater importance so logistic regression would be used as the final model. The random forest and gradient boosting models were used to baseline model performance, and for comparison with the logistic regression model performance.
The following image is a graph of how a decision tree built by a random forest model could be used to classify default vs non-default.
Recursive feature engineering was then used with the logistic regression model to further reduce the feature set to only the 10 most significant features. A few of the final features are:
With such high data imbalance accuracy was not a good metric for model performance. Instead, the F1-score was used, and the model achieved an average score of 54% which was an improvement over the random guessing score of 36%.
The model could be scheduled to run monthly, or as often as needed, with minimal effort and create easily accessible predictions for business users. When used in a live setting, the model’s classification of accounts can provide financial benefits by allowing a bank to be proactive about handling loan accounts that are at risk of default, and to reduce the impact of unexpected loan defaults. It can also be used to determine the potential riskiness of new loans when the customer already has a deposit account with the bank.
Check out more of our AI & Machine Learning solutions.