Do Credit Companies Really Need Seven Years of Our Data?

If you grew up in the United States, you probably have a distinct memory of receiving 15,000 different credit card applications the MOMENT you turned 18. You may even remember being accosted by friendly employees at American Eagle, Hollister, Best Buy, and Barnes & Noble trying to sell you on their latest credit card to guarantee you 20% off your purchase! The moment you agreed to sign on with that first card was, for many of you, the moment your journey into the world of credit began! There are also things like student loans and car loans, and, if you’re in your mid-20s or early-30s, you may be looking at home loans and mortgages! All of this is wrapped up in the world of credit.

In the world of credit, there are many ways of making mistakes and not too many ways of digging your way out. If/when you do buy more than you can pay for on credit and miss a few payments, you’re likely to rack up an adverse credit history, which reflects in your score. Did you know that some of these mistakes will remain for seven years?

According to credit monitoring companies like Equifax, Experian, and TransUnion, mistakes such as late/missed payments, accounts that head into collections, and chapter 13 bankruptcy all remain in your credit history (and therefore affect your daily life) up to 7 years. Chapter 7 bankruptcy lasts for ten years! Of course, there are a great many things you can do to build your score back up even with these, at-times-heavy, dings on your credit. You can begin by making a call to collections or the company you owe money to and create a payment plan, for example. However, I long wondered about the *why* behind it all.

Why seven years? Seven years is an incredibly long time for most mistakes to follow you around. This did not seem like the sort of mistake that falls in the category of “Should Follow You For 7+ Years.”

I wonder — can payment history that reaches years into your past indeed be a good predictor for your ability to make credit payments in the future?

I did some digging and found a data set upon which to test my questions! The University of Irvine’s Machine Learning Database had a dataset related to this question! It contained payment history, payment status, and account information for 30,000 people who were lent credit by a central bank in Taiwan between April and September of 2005. I wanted to know which features would be the best predictors for whether someone would default on their credit payment in the next month. I created classification models to answer this question.

Classification models are essential to machine learning. They are fed training data full of observations containing a variety of predictor variables and a target variable. If all is done correctly, these models can predict to which sub-group a specific observation belongs.

Think of your email inbox for a second. Do you ever wonder how spam rarely, if ever, seems to reach you? That is thanks to a classification filter built into your inbox! It has been trained on millions and millions of combinations of words to understand which words feature more in a spam email than a regular email. It filters each email you receive and will classify any spam that it comes across. These models aren’t perfect; of course, they are not entirely leak-proof. However, they give us an incredible amount of information about how to classify data relevant to our everyday lives.

I built 15 different models to analyze this data. I will highlight two of the most influential models and then discuss which features I found were most relevant when predicting an individual defaulting on their credit. The two models I would like to highlight are called Decision Trees and Random Forests.


Two of the most important terms related to decision trees are entropy and information gain. Entropy measures the impurity of the input set. Information is a decrease in entropy.

When I refer to a data set’s impurity, here is what I mean: If you have a bowl of 100 white grapes, you know that if you pluck a grape at random, you will get a white grape. Your bowl has purity. If, however, I remove 30 of the white grapes and replace them with purple grapes, your likelihood of plucking out a white grape has decreased to 70%. Your bowl has become impure. The entropy has increased.

As each split occurs in your decision tree, entropy is measured. The split with the lowest entropy compared to the parent node and other splits is chosen — the lesser the entropy, the better.

The Decision Tree has a great many hyperparameters that we need to tune. Hyperparameters are parameters whose values are set before the learning process begins. In Decision Tree Models, the relevant hyperparameters are the criterion, max depth, minimum samples leaf with split, minimum leaf sample size, and max features.

  1. Criterion: Entropy or Gini. Different measures of impurity. Not a vast difference between each.
  2. Maximum Depth: Reduces the depth of the tree to build a generalized tree. This is set depending on your need.
  3. Minimum Samples Leaf with Split: Restricts size of sample leaf
  4. Minimum Leaf Sample Size: Size in terminal nodes can be fixed
  5. Maximum Features: Max number of features to consider when splitting a node.

While I initially wrote code to tune each of these individually, to find the best results, I used a function called GridSearch to find the best parameters for me. GridSearch is a function that tries every possible parameter combination that you feed it to find out which variety of parameters will give you the best possible score. It combines K-Fold CrossValidation with a grid search of the parameters to do so.

This is an example of a tuned decision tree with shallow depth:

At this point, I wanted to see what this model saw as relevant to predicting one class over the other. I created a chart of important features.

This chart shows MASSIVE importance for the payment status in September but such a low readout for the rest. There needed to be a way to gain more precise information.

A word on the data for a moment — I mention payment status throughout this essay. It becomes a central theme in the analysis. The column that measured payment status used the numbers -2, -1, 0, and 1–9. -2 denoted no use of credit for the month, -1 denoted someone who had paid up their account that month, 0 denoted someone using revolving credit, and 1–9 marked someone who was that many months behind in payments (9 stood for nine and above).


Imagine creating MANY decision trees!

Random Forests are also very resilient to overfitting — our random forest of diverse decision trees are trained on different sets of data and looks to varying subsets of features to make predictions. There is room for error for any given tree, but odds that every tree will make the same error because they looked at the same predictor is incredibly small!

After GridSearching my random forest and fitting it to my data, these were the features it found to be important:

The above feature importance chart is incredibly informative.

The status of Payment from September is still paramount.

  • However now instead of being in the range of above 0.8, it is now close to 0.175
  • The amount one pays each month and the balance have increased in importance; however, the status of payment in August & September is still the best predictor.
  • Most notable third place: july_status

I could have stopped here. My models were sufficiently strong, and I had a decent amount of information regarding feature importance. However, I wanted to dig a bit more deeply into the data, and therefore I created a new dataframe that included the top ten predictors.

The top ten features were the status, monthly payment, and account balance from the past three months (September, August, and July) and the credit limit. I came to decide on these features based on the results of the above feature importance charts and the conversations I had with people who worked in banking, particularly in lending. They explained that recent payment history and information tell a far better story than older data. The two models from this analysis that I will focus on are the GridSearched Decision Tree and the GridSearched Random Forest.

The Decision Tree didn’t bring many new results. It reflected the immense importance of September and August’s payment statuses as paramount above the rest of the variables. The Random Forest did as well, albeit more cleanly, here is the chart displaying the feature importance from the GridSearched Random Forests model fit with data from the top ten dataframe:

That’s the ballgame!

Well, there we have it! The past three months’ payment status are the most important features when determining whether someone will default on their next credit payment. This dataset only covered a few months of payment history. I would love to do more in-depth research into years of payment data to see if my questions are answered clearly. However, this much I understand to be valid from the data analysis: your payment activity from the more recent period of your life is much more predictive of your next moves financially than data from years in your past. I do not believe that credit companies need to keep seven years of your mistakes, holding them over your head for what could be around 1/10th or so of your lifespan. Please let me know if there are reasons I am missing for such a long period of penance.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store