I believe I can prove that LendingClub.com, the largest peer-to-peer loan platform, has for years been rating loans with systematic bias. Some borrowers are rated lower risk than they should be, and others are rated higher risk than they should be.

Cause for Suspicion

My investigation began with a simple question: Why do hedge funds get consistently high returns from peer to peer lending, while retail investors often report vastly lower returns using the same platform? For instance, the Prime Meridian Income Fund keeps emailing me this graph of their fund’s absurdly good returns from peer to peer lending:

How do they do this? According to them, they use “proprietary credit algorithms to overweight the best risk/reward loans”. In other words, at least some of the lending platforms do a poor job of rating their loans, and this fund developed an algorithm that rates loans correctly in order to exploit these inefficiencies.

So if the hedge funds are able to disproportionately stack their portfolio with loans that perform better than average, who gets stuck with the rest of the loans performing worse than average? That would be you and I, my friend: those of us who simply turn on the automatic investing, and let LendingClub diversify us across all the loans. Investing in all the loans will always lose compared to an algorithm that beats LendingClub’s own risk assessment methods, especially if rich hedge funds are scooping up all the best loans shortly after they appear.

It is easy to find people complaining about poor returns at LendingClub, although it is difficult to prove anything from these anecdotes. So what can we prove? A lot, as it turns out, using LendingClub’s own publicly available data (note: log in to your LendingClub investor’s account before downloading this data, otherwise some columns will be missing⁵).

Loan Grading Basics

LendingClub rates loans from A1 (the least risky, having the lowest interest rate) down to G5 (the most risky, having the highest interest rate). In general, loans rated A1 can be expected to default¹ slightly less than those rated A2, which default slightly less than those rated A3, and so on. This rating system basically works:

Things appear to break down when we get to the highest risk loans, but we have not yet accounted for some loans being paid off earlier than others. We will discuss early payoffs more later, but for now here is the same chart adjusted for the risk of early payoffs:

Clearly the rating system basically works. However, we will find that many loans are several grades off from where they should be. Something rated B1 might actually be very low risk, and should have been rated A3. Something rated B3 might actually be much higher risk, and should have been rated C1.

Mortgage or Rent?

One of the biggest factors in most people’s credit score is whether they have a mortgage. Let’s start there. I’m sure LendingClub already takes this into account when they grade loans . . . right?

WHAT?! Renters default more than people with a mortgage in every single grade! Let’s take a closer look at the data. In the graph above, the effect appears to be smaller for the grade-A loans, but that is deceptive. A1 borrowers with a mortgage have a 1.45% chance of default, whereas A1 renters have a 2.19% chance of default. That means A1 renters are 51% more likely to default! The graph below illustrates this across all grades:

Across all loan grades, being a renter increases the risk of default by an average of 26% compared to people having a mortgage. Before we get out the pitchforks, let’s see if LendingClub has fixed this. After all, we’re looking at more than 10 years of data. Perhaps this problem happened in older loans but is no longer an issue.

(Renters are on average 26% more likely to default, but this varies over time.)

Very few of the newest loans have defaulted, so that makes the signal noisy on the right side of the graph since there aren’t many defaults to analyze, but this doesn’t look good. It appears that LendingClub slightly over-penalized renters in 2008, but ever since 2010 it appears that renters have been under-penalized, and it might be getting worse. Hedge funds can massively increase their returns simply by manually selecting loans and avoiding renters. That’s incredibly low-hanging fruit.

Unemployed People

Having found that massive smoking gun, I went fishing for other inefficiencies. The next thing I found was that people who left their job title blank, or chose “N/A” for their length of employment tended to default much more than other loans. It is not a huge leap to assume that these people may be unemployed. Here are some more graphs, adjusted to remove rent bias²:

(Unemployed people are on average 34% more likely to default, but this varies over time.)

Note the Y-axis on that last graph above! It’s 4x wider than the graph for renters.

Surprise! Unemployed people are far, far riskier than other loans despite LendingClub lumping them in the same risk grade. It’s easy picking for the hedge funds to exclude these loans (and foist the unemployed on the rest of us). Again, the right side of the graph is noisy due to the loans being new and having very few defaults to analyze, but it appears this grading error may be getting worse over the past few months. Unemployed people who do pay off their loans also tend to pay them off fifteen days earlier on average, which makes them even riskier, as we will see later in the analysis.

Credit History not Useful

Next, I started looking at credit history. Unfortunately, the data provided by LendingClub does not show borrower credit history at the time they applied for the loan, but rather their current credit history⁵. Since their credit might be very different now from when they applied, the credit data provided by LendingClub can’t be used to show rating bias.

Verification Status

LendingClub provides a flag in their data called verification_status which they define as “Indicates if income was verified by LC, not verified, or if the income source was verified”. The three possible values are “Verified”, “Not Verified” and “Source Verified”. I’m not sure how they verify an applicant’s income, but one would generally assume that “Not Verified” would be an indicator of a higher risk loan. However, the data shows the opposite effect for some reason. These graphs show verified vs not verified after adjusting for rent bias and unemployment bias²:

(Verified people are on average 21% more likely to default than unverified, but this varies over time.)

Why unverified incomes would indicate a safer loan is an interesting question. Perhaps LendingClub automatically penalizes the rating of these loans, but does so too aggressively. Whatever the cause, it leaves more big piles of money for hedge funds to collect. They can take the people without income verification which have been mis-rated while the retail investor gets the less desirable loans who have verified income. Again, the data seems to show this problem getting worse in recent months.

With verification status, there is one big caveat. Loans categorized as unverified tend to be paid off earlier. Verified loans which are fully paid off after 605 days on average. Unverified loans are fully paid off after 537 days on average. This actually significantly increases the risk of unverified loans (see next section for an explanation). If we take into account early payoffs, the advantage of unverified loans over verified loans falls from 21% on average to 8%:

Unverified loans still beat verified loans most of the time after considering early payoffs

Let’s take a moment to understand why early payoffs affect risk in this way:

Understanding Early Payoffs

Imagine two types of loans: loan type A is typically paid off in 1 year while loan type B is typically paid off early after 1 month. If they both have a 12% chance of default, then a portfolio of type A loans will have 1% of loans default every month, while a portfolio of type B loans will have 12% of loans default every month. They both have a 12% chance of default, but one type is 12 times riskier than the other! Many of the graphs here don’t account for early payoffs, but when early payoffs are a significant factor, I will provide additional information about its impact. (Early payoffs don’t appear to be different between renters and mortgage holders).

Loan Purpose

Next, let’s break down the data by loan purpose. Here’s a chart showing the loan purpose categories LendingClub uses and how many loans are in each:

Next, let’s see how those categories perform versus the average loan (risks have been adjusted to account for rent bias, unemployment bias, and verification status bias²):

Here again early payoffs make a big difference. Here’s what the same chart looks like once we account for early payoffs:

Clearly “educational” and “renewable energy” loans perform poorly, but there are almost none of them in the system so that isn’t very interesting. Weddings are typically not paid off early, which switches them from looking bad to looking good, but there aren’t many of those either. Instead, let’s look closer at car loans (lower risk) versus small business loans (high risk):

Once adjusted for risk of early payoff, A1 car loans have a 1.80% chance of default while A1 small business loans have a 4.87% chance of default: a 170% increase in risk! The advantage appears to go away for the riskier loans, but there are very few car loans in the riskier grades, so the data is noisy. Here’s the risk increase across all loan grades:

And over time:

Yup: the difference persists over time, and seems to get worse in recent months. Admittedly, car loans and small business loans won’t show up very often, but when they do, the hedge funds will scoop up the car loans and leave the small business loans for the retail investors (suckers) to pick up. I didn’t bother to make the graphs, but I expect that loans with listed purpose “other” are also good loans to grab when they appear.

There Are No Bad Zip Codes

Lending club provides the first three digits of borrower zip codes, and early passes at the data seemed to show biases by zip code, but in the end I concluded that LendingClub (or perhaps some data they rely on like a FICO score) is now accounting for zip code as best as can be done, starting in 2017. Prior to 2017 data, past performance of a zip code had some predictive value, but that value pretty much disappears in the latest loans. A word of caution to anybody else analyzing this data: since loans can last up to 5 years, an economic event in a particular zip code, such as a factory closing, can affect multiple vintages of loans. If the event happened in 2016, loans originating from as far back as 2011 could be affected. This can make it appear as if zip code past performance has more predictive value than it actually does. To test predictive power, you have to compare how zip codes from vintages more than five years earlier performed. Therefore I used zip code performance for 2010 loans to try to predict 2016 zip code performance, and 2011 loans to predict 2017 performance³ . Here are a couple graphs:

Zip code past performance used to have some predictive power

A zip code which did poorly for loans originating in 2010 meant that same zip code would be slightly more likely to do poorly for loans originating in 2016.

No more predictive power

2017 zip code performance appears completely independent of how those same zip codes performed in 2011. After seeing so many things obviously done poorly, it is surprising to see zip codes being done so well. As far as I can tell, you can’t make a list of “bad zip codes” to avoid⁴.

Smaller Loans (Under $5k) Outperform

Next, let’s look at how default rates are affected by the size of the loan. Default rates have been adjusted to account for rent bias, unemployment bias, and verification status bias².

Loans under $5k are significantly less likely to default than other loans, perhaps because they are easier to repay? Let’s see how this holds up across loan grades:

Loans under from $15k to <$20k are 36% more likely to default than loans from $0 to <$5k. The difference is fairly consistent across loan grades:

Loans $15k to <$20k persistently default more than loans < $5k over time, although the bias has been slowly getting better

By now you know what I am going to say next. The hedge funds are probably grabbing all these small loans while retail / auto investors get stuck with the larger loans which underperform.

Note the graphs above don’t account for early payoffs, which end up significantly changing some of the risks:

When you account for the tendency for large loans to be paid off early, the biggest loans become the riskiest, while 25k to 30k loans become more attractive due to their tendency to be paid off later.

Why Systematically Rate Loans Incorrectly?

So why would LendingClub rate loans so poorly? I can speculate a couple reasons, which may or not be true:

They would rather have more hedge funds than more retail investors. The hedge funds contribute millions of dollars each while you and I contribute perhaps a few thousand dollars each, but they have to answer our questions and give us customer service too, even though we are a much smaller contributor to their profits.

The hedge funds contribute millions of dollars each while you and I contribute perhaps a few thousand dollars each, but they have to answer our questions and give us customer service too, even though we are a much smaller contributor to their profits. They can offer better rates to higher risk borrowers. The hedge funds supply capital to the credit-worthy borrowers, ensuring competitive rates for them, while many less credit worthy borrowers get BETTER than competitive rates due to the rating inefficiencies and the retail investing suckers who fund them via auto investing.

Basically, it appears that everybody wins except the naive, trusting retail investor, who loses big time.

Conclusions

One really obvious conclusion is that if you have a LendingClub account with auto-investing turned on, you turn it off and invest manually or at least ADD FILTERS as I have done with my account:

You could also try moving to a different lending platform (which might have the same problem), or you could try to play hedge fund and choose better loans for yourself now that you are armed with the knowledge of the huge biases; the charts above indicate that a few simple rules could massively elevate your returns. The risk of playing hedge fund of course is that once a loan becomes available, anybody can fund it, and you will be racing against the hedge funds to grab the best loans. I expect the best loans are probably funded within milliseconds of becoming available. This is why I also plan to explore nsrinvest.com to see if they might be able to increase my returns.

Another conclusion is that LendingClub should fix this problem. It’s hard for me to imagine that they don’t already know about it, since anybody half-decent with spreadsheets and/or databases could sniff out these biases (and many people have). However, I expect they have a lot of incentives to keep the biases in place as described earlier, so I’m only marginally hopeful they will take significant action to correct this.

Please note this is not an exhaustive list of biases, but simply the ones that were easiest for me to find. There are undoubtedly many more biases that hedge funds are exploiting.

LendingClub Responds?

I sent a draft of this article to LendingClub on July 3rd, more than two weeks before publishing it, and invited them to review my data and provide me with an official response. They have yet to reply. If they say anything to me, good or bad, I will update this section with any quotes from them.

Methodology

In order to generate these graphs, I imported all of LendingClub’s historical loan data into a Postgres database, and ran various hairy SQL statements to massage the data and generate statistical summary tables which I could then import into a spreadsheet where I massaged the data further and created graphs. My spreadsheet is publicly viewable here, although be advised it is NOT well-labeled or laid out for public consumption.

I’m just some guy playing with a huge data set. I found and corrected dozens of errors in my analysis as I worked on it, and I was chasing errors up to the moments before publishing this article. Undoubtedly others will find more problems. Please let me know what problems you spot and I will try to correct them or at least acknowledge them in a footnote. I would love it if more people take a stab at independently verifying these findings.

Other Analyses for Further Reading

I am by no means the first person to find these patterns in LendingClub data. If you are looking for more interesting charts and analysis, here are a few:

This analysis from 2015 has some cool graphs, including showing the scandalous misrating of renters vs mortgage holders and of small business loans: https://rstudio-pubs-static.s3.amazonaws.com/115829_32417d32dbce41eab3eeaf608a0eef9d.html

This analysis using 2017 data also zooms in on the increased risk of renters vs mortgage holders: https://medium.com/@sonicmsba/how-to-minimize-risk-for-your-lending-club-investment-a4c8de0d129a

This analysis from all the way back in 2013 finds many of the same discrepancies I found: http://michaeltoth.me/analyzing-historical-default-rates-of-lending-club-notes.html

These brief analyses also point out the rent/mortgage discrepancy:

http://dfile.github.io/

https://www.orchardplatform.com/blog/2014522credit-variables-explained-home-ownership/ (2012 data)

https://nycdatascience.com/blog/r/p2p-loan-data-analysis-using-lending-club-data/ (she didn’t think the difference was big enough to worry about!)

Footnotes