P2P lending is not a buzzword anymore these days, but building up an algorithmic bidding system for P2P loans is a still good learning process that I enjoyed very much. Now I would like to share some of my experience and thoughts on that.

Not every loan is equally profitable. First, loans are associated with grades, from A to G. Grade A loans have the lowest average interest rates and grade G loans have the highest average interest rates. Also, each loan may have different default rate. In general, grade G loans are more likely to default than grade A loans. It is critical to take into account both interest rate and default rate to make the bidding decision, i.e., to bid or not to bid. That is what smart means. In general, the smarter the system is, the slower to make a decision. Don’t forget there are many other institutional and individual investors who are checking the same loans at the same time – you compete with them. So the system has to be fast. The latency mainly comes from the system and the network. For the network latency, consider collocating the server.

Lending Club lists new loans on its website four times per day. Investors can bid the loans they like manually or automatically. According to a recent report [1], 99% of all Marketplace/P2P Consumer Lending was fulfilled in 2016 through auto-bidding processes. Actually, it is not that difficult to build up an auto-bidding system because Lending Club offers REST API services that allow investors to access the Lending Club platform programmatically. But an auto-bidding system doesn’t assure you to make money since in order to compete with other investors the system should be 1) smart and 2) fast.

The system design is straightforward and here is how it works. The system is online all the time, and on each of the new listing times, the system would send an HTTP GET request through the lending club API to get new listed loans. Each of the new listed loans goes through the scoring engine and a decision of bidding or not is made. All the data are also stored into database. I chose Scala as the programming language to build up such a system. As for the database, I used MySQL on AWS RDS. The system is built on Akka actor system in Scala. The major reason for using Scala is that I feel the development using Scala should be much faster than using other languages for me. And the code is much more concise. Akka allows me to define actors to do different jobs. For example, one actor’s responsibility is to download new listings, scheduled by an Akka scheduler. For database manipulation, I used slick library. Codegenerator in slick can generate case classes from database tables, but providing the number of fields in a table is no more than 22. For more than 22 fields, I had to use HList. But I do hope there is no limitation of 22 parameters in Scala case class. For more discussion about 22 parameters in Scala, please refer to [2].

Scoring Engine

This is the fun part for me. Basically, the scoring engine is driven by rules. A simplistic rule is to check the grade of the loan. For example, if the grade is B we send a bid request. The only reason for investing in P2P loans is to make money (isn’t it?). So a more realistic and reasonable rule is to value the loan and compare the value with some threshold to make the bidding decision. Valuation of a loan could be done with different valuation metrics. For example, we can use risk-adjusted net present value (NPV) or internal rate of return (IRR) as the valuation metric. The reason we use risk-adjust metric is that loans are risky, i.e., the borrowers can default.

The risk of a loan can be quantified by the probability of default. To get the probability of default, we can train various classification models, such as logistic regression, random forest, (deep) neural network. These models take the variables of each loan as input and output a probability of default within the loan’s term (3 or 5 years) as the prediction. In order to train such classification models, we need training data which contains the variables of historical loans issued by lending club and their final status (fully paid or default/charge-off). Such dataset can be downloaded from Lending Club website. We can train the classification model using any machine learning tools. For example, we may train a random forest model using scikit-learn in Python. But it is important to transfer the trained model into the auto-bidding system to avoid the overhead caused by calling Python functions in Scala. For most of classification models, we can simply store the model parameters (such as coefficients of the logistic regression) into database and implement the prediction function in Scala. For tree-based models, we could use tail recursion to implementation the prediction. But actually, we can do better if tree-based models are used, with the technique introduced by Yandex in their CatBoost [3]. The idea is to use oblivious decision trees to replace the regular decision trees. In such oblivious decision tree, each level has the same evaluation condition to check, which enables us to check all conditions at all levels in parallel. In regular trees that is impossible as we have to do sequential condition check from the root to the leaf.

While in the system I built up, classification models are not used since they only provide a single default probability. When to default is more important than if default for a fixed term loan [4]. An early default can cause more severe loss than the default of the last few payments. Survival analysis is the tool I used to model when to default. For people not familiar with survival analysis, I recommend learning through the course Pop 509 [5]. But I didn’t use the parametric or semi-parametric survival models. Instead, I implemented a non-parametric survival model based on gradient boosting regression which performs significantly better. Finally, I saved the model into the database and implemented the prediction function in Scala, which outputs a vector of default probabilities for each payment/month. Based on this default probability vector, I wrote other functions to project the future cash flow for each loan for valuation.

As I mentioned earlier, the value of a loan would be compared with some threshold to make the final bidding decision. It is rather difficult to find a good threshold and the threshold should also be dynamic. Because it also depends on how much to invest and the loan quality the system would see in real time. It’s a real-time stochastic optimization problem that is out of the scope of this introduction.

Finally, don’t forget backtesting using historical data for the scoring engine, even though good performance in the backtesting doesn’t mean a positive return when the system is online.

In the scoring engine introduced above only default is considered. Of course, loans can be prepaid too. But prepay is not that important. If we do want to consider the prepay scenario, multi-class classification models can be used. In the context of survival analysis, competing risks and mixture cure models are some alternative solutions.