Introduction

What is an ordinary way of figuring out the price for a used car? You search for similar vehicles, estimate the rough baseline price and then fine-tune it depending on the current mileage, color, number of options, etc. You use both domain knowledge and current market state analysis.

If you go deeper, you may consider selling the car in a different region of the country where the average price is higher. You may even investigate for how long cars are listed in the catalog and detect overpriced samples to make more informed decision.

Original ad of late 1990s VW Passat estate in “Rosso corsa” color, which turned out to be the “average car” in Belarus according to dataset statistics

So there is a lot to think about, and the question I faced here was “Is it possible that using data science methods (collecting and cleaning the data, training ML models, etc.) can save your time and mental effort in a painful decision making process?”

I opened a laptop, created a new project and turned the timer on.

Stage 1. Collecting the data

Without going into too much details: I’ve managed to collect a dataset containing roughly 40,000 car ads with 35 features (mostly categorical) in two days. Collecting the data itself wasn’t too much pain, but structuring it in an organized way took a bit of time. I’ve used Python, Requests, Pandas, NumPy, SciPy, etc.

What is interesting about this particular dataset is that most of the categorical features are not encoded in any way and thus can be easily interpreted (like engine_fuel = “diesel”).

Stage 2. Looking at the big picture and dealing with bad data

Initial data analysis quickly revealed suspicious samples with 8 million kilometers odometer state, 10 liter engine hatchbacks, hybrid diesel vehicles for $600, etc. I’ve spent roughly 6 hours writing scripts to detect these issues and process them.

Visualizing the data (I’ve used MatPlotlib and Seaborn) gave me a good sense of the overall market situation.

Odometer_value distribution (it is a distance traveled by a vehicle in kilometers)

The majority of the cars are pretty heavily used with mean odometer_value of 250,000 kilometers, and that is a lot! I’ve also noticed that people prefer to assign nice numbers to odometer_value like 250,000 km, 300,000 km, 350,000 km, etc. A bunch of cars have one million kilometers odometer_value and it does not make much sense if you look at the values distribution. I may presume that 1 million kilometers is more like a statement “This car have seen a lot, the exact number of kilometers on it I honestly don’t know.”

The general trend behind car pricing is pretty intuitive: the older the car — the lower the price. I’ve expected the age of the car to be number one feature in overall feature hierarchy.

Also, the older the car, the higher its odometer_value in general, and that is reasonable.

To build price_usd scatter plot I’ve limited maximum car price for roughly $50,000 and removed several million-level kilometer odometer_value outliers.

Actually cars that have price below $50,000 constitute 99.9 % of the catalog, so scatter plot gives a good sense of pricing trend.

Regarding car age: most of the cars have been used for a while, with the mean year_produced value of 2002. I believe that the distribution of production year (depicted below) was heavily influenced by policies around customs duties for importing cars from abroad.

Distribution of the cars in the catalog by their production year.

Prices distribution (price_usd is going to be the target value in this project during model training) is highly skewed to the right with the mean and median price of $7,275 and $4,900 accordingly.

price distribution

Some features like up_counter (number of times the ad has been promoted manually) don’t reflect parameters of the car at all, but since this data has been available, I decided to include it into the project. The distribution was so skewed that the only way to properly plot it was to use log scale.

distribution of up_counter metric

The distribution of brands popularity wasn’t a surprise for me with the most popular model in the catalog being VW Passat, the legendary source of transportation in Belarus.

I also used Tableau to get nicer visual representation of manufacturer’s marketshare and average price for each brand.