Knowing these will make you choose the right car

A Look Into Correlation of Car Features and Prices

Ikenna Nwosu
10 min readDec 27, 2020

Introduction
Knowing the right price to buy or sell a second hand car or determining what features of a car suits one’s need is usually very challenging for a lot of people. People would normally turn to AutoTrader, a popular online marketplace that specialises in new and second hand automotive sales, for information and price comparison.

Photo by ryansearle on Unsplash

We will take a closer look at a dataset I scraped off AutoTrader in December, 2020, across multiple postcodes in the UK. The initial dataset contains 70656 car listings (or rows) and 12 variables (or columns) comprising of 11 car features and the selling price, which is the variable we want to predict.

  1. Make: manufacturer of the car
  2. Model: model of a particular car
  3. Doors: number of doors
  4. Year: the year the vehicle was registered (indicating the age of the car)
  5. Body_Type: refers to the shape (car body) of a particular car
  6. Mileage: number of miles travelled or covered
  7. Engine_Size: the size of the engine
  8. Gearbox: transmission type of the car
  9. Fuel: what fuel type the car runs on
  10. Past_Owners: number of previous owners
  11. HorsePower: power produced by the engine
  12. Price: selling price of the car

This list is not exhaustive as there are other features that influence the cost of a car such as the colour, miles per gallon (mpg), etc. The above features were chosen based on how easily it was to obtain without having thousands of missing values, as for example, not all the listings on AutoTrader had the mpg values. By how much these missing features affect the price of a car is one we can only ponder about until we can obtain the data and perform some analysis on.

There are five categorical variables (Make, Model, Body_Type, Gearbox, and Fuel), and the rest are numeric. After cleaning and preparing the data to predict the price of a car, I arrived at three questions that I thought would help people in choosing the right car for their needs, and also knowing the features that could affect the price of the car. The questions I asked were:

Do cars with higher engine sizes accumulate more mileage than cars with smaller engine sizes?

Do petrol cars accumulate more mileage than diesel cars over their lifetime?

Which car features influence the price of a car the most?

Question 1: Do cars with higher engine sizes accumulate more mileage than cars with smaller engine sizes?

The size of a car engine is measured in cubic centimetres (cc), with 1,000cc equating a litre (or 1.0L). This refers to the air/fuel mixture each engine piston can push when it moves. The higher the engine size the more engine power the car can produce, and therefore the faster the car can ultimately go. In other words, the higher the horsepower, the quicker the car will gain speed. Below, you can see the relationship between engine size and horsepower.

At this point, it is fair to say regular motorway commuters are more likely to buy faster cars than regular city commuters. This potentially means cars with bigger engine sizes, and therefore as regular motorway commuters, higher chances of accumulating more mileage, but we will not come to that conclusion yet.

I found that the engine size and mileage data were somewhat normally distributed with most of the data points lying within two standard deviations from the mean.

Now the question is “what engine size do we consider as big and what mileage do we consider as high?”. For this project, I chose engine sizes bigger than the mean engine size (i.e. 1.75L) and any mileage higher than the mean mileage (i.e. 82300 miles). These are represented by the data points to the right of the green vertical line (i.e. the mean point) in the graph above.

With these criteria, I grouped the data by engine size and aggregated by the mean of the mileage, and then visualized the resultant data as shown below.

The black dotted line indicates the mean mileage and the red dotted line indicates the mean engine size. The shaded region shows that most of the cars with bigger engine sizes have done above the mean mileage, but there were a few interesting points which do not follow the trend. These are the data points at the bottom right quadrant.

A closer look at these data points review the following, as summarised in the plot and table below.

First 10 rows of the bottom right quadrant

These data points contain newer luxury cars with big engine sizes and very low mileages. Owners of these cars would not normally use them for daily motorway commute. They are very likely to have other cars for this purpose, which could explain why the cars in our dataset do not follow the pattern in the shaded region above. However, we may never know if they all have other cars.

With the exception of these data points, we can conclude that generally, the bigger the engine size, the more likely the cars are to accumulate more mileage than the smaller engine variants.

Question 2: Do petrol cars accumulate more mileage than diesel cars over their lifetime?

Below you can see the relationship between the fuel type and the mean accumulated mileage. I grouped the data by fuel type and aggregated by the mean of the mileage. The data clearly shows that of all the cars listed , cars with diesel engine tend to accumulate more mileage than the petrol engine variants.

Visualizing the fuel type and mileage data, but this time taking the registration year into consideration (below), we see that year-on-year in the past 20 years, diesel powered cars have accumulated more mileage than their petrol variants.

This could be because they are more fuel efficient, and ideal for long distance travels, therefore the owners of diesel cars are likely to be regular motorway commuters than city commuters, and therefore likely to have bigger engine sizes as shown below.

With these, we can conclude that generally, petrol cars are not likely to accumulate more mileage than the diesel engine variants over their lifetime.

Question 3: Which car features influence the price of a car the most?

I started by looking at correlation of the features with the target prices. Below is the pairwise correlation plot for the numerical variables.

Pairwise correlation of the numerical features and prices

The more positive the value, the higher the correlation between the variables, and vice versa.

The registration year of the car has a positive correlation (0.35) with the car price. As expected, the newer the car, the more expensive it is expected to be, generally driven by newer technologies (safety, comfort, efficiency, etc).

There is a negative correlation (-0.31) between the car mileage and the cost of the car. The higher the mileage, the more likely it is prone to wear and tear, and therefore the more likely the price will be lower compared to its newer variants.

Engine size and horsepower both have positive correlations (0.22 & 0.16 respectively) with car price. There is an upward trend that shows higher values tend to positively affect the car prices, which makes sense as higher values mean more power (with the exception of cars with turbochargers).

Number of past owners have significant influence on the price of the car. The more people who have owned the car in the past, the lower the car price tends to be. The correlation score indicates a negative correlation of -0.16.

The plot suggests that cars with 2, 4 or 5 doors are more expensive, but a correlation score of 0.01 does not give any confidence of the number of doors influencing the price of a car a lot.

I also looked at correlation between the categorical features and the car prices, and the following was observed:

From the data, most of cars were under £20,000. There was no clear variation indicating that the make of the car affected the price of the car except that the more luxurious cars were way more expensive than the rest as shown above. These data points I considered anomalous with regards to our dataset.

With regards to the body type, we can see that SUVs, Coupes, Saloons, Convertibles and Limousines were more expensive than the others, and therefore would influence the cost of a car. It is interesting to note that there is a relationship between the body type of a car, and number of doors.

The data shows that cars with automatic powertrain transmission systems are more expensive than cars with manual transmission systems. Clearly this is one feature to consider if the price of the car is of any importance.

Diesel cars are generally known to be more expensive than petrol cars for multiple reasons but because they are more fuel efficient, they sort of make up for the high purchase cost, meaning running cost is cheaper in the long run. However, our data shows multiple data points with higher prices for petrol vehicles.

Further exploring these data points, with focus on cars from £100k and above, I noticed these are top luxury brands, with mostly petrol engines. This explains the data points with high prices for petrol cars.

To select the features that influence the car price the most, I built multiple machine learning models, and selected the model with the least root mean squared error (RMSE) and most coefficient of determination score (R²).

RMSE is an absolute measure of fit. It is a measure of the difference between the predicted car price and the actual/observed car price.

R² is a statistical measure of fit that indicates how much variation of car prices are explained by the car features. In other words, by how much does a change in the engine size or mileage of a car affect the car price. So we are measuring the goodness of fit of the model — how well the predictions approximate the real data points.

Gradient Boosting regression model had the least RMSE score of £1887 and R² score of 91%. With this model, I was able to work out the features that contributed the most to this least RMSE value as shown below.

From the above analysis, we conclude that the most important features in predicting the price of a car, based on our dataset are as follows

  • Engine size
  • Age of the car
  • Mileage completed at the point of sale
  • Engine Horsepower
  • Engine transmission type
  • Number of past owners
  • Number of doors
  • Fuel type

Conclusion

In this post, we looked at a dataset I scraped off AutoTrader to understand what car features affect the price of a car, and how the fuel type and engine size influence how much mileage the car accumulates over its lifetime.

We found some useful insights at each stage of the analysis, but also there were some questions left to our imaginations. The key takeaways from this article are:

  1. Regular motorway commuters are more likely to opt for faster cars with bigger engine sizes and more power than regular city commuters. Therefore these cars with bigger engine sizes are more likely to accumulate more mileage than the smaller engine variants.
  2. Cars that run on diesel are likely to accumulate more mileages than petrol powered cars, as they are more likely to be used for motorway commuting than city driving.
  3. When selling or purchasing a car, the engine size, age of the car, mileage reading, the horsepower, gearbox type, number of people who have owned the car in the past, the fuel type and the number of doors (or the body type) should be considered as these features positively or negatively affect the price of the car.

If you are interested in seeing the codes for this article, please visit my Github.

--

--

Ikenna Nwosu

Passionate writer helping others grow. Summarizing books, providing practical tips, and unlocking earning potential. Join the journey of self-improvement!