Kaggle second hand car data analysis in R

In this post I read in a data set in R containing data on used car sale postings on German ebay. The data has been shared on kaggle and can be found here: https://www.kaggle.com/orgesleka/used-cars-database/data

# read in the data
data_df = read.csv("autos.csv",header=TRUE,sep = ",")
# print a summary
summary(data_df)
##               dateCrawled                      name               seller      
##  2016-03-08 15:50:29:     5   Ford_Fiesta        :   336   gewerblich:     2  
##  2016-03-20 16:50:22:     5   Volkswagen_Golf_1.4:   335   privat    :189347  
##  2016-03-26 10:51:07:     5   BMW_318i           :   334                      
##  2016-03-31 17:57:07:     5   Opel_Corsa         :   316                      
##  2016-04-02 14:50:21:     5   BMW_316i           :   258                      
##  2016-03-05 15:48:41:     4   BMW_320i           :   256                      
##  (Other)            :189320   (Other)            :187514                      
##    offerType          price              abtest          vehicleType   
##  Angebot:189341   Min.   :       0   control:91131   limousine :48701  
##  Gesuch :     8   1st Qu.:    1150   test   :98218   kleinwagen:40759  
##                   Median :    2950                   kombi     :34498  
##                   Mean   :   10895                             :19437  
##                   3rd Qu.:    7200                   bus       :15532  
##                   Max.   :99999999                   cabrio    :11668  
##                                                      (Other)   :18754  
##  yearOfRegistration      gearbox          powerPS            model       
##  Min.   :1000                : 10395   Min.   :    0.0   golf   : 15286  
##  1st Qu.:1999       automatik: 39220   1st Qu.:   70.0   andere : 13453  
##  Median :2003       manuell  :139734   Median :  105.0   3er    : 10528  
##  Mean   :2005                          Mean   :  114.7          : 10398  
##  3rd Qu.:2008                          3rd Qu.:  150.0   polo   :  6714  
##  Max.   :9999                          Max.   :19208.0   corsa  :  6415  
##                                                          (Other):126555  
##    kilometer      monthOfRegistration    fuelType                brand      
##  Min.   :  5000   Min.   : 0.000      benzin :114106   volkswagen   :40687  
##  1st Qu.:125000   1st Qu.: 3.000      diesel : 54968   bmw          :20545  
##  Median :150000   Median : 6.000             : 16936   opel         :20427  
##  Mean   :125640   Mean   : 5.733      lpg    :  2732   mercedes_benz:17931  
##  3rd Qu.:150000   3rd Qu.: 9.000      cng    :   308   audi         :16676  
##  Max.   :150000   Max.   :12.000      hybrid :   142   ford         :13093  
##                                       (Other):   157   (Other)      :59990  
##  notRepairedDamage              dateCreated      nrOfPictures   postalCode   
##      : 36542       2016-04-03 00:00:00:  7392   Min.   :0     Min.   : 1067  
##  ja  : 18410       2016-04-04 00:00:00:  7185   1st Qu.:0     1st Qu.:30559  
##  nein:134397       2016-03-20 00:00:00:  6860   Median :0     Median :49661  
##                    2016-03-12 00:00:00:  6853   Mean   :0     Mean   :50892  
##                    2016-03-21 00:00:00:  6843   3rd Qu.:0     3rd Qu.:71577  
##                    2016-03-14 00:00:00:  6691   Max.   :0     Max.   :99998  
##                    (Other)            :147525                                
##                 lastSeen     
##  2016-04-07 00:46:04:    14  
##  2016-04-06 04:17:47:    12  
##  2016-04-06 05:44:34:    12  
##  2016-04-06 08:15:55:    12  
##  2016-04-06 09:46:11:    12  
##  2016-04-06 09:46:51:    12  
##  (Other)            :189275

The data has been accessed by applying web scraping and the data set contains a considerable amount of entries. This also becomes clear from above summary.

I proceed by cleaning the data. Cleaning at this stage consists of filtering out entries with missing data. Since there are many variables the relative share of entries with at least one missing data point is relatively high. Therefore, I reduce the data set to comprise relevant variables only. For this “compact” data set I conduct a cleaning run, in which I remove all entries that have at least one empty data point in them.

– name

– seller

– price

– vehicleType

– yearOfRegistration

– gearbox

– powerPS

– model

– kilometer

– fuelType

– monthOfRegistration

– brand

– postalCode

library(dplyr)
library(magrittr)
data_df = data_df %>% select(name,seller,price,vehicleType,yearOfRegistration,gearbox,powerPS,model,kilometer,fuelType,monthOfRegistration,brand,postalCode)

# filter out empty entries; not using dplyr
for(i in 1:ncol(data_df)){
  data_df = data_df[data_df[,1]!="",]
}

# create a first column plot with mean prices by kilometer group
library(ggplot2)
ggplot(summarise(group_by(filter(data_df,yearOfRegistration>1990 & yearOfRegistration<2019),yearOfRegistration),mean=mean(price))) + geom_point(mapping=aes(x=yearOfRegistration,y=mean),color="red") + labs(title="mean prices per second hand car posting, for 1990 - 2018",
                                                                                                                                                                                                 subtitle="an analysis of German e-bay postings (used car market)") + 
  xlab("year of vehicle registration") + 
  ylab("mean price [EUR]")

In following posts I will explore this data set further. I will use the data set further to explore used car market in Germany. Also, I will clean the data much more and look for outliers in the data set.

You May Also Like

Leave a Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.