K-Means clustering on OECD data in R

In this post I show you: a) How to retrieve data from OECD database using OECD R-package, b) how to run k-means clustering in R using the base package, and c) how to visualize data with the ggplot2 R-package.

I will run a k-means clustering algorithm over a dataset retrieved from the OECD database, using the base and OECD packages in R.

First, I retrieve some OECD gdp data:

# import OECD package in R
library(OECD)

# retrieve economic gdp data measured with output methode - for all countries, between 2008 and 2019 ; for a country selection
# countries included:
# -- austria
# -- belgium
# -- finland
# -- germany
# -- italy

data_df <- as.data.frame(get_dataset(dataset = "SNA_TABLE1",
                                     filter = list(c("AUT","BEL","FIN","DEU","ITA")),
                                     start_time = 2017,
                                     end_time = 2018))

# show the header of that dataset
head(data_df)
##   LOCATION TRANSACT MEASURE TIME_FORMAT UNIT POWERCODE REFERENCEPERIOD obsTime
## 1      AUT    B1_GA       C         P1Y  EUR         6            <NA>    2017
## 2      AUT    B1_GA       C         P1Y  EUR         6            <NA>    2018
## 3      AUT B1G_P119       C         P1Y  EUR         6            <NA>    2017
## 4      AUT B1G_P119       C         P1Y  EUR         6            <NA>    2018
## 5      AUT    P3_P5       C         P1Y  EUR         6            <NA>    2017
## 6      AUT    P3_P5       C         P1Y  EUR         6            <NA>    2018
##   obsValue OBS_STATUS
## 1 370295.8       <NA>
## 2 385711.9       <NA>
## 3 330332.9       <NA>
## 4 344658.8       <NA>
## 5 357239.7       <NA>
## 6 371036.2       <NA>

Processing the retrieved OECD dataset

In a next step I filter further, using dplyr:

# importing dplyr
library(dplyr)
library(magrittr)

# apply filter function from dplyr package
data_df <- data_df %>% filter(TRANSACT == "B1_GA",
                              MEASURE == "C",
                              TIME_FORMAT == "P1Y",
                              POWERCODE == "6")

# using base R functionality to ensure subsetting of dataset without any estimates or "ball-park-figures"
data_df <- data_df[is.na(data_df$OBS_STATUS),]

# select columns relevant for further analysis
data_df <- data_df %>% select(LOCATION, obsTime, obsValue)

# sub-set into two different dataframes
data2017_df <- subset(data_df,obsTime == "2017")
data2018_df <- subset(data_df,obsTime == "2018")

# merge into one new data frame
joint_df <- inner_join(data2017_df,data2018_df,by="LOCATION") %>% select(LOCATION,obsValue.x,obsValue.y)
colnames(joint_df) <- c("country","val2017","val2018")

# view header of data subset
head(joint_df)
##   country   val2017   val2018
## 1     AUT  370295.8  385711.9
## 2     BEL  446364.9  459819.8
## 3     FIN  225785.0  234453.0
## 4     DEU 3244990.0 3344370.0
## 5     ITA 1736601.8 1765421.4

Apply k-means clustering algorithm from R base package

Now I run a k-means clustering algorithm on this small dataset, using the base package in R. I will search for two cluster (middel) points, i.e. two centers:

# create cluster analysis object
clustering_obj <- kmeans(joint_df[,c(2,3)],centers=2)

# assign cluster to joint_df
joint_df$cluster <- clustering_obj$cluster

Visualize results using the ggplot2 package in R

Using a colored scatterplot we can visualize cluster assignment in this small dataset:

# import ggplot2 library
library(ggplot2)

# create scatterplot grouped by cluster index, using discrete color scale
ggplot(joint_df) + geom_point(mapping = aes(x = val2017, y = val2018, fill = cluster))
A simple k-means example, performed on a subset of OECD data (retrieved with OECD package in R)

If you want to learn more about the OECD R-package go ahead and check out my posts on how to retrieve e.g. inland freight OECD data in R.

You May Also Like

Leave a Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.