K-Means clustering on OECD data in R

In this post I show you: a) How to retrieve data from OECD database using OECD R-package, b) how to run k-means clustering in R using the base package, and c) how to visualize data with the ggplot2 R-package.

I will run a k-means clustering algorithm over a dataset retrieved from the OECD database, using the base and OECD packages in R.

First, I retrieve some OECD gdp data:

# import OECD package in R
library(OECD)

# retrieve economic gdp data measured with output methode - for all countries, between 2008 and 2019 ; for a country selection
# countries included:
# -- austria
# -- belgium
# -- finland
# -- germany
# -- italy

data_df <- as.data.frame(get_dataset(dataset = "SNA_TABLE1",
                                     filter = list(c("AUT","BEL","FIN","DEU","ITA")),
                                     start_time = 2017,
                                     end_time = 2018))

# show the header of that dataset
head(data_df)

##   LOCATION TRANSACT MEASURE TIME_FORMAT UNIT POWERCODE REFERENCEPERIOD obsTime
## 1      AUT    B1_GA       C         P1Y  EUR         6            <NA>    2017
## 2      AUT    B1_GA       C         P1Y  EUR         6            <NA>    2018
## 3      AUT B1G_P119       C         P1Y  EUR         6            <NA>    2017
## 4      AUT B1G_P119       C         P1Y  EUR         6            <NA>    2018
## 5      AUT    P3_P5       C         P1Y  EUR         6            <NA>    2017
## 6      AUT    P3_P5       C         P1Y  EUR         6            <NA>    2018
##   obsValue OBS_STATUS
## 1 370295.8       <NA>
## 2 385711.9       <NA>
## 3 330332.9       <NA>
## 4 344658.8       <NA>
## 5 357239.7       <NA>
## 6 371036.2       <NA>

Processing the retrieved OECD dataset

In a next step I filter further, using dplyr:

# importing dplyr
library(dplyr)

library(magrittr)

# apply filter function from dplyr package
data_df <- data_df %>% filter(TRANSACT == "B1_GA",
                              MEASURE == "C",
                              TIME_FORMAT == "P1Y",
                              POWERCODE == "6")

# using base R functionality to ensure subsetting of dataset without any estimates or "ball-park-figures"
data_df <- data_df[is.na(data_df$OBS_STATUS),]

# select columns relevant for further analysis
data_df <- data_df %>% select(LOCATION, obsTime, obsValue)

# sub-set into two different dataframes
data2017_df <- subset(data_df,obsTime == "2017")
data2018_df <- subset(data_df,obsTime == "2018")

# merge into one new data frame
joint_df <- inner_join(data2017_df,data2018_df,by="LOCATION") %>% select(LOCATION,obsValue.x,obsValue.y)
colnames(joint_df) <- c("country","val2017","val2018")

# view header of data subset
head(joint_df)

##   country   val2017   val2018
## 1     AUT  370295.8  385711.9
## 2     BEL  446364.9  459819.8
## 3     FIN  225785.0  234453.0
## 4     DEU 3244990.0 3344370.0
## 5     ITA 1736601.8 1765421.4

Apply k-means clustering algorithm from R base package

Now I run a k-means clustering algorithm on this small dataset, using the base package in R. I will search for two cluster (middel) points, i.e. two centers:

# create cluster analysis object
clustering_obj <- kmeans(joint_df[,c(2,3)],centers=2)

# assign cluster to joint_df
joint_df$cluster <- clustering_obj$cluster

Visualize results using the ggplot2 package in R

Using a colored scatterplot we can visualize cluster assignment in this small dataset:

# import ggplot2 library
library(ggplot2)

# create scatterplot grouped by cluster index, using discrete color scale
ggplot(joint_df) + geom_point(mapping = aes(x = val2017, y = val2018, fill = cluster))

A simple k-means example, performed on a subset of OECD data (retrieved with OECD package in R)

If you want to learn more about the OECD R-package go ahead and check out my posts on how to retrieve e.g. inland freight OECD data in R.

Linnart Felkl

Data scientist focusing on simulation, optimization and modeling in R, SQL, VBA and Python

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

K-Means clustering on OECD data in R

Processing the retrieved OECD dataset

Apply k-means clustering algorithm from R base package

Visualize results using the ggplot2 package in R

Leave a Reply

Leave a Reply Cancel reply

Processing the retrieved OECD dataset

Apply k-means clustering algorithm from R base package

Visualize results using the ggplot2 package in R

You May Also Like

Excel operations in SQL: 40 examples

Parking lot simulator with simmer in R

Backlog simulation of FIFO production

Leave a Reply

Leave a Reply Cancel reply