install.packages(c("tidyverse", "fastDummmies", "haven"))4 Transferring, Processing, and Wrangling Data
4.1 Introduction
In the past two sessions, we presented some ideas, concepts, and tools that provide a basic foundation for working with data in R. Now that we have this basic foundation, we’ll turn in this session to a more applied exploration of some actual datasets. Our goal here is to introduce you to some useful functions that will allow you to explore and begin making sense of actual datasets in R. This lesson will provide a tour of various functions that can be particularly helpful as you get started processing and wrangling data in R, so that you can get your raw datasets ready for analysis and visualization. Among the topics we’ll cover today are:
Reading data into R
Handling missing data in R
Preparing and transforming datasets for analysis using a variety of tidyverse functions for data wrangling and processing
4.2 Preliminaries
4.2.1 Install and Load Packages
In this lesson, we’ll work with a variety of packages from the tidyverse suite, as well as the fastDummies package, which is a handy package for quickly transforming categorical variables into binary indicator variables (i.e. dummy variables). Please go ahead and make sure that both the tidyverse and fastDummies are installed and loaded. Recall that in general, you only need to install packages once, so assuming you completed the previous lesson, you wouldn’t need to install the tidyverse again (assuming you’re working on the same computer and haven’t made significant changes to your operating system or updated R since then). If you haven’t installed one or both packages, recall that you can install packages by passing the name of the package as an argument to the install.packages() function (i.e. install.packages("fastDummies").
Note that when you need to install more than one package, you can do so by passing a vector of package names to the install.packages() function. For example, in this case, you could use the following:
Remember that even if you’ve already installed the packages that you need in a previous session, you must load packages into memory each time you begin a new R session by passing the package names as arguments to the library() function. In this case:
# load packages into memory
library(tidyverse)
library(fastDummies)
library(haven)4.2.2 Lesson Datasets
In this lesson, we’ll be working with a handful of datasets, which you should have download from the Workshop’s repository page. One of the datasets is a cross-national dataset published by the Quality of Government (QoG) Institute at the University of Gothenburg. The dataset contains information on a variety of political, social, and economic variables from the early 2020s; additional documentation is available on the QoG’s website. In addition, we’ll be working with several World Bank datasets downloaded from the World Development Indicators. You can download these datasets from the Workshop Repository.
4.3 Importing Datasets into R
We will begin this section by learning a few ways to read an individual dataset into memory in R, using the QoG dataset as an example. Then, we’ll explore how to read in multiple external datasets into R in an efficient manner through some basic functional programing techniques.
4.3.1 Importing Individual Datasets
Below, we’ll learn how to import individual datasets into R from various sources. We’ll also explore how to handle data stored in different file formats.
4.3.1.1 Reading in Local Files
Though it’s not strictly necessary, it’s useful to begin by setting your working directory to the location on your computer where the data is stored. The easiest way to do this is to go to the R Studio menu: click Session, then click Set Working Directory, then click Choose Directory. We can also set the working directory programmatically using the setwd() function.
We can now pass the name of file and its extension in quotation marks to the read_csv() function (since the data we want to load is a CSV file). We’ll assign it to an object named qog (for “Quality of Government”):
# reads in the workshop dataset (Persson and Tabellini cross-national dataset) by passing the file path as an argument to the "read_csv" and assigns it to a new object named "pt"
qog<-read_csv("data/quality_of_government/qog_bas_cs_jan25.csv")Rows: 194 Columns: 331
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): cname_qog, cname, ccodealp
dbl (328): ccode_qog, ccodecow, ccode, ajr_settmort, atop_ally, atop_number,...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
By printing the name of the object, we can extract a preview of the dataset, along with some useful metadata:
# prints contents of "qog" object to console
qog# A tibble: 194 × 331
cname_qog cname ccode_qog ccodecow ccodealp ccode ajr_settmort atop_ally
<chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
1 Afghanistan Afgh… 4 700 AFG 4 4.54 1
2 Albania Alba… 8 339 ALB 8 NA 1
3 Algeria Alge… 12 615 DZA 12 4.36 1
4 Andorra Ando… 20 232 AND 20 NA 1
5 Angola Ango… 24 540 AGO 24 5.63 1
6 Antigua and B… Anti… 28 58 ATG 28 NA 1
7 Azerbaijan Azer… 31 373 AZE 31 NA 1
8 Argentina Arge… 32 160 ARG 32 4.23 1
9 Australia Aust… 36 900 AUS 36 2.15 1
10 Austria Aust… 40 305 AUT 40 NA 1
# ℹ 184 more rows
# ℹ 323 more variables: atop_number <dbl>, bci_bci <dbl>, bicc_gmi <dbl>,
# bmr_dem <dbl>, bmr_demdur <dbl>, bti_aar <dbl>, bti_acp <dbl>,
# bti_aod <dbl>, bti_cdi <dbl>, bti_ci <dbl>, bti_cps <dbl>, bti_cr <dbl>,
# bti_ds <dbl>, bti_eo <dbl>, bti_eos <dbl>, bti_ep <dbl>, bti_ffe <dbl>,
# bti_foe <dbl>, bti_ij <dbl>, bti_mes <dbl>, bti_muf <dbl>, bti_pdi <dbl>,
# bti_pp <dbl>, bti_prp <dbl>, bti_ps <dbl>, bti_rol <dbl>, bti_sdi <dbl>, …
Recall, also, that we can view data frames in the R Studio data viewer by passing the name of the object to the View() function:
# views "qog" in data viewer
View(qog)Note that to keep things tractable within this document, the table printed above does not print all of the columns in the dataset, but the full data frame should appear in the Viewer when the corresponding object name is passed to the View() function.
4.3.1.2 Reading in Data from an Online Source
The Quality of Government hosts its data online, and instead of reading the data into R from a downloaded local file, we could have read the data into R directly from the online source by using the appropriate URL. For example, the code below reads in the same QoG dataset directly from its website, and assigns the data to a new object named qog_direct:
# Reads in cross-national CSV dataset directly from QoG website and assigns it to a new object named "qog_direct"
qog_direct<-read_csv("https://www.qogdata.pol.gu.se/data/qog_bas_cs_jan25.csv")Rows: 194 Columns: 331
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): cname_qog, cname, ccodealp
dbl (328): ccode_qog, ccodecow, ccode, ajr_settmort, atop_ally, atop_number,...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
We can print qog_direct to the console to confirm that the data was successfully read in, and that its contents are identical to qog:
# prints contents of qog_direct
qog_direct# A tibble: 194 × 331
cname_qog cname ccode_qog ccodecow ccodealp ccode ajr_settmort atop_ally
<chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
1 Afghanistan Afgh… 4 700 AFG 4 4.54 1
2 Albania Alba… 8 339 ALB 8 NA 1
3 Algeria Alge… 12 615 DZA 12 4.36 1
4 Andorra Ando… 20 232 AND 20 NA 1
5 Angola Ango… 24 540 AGO 24 5.63 1
6 Antigua and B… Anti… 28 58 ATG 28 NA 1
7 Azerbaijan Azer… 31 373 AZE 31 NA 1
8 Argentina Arge… 32 160 ARG 32 4.23 1
9 Australia Aust… 36 900 AUS 36 2.15 1
10 Austria Aust… 40 305 AUT 40 NA 1
# ℹ 184 more rows
# ℹ 323 more variables: atop_number <dbl>, bci_bci <dbl>, bicc_gmi <dbl>,
# bmr_dem <dbl>, bmr_demdur <dbl>, bti_aar <dbl>, bti_acp <dbl>,
# bti_aod <dbl>, bti_cdi <dbl>, bti_ci <dbl>, bti_cps <dbl>, bti_cr <dbl>,
# bti_ds <dbl>, bti_eo <dbl>, bti_eos <dbl>, bti_ep <dbl>, bti_ffe <dbl>,
# bti_foe <dbl>, bti_ij <dbl>, bti_mes <dbl>, bti_muf <dbl>, bti_pdi <dbl>,
# bti_pp <dbl>, bti_prp <dbl>, bti_ps <dbl>, bti_rol <dbl>, bti_sdi <dbl>, …
4.3.1.3 Reading in Data from Cloud Storage
Sometimes, the dataset you are working with won’t be available on a public website, and it may be more convenient to read the data in from a cloud storage account (i.e. Dropbox, OneDrive, Google Drive etc) than to download the data to your machine and read in a local file.
The specific steps required to read in a dataset from cloud storage depend on your provider, as well as the nature of the account (for example, additional steps or functions may be required to read in password protected or non-public files), and you may need to do a bit of research to determine the specific steps required in your case.
In general, though, the process tends to be fairly simple. For example, we’ve placed a copy of the QoG dataset on a publicly shared Dropbox page. In order to read this data from Dropbox directly into R, we simply change the “0” at the end of the Dropbox url into a “1”, and pass this modified URL to the read_csv() function. Below, we’ll read in the dataset from Dropbox, and assign it to a new object named qog_cloud:
# reads dataset into R from Dropbox and assigns it to a new object named "qog_cloud"
qog_cloud<-read_csv("https://www.dropbox.com/scl/fi/xxd5otw869auq56fs4c9k/qog_bas_cs_jan25.csv?rlkey=8thev7gb5u1ffbtmhs2tutxxp&e=1&st=folhfq67&dl=1")Rows: 194 Columns: 331
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): cname_qog, cname, ccodealp
dbl (328): ccode_qog, ccodecow, ccode, ajr_settmort, atop_ally, atop_number,...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
We can print out the contents of qog_cloud to the console to ensure that the data has been read in correctly:
# prints contents of "qog_cloud" to the console
qog_cloud# A tibble: 194 × 331
cname_qog cname ccode_qog ccodecow ccodealp ccode ajr_settmort atop_ally
<chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
1 Afghanistan Afgh… 4 700 AFG 4 4.54 1
2 Albania Alba… 8 339 ALB 8 NA 1
3 Algeria Alge… 12 615 DZA 12 4.36 1
4 Andorra Ando… 20 232 AND 20 NA 1
5 Angola Ango… 24 540 AGO 24 5.63 1
6 Antigua and B… Anti… 28 58 ATG 28 NA 1
7 Azerbaijan Azer… 31 373 AZE 31 NA 1
8 Argentina Arge… 32 160 ARG 32 4.23 1
9 Australia Aust… 36 900 AUS 36 2.15 1
10 Austria Aust… 40 305 AUT 40 NA 1
# ℹ 184 more rows
# ℹ 323 more variables: atop_number <dbl>, bci_bci <dbl>, bicc_gmi <dbl>,
# bmr_dem <dbl>, bmr_demdur <dbl>, bti_aar <dbl>, bti_acp <dbl>,
# bti_aod <dbl>, bti_cdi <dbl>, bti_ci <dbl>, bti_cps <dbl>, bti_cr <dbl>,
# bti_ds <dbl>, bti_eo <dbl>, bti_eos <dbl>, bti_ep <dbl>, bti_ffe <dbl>,
# bti_foe <dbl>, bti_ij <dbl>, bti_mes <dbl>, bti_muf <dbl>, bti_pdi <dbl>,
# bti_pp <dbl>, bti_prp <dbl>, bti_ps <dbl>, bti_rol <dbl>, bti_sdi <dbl>, …
4.3.1.4 Reading in data saved in different file formats
The QoG dataset that we read in from different sources in the subsections above was saved as a CSV file; though this is a widely-used and flexible file format, it’s likely that you’ll also have to read in datasets in file formats other than CSV (for example, .xlsx, or Stata, SPSS, or SAS files). There are useful tidyverse packages that can help with importing datasets stored in a variety of such file formats. For example, the readxl package offers handy functions for reading in Excel files (i.e. .xls and .xlsx files), while the haven package provides functions to read in Stata, SPSS, or SAS files.
To get a sense of how to read in a non-CSV file, let’s quickly explore how to use the haven package’s read_dta() function to read a Stata file into R as a data frame. As part of the data package you downloaded for this workshop, there is a Stata version of the QoG dataset (“qog_bas_cs_jan25.dta”). Below, we’ll read this version of the dataset into R from our local directory by passing the file path and file name as an argument to the read_dta() function and assigning it to a new object named qog_stata:
# reads in stata version of QoG crossnational dataset from local drive using haven's "read_dta" function and assigns the data to a new object named "qog_stata"
qog_stata <- read_dta("data/quality_of_government/qog_bas_cs_jan25.dta")Now, let’s print the contents of qog_stata to the console, and confirm that the Stata dataset was successfully read in as an R data frame/tibble:
# prints contents of "qog_stata"
qog_stata# A tibble: 194 × 331
cname_qog cname ccode_qog ccodecow ccodealp ccode ajr_settmort atop_ally
<chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
1 Afghanistan Afgh… 4 700 AFG 4 4.54 1
2 Albania Alba… 8 339 ALB 8 NA 1
3 Algeria Alge… 12 615 DZA 12 4.36 1
4 Andorra Ando… 20 232 AND 20 NA 1
5 Angola Ango… 24 540 AGO 24 5.63 1
6 Antigua and B… Anti… 28 58 ATG 28 NA 1
7 Azerbaijan Azer… 31 373 AZE 31 NA 1
8 Argentina Arge… 32 160 ARG 32 4.23 1
9 Australia Aust… 36 900 AUS 36 2.15 1
10 Austria Aust… 40 305 AUT 40 NA 1
# ℹ 184 more rows
# ℹ 323 more variables: atop_number <dbl>, bci_bci <dbl>, bicc_gmi <dbl>,
# bmr_dem <dbl>, bmr_demdur <dbl>, bti_aar <dbl>, bti_acp <dbl>,
# bti_aod <dbl>, bti_cdi <dbl>, bti_ci <dbl>, bti_cps <dbl>, bti_cr <dbl>,
# bti_ds <dbl>, bti_eo <dbl>, bti_eos <dbl>, bti_ep <dbl>, bti_ffe <dbl>,
# bti_foe <dbl>, bti_ij <dbl>, bti_mes <dbl>, bti_muf <dbl>, bti_pdi <dbl>,
# bti_pp <dbl>, bti_prp <dbl>, bti_ps <dbl>, bti_rol <dbl>, bti_sdi <dbl>, …
4.3.2 Importing Multiple Datasets
Sometimes, you may be working with more than one dataset, in which case it could make sense to iteratively load multiple datasets into memory. In such cases, it is typically useful to read the datasets directly into a list.
Within the “world_bank” subdirectory in the “data” directory, there are four CSV files downloaded from the World Bank’s development indicators site:
wdi_debt2019.csv (country level World Bank data on debt as a share of GDP in 2019)
wdi_fdi2019.csv (country level World Bank data on foreign direct investment as a share of GDP in 2019)
wdi_trade2019.csv (country level World Bank data on trade as a share of GDP in 2019)
wdi_urban2019.csv (country level World Bank data on the urban population as a share of the overall population in 2019).
The first step we must take to iteratively read these files into a list is to make a character vector of the file names we want to read in. The code below uses the list.files() function to extract the file names of the files in the “data/world_bank” directory to a character vector, which we’ll assign to an object named “worldbank_filenames”:
# prints the names of the files we want to read in and assigns the vector of strings to a new object named "worldbank_filenames"
worldbank_filenames<-list.files("data/world_bank")Let’s confirm that the file names have been written correctly:
# prints "worldbank_filenames"
worldbank_filenames[1] "wdi_debt2019.csv" "wdi_fdi2019.csv" "wdi_trade2019.csv"
[4] "wdi_urban2019.csv"
Now, we’ll use the map() function to iteratively pass the file names in the worldbank_filenames vector to the read_csv() function, and deposit the imported files into a list named world_bank_list:
# iteratively passes file names in "worldbank_filenames" to the "read_csv" function, and deposits imported world bank files into a list that is assigned to an object named "world_bank_list"; assumes the working directory is the one with the world bank files
setwd("data/world_bank")
world_bank_list <- map(worldbank_filenames, read_csv)Rows: 271 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): Country Name, Country Code, Series Name, Series Code, 2019 [YR2019]
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 271 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): Country Name, Country Code, Series Name, Series Code, 2019 [YR2019]
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 271 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): Country Name, Country Code, Series Name, Series Code, 2019 [YR2019]
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 271 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): Country Name, Country Code, Series Name, Series Code, 2019 [YR2019]
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Now, let’s go ahead print the contents of world_bank_list:
# prints contents of "world_bank_list"
world_bank_list[[1]]
# A tibble: 271 × 5
`Country Name` `Country Code` `Series Name` `Series Code` `2019 [YR2019]`
<chr> <chr> <chr> <chr> <chr>
1 Afghanistan AFG Central gove… GC.DOD.TOTL.… ..
2 Albania ALB Central gove… GC.DOD.TOTL.… 75.69848824949…
3 Algeria DZA Central gove… GC.DOD.TOTL.… ..
4 American Samoa ASM Central gove… GC.DOD.TOTL.… ..
5 Andorra AND Central gove… GC.DOD.TOTL.… ..
6 Angola AGO Central gove… GC.DOD.TOTL.… ..
7 Antigua and Barbu… ATG Central gove… GC.DOD.TOTL.… ..
8 Argentina ARG Central gove… GC.DOD.TOTL.… ..
9 Armenia ARM Central gove… GC.DOD.TOTL.… 50.02842068637…
10 Aruba ABW Central gove… GC.DOD.TOTL.… ..
# ℹ 261 more rows
[[2]]
# A tibble: 271 × 5
`Country Name` `Country Code` `Series Name` `Series Code` `2019 [YR2019]`
<chr> <chr> <chr> <chr> <chr>
1 Afghanistan AFG Foreign dire… BX.KLT.DINV.… 0.124495985791…
2 Albania ALB Foreign dire… BX.KLT.DINV.… 7.797920483865…
3 Algeria DZA Foreign dire… BX.KLT.DINV.… 0.804144058246…
4 American Samoa ASM Foreign dire… BX.KLT.DINV.… ..
5 Andorra AND Foreign dire… BX.KLT.DINV.… ..
6 Angola AGO Foreign dire… BX.KLT.DINV.… -5.78081314444…
7 Antigua and Barbu… ATG Foreign dire… BX.KLT.DINV.… 7.433324076307…
8 Argentina ARG Foreign dire… BX.KLT.DINV.… 1.485006875706…
9 Armenia ARM Foreign dire… BX.KLT.DINV.… 0.736361516844…
10 Aruba ABW Foreign dire… BX.KLT.DINV.… -2.21528256776…
# ℹ 261 more rows
[[3]]
# A tibble: 271 × 5
`Country Name` `Country Code` `Series Name` `Series Code` `2019 [YR2019]`
<chr> <chr> <chr> <chr> <chr>
1 Afghanistan AFG Trade (% of … NE.TRD.GNFS.… ..
2 Albania ALB Trade (% of … NE.TRD.GNFS.… 76.27919464957…
3 Algeria DZA Trade (% of … NE.TRD.GNFS.… 51.80973844157…
4 American Samoa ASM Trade (% of … NE.TRD.GNFS.… 156.5687789799…
5 Andorra AND Trade (% of … NE.TRD.GNFS.… ..
6 Angola AGO Trade (% of … NE.TRD.GNFS.… 57.82953811830…
7 Antigua and Barbu… ATG Trade (% of … NE.TRD.GNFS.… 137.6251757558…
8 Argentina ARG Trade (% of … NE.TRD.GNFS.… 32.63061504584…
9 Armenia ARM Trade (% of … NE.TRD.GNFS.… 96.11415412887…
10 Aruba ABW Trade (% of … NE.TRD.GNFS.… 145.3435727352…
# ℹ 261 more rows
[[4]]
# A tibble: 271 × 5
`Country Name` `Country Code` `Series Name` `Series Code` `2019 [YR2019]`
<chr> <chr> <chr> <chr> <chr>
1 Afghanistan AFG Urban popula… SP.URB.TOTL.… 25.754
2 Albania ALB Urban popula… SP.URB.TOTL.… 61.229
3 Algeria DZA Urban popula… SP.URB.TOTL.… 73.189
4 American Samoa ASM Urban popula… SP.URB.TOTL.… 87.147
5 Andorra AND Urban popula… SP.URB.TOTL.… 87.984
6 Angola AGO Urban popula… SP.URB.TOTL.… 66.177
7 Antigua and Barbu… ATG Urban popula… SP.URB.TOTL.… 24.506
8 Argentina ARG Urban popula… SP.URB.TOTL.… 91.991
9 Armenia ARM Urban popula… SP.URB.TOTL.… 63.219
10 Aruba ABW Urban popula… SP.URB.TOTL.… 43.546
# ℹ 261 more rows
It could be useful to label the list elements of world_bank_list. For labels, it would make sense to use the file names in worldbank_filenames, without the “.csv” extension. Below, we use the str_remove() function to remove the “.csv” extension from the file names in worldbank_filenames and assign the result to a new object named worldbank_filenames_base:
# removes CSV extension from "worldbank_filenames"
worldbank_filenames_base <- str_remove(worldbank_filenames, ".csv")Now, let’s use the names() argument to assign the labels in worldbank_filenames_base to the elements in world_bank_list:
# assigns names to datasets in "world_bank_list"
names(world_bank_list) <- worldbank_filenames_baseNow that the file names are assigned, we can extract list elements by their labels. Below, for example, we extract the FDI dataset from world_bank_list using its label:
# extracts fdi dataset from "world_bank_list" by assigned label
world_bank_list[["wdi_fdi2019"]]# A tibble: 271 × 5
`Country Name` `Country Code` `Series Name` `Series Code` `2019 [YR2019]`
<chr> <chr> <chr> <chr> <chr>
1 Afghanistan AFG Foreign dire… BX.KLT.DINV.… 0.124495985791…
2 Albania ALB Foreign dire… BX.KLT.DINV.… 7.797920483865…
3 Algeria DZA Foreign dire… BX.KLT.DINV.… 0.804144058246…
4 American Samoa ASM Foreign dire… BX.KLT.DINV.… ..
5 Andorra AND Foreign dire… BX.KLT.DINV.… ..
6 Angola AGO Foreign dire… BX.KLT.DINV.… -5.78081314444…
7 Antigua and Barbu… ATG Foreign dire… BX.KLT.DINV.… 7.433324076307…
8 Argentina ARG Foreign dire… BX.KLT.DINV.… 1.485006875706…
9 Armenia ARM Foreign dire… BX.KLT.DINV.… 0.736361516844…
10 Aruba ABW Foreign dire… BX.KLT.DINV.… -2.21528256776…
# ℹ 261 more rows
4.4 Processing and Wrangling a Single Dataset
Once we’ve read in our dataset(s) of interest, we typically need to carry out a variety of processing and wrangling tasks to prepare to data for analysis and visualization. In this section, we’ll consider a variety of useful functions (most of them from tidyverse packages such as dplyr) that can help with a variety of these data preparation tasks.
We’ll start by making a copy of the qog object, by assigning it to a new object named qog_copy; we’ll work with qog_copy instead of the original qog object, to ensure that we can always revert to the original data when needed. Keeping a “clean” version of the dataset of interest, and carrying out data processing and analysis tasks on a copy of this dataset, is good data management practice.
# makes a copy of "qog", called "qog_copy" that we can use for processing; keeps the original data frame, "qog" untouched
qog_copy<-qog4.4.1 Introducing the %>% (“pipe”) operator
One of the most useful features of the tidyverse is the %>% operator (pronounced “pipe”) which helps chain together different functions to form a clear and explicit data processing and analysis pipeline. To illustrate how a pipe works, let’s first consider a simple data processing operation that does not use a pipe. In particular, we will use the select() function from the dplyr package to select a few columns from qog_copy; this dataset has over 300 variables, so it would make sense to create a more tractable dataset that extracts the specific columns/variables we’re interested in. Let’s say, for example, that we want to select the “cname_qog”, “cname”, “ccodealp”, “undp_hdi”, and “wdi_expedu” variables from qog_copy. We can do so by passing the name of the data object containing the columns we want to select, along with the names of the desired columns, as arguments to the select() function. Below, we’ll select these columns from qog_copy and assign the modified data frame to a new object named qog_copy_select_initial:
# selects columns/variables from "qog_copy" and assigns the
# modified data frame to a new object named "qog_copy_select"
qog_copy_select_initial <- select(qog_copy, cname_qog, cname, ccodealp, undp_hdi, wdi_expedu)Let’s view qog_copy_select_initial in the data viewer:
# Views "qog_copy_select_initial" in the data viewer
View(qog_copy_select_initial)Now that we’ve seen how to use the select() function using traditional syntax, let’s see how we can carry out the same operation with a pipe operator. In particular, the piping syntax looks something like this:
# selects columns/variables from "qog_copy" using the
# pipe syntax and assigns the modified data frame
# to a new object named "qog_copy_select"
qog_copy_select_pipe <-
qog_copy %>%
select(cname_qog, cname, ccodealp, undp_hdi, wdi_expedu)Note that the pipe operator %>% comes immediately after qog_copy and immediately before we call the select() function. In essence, the pipe operator takes the contents to its left, and then uses these contents as an input to the code on its right. Above, the pipe takes the contents of qog_copy on its left, and then feeds this data into the select() function on the right, and returns a modified data frame which is assigned to qog_copy_select_pipe. We can pass qog_copy_select_pipe into the data viewer to confirm that the operation worked as expected:
# views "qog_copy_select_pipe" in data viewer
View(qog_copy_select_pipe)We can already see that the pipe makes our code slightly more readable even when performing a simple operation, but the pipe’s usefulness for the task of writing concise and readable code becomes even more visible when performing more complex operations that involve several different data processing functions.
For example, let’s say that in addition to selecting the columns above, we also want to subset the dataset to include only countries for which the “undp_hdi” variable is higher than 0.8. We can subset datasets based on such conditions using the filter() function. Below, we’ll take qog_copy_select_initial and subset this dataset to meet the undp_hdi>0.8 condition, and assign the final version of our processed dataset to a new object named qog_final_processed:
# subsets "qog_copy_select_initial" using the "filter()" function to include only observations with undp_hdi>0.8, and deposits the modified dataset into a new object named "qog_final_processed"
qog_final_processed <- filter(qog_copy_select_initial, undp_hdi>0.8)Now, let’s view qog_final_processed in the data viewer and confirm that it only includes the selected variables and is appropriately subsetted according to the specified condition:
View(qog_final_processed)Notice that using conventional notation (i.e. notation without a pipe), we arrived at our final processed dataset in two steps; first, we had to select columns using the select() function, and then we had to take the resulting object and call the filter() function to subset the data based on the “undp_hdi” variable. This process can by a little clunky, and the pipe operator helps to streamline it, making for more efficient and readable code. In short, when we use pipe (rather than conventional) notation, we can get from qog_copy to the final processed dataset with just the following:
qog_copy %>%
select(cname_qog, cname, ccodealp, undp_hdi, wdi_expedu) %>%
filter(undp_hdi>0.8)# A tibble: 65 × 5
cname_qog cname ccodealp undp_hdi wdi_expedu
<chr> <chr> <chr> <dbl> <dbl>
1 Andorra Andorra AND 0.855 2.58
2 Antigua and Barbuda Antigua and Barbuda ATG 0.819 2.57
3 Argentina Argentina ARG 0.844 4.64
4 Australia Australia AUS 0.949 5.33
5 Austria Austria AUT 0.92 5.49
6 Bahrain Bahrain BHR 0.884 2.22
7 Barbados Barbados BRB 0.803 4.86
8 Belgium Belgium BEL 0.938 6.36
9 Brunei Brunei Darussalam BRN 0.824 NA
10 Belarus Belarus BLR 0.801 4.61
# ℹ 55 more rows
This code takes the qog_copy dataset, and then feeds it into the select() function; the output of the select() function is then fed into the filter() function, which then returns a final dataset equivalent to qog_final_processed. Given the readability and conciseness of pipe notation, we will use it extensively in our exploration of data processing and wrangling.
Below, we’ll explore the select() and filter() functions (along with other data processing functions) at greater length; our main purpose in this sub-section was to motivate the utility of the %>% operator in helping to create complex data processing pipelines with intuitive and readable code.
4.4.2 Selecting and Deleting Variables
In this section, we’ll explore the select() function at greater length, and use it to select some key variables of interest from qog_copy; this will allow us to subsequently work with a more tractable dataset.
In particular, let’s select the following variables from qog_copy (additional information about the variables is available in the Quality of Government Basic Dataset Codebook:
“cname_qog”: Country name as standardized by QoG
“cname”: Country name based on the ISO standard
“ccodealp”: 3-digit ISO country code
“undp_hdi”: Human development index
“wdi_expedu”: Government expenditure on education as a percentage of GDP
“wdi_acel”: Percentage of population with access to electricity
“wdi_area”: Land area in sq km
“wdi_taxrev”: Tax revenue as a percentage of GDP
“wdi_expmil”: Military expenditure as a percentage of GDP
“wdi_fdiin”: Foreign direct investment (FDI) inflows as a share of GDP
“wdi_trade”: Foreign trade as a percentage of GDP
“cbie_index”: Central bank independence index
“ht_region”: World region of the country
“wbgi_rle”: Rule of law index
“bmr_dem”: Dichotomous democracy measure
“atop_ally”: Member of a military alliance
“gol_est”: Electoral system (majoritarian/proportional/mixed)
“mad_gdppc”: Real GDP per capita in 2018 (2011 dollars)
“mad_gdppc1900”: Real GDP per capita in 1900 (2011 dollars)
“bci_bci”: Bayesian Corruption Indicator
“lis_gini”: Gini coefficient
“top_top1_income_share”: Income share of the population’s top 1%
“wdi_wip”: Percentage of lower house or single house parliamentary seats held by women
Below, we select these variables by passing qog_copy to the select() function via a %>%, and then specify the columns we want to select as arguments to the select() function; we assign the resulting selection to a new object named qog_copy_selection:
# selects specific variables from "qog_copy" and assigns the selection to a new object named "qog_copy_selection"
qog_copy_selection <- qog_copy %>%
select(cname_qog,
cname,
ccodealp,
undp_hdi,
wdi_expedu,
wdi_acel,
wdi_area,
wdi_taxrev,
wdi_expmil,
wdi_fdiin,
wdi_trade,
cbie_index,
ht_region,
wbgi_rle,
bmr_dem,
atop_ally,
gol_est,
mad_gdppc,
mad_gdppc1900,
bci_bci,
lis_gini,
top_top1_income_share,
wdi_wip)When we view qog_copy_selection in the R Studio data viewer, it looks something like this:
# Views "qog_copy_selection" in the data viewer
View(qog_copy_selection)Sometimes, it could make more sense to directly delete columns, instead of deciding on which ones to keep or select. For example, the code below deletes the “cname” variable from qog_copy_selection by passing it to the select() column with a “-” in front of it:
# removes "cname" column from "qog_copy_selection"
qog_copy_selection %>% select(-cname)# A tibble: 194 × 22
cname_qog ccodealp undp_hdi wdi_expedu wdi_acel wdi_area wdi_taxrev
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan AFG 0.473 NA 97.7 652230 NA
2 Albania ALB 0.785 3.02 100 27400 18.2
3 Algeria DZA 0.74 5.51 99.8 2381741 NA
4 Andorra AND 0.855 2.58 100 470 NA
5 Angola AGO 0.59 2.30 48.2 1246700 10.1
6 Antigua and Barbuda ATG 0.819 2.57 100 440 NA
7 Azerbaijan AZE 0.738 3.70 100 82650 13.4
8 Argentina ARG 0.844 4.64 100 2736690 8.79
9 Australia AUS 0.949 5.33 100 7692020 23.0
10 Austria AUT 0.92 5.49 100 82520 25.9
# ℹ 184 more rows
# ℹ 15 more variables: wdi_expmil <dbl>, wdi_fdiin <dbl>, wdi_trade <dbl>,
# cbie_index <dbl>, ht_region <dbl>, wbgi_rle <dbl>, bmr_dem <dbl>,
# atop_ally <dbl>, gol_est <dbl>, mad_gdppc <dbl>, mad_gdppc1900 <dbl>,
# bci_bci <dbl>, lis_gini <dbl>, top_top1_income_share <dbl>, wdi_wip <dbl>
Note that the “cname” column is not permanently deleted since we didn’t assign the code back to qog_copy_selection (i.e. with qog_copy_selection <- qog_copy_selection %>% select(-cname).) We can confirm this by reprinting qog_copy_selection and noting that “cname” is still in the data frame:
# prints contents of "qog_copy_selection"
qog_copy_selection# A tibble: 194 × 23
cname_qog cname ccodealp undp_hdi wdi_expedu wdi_acel wdi_area wdi_taxrev
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan Afgh… AFG 0.473 NA 97.7 652230 NA
2 Albania Alba… ALB 0.785 3.02 100 27400 18.2
3 Algeria Alge… DZA 0.74 5.51 99.8 2381741 NA
4 Andorra Ando… AND 0.855 2.58 100 470 NA
5 Angola Ango… AGO 0.59 2.30 48.2 1246700 10.1
6 Antigua and … Anti… ATG 0.819 2.57 100 440 NA
7 Azerbaijan Azer… AZE 0.738 3.70 100 82650 13.4
8 Argentina Arge… ARG 0.844 4.64 100 2736690 8.79
9 Australia Aust… AUS 0.949 5.33 100 7692020 23.0
10 Austria Aust… AUT 0.92 5.49 100 82520 25.9
# ℹ 184 more rows
# ℹ 15 more variables: wdi_expmil <dbl>, wdi_fdiin <dbl>, wdi_trade <dbl>,
# cbie_index <dbl>, ht_region <dbl>, wbgi_rle <dbl>, bmr_dem <dbl>,
# atop_ally <dbl>, gol_est <dbl>, mad_gdppc <dbl>, mad_gdppc1900 <dbl>,
# bci_bci <dbl>, lis_gini <dbl>, top_top1_income_share <dbl>, wdi_wip <dbl>
If we want to delete multiple columns, we can pass a vector containing the names of the columns to be deleted to the select() function, preceded by a minus sign. For example, if we want to delete “cname”, “undp_hdi”, and “wdi_acel” from qog_copy_selection we can do so with the following:
# deletes "cname", "undp_hdi", and "wdi_acel" from "qog_copy_selection"
qog_copy_selection %>% select(-c(cname, undp_hdi, wdi_acel))# A tibble: 194 × 20
cname_qog ccodealp wdi_expedu wdi_area wdi_taxrev wdi_expmil wdi_fdiin
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan AFG NA 652230 NA 1.83 0.144
2 Albania ALB 3.02 27400 18.2 1.22 6.80
3 Algeria DZA 5.51 2381741 NA 5.59 0.467
4 Andorra AND 2.58 470 NA NA NA
5 Angola AGO 2.30 1246700 10.1 1.29 -6.55
6 Antigua and Bar… ATG 2.57 440 NA NA 18.1
7 Azerbaijan AZE 3.70 82650 13.4 5.27 -3.11
8 Argentina ARG 4.64 2736690 8.79 0.631 1.36
9 Australia AUS 5.33 7692020 23.0 1.99 1.78
10 Austria AUT 5.49 82520 25.9 0.874 3.71
# ℹ 184 more rows
# ℹ 13 more variables: wdi_trade <dbl>, cbie_index <dbl>, ht_region <dbl>,
# wbgi_rle <dbl>, bmr_dem <dbl>, atop_ally <dbl>, gol_est <dbl>,
# mad_gdppc <dbl>, mad_gdppc1900 <dbl>, bci_bci <dbl>, lis_gini <dbl>,
# top_top1_income_share <dbl>, wdi_wip <dbl>
4.4.3 Rearranging Columns
We can change the order of the columns in a dataset using the relocate() function. For example, the code below uses the relocate() function to shift the “ccdoealp” column to the front of the qog_copy_selection data frame:
# moves "ccdoealp" to front of "qog_copy_selection" dataset
qog_copy_selection %>% relocate(ccodealp)# A tibble: 194 × 23
ccodealp cname_qog cname undp_hdi wdi_expedu wdi_acel wdi_area wdi_taxrev
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 AFG Afghanistan Afgh… 0.473 NA 97.7 652230 NA
2 ALB Albania Alba… 0.785 3.02 100 27400 18.2
3 DZA Algeria Alge… 0.74 5.51 99.8 2381741 NA
4 AND Andorra Ando… 0.855 2.58 100 470 NA
5 AGO Angola Ango… 0.59 2.30 48.2 1246700 10.1
6 ATG Antigua and … Anti… 0.819 2.57 100 440 NA
7 AZE Azerbaijan Azer… 0.738 3.70 100 82650 13.4
8 ARG Argentina Arge… 0.844 4.64 100 2736690 8.79
9 AUS Australia Aust… 0.949 5.33 100 7692020 23.0
10 AUT Austria Aust… 0.92 5.49 100 82520 25.9
# ℹ 184 more rows
# ℹ 15 more variables: wdi_expmil <dbl>, wdi_fdiin <dbl>, wdi_trade <dbl>,
# cbie_index <dbl>, ht_region <dbl>, wbgi_rle <dbl>, bmr_dem <dbl>,
# atop_ally <dbl>, gol_est <dbl>, mad_gdppc <dbl>, mad_gdppc1900 <dbl>,
# bci_bci <dbl>, lis_gini <dbl>, top_top1_income_share <dbl>, wdi_wip <dbl>
We can specify more than one argument to the relocate function. For example, in the code below, passing the “ccodealp”, “wdi_acel”, “wdi_expmil”, and “wdi_wip” variables/columns to the relocate() function will make “ccodealp” the first column, “wdi_acel” the second column, “wdi_expmil” the third column, and “wdi_wip” the fourth column. Below, we specify this order, and assign the result back to qog_copy_selection:
# sets the order for the first four columns of "qog_copy_selection" and assigns the result back to "qog_copy_selection"
qog_copy_selection <- qog_copy_selection %>%
relocate(ccodealp, wdi_acel, wdi_expmil, wdi_wip)Now, let’s print the contents of “qog_copy_selection” to the console and confirm the change has been made:
# prints contents of updated "qog_copy_selection" object
qog_copy_selection# A tibble: 194 × 23
ccodealp wdi_acel wdi_expmil wdi_wip cname_qog cname undp_hdi wdi_expedu
<chr> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
1 AFG 97.7 1.83 27.0 Afghanistan Afgh… 0.473 NA
2 ALB 100 1.22 35.7 Albania Alba… 0.785 3.02
3 DZA 99.8 5.59 8.11 Algeria Alge… 0.74 5.51
4 AND 100 NA 46.4 Andorra Ando… 0.855 2.58
5 AGO 48.2 1.29 29.5 Angola Ango… 0.59 2.30
6 ATG 100 NA 11.1 Antigua and B… Anti… 0.819 2.57
7 AZE 100 5.27 18.2 Azerbaijan Azer… 0.738 3.70
8 ARG 100 0.631 44.7 Argentina Arge… 0.844 4.64
9 AUS 100 1.99 31.1 Australia Aust… 0.949 5.33
10 AUT 100 0.874 40.4 Austria Aust… 0.92 5.49
# ℹ 184 more rows
# ℹ 15 more variables: wdi_area <dbl>, wdi_taxrev <dbl>, wdi_fdiin <dbl>,
# wdi_trade <dbl>, cbie_index <dbl>, ht_region <dbl>, wbgi_rle <dbl>,
# bmr_dem <dbl>, atop_ally <dbl>, gol_est <dbl>, mad_gdppc <dbl>,
# mad_gdppc1900 <dbl>, bci_bci <dbl>, lis_gini <dbl>,
# top_top1_income_share <dbl>
4.4.4 Renaming Variables
In order to rename variables, we can use the rename() function; the argument to this function takes the form new_name=old_name. The code below renames the existing “ccodealp” variable in the qog_copy_selection data frame to “iso3” using the %>% operator to pass
# renames "ccodealp" variable in "qog_copy_selection" to "iso3"
qog_copy_selection %>%
rename(iso3=ccodealp)# A tibble: 194 × 23
iso3 wdi_acel wdi_expmil wdi_wip cname_qog cname undp_hdi wdi_expedu
<chr> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
1 AFG 97.7 1.83 27.0 Afghanistan Afgh… 0.473 NA
2 ALB 100 1.22 35.7 Albania Alba… 0.785 3.02
3 DZA 99.8 5.59 8.11 Algeria Alge… 0.74 5.51
4 AND 100 NA 46.4 Andorra Ando… 0.855 2.58
5 AGO 48.2 1.29 29.5 Angola Ango… 0.59 2.30
6 ATG 100 NA 11.1 Antigua and Barb… Anti… 0.819 2.57
7 AZE 100 5.27 18.2 Azerbaijan Azer… 0.738 3.70
8 ARG 100 0.631 44.7 Argentina Arge… 0.844 4.64
9 AUS 100 1.99 31.1 Australia Aust… 0.949 5.33
10 AUT 100 0.874 40.4 Austria Aust… 0.92 5.49
# ℹ 184 more rows
# ℹ 15 more variables: wdi_area <dbl>, wdi_taxrev <dbl>, wdi_fdiin <dbl>,
# wdi_trade <dbl>, cbie_index <dbl>, ht_region <dbl>, wbgi_rle <dbl>,
# bmr_dem <dbl>, atop_ally <dbl>, gol_est <dbl>, mad_gdppc <dbl>,
# mad_gdppc1900 <dbl>, bci_bci <dbl>, lis_gini <dbl>,
# top_top1_income_share <dbl>
Note the “ccodealp” variable was changed to “iso3”. If we want to change the name of more than one variable, the rename() function can take multiple arguments, separated by a comma. For example, the code below changes “undp_hdi” to “hdi” and “wdi_area” to “wdi_area_sqkm”:
# renames "undp_hdi" variable to "hdi", and "wdi_area" to "wdi_area_sqkm" in "qog_copy_selection" data frame
qog_copy_selection %>%
rename(hdi=undp_hdi,
wdi_area_sqkm=wdi_area)# A tibble: 194 × 23
ccodealp wdi_acel wdi_expmil wdi_wip cname_qog cname hdi wdi_expedu
<chr> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
1 AFG 97.7 1.83 27.0 Afghanistan Afgh… 0.473 NA
2 ALB 100 1.22 35.7 Albania Alba… 0.785 3.02
3 DZA 99.8 5.59 8.11 Algeria Alge… 0.74 5.51
4 AND 100 NA 46.4 Andorra Ando… 0.855 2.58
5 AGO 48.2 1.29 29.5 Angola Ango… 0.59 2.30
6 ATG 100 NA 11.1 Antigua and Barb… Anti… 0.819 2.57
7 AZE 100 5.27 18.2 Azerbaijan Azer… 0.738 3.70
8 ARG 100 0.631 44.7 Argentina Arge… 0.844 4.64
9 AUS 100 1.99 31.1 Australia Aust… 0.949 5.33
10 AUT 100 0.874 40.4 Austria Aust… 0.92 5.49
# ℹ 184 more rows
# ℹ 15 more variables: wdi_area_sqkm <dbl>, wdi_taxrev <dbl>, wdi_fdiin <dbl>,
# wdi_trade <dbl>, cbie_index <dbl>, ht_region <dbl>, wbgi_rle <dbl>,
# bmr_dem <dbl>, atop_ally <dbl>, gol_est <dbl>, mad_gdppc <dbl>,
# mad_gdppc1900 <dbl>, bci_bci <dbl>, lis_gini <dbl>,
# top_top1_income_share <dbl>
4.4.5 Sorting data
It is often useful to sort a data frame in ascending or descending order with respect to a given variable. The code below sorts the qog_copy_selection data frame in ascending order with respect to the “wdi_trade” variable using the arrange() function, and then uses the relocate() function to move “wdi_trade” towards the front of the dataset, right after the country name and country code variables:
# sorts "qog_copy_selection" data frame in ascending (low to high) order with respect to the "wdi_trade" variable, and then brings the "wdi_trade" variable
qog_copy_selection %>%
arrange(wdi_trade) %>%
relocate(cname_qog, cname, ccodealp, wdi_trade)# A tibble: 194 × 23
cname_qog cname ccodealp wdi_trade wdi_acel wdi_expmil wdi_wip undp_hdi
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Ethiopia (1993… Ethi… ETH 24.3 54.2 0.492 42.6 0.489
2 United States Unit… USA 25.3 100 3.46 27.6 0.921
3 Pakistan (1971… Paki… PAK 27.0 94.9 2.87 20.2 0.537
4 Bangladesh Bang… BGD 27.7 99 1.20 20.9 0.662
5 Burundi Buru… BDI 28.8 10.2 2.04 38.2 0.419
6 Egypt Egypt EGY 29.9 100 1.12 27.7 0.726
7 Tanzania Tanz… TZA 29.9 42.7 1.08 36.9 0.529
8 Kenya Kenya KEN 30.7 76.5 1.07 21.6 0.596
9 Argentina Arge… ARG 32.9 100 0.631 44.7 0.844
10 Turkmenistan Turk… TKM 33.1 100 NA 25 0.74
# ℹ 184 more rows
# ℹ 15 more variables: wdi_expedu <dbl>, wdi_area <dbl>, wdi_taxrev <dbl>,
# wdi_fdiin <dbl>, cbie_index <dbl>, ht_region <dbl>, wbgi_rle <dbl>,
# bmr_dem <dbl>, atop_ally <dbl>, gol_est <dbl>, mad_gdppc <dbl>,
# mad_gdppc1900 <dbl>, bci_bci <dbl>, lis_gini <dbl>,
# top_top1_income_share <dbl>
If, instead, we want to sort the dataset in descending order with respect to the “wdi_trade” variable, we can pass the name of the variable to the desc() function within the arrange() function, as below:
# sorts "qog_copy_selection" data frame in descending (high to low) order with respect to the "wdi_trade" variable, and then brings the "wdi_trade" variable
qog_copy_selection %>%
arrange(desc(wdi_trade)) %>%
relocate(cname_qog, cname, ccodealp, wdi_trade)# A tibble: 194 × 23
cname_qog cname ccodealp wdi_trade wdi_acel wdi_expmil wdi_wip undp_hdi
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Luxembourg Luxem… LUX 393. 100 0.472 35 0.927
2 San Marino San M… SMR 342. 100 NA 33.3 0.853
3 Singapore Singa… SGP 330. 100 2.78 29.8 0.942
4 Malta Malta MLT 315. 100 0.496 13.4 0.912
5 Djibouti Djibo… DJI 264. 65.4 NA 26.2 0.512
6 Ireland Irela… IRL 227. 100 0.251 22.5 0.946
7 Vietnam Viet … VNM 187. 100 2.28 30.3 0.718
8 Slovakia Slova… SVK 184. 100 1.77 22.7 0.852
9 Seychelles Seych… SYC 184. 100 2.78 22.9 0.795
10 Cyprus (1975-) Cyprus CYP 175. 100 1.91 14.3 0.901
# ℹ 184 more rows
# ℹ 15 more variables: wdi_expedu <dbl>, wdi_area <dbl>, wdi_taxrev <dbl>,
# wdi_fdiin <dbl>, cbie_index <dbl>, ht_region <dbl>, wbgi_rle <dbl>,
# bmr_dem <dbl>, atop_ally <dbl>, gol_est <dbl>, mad_gdppc <dbl>,
# mad_gdppc1900 <dbl>, bci_bci <dbl>, lis_gini <dbl>,
# top_top1_income_share <dbl>
Note that it’s also possible to pass several arguments to the arrange function, and thereby sort a dataset with respect to multiple variables; for example, the code below sorts the qog_copy_selection dataset by region (“ht_region), and then further sorts it in descending order by”wdi_trade”. It assigns the result back to qog_copy_selection to effectively “save” these changes to the object:
# arranges the "qog_copy_selection" data frame in ascending order with respect to "ht_region" and then in descending order with respect to "wdi_trade", and then relocates these variables to the front of the dataset; changes are assigned back to "qog_copy_selection" to store these changes
qog_copy_selection<-qog_copy_selection %>%
arrange(ht_region, desc(wdi_trade)) %>%
relocate(cname_qog, cname, ccodealp, ht_region, wdi_trade)Let’s view the updated version of the qog_copy_selection in the data viewer:
# Views "qog_copy_selection" in the data viewer
View(qog_copy_selection)As we can see, sorting the data in this way allows us to quickly note the country in each regional grouping with the highest value on the “trade” variable. Sorting a dataset with respect to more than one variable can be especially useful in certain contexts, particularly in cases where you’re dealing with nested data. For example, in a time series dataset, it can be useful to first sort by year, then by months within the year. In a dataset with regional information, it could be helpful to first sort by regional, then by cities within those regional groupings.
4.4.6 Creating New Variables Based on Existing Variables
Depending on your research question and empirical strategy, it is often useful or necessary to create new variables in your dataset, based on existing variables. The mutate() function from the dplyr package is particularly useful for this purpose. To see how it works, consider the code below, which uses the mutate() function to create a new variable named “mip” (which stands for “men in parliament”) that is computed by subtracting the existing “wdi_wip” (percentage of parliamentary seats held by women) variable from 100. For convenience, we’ll also move “wdi_wip” and the newly created “mip” variables to the front of the dataset using relocate():
# Creates new variable named "mip" (percentage of men in parliement) that is calculated by substracting the women's share of parliamentary sets ("wdi_wip") from 100 and relocates these variables to the front of the dataset
qog_copy_selection %>%
mutate(mip=100-wdi_wip) %>%
relocate(cname_qog, cname, ccodealp, wdi_wip, mip)# A tibble: 194 × 24
cname_qog cname ccodealp wdi_wip mip ht_region wdi_trade wdi_acel
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Slovakia Slovakia SVK 22.7 77.3 1 184. 100
2 Estonia Estonia EST 25.7 74.3 1 162. 100
3 Slovenia Slovenia SVN 26.7 73.3 1 161. 100
4 Hungary Hungary HUN 13.1 86.9 1 160. 100
5 Lithuania Lithuania LTU 27.7 72.3 1 156. 100
6 North Macedonia North Ma… MKD 41.7 58.3 1 147. 100
7 Czech Republic Czechia CZE 25 75 1 142. 100
8 Belarus Belarus BLR 40 60 1 136. 100
9 Latvia Latvia LVA 29 71 1 132. 100
10 Bulgaria Bulgaria BGR 23.8 76.2 1 121. 99.8
# ℹ 184 more rows
# ℹ 16 more variables: wdi_expmil <dbl>, undp_hdi <dbl>, wdi_expedu <dbl>,
# wdi_area <dbl>, wdi_taxrev <dbl>, wdi_fdiin <dbl>, cbie_index <dbl>,
# wbgi_rle <dbl>, bmr_dem <dbl>, atop_ally <dbl>, gol_est <dbl>,
# mad_gdppc <dbl>, mad_gdppc1900 <dbl>, bci_bci <dbl>, lis_gini <dbl>,
# top_top1_income_share <dbl>
Note that the argument to the mutate function takes the form new_variable=old_variable. It’s possible to define more than one new variable at a time using mutate(); simply separate the arguments by a comma within the mutate() function. For example, the code below creates a new variable (“land_area_sqmiles”) that is created based on “wdi_area” (land area in sq km), as well as a new variable (“no_electricity_access”) that presents the percentage of the population without access to electricity (based on the “wdi_acel” variable that represents the percentage of the population with access to electricity):
# creates "land_area_sqmiles" variable based on "wdi_area" and "no_electricity_access" variable based on "wdi_acel"
qog_copy_selection %>%
mutate(land_area_sqmiles=wdi_area/2.5899,
no_electricity_access=100-wdi_acel) %>%
relocate(cname_qog,
cname,
ccodealp,
land_area_sqmiles,
wdi_area,
no_electricity_access,
wdi_acel)# A tibble: 194 × 25
cname_qog cname ccodealp land_area_sqmiles wdi_area no_electricity_access
<chr> <chr> <chr> <dbl> <dbl> <dbl>
1 Slovakia Slov… SVK 18564. 48080 0
2 Estonia Esto… EST 16506. 42750 0
3 Slovenia Slov… SVN 7775. 20136. 0
4 Hungary Hung… HUN 35237. 91260 0
5 Lithuania Lith… LTU 24175. 62610 0
6 North Macedo… Nort… MKD 9738. 25220 0
7 Czech Republ… Czec… CZE 29803. 77187. 0
8 Belarus Bela… BLR 78362. 202950 0
9 Latvia Latv… LVA 24028. 62230 0
10 Bulgaria Bulg… BGR 41917. 108560 0.200
# ℹ 184 more rows
# ℹ 19 more variables: wdi_acel <dbl>, ht_region <dbl>, wdi_trade <dbl>,
# wdi_expmil <dbl>, wdi_wip <dbl>, undp_hdi <dbl>, wdi_expedu <dbl>,
# wdi_taxrev <dbl>, wdi_fdiin <dbl>, cbie_index <dbl>, wbgi_rle <dbl>,
# bmr_dem <dbl>, atop_ally <dbl>, gol_est <dbl>, mad_gdppc <dbl>,
# mad_gdppc1900 <dbl>, bci_bci <dbl>, lis_gini <dbl>,
# top_top1_income_share <dbl>
The mutate() function is particularly handy when we want to transform continuous numeric variables into a dichotomous (i.e. a “dummy” variable that takes on a value of 1 if a condition is met and 0 otherwise) or categorical variable. For example, let’s say we want to create a new variable, named “trade_open” that takes on the value of 1 if the “wdi_trade” variable is greater than or equal to 60, and 0 otherwise. We can generate this new dummy variable using the mutate() function. In particular, within the mutate() function below, we specify that we want to create a new variable named “trade_open”; the ifelse() function specifies the desired condition (trade>=60), followed by the value the new “trade_open” variable is to take if the condition is met (1), and the value the new “trade_open” variable is to take if the condition is not met (0). In other words, we can translate ifelse(wdi_trade>=77, 1, 0) to “if wdi_trade>=60, set the ‘trade_open’ variable to 1, otherwise set it to 0.” We’ll assign the data frame with the new “trade_open” variable back to “qog_copy_selection”:
# Creates a new dummy variable based on the existing "wdi_trade" variable named "trade_open" (which takes on a value of "1" if "trade" is greater than or equal to 60, and 0 otherwise) and then moves the newly created variables to the front of the datasetall changes are assigned to "qog_copy_selection", thereby overwriting the existing version of "qog_copy_selection"
qog_copy_selection<-
qog_copy_selection %>%
mutate(trade_open=ifelse(wdi_trade>=60, 1, 0)) %>%
relocate(cname_qog, cname, ccodealp, wdi_trade, trade_open)Let’s print the updated contents of qog_copy_selection to see that the changes have been made:
# prints updated contents of "qog_copy_selection"
qog_copy_selection# A tibble: 194 × 24
cname_qog cname ccodealp wdi_trade trade_open ht_region wdi_acel wdi_expmil
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Slovakia Slov… SVK 184. 1 1 100 1.77
2 Estonia Esto… EST 162. 1 1 100 2.01
3 Slovenia Slov… SVN 161. 1 1 100 1.24
4 Hungary Hung… HUN 160. 1 1 100 1.68
5 Lithuania Lith… LTU 156. 1 1 100 1.97
6 North Mace… Nort… MKD 147. 1 1 100 1.47
7 Czech Repu… Czec… CZE 142. 1 1 100 1.40
8 Belarus Bela… BLR 136. 1 1 100 1.12
9 Latvia Latv… LVA 132. 1 1 100 2.07
10 Bulgaria Bulg… BGR 121. 1 1 99.8 1.52
# ℹ 184 more rows
# ℹ 16 more variables: wdi_wip <dbl>, undp_hdi <dbl>, wdi_expedu <dbl>,
# wdi_area <dbl>, wdi_taxrev <dbl>, wdi_fdiin <dbl>, cbie_index <dbl>,
# wbgi_rle <dbl>, bmr_dem <dbl>, atop_ally <dbl>, gol_est <dbl>,
# mad_gdppc <dbl>, mad_gdppc1900 <dbl>, bci_bci <dbl>, lis_gini <dbl>,
# top_top1_income_share <dbl>
It can also be useful to create categorical variables based on numeric thresholds from an existing variable. For instance, let’s say we want to take the existing “wdi_trade” variable, and define a new variable named “trade_level”, which is set to “Low_Trade” when the “trade” variable is greater than 15 and less than 50; “Intermediate_Trade” when the “trade” variable is greater than or equal to 50 and less than 100; and “High_Trade” when the “trade” variable is greater than or equal to 100. The code below creates this new “trade_level” variable using the mutate() function; within mutate() the case_when() function maps the conditions onto the desired variable values for “trade_level” using the following syntax:
# Creates a new variable in the "pt_copy" dataset named "trade_level" (that is coded as "Low_Trade" when the "wdi_trade" variable is greater than 15 and less than 50, coded as "Intermediate_Trade" when "wdi_trade" is greater than or equal to 50 and less than 100, and coded as "High_Trade" when "wdi_trade" is greater than or equal to 100), and then reorders the dataset to move "trade_level" and "wdi_trade" to the front of the dataset; the changes are assigned back to "qog_copy_selection"
qog_copy_selection<-
qog_copy_selection %>%
mutate(trade_level=case_when(wdi_trade>15 & wdi_trade<50~"Low_Trade",
wdi_trade>=50 & wdi_trade<100~"Intermediate_Trade",
wdi_trade>=100~"High_Trade")) %>%
relocate(cname_qog, cname, ccodealp, wdi_trade, trade_level)Let’s print the contents of the updated qog_copy_selection data frame and confirm that the changes have been made:
# prints updated contents of "qog_copy_selection"
qog_copy_selection# A tibble: 194 × 25
cname_qog cname ccodealp wdi_trade trade_level trade_open ht_region wdi_acel
<chr> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
1 Slovakia Slov… SVK 184. High_Trade 1 1 100
2 Estonia Esto… EST 162. High_Trade 1 1 100
3 Slovenia Slov… SVN 161. High_Trade 1 1 100
4 Hungary Hung… HUN 160. High_Trade 1 1 100
5 Lithuania Lith… LTU 156. High_Trade 1 1 100
6 North Mac… Nort… MKD 147. High_Trade 1 1 100
7 Czech Rep… Czec… CZE 142. High_Trade 1 1 100
8 Belarus Bela… BLR 136. High_Trade 1 1 100
9 Latvia Latv… LVA 132. High_Trade 1 1 100
10 Bulgaria Bulg… BGR 121. High_Trade 1 1 99.8
# ℹ 184 more rows
# ℹ 17 more variables: wdi_expmil <dbl>, wdi_wip <dbl>, undp_hdi <dbl>,
# wdi_expedu <dbl>, wdi_area <dbl>, wdi_taxrev <dbl>, wdi_fdiin <dbl>,
# cbie_index <dbl>, wbgi_rle <dbl>, bmr_dem <dbl>, atop_ally <dbl>,
# gol_est <dbl>, mad_gdppc <dbl>, mad_gdppc1900 <dbl>, bci_bci <dbl>,
# lis_gini <dbl>, top_top1_income_share <dbl>
Another useful operation is to create dummy variables based on a categorical variable. For example, consider the “trade_level” variable we just created. Let’s say we want to use the “trade_level” column to create dummy variables for each of the categories in that column. We can do so with the fastDummies package, which can quickly generate dummy variables for the categories in a categorical variable using the dummy_cols() function. Below, we simply take the existing qog_copy_selection dataset, and pass the name of the categorical variable out of which we want to create the dummies (“trade_level”) to the dummy_cols() function:
# Creates dummy variables from "trade_level" column, and relocates the new dummies to the front of the dataset; assigns changes back to "qog_copy_selection"
qog_copy_selection<-
qog_copy_selection %>%
dummy_cols("trade_level") %>%
relocate(cname_qog,
cname,
ccodealp,
trade_level,
trade_level_High_Trade,
trade_level_Intermediate_Trade,
trade_level_Low_Trade,
trade_level_NA)You’ll notice that there are now dummy variables corresponding to each of the categories in the categorical “trade_level” variable; for example, the “trade_level_High_Trade” dummy variable takes on the value of 1 for all observations where the “trade_level” variable is “High_Trade” and 0 otherwise; the “trade_level_Intermediate_Trade” dummy variable takes on the value of 1 for all observations where the “trade_level” variable is “Intermediate_Trade” and 0 otherwise; and so on. If we were to incorporate these dummy variables into a regression model, we would exclude one set of dummy variables (which would then serve as the reference category).
Let’s view the updated version of qog_copy_selection in the data viewer to see our newly created variables. First, we’ll bring the variables of interest to the front of the data frame:
# specifies variable order
qog_copy_selection<-
qog_copy_selection %>%
relocate(cname_qog, cname, ccodealp, wdi_trade, trade_open, trade_level, trade_level_High_Trade, trade_level_Intermediate_Trade, trade_level_Low_Trade, trade_level_NA)Let’s now view the modified data frame and note the new variables:
# Views "qog_copy_selection" in the data viewer
View(qog_copy_selection)4.4.7 Subsetting (Filtering) Variables
We already introduced the filter() function in an earlier section, in the context of our discussion of the %>% operator. We’ll now briefly return to this function, and provide additional examples of how it can be used to subset our data frames.
Let’s say that we want to subset the qog_copy_selection data frame to extract only countries part of the South-East Asia region (coded as a “7” in the “ht_region” variable), and assign this South-East Asia data frame to its own object, named se_asia_data. We can do so with the following:
# uses the filter() function to extract South East Asia observations from "qog_copy_selection" and assigns the result to a new object named "se_asia_data"
se_asia_data<-qog_copy_selection %>%
filter(ht_region==7)Let’s print out the contents of se_asia_data to see if it worked as expected:
# prints "se_asia_data" to console
se_asia_data# A tibble: 11 × 29
cname_qog cname ccodealp wdi_trade trade_open trade_level
<chr> <chr> <chr> <dbl> <dbl> <chr>
1 Singapore Singapore SGP 330. 1 High_Trade
2 Vietnam Viet Nam VNM 187. 1 High_Trade
3 Brunei Brunei Darussalam BRN 147. 1 High_Trade
4 Malaysia (1966-) Malaysia MYS 134. 1 High_Trade
5 Cambodia Cambodia KHM 129. 1 High_Trade
6 Thailand Thailand THA 117. 1 High_Trade
7 Timor-Leste Timor-Leste TLS 99.9 1 Intermedia…
8 Philippines Philippines (the) PHL 63.5 1 Intermedia…
9 Indonesia Indonesia IDN 40.2 0 Low_Trade
10 Myanmar Myanmar MMR NA NA <NA>
11 Laos Lao People's Demo… LAO NA NA <NA>
# ℹ 23 more variables: trade_level_High_Trade <int>,
# trade_level_Intermediate_Trade <int>, trade_level_Low_Trade <int>,
# trade_level_NA <int>, ht_region <dbl>, wdi_acel <dbl>, wdi_expmil <dbl>,
# wdi_wip <dbl>, undp_hdi <dbl>, wdi_expedu <dbl>, wdi_area <dbl>,
# wdi_taxrev <dbl>, wdi_fdiin <dbl>, cbie_index <dbl>, wbgi_rle <dbl>,
# bmr_dem <dbl>, atop_ally <dbl>, gol_est <dbl>, mad_gdppc <dbl>,
# mad_gdppc1900 <dbl>, bci_bci <dbl>, lis_gini <dbl>, …
We can also chain together multiple conditions. For example, let’s use the filter() condition to extract observations for countries that are part of the “Sub-Saharan Africa” region, and have a Human Development Index (“und_hdi”) score greater than or equal to 0.65. We can do so by setting up an expression that uses the “&” logical operator:
# subsets data to include Sub-Saharan African countries with a "undp_hdi" value greater than or equal to 0.65 and assigns the result to an object named "sub_saharan_africa_hdi"
sub_saharan_africa_hdi<-qog_copy_selection %>%
filter(ht_region==4 & undp_hdi>=0.65)Let’s print out the contents of sub_saharan_africa_hdi:
# prints "sub_saharan_africa_hdi"
sub_saharan_africa_hdi# A tibble: 6 × 29
cname_qog cname ccodealp wdi_trade trade_open trade_level
<chr> <chr> <chr> <dbl> <dbl> <chr>
1 Seychelles Seychelles SYC 184. 1 High_Trade
2 Mauritius Mauritius MUS 98.0 1 Intermediate_Trade
3 Botswana Botswana BWA 88.8 1 Intermediate_Trade
4 Cape Verde Cabo Verde CPV 77.5 1 Intermediate_Trade
5 Gabon Gabon GAB 74.5 1 Intermediate_Trade
6 South Africa South Africa ZAF 56.1 0 Intermediate_Trade
# ℹ 23 more variables: trade_level_High_Trade <int>,
# trade_level_Intermediate_Trade <int>, trade_level_Low_Trade <int>,
# trade_level_NA <int>, ht_region <dbl>, wdi_acel <dbl>, wdi_expmil <dbl>,
# wdi_wip <dbl>, undp_hdi <dbl>, wdi_expedu <dbl>, wdi_area <dbl>,
# wdi_taxrev <dbl>, wdi_fdiin <dbl>, cbie_index <dbl>, wbgi_rle <dbl>,
# bmr_dem <dbl>, atop_ally <dbl>, gol_est <dbl>, mad_gdppc <dbl>,
# mad_gdppc1900 <dbl>, bci_bci <dbl>, lis_gini <dbl>, …
We can also subset datasets using non-numeric data. For example, if we wanted to extract all of the observations from sub_saharan_africa_hdi for which the “trade_level” variable we created earlier is “Intermediate_Trade”, we can do the following:
# Filters observations from "sub_saharan_africa_hdi" for which the "trade_level" variable is set to "Intermediate_Trade"
sub_saharan_africa_hdi %>%
filter(trade_level=="Intermediate_Trade")# A tibble: 5 × 29
cname_qog cname ccodealp wdi_trade trade_open trade_level
<chr> <chr> <chr> <dbl> <dbl> <chr>
1 Mauritius Mauritius MUS 98.0 1 Intermediate_Trade
2 Botswana Botswana BWA 88.8 1 Intermediate_Trade
3 Cape Verde Cabo Verde CPV 77.5 1 Intermediate_Trade
4 Gabon Gabon GAB 74.5 1 Intermediate_Trade
5 South Africa South Africa ZAF 56.1 0 Intermediate_Trade
# ℹ 23 more variables: trade_level_High_Trade <int>,
# trade_level_Intermediate_Trade <int>, trade_level_Low_Trade <int>,
# trade_level_NA <int>, ht_region <dbl>, wdi_acel <dbl>, wdi_expmil <dbl>,
# wdi_wip <dbl>, undp_hdi <dbl>, wdi_expedu <dbl>, wdi_area <dbl>,
# wdi_taxrev <dbl>, wdi_fdiin <dbl>, cbie_index <dbl>, wbgi_rle <dbl>,
# bmr_dem <dbl>, atop_ally <dbl>, gol_est <dbl>, mad_gdppc <dbl>,
# mad_gdppc1900 <dbl>, bci_bci <dbl>, lis_gini <dbl>, …
We can also extract observations that do NOT meet a given condition. In particular, the condition “not equal to” is denoted by a “!=”. For example, an operation equivalent to the one immediately above would be to exclude observations for which “trade_level” is set to “High_Trade”:
# filters observations from "sub_saharan_africa_hdi" for which the "trade_level" variable is NOT set to "High_Trade"
sub_saharan_africa_hdi %>%
filter(trade_level != "High_Trade")# A tibble: 5 × 29
cname_qog cname ccodealp wdi_trade trade_open trade_level
<chr> <chr> <chr> <dbl> <dbl> <chr>
1 Mauritius Mauritius MUS 98.0 1 Intermediate_Trade
2 Botswana Botswana BWA 88.8 1 Intermediate_Trade
3 Cape Verde Cabo Verde CPV 77.5 1 Intermediate_Trade
4 Gabon Gabon GAB 74.5 1 Intermediate_Trade
5 South Africa South Africa ZAF 56.1 0 Intermediate_Trade
# ℹ 23 more variables: trade_level_High_Trade <int>,
# trade_level_Intermediate_Trade <int>, trade_level_Low_Trade <int>,
# trade_level_NA <int>, ht_region <dbl>, wdi_acel <dbl>, wdi_expmil <dbl>,
# wdi_wip <dbl>, undp_hdi <dbl>, wdi_expedu <dbl>, wdi_area <dbl>,
# wdi_taxrev <dbl>, wdi_fdiin <dbl>, cbie_index <dbl>, wbgi_rle <dbl>,
# bmr_dem <dbl>, atop_ally <dbl>, gol_est <dbl>, mad_gdppc <dbl>,
# mad_gdppc1900 <dbl>, bci_bci <dbl>, lis_gini <dbl>, …
It can also be useful to filter data using “or” conditions; we can make such statements within the filter() function using a “|”. For example, the code below creates a new “east_asia” dataset by using the filter() function to extract South-East Asia or East Asia observations from qog_copy_selection :
# uses the "filter()" function to extract observations from South-East Asia (ht_region=7) and East Asia (ht_region=6) and assigns the result to a new object named "east_asia"
east_asia<-qog_copy_selection %>%
filter(ht_region==6|ht_region==7) %>%
relocate(cname_qog, cname, ccodealp, ht_region)Now, let’s print the contents of the newly created east_asia object:
# prints contents of "east_asia"
east_asia# A tibble: 17 × 29
cname_qog cname ccodealp ht_region wdi_trade trade_open trade_level
<chr> <chr> <chr> <dbl> <dbl> <dbl> <chr>
1 Mongolia Mongolia MNG 6 119. 1 High_Trade
2 Korea, South Korea (… KOR 6 80.2 1 Intermedia…
3 China China CHN 6 37.3 0 Low_Trade
4 Japan Japan JPN 6 36.8 0 Low_Trade
5 Taiwan Taiwan … TWN 6 NA NA <NA>
6 Korea, North Korea (… PRK 6 NA NA <NA>
7 Singapore Singapo… SGP 7 330. 1 High_Trade
8 Vietnam Viet Nam VNM 7 187. 1 High_Trade
9 Brunei Brunei … BRN 7 147. 1 High_Trade
10 Malaysia (1966-) Malaysia MYS 7 134. 1 High_Trade
11 Cambodia Cambodia KHM 7 129. 1 High_Trade
12 Thailand Thailand THA 7 117. 1 High_Trade
13 Timor-Leste Timor-L… TLS 7 99.9 1 Intermedia…
14 Philippines Philipp… PHL 7 63.5 1 Intermedia…
15 Indonesia Indones… IDN 7 40.2 0 Low_Trade
16 Myanmar Myanmar MMR 7 NA NA <NA>
17 Laos Lao Peo… LAO 7 NA NA <NA>
# ℹ 22 more variables: trade_level_High_Trade <int>,
# trade_level_Intermediate_Trade <int>, trade_level_Low_Trade <int>,
# trade_level_NA <int>, wdi_acel <dbl>, wdi_expmil <dbl>, wdi_wip <dbl>,
# undp_hdi <dbl>, wdi_expedu <dbl>, wdi_area <dbl>, wdi_taxrev <dbl>,
# wdi_fdiin <dbl>, cbie_index <dbl>, wbgi_rle <dbl>, bmr_dem <dbl>,
# atop_ally <dbl>, gol_est <dbl>, mad_gdppc <dbl>, mad_gdppc1900 <dbl>,
# bci_bci <dbl>, lis_gini <dbl>, top_top1_income_share <dbl>
4.5 Processing and Wrangling Multiple Datasets
4.5.1 Joining Data
Joining, or merging, distinct datasets into a single dataset based on a common variable that exists in both individual datasets is an essential operation in most research projects. To see how we can implement joins in R using the join() family of functions, let’s first pull out two of the list elements in world_bank_list and assign them to objects in the global environment. First, we’ll extract the dataset on foreign direct investment, and assign it to an object named wdi_fdi:
# extracts fdi dataset from "world_bank_list" by assigned name and assigns it to a new object named "wdi_fdi"
wdi_fdi<-world_bank_list[["wdi_fdi2019"]]Then, we’ll extract the World Bank trade dataset and assign it to an object named wdi_trade:
# extracts debt dataset from "world_bank_list" by assigned name and assigns it to a new object named "wdi_trade"
wdi_trade<-world_bank_list[["wdi_trade2019"]]Then, we’ll clean up these datasets by dropping rows with NA values, and then renaming the awkwardly named “2019 [YR2019’” variables in each of the datasets to something more intuitive and descriptive:
# drop na's and rename variable in in trade dataset and assign to "wdi_trade_cleaned"
wdi_trade_cleaned<-wdi_trade %>%
drop_na() %>%
rename(trade_2019=`2019 [YR2019]`)
# drop na's and rename variable in in FDI dataset and assign to "wdi_fdi_cleaned"
wdi_fdi_cleaned<-wdi_fdi %>%
drop_na() %>%
rename(fdi_2019=`2019 [YR2019]`)Now, we can go ahead and use a join() function, to merge the datasets together. There are several different versions of the join() function, and you should read the documentation to learn more, and make sure you’re applying the correct function given the context of your research (?join). We will use the full_join() function, which keeps all of the observations from each of the component datasets in the joined (i.e. output) dataset (even those that do not have a corresponding observation in the other dataset). Below, we call the full_join() function, and pass as arguments the two datasets we’d like to join. The specification by="Country Code" indicates that the join field (i.e. the common field that can be used to link the data frames) is the “Country Code” variable. The column containing the country codes is named “Country Code” in both columns, which makes this especially easy; if, however, the columns containing the join variable were named differently, we could equate them within the full_join() function by passing a vector to the “by” argument that specifies the join field from each component dataset. For example, if the country code variable was in a column named “iso3” in wdi_fdi and “Country Code” in wdi_trade, we could specify by=c("iso3"="Country Code") to let the function know that the join field in wdi_fdi is “iso3” and the join field in wdi_trade is “Country Code”. Below, we assign the product of the join to a new object named fdi_trade_join:
# join together "wdi_fdi_cleaned" and "wdi_fdi_cleaned" using country code
fdi_trade_join<-full_join(wdi_fdi_cleaned, wdi_trade_cleaned, by="Country Code")Let’s see what fdi_trade_join looks like by passing it to the R Studio Viewer:
View(fdi_trade_join)Note that both variables are now in the dataset as separate columns.
4.5.2 Appending Data
If joining datasets is fundamentally about situating disparate datasets side-by-side in a new unified dataset, appending datasets is about stacking disparate datasets with a similar structure on top of each other in a new unified dataset. In other words, joining datasets leads to a unified dataset with more columns than the original individual datasets, while appending datasets leads to a unified dataset with more rows than the original individual datasets. Depending on the type of data you’re working with, appending data can be just as important a method for bringing together disparate datasets.
We can easily append data frames a single unified data frame using the bind_rows() function from dplyr. Below, we append wdi_fdito wdi_trade by passing these objects as arguments to the bind_rows() function. We’ll assign the appended dataset to a new object named worldbank_trade_fdi_appended:
# Appends "worldbank_trade_2019" to "worldbank_fdi_2019" and assigns new dataset to object named "worldbank_trade_fdi"
worldbank_trade_fdi_appended<-bind_rows(wdi_trade, wdi_fdi)To confirm that the appending operation worked as expected, please view worldbank_trade_fdi_appended in the data viewer.
4.5.3 Reshaping Data
In a “wide” dataset, each variable or measurement is stored in a separate column; on the other hand, in a “long” dataset, each measurement is stored in a single column, and there is an additional column that identifies the specific variable or category for each observation. For example, in our context, fdi_trade_join is a “wide” dataset, while worldbank_trade_fdi_appended is formatted as a long dataset.
The process of converting a wide dataset to a long one, or vice versa, is typically referred to as reshaping data. Often, one will need to bring different datasets together (either through a join or append operation), but in order to successfully implement the required procedure, it is first necessary to reshape at least one of the datasets so that they’re correctly formatted; the first step in integrating datasets, therefore, is frequently a reshaping operation that ensures that the datasets have a common structure. Below, we’ll quickly review some functions for carrying out these reshaping operations from the tidyr package, which is part of the tidyverse suite.
4.5.3.1 Long to Wide
To see how a “long to wide” operation works using the tidyr lets imagine we want to transform worldbank_trade_fdi_appended, currently formatted as a long dataset, into a wide dataset. First we’ll clean up worldbank_trade_fdi_appended using some familiar tidyverse functions, and assign the result to a new object named worldbank_trade_fdi_cleaned cleaned:
# cleans the dataset before reshaping
worldbank_trade_fdi_cleaned<-worldbank_trade_fdi_appended %>%
rename(economic_variables="2019 [YR2019]",
series_code="Series Code") %>%
select(-"Series Name") %>%
drop_na()Before proceeding, we should note that the column containing the data, “economic_variables” is formatted as a character variable:
# prints class of "economic_variables" column
class(worldbank_trade_fdi_cleaned$economic_variables)[1] "character"
In order for the reshaping process to work, we need to transform that column into the “numeric” class, which we do with the following:
# converts "economic_variables" to numeric
worldbank_trade_fdi_cleaned$economic_variables<-as.numeric(worldbank_trade_fdi_cleaned$economic_variables)Warning: NAs introduced by coercion
Now, we use tidyr’s pivote_wider function to “pivot” the data from long to wide. Below, we’ll take worldbank_trade_fdi_cleaned and then feed that data into the pivot_wider() function using the %>% operator. Within the pivot_wider() function, the “names_from” argument specifies the current column that contains the names we would like to use as column names in the reshaped dataset, while the “values_from” argument specifies the name of the current column that contains the data that will be used to populate the columns in the reshaped dataset. We’ll assign this code to a new object named worldbank_trade_fdi_wide:
# reshapes "worldbank_trade_fdi_cleaned" from long to wide and assigns the wide dataset to an object named "worldbank_trade_fdi_wide"
worldbank_trade_fdi_wide<-worldbank_trade_fdi_cleaned %>%
tidyr:: pivot_wider(names_from=series_code,
values_from=economic_variables)Let’s quickly take a look at the reshaped dataset:
# prints contents of "worldbank_trade_fdi_wide"
worldbank_trade_fdi_wide# A tibble: 6 × 4
`Country Name` `Country Code` NE.TRD.GNFS.ZS BX.KLT.DINV.WD.GD.ZS
<chr> <chr> <dbl> <dbl>
1 Afghanistan AFG NA 0.124
2 Albania ALB 76.3 7.80
3 Algeria DZA 51.8 0.804
4 American Samoa ASM 157. NA
5 Andorra AND NA NA
6 Angola AGO 57.8 -5.78
It might be nice to rename the variables to something more intuitive:
# renames columns in "worldbank_trade_fdi_wide"
worldbank_trade_fdi_wide<-worldbank_trade_fdi_wide %>%
rename(trade2019=NE.TRD.GNFS.ZS,
FDI2019=BX.KLT.DINV.WD.GD.ZS)Go ahead and inspect worldbank_trade_fdi_wide in the data viewer:
View(worldbank_trade_fdi_wide)4.5.3.2 Wide to Long
To make the conversion in the opposite direction, i.e. reshaping a dataset from wide to long, we can use the pivot_longer() function. Below, we take the worldbank_trade_fdi_wide data frame, and feed it into the pivot_longer() function using the %>% operator. Within the function, we use cols=c(FDI2019, trade2019) to specify the current columns that we’re going to collapse into a single column in the new dataset. The “names_to” argument specifies the name of column that will hold the original “FDI2019” and “trade2019” columns, while the “values_to” argument specifies the name of the column that will actually hold the data. Let’s go ahead and run the code, and assign the output to a new object named world_bank_trade_long:
# reshapes "worldbank_trade_fdi_wide" back to long format and assigns the reshaped dataset to a new object named "world_bank_trade_long"
world_bank_trade_long<-worldbank_trade_fdi_wide %>%
pivot_longer(cols=c(FDI2019, trade2019),
names_to="economic_variable",
values_to = "2019")Let’s view world_bank_trade_long in the data Viewer to confirm that it has been successfully reshaped:
View(world_bank_trade_long)4.5.4 Automating Data Processing with Functions
When working with multiple datasets, we may need to carry out identical operations on those datasets, in which case we could save time by wrapping up the relevant code in a function and iterating over the dataset elements we’d like to transform. Let’s say, for example, that we want to clean up our World Bank dataset by deleting a column, renaming others, and removing the rows where the country code is NA. Rather than carrying out these operations individually, let’s create a function, called worldbank_cleaning_function, that implements these changes and returns a clean dataset:
# write function to clean World Bank dataset
worldbank_cleaning_function<-function(input_dataset){
modified_dataset<-input_dataset %>%
select(-"Series Code") %>%
rename("Country"="Country Name",
"CountryCode"="Country Code",
"Series"="Series Name",
"2019"="2019 [YR2019]") %>%
drop_na(CountryCode)
return(modified_dataset)
}Let’s test the function using wdi_trade:
# passes "wdi_trade" to "worldbank_cleaning_function"
worldbank_cleaning_function(wdi_trade)# A tibble: 266 × 4
Country CountryCode Series `2019`
<chr> <chr> <chr> <chr>
1 Afghanistan AFG Trade (% of GDP) ..
2 Albania ALB Trade (% of GDP) 76.2791946495763
3 Algeria DZA Trade (% of GDP) 51.8097384415762
4 American Samoa ASM Trade (% of GDP) 156.568778979907
5 Andorra AND Trade (% of GDP) ..
6 Angola AGO Trade (% of GDP) 57.8295381183036
7 Antigua and Barbuda ATG Trade (% of GDP) 137.625175755884
8 Argentina ARG Trade (% of GDP) 32.6306150458499
9 Armenia ARM Trade (% of GDP) 96.1141541288708
10 Aruba ABW Trade (% of GDP) 145.343572735289
# ℹ 256 more rows
That looked like it worked, so lets go ahead and apply worldbank_cleaning_function to all of the data frames in world_bank_list. We’ll use the familiar map() function to do so; below, world_bank_list is the list we’re iterating over, while worldbank_cleaning_function() is the function we’re applying:
# Iteratively apply "worldbank_cleaning_function" to all of the datasets in "world_bank_list", and deposit the cleaned datasets into a new list named "world_bank_list_cleaned"
world_bank_list_cleaned<-map(.x=world_bank_list, .f=worldbank_cleaning_function)Let’s print out the list and confirm that the operation worked:
# prints contents of "world_bank_list_cleaned"
world_bank_list_cleaned$wdi_debt2019
# A tibble: 266 × 4
Country CountryCode Series `2019`
<chr> <chr> <chr> <chr>
1 Afghanistan AFG Central government debt, total (% of … ..
2 Albania ALB Central government debt, total (% of … 75.69…
3 Algeria DZA Central government debt, total (% of … ..
4 American Samoa ASM Central government debt, total (% of … ..
5 Andorra AND Central government debt, total (% of … ..
6 Angola AGO Central government debt, total (% of … ..
7 Antigua and Barbuda ATG Central government debt, total (% of … ..
8 Argentina ARG Central government debt, total (% of … ..
9 Armenia ARM Central government debt, total (% of … 50.02…
10 Aruba ABW Central government debt, total (% of … ..
# ℹ 256 more rows
$wdi_fdi2019
# A tibble: 266 × 4
Country CountryCode Series `2019`
<chr> <chr> <chr> <chr>
1 Afghanistan AFG Foreign direct investment, net inflow… 0.124…
2 Albania ALB Foreign direct investment, net inflow… 7.797…
3 Algeria DZA Foreign direct investment, net inflow… 0.804…
4 American Samoa ASM Foreign direct investment, net inflow… ..
5 Andorra AND Foreign direct investment, net inflow… ..
6 Angola AGO Foreign direct investment, net inflow… -5.78…
7 Antigua and Barbuda ATG Foreign direct investment, net inflow… 7.433…
8 Argentina ARG Foreign direct investment, net inflow… 1.485…
9 Armenia ARM Foreign direct investment, net inflow… 0.736…
10 Aruba ABW Foreign direct investment, net inflow… -2.21…
# ℹ 256 more rows
$wdi_trade2019
# A tibble: 266 × 4
Country CountryCode Series `2019`
<chr> <chr> <chr> <chr>
1 Afghanistan AFG Trade (% of GDP) ..
2 Albania ALB Trade (% of GDP) 76.2791946495763
3 Algeria DZA Trade (% of GDP) 51.8097384415762
4 American Samoa ASM Trade (% of GDP) 156.568778979907
5 Andorra AND Trade (% of GDP) ..
6 Angola AGO Trade (% of GDP) 57.8295381183036
7 Antigua and Barbuda ATG Trade (% of GDP) 137.625175755884
8 Argentina ARG Trade (% of GDP) 32.6306150458499
9 Armenia ARM Trade (% of GDP) 96.1141541288708
10 Aruba ABW Trade (% of GDP) 145.343572735289
# ℹ 256 more rows
$wdi_urban2019
# A tibble: 266 × 4
Country CountryCode Series `2019`
<chr> <chr> <chr> <chr>
1 Afghanistan AFG Urban population (% of total populati… 25.754
2 Albania ALB Urban population (% of total populati… 61.229
3 Algeria DZA Urban population (% of total populati… 73.189
4 American Samoa ASM Urban population (% of total populati… 87.147
5 Andorra AND Urban population (% of total populati… 87.984
6 Angola AGO Urban population (% of total populati… 66.177
7 Antigua and Barbuda ATG Urban population (% of total populati… 24.506
8 Argentina ARG Urban population (% of total populati… 91.991
9 Armenia ARM Urban population (% of total populati… 63.219
10 Aruba ABW Urban population (% of total populati… 43.546
# ℹ 256 more rows
As expected, each of the processed datasets is an element in world_bank_list_cleaned.
4.6 Exporting Processed Data
After we have processed our datasets, it’s often necessary to get the transformed data out of the R environment. There are a variety of functions that can export data frames from R to your local disk in various file formats. Because CSV files are so flexible and commonly used, we’ll explore how to export R data frames to your local disk as CSV files; if you need to export your data frames into another file format (for example .xlsx or .dta files), you would need to look up the relevant packages and functions that enable this.
To write out an R data frame to disk as a CSV, we can use the read_csv() function. Below, we export the east_asia data frame we created earlier as a CSV to the “outputs” folder in our working directory. The first argument is the name of the object we want to export, while the second argument specifies the file path, name, and extension of the desired output file.
# exports "east_asia" to a local directory (i.e. the "outputs" sub-directory of our working directory)
write_csv(east_asia, "outputs/east_asia.csv")In order to export multiple data frames (for example, we may want to export a bunch of cleaned data frames out of R to use in another package) we can use the walk() function and its relatives; these functions are a part of the purrr package, and work similarly to the map() family of functions.
Let’s say we want to export the cleaned World Bank data frames in world_bank_list_cleaned. First, we’ll create a vector of file names for the exported files using the paste0() function. We’ll assign this vector to a new object named WB_filenames_export:
# create file names for exported World Bank files
WB_filenames_export<-paste0("outputs/", worldbank_filenames_base, "_cleaned.csv")Let’s see what this looks like:
# prints "WB_filenames_export" contents
WB_filenames_export[1] "outputs/wdi_debt2019_cleaned.csv" "outputs/wdi_fdi2019_cleaned.csv"
[3] "outputs/wdi_trade2019_cleaned.csv" "outputs/wdi_urban2019_cleaned.csv"
Now, we’ll use the walk2() function to iteratively pass the data frames (contained in world_bank_list_cleaned and the file names contained in WB_filenames_export to the write_csv() function. The code below takes the first elements from world_bank_list_cleaned and WB_filenames_export, and passes them to write_csv() which exports the first data frame with the first file name; then, it takes the second elements from world_bank_list_cleaned and WB_filenames_export, and passes them to write_csv() which exports the second data frame with the second file name; and so on.
# exports datasets in "world_bank_list_cleaned" to "outputs" directory using filenames in "WB_filenames_export"
walk2(.x=world_bank_list_cleaned, .y=WB_filenames_export, write_csv)Check your “outputs” directory to ensure that the files have been written out as expected.