4 Transferring, Processing, and Wrangling Data

4.1 Introduction

In the past two sessions, we presented some ideas, concepts, and tools that provide a basic foundation for working with data in R. Now that we have this basic foundation, we’ll turn in this session to a more applied exploration of some actual datasets. Our goal here is to introduce you to some useful functions that will allow you to explore and begin making sense of actual datasets in R. This lesson will provide a tour of various functions that can be particularly helpful as you get started processing and wrangling data in R, so that you can get your raw datasets ready for analysis and visualization. Among the topics we’ll cover today are:

Reading data into R
Handling missing data in R
Preparing and transforming datasets for analysis using a variety of tidyverse functions for data wrangling and processing

4.2 Preliminaries

4.2.1 Install and Load Packages

In this lesson, we’ll work with a variety of packages from the tidyverse suite, as well as the fastDummies package, which is a handy package for quickly transforming categorical variables into binary indicator variables (i.e. dummy variables). Please go ahead and make sure that both the tidyverse and fastDummies are installed and loaded. Recall that in general, you only need to install packages once, so assuming you completed the previous lesson, you wouldn’t need to install the tidyverse again (assuming you’re working on the same computer and haven’t made significant changes to your operating system or updated R since then). If you haven’t installed one or both packages, recall that you can install packages by passing the name of the package as an argument to the install.packages() function (i.e. install.packages("fastDummies").

Note that when you need to install more than one package, you can do so by passing a vector of package names to the install.packages() function. For example, in this case, you could use the following:

install.packages(c("tidyverse", "fastDummmies", "haven"))

Remember that even if you’ve already installed the packages that you need in a previous session, you must load packages into memory each time you begin a new R session by passing the package names as arguments to the library() function. In this case:

# load packages into memory
library(tidyverse)
library(fastDummies)
library(haven)

4.2.2 Lesson Datasets

In this lesson, we’ll be working with a handful of datasets, which you should have download from the Workshop’s repository page. One of the datasets is a cross-national dataset published by the Quality of Government (QoG) Institute at the University of Gothenburg. The dataset contains information on a variety of political, social, and economic variables from the early 2020s; additional documentation is available on the QoG’s website. In addition, we’ll be working with several World Bank datasets downloaded from the World Development Indicators. You can download these datasets from the Workshop Repository.

4.3 Importing Datasets into R

We will begin this section by learning a few ways to read an individual dataset into memory in R, using the QoG dataset as an example. Then, we’ll explore how to read in multiple external datasets into R in an efficient manner through some basic functional programing techniques.

4.3.1 Importing Individual Datasets

Below, we’ll learn how to import individual datasets into R from various sources. We’ll also explore how to handle data stored in different file formats.

4.3.1.1 Reading in Local Files

Though it’s not strictly necessary, it’s useful to begin by setting your working directory to the location on your computer where the data is stored. The easiest way to do this is to go to the R Studio menu: click Session, then click Set Working Directory, then click Choose Directory. We can also set the working directory programmatically using the setwd() function.

We can now pass the name of file and its extension in quotation marks to the read_csv() function (since the data we want to load is a CSV file). We’ll assign it to an object named qog (for “Quality of Government”):

# reads in the workshop dataset (Persson and Tabellini cross-national dataset) by passing the file path as an argument to the "read_csv" and assigns it to a new object named "pt"
qog<-read_csv("data/quality_of_government/qog_bas_cs_jan25.csv")

Rows: 194 Columns: 331
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (3): cname_qog, cname, ccodealp
dbl (328): ccode_qog, ccodecow, ccode, ajr_settmort, atop_ally, atop_number,...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

By printing the name of the object, we can extract a preview of the dataset, along with some useful metadata:

# prints contents of "qog" object to console
qog

# A tibble: 194 × 331
   cname_qog      cname ccode_qog ccodecow ccodealp ccode ajr_settmort atop_ally
   <chr>          <chr>     <dbl>    <dbl> <chr>    <dbl>        <dbl>     <dbl>
 1 Afghanistan    Afgh…         4      700 AFG          4         4.54         1
 2 Albania        Alba…         8      339 ALB          8        NA            1
 3 Algeria        Alge…        12      615 DZA         12         4.36         1
 4 Andorra        Ando…        20      232 AND         20        NA            1
 5 Angola         Ango…        24      540 AGO         24         5.63         1
 6 Antigua and B… Anti…        28       58 ATG         28        NA            1
 7 Azerbaijan     Azer…        31      373 AZE         31        NA            1
 8 Argentina      Arge…        32      160 ARG         32         4.23         1
 9 Australia      Aust…        36      900 AUS         36         2.15         1
10 Austria        Aust…        40      305 AUT         40        NA            1
# ℹ 184 more rows
# ℹ 323 more variables: atop_number <dbl>, bci_bci <dbl>, bicc_gmi <dbl>,
#   bmr_dem <dbl>, bmr_demdur <dbl>, bti_aar <dbl>, bti_acp <dbl>,
#   bti_aod <dbl>, bti_cdi <dbl>, bti_ci <dbl>, bti_cps <dbl>, bti_cr <dbl>,
#   bti_ds <dbl>, bti_eo <dbl>, bti_eos <dbl>, bti_ep <dbl>, bti_ffe <dbl>,
#   bti_foe <dbl>, bti_ij <dbl>, bti_mes <dbl>, bti_muf <dbl>, bti_pdi <dbl>,
#   bti_pp <dbl>, bti_prp <dbl>, bti_ps <dbl>, bti_rol <dbl>, bti_sdi <dbl>, …

Recall, also, that we can view data frames in the R Studio data viewer by passing the name of the object to the View() function:

# views "qog" in data viewer
View(qog)

Note that to keep things tractable within this document, the table printed above does not print all of the columns in the dataset, but the full data frame should appear in the Viewer when the corresponding object name is passed to the View() function.

4.3.1.2 Reading in Data from an Online Source

The Quality of Government hosts its data online, and instead of reading the data into R from a downloaded local file, we could have read the data into R directly from the online source by using the appropriate URL. For example, the code below reads in the same QoG dataset directly from its website, and assigns the data to a new object named qog_direct:

# Reads in cross-national CSV dataset directly from QoG website and assigns it to a new object named "qog_direct"
qog_direct<-read_csv("https://www.qogdata.pol.gu.se/data/qog_bas_cs_jan25.csv")

Rows: 194 Columns: 331
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (3): cname_qog, cname, ccodealp
dbl (328): ccode_qog, ccodecow, ccode, ajr_settmort, atop_ally, atop_number,...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

We can print qog_direct to the console to confirm that the data was successfully read in, and that its contents are identical to qog:

# prints contents of qog_direct
qog_direct

# A tibble: 194 × 331
   cname_qog      cname ccode_qog ccodecow ccodealp ccode ajr_settmort atop_ally
   <chr>          <chr>     <dbl>    <dbl> <chr>    <dbl>        <dbl>     <dbl>
 1 Afghanistan    Afgh…         4      700 AFG          4         4.54         1
 2 Albania        Alba…         8      339 ALB          8        NA            1
 3 Algeria        Alge…        12      615 DZA         12         4.36         1
 4 Andorra        Ando…        20      232 AND         20        NA            1
 5 Angola         Ango…        24      540 AGO         24         5.63         1
 6 Antigua and B… Anti…        28       58 ATG         28        NA            1
 7 Azerbaijan     Azer…        31      373 AZE         31        NA            1
 8 Argentina      Arge…        32      160 ARG         32         4.23         1
 9 Australia      Aust…        36      900 AUS         36         2.15         1
10 Austria        Aust…        40      305 AUT         40        NA            1
# ℹ 184 more rows
# ℹ 323 more variables: atop_number <dbl>, bci_bci <dbl>, bicc_gmi <dbl>,
#   bmr_dem <dbl>, bmr_demdur <dbl>, bti_aar <dbl>, bti_acp <dbl>,
#   bti_aod <dbl>, bti_cdi <dbl>, bti_ci <dbl>, bti_cps <dbl>, bti_cr <dbl>,
#   bti_ds <dbl>, bti_eo <dbl>, bti_eos <dbl>, bti_ep <dbl>, bti_ffe <dbl>,
#   bti_foe <dbl>, bti_ij <dbl>, bti_mes <dbl>, bti_muf <dbl>, bti_pdi <dbl>,
#   bti_pp <dbl>, bti_prp <dbl>, bti_ps <dbl>, bti_rol <dbl>, bti_sdi <dbl>, …

4.3.1.3 Reading in Data from Cloud Storage

Sometimes, the dataset you are working with won’t be available on a public website, and it may be more convenient to read the data in from a cloud storage account (i.e. Dropbox, OneDrive, Google Drive etc) than to download the data to your machine and read in a local file.

The specific steps required to read in a dataset from cloud storage depend on your provider, as well as the nature of the account (for example, additional steps or functions may be required to read in password protected or non-public files), and you may need to do a bit of research to determine the specific steps required in your case.

In general, though, the process tends to be fairly simple. For example, we’ve placed a copy of the QoG dataset on a publicly shared Dropbox page. In order to read this data from Dropbox directly into R, we simply change the “0” at the end of the Dropbox url into a “1”, and pass this modified URL to the read_csv() function. Below, we’ll read in the dataset from Dropbox, and assign it to a new object named qog_cloud:

# reads dataset into R from Dropbox and assigns it to a new object named "qog_cloud"
qog_cloud<-read_csv("https://www.dropbox.com/scl/fi/xxd5otw869auq56fs4c9k/qog_bas_cs_jan25.csv?rlkey=8thev7gb5u1ffbtmhs2tutxxp&e=1&st=folhfq67&dl=1")

Rows: 194 Columns: 331
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (3): cname_qog, cname, ccodealp
dbl (328): ccode_qog, ccodecow, ccode, ajr_settmort, atop_ally, atop_number,...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

We can print out the contents of qog_cloud to the console to ensure that the data has been read in correctly:

# prints contents of "qog_cloud" to the console
qog_cloud

# A tibble: 194 × 331
   cname_qog      cname ccode_qog ccodecow ccodealp ccode ajr_settmort atop_ally
   <chr>          <chr>     <dbl>    <dbl> <chr>    <dbl>        <dbl>     <dbl>
 1 Afghanistan    Afgh…         4      700 AFG          4         4.54         1
 2 Albania        Alba…         8      339 ALB          8        NA            1
 3 Algeria        Alge…        12      615 DZA         12         4.36         1
 4 Andorra        Ando…        20      232 AND         20        NA            1
 5 Angola         Ango…        24      540 AGO         24         5.63         1
 6 Antigua and B… Anti…        28       58 ATG         28        NA            1
 7 Azerbaijan     Azer…        31      373 AZE         31        NA            1
 8 Argentina      Arge…        32      160 ARG         32         4.23         1
 9 Australia      Aust…        36      900 AUS         36         2.15         1
10 Austria        Aust…        40      305 AUT         40        NA            1
# ℹ 184 more rows
# ℹ 323 more variables: atop_number <dbl>, bci_bci <dbl>, bicc_gmi <dbl>,
#   bmr_dem <dbl>, bmr_demdur <dbl>, bti_aar <dbl>, bti_acp <dbl>,
#   bti_aod <dbl>, bti_cdi <dbl>, bti_ci <dbl>, bti_cps <dbl>, bti_cr <dbl>,
#   bti_ds <dbl>, bti_eo <dbl>, bti_eos <dbl>, bti_ep <dbl>, bti_ffe <dbl>,
#   bti_foe <dbl>, bti_ij <dbl>, bti_mes <dbl>, bti_muf <dbl>, bti_pdi <dbl>,
#   bti_pp <dbl>, bti_prp <dbl>, bti_ps <dbl>, bti_rol <dbl>, bti_sdi <dbl>, …

4.3.1.4 Reading in data saved in different file formats

The QoG dataset that we read in from different sources in the subsections above was saved as a CSV file; though this is a widely-used and flexible file format, it’s likely that you’ll also have to read in datasets in file formats other than CSV (for example, .xlsx, or Stata, SPSS, or SAS files). There are useful tidyverse packages that can help with importing datasets stored in a variety of such file formats. For example, the readxl package offers handy functions for reading in Excel files (i.e. .xls and .xlsx files), while the haven package provides functions to read in Stata, SPSS, or SAS files.

To get a sense of how to read in a non-CSV file, let’s quickly explore how to use the haven package’s read_dta() function to read a Stata file into R as a data frame. As part of the data package you downloaded for this workshop, there is a Stata version of the QoG dataset (“qog_bas_cs_jan25.dta”). Below, we’ll read this version of the dataset into R from our local directory by passing the file path and file name as an argument to the read_dta() function and assigning it to a new object named qog_stata:

# reads in stata version of QoG crossnational dataset from local drive using haven's "read_dta" function and assigns the data to a new object named "qog_stata"
qog_stata <- read_dta("data/quality_of_government/qog_bas_cs_jan25.dta")

Now, let’s print the contents of qog_stata to the console, and confirm that the Stata dataset was successfully read in as an R data frame/tibble:

# prints contents of "qog_stata"
qog_stata

# A tibble: 194 × 331
   cname_qog      cname ccode_qog ccodecow ccodealp ccode ajr_settmort atop_ally
   <chr>          <chr>     <dbl>    <dbl> <chr>    <dbl>        <dbl>     <dbl>
 1 Afghanistan    Afgh…         4      700 AFG          4         4.54         1
 2 Albania        Alba…         8      339 ALB          8        NA            1
 3 Algeria        Alge…        12      615 DZA         12         4.36         1
 4 Andorra        Ando…        20      232 AND         20        NA            1
 5 Angola         Ango…        24      540 AGO         24         5.63         1
 6 Antigua and B… Anti…        28       58 ATG         28        NA            1
 7 Azerbaijan     Azer…        31      373 AZE         31        NA            1
 8 Argentina      Arge…        32      160 ARG         32         4.23         1
 9 Australia      Aust…        36      900 AUS         36         2.15         1
10 Austria        Aust…        40      305 AUT         40        NA            1
# ℹ 184 more rows
# ℹ 323 more variables: atop_number <dbl>, bci_bci <dbl>, bicc_gmi <dbl>,
#   bmr_dem <dbl>, bmr_demdur <dbl>, bti_aar <dbl>, bti_acp <dbl>,
#   bti_aod <dbl>, bti_cdi <dbl>, bti_ci <dbl>, bti_cps <dbl>, bti_cr <dbl>,
#   bti_ds <dbl>, bti_eo <dbl>, bti_eos <dbl>, bti_ep <dbl>, bti_ffe <dbl>,
#   bti_foe <dbl>, bti_ij <dbl>, bti_mes <dbl>, bti_muf <dbl>, bti_pdi <dbl>,
#   bti_pp <dbl>, bti_prp <dbl>, bti_ps <dbl>, bti_rol <dbl>, bti_sdi <dbl>, …

4.3.2 Importing Multiple Datasets

Sometimes, you may be working with more than one dataset, in which case it could make sense to iteratively load multiple datasets into memory. In such cases, it is typically useful to read the datasets directly into a list.

Within the “world_bank” subdirectory in the “data” directory, there are four CSV files downloaded from the World Bank’s development indicators site:

wdi_debt2019.csv (country level World Bank data on debt as a share of GDP in 2019)
wdi_fdi2019.csv (country level World Bank data on foreign direct investment as a share of GDP in 2019)
wdi_trade2019.csv (country level World Bank data on trade as a share of GDP in 2019)
wdi_urban2019.csv (country level World Bank data on the urban population as a share of the overall population in 2019).

The first step we must take to iteratively read these files into a list is to make a character vector of the file names we want to read in. The code below uses the list.files() function to extract the file names of the files in the “data/world_bank” directory to a character vector, which we’ll assign to an object named “worldbank_filenames”:

# prints the names of the files we want to read in and assigns the vector of strings to a new object named "worldbank_filenames" 
worldbank_filenames<-list.files("data/world_bank")

Let’s confirm that the file names have been written correctly:

# prints "worldbank_filenames"
worldbank_filenames

[1] "wdi_debt2019.csv"  "wdi_fdi2019.csv"   "wdi_trade2019.csv"
[4] "wdi_urban2019.csv"

Now, we’ll use the map() function to iteratively pass the file names in the worldbank_filenames vector to the read_csv() function, and deposit the imported files into a list named world_bank_list:

# iteratively passes file names in "worldbank_filenames" to the "read_csv" function, and deposits imported world bank files into a list that is assigned to an object named "world_bank_list"; assumes the working directory is the one with the world bank files
setwd("data/world_bank")
world_bank_list <- map(worldbank_filenames, read_csv)

Rows: 271 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): Country Name, Country Code, Series Name, Series Code, 2019 [YR2019]

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 271 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): Country Name, Country Code, Series Name, Series Code, 2019 [YR2019]

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 271 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): Country Name, Country Code, Series Name, Series Code, 2019 [YR2019]

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 271 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): Country Name, Country Code, Series Name, Series Code, 2019 [YR2019]

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Now, let’s go ahead print the contents of world_bank_list:

# prints contents of "world_bank_list"
world_bank_list

[[1]]
# A tibble: 271 × 5
   `Country Name`     `Country Code` `Series Name` `Series Code` `2019 [YR2019]`
   <chr>              <chr>          <chr>         <chr>         <chr>          
 1 Afghanistan        AFG            Central gove… GC.DOD.TOTL.… ..             
 2 Albania            ALB            Central gove… GC.DOD.TOTL.… 75.69848824949…
 3 Algeria            DZA            Central gove… GC.DOD.TOTL.… ..             
 4 American Samoa     ASM            Central gove… GC.DOD.TOTL.… ..             
 5 Andorra            AND            Central gove… GC.DOD.TOTL.… ..             
 6 Angola             AGO            Central gove… GC.DOD.TOTL.… ..             
 7 Antigua and Barbu… ATG            Central gove… GC.DOD.TOTL.… ..             
 8 Argentina          ARG            Central gove… GC.DOD.TOTL.… ..             
 9 Armenia            ARM            Central gove… GC.DOD.TOTL.… 50.02842068637…
10 Aruba              ABW            Central gove… GC.DOD.TOTL.… ..             
# ℹ 261 more rows

[[2]]
# A tibble: 271 × 5
   `Country Name`     `Country Code` `Series Name` `Series Code` `2019 [YR2019]`
   <chr>              <chr>          <chr>         <chr>         <chr>          
 1 Afghanistan        AFG            Foreign dire… BX.KLT.DINV.… 0.124495985791…
 2 Albania            ALB            Foreign dire… BX.KLT.DINV.… 7.797920483865…
 3 Algeria            DZA            Foreign dire… BX.KLT.DINV.… 0.804144058246…
 4 American Samoa     ASM            Foreign dire… BX.KLT.DINV.… ..             
 5 Andorra            AND            Foreign dire… BX.KLT.DINV.… ..             
 6 Angola             AGO            Foreign dire… BX.KLT.DINV.… -5.78081314444…
 7 Antigua and Barbu… ATG            Foreign dire… BX.KLT.DINV.… 7.433324076307…
 8 Argentina          ARG            Foreign dire… BX.KLT.DINV.… 1.485006875706…
 9 Armenia            ARM            Foreign dire… BX.KLT.DINV.… 0.736361516844…
10 Aruba              ABW            Foreign dire… BX.KLT.DINV.… -2.21528256776…
# ℹ 261 more rows

[[3]]
# A tibble: 271 × 5
   `Country Name`     `Country Code` `Series Name` `Series Code` `2019 [YR2019]`
   <chr>              <chr>          <chr>         <chr>         <chr>          
 1 Afghanistan        AFG            Trade (% of … NE.TRD.GNFS.… ..             
 2 Albania            ALB            Trade (% of … NE.TRD.GNFS.… 76.27919464957…
 3 Algeria            DZA            Trade (% of … NE.TRD.GNFS.… 51.80973844157…
 4 American Samoa     ASM            Trade (% of … NE.TRD.GNFS.… 156.5687789799…
 5 Andorra            AND            Trade (% of … NE.TRD.GNFS.… ..             
 6 Angola             AGO            Trade (% of … NE.TRD.GNFS.… 57.82953811830…
 7 Antigua and Barbu… ATG            Trade (% of … NE.TRD.GNFS.… 137.6251757558…
 8 Argentina          ARG            Trade (% of … NE.TRD.GNFS.… 32.63061504584…
 9 Armenia            ARM            Trade (% of … NE.TRD.GNFS.… 96.11415412887…
10 Aruba              ABW            Trade (% of … NE.TRD.GNFS.… 145.3435727352…
# ℹ 261 more rows

[[4]]
# A tibble: 271 × 5
   `Country Name`     `Country Code` `Series Name` `Series Code` `2019 [YR2019]`
   <chr>              <chr>          <chr>         <chr>         <chr>          
 1 Afghanistan        AFG            Urban popula… SP.URB.TOTL.… 25.754         
 2 Albania            ALB            Urban popula… SP.URB.TOTL.… 61.229         
 3 Algeria            DZA            Urban popula… SP.URB.TOTL.… 73.189         
 4 American Samoa     ASM            Urban popula… SP.URB.TOTL.… 87.147         
 5 Andorra            AND            Urban popula… SP.URB.TOTL.… 87.984         
 6 Angola             AGO            Urban popula… SP.URB.TOTL.… 66.177         
 7 Antigua and Barbu… ATG            Urban popula… SP.URB.TOTL.… 24.506         
 8 Argentina          ARG            Urban popula… SP.URB.TOTL.… 91.991         
 9 Armenia            ARM            Urban popula… SP.URB.TOTL.… 63.219         
10 Aruba              ABW            Urban popula… SP.URB.TOTL.… 43.546         
# ℹ 261 more rows

It could be useful to label the list elements of world_bank_list. For labels, it would make sense to use the file names in worldbank_filenames, without the “.csv” extension. Below, we use the str_remove() function to remove the “.csv” extension from the file names in worldbank_filenames and assign the result to a new object named worldbank_filenames_base:

# removes CSV extension from "worldbank_filenames"
worldbank_filenames_base <- str_remove(worldbank_filenames, ".csv")

Now, let’s use the names() argument to assign the labels in worldbank_filenames_base to the elements in world_bank_list:

# assigns names to datasets in "world_bank_list"
names(world_bank_list) <- worldbank_filenames_base

Now that the file names are assigned, we can extract list elements by their labels. Below, for example, we extract the FDI dataset from world_bank_list using its label:

# extracts fdi dataset from "world_bank_list" by assigned label
world_bank_list[["wdi_fdi2019"]]

# A tibble: 271 × 5
   `Country Name`     `Country Code` `Series Name` `Series Code` `2019 [YR2019]`
   <chr>              <chr>          <chr>         <chr>         <chr>          
 1 Afghanistan        AFG            Foreign dire… BX.KLT.DINV.… 0.124495985791…
 2 Albania            ALB            Foreign dire… BX.KLT.DINV.… 7.797920483865…
 3 Algeria            DZA            Foreign dire… BX.KLT.DINV.… 0.804144058246…
 4 American Samoa     ASM            Foreign dire… BX.KLT.DINV.… ..             
 5 Andorra            AND            Foreign dire… BX.KLT.DINV.… ..             
 6 Angola             AGO            Foreign dire… BX.KLT.DINV.… -5.78081314444…
 7 Antigua and Barbu… ATG            Foreign dire… BX.KLT.DINV.… 7.433324076307…
 8 Argentina          ARG            Foreign dire… BX.KLT.DINV.… 1.485006875706…
 9 Armenia            ARM            Foreign dire… BX.KLT.DINV.… 0.736361516844…
10 Aruba              ABW            Foreign dire… BX.KLT.DINV.… -2.21528256776…
# ℹ 261 more rows

4.4 Processing and Wrangling a Single Dataset

Once we’ve read in our dataset(s) of interest, we typically need to carry out a variety of processing and wrangling tasks to prepare to data for analysis and visualization. In this section, we’ll consider a variety of useful functions (most of them from tidyverse packages such as dplyr) that can help with a variety of these data preparation tasks.

We’ll start by making a copy of the qog object, by assigning it to a new object named qog_copy; we’ll work with qog_copy instead of the original qog object, to ensure that we can always revert to the original data when needed. Keeping a “clean” version of the dataset of interest, and carrying out data processing and analysis tasks on a copy of this dataset, is good data management practice.

# makes a copy of "qog", called "qog_copy" that we can use for processing; keeps the original data frame, "qog" untouched
qog_copy<-qog

4.4.1 Introducing the `%>%` (“pipe”) operator

One of the most useful features of the tidyverse is the %>% operator (pronounced “pipe”) which helps chain together different functions to form a clear and explicit data processing and analysis pipeline. To illustrate how a pipe works, let’s first consider a simple data processing operation that does not use a pipe. In particular, we will use the select() function from the dplyr package to select a few columns from qog_copy; this dataset has over 300 variables, so it would make sense to create a more tractable dataset that extracts the specific columns/variables we’re interested in. Let’s say, for example, that we want to select the “cname_qog”, “cname”, “ccodealp”, “undp_hdi”, and “wdi_expedu” variables from qog_copy. We can do so by passing the name of the data object containing the columns we want to select, along with the names of the desired columns, as arguments to the select() function. Below, we’ll select these columns from qog_copy and assign the modified data frame to a new object named qog_copy_select_initial:

# selects columns/variables from "qog_copy" and assigns the 
# modified data frame to a new object named "qog_copy_select"
qog_copy_select_initial <- select(qog_copy, cname_qog, cname, ccodealp, undp_hdi, wdi_expedu)

Let’s view qog_copy_select_initial in the data viewer:

# Views "qog_copy_select_initial" in the data viewer
View(qog_copy_select_initial)

Now that we’ve seen how to use the select() function using traditional syntax, let’s see how we can carry out the same operation with a pipe operator. In particular, the piping syntax looks something like this:

# selects columns/variables from "qog_copy" using the 
# pipe syntax and assigns the modified data frame 
# to a new object named "qog_copy_select"
qog_copy_select_pipe <- 
  qog_copy %>% 
    select(cname_qog, cname, ccodealp, undp_hdi, wdi_expedu)

Note that the pipe operator %>% comes immediately after qog_copy and immediately before we call the select() function. In essence, the pipe operator takes the contents to its left, and then uses these contents as an input to the code on its right. Above, the pipe takes the contents of qog_copy on its left, and then feeds this data into the select() function on the right, and returns a modified data frame which is assigned to qog_copy_select_pipe. We can pass qog_copy_select_pipe into the data viewer to confirm that the operation worked as expected:

# views "qog_copy_select_pipe" in data viewer
View(qog_copy_select_pipe)

We can already see that the pipe makes our code slightly more readable even when performing a simple operation, but the pipe’s usefulness for the task of writing concise and readable code becomes even more visible when performing more complex operations that involve several different data processing functions.

For example, let’s say that in addition to selecting the columns above, we also want to subset the dataset to include only countries for which the “undp_hdi” variable is higher than 0.8. We can subset datasets based on such conditions using the filter() function. Below, we’ll take qog_copy_select_initial and subset this dataset to meet the undp_hdi>0.8 condition, and assign the final version of our processed dataset to a new object named qog_final_processed:

# subsets "qog_copy_select_initial" using the "filter()" function to include only observations with undp_hdi>0.8, and deposits the modified dataset into a new object named "qog_final_processed"
qog_final_processed <- filter(qog_copy_select_initial, undp_hdi>0.8)

Now, let’s view qog_final_processed in the data viewer and confirm that it only includes the selected variables and is appropriately subsetted according to the specified condition:

View(qog_final_processed)

Notice that using conventional notation (i.e. notation without a pipe), we arrived at our final processed dataset in two steps; first, we had to select columns using the select() function, and then we had to take the resulting object and call the filter() function to subset the data based on the “undp_hdi” variable. This process can by a little clunky, and the pipe operator helps to streamline it, making for more efficient and readable code. In short, when we use pipe (rather than conventional) notation, we can get from qog_copy to the final processed dataset with just the following:

qog_copy %>% 
  select(cname_qog, cname, ccodealp, undp_hdi, wdi_expedu) %>% 
  filter(undp_hdi>0.8)

# A tibble: 65 × 5
   cname_qog           cname               ccodealp undp_hdi wdi_expedu
   <chr>               <chr>               <chr>       <dbl>      <dbl>
 1 Andorra             Andorra             AND         0.855       2.58
 2 Antigua and Barbuda Antigua and Barbuda ATG         0.819       2.57
 3 Argentina           Argentina           ARG         0.844       4.64
 4 Australia           Australia           AUS         0.949       5.33
 5 Austria             Austria             AUT         0.92        5.49
 6 Bahrain             Bahrain             BHR         0.884       2.22
 7 Barbados            Barbados            BRB         0.803       4.86
 8 Belgium             Belgium             BEL         0.938       6.36
 9 Brunei              Brunei Darussalam   BRN         0.824      NA   
10 Belarus             Belarus             BLR         0.801       4.61
# ℹ 55 more rows

This code takes the qog_copy dataset, and then feeds it into the select() function; the output of the select() function is then fed into the filter() function, which then returns a final dataset equivalent to qog_final_processed. Given the readability and conciseness of pipe notation, we will use it extensively in our exploration of data processing and wrangling.

Below, we’ll explore the select() and filter() functions (along with other data processing functions) at greater length; our main purpose in this sub-section was to motivate the utility of the %>% operator in helping to create complex data processing pipelines with intuitive and readable code.

4.4.2 Selecting and Deleting Variables

In this section, we’ll explore the select() function at greater length, and use it to select some key variables of interest from qog_copy; this will allow us to subsequently work with a more tractable dataset.

In particular, let’s select the following variables from qog_copy (additional information about the variables is available in the Quality of Government Basic Dataset Codebook:

“cname_qog”: Country name as standardized by QoG
“cname”: Country name based on the ISO standard
“ccodealp”: 3-digit ISO country code
“undp_hdi”: Human development index
“wdi_expedu”: Government expenditure on education as a percentage of GDP
“wdi_acel”: Percentage of population with access to electricity
“wdi_area”: Land area in sq km
“wdi_taxrev”: Tax revenue as a percentage of GDP
“wdi_expmil”: Military expenditure as a percentage of GDP
“wdi_fdiin”: Foreign direct investment (FDI) inflows as a share of GDP
“wdi_trade”: Foreign trade as a percentage of GDP
“cbie_index”: Central bank independence index
“ht_region”: World region of the country
“wbgi_rle”: Rule of law index
“bmr_dem”: Dichotomous democracy measure
“atop_ally”: Member of a military alliance
“gol_est”: Electoral system (majoritarian/proportional/mixed)
“mad_gdppc”: Real GDP per capita in 2018 (2011 dollars)
“mad_gdppc1900”: Real GDP per capita in 1900 (2011 dollars)
“bci_bci”: Bayesian Corruption Indicator
“lis_gini”: Gini coefficient
“top_top1_income_share”: Income share of the population’s top 1%
“wdi_wip”: Percentage of lower house or single house parliamentary seats held by women

Below, we select these variables by passing qog_copy to the select() function via a %>%, and then specify the columns we want to select as arguments to the select() function; we assign the resulting selection to a new object named qog_copy_selection:

# selects specific variables from "qog_copy" and assigns the selection to a new object named "qog_copy_selection"
qog_copy_selection <- qog_copy %>% 
                        select(cname_qog, 
                               cname, 
                               ccodealp, 
                               undp_hdi, 
                               wdi_expedu,
                               wdi_acel,
                               wdi_area,
                               wdi_taxrev,
                               wdi_expmil,
                               wdi_fdiin,
                               wdi_trade,
                               cbie_index,
                               ht_region,
                               wbgi_rle,
                               bmr_dem,
                               atop_ally,
                               gol_est,
                               mad_gdppc,
                               mad_gdppc1900,
                               bci_bci,
                               lis_gini,
                               top_top1_income_share,
                               wdi_wip)

When we view qog_copy_selection in the R Studio data viewer, it looks something like this:

# Views "qog_copy_selection" in the data viewer
View(qog_copy_selection)

Sometimes, it could make more sense to directly delete columns, instead of deciding on which ones to keep or select. For example, the code below deletes the “cname” variable from qog_copy_selection by passing it to the select() column with a “-” in front of it:

# removes "cname" column from "qog_copy_selection"
qog_copy_selection %>% select(-cname)

# A tibble: 194 × 22
   cname_qog           ccodealp undp_hdi wdi_expedu wdi_acel wdi_area wdi_taxrev
   <chr>               <chr>       <dbl>      <dbl>    <dbl>    <dbl>      <dbl>
 1 Afghanistan         AFG         0.473      NA        97.7   652230      NA   
 2 Albania             ALB         0.785       3.02    100      27400      18.2 
 3 Algeria             DZA         0.74        5.51     99.8  2381741      NA   
 4 Andorra             AND         0.855       2.58    100        470      NA   
 5 Angola              AGO         0.59        2.30     48.2  1246700      10.1 
 6 Antigua and Barbuda ATG         0.819       2.57    100        440      NA   
 7 Azerbaijan          AZE         0.738       3.70    100      82650      13.4 
 8 Argentina           ARG         0.844       4.64    100    2736690       8.79
 9 Australia           AUS         0.949       5.33    100    7692020      23.0 
10 Austria             AUT         0.92        5.49    100      82520      25.9 
# ℹ 184 more rows
# ℹ 15 more variables: wdi_expmil <dbl>, wdi_fdiin <dbl>, wdi_trade <dbl>,
#   cbie_index <dbl>, ht_region <dbl>, wbgi_rle <dbl>, bmr_dem <dbl>,
#   atop_ally <dbl>, gol_est <dbl>, mad_gdppc <dbl>, mad_gdppc1900 <dbl>,
#   bci_bci <dbl>, lis_gini <dbl>, top_top1_income_share <dbl>, wdi_wip <dbl>

Note that the “cname” column is not permanently deleted since we didn’t assign the code back to qog_copy_selection (i.e. with qog_copy_selection <- qog_copy_selection %>% select(-cname).) We can confirm this by reprinting qog_copy_selection and noting that “cname” is still in the data frame:

# prints contents of "qog_copy_selection"
qog_copy_selection

# A tibble: 194 × 23
   cname_qog     cname ccodealp undp_hdi wdi_expedu wdi_acel wdi_area wdi_taxrev
   <chr>         <chr> <chr>       <dbl>      <dbl>    <dbl>    <dbl>      <dbl>
 1 Afghanistan   Afgh… AFG         0.473      NA        97.7   652230      NA   
 2 Albania       Alba… ALB         0.785       3.02    100      27400      18.2 
 3 Algeria       Alge… DZA         0.74        5.51     99.8  2381741      NA   
 4 Andorra       Ando… AND         0.855       2.58    100        470      NA   
 5 Angola        Ango… AGO         0.59        2.30     48.2  1246700      10.1 
 6 Antigua and … Anti… ATG         0.819       2.57    100        440      NA   
 7 Azerbaijan    Azer… AZE         0.738       3.70    100      82650      13.4 
 8 Argentina     Arge… ARG         0.844       4.64    100    2736690       8.79
 9 Australia     Aust… AUS         0.949       5.33    100    7692020      23.0 
10 Austria       Aust… AUT         0.92        5.49    100      82520      25.9 
# ℹ 184 more rows
# ℹ 15 more variables: wdi_expmil <dbl>, wdi_fdiin <dbl>, wdi_trade <dbl>,
#   cbie_index <dbl>, ht_region <dbl>, wbgi_rle <dbl>, bmr_dem <dbl>,
#   atop_ally <dbl>, gol_est <dbl>, mad_gdppc <dbl>, mad_gdppc1900 <dbl>,
#   bci_bci <dbl>, lis_gini <dbl>, top_top1_income_share <dbl>, wdi_wip <dbl>

If we want to delete multiple columns, we can pass a vector containing the names of the columns to be deleted to the select() function, preceded by a minus sign. For example, if we want to delete “cname”, “undp_hdi”, and “wdi_acel” from qog_copy_selection we can do so with the following:

# deletes "cname", "undp_hdi", and "wdi_acel" from "qog_copy_selection"
qog_copy_selection %>% select(-c(cname, undp_hdi, wdi_acel))

# A tibble: 194 × 20
   cname_qog        ccodealp wdi_expedu wdi_area wdi_taxrev wdi_expmil wdi_fdiin
   <chr>            <chr>         <dbl>    <dbl>      <dbl>      <dbl>     <dbl>
 1 Afghanistan      AFG           NA      652230      NA         1.83      0.144
 2 Albania          ALB            3.02    27400      18.2       1.22      6.80 
 3 Algeria          DZA            5.51  2381741      NA         5.59      0.467
 4 Andorra          AND            2.58      470      NA        NA        NA    
 5 Angola           AGO            2.30  1246700      10.1       1.29     -6.55 
 6 Antigua and Bar… ATG            2.57      440      NA        NA        18.1  
 7 Azerbaijan       AZE            3.70    82650      13.4       5.27     -3.11 
 8 Argentina        ARG            4.64  2736690       8.79      0.631     1.36 
 9 Australia        AUS            5.33  7692020      23.0       1.99      1.78 
10 Austria          AUT            5.49    82520      25.9       0.874     3.71 
# ℹ 184 more rows
# ℹ 13 more variables: wdi_trade <dbl>, cbie_index <dbl>, ht_region <dbl>,
#   wbgi_rle <dbl>, bmr_dem <dbl>, atop_ally <dbl>, gol_est <dbl>,
#   mad_gdppc <dbl>, mad_gdppc1900 <dbl>, bci_bci <dbl>, lis_gini <dbl>,
#   top_top1_income_share <dbl>, wdi_wip <dbl>

4.4.3 Rearranging Columns

We can change the order of the columns in a dataset using the relocate() function. For example, the code below uses the relocate() function to shift the “ccdoealp” column to the front of the qog_copy_selection data frame:

# moves "ccdoealp" to front of "qog_copy_selection" dataset
qog_copy_selection %>% relocate(ccodealp)

# A tibble: 194 × 23
   ccodealp cname_qog     cname undp_hdi wdi_expedu wdi_acel wdi_area wdi_taxrev
   <chr>    <chr>         <chr>    <dbl>      <dbl>    <dbl>    <dbl>      <dbl>
 1 AFG      Afghanistan   Afgh…    0.473      NA        97.7   652230      NA   
 2 ALB      Albania       Alba…    0.785       3.02    100      27400      18.2 
 3 DZA      Algeria       Alge…    0.74        5.51     99.8  2381741      NA   
 4 AND      Andorra       Ando…    0.855       2.58    100        470      NA   
 5 AGO      Angola        Ango…    0.59        2.30     48.2  1246700      10.1 
 6 ATG      Antigua and … Anti…    0.819       2.57    100        440      NA   
 7 AZE      Azerbaijan    Azer…    0.738       3.70    100      82650      13.4 
 8 ARG      Argentina     Arge…    0.844       4.64    100    2736690       8.79
 9 AUS      Australia     Aust…    0.949       5.33    100    7692020      23.0 
10 AUT      Austria       Aust…    0.92        5.49    100      82520      25.9 
# ℹ 184 more rows
# ℹ 15 more variables: wdi_expmil <dbl>, wdi_fdiin <dbl>, wdi_trade <dbl>,
#   cbie_index <dbl>, ht_region <dbl>, wbgi_rle <dbl>, bmr_dem <dbl>,
#   atop_ally <dbl>, gol_est <dbl>, mad_gdppc <dbl>, mad_gdppc1900 <dbl>,
#   bci_bci <dbl>, lis_gini <dbl>, top_top1_income_share <dbl>, wdi_wip <dbl>

We can specify more than one argument to the relocate function. For example, in the code below, passing the “ccodealp”, “wdi_acel”, “wdi_expmil”, and “wdi_wip” variables/columns to the relocate() function will make “ccodealp” the first column, “wdi_acel” the second column, “wdi_expmil” the third column, and “wdi_wip” the fourth column. Below, we specify this order, and assign the result back to qog_copy_selection:

# sets the order for the first four columns of "qog_copy_selection" and assigns the result back to "qog_copy_selection"
qog_copy_selection <- qog_copy_selection %>% 
                        relocate(ccodealp, wdi_acel, wdi_expmil, wdi_wip)

Now, let’s print the contents of “qog_copy_selection” to the console and confirm the change has been made:

# prints contents of updated "qog_copy_selection" object
qog_copy_selection

# A tibble: 194 × 23
   ccodealp wdi_acel wdi_expmil wdi_wip cname_qog      cname undp_hdi wdi_expedu
   <chr>       <dbl>      <dbl>   <dbl> <chr>          <chr>    <dbl>      <dbl>
 1 AFG          97.7      1.83    27.0  Afghanistan    Afgh…    0.473      NA   
 2 ALB         100        1.22    35.7  Albania        Alba…    0.785       3.02
 3 DZA          99.8      5.59     8.11 Algeria        Alge…    0.74        5.51
 4 AND         100       NA       46.4  Andorra        Ando…    0.855       2.58
 5 AGO          48.2      1.29    29.5  Angola         Ango…    0.59        2.30
 6 ATG         100       NA       11.1  Antigua and B… Anti…    0.819       2.57
 7 AZE         100        5.27    18.2  Azerbaijan     Azer…    0.738       3.70
 8 ARG         100        0.631   44.7  Argentina      Arge…    0.844       4.64
 9 AUS         100        1.99    31.1  Australia      Aust…    0.949       5.33
10 AUT         100        0.874   40.4  Austria        Aust…    0.92        5.49
# ℹ 184 more rows
# ℹ 15 more variables: wdi_area <dbl>, wdi_taxrev <dbl>, wdi_fdiin <dbl>,
#   wdi_trade <dbl>, cbie_index <dbl>, ht_region <dbl>, wbgi_rle <dbl>,
#   bmr_dem <dbl>, atop_ally <dbl>, gol_est <dbl>, mad_gdppc <dbl>,
#   mad_gdppc1900 <dbl>, bci_bci <dbl>, lis_gini <dbl>,
#   top_top1_income_share <dbl>

4.4.4 Renaming Variables

In order to rename variables, we can use the rename() function; the argument to this function takes the form new_name=old_name. The code below renames the existing “ccodealp” variable in the qog_copy_selection data frame to “iso3” using the %>% operator to pass

# renames "ccodealp" variable in "qog_copy_selection" to "iso3"
qog_copy_selection %>% 
  rename(iso3=ccodealp)

# A tibble: 194 × 23
   iso3  wdi_acel wdi_expmil wdi_wip cname_qog         cname undp_hdi wdi_expedu
   <chr>    <dbl>      <dbl>   <dbl> <chr>             <chr>    <dbl>      <dbl>
 1 AFG       97.7      1.83    27.0  Afghanistan       Afgh…    0.473      NA   
 2 ALB      100        1.22    35.7  Albania           Alba…    0.785       3.02
 3 DZA       99.8      5.59     8.11 Algeria           Alge…    0.74        5.51
 4 AND      100       NA       46.4  Andorra           Ando…    0.855       2.58
 5 AGO       48.2      1.29    29.5  Angola            Ango…    0.59        2.30
 6 ATG      100       NA       11.1  Antigua and Barb… Anti…    0.819       2.57
 7 AZE      100        5.27    18.2  Azerbaijan        Azer…    0.738       3.70
 8 ARG      100        0.631   44.7  Argentina         Arge…    0.844       4.64
 9 AUS      100        1.99    31.1  Australia         Aust…    0.949       5.33
10 AUT      100        0.874   40.4  Austria           Aust…    0.92        5.49
# ℹ 184 more rows
# ℹ 15 more variables: wdi_area <dbl>, wdi_taxrev <dbl>, wdi_fdiin <dbl>,
#   wdi_trade <dbl>, cbie_index <dbl>, ht_region <dbl>, wbgi_rle <dbl>,
#   bmr_dem <dbl>, atop_ally <dbl>, gol_est <dbl>, mad_gdppc <dbl>,
#   mad_gdppc1900 <dbl>, bci_bci <dbl>, lis_gini <dbl>,
#   top_top1_income_share <dbl>

Note the “ccodealp” variable was changed to “iso3”. If we want to change the name of more than one variable, the rename() function can take multiple arguments, separated by a comma. For example, the code below changes “undp_hdi” to “hdi” and “wdi_area” to “wdi_area_sqkm”:

# renames "undp_hdi" variable to "hdi", and "wdi_area" to "wdi_area_sqkm" in "qog_copy_selection" data frame
qog_copy_selection %>% 
  rename(hdi=undp_hdi,
         wdi_area_sqkm=wdi_area)

# A tibble: 194 × 23
   ccodealp wdi_acel wdi_expmil wdi_wip cname_qog         cname   hdi wdi_expedu
   <chr>       <dbl>      <dbl>   <dbl> <chr>             <chr> <dbl>      <dbl>
 1 AFG          97.7      1.83    27.0  Afghanistan       Afgh… 0.473      NA   
 2 ALB         100        1.22    35.7  Albania           Alba… 0.785       3.02
 3 DZA          99.8      5.59     8.11 Algeria           Alge… 0.74        5.51
 4 AND         100       NA       46.4  Andorra           Ando… 0.855       2.58
 5 AGO          48.2      1.29    29.5  Angola            Ango… 0.59        2.30
 6 ATG         100       NA       11.1  Antigua and Barb… Anti… 0.819       2.57
 7 AZE         100        5.27    18.2  Azerbaijan        Azer… 0.738       3.70
 8 ARG         100        0.631   44.7  Argentina         Arge… 0.844       4.64
 9 AUS         100        1.99    31.1  Australia         Aust… 0.949       5.33
10 AUT         100        0.874   40.4  Austria           Aust… 0.92        5.49
# ℹ 184 more rows
# ℹ 15 more variables: wdi_area_sqkm <dbl>, wdi_taxrev <dbl>, wdi_fdiin <dbl>,
#   wdi_trade <dbl>, cbie_index <dbl>, ht_region <dbl>, wbgi_rle <dbl>,
#   bmr_dem <dbl>, atop_ally <dbl>, gol_est <dbl>, mad_gdppc <dbl>,
#   mad_gdppc1900 <dbl>, bci_bci <dbl>, lis_gini <dbl>,
#   top_top1_income_share <dbl>

4.4.5 Sorting data

It is often useful to sort a data frame in ascending or descending order with respect to a given variable. The code below sorts the qog_copy_selection data frame in ascending order with respect to the “wdi_trade” variable using the arrange() function, and then uses the relocate() function to move “wdi_trade” towards the front of the dataset, right after the country name and country code variables:

# sorts "qog_copy_selection" data frame in ascending (low to high) order with respect to the "wdi_trade" variable, and then brings the "wdi_trade" variable
qog_copy_selection %>% 
  arrange(wdi_trade) %>% 
  relocate(cname_qog, cname, ccodealp, wdi_trade)

# A tibble: 194 × 23
   cname_qog       cname ccodealp wdi_trade wdi_acel wdi_expmil wdi_wip undp_hdi
   <chr>           <chr> <chr>        <dbl>    <dbl>      <dbl>   <dbl>    <dbl>
 1 Ethiopia (1993… Ethi… ETH           24.3     54.2      0.492    42.6    0.489
 2 United States   Unit… USA           25.3    100        3.46     27.6    0.921
 3 Pakistan (1971… Paki… PAK           27.0     94.9      2.87     20.2    0.537
 4 Bangladesh      Bang… BGD           27.7     99        1.20     20.9    0.662
 5 Burundi         Buru… BDI           28.8     10.2      2.04     38.2    0.419
 6 Egypt           Egypt EGY           29.9    100        1.12     27.7    0.726
 7 Tanzania        Tanz… TZA           29.9     42.7      1.08     36.9    0.529
 8 Kenya           Kenya KEN           30.7     76.5      1.07     21.6    0.596
 9 Argentina       Arge… ARG           32.9    100        0.631    44.7    0.844
10 Turkmenistan    Turk… TKM           33.1    100       NA        25      0.74 
# ℹ 184 more rows
# ℹ 15 more variables: wdi_expedu <dbl>, wdi_area <dbl>, wdi_taxrev <dbl>,
#   wdi_fdiin <dbl>, cbie_index <dbl>, ht_region <dbl>, wbgi_rle <dbl>,
#   bmr_dem <dbl>, atop_ally <dbl>, gol_est <dbl>, mad_gdppc <dbl>,
#   mad_gdppc1900 <dbl>, bci_bci <dbl>, lis_gini <dbl>,
#   top_top1_income_share <dbl>

If, instead, we want to sort the dataset in descending order with respect to the “wdi_trade” variable, we can pass the name of the variable to the desc() function within the arrange() function, as below:

# sorts "qog_copy_selection" data frame in descending (high to low) order with respect to the "wdi_trade" variable, and then brings the "wdi_trade" variable
qog_copy_selection %>% 
  arrange(desc(wdi_trade)) %>% 
  relocate(cname_qog, cname, ccodealp, wdi_trade)

# A tibble: 194 × 23
   cname_qog      cname  ccodealp wdi_trade wdi_acel wdi_expmil wdi_wip undp_hdi
   <chr>          <chr>  <chr>        <dbl>    <dbl>      <dbl>   <dbl>    <dbl>
 1 Luxembourg     Luxem… LUX           393.    100        0.472    35      0.927
 2 San Marino     San M… SMR           342.    100       NA        33.3    0.853
 3 Singapore      Singa… SGP           330.    100        2.78     29.8    0.942
 4 Malta          Malta  MLT           315.    100        0.496    13.4    0.912
 5 Djibouti       Djibo… DJI           264.     65.4     NA        26.2    0.512
 6 Ireland        Irela… IRL           227.    100        0.251    22.5    0.946
 7 Vietnam        Viet … VNM           187.    100        2.28     30.3    0.718
 8 Slovakia       Slova… SVK           184.    100        1.77     22.7    0.852
 9 Seychelles     Seych… SYC           184.    100        2.78     22.9    0.795
10 Cyprus (1975-) Cyprus CYP           175.    100        1.91     14.3    0.901
# ℹ 184 more rows
# ℹ 15 more variables: wdi_expedu <dbl>, wdi_area <dbl>, wdi_taxrev <dbl>,
#   wdi_fdiin <dbl>, cbie_index <dbl>, ht_region <dbl>, wbgi_rle <dbl>,
#   bmr_dem <dbl>, atop_ally <dbl>, gol_est <dbl>, mad_gdppc <dbl>,
#   mad_gdppc1900 <dbl>, bci_bci <dbl>, lis_gini <dbl>,
#   top_top1_income_share <dbl>

Note that it’s also possible to pass several arguments to the arrange function, and thereby sort a dataset with respect to multiple variables; for example, the code below sorts the qog_copy_selection dataset by region (“ht_region), and then further sorts it in descending order by”wdi_trade”. It assigns the result back to qog_copy_selection to effectively “save” these changes to the object:

# arranges the "qog_copy_selection" data frame in ascending order with respect to "ht_region" and then in descending order with respect to "wdi_trade", and then relocates these variables to the front of the dataset; changes are assigned back to "qog_copy_selection" to store these changes
qog_copy_selection<-qog_copy_selection %>% 
                      arrange(ht_region, desc(wdi_trade)) %>% 
                      relocate(cname_qog, cname, ccodealp, ht_region, wdi_trade)

Let’s view the updated version of the qog_copy_selection in the data viewer:

# Views "qog_copy_selection" in the data viewer
View(qog_copy_selection)

As we can see, sorting the data in this way allows us to quickly note the country in each regional grouping with the highest value on the “trade” variable. Sorting a dataset with respect to more than one variable can be especially useful in certain contexts, particularly in cases where you’re dealing with nested data. For example, in a time series dataset, it can be useful to first sort by year, then by months within the year. In a dataset with regional information, it could be helpful to first sort by regional, then by cities within those regional groupings.

4.4.6 Creating New Variables Based on Existing Variables

Depending on your research question and empirical strategy, it is often useful or necessary to create new variables in your dataset, based on existing variables. The mutate() function from the dplyr package is particularly useful for this purpose. To see how it works, consider the code below, which uses the mutate() function to create a new variable named “mip” (which stands for “men in parliament”) that is computed by subtracting the existing “wdi_wip” (percentage of parliamentary seats held by women) variable from 100. For convenience, we’ll also move “wdi_wip” and the newly created “mip” variables to the front of the dataset using relocate():

# Creates new variable named "mip" (percentage of men in parliement) that is calculated by substracting the women's share of parliamentary sets ("wdi_wip") from 100 and relocates these variables to the front of the dataset
qog_copy_selection %>% 
  mutate(mip=100-wdi_wip) %>% 
    relocate(cname_qog, cname, ccodealp, wdi_wip, mip)

# A tibble: 194 × 24
   cname_qog       cname     ccodealp wdi_wip   mip ht_region wdi_trade wdi_acel
   <chr>           <chr>     <chr>      <dbl> <dbl>     <dbl>     <dbl>    <dbl>
 1 Slovakia        Slovakia  SVK         22.7  77.3         1      184.    100  
 2 Estonia         Estonia   EST         25.7  74.3         1      162.    100  
 3 Slovenia        Slovenia  SVN         26.7  73.3         1      161.    100  
 4 Hungary         Hungary   HUN         13.1  86.9         1      160.    100  
 5 Lithuania       Lithuania LTU         27.7  72.3         1      156.    100  
 6 North Macedonia North Ma… MKD         41.7  58.3         1      147.    100  
 7 Czech Republic  Czechia   CZE         25    75           1      142.    100  
 8 Belarus         Belarus   BLR         40    60           1      136.    100  
 9 Latvia          Latvia    LVA         29    71           1      132.    100  
10 Bulgaria        Bulgaria  BGR         23.8  76.2         1      121.     99.8
# ℹ 184 more rows
# ℹ 16 more variables: wdi_expmil <dbl>, undp_hdi <dbl>, wdi_expedu <dbl>,
#   wdi_area <dbl>, wdi_taxrev <dbl>, wdi_fdiin <dbl>, cbie_index <dbl>,
#   wbgi_rle <dbl>, bmr_dem <dbl>, atop_ally <dbl>, gol_est <dbl>,
#   mad_gdppc <dbl>, mad_gdppc1900 <dbl>, bci_bci <dbl>, lis_gini <dbl>,
#   top_top1_income_share <dbl>

Note that the argument to the mutate function takes the form new_variable=old_variable. It’s possible to define more than one new variable at a time using mutate(); simply separate the arguments by a comma within the mutate() function. For example, the code below creates a new variable (“land_area_sqmiles”) that is created based on “wdi_area” (land area in sq km), as well as a new variable (“no_electricity_access”) that presents the percentage of the population without access to electricity (based on the “wdi_acel” variable that represents the percentage of the population with access to electricity):

# creates "land_area_sqmiles" variable based on "wdi_area" and "no_electricity_access" variable based on "wdi_acel"
qog_copy_selection %>% 
  mutate(land_area_sqmiles=wdi_area/2.5899,
         no_electricity_access=100-wdi_acel) %>% 
    relocate(cname_qog, 
             cname, 
             ccodealp, 
             land_area_sqmiles, 
             wdi_area, 
             no_electricity_access, 
             wdi_acel)

# A tibble: 194 × 25
   cname_qog     cname ccodealp land_area_sqmiles wdi_area no_electricity_access
   <chr>         <chr> <chr>                <dbl>    <dbl>                 <dbl>
 1 Slovakia      Slov… SVK                 18564.   48080                  0    
 2 Estonia       Esto… EST                 16506.   42750                  0    
 3 Slovenia      Slov… SVN                  7775.   20136.                 0    
 4 Hungary       Hung… HUN                 35237.   91260                  0    
 5 Lithuania     Lith… LTU                 24175.   62610                  0    
 6 North Macedo… Nort… MKD                  9738.   25220                  0    
 7 Czech Republ… Czec… CZE                 29803.   77187.                 0    
 8 Belarus       Bela… BLR                 78362.  202950                  0    
 9 Latvia        Latv… LVA                 24028.   62230                  0    
10 Bulgaria      Bulg… BGR                 41917.  108560                  0.200
# ℹ 184 more rows
# ℹ 19 more variables: wdi_acel <dbl>, ht_region <dbl>, wdi_trade <dbl>,
#   wdi_expmil <dbl>, wdi_wip <dbl>, undp_hdi <dbl>, wdi_expedu <dbl>,
#   wdi_taxrev <dbl>, wdi_fdiin <dbl>, cbie_index <dbl>, wbgi_rle <dbl>,
#   bmr_dem <dbl>, atop_ally <dbl>, gol_est <dbl>, mad_gdppc <dbl>,
#   mad_gdppc1900 <dbl>, bci_bci <dbl>, lis_gini <dbl>,
#   top_top1_income_share <dbl>

The mutate() function is particularly handy when we want to transform continuous numeric variables into a dichotomous (i.e. a “dummy” variable that takes on a value of 1 if a condition is met and 0 otherwise) or categorical variable. For example, let’s say we want to create a new variable, named “trade_open” that takes on the value of 1 if the “wdi_trade” variable is greater than or equal to 60, and 0 otherwise. We can generate this new dummy variable using the mutate() function. In particular, within the mutate() function below, we specify that we want to create a new variable named “trade_open”; the ifelse() function specifies the desired condition (trade>=60), followed by the value the new “trade_open” variable is to take if the condition is met (1), and the value the new “trade_open” variable is to take if the condition is not met (0). In other words, we can translate ifelse(wdi_trade>=77, 1, 0) to “if wdi_trade>=60, set the ‘trade_open’ variable to 1, otherwise set it to 0.” We’ll assign the data frame with the new “trade_open” variable back to “qog_copy_selection”:

# Creates a new dummy variable based on the existing "wdi_trade" variable named "trade_open" (which takes on a value of "1" if "trade" is greater than or equal to 60, and 0 otherwise) and then moves the newly created variables to the front of the datasetall changes are assigned to "qog_copy_selection", thereby overwriting the existing version of "qog_copy_selection"

qog_copy_selection<-
  qog_copy_selection %>% 
    mutate(trade_open=ifelse(wdi_trade>=60, 1, 0)) %>% 
      relocate(cname_qog, cname, ccodealp, wdi_trade, trade_open)

Let’s print the updated contents of qog_copy_selection to see that the changes have been made:

# prints updated contents of "qog_copy_selection"
qog_copy_selection

# A tibble: 194 × 24
   cname_qog   cname ccodealp wdi_trade trade_open ht_region wdi_acel wdi_expmil
   <chr>       <chr> <chr>        <dbl>      <dbl>     <dbl>    <dbl>      <dbl>
 1 Slovakia    Slov… SVK           184.          1         1    100         1.77
 2 Estonia     Esto… EST           162.          1         1    100         2.01
 3 Slovenia    Slov… SVN           161.          1         1    100         1.24
 4 Hungary     Hung… HUN           160.          1         1    100         1.68
 5 Lithuania   Lith… LTU           156.          1         1    100         1.97
 6 North Mace… Nort… MKD           147.          1         1    100         1.47
 7 Czech Repu… Czec… CZE           142.          1         1    100         1.40
 8 Belarus     Bela… BLR           136.          1         1    100         1.12
 9 Latvia      Latv… LVA           132.          1         1    100         2.07
10 Bulgaria    Bulg… BGR           121.          1         1     99.8       1.52
# ℹ 184 more rows
# ℹ 16 more variables: wdi_wip <dbl>, undp_hdi <dbl>, wdi_expedu <dbl>,
#   wdi_area <dbl>, wdi_taxrev <dbl>, wdi_fdiin <dbl>, cbie_index <dbl>,
#   wbgi_rle <dbl>, bmr_dem <dbl>, atop_ally <dbl>, gol_est <dbl>,
#   mad_gdppc <dbl>, mad_gdppc1900 <dbl>, bci_bci <dbl>, lis_gini <dbl>,
#   top_top1_income_share <dbl>

It can also be useful to create categorical variables based on numeric thresholds from an existing variable. For instance, let’s say we want to take the existing “wdi_trade” variable, and define a new variable named “trade_level”, which is set to “Low_Trade” when the “trade” variable is greater than 15 and less than 50; “Intermediate_Trade” when the “trade” variable is greater than or equal to 50 and less than 100; and “High_Trade” when the “trade” variable is greater than or equal to 100. The code below creates this new “trade_level” variable using the mutate() function; within mutate() the case_when() function maps the conditions onto the desired variable values for “trade_level” using the following syntax:

# Creates a new variable in the "pt_copy" dataset named "trade_level" (that is coded as "Low_Trade" when the "wdi_trade" variable is greater than 15 and less than 50, coded as "Intermediate_Trade" when "wdi_trade" is greater than or equal to 50 and less than 100, and coded as "High_Trade" when "wdi_trade" is greater than or equal to 100), and then reorders the dataset to move "trade_level" and "wdi_trade" to the front of the dataset; the changes are assigned back to "qog_copy_selection"
qog_copy_selection<-
  qog_copy_selection %>% 
  mutate(trade_level=case_when(wdi_trade>15 & wdi_trade<50~"Low_Trade",
                              wdi_trade>=50 & wdi_trade<100~"Intermediate_Trade",
                              wdi_trade>=100~"High_Trade")) %>% 
                    relocate(cname_qog, cname, ccodealp, wdi_trade, trade_level)

Let’s print the contents of the updated qog_copy_selection data frame and confirm that the changes have been made:

# prints updated contents of "qog_copy_selection"
qog_copy_selection

# A tibble: 194 × 25
   cname_qog  cname ccodealp wdi_trade trade_level trade_open ht_region wdi_acel
   <chr>      <chr> <chr>        <dbl> <chr>            <dbl>     <dbl>    <dbl>
 1 Slovakia   Slov… SVK           184. High_Trade           1         1    100  
 2 Estonia    Esto… EST           162. High_Trade           1         1    100  
 3 Slovenia   Slov… SVN           161. High_Trade           1         1    100  
 4 Hungary    Hung… HUN           160. High_Trade           1         1    100  
 5 Lithuania  Lith… LTU           156. High_Trade           1         1    100  
 6 North Mac… Nort… MKD           147. High_Trade           1         1    100  
 7 Czech Rep… Czec… CZE           142. High_Trade           1         1    100  
 8 Belarus    Bela… BLR           136. High_Trade           1         1    100  
 9 Latvia     Latv… LVA           132. High_Trade           1         1    100  
10 Bulgaria   Bulg… BGR           121. High_Trade           1         1     99.8
# ℹ 184 more rows
# ℹ 17 more variables: wdi_expmil <dbl>, wdi_wip <dbl>, undp_hdi <dbl>,
#   wdi_expedu <dbl>, wdi_area <dbl>, wdi_taxrev <dbl>, wdi_fdiin <dbl>,
#   cbie_index <dbl>, wbgi_rle <dbl>, bmr_dem <dbl>, atop_ally <dbl>,
#   gol_est <dbl>, mad_gdppc <dbl>, mad_gdppc1900 <dbl>, bci_bci <dbl>,
#   lis_gini <dbl>, top_top1_income_share <dbl>

Another useful operation is to create dummy variables based on a categorical variable. For example, consider the “trade_level” variable we just created. Let’s say we want to use the “trade_level” column to create dummy variables for each of the categories in that column. We can do so with the fastDummies package, which can quickly generate dummy variables for the categories in a categorical variable using the dummy_cols() function. Below, we simply take the existing qog_copy_selection dataset, and pass the name of the categorical variable out of which we want to create the dummies (“trade_level”) to the dummy_cols() function:

# Creates dummy variables from "trade_level" column, and relocates the new dummies to the front of the dataset; assigns changes back to "qog_copy_selection"
qog_copy_selection<-
  qog_copy_selection %>% 
  dummy_cols("trade_level") %>% 
    relocate(cname_qog, 
             cname,
             ccodealp,
             trade_level, 
             trade_level_High_Trade, 
             trade_level_Intermediate_Trade, 
             trade_level_Low_Trade, 
             trade_level_NA)

You’ll notice that there are now dummy variables corresponding to each of the categories in the categorical “trade_level” variable; for example, the “trade_level_High_Trade” dummy variable takes on the value of 1 for all observations where the “trade_level” variable is “High_Trade” and 0 otherwise; the “trade_level_Intermediate_Trade” dummy variable takes on the value of 1 for all observations where the “trade_level” variable is “Intermediate_Trade” and 0 otherwise; and so on. If we were to incorporate these dummy variables into a regression model, we would exclude one set of dummy variables (which would then serve as the reference category).

Let’s view the updated version of qog_copy_selection in the data viewer to see our newly created variables. First, we’ll bring the variables of interest to the front of the data frame:

# specifies variable order
qog_copy_selection<-
  qog_copy_selection %>% 
    relocate(cname_qog, cname, ccodealp, wdi_trade, trade_open, trade_level, trade_level_High_Trade, trade_level_Intermediate_Trade, trade_level_Low_Trade, trade_level_NA)

Let’s now view the modified data frame and note the new variables:

# Views "qog_copy_selection" in the data viewer
View(qog_copy_selection)

4.4.7 Subsetting (Filtering) Variables

We already introduced the filter() function in an earlier section, in the context of our discussion of the %>% operator. We’ll now briefly return to this function, and provide additional examples of how it can be used to subset our data frames.

Let’s say that we want to subset the qog_copy_selection data frame to extract only countries part of the South-East Asia region (coded as a “7” in the “ht_region” variable), and assign this South-East Asia data frame to its own object, named se_asia_data. We can do so with the following:

# uses the filter() function to extract South East Asia observations from "qog_copy_selection" and assigns the result to a new object named "se_asia_data"
se_asia_data<-qog_copy_selection %>% 
                filter(ht_region==7)

Let’s print out the contents of se_asia_data to see if it worked as expected:

# prints "se_asia_data" to console
se_asia_data

# A tibble: 11 × 29
   cname_qog        cname              ccodealp wdi_trade trade_open trade_level
   <chr>            <chr>              <chr>        <dbl>      <dbl> <chr>      
 1 Singapore        Singapore          SGP          330.           1 High_Trade 
 2 Vietnam          Viet Nam           VNM          187.           1 High_Trade 
 3 Brunei           Brunei Darussalam  BRN          147.           1 High_Trade 
 4 Malaysia (1966-) Malaysia           MYS          134.           1 High_Trade 
 5 Cambodia         Cambodia           KHM          129.           1 High_Trade 
 6 Thailand         Thailand           THA          117.           1 High_Trade 
 7 Timor-Leste      Timor-Leste        TLS           99.9          1 Intermedia…
 8 Philippines      Philippines (the)  PHL           63.5          1 Intermedia…
 9 Indonesia        Indonesia          IDN           40.2          0 Low_Trade  
10 Myanmar          Myanmar            MMR           NA           NA <NA>       
11 Laos             Lao People's Demo… LAO           NA           NA <NA>       
# ℹ 23 more variables: trade_level_High_Trade <int>,
#   trade_level_Intermediate_Trade <int>, trade_level_Low_Trade <int>,
#   trade_level_NA <int>, ht_region <dbl>, wdi_acel <dbl>, wdi_expmil <dbl>,
#   wdi_wip <dbl>, undp_hdi <dbl>, wdi_expedu <dbl>, wdi_area <dbl>,
#   wdi_taxrev <dbl>, wdi_fdiin <dbl>, cbie_index <dbl>, wbgi_rle <dbl>,
#   bmr_dem <dbl>, atop_ally <dbl>, gol_est <dbl>, mad_gdppc <dbl>,
#   mad_gdppc1900 <dbl>, bci_bci <dbl>, lis_gini <dbl>, …

We can also chain together multiple conditions. For example, let’s use the filter() condition to extract observations for countries that are part of the “Sub-Saharan Africa” region, and have a Human Development Index (“und_hdi”) score greater than or equal to 0.65. We can do so by setting up an expression that uses the “&” logical operator:

# subsets data to include Sub-Saharan African countries with a "undp_hdi" value greater than or equal to 0.65 and assigns the result to an object named "sub_saharan_africa_hdi"
sub_saharan_africa_hdi<-qog_copy_selection %>% 
                          filter(ht_region==4 & undp_hdi>=0.65)

Let’s print out the contents of sub_saharan_africa_hdi:

# prints "sub_saharan_africa_hdi"
sub_saharan_africa_hdi

# A tibble: 6 × 29
  cname_qog    cname        ccodealp wdi_trade trade_open trade_level       
  <chr>        <chr>        <chr>        <dbl>      <dbl> <chr>             
1 Seychelles   Seychelles   SYC          184.           1 High_Trade        
2 Mauritius    Mauritius    MUS           98.0          1 Intermediate_Trade
3 Botswana     Botswana     BWA           88.8          1 Intermediate_Trade
4 Cape Verde   Cabo Verde   CPV           77.5          1 Intermediate_Trade
5 Gabon        Gabon        GAB           74.5          1 Intermediate_Trade
6 South Africa South Africa ZAF           56.1          0 Intermediate_Trade
# ℹ 23 more variables: trade_level_High_Trade <int>,
#   trade_level_Intermediate_Trade <int>, trade_level_Low_Trade <int>,
#   trade_level_NA <int>, ht_region <dbl>, wdi_acel <dbl>, wdi_expmil <dbl>,
#   wdi_wip <dbl>, undp_hdi <dbl>, wdi_expedu <dbl>, wdi_area <dbl>,
#   wdi_taxrev <dbl>, wdi_fdiin <dbl>, cbie_index <dbl>, wbgi_rle <dbl>,
#   bmr_dem <dbl>, atop_ally <dbl>, gol_est <dbl>, mad_gdppc <dbl>,
#   mad_gdppc1900 <dbl>, bci_bci <dbl>, lis_gini <dbl>, …

We can also subset datasets using non-numeric data. For example, if we wanted to extract all of the observations from sub_saharan_africa_hdi for which the “trade_level” variable we created earlier is “Intermediate_Trade”, we can do the following:

# Filters observations from "sub_saharan_africa_hdi" for which the "trade_level" variable is set to "Intermediate_Trade"
sub_saharan_africa_hdi %>% 
  filter(trade_level=="Intermediate_Trade")

# A tibble: 5 × 29
  cname_qog    cname        ccodealp wdi_trade trade_open trade_level       
  <chr>        <chr>        <chr>        <dbl>      <dbl> <chr>             
1 Mauritius    Mauritius    MUS           98.0          1 Intermediate_Trade
2 Botswana     Botswana     BWA           88.8          1 Intermediate_Trade
3 Cape Verde   Cabo Verde   CPV           77.5          1 Intermediate_Trade
4 Gabon        Gabon        GAB           74.5          1 Intermediate_Trade
5 South Africa South Africa ZAF           56.1          0 Intermediate_Trade
# ℹ 23 more variables: trade_level_High_Trade <int>,
#   trade_level_Intermediate_Trade <int>, trade_level_Low_Trade <int>,
#   trade_level_NA <int>, ht_region <dbl>, wdi_acel <dbl>, wdi_expmil <dbl>,
#   wdi_wip <dbl>, undp_hdi <dbl>, wdi_expedu <dbl>, wdi_area <dbl>,
#   wdi_taxrev <dbl>, wdi_fdiin <dbl>, cbie_index <dbl>, wbgi_rle <dbl>,
#   bmr_dem <dbl>, atop_ally <dbl>, gol_est <dbl>, mad_gdppc <dbl>,
#   mad_gdppc1900 <dbl>, bci_bci <dbl>, lis_gini <dbl>, …

We can also extract observations that do NOT meet a given condition. In particular, the condition “not equal to” is denoted by a “!=”. For example, an operation equivalent to the one immediately above would be to exclude observations for which “trade_level” is set to “High_Trade”:

# filters observations from "sub_saharan_africa_hdi" for which the "trade_level" variable is NOT set to "High_Trade"
sub_saharan_africa_hdi %>% 
  filter(trade_level != "High_Trade")

# A tibble: 5 × 29
  cname_qog    cname        ccodealp wdi_trade trade_open trade_level       
  <chr>        <chr>        <chr>        <dbl>      <dbl> <chr>             
1 Mauritius    Mauritius    MUS           98.0          1 Intermediate_Trade
2 Botswana     Botswana     BWA           88.8          1 Intermediate_Trade
3 Cape Verde   Cabo Verde   CPV           77.5          1 Intermediate_Trade
4 Gabon        Gabon        GAB           74.5          1 Intermediate_Trade
5 South Africa South Africa ZAF           56.1          0 Intermediate_Trade
# ℹ 23 more variables: trade_level_High_Trade <int>,
#   trade_level_Intermediate_Trade <int>, trade_level_Low_Trade <int>,
#   trade_level_NA <int>, ht_region <dbl>, wdi_acel <dbl>, wdi_expmil <dbl>,
#   wdi_wip <dbl>, undp_hdi <dbl>, wdi_expedu <dbl>, wdi_area <dbl>,
#   wdi_taxrev <dbl>, wdi_fdiin <dbl>, cbie_index <dbl>, wbgi_rle <dbl>,
#   bmr_dem <dbl>, atop_ally <dbl>, gol_est <dbl>, mad_gdppc <dbl>,
#   mad_gdppc1900 <dbl>, bci_bci <dbl>, lis_gini <dbl>, …

It can also be useful to filter data using “or” conditions; we can make such statements within the filter() function using a “|”. For example, the code below creates a new “east_asia” dataset by using the filter() function to extract South-East Asia or East Asia observations from qog_copy_selection :

# uses the "filter()" function to extract observations from South-East Asia (ht_region=7) and East Asia (ht_region=6) and assigns the result to a new object named "east_asia"
east_asia<-qog_copy_selection %>% 
              filter(ht_region==6|ht_region==7) %>% 
                relocate(cname_qog, cname, ccodealp, ht_region)

Now, let’s print the contents of the newly created east_asia object:

# prints contents of "east_asia"
east_asia

# A tibble: 17 × 29
   cname_qog        cname    ccodealp ht_region wdi_trade trade_open trade_level
   <chr>            <chr>    <chr>        <dbl>     <dbl>      <dbl> <chr>      
 1 Mongolia         Mongolia MNG              6     119.           1 High_Trade 
 2 Korea, South     Korea (… KOR              6      80.2          1 Intermedia…
 3 China            China    CHN              6      37.3          0 Low_Trade  
 4 Japan            Japan    JPN              6      36.8          0 Low_Trade  
 5 Taiwan           Taiwan … TWN              6      NA           NA <NA>       
 6 Korea, North     Korea (… PRK              6      NA           NA <NA>       
 7 Singapore        Singapo… SGP              7     330.           1 High_Trade 
 8 Vietnam          Viet Nam VNM              7     187.           1 High_Trade 
 9 Brunei           Brunei … BRN              7     147.           1 High_Trade 
10 Malaysia (1966-) Malaysia MYS              7     134.           1 High_Trade 
11 Cambodia         Cambodia KHM              7     129.           1 High_Trade 
12 Thailand         Thailand THA              7     117.           1 High_Trade 
13 Timor-Leste      Timor-L… TLS              7      99.9          1 Intermedia…
14 Philippines      Philipp… PHL              7      63.5          1 Intermedia…
15 Indonesia        Indones… IDN              7      40.2          0 Low_Trade  
16 Myanmar          Myanmar  MMR              7      NA           NA <NA>       
17 Laos             Lao Peo… LAO              7      NA           NA <NA>       
# ℹ 22 more variables: trade_level_High_Trade <int>,
#   trade_level_Intermediate_Trade <int>, trade_level_Low_Trade <int>,
#   trade_level_NA <int>, wdi_acel <dbl>, wdi_expmil <dbl>, wdi_wip <dbl>,
#   undp_hdi <dbl>, wdi_expedu <dbl>, wdi_area <dbl>, wdi_taxrev <dbl>,
#   wdi_fdiin <dbl>, cbie_index <dbl>, wbgi_rle <dbl>, bmr_dem <dbl>,
#   atop_ally <dbl>, gol_est <dbl>, mad_gdppc <dbl>, mad_gdppc1900 <dbl>,
#   bci_bci <dbl>, lis_gini <dbl>, top_top1_income_share <dbl>

4.5 Processing and Wrangling Multiple Datasets

4.5.1 Joining Data

Joining, or merging, distinct datasets into a single dataset based on a common variable that exists in both individual datasets is an essential operation in most research projects. To see how we can implement joins in R using the join() family of functions, let’s first pull out two of the list elements in world_bank_list and assign them to objects in the global environment. First, we’ll extract the dataset on foreign direct investment, and assign it to an object named wdi_fdi:

# extracts fdi dataset from "world_bank_list" by assigned name and assigns it to a new object named "wdi_fdi"
wdi_fdi<-world_bank_list[["wdi_fdi2019"]]

Then, we’ll extract the World Bank trade dataset and assign it to an object named wdi_trade:

# extracts debt dataset from "world_bank_list" by assigned name and assigns it to a new object named "wdi_trade"
wdi_trade<-world_bank_list[["wdi_trade2019"]]

Then, we’ll clean up these datasets by dropping rows with NA values, and then renaming the awkwardly named “2019 [YR2019’” variables in each of the datasets to something more intuitive and descriptive:

# drop na's and rename variable in in trade dataset and assign to "wdi_trade_cleaned"
wdi_trade_cleaned<-wdi_trade %>%
            drop_na() %>% 
            rename(trade_2019=`2019 [YR2019]`)

# drop na's and rename variable in in FDI dataset and assign to "wdi_fdi_cleaned"
wdi_fdi_cleaned<-wdi_fdi %>% 
            drop_na() %>% 
            rename(fdi_2019=`2019 [YR2019]`)

Now, we can go ahead and use a join() function, to merge the datasets together. There are several different versions of the join() function, and you should read the documentation to learn more, and make sure you’re applying the correct function given the context of your research (?join). We will use the full_join() function, which keeps all of the observations from each of the component datasets in the joined (i.e. output) dataset (even those that do not have a corresponding observation in the other dataset). Below, we call the full_join() function, and pass as arguments the two datasets we’d like to join. The specification by="Country Code" indicates that the join field (i.e. the common field that can be used to link the data frames) is the “Country Code” variable. The column containing the country codes is named “Country Code” in both columns, which makes this especially easy; if, however, the columns containing the join variable were named differently, we could equate them within the full_join() function by passing a vector to the “by” argument that specifies the join field from each component dataset. For example, if the country code variable was in a column named “iso3” in wdi_fdi and “Country Code” in wdi_trade, we could specify by=c("iso3"="Country Code") to let the function know that the join field in wdi_fdi is “iso3” and the join field in wdi_trade is “Country Code”. Below, we assign the product of the join to a new object named fdi_trade_join:

# join together "wdi_fdi_cleaned" and "wdi_fdi_cleaned" using country code
fdi_trade_join<-full_join(wdi_fdi_cleaned, wdi_trade_cleaned, by="Country Code")

Let’s see what fdi_trade_join looks like by passing it to the R Studio Viewer:

View(fdi_trade_join)

Note that both variables are now in the dataset as separate columns.

4.5.2 Appending Data

If joining datasets is fundamentally about situating disparate datasets side-by-side in a new unified dataset, appending datasets is about stacking disparate datasets with a similar structure on top of each other in a new unified dataset. In other words, joining datasets leads to a unified dataset with more columns than the original individual datasets, while appending datasets leads to a unified dataset with more rows than the original individual datasets. Depending on the type of data you’re working with, appending data can be just as important a method for bringing together disparate datasets.

We can easily append data frames a single unified data frame using the bind_rows() function from dplyr. Below, we append wdi_fdito wdi_trade by passing these objects as arguments to the bind_rows() function. We’ll assign the appended dataset to a new object named worldbank_trade_fdi_appended:

# Appends "worldbank_trade_2019" to "worldbank_fdi_2019" and assigns new dataset to object named "worldbank_trade_fdi"
worldbank_trade_fdi_appended<-bind_rows(wdi_trade, wdi_fdi)

To confirm that the appending operation worked as expected, please view worldbank_trade_fdi_appended in the data viewer.

4.5.3 Reshaping Data

In a “wide” dataset, each variable or measurement is stored in a separate column; on the other hand, in a “long” dataset, each measurement is stored in a single column, and there is an additional column that identifies the specific variable or category for each observation. For example, in our context, fdi_trade_join is a “wide” dataset, while worldbank_trade_fdi_appended is formatted as a long dataset.

The process of converting a wide dataset to a long one, or vice versa, is typically referred to as reshaping data. Often, one will need to bring different datasets together (either through a join or append operation), but in order to successfully implement the required procedure, it is first necessary to reshape at least one of the datasets so that they’re correctly formatted; the first step in integrating datasets, therefore, is frequently a reshaping operation that ensures that the datasets have a common structure. Below, we’ll quickly review some functions for carrying out these reshaping operations from the tidyr package, which is part of the tidyverse suite.

4.5.3.1 Long to Wide

To see how a “long to wide” operation works using the tidyr lets imagine we want to transform worldbank_trade_fdi_appended, currently formatted as a long dataset, into a wide dataset. First we’ll clean up worldbank_trade_fdi_appended using some familiar tidyverse functions, and assign the result to a new object named worldbank_trade_fdi_cleaned cleaned:

# cleans the dataset before reshaping
worldbank_trade_fdi_cleaned<-worldbank_trade_fdi_appended %>% 
                              rename(economic_variables="2019 [YR2019]",
                                     series_code="Series Code") %>% 
                              select(-"Series Name") %>% 
                              drop_na()

Before proceeding, we should note that the column containing the data, “economic_variables” is formatted as a character variable:

# prints class of "economic_variables" column
class(worldbank_trade_fdi_cleaned$economic_variables)

[1] "character"

In order for the reshaping process to work, we need to transform that column into the “numeric” class, which we do with the following:

# converts "economic_variables" to numeric
worldbank_trade_fdi_cleaned$economic_variables<-as.numeric(worldbank_trade_fdi_cleaned$economic_variables)

Warning: NAs introduced by coercion

Now, we use tidyr’s pivote_wider function to “pivot” the data from long to wide. Below, we’ll take worldbank_trade_fdi_cleaned and then feed that data into the pivot_wider() function using the %>% operator. Within the pivot_wider() function, the “names_from” argument specifies the current column that contains the names we would like to use as column names in the reshaped dataset, while the “values_from” argument specifies the name of the current column that contains the data that will be used to populate the columns in the reshaped dataset. We’ll assign this code to a new object named worldbank_trade_fdi_wide:

# reshapes "worldbank_trade_fdi_cleaned" from long to wide and assigns the wide dataset to an object named "worldbank_trade_fdi_wide"
worldbank_trade_fdi_wide<-worldbank_trade_fdi_cleaned %>% 
                              tidyr:: pivot_wider(names_from=series_code,
                                          values_from=economic_variables)

Let’s quickly take a look at the reshaped dataset:

# prints contents of "worldbank_trade_fdi_wide"
worldbank_trade_fdi_wide

# A tibble: 6 × 4
  `Country Name` `Country Code` NE.TRD.GNFS.ZS BX.KLT.DINV.WD.GD.ZS
  <chr>          <chr>                   <dbl>                <dbl>
1 Afghanistan    AFG                      NA                  0.124
2 Albania        ALB                      76.3                7.80 
3 Algeria        DZA                      51.8                0.804
4 American Samoa ASM                     157.                NA    
5 Andorra        AND                      NA                 NA    
6 Angola         AGO                      57.8               -5.78

It might be nice to rename the variables to something more intuitive:

# renames columns in "worldbank_trade_fdi_wide"
worldbank_trade_fdi_wide<-worldbank_trade_fdi_wide %>% 
                          rename(trade2019=NE.TRD.GNFS.ZS,
                                 FDI2019=BX.KLT.DINV.WD.GD.ZS)

Go ahead and inspect worldbank_trade_fdi_wide in the data viewer:

View(worldbank_trade_fdi_wide)

4.5.3.2 Wide to Long

To make the conversion in the opposite direction, i.e. reshaping a dataset from wide to long, we can use the pivot_longer() function. Below, we take the worldbank_trade_fdi_wide data frame, and feed it into the pivot_longer() function using the %>% operator. Within the function, we use cols=c(FDI2019, trade2019) to specify the current columns that we’re going to collapse into a single column in the new dataset. The “names_to” argument specifies the name of column that will hold the original “FDI2019” and “trade2019” columns, while the “values_to” argument specifies the name of the column that will actually hold the data. Let’s go ahead and run the code, and assign the output to a new object named world_bank_trade_long:

# reshapes "worldbank_trade_fdi_wide" back to long format and assigns the reshaped dataset to a new object named "world_bank_trade_long"
world_bank_trade_long<-worldbank_trade_fdi_wide %>% 
                        pivot_longer(cols=c(FDI2019, trade2019),
                                     names_to="economic_variable",
                                     values_to = "2019")

Let’s view world_bank_trade_long in the data Viewer to confirm that it has been successfully reshaped:

View(world_bank_trade_long)

4.5.4 Automating Data Processing with Functions

When working with multiple datasets, we may need to carry out identical operations on those datasets, in which case we could save time by wrapping up the relevant code in a function and iterating over the dataset elements we’d like to transform. Let’s say, for example, that we want to clean up our World Bank dataset by deleting a column, renaming others, and removing the rows where the country code is NA. Rather than carrying out these operations individually, let’s create a function, called worldbank_cleaning_function, that implements these changes and returns a clean dataset:

# write function to clean World Bank dataset
worldbank_cleaning_function<-function(input_dataset){
  modified_dataset<-input_dataset %>% 
                      select(-"Series Code") %>% 
                      rename("Country"="Country Name",
                             "CountryCode"="Country Code",
                             "Series"="Series Name",
                             "2019"="2019 [YR2019]") %>% 
                      drop_na(CountryCode)
  return(modified_dataset)
}

Let’s test the function using wdi_trade:

# passes "wdi_trade" to "worldbank_cleaning_function"
worldbank_cleaning_function(wdi_trade)

# A tibble: 266 × 4
   Country             CountryCode Series           `2019`          
   <chr>               <chr>       <chr>            <chr>           
 1 Afghanistan         AFG         Trade (% of GDP) ..              
 2 Albania             ALB         Trade (% of GDP) 76.2791946495763
 3 Algeria             DZA         Trade (% of GDP) 51.8097384415762
 4 American Samoa      ASM         Trade (% of GDP) 156.568778979907
 5 Andorra             AND         Trade (% of GDP) ..              
 6 Angola              AGO         Trade (% of GDP) 57.8295381183036
 7 Antigua and Barbuda ATG         Trade (% of GDP) 137.625175755884
 8 Argentina           ARG         Trade (% of GDP) 32.6306150458499
 9 Armenia             ARM         Trade (% of GDP) 96.1141541288708
10 Aruba               ABW         Trade (% of GDP) 145.343572735289
# ℹ 256 more rows

That looked like it worked, so lets go ahead and apply worldbank_cleaning_function to all of the data frames in world_bank_list. We’ll use the familiar map() function to do so; below, world_bank_list is the list we’re iterating over, while worldbank_cleaning_function() is the function we’re applying:

# Iteratively apply "worldbank_cleaning_function" to all of the datasets in "world_bank_list", and deposit the cleaned datasets into a new list named "world_bank_list_cleaned"
world_bank_list_cleaned<-map(.x=world_bank_list, .f=worldbank_cleaning_function)

Let’s print out the list and confirm that the operation worked:

# prints contents of "world_bank_list_cleaned"
world_bank_list_cleaned

$wdi_debt2019
# A tibble: 266 × 4
   Country             CountryCode Series                                 `2019`
   <chr>               <chr>       <chr>                                  <chr> 
 1 Afghanistan         AFG         Central government debt, total (% of … ..    
 2 Albania             ALB         Central government debt, total (% of … 75.69…
 3 Algeria             DZA         Central government debt, total (% of … ..    
 4 American Samoa      ASM         Central government debt, total (% of … ..    
 5 Andorra             AND         Central government debt, total (% of … ..    
 6 Angola              AGO         Central government debt, total (% of … ..    
 7 Antigua and Barbuda ATG         Central government debt, total (% of … ..    
 8 Argentina           ARG         Central government debt, total (% of … ..    
 9 Armenia             ARM         Central government debt, total (% of … 50.02…
10 Aruba               ABW         Central government debt, total (% of … ..    
# ℹ 256 more rows

$wdi_fdi2019
# A tibble: 266 × 4
   Country             CountryCode Series                                 `2019`
   <chr>               <chr>       <chr>                                  <chr> 
 1 Afghanistan         AFG         Foreign direct investment, net inflow… 0.124…
 2 Albania             ALB         Foreign direct investment, net inflow… 7.797…
 3 Algeria             DZA         Foreign direct investment, net inflow… 0.804…
 4 American Samoa      ASM         Foreign direct investment, net inflow… ..    
 5 Andorra             AND         Foreign direct investment, net inflow… ..    
 6 Angola              AGO         Foreign direct investment, net inflow… -5.78…
 7 Antigua and Barbuda ATG         Foreign direct investment, net inflow… 7.433…
 8 Argentina           ARG         Foreign direct investment, net inflow… 1.485…
 9 Armenia             ARM         Foreign direct investment, net inflow… 0.736…
10 Aruba               ABW         Foreign direct investment, net inflow… -2.21…
# ℹ 256 more rows

$wdi_trade2019
# A tibble: 266 × 4
   Country             CountryCode Series           `2019`          
   <chr>               <chr>       <chr>            <chr>           
 1 Afghanistan         AFG         Trade (% of GDP) ..              
 2 Albania             ALB         Trade (% of GDP) 76.2791946495763
 3 Algeria             DZA         Trade (% of GDP) 51.8097384415762
 4 American Samoa      ASM         Trade (% of GDP) 156.568778979907
 5 Andorra             AND         Trade (% of GDP) ..              
 6 Angola              AGO         Trade (% of GDP) 57.8295381183036
 7 Antigua and Barbuda ATG         Trade (% of GDP) 137.625175755884
 8 Argentina           ARG         Trade (% of GDP) 32.6306150458499
 9 Armenia             ARM         Trade (% of GDP) 96.1141541288708
10 Aruba               ABW         Trade (% of GDP) 145.343572735289
# ℹ 256 more rows

$wdi_urban2019
# A tibble: 266 × 4
   Country             CountryCode Series                                 `2019`
   <chr>               <chr>       <chr>                                  <chr> 
 1 Afghanistan         AFG         Urban population (% of total populati… 25.754
 2 Albania             ALB         Urban population (% of total populati… 61.229
 3 Algeria             DZA         Urban population (% of total populati… 73.189
 4 American Samoa      ASM         Urban population (% of total populati… 87.147
 5 Andorra             AND         Urban population (% of total populati… 87.984
 6 Angola              AGO         Urban population (% of total populati… 66.177
 7 Antigua and Barbuda ATG         Urban population (% of total populati… 24.506
 8 Argentina           ARG         Urban population (% of total populati… 91.991
 9 Armenia             ARM         Urban population (% of total populati… 63.219
10 Aruba               ABW         Urban population (% of total populati… 43.546
# ℹ 256 more rows

As expected, each of the processed datasets is an element in world_bank_list_cleaned.

4.6 Exporting Processed Data

After we have processed our datasets, it’s often necessary to get the transformed data out of the R environment. There are a variety of functions that can export data frames from R to your local disk in various file formats. Because CSV files are so flexible and commonly used, we’ll explore how to export R data frames to your local disk as CSV files; if you need to export your data frames into another file format (for example .xlsx or .dta files), you would need to look up the relevant packages and functions that enable this.

To write out an R data frame to disk as a CSV, we can use the read_csv() function. Below, we export the east_asia data frame we created earlier as a CSV to the “outputs” folder in our working directory. The first argument is the name of the object we want to export, while the second argument specifies the file path, name, and extension of the desired output file.

# exports "east_asia" to a local directory (i.e. the "outputs" sub-directory of our working directory)
write_csv(east_asia, "outputs/east_asia.csv")

In order to export multiple data frames (for example, we may want to export a bunch of cleaned data frames out of R to use in another package) we can use the walk() function and its relatives; these functions are a part of the purrr package, and work similarly to the map() family of functions.

Let’s say we want to export the cleaned World Bank data frames in world_bank_list_cleaned. First, we’ll create a vector of file names for the exported files using the paste0() function. We’ll assign this vector to a new object named WB_filenames_export:

# create file names for exported World Bank files
WB_filenames_export<-paste0("outputs/", worldbank_filenames_base, "_cleaned.csv")

Let’s see what this looks like:

# prints "WB_filenames_export" contents
WB_filenames_export

[1] "outputs/wdi_debt2019_cleaned.csv"  "outputs/wdi_fdi2019_cleaned.csv"  
[3] "outputs/wdi_trade2019_cleaned.csv" "outputs/wdi_urban2019_cleaned.csv"

Now, we’ll use the walk2() function to iteratively pass the data frames (contained in world_bank_list_cleaned and the file names contained in WB_filenames_export to the write_csv() function. The code below takes the first elements from world_bank_list_cleaned and WB_filenames_export, and passes them to write_csv() which exports the first data frame with the first file name; then, it takes the second elements from world_bank_list_cleaned and WB_filenames_export, and passes them to write_csv() which exports the second data frame with the second file name; and so on.

# exports datasets in "world_bank_list_cleaned" to "outputs" directory using filenames in "WB_filenames_export"
walk2(.x=world_bank_list_cleaned, .y=WB_filenames_export, write_csv)

Check your “outputs” directory to ensure that the files have been written out as expected.