# Install packages using the "install.packages" function
install.packages("tidyverse")
install.packages("psych")
install.packages("janitor")
install.packages("fastDummies")
install.packages("stargazer")
install.packages("gtsummary")
install.packages("ggeffects")Processing and Wrangling Data in R
1 Introduction
In the past two sessions, we presented some ideas, concepts, and tools that provide a basic foundation for working with data in R. Now that we have this basic foundation, we’ll turn in this session to a more applied exploration of some actual datasets. Our goal here is to introduce you to some useful functions that will allow you to explore and begin making sense of actual datasets in R. This lesson will provide a tour of various functions that can be particularly helpful as you get started cleaning, and wrangling data in R. Among the topics we’ll cover today are:
Loading data into R
Handling missing data
Summary Statistics in R
Data cleaning and processing
We will be working with a handful of datasets, which you will need to download. We will primarily be working with a crossnational dataset collected and made public by the economists Torsten Persson and Guido Tabellini; it’s one of the datasets they work with in their highly influential book, entitled “The Economic Effects of Constitutions.” The codebook for the dataset is provided along with the data in the downloaded materials. We’ll also be working with some datasets collected from the World Bank’s World Development Indicators next class, and these datasets are included in the data download.
2 Installing Packages and Loading Libraries
Please go ahead and install any of the packages below that you have not already installed.
Note that when you need to install a lot of packages at once, you can put the names of the packages you’d like to install into a character vector and pass it to the install.packages() function:
# Installing multiple packages by passing package names as strings to the "install.packages" function
install.packages(c("tidyverse", "psych", "janitor", "fastDummies"))Remember to load the required libraries (even packages you’ve already installed before).
# load libraries
library(tidyverse)
library(psych)
library(janitor)
library(fastDummies)
library(janitor)
library(summarytools)
library(stargazer)
library(gtsummary)
library(ggeffects)3 Data Transfer Part 1: Reading Data into R
If you haven’t already, please download the data for this session. Deposit the data into a local directory that you will use for the workshop.
3.1 Reading in Local Files
Though it’s not strictly necessary, it’s useful to begin by setting your working directory to the location on your computer where the data is stored. The easiest way to do this is to go to the R Studio menu: click Session, then click Set Working Directory, then click Choose Directory. We can also set the working directory programmatically using the setwd() function.
We can now pass the name of file and its extension in quotation marks to the read_csv() function (since the data we want to load is a CSV file). We’ll assign it to an object named pt (after Persson and Tabellini):
# reads in the workshop dataset (Persson and Tabellini cross-national dataset) by passing the file path as an argument to the "read_csv" and assigns it to a new object named "pt"
pt<-read_csv("data/pt/persson_tabellini_workshop.csv")Rows: 85 Columns: 75
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): country, continent
dbl (73): oecd, pind, pindo, ctrycd, col_uk, t_indep, col_uka, col_espa, col...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Go ahead and view pt in the R Studio viewer.
# Views the dataset in the R Studio data viewer
View(pt)3.2 Reading in Data from Cloud Storage
As you continue working in R, it may be helpful to read data in directly from cloud storage, rather than your local machine. It may take some effort to figure out how to do this (it will vary depending on your cloud storage provider) but it can be as simple as passing the link to your data to an R function. For example, an alternative way of reading the PT data into R is to paste the Dropbox link associated with the dataset to the read_csv() function. If you’re reading in files from Dropbox, change the “0” at the end of the url to a 1 before reading it into R.
read_csv("https://www.dropbox.com/scl/fo/2aianyhhdx92auh9aa6l2/AFfOXirAyGDi-xPC2c_gcDw/data/pt/persson_tabellini_workshop.csv?rlkey=ilmv8do8gcpmd8m43tbw2wxjp&e=1&dl=1")Rows: 85 Columns: 75
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): country, continent
dbl (73): oecd, pind, pindo, ctrycd, col_uk, t_indep, col_uka, col_espa, col...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 85 × 75
oecd country pind pindo ctrycd col_uk t_indep col_uka col_espa col_otha
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0 Argentina 0 0 213 0 183 0 0.268 0
2 1 Australia 1 1 193 1 98 0.608 0 0
3 1 Austria 0 0 122 0 250 0 0 0
4 0 Bahamas 1 1 313 1 26 0.896 0 0
5 0 Bangladesh 1 1 513 0 28 0 0 0.888
6 0 Barbados 1 1 316 1 33 0.868 0 0
7 0 Belarus 1 1 913 0 8 0 0 0.968
8 1 Belgium 0 0 124 0 169 0 0 0.324
9 0 Belize 1 1 339 1 18 0.928 0 0
10 0 Bolivia 0.116 0.116 218 0 174 0 0.304 0
# ℹ 75 more rows
# ℹ 65 more variables: legor_uk <dbl>, legor_so <dbl>, legor_fr <dbl>,
# legor_ge <dbl>, legor_sc <dbl>, prot80 <dbl>, catho80 <dbl>, confu <dbl>,
# avelf <dbl>, govef <dbl>, graft <dbl>, logyl <dbl>, loga <dbl>,
# yrsopen <dbl>, gadp <dbl>, engfrac <dbl>, eurfrac <dbl>, frankrom <dbl>,
# latitude <dbl>, gastil <dbl>, cgexp <dbl>, cgrev <dbl>, ssw <dbl>,
# rgdph <dbl>, trade <dbl>, prop1564 <dbl>, prop65 <dbl>, federal <dbl>, …
4 Processing, Wrangling, and Analyzing Numeric Data
We’ll start by making a copy of the pt object, by assigning it to a new object named pt_copy. We’ll use pt_copy when exploring the dataset, which ensures that we do not make inadvertent changes to our original pt data frame, and can always refer back to it when needed. Keeping a “clean” version of the data, and carrying out analysis tasks on a copy of this dataset, is good data management practice.
# makes a copy of "pt", called "pt_copy" that we can use for processing and analysis; keeps the original data frame, "pt" untouched
pt_copy<-pt4.1 Missing Data
Before proceeding, it’s useful to have an orientation to how R handles missing data, which is a common feature of most social science datasets (including the data we’re working with today). Let’s work with a simple toy dataset for now; as an exercise, you’ll be asked to perform analogous operations on the pt_copy dataset
# makes toy dataset, assigned to object named "student_scores"
student_scores<-data.frame(ID=c(1:5),
Age=c(25, NA, 30, 22, NA),
Score=c(85, 90, NA, 78, 88))If we want to identify missing data, we can do so with the is.na() function, which returns a logical matrix with the value TRUE for missing values:
# uses "is.na" to return a logical matrix indicating missing values (TRUE for missing values)
is.na(student_scores) ID Age Score
[1,] FALSE FALSE FALSE
[2,] FALSE TRUE FALSE
[3,] FALSE FALSE TRUE
[4,] FALSE FALSE FALSE
[5,] FALSE TRUE FALSE
If we want the total number of missing values in the dataset, we can take the sum of is.na(student_scores) :
# calculates total number of missing values in "student_scores"
sum(is.na(student_scores))[1] 3
If we want the total number of missing values per column, we can pass is.na(student_scores) as an argument to the colSum() function, which we can do because in R, logical values can be used in numeric operations; when they’re used in this way, TRUE=1 and FALSE=0.
# calculates total number of missing values per column
colSums(is.na(student_scores)) ID Age Score
0 2 1
If we wanted to calculate the percentage of missing data in the dataset, we could use our count of the number of NA values in the dataset sum(is.na(student_scores)) and divide it by the total number of cells in the dataset, derived by multiplying the dataset’s rows and columns:
# calculates missing data percentage in "student_scores"
# first calculates count of missing values and assigns it to "total_missing"
total_missing<-sum(is.na(student_scores))
# calculates total number of cells and assigns it to "total_values"
total_values<-prod(dim(student_scores))
# calculates percentage of missing data and assigns it to "missing_pct"
missing_pct<-(total_missing/total_values)*100
# prints contents of "missing_pct"
missing_pct[1] 20
Calculating the percentage of data missing in a dataset can be useful, and something we may want to do frequently, so it could be useful to wrap this code into a function, which we’ll do below, and assign to a new object named missing_data_percentage():
# creates function to calculate the percentage of missing data in a dataset
missing_data_percentage<-function(dataset){
# generates count of missing values
total_missing<-sum(is.na(dataset))
# calculates total number of cells
total_values<-prod(dim(dataset))
# calculates percentage of missing data
missing_pct<-(total_missing/total_values)*100
return(missing_pct)
}Let’s test out the function on student_scores:
# passes "student_scores" as an argument to custom function "missing_data_percentage" which yields the percentage of missing data in the "student_scores" dataset
missing_data_percentage(student_scores)[1] 20
We can now use this function for any other datasets we might create or work with, without having to retype a bunch of code. For example, if we wanted to calculate the percentage of missing data in pt_copy, we could pass it as an argument to missing_data_percentage:
# calculates missing data percentage in pt_copy
missing_data_percentage(pt_copy)[1] 3.843137
It could also be useful to get a sense of the missing data percentage per column. We can do so with the following:
# calculates percentage of missing data per column in "student_scores" and assigns the resulting vector to an object named "missing_pct_per_col"
missing_pct_per_col<-colSums(is.na(student_scores))/nrow(student_scores)*100
# prints contents of "missing_pct_per_col"
missing_pct_per_col ID Age Score
0 40 20
If we want to remove missing data, the drop_na() function from the tidyverse is useful. The drop_na() function will remove all rows from the dataset with NA values in any column:
# removes all rows with NA values from "student_scores"
drop_na(student_scores) ID Age Score
1 1 25 85
2 4 22 78
The default behavior of the drop_na() function is to drop all rows with NA values in any column (i.e. only rows with complete data are kept in the dataset). However, it’s possible to change this default behavior by specifying column/variable names after passing the data frame object as an argument; when we do this, drop_na() no longer drops all rows with missing data in any column, but all rows with missing data in any specified column. For example, the code below drops all rows where “Age” values are NA, but not all rows where “Score” scores are NA
# removes all rows where "Age" has NA values in "student_scores"
drop_na(student_scores, Age) ID Age Score
1 1 25 85
2 3 30 NA
3 4 22 78
If, instead, we specify “Score” as an argument, the function drops all rows where “Score” values are NA, but not all rows where “Age” values are NA.
# removes all rows where "Score" has NA values in "student_scores"
drop_na(student_scores, Score) ID Age Score
1 1 25 85
2 2 NA 90
3 4 22 78
4 5 NA 88
Note that we can specify more than one column argument to drop_na(), in which case rows which have NA values associated with those specified columns will be dropped, but rows which have NA values with other (non-specified) columns are kept.
Imputing missing values is a complex topic beyond the scope of our workshop, but it’s worth briefly noting that for whatever reason, NA values may be incorrectly coded, or placeholders for other values. For example, you may know from a dataset’s codebook, or from context, that for a particular column in the dataset, observations that appear as missing/NA should actually be coded as “0”. We can use the replace_na() function to replace such NA values with their proper values. For example, let’s say that NA values in the “Score” column should actually be “0” (perhaps NA was being used as a placeholder, and needs to be converted to 0 since the student never turned in their assignment). In the replace_na() function, the first argument is the name of the data frame object, while the second is a list that indicates the column(s) with NA values that need replacement, and a specification for what they will be replaced with. In our case, the syntax looks something like this:
# replace NA values in the "Score" column with 0
replace_na(student_scores, list(Score=0)) ID Age Score
1 1 25 85
2 2 NA 90
3 3 30 0
4 4 22 78
5 5 NA 88
Finally, it’s important to note that functions like mean() and sum() will return NA if there are missing values. For example:
# calculates mean of "Score" (NA values are not excluded; default behavior)
mean(student_scores$Score)[1] NA
# calculates mean of "Score" (NA values are excluded due to na.rm=TRUE specification; as a result, the function computes an average based on non-NA values)
mean(student_scores$Score, na.rm=TRUE)[1] 85.25
When working in R, different packages or functions may handle missing data differently, so it’s useful to consult the documentation of the packages you use for more information (especially when something is not working as expected).
4.2 Summary Statistics
Once we have a dataset loaded into R, one of the first things we’ll likely want to do after taking a look at the dataset in the viewer with the View() function is to get a sense of are data by calculating some summary statistics. This section reviews some summary statistics functions and tools we can call on for this purpose.
4.2.1 Summary Statistics Tables
A quick way to generate a table of summary statistics is to use the describe() function from the psych package. Below, we’ll generate summary statistics for the pt_copy dataset by passing pt_copy as an argument to the describe() function, and assign the resulting table of summary statistics to a new object named pt_copy_summarystats1.
# Generate summary statistics for "pt_copy" and assign table of summary statistics to a new object named "pt_copy_summarystats1"
pt_copy_summarystats1<-describe(pt_copy)Let’s go ahead and view the table of summary statistics. We can print the object to our console:
# prints contents of "pt_copy_summarystats1" to console
pt_copy_summarystats1 vars n mean sd median trimmed mad min max
oecd 1 85 0.29 0.46 0.00 0.25 0.00 0.00 1.00
country 2 85 43.00 24.68 43.00 43.00 31.13 1.00 85.00
pind 3 85 0.46 0.47 0.42 0.45 0.62 0.00 1.00
pindo 4 85 0.61 0.46 0.90 0.63 0.15 0.00 1.00
ctrycd 5 85 430.88 284.89 299.00 406.86 240.18 111.00 968.00
col_uk 6 85 0.35 0.48 0.00 0.32 0.00 0.00 1.00
t_indep 7 85 119.73 89.76 92.00 117.48 114.16 6.00 250.00
col_uka 8 85 0.28 0.39 0.00 0.24 0.00 0.00 0.93
col_espa 9 85 0.06 0.13 0.00 0.03 0.00 0.00 0.79
col_otha 10 85 0.22 0.36 0.00 0.16 0.00 0.00 0.98
legor_uk 11 85 0.39 0.49 0.00 0.36 0.00 0.00 1.00
legor_so 12 85 0.13 0.34 0.00 0.04 0.00 0.00 1.00
legor_fr 13 85 0.35 0.48 0.00 0.32 0.00 0.00 1.00
legor_ge 14 85 0.07 0.26 0.00 0.00 0.00 0.00 1.00
legor_sc 15 85 0.06 0.24 0.00 0.00 0.00 0.00 1.00
prot80 16 85 17.46 25.50 2.70 12.24 3.85 0.00 97.80
catho80 17 85 40.69 38.66 27.60 38.92 39.88 0.00 97.30
confu 18 85 0.07 0.26 0.00 0.00 0.00 0.00 1.00
avelf 19 85 0.29 0.26 0.18 0.26 0.21 0.00 0.84
govef 20 81 4.21 1.75 4.48 4.25 1.99 0.84 7.26
graft 21 81 4.17 1.89 4.24 4.26 2.18 0.74 6.92
logyl 22 74 9.23 0.90 9.29 9.30 1.15 6.95 10.48
loga 23 73 8.17 0.61 8.37 8.23 0.58 6.28 9.02
yrsopen 24 75 0.52 0.54 0.42 0.47 0.53 0.00 4.09
gadp 25 75 0.69 0.20 0.68 0.69 0.24 0.31 1.00
engfrac 26 78 0.14 0.32 0.00 0.06 0.00 0.00 1.00
eurfrac 27 78 0.40 0.44 0.08 0.38 0.12 0.00 1.00
frankrom 28 78 2.87 0.84 2.91 2.86 0.78 0.94 5.64
latitude 29 78 17.96 27.87 17.30 19.13 33.67 -36.89 63.89
gastil 30 85 2.44 1.23 2.28 2.36 1.65 1.00 4.89
cgexp 31 82 28.82 10.49 28.90 28.50 13.04 9.74 51.18
cgrev 32 78 26.49 10.12 24.16 26.05 10.98 8.92 50.85
ssw 33 71 8.15 6.67 7.17 7.65 8.60 0.13 22.38
rgdph 34 85 6688.63 5495.17 4400.03 6130.45 4371.13 530.22 20782.81
trade 35 85 78.77 47.34 68.45 73.50 34.38 17.56 343.39
prop1564 36 84 62.07 5.76 63.94 62.49 5.31 49.05 71.70
prop65 37 84 8.45 4.89 6.47 8.26 5.68 2.30 17.43
federal 38 83 0.16 0.37 0.00 0.07 0.00 0.00 1.00
eduger 39 82 88.58 17.70 93.48 90.55 14.33 40.05 117.11
spropn 40 77 0.13 0.25 0.00 0.07 0.00 0.00 1.00
yearele 41 81 1965.55 36.85 1981.00 1972.62 16.31 1800.00 1994.00
yearreg 42 81 1961.48 40.16 1978.00 1969.65 20.76 1800.00 1994.00
seats 43 85 215.45 162.54 166.00 195.25 127.17 15.00 656.00
maj 44 85 0.39 0.49 0.00 0.36 0.00 0.00 1.00
pres 45 85 0.39 0.49 0.00 0.36 0.00 0.00 1.00
lyp 46 85 8.41 0.97 8.39 8.45 1.25 6.27 9.94
semi 47 85 0.11 0.31 0.00 0.01 0.00 0.00 1.00
majpar 48 85 0.25 0.42 0.00 0.19 0.00 0.00 1.00
majpres 49 85 0.13 0.33 0.00 0.04 0.00 0.00 1.00
propres 50 85 0.26 0.44 0.00 0.20 0.00 0.00 1.00
dem_age 51 85 1958.34 43.74 1978.00 1967.00 19.27 1800.00 1994.00
lat01 52 78 0.32 0.19 0.28 0.31 0.20 0.00 0.71
age 53 85 0.21 0.22 0.11 0.17 0.10 0.03 1.00
polityIV 54 80 7.17 3.64 8.11 7.92 2.80 -6.00 10.00
spl 55 74 -2.18 3.48 -1.69 -2.22 2.85 -11.36 12.59
cpi9500 56 71 4.81 2.38 5.29 4.95 2.46 0.27 8.25
du_60ctry 57 85 0.71 0.46 1.00 0.75 0.00 0.00 1.00
magn 58 84 0.47 0.40 0.26 0.46 0.31 0.01 1.00
sdm 59 77 0.35 0.39 0.16 0.32 0.15 0.01 1.00
oecd.x 60 85 0.27 0.45 0.00 0.22 0.00 0.00 1.00
mining_gdp 61 77 4.26 6.72 1.29 2.82 1.51 0.02 37.20
gini_8090 62 72 39.20 10.41 37.52 38.71 12.23 19.49 62.30
con2150 63 85 0.11 0.31 0.00 0.01 0.00 0.00 1.00
con5180 64 85 0.29 0.46 0.00 0.25 0.00 0.00 1.00
con81 65 85 0.49 0.50 0.00 0.49 0.00 0.00 1.00
list 66 84 114.48 129.55 77.67 93.16 115.15 0.00 510.33
maj_bad 67 85 1.06 1.60 0.00 0.77 0.00 0.00 4.89
maj_gin 68 72 16.40 20.96 0.00 13.86 0.00 0.00 62.00
maj_old 69 85 0.08 0.19 0.00 0.04 0.00 0.00 1.00
pres_bad 70 85 1.21 1.69 0.00 0.97 0.00 0.00 4.89
pres_gin 71 72 16.66 22.90 0.00 13.83 0.00 0.00 62.00
pres_old 72 85 0.06 0.16 0.00 0.02 0.00 0.00 1.00
propar 73 85 0.35 0.48 0.00 0.32 0.00 0.00 1.00
lpop 74 60 15.90 1.92 15.94 15.95 1.47 11.61 20.63
continent 75 85 3.04 1.06 3.00 3.16 1.48 1.00 4.00
range skew kurtosis se
oecd 1.00 0.89 -1.23 0.05
country 84.00 0.00 -1.24 2.68
pind 1.00 0.13 -1.89 0.05
pindo 1.00 -0.46 -1.69 0.05
ctrycd 857.00 0.63 -1.08 30.90
col_uk 1.00 0.60 -1.65 0.05
t_indep 244.00 0.26 -1.54 9.74
col_uka 0.93 0.71 -1.44 0.04
col_espa 0.79 2.68 9.07 0.01
col_otha 0.98 1.16 -0.42 0.04
legor_uk 1.00 0.45 -1.82 0.05
legor_so 1.00 2.17 2.74 0.04
legor_fr 1.00 0.60 -1.65 0.05
legor_ge 1.00 3.29 8.96 0.03
legor_sc 1.00 3.68 11.71 0.03
prot80 97.80 1.65 1.99 2.77
catho80 97.30 0.38 -1.59 4.19
confu 1.00 3.29 8.96 0.03
avelf 0.84 0.76 -0.79 0.03
govef 6.42 -0.22 -1.11 0.19
graft 6.17 -0.37 -1.14 0.21
logyl 3.52 -0.57 -0.53 0.10
loga 2.73 -0.85 0.12 0.07
yrsopen 4.09 3.72 22.32 0.06
gadp 0.69 0.11 -1.19 0.02
engfrac 1.00 1.97 2.04 0.04
eurfrac 1.00 0.31 -1.80 0.05
frankrom 4.70 0.25 0.60 0.10
latitude 100.78 -0.31 -0.98 3.16
gastil 3.89 0.39 -1.16 0.13
cgexp 41.44 0.17 -0.99 1.16
cgrev 41.93 0.34 -0.84 1.15
ssw 22.26 0.41 -1.16 0.79
rgdph 20252.59 0.77 -0.71 596.03
trade 325.82 2.37 9.85 5.13
prop1564 22.66 -0.60 -0.87 0.63
prop65 15.13 0.23 -1.60 0.53
federal 1.00 1.86 1.46 0.04
eduger 77.06 -0.94 0.59 1.95
spropn 1.00 2.22 4.33 0.03
yearele 194.00 -2.13 5.16 4.09
yearreg 194.00 -1.85 3.28 4.46
seats 641.00 1.06 0.18 17.63
maj 1.00 0.45 -1.82 0.05
pres 1.00 0.45 -1.82 0.05
lyp 3.67 -0.24 -1.04 0.11
semi 1.00 2.52 4.39 0.03
majpar 1.00 1.12 -0.69 0.05
majpres 1.00 2.17 2.75 0.04
propres 1.00 1.08 -0.84 0.05
dem_age 194.00 -1.65 2.05 4.74
lat01 0.71 0.27 -1.10 0.02
age 0.97 1.65 2.05 0.02
polityIV 16.00 -1.62 2.20 0.41
spl 23.95 0.88 3.85 0.40
cpi9500 7.98 -0.49 -1.12 0.28
du_60ctry 1.00 -0.89 -1.23 0.05
magn 0.99 0.39 -1.64 0.04
sdm 0.99 0.93 -0.95 0.04
oecd.x 1.00 1.01 -0.98 0.05
mining_gdp 37.18 2.41 6.61 0.77
gini_8090 42.81 0.35 -0.80 1.23
con2150 1.00 2.52 4.39 0.03
con5180 1.00 0.89 -1.23 0.05
con81 1.00 0.02 -2.02 0.05
list 510.33 1.19 0.79 14.14
maj_bad 4.89 1.23 0.04 0.17
maj_gin 62.00 0.63 -1.30 2.47
maj_old 1.00 3.05 9.57 0.02
pres_bad 4.89 0.91 -0.81 0.18
pres_gin 62.00 0.71 -1.33 2.70
pres_old 1.00 4.26 18.75 0.02
propar 1.00 0.60 -1.65 0.05
lpop 9.02 -0.16 -0.32 0.25
continent 3.00 -0.72 -0.82 0.12
Or, alternatively (and preferably), we can view the table in the data viewer:
# prints contents of "pt_copy_summarystats1" to console
View(pt_copy_summarystats1)To learn more about the describe() function, or any function you use in R, recall that you can query the function’s documentation by printing a ? following by the function’s name in the console (or executing it within a script). For example:
?describeOne thing to note here is that the function defaults to setting na.rm=TRUE; under this setting, the describe() function simply omits NA (i.e. missing) values when calculating summary statistics. If, instead, na.rm=FALSE, the function deletes all rows that contain missing data, and calculates summary statistics with reference to this truncated dataset that has been excised of missing values. In practical terms, it will typically make the most sense to go with the default na.rm=TRUE option, but it’s important to be clear about these different options for handling missing data in the context of the describe() function. If you’re still struggling to understand the difference, we can consider a simple example with a made-up toy dataset. Let’s first create this dataset:
# creates toy dataset for illustrative purposes and assigns it to object named "df_sample"
df_sample<-data.frame(Country=c("A", "B", "C", "D"),
Variable1=c(10, NA, 15, 20),
Variable2=c(5, 8, 12, NA))
# prints contents of "df_sample"
df_sample Country Variable1 Variable2
1 A 10 5
2 B NA 8
3 C 15 12
4 D 20 NA
Let’s now generate a summary statistics table using describe() with the default na.rm=TRUE setting and assign it to an object named df_sample_summary_default:
# generates summary statistics with describe function and assigns it to "df_sample_summary_default"; na.rm=TRUE ignores NA values when calculating summary statistics
df_sample_summary_default<-describe(df_sample, na.rm=TRUE)Go ahead and view df_sample_summary_default in the data viewer:
View(df_sample_summary_default)Note for example, that the mean of “Variable1” is 15 (i.e. the average of 10, 15, and 20) while the mean of “Variable2” is 12.5 (i.e. the average of 5, 8, and 12). When computing these averages, the describe() function simply ignores “NA”, i.e. missing values. Now, let’s use describe() to generate a summary statistics table, but set na.rm=FALSE ; we’ll assign it to an object named df_sample_summary_B :
# generates summary statistics with describe function and assigns it to "df_sample_summary_B"; na.rm=FALSE removes rows with any NA values before calculating summary statistics (in other words, summary statistics are computed on rows with complete data)
df_sample_summary_B<-describe(df_sample, na.rm=FALSE)Go ahead and inspect the summary table using the View() function:
View(df_sample_summary_B)Note that the mean of “Variable1” is now 12.5 (the average of 10 and 15), because the na.rm=FALSE command removed both rows 1 and 3 in df_sample before performing the calculation, since both of these rows have an NA value in them. Similarly, the mean of “Variable2” is now 8.5 (the average of 5 and 12).
For the sake of argument, let’s say you want to only calculate summary statistics for columns that have no missing values. The describe() function doesn’t allow you to do this by specifying an argument, but we can carry out this task by writing a custom function. First, let’s write a function that checks whether a vector has any missing values; we’ll assign this object to a new object named check_na(). This function takes a column as an input, and returns a logical condition TRUE if there are any NA values in the column and FALSE if there are no NA values in the column:
# writes a function that takes a column as input and checks whether there are any NA values in that column; returns TRUE if there is, FALSE otherwise
check_na<-function(column){
any(is.na(column))
}Let’s test the function for a few columns in our dataset
check_na(pt_copy$country)[1] FALSE
check_na(pt_copy$graft)[1] TRUE
That worked. Now, let’s call on the discard() function, which is part of the purrr package we’re now familiar with; it removes a vector from a list if some condition is met. We’ll pass pt_copy and our check_na() function as arguments to it. In particular, the code below uses the discard() function to run the check_na() function across each column in pt_copy; if a column does have an NA value (i.e. evaluates to TRUE when the check_na() function is passed to it), it is “discarded”, but if it does not have an NA value (i.e. evaluates to FALSE when the check_na() function is passed to it), it is kept. We’re left with a version of pt_copy() without any columns that have NA values, which we’ll assign to a new object named pt_copy_NAcolumns_removed:
# discards columns in "pt_copy" that have missing data/NA Values, and assigns the resulting data frame to a new object named "pt_copy_NAcolumns_removed"
pt_copy_NAcolumns_removed<-discard(pt_copy, check_na)We can confirm that all of the columns with NA values have been removed by passing pt_copy_NAcolumns_removed to the View() function:
View(pt_copy_NAcolumns_removed)Now, we can pass pt_copy_NAcolumns_removed as an argument to the describe() function, and thereby generate a table of summary statistics for only variables without any missing observations. Below, we’ll assign this table of summary statistics to a new object named pt_copy_NAcolumns_removed_summarystats:
# passes "pt_copy_NAcolumns_removed" as an argument to "describe" function, which generates summary statistics only for variables without missing observations; the table of summary statistics is assigned to a new object named "pt_copy_NAcolumns_removed_summarystats"
pt_copy_NAcolumns_removed_summarystats<-describe(pt_copy_NAcolumns_removed)Let’s go ahead and view this table in the data viewer:
View(pt_copy_NAcolumns_removed_summarystats)Let’s say, for the sake of argument, that we will be working with a lot of datasets, and you will want to quickly generate tables of summary statistics that exclude variables that have missing values. We don’t want to go through these steps all the time, so we can wrap this code into a function that takes a dataset as an input, and returns a table of summary statistics that excludes information on variables with missing values:
# creates function to take dataset input and return a table of summary statistics only for variables that have no missing values
summary_stats_noNAcolumns<-function(dataset_input){
cleaned_dataset<-discard(dataset_input, check_na)
cleaned_summary_stats<-describe(cleaned_dataset)
return(cleaned_summary_stats)
}Now, we can simply pass a dataset as an argument to our custom summary_stats_noNAcolumns function, and it will return a summary statistics table with these desired specifications. Test this function out on pt_copy to confirm it works:
# passes "pt_copy" as argument to "summary_stats_noNAcolumns" function
summary_stats_noNAcolumns(pt_copy) vars n mean sd median trimmed mad min max range skew
oecd 1 85 0.29 0.46 0.00 0.25 0.00 0 1 1 0.89
country* 2 85 43.00 24.68 43.00 43.00 31.13 1 85 84 0.00
pind 3 85 0.46 0.47 0.42 0.45 0.62 0 1 1 0.13
pindo 4 85 0.61 0.46 0.90 0.63 0.15 0 1 1 -0.46
ctrycd 5 85 430.88 284.89 299.00 406.86 240.18 111 968 857 0.63
col_uk 6 85 0.35 0.48 0.00 0.32 0.00 0 1 1 0.60
kurtosis se
oecd -1.23 0.05
country* -1.24 2.68
pind -1.89 0.05
pindo -1.69 0.05
ctrycd -1.08 30.90
col_uk -1.65 0.05
For further confirmation, create a toy dataset and pass it as an argument to summary_stats_noNAcolumns() to ensure that it works as expected.
It might be the case that the summary statistics table created by describe() does not contain information you’d like, in which case it may be possible to customize it to provide that information. For example, it could be nice to have a column in the summary statistics table that provides information on the number of NA values there are for each variable. To get that information into the summary statistics table, we’ll first write a function to count the number of NA values in a vector:
# Define a named function to count missing values
count_missing <- function(x) {
sum(is.na(x))
}Next, we’ll apply this function to each column/variable in pt_copy, and deposit the number of missing values for each column in a vector, such that the number of missing values in the first column is deposited as the first element in the vector, the number of missing values in the second column is deposited as the second element in the vector, and so on; we’ll assign this vector to a new object named missing_values_vector:
# applies "count_missing" function to the columns of "pt_copy", and deposits the results (i.e. count of missing values in each column of "pt_copy") to a numeric vector assigned to the object "missing_values_vector"
missing_values_vector <- map_dbl(pt_copy, count_missing)Now, we’ll add this vector to pt_copy_summarystats1:
# adds "missing_values_vector" as a column named "missing" to "pt_copy_summarystats1" summary stats table
pt_copy_summarystats1$missing<-missing_values_vectorView pt_copy_summarystats1 in the data viewer to confirm that the column has been successfully added:
View(pt_copy_summarystats1)Below, we’ll learn another way of adding new columns to datasets that draws on the tidyverse; the syntax above is from base R.
Finally, there may be instances in which the summary statistics table produced by the describe() function provides unwanted information. For example, the summary statistics table provides summary statistics for categorical variables as well as numeric ones, but quantitative summary statistics for categorical information may not be meaningful, and simply clutter things. What if we want to remove categorical variables from the table of summary statistics?
We can first use the select_if() function from dplyr (a data-wrangling package that is part of the tidyverse suite) to select only the numeric columns in pt_copy. We’ll assign this “numeric only” version of pt_copy to a new object named pt_copy_numeric:
# selects columns from "pt_copy" that are numeric
pt_copy_numeric<-select_if(pt_copy, is.numeric)We can confirm that pt_copy_numeric only contains numeric variables:
# views "pt_copy_numeric" in data viewer
View(pt_copy_numeric)Before proceeding, it’s useful to note that there are other summary statistics packages that are worth exploring. For example, the summarytools package provides the descr() , which is similar to describe() from the psych package. The summarytools package defaults to providing information as a matrix, which we haven’t covered, so we pass descr(pt_copy) to the as.data.frame() function to get the summary statistics table output as a data frame.
# uses "descr" function from summarytools package to create a table of summary statistics as a data frame and assigns it to "pt_summary_ST"
pt_summary_ST<-as.data.frame(descr(pt_copy))# views "pt_copy_numeric" in data viewer
View(pt_summary_ST)Another handy package for generating summary statistics is stargazer, which is most useful for creating and exporting regression tables, but which can also produce a well-formatted summary statistics table that is ready to be exported. We can use stargazer to create a summary statistics table with the following (we’ll export it in a later section):
# uses stargazer package to generate summary statistics for pt_copy
stargazer(as.data.frame(pt_copy), type = "text")
======================================================
Statistic N Mean St. Dev. Min Max
------------------------------------------------------
oecd 85 0.294 0.458 0 1
pind 85 0.460 0.466 0.000 1.000
pindo 85 0.606 0.457 0.000 1.000
ctrycd 85 430.882 284.889 111 968
col_uk 85 0.353 0.481 0 1
t_indep 85 119.729 89.759 6 250
col_uka 85 0.282 0.393 0.000 0.928
col_espa 85 0.058 0.133 0.000 0.788
col_otha 85 0.218 0.360 0.000 0.976
legor_uk 85 0.388 0.490 0 1
legor_so 85 0.129 0.338 0 1
legor_fr 85 0.353 0.481 0 1
legor_ge 85 0.071 0.258 0 1
legor_sc 85 0.059 0.237 0 1
prot80 85 17.462 25.498 0.000 97.800
catho80 85 40.694 38.662 0.000 97.300
confu 85 0.071 0.258 0 1
avelf 85 0.286 0.263 0.000 0.836
govef 81 4.209 1.751 0.836 7.258
graft 81 4.169 1.891 0.742 6.917
logyl 74 9.226 0.902 6.954 10.476
loga 73 8.171 0.610 6.284 9.015
yrsopen 75 0.516 0.543 0.000 4.089
gadp 75 0.689 0.199 0.313 1.000
engfrac 78 0.138 0.318 0.000 1.000
eurfrac 78 0.398 0.437 0.000 1.004
frankrom 78 2.866 0.840 0.940 5.639
latitude 78 17.964 27.869 -36.892 63.892
gastil 85 2.445 1.229 1.000 4.889
cgexp 82 28.818 10.488 9.743 51.178
cgrev 78 26.488 10.120 8.923 50.849
ssw 71 8.148 6.674 0.129 22.385
rgdph 85 6,688.634 5,495.171 530.223 20,782.810
trade 85 78.766 47.339 17.562 343.387
prop1564 84 62.068 5.764 49.048 71.703
prop65 84 8.447 4.885 2.296 17.430
federal 83 0.157 0.366 0 1
eduger 82 88.580 17.698 40.050 117.114
spropn 77 0.126 0.245 0.000 1.000
yearele 81 1,965.547 36.852 1,800.000 1,994.000
yearreg 81 1,961.481 40.162 1,800 1,994
seats 85 215.446 162.537 15.000 656.000
maj 85 0.388 0.490 0 1
pres 85 0.388 0.490 0 1
lyp 85 8.406 0.969 6.273 9.942
semi 85 0.106 0.310 0 1
majpar 85 0.250 0.422 0.000 1.000
majpres 85 0.129 0.334 0.000 1.000
propres 85 0.260 0.440 0.000 1.000
dem_age 85 1,958.341 43.737 1,800 1,994
lat01 78 0.316 0.187 0.003 0.710
age 85 0.208 0.219 0.030 1.000
polityIV 80 7.172 3.643 -6.000 10.000
spl 74 -2.177 3.479 -11.362 12.589
cpi9500 71 4.812 2.382 0.268 8.250
du_60ctry 85 0.706 0.458 0 1
magn 84 0.470 0.395 0.007 1.000
sdm 77 0.355 0.386 0.008 1.000
oecd.x 85 0.271 0.447 0 1
mining_gdp 77 4.256 6.717 0.018 37.198
gini_8090 72 39.202 10.412 19.490 62.300
con2150 85 0.106 0.310 0 1
con5180 85 0.294 0.458 0 1
con81 85 0.494 0.503 0 1
list 84 114.479 129.555 0.000 510.333
maj_bad 85 1.062 1.598 0.000 4.889
maj_gin 72 16.401 20.961 0.000 62.000
maj_old 85 0.084 0.186 0.000 1.000
pres_bad 85 1.211 1.689 0.000 4.889
pres_gin 72 16.663 22.900 0.000 62.000
pres_old 85 0.061 0.163 0.000 1.000
propar 85 0.353 0.481 0 1
lpop 60 15.897 1.917 11.610 20.632
------------------------------------------------------
4.3 Frequency Tables and Crosstabs
For qualitative data, it can be especially useful to generate frequency tables, which we can do with the freq() function:
# creates frequency table for the continent variable
freq(pt_copy$continent)Frequencies
pt_copy$continent
Type: Character
Freq % Valid % Valid Cum. % Total % Total Cum.
------------ ------ --------- -------------- --------- --------------
africa 11 12.94 12.94 12.94 12.94
asiae 13 15.29 28.24 15.29 28.24
laam 23 27.06 55.29 27.06 55.29
other 38 44.71 100.00 44.71 100.00
<NA> 0 0.00 100.00
Total 85 100.00 100.00 100.00 100.00
Cross-tabs, which show the frequency distribution of two or more categorical variables, are another useful way to begin exploring your data; one useful function for generating crosstabs in R is the table() function, that takes dataset columns as arguments.
# creates crosstab with continent and federal variables
table(pt_copy$continent, pt_copy$federal)
0 1
africa 11 0
asiae 11 2
laam 19 4
other 29 7
The cross tab above quickly tells us, for example, that countries in the “asiae” category (Asia and Europe) have 2 countries with a federal structure of government and 11 without it, while countries in the “laam” category (Latin America) have 4 countries with a federal structure of government and 19 without it.
4.4 Group-Level Summary Statistics
While having a simple table of summary statistics is a useful starting point, it is often useful to generate group-level summary statistics, where summary statistics are presented for different subgroups in the dataset. One way to generate group summary statistics is to use the describeBy() function (also from the psych package), where the first argument is the data frame you would like to generate group-level summary statistics for, and the second argument is the column that contains the relevant groups. Below, we generate summary statistics for pt_copy parsed out by the different continents in the “continent” column. The expression pt_copy$continent indicates that the groups with respect to which we want to calcualte the summary statistics is in the “continent” column of the pt_copy data frame. More generally, we can explicitly refer to columns in an R data frame using this dollar-sign notation, where the expression before the dollar sign refers to the data frame object, and the expression after refers to the name of the column.
The describeBy() function will produce a list that contains summary statistics for different groups as list elements. Below, we’ll assign the list of group summary statistics to a new object named summary_stats_by_continent:
# Creates summary statistics for each continent grouping, and puts results in list named "summary_stats_by_continent"
summary_stats_by_continent<-describeBy(pt_copy, pt_copy$continent)Now, let’s say we want to extract the summary statistics for Africa, one of the continent categories in the “continent” column. We can do so using the double-bracket notation we discussed above:
# Accessing continent-level summary statistics for africa from the "summary_stats_by_continent" list
summary_stats_by_continent[["africa"]] vars n mean sd median trimmed mad min max range skew kurtosis
oecd 1 11 0.00 0.00 0 0.00 0.00 0 0 0 NaN NaN
country 2 11 53.36 24.45 49 54.56 28.17 11 85 74 -0.17 -1.41
pind 3 11 0.77 0.42 1 0.83 0.00 0 1 1 -1.06 -0.79
pindo 4 11 0.77 0.42 1 0.83 0.00 0 1 1 -1.06 -0.79
ctrycd 5 11 647.55 154.90 684 685.56 56.34 199 754 555 -2.13 3.44
col_uk 6 11 0.82 0.40 1 0.89 0.00 0 1 1 -1.43 0.08
se
oecd 0.00
country 7.37
pind 0.13
pindo 0.13
ctrycd 46.70
col_uk 0.12
Recall that we can assign list elements that we extract from a list to their own object, which allows us to conveniently retrieve it whenever it is needed. Below, we’ll assign the summary statistics for Africa to a new object named africa_summary:
# Group-level summary statistics can be assigned to their own object for easy retrieval
africa_summary<-summary_stats_by_continent[["africa"]]Another convenient way to extract group-level statistics is with the dplyr package’s group_by() function. First, we’ll run the code below, and assign it to a new object named trade_age_by_continent:
# Generate a table that displays summary statistics for trade at the continent level and assign to object named "trade_age_by_continent"
trade_age_by_continent<-pt_copy %>%
group_by(continent) %>%
summarise(meanTrade=mean(trade),sdTrade=sd(trade),
meanAge=mean(age), sdAge=sd(age),
n=n())Let’s now print the contents of trade_age_by_continent:
# prints contents of "trade_age_by_continent"
trade_age_by_continent# A tibble: 4 × 6
continent meanTrade sdTrade meanAge sdAge n
<chr> <dbl> <dbl> <dbl> <dbl> <int>
1 africa 77.3 32.1 0.121 0.124 11
2 asiae 97.8 84.6 0.110 0.0846 13
3 laam 68.6 32.8 0.139 0.153 23
4 other 78.8 40.7 0.309 0.263 38
Let’s now unpack the code that created this table. We started with the pt_copy data frame, and then used group_by(continent) to declare that subsequent calculations should be performed at the continent-level; then, within the summarise() function, we defined the column names we wanted to use in the group-level summary table, and how those variables are to be calculated. For example, meanTrade=mean(trade) indicates that we want the first column to be named “meanTrade”, which is to be calculated by taking the mean of the “trade” variable for each continent grouping. After that, sdTrade=sd(trade) indicates that we want the second column to be named “sdTrade”, which is to be calculated by taking the standard deviation of the “trade” variable for each continent grouping. And so on. Note that n=n() indicates that we want the final column, named “n”, to provide information about the number of observations in each continent-level grouping.
You might have noticed a mysterious symbol in the above code that comes immediately after pt_copy, and immediately after group_by(continent). This symbol is known as a “pipe” (%>%). The pipe operator effectively takes the contents to its left, and then uses these contents as an input to the code on its right. Above, the pipe takes the contents of pt_copy on its left, and then feeds this data into the group_by() function on the right; then, after grouping the data by continent, it feeds this grouped data on its left into the summarise() function on its right. We will use the pipe operator throughout the lesson to chain together functions in this manner.
4.5 Data Cleaning and Wrangling
After getting a sense of your data by computing some summary statistics and running some crosstabs, you’ll often have a sense of how you would like to clean or transform your data for analysis. This section briefly describes some functions that are useful for these basic data-preparation and wrangling tasks. Most of these functions are from the tidyverse’s dplyr package.
4.5.1 Rearranging Columns
We can manipulate the order of the columns in a dataset using the relocate function. For example, the code below uses the relocate() function to shift the “country” column to the front of the dataset, and then assigns this change back to pt_copy to update the object:
# bring the "country" column to the front of the dataset
pt_copy<-pt_copy %>% relocate(country)Go ahead and confirm that the change has been implemented by viewing pt_copy in the data viewer:
# Views "pt_copy" in data viewer
View(pt_copy)We can specify more than one argument to the relocate function. For example, in the code below, passing the “country”, “list”, “trade”, and “oecd” variables/columns to the relocate() function will make “country” the first column, “list” the second column, “trade” the third column, and so on.
# bring the "country", "list", "trade", "oecd" columns to the front of the dataset
pt_copy<-pt_copy %>% relocate(country, list, trade, oecd)Let’s go ahead and view the updated dataset in the R Studio Viewer:
# Views updated "pt_copy" data frame in data viewer
View(pt_copy)4.5.2 Renaming Variables
In order to rename variables, we can use the rename() function, as below. The code below renames the existing “list” variable to “party_list”, which is more descriptive, and assigns the change back to the pt_copy object.
## Renaming a variable (renames "list" to "party_list")
pt_copy<-pt_copy %>% rename(party_list=list)Let’s check the pt_copy data frame in the viewer to ensure that the change has been made.
# Views updated "pt_copy" data frame in data viewer
View(pt_copy)You can rename more than one variable at a time using the rename() function; simply separate the arguments by a comma. For example, if we wanted to also change the variable currently named “ctrycd” to “country_code”, in addition to changing “list” to “party_list”, we could use the following expression: pt_copy<-pt_copy %>% rename(party_list=list, country_code=ctrycd)
4.5.3 Sorting Datasets With Respect to Variables
It is often useful to sort a data frame in ascending or descending order with respect to a given variable. The code below sorts the pt_copy data frame in ascending order with respect to the “trade” variable using the arrange() function:
# sorting in ascending (low to high) order with respect to the "trade" variable
pt_copy<-pt_copy %>% arrange(trade)If, instead, you want to sort the dataset in descending order with respect to the “trade” variable, pass the name of the variable to the desc() function within the arrange() function, as below:
# sorting in descending (high to low) order with respect to the "trade" variable
pt_copy<-pt_copy %>% arrange(desc(trade))Note that it’s also possible to pass several arguments to the arrange function, and thereby sort a dataset with respect to multiple variables; for example, the code below first relocates the “continent” column and the “trade” column towards the front of the dataset (just after “Country), then sorts the dataset by continent, and then further sorts it in descending order by trade. This allows us to quickly note the country in each continent grouping with the highest value on the”trade” variable.
# takes the "pt_copy" dataset, relocates "country", "continent", and "trade" to the front of the dataset, and then arranges the dataset based on the "continent" variable, then in descending order with respect to the "trade" variable
pt_copy<-pt_copy %>%
relocate(country, continent, trade) %>%
arrange(continent, desc(trade))Let’s go ahead and see what this looks like using the Viewer:
# Views updated "pt_copy" data frame in data viewer
View(pt_copy)Sorting a dataset with respect to more than one variable can be especially useful in certain contexts, particularly in cases where you’re dealing with nested data. For example, in a time series dataset, it can be useful to first sort by year, then by months within the year. In a dataset with regional information, it could be helpful to first sort by regional, then by cities within those regional groupings.
4.5.4 Creating New Variables Based on Existing Variables
Depending on your research question and empirical strategy, it is often useful or necessary to create new variables in your dataset, based on existing variables. To do so, we can use dplyr’s mutate() function. The code below, for example, uses the mutate() function to create a new variable, named “non_catholic_80”, that is computed by subtracting the existing “catho80” variable from 100; for convenience, the “country”, “catho80”, and newly created “non_catholic_80” variables are all moved to the front of the dataset using the relocate() function:
# Create new variable named "non_catholic_80" that is calculated by substracting the Catholic share of the population in 1980 ("catho80") from 100 and relocates "country", "catho80", and the newly created "non_catholic_80" to the front of the dataset
pt_copy<-pt_copy %>% mutate(non_catholic_80=100-catho80) %>%
relocate(country, catho80, non_catholic_80)We can view the updated pt_copy data frame to confirm that the new variable has been created:
# Views updated "pt_copy" data frame in data viewer
View(pt_copy)Note that it’s possible to define more than one new variable at a time using mutate(); simply separate the arguments by a comma within the mutate() function.
4.5.5 Selecting or Deleting Variables
Sometimes, we will have a dataset with many variables, and to make things more tractable, we’ll want to select only the variables that are relevant to our analysis. we can explicitly select desired variables using the select() function from dplyr. The code below selects the “country”, “cgexp”, “cgrev”, “trade”, and “federal” columns from pt_copy, and then assigns this selection to a new object named pt_copy_selection:
# Selects "country", "cgexp", "cgrev", and "trade" variables from the "pt_copy" dataset and assigns the selection to a new object named "pt_copy_selection"
pt_copy_selection<-pt_copy %>%
select(country, cgexp, cgrev, trade, federal)When we view the pt_copy_selection object in the data viewer, we’ll see that we now have a new data frame that consists only of these variables:
# views "pt_copy_selection" in data viewer
View(pt_copy_selection)Sometimes, it could make more sense to directly delete columns, instead of deciding on which ones to keep or select. For example, the code below deletes the “federal” variable from pt_copy_selection by passing it to the select() column with a “-” in front of it.
# deletes "federal" variable from "pt_copy_selection"
pt_copy_selection %>% select(-federal)# A tibble: 85 × 4
country cgexp cgrev trade
<chr> <dbl> <dbl> <dbl>
1 Mauritius 22.5 21.6 128.
2 Gambia 24.4 23.8 122.
3 Namibia 37.3 33.2 115.
4 Botswana 38.6 50.8 87.5
5 Zambia 25.2 19.7 73.8
6 Zimbabwe 31.2 25.6 69.2
7 Senegal NA NA 65.4
8 Malawi 25.8 NA 62.5
9 Ghana 19.0 17.2 53.5
10 South Africa 31.3 24.1 43.9
# ℹ 75 more rows
If you want to delete multiple columns, simply specify the columns in a vector, preceded by a minus sign, that is passed to the select() function. The code below, for instance, takes the existing pt_copy_selection data frame, deletes the “federal” and “trade” columns, and assigns the result to a new object named pt_copy_selection_modified:
# deletes "federal" and "trade" from "pt_copy_selection" and assigns it to new object named "pt_copy_selection_modified"
pt_copy_selection_modified<-pt_copy_selection %>% select(-c(federal, trade))Check the pt_copy_selection_modified data frame in the data viewer to confirm these changes:
# views "pt_copy_selection_modified" in data viewer
View(pt_copy_selection_modified)4.5.6 Recoding Variables
“Recoding” a variable refers to the process of taking an existing variable, and generating new variable(s) that represent the information from that original variable in a new way. Below, we’ll consider some common recoding operations.
4.5.7 Creating Dummy Variables from Continuous Numeric Variables
You may sometimes have a continuous numeric variable, but want to create a new dummy variable (a variable that takes on the value of 1 if a given condition is met, and 0 otherwise) based on that numeric variable. For example, let’s say we want to create a new variable, named “trade_open” that takes on the value of 1 if the trade variable is greater than or equal to 77, and 0 otherwise. We can generate this new dummy variable using the mutate() function; within the mutate() function below, we specify that we want to create a new variable named “trade_open”; the ifelse() function specifies the desired condition (trade>=77), followed by the value the new “trade_open” variable is to take if the condition is met (1), and the value the new “trade_open” variable is to take if the condition is not met (0). In other words, we can translate ifelse(trade>=77, 1, 0) to “if trade>=77, set the ‘trade_open’ variable to 1, otherwise set it to 0.” We’ll assign the data frame with the new “trade_open” variable back to “pt_copy”:
# Creates a new dummy variable based on the existing "trade" variable named "trade_open" (which takes on a value of "1" if "trade" is greater than or equal to 77, and 0 otherwise) and then moves the newly created variable to the front of the dataset along with "country" and "trade"; all changes are assigned to "pt_copy", thereby overwriting the existing version of "pt_copy"
pt_copy<-pt_copy %>% mutate(trade_open=ifelse(trade>=77, 1, 0)) %>%
relocate(country, trade_open, trade)View the data frame to ensure that the new variable “trade_open”, recoded based on “trade”, has been created:
4.5.8 Creating categorical variables from continuous numeric variables
Sometimes, we will want to create a variable that contains categories or classifications that derive from numeric thresholds of an existing variable. For instance, let’s say we want to take the existing “trade” variable, and define a new variable named “trade_level”, which is set to “Low Trade” when the “trade” variable is greater than 15 and less than 50; “Intermediate_Trade” when the “trade” variable is greater than or equal to 50 and less than 100; and “High_Trade” when the “trade” variable is greater than or equal to 100. The code below creates this new “trade_level” variable using the mutate() function, and the case_when() function that maps the conditions onto the desired variable values for “trade_level” using the following syntax:
# Creates a new variable in the "pt_copy" dataset named "trade_level" (that is coded as "Low Trade" when the "trade" variable is greater than 15 and less than 50, coded as "Intermediate Trade" when "trade" is greater than or equal to 50 and less than 100, and coded as "High TradE" when "trade" is greater than or equal to 100), and then reorders the dataset such that "country", "trade_level", and "trade" are the first three variables in the dataset
pt_copy<-pt_copy %>% mutate(trade_level=case_when(trade>15 & trade<50~"Low_Trade",
trade>=50 & trade<100~"Intermediate_Trade",
trade>=100~"High_Trade")) %>%
relocate(country, trade_level, trade)Check to see that the new “trade_level” variable has indeed been created in pt_copy according to the specifications above:
# views updated "pt_copy" data frame in data viewer
View(pt_copy)4.5.9 Creating dummy variables from categorical variables
Sometimes, we may have a categorical variable in a dataset, and want to create dummy variables based on those categories. For example, consider the “trade_level” variable we created above. Let’s say we want to use the “trade_level” column to create dummy variables for each of the categories in that column. We can do so with the fastDummies package, which can quickly generate dummy variables for the categories in a categorical variable using the dummy_cols() function. Below, we simply take the existing pt_copy dataset, and pass the name of the categorical variable out of which we want to create the dummies (“trade_level”) to the dummy_cols() function:
# Creates dummy variables from "trade_level" column, and relocates the new dummies to the front of the dataset
pt_copy<-pt_copy %>%
dummy_cols("trade_level") %>%
relocate(country, trade_level, trade_level_High_Trade, trade_level_Intermediate_Trade, trade_level_Low_Trade)Let’s now view the updated pt_copy data frame, with the newly created dummy variables:
# views updated "pt_copy" data frame in data viewer
View(pt_copy)You’ll notice that there are now dummy variables corresponding to each of the categories in the categorical “trade_level” variable; for example, the “trade_level_High_Trade” dummy variable takes on the value of 1 for all observations where the “trade_level” variable is “High_Trade” and 0 otherwise; the “trade_level_Intermediate_Trade” dummy variable takes on the value of 1 for all observations where the “trade_level” variable is “Intermediate_Trade” and 0 otherwise; and so on.
4.5.10 Subsetting (Filtering) Variables
We will often want to subset, or “filter” our datasets to extract observations that meet specified criteria. The dplyr packages allows us to carry out these subsetting operations with a function called filter(), which takes various logical conditions as arguments. Let’s say, for example, that we want to extract all of the OECD country observations from the pt_copy dataset. The “oecd” variable in pt_copy is equal to 1, for all OECD countries, and 0 for non-OECD countries. By passing the condition oecd==1 to the filter() function, we can extract all OECD observations. We’ll assign this data subset to a new object named oecd_countries, and view it in the data viewer:
# Extracts OECD observations in "pt_copy" and assigns to object named "oecd_countries"
oecd_countries<-pt_copy %>% filter(oecd==1) %>%
relocate(country, oecd)# views "oecd_countries" in data viewer
View(oecd_countries)Let’s take another example. Let’s use the filter() function to extract all observations for which the “cgrev” (central government revenue as a share of GDP) exceeds 40. We’ll assign the observations that satisfy this condition to a new object named high_revenues:
# Extracts observations for which cgrev (central government revenue as % of gdp)>40, and assigns to object named "high_revenues"
high_revenues<-pt_copy %>% filter(cgrev>40) %>%
relocate(country, cgrev)# Views "high_revenues" in data viewer
View(high_revenues) Let’s try another example. Let’s subset observations from pt_copy for which the Catholic share of the population in 1980 (“catho80”) is less than or equal to 50, and assign the filtered data to a new object named minority_catholic:
# Extracts observations for which the "catho80" variable is less than or equal to 50
minority_catholic<-pt_copy %>% filter(catho80<=50) %>%
relocate(country, catho80)# Views "minority_catholic" in the data viewer
View(minority_catholic)It is also possible to chain together multiple conditions as arguments to the filter() function. For example, if we want to subset observations from OECD countries that also have a federal political structure, we can use the “&” operator to specify these two conditions; we’ll assign the filtered dataset to a new object named oecd_federal_countries:
# Extracts federal OECD countries (where oecd=1 AND federal=1) and assigns to a new object named "oecd_federal_countries"
oecd_federal_countries<-pt_copy %>% filter(oecd==1 & federal==1) %>%
relocate(country, oecd, federal)# Views "oecd_federal_countries" in data viewer
View(oecd_federal_countries)We can use a vertical line (|) to specify “or” conditions. For example, the code below subsets observations from countries in Africa OR countries in Asia/Europe, and assigns the subsetted data to a new object named asia_europe_africa:
# Extracts observations that are in Africa ("africa") OR in Asia/Europe ("asiae) and assigns to an object named "asia_europe_africa"
asia_europe_africa<-pt_copy %>% filter(continent=="africa"|continent=="asiae") %>%
relocate(continent)# views "asia_europe_africa" in data viewer
View(asia_europe_africa)It is also useful to know how to subset datasets to extract observations that do NOT meet a given condition. In particular, the condition “not equal to” is denoted by a “!=”. For example, if we wanted to extract observations from pt_copy where the “continent” variable is NOT equal to “africa”, and assign the result to a new object named pt_copy_sans_africa, we can write the following:
# Extracts all non-Africa observations and assigns to object named "pt_copy_sans_africa"
pt_copy_sans_africa<-pt_copy %>% filter(continent!="africa") %>% relocate(continent)# views pt_copy_sans_africa in the data viewer
View(pt_copy_sans_africa)5 Exercises
Exercise 1
Write a function that takes a dataset as an input argument, and returns a table of summary statistics that includes a column containing information on the number of missing values/NA values associated with each variable
Exercise 2
Earlier, we used the following to drop rows in the student_scores dataset that have NA values associated with the “Age” variable: drop_na(student_scores, Age). Rewrite that code using the %>% operator.
Exercise 3
Apply the filter() AND the select() function to one of the datasets we’ve worked with in this lesson (chaining together operations with the %>% operator), and assign the modified dataset to a new object that you write out to a local directory on your computer as a CSV file.