Processing and Wrangling Data in R

Author

Aditya Ranganath

1 Introduction

In the past two sessions, we presented some ideas, concepts, and tools that provide a basic foundation for working with data in R. Now that we have this basic foundation, we’ll turn in this session to a more applied exploration of some actual datasets. Our goal here is to introduce you to some useful functions that will allow you to explore and begin making sense of actual datasets in R. This lesson will provide a tour of various functions that can be particularly helpful as you get started cleaning, and wrangling data in R. Among the topics we’ll cover today are:

  • Loading data into R

  • Handling missing data

  • Summary Statistics in R

  • Data cleaning and processing

We will be working with a handful of datasets, which you will need to download. We will primarily be working with a crossnational dataset collected and made public by the economists Torsten Persson and Guido Tabellini; it’s one of the datasets they work with in their highly influential book, entitled “The Economic Effects of Constitutions.” The codebook for the dataset is provided along with the data in the downloaded materials. We’ll also be working with some datasets collected from the World Bank’s World Development Indicators next class, and these datasets are included in the data download.

2 Installing Packages and Loading Libraries

Please go ahead and install any of the packages below that you have not already installed.

# Install packages using the "install.packages" function
install.packages("tidyverse")
install.packages("psych")
install.packages("janitor")
install.packages("fastDummies")
install.packages("stargazer")
install.packages("gtsummary")
install.packages("ggeffects")

Note that when you need to install a lot of packages at once, you can put the names of the packages you’d like to install into a character vector and pass it to the install.packages() function:

# Installing multiple packages by passing package names as strings to the "install.packages" function
install.packages(c("tidyverse", "psych", "janitor", "fastDummies"))

Remember to load the required libraries (even packages you’ve already installed before).

# load libraries
library(tidyverse)
library(psych)
library(janitor)
library(fastDummies)
library(janitor)
library(summarytools)
library(stargazer)
library(gtsummary)
library(ggeffects)

3 Data Transfer Part 1: Reading Data into R

If you haven’t already, please download the data for this session. Deposit the data into a local directory that you will use for the workshop.

3.1 Reading in Local Files

Though it’s not strictly necessary, it’s useful to begin by setting your working directory to the location on your computer where the data is stored. The easiest way to do this is to go to the R Studio menu: click Session, then click Set Working Directory, then click Choose Directory. We can also set the working directory programmatically using the setwd() function.

We can now pass the name of file and its extension in quotation marks to the read_csv() function (since the data we want to load is a CSV file). We’ll assign it to an object named pt (after Persson and Tabellini):

# reads in the workshop dataset (Persson and Tabellini cross-national dataset) by passing the file path as an argument to the "read_csv" and assigns it to a new object named "pt"
pt<-read_csv("data/pt/persson_tabellini_workshop.csv")
Rows: 85 Columns: 75
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): country, continent
dbl (73): oecd, pind, pindo, ctrycd, col_uk, t_indep, col_uka, col_espa, col...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Go ahead and view pt in the R Studio viewer.

# Views the dataset in the R Studio data viewer
View(pt)

3.2 Reading in Data from Cloud Storage

As you continue working in R, it may be helpful to read data in directly from cloud storage, rather than your local machine. It may take some effort to figure out how to do this (it will vary depending on your cloud storage provider) but it can be as simple as passing the link to your data to an R function. For example, an alternative way of reading the PT data into R is to paste the Dropbox link associated with the dataset to the read_csv() function. If you’re reading in files from Dropbox, change the “0” at the end of the url to a 1 before reading it into R.

read_csv("https://www.dropbox.com/scl/fo/2aianyhhdx92auh9aa6l2/AFfOXirAyGDi-xPC2c_gcDw/data/pt/persson_tabellini_workshop.csv?rlkey=ilmv8do8gcpmd8m43tbw2wxjp&e=1&dl=1")
Rows: 85 Columns: 75
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): country, continent
dbl (73): oecd, pind, pindo, ctrycd, col_uk, t_indep, col_uka, col_espa, col...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 85 × 75
    oecd country     pind pindo ctrycd col_uk t_indep col_uka col_espa col_otha
   <dbl> <chr>      <dbl> <dbl>  <dbl>  <dbl>   <dbl>   <dbl>    <dbl>    <dbl>
 1     0 Argentina  0     0        213      0     183   0        0.268    0    
 2     1 Australia  1     1        193      1      98   0.608    0        0    
 3     1 Austria    0     0        122      0     250   0        0        0    
 4     0 Bahamas    1     1        313      1      26   0.896    0        0    
 5     0 Bangladesh 1     1        513      0      28   0        0        0.888
 6     0 Barbados   1     1        316      1      33   0.868    0        0    
 7     0 Belarus    1     1        913      0       8   0        0        0.968
 8     1 Belgium    0     0        124      0     169   0        0        0.324
 9     0 Belize     1     1        339      1      18   0.928    0        0    
10     0 Bolivia    0.116 0.116    218      0     174   0        0.304    0    
# ℹ 75 more rows
# ℹ 65 more variables: legor_uk <dbl>, legor_so <dbl>, legor_fr <dbl>,
#   legor_ge <dbl>, legor_sc <dbl>, prot80 <dbl>, catho80 <dbl>, confu <dbl>,
#   avelf <dbl>, govef <dbl>, graft <dbl>, logyl <dbl>, loga <dbl>,
#   yrsopen <dbl>, gadp <dbl>, engfrac <dbl>, eurfrac <dbl>, frankrom <dbl>,
#   latitude <dbl>, gastil <dbl>, cgexp <dbl>, cgrev <dbl>, ssw <dbl>,
#   rgdph <dbl>, trade <dbl>, prop1564 <dbl>, prop65 <dbl>, federal <dbl>, …

4 Processing, Wrangling, and Analyzing Numeric Data

We’ll start by making a copy of the pt object, by assigning it to a new object named pt_copy. We’ll use pt_copy when exploring the dataset, which ensures that we do not make inadvertent changes to our original pt data frame, and can always refer back to it when needed. Keeping a “clean” version of the data, and carrying out analysis tasks on a copy of this dataset, is good data management practice.

# makes a copy of "pt", called "pt_copy" that we can use for processing and analysis; keeps the original data frame, "pt" untouched
pt_copy<-pt

4.1 Missing Data

Before proceeding, it’s useful to have an orientation to how R handles missing data, which is a common feature of most social science datasets (including the data we’re working with today). Let’s work with a simple toy dataset for now; as an exercise, you’ll be asked to perform analogous operations on the pt_copy dataset

# makes toy dataset, assigned to object named "student_scores"
student_scores<-data.frame(ID=c(1:5),
                           Age=c(25, NA, 30, 22, NA),
                           Score=c(85, 90, NA, 78, 88))

If we want to identify missing data, we can do so with the is.na() function, which returns a logical matrix with the value TRUE for missing values:

# uses "is.na" to return a logical matrix indicating missing values (TRUE for missing values)
is.na(student_scores)
        ID   Age Score
[1,] FALSE FALSE FALSE
[2,] FALSE  TRUE FALSE
[3,] FALSE FALSE  TRUE
[4,] FALSE FALSE FALSE
[5,] FALSE  TRUE FALSE

If we want the total number of missing values in the dataset, we can take the sum of is.na(student_scores) :

# calculates total number of missing values in "student_scores"
sum(is.na(student_scores))
[1] 3

If we want the total number of missing values per column, we can pass is.na(student_scores) as an argument to the colSum() function, which we can do because in R, logical values can be used in numeric operations; when they’re used in this way, TRUE=1 and FALSE=0.

# calculates total number of missing values per column
colSums(is.na(student_scores))
   ID   Age Score 
    0     2     1 

If we wanted to calculate the percentage of missing data in the dataset, we could use our count of the number of NA values in the dataset sum(is.na(student_scores)) and divide it by the total number of cells in the dataset, derived by multiplying the dataset’s rows and columns:

# calculates missing data percentage in "student_scores"
# first calculates count of missing values and assigns it to "total_missing"
total_missing<-sum(is.na(student_scores))
# calculates total number of cells and assigns it to "total_values"
total_values<-prod(dim(student_scores))
# calculates percentage of missing data and assigns it to "missing_pct"
missing_pct<-(total_missing/total_values)*100

# prints contents of "missing_pct"
missing_pct
[1] 20

Calculating the percentage of data missing in a dataset can be useful, and something we may want to do frequently, so it could be useful to wrap this code into a function, which we’ll do below, and assign to a new object named missing_data_percentage():

# creates function to calculate the percentage of missing data in a dataset
missing_data_percentage<-function(dataset){
  # generates count of missing values
  total_missing<-sum(is.na(dataset))
  # calculates total number of cells
  total_values<-prod(dim(dataset))
  # calculates percentage of missing data
missing_pct<-(total_missing/total_values)*100
return(missing_pct)
}

Let’s test out the function on student_scores:

# passes "student_scores" as an argument to custom function "missing_data_percentage" which yields the percentage of missing data in the "student_scores" dataset
missing_data_percentage(student_scores)
[1] 20

We can now use this function for any other datasets we might create or work with, without having to retype a bunch of code. For example, if we wanted to calculate the percentage of missing data in pt_copy, we could pass it as an argument to missing_data_percentage:

# calculates missing data percentage in pt_copy
missing_data_percentage(pt_copy)
[1] 3.843137

It could also be useful to get a sense of the missing data percentage per column. We can do so with the following:

# calculates percentage of missing data per column in "student_scores" and assigns the resulting vector to an object named "missing_pct_per_col"
missing_pct_per_col<-colSums(is.na(student_scores))/nrow(student_scores)*100

# prints contents of "missing_pct_per_col"
missing_pct_per_col
   ID   Age Score 
    0    40    20 

If we want to remove missing data, the drop_na() function from the tidyverse is useful. The drop_na() function will remove all rows from the dataset with NA values in any column:

# removes all rows with NA values from "student_scores"
drop_na(student_scores)
  ID Age Score
1  1  25    85
2  4  22    78

The default behavior of the drop_na() function is to drop all rows with NA values in any column (i.e. only rows with complete data are kept in the dataset). However, it’s possible to change this default behavior by specifying column/variable names after passing the data frame object as an argument; when we do this, drop_na() no longer drops all rows with missing data in any column, but all rows with missing data in any specified column. For example, the code below drops all rows where “Age” values are NA, but not all rows where “Score” scores are NA

# removes all rows where "Age" has NA values in "student_scores"
drop_na(student_scores, Age)
  ID Age Score
1  1  25    85
2  3  30    NA
3  4  22    78

If, instead, we specify “Score” as an argument, the function drops all rows where “Score” values are NA, but not all rows where “Age” values are NA.

# removes all rows where "Score" has NA values in "student_scores"
drop_na(student_scores, Score)
  ID Age Score
1  1  25    85
2  2  NA    90
3  4  22    78
4  5  NA    88

Note that we can specify more than one column argument to drop_na(), in which case rows which have NA values associated with those specified columns will be dropped, but rows which have NA values with other (non-specified) columns are kept.

Imputing missing values is a complex topic beyond the scope of our workshop, but it’s worth briefly noting that for whatever reason, NA values may be incorrectly coded, or placeholders for other values. For example, you may know from a dataset’s codebook, or from context, that for a particular column in the dataset, observations that appear as missing/NA should actually be coded as “0”. We can use the replace_na() function to replace such NA values with their proper values. For example, let’s say that NA values in the “Score” column should actually be “0” (perhaps NA was being used as a placeholder, and needs to be converted to 0 since the student never turned in their assignment). In the replace_na() function, the first argument is the name of the data frame object, while the second is a list that indicates the column(s) with NA values that need replacement, and a specification for what they will be replaced with. In our case, the syntax looks something like this:

# replace NA values in the "Score" column with 0
replace_na(student_scores, list(Score=0))
  ID Age Score
1  1  25    85
2  2  NA    90
3  3  30     0
4  4  22    78
5  5  NA    88

Finally, it’s important to note that functions like mean() and sum() will return NA if there are missing values. For example:

# calculates mean of "Score" (NA values are not excluded; default behavior)
mean(student_scores$Score)
[1] NA
# calculates mean of "Score" (NA values are excluded due to na.rm=TRUE specification; as a result, the function computes an average based on non-NA values)
mean(student_scores$Score, na.rm=TRUE)
[1] 85.25

When working in R, different packages or functions may handle missing data differently, so it’s useful to consult the documentation of the packages you use for more information (especially when something is not working as expected).

4.2 Summary Statistics

Once we have a dataset loaded into R, one of the first things we’ll likely want to do after taking a look at the dataset in the viewer with the View() function is to get a sense of are data by calculating some summary statistics. This section reviews some summary statistics functions and tools we can call on for this purpose.

4.2.1 Summary Statistics Tables

A quick way to generate a table of summary statistics is to use the describe() function from the psych package. Below, we’ll generate summary statistics for the pt_copy dataset by passing pt_copy as an argument to the describe() function, and assign the resulting table of summary statistics to a new object named pt_copy_summarystats1.

# Generate summary statistics for "pt_copy" and assign table of summary statistics to a new object named "pt_copy_summarystats1"
pt_copy_summarystats1<-describe(pt_copy)

Let’s go ahead and view the table of summary statistics. We can print the object to our console:

# prints contents of "pt_copy_summarystats1" to console
pt_copy_summarystats1
           vars  n    mean      sd  median trimmed     mad     min      max
oecd          1 85    0.29    0.46    0.00    0.25    0.00    0.00     1.00
country       2 85   43.00   24.68   43.00   43.00   31.13    1.00    85.00
pind          3 85    0.46    0.47    0.42    0.45    0.62    0.00     1.00
pindo         4 85    0.61    0.46    0.90    0.63    0.15    0.00     1.00
ctrycd        5 85  430.88  284.89  299.00  406.86  240.18  111.00   968.00
col_uk        6 85    0.35    0.48    0.00    0.32    0.00    0.00     1.00
t_indep       7 85  119.73   89.76   92.00  117.48  114.16    6.00   250.00
col_uka       8 85    0.28    0.39    0.00    0.24    0.00    0.00     0.93
col_espa      9 85    0.06    0.13    0.00    0.03    0.00    0.00     0.79
col_otha     10 85    0.22    0.36    0.00    0.16    0.00    0.00     0.98
legor_uk     11 85    0.39    0.49    0.00    0.36    0.00    0.00     1.00
legor_so     12 85    0.13    0.34    0.00    0.04    0.00    0.00     1.00
legor_fr     13 85    0.35    0.48    0.00    0.32    0.00    0.00     1.00
legor_ge     14 85    0.07    0.26    0.00    0.00    0.00    0.00     1.00
legor_sc     15 85    0.06    0.24    0.00    0.00    0.00    0.00     1.00
prot80       16 85   17.46   25.50    2.70   12.24    3.85    0.00    97.80
catho80      17 85   40.69   38.66   27.60   38.92   39.88    0.00    97.30
confu        18 85    0.07    0.26    0.00    0.00    0.00    0.00     1.00
avelf        19 85    0.29    0.26    0.18    0.26    0.21    0.00     0.84
govef        20 81    4.21    1.75    4.48    4.25    1.99    0.84     7.26
graft        21 81    4.17    1.89    4.24    4.26    2.18    0.74     6.92
logyl        22 74    9.23    0.90    9.29    9.30    1.15    6.95    10.48
loga         23 73    8.17    0.61    8.37    8.23    0.58    6.28     9.02
yrsopen      24 75    0.52    0.54    0.42    0.47    0.53    0.00     4.09
gadp         25 75    0.69    0.20    0.68    0.69    0.24    0.31     1.00
engfrac      26 78    0.14    0.32    0.00    0.06    0.00    0.00     1.00
eurfrac      27 78    0.40    0.44    0.08    0.38    0.12    0.00     1.00
frankrom     28 78    2.87    0.84    2.91    2.86    0.78    0.94     5.64
latitude     29 78   17.96   27.87   17.30   19.13   33.67  -36.89    63.89
gastil       30 85    2.44    1.23    2.28    2.36    1.65    1.00     4.89
cgexp        31 82   28.82   10.49   28.90   28.50   13.04    9.74    51.18
cgrev        32 78   26.49   10.12   24.16   26.05   10.98    8.92    50.85
ssw          33 71    8.15    6.67    7.17    7.65    8.60    0.13    22.38
rgdph        34 85 6688.63 5495.17 4400.03 6130.45 4371.13  530.22 20782.81
trade        35 85   78.77   47.34   68.45   73.50   34.38   17.56   343.39
prop1564     36 84   62.07    5.76   63.94   62.49    5.31   49.05    71.70
prop65       37 84    8.45    4.89    6.47    8.26    5.68    2.30    17.43
federal      38 83    0.16    0.37    0.00    0.07    0.00    0.00     1.00
eduger       39 82   88.58   17.70   93.48   90.55   14.33   40.05   117.11
spropn       40 77    0.13    0.25    0.00    0.07    0.00    0.00     1.00
yearele      41 81 1965.55   36.85 1981.00 1972.62   16.31 1800.00  1994.00
yearreg      42 81 1961.48   40.16 1978.00 1969.65   20.76 1800.00  1994.00
seats        43 85  215.45  162.54  166.00  195.25  127.17   15.00   656.00
maj          44 85    0.39    0.49    0.00    0.36    0.00    0.00     1.00
pres         45 85    0.39    0.49    0.00    0.36    0.00    0.00     1.00
lyp          46 85    8.41    0.97    8.39    8.45    1.25    6.27     9.94
semi         47 85    0.11    0.31    0.00    0.01    0.00    0.00     1.00
majpar       48 85    0.25    0.42    0.00    0.19    0.00    0.00     1.00
majpres      49 85    0.13    0.33    0.00    0.04    0.00    0.00     1.00
propres      50 85    0.26    0.44    0.00    0.20    0.00    0.00     1.00
dem_age      51 85 1958.34   43.74 1978.00 1967.00   19.27 1800.00  1994.00
lat01        52 78    0.32    0.19    0.28    0.31    0.20    0.00     0.71
age          53 85    0.21    0.22    0.11    0.17    0.10    0.03     1.00
polityIV     54 80    7.17    3.64    8.11    7.92    2.80   -6.00    10.00
spl          55 74   -2.18    3.48   -1.69   -2.22    2.85  -11.36    12.59
cpi9500      56 71    4.81    2.38    5.29    4.95    2.46    0.27     8.25
du_60ctry    57 85    0.71    0.46    1.00    0.75    0.00    0.00     1.00
magn         58 84    0.47    0.40    0.26    0.46    0.31    0.01     1.00
sdm          59 77    0.35    0.39    0.16    0.32    0.15    0.01     1.00
oecd.x       60 85    0.27    0.45    0.00    0.22    0.00    0.00     1.00
mining_gdp   61 77    4.26    6.72    1.29    2.82    1.51    0.02    37.20
gini_8090    62 72   39.20   10.41   37.52   38.71   12.23   19.49    62.30
con2150      63 85    0.11    0.31    0.00    0.01    0.00    0.00     1.00
con5180      64 85    0.29    0.46    0.00    0.25    0.00    0.00     1.00
con81        65 85    0.49    0.50    0.00    0.49    0.00    0.00     1.00
list         66 84  114.48  129.55   77.67   93.16  115.15    0.00   510.33
maj_bad      67 85    1.06    1.60    0.00    0.77    0.00    0.00     4.89
maj_gin      68 72   16.40   20.96    0.00   13.86    0.00    0.00    62.00
maj_old      69 85    0.08    0.19    0.00    0.04    0.00    0.00     1.00
pres_bad     70 85    1.21    1.69    0.00    0.97    0.00    0.00     4.89
pres_gin     71 72   16.66   22.90    0.00   13.83    0.00    0.00    62.00
pres_old     72 85    0.06    0.16    0.00    0.02    0.00    0.00     1.00
propar       73 85    0.35    0.48    0.00    0.32    0.00    0.00     1.00
lpop         74 60   15.90    1.92   15.94   15.95    1.47   11.61    20.63
continent    75 85    3.04    1.06    3.00    3.16    1.48    1.00     4.00
              range  skew kurtosis     se
oecd           1.00  0.89    -1.23   0.05
country       84.00  0.00    -1.24   2.68
pind           1.00  0.13    -1.89   0.05
pindo          1.00 -0.46    -1.69   0.05
ctrycd       857.00  0.63    -1.08  30.90
col_uk         1.00  0.60    -1.65   0.05
t_indep      244.00  0.26    -1.54   9.74
col_uka        0.93  0.71    -1.44   0.04
col_espa       0.79  2.68     9.07   0.01
col_otha       0.98  1.16    -0.42   0.04
legor_uk       1.00  0.45    -1.82   0.05
legor_so       1.00  2.17     2.74   0.04
legor_fr       1.00  0.60    -1.65   0.05
legor_ge       1.00  3.29     8.96   0.03
legor_sc       1.00  3.68    11.71   0.03
prot80        97.80  1.65     1.99   2.77
catho80       97.30  0.38    -1.59   4.19
confu          1.00  3.29     8.96   0.03
avelf          0.84  0.76    -0.79   0.03
govef          6.42 -0.22    -1.11   0.19
graft          6.17 -0.37    -1.14   0.21
logyl          3.52 -0.57    -0.53   0.10
loga           2.73 -0.85     0.12   0.07
yrsopen        4.09  3.72    22.32   0.06
gadp           0.69  0.11    -1.19   0.02
engfrac        1.00  1.97     2.04   0.04
eurfrac        1.00  0.31    -1.80   0.05
frankrom       4.70  0.25     0.60   0.10
latitude     100.78 -0.31    -0.98   3.16
gastil         3.89  0.39    -1.16   0.13
cgexp         41.44  0.17    -0.99   1.16
cgrev         41.93  0.34    -0.84   1.15
ssw           22.26  0.41    -1.16   0.79
rgdph      20252.59  0.77    -0.71 596.03
trade        325.82  2.37     9.85   5.13
prop1564      22.66 -0.60    -0.87   0.63
prop65        15.13  0.23    -1.60   0.53
federal        1.00  1.86     1.46   0.04
eduger        77.06 -0.94     0.59   1.95
spropn         1.00  2.22     4.33   0.03
yearele      194.00 -2.13     5.16   4.09
yearreg      194.00 -1.85     3.28   4.46
seats        641.00  1.06     0.18  17.63
maj            1.00  0.45    -1.82   0.05
pres           1.00  0.45    -1.82   0.05
lyp            3.67 -0.24    -1.04   0.11
semi           1.00  2.52     4.39   0.03
majpar         1.00  1.12    -0.69   0.05
majpres        1.00  2.17     2.75   0.04
propres        1.00  1.08    -0.84   0.05
dem_age      194.00 -1.65     2.05   4.74
lat01          0.71  0.27    -1.10   0.02
age            0.97  1.65     2.05   0.02
polityIV      16.00 -1.62     2.20   0.41
spl           23.95  0.88     3.85   0.40
cpi9500        7.98 -0.49    -1.12   0.28
du_60ctry      1.00 -0.89    -1.23   0.05
magn           0.99  0.39    -1.64   0.04
sdm            0.99  0.93    -0.95   0.04
oecd.x         1.00  1.01    -0.98   0.05
mining_gdp    37.18  2.41     6.61   0.77
gini_8090     42.81  0.35    -0.80   1.23
con2150        1.00  2.52     4.39   0.03
con5180        1.00  0.89    -1.23   0.05
con81          1.00  0.02    -2.02   0.05
list         510.33  1.19     0.79  14.14
maj_bad        4.89  1.23     0.04   0.17
maj_gin       62.00  0.63    -1.30   2.47
maj_old        1.00  3.05     9.57   0.02
pres_bad       4.89  0.91    -0.81   0.18
pres_gin      62.00  0.71    -1.33   2.70
pres_old       1.00  4.26    18.75   0.02
propar         1.00  0.60    -1.65   0.05
lpop           9.02 -0.16    -0.32   0.25
continent      3.00 -0.72    -0.82   0.12

Or, alternatively (and preferably), we can view the table in the data viewer:

# prints contents of "pt_copy_summarystats1" to console
View(pt_copy_summarystats1)

To learn more about the describe() function, or any function you use in R, recall that you can query the function’s documentation by printing a ? following by the function’s name in the console (or executing it within a script). For example:

?describe

One thing to note here is that the function defaults to setting na.rm=TRUE; under this setting, the describe() function simply omits NA (i.e. missing) values when calculating summary statistics. If, instead, na.rm=FALSE, the function deletes all rows that contain missing data, and calculates summary statistics with reference to this truncated dataset that has been excised of missing values. In practical terms, it will typically make the most sense to go with the default na.rm=TRUE option, but it’s important to be clear about these different options for handling missing data in the context of the describe() function. If you’re still struggling to understand the difference, we can consider a simple example with a made-up toy dataset. Let’s first create this dataset:

# creates toy dataset for illustrative purposes and assigns it to object named "df_sample"
df_sample<-data.frame(Country=c("A", "B", "C", "D"),
                      Variable1=c(10, NA, 15, 20),
                      Variable2=c(5, 8, 12, NA))

# prints contents of "df_sample"
df_sample
  Country Variable1 Variable2
1       A        10         5
2       B        NA         8
3       C        15        12
4       D        20        NA

Let’s now generate a summary statistics table using describe() with the default na.rm=TRUE setting and assign it to an object named df_sample_summary_default:

# generates summary statistics with describe function and assigns it to "df_sample_summary_default"; na.rm=TRUE ignores NA values when calculating summary statistics
df_sample_summary_default<-describe(df_sample, na.rm=TRUE)

Go ahead and view df_sample_summary_default in the data viewer:

View(df_sample_summary_default)

Note for example, that the mean of “Variable1” is 15 (i.e. the average of 10, 15, and 20) while the mean of “Variable2” is 12.5 (i.e. the average of 5, 8, and 12). When computing these averages, the describe() function simply ignores “NA”, i.e. missing values. Now, let’s use describe() to generate a summary statistics table, but set na.rm=FALSE ; we’ll assign it to an object named df_sample_summary_B :

# generates summary statistics with describe function and assigns it to "df_sample_summary_B"; na.rm=FALSE removes rows with any NA values before calculating summary statistics (in other words, summary statistics are computed on rows with complete data)
df_sample_summary_B<-describe(df_sample, na.rm=FALSE)

Go ahead and inspect the summary table using the View() function:

View(df_sample_summary_B)

Note that the mean of “Variable1” is now 12.5 (the average of 10 and 15), because the na.rm=FALSE command removed both rows 1 and 3 in df_sample before performing the calculation, since both of these rows have an NA value in them. Similarly, the mean of “Variable2” is now 8.5 (the average of 5 and 12).

For the sake of argument, let’s say you want to only calculate summary statistics for columns that have no missing values. The describe() function doesn’t allow you to do this by specifying an argument, but we can carry out this task by writing a custom function. First, let’s write a function that checks whether a vector has any missing values; we’ll assign this object to a new object named check_na(). This function takes a column as an input, and returns a logical condition TRUE if there are any NA values in the column and FALSE if there are no NA values in the column:

# writes a function that takes a column as input and checks whether there are any NA values in that column; returns TRUE if there is, FALSE otherwise
check_na<-function(column){
  any(is.na(column))
}

Let’s test the function for a few columns in our dataset

check_na(pt_copy$country)
[1] FALSE
check_na(pt_copy$graft)
[1] TRUE

That worked. Now, let’s call on the discard() function, which is part of the purrr package we’re now familiar with; it removes a vector from a list if some condition is met. We’ll pass pt_copy and our check_na() function as arguments to it. In particular, the code below uses the discard() function to run the check_na() function across each column in pt_copy; if a column does have an NA value (i.e. evaluates to TRUE when the check_na() function is passed to it), it is “discarded”, but if it does not have an NA value (i.e. evaluates to FALSE when the check_na() function is passed to it), it is kept. We’re left with a version of pt_copy() without any columns that have NA values, which we’ll assign to a new object named pt_copy_NAcolumns_removed:

# discards columns in "pt_copy" that have missing data/NA Values, and assigns the resulting data frame to a new object named "pt_copy_NAcolumns_removed"
pt_copy_NAcolumns_removed<-discard(pt_copy, check_na)

We can confirm that all of the columns with NA values have been removed by passing pt_copy_NAcolumns_removed to the View() function:

View(pt_copy_NAcolumns_removed)

Now, we can pass pt_copy_NAcolumns_removed as an argument to the describe() function, and thereby generate a table of summary statistics for only variables without any missing observations. Below, we’ll assign this table of summary statistics to a new object named pt_copy_NAcolumns_removed_summarystats:

# passes "pt_copy_NAcolumns_removed" as an argument to "describe" function, which generates summary statistics only for variables without missing observations; the table of summary statistics is assigned to a new object named "pt_copy_NAcolumns_removed_summarystats"
pt_copy_NAcolumns_removed_summarystats<-describe(pt_copy_NAcolumns_removed)

Let’s go ahead and view this table in the data viewer:

View(pt_copy_NAcolumns_removed_summarystats)

Let’s say, for the sake of argument, that we will be working with a lot of datasets, and you will want to quickly generate tables of summary statistics that exclude variables that have missing values. We don’t want to go through these steps all the time, so we can wrap this code into a function that takes a dataset as an input, and returns a table of summary statistics that excludes information on variables with missing values:

# creates function to take dataset input and return a table of summary statistics only for variables that have no missing values
summary_stats_noNAcolumns<-function(dataset_input){
  cleaned_dataset<-discard(dataset_input, check_na)
  cleaned_summary_stats<-describe(cleaned_dataset)
  return(cleaned_summary_stats)
}

Now, we can simply pass a dataset as an argument to our custom summary_stats_noNAcolumns function, and it will return a summary statistics table with these desired specifications. Test this function out on pt_copy to confirm it works:

# passes "pt_copy" as argument to "summary_stats_noNAcolumns" function
summary_stats_noNAcolumns(pt_copy)
         vars  n   mean     sd median trimmed    mad min max range  skew
oecd        1 85   0.29   0.46   0.00    0.25   0.00   0   1     1  0.89
country*    2 85  43.00  24.68  43.00   43.00  31.13   1  85    84  0.00
pind        3 85   0.46   0.47   0.42    0.45   0.62   0   1     1  0.13
pindo       4 85   0.61   0.46   0.90    0.63   0.15   0   1     1 -0.46
ctrycd      5 85 430.88 284.89 299.00  406.86 240.18 111 968   857  0.63
col_uk      6 85   0.35   0.48   0.00    0.32   0.00   0   1     1  0.60
         kurtosis    se
oecd        -1.23  0.05
country*    -1.24  2.68
pind        -1.89  0.05
pindo       -1.69  0.05
ctrycd      -1.08 30.90
col_uk      -1.65  0.05

For further confirmation, create a toy dataset and pass it as an argument to summary_stats_noNAcolumns() to ensure that it works as expected.

It might be the case that the summary statistics table created by describe() does not contain information you’d like, in which case it may be possible to customize it to provide that information. For example, it could be nice to have a column in the summary statistics table that provides information on the number of NA values there are for each variable. To get that information into the summary statistics table, we’ll first write a function to count the number of NA values in a vector:

# Define a named function to count missing values
count_missing <- function(x) {
  sum(is.na(x))
}

Next, we’ll apply this function to each column/variable in pt_copy, and deposit the number of missing values for each column in a vector, such that the number of missing values in the first column is deposited as the first element in the vector, the number of missing values in the second column is deposited as the second element in the vector, and so on; we’ll assign this vector to a new object named missing_values_vector:

# applies "count_missing" function to the columns of "pt_copy", and deposits the results (i.e. count of missing values in each column of "pt_copy") to a numeric vector assigned to the object "missing_values_vector"
missing_values_vector <- map_dbl(pt_copy, count_missing)

Now, we’ll add this vector to pt_copy_summarystats1:

# adds "missing_values_vector" as a column named "missing" to "pt_copy_summarystats1" summary stats table
pt_copy_summarystats1$missing<-missing_values_vector

View pt_copy_summarystats1 in the data viewer to confirm that the column has been successfully added:

View(pt_copy_summarystats1)

Below, we’ll learn another way of adding new columns to datasets that draws on the tidyverse; the syntax above is from base R.

Finally, there may be instances in which the summary statistics table produced by the describe() function provides unwanted information. For example, the summary statistics table provides summary statistics for categorical variables as well as numeric ones, but quantitative summary statistics for categorical information may not be meaningful, and simply clutter things. What if we want to remove categorical variables from the table of summary statistics?

We can first use the select_if() function from dplyr (a data-wrangling package that is part of the tidyverse suite) to select only the numeric columns in pt_copy. We’ll assign this “numeric only” version of pt_copy to a new object named pt_copy_numeric:

# selects columns from "pt_copy" that are numeric
pt_copy_numeric<-select_if(pt_copy, is.numeric)

We can confirm that pt_copy_numeric only contains numeric variables:

# views "pt_copy_numeric" in data viewer
View(pt_copy_numeric)

Before proceeding, it’s useful to note that there are other summary statistics packages that are worth exploring. For example, the summarytools package provides the descr() , which is similar to describe() from the psych package. The summarytools package defaults to providing information as a matrix, which we haven’t covered, so we pass descr(pt_copy) to the as.data.frame() function to get the summary statistics table output as a data frame.

# uses "descr" function from summarytools package to create a table of summary statistics as a data frame and assigns it to "pt_summary_ST"
pt_summary_ST<-as.data.frame(descr(pt_copy))
# views "pt_copy_numeric" in data viewer
View(pt_summary_ST)

Another handy package for generating summary statistics is stargazer, which is most useful for creating and exporting regression tables, but which can also produce a well-formatted summary statistics table that is ready to be exported. We can use stargazer to create a summary statistics table with the following (we’ll export it in a later section):

# uses stargazer package to generate summary statistics for pt_copy
stargazer(as.data.frame(pt_copy), type = "text")

======================================================
Statistic  N    Mean    St. Dev.     Min       Max    
------------------------------------------------------
oecd       85   0.294     0.458       0         1     
pind       85   0.460     0.466     0.000     1.000   
pindo      85   0.606     0.457     0.000     1.000   
ctrycd     85  430.882   284.889     111       968    
col_uk     85   0.353     0.481       0         1     
t_indep    85  119.729   89.759       6        250    
col_uka    85   0.282     0.393     0.000     0.928   
col_espa   85   0.058     0.133     0.000     0.788   
col_otha   85   0.218     0.360     0.000     0.976   
legor_uk   85   0.388     0.490       0         1     
legor_so   85   0.129     0.338       0         1     
legor_fr   85   0.353     0.481       0         1     
legor_ge   85   0.071     0.258       0         1     
legor_sc   85   0.059     0.237       0         1     
prot80     85  17.462    25.498     0.000     97.800  
catho80    85  40.694    38.662     0.000     97.300  
confu      85   0.071     0.258       0         1     
avelf      85   0.286     0.263     0.000     0.836   
govef      81   4.209     1.751     0.836     7.258   
graft      81   4.169     1.891     0.742     6.917   
logyl      74   9.226     0.902     6.954     10.476  
loga       73   8.171     0.610     6.284     9.015   
yrsopen    75   0.516     0.543     0.000     4.089   
gadp       75   0.689     0.199     0.313     1.000   
engfrac    78   0.138     0.318     0.000     1.000   
eurfrac    78   0.398     0.437     0.000     1.004   
frankrom   78   2.866     0.840     0.940     5.639   
latitude   78  17.964    27.869    -36.892    63.892  
gastil     85   2.445     1.229     1.000     4.889   
cgexp      82  28.818    10.488     9.743     51.178  
cgrev      78  26.488    10.120     8.923     50.849  
ssw        71   8.148     6.674     0.129     22.385  
rgdph      85 6,688.634 5,495.171  530.223  20,782.810
trade      85  78.766    47.339    17.562    343.387  
prop1564   84  62.068     5.764    49.048     71.703  
prop65     84   8.447     4.885     2.296     17.430  
federal    83   0.157     0.366       0         1     
eduger     82  88.580    17.698    40.050    117.114  
spropn     77   0.126     0.245     0.000     1.000   
yearele    81 1,965.547  36.852   1,800.000 1,994.000 
yearreg    81 1,961.481  40.162     1,800     1,994   
seats      85  215.446   162.537   15.000    656.000  
maj        85   0.388     0.490       0         1     
pres       85   0.388     0.490       0         1     
lyp        85   8.406     0.969     6.273     9.942   
semi       85   0.106     0.310       0         1     
majpar     85   0.250     0.422     0.000     1.000   
majpres    85   0.129     0.334     0.000     1.000   
propres    85   0.260     0.440     0.000     1.000   
dem_age    85 1,958.341  43.737     1,800     1,994   
lat01      78   0.316     0.187     0.003     0.710   
age        85   0.208     0.219     0.030     1.000   
polityIV   80   7.172     3.643    -6.000     10.000  
spl        74  -2.177     3.479    -11.362    12.589  
cpi9500    71   4.812     2.382     0.268     8.250   
du_60ctry  85   0.706     0.458       0         1     
magn       84   0.470     0.395     0.007     1.000   
sdm        77   0.355     0.386     0.008     1.000   
oecd.x     85   0.271     0.447       0         1     
mining_gdp 77   4.256     6.717     0.018     37.198  
gini_8090  72  39.202    10.412    19.490     62.300  
con2150    85   0.106     0.310       0         1     
con5180    85   0.294     0.458       0         1     
con81      85   0.494     0.503       0         1     
list       84  114.479   129.555    0.000    510.333  
maj_bad    85   1.062     1.598     0.000     4.889   
maj_gin    72  16.401    20.961     0.000     62.000  
maj_old    85   0.084     0.186     0.000     1.000   
pres_bad   85   1.211     1.689     0.000     4.889   
pres_gin   72  16.663    22.900     0.000     62.000  
pres_old   85   0.061     0.163     0.000     1.000   
propar     85   0.353     0.481       0         1     
lpop       60  15.897     1.917    11.610     20.632  
------------------------------------------------------

4.3 Frequency Tables and Crosstabs

For qualitative data, it can be especially useful to generate frequency tables, which we can do with the freq() function:

# creates frequency table for the continent variable
freq(pt_copy$continent)
Frequencies  
pt_copy$continent  
Type: Character  

               Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
------------ ------ --------- -------------- --------- --------------
      africa     11     12.94          12.94     12.94          12.94
       asiae     13     15.29          28.24     15.29          28.24
        laam     23     27.06          55.29     27.06          55.29
       other     38     44.71         100.00     44.71         100.00
        <NA>      0                               0.00         100.00
       Total     85    100.00         100.00    100.00         100.00

Cross-tabs, which show the frequency distribution of two or more categorical variables, are another useful way to begin exploring your data; one useful function for generating crosstabs in R is the table() function, that takes dataset columns as arguments.

# creates crosstab with continent and federal variables
table(pt_copy$continent, pt_copy$federal)
        
          0  1
  africa 11  0
  asiae  11  2
  laam   19  4
  other  29  7

The cross tab above quickly tells us, for example, that countries in the “asiae” category (Asia and Europe) have 2 countries with a federal structure of government and 11 without it, while countries in the “laam” category (Latin America) have 4 countries with a federal structure of government and 19 without it.

4.4 Group-Level Summary Statistics

While having a simple table of summary statistics is a useful starting point, it is often useful to generate group-level summary statistics, where summary statistics are presented for different subgroups in the dataset. One way to generate group summary statistics is to use the describeBy() function (also from the psych package), where the first argument is the data frame you would like to generate group-level summary statistics for, and the second argument is the column that contains the relevant groups. Below, we generate summary statistics for pt_copy parsed out by the different continents in the “continent” column. The expression pt_copy$continent indicates that the groups with respect to which we want to calcualte the summary statistics is in the “continent” column of the pt_copy data frame. More generally, we can explicitly refer to columns in an R data frame using this dollar-sign notation, where the expression before the dollar sign refers to the data frame object, and the expression after refers to the name of the column.

The describeBy() function will produce a list that contains summary statistics for different groups as list elements. Below, we’ll assign the list of group summary statistics to a new object named summary_stats_by_continent:

# Creates summary statistics for each continent grouping, and puts results in list named "summary_stats_by_continent"
summary_stats_by_continent<-describeBy(pt_copy, pt_copy$continent)

Now, let’s say we want to extract the summary statistics for Africa, one of the continent categories in the “continent” column. We can do so using the double-bracket notation we discussed above:

# Accessing continent-level summary statistics for africa from the "summary_stats_by_continent" list
summary_stats_by_continent[["africa"]]
        vars  n   mean     sd median trimmed   mad min max range  skew kurtosis
oecd       1 11   0.00   0.00      0    0.00  0.00   0   0     0   NaN      NaN
country    2 11  53.36  24.45     49   54.56 28.17  11  85    74 -0.17    -1.41
pind       3 11   0.77   0.42      1    0.83  0.00   0   1     1 -1.06    -0.79
pindo      4 11   0.77   0.42      1    0.83  0.00   0   1     1 -1.06    -0.79
ctrycd     5 11 647.55 154.90    684  685.56 56.34 199 754   555 -2.13     3.44
col_uk     6 11   0.82   0.40      1    0.89  0.00   0   1     1 -1.43     0.08
           se
oecd     0.00
country  7.37
pind     0.13
pindo    0.13
ctrycd  46.70
col_uk   0.12

Recall that we can assign list elements that we extract from a list to their own object, which allows us to conveniently retrieve it whenever it is needed. Below, we’ll assign the summary statistics for Africa to a new object named africa_summary:

# Group-level summary statistics can be assigned to their own object for easy retrieval
africa_summary<-summary_stats_by_continent[["africa"]]

Another convenient way to extract group-level statistics is with the dplyr package’s group_by() function. First, we’ll run the code below, and assign it to a new object named trade_age_by_continent:

# Generate a table that displays summary statistics for trade at the continent level and assign to object named "trade_age_by_continent"
trade_age_by_continent<-pt_copy %>% 
                          group_by(continent) %>% 
                            summarise(meanTrade=mean(trade),sdTrade=sd(trade),
                                      meanAge=mean(age), sdAge=sd(age),
                                      n=n())

Let’s now print the contents of trade_age_by_continent:

# prints contents of "trade_age_by_continent"
trade_age_by_continent
# A tibble: 4 × 6
  continent meanTrade sdTrade meanAge  sdAge     n
  <chr>         <dbl>   <dbl>   <dbl>  <dbl> <int>
1 africa         77.3    32.1   0.121 0.124     11
2 asiae          97.8    84.6   0.110 0.0846    13
3 laam           68.6    32.8   0.139 0.153     23
4 other          78.8    40.7   0.309 0.263     38

Let’s now unpack the code that created this table. We started with the pt_copy data frame, and then used group_by(continent) to declare that subsequent calculations should be performed at the continent-level; then, within the summarise() function, we defined the column names we wanted to use in the group-level summary table, and how those variables are to be calculated. For example, meanTrade=mean(trade) indicates that we want the first column to be named “meanTrade”, which is to be calculated by taking the mean of the “trade” variable for each continent grouping. After that, sdTrade=sd(trade) indicates that we want the second column to be named “sdTrade”, which is to be calculated by taking the standard deviation of the “trade” variable for each continent grouping. And so on. Note that n=n() indicates that we want the final column, named “n”, to provide information about the number of observations in each continent-level grouping.

You might have noticed a mysterious symbol in the above code that comes immediately after pt_copy, and immediately after group_by(continent). This symbol is known as a “pipe” (%>%). The pipe operator effectively takes the contents to its left, and then uses these contents as an input to the code on its right. Above, the pipe takes the contents of pt_copy on its left, and then feeds this data into the group_by() function on the right; then, after grouping the data by continent, it feeds this grouped data on its left into the summarise() function on its right. We will use the pipe operator throughout the lesson to chain together functions in this manner.

4.5 Data Cleaning and Wrangling

After getting a sense of your data by computing some summary statistics and running some crosstabs, you’ll often have a sense of how you would like to clean or transform your data for analysis. This section briefly describes some functions that are useful for these basic data-preparation and wrangling tasks. Most of these functions are from the tidyverse’s dplyr package.

4.5.1 Rearranging Columns

We can manipulate the order of the columns in a dataset using the relocate function. For example, the code below uses the relocate() function to shift the “country” column to the front of the dataset, and then assigns this change back to pt_copy to update the object:

# bring the "country" column to the front of the dataset
pt_copy<-pt_copy %>% relocate(country)

Go ahead and confirm that the change has been implemented by viewing pt_copy in the data viewer:

# Views "pt_copy" in data viewer
View(pt_copy)

We can specify more than one argument to the relocate function. For example, in the code below, passing the “country”, “list”, “trade”, and “oecd” variables/columns to the relocate() function will make “country” the first column, “list” the second column, “trade” the third column, and so on.

# bring the "country", "list", "trade", "oecd" columns to the front of the dataset
pt_copy<-pt_copy %>% relocate(country, list, trade, oecd)

Let’s go ahead and view the updated dataset in the R Studio Viewer:

# Views updated "pt_copy" data frame in data viewer
View(pt_copy)

4.5.2 Renaming Variables

In order to rename variables, we can use the rename() function, as below. The code below renames the existing “list” variable to “party_list”, which is more descriptive, and assigns the change back to the pt_copy object.

## Renaming a variable (renames "list" to "party_list")
pt_copy<-pt_copy %>% rename(party_list=list)

Let’s check the pt_copy data frame in the viewer to ensure that the change has been made.

# Views updated "pt_copy" data frame in data viewer
View(pt_copy)

You can rename more than one variable at a time using the rename() function; simply separate the arguments by a comma. For example, if we wanted to also change the variable currently named “ctrycd” to “country_code”, in addition to changing “list” to “party_list”, we could use the following expression: pt_copy<-pt_copy %>% rename(party_list=list, country_code=ctrycd)

4.5.3 Sorting Datasets With Respect to Variables

It is often useful to sort a data frame in ascending or descending order with respect to a given variable. The code below sorts the pt_copy data frame in ascending order with respect to the “trade” variable using the arrange() function:

# sorting in ascending (low to high) order with respect to the "trade" variable
pt_copy<-pt_copy %>% arrange(trade)

If, instead, you want to sort the dataset in descending order with respect to the “trade” variable, pass the name of the variable to the desc() function within the arrange() function, as below:

# sorting in descending (high to low) order with respect to the "trade" variable
pt_copy<-pt_copy %>% arrange(desc(trade))

Note that it’s also possible to pass several arguments to the arrange function, and thereby sort a dataset with respect to multiple variables; for example, the code below first relocates the “continent” column and the “trade” column towards the front of the dataset (just after “Country), then sorts the dataset by continent, and then further sorts it in descending order by trade. This allows us to quickly note the country in each continent grouping with the highest value on the”trade” variable.

# takes the "pt_copy" dataset, relocates "country", "continent", and "trade" to the front of the dataset, and then arranges the dataset based on the "continent" variable, then in descending order with respect to the "trade" variable
pt_copy<-pt_copy %>% 
        relocate(country, continent, trade) %>% 
        arrange(continent, desc(trade))

Let’s go ahead and see what this looks like using the Viewer:

# Views updated "pt_copy" data frame in data viewer
View(pt_copy)

Sorting a dataset with respect to more than one variable can be especially useful in certain contexts, particularly in cases where you’re dealing with nested data. For example, in a time series dataset, it can be useful to first sort by year, then by months within the year. In a dataset with regional information, it could be helpful to first sort by regional, then by cities within those regional groupings.

4.5.4 Creating New Variables Based on Existing Variables

Depending on your research question and empirical strategy, it is often useful or necessary to create new variables in your dataset, based on existing variables. To do so, we can use dplyr’s mutate() function. The code below, for example, uses the mutate() function to create a new variable, named “non_catholic_80”, that is computed by subtracting the existing “catho80” variable from 100; for convenience, the “country”, “catho80”, and newly created “non_catholic_80” variables are all moved to the front of the dataset using the relocate() function:

# Create new variable named "non_catholic_80" that is calculated by substracting the Catholic share of the population in 1980 ("catho80") from 100  and relocates "country", "catho80", and the newly created "non_catholic_80" to the front of the dataset
pt_copy<-pt_copy %>% mutate(non_catholic_80=100-catho80) %>% 
                     relocate(country, catho80, non_catholic_80)

We can view the updated pt_copy data frame to confirm that the new variable has been created:

# Views updated "pt_copy" data frame in data viewer
View(pt_copy)

Note that it’s possible to define more than one new variable at a time using mutate(); simply separate the arguments by a comma within the mutate() function.

4.5.5 Selecting or Deleting Variables

Sometimes, we will have a dataset with many variables, and to make things more tractable, we’ll want to select only the variables that are relevant to our analysis. we can explicitly select desired variables using the select() function from dplyr. The code below selects the “country”, “cgexp”, “cgrev”, “trade”, and “federal” columns from pt_copy, and then assigns this selection to a new object named pt_copy_selection:

# Selects "country", "cgexp", "cgrev", and "trade" variables from the "pt_copy" dataset and assigns the selection to a new object named "pt_copy_selection"
pt_copy_selection<-pt_copy %>% 
                    select(country, cgexp, cgrev, trade, federal)

When we view the pt_copy_selection object in the data viewer, we’ll see that we now have a new data frame that consists only of these variables:

# views "pt_copy_selection" in data viewer
View(pt_copy_selection)

Sometimes, it could make more sense to directly delete columns, instead of deciding on which ones to keep or select. For example, the code below deletes the “federal” variable from pt_copy_selection by passing it to the select() column with a “-” in front of it.

# deletes "federal" variable from "pt_copy_selection"
pt_copy_selection %>% select(-federal)
# A tibble: 85 × 4
   country      cgexp cgrev trade
   <chr>        <dbl> <dbl> <dbl>
 1 Mauritius     22.5  21.6 128. 
 2 Gambia        24.4  23.8 122. 
 3 Namibia       37.3  33.2 115. 
 4 Botswana      38.6  50.8  87.5
 5 Zambia        25.2  19.7  73.8
 6 Zimbabwe      31.2  25.6  69.2
 7 Senegal       NA    NA    65.4
 8 Malawi        25.8  NA    62.5
 9 Ghana         19.0  17.2  53.5
10 South Africa  31.3  24.1  43.9
# ℹ 75 more rows

If you want to delete multiple columns, simply specify the columns in a vector, preceded by a minus sign, that is passed to the select() function. The code below, for instance, takes the existing pt_copy_selection data frame, deletes the “federal” and “trade” columns, and assigns the result to a new object named pt_copy_selection_modified:

# deletes "federal" and "trade" from "pt_copy_selection" and assigns it to new object named "pt_copy_selection_modified"
pt_copy_selection_modified<-pt_copy_selection %>% select(-c(federal, trade))

Check the pt_copy_selection_modified data frame in the data viewer to confirm these changes:

# views "pt_copy_selection_modified" in data viewer
View(pt_copy_selection_modified)

4.5.6 Recoding Variables

“Recoding” a variable refers to the process of taking an existing variable, and generating new variable(s) that represent the information from that original variable in a new way. Below, we’ll consider some common recoding operations.

4.5.7 Creating Dummy Variables from Continuous Numeric Variables

You may sometimes have a continuous numeric variable, but want to create a new dummy variable (a variable that takes on the value of 1 if a given condition is met, and 0 otherwise) based on that numeric variable. For example, let’s say we want to create a new variable, named “trade_open” that takes on the value of 1 if the trade variable is greater than or equal to 77, and 0 otherwise. We can generate this new dummy variable using the mutate() function; within the mutate() function below, we specify that we want to create a new variable named “trade_open”; the ifelse() function specifies the desired condition (trade>=77), followed by the value the new “trade_open” variable is to take if the condition is met (1), and the value the new “trade_open” variable is to take if the condition is not met (0). In other words, we can translate ifelse(trade>=77, 1, 0) to “if trade>=77, set the ‘trade_open’ variable to 1, otherwise set it to 0.” We’ll assign the data frame with the new “trade_open” variable back to “pt_copy”:

# Creates a new dummy variable based on the existing "trade" variable named "trade_open" (which takes on a value of "1" if "trade" is greater than or equal to 77, and 0 otherwise) and then moves the newly created variable to the front of the dataset along with "country" and "trade"; all changes are assigned to "pt_copy", thereby overwriting the existing version of "pt_copy"

pt_copy<-pt_copy %>% mutate(trade_open=ifelse(trade>=77, 1, 0)) %>% 
                     relocate(country, trade_open, trade)

View the data frame to ensure that the new variable “trade_open”, recoded based on “trade”, has been created:

4.5.8 Creating categorical variables from continuous numeric variables

Sometimes, we will want to create a variable that contains categories or classifications that derive from numeric thresholds of an existing variable. For instance, let’s say we want to take the existing “trade” variable, and define a new variable named “trade_level”, which is set to “Low Trade” when the “trade” variable is greater than 15 and less than 50; “Intermediate_Trade” when the “trade” variable is greater than or equal to 50 and less than 100; and “High_Trade” when the “trade” variable is greater than or equal to 100. The code below creates this new “trade_level” variable using the mutate() function, and the case_when() function that maps the conditions onto the desired variable values for “trade_level” using the following syntax:

# Creates a new variable in the "pt_copy" dataset named "trade_level" (that is coded as "Low Trade" when the "trade" variable is greater than 15 and less than 50, coded as "Intermediate Trade" when "trade" is greater than or equal to 50 and less than 100, and coded as "High TradE" when "trade" is greater than or equal to 100), and then reorders the dataset such that "country", "trade_level", and "trade" are the first three variables in the dataset
pt_copy<-pt_copy %>% mutate(trade_level=case_when(trade>15 & trade<50~"Low_Trade",
                                      trade>=50 & trade<100~"Intermediate_Trade",
                                                  trade>=100~"High_Trade")) %>% 
                    relocate(country, trade_level, trade)

Check to see that the new “trade_level” variable has indeed been created in pt_copy according to the specifications above:

# views updated "pt_copy" data frame in data viewer
View(pt_copy)

4.5.9 Creating dummy variables from categorical variables

Sometimes, we may have a categorical variable in a dataset, and want to create dummy variables based on those categories. For example, consider the “trade_level” variable we created above. Let’s say we want to use the “trade_level” column to create dummy variables for each of the categories in that column. We can do so with the fastDummies package, which can quickly generate dummy variables for the categories in a categorical variable using the dummy_cols() function. Below, we simply take the existing pt_copy dataset, and pass the name of the categorical variable out of which we want to create the dummies (“trade_level”) to the dummy_cols() function:

# Creates dummy variables from "trade_level" column, and relocates the new dummies to the front of the dataset
pt_copy<-pt_copy %>% 
  dummy_cols("trade_level") %>% 
 relocate(country, trade_level, trade_level_High_Trade, trade_level_Intermediate_Trade, trade_level_Low_Trade)

Let’s now view the updated pt_copy data frame, with the newly created dummy variables:

# views updated "pt_copy" data frame in data viewer
View(pt_copy)

You’ll notice that there are now dummy variables corresponding to each of the categories in the categorical “trade_level” variable; for example, the “trade_level_High_Trade” dummy variable takes on the value of 1 for all observations where the “trade_level” variable is “High_Trade” and 0 otherwise; the “trade_level_Intermediate_Trade” dummy variable takes on the value of 1 for all observations where the “trade_level” variable is “Intermediate_Trade” and 0 otherwise; and so on.

4.5.10 Subsetting (Filtering) Variables

We will often want to subset, or “filter” our datasets to extract observations that meet specified criteria. The dplyr packages allows us to carry out these subsetting operations with a function called filter(), which takes various logical conditions as arguments. Let’s say, for example, that we want to extract all of the OECD country observations from the pt_copy dataset. The “oecd” variable in pt_copy is equal to 1, for all OECD countries, and 0 for non-OECD countries. By passing the condition oecd==1 to the filter() function, we can extract all OECD observations. We’ll assign this data subset to a new object named oecd_countries, and view it in the data viewer:

# Extracts OECD observations in "pt_copy" and assigns to object named "oecd_countries"
oecd_countries<-pt_copy %>% filter(oecd==1) %>% 
                            relocate(country, oecd)
# views "oecd_countries" in data viewer
View(oecd_countries)

Let’s take another example. Let’s use the filter() function to extract all observations for which the “cgrev” (central government revenue as a share of GDP) exceeds 40. We’ll assign the observations that satisfy this condition to a new object named high_revenues:

# Extracts observations for which cgrev (central government revenue as % of gdp)>40, and assigns to object named "high_revenues"
high_revenues<-pt_copy %>% filter(cgrev>40) %>% 
                              relocate(country, cgrev)
# Views "high_revenues" in data viewer
View(high_revenues) 

Let’s try another example. Let’s subset observations from pt_copy for which the Catholic share of the population in 1980 (“catho80”) is less than or equal to 50, and assign the filtered data to a new object named minority_catholic:

# Extracts observations for which the "catho80" variable is less than or equal to 50
minority_catholic<-pt_copy %>% filter(catho80<=50) %>% 
                               relocate(country, catho80)
# Views "minority_catholic" in the data viewer
View(minority_catholic)

It is also possible to chain together multiple conditions as arguments to the filter() function. For example, if we want to subset observations from OECD countries that also have a federal political structure, we can use the “&” operator to specify these two conditions; we’ll assign the filtered dataset to a new object named oecd_federal_countries:

# Extracts federal OECD countries (where oecd=1 AND federal=1) and assigns to a new object named "oecd_federal_countries"
oecd_federal_countries<-pt_copy %>% filter(oecd==1 & federal==1) %>% 
                                      relocate(country, oecd, federal)
# Views "oecd_federal_countries" in data viewer
View(oecd_federal_countries)

We can use a vertical line (|) to specify “or” conditions. For example, the code below subsets observations from countries in Africa OR countries in Asia/Europe, and assigns the subsetted data to a new object named asia_europe_africa:

# Extracts observations that are in Africa ("africa") OR in Asia/Europe ("asiae) and assigns to an object named "asia_europe_africa"
asia_europe_africa<-pt_copy %>% filter(continent=="africa"|continent=="asiae") %>% 
                                  relocate(continent)
# views "asia_europe_africa" in data viewer
View(asia_europe_africa)

It is also useful to know how to subset datasets to extract observations that do NOT meet a given condition. In particular, the condition “not equal to” is denoted by a “!=”. For example, if we wanted to extract observations from pt_copy where the “continent” variable is NOT equal to “africa”, and assign the result to a new object named pt_copy_sans_africa, we can write the following:

# Extracts all non-Africa observations and assigns to object named "pt_copy_sans_africa"
pt_copy_sans_africa<-pt_copy %>% filter(continent!="africa") %>% relocate(continent)
# views pt_copy_sans_africa in the data viewer
View(pt_copy_sans_africa)

5 Exercises

Exercise 1

Write a function that takes a dataset as an input argument, and returns a table of summary statistics that includes a column containing information on the number of missing values/NA values associated with each variable

Exercise 2

Earlier, we used the following to drop rows in the student_scores dataset that have NA values associated with the “Age” variable: drop_na(student_scores, Age). Rewrite that code using the %>% operator.

Exercise 3

Apply the filter() AND the select() function to one of the datasets we’ve worked with in this lesson (chaining together operations with the %>% operator), and assign the modified dataset to a new object that you write out to a local directory on your computer as a CSV file.