11 Numeric Data Summarizing, Processing and Wrangling Tasks
In this section, we will survey some useful functions (primarily from the dplyr package) for wrangling and processing numeric data. We will demonstrate these functions using the Persson-Tabellini dataset on political-economic data (pt).
First, we’ll make a copy of the pt object, by assigning it to a new object named pt_copy. We’ll use pt_copy when exploring the dataset, which ensures that we do not make inadvertent changes to our original pt data frame, and can always refer back to it when needed. Keeping a “clean” version of the data, and carrying out analysis tasks on a copy of this dataset, is good data management practice.
# Make a copy of the dataset so we don't alter the original dataset; then, view
# the copied dataset
pt_copy<-ptWe can go ahead and print the contents of pt_copy, which, at this point, is identical to pt:
## # A tibble: 85 × 75
## oecd country pind pindo ctrycd col_uk t_indep col_uka col_espa col_otha legor_uk legor_so legor_fr
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 Argentina 0 0 213 0 183 0 0.268 0 0 0 1
## 2 1 Australia 1 1 193 1 98 0.608 0 0 1 0 0
## 3 1 Austria 0 0 122 0 250 0 0 0 0 0 0
## 4 0 Bahamas 1 1 313 1 26 0.896 0 0 1 0 0
## 5 0 Bangladesh 1 1 513 0 28 0 0 0.888 1 0 0
## 6 0 Barbados 1 1 316 1 33 0.868 0 0 1 0 0
## 7 0 Belarus 1 1 913 0 8 0 0 0.968 0 1 0
## 8 1 Belgium 0 0 124 0 169 0 0 0.324 0 0 1
## 9 0 Belize 1 1 339 1 18 0.928 0 0 1 0 0
## 10 0 Bolivia 0.116 0.116 218 0 174 0 0.304 0 0 0 1
## # ℹ 75 more rows
## # ℹ 62 more variables: legor_ge <dbl>, legor_sc <dbl>, prot80 <dbl>, catho80 <dbl>, confu <dbl>,
## # avelf <dbl>, govef <dbl>, graft <dbl>, logyl <dbl>, loga <dbl>, yrsopen <dbl>, gadp <dbl>,
## # engfrac <dbl>, eurfrac <dbl>, frankrom <dbl>, latitude <dbl>, gastil <dbl>, cgexp <dbl>, cgrev <dbl>,
## # ssw <dbl>, rgdph <dbl>, trade <dbl>, prop1564 <dbl>, prop65 <dbl>, federal <dbl>, eduger <dbl>,
## # spropn <dbl>, yearele <dbl>, yearreg <dbl>, seats <dbl>, maj <dbl>, pres <dbl>, lyp <dbl>,
## # semi <dbl>, majpar <dbl>, majpres <dbl>, propres <dbl>, dem_age <dbl>, lat01 <dbl>, age <dbl>, …
We can also view it in the data viewer:
11.1 Summary Statistics
Once you have a dataset loaded into R, one of the first things you’ll want to do is likely to generate a table of summary statistics. A quick way to do that is to use the describe() function from the psych package. Below, we’ll generate summary statistics for the pt_copy dataset by passing it to the describe() function, and assign the table of summary statistics to a new object named pt_copy_summarystats1. We’ll then view it in the data viewer:
# Generate summary statistics for "pt_copy" and assign to new object named "pt_copy_summarystats1"
pt_copy_summarystats1<-describe(pt_copy)While having a simple table of summary statistics is often a useful starting point, it is often useful to generate group-level summary statistics, where summary statistics are presented for different subgroups in the dataset. One way to generate group summary statistics is to use the describeBy() function (also from the psych package), where the first argument is the data frame you would like to generate group-level summary statistics for, and the second argument is the column that contains the relevant groups. Below, we generate summary statistics for pt_copy parsed out by the different continents in the “continent” column. The expression pt_copy$continent indicates that the groups with respect to which we want to calcualte the summary statistics is in the “continent” column of the pt_copy data frame. More generally, we can explicitly refer to columns in an R data frame using this dollar-sign notation, where the expression before the dollar sign refers to the data frame object, and the expression after refers to the name of the column.
The describeBy() function will produce a list that contains summary statistics for different groups as list elements. Below, we’ll assign the list of group summary statistics to a new object named summary_stats_by_continent:
# Creates summary statistics for each continent grouping, and puts results in list named "summary_stats_by_continent"
summary_stats_by_continent<-describeBy(pt_copy, pt_copy$continent)Now, let’s say we want to extract the summary statistics for Africa, one of the continent categories in the “continent” column. We can do so using the double-bracket notation we discussed above:
# Accessing continent-level summary statistics for africa from the "summary_stats_by_continent" list
summary_stats_by_continent[["africa"]]## vars n mean sd median trimmed mad min max range skew kurtosis se
## oecd 1 11 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 NaN NaN 0.00
## country 2 11 53.36 24.45 49.00 54.56 28.17 11.00 85.00 74.00 -0.17 -1.41 7.37
## pind 3 11 0.77 0.42 1.00 0.83 0.00 0.00 1.00 1.00 -1.06 -0.79 0.13
## pindo 4 11 0.77 0.42 1.00 0.83 0.00 0.00 1.00 1.00 -1.06 -0.79 0.13
## ctrycd 5 11 647.55 154.90 684.00 685.56 56.34 199.00 754.00 555.00 -2.13 3.44 46.70
## col_uk 6 11 0.82 0.40 1.00 0.89 0.00 0.00 1.00 1.00 -1.43 0.08 0.12
## t_indep 7 11 36.64 19.77 35.00 33.89 5.93 9.00 89.00 80.00 1.38 1.88 5.96
## col_uka 8 11 0.69 0.35 0.86 0.74 0.02 0.00 0.92 0.92 -1.31 -0.14 0.10
## col_espa 9 11 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 NaN NaN 0.00
## col_otha 10 11 0.15 0.33 0.00 0.07 0.00 0.00 0.96 0.96 1.58 0.79 0.10
## legor_uk 11 11 0.82 0.40 1.00 0.89 0.00 0.00 1.00 1.00 -1.43 0.08 0.12
## legor_so 12 11 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 NaN NaN 0.00
## legor_fr 13 11 0.18 0.40 0.00 0.11 0.00 0.00 1.00 1.00 1.43 0.08 0.12
## legor_ge 14 11 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 NaN NaN 0.00
## legor_sc 15 11 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 NaN NaN 0.00
## prot80 16 11 22.17 20.23 25.80 19.96 19.57 0.10 64.20 64.10 0.46 -0.80 6.10
## catho80 17 11 19.46 13.67 18.70 18.07 13.20 1.90 49.60 47.70 0.71 -0.39 4.12
## confu 18 11 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 NaN NaN 0.00
## avelf 19 11 0.71 0.14 0.73 0.73 0.15 0.38 0.84 0.46 -1.15 0.44 0.04
## govef 20 11 5.37 0.82 5.02 5.25 0.68 4.56 7.26 2.70 0.97 -0.17 0.25
## graft 21 11 5.11 0.77 5.39 5.12 0.80 3.93 6.23 2.30 -0.17 -1.62 0.23
## logyl 22 11 7.93 0.78 7.75 7.90 0.53 6.95 9.13 2.18 0.42 -1.43 0.23
## loga 23 11 7.38 0.66 7.33 7.37 0.55 6.28 8.58 2.29 0.03 -0.91 0.20
## yrsopen 24 11 0.21 0.29 0.16 0.15 0.18 0.00 1.00 1.00 1.72 2.15 0.09
## gadp 25 11 0.55 0.12 0.54 0.55 0.12 0.37 0.74 0.37 0.28 -1.38 0.04
## engfrac 26 11 0.02 0.04 0.00 0.02 0.00 0.00 0.09 0.09 0.95 -1.09 0.01
## eurfrac 27 11 0.07 0.17 0.00 0.03 0.00 0.00 0.57 0.57 2.24 3.76 0.05
## frankrom 28 11 2.90 0.51 2.94 2.86 0.56 2.19 3.95 1.77 0.54 -0.69 0.15
## latitude 29 11 -9.14 15.17 -15.81 -9.58 8.49 -29.13 14.77 43.90 0.44 -1.52 4.57
## gastil 30 11 3.59 1.16 4.00 3.66 1.32 1.61 4.89 3.28 -0.48 -1.45 0.35
## cgexp 31 10 27.00 7.63 25.50 27.10 8.58 14.65 38.57 23.92 0.06 -1.30 2.41
## cgrev 32 9 26.15 10.36 23.81 26.15 6.14 17.24 50.85 33.61 1.40 0.71 3.45
## ssw 33 6 1.67 1.46 0.94 1.67 0.58 0.44 3.80 3.36 0.52 -1.87 0.60
## rgdph 34 11 1899.87 1832.60 1116.28 1522.39 738.30 530.22 6666.77 6136.54 1.50 1.28 552.55
## trade 35 11 77.34 32.13 69.17 76.87 27.13 30.83 128.12 97.29 0.31 -1.40 9.69
## prop1564 36 11 54.23 4.91 53.23 53.51 2.96 49.05 65.95 16.90 1.19 0.34 1.48
## prop65 37 11 3.28 1.16 2.80 3.06 0.65 2.34 6.26 3.92 1.47 1.16 0.35
## federal 38 11 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 NaN NaN 0.00
## eduger 39 11 73.95 23.54 73.55 73.64 25.47 40.05 110.67 70.62 0.08 -1.50 7.10
## spropn 40 10 0.27 0.42 0.00 0.21 0.00 0.00 1.00 1.00 0.92 -1.07 0.13
## yearele 41 8 1982.50 13.48 1990.50 1982.50 5.19 1965.00 1994.00 29.00 -0.41 -2.00 4.77
## yearreg 42 8 1982.50 13.48 1990.50 1982.50 5.19 1965.00 1994.00 29.00 -0.41 -2.00 4.77
## seats 43 11 151.20 109.96 122.22 136.21 86.65 37.33 400.00 362.67 0.92 -0.20 33.16
## maj 44 11 0.73 0.47 1.00 0.78 0.00 0.00 1.00 1.00 -0.88 -1.31 0.14
## pres 45 11 0.64 0.50 1.00 0.67 0.00 0.00 1.00 1.00 -0.49 -1.91 0.15
## lyp 46 11 7.22 0.81 7.02 7.15 0.88 6.27 8.80 2.53 0.53 -1.18 0.25
## semi 47 11 0.18 0.40 0.00 0.11 0.00 0.00 1.00 1.00 1.43 0.08 0.12
## majpar 48 11 0.18 0.40 0.00 0.11 0.00 0.00 1.00 1.00 1.43 0.08 0.12
## majpres 49 11 0.55 0.52 1.00 0.56 0.00 0.00 1.00 1.00 -0.16 -2.15 0.16
## propres 50 11 0.09 0.30 0.00 0.00 0.00 0.00 1.00 1.00 2.47 4.52 0.09
## dem_age 51 11 1975.82 24.77 1989.00 1981.11 7.41 1910.00 1994.00 84.00 -1.57 1.64 7.47
## lat01 52 11 0.17 0.08 0.18 0.17 0.05 0.00 0.32 0.32 -0.28 -0.38 0.03
## age 53 11 0.12 0.12 0.05 0.09 0.04 0.03 0.45 0.42 1.57 1.64 0.04
## polityIV 54 11 2.34 5.56 0.22 2.42 6.75 -6.00 10.00 16.00 0.07 -1.63 1.68
## spl 55 8 -1.55 4.52 -1.54 -1.55 1.91 -6.77 8.23 15.00 0.98 0.05 1.60
## cpi9500 56 9 5.70 1.15 5.90 5.70 1.14 3.93 7.55 3.61 0.01 -1.45 0.38
## du_60ctry 57 11 0.27 0.47 0.00 0.22 0.00 0.00 1.00 1.00 0.88 -1.31 0.14
## magn 58 11 0.71 0.41 1.00 0.75 0.00 0.02 1.00 0.98 -0.58 -1.70 0.12
## sdm 59 9 0.71 0.45 1.00 0.71 0.00 0.03 1.00 0.97 -0.67 -1.63 0.15
## oecd.x 60 11 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 NaN NaN 0.00
## mining_gdp 61 10 8.43 11.70 4.10 5.89 5.71 0.02 37.20 37.18 1.39 0.79 3.70
## gini_8090 62 9 50.25 9.95 54.00 50.25 11.86 35.36 62.30 26.94 -0.19 -1.71 3.32
## con2150 63 11 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 NaN NaN 0.00
## con5180 64 11 0.27 0.47 0.00 0.22 0.00 0.00 1.00 1.00 0.88 -1.31 0.14
## con81 65 11 0.73 0.47 1.00 0.78 0.00 0.00 1.00 1.00 -0.88 -1.31 0.14
## list 66 11 49.83 119.87 0.00 16.46 0.00 0.00 400.00 400.00 2.22 3.64 36.14
## maj_bad 67 11 2.73 2.05 3.83 2.80 1.56 0.00 4.89 4.89 -0.32 -1.81 0.62
## maj_gin 68 9 37.31 22.84 41.35 37.31 18.75 0.00 62.00 62.00 -0.71 -1.17 7.61
## maj_old 69 11 0.06 0.07 0.04 0.06 0.06 0.00 0.17 0.17 0.69 -1.39 0.02
## pres_bad 70 11 2.63 2.18 3.83 2.67 1.56 0.00 4.89 4.89 -0.30 -1.92 0.66
## pres_gin 71 9 26.72 26.59 35.36 26.72 39.50 0.00 62.00 62.00 0.04 -1.97 8.86
## pres_old 72 11 0.04 0.05 0.03 0.03 0.04 0.00 0.17 0.17 1.66 2.10 0.02
## propar 73 11 0.18 0.40 0.00 0.11 0.00 0.00 1.00 1.00 1.43 0.08 0.12
## lpop 74 3 13.99 0.15 13.92 13.99 0.05 13.88 14.17 0.28 0.36 -2.33 0.09
## continent 75 11 1.00 0.00 1.00 1.00 0.00 1.00 1.00 0.00 NaN NaN 0.00
Recall that we can assign list elements that we extract from a list to their own object, which allows us to conveniently retrieve it whenever it is needed. Below, we’ll assign the summary statistics for Africa to a new object named africa_summary:
# Group-level summary statistics can be assigned to their own object for easy retrieval
africa_summary<-summary_stats_by_continent[["africa"]]Another convenient way to retrieve group-level summary statistics is through the group_by() function in the dplyr package. First, we’ll run the code below, and assign it to a new object named trade_age_by_continent:
# Generate a table that displays summary statistics for trade at the continent level and assign to object named "trade_age_by_continent"
trade_age_by_continent<-pt_copy %>%
group_by(continent) %>%
summarise(meanTrade=mean(trade),sdTrade=sd(trade),
meanAge=mean(age), sdAge=sd(age),
n=n())Now, let’s print the contents of trade_age_by_continent:
## # A tibble: 4 × 6
## continent meanTrade sdTrade meanAge sdAge n
## <chr> <dbl> <dbl> <dbl> <dbl> <int>
## 1 africa 77.3 32.1 0.121 0.124 11
## 2 asiae 97.8 84.6 0.110 0.0846 13
## 3 laam 68.6 32.8 0.139 0.153 23
## 4 other 78.8 40.7 0.309 0.263 38
Let’s now unpack the code that created this table. We started with the pt_copy data frame, and then used group_by(continent) to declare that subsequent calculations should be performed at the continent-level; then, within the summarise() function, we defined the column names we wanted to use in the group-level summary table, and how those variables are to be calculated. For example, meanTrade=mean(trade) indicates that we want the first column to be named “meanTrade”, which is to be calculated by taking the mean of the “trade” variable for each continent grouping. After that, sdTrade=sd(trade) indicates that we want the second column to be named “sdTrade”, which is to be calculated by taking the standard deviation of the “trade” variable for each continent grouping. And so on. Note that n=n() indicates that we want the final column, named “n”, to provide information about the number of observations in each continent-level grouping.
You might have noticed a mysterious symbol in the above code that comes immediately after pt_copy, and immediately after group_by(continent). This symbol is known as a “pipe” (%>%). The pipe operator effectively takes the contents to its left, and then uses these contents as an input to the code on its right. Above, the pipe takes the contents of pt_copy on its left, and then feeds this data into the group_by() function on the right; then, after grouping the data by continent, it feeds this grouped data on its left into the summarise() function on its right. We will use the pipe operator throughout the lesson to chain together functions in this manner.
Finally, in addition to calculating summary statistics and group-level summary statistics, another useful way to explore your data is to generate simple crosstabs that show the breakdown of one variable with respect to the other. The code below uses the tabyl() function from the janitor package to compute a crosstab between the “federal” variable (i.e. this variable takes on the value of 1 if a country has a federal structure of government, and 0 if it’s a unitary government) and the “continent” variable; it assigns the crosstab to a new object named crosstab_federal_continent:
# Creates cross-tab showing the breakdown of federal/non federal across continents
crosstab_federal_continent<-pt_copy %>% tabyl(federal, continent)Let’s print the contents of crosstab_federal_continent:
## federal africa asiae laam other
## 0 11 11 19 29
## 1 0 2 4 7
## NA 0 0 0 2
This tells us, for instance, that among Latin American countries, 19 had a unitary government, and 4 had federal structure of government.
11.3 Basic Data Cleaning and Preparation Tasks
After getting a sense of your data by computing some summary statistics and running some crosstabs, you’ll often have a sense of how you would like to clean or transform your data for analysis. This subsection briefly describes some functions that are useful for these basic data-preparation tasks.
11.3.1 Rearranging Columns
We can manipulate the order of the columns in a dataset using the relocate function. For example, the code below uses the relocate() function to shift the “country” column to the front of the dataset, and then assigns this change back to pt_copy to update the object:
You can confirm that the change has been implemented by viewing pt_copy in the data viewer:
You can specify more than one argument to the relocate function. For example, in the code below, passing the “country”, “list”, “trade”, and “oecd” variables/columns to the relocate() function will make “country” the first column, “list” the second column, “trade” the third column, and so on.
11.3.2 Renaming Variables
In order to rename variables, we can use the rename() function, as below. The code below renames the existing “list” variable to “party_list”, which is more descriptive, and assigns the change back to the pt_copy object.
## Renaming a variable (renames "list" to "party_list")
pt_copy<-pt_copy %>% rename(party_list=list)Check the pt_copy data frame in the viewer to ensure that the change has been made.
11.3.3 Sorting a dataset in ascending or descending order with respect to a variable
It is often useful to sort a data frame in ascending or descending order with respect to a given variable. The code below sorts the pt_copy data frame in ascending order with respect to the “trade” variable using the arrange() function:
# sorting in ascending (low to high) order with respect to the "trade" variable
pt_copy<-pt_copy %>% arrange(trade)If, instead, you want to sort the dataset in descending order with respect to the “trade” variable, pass the name of the variable to the desc() function within the arrange() function, as below:
11.3.4 Creating new variables based on existing variables
Depending on your research question and empirical strategy, it is often useful or necessary to create new variables in your dataset, based on existing variables. To do so, we can use dplyr’s mutate() function. The code below, for example, uses the mutate() function to create a new variable, named “non_catholic_80”, that is computed by subtracting the existing “catho80” variable from 100; for convenience, the “country”, “catho80”, and newly created “non_catholic_80” variables are all moved to the front of the dataset using the relocate() function:
# Create new variable named "non_catholic_80" that is calculated by substracting the Catholic share of the population in 1980 ("catho80") from 100 and relocates "country", "catho80", and the newly created "non_catholic_80" to the front of the dataset
pt_copy<-pt_copy %>% mutate(non_catholic_80=100-catho80) %>%
relocate(country, catho80, non_catholic_80)You can view the updated pt_copy data frame to confirm that the new variable has been created:
11.3.5 Selecting or Deleting Variables
Sometimes, you will have a dataset with many variables, and to make things more tractable, you’ll want to select only the variables that are relevant to your analysis. You can explicitly select desired variables using the select() function from dplyr. The code below selects the “country”, “cgexp”, “cgrev”, “trade”, and “federal” columns from pt_copy, and then assigns this selection to a new object named pt_copy_selection:
# Selects "country", "cgexp", "cgrev", and "trade" variables from the "pt_copy" dataset and assigns the selection to a new object named "pt_copy_selection"
pt_copy_selection<-pt_copy %>%
select(country, cgexp, cgrev, trade, federal)When you view the pt_copy_selection object in the data viewer, you’ll see that we now have a new data frame that consists only of these variables:
Instead of selecting columns to keep, it may sometimes by easier to directly delete columns. For example, the code below deletes the “federal” variable from pt_copy_selection by passing it to the select() column with a “-” in front of it.
## # A tibble: 85 × 4
## country cgexp cgrev trade
## <chr> <dbl> <dbl> <dbl>
## 1 Singapore 18.5 34.7 343.
## 2 Malta 41.0 35.0 190.
## 3 Luxembourg 40.2 45.5 189.
## 4 Malaysia 24.5 26.8 176.
## 5 Estonia 30.0 31.1 154.
## 6 Belgium 47.9 43.7 132.
## 7 Ireland 38.1 34.8 129.
## 8 Mauritius 22.5 21.6 128.
## 9 St. Vincent&G 34.8 28.7 123.
## 10 Jamaica NA NA 122.
## # ℹ 75 more rows
If you want to delete multiple columns, simply specify the columns in a vector, preceded by a minus sign, that is passed to the select() function. The code below, for instance, takes the existing pt_copy_selection data frame, deletes the “federal” and “trade” columns, and assigns the result to a new object named pt_copy_selection_modified:
# deletes "federal" and "trade" from "pt_copy_selection" and assigns it to new object named "pt_copy_selection_modified"
pt_copy_selection_modified<-pt_copy_selection %>% select(-c(federal, trade))Check the pt_copy_selection_modified data frame in the data viewer to confirm these changes:
## # A tibble: 85 × 3
## country cgexp cgrev
## <chr> <dbl> <dbl>
## 1 Singapore 18.5 34.7
## 2 Malta 41.0 35.0
## 3 Luxembourg 40.2 45.5
## 4 Malaysia 24.5 26.8
## 5 Estonia 30.0 31.1
## 6 Belgium 47.9 43.7
## 7 Ireland 38.1 34.8
## 8 Mauritius 22.5 21.6
## 9 St. Vincent&G 34.8 28.7
## 10 Jamaica NA NA
## # ℹ 75 more rows
11.3.6 Recoding Variables
“Recoding” a variable refers to the process of taking an existing variable, and generating new variable(s) that represent the information from that original variable in a new way. Below, we’ll consider some common recoding operations.
Creating Dummy Variables from Continuous Numeric Variables
You may sometimes have a continuous numeric variable, but want to create a new dummy variable (a variable that takes on the value of 1 if a given condition is met, and 0 otherwise) based on that numeric variable. For example, let’s say we want to create a new variable, named “trade_open” that takes on the value of 1 if the trade variable is greater than or equal to 77, and 0 otherwise. We can generate this new dummy variable using the mutate() function; within the mutate() function below, we specify that we want to create a new variable named “trade_open”; the ifelse() function specifies the desired condition (trade>=77), followed by the value the new “trade_open” variable is to take if the condition is met (1), and the value the new “trade_open” variable is to take if the condition is not met (0). In other words, we can translate ifelse(trade>=77, 1, 0) to “if trade>=77, set the ‘trade_open’ variable to 1, otherwise set it to 0.” We’ll assign the data frame with the new “trade_open” variable back to “pt_copy”:
# Creates a new dummy variable based on the existing "trade" variable named "trade_open" (which takes on a value of "1" if "trade" is greater than or equal to 77, and 0 otherwise) and then moves the newly created variable to the front of the dataset along with "country" and "trade"; all changes are assigned to "pt_copy", thereby overwriting the existing version of "pt_copy"
pt_copy<-pt_copy %>% mutate(trade_open=ifelse(trade>=77, 1, 0)) %>%
relocate(country, trade_open, trade)View the data frame to ensure that the new variable “trade_open”, recoded based on “trade”, has been created:
Creating categorical variables from continuous numeric variables
Sometimes, you will want to create a variable that contains categories or classifications that derive from numeric thresholds of an existing variable. For instance, let’s say we want to take the existing “trade” variable, and define a new variable named “trade_level”, which is set to “Low Trade” when the “trade” variable is greater than 15 and less than 50; “Intermediate_Trade” when the “trade” variable is greater than or equal to 50 and less than 100; and “High_Trade” when the “trade” variable is greater than or equal to 100. The code below creates this new “trade_level” variable using the mutate() function, and the case_when() function that maps the conditions onto the desired variable values for “trade_level” using the following syntax:
# Creates a new variable in the "pt_copy" dataset named "trade_level" (that is coded as "Low Trade" when the "trade" variable is greater than 15 and less than 50, coded as "Intermediate Trade" when "trade" is greater than or equal to 50 and less than 100, and coded as "High TradE" when "trade" is greater than or equal to 100), and then reorders the dataset such that "country", "trade_level", and "trade" are the first three variables in the dataset
pt_copy<-pt_copy %>% mutate(trade_level=case_when(trade>15 & trade<50~"Low_Trade",
trade>=50 & trade<100~"Intermediate_Trade",
trade>=100~"High_Trade")) %>%
relocate(country, trade_level, trade)Check to see that the new “trade_level” variable has indeed been created in pt_copy according to the specifications above:
Creating dummmy variables from categorical variables
Sometimes, you may have a categorical variable in a dataset, and want to create dummy variables based on those categories. For example, consider the “trade_level” variable we created above. Let’s say we want to use the “trade_level” column to create dummy variables for each of the categories in that column. We can do so with the fastDummies package, which can quickly generate dummy variables for the categories in a categorical variable using the dummy_cols() function. Below, we simply take the existing pt_copy dataset, and pass the name of the categorical variable out of which we want to create the dummies (“trade_level”) to the dummy_cols() function:
# Creates dummy variables from "trade_level" column, and relocates the new dummies to the front of the dataset
pt_copy<-pt_copy %>% dummy_cols("trade_level") %>%
relocate(country, trade_level, trade_level_High_Trade, trade_level_Intermediate_Trade, trade_level_Low_Trade)Let’s now view the updated pt_copy data frame, with the newly created dummy variables:
You’ll notice that there are now dummy variables corresponding to each of the categories in the categorical “trade_level” variable; for example, the “trade_level_High_Trade” dummy variable takes on the value of 1 for all observations where the “trade_level” variable is “High_Trade” and 0 otherwise; the “trade_level_Intermediate_Trade” dummy variable takes on the value of 1 for all observations where the “trade_level” variable is “Intermediate_Trade” and 0 otherwise; and so on.
11.3.7 Subsetting Variables
You will often want to subset, or “filter” your datasets to extract observations that meet specified criteria. The dplyr packages allows you to carry out these subsetting operations with a function called filter(), which takes various logical conditions as arguments. Let’s say, for example, that we want to extract all of the OECD country observations from the pt_copy dataset. The “oecd” variable in pt_copy is equal to 1, for all OECD countries, and 0 for non-OECD countries. By passing the condition oecd==1 to the filter() function, we can extract all OECD observations. We’ll assign this data subset to a new object named oecd_countries, and view it in the data viewer:
# Extracts OECD observations in "pt_copy" and assigns to object named "oecd_countries"
oecd_countries<-pt_copy %>% filter(oecd==1) %>%
relocate(country, oecd)Let’s take another example. Let’s use the filter() function to extract all observations for which the “cgrev” (central government revenue as a share of GDP) exceeds 40. We’ll assign the observations that satisfy this condition to a new object named high_revenues:
# Extracts observations for which cgrev (central government revenue as % of gdp)>40, and assigns to object named "high_revenues"
high_revenues<-pt_copy %>% filter(cgrev>40) %>%
relocate(country, cgrev)## # A tibble: 10 × 81
## country cgrev trade_level trade_level_High_Trade trade_level_Intermed…¹ trade_level_Low_Trade trade
## <chr> <dbl> <chr> <int> <int> <int> <dbl>
## 1 Luxembourg 45.5 High_Trade 1 0 0 189.
## 2 Belgium 43.7 High_Trade 1 0 0 132.
## 3 Netherlands 47.6 High_Trade 1 0 0 100.
## 4 Botswana 50.8 Intermedia… 0 1 0 87.5
## 5 Hungary 45.6 Intermedia… 0 1 0 73.7
## 6 Norway 41.1 Intermedia… 0 1 0 72.2
## 7 Sweden 40.8 Intermedia… 0 1 0 68.1
## 8 Poland 40.3 Low_Trade 0 0 1 48.3
## 9 France 40.9 Low_Trade 0 0 1 44.9
## 10 Italy 41.2 Low_Trade 0 0 1 44.3
## # ℹ abbreviated name: ¹trade_level_Intermediate_Trade
## # ℹ 74 more variables: trade_open <dbl>, catho80 <dbl>, non_catholic_80 <dbl>, party_list <dbl>,
## # oecd <dbl>, pind <dbl>, pindo <dbl>, ctrycd <dbl>, col_uk <dbl>, t_indep <dbl>, col_uka <dbl>,
## # col_espa <dbl>, col_otha <dbl>, legor_uk <dbl>, legor_so <dbl>, legor_fr <dbl>, legor_ge <dbl>,
## # legor_sc <dbl>, prot80 <dbl>, confu <dbl>, avelf <dbl>, govef <dbl>, graft <dbl>, logyl <dbl>,
## # loga <dbl>, yrsopen <dbl>, gadp <dbl>, engfrac <dbl>, eurfrac <dbl>, frankrom <dbl>, latitude <dbl>,
## # gastil <dbl>, cgexp <dbl>, ssw <dbl>, rgdph <dbl>, prop1564 <dbl>, prop65 <dbl>, federal <dbl>, …
Let’s try another example. Let’s subset observations from pt_copy for which the Catholic share of the population in 1980 (“catho80”) is less than or equal to 50, and assign the filtered data to a new object named minority_catholic:
# Extracts observations for which the "catho80" variable is less than or equal to 50
minority_catholic<-pt_copy %>% filter(catho80<=50) %>%
relocate(country, catho80)It is also possible to chain together multiple conditions as arguments to the filter() function. For example, if we want to subset observations from OECD countries that also have a federal political structure, we can use the “&” operator to specify these two conditions; we’ll assign the filtered dataset to a new object named oecd_federal_countries:
# Extracts federal OECD countries (where oecd=1 AND federal=1) and assigns to a new object named "oecd_federal_countries"
oecd_federal_countries<-pt_copy %>% filter(oecd==1 & federal==1) %>%
relocate(country, oecd, federal)## # A tibble: 7 × 81
## country oecd federal trade_level trade_level_High_Trade trade_level_Intermed…¹ trade_level_Low_Trade
## <chr> <dbl> <dbl> <chr> <int> <int> <int>
## 1 Austria 1 1 Intermedia… 0 1 0
## 2 Switzerla… 1 1 Intermedia… 0 1 0
## 3 Canada 1 1 Intermedia… 0 1 0
## 4 Germany 1 1 Low_Trade 0 0 1
## 5 Mexico 1 1 Low_Trade 0 0 1
## 6 Australia 1 1 Low_Trade 0 0 1
## 7 USA 1 1 Low_Trade 0 0 1
## # ℹ abbreviated name: ¹trade_level_Intermediate_Trade
## # ℹ 74 more variables: trade <dbl>, trade_open <dbl>, catho80 <dbl>, non_catholic_80 <dbl>,
## # party_list <dbl>, pind <dbl>, pindo <dbl>, ctrycd <dbl>, col_uk <dbl>, t_indep <dbl>, col_uka <dbl>,
## # col_espa <dbl>, col_otha <dbl>, legor_uk <dbl>, legor_so <dbl>, legor_fr <dbl>, legor_ge <dbl>,
## # legor_sc <dbl>, prot80 <dbl>, confu <dbl>, avelf <dbl>, govef <dbl>, graft <dbl>, logyl <dbl>,
## # loga <dbl>, yrsopen <dbl>, gadp <dbl>, engfrac <dbl>, eurfrac <dbl>, frankrom <dbl>, latitude <dbl>,
## # gastil <dbl>, cgexp <dbl>, cgrev <dbl>, ssw <dbl>, rgdph <dbl>, prop1564 <dbl>, prop65 <dbl>, …
We can use a vertical line (|) to specify “or” conditions. For example, the code below subsets observations from countries in Africa OR countries in Asia/Europe, and assigns the subsetted data to a new object named asia_europe_africa:
# Extracts observations that are in Africa ("africa") OR in Asia/Europe ("asiae) and assigns to an object named "asia_europe_africa"
asia_europe_africa<-pt_copy %>% filter(continent=="africa"|continent=="asiae") %>%
relocate(continent)# Prints contents of "asia_europe_africa"
asia_europe_africa %>% datatable(extensions=c("Scroller", "FixedColumns"), options = list(
deferRender = TRUE,
scrollY = 350,
scrollX = 350,
dom = "t",
scroller = TRUE,
fixedColumns = list(leftColumns = 3)
))It is also useful to know how to subset datasets to extract observations that do NOT meet a given condition. In particular, the condition “not equal to” is denoted by a “!=”. For example, if we wanted to extract observations from pt_copy where the “continent” variable is NOT equal to “africa”, and assign the result to a new object named pt_copy_sans_africa, we can write the following: