11 Numeric Data Summarizing, Processing and Wrangling Tasks

In this section, we will survey some useful functions (primarily from the dplyr package) for wrangling and processing numeric data. We will demonstrate these functions using the Persson-Tabellini dataset on political-economic data (pt).

First, we’ll make a copy of the pt object, by assigning it to a new object named pt_copy. We’ll use pt_copy when exploring the dataset, which ensures that we do not make inadvertent changes to our original pt data frame, and can always refer back to it when needed. Keeping a “clean” version of the data, and carrying out analysis tasks on a copy of this dataset, is good data management practice.

# Make a copy of the dataset so we don't alter the original dataset; then, view
# the copied dataset 
pt_copy<-pt

We can go ahead and print the contents of pt_copy, which, at this point, is identical to pt:

# Print contents of "pt_copy"
pt_copy
## # A tibble: 85 × 75
##     oecd country     pind pindo ctrycd col_uk t_indep col_uka col_espa col_otha legor_uk legor_so legor_fr
##    <dbl> <chr>      <dbl> <dbl>  <dbl>  <dbl>   <dbl>   <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
##  1     0 Argentina  0     0        213      0     183   0        0.268    0            0        0        1
##  2     1 Australia  1     1        193      1      98   0.608    0        0            1        0        0
##  3     1 Austria    0     0        122      0     250   0        0        0            0        0        0
##  4     0 Bahamas    1     1        313      1      26   0.896    0        0            1        0        0
##  5     0 Bangladesh 1     1        513      0      28   0        0        0.888        1        0        0
##  6     0 Barbados   1     1        316      1      33   0.868    0        0            1        0        0
##  7     0 Belarus    1     1        913      0       8   0        0        0.968        0        1        0
##  8     1 Belgium    0     0        124      0     169   0        0        0.324        0        0        1
##  9     0 Belize     1     1        339      1      18   0.928    0        0            1        0        0
## 10     0 Bolivia    0.116 0.116    218      0     174   0        0.304    0            0        0        1
## # ℹ 75 more rows
## # ℹ 62 more variables: legor_ge <dbl>, legor_sc <dbl>, prot80 <dbl>, catho80 <dbl>, confu <dbl>,
## #   avelf <dbl>, govef <dbl>, graft <dbl>, logyl <dbl>, loga <dbl>, yrsopen <dbl>, gadp <dbl>,
## #   engfrac <dbl>, eurfrac <dbl>, frankrom <dbl>, latitude <dbl>, gastil <dbl>, cgexp <dbl>, cgrev <dbl>,
## #   ssw <dbl>, rgdph <dbl>, trade <dbl>, prop1564 <dbl>, prop65 <dbl>, federal <dbl>, eduger <dbl>,
## #   spropn <dbl>, yearele <dbl>, yearreg <dbl>, seats <dbl>, maj <dbl>, pres <dbl>, lyp <dbl>,
## #   semi <dbl>, majpar <dbl>, majpres <dbl>, propres <dbl>, dem_age <dbl>, lat01 <dbl>, age <dbl>, …

We can also view it in the data viewer:

# Views "pt_copy" in data frame
View(pt_copy)

11.1 Summary Statistics

Once you have a dataset loaded into R, one of the first things you’ll want to do is likely to generate a table of summary statistics. A quick way to do that is to use the describe() function from the psych package. Below, we’ll generate summary statistics for the pt_copy dataset by passing it to the describe() function, and assign the table of summary statistics to a new object named pt_copy_summarystats1. We’ll then view it in the data viewer:

# Generate summary statistics for "pt_copy" and assign to new object named "pt_copy_summarystats1"
pt_copy_summarystats1<-describe(pt_copy)
# View contents of "pt_copy_summarystats1" in data viewer
View(pt_copy_summarystats1)

While having a simple table of summary statistics is often a useful starting point, it is often useful to generate group-level summary statistics, where summary statistics are presented for different subgroups in the dataset. One way to generate group summary statistics is to use the describeBy() function (also from the psych package), where the first argument is the data frame you would like to generate group-level summary statistics for, and the second argument is the column that contains the relevant groups. Below, we generate summary statistics for pt_copy parsed out by the different continents in the “continent” column. The expression pt_copy$continent indicates that the groups with respect to which we want to calcualte the summary statistics is in the “continent” column of the pt_copy data frame. More generally, we can explicitly refer to columns in an R data frame using this dollar-sign notation, where the expression before the dollar sign refers to the data frame object, and the expression after refers to the name of the column.

The describeBy() function will produce a list that contains summary statistics for different groups as list elements. Below, we’ll assign the list of group summary statistics to a new object named summary_stats_by_continent:

# Creates summary statistics for each continent grouping, and puts results in list named "summary_stats_by_continent"
summary_stats_by_continent<-describeBy(pt_copy, pt_copy$continent)

Now, let’s say we want to extract the summary statistics for Africa, one of the continent categories in the “continent” column. We can do so using the double-bracket notation we discussed above:

# Accessing continent-level summary statistics for africa from the "summary_stats_by_continent" list
summary_stats_by_continent[["africa"]]
##            vars  n    mean      sd  median trimmed    mad     min     max   range  skew kurtosis     se
## oecd          1 11    0.00    0.00    0.00    0.00   0.00    0.00    0.00    0.00   NaN      NaN   0.00
## country       2 11   53.36   24.45   49.00   54.56  28.17   11.00   85.00   74.00 -0.17    -1.41   7.37
## pind          3 11    0.77    0.42    1.00    0.83   0.00    0.00    1.00    1.00 -1.06    -0.79   0.13
## pindo         4 11    0.77    0.42    1.00    0.83   0.00    0.00    1.00    1.00 -1.06    -0.79   0.13
## ctrycd        5 11  647.55  154.90  684.00  685.56  56.34  199.00  754.00  555.00 -2.13     3.44  46.70
## col_uk        6 11    0.82    0.40    1.00    0.89   0.00    0.00    1.00    1.00 -1.43     0.08   0.12
## t_indep       7 11   36.64   19.77   35.00   33.89   5.93    9.00   89.00   80.00  1.38     1.88   5.96
## col_uka       8 11    0.69    0.35    0.86    0.74   0.02    0.00    0.92    0.92 -1.31    -0.14   0.10
## col_espa      9 11    0.00    0.00    0.00    0.00   0.00    0.00    0.00    0.00   NaN      NaN   0.00
## col_otha     10 11    0.15    0.33    0.00    0.07   0.00    0.00    0.96    0.96  1.58     0.79   0.10
## legor_uk     11 11    0.82    0.40    1.00    0.89   0.00    0.00    1.00    1.00 -1.43     0.08   0.12
## legor_so     12 11    0.00    0.00    0.00    0.00   0.00    0.00    0.00    0.00   NaN      NaN   0.00
## legor_fr     13 11    0.18    0.40    0.00    0.11   0.00    0.00    1.00    1.00  1.43     0.08   0.12
## legor_ge     14 11    0.00    0.00    0.00    0.00   0.00    0.00    0.00    0.00   NaN      NaN   0.00
## legor_sc     15 11    0.00    0.00    0.00    0.00   0.00    0.00    0.00    0.00   NaN      NaN   0.00
## prot80       16 11   22.17   20.23   25.80   19.96  19.57    0.10   64.20   64.10  0.46    -0.80   6.10
## catho80      17 11   19.46   13.67   18.70   18.07  13.20    1.90   49.60   47.70  0.71    -0.39   4.12
## confu        18 11    0.00    0.00    0.00    0.00   0.00    0.00    0.00    0.00   NaN      NaN   0.00
## avelf        19 11    0.71    0.14    0.73    0.73   0.15    0.38    0.84    0.46 -1.15     0.44   0.04
## govef        20 11    5.37    0.82    5.02    5.25   0.68    4.56    7.26    2.70  0.97    -0.17   0.25
## graft        21 11    5.11    0.77    5.39    5.12   0.80    3.93    6.23    2.30 -0.17    -1.62   0.23
## logyl        22 11    7.93    0.78    7.75    7.90   0.53    6.95    9.13    2.18  0.42    -1.43   0.23
## loga         23 11    7.38    0.66    7.33    7.37   0.55    6.28    8.58    2.29  0.03    -0.91   0.20
## yrsopen      24 11    0.21    0.29    0.16    0.15   0.18    0.00    1.00    1.00  1.72     2.15   0.09
## gadp         25 11    0.55    0.12    0.54    0.55   0.12    0.37    0.74    0.37  0.28    -1.38   0.04
## engfrac      26 11    0.02    0.04    0.00    0.02   0.00    0.00    0.09    0.09  0.95    -1.09   0.01
## eurfrac      27 11    0.07    0.17    0.00    0.03   0.00    0.00    0.57    0.57  2.24     3.76   0.05
## frankrom     28 11    2.90    0.51    2.94    2.86   0.56    2.19    3.95    1.77  0.54    -0.69   0.15
## latitude     29 11   -9.14   15.17  -15.81   -9.58   8.49  -29.13   14.77   43.90  0.44    -1.52   4.57
## gastil       30 11    3.59    1.16    4.00    3.66   1.32    1.61    4.89    3.28 -0.48    -1.45   0.35
## cgexp        31 10   27.00    7.63   25.50   27.10   8.58   14.65   38.57   23.92  0.06    -1.30   2.41
## cgrev        32  9   26.15   10.36   23.81   26.15   6.14   17.24   50.85   33.61  1.40     0.71   3.45
## ssw          33  6    1.67    1.46    0.94    1.67   0.58    0.44    3.80    3.36  0.52    -1.87   0.60
## rgdph        34 11 1899.87 1832.60 1116.28 1522.39 738.30  530.22 6666.77 6136.54  1.50     1.28 552.55
## trade        35 11   77.34   32.13   69.17   76.87  27.13   30.83  128.12   97.29  0.31    -1.40   9.69
## prop1564     36 11   54.23    4.91   53.23   53.51   2.96   49.05   65.95   16.90  1.19     0.34   1.48
## prop65       37 11    3.28    1.16    2.80    3.06   0.65    2.34    6.26    3.92  1.47     1.16   0.35
## federal      38 11    0.00    0.00    0.00    0.00   0.00    0.00    0.00    0.00   NaN      NaN   0.00
## eduger       39 11   73.95   23.54   73.55   73.64  25.47   40.05  110.67   70.62  0.08    -1.50   7.10
## spropn       40 10    0.27    0.42    0.00    0.21   0.00    0.00    1.00    1.00  0.92    -1.07   0.13
## yearele      41  8 1982.50   13.48 1990.50 1982.50   5.19 1965.00 1994.00   29.00 -0.41    -2.00   4.77
## yearreg      42  8 1982.50   13.48 1990.50 1982.50   5.19 1965.00 1994.00   29.00 -0.41    -2.00   4.77
## seats        43 11  151.20  109.96  122.22  136.21  86.65   37.33  400.00  362.67  0.92    -0.20  33.16
## maj          44 11    0.73    0.47    1.00    0.78   0.00    0.00    1.00    1.00 -0.88    -1.31   0.14
## pres         45 11    0.64    0.50    1.00    0.67   0.00    0.00    1.00    1.00 -0.49    -1.91   0.15
## lyp          46 11    7.22    0.81    7.02    7.15   0.88    6.27    8.80    2.53  0.53    -1.18   0.25
## semi         47 11    0.18    0.40    0.00    0.11   0.00    0.00    1.00    1.00  1.43     0.08   0.12
## majpar       48 11    0.18    0.40    0.00    0.11   0.00    0.00    1.00    1.00  1.43     0.08   0.12
## majpres      49 11    0.55    0.52    1.00    0.56   0.00    0.00    1.00    1.00 -0.16    -2.15   0.16
## propres      50 11    0.09    0.30    0.00    0.00   0.00    0.00    1.00    1.00  2.47     4.52   0.09
## dem_age      51 11 1975.82   24.77 1989.00 1981.11   7.41 1910.00 1994.00   84.00 -1.57     1.64   7.47
## lat01        52 11    0.17    0.08    0.18    0.17   0.05    0.00    0.32    0.32 -0.28    -0.38   0.03
## age          53 11    0.12    0.12    0.05    0.09   0.04    0.03    0.45    0.42  1.57     1.64   0.04
## polityIV     54 11    2.34    5.56    0.22    2.42   6.75   -6.00   10.00   16.00  0.07    -1.63   1.68
## spl          55  8   -1.55    4.52   -1.54   -1.55   1.91   -6.77    8.23   15.00  0.98     0.05   1.60
## cpi9500      56  9    5.70    1.15    5.90    5.70   1.14    3.93    7.55    3.61  0.01    -1.45   0.38
## du_60ctry    57 11    0.27    0.47    0.00    0.22   0.00    0.00    1.00    1.00  0.88    -1.31   0.14
## magn         58 11    0.71    0.41    1.00    0.75   0.00    0.02    1.00    0.98 -0.58    -1.70   0.12
## sdm          59  9    0.71    0.45    1.00    0.71   0.00    0.03    1.00    0.97 -0.67    -1.63   0.15
## oecd.x       60 11    0.00    0.00    0.00    0.00   0.00    0.00    0.00    0.00   NaN      NaN   0.00
## mining_gdp   61 10    8.43   11.70    4.10    5.89   5.71    0.02   37.20   37.18  1.39     0.79   3.70
## gini_8090    62  9   50.25    9.95   54.00   50.25  11.86   35.36   62.30   26.94 -0.19    -1.71   3.32
## con2150      63 11    0.00    0.00    0.00    0.00   0.00    0.00    0.00    0.00   NaN      NaN   0.00
## con5180      64 11    0.27    0.47    0.00    0.22   0.00    0.00    1.00    1.00  0.88    -1.31   0.14
## con81        65 11    0.73    0.47    1.00    0.78   0.00    0.00    1.00    1.00 -0.88    -1.31   0.14
## list         66 11   49.83  119.87    0.00   16.46   0.00    0.00  400.00  400.00  2.22     3.64  36.14
## maj_bad      67 11    2.73    2.05    3.83    2.80   1.56    0.00    4.89    4.89 -0.32    -1.81   0.62
## maj_gin      68  9   37.31   22.84   41.35   37.31  18.75    0.00   62.00   62.00 -0.71    -1.17   7.61
## maj_old      69 11    0.06    0.07    0.04    0.06   0.06    0.00    0.17    0.17  0.69    -1.39   0.02
## pres_bad     70 11    2.63    2.18    3.83    2.67   1.56    0.00    4.89    4.89 -0.30    -1.92   0.66
## pres_gin     71  9   26.72   26.59   35.36   26.72  39.50    0.00   62.00   62.00  0.04    -1.97   8.86
## pres_old     72 11    0.04    0.05    0.03    0.03   0.04    0.00    0.17    0.17  1.66     2.10   0.02
## propar       73 11    0.18    0.40    0.00    0.11   0.00    0.00    1.00    1.00  1.43     0.08   0.12
## lpop         74  3   13.99    0.15   13.92   13.99   0.05   13.88   14.17    0.28  0.36    -2.33   0.09
## continent    75 11    1.00    0.00    1.00    1.00   0.00    1.00    1.00    0.00   NaN      NaN   0.00

Recall that we can assign list elements that we extract from a list to their own object, which allows us to conveniently retrieve it whenever it is needed. Below, we’ll assign the summary statistics for Africa to a new object named africa_summary:

# Group-level summary statistics can be assigned to their own object for easy retrieval
africa_summary<-summary_stats_by_continent[["africa"]]

Another convenient way to retrieve group-level summary statistics is through the group_by() function in the dplyr package. First, we’ll run the code below, and assign it to a new object named trade_age_by_continent:

# Generate a table that displays summary statistics for trade at the continent level and assign to object named "trade_age_by_continent"
trade_age_by_continent<-pt_copy %>% 
                          group_by(continent) %>% 
                            summarise(meanTrade=mean(trade),sdTrade=sd(trade),
                                      meanAge=mean(age), sdAge=sd(age),
                                      n=n())

Now, let’s print the contents of trade_age_by_continent:

# prints contents of "trade_age_by_continent"
trade_age_by_continent
## # A tibble: 4 × 6
##   continent meanTrade sdTrade meanAge  sdAge     n
##   <chr>         <dbl>   <dbl>   <dbl>  <dbl> <int>
## 1 africa         77.3    32.1   0.121 0.124     11
## 2 asiae          97.8    84.6   0.110 0.0846    13
## 3 laam           68.6    32.8   0.139 0.153     23
## 4 other          78.8    40.7   0.309 0.263     38

Let’s now unpack the code that created this table. We started with the pt_copy data frame, and then used group_by(continent) to declare that subsequent calculations should be performed at the continent-level; then, within the summarise() function, we defined the column names we wanted to use in the group-level summary table, and how those variables are to be calculated. For example, meanTrade=mean(trade) indicates that we want the first column to be named “meanTrade”, which is to be calculated by taking the mean of the “trade” variable for each continent grouping. After that, sdTrade=sd(trade) indicates that we want the second column to be named “sdTrade”, which is to be calculated by taking the standard deviation of the “trade” variable for each continent grouping. And so on. Note that n=n() indicates that we want the final column, named “n”, to provide information about the number of observations in each continent-level grouping.

You might have noticed a mysterious symbol in the above code that comes immediately after pt_copy, and immediately after group_by(continent). This symbol is known as a “pipe” (%>%). The pipe operator effectively takes the contents to its left, and then uses these contents as an input to the code on its right. Above, the pipe takes the contents of pt_copy on its left, and then feeds this data into the group_by() function on the right; then, after grouping the data by continent, it feeds this grouped data on its left into the summarise() function on its right. We will use the pipe operator throughout the lesson to chain together functions in this manner.

Finally, in addition to calculating summary statistics and group-level summary statistics, another useful way to explore your data is to generate simple crosstabs that show the breakdown of one variable with respect to the other. The code below uses the tabyl() function from the janitor package to compute a crosstab between the “federal” variable (i.e. this variable takes on the value of 1 if a country has a federal structure of government, and 0 if it’s a unitary government) and the “continent” variable; it assigns the crosstab to a new object named crosstab_federal_continent:

# Creates cross-tab showing the breakdown of federal/non federal across continents
crosstab_federal_continent<-pt_copy %>% tabyl(federal, continent)

Let’s print the contents of crosstab_federal_continent:

# prints contents of "crosstab_federal_continent"
crosstab_federal_continent
##  federal africa asiae laam other
##        0     11    11   19    29
##        1      0     2    4     7
##       NA      0     0    0     2

This tells us, for instance, that among Latin American countries, 19 had a unitary government, and 4 had federal structure of government.

11.2 Missing Data

11.3 Basic Data Cleaning and Preparation Tasks

After getting a sense of your data by computing some summary statistics and running some crosstabs, you’ll often have a sense of how you would like to clean or transform your data for analysis. This subsection briefly describes some functions that are useful for these basic data-preparation tasks.

11.3.1 Rearranging Columns

We can manipulate the order of the columns in a dataset using the relocate function. For example, the code below uses the relocate() function to shift the “country” column to the front of the dataset, and then assigns this change back to pt_copy to update the object:

# bring the "country" column to the front of the dataset
pt_copy<-pt_copy %>% relocate(country)

You can confirm that the change has been implemented by viewing pt_copy in the data viewer:

# Views "pt_copy" in data viewer
View(pt_copy)

You can specify more than one argument to the relocate function. For example, in the code below, passing the “country”, “list”, “trade”, and “oecd” variables/columns to the relocate() function will make “country” the first column, “list” the second column, “trade” the third column, and so on.

# bring the "country", "list", "trade", "oecd" columns to the front of the dataset
pt_copy<-pt_copy %>% relocate(country, list, trade, oecd)
# Views updated "pt_copy" data frame in data viewer
View(pt_copy)

11.3.2 Renaming Variables

In order to rename variables, we can use the rename() function, as below. The code below renames the existing “list” variable to “party_list”, which is more descriptive, and assigns the change back to the pt_copy object.

## Renaming a variable (renames "list" to "party_list")
pt_copy<-pt_copy %>% rename(party_list=list)

Check the pt_copy data frame in the viewer to ensure that the change has been made.

11.3.3 Sorting a dataset in ascending or descending order with respect to a variable

It is often useful to sort a data frame in ascending or descending order with respect to a given variable. The code below sorts the pt_copy data frame in ascending order with respect to the “trade” variable using the arrange() function:

# sorting in ascending (low to high) order with respect to the "trade" variable
pt_copy<-pt_copy %>% arrange(trade)

If, instead, you want to sort the dataset in descending order with respect to the “trade” variable, pass the name of the variable to the desc() function within the arrange() function, as below:

# sorting in descending (high to low) order with respect to the "trade" variable
pt_copy<-pt_copy %>% arrange(desc(trade))

11.3.4 Creating new variables based on existing variables

Depending on your research question and empirical strategy, it is often useful or necessary to create new variables in your dataset, based on existing variables. To do so, we can use dplyr’s mutate() function. The code below, for example, uses the mutate() function to create a new variable, named “non_catholic_80”, that is computed by subtracting the existing “catho80” variable from 100; for convenience, the “country”, “catho80”, and newly created “non_catholic_80” variables are all moved to the front of the dataset using the relocate() function:

# Create new variable named "non_catholic_80" that is calculated by substracting the Catholic share of the population in 1980 ("catho80") from 100  and relocates "country", "catho80", and the newly created "non_catholic_80" to the front of the dataset
pt_copy<-pt_copy %>% mutate(non_catholic_80=100-catho80) %>% 
                     relocate(country, catho80, non_catholic_80)

You can view the updated pt_copy data frame to confirm that the new variable has been created:

# views updated "pt_copy" data frame in R Studio data viewer
View(pt_copy)

11.3.5 Selecting or Deleting Variables

Sometimes, you will have a dataset with many variables, and to make things more tractable, you’ll want to select only the variables that are relevant to your analysis. You can explicitly select desired variables using the select() function from dplyr. The code below selects the “country”, “cgexp”, “cgrev”, “trade”, and “federal” columns from pt_copy, and then assigns this selection to a new object named pt_copy_selection:

# Selects "country", "cgexp", "cgrev", and "trade" variables from the "pt_copy" dataset and assigns the selection to a new object named "pt_copy_selection"
pt_copy_selection<-pt_copy %>% 
                    select(country, cgexp, cgrev, trade, federal)

When you view the pt_copy_selection object in the data viewer, you’ll see that we now have a new data frame that consists only of these variables:

# views "pt_copy_selection" in data viewer
View(pt_copy_selection)

Instead of selecting columns to keep, it may sometimes by easier to directly delete columns. For example, the code below deletes the “federal” variable from pt_copy_selection by passing it to the select() column with a “-” in front of it.

# deletes "federal" variable from "pt_copy_selection"
pt_copy_selection %>% select(-federal)
## # A tibble: 85 × 4
##    country       cgexp cgrev trade
##    <chr>         <dbl> <dbl> <dbl>
##  1 Singapore      18.5  34.7  343.
##  2 Malta          41.0  35.0  190.
##  3 Luxembourg     40.2  45.5  189.
##  4 Malaysia       24.5  26.8  176.
##  5 Estonia        30.0  31.1  154.
##  6 Belgium        47.9  43.7  132.
##  7 Ireland        38.1  34.8  129.
##  8 Mauritius      22.5  21.6  128.
##  9 St. Vincent&G  34.8  28.7  123.
## 10 Jamaica        NA    NA    122.
## # ℹ 75 more rows

If you want to delete multiple columns, simply specify the columns in a vector, preceded by a minus sign, that is passed to the select() function. The code below, for instance, takes the existing pt_copy_selection data frame, deletes the “federal” and “trade” columns, and assigns the result to a new object named pt_copy_selection_modified:

# deletes "federal" and "trade" from "pt_copy_selection" and assigns it to new object named "pt_copy_selection_modified"
pt_copy_selection_modified<-pt_copy_selection %>% select(-c(federal, trade))

Check the pt_copy_selection_modified data frame in the data viewer to confirm these changes:

# views "pt_copy_selection_modified" in data viewer
pt_copy_selection_modified
## # A tibble: 85 × 3
##    country       cgexp cgrev
##    <chr>         <dbl> <dbl>
##  1 Singapore      18.5  34.7
##  2 Malta          41.0  35.0
##  3 Luxembourg     40.2  45.5
##  4 Malaysia       24.5  26.8
##  5 Estonia        30.0  31.1
##  6 Belgium        47.9  43.7
##  7 Ireland        38.1  34.8
##  8 Mauritius      22.5  21.6
##  9 St. Vincent&G  34.8  28.7
## 10 Jamaica        NA    NA  
## # ℹ 75 more rows

11.3.6 Recoding Variables

“Recoding” a variable refers to the process of taking an existing variable, and generating new variable(s) that represent the information from that original variable in a new way. Below, we’ll consider some common recoding operations.

Creating Dummy Variables from Continuous Numeric Variables

You may sometimes have a continuous numeric variable, but want to create a new dummy variable (a variable that takes on the value of 1 if a given condition is met, and 0 otherwise) based on that numeric variable. For example, let’s say we want to create a new variable, named “trade_open” that takes on the value of 1 if the trade variable is greater than or equal to 77, and 0 otherwise. We can generate this new dummy variable using the mutate() function; within the mutate() function below, we specify that we want to create a new variable named “trade_open”; the ifelse() function specifies the desired condition (trade>=77), followed by the value the new “trade_open” variable is to take if the condition is met (1), and the value the new “trade_open” variable is to take if the condition is not met (0). In other words, we can translate ifelse(trade>=77, 1, 0) to “if trade>=77, set the ‘trade_open’ variable to 1, otherwise set it to 0.” We’ll assign the data frame with the new “trade_open” variable back to “pt_copy”:

# Creates a new dummy variable based on the existing "trade" variable named "trade_open" (which takes on a value of "1" if "trade" is greater than or equal to 77, and 0 otherwise) and then moves the newly created variable to the front of the dataset along with "country" and "trade"; all changes are assigned to "pt_copy", thereby overwriting the existing version of "pt_copy"

pt_copy<-pt_copy %>% mutate(trade_open=ifelse(trade>=77, 1, 0)) %>% 
                     relocate(country, trade_open, trade)

View the data frame to ensure that the new variable “trade_open”, recoded based on “trade”, has been created:

Creating categorical variables from continuous numeric variables

Sometimes, you will want to create a variable that contains categories or classifications that derive from numeric thresholds of an existing variable. For instance, let’s say we want to take the existing “trade” variable, and define a new variable named “trade_level”, which is set to “Low Trade” when the “trade” variable is greater than 15 and less than 50; “Intermediate_Trade” when the “trade” variable is greater than or equal to 50 and less than 100; and “High_Trade” when the “trade” variable is greater than or equal to 100. The code below creates this new “trade_level” variable using the mutate() function, and the case_when() function that maps the conditions onto the desired variable values for “trade_level” using the following syntax:

# Creates a new variable in the "pt_copy" dataset named "trade_level" (that is coded as "Low Trade" when the "trade" variable is greater than 15 and less than 50, coded as "Intermediate Trade" when "trade" is greater than or equal to 50 and less than 100, and coded as "High TradE" when "trade" is greater than or equal to 100), and then reorders the dataset such that "country", "trade_level", and "trade" are the first three variables in the dataset
pt_copy<-pt_copy %>% mutate(trade_level=case_when(trade>15 & trade<50~"Low_Trade",
                                                  trade>=50 & trade<100~"Intermediate_Trade",
                                                  trade>=100~"High_Trade")) %>% 
                    relocate(country, trade_level, trade)

Check to see that the new “trade_level” variable has indeed been created in pt_copy according to the specifications above:

# views updated "pt_copy" data frame in data viewer
View(pt_copy)

Creating dummmy variables from categorical variables

Sometimes, you may have a categorical variable in a dataset, and want to create dummy variables based on those categories. For example, consider the “trade_level” variable we created above. Let’s say we want to use the “trade_level” column to create dummy variables for each of the categories in that column. We can do so with the fastDummies package, which can quickly generate dummy variables for the categories in a categorical variable using the dummy_cols() function. Below, we simply take the existing pt_copy dataset, and pass the name of the categorical variable out of which we want to create the dummies (“trade_level”) to the dummy_cols() function:

# Creates dummy variables from "trade_level" column, and relocates the new dummies to the front of the dataset
pt_copy<-pt_copy %>% dummy_cols("trade_level") %>% 
                      relocate(country, trade_level, trade_level_High_Trade, trade_level_Intermediate_Trade, trade_level_Low_Trade)

Let’s now view the updated pt_copy data frame, with the newly created dummy variables:

# views updated "pt_copy" in data viewer
View(pt_copy)

You’ll notice that there are now dummy variables corresponding to each of the categories in the categorical “trade_level” variable; for example, the “trade_level_High_Trade” dummy variable takes on the value of 1 for all observations where the “trade_level” variable is “High_Trade” and 0 otherwise; the “trade_level_Intermediate_Trade” dummy variable takes on the value of 1 for all observations where the “trade_level” variable is “Intermediate_Trade” and 0 otherwise; and so on.

11.3.7 Subsetting Variables

You will often want to subset, or “filter” your datasets to extract observations that meet specified criteria. The dplyr packages allows you to carry out these subsetting operations with a function called filter(), which takes various logical conditions as arguments. Let’s say, for example, that we want to extract all of the OECD country observations from the pt_copy dataset. The “oecd” variable in pt_copy is equal to 1, for all OECD countries, and 0 for non-OECD countries. By passing the condition oecd==1 to the filter() function, we can extract all OECD observations. We’ll assign this data subset to a new object named oecd_countries, and view it in the data viewer:

# Extracts OECD observations in "pt_copy" and assigns to object named "oecd_countries"
oecd_countries<-pt_copy %>% filter(oecd==1) %>% 
                            relocate(country, oecd)
# views "oecd_countries" in data viewer
View(oecd_countries)

Let’s take another example. Let’s use the filter() function to extract all observations for which the “cgrev” (central government revenue as a share of GDP) exceeds 40. We’ll assign the observations that satisfy this condition to a new object named high_revenues:

# Extracts observations for which cgrev (central government revenue as % of gdp)>40, and assigns to object named "high_revenues"
high_revenues<-pt_copy %>% filter(cgrev>40) %>% 
                              relocate(country, cgrev)
# Views "high_revenues" in data viewer
high_revenues
## # A tibble: 10 × 81
##    country     cgrev trade_level trade_level_High_Trade trade_level_Intermed…¹ trade_level_Low_Trade trade
##    <chr>       <dbl> <chr>                        <int>                  <int>                 <int> <dbl>
##  1 Luxembourg   45.5 High_Trade                       1                      0                     0 189. 
##  2 Belgium      43.7 High_Trade                       1                      0                     0 132. 
##  3 Netherlands  47.6 High_Trade                       1                      0                     0 100. 
##  4 Botswana     50.8 Intermedia…                      0                      1                     0  87.5
##  5 Hungary      45.6 Intermedia…                      0                      1                     0  73.7
##  6 Norway       41.1 Intermedia…                      0                      1                     0  72.2
##  7 Sweden       40.8 Intermedia…                      0                      1                     0  68.1
##  8 Poland       40.3 Low_Trade                        0                      0                     1  48.3
##  9 France       40.9 Low_Trade                        0                      0                     1  44.9
## 10 Italy        41.2 Low_Trade                        0                      0                     1  44.3
## # ℹ abbreviated name: ¹​trade_level_Intermediate_Trade
## # ℹ 74 more variables: trade_open <dbl>, catho80 <dbl>, non_catholic_80 <dbl>, party_list <dbl>,
## #   oecd <dbl>, pind <dbl>, pindo <dbl>, ctrycd <dbl>, col_uk <dbl>, t_indep <dbl>, col_uka <dbl>,
## #   col_espa <dbl>, col_otha <dbl>, legor_uk <dbl>, legor_so <dbl>, legor_fr <dbl>, legor_ge <dbl>,
## #   legor_sc <dbl>, prot80 <dbl>, confu <dbl>, avelf <dbl>, govef <dbl>, graft <dbl>, logyl <dbl>,
## #   loga <dbl>, yrsopen <dbl>, gadp <dbl>, engfrac <dbl>, eurfrac <dbl>, frankrom <dbl>, latitude <dbl>,
## #   gastil <dbl>, cgexp <dbl>, ssw <dbl>, rgdph <dbl>, prop1564 <dbl>, prop65 <dbl>, federal <dbl>, …

Let’s try another example. Let’s subset observations from pt_copy for which the Catholic share of the population in 1980 (“catho80”) is less than or equal to 50, and assign the filtered data to a new object named minority_catholic:

# Extracts observations for which the "catho80" variable is less than or equal to 50
minority_catholic<-pt_copy %>% filter(catho80<=50) %>% 
                               relocate(country, catho80)
# Views "minority_catholic" in the data viewer
View(minority_catholic)

It is also possible to chain together multiple conditions as arguments to the filter() function. For example, if we want to subset observations from OECD countries that also have a federal political structure, we can use the “&” operator to specify these two conditions; we’ll assign the filtered dataset to a new object named oecd_federal_countries:

# Extracts federal OECD countries (where oecd=1 AND federal=1) and assigns to a new object named "oecd_federal_countries"
oecd_federal_countries<-pt_copy %>% filter(oecd==1 & federal==1) %>% 
                                      relocate(country, oecd, federal)
# Views "oecd_federal_countries" in data viewer
oecd_federal_countries
## # A tibble: 7 × 81
##   country     oecd federal trade_level trade_level_High_Trade trade_level_Intermed…¹ trade_level_Low_Trade
##   <chr>      <dbl>   <dbl> <chr>                        <int>                  <int>                 <int>
## 1 Austria        1       1 Intermedia…                      0                      1                     0
## 2 Switzerla…     1       1 Intermedia…                      0                      1                     0
## 3 Canada         1       1 Intermedia…                      0                      1                     0
## 4 Germany        1       1 Low_Trade                        0                      0                     1
## 5 Mexico         1       1 Low_Trade                        0                      0                     1
## 6 Australia      1       1 Low_Trade                        0                      0                     1
## 7 USA            1       1 Low_Trade                        0                      0                     1
## # ℹ abbreviated name: ¹​trade_level_Intermediate_Trade
## # ℹ 74 more variables: trade <dbl>, trade_open <dbl>, catho80 <dbl>, non_catholic_80 <dbl>,
## #   party_list <dbl>, pind <dbl>, pindo <dbl>, ctrycd <dbl>, col_uk <dbl>, t_indep <dbl>, col_uka <dbl>,
## #   col_espa <dbl>, col_otha <dbl>, legor_uk <dbl>, legor_so <dbl>, legor_fr <dbl>, legor_ge <dbl>,
## #   legor_sc <dbl>, prot80 <dbl>, confu <dbl>, avelf <dbl>, govef <dbl>, graft <dbl>, logyl <dbl>,
## #   loga <dbl>, yrsopen <dbl>, gadp <dbl>, engfrac <dbl>, eurfrac <dbl>, frankrom <dbl>, latitude <dbl>,
## #   gastil <dbl>, cgexp <dbl>, cgrev <dbl>, ssw <dbl>, rgdph <dbl>, prop1564 <dbl>, prop65 <dbl>, …

We can use a vertical line (|) to specify “or” conditions. For example, the code below subsets observations from countries in Africa OR countries in Asia/Europe, and assigns the subsetted data to a new object named asia_europe_africa:

# Extracts observations that are in Africa ("africa") OR in Asia/Europe ("asiae) and assigns to an object named "asia_europe_africa"
asia_europe_africa<-pt_copy %>% filter(continent=="africa"|continent=="asiae") %>% 
                                  relocate(continent)
# views "asia_europe_africa" in data viewer
View(asia_europe_africa)
# Prints contents of "asia_europe_africa"
asia_europe_africa %>% datatable(extensions=c("Scroller", "FixedColumns"), options = list(
  deferRender = TRUE,
  scrollY = 350,
  scrollX = 350,
  dom = "t",
  scroller = TRUE,
  fixedColumns = list(leftColumns = 3)
))

It is also useful to know how to subset datasets to extract observations that do NOT meet a given condition. In particular, the condition “not equal to” is denoted by a “!=”. For example, if we wanted to extract observations from pt_copy where the “continent” variable is NOT equal to “africa”, and assign the result to a new object named pt_copy_sans_africa, we can write the following:

# Extracts all non-Africa observations and assigns to object named "pt_copy_sans_africa"
pt_copy_sans_africa<-pt_copy %>% filter(continent!="africa") %>% relocate(continent)
# views pt_copy_sans_africa in the data viewer
View(pt_copy_sans_africa)
# Prints contents of "pt_copy_sans_africa"
pt_copy_sans_africa %>% datatable(extensions=c("Scroller", "FixedColumns"), options = list(
  deferRender = TRUE,
  scrollY = 350,
  scrollX = 350,
  dom = "t",
  scroller = TRUE,
  fixedColumns = list(leftColumns = 3)
))