A Using tidycensus to extract relevant census data

In Section 5.1, we read in a county-level demographic dataset (extracted from the 2010 decennial census), which we subsequently used to generate one of the components of our bias indicator. That dataset was provided to you at the start of the workshop, and we did not discuss the process by which the dataset was generated in the main lesson. However, if you are curious about how that data was generated, the script in this Appendix can be used as a guide to reproduce the demographic dataset introduced in Section 5.1.

To extract the census demographic data required to create the bias index, we must first load a few required packages. In particular, please load the tidyverse (which was used in the workshop tutorial) and the tidycensus package, which is an R package that allows users to pull Census Bureau data using the Census API (if tidycensus is not already installed, please install it using install.packages("tidycensus").

# Loads libraries required to extract census data
library(tidycensus)
library(tidyverse)

Before extracting data with tidycensus, you must acquire a Census API key from the Census Bureau website; once you apply for a key on the website, your key will be immediately emailed to the address you provide. Enter your census API key in R Studio with the following code, replacing the with your key:

census_api_key("INSERT HERE")

In the workshop, we were working with 2010 police stops data, so it made sense to pull demographic data from the 2010 decennial census (which would have been collected in 2009). The discussion below is therefore framed with respect to the 2010 decennial census; if you choose to use another census data product (such as the American Community Survey) to create the bias index, your code may look slightly different.

A.1 Define your variables

First, we can generate a table that contains the various variables (and associated variable codes) for the 2010 decennial census by using the load_variables() function. The arguments to the function below (2010, "sf1") indicate that we want to extract variable names and codes for the 2010 decennial census. Below, we assign the table to an object named decennial_2010_variables:

# Variable list for 2010 Decennial assigned to object named "decennial_2010_variables"
decennial_2010_variables<-load_variables(2010, "sf1")

At this point, we can easily view the 2010 decennial variables table by passing decennial_2010_variables as an argument to the View() function:

Based on the information in decennial_2010_variables, we can identify the variable codes for the variables we want to extract. Then, we will define a vector (see below), assigned to an object named my_vars that assigns the variable codes to descriptive names; these descriptive names will be used as column names in the dataset returned by the census API call, while the variable codes will be used by tidycensus to populate the respective fields with the desired data.

For the purpose of defining the “bias_index” variable, recall that we needed to define the county-level percentage of traffic stops involving a Black motorist, and the county-level percentage of the adult Black population (with respect to the county’s overall adult population). The first component of this index was derived using the Stanford Open Policing data. In order to calculate the second component of the bias index, we need two key variables at the county level: the total adult (which we define as 17+) population, and the total adult Black population. These two variables can be extracted from the decennial census. However, there is no separate category in the decennial census for these two measures, so we must derive them based on the data that is available.

Therefore, given the available data, calculating the total over-17 population will require us to extract data for the male under-5 population, the male 5 to 9 population, the male 10 to 14 population, and the male 15 to 17 population, and analogous measures for the female population. Subtracting these values from the total overall population (among all age groups) will yield a value for the total over-17 population.

To calculate the Black over-17 population, we will extract a variable that defines the total Black population, and a series of variables that indicate the Black population for different demographic (sex/age) combinations 17 years old and younger; subtracting the sum of the latter variables from the total Black population yields a value for the Black over-17 population.

The code below extracts all the variables needed to carry out these calculations:

# Define and name variables for census API call
my_vars<-c(total_pop="P001001",
           totalpop_men_u5="P012003",
           totalpop_men_5to9="P012004",
           totalpop_men_10to14="P012005",
           totalpop_men_15to17="P012006",
           totalpop_women_u5="P012027",
           totalpop_women_5to9="P012028",
           totalpop_women_10to14="P012029",
           totalpop_women_15to17="P012030",
           black_totalpop="PCT012B001",
           black_men_u1="PCT012B003",
           black_men_1="PCT012B004",
           black_men_2="PCT012B005",
           black_men_3="PCT012B006",
           black_men_4="PCT012B007",
           black_men_5="PCT012B008",
           black_men_6="PCT012B009",
           black_men_7="PCT012B010",
           black_men_8="PCT012B011",
           black_men_9="PCT012B012",
           black_men_10="PCT012B013",
           black_men_11="PCT012B014",
           black_men_12="PCT012B015",
           black_men_13="PCT012B016",
           black_men_14="PCT012B017",
           black_men_15="PCT012B018",
           black_men_16="PCT012B019",
           black_men_17="PCT012B020",
           black_women_u1="PCT012B107",
           black_women_1="PCT012B108",
           black_women_2="PCT012B109",
           black_women_3="PCT012B110",
           black_women_4="PCT012B111",
           black_women_5="PCT012B112",
           black_women_6="PCT012B113",
           black_women_7="PCT012B114",
           black_women_8="PCT012B115",
           black_women_9="PCT012B116",
           black_women_10="PCT012B117",
           black_women_11="PCT012B118",
           black_women_12="PCT012B119",
           black_women_13="PCT012B120",
           black_women_14="PCT012B121",
           black_women_15="PCT012B122",
           black_women_16="PCT012B123",
           black_women_17="PCT012B124")

A.2 Extract the variables using tidycensus

Now that we have a vector of the variables we want to extract (along with descriptive names for those variables), we will use the get_decennial() function from tidycensus to extract these variables from the 2010 decennial census. Several arguments are passed to the get_decennial() function below:

geography="county" specifies that we want the census data to be provided at the county level
variables=my_vars specifies the variables we want to extract, and the names they are to be given in the dataset; this information is contained in the my_vars vector defined above
state=CO specifies the state for which we want to extract the data; this argument, together with the geography="county" argument, means that tidycensus will extract the specified data in my_vars at the county level for the state of Colorado.
survey="sf1" indicates which census dataset we would like to query; here sf1 (short for Summary File 1), indicates we are referring to the decennial census (as opposed, for example, to the American Community Survey)
output=wide indicates that we want the dataset with the extracted variables to be in “wide” format, wherein each variable is assigned to its own column.
year=2010 indicates that we are interested in data from 2010. Combined with survey="sf1", this will extract the 2010 decennial census data.

Finally, we’ll assign the extracted dataset to an object named co_counties_race:

# Issue call to Census API
co_counties_race<-
get_decennial(
  geography="county", 
  variables=my_vars,
  state="CO",
  survey="sf1",
  output="wide",
  year=2010)

## Getting data from the 2010 decennial Census

## Using FIPS code '08' for state 'CO'

## Using Census Summary File 1

We can print the first few lines of the dataset to the console to view its structure, and ensure that everything looks in order:

# prints contents of "co_counties_race"
co_counties_race

# A tibble: 64 × 48
   GEOID NAME       total_pop totalpop_men_u5 totalpop_men_5t… totalpop_men_10… totalpop_men_15… totalpop_women_…
   <chr> <chr>          <dbl>           <dbl>            <dbl>            <dbl>            <dbl>            <dbl>
 1 08023 Costilla …      3524              99              102              107               95               76
 2 08025 Crowley C…      5823              88              110              136               81               94
 3 08027 Custer Co…      4255              62               95              113               85               74
 4 08029 Delta Cou…     30952             887              913             1042              680              865
 5 08031 Denver Co…    600158           22252            18894            15319             8920            21263
 6 08035 Douglas C…    285465           11278            13587            12664             6913            10622
 7 08033 Dolores C…      2064              58               63               71               40               78
 8 08049 Grand Cou…     14843             416              421              431              261              422
 9 08039 Elbert Co…     23086             594              790              976              611              563
10 08041 El Paso C…    622263           23152            23050            23252            14097            22060
# … with 54 more rows, and 40 more variables: totalpop_women_5to9 <dbl>, totalpop_women_10to14 <dbl>,
#   totalpop_women_15to17 <dbl>, black_totalpop <dbl>, black_men_u1 <dbl>, black_men_1 <dbl>, black_men_2 <dbl>,
#   black_men_3 <dbl>, black_men_4 <dbl>, black_men_5 <dbl>, black_men_6 <dbl>, black_men_7 <dbl>,
#   black_men_8 <dbl>, black_men_9 <dbl>, black_men_10 <dbl>, black_men_11 <dbl>, black_men_12 <dbl>,
#   black_men_13 <dbl>, black_men_14 <dbl>, black_men_15 <dbl>, black_men_16 <dbl>, black_men_17 <dbl>,
#   black_women_u1 <dbl>, black_women_1 <dbl>, black_women_2 <dbl>, black_women_3 <dbl>, black_women_4 <dbl>,
#   black_women_5 <dbl>, black_women_6 <dbl>, black_women_7 <dbl>, black_women_8 <dbl>, black_women_9 <dbl>, …

As always, it is also possible to view the dataset in the R Studio data viewer by running View(co_counties_race).

A.3 Clean the tidycensus dataset

Having extracted the dataset, you may want to tidy it up, depending on your needs and preferences. For example, the “NAME” field includes the name of the state, which is not really necessary here since there are only observations from Colorado in the dataset. The code below creates a new variable named “County”, which is populated by removing the state name from the “NAME” field, and updates the co_counties_race object with this new variable:

# Remove state name from NAME field
co_counties_race<-co_counties_race %>% 
                  mutate(County=str_remove_all(NAME, ", Colorado"))

Note the newly created “County” variable is now in the dataset:

# Prints contents of "co_counties_race"
co_counties_race

# A tibble: 64 × 49
   GEOID County          NAME        total_pop totalpop_men_u5 totalpop_men_5t… totalpop_men_10… totalpop_men_15…
   <chr> <chr>           <chr>           <dbl>           <dbl>            <dbl>            <dbl>            <dbl>
 1 08023 Costilla County Costilla C…      3524              99              102              107               95
 2 08025 Crowley County  Crowley Co…      5823              88              110              136               81
 3 08027 Custer County   Custer Cou…      4255              62               95              113               85
 4 08029 Delta County    Delta Coun…     30952             887              913             1042              680
 5 08031 Denver County   Denver Cou…    600158           22252            18894            15319             8920
 6 08035 Douglas County  Douglas Co…    285465           11278            13587            12664             6913
 7 08033 Dolores County  Dolores Co…      2064              58               63               71               40
 8 08049 Grand County    Grand Coun…     14843             416              421              431              261
 9 08039 Elbert County   Elbert Cou…     23086             594              790              976              611
10 08041 El Paso County  El Paso Co…    622263           23152            23050            23252            14097
# … with 54 more rows, and 41 more variables: totalpop_women_u5 <dbl>, totalpop_women_5to9 <dbl>,
#   totalpop_women_10to14 <dbl>, totalpop_women_15to17 <dbl>, black_totalpop <dbl>, black_men_u1 <dbl>,
#   black_men_1 <dbl>, black_men_2 <dbl>, black_men_3 <dbl>, black_men_4 <dbl>, black_men_5 <dbl>,
#   black_men_6 <dbl>, black_men_7 <dbl>, black_men_8 <dbl>, black_men_9 <dbl>, black_men_10 <dbl>,
#   black_men_11 <dbl>, black_men_12 <dbl>, black_men_13 <dbl>, black_men_14 <dbl>, black_men_15 <dbl>,
#   black_men_16 <dbl>, black_men_17 <dbl>, black_women_u1 <dbl>, black_women_1 <dbl>, black_women_2 <dbl>,
#   black_women_3 <dbl>, black_women_4 <dbl>, black_women_5 <dbl>, black_women_6 <dbl>, black_women_7 <dbl>, …

A.4 Define new variables

Now that we have a cleaned dataset with all our necessary variables, we can use these variables to generate the demographic variables needed to calculate the bias index. First, the code below defines a new variable, called “total_pop_over17”, that is calculated by subtracting the total population that is 17 and under from the total overall population:

# Create variable for total over-17 population
co_counties_race<-
  co_counties_race %>% 
     mutate(total_pop_over17=
              total_pop-totalpop_men_u5-totalpop_men_5to9-
              totalpop_men_10to14-totalpop_men_15to17-totalpop_women_u5-
              totalpop_women_5to9-totalpop_women_10to14-totalpop_women_15to17)

Then, we create a new variable named “total_black_pop_over17”, which is defined by subtracting the total Black population that is 17 and under from the total Black population:

# Create variable for total over-17 black population
co_counties_race<-
  co_counties_race %>% 
  mutate(total_black_pop_over17=
          black_totalpop-black_men_u1-black_men_1-
          black_men_2-black_men_3-black_men_4-black_men_5-black_men_6-
          black_men_7-black_men_8-black_men_9-black_men_10-black_men_11-
          black_men_12-black_men_13-black_men_14-black_men_15-black_men_16-
          black_men_17-black_women_u1-black_women_1-black_women_2-black_women_3-
          black_women_4-black_women_5-black_women_6-black_women_7-black_women_8-
          black_women_9-black_women_10-black_women_11-black_women_12-
          black_women_13-black_women_14-black_women_15-black_women_16-
          black_women_17)

A.5 Finalize and export the dataset

Now that we have our two key variables defined, let’s clean up the dataset by removing the variables we no longer need, only keeping the variables necessary to help create the bias index. Below, we take the existing dataset assigned to co_counties_race, and select the “GEOID” and “County” variables (which serve as ID variables), “total_black_pop_over17” and “total_pop_over17” (which are used to compute the bias index), and “total_pop” (which is not necessary to create “bias_index”, but which could prove useful in exploring alternate ways of defining a bias index than the one implemented in the tutorial). The new dataset is assigned to an object named co_counties_census_2010:

# Clean data by select relevant variables for analysis, and assign 
# selection to new object named "co_counties_census_2010"
co_counties_census_2010<-
  co_counties_race %>% 
    select(GEOID, County, total_pop, total_black_pop_over17, total_pop_over17)

Let’s now view the contents of co_counties_census_2010:

# prints contents of "co_counties_census_2010"
co_counties_census_2010

# A tibble: 64 × 5
   GEOID County          total_pop total_black_pop_over17 total_pop_over17
   <chr> <chr>               <dbl>                  <dbl>            <dbl>
 1 08023 Costilla County      3524                     18             2788
 2 08025 Crowley County       5823                    556             5034
 3 08027 Custer County        4255                     37             3525
 4 08029 Delta County        30952                    139            24101
 5 08031 Denver County      600158                  45338           471392
 6 08035 Douglas County     285465                   2447           198453
 7 08033 Dolores County       2064                      4             1602
 8 08049 Grand County        14843                     43            11825
 9 08039 Elbert County       23086                    122            17232
10 08041 El Paso County     622263                  27280           459587
# … with 54 more rows

At this point, we now have the census data that was provided to you at the start of the tutorial.

To export this dataset, use the write_csv() function; below, the first argument is the name of the object which contains the dataset to be exported, and the second argument is the desired file name. The data is exported to the current working directory, and can subsequently be opened on your spreadsheet software of choice as a CSV file.

# Exports the data assigned to the "co_counties_census_2010" object
write_csv(co_counties_census_2010, "co_counties_census_2010.csv")