A Using tidycensus to extract relevant census data
In Section 5.1, we read in a county-level demographic dataset (extracted from the 2010 decennial census), which we subsequently used to generate one of the components of our bias indicator. That dataset was provided to you at the start of the workshop, and we did not discuss the process by which the dataset was generated in the main lesson. However, if you are curious about how that data was generated, the script in this Appendix can be used as a guide to reproduce the demographic dataset introduced in Section 5.1.
To extract the census demographic data required to create the bias index, we must first load a few required packages. In particular, please load the tidyverse
(which was used in the workshop tutorial) and the tidycensus
package, which is an R package that allows users to pull Census Bureau data using the Census API (if tidycensus is not already installed, please install it using install.packages("tidycensus")
.
# Loads libraries required to extract census data
library(tidycensus)
library(tidyverse)
Before extracting data with tidycensus
, you must acquire a Census API key from the Census Bureau website; once you apply for a key on the website, your key will be immediately emailed to the address you provide. Enter your census API key in R Studio with the following code, replacing the
census_api_key("INSERT HERE")
In the workshop, we were working with 2010 police stops data, so it made sense to pull demographic data from the 2010 decennial census (which would have been collected in 2009). The discussion below is therefore framed with respect to the 2010 decennial census; if you choose to use another census data product (such as the American Community Survey) to create the bias index, your code may look slightly different.
A.1 Define your variables
First, we can generate a table that contains the various variables (and associated variable codes) for the 2010 decennial census by using the load_variables()
function. The arguments to the function below (2010, "sf1")
indicate that we want to extract variable names and codes for the 2010 decennial census. Below, we assign the table to an object named decennial_2010_variables
:
# Variable list for 2010 Decennial assigned to object named "decennial_2010_variables"
<-load_variables(2010, "sf1") decennial_2010_variables
At this point, we can easily view the 2010 decennial variables table by passing decennial_2010_variables
as an argument to the View()
function:
Based on the information in decennial_2010_variables
, we can identify the variable codes for the variables we want to extract. Then, we will define a vector (see below), assigned to an object named my_vars
that assigns the variable codes to descriptive names; these descriptive names will be used as column names in the dataset returned by the census API call, while the variable codes will be used by tidycensus to populate the respective fields with the desired data.
For the purpose of defining the “bias_index” variable, recall that we needed to define the county-level percentage of traffic stops involving a Black motorist, and the county-level percentage of the adult Black population (with respect to the county’s overall adult population). The first component of this index was derived using the Stanford Open Policing data. In order to calculate the second component of the bias index, we need two key variables at the county level: the total adult (which we define as 17+) population, and the total adult Black population. These two variables can be extracted from the decennial census. However, there is no separate category in the decennial census for these two measures, so we must derive them based on the data that is available.
Therefore, given the available data, calculating the total over-17 population will require us to extract data for the male under-5 population, the male 5 to 9 population, the male 10 to 14 population, and the male 15 to 17 population, and analogous measures for the female population. Subtracting these values from the total overall population (among all age groups) will yield a value for the total over-17 population.
To calculate the Black over-17 population, we will extract a variable that defines the total Black population, and a series of variables that indicate the Black population for different demographic (sex/age) combinations 17 years old and younger; subtracting the sum of the latter variables from the total Black population yields a value for the Black over-17 population.
The code below extracts all the variables needed to carry out these calculations:
# Define and name variables for census API call
<-c(total_pop="P001001",
my_varstotalpop_men_u5="P012003",
totalpop_men_5to9="P012004",
totalpop_men_10to14="P012005",
totalpop_men_15to17="P012006",
totalpop_women_u5="P012027",
totalpop_women_5to9="P012028",
totalpop_women_10to14="P012029",
totalpop_women_15to17="P012030",
black_totalpop="PCT012B001",
black_men_u1="PCT012B003",
black_men_1="PCT012B004",
black_men_2="PCT012B005",
black_men_3="PCT012B006",
black_men_4="PCT012B007",
black_men_5="PCT012B008",
black_men_6="PCT012B009",
black_men_7="PCT012B010",
black_men_8="PCT012B011",
black_men_9="PCT012B012",
black_men_10="PCT012B013",
black_men_11="PCT012B014",
black_men_12="PCT012B015",
black_men_13="PCT012B016",
black_men_14="PCT012B017",
black_men_15="PCT012B018",
black_men_16="PCT012B019",
black_men_17="PCT012B020",
black_women_u1="PCT012B107",
black_women_1="PCT012B108",
black_women_2="PCT012B109",
black_women_3="PCT012B110",
black_women_4="PCT012B111",
black_women_5="PCT012B112",
black_women_6="PCT012B113",
black_women_7="PCT012B114",
black_women_8="PCT012B115",
black_women_9="PCT012B116",
black_women_10="PCT012B117",
black_women_11="PCT012B118",
black_women_12="PCT012B119",
black_women_13="PCT012B120",
black_women_14="PCT012B121",
black_women_15="PCT012B122",
black_women_16="PCT012B123",
black_women_17="PCT012B124")
A.2 Extract the variables using tidycensus
Now that we have a vector of the variables we want to extract (along with descriptive names for those variables), we will use the get_decennial()
function from tidycensus to extract these variables from the 2010 decennial census. Several arguments are passed to the get_decennial()
function below:
geography="county"
specifies that we want the census data to be provided at the county levelvariables=my_vars
specifies the variables we want to extract, and the names they are to be given in the dataset; this information is contained in themy_vars
vector defined abovestate=CO
specifies the state for which we want to extract the data; this argument, together with thegeography="county"
argument, means that tidycensus will extract the specified data inmy_vars
at the county level for the state of Colorado.survey="sf1"
indicates which census dataset we would like to query; heresf1
(short for Summary File 1), indicates we are referring to the decennial census (as opposed, for example, to the American Community Survey)output=wide
indicates that we want the dataset with the extracted variables to be in “wide” format, wherein each variable is assigned to its own column.year=2010
indicates that we are interested in data from 2010. Combined withsurvey="sf1"
, this will extract the 2010 decennial census data.
Finally, we’ll assign the extracted dataset to an object named co_counties_race
:
# Issue call to Census API
<-
co_counties_raceget_decennial(
geography="county",
variables=my_vars,
state="CO",
survey="sf1",
output="wide",
year=2010)
## Getting data from the 2010 decennial Census
## Using FIPS code '08' for state 'CO'
## Using Census Summary File 1
We can print the first few lines of the dataset to the console to view its structure, and ensure that everything looks in order:
# prints contents of "co_counties_race"
co_counties_race
# A tibble: 64 × 48
GEOID NAME total_pop totalpop_men_u5 totalpop_men_5t… totalpop_men_10… totalpop_men_15… totalpop_women_…
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 08023 Costilla … 3524 99 102 107 95 76
2 08025 Crowley C… 5823 88 110 136 81 94
3 08027 Custer Co… 4255 62 95 113 85 74
4 08029 Delta Cou… 30952 887 913 1042 680 865
5 08031 Denver Co… 600158 22252 18894 15319 8920 21263
6 08035 Douglas C… 285465 11278 13587 12664 6913 10622
7 08033 Dolores C… 2064 58 63 71 40 78
8 08049 Grand Cou… 14843 416 421 431 261 422
9 08039 Elbert Co… 23086 594 790 976 611 563
10 08041 El Paso C… 622263 23152 23050 23252 14097 22060
# … with 54 more rows, and 40 more variables: totalpop_women_5to9 <dbl>, totalpop_women_10to14 <dbl>,
# totalpop_women_15to17 <dbl>, black_totalpop <dbl>, black_men_u1 <dbl>, black_men_1 <dbl>, black_men_2 <dbl>,
# black_men_3 <dbl>, black_men_4 <dbl>, black_men_5 <dbl>, black_men_6 <dbl>, black_men_7 <dbl>,
# black_men_8 <dbl>, black_men_9 <dbl>, black_men_10 <dbl>, black_men_11 <dbl>, black_men_12 <dbl>,
# black_men_13 <dbl>, black_men_14 <dbl>, black_men_15 <dbl>, black_men_16 <dbl>, black_men_17 <dbl>,
# black_women_u1 <dbl>, black_women_1 <dbl>, black_women_2 <dbl>, black_women_3 <dbl>, black_women_4 <dbl>,
# black_women_5 <dbl>, black_women_6 <dbl>, black_women_7 <dbl>, black_women_8 <dbl>, black_women_9 <dbl>, …
As always, it is also possible to view the dataset in the R Studio data viewer by running View(co_counties_race)
.
A.3 Clean the tidycensus dataset
Having extracted the dataset, you may want to tidy it up, depending on your needs and preferences. For example, the “NAME” field includes the name of the state, which is not really necessary here since there are only observations from Colorado in the dataset. The code below creates a new variable named “County”, which is populated by removing the state name from the “NAME” field, and updates the co_counties_race
object with this new variable:
# Remove state name from NAME field
<-co_counties_race %>%
co_counties_racemutate(County=str_remove_all(NAME, ", Colorado"))
Note the newly created “County” variable is now in the dataset:
# Prints contents of "co_counties_race"
co_counties_race
# A tibble: 64 × 49
GEOID County NAME total_pop totalpop_men_u5 totalpop_men_5t… totalpop_men_10… totalpop_men_15…
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 08023 Costilla County Costilla C… 3524 99 102 107 95
2 08025 Crowley County Crowley Co… 5823 88 110 136 81
3 08027 Custer County Custer Cou… 4255 62 95 113 85
4 08029 Delta County Delta Coun… 30952 887 913 1042 680
5 08031 Denver County Denver Cou… 600158 22252 18894 15319 8920
6 08035 Douglas County Douglas Co… 285465 11278 13587 12664 6913
7 08033 Dolores County Dolores Co… 2064 58 63 71 40
8 08049 Grand County Grand Coun… 14843 416 421 431 261
9 08039 Elbert County Elbert Cou… 23086 594 790 976 611
10 08041 El Paso County El Paso Co… 622263 23152 23050 23252 14097
# … with 54 more rows, and 41 more variables: totalpop_women_u5 <dbl>, totalpop_women_5to9 <dbl>,
# totalpop_women_10to14 <dbl>, totalpop_women_15to17 <dbl>, black_totalpop <dbl>, black_men_u1 <dbl>,
# black_men_1 <dbl>, black_men_2 <dbl>, black_men_3 <dbl>, black_men_4 <dbl>, black_men_5 <dbl>,
# black_men_6 <dbl>, black_men_7 <dbl>, black_men_8 <dbl>, black_men_9 <dbl>, black_men_10 <dbl>,
# black_men_11 <dbl>, black_men_12 <dbl>, black_men_13 <dbl>, black_men_14 <dbl>, black_men_15 <dbl>,
# black_men_16 <dbl>, black_men_17 <dbl>, black_women_u1 <dbl>, black_women_1 <dbl>, black_women_2 <dbl>,
# black_women_3 <dbl>, black_women_4 <dbl>, black_women_5 <dbl>, black_women_6 <dbl>, black_women_7 <dbl>, …
A.4 Define new variables
Now that we have a cleaned dataset with all our necessary variables, we can use these variables to generate the demographic variables needed to calculate the bias index. First, the code below defines a new variable, called “total_pop_over17”, that is calculated by subtracting the total population that is 17 and under from the total overall population:
# Create variable for total over-17 population
<-
co_counties_race%>%
co_counties_race mutate(total_pop_over17=
-totalpop_men_u5-totalpop_men_5to9-
total_pop-totalpop_men_15to17-totalpop_women_u5-
totalpop_men_10to14-totalpop_women_10to14-totalpop_women_15to17) totalpop_women_5to9
Then, we create a new variable named “total_black_pop_over17”, which is defined by subtracting the total Black population that is 17 and under from the total Black population:
# Create variable for total over-17 black population
<-
co_counties_race%>%
co_counties_race mutate(total_black_pop_over17=
-black_men_u1-black_men_1-
black_totalpop-black_men_3-black_men_4-black_men_5-black_men_6-
black_men_2-black_men_8-black_men_9-black_men_10-black_men_11-
black_men_7-black_men_13-black_men_14-black_men_15-black_men_16-
black_men_12-black_women_u1-black_women_1-black_women_2-black_women_3-
black_men_17-black_women_5-black_women_6-black_women_7-black_women_8-
black_women_4-black_women_10-black_women_11-black_women_12-
black_women_9-black_women_14-black_women_15-black_women_16-
black_women_13 black_women_17)
A.5 Finalize and export the dataset
Now that we have our two key variables defined, let’s clean up the dataset by removing the variables we no longer need, only keeping the variables necessary to help create the bias index. Below, we take the existing dataset assigned to co_counties_race
, and select the “GEOID” and “County” variables (which serve as ID variables), “total_black_pop_over17” and “total_pop_over17” (which are used to compute the bias index), and “total_pop” (which is not necessary to create “bias_index”, but which could prove useful in exploring alternate ways of defining a bias index than the one implemented in the tutorial). The new dataset is assigned to an object named co_counties_census_2010
:
# Clean data by select relevant variables for analysis, and assign
# selection to new object named "co_counties_census_2010"
<-
co_counties_census_2010%>%
co_counties_race select(GEOID, County, total_pop, total_black_pop_over17, total_pop_over17)
Let’s now view the contents of co_counties_census_2010
:
# prints contents of "co_counties_census_2010"
co_counties_census_2010
# A tibble: 64 × 5
GEOID County total_pop total_black_pop_over17 total_pop_over17
<chr> <chr> <dbl> <dbl> <dbl>
1 08023 Costilla County 3524 18 2788
2 08025 Crowley County 5823 556 5034
3 08027 Custer County 4255 37 3525
4 08029 Delta County 30952 139 24101
5 08031 Denver County 600158 45338 471392
6 08035 Douglas County 285465 2447 198453
7 08033 Dolores County 2064 4 1602
8 08049 Grand County 14843 43 11825
9 08039 Elbert County 23086 122 17232
10 08041 El Paso County 622263 27280 459587
# … with 54 more rows
At this point, we now have the census data that was provided to you at the start of the tutorial.
To export this dataset, use the write_csv()
function; below, the first argument is the name of the object which contains the dataset to be exported, and the second argument is the desired file name. The data is exported to the current working directory, and can subsequently be opened on your spreadsheet software of choice as a CSV file.
# Exports the data assigned to the "co_counties_census_2010" object
write_csv(co_counties_census_2010, "co_counties_census_2010.csv")