A Using tidycensus to extract relevant census data
This section provides a script used to extract the census dataset that was read into R Studio in Section 5.1. To save time during a workshop, we provided you with this census data beforehand. However, if you are looking for a way to extract the census dataset within R Studio yourself from scratch, the following script can be used as a guide to reproduce the workshop’s census dataset.
To extract the census data required to create the bias index, load the tidyverse
(which was used in the workshop tutorial) and the tidycensus
package, which is an R package that allows users to pull Census Bureau data using the Census API (if tidycensus is not already installed, please install it using install.packages("tidycensus")
.
# Loads libraries required to extract census data
library(tidycensus)
library(tidyverse)
Before extracting data with tidycensus
, you must acquire a Census API key from the Census Bureau website; once you apply for a key on the website, your key will be immediately emailed to the address you provide. Enter your census API key in R Studio with the following code, replacing the
census_api_key("INSERT HERE")
In the workshop, we were working with 2010 police stops data, so it made sense to pull demographic data from the 2010 decennial census (which would have been collected in 2009). The discussion below is therefore framed with respect to the 2010 decennial census; if you choose to use another census dataset to create the index, your code may look slightly different.
A.1 Define your variables
First, we can generate a table that contains the various variables (and associated variable codes) for the 2010 decennial census by using the load_variables()
function. The arguments to the function below (2010, "sf1")
indicate that we want to extract variable names and codes for the 2010 decennial census. We assign the table to an object named decennial_2010_variables
, which allows us to easily view the table by using the View()
function, and refer back to it whenever needed.
# Variable list for 2010 Decennial
<-load_variables(2010, "sf1") decennial_2010_variables
Based on the information in decennial_2010_variables
, we can identify the variable codes for the variables we want to extract. Then, we define a vector, assigned to an object named my_vars
that assigns the variable codes to descriptive names; these descriptive names will be used as column names in the dataset returned by the census API call, while the variable codes will be used by tidycensus to populate the respective fields with the desired data.
For the purpose of defining the “bias_index” variable, recall that the two key variables we need are the over-17 total population, and the over-17 Black population (counted at the county level). There is no separate category in the census dataset for these measures, so we must derive them based on the data that is available.
Given the available data, to calculate the total over-17 population, we must extract data for the male under-5 population, the male 5 to 9 population, the male 10 to 14 population, and the male 15 to 17 population, and analogous measures for the female population. Subtracting these values from the total overall population (among all age groups) will yield a value for the total over-17 population.
To calculate the Black over-17 population, we will extract a variable that defines the total Black population, and a series of variables that measure the Black population for different demographic (sex/age) combinations under 17 years old; subtracting the sum of the latter variables from the total Black population yields a measure of the Black over-17 population.
The code below extracts all the variables needed to carry out these calculation:
# Define and name variables for census API call
<-c(total_pop="P001001",
my_varstotalpop_men_u5="P012003",
totalpop_men_5to9="P012004",
totalpop_men_10to14="P012005",
totalpop_men_15to17="P012006",
totalpop_women_u5="P012027",
totalpop_women_5to9="P012028",
totalpop_women_10to14="P012029",
totalpop_women_15to17="P012030",
black_totalpop="PCT012B001",
black_men_u1="PCT012B003",
black_men_1="PCT012B004",
black_men_2="PCT012B005",
black_men_3="PCT012B006",
black_men_4="PCT012B007",
black_men_5="PCT012B008",
black_men_6="PCT012B009",
black_men_7="PCT012B010",
black_men_8="PCT012B011",
black_men_9="PCT012B012",
black_men_10="PCT012B013",
black_men_11="PCT012B014",
black_men_12="PCT012B015",
black_men_13="PCT012B016",
black_men_14="PCT012B017",
black_men_15="PCT012B018",
black_men_16="PCT012B019",
black_men_17="PCT012B020",
black_women_u1="PCT012B107",
black_women_1="PCT012B108",
black_women_2="PCT012B109",
black_women_3="PCT012B110",
black_women_4="PCT012B111",
black_women_5="PCT012B112",
black_women_6="PCT012B113",
black_women_7="PCT012B114",
black_women_8="PCT012B115",
black_women_9="PCT012B116",
black_women_10="PCT012B117",
black_women_11="PCT012B118",
black_women_12="PCT012B119",
black_women_13="PCT012B120",
black_women_14="PCT012B121",
black_women_15="PCT012B122",
black_women_16="PCT012B123",
black_women_17="PCT012B124")
A.2 Extract the variables using tidycensus
Now that we have a vector of the variables we want to extract (along with descriptive names for those variables), we will use the get_decennial()
function from tidycensus to extract these variables from the 2010 decennial census. Several arguments are passed to the get_decennial()
function below:
geography="county"
specifies that we want the census data to be provided at the county levelvariables=my_vars
specifies the variables we want to extract, and the names they are to be given in the dataset; this information is contained in themy_vars
vector defined abovestate=CO
specifies the state for which we want to extract the data; this argument, together with thegeography="county"
argument, means that tidycensus will extract the specified data inmy_vars
at the county level for the state of Colorado.survey="sf1"
indicates which census dataset we would like to query; heresf1
(short for Summary File 1), indicates we are referring to the decennial census (as opposed, for example, to the American Community Survey)output=wide
indicates that we want the dataset with the extracted variables to be in “wide” format, wherein each variable is assigned to its own column.year=2010
indicates that we are interested in data from 2010. Combined withsurvey="sf1"
, this will extract the 2010 decennial census data.
Finally, we’ll assign the extracted dataset to an object named co_counties_race
:
# Issue call to Census API
<-
co_counties_raceget_decennial(
geography="county",
variables=my_vars,
state="CO",
survey="sf1",
output="wide",
year=2010)
## Getting data from the 2010 decennial Census
## Using FIPS code '08' for state 'CO'
## Using Census Summary File 1
We can print the first few lines of the dataset to the console to view its structure, and ensure that everything looks in order:
# prints contents of "co_counties_race"
co_counties_race
# A tibble: 64 × 48
GEOID NAME total_pop totalpop_men_u5 totalpop_men_5t… totalpop_men_10… totalpop_men_15… totalpop_women_…
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 08023 Cost… 3524 99 102 107 95 76
2 08025 Crow… 5823 88 110 136 81 94
3 08027 Cust… 4255 62 95 113 85 74
4 08029 Delt… 30952 887 913 1042 680 865
5 08031 Denv… 600158 22252 18894 15319 8920 21263
6 08035 Doug… 285465 11278 13587 12664 6913 10622
7 08033 Dolo… 2064 58 63 71 40 78
8 08049 Gran… 14843 416 421 431 261 422
9 08039 Elbe… 23086 594 790 976 611 563
10 08041 El P… 622263 23152 23050 23252 14097 22060
# … with 54 more rows, and 40 more variables: totalpop_women_5to9 <dbl>, totalpop_women_10to14 <dbl>,
# totalpop_women_15to17 <dbl>, black_totalpop <dbl>, black_men_u1 <dbl>, black_men_1 <dbl>,
# black_men_2 <dbl>, black_men_3 <dbl>, black_men_4 <dbl>, black_men_5 <dbl>, black_men_6 <dbl>,
# black_men_7 <dbl>, black_men_8 <dbl>, black_men_9 <dbl>, black_men_10 <dbl>, black_men_11 <dbl>,
# black_men_12 <dbl>, black_men_13 <dbl>, black_men_14 <dbl>, black_men_15 <dbl>, black_men_16 <dbl>,
# black_men_17 <dbl>, black_women_u1 <dbl>, black_women_1 <dbl>, black_women_2 <dbl>,
# black_women_3 <dbl>, black_women_4 <dbl>, black_women_5 <dbl>, black_women_6 <dbl>, …
As always, it is also possible to view the dataset in the R Studio data viewer by running View(co_counties_race)
.
A.3 Clean the tidycensus dataset
Having extracted the dataset, you may want to tidy it up depending on your needs and preferences. For example, the “NAME” field includes the name of the state, which is not really necessary here since there are only observations from Colorado in the dataset. The code below removes the state name from the “NAME” field, and updates the co_counties_race
object with this change:
# Remove state name from NAME field
<-co_counties_race %>%
co_counties_racemutate(County=str_remove_all(NAME, ", Colorado"))
# Prints contents of "co_counties_race"
co_counties_race
# A tibble: 64 × 49
GEOID NAME total_pop totalpop_men_u5 totalpop_men_5t… totalpop_men_10… totalpop_men_15… totalpop_women_…
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 08023 Cost… 3524 99 102 107 95 76
2 08025 Crow… 5823 88 110 136 81 94
3 08027 Cust… 4255 62 95 113 85 74
4 08029 Delt… 30952 887 913 1042 680 865
5 08031 Denv… 600158 22252 18894 15319 8920 21263
6 08035 Doug… 285465 11278 13587 12664 6913 10622
7 08033 Dolo… 2064 58 63 71 40 78
8 08049 Gran… 14843 416 421 431 261 422
9 08039 Elbe… 23086 594 790 976 611 563
10 08041 El P… 622263 23152 23050 23252 14097 22060
# … with 54 more rows, and 41 more variables: totalpop_women_5to9 <dbl>, totalpop_women_10to14 <dbl>,
# totalpop_women_15to17 <dbl>, black_totalpop <dbl>, black_men_u1 <dbl>, black_men_1 <dbl>,
# black_men_2 <dbl>, black_men_3 <dbl>, black_men_4 <dbl>, black_men_5 <dbl>, black_men_6 <dbl>,
# black_men_7 <dbl>, black_men_8 <dbl>, black_men_9 <dbl>, black_men_10 <dbl>, black_men_11 <dbl>,
# black_men_12 <dbl>, black_men_13 <dbl>, black_men_14 <dbl>, black_men_15 <dbl>, black_men_16 <dbl>,
# black_men_17 <dbl>, black_women_u1 <dbl>, black_women_1 <dbl>, black_women_2 <dbl>,
# black_women_3 <dbl>, black_women_4 <dbl>, black_women_5 <dbl>, black_women_6 <dbl>, …
A.4 Define new variables
Now that we have a cleaned dataset with all our necessary variables, we can use these variables to generate the demographic variables needed to calculate the bias index. First, the code below defines a new variable, called “total_pop_over17”, that is calculated by subtracting the total population that is 17 and under from the total overall population:
# Create variable for total over-17 population
<-
co_counties_race%>%
co_counties_race mutate(total_pop_over17=
-totalpop_men_u5-totalpop_men_5to9-
total_pop-totalpop_men_15to17-totalpop_women_u5-
totalpop_men_10to14-totalpop_women_10to14-totalpop_women_15to17) totalpop_women_5to9
Then, we create a new variable named “total_black_pop_over17”, which is defined by subtracting the total Black population that is 17 and under from the total Black population:
# Create variable for total over--17 black population
<-
co_counties_race%>%
co_counties_race mutate(total_black_pop_over17=
-black_men_u1-black_men_1-
black_totalpop-black_men_3-black_men_4-black_men_5-black_men_6-
black_men_2-black_men_8-black_men_9-black_men_10-black_men_11-
black_men_7-black_men_13-black_men_14-black_men_15-black_men_16-
black_men_12-black_women_u1-black_women_1-black_women_2-black_women_3-
black_men_17-black_women_5-black_women_6-black_women_7-black_women_8-
black_women_4-black_women_10-black_women_11-black_women_12-
black_women_9-black_women_14-black_women_15-black_women_16-
black_women_13 black_women_17)
A.5 Finalize and export the dataset
Now that we have our two key variables defined, let’s clean up the dataset by removing the variables we no longer need, and only keeping the variables necessary to create the bias index. Below, we take the existing dataset assigned to co_counties_race
, and select the “GEOID” and “County” variables (which serve as ID variables), “total_black_pop_over17” and “total_pop_over17” (which are used to compute the bias index), and “total_pop” (which is not necessary to create “bias_index”, but which could prove useful in exploring alternate ways of defining a bias index than the one implemented in the tutorial). The new dataset is assigned to an object named co_counties_census_2010
:
#Clean data by select relevant variables for analysis, and assign selection to new object named "co_counties_census_2010"
<-
co_counties_census_2010%>%
co_counties_race select(GEOID, County, total_pop, total_black_pop_over17, total_pop_over17)
Let’s view the dataset’s contents:
# prints contents of "co_counties_census_2010"
co_counties_census_2010
# A tibble: 64 × 5
GEOID County total_pop total_black_pop_over17 total_pop_over17
<chr> <chr> <dbl> <dbl> <dbl>
1 08023 Costilla County 3524 18 2788
2 08025 Crowley County 5823 556 5034
3 08027 Custer County 4255 37 3525
4 08029 Delta County 30952 139 24101
5 08031 Denver County 600158 45338 471392
6 08035 Douglas County 285465 2447 198453
7 08033 Dolores County 2064 4 1602
8 08049 Grand County 14843 43 11825
9 08039 Elbert County 23086 122 17232
10 08041 El Paso County 622263 27280 459587
# … with 54 more rows
At this point, we now have the census data used in the tutorial. This data was exported from R Studio, and provided to workshop participants as a CSV file that was included in the workshop materials.
To export the data, use the write_csv()
function; below, the first argument is the name of the object which contains the dataset to be exported, and the second argument is the desired file name. The data is exported to the current working directory, and can subsequently be opened on your spreadsheet software of choice as a CSV file.
# Exports the data
write_csv(co_counties_census_2010, "co_counties_census_2010.csv")