A Using tidycensus to extract relevant census data

This section provides a script used to extract the census dataset that was read into R Studio in Section 5.1. To save time during a workshop, we provided you with this census data beforehand. However, if you are looking for a way to extract the census dataset within R Studio yourself from scratch, the following script can be used as a guide to reproduce the workshop’s census dataset.

To extract the census data required to create the bias index, load the tidyverse (which was used in the workshop tutorial) and the tidycensus package, which is an R package that allows users to pull Census Bureau data using the Census API (if tidycensus is not already installed, please install it using install.packages("tidycensus").

# Loads libraries required to extract census data

Before extracting data with tidycensus, you must acquire a Census API key from the Census Bureau website; once you apply for a key on the website, your key will be immediately emailed to the address you provide. Enter your census API key in R Studio with the following code, replacing the with your key:

census_api_key("INSERT HERE")

In the workshop, we were working with 2010 police stops data, so it made sense to pull demographic data from the 2010 decennial census (which would have been collected in 2009). The discussion below is therefore framed with respect to the 2010 decennial census; if you choose to use another census dataset to create the index, your code may look slightly different.

A.1 Define your variables

First, we can generate a table that contains the various variables (and associated variable codes) for the 2010 decennial census by using the load_variables() function. The arguments to the function below (2010, "sf1") indicate that we want to extract variable names and codes for the 2010 decennial census. We assign the table to an object named decennial_2010_variables, which allows us to easily view the table by using the View() function, and refer back to it whenever needed.

# Variable list for 2010 Decennial
decennial_2010_variables<-load_variables(2010, "sf1")

Based on the information in decennial_2010_variables, we can identify the variable codes for the variables we want to extract. Then, we define a vector, assigned to an object named my_vars that assigns the variable codes to descriptive names; these descriptive names will be used as column names in the dataset returned by the census API call, while the variable codes will be used by tidycensus to populate the respective fields with the desired data.

For the purpose of defining the “bias_index” variable, recall that the two key variables we need are the over-17 total population, and the over-17 Black population (counted at the county level). There is no separate category in the census dataset for these measures, so we must derive them based on the data that is available.

Given the available data, to calculate the total over-17 population, we must extract data for the male under-5 population, the male 5 to 9 population, the male 10 to 14 population, and the male 15 to 17 population, and analogous measures for the female population. Subtracting these values from the total overall population (among all age groups) will yield a value for the total over-17 population.

To calculate the Black over-17 population, we will extract a variable that defines the total Black population, and a series of variables that measure the Black population for different demographic (sex/age) combinations under 17 years old; subtracting the sum of the latter variables from the total Black population yields a measure of the Black over-17 population.

The code below extracts all the variables needed to carry out these calculation:

# Define and name variables for census API call


A.2 Extract the variables using tidycensus

Now that we have a vector of the variables we want to extract (along with descriptive names for those variables), we will use the get_decennial() function from tidycensus to extract these variables from the 2010 decennial census. Several arguments are passed to the get_decennial() function below:

  • geography="county" specifies that we want the census data to be provided at the county level
  • variables=my_vars specifies the variables we want to extract, and the names they are to be given in the dataset; this information is contained in the my_vars vector defined above
  • state=CO specifies the state for which we want to extract the data; this argument, together with the geography="county" argument, means that tidycensus will extract the specified data in my_vars at the county level for the state of Colorado.
  • survey="sf1" indicates which census dataset we would like to query; here sf1 (short for Summary File 1), indicates we are referring to the decennial census (as opposed, for example, to the American Community Survey)
  • output=wide indicates that we want the dataset with the extracted variables to be in “wide” format, wherein each variable is assigned to its own column.
  • year=2010 indicates that we are interested in data from 2010. Combined with survey="sf1", this will extract the 2010 decennial census data.

Finally, we’ll assign the extracted dataset to an object named co_counties_race:

# Issue call to Census API
## Getting data from the 2010 decennial Census
## Using FIPS code '08' for state 'CO'
## Using Census Summary File 1

We can print the first few lines of the dataset to the console to view its structure, and ensure that everything looks in order:

# prints contents of "co_counties_race"
# A tibble: 64 × 48
   GEOID NAME  total_pop totalpop_men_u5 totalpop_men_5t… totalpop_men_10… totalpop_men_15… totalpop_women_…
   <chr> <chr>     <dbl>           <dbl>            <dbl>            <dbl>            <dbl>            <dbl>
 1 08023 Cost…      3524              99              102              107               95               76
 2 08025 Crow…      5823              88              110              136               81               94
 3 08027 Cust…      4255              62               95              113               85               74
 4 08029 Delt…     30952             887              913             1042              680              865
 5 08031 Denv…    600158           22252            18894            15319             8920            21263
 6 08035 Doug…    285465           11278            13587            12664             6913            10622
 7 08033 Dolo…      2064              58               63               71               40               78
 8 08049 Gran…     14843             416              421              431              261              422
 9 08039 Elbe…     23086             594              790              976              611              563
10 08041 El P…    622263           23152            23050            23252            14097            22060
# … with 54 more rows, and 40 more variables: totalpop_women_5to9 <dbl>, totalpop_women_10to14 <dbl>,
#   totalpop_women_15to17 <dbl>, black_totalpop <dbl>, black_men_u1 <dbl>, black_men_1 <dbl>,
#   black_men_2 <dbl>, black_men_3 <dbl>, black_men_4 <dbl>, black_men_5 <dbl>, black_men_6 <dbl>,
#   black_men_7 <dbl>, black_men_8 <dbl>, black_men_9 <dbl>, black_men_10 <dbl>, black_men_11 <dbl>,
#   black_men_12 <dbl>, black_men_13 <dbl>, black_men_14 <dbl>, black_men_15 <dbl>, black_men_16 <dbl>,
#   black_men_17 <dbl>, black_women_u1 <dbl>, black_women_1 <dbl>, black_women_2 <dbl>,
#   black_women_3 <dbl>, black_women_4 <dbl>, black_women_5 <dbl>, black_women_6 <dbl>, …

As always, it is also possible to view the dataset in the R Studio data viewer by running View(co_counties_race).

A.3 Clean the tidycensus dataset

Having extracted the dataset, you may want to tidy it up depending on your needs and preferences. For example, the “NAME” field includes the name of the state, which is not really necessary here since there are only observations from Colorado in the dataset. The code below removes the state name from the “NAME” field, and updates the co_counties_race object with this change:

# Remove state name from NAME field
co_counties_race<-co_counties_race %>% 
                  mutate(County=str_remove_all(NAME, ", Colorado"))   
# Prints contents of "co_counties_race"
# A tibble: 64 × 49
   GEOID NAME  total_pop totalpop_men_u5 totalpop_men_5t… totalpop_men_10… totalpop_men_15… totalpop_women_…
   <chr> <chr>     <dbl>           <dbl>            <dbl>            <dbl>            <dbl>            <dbl>
 1 08023 Cost…      3524              99              102              107               95               76
 2 08025 Crow…      5823              88              110              136               81               94
 3 08027 Cust…      4255              62               95              113               85               74
 4 08029 Delt…     30952             887              913             1042              680              865
 5 08031 Denv…    600158           22252            18894            15319             8920            21263
 6 08035 Doug…    285465           11278            13587            12664             6913            10622
 7 08033 Dolo…      2064              58               63               71               40               78
 8 08049 Gran…     14843             416              421              431              261              422
 9 08039 Elbe…     23086             594              790              976              611              563
10 08041 El P…    622263           23152            23050            23252            14097            22060
# … with 54 more rows, and 41 more variables: totalpop_women_5to9 <dbl>, totalpop_women_10to14 <dbl>,
#   totalpop_women_15to17 <dbl>, black_totalpop <dbl>, black_men_u1 <dbl>, black_men_1 <dbl>,
#   black_men_2 <dbl>, black_men_3 <dbl>, black_men_4 <dbl>, black_men_5 <dbl>, black_men_6 <dbl>,
#   black_men_7 <dbl>, black_men_8 <dbl>, black_men_9 <dbl>, black_men_10 <dbl>, black_men_11 <dbl>,
#   black_men_12 <dbl>, black_men_13 <dbl>, black_men_14 <dbl>, black_men_15 <dbl>, black_men_16 <dbl>,
#   black_men_17 <dbl>, black_women_u1 <dbl>, black_women_1 <dbl>, black_women_2 <dbl>,
#   black_women_3 <dbl>, black_women_4 <dbl>, black_women_5 <dbl>, black_women_6 <dbl>, …

A.4 Define new variables

Now that we have a cleaned dataset with all our necessary variables, we can use these variables to generate the demographic variables needed to calculate the bias index. First, the code below defines a new variable, called “total_pop_over17”, that is calculated by subtracting the total population that is 17 and under from the total overall population:

# Create variable for total over-17 population
  co_counties_race %>% 

Then, we create a new variable named “total_black_pop_over17”, which is defined by subtracting the total Black population that is 17 and under from the total Black population:

# Create variable for total over--17 black population
  co_counties_race %>% 

A.5 Finalize and export the dataset

Now that we have our two key variables defined, let’s clean up the dataset by removing the variables we no longer need, and only keeping the variables necessary to create the bias index. Below, we take the existing dataset assigned to co_counties_race, and select the “GEOID” and “County” variables (which serve as ID variables), “total_black_pop_over17” and “total_pop_over17” (which are used to compute the bias index), and “total_pop” (which is not necessary to create “bias_index”, but which could prove useful in exploring alternate ways of defining a bias index than the one implemented in the tutorial). The new dataset is assigned to an object named co_counties_census_2010:

#Clean data by select relevant variables for analysis, and assign selection to new object named "co_counties_census_2010"
  co_counties_race %>% 
    select(GEOID, County, total_pop, total_black_pop_over17, total_pop_over17)

Let’s view the dataset’s contents:

# prints contents of "co_counties_census_2010"
# A tibble: 64 × 5
   GEOID County          total_pop total_black_pop_over17 total_pop_over17
   <chr> <chr>               <dbl>                  <dbl>            <dbl>
 1 08023 Costilla County      3524                     18             2788
 2 08025 Crowley County       5823                    556             5034
 3 08027 Custer County        4255                     37             3525
 4 08029 Delta County        30952                    139            24101
 5 08031 Denver County      600158                  45338           471392
 6 08035 Douglas County     285465                   2447           198453
 7 08033 Dolores County       2064                      4             1602
 8 08049 Grand County        14843                     43            11825
 9 08039 Elbert County       23086                    122            17232
10 08041 El Paso County     622263                  27280           459587
# … with 54 more rows

At this point, we now have the census data used in the tutorial. This data was exported from R Studio, and provided to workshop participants as a CSV file that was included in the workshop materials.

To export the data, use the write_csv() function; below, the first argument is the name of the object which contains the dataset to be exported, and the second argument is the desired file name. The data is exported to the current working directory, and can subsequently be opened on your spreadsheet software of choice as a CSV file.

# Exports the data
write_csv(co_counties_census_2010, "co_counties_census_2010.csv")