3 Cleaning and filtering data

To make things tractable (after all, the entire dataset has more than 3,000,000 observations!), let’s focus our attention on Colorado traffic patrol from the year 2010; this also has the advantage of allowing us to eventually use 2010 census data in crafting our measure of racial bias in traffic stops. Note that currently, there isn’t a separate field which contains the year in which a particular stop took place; rather, there is a “date” field, in which the date is stored in YYYY-MM-DD format. The easiest way to extract observations from the year 2010 is therefore to first extract the YYYY information from the “date” field, and use that information to make a new “Year” field, which only contains the year of a given stop; we can then extract the 2010 observations based on this newly generated “Year” field.

3.1 Create a “Year” field

To create this new “Year” field within co_traffic_stops, we can use the mutate() function, which is a tidyverse function that allows us to define new variables within a dataset, and the substr() function, which allows us to extract a subset of a given string.

The code below takes the data in the co_traffic_stops object, and then (%>%) creates a new field named “Year” using the mutate() function; this field is set equal to the the first four digits of the existing “date” field by passing the expression that reads co_traffic_stops$data, 1, 4 to the substr() function. It then uses the assignment operator (<-) to assign this change back to the co_traffic_stops object, which permanently updates the dataset with the addition of the new “Year” field:

# Creates "Year" field, that contains the year of a given stop, 
# in "co_traffic_stops"
co_traffic_stops<-co_traffic_stops %>% 
                    mutate(Year=substr(co_traffic_stops$date, 1,4))

Let’s check the updated co_traffic_stops object and make sure that the new field has been successfully created:

# prints contents of "co_traffic_stops"
co_traffic_stops
# A tibble: 3,112,853 × 21
   county_name      date       Year  raw_row_number      time  location subject_age subject_race subject_sex
   <chr>            <date>     <chr> <chr>               <lgl> <chr>          <dbl> <chr>        <chr>      
 1 Mesa County      2013-06-19 2013  1947986|1947987     NA    19, I70…          26 hispanic     male       
 2 Jefferson County 2012-08-24 2012  1537576             NA    254, H2…          NA <NA>         <NA>       
 3 Logan County     2012-09-23 2012  1581594             NA    115, I7…          52 white        male       
 4 Douglas County   2011-08-25 2011  1009205             NA    197, H8…          32 white        female     
 5 Kiowa County     2013-06-08 2013  1932619             NA    107, H2…          33 hispanic     male       
 6 Boulder County   2011-12-23 2011  1179436             NA    48, 384…          NA <NA>         <NA>       
 7 Boulder County   2012-04-07 2012  1326795             NA    0, R250…          39 white        male       
 8 Arapahoe County  2013-03-03 2013  1786795             NA    19, E47…          44 white        female     
 9 Park County      2012-09-02 2012  1552164             NA    224, H2…          NA <NA>         <NA>       
10 Adams County     2011-08-21 2011  1004281|1004282|10… NA    R2000, …          32 hispanic     male       
# … with 3,112,843 more rows, and 12 more variables: officer_id_hash <chr>, officer_sex <chr>, type <chr>,
#   violation <chr>, arrest_made <lgl>, citation_issued <lgl>, warning_issued <lgl>, outcome <chr>,
#   contraband_found <lgl>, search_conducted <lgl>, search_basis <chr>, raw_Ethnicity <chr>

Note the newly created Year field above. We can also check to make sure that the “Year” field has been successfully created by viewing co_traffic_stops in the R Studio data viewer with View(co_traffic_stops).

3.2 Filter by year

Now that we have created the “Year” field, we can use it to extract the 2010 observations using the filter() function , and assign the new dataset of 2010 stops to a new object named co_traffic_stops_2010:

# Extract 2010 observations and assign to a new object named 
# "co_traffic_stops_2010"
co_traffic_stops_2010<-co_traffic_stops %>% filter(Year==2010)

When we print the contents of the newly created co_traffic_stops_2010 object, note that the observations are now only from 2010.

# Print contents of "co_traffic_stops_2010" object
co_traffic_stops_2010
# A tibble: 470,284 × 21
   date       Year  county_name      subject_race raw_row_number      time  location subject_age subject_sex
   <date>     <chr> <chr>            <chr>        <chr>               <lgl> <chr>          <dbl> <chr>      
 1 2010-04-17 2010  Montezuma County white        188721|188722       NA    2, 989,…          16 female     
 2 2010-04-17 2010  Montezuma County white        187958              NA    991, 32           54 male       
 3 2010-04-17 2010  Montezuma County hispanic     188451              NA    9, 280,…          49 male       
 4 2010-04-17 2010  Montezuma County white        186989|186990|1869… NA    3, 277,…          16 male       
 5 2010-04-17 2010  Montezuma County white        186997|186998|1869… NA    3, 277,…          37 male       
 6 2010-04-17 2010  Montezuma County white        186993|186994|1869… NA    3, 277,…          39 male       
 7 2010-12-21 2010  Mineral County   <NA>         600865              NA    164.5, …         110 <NA>       
 8 2010-12-21 2010  Mineral County   <NA>         600477              NA    163, 29…         110 <NA>       
 9 2010-01-20 2010  Pueblo County    hispanic     36625|36626         NA    312, H5…          45 male       
10 2010-01-01 2010  Chaffee County   white        275                 NA    127, H2…          17 female     
# … with 470,274 more rows, and 12 more variables: officer_id_hash <chr>, officer_sex <chr>, type <chr>,
#   violation <chr>, arrest_made <lgl>, citation_issued <lgl>, warning_issued <lgl>, outcome <chr>,
#   contraband_found <lgl>, search_conducted <lgl>, search_basis <chr>, raw_Ethnicity <chr>