3 Cleaning and filtering data

To make things tractable (after all, the entire dataset has more than 3,000,000 observations!), let’s focus our attention on Colorado traffic patrol from the year 2010; this also has the advantage of allowing us to eventually use 2010 census data in crafting our measure of racial bias in traffic stops. Note that currently, there isn’t a separate field which contains the year in which a particular stop took place; rather, there is a “date” field, in which the date is stored in YYYY-MM-DD format. The easiest way to extract observations from the year 2010 is therefore to first extract the YYYY information from the “date” field, and use that information to make a new “Year” field that only contains the year of a given stop; we can then extract the 2010 observations based on this newly generated “Year” field.

3.1 Create a “Year” field

To create this new “Year” field within co_traffic_stops, we can use the mutate() function, which is a tidyverse function that allows us to define new variables within a dataset, and the substr() function, which allows us to extract a subset of a given string.

Below, we take the data in the co_traffic_stops object, and then create a new column named “Year” using the mutate() function; we set this column equal to the the first four digits of the existing “date” column by passing the expression that reads co_traffic_stops$data, 1, 4 to the substr() function. We then use the the assignment operator (<-) to assign this change back to the co_traffic_stops object, which permanently updates the dataset with the addition of the new “Year” field:

# Creates "Year" field, that contains the year of a given stop, 
# in "co_traffic_stops"
co_traffic_stops<-co_traffic_stops %>% 
                    mutate(Year=substr(co_traffic_stops$date, 1,4))

You might have noticed a mysterious symbol in the above code that comes immediately after co_traffic_stops and immediately before calling the mutate() function in the next line. This symbol is known as a “pipe” (%>%). The pipe operator effectively takes the contents to its left, and then uses these contents as an input to the code on its right. Here, the pipe takes the contents of co_traffic_stops object on its left, and then feeds this data into the mutate() function on its right. In other words, the pipe operator links the code on its left to the code on its right, and establishes that the data which is to be modified using the mutate() function is the data assigned to co_traffic_stops. We will use the pipe operator throughout the lesson to chain together functions in this manner.

Let’s check the updated co_traffic_stops object and make sure that the new field has been successfully created:

# prints contents of "co_traffic_stops"
co_traffic_stops
# A tibble: 3,112,853 × 21
   county_name      date       Year  raw_row_number           time  location subject_age subject_race subject_sex
   <chr>            <date>     <chr> <chr>                    <lgl> <chr>          <dbl> <chr>        <chr>      
 1 Mesa County      2013-06-19 2013  1947986|1947987          NA    19, I70…          26 hispanic     male       
 2 Jefferson County 2012-08-24 2012  1537576                  NA    254, H2…          NA <NA>         <NA>       
 3 Logan County     2012-09-23 2012  1581594                  NA    115, I7…          52 white        male       
 4 Douglas County   2011-08-25 2011  1009205                  NA    197, H8…          32 white        female     
 5 Kiowa County     2013-06-08 2013  1932619                  NA    107, H2…          33 hispanic     male       
 6 Boulder County   2011-12-23 2011  1179436                  NA    48, 384…          NA <NA>         <NA>       
 7 Boulder County   2012-04-07 2012  1326795                  NA    0, R250…          39 white        male       
 8 Arapahoe County  2013-03-03 2013  1786795                  NA    19, E47…          44 white        female     
 9 Park County      2012-09-02 2012  1552164                  NA    224, H2…          NA <NA>         <NA>       
10 Adams County     2011-08-21 2011  1004281|1004282|1004283… NA    R2000, …          32 hispanic     male       
# … with 3,112,843 more rows, and 12 more variables: officer_id_hash <chr>, officer_sex <chr>, type <chr>,
#   violation <chr>, arrest_made <lgl>, citation_issued <lgl>, warning_issued <lgl>, outcome <chr>,
#   contraband_found <lgl>, search_conducted <lgl>, search_basis <chr>, raw_Ethnicity <chr>

Note the newly created Year field above. We can also check to make sure that the “Year” field has been successfully created by viewing co_traffic_stops in the R Studio data viewer with View(co_traffic_stops).

3.2 Filter by year

Now that we have created the “Year” field, we can use it to extract the 2010 observations using the filter() function , and assign the filtered dataset of 2010 stops to a new object named co_traffic_stops_2010:

# Extract 2010 observations and assign to a new object named 
# "co_traffic_stops_2010"
co_traffic_stops_2010<-co_traffic_stops %>% filter(Year==2010)

When we print the contents of the newly created co_traffic_stops_2010 object, note that the observations are now only from 2010.

# Print contents of "co_traffic_stops_2010" object
co_traffic_stops_2010
# A tibble: 470,284 × 21
   date       Year  county_name      subject_race raw_row_number           time  location subject_age subject_sex
   <date>     <chr> <chr>            <chr>        <chr>                    <lgl> <chr>          <dbl> <chr>      
 1 2010-04-17 2010  Montezuma County white        188721|188722            NA    2, 989,…          16 female     
 2 2010-04-17 2010  Montezuma County white        187958                   NA    991, 32           54 male       
 3 2010-04-17 2010  Montezuma County hispanic     188451                   NA    9, 280,…          49 male       
 4 2010-04-17 2010  Montezuma County white        186989|186990|186991|18… NA    3, 277,…          16 male       
 5 2010-04-17 2010  Montezuma County white        186997|186998|186999|18… NA    3, 277,…          37 male       
 6 2010-04-17 2010  Montezuma County white        186993|186994|186995|18… NA    3, 277,…          39 male       
 7 2010-12-21 2010  Mineral County   <NA>         600865                   NA    164.5, …         110 <NA>       
 8 2010-12-21 2010  Mineral County   <NA>         600477                   NA    163, 29…         110 <NA>       
 9 2010-01-20 2010  Pueblo County    hispanic     36625|36626              NA    312, H5…          45 male       
10 2010-01-01 2010  Chaffee County   white        275                      NA    127, H2…          17 female     
# … with 470,274 more rows, and 12 more variables: officer_id_hash <chr>, officer_sex <chr>, type <chr>,
#   violation <chr>, arrest_made <lgl>, citation_issued <lgl>, warning_issued <lgl>, outcome <chr>,
#   contraband_found <lgl>, search_conducted <lgl>, search_basis <chr>, raw_Ethnicity <chr>