3 Cleaning and filtering data
To make things tractable (after all, the entire dataset has more than 3,000,000 observations!), let’s focus our attention on Colorado traffic patrol from the year 2010; this also has the advantage of allowing us to eventually use 2010 census data in crafting our measure of racial bias in traffic stops. Note that currently, there isn’t a separate field which contains the year in which a particular stop took place; rather, there is a “date” field, in which the date is stored in YYYY-MM-DD format. The easiest way to extract observations from the year 2010 is therefore to first extract the YYYY information from the “date” field, and use that information to make a new “Year” field, which only contains the year of a given stop; we can then extract the 2010 observations based on this newly generated “Year” field.
3.1 Create a “Year” field
To create this new “Year” field within co_traffic_stops
, we can use the mutate()
function, which is a tidyverse function that allows us to define new variables within a dataset, and the substr()
function, which allows us to extract a subset of a given string.
The code below takes the data in the co_traffic_stops
object, and then (%>%
) creates a new field named “Year” using the mutate()
function; this field is set equal to the the first four digits of the existing “date” field by passing the expression that reads co_traffic_stops$data, 1, 4
to the substr()
function. It then uses the assignment operator (<-
) to assign this change back to the co_traffic_stops
object, which permanently updates the dataset with the addition of the new “Year” field:
# Creates "Year" field, that contains the year of a given stop,
# in "co_traffic_stops"
<-co_traffic_stops %>%
co_traffic_stopsmutate(Year=substr(co_traffic_stops$date, 1,4))
Let’s check the updated co_traffic_stops
object and make sure that the new field has been successfully created:
# prints contents of "co_traffic_stops"
co_traffic_stops
# A tibble: 3,112,853 × 21
county_name date Year raw_row_number time location subject_age subject_race subject_sex
<chr> <date> <chr> <chr> <lgl> <chr> <dbl> <chr> <chr>
1 Mesa County 2013-06-19 2013 1947986|1947987 NA 19, I70… 26 hispanic male
2 Jefferson County 2012-08-24 2012 1537576 NA 254, H2… NA <NA> <NA>
3 Logan County 2012-09-23 2012 1581594 NA 115, I7… 52 white male
4 Douglas County 2011-08-25 2011 1009205 NA 197, H8… 32 white female
5 Kiowa County 2013-06-08 2013 1932619 NA 107, H2… 33 hispanic male
6 Boulder County 2011-12-23 2011 1179436 NA 48, 384… NA <NA> <NA>
7 Boulder County 2012-04-07 2012 1326795 NA 0, R250… 39 white male
8 Arapahoe County 2013-03-03 2013 1786795 NA 19, E47… 44 white female
9 Park County 2012-09-02 2012 1552164 NA 224, H2… NA <NA> <NA>
10 Adams County 2011-08-21 2011 1004281|1004282|10… NA R2000, … 32 hispanic male
# … with 3,112,843 more rows, and 12 more variables: officer_id_hash <chr>, officer_sex <chr>, type <chr>,
# violation <chr>, arrest_made <lgl>, citation_issued <lgl>, warning_issued <lgl>, outcome <chr>,
# contraband_found <lgl>, search_conducted <lgl>, search_basis <chr>, raw_Ethnicity <chr>
Note the newly created Year field above. We can also check to make sure that the “Year” field has been successfully created by viewing co_traffic_stops
in the R Studio data viewer with View(co_traffic_stops)
.
3.2 Filter by year
Now that we have created the “Year” field, we can use it to extract the 2010 observations using the filter()
function , and assign the new dataset of 2010 stops to a new object named co_traffic_stops_2010
:
# Extract 2010 observations and assign to a new object named
# "co_traffic_stops_2010"
<-co_traffic_stops %>% filter(Year==2010) co_traffic_stops_2010
When we print the contents of the newly created co_traffic_stops_2010
object, note that the observations are now only from 2010.
# Print contents of "co_traffic_stops_2010" object
co_traffic_stops_2010
# A tibble: 470,284 × 21
date Year county_name subject_race raw_row_number time location subject_age subject_sex
<date> <chr> <chr> <chr> <chr> <lgl> <chr> <dbl> <chr>
1 2010-04-17 2010 Montezuma County white 188721|188722 NA 2, 989,… 16 female
2 2010-04-17 2010 Montezuma County white 187958 NA 991, 32 54 male
3 2010-04-17 2010 Montezuma County hispanic 188451 NA 9, 280,… 49 male
4 2010-04-17 2010 Montezuma County white 186989|186990|1869… NA 3, 277,… 16 male
5 2010-04-17 2010 Montezuma County white 186997|186998|1869… NA 3, 277,… 37 male
6 2010-04-17 2010 Montezuma County white 186993|186994|1869… NA 3, 277,… 39 male
7 2010-12-21 2010 Mineral County <NA> 600865 NA 164.5, … 110 <NA>
8 2010-12-21 2010 Mineral County <NA> 600477 NA 163, 29… 110 <NA>
9 2010-01-20 2010 Pueblo County hispanic 36625|36626 NA 312, H5… 45 male
10 2010-01-01 2010 Chaffee County white 275 NA 127, H2… 17 female
# … with 470,274 more rows, and 12 more variables: officer_id_hash <chr>, officer_sex <chr>, type <chr>,
# violation <chr>, arrest_made <lgl>, citation_issued <lgl>, warning_issued <lgl>, outcome <chr>,
# contraband_found <lgl>, search_conducted <lgl>, search_basis <chr>, raw_Ethnicity <chr>