3 Cleaning and filtering data
To make things tractable (after all, the entire dataset has more than 3,000,000 observations!), let’s focus our attention on Colorado traffic patrol from the year 2010; this also has the advantage of allowing us to eventually use 2010 census data in crafting our measure of racial bias in traffic stops. Note that currently, there isn’t a separate field which contains the year in which a particular stop took place; rather, there is a “date” field, in which the date is stored in YYYY-MM-DD format. The easiest way to extract observations from the year 2010 is therefore to first extract the YYYY information from the “date” field, and use that information to make a new “Year” field that only contains the year of a given stop; we can then extract the 2010 observations based on this newly generated “Year” field.
3.1 Create a “Year” field
To create this new “Year” field within co_traffic_stops
, we can use the mutate()
function, which is a tidyverse function that allows us to define new variables within a dataset, and the substr()
function, which allows us to extract a subset of a given string.
Below, we take the data in the co_traffic_stops
object, and then create a new column named “Year” using the mutate()
function; we set this column equal to the the first four digits of the existing “date” column by passing the expression that reads co_traffic_stops$data, 1, 4
to the substr()
function. We then use the the assignment operator (<-
) to assign this change back to the co_traffic_stops
object, which permanently updates the dataset with the addition of the new “Year” field:
# Creates "Year" field, that contains the year of a given stop,
# in "co_traffic_stops"
<-co_traffic_stops %>%
co_traffic_stopsmutate(Year=substr(co_traffic_stops$date, 1,4))
You might have noticed a mysterious symbol in the above code that comes immediately after co_traffic_stops
and immediately before calling the mutate()
function in the next line. This symbol is known as a “pipe” (%>%
). The pipe operator effectively takes the contents to its left, and then uses these contents as an input to the code on its right. Here, the pipe takes the contents of co_traffic_stops
object on its left, and then feeds this data into the mutate()
function on its right. In other words, the pipe operator links the code on its left to the code on its right, and establishes that the data which is to be modified using the mutate()
function is the data assigned to co_traffic_stops
. We will use the pipe operator throughout the lesson to chain together functions in this manner.
Let’s check the updated co_traffic_stops
object and make sure that the new field has been successfully created:
# prints contents of "co_traffic_stops"
co_traffic_stops
# A tibble: 3,112,853 × 21
county_name date Year raw_row_number time location subject_age subject_race subject_sex
<chr> <date> <chr> <chr> <lgl> <chr> <dbl> <chr> <chr>
1 Mesa County 2013-06-19 2013 1947986|1947987 NA 19, I70… 26 hispanic male
2 Jefferson County 2012-08-24 2012 1537576 NA 254, H2… NA <NA> <NA>
3 Logan County 2012-09-23 2012 1581594 NA 115, I7… 52 white male
4 Douglas County 2011-08-25 2011 1009205 NA 197, H8… 32 white female
5 Kiowa County 2013-06-08 2013 1932619 NA 107, H2… 33 hispanic male
6 Boulder County 2011-12-23 2011 1179436 NA 48, 384… NA <NA> <NA>
7 Boulder County 2012-04-07 2012 1326795 NA 0, R250… 39 white male
8 Arapahoe County 2013-03-03 2013 1786795 NA 19, E47… 44 white female
9 Park County 2012-09-02 2012 1552164 NA 224, H2… NA <NA> <NA>
10 Adams County 2011-08-21 2011 1004281|1004282|1004283… NA R2000, … 32 hispanic male
# … with 3,112,843 more rows, and 12 more variables: officer_id_hash <chr>, officer_sex <chr>, type <chr>,
# violation <chr>, arrest_made <lgl>, citation_issued <lgl>, warning_issued <lgl>, outcome <chr>,
# contraband_found <lgl>, search_conducted <lgl>, search_basis <chr>, raw_Ethnicity <chr>
Note the newly created Year field above. We can also check to make sure that the “Year” field has been successfully created by viewing co_traffic_stops
in the R Studio data viewer with View(co_traffic_stops)
.
3.2 Filter by year
Now that we have created the “Year” field, we can use it to extract the 2010 observations using the filter()
function , and assign the filtered dataset of 2010 stops to a new object named co_traffic_stops_2010
:
# Extract 2010 observations and assign to a new object named
# "co_traffic_stops_2010"
<-co_traffic_stops %>% filter(Year==2010) co_traffic_stops_2010
When we print the contents of the newly created co_traffic_stops_2010
object, note that the observations are now only from 2010.
# Print contents of "co_traffic_stops_2010" object
co_traffic_stops_2010
# A tibble: 470,284 × 21
date Year county_name subject_race raw_row_number time location subject_age subject_sex
<date> <chr> <chr> <chr> <chr> <lgl> <chr> <dbl> <chr>
1 2010-04-17 2010 Montezuma County white 188721|188722 NA 2, 989,… 16 female
2 2010-04-17 2010 Montezuma County white 187958 NA 991, 32 54 male
3 2010-04-17 2010 Montezuma County hispanic 188451 NA 9, 280,… 49 male
4 2010-04-17 2010 Montezuma County white 186989|186990|186991|18… NA 3, 277,… 16 male
5 2010-04-17 2010 Montezuma County white 186997|186998|186999|18… NA 3, 277,… 37 male
6 2010-04-17 2010 Montezuma County white 186993|186994|186995|18… NA 3, 277,… 39 male
7 2010-12-21 2010 Mineral County <NA> 600865 NA 164.5, … 110 <NA>
8 2010-12-21 2010 Mineral County <NA> 600477 NA 163, 29… 110 <NA>
9 2010-01-20 2010 Pueblo County hispanic 36625|36626 NA 312, H5… 45 male
10 2010-01-01 2010 Chaffee County white 275 NA 127, H2… 17 female
# … with 470,274 more rows, and 12 more variables: officer_id_hash <chr>, officer_sex <chr>, type <chr>,
# violation <chr>, arrest_made <lgl>, citation_issued <lgl>, warning_issued <lgl>, outcome <chr>,
# contraband_found <lgl>, search_conducted <lgl>, search_basis <chr>, raw_Ethnicity <chr>