4 Transforming Data
Now that we have created a new data object that contains information on traffic stops from our year of interest (2010), let’s do a bit of work with the data so that we can find out the total number of traffic stops within each county, and the number of stops within each county in which the person stopped by the traffic police was Black.
4.1 Tabulate county-level count of traffic stops by race
First, let’s find out the racial breakdown of stops for each county, using the data in the “subject_race” field of co_traffic_stops_2010
.
Below, we take the co_traffic_stops_2010
object, declare “county_name” as a grouping variable using group_by(county_name)
, and then count up the number of stops associated with each racial category within each county using count(subject_race)
. The dataset thats results from these operations is assigned to a new object named co_county_summary
:
# Compute county-level count of traffic stops by race
<-co_traffic_stops_2010 %>%
co_county_summarygroup_by(county_name) %>%
count(subject_race)
Let’s print the first few rows from the dataset and observe its structure:
# Prints contents of "co_county_summary"
co_county_summary
# A tibble: 439 × 3
# Groups: county_name [65]
county_name subject_race n
<chr> <chr> <int>
1 Adams County asian/pacific islander 582
2 Adams County black 1208
3 Adams County hispanic 8012
4 Adams County other 36
5 Adams County unknown 462
6 Adams County white 20225
7 Adams County <NA> 3825
8 Alamosa County asian/pacific islander 18
9 Alamosa County black 43
10 Alamosa County hispanic 1537
# … with 429 more rows
Note that each row contains information on the number of times a person from a given racial category was stopped by the traffic police in a given county. For example, in Adams County, CO, Asian/Pacific Islander motorists were stopped 582 times, Black motorists were stopped 1208 times, and white motorists were stopped 20225 times. The dataset contains information on the racial and ethnic breakdown of police stops for each county in Colorado, in the year 2010.
4.2 Reshape the data
Note that co_county_summary
is currently a “long” dataset, in which there are multiple rows associated with each county, with each row corresponding to a distinct county/racial category combination, and the n
column providing information on the number of stops associated with that county/racial category pairing.
It will be easier to instead work with a “wide” dataset, in which each county is associated with a single row, and each racial category is assigned to its own column. Each cell in this “wide” dataset would correspond to the number of stops associated with a given county (defined by the row) for a given racial category (defined by the column).
To reshape the dataset from its current “long” format into a “wide” format, we can use the pivot_wider()
function. The code below takes the current co_county_summary
data object (in “long” format), and then transforms it into a “wide” format with pivot_wider(names_from=subject_race, values_from=n)
. The names_from
argument specifies the name of the current column which contains the categories which we want to transform into columns (here, subject_race
), while the values_from
argument specifies the name of the current column which contains the values that will be associated with each county/racial category combination (here, n
). Finally, the code below assigns the transformed dataset to a new object named co_county_summary_wide
:
# Transforms "co_county_summary" from long format to wide, and assigns
# the reshaped dataset to a new object named "co_county_summary_wide"
<-co_county_summary %>%
co_county_summary_widepivot_wider(names_from=subject_race, values_from=n)
Let’s print the contents of co_county_summary_wide
to ensure that the data has indeed been transformed into a “wide” format:
# prints contents of "co_county_summary_wide"
co_county_summary_wide
# A tibble: 65 × 8
# Groups: county_name [65]
county_name `asian/pacific islander` black hispanic other unknown white `NA`
<chr> <int> <int> <int> <int> <int> <int> <int>
1 Adams County 582 1208 8012 36 462 20225 3825
2 Alamosa County 18 43 1537 9 30 2427 414
3 Arapahoe County 540 1819 1862 12 300 11089 1898
4 Archuleta County 17 28 392 71 41 4125 417
5 Baca County 11 61 288 NA 6 971 174
6 Bent County 8 46 314 1 6 1155 278
7 Boulder County 345 192 1050 10 180 9682 1594
8 Broomfield County 32 22 104 3 18 690 226
9 Chaffee County 43 37 361 9 71 4806 1194
10 Cheyenne County 10 38 147 3 2 821 85
# … with 55 more rows
Indeed, we can see that each row is now associated with a single county, and each column is now associated with a given racial category.
4.3 Calculate total stops for each county in co_county_summary_wide
Now that we have our dataset in “wide” format, let’s create a new column, named “total_stops” that contains information on the total number of traffic stops for each county. We can create this column by calculating the sum total of stops across all of the racial categories for each county, which yields the total number of stops for each county.
The code below takes the co_county_summary_wide
object, and then calls the rowise()
function, which allows us to make calculations across the rows of a data frame. It then creates a new column, called “total_stops” using the now-familiar mutate()
function; this “total_stops” column is populated by taking the sum of traffic stops across racial categories for each county. This is accomplished with sum(c_across(where(is.integer)), na.rm=TRUE)
. This expression can be translated as follows: “for each row in the dataset, calculate the sum across the columns whenever the value in a column is an integer; whenever a cell value is ‘NA’, simply ignore it in the calculation.” We’ll assign the dataset, with the newly added “total_stops” column, back to the same co_county_summary_wide
object; this effectively overwrites the contents of the current co_county_summary_wide
object (a dataset without the “total_stops” column), with the new dataset (which does have the “total_stops” column).
# Takes the existing "co_county_summary_wide" dataset, and creates a new column
# called "total_stops" that sums the values across columns for each row;
# the revised dataset is assigned back to "co_county_summary_wide",
# which overwrites the object's previous contents with the revised dataset
<-co_county_summary_wide %>%
co_county_summary_widerowwise() %>%
mutate(total_stops=sum(c_across(where(is.integer)),
na.rm=TRUE))
Let’s print the contents of the co_county_summary_wide
object and confirm that the new column has been successfully created:
# Prints updated contents of "co_county_summary_wide"
co_county_summary_wide
# A tibble: 65 × 9
# Rowwise: county_name
county_name total_stops `asian/pacific islander` black hispanic other unknown white `NA`
<chr> <int> <int> <int> <int> <int> <int> <int> <int>
1 Adams County 34350 582 1208 8012 36 462 20225 3825
2 Alamosa County 4478 18 43 1537 9 30 2427 414
3 Arapahoe County 17520 540 1819 1862 12 300 11089 1898
4 Archuleta County 5091 17 28 392 71 41 4125 417
5 Baca County 1511 11 61 288 NA 6 971 174
6 Bent County 1808 8 46 314 1 6 1155 278
7 Boulder County 13053 345 192 1050 10 180 9682 1594
8 Broomfield County 1095 32 22 104 3 18 690 226
9 Chaffee County 6521 43 37 361 9 71 4806 1194
10 Cheyenne County 1106 10 38 147 3 2 821 85
# … with 55 more rows
4.4 Clean co_county_summary_wide
and assign to new object
Let’s clean up the co_county_summmary_wide
dataset a bit more to make it easier to work with. First, because our interest in this workshop is in exploring the possibility that Black motorists suffer disproportionately high traffic stop rates (relative to their share of the overall adult population), let’s create a new object that only contains the data that is essential for the subsequent analysis: the county’s name (“county_name”), the number of Black motorists that were stopped (“black”), and the total number of stops in the county across all racial categories ("total_stops).
The code below takes the existing co_county_summary_wide
object, and then uses the select()
function to select the columns we want to keep. It then assigns this selection to a new object named “co_county_black_stops”:
# Selects the "county_name", "black", and "total_stops" columns
# from the "co_county_summary_wide" object, and assigns the selection
# to a new object named "co_county_black_stops"
<-co_county_summary_wide %>%
co_county_black_stopsselect(county_name, black, total_stops)
Let’s open up the newly created co_county_black_stops
object to ensure that these changes have been successfully implemented:
# Prints contents of "co_county_black_stops"
co_county_black_stops
# A tibble: 65 × 3
# Rowwise: county_name
county_name black total_stops
<chr> <int> <int>
1 Adams County 1208 34350
2 Alamosa County 43 4478
3 Arapahoe County 1819 17520
4 Archuleta County 28 5091
5 Baca County 61 1511
6 Bent County 46 1808
7 Boulder County 192 13053
8 Broomfield County 22 1095
9 Chaffee County 37 6521
10 Cheyenne County 38 1106
# … with 55 more rows
As expected, we now have a dataset that contains information on the total number of traffic stops for each Colorado county in 2010 (“total_stops”), and the number of traffic stops that involved Black motorists (“black”).
The name of the column containing information on the number of Black motorists stopped by the traffic patrol is simply “black”, which was inherited from the initial dataset from the Stanford Policing project. However, this column name is somewhat vague, and may cause confusion down the road when we work with census demographic data. Therefore, let’s rename that column name to “black_stops” using the rename()
function, and assign that change back to co_county_black_stops
:
# Takes the existing "co_county_black_stops" object and renames the column named
# "black" to "black_stops"; assigns the modified dataset back to
# "co_county_black_stops", which overwrites the existing contents of
# "co_county_black_stops"
<-co_county_black_stops %>%
co_county_black_stopsrename(black_stops=black)
Let’s now print the modified co_county_black_stops
object:
# Prints contents of "co_county_black_stops"
co_county_black_stops
# A tibble: 65 × 3
# Rowwise: county_name
county_name black_stops total_stops
<chr> <int> <int>
1 Adams County 1208 34350
2 Alamosa County 43 4478
3 Arapahoe County 1819 17520
4 Archuleta County 28 5091
5 Baca County 61 1511
6 Bent County 46 1808
7 Boulder County 192 13053
8 Broomfield County 22 1095
9 Chaffee County 37 6521
10 Cheyenne County 38 1106
# … with 55 more rows
Note that the name of the column has successfully been changed, and is now more descriptive.
Finally, if we view co_county_black_stops
within the R Studio data viewer (View(co_county_black_stops)
), we’ll note a small quirk in the dataset. In particular, while Colorado has 64 counties, the dataset has 65 rows; one of the rows is an extra row, with an “NA” value associated with the “county_name” field, and 20 total traffic stops associated with it. Because we’re interested in a county-level analysis, and these stops are not associated with an actual county, we’ll go ahead and delete that row with the following code, which takes the current co_county_black_stops
object, and then uses the filter()
function to select only those rows for which “county_name” is NOT EQUAl (!=
) to “NA”; after effectively deleting the row where “county_name” is set to “NA”, the code assigns the modified dataset back to co_county_black_stops
:
# Takes the "co_county_black_stops" object and removes the row for which
# "county_name" is "NA"; assigns the modified dataset back to
# "co_county_black_stops", which overwrites the dataset that is
# currently assigned to that object
<-co_county_black_stops %>%
co_county_black_stopsfilter(county_name!="NA")
Note that co_county_black_stops
now contains 64 rows:
# Prints contents of "co_county_black_stops"
co_county_black_stops
# A tibble: 64 × 3
# Rowwise: county_name
county_name black_stops total_stops
<chr> <int> <int>
1 Adams County 1208 34350
2 Alamosa County 43 4478
3 Arapahoe County 1819 17520
4 Archuleta County 28 5091
5 Baca County 61 1511
6 Bent County 46 1808
7 Boulder County 192 13053
8 Broomfield County 22 1095
9 Chaffee County 37 6521
10 Cheyenne County 38 1106
# … with 54 more rows