4 Transforming Data

Now that we have created a new data object that contains information on traffic stops from our year of interest (2010), let’s do a bit of work with the data so that we can find out the total number of traffic stops within each county, and the number of stops within each county in which the person stopped by the traffic police was Black.

4.1 Tabulate county-level count of traffic stops by race

First, let’s find out the racial breakdown of stops for each county, using the data in the “subject_race” field of co_traffic_stops_2010.

Below, we take the co_traffic_stops_2010 object, declare “county_name” as a grouping variable using group_by(county_name), and then count up the number of stops associated with each racial category within each county using count(subject_race). The dataset thats results from these operations is assigned to a new object named co_county_summary:

# Compute county-level count of traffic stops by race
co_county_summary<-co_traffic_stops_2010 %>% 
                    group_by(county_name) %>% 
                    count(subject_race) 

Let’s print the first few rows from the dataset and observe its structure:

# Prints contents of "co_county_summary"
co_county_summary
# A tibble: 439 × 3
# Groups:   county_name [65]
   county_name    subject_race               n
   <chr>          <chr>                  <int>
 1 Adams County   asian/pacific islander   582
 2 Adams County   black                   1208
 3 Adams County   hispanic                8012
 4 Adams County   other                     36
 5 Adams County   unknown                  462
 6 Adams County   white                  20225
 7 Adams County   <NA>                    3825
 8 Alamosa County asian/pacific islander    18
 9 Alamosa County black                     43
10 Alamosa County hispanic                1537
# … with 429 more rows

Note that each row contains information on the number of times a person from a given racial category was stopped by the traffic police in a given county. For example, in Adams County, CO, Asian/Pacific Islander motorists were stopped 582 times, Black motorists were stopped 1208 times, and white motorists were stopped 20225 times. The dataset contains information on the racial and ethnic breakdown of police stops for each county in Colorado, in the year 2010.

4.2 Reshape the data

Note that co_county_summary is currently a “long” dataset, in which there are multiple rows associated with each county, with each row corresponding to a distinct county/racial category combination, and the n column providing information on the number of stops associated with that county/racial category pairing.

It will be easier to instead work with a “wide” dataset, in which each county is associated with a single row, and each racial category is assigned to its own column. Each cell in this “wide” dataset would correspond to the number of stops associated with a given county (defined by the row) for a given racial category (defined by the column).

To reshape the dataset from its current “long” format into a “wide” format, we can use the pivot_wider() function. The code below takes the current co_county_summary data object (in “long” format), and then transforms it into a “wide” format with pivot_wider(names_from=subject_race, values_from=n). The names_from argument specifies the name of the current column which contains the categories which we want to transform into columns (here, subject_race), while the values_from argument specifies the name of the current column which contains the values that will be associated with each county/racial category combination (here, n). Finally, the code below assigns the transformed dataset to a new object named co_county_summary_wide:

# Transforms "co_county_summary" from long format to wide, and assigns 
# the reshaped dataset to a new object named "co_county_summary_wide"
co_county_summary_wide<-co_county_summary %>% 
                        pivot_wider(names_from=subject_race, values_from=n)

Let’s print the contents of co_county_summary_wide to ensure that the data has indeed been transformed into a “wide” format:

# prints contents of "co_county_summary_wide"
co_county_summary_wide
# A tibble: 65 × 8
# Groups:   county_name [65]
   county_name       `asian/pacific islander` black hispanic other unknown white  `NA`
   <chr>                                <int> <int>    <int> <int>   <int> <int> <int>
 1 Adams County                           582  1208     8012    36     462 20225  3825
 2 Alamosa County                          18    43     1537     9      30  2427   414
 3 Arapahoe County                        540  1819     1862    12     300 11089  1898
 4 Archuleta County                        17    28      392    71      41  4125   417
 5 Baca County                             11    61      288    NA       6   971   174
 6 Bent County                              8    46      314     1       6  1155   278
 7 Boulder County                         345   192     1050    10     180  9682  1594
 8 Broomfield County                       32    22      104     3      18   690   226
 9 Chaffee County                          43    37      361     9      71  4806  1194
10 Cheyenne County                         10    38      147     3       2   821    85
# … with 55 more rows

Indeed, we can see that each row is now associated with a single county, and each column is now associated with a given racial category.

4.3 Calculate total stops for each county in co_county_summary_wide

Now that we have our dataset in “wide” format, let’s create a new column, named “total_stops” that contains information on the total number of traffic stops for each county. We can create this column by calculating the sum total of stops across all of the racial categories for each county, which yields the total number of stops for each county.

The code below takes the co_county_summary_wide object, and then calls the rowise() function, which allows us to make calculations across the rows of a data frame. It then creates a new column, called “total_stops” using the now-familiar mutate() function; this “total_stops” column is populated by taking the sum of traffic stops across racial categories for each county. This is accomplished with sum(c_across(where(is.integer)), na.rm=TRUE). This expression can be translated as follows: “for each row in the dataset, calculate the sum across the columns whenever the value in a column is an integer; whenever a cell value is ‘NA’, simply ignore it in the calculation.” We’ll assign the dataset, with the newly added “total_stops” column, back to the same co_county_summary_wide object; this effectively overwrites the contents of the current co_county_summary_wide object (a dataset without the “total_stops” column), with the new dataset (which does have the “total_stops” column).

# Takes the existing "co_county_summary_wide" dataset, and creates a new column 
# called "total_stops" that sums the values across columns for each row; 
# the revised dataset is assigned back to "co_county_summary_wide", 
# which overwrites the object's previous contents with the revised dataset
co_county_summary_wide<-co_county_summary_wide %>% 
                        rowwise() %>% 
                        mutate(total_stops=sum(c_across(where(is.integer)), 
                                               na.rm=TRUE)) 

Let’s print the contents of the co_county_summary_wide object and confirm that the new column has been successfully created:

# Prints updated contents of "co_county_summary_wide"
co_county_summary_wide
# A tibble: 65 × 9
# Rowwise:  county_name
   county_name       total_stops `asian/pacific islander` black hispanic other unknown white  `NA`
   <chr>                   <int>                    <int> <int>    <int> <int>   <int> <int> <int>
 1 Adams County            34350                      582  1208     8012    36     462 20225  3825
 2 Alamosa County           4478                       18    43     1537     9      30  2427   414
 3 Arapahoe County         17520                      540  1819     1862    12     300 11089  1898
 4 Archuleta County         5091                       17    28      392    71      41  4125   417
 5 Baca County              1511                       11    61      288    NA       6   971   174
 6 Bent County              1808                        8    46      314     1       6  1155   278
 7 Boulder County          13053                      345   192     1050    10     180  9682  1594
 8 Broomfield County        1095                       32    22      104     3      18   690   226
 9 Chaffee County           6521                       43    37      361     9      71  4806  1194
10 Cheyenne County          1106                       10    38      147     3       2   821    85
# … with 55 more rows

4.4 Clean co_county_summary_wide and assign to new object

Let’s clean up the co_county_summmary_wide dataset a bit more to make it easier to work with. First, because our interest in this workshop is in exploring the possibility that Black motorists suffer disproportionately high traffic stop rates (relative to their share of the overall adult population), let’s create a new object that only contains the data that is essential for the subsequent analysis: the county’s name (“county_name”), the number of Black motorists that were stopped (“black”), and the total number of stops in the county across all racial categories ("total_stops).

The code below takes the existing co_county_summary_wide object, and then uses the select() function to select the columns we want to keep. It then assigns this selection to a new object named “co_county_black_stops”:

# Selects the "county_name", "black", and "total_stops" columns 
# from the "co_county_summary_wide" object, and assigns the selection 
# to a new object named "co_county_black_stops"

co_county_black_stops<-co_county_summary_wide %>%
                        select(county_name, black, total_stops) 

Let’s open up the newly created co_county_black_stops object to ensure that these changes have been successfully implemented:

# Prints contents of "co_county_black_stops"
co_county_black_stops
# A tibble: 65 × 3
# Rowwise:  county_name
   county_name       black total_stops
   <chr>             <int>       <int>
 1 Adams County       1208       34350
 2 Alamosa County       43        4478
 3 Arapahoe County    1819       17520
 4 Archuleta County     28        5091
 5 Baca County          61        1511
 6 Bent County          46        1808
 7 Boulder County      192       13053
 8 Broomfield County    22        1095
 9 Chaffee County       37        6521
10 Cheyenne County      38        1106
# … with 55 more rows

As expected, we now have a dataset that contains information on the total number of traffic stops for each Colorado county in 2010 (“total_stops”), and the number of traffic stops that involved Black motorists (“black”).

The name of the column containing information on the number of Black motorists stopped by the traffic patrol is simply “black”, which was inherited from the initial dataset from the Stanford Policing project. However, this column name is somewhat vague, and may cause confusion down the road when we work with census demographic data. Therefore, let’s rename that column name to “black_stops” using the rename() function, and assign that change back to co_county_black_stops:

# Takes the existing "co_county_black_stops" object and renames the column named 
# "black" to "black_stops"; assigns the modified dataset back to 
# "co_county_black_stops", which overwrites the existing contents of 
# "co_county_black_stops"
co_county_black_stops<-co_county_black_stops %>%
                        rename(black_stops=black)

Let’s now print the modified co_county_black_stops object:

# Prints contents of "co_county_black_stops"
co_county_black_stops
# A tibble: 65 × 3
# Rowwise:  county_name
   county_name       black_stops total_stops
   <chr>                   <int>       <int>
 1 Adams County             1208       34350
 2 Alamosa County             43        4478
 3 Arapahoe County          1819       17520
 4 Archuleta County           28        5091
 5 Baca County                61        1511
 6 Bent County                46        1808
 7 Boulder County            192       13053
 8 Broomfield County          22        1095
 9 Chaffee County             37        6521
10 Cheyenne County            38        1106
# … with 55 more rows

Note that the name of the column has successfully been changed, and is now more descriptive.

Finally, if we view co_county_black_stops within the R Studio data viewer (View(co_county_black_stops)), we’ll note a small quirk in the dataset. In particular, while Colorado has 64 counties, the dataset has 65 rows; one of the rows is an extra row, with an “NA” value associated with the “county_name” field, and 20 total traffic stops associated with it. Because we’re interested in a county-level analysis, and these stops are not associated with an actual county, we’ll go ahead and delete that row with the following code, which takes the current co_county_black_stops object, and then uses the filter() function to select only those rows for which “county_name” is NOT EQUAl (!=) to “NA”; after effectively deleting the row where “county_name” is set to “NA”, the code assigns the modified dataset back to co_county_black_stops:

# Takes the "co_county_black_stops" object and removes the row for which 
# "county_name" is "NA"; assigns the modified dataset back to 
# "co_county_black_stops", which overwrites the dataset that is 
# currently assigned to that object
co_county_black_stops<-co_county_black_stops %>% 
                        filter(county_name!="NA")

Note that co_county_black_stops now contains 64 rows:

# Prints contents of "co_county_black_stops"
co_county_black_stops
# A tibble: 64 × 3
# Rowwise:  county_name
   county_name       black_stops total_stops
   <chr>                   <int>       <int>
 1 Adams County             1208       34350
 2 Alamosa County             43        4478
 3 Arapahoe County          1819       17520
 4 Archuleta County           28        5091
 5 Baca County                61        1511
 6 Bent County                46        1808
 7 Boulder County            192       13053
 8 Broomfield County          22        1095
 9 Chaffee County             37        6521
10 Cheyenne County            38        1106
# … with 54 more rows