1 Introduction

This workshop introduces participants to basic text analysis techniques with R (an open-source statistical computing language), and RStudio (a user-friendly development environment in which you can write your R code to accomplish your analysis and visualization goals). R and RStudio are widely used in digital scholarship and teaching.

Over the course of the workshop, we will tour the RStudio interface and cover basic setup and installation procedures, demonstrate how to process and prepare your text data, and explore functions that help implement common text analysis tasks (such as the creation of word clouds and frequency graphs). This workshop is for beginners to programming; working knowledge of text mining methods is helpful but not required.

1.1 Objectives

Our workshop has several objectives, or learning goals:

Build familiarity with the RStudio interface
Install and load common text analysis packages
Practice typing R’s syntax
Implement basic text processing tasks with relevant functions

1.2 Background on Workshop Data

The textual data for this workshop was collected from El Diario de la Gente, an independent Chicanx student newspaper published at University of Colorado Boulder, 1972-1983. Founded by then journalism major Juan Espinosa and UMAS, the newspaper covered a wide range of topics, including boycotts and protests, the tragic events surrounding Los Seis de Boulder, and the friction between the Chicanx student community and the CU Boulder administration. The activist newspaper also served as a creative platform for poetry, art, and literature of El Movimiento. The Colorado Historic Newspapers Collection completed digitization in 2019, complete with optical character recognition (OCR) of the text. That fall, a group of digital humanities graduate students created a plain-text corpus for computational analysis - we include parts of that dataset here, which still contains errors from that collection process (e.g., incorrect OCR, encoding errors) to serve as teaching and learning opportunities. The corpus is divided into articles, creatives, notices, and advertisements; see Digital El Diario for more details.

1.3 Accessing Data and Workshop Materials

All of the data and underlying code for the workshop can be accessed and downloaded at the GitHub repository that underpins this workshop site. If you would like to download the data to follow along in the workshop, but would prefer an alternative to GitHub, you can also download the data from this Dropbox folder. Please note that the workshop’s lesson plan uses the text collection contained in the “creatives” subdirectory within the main “2019.11.14-ElDiarioCorpus” directory that contains all of the data associated with El Diario.

A Gentle Introduction to Text Analysis in R

A Gentle Introduction to Text Analysis in R

1 Introduction

1.1 Objectives

1.2 Background on Workshop Data

1.3 Accessing Data and Workshop Materials