The following script assumes that all of the strategic plans listed in the strategic plans spreadsheet (located in the repository’s data folder) are downloaded as PDFs to your R working directory.
Once the plans are downloaded and stored as individual pdf files in the working directory, load the following libraries:
# Load relevant libraries
library(pdftools)
library(tm)
library(SnowballC)
library(wordcloud)
library(tidyverse)
library(knitr)
In the following, we make a vector of the names of strategic plan file names (all plans are stored as PDFs in the working directory), and assign this vector to an object named files
.
# Make a vector of the names of the PDF files (i.e. plan documents) in working directory
files<-list.files(pattern="pdf$")
We will now use the vector of plan document names (files
) to make a text corpus that is based on the underlying text of the strategic plans; this corpus is assigned to an object named corpus_plans
:
# Make corpus based on strategic plan documents
corpus_plans<-Corpus(URISource(files),readerControl = list(reader=readPDF))
Now, we use the corpus as an argument in the TermDocumentMatrix
function to make a term document matrix derived from the corpus. This term document matrix is assigned to an object named plans.tdm
# Make term document matrix based on corpus_plans object
plans.tdm <- as.matrix(TermDocumentMatrix(corpus_plans,
control =
list(removePunctuation = TRUE,
stopwords = TRUE,
tolower = TRUE,
stemming = FALSE,
removeNumbers = TRUE,
bounds = list(global = c(3, Inf)))))
How many times does the word “commons” appear in the corpus of strategic plans? To find out, we extract a vector that sums up the number of times various words appear in the corpus using the rowSums
function, and assign this vector to an object called words
; we then create a data frame based on words
, with one column (named “word”) containing the various words in the corpus, and the other column (named “freq”) containing information on how many times the associated word appears in the corpus. We then extract the rows with the words “commons” and “library” (to serve as a reference point) to see how many times they appear in the corpus.
# Extract vector containing information on word frequencies
words <- sort(rowSums(plans.tdm), decreasing=TRUE)
# Create a data frame based on the information in "words"
df <- data.frame(word = names(words),freq=words)
# Find the frequency of "commons" and "library" in the corpus
df %>% filter(word=="commons"|word=="library") %>% remove_rownames()
## word freq
## 1 library 2908
## 2 commons 83
Finally, we use the “word” and “freq” columns from the df
object created above to generate a word cloud of the most frequently recurring words in the strategic plan corpus:
# Set seed for reproducibility
set.seed(1234)
# Generate word cloud
wordcloud(words = df$word, freq = df$freq, min.freq = 50, max.words=90, random.order=FALSE, rot.per=0.35, colors=brewer.pal(8, "Dark2"))
This brief script draws on previously published online guides to working with text data and generating word clouds in R. Especially helpful were Clay Ford’s introduction to text mining in R (published through the University of Virginia Libraries and available here), STHDA’s tutorial on text mining and word clouds in R, and Celine Van den Rul’s tutorial in the towards data science series on generating word clouds in R.