Code to Generate Word Cloud of Key Words in Corpus of R1 Academic Library Strategic Plans

Preliminaries

The following script assumes that all of the strategic plans listed in the strategic plans spreadsheet (located in the repository’s data folder) are downloaded as PDFs to your R working directory.

Once the plans are downloaded and stored as individual pdf files in the working directory, load the following libraries:

# Load relevant libraries
library(pdftools)
library(tm)
library(SnowballC)
library(wordcloud)
library(tidyverse)
library(knitr)

Create vector of strategic plan document names

In the following, we make a vector of the names of strategic plan file names (all plans are stored as PDFs in the working directory), and assign this vector to an object named files.

# Make a vector of the names of the PDF files (i.e. plan documents) in working directory
files<-list.files(pattern="pdf$")

Make corpus and term document matrix

We will now use the vector of plan document names (files) to make a text corpus that is based on the underlying text of the strategic plans; this corpus is assigned to an object named corpus_plans:

# Make corpus based on strategic plan documents
corpus_plans<-Corpus(URISource(files),readerControl = list(reader=readPDF))

Now, we use the corpus as an argument in the TermDocumentMatrix function to make a term document matrix derived from the corpus. This term document matrix is assigned to an object named plans.tdm

# Make term document matrix based on corpus_plans object
plans.tdm <- as.matrix(TermDocumentMatrix(corpus_plans, 
                                control = 
                                  list(removePunctuation = TRUE,
                                       stopwords = TRUE,
                                       tolower = TRUE,
                                       stemming = FALSE,
                                       removeNumbers = TRUE,
                                       bounds = list(global = c(3, Inf)))))

The frequency of the word “commons” in the strategic plan corpus

How many times does the word “commons” appear in the corpus of strategic plans? To find out, we extract a vector that sums up the number of times various words appear in the corpus using the rowSums function, and assign this vector to an object called words; we then create a data frame based on words, with one column (named “word”) containing the various words in the corpus, and the other column (named “freq”) containing information on how many times the associated word appears in the corpus. We then extract the rows with the words “commons” and “library” (to serve as a reference point) to see how many times they appear in the corpus.

# Extract vector containing information on word frequencies
words <- sort(rowSums(plans.tdm), decreasing=TRUE)
# Create a data frame based on the information in "words"
df <- data.frame(word = names(words),freq=words)
# Find the frequency of "commons" and "library" in the corpus 
df %>% filter(word=="commons"|word=="library") %>% remove_rownames()

##      word freq
## 1 library 2908
## 2 commons   83

Generate word cloud

Finally, we use the “word” and “freq” columns from the df object created above to generate a word cloud of the most frequently recurring words in the strategic plan corpus:

# Set seed for reproducibility
set.seed(1234)
# Generate word cloud
wordcloud(words = df$word, freq = df$freq, min.freq = 50, max.words=90, random.order=FALSE, rot.per=0.35, colors=brewer.pal(8, "Dark2"))

Acknowledgements

This brief script draws on previously published online guides to working with text data and generating word clouds in R. Especially helpful were Clay Ford’s introduction to text mining in R (published through the University of Virginia Libraries and available here), STHDA’s tutorial on text mining and word clouds in R, and Celine Van den Rul’s tutorial in the towards data science series on generating word clouds in R.