5 Creating a word frequency table

A common task in basic text analysis is to identify the frequency with which various words occur in a text collection, and present this information in a table (commonly referred to as a word frequency table). In this section, we will learn how to transform a tidy text dataframe such as diario_creatives_tidy into a word frequency table.

5.1 Tokenize diario_creatives_tidy

The first step in creating a word frequency table is to split our column containing the relevant text (“text” in diario_creatives_tidy) into word tokens; in the context of text analysis, a “token” is the semantic unit of analysis that you wish to use for information extraction purposes. Here, we want to tokenize by single words, since that is the unit of analysis that makes sense when the end-goal is a word frequency table. In other words, because our desired unit of analysis is a single word, we will use single words as our token.

In the code below, we first take our existing diario_creatives_tidy data frame, and then use the unnest_tokens() function from the tidytext package to tokenize the information in the “text” column of diario_creatives_tidy by word, and return a new data frame where the textual unit of analysis is the word. Note that the unnest_tokens() function takes three arguments, described below:

  • The first, input=text, specifies the name of the column in diario_creatives_tidy that contains the text data we wish to tokenize; here, the name of this desired input column is “text”.
  • The second argument, token="words", specifies that we wish to tokenize the text data in the “text” column at the level of the individual word.
  • Finally, the argument output=word specifies that the name of the column containing the word data in the new one-token-per-row dataset is to be “word.”

We’ll assign the new one-token-per-row dataset that results from the tokenization process to a new object named diario_word_tokenized:

# Tokenizes "diario_creatives_tidy" by word and assigns the resulting dataset to 
# "diario_word_tokenized"
diario_word_tokenized<-
  diario_creatives_tidy %>% # declares dataset with relevant text
    unnest_tokens(input=text, # specifies name of input column containing text data
                  token="words", # specifies how to tokenize input column
                  output=word)  # specifies name of output column containing tokens

Now, go ahead and take a look at the newly tokenized dataset, diario_word_tokenized, within the RStudio data viewer:

# opens "diario_word_tokenized" in RStudio data viewer
View(diario_word_tokenized)


One feature of the code used to generate a tokenized dataset may require additional elaboration. In particular, the code above uses a “pipe”, a symbol that looks like this: %>%. The pipe operator essentially takes whatever is to its left, and then uses it as an input to the code on its right. More specifically, in the code above, the operations on the right of the %>% are applied to the object on the left of the %>%. In other words, the pipe operator links the code on its two sides, and establishes that the dataset to be tokenized in the manner specified by the code on its right is diario_creatives_tidy.

5.2 Extract the first draft of a word frequency table based from the tokenized dataset

Now that we have tokenized our text collection by word, and organized this information in a word-level dataset (diario_word_tokenized), let’s use it to create the first draft of a word frequency table. To do so, we will use the count() function on the diario_word_tokenized dataset we created above. The count() function is part of the dplyr package, which contains extremely useful functions for data wrangling tasks; dplyr is one package within the broader suite of tidyverse packages.

In the code below, we take our tokenized dataset, diario_word_tokenized, and then call the count() function; the first argument to count(), which is word, specifies the name of the column which contains the values (i.e. words) whose frequencies we’d like to count up (here, the name of that column is “word”). The second argument, sort=TRUE, specifies that the frequency table generated by the count() function should array the words in descending order of frequency. We’ll assign the resulting dataset, which will contain information on the number of times each distinct word appears in the text collection (stored in a column named “n”), to a new object named diario_frequency_table:

# Uses "count" function to generate a dataset that contains information on the 
# frequency of each word in "diario_word_tokenized"; assigns this newly created 
# dataset (organized in descending order) to object named 
# "diario_frequency_table"
diario_frequency_table<-diario_word_tokenized %>% 
                          count(word, sort=TRUE)

Go ahead and view the newly created frequency table, diario_frequency_table, in the RStudio data viewer:

View(diario_frequency_table)

We now have a word frequency table! The words which appear in the text collection are contained in the column on the left (“word”), and the frequency of each word is contained in the column on the right (“n”).

5.3 Clean the word frequency table

Having generated this basic word frequency table, it is now time to edit and clean it, so as to ensure that it contains meaningful information.

5.3.1 Remove stopwords

One issue with our frequency table (diario_frequency_table) that you may have noticed is that many of the words are (predictably enough) common words that don’t really inform us about the distinctive semantic features of the “creatives” text collection. For example, the word “the” appears 857 times in our collection, but for most research projects, this would be completely uninteresting, and may even obscure more interesting patterns within the text collection. When working with text data, it is therefore common practice to excise these common words (known as “stopwords”) from a dataset of interest, before carrying out any further analysis or visualization tasks.

Let’s now go ahead and excise stopwords from diario_frequency_table. In order to remove stopwords, we must first compile those words, and assemble them in an R data structure (such as a data frame or vector). This can be a tedious task, but luckily, many text mining and analysis packages provide pre-assembled collections of stopwords that we can use off the shelf, without having to develop our own from scratch.

Within the tidytext package, stopwords are contained in a data frame that is assigned to an object named stop_words. We can preview some of these words by printing the contents of stop_words to the console:

# prints contents of "stop_words"
stop_words
## # A tibble: 1,149 × 2
##    word        lexicon
##    <chr>       <chr>  
##  1 a           SMART  
##  2 a's         SMART  
##  3 able        SMART  
##  4 about       SMART  
##  5 above       SMART  
##  6 according   SMART  
##  7 accordingly SMART  
##  8 across      SMART  
##  9 actually    SMART  
## 10 after       SMART  
## # … with 1,139 more rows

Alternatively, if you want to see all of the stopwords within stop_words, you can view it within the RStudio data viewer (i.e. View(stop_words)).

Because our text documents also contain a meaningful amount of Spanish, it makes sense to also remove Spanish language stopwords. The stop_words data frame from the tidytext package only contains English stopwords, but the tm package allows users to extract Spanish stopwords by passing the string "spanish" to its stopwords() function. Below, we extract Spanish stopwords from the tm package’s stopwords() function as a data frame (by passing stopwords("spanish") as an argument to the as.data.frame() function), and assign it to a new object named spanish_stopwords:

# extract Spanish stopwords as a data frame, and assign it to an object named 
# "spanish_stopwords"
spanish_stopwords<-as.data.frame(stopwords("spanish"))

Feel free to view spanish_stopwords within the data viewer by passing it as an argument to the View() function (i.e. View(spanish_stopwords)). Note that the column containing the stopwords is named stopwords("spanish").

Now that we have a set of English stopwords (stop_words), and a set of Spanish stopwords (spanish_stopwords), let’s remove these stopwords from our word frequency table (diario_frequency_table):

# Takes the existing "diario_frequency_table" dataset, and removes English 
# and Spanish stopwords from it
diario_frequency_table<-
  diario_frequency_table %>% 
    filter(!word %in% stop_words$word) %>% # removes English stopwords
    filter(!word %in% spanish_stopwords$`stopwords("spanish")`) # removes Spanish stopwords

Let’s unpack the code in the previous codeblock:

  • The first element (to the right of the assignment operator), is the name of the object from which we’d like to remove the relevant stopwords, diario_frequency_table.
  • After declaring the name of the object we’d like to modify, we use a %>% to connect this object to the subsequent line of code, which removes the English stopwords using filter(!word %in% stop_words$word). The filter() function (which is part of the dplyr package), is used to create a subset of an existing dataset, in which all of the rows that satisfy a condition (or set of conditions) are retained (and those which do not satisfy the condition(s) are discarded). Here, the argument to the filter() function, which reads !word %in% stop_words$word, can be interpreted as specifying that we wish to subset and retain all of the rows in diario_frequency_table in which the “word” column (within diario_frequency_table) is NOT equal to any of the words in the “word” column of stop_words (note that in order to refer to a column from a dataset, we can type the name of the data frame object, followed by a $, followed by the name of the column within the data frame). This procedure effectively excises the English stopwords from diario_frequency_table. The ! before word is a small detail, but important; the exclamation mark is a symbol for logical negation, and without it, the filter() function would instead subset and retain all of the rows in diario_frequency_table where the word in the “word” column is an English stopword.
  • After excising all of the English stopwords from diario_frequency_table, we use another pipe (%>%), to indicate that we want to implement additional changes to diario_frequency_table. We then use the filter() function to remove Spanish stopwords from diario_frequency_table, with filter(!word %in% spanish_stopwords$`stopwords("spanish")`). The logic behind this expression is the same as the logic behind the previous expression, which removed English stopwords from the frequency table. That is, the argument to the filter() function specifies that we want to subset and retain all of the rows in diario_frequency_table for which the “word” column is NOT one of the Spanish stopwords contained in the stopwords("spanish") column within the spanish_stopwords dataset. This effectively removes the words in spanish_stopwords from diario_frequency_table.
  • Finally, we assign the changes we made to diario_frequency_table (i.e. removing stopwords) back to the existing diario_frequency_table object with the assignment operator (<-) instead of creating a new object; this effectively updates the contents of diario_frequency_table. Before, this data object included the stopwords in stop_words and spanish_stopwords; now, after assigning the changes back to the existing object, diario_frequency_table no longer contains these stopwords.

You can confirm that the stopwords have been removed by viewing our updated diario_frequency_table object within the RStudio data viewer.

5.3.2 Remove numbers

You may have noticed that some of the “words” in diario_frequency_table are actually numbers. In certain situations, this may be acceptable or desirable, but in other cases, you may want to remove numbers from your word frequency table. In this sub-section, we’ll learn one way to do this.

The first step is to parse each row in the “word” column of diario_frequency_table, and extract the content in the “word” column if it is a number, and return an “NA” value if there are no numbers in the text; this information can be stored as a vector. The code below uses the parse_number() function to generate such a vector, and assigns this vector to a new object named diario_frequency_table_numbers:

# defines a vector that extracts numbers from the "word" column in 
# "diario_frequency_table" and assigns it to a new vector object named 
# "diario_frequency_table_numbers"
diario_frequency_table_numbers<-parse_number(diario_frequency_table$word)

Next, we’ll add this vector as a column in diario_frequency_table by using the cbind() function, which is used to bind a vector to a dataset as a column. Below, the first argument to cbind() is the name of the dataset to which we want to bind the vector (diario_frequency_table), and the second argument is the name of the desired vector (diario_frequency_table_numbers). We’ll assign the resulting dataset (i.e. diario_frequency_table with the diario_frequency_table_numbers vector added to it as a column) back to diario_frequency_table, which will update the existing diario_frequency_table object with this new column:

# updates the existing "diario_frequency_table" data frame by binding the
# "diario_frequency_table_numbers" vector to it
diario_frequency_table<-cbind(diario_frequency_table, diario_frequency_table_numbers)

In the updated diario_frequency_table, we now have a new column (diario_frequency_table_numbers) which contains any numbers from the corresponding “word” column on the same row, and an “NA” value if there are no numbers in the corresponding “word” column. You can confirm this by viewing diario_frequency_table in the RStudio data viewer.

Now that we have a new column that contains “NA” values when the corresponding word in the “word” column does NOT include numbers, we can use the filter() function to create a subset of the existing diario_frequency_table for which the value of the “diario_frequency_table_numbers” column is “NA”; this effectively excises all rows where the value for the “word” column includes a number:

# subsets all rows in which the "diario_frequency_table_numbers" column of the 
# "diario_frequency_table" data frame is an NA value; this effectively removes 
# all rows in "diario_frequency_table" in which the "word" column has a number
diario_frequency_table<-diario_frequency_table %>% 
                        filter(is.na(diario_frequency_table_numbers))

Now that we no longer need the “diario_frequency_table_numbers” column, we can go ahead and delete it. We can select a desired column from an existing dataset using the select() function from dplyr; conversely, we can also use select() to delete a column, by including a minus sign (-) before the name of the column we’d like to delete. The code below takes the existing diario_frequency_table and then deletes its “diario_frequency_table_numbers” column with the expression that reads select(-diario_frequency_table_numbers):

# deletes the "diario_frequency_table_numbers" from the "diario_frequency_table" dataframe 
diario_frequency_table<-diario_frequency_table %>% 
                        select(-diario_frequency_table_numbers)

5.4 View the final (cleaned) frequency table

Let’s now view our final word frequency table, which we generated by cleaning the draft version of the frequency table (from Section 5.2), by excising stopwords and numbers (in Section 5.3):

# Views updated "diario_frequency_table" in RStudio data viewer
View(diario_frequency_table)