8 Extracting keywords in context

The idea behind “keywords in context” is that it is often useful to extract a word of interest (i.e. a “keyword”) from a text collection along with its surrounding words, so as to develop a sense of the context in which that keyword tends to be used.

In this final section, we’ll briefly learn how to use functions from the text mining and analysis package quanteda to extract a table which provides contextual information about a given keyword.

In order to extract a keyword in context using quanteda, we first have to create a quanteda tokens object. We can do so by passing the column in diario_creatives_tidy that contains our text data (named “text”) to the tokens() function. We’ll also set remove_punct=TRUE to remove the text’s punctuation from our tokens object (since punctuation can interfere with the extraction of the words surrounding a given keyword). We’ll assign the resulting token to a new object named kwic_token:

# creates a tokens object based on the "text" column of the "tidy_diario" 
# data frame and assigns it to "kwic_token"
kwic_token<-tokens(diario_creatives_tidy$text, remove_punct = TRUE)

Now, we’ll use the kwic() function to extract contextual information about a given keyword. Let’s unpack the various arguments we pass to kwic() below:

The first argument to kwic() is our token object defined above (kwic_token), which contains our text data tokenized by word.
The second argument, pattern="earth" specifies that our keyword of interest is “earth”.
Finally, the window=3 argument specifies the number of context words we’d like to extract on either side of the keyword. By setting window=3, the kwic() function will identify every instance of the word “earth” in the text collection, and extract the three words before and the three words after each instance of our keyword, “earth”. It will then return a data frame that organizes this contextual information; we’ll assign this data frame to a new object named earth_keyword_context:

# Extracts contextual text data for the keyword "earth", based on a
# window of 3 words; the resulting data frame containing the contextual 
# information associated with each appearance of the keyword is assigned to 
# a new object named "earth_keyword_context"
earth_keyword_context<-kwic(kwic_token, pattern="earth", window=3)

Let’s now take a look at earth_keyword_context within the data viewer:

# Views "earth_keyword_context" in data viewer
View(earth_keyword_context)

As you can see, the “docname” column contains information about the specific text file in which the “earth” keyword ocurred, and the “pre” and “post” columns provide (respectively) information on the three words before and after the keyword, for every instance in which it is used.