8 Extracting keywords in context
The idea behind “keywords in context” is that it is often useful to extract a word of interest (i.e. a “keyword”) from a text collection along with its surrounding words, so as to develop a sense of the context in which that keyword tends to be used.
In this final section, we’ll briefly learn how to use functions from the text mining and analysis package quanteda to extract a table which provides contextual information about a given keyword.
In order to extract a keyword in context using quanteda, we first have to create a quanteda tokens object. We can do so by passing the column in diario_creatives_tidy
that contains our text data (named “text”) to the tokens()
function. We’ll also set remove_punct=TRUE
to remove the text’s punctuation from our tokens object (since punctuation can interfere with the extraction of the words surrounding a given keyword). We’ll assign the resulting token to a new object named kwic_token
:
# creates a tokens object based on the "text" column of the "tidy_diario"
# data frame and assigns it to "kwic_token"
<-tokens(diario_creatives_tidy$text, remove_punct = TRUE) kwic_token
Now, we’ll use the kwic()
function to extract contextual information about a given keyword. Let’s unpack the various arguments we pass to kwic()
below:
- The first argument to
kwic()
is our token object defined above (kwic_token
), which contains our text data tokenized by word. - The second argument,
pattern="earth"
specifies that our keyword of interest is “earth”. - Finally, the
window=3
argument specifies the number of context words we’d like to extract on either side of the keyword. By settingwindow=3
, thekwic()
function will identify every instance of the word “earth” in the text collection, and extract the three words before and the three words after each instance of our keyword, “earth”. It will then return a data frame that organizes this contextual information; we’ll assign this data frame to a new object namedearth_keyword_context
:
# Extracts contextual text data for the keyword "earth", based on a
# window of 3 words; the resulting data frame containing the contextual
# information associated with each appearance of the keyword is assigned to
# a new object named "earth_keyword_context"
<-kwic(kwic_token, pattern="earth", window=3) earth_keyword_context
Let’s now take a look at earth_keyword_context
within the data viewer:
# Views "earth_keyword_context" in data viewer
View(earth_keyword_context)
As you can see, the “docname” column contains information about the specific text file in which the “earth” keyword ocurred, and the “pre” and “post” columns provide (respectively) information on the three words before and after the keyword, for every instance in which it is used.