4 Creating a tidy dataframe from a text corpus

Once we have created a text corpus out of the relevant collection of text documents in our directory, we can work directly with this corpus to extract information about the contents of our text collection. However, tm package corpus objects can sometimes be a bit unwieldy and non-intuitive to work with. Luckily, the tidytext package’s tidy() function can quickly transform a tm corpus object into a more tractable data frame (i.e. a tabular dataset), in which each text document is assigned its own row, and the entirety of the text associated with each document is stored in a column. Below, we take our corpus object, diario_creatives_corpus, and pass it as an argument to the tidy() function; then, we assign the resulting data frame of textual information to a new object named diario_creatives_tidy:

# Uses the "tidy" function from the "tidytext" package to transform the 
# "diario_creatives_corpus" corpus into a tidy data frame, where each text file
# is associated with a row in the data frame; this data frame is 
# assigned to a new object named "diario_creatives_tidy"
diario_creatives_tidy<-tidy(diario_creatives_corpus)

Now, let’s inspect the diario_creatives_tidy data frame. One way to explore the contents of a data frame object is to simply print the name of the object in the console (or run it in a script); a truncated version of the dataset will then print to the console:

# prints contents of "diario_creatives_tidy" to console
diario_creatives_tidy

## # A tibble: 103 × 8
##    author datetimestamp       description heading id                                   language origin text 
##    <lgl>  <dttm>              <lgl>       <lgl>   <chr>                                <chr>    <lgl>  <chr>
##  1 NA     2022-08-14 22:44:10 NA          NA      1972.10.27-en-creative-Metemorphosi… en       NA     "MET…
##  2 NA     2022-08-14 22:44:10 NA          NA      1972.10.27-sp-creative-LaTragediaDe… en       NA     "LA …
##  3 NA     2022-08-14 22:44:10 NA          NA      1972.11.03-en-creative-BoomerangChi… en       NA     "Boo…
##  4 NA     2022-08-14 22:44:10 NA          NA      1972.11.03-en-creative-Pensamientor… en       NA     "RIN…
##  5 NA     2022-08-14 22:44:10 NA          NA      1972.11.03-en-creative-Pensamientos… en       NA     "A N…
##  6 NA     2022-08-14 22:44:10 NA          NA      1972.11.03-en-creative-Pensamientos… en       NA     "IN …
##  7 NA     2022-08-14 22:44:10 NA          NA      1972.11.03-en-creative-Pensamientos… en       NA     "THE…
##  8 NA     2022-08-14 22:44:10 NA          NA      1972.11.03-sp-creative-Pensamientor… en       NA     "VIR…
##  9 NA     2022-08-14 22:44:10 NA          NA      1972.11.03-sp-creative-Pensamientos… en       NA     "MI …
## 10 NA     2022-08-14 22:44:10 NA          NA      1972.12.01-en-creative-MadGenieTerr… en       NA     "Mad…
## # … with 93 more rows

A better option, however, is to view data frame objects in the RStudio data viewer; this allows you to view the entire dataset, with formatting applied to make the information easier to interpret. To view diario_creatives_tidy in the RStudio Data Viewer, simply pass it as an argument to the View() function, as below:

# Opens "diario_creatives_tidy" in RStudio data viewer
View(diario_creatives_tidy)

Please note that data frames which are reproduced in this tutorial are sometimes truncated, to preserve memory and space; you will therefore usually see fewer rows in data frames within the tutorial material than you will see when inspecting those same data frames within the RStudio data viewer on your computers

As we noted above, each text document is associated with a unique row. There are several columns that provide metadata associated with the corresponding text document. Most importantly, there is a column named “text” which contains the full text that is contained in the document associated with the corresponding row.

Once we have recast our text corpus into a concise and well-structured data frame, we can begin to extract information of interest from our text data.