3 Reading in a Text Corpus

The first step when working with text data in RStudio is to load the data which you wish to analyze into your R environment.

Sometimes, you may only need to analyze the text within a single document, but often, you will want to analyze an entire collection of text documents (known as a corpus). In our case, we want to analyze a corpus comprised of text documents that are part of the “creatives” section of the broader “El Diario” text collection.

To make this more concrete, we can take a look at these “creatives” files on our local computer’s directory:

Creatives text files stored in local directory

Figure 3.1: Creatives text files stored in local directory

Our first job is to load all of the individual text files within this “creatives” directory into RStudio as a corpus. To do so, we will first read in all of the file names for the text documents that constitute the “creatives” corpus, and store these names within a character vector; a “vector” in R is simply a sequence of elements, and in the case of a character vector, those elements are text strings.

The code below first reads in the file names from the “creatives” directory, using list.files(pattern=".txt"). The list.files() function produces a character vector of files within a specified directory (if no directory is specified, as is the case here, it defaults to extracting the names of files from the working directory), and the argument to this function, pattern=".txt", specifies what type of files we want the list.files() function to extract (anything with a “.txt” extension). Finally, it assigns this vector of file names to a new object named diario_files:

# reads in the filenames for diario creatives as a character vector, and 
# assigns it to an object named "diario_files"
diario_files<-list.files(pattern=".txt")

The concept of object assignment, which we alluded to above, is a fundamental concept when working in a scripting environment; indeed, the ability to easily assign values to objects is what allows us to easily and intuitively manipulate and process our data in a programmatic setting. To better understand the mechanics of object assignment, let’s briefly step away from our text data, and consider a simple example:

# assign value 5 to new object named x
x<-5

In the code above, we use R’s assignment operator, <-, to assign the value 5 to an object named x. Now that an object named x has been created and assigned the value 5, printing x in our console (or printing x in our script and running it) will return the value that has been assigned to the x object, i.e. 5:

# prints value assigned to "x"
x
## [1] 5

More generally, the process of assignment effectively equates the output created by the code on the right side of the assignment operator (<-) to an object with a name that is specified on the left side of the assignment operator. Whenever we want to look at the contents of an object (i.e. the output created by the code to the right side of the assignment operator), we simply print the name of the object in the R console (or print the name and run it within a script).

While the example above was very simple, we can assign virtually any R code, and by extension, the data structure(s) generated by that code (such as datasets, vectors, graphs/plots etc.) to an R object. Above, we assigned the vector of text file names created by list.files(pattern=".txt") to an object named diario_files, and can now confirm that this vector is associated with the diario_files object by printing the object name:

# prints contents of "diario_files" vector
diario_files
##   [1] "1972.10.27-en-creative-Metemorphosis.txt"                       
##   [2] "1972.10.27-sp-creative-LaTragediaDeRicardoFalcon.txt"           
##   [3] "1972.11.03-en-creative-BoomerangChicano.txt"                    
##   [4] "1972.11.03-en-creative-PensamientorsRinche.txt"                 
##   [5] "1972.11.03-en-creative-PensamientosANewLegendElRaton.txt"       
##   [6] "1972.11.03-en-creative-PensamientosInTheNameOfGoodWill.txt"     
##   [7] "1972.11.03-en-creative-PensamientosSaladBowloftheWorld.txt"     
##   [8] "1972.11.03-sp-creative-PensamientorsVirginia.txt"               
##   [9] "1972.11.03-sp-creative-PensamientosMiCarnal.txt"                
##  [10] "1972.12.01-en-creative-MadGenieTerrorizesNeighborhood.txt"      
##  [11] "1973.01.23-en-creative-VocalistFeatured.txt"                    
##  [12] "1973.03.06-en-creative-EscuelaTlatelolco.txt"                   
##  [13] "1973.03.06-en-creative-LaLloronaWeepsOn.txt"                    
##  [14] "1973.03.20-en-creatives-ElTortillaKid.txt"                      
##  [15] "1973.03.20-en-creatives-MateskisSourGrapes.txt"                 
##  [16] "1973.04.10-en-creative-ElTortillaKid.txt"                       
##  [17] "1973.04.10-en-creative-PhotoRichardGarcia.txt"                  
##  [18] "1973.04.24-en-creative-ElTortillaKid.txt"                       
##  [19] "1973.04.24-en-creative-TeamstersHiringHall.txt"                 
##  [20] "1973.05.05-en-creative-AndWhatShallIDoNow.txt"                  
##  [21] "1973.05.05-en-creative-BreakSolomonsChains.txt"                 
##  [22] "1973.05.05-en-creative-FromHoustonToAustin.txt"                 
##  [23] "1973.05.05-en-creative-TimeHasCome.txt"                         
##  [24] "1973.05.05-en-creative-Untitled.txt"                            
##  [25] "1973.05.05-en-creative-Untitled2.txt"                           
##  [26] "1973.05.05-en-creative-WeAreLaRaza.txt"                         
##  [27] "1973.05.05-sp-creative-Chicano.txt"                             
##  [28] "1973.05.05-sp-creative-Obrapafalcon.txt"                        
##  [29] "1973.06.15-en-creative-SeeTheFunnyU.S.GovernmentWork.txt"       
##  [30] "1973.06.15-sp-creative-DIANUBLADO.txt"                          
##  [31] "1973.07.13-en-creative-Escuela.txt"                             
##  [32] "1973.10.12-en-creative-ADVENTURESRABBLEROUSER.txt"              
##  [33] "1973.10.26-en-creative-LECHUGUERO.txt"                          
##  [34] "1973.10.26-sp-en-creative-PabloNerudapoet1904-1973.txt"         
##  [35] "1973.11.09-en-creative-FinancialAidGame.txt"                    
##  [36] "1973.12.13-sp-creative-DondeEstaraLaMovimiento.txt"             
##  [37] "1973.12.13-sp-creative-Manos.txt"                               
##  [38] "1973.12.13-sp-creative-VersosDeLasPosadas.txt"                  
##  [39] "1973.12.13-sp-en-creative-NoHasMuertoCompanero.txt"             
##  [40] "1973.12.13-sp-en-creative-SantosRodriguezKilledByDallasPigs.txt"
##  [41] "1974.01.25-sp-creative-ANuestroCarinosoAntecesor.txt"           
##  [42] "1974.02.22-en-creative-APoemByTigre.txt"                        
##  [43] "1974.03.08-sp-creative-Recetas.txt"                             
##  [44] "1974.03.22-sp-creative-BatoDelBarrio.txt"                       
##  [45] "1974.03.22-sp-creative-Recetas.txt"                             
##  [46] "1974.05.05-en-creative-AMadMan.txt"                             
##  [47] "1974.05.05-en-creative-Antiperros.txt"                          
##  [48] "1974.05.05-en-creative-CuandoLaCucarachaCamine.txt"             
##  [49] "1974.05.05-en-creative-Descanco.txt"                            
##  [50] "1974.05.05-en-creative-FaceYourFearsCarnal.txt"                 
##  [51] "1974.05.05-en-creative-Fighters.txt"                            
##  [52] "1974.05.05-en-creative-HangToughChicano.txt"                    
##  [53] "1974.05.05-en-creative-LasComadres.txt"                         
##  [54] "1974.05.05-en-creative-LetYourselfBeSidetrackedByYourGuiro.txt" 
##  [55] "1974.05.05-en-creative-LosPintos.txt"                           
##  [56] "1974.05.05-en-creative-MarioSuarez.txt"                         
##  [57] "1974.05.05-en-creative-OfferingOfManToGod.txt"                  
##  [58] "1974.05.05-en-creative-Pachucos.txt"                            
##  [59] "1974.05.05-en-creative-TheOrganizer.txt"                        
##  [60] "1974.05.05-sp-creative-ALaFlorDeNuestraHerencia.txt"            
##  [61] "1974.05.05-sp-creative-Antiperros.txt"                          
##  [62] "1974.05.05-sp-creative-DeColores.txt"                           
##  [63] "1974.05.05-sp-creative-Descanso.txt"                            
##  [64] "1974.05.05-sp-creative-LaUnitedFruitCo.txt"                     
##  [65] "1974.05.05-sp-creative-UnViajeAMexico.txt"                      
##  [66] "1974.05.05-sp-creative-YoSoyChicano.txt"                        
##  [67] "1974.05.05-sp-en-creative-Raza.txt"                             
##  [68] "1974.06.11-en-creative-Inheritance.txt"                         
##  [69] "1974.10.03-en-creative-Aztlan.txt"                              
##  [70] "1974.10.03-sp-creative-NewSong.txt"                             
##  [71] "1974.10.03-sp-en-creative-TeranDerramadorDeFronteras.txt"       
##  [72] "1975.01.30-en-creative-ChicanoGraphicByLorettaMalacara.txt"     
##  [73] "1975.01.30-en-creative-HarvardGraduates.txt"                    
##  [74] "1975.01.30-en-creative-PoetryOfAncientMexicanIndians.txt"       
##  [75] "1975.01.30-en-creative-TheEarthSoRichInImage.txt"               
##  [76] "1975.07.17-en-creative-ArturoSylvanoBobianPhotoCaption.txt"     
##  [77] "1975.07.17-en-creative-ExerciseInFutility-1.txt"                
##  [78] "1975.07.17-en-creative-ExerciseInFutility-2.txt"                
##  [79] "1975.07.17-en-creative-Life.txt"                                
##  [80] "1975.07.17-en-creative-WindsOfAztlan.txt"                       
##  [81] "1975.07.17-en-sp-creative-CamaYTortillas.txt"                   
##  [82] "1975.07.17-en-sp-creative-MachoReflections.txt"                 
##  [83] "1975.10.01-en-creative-LosParrasCantanDeRevolucion.txt"         
##  [84] "1975.10.01-en-creative-WhenDayIsBorn.txt"                       
##  [85] "1975.10.01-en-creative-WindOfThePeople.txt"                     
##  [86] "1975.10.01-sp-creative-CuandoAmaneceElDia.txt"                  
##  [87] "1975.10.01-sp-creative-TodaLaTierraEntera.txt"                  
##  [88] "1975.10.01-sp-creative-VientosDelPueblo.txt"                    
##  [89] "1976.02.01-en-creative-AsTimePasses.txt"                        
##  [90] "1976.02.01-en-creative-ToLaura.txt"                             
##  [91] "1977.07.01-sp-creative-LasMujeresDeCuba.txt"                    
##  [92] "1977.08.01-en-creative-Criticism.txt"                           
##  [93] "1977.08.01-en-creative-Philanthropyfundingdisregard.txt"        
##  [94] "1977.08.01-sp-creative-Poemas.txt"                              
##  [95] "1978.10.01-sp-creative-PoemasdeNicaragua.txt"                   
##  [96] "1978.11.01-en-creative-AWorldofDreams.txt"                      
##  [97] "1979.03.13-en-creative-AlternativeEducation.txt"                
##  [98] "1980.02.01-en-creative-AWarriorInChains.txt"                    
##  [99] "1983.04.01-en-creative-ChiefLeonardCrowDog.txt"                 
## [100] "1983.04.01-en-creative-IAmUMAS.txt"                             
## [101] "1983.04.01-en-creative-LostToOurLand.txt"                       
## [102] "1983.04.01-en-creative-TogetherWeAreLaRaza.txt"                 
## [103] "1983.04.01-sp-creative-MiCarnalEsATodoMadre.txt"

We will use this basic principle of object assignment throughout the lesson.

Now, let’s use this vector of file names to create a corpus object that contains the text from all of these files. To do so, we’ll use the tm package’s Corpus() function. Below, the first argument to the Corpus() function, URISource(diario_files), specifies the file names of the text documents from which we want to create our corpus. The second argument, readerControl=list(reader=readPlain)), specifies that we want the Corpus() function to use a plain text reader (“readPlain”) to read in the text data within the documents specified in diario_files (if our files were in a different file format, such as PDF, we would use a different reader that is appropriate to that file format). Finally, we assign the corpus to a new object named diario_creatives_corpus:

# Uses the "Corpus" function from the "tm" package to create a new text corpus 
# based on the diario creatives text files; this corpus is assigned to a new 
# object named "diario_creatives_corpus"
diario_creatives_corpus<-Corpus(URISource(diario_files), readerControl = list(reader=readPlain))

If we print the name of our corpus object into the console (or run it from a script), we can return some basic metadata about the corpus we’ve just created:

# prints metadata about the corpus assigned to the "diario_creatives_corpus" object
diario_creatives_corpus
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 103