In this tutorial, we’ll explore important principles and tools for working with data in R. Its goal is not to provide a comprehensive introduction to the R language, but to provide a practical and example-driven discussion that serves as a starting point for your own exploration. Topics covered include:
At the start of the session, please download the workshop data.
In this preliminary section, we’ll cover basic information that will help you to get started with RStudio.
If you haven’t already, please go ahead and install both the R and RStudio applications. R and RStudio must be installed separately; you should install R first, and then RStudio. The R application is a bare-bones computing environment that supports statistical computing using the R programming language; RStudio is a visually appealing, feature-rich, and user-friendly interface that allows users to interact with this environment in an intuitive way. Once you have both applications installed, you don’t need to open up R and RStudio separately; you only need to open and interact with RStudio (which will run R in the background).
The following subsections provide instructions on installing R and RStudio for the macOS and Windows operating systems. These instructions are taken from the “Setup” section of the Data Carpentry Course entitled R for Social Scientists. The Data Carpentry page also contains installation instructions for the Linux operating system; if you’re a Linux user, please refer to that page for instructions.
The Appendix to Garret Grolemund’s book Hands on Programming with R also provides an excellent overview of the R and RStudio installation process.
.exe
file that was just downloaded..pkg
file for the latest R version.Now that we’ve installed and opened up RStudio, let’s familiarize ourselves with the RStudio interface. When we open up RStudio, we’ll see a window that looks something like this:
If your interface doesn’t look exactly like this, it shouldn’t be a problem; we would expect to see minor cosmetic differences in the appearance of the interface across operating systems and computers (based on how they’re configured). However, you should see four distinct windows within the larger RStudio interface:
File
button on the RStudio menu bar, scroll down to New File
button, and then select R Script
from the menu bar that opens up.View()
function, which will display the relevant data within a new tab in the Source window.R is an open-source programming language for statistical computing that allows users to carry out a wide range of data analysis and visualization tasks (among other things). One of the big advantages of using R is that it has a very large user community among social scientists, statisticians, and digital humanists, who frequently publish R packages. One might think of packages as workbooks of sorts, which contain a well-integrated set of R functions, scripts, data, and documentation; these “workbooks” are designed to facilitate certain tasks or implement useful procedures. These packages are then shared with the broader R user community, and at this point, anyone who needs to accomplish the tasks to which the package addresses itself can use the package in the context of their own projects. The ability to use published packages considerably simplifies the work of applied data research using R; it means that we rarely have to write code entirely from scratch, and can build on the code that others have published in the form of packages. This allows applied researchers to focus on substantive problems, without having to get too bogged down in complicated programming tasks.
In this workshop, we will use the following packages to carry out relevant data analysis and visualization tasks (please click the relevant link to learn more about a given package; note that the tidyverse is not a single package, but rather an entire suite of packages used for common data science and analysis tasks): + tidyverse: + wosr
To install a package in R, we can use the install.packages()
function. A function is essentially a programming construct that takes a specified input, runs this input (called an “argument”) through a set of procedures, and returns an output. In the code block below, the name of the package we want to install (here, the tidyverse suite) is enclosed within quotation marks and placed within parentheses after printing install.packages
Running the code below will effectively download the tidyverse suite of packages to our computer:
# Installs "tm" package
install.packages("tidyverse")
To run this code in your own R session:
Edit
menu of your browser).Below, we can see how that line of code should look in your script, and how to run it:
Please note that you can follow along with the tutorial on your own computers by transferring all of the subsequent codeblocks into your script in just this way. Run each codeblock in your RStudio environment as you go, and you should be able to replicate the entire tutorial on your computer. You can copy-paste the workshop code if you wish, but we recommend actually retyping the code into your script, since this will help you to more effectively familiarize yourself with the process of writing code in R.
Note that the codeblocks in the tutorial usually have a comment, prefaced by a hash (“#”). When writing code in R (or any other command-line interface) it is good practice to preface one’s code with brief comments that describe what a block of code is doing. Writing these comments can allow someone else (or your future self) to read and quickly understand the code more easily than otherwise might be the case. The hash before the comment effectively tells R that the subsequent text is a comment, and should be ignored when running a script. If one does not preface the comment with a hash, R wouldn’t know to ignore the comment, and would throw an error message.
Now, let’s install the other packages we mentioned above, using the same install.packages()
function:
install.packages("wosr")
All of the packages we need are now installed!
However, while our packages are installed, they are not yet ready to use. Before we can use our packages, we must load them into our environment. We can think of the process of loading installed packages into a current R environment as analogous to opening up an application on your phone or computer after it has been installed (even after an application has been installed, you can’t use it until you open it!). To load (i.e. “open”) an R package, we pass the name of the package we want to load as an argument to the library()
function. For example, if we want to load our tidyverse packages into the current environment, we can type:
# Loads tidyverse packages into memory
library(tidyverse)
At this point, the full suite of the tidyverse suite’s functionality is available for us to use.
Now, let’s go ahead and load the remainder of the packages that we’ll need:
# loads remainder of required packages
library(wosr)
library(psych)
library(fastDummies)
library(janitor)
library(tidytext)
library(wordcloud2)
At this point, the packages are loaded and ready to go! One important thing to note regarding the installation and loading of packages is that we only have to install packages once; after a package is installed, there is no need to subsequently reinstall it. However, we must load the packages we need (using the library
function) every time we open a new R session. In other words, if we were to close RStudio at this point and open it up later, we would not need to install these packages again, but would need to load the packages again.
Before we can get a sense of how to work with data in R, it is important to familiarize ourselves with basic features of the R language’s syntax, and the basic data structures that are used to store and process data.
At its most basic, R can be used as a calculator. For instance:
# calculates 2+2
2+2
## [1] 4
# calculates 65 to the power of 4
65^4
## [1] 17850625
While this is a useful starting point, the possibility of assigning values to objects (or variables) considerably increases the scope of the operations we are able to carry out. We turn to object assignment in the next sub-section.
The concept of object (or variable) assignment is a fundamental concept when working in a scripting environment; indeed, the ability to easily assign values to objects is what allows us to easily and intuitively manipulate and process our data in a programmatic setting. To better understand the mechanics of object assignment, consider the following:
# assign value 5 to new object named x
x<-5
In the code above, we use R’s assignment operator, <-
, to assign the value 5 to an object named x
. Now that an object named x
has been created and assigned the value 5, printing x
in our console (or printing x
in our script and running it) will return the value that has been assigned to the x
object, i.e. 5:
# prints value assigned to "x"
x
## [1] 5
More generally, the process of assignment effectively equates the output created by the code on the right side of the assignment operator (<-
) to an object with a name that is specified on the left side of the assignment operator. Whenever we want to look at the contents of an object (i.e. the output created by the code to the right side of the assignment operator), we simply print the name of the object in the R console (or print the name and run it within a script).
Let’s create another object, named y
, and assign it the value “12”:
# assign value 12 to new object named y
y<-12
As we noted above, we can print the value that was assigned to y
by printing its name:
# prints value assigned to "y"
y
## [1] 12
It’s possible to use existing objects to assign values to new ones. For example, we can assign the sum of x
and y
to a new object that we’ll name xy_sum
:
# creates a new object, named "xy_sum" whose value is the sum of "x" and "y"
xy_sum<-x+y
Now, let’s print the contents of xy_sum
# prints contents of "xy_sum"
xy_sum
## [1] 17
As expected, we see that the value assigned to xy_sum
is “17” (i.e. the sum of the values assigned to x
and y
).
It is possible to change the value assigned to a given object. For example, let’s say we want to change the value assigned to x
from “5” to “8”:
# assign value of "8" to object named "x"
x<-8
We can now confirm that x
is now associated with the value “8”
# prints updated value of "x"
x
## [1] 8
It’s worth noting that updating the value assigned to x
will not automatically update the value assigned to xy_sum
(which, recall, is the sum of x
and y
). If we print the value assigned to xy_sum
, note that it is still “17”):
xy_sum
## [1] 17
In order for the value assigned to xy_sum
to be updated with the new value of x
, we must run the assignment operation again:
# assigns sum of "y" and newly updated value of "x" to "xy_sum" object
xy_sum<-x+y
Now, the value of xy_sum
should reflect the updated value of x
, which we can confirm by printing the value of xy_sum
:
# prints value of "xy_sum"
xy_sum
## [1] 20
Note that the value assigned to xy_sum
is now “20” (the sum of “8” and “12”), rather than “17” (the sum of “5” and “12”).
While the examples above were very simple, we can assign virtually any R code, and by extension, the data structure(s) generated by that code (such as datasets, vectors, graphs/plots etc.) to an R object. When naming your objects, try to be descriptive, so that the name of the object signifies something about its corresponding value.
Below, consider a simple example of an object, named our_location
that has been assigned a non-numeric value. It’s value is a string, or textual information:
# assigns text string "Boulder, CO" to
our_location<-"Boulder, CO"
We can print string that has been assigned to the location
object by typing the name of the object in our console, or running it from our script:
# prints value of "our_location" object
our_location
## [1] "Boulder, CO"
Note that generally speaking, you have a lot of flexibility in naming your R objects, but there are certain rules. For example, object names must start with a letter, and cannot contain any special symbols (they can only contain letters, numbers, underscores, and periods). Also, object names cannot contain multiple unconnected words; if you’d like to use multiple words or phrases, connect the discrete elements with an underscore (_
), or use camel case (where different words are distinguished by beginning each discrete word begins with a capitalized letter).
It is also worth emphasizing that object names are case sensitive; in order to print the value assigned to an object, that object’s name must be printed exactly as it was created. For example, if we were to type our_Location
, we would get an error, since there is no our_Location
object (only an our_location
object):
our_Location
## Error in eval(expr, envir, enclos): object 'our_Location' not found
We now turn to a brief overview of some important data structures that help us to work with data in R. We will consider three data structures that are particularly useful: vectors, data frames, and lists. Note that this is not an exhaustive treatment of data structures in R; there are other structures, such as matrices and arrays, that are also important. However, we will limit our discussion to the data structures that are essential for getting started with data-based research in R.
In R, a vector is a sequence of values. A vector is created using the c()
function. For example, let’s make a vector with some arbitrary numeric values:
# makes vector with values 5,7,55,32
c(5, 7, 55, 32)
## [1] 5 7 55 32
If we plan to work with this numeric vector again later in our workflow, it makes sense to assign it to an object, which we’ll call arbitrary_values
:
# assigns vector of arbitrary values to new object named "arbitrary_values"
arbitrary_values<-c(5,7,55.6,32.5)
Now, whenever we want to print the vector assigned to the arbitrary_values
object, we can simply print the name of the object:
# prints vector assigned to "arbitrary_values" object
arbitrary_values
## [1] 5.0 7.0 55.6 32.5
It is possible to carry out mathematical operations with numeric vectors; for instance, let’s say that we want to double the values in the arbitrary_values
vector; to do so, we can simply multiply arbitrary_values
by 2, which yields a new vector where each numeric element is twice the corresponding element in arbitrary_values
. Below, we’ll create a new vector that doubles the values in arbitrary_values
, assign it to a new object named arbitrary_values_2x
, and print the contents of arbitrary_values_2x
:
# creates a new vector that doubles the values in "arbitrary_values" and assigns it to a new object named
"arbitrary_values_2x"
## [1] "arbitrary_values_2x"
arbitrary_values_2x<-arbitrary_values*2
# prints contents of "arbitrary_values_2x"
arbitrary_values_2x
## [1] 10.0 14.0 111.2 65.0
Now, let’s say we want to add different vectors together; the code below creates a new vector by adding together arbitrary_values
and arbitrary_values_2x
:
# adds "arbitrary_values" vector and "arbitrary_values_2x" vector
arbitrary_values + arbitrary_values_2x
## [1] 15.0 21.0 166.8 97.5
Note that each element of the resulting vector printed above is the sum of the corresponding elements in arbitrary_values
and arbitrary_values_2x
.
Other arithmetic operations on numeric vectors are also possible, and you may wish to explore these on your own as an exercise.
In many cases, it is useful to extract a specific element from a vector. Each element in a given vector is assigned an index number, starting with 1; that is, the first element in a vector is assigned an index value of 1, the second element of a vector is assigned an index value of 2, and so on. We can use these index values to extract our desired vector elements. In particular, we can specify the desired index within square brackets after printing the name of the vector object of interest. For example, let’s say we want to extract the 3rd element of the vector in arbitrary_values
. We can do so with the following:
# extracts third element of "arbitrary_values_2x" vector
arbitrary_values[3]
## [1] 55.6
It is also possible to extract a range of values from a vector using index values. For example, let’s say we want to extract a new vector comprised of the second, third, and fourth numeric elements in arbitrary_values
; we can do so with the following:
# extracts a new vector comprised of the 2nd, 3rd, and 4th elements of the existing "arbitrary_values" vector
arbitrary_values[2:4]
## [1] 7.0 55.6 32.5
Thus far, we have been working with numeric vectors, where each of the vector’s elements is a numeric value, but it is also possible to create vectors in which the elements are strings (i.e. text). Such vectors are know as character vectors. For example, the code below creates a character vector of the first four months of the year, and assigns it to a new object named months_four
:
# creates character vector whose elements are the first four months of the year, and assigns the vector to a new object named "months_four"
months_four<-c("January", "February", "March", "April")
Let’s now print the character vector assigned to months_four
:
# prints contents of "months_four"
months_four
## [1] "January" "February" "March" "April"
We can extract elements from character vectors using index values in the same way we did so for elements in a numeric vector. For example:
# extracts second element of "months_four" object (i.e. the "February" string)
months_four[2]
## [1] "February"
# subsets the second and third elements of "months_four" object (i.e. the "February" and "March" strings, which are extracted as a new character vector)
months_four[2:3]
## [1] "February" "March"
The data frame structure is the workhorse of data analysis in R. A data frame resembles a table, of the sort you might generate in a spreadsheet application.
Often, the most important (and arduous) step in a data analysis workflow is to assemble disparate strands of data into a tractable data frame. What does it mean for a data frame to be “tractable”? One way to define this concept more precisely is to appeal to the concept of “tidy” data, which is often referenced in the data science world. Broadly speaking, a “tidy” data frame is a table in which:
We will work extensively with data frames later in the workshop, but let’s generate a simple data frame from scratch, and assign it to a new object. We will generate a data frame containing “dummy” country-level data on basic economic, geographic, and demographic variables, and assign it to a new object named country_df
. The data frame is created through the use of the data.frame()
function, which has already been programmed into R. Column names and the corresponding column values are passed to the data.frame()
function in the manner below, and the function effectively binds these different columns together into a table:
# Creates a dummy country-level data frame
country_df<-data.frame(Country=c("Country A", "Country B", "Country C"),
GDP=c(8000, 30000, 23500),
Population=c(2000, 5400, 10000),
Continent=c("South America", "Europe", "North America"))
To observe the structure of the table, we can print it to the R console by simply printing the name of the object to which it has been assigned:
# prints "country_df" data frame to console
country_df
## Country GDP Population Continent
## 1 Country A 8000 2000 South America
## 2 Country B 30000 5400 Europe
## 3 Country C 23500 10000 North America
One nice feature of R Studio is that instead of simply printing our data frames into the console, we can view a nicely formatted version of our data frame by passing the name of the data frame object through the View()
function. For example, the code below will bring up the country_df
data frame as a new tab in R Studio:
# pulls up "country_df" data frame in R Studio data viewer
View(country_df)
Note the “tidy” features of this simple data frame:
We will explore data frames, and the process of extracting information from them, at greater length in subsequent sections.
In R, a list is a data structure that allows us to conveniently store a variety of different objects, of various types. For example, we can use a list to vectors, data frames, visualizations and graphs–basically any R object you can think of! It is also possible to store a list within a list.
Lists allow us to keep track of the various objects we create, and are therefore a useful data management tool. In addition, lists are very helpful to use when we want to perform iterative operations across multiple objects.
We can create lists in R using the list()
function; the arguments to this function are the objects that we want to include in the list. In the code below, we’ll create a list (assigned to an object named example_list
) that contains some of the objects we create earlier in the lesson: the arbitrary_values
vector, the months_four
vector, and the country_df
data frame.
# creates list whose elements are the "arbitrary_values" numeric vector, the "months_four" character vector, and the "country_df" data frame, and assigns it to a new object named "example_list"
example_list<-list(arbitrary_values, months_four, country_df)
Now that we’ve created our list object, let’s print out its contents:
# prints contents of "example_list"
example_list
## [[1]]
## [1] 5.0 7.0 55.6 32.5
##
## [[2]]
## [1] "January" "February" "March" "April"
##
## [[3]]
## Country GDP Population Continent
## 1 Country A 8000 2000 South America
## 2 Country B 30000 5400 Europe
## 3 Country C 23500 10000 North America
As you can see, our list contains each of the various specified objects within a single, unified structure. We can access specific elements within a list using the specific index number of the desired element, in much the same way we did for vectors. When extracting a single list element from a list, the convention is to enclose the index number of the desired list element in double square brackets. For example, if we want to extract the country-level data frame from example_list
, we can use the following:
# extracts country-level data frame from "example_list"; the country-level data frame is the third element in "example_list"
example_list[[3]]
## Country GDP Population Continent
## 1 Country A 8000 2000 South America
## 2 Country B 30000 5400 Europe
## 3 Country C 23500 10000 North America
If we want to subset a list, and extract more than one list element as a separate list, we can do so by creating a vector of the index values of the desired elements, and enclosing it in single brackets after the name of the list object. For example, if we wanted to generate a new list that contained only the first and third elements of example_list
(the numeric vector of arbitrary values and the data frame), we would use the following syntax:
example_list[c(1,3)]
## [[1]]
## [1] 5.0 7.0 55.6 32.5
##
## [[2]]
## Country GDP Population Continent
## 1 Country A 8000 2000 South America
## 2 Country B 30000 5400 Europe
## 3 Country C 23500 10000 North America
While list elements are not automatically named, we can name our list element using the names()
function. The first step to define a character vector of desired names. We can specify any names we’d like but for the sake of illustration, let’s say we want to name the first list element “element1”, the second list element “element2”, and the third list element “element3”. Let’s create a vector of our desired names, and assign it to an object named name_vector
:
# creates a character vector of desired names for list elements, and assigns it to a new object named "name_vector"
name_vector<-c("element1", "element2", "element3")
Now, we’ll assign these names in name_vector
to the list elements in example_list
with the following
# assigns names from "name_vector" to list elements in "example_list"
names(example_list)<-name_vector
Let’s now print the contents of example_list
:
# prints contents of "example_list"
example_list
## $element1
## [1] 5.0 7.0 55.6 32.5
##
## $element2
## [1] "January" "February" "March" "April"
##
## $element3
## Country GDP Population Continent
## 1 Country A 8000 2000 South America
## 2 Country B 30000 5400 Europe
## 3 Country C 23500 10000 North America
Note that the list elements now have names attached to them; the first character string in name_vector
is assigned as the name of the first element in example_list
, the second character string in name_vector
is assigned as the name of the second element in example_list
, and so on.
Practically speaking, we can now extract list elements using the assigned names. For example, if we want to extract the data frame from example_list
, we could do so by its assigned name (“element3”), as follows:
# Extracts the data frame from "example_list" by its assigned name
example_list[["element3"]]
## Country GDP Population Continent
## 1 Country A 8000 2000 South America
## 2 Country B 30000 5400 Europe
## 3 Country C 23500 10000 North America
Note that even after assigning names to list elements, you can still extract elements by their index value, if you would prefer to do so:
# # Extracts the "element3" data frame from "example_list" by its index number
example_list[[3]]
## Country GDP Population Continent
## 1 Country A 8000 2000 South America
## 2 Country B 30000 5400 Europe
## 3 Country C 23500 10000 North America
It is useful to be able to quickly identify the data structure of a given object. Indeed, one way that things can go wrong when processing or analyzing data in R is that a given function expects a certain type of data structure as an input, but encounters something else, which will cause the function to throw an error or perform unexpectedly. In such circumstances, it is especially useful to be able to quickly double-check the data structure of a given object.
We can quickly ascertain this information by passing a given object as an argument to the class()
function, which will provide information about the object’s data structure.
For example, let’s say we want to confirm that example_list
is indeed a list:
# print the data structure of the "example_list" object
class(example_list)
## [1] "list"
Let’s take another example:
# print the data structure of the "months_four" object
class(months_four)
## [1] "character"
Note that we can read “character”, as “character vector”.
Similarly, we can read “numeric” as “numeric vector”:
# print the data structure of the "arbitrary_values" object
class(arbitrary_values)
## [1] "numeric"
As we mentioned earlier, a function is a programming construct that takes a set of inputs (also known as arguments), manipulates those inputs/arguments in a specific way (the body of the function), and returns an output that is the product of how those inputs are manipulated in the body of the function. It is much like a recipe, where the recipe’s ingredients are analogous to a function’s inputs, the instructions about how to combine and process those ingredients are analogous to the body of the function, and the end product of the recipe (for example, a cake) is analogous to the function’s output. R packages are essentially pre-written collections of functions organized around a given theme, and for a large number of data processing and analysis tasks, one can rely on these pre-written functions. In some cases, however, you may want to write your own functions from scratch.
Why might you want to write your own functions?
Writing your own functions can be challenging, but this section will provide you with some basic intuition for how the process works. To develop this intuition, we’ll use a very simple example.
Let’s say you have a large collection of temperature data, measured in Fahrenheit, and you want to convert these data to Celsius. Recall that the formula to convert from Fahrenheit to Celsius is the following, where “C” represents temperature in Celsius, and “F” represents temperature in Fahrenheit:
# fahrenheit to Celsius formula, where "F" is fahrenheit input
C=(F-32)*(5/9)
Recall that at its most basic level, R is a calculator; if for example, we have a Fahrenheit measurement of 55 degrees, we can convert this to Celsius by plugging 55 into the conversion formula:
# Converts 55 degrees fahrenheit to Celsius
(55-32)*(5/9)
## [1] 12.77778
This is easy enough, but if we have a large amount of temperature data that requires processing, we wouldn’t want to carry out this calculation using arithmetic operators for each measurement in our data collection; that could quickly become unwieldy and tedious. Instead of repeatedly using arithmetic operators, we can wrap the Fahrenheit-to-Celsius conversion formula into a function:
# Generates function that takes fahrenheit value ("fahrenheit_input") and returns a value in Celsius, and assigns the function to an object named "fahrenheit_to_celsius_converter"
fahrenheit_to_celsius_converter<-function(fahrenheit_input){
celsius_output<-(fahrenheit_input-32)*(5/9)
return(celsius_output)
}
Let’s unpack the code above, which we used to create our function:
function
; within the parenthesis after function
, we specify the function’s argument(s). Here, the function’s argument is an input named fahrenheit_input
. The name of the argument(s) is arbitrary, and can be anything you like; ideally, its name should be informed by relevant context. Here, the argument/input to the function is a temperature value expressed in degrees Fahrenheit, so the name “fahrenheit_input” describes the nature of this input.{
, and then define the body of the function (i.e. the recipe), which specifies how we want to transform this input. In particular, we take fahrenheit_input
, subtract 32, and then multiply by 5/9, which transform the input to the celsius temperature scale. We’ll tell R to assign this transformed value to a new object, named celsius_output
.return(celsius_output)
, we specify the value we want the function to return. Here, we are saying that we want the function to return the value that was assigned to celsius_output
. We then close the function by typing a left-facing curly brace below the return statement }
.fahrenheit_to_celsius_converter
.After creating our function by running that code, we can use the newly created fahrenheit_to_celsius
function to perform our Fahrenheit to Celsius transformations. Let’s say we have a Fahrenheit value of 68, and want to transform it to Celsius. Instead of the following calculation:
# Uses arithmetic operation to convert 68 degrees Fahrenheit to Celsius
(68-32)*(5/9)
## [1] 20
We can use our function:
# Uses "fahrenheit_to_celsius_converter" function to convert 68 degrees Fahrenheit to Celsius
fahrenheit_to_celsius_converter(fahrenheit_input=68)
## [1] 20
Above, we passed the argument “fahrenheit_input=68” to the fahrenheit_to_celsius_converter
function that we created; the function then took this value (68), plugged it into “fahrenheit_input” within the function and assigned the resulting value to “celsius_output”; it then returned the value of “celsius_output” (20) back to us.
Let’s try another one:
fahrenheit_to_celsius_converter(fahrenheit_input=22)
## [1] -5.555556
In short, we can specify any value for the “fahrenheit_input” argument; this value will be substituted for “fahrenheit_input” in the expression celsius_output<-(fahrenheit_input-32)*(5/9)
, after which the value of celsius_output
will be returned to us.
Even though the Fahrenheit to Celsius conversion formula is not particularly complex, it is clear that writing a function to perform this calculation is nonetheless more efficient than repeatedly performing the relevant arithmetic operation. As the operations you need to perform on your data become more complex, and the number of times you need to perform those operations increases, the benefits of wrapping those operations into a function become ever-more apparent.
Once we have a function written down, it is straightforward to apply that function to multiple inputs in an iterative fashion. For example, let’s say you have four different Fahrenheit temperature values that you would like to convert to celsius, using the fahrenheit_to_celsius_converter
we developed above. One option would be to apply the fahrenheit_to_celsius_converter
function to each of the Fahrenheit temperature inputs individually. For example, let’s say our Fahrenheit values, which we’d like to convert to celsius, are the following: 45.6, 95.9, 67.8, 43. We could, of course, run these values through the function individually, as below:
fahrenheit_to_celsius_converter(fahrenheit_input=45.6)
## [1] 7.555556
fahrenheit_to_celsius_converter(fahrenheit_input=95.9)
## [1] 35.5
fahrenheit_to_celsius_converter(fahrenheit_input=67.8)
## [1] 19.88889
fahrenheit_to_celsius_converter(fahrenheit_input=43.)
## [1] 6.111111
This is manageable with a collection of only four Fahrenheit values, but would quickly become tedious if you had a substantially larger set of Fahrenheit temperature values that required conversion. Instead of manually applying the function to each individual input value, we can instead put these values into a vector, and then iteratively apply the fahrenheit_to_celsius_converter
function to each of these vector elements.
Let’s first assign our Fahrenheit temperature values to a numeric vector object named fahrenheit_input_vector
:
# makes a vector out of Fahrenheit values we want to convert, and assigns it to a new object named "fahrenheit_input_vector"
fahrenheit_input_vector<-c(45.6, 95.9, 67.8, 43)
Our goal is to also iteratively apply our function to all of these vector elements, and deposit the transformed results into a new vector. In programming languages, functions are typically applied to to multiple inputs in an iterative fashion using a construct known as a for-loop, which some of you may already be familiar with. R users also frequently use specialized functions (instead of for-loops) to iterate over elements; this is often faster, or at the very least, makes R scripts more readable. One family of these iterative functions is the “Apply” family of functions. A more recent set of functions that facilitate iteration is part of the tidyverse, and is found within the purrr package. These functions are known as map()
functions, and we will use them here to iteratively apply our functions to multiple inputs.
Let’s see how we can use a map()
function to sequentially apply the fahrenheit_to_celsius_converter()
function we created to several different values for the “fahrenheit_input” argument, contained in fahrenheit_input_vector
. We’ll pass fahrenheit_input_vector
as the first argument to the map_dbl()
function, and fahrenheit_to_celsius_converter
(i.e. the function we want to apply iteratively to the elements in `thefahrenheit_input_vector
) as the second argument. The result of this operation will be a new “results vector”, containing the transformed temperature values for each input in the original vector of Fahrenheit values (fahrenheit_input_vector
). We’ll assign this result/output vector to a new object named celsius_outputs_vector
:
# Iteratively applies the "fahrenheit_to_celsius_converter" to celsius input values in "fahrenheit_input_vector" and assigns the resulting vector of converted temperature values to "celsius_ouputs_vector"
celsius_outputs_vector<-map_dbl(fahrenheit_input_vector, fahrenheit_to_celsius_converter)
In short, the code above takes ``fahrenheit_input_vector(i.e. a vector with the numbers 45.6, 95.9, 67.8, 43), and runs each of these numbers through the
fahrenheit_converter()function, and sequentially deposits the transformed result to the newly created
celsius_outputs_vector()``` object, which contains the following elements:
# prints contents of "celsius_outputs_vector"
celsius_outputs_vector
## [1] 7.555556 35.500000 19.888889 6.111111
More explicitly, the code that reads celsius_outputs_vector<-map_dbl(fahrenheit_input_vector, fahrenheit_converter)
did the following:
fahrenheit_input_vector
) to the fahrenheit_converter()
function, and place the output (7.555556) as the first element in a new vector of transformed values, named celsius_outputs_vector
.fahrenheit_input_vector
) to the fahrenheit_converter()
function, and deposit the output (35.500000) as the second element in celsius_outputs_vector
.fahrenheit_input_vector
) to the fahrenheit_converter()
function, and deposit the output (19.888889) as the third element in celsius_outputs_vector
.fahrenheit_input_vector
) to the fahrenheit_converter()
function, and deposit the output (6.111111) as the fourth element in celsius_outputs_vector
.There are a variety of map()
functions from the purrr package, and the precise one you should use turns on the number of arguments used by the function (in this example, there is of course only one argument, i.e. “fahrenheit_input”), and the desired class of the output (i.e. numeric vector, character vector, data frame, list etc.). For example, let’s say we want to apply the fahrenheit_to_celsius_converter
function iteratively to the input values in fahrenheit_input_vector
, but that we want the output values to be stored as a list, rather than as a vector. Instead of using the map_dbl()
function, we can use the map()
function, which always returns outputs as a list. Below, we pass our input vector (fahrenheit_input_vector
), and the function we want to iteratively apply to the elements of the input vector (fahrenheit_converter
) to the map()
function. We’ll assign the output list to a new object named celsius_outputs_list
:
# iteratively applies the "fahrenheit_to_celsius_converter" function to the input values in "fahrenheit_input_vector", and assigns the list of celsius output values to a new object named "celsius_outputs_list"
celsius_outputs_list<-map(fahrenheit_input_vector, fahrenheit_to_celsius_converter)
Let’s print out the list of output values:
# prints contents of "celsius_outputs_list"
celsius_outputs_list
## [[1]]
## [1] 7.555556
##
## [[2]]
## [1] 35.5
##
## [[3]]
## [1] 19.88889
##
## [[4]]
## [1] 6.111111
We can confirm that celsius_outputs_list
is indeed a list using the class()
function that we introduced earlier:
# checks data structure of "celsius_outputs_list"
class(celsius_outputs_list)
## [1] "list"
Now, let’s say we we want to organize our information in a data frame, where one column represents our Fahrenheit input values, and the other column represents the corresponding Celsius output values. To do so, we’ll first slightly modify our function to return a data frame:
# Creates function that takes an input value in degrees Fahrenheit (fahrenheit_input), converts this value to Celsius, and returns a data frame with the input Fahrenheit temperature value as one column, and the corresponding Celsius temperature value as another column; the function is assigned to a new object named "fahrenheit_to_celsius_converter_df"
fahrenheit_to_celsius_converter_df<-function(fahrenheit_input){
celsius_output<-(fahrenheit_input-32)*(5/9)
celsius_output_df<-data.frame(fahrenheit_input, celsius_output)
return(celsius_output_df)
}
Now, let’s test out this new function for a single “fahrenheit_input” value, to make sure it works as expected; we’ll test it out for a value of 63 degrees Fahrenheit:
# applies "fahrenheit_to_celsius_converter_df" function to input value of 63 degrees Fahrenheit
fahrenheit_to_celsius_converter_df(fahrenheit_input=63)
## fahrenheit_input celsius_output
## 1 63 17.22222
Having confirmed that the function works as expected, let’s now assemble a dataset using multiple Fahrenheit input values, where one column consists of these input values, and the second column consists of the corresponding Celsius outputs. We can do so using the map_dfr()
function from the purrr package, which is a cousin of the map()
and map_dbl()
functions we explored above. While the map()
function returns function outputs in a list, and the map_dbl()
function returns function outputs in a numeric vector, the map_dfr()
is used to bind together multiple function outputs rowwise into a data frame. To make this more concrete, let’s consider the code below, which uses map_dfr()
to iteratively apply the fahrenheit_to_celsius_converter_df
function to the Fahrenheit values in fahrenheit_input_vector
, and assemble the resulting rows into a data frame that is assigned to a new object named celsius_outputs_df
:
# Iteratively applies the "fahrenheit_to_celsius_converter_df" function to input values in "fahrenheit_input_vector" to generate a data frame with column of input Fahrenheit values, and column of corresponding output Celsius values; assigns this data frame to a new object named "celsius_outputs_df"
celsius_outputs_df<-map_dfr(fahrenheit_input_vector, fahrenheit_to_celsius_converter_df)
Let’s now print the contents of celsius_outputs_df
:
# prints contents of
celsius_outputs_df
## fahrenheit_input celsius_output
## 1 45.6 7.555556
## 2 95.9 35.500000
## 3 67.8 19.888889
## 4 43.0 6.111111
We now have a dataset with one column consisting of our Fahrenheit inputs (taken from fahrenheit_input_vector
), and a second column consisting of our Celsius outputs (derived by applying the fahrenheit_to_celsius_converter_df()
function to our vector of input values, `fahrenheit_input_vector
).
We’ve just covered three different purrr functions: map()
(which returns a list), map_dbl()
(which returns a vector), and map_dfr()
(which returns a dataframe). There are other map functions which return different types of objects; you can see a list of these other map functions by inspecting the documentation for the map()
function:
?map
The process of iteratively applying a function with more than one argument is beyond the scope of the workshop, but the same general principles are at work in those cases. If you’d like to explore the process of iteratively applying a function with two arguments, or more than two arguments, check out the documentation for the map2()
and pmap()
functions, respectively.
Before we move into the next section, let’s consider one more example of how you can use your own custom-written functions in conjunction with the iteration functions in the purrr package to write scripts that can help you to automate tedious tasks. In particular, we’ll demonstrate the utility of the list data structure in helping you to carry out such automation tasks.
Let’s say, for example, that you have temperature values stored in Fahrenheit, for multiple countries, and want to quickly convert those country-level values to degrees Celsius. Suppose that these Fahrenheit values are stored in a series of vectors:
# creates sample country-level Fahrenheit data for Country A
countryA_fahrenheit<-c(55,67,91,23, 77, 98, 27)
# creates sample country-level Fahrenheit data for Country B
countryB_fahrenheit<-c(33,45,11,66, 44)
# creates sample country-level Fahrenheit data for Country C
countryC_fahrenheit<-c(60,55,12,109)
# creates sample country-level Fahrenheit data for Country D
countryD_fahrenheit<-c(76, 24, 77, 78)
Let’s say that we want to take all of these vectors, and iteratively pass them as arguments to the fahrenheit_to_celsius_converter_df
function, thereby creating four country-specific data frames that have the original Fahrenheit values in one column and the transformed Celsius values in the other column. The easiest way to do this is to first put our input vectors into a list, which we’ll assign to a new object named temperature_input_list
:
# Creates list of input vectors and assigns this list to new object named "input_list"
temperature_input_list<-list(countryA_fahrenheit, countryB_fahrenheit, countryC_fahrenheit, countryD_fahrenheit)
Now, we’ll use the map()
function to iteratively pass the vectors in temperature_input_list
as arguments to the fahrenheit_to_celsius_converter_df
function, and deposits the resulting data frames into a list; we’ll assign this list that contains the output data frames to a new list object, named processed_temperature_data_list
:
# Iteratively passes vectors in "temperature_input_list" as arguments to "fahrenheit_to_celsius_converter_df" and deposits the resulting data frames to a list, which is assigned to a new object named "processed_temperature_data_list"
processed_temperature_data_list<-map(temperature_input_list, fahrenheit_to_celsius_converter_df)
In effect, the code above takes the countryA_fahrenheit
vector, uses it as the argument to the fahrenheit_to_celsius_converter_df
function, and deposits the resulting data frame as the first element in the processed_temperature_data_list
list; it then takes the countryB_fahrenheit
vector, uses it as the argument to the fahrenheit_to_celsius_converter_df
function, and deposits the resulting data frame as the second element in the processed_temperature_data_list
list; and so on.
Let’s print the contents of processed_temperature_data_list
and confirm that our data frames have been created as expected:
# prints contents of "processed_temperature_data_list"
processed_temperature_data_list
## [[1]]
## fahrenheit_input celsius_output
## 1 55 12.777778
## 2 67 19.444444
## 3 91 32.777778
## 4 23 -5.000000
## 5 77 25.000000
## 6 98 36.666667
## 7 27 -2.777778
##
## [[2]]
## fahrenheit_input celsius_output
## 1 33 0.5555556
## 2 45 7.2222222
## 3 11 -11.6666667
## 4 66 18.8888889
## 5 44 6.6666667
##
## [[3]]
## fahrenheit_input celsius_output
## 1 60 15.55556
## 2 55 12.77778
## 3 12 -11.11111
## 4 109 42.77778
##
## [[4]]
## fahrenheit_input celsius_output
## 1 76 24.444444
## 2 24 -4.444444
## 3 77 25.000000
## 4 78 25.555556
As an exercise, try and extract a given dataset from processed_temperature_data_list
using the indexing method we discussed above. Additionally, see if you can assign names to the list elements in processed_temperature_data_list
.
The material in Part 1 was not intended as a comprehensive introduction to the R programming language. Its goal, rather, was to present some ideas, concepts, and tools that can serve as a general foundation for working with data in R. Now that we have this basic foundation, we’ll turn in this section to a more applied exploration of some actual datasets. Our goal here is to introduce you to some useful functions that will allow you to explore and begin making sense of actual datasets in R.
Typically, the first step when working with research data in R Studio is to load your relevant data into memory. There are many ways to do this, and the precise way in which you will do so will depend on where your data is stored, and how it is structured. Below, we’ll cover the process of reading your data into R Studio under a couple of different scenarios.
Often (especially when a dataset is of tractable size), you will have the dataset you would like to analyze stored on a directory on your computer. In order to read in a dataset from a computer directory, you can use the read_csv()
function (provided it is stored as a CSV; if the file type is different, than the import function would be different as well), and the pass dataset’s file path as an argument to the function. Typically, you will want to assign the dataset you read in to a new R object:
# Reads in Persson/Tabellini Data from local directory, and assigns it to new object named "pt"
pt<-read_csv("data/pt/persson_tabellini_workshop.csv")
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## .default = col_double(),
## country = col_character(),
## continent = col_character()
## )
## ℹ Use `spec()` for the full column specifications.
If you’d like to view the contents of the dataset, pass it to the R Studio data viewer:
# views "pt" data frame in R Studio data viewer
View(pt)
pt %>% datatable(extensions=c("Scroller", "FixedColumns"), options = list(
deferRender = TRUE,
scrollY = 350,
scrollX = 350,
dom = "t",
scroller = TRUE,
fixedColumns = list(leftColumns = 3)
))
We’ll be working more with this dataset below; if you’d like to learn more about the dataset’s various variables, see the codebook. The data was originally collected by the political economists Torsten Persson and Guido Tabellini for their book on The Econmic Effects of Constitutions.
Sometimes, your data is spread out over multiple files. For example, you may have multiple CSV files with data stored on disk, which you want to read into R at one-go, instead of loading in multiple files individually.
To do so, we can use the list data structure to hold all of the desired files, and use the map()
function we learned about above to iteratively read these files into our R environment.
The first step is to use the list.files()
function to create a character vector of the file names we want to read in; if all of the files you want to read in are already in your working directory, you don’t need to supply any arguments to the list.files()
function. If the files are stored in another location, you can specify the relevant file path as an argument to list.files()
. In the case below, the individual files we want to read in are thirteen Web of Science datasets (which were generated using the “climate” + “art” search parameters); these files are in the “wos” directory within the “data” subdirectory of the working directory:
# print relevant file names, which are stored in the data/wos subdirectory
wos_files<-list.files("data/wos")
Let’s now print the contents of “wos_files” and observe the file names":
# prints contents of "wos_files"
wos_files
## [1] "ClimateAndArt_01.csv" "ClimateAndArt_02.csv" "ClimateAndArt_03.csv"
## [4] "ClimateAndArt_04.csv" "ClimateAndArt_05.csv" "ClimateAndArt_06.csv"
## [7] "ClimateAndArt_07.csv" "ClimateAndArt_08.csv" "ClimateAndArt_09.csv"
## [10] "ClimateAndArt_10.csv" "ClimateAndArt_11.csv" "ClimateAndArt_12.csv"
## [13] "ClimateAndArt_13.csv"
Now that we have our file names, we can iteratively pass them through the read_csv()
function, and deposit the files as data frames in a list:
# Iteratively reads in all individual WOS files from the "data/wos" directory and assigns it to an object named "wos_file_list"
setwd("data/wos")
wos_file_list<-map(wos_files, read_csv)
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## .default = col_character(),
## `Book Series Subtitle` = col_logical(),
## `Cited References` = col_logical(),
## `Cited Reference Count` = col_double(),
## `Times Cited, WoS Core` = col_double(),
## `Times Cited, All Databases` = col_double(),
## `180 Day Usage Count` = col_double(),
## `Since 2013 Usage Count` = col_double(),
## `Publication Year` = col_double(),
## `Meeting Abstract` = col_logical(),
## `Number of Pages` = col_double(),
## `Pubmed Id` = col_double()
## )
## ℹ Use `spec()` for the full column specifications.
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## .default = col_character(),
## `Book Series Subtitle` = col_logical(),
## `Cited References` = col_logical(),
## `Cited Reference Count` = col_double(),
## `Times Cited, WoS Core` = col_double(),
## `Times Cited, All Databases` = col_double(),
## `180 Day Usage Count` = col_double(),
## `Since 2013 Usage Count` = col_double(),
## `Publication Year` = col_double(),
## `Number of Pages` = col_double(),
## `Pubmed Id` = col_double()
## )
## ℹ Use `spec()` for the full column specifications.
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## .default = col_character(),
## `Book Series Subtitle` = col_logical(),
## `Cited References` = col_logical(),
## `Cited Reference Count` = col_double(),
## `Times Cited, WoS Core` = col_double(),
## `Times Cited, All Databases` = col_double(),
## `180 Day Usage Count` = col_double(),
## `Since 2013 Usage Count` = col_double(),
## `Publication Year` = col_double(),
## `Meeting Abstract` = col_logical(),
## `Number of Pages` = col_double(),
## `Pubmed Id` = col_double()
## )
## ℹ Use `spec()` for the full column specifications.
##
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## .default = col_character(),
## `Book Series Subtitle` = col_logical(),
## `Cited References` = col_logical(),
## `Cited Reference Count` = col_double(),
## `Times Cited, WoS Core` = col_double(),
## `Times Cited, All Databases` = col_double(),
## `180 Day Usage Count` = col_double(),
## `Since 2013 Usage Count` = col_double(),
## `Publication Year` = col_double(),
## `Meeting Abstract` = col_logical(),
## `Number of Pages` = col_double(),
## `Pubmed Id` = col_double()
## )
## ℹ Use `spec()` for the full column specifications.
##
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## .default = col_character(),
## `Book Series Subtitle` = col_logical(),
## `Cited References` = col_logical(),
## `Cited Reference Count` = col_double(),
## `Times Cited, WoS Core` = col_double(),
## `Times Cited, All Databases` = col_double(),
## `180 Day Usage Count` = col_double(),
## `Since 2013 Usage Count` = col_double(),
## `Publication Year` = col_double(),
## `Meeting Abstract` = col_logical(),
## `Number of Pages` = col_double(),
## `Pubmed Id` = col_double()
## )
## ℹ Use `spec()` for the full column specifications.
##
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## .default = col_character(),
## `Book Series Subtitle` = col_logical(),
## `Cited References` = col_logical(),
## `Cited Reference Count` = col_double(),
## `Times Cited, WoS Core` = col_double(),
## `Times Cited, All Databases` = col_double(),
## `180 Day Usage Count` = col_double(),
## `Since 2013 Usage Count` = col_double(),
## `Publication Year` = col_double(),
## `Meeting Abstract` = col_logical(),
## `Number of Pages` = col_double(),
## `Pubmed Id` = col_double()
## )
## ℹ Use `spec()` for the full column specifications.
##
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## .default = col_character(),
## `Book Series Subtitle` = col_logical(),
## `Cited References` = col_logical(),
## `Cited Reference Count` = col_double(),
## `Times Cited, WoS Core` = col_double(),
## `Times Cited, All Databases` = col_double(),
## `180 Day Usage Count` = col_double(),
## `Since 2013 Usage Count` = col_double(),
## `Publication Year` = col_double(),
## `Meeting Abstract` = col_logical(),
## `Number of Pages` = col_double(),
## `Pubmed Id` = col_double()
## )
## ℹ Use `spec()` for the full column specifications.
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## .default = col_character(),
## `Group Authors` = col_logical(),
## `Book Series Subtitle` = col_logical(),
## `Cited References` = col_logical(),
## `Cited Reference Count` = col_double(),
## `Times Cited, WoS Core` = col_double(),
## `Times Cited, All Databases` = col_double(),
## `180 Day Usage Count` = col_double(),
## `Since 2013 Usage Count` = col_double(),
## `Publication Year` = col_double(),
## `Meeting Abstract` = col_logical(),
## `Number of Pages` = col_double(),
## `Pubmed Id` = col_double()
## )
## ℹ Use `spec()` for the full column specifications.
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## .default = col_character(),
## `Book Series Subtitle` = col_logical(),
## `Cited References` = col_logical(),
## `Cited Reference Count` = col_double(),
## `Times Cited, WoS Core` = col_double(),
## `Times Cited, All Databases` = col_double(),
## `180 Day Usage Count` = col_double(),
## `Since 2013 Usage Count` = col_double(),
## `Publication Year` = col_double(),
## `Meeting Abstract` = col_logical(),
## `Number of Pages` = col_double(),
## `Pubmed Id` = col_double()
## )
## ℹ Use `spec()` for the full column specifications.
##
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## .default = col_character(),
## `Book Series Subtitle` = col_logical(),
## `Cited References` = col_logical(),
## `Cited Reference Count` = col_double(),
## `Times Cited, WoS Core` = col_double(),
## `Times Cited, All Databases` = col_double(),
## `180 Day Usage Count` = col_double(),
## `Since 2013 Usage Count` = col_double(),
## `Publication Year` = col_double(),
## `Meeting Abstract` = col_logical(),
## `Number of Pages` = col_double(),
## `Pubmed Id` = col_double()
## )
## ℹ Use `spec()` for the full column specifications.
##
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## .default = col_character(),
## `Book Series Subtitle` = col_logical(),
## `Cited References` = col_logical(),
## `Cited Reference Count` = col_double(),
## `Times Cited, WoS Core` = col_double(),
## `Times Cited, All Databases` = col_double(),
## `180 Day Usage Count` = col_double(),
## `Since 2013 Usage Count` = col_double(),
## `Publication Year` = col_double(),
## `Meeting Abstract` = col_logical(),
## `Number of Pages` = col_double(),
## `Pubmed Id` = col_double()
## )
## ℹ Use `spec()` for the full column specifications.
##
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## .default = col_character(),
## `Book Series Subtitle` = col_logical(),
## `Cited References` = col_logical(),
## `Cited Reference Count` = col_double(),
## `Times Cited, WoS Core` = col_double(),
## `Times Cited, All Databases` = col_double(),
## `180 Day Usage Count` = col_double(),
## `Since 2013 Usage Count` = col_double(),
## `Publication Year` = col_double(),
## `Meeting Abstract` = col_logical(),
## `Number of Pages` = col_double(),
## `Pubmed Id` = col_double()
## )
## ℹ Use `spec()` for the full column specifications.
##
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## .default = col_character(),
## `Book Series Subtitle` = col_logical(),
## `Cited References` = col_logical(),
## `Cited Reference Count` = col_double(),
## `Times Cited, WoS Core` = col_double(),
## `Times Cited, All Databases` = col_double(),
## `180 Day Usage Count` = col_double(),
## `Since 2013 Usage Count` = col_double(),
## `Publication Year` = col_double(),
## `Meeting Abstract` = col_logical(),
## `Number of Pages` = col_double(),
## `Pubmed Id` = col_double()
## )
## ℹ Use `spec()` for the full column specifications.
The code above takes the first file name in wos_files()
and then passes it to the read_csv()
function and deposits the file as the first data frame in a new list; it then takes the second file name in wos_files()
and then passes it to the read_csv()
function and deposits that file as the second data frame in the list; and so on. The list containing all of the files is assigned to a new object named wos_file_list
; we’ll print the contents below:
# prints contents of "wos_file_list"
wos_file_list
## [[1]]
## # A tibble: 1,000 × 70
## `Publication Type` Authors `Book Authors` `Book Editors` `Book Group Au…`
## <chr> <chr> <chr> <chr> <chr>
## 1 J Miles, M <NA> <NA> <NA>
## 2 J Dal Farra,… <NA> <NA> <NA>
## 3 J Chen, MH <NA> <NA> <NA>
## 4 J Guy, S; He… <NA> <NA> <NA>
## 5 J Baztan, J;… <NA> <NA> <NA>
## 6 J Burke, M; … <NA> <NA> <NA>
## 7 J Rodder, S <NA> <NA> <NA>
## 8 J Bentz, J; … <NA> <NA> <NA>
## 9 J Ture, C <NA> <NA> <NA>
## 10 J Kim, S <NA> <NA> <NA>
## # … with 990 more rows, and 65 more variables: `Author Full Names` <chr>,
## # `Book Author Full Names` <chr>, `Group Authors` <chr>,
## # `Article Title` <chr>, `Source Title` <chr>, `Book Series Title` <chr>,
## # `Book Series Subtitle` <lgl>, Language <chr>, `Document Type` <chr>,
## # `Conference Title` <chr>, `Conference Date` <chr>,
## # `Conference Location` <chr>, `Conference Sponsor` <chr>,
## # `Conference Host` <chr>, `Author Keywords` <chr>, `Keywords Plus` <chr>, …
##
## [[2]]
## # A tibble: 1,000 × 70
## `Publication Type` Authors `Book Authors` `Book Editors` `Book Group Au…`
## <chr> <chr> <chr> <chr> <chr>
## 1 J De Ollas, … <NA> <NA> <NA>
## 2 J Mansfield,… <NA> <NA> <NA>
## 3 J Forrest, M <NA> <NA> <NA>
## 4 J Chiodo, G;… <NA> <NA> <NA>
## 5 J Williams, … <NA> <NA> <NA>
## 6 J Mascaro, G… <NA> <NA> <NA>
## 7 C Anel, JA <NA> Proenca, A; P… <NA>
## 8 J Crosato, A… <NA> <NA> <NA>
## 9 J Barnett, T… <NA> <NA> <NA>
## 10 J Buckland, … <NA> <NA> <NA>
## # … with 990 more rows, and 65 more variables: `Author Full Names` <chr>,
## # `Book Author Full Names` <chr>, `Group Authors` <chr>,
## # `Article Title` <chr>, `Source Title` <chr>, `Book Series Title` <chr>,
## # `Book Series Subtitle` <lgl>, Language <chr>, `Document Type` <chr>,
## # `Conference Title` <chr>, `Conference Date` <chr>,
## # `Conference Location` <chr>, `Conference Sponsor` <chr>,
## # `Conference Host` <chr>, `Author Keywords` <chr>, `Keywords Plus` <chr>, …
##
## [[3]]
## # A tibble: 1,000 × 70
## `Publication Type` Authors `Book Authors` `Book Editors` `Book Group Au…`
## <chr> <chr> <chr> <chr> <chr>
## 1 J Goswami, B… <NA> <NA> <NA>
## 2 J Berman, AL… <NA> <NA> <NA>
## 3 J Kulmala, M… <NA> <NA> <NA>
## 4 J Touchan, R… <NA> <NA> <NA>
## 5 J Zhang, L; … <NA> <NA> <NA>
## 6 J Fallah, A;… <NA> <NA> <NA>
## 7 J Wagner, TJ… <NA> <NA> <NA>
## 8 C Bloschl, G… <NA> Demuth, S; Gu… <NA>
## 9 S Hoang, T; … <NA> Daniere, AG; … <NA>
## 10 J Surovikina… <NA> <NA> <NA>
## # … with 990 more rows, and 65 more variables: `Author Full Names` <chr>,
## # `Book Author Full Names` <chr>, `Group Authors` <chr>,
## # `Article Title` <chr>, `Source Title` <chr>, `Book Series Title` <chr>,
## # `Book Series Subtitle` <lgl>, Language <chr>, `Document Type` <chr>,
## # `Conference Title` <chr>, `Conference Date` <chr>,
## # `Conference Location` <chr>, `Conference Sponsor` <chr>,
## # `Conference Host` <chr>, `Author Keywords` <chr>, `Keywords Plus` <chr>, …
##
## [[4]]
## # A tibble: 1,000 × 70
## `Publication Type` Authors `Book Authors` `Book Editors` `Book Group Au…`
## <chr> <chr> <chr> <chr> <chr>
## 1 J Arce-Nazar… <NA> <NA> <NA>
## 2 J Beevers, L… <NA> <NA> <NA>
## 3 J Staehelin,… <NA> <NA> <NA>
## 4 J Zhang, YH;… <NA> <NA> <NA>
## 5 C Sudantha, … <NA> <NA> IEEE
## 6 J Franzen, C <NA> <NA> <NA>
## 7 J van Oldenb… <NA> <NA> <NA>
## 8 J Tankersley… <NA> <NA> <NA>
## 9 J Pistocchi,… <NA> <NA> <NA>
## 10 C Biggs, HR;… <NA> <NA> ACM
## # … with 990 more rows, and 65 more variables: `Author Full Names` <chr>,
## # `Book Author Full Names` <chr>, `Group Authors` <chr>,
## # `Article Title` <chr>, `Source Title` <chr>, `Book Series Title` <chr>,
## # `Book Series Subtitle` <lgl>, Language <chr>, `Document Type` <chr>,
## # `Conference Title` <chr>, `Conference Date` <chr>,
## # `Conference Location` <chr>, `Conference Sponsor` <chr>,
## # `Conference Host` <chr>, `Author Keywords` <chr>, `Keywords Plus` <chr>, …
##
## [[5]]
## # A tibble: 1,000 × 70
## `Publication Type` Authors `Book Authors` `Book Editors` `Book Group Au…`
## <chr> <chr> <chr> <chr> <chr>
## 1 J Esper, J; … <NA> <NA> <NA>
## 2 J Hallgren, … <NA> <NA> <NA>
## 3 J Caron, LP;… <NA> <NA> <NA>
## 4 J Sharif, K;… <NA> <NA> <NA>
## 5 J Santer, BD… <NA> <NA> <NA>
## 6 J Sarospatak… <NA> <NA> <NA>
## 7 B Pente, P <NA> Jagodzinski, J <NA>
## 8 J Marsh, C <NA> <NA> <NA>
## 9 J Wang, XY; … <NA> <NA> <NA>
## 10 J Servera-Vi… <NA> <NA> <NA>
## # … with 990 more rows, and 65 more variables: `Author Full Names` <chr>,
## # `Book Author Full Names` <chr>, `Group Authors` <chr>,
## # `Article Title` <chr>, `Source Title` <chr>, `Book Series Title` <chr>,
## # `Book Series Subtitle` <lgl>, Language <chr>, `Document Type` <chr>,
## # `Conference Title` <chr>, `Conference Date` <chr>,
## # `Conference Location` <chr>, `Conference Sponsor` <chr>,
## # `Conference Host` <chr>, `Author Keywords` <chr>, `Keywords Plus` <chr>, …
##
## [[6]]
## # A tibble: 1,000 × 70
## `Publication Type` Authors `Book Authors` `Book Editors` `Book Group Au…`
## <chr> <chr> <chr> <chr> <chr>
## 1 J Kim, JK <NA> <NA> <NA>
## 2 J Singh, SJ;… <NA> <NA> <NA>
## 3 J D'Andrea, … <NA> <NA> <NA>
## 4 J Grant, KM;… <NA> <NA> <NA>
## 5 J Guan, B; W… <NA> <NA> <NA>
## 6 J Hummel, M;… <NA> <NA> <NA>
## 7 J Rodney, L <NA> <NA> <NA>
## 8 J Jouili, JS <NA> <NA> <NA>
## 9 J Rodriguez-… <NA> <NA> <NA>
## 10 J Joyette, A… <NA> <NA> <NA>
## # … with 990 more rows, and 65 more variables: `Author Full Names` <chr>,
## # `Book Author Full Names` <chr>, `Group Authors` <chr>,
## # `Article Title` <chr>, `Source Title` <chr>, `Book Series Title` <chr>,
## # `Book Series Subtitle` <lgl>, Language <chr>, `Document Type` <chr>,
## # `Conference Title` <chr>, `Conference Date` <chr>,
## # `Conference Location` <chr>, `Conference Sponsor` <chr>,
## # `Conference Host` <chr>, `Author Keywords` <chr>, `Keywords Plus` <chr>, …
##
## [[7]]
## # A tibble: 1,000 × 70
## `Publication Type` Authors `Book Authors` `Book Editors` `Book Group Au…`
## <chr> <chr> <chr> <chr> <chr>
## 1 C Farnoosh, … <NA> <NA> Assoc Advanceme…
## 2 J Nahhas, TM… <NA> <NA> <NA>
## 3 J Yoon, J; B… <NA> <NA> <NA>
## 4 J Masnavi, M… <NA> <NA> <NA>
## 5 J Hu, ZZ; Ku… <NA> <NA> <NA>
## 6 J Gonzalez, … <NA> <NA> <NA>
## 7 C Hungilo, G… <NA> <NA> ACM
## 8 J Fernandes,… <NA> <NA> <NA>
## 9 J Ingwersen,… <NA> <NA> <NA>
## 10 J Zilitinkev… <NA> <NA> <NA>
## # … with 990 more rows, and 65 more variables: `Author Full Names` <chr>,
## # `Book Author Full Names` <chr>, `Group Authors` <chr>,
## # `Article Title` <chr>, `Source Title` <chr>, `Book Series Title` <chr>,
## # `Book Series Subtitle` <lgl>, Language <chr>, `Document Type` <chr>,
## # `Conference Title` <chr>, `Conference Date` <chr>,
## # `Conference Location` <chr>, `Conference Sponsor` <chr>,
## # `Conference Host` <chr>, `Author Keywords` <chr>, `Keywords Plus` <chr>, …
##
## [[8]]
## # A tibble: 1,000 × 70
## `Publication Type` Authors `Book Authors` `Book Editors` `Book Group Au…`
## <chr> <chr> <chr> <chr> <chr>
## 1 J Chalal, ML… <NA> <NA> <NA>
## 2 C El-Araby, … <NA> Brebner, G; C… <NA>
## 3 J Silva, B; … <NA> <NA> <NA>
## 4 J Sorbet, J;… <NA> <NA> <NA>
## 5 J Smart, PDS… <NA> <NA> <NA>
## 6 J Colding, J… <NA> <NA> <NA>
## 7 J Antonini, … <NA> <NA> <NA>
## 8 J Woodworth,… <NA> <NA> <NA>
## 9 J Roh, JS; K… <NA> <NA> <NA>
## 10 J Loch, CH; … <NA> <NA> <NA>
## # … with 990 more rows, and 65 more variables: `Author Full Names` <chr>,
## # `Book Author Full Names` <chr>, `Group Authors` <lgl>,
## # `Article Title` <chr>, `Source Title` <chr>, `Book Series Title` <chr>,
## # `Book Series Subtitle` <lgl>, Language <chr>, `Document Type` <chr>,
## # `Conference Title` <chr>, `Conference Date` <chr>,
## # `Conference Location` <chr>, `Conference Sponsor` <chr>,
## # `Conference Host` <chr>, `Author Keywords` <chr>, `Keywords Plus` <chr>, …
##
## [[9]]
## # A tibble: 1,000 × 70
## `Publication Type` Authors `Book Authors` `Book Editors` `Book Group Au…`
## <chr> <chr> <chr> <chr> <chr>
## 1 C Rajer, A; … <NA> <NA> CCI; CCI
## 2 J Hashioka, … <NA> <NA> <NA>
## 3 C Halova, P;… <NA> Smutka, L; Re… <NA>
## 4 J MARTI, C; … <NA> <NA> <NA>
## 5 C Taveres-Ca… <NA> Geving, S; Ti… <NA>
## 6 J Haida, M; … <NA> <NA> <NA>
## 7 J Schmid, R <NA> <NA> <NA>
## 8 J Manzini, E… <NA> <NA> <NA>
## 9 J Chang, WL;… <NA> <NA> <NA>
## 10 J Bruhwiler,… <NA> <NA> <NA>
## # … with 990 more rows, and 65 more variables: `Author Full Names` <chr>,
## # `Book Author Full Names` <chr>, `Group Authors` <chr>,
## # `Article Title` <chr>, `Source Title` <chr>, `Book Series Title` <chr>,
## # `Book Series Subtitle` <lgl>, Language <chr>, `Document Type` <chr>,
## # `Conference Title` <chr>, `Conference Date` <chr>,
## # `Conference Location` <chr>, `Conference Sponsor` <chr>,
## # `Conference Host` <chr>, `Author Keywords` <chr>, `Keywords Plus` <chr>, …
##
## [[10]]
## # A tibble: 1,000 × 70
## `Publication Type` Authors `Book Authors` `Book Editors` `Book Group Au…`
## <chr> <chr> <chr> <chr> <chr>
## 1 J Paietta, E <NA> <NA> <NA>
## 2 J Chen, N; P… <NA> <NA> <NA>
## 3 J Bruckert, … <NA> <NA> <NA>
## 4 J Cha, MS; P… <NA> <NA> <NA>
## 5 J Brewin, RJ… <NA> <NA> <NA>
## 6 J Lamane, H;… <NA> <NA> <NA>
## 7 J Rodriguez,… <NA> <NA> <NA>
## 8 J Cortesi, U… <NA> <NA> <NA>
## 9 J Buerki, S;… <NA> <NA> <NA>
## 10 J Nastula, J… <NA> <NA> <NA>
## # … with 990 more rows, and 65 more variables: `Author Full Names` <chr>,
## # `Book Author Full Names` <chr>, `Group Authors` <chr>,
## # `Article Title` <chr>, `Source Title` <chr>, `Book Series Title` <chr>,
## # `Book Series Subtitle` <lgl>, Language <chr>, `Document Type` <chr>,
## # `Conference Title` <chr>, `Conference Date` <chr>,
## # `Conference Location` <chr>, `Conference Sponsor` <chr>,
## # `Conference Host` <chr>, `Author Keywords` <chr>, `Keywords Plus` <chr>, …
##
## [[11]]
## # A tibble: 1,000 × 70
## `Publication Type` Authors `Book Authors` `Book Editors` `Book Group Au…`
## <chr> <chr> <chr> <chr> <chr>
## 1 J Zhao, XY; … <NA> <NA> <NA>
## 2 J Pacheco-To… <NA> <NA> <NA>
## 3 J Nasiyev, B… <NA> <NA> <NA>
## 4 J Helmig, D;… <NA> <NA> <NA>
## 5 C Voigt, C; … <NA> Calabro, M; D… <NA>
## 6 J Hartman, S… <NA> <NA> <NA>
## 7 J Konstantin… <NA> <NA> <NA>
## 8 J Yu, TT; Le… <NA> <NA> <NA>
## 9 J Karam, N; … <NA> <NA> <NA>
## 10 J Di Prima, … <NA> <NA> <NA>
## # … with 990 more rows, and 65 more variables: `Author Full Names` <chr>,
## # `Book Author Full Names` <chr>, `Group Authors` <chr>,
## # `Article Title` <chr>, `Source Title` <chr>, `Book Series Title` <chr>,
## # `Book Series Subtitle` <lgl>, Language <chr>, `Document Type` <chr>,
## # `Conference Title` <chr>, `Conference Date` <chr>,
## # `Conference Location` <chr>, `Conference Sponsor` <chr>,
## # `Conference Host` <chr>, `Author Keywords` <chr>, `Keywords Plus` <chr>, …
##
## [[12]]
## # A tibble: 1,000 × 70
## `Publication Type` Authors `Book Authors` `Book Editors` `Book Group Au…`
## <chr> <chr> <chr> <chr> <chr>
## 1 J Shah, ARY;… <NA> <NA> <NA>
## 2 J Lindskog, … <NA> <NA> <NA>
## 3 J Morciano, … <NA> <NA> <NA>
## 4 J Lebo, ZJ; … <NA> <NA> <NA>
## 5 J Wilczynska… <NA> <NA> <NA>
## 6 B Ballard, S <NA> Marsching, JD… <NA>
## 7 J Pinkovetsk… <NA> <NA> <NA>
## 8 J Xiong, Y; … <NA> <NA> <NA>
## 9 J Bui, DT; H… <NA> <NA> <NA>
## 10 C Alloza, JA… <NA> Kepner, WG; R… <NA>
## # … with 990 more rows, and 65 more variables: `Author Full Names` <chr>,
## # `Book Author Full Names` <chr>, `Group Authors` <chr>,
## # `Article Title` <chr>, `Source Title` <chr>, `Book Series Title` <chr>,
## # `Book Series Subtitle` <lgl>, Language <chr>, `Document Type` <chr>,
## # `Conference Title` <chr>, `Conference Date` <chr>,
## # `Conference Location` <chr>, `Conference Sponsor` <chr>,
## # `Conference Host` <chr>, `Author Keywords` <chr>, `Keywords Plus` <chr>, …
##
## [[13]]
## # A tibble: 686 × 70
## `Publication Type` Authors `Book Authors` `Book Editors` `Book Group Au…`
## <chr> <chr> <chr> <chr> <chr>
## 1 J Marchese, … <NA> <NA> <NA>
## 2 J Blood, A; … <NA> <NA> <NA>
## 3 J Bonannella… <NA> <NA> <NA>
## 4 J Lazarev, S… <NA> <NA> <NA>
## 5 J Gliss, J; … <NA> <NA> <NA>
## 6 J Mueller, C… <NA> <NA> <NA>
## 7 B Cohen, M; … <NA> Cohen, M; Qui… <NA>
## 8 J Nyamekye, … <NA> <NA> <NA>
## 9 J Myriokefal… <NA> <NA> <NA>
## 10 J Viana, M; … <NA> <NA> <NA>
## # … with 676 more rows, and 65 more variables: `Author Full Names` <chr>,
## # `Book Author Full Names` <chr>, `Group Authors` <chr>,
## # `Article Title` <chr>, `Source Title` <chr>, `Book Series Title` <chr>,
## # `Book Series Subtitle` <lgl>, Language <chr>, `Document Type` <chr>,
## # `Conference Title` <chr>, `Conference Date` <chr>,
## # `Conference Location` <chr>, `Conference Sponsor` <chr>,
## # `Conference Host` <chr>, `Author Keywords` <chr>, `Keywords Plus` <chr>, …
We will work with the separate data frames in wos_file_list
later in the tutorial.
In some cases, however, if the different data frames read into a list all have the same structure (i.e. same variable names and number of variables), it is also sometimes useful to append the different data frames in a list together into one data frame. It is easy to carry out this operation using the bind_rows()
function, which takes as its argument the name of the list containing the data frames whose rows we want to append together. Below, we’ll assign the consolidated data frame created through this appending operation to a new object named ws_df_appended
:
# appends data frames in "wos_file_list" into one data frame and assigns it to a new object named "ws_df_appended"
ws_df_appended<-bind_rows(wos_file_list)
You can now view the appended data frame in the R Studio data viewer by passing ws_df_appended
to the View()
function. In addition to working with the files separately, we will also use the consolidated ws_df_appended
data frame to introduce some basic text analysis functions in R.
At this point, we have all of the data we need for subsequent sections loaded in our R environment. However, before proceeding, it’s worth noting some additional methods of reading in data into R.
If you store your data on the Cloud using a standard storage service such as Dropbox, you can simply extract the URL to the data from your service provider, and pass it as an argument to a data transfer function in R such as read_csv()
:
# Reads in PT dataset from Dropbox and assigns it to a new object named "pt_cloud"
pt_cloud<-read_csv("https://www.dropbox.com/s/iczslf52s8bzku2/persson_tabellini_workshop.csv?dl=1")
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## .default = col_double(),
## country = col_character(),
## continent = col_character()
## )
## ℹ Use `spec()` for the full column specifications.
The code above reads in the Persson-Tabellini dataset that is stored on a Dropbox account straight into R using its URL as an argument, and assigns it to a new object named pt_cloud
. If you view pt_cloud
in the data viewer, you’ll notice that the dataset is exactly the same as the one assigned to the pt
object.
Many R packages allow you to import data directly into your R environment via an Application Programming Interface (API), which is essentially a device that organizations use to provide access to their data. The R package wosr, for example, allows users to leverage the Web of Science API to directly import data from the Web of Science. Even after importing the Web of Science data into R via wosr, it takes a bit of effort to get the data into usable shape, so we won’t make use of data imported from wosr later in the tutorial. Nevertheless, it can be instructive to see how this process works. For more details on wosr functions used below, inspect the relevant documentation by typing a "? before the function name in your console (or from a script).
The first step is to set your authorization credentials; because CU has a Web of Science subscription (which you can automatically access from campus or through a campus VPN), we can simply set the user name and password to NULL in the session identifier, which is assigned to a new object named sid
:
# creates WOS session identifier and assigns to object named "sid"
sid<-auth(username=NULL, password=NULL)
Now, we’ll create a string that represents our Web of Science query; this text string is assigned to a new object name wos_query
:
# Creates string for WOS query
wos_query<-'TS = ("climate" & "art") AND PY = (2015-2016)'
Now, we’ll pass wos_query
and sid
as arguments to the pull_wos()
function, which is the wosr function that allows for query-based data extraction from the Web of Science API. We’ll assign the data to a new object named wos_api_climate_art
:
# Pulls data from WOS API based on wos_query and assigns to object named "wos_api_climate_art"
wos_api_climate_art<-pull_wos(wos_query, sid=sid)
## Downloading data
##
## Parsing XML
Now that the data has been downloaded, we’ll check the class of wos_api_climate_art
:
# checks class of "wos_api_climate_art"
class(wos_api_climate_art)
## [1] "list" "wos_data"
Note that the data is stored as a list, and we can print its contents below:
# prints contents of "wos_api_climate_art"
wos_api_climate_art
## List of 9
## $ publication :'data.frame': 843 obs. of 7 variables:
## ..$ ut : chr [1:843] "WOS:000344457900003" ...
## ..$ title : chr [1:843] "Climate change, adaptation and Eco-Art in Singa"..
## ..$ journal : chr [1:843] "Journal of Environmental Planning and Managemen"..
## ..$ date : Date[1:843], format: "2015-01-02" ...
## ..$ doi : chr [1:843] "10.1080/09640568.2013.839446" ...
## ..$ tot_cites: int [1:843] 8 22 ...
## ..$ abstract : chr [1:843] "Eco-Art has recently emerged as a potential mea"..
## $ author :'data.frame': 4275 obs. of 7 variables:
## ..$ ut : chr [1:4275] "WOS:000344457900003" ...
## ..$ author_no : int [1:4275] 1 2 ...
## ..$ display_name: chr [1:4275] "Guy, Simon" ...
## ..$ first_name : chr [1:4275] "Simon" ...
## ..$ last_name : chr [1:4275] "Guy" ...
## ..$ email : chr [1:4275] NA ...
## ..$ daisng_id : int [1:4275] 29359223 9147145 ...
## $ address :'data.frame': 2957 obs. of 7 variables:
## ..$ ut : chr [1:2957] "WOS:000344457900003" ...
## ..$ addr_no : int [1:2957] 1 2 ...
## ..$ org_pref: chr [1:2957] "University of Manchester" ...
## ..$ org : chr [1:2957] "Univ Manchester" ...
## ..$ city : chr [1:2957] "Manchester" ...
## ..$ state : chr [1:2957] "Lancs" ...
## ..$ country : chr [1:2957] "England" ...
## $ author_address:'data.frame': 5076 obs. of 3 variables:
## ..$ ut : chr [1:5076] "WOS:000344457900003" ...
## ..$ author_no: int [1:5076] 1 2 ...
## ..$ addr_no : int [1:5076] 1 2 ...
## $ jsc :'data.frame': 1500 obs. of 2 variables:
## ..$ ut : chr [1:1500] "WOS:000344457900003" ...
## ..$ jsc: chr [1:1500] "Development Studies" ...
## $ keyword :'data.frame': 3468 obs. of 2 variables:
## ..$ ut : chr [1:3468] "WOS:000344457900003" ...
## ..$ keyword: chr [1:3468] "eco-art" ...
## $ keywords_plus :'data.frame': 5268 obs. of 2 variables:
## ..$ ut : chr [1:5268] "WOS:000344457900003" ...
## ..$ keywords_plus: chr [1:5268] "sustainable architecture" ...
## $ grant :'data.frame': 1914 obs. of 3 variables:
## ..$ ut : chr [1:1914] "WOS:000389909500018" ...
## ..$ grant_agency: chr [1:1914] "Russian Science Foundation" ...
## ..$ grant_id : chr [1:1914] "14-17-00647" ...
## $ doc_type :'data.frame': 893 obs. of 2 variables:
## ..$ ut : chr [1:893] "WOS:000344457900003" ...
## ..$ doc_type: chr [1:893] "Article" ...
If a dataset of interest is hosted on the web, it is usually straighforward to use the relevant URL to load the data directly into your R environment. For instance, consider this dataset from CU Scholar. If we want to load the dataset into R directly, we can identify the relevant download link by hovering over the blue “Download the file” button“, and pass it as an argument to the read_csv()
function (with”.csv" appended ):
# Reads in published dataset from CU Scholar and assigns it to a new object named "green_space_CUScholar"
green_space_CUScholar<-read_csv("https://scholar.colorado.edu/downloads/76537257b.csv")
## Warning: Missing column names filled in: 'X1' [1]
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## .default = col_double(),
## time_period = col_character(),
## age_group = col_character(),
## sex = col_character(),
## ethnicity = col_character(),
## race2 = col_character(),
## cohabit = col_character(),
## insured = col_character(),
## looking_for_work = col_character(),
## BAplus = col_character(),
## income = col_character(),
## Q33_1 = col_character(),
## Q33_2 = col_character(),
## Q33_3 = col_character(),
## Q33_4 = col_character(),
## Q33_5 = col_character(),
## diagnosed = col_character()
## )
## ℹ Use `spec()` for the full column specifications.
Above, we assigned the data from that CU Scholar landing page to a new object named green_space_CUScholar
, which you can now view in the R Studio data viewer with View(green_space_CUScholar)
.
In this section, we will survey some useful functions (primarily from the dplyr package) for wrangling and processing numeric data. We will also introduce ggplot2, which is the most popular visualization package in R. We will demonstrate these functions using the Persson-Tabellini dataset on political-economic data (pt
).
We’ll start by making a copy of the pt
object, by assigning it to a new object named pt_copy
. We’ll use pt_copy
when exploring the dataset, which ensures that we do not make inadvertent changes to our original pt
data frame, and can always refer back to it when needed. Keeping a “clean” version of the data, and carrying out analysis tasks on a copy of this dataset, is good data management practice.
# Make a copy of the dataset so we don't alter the original dataset; then, view
# the copied dataset
pt_copy<-pt
We can go ahead and print the contents of pt_copy
, which, at this point, is identical to pt
:
# Print contents of "pt_copy"
pt_copy
## # A tibble: 85 × 75
## oecd country pind pindo ctrycd col_uk t_indep col_uka col_espa col_otha
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 Argentina 0 0 213 0 183 0 0.268 0
## 2 1 Australia 1 1 193 1 98 0.608 0 0
## 3 1 Austria 0 0 122 0 250 0 0 0
## 4 0 Bahamas 1 1 313 1 26 0.896 0 0
## 5 0 Bangladesh 1 1 513 0 28 0 0 0.888
## 6 0 Barbados 1 1 316 1 33 0.868 0 0
## 7 0 Belarus 1 1 913 0 8 0 0 0.968
## 8 1 Belgium 0 0 124 0 169 0 0 0.324
## 9 0 Belize 1 1 339 1 18 0.928 0 0
## 10 0 Bolivia 0.116 0.116 218 0 174 0 0.304 0
## # … with 75 more rows, and 65 more variables: legor_uk <dbl>, legor_so <dbl>,
## # legor_fr <dbl>, legor_ge <dbl>, legor_sc <dbl>, prot80 <dbl>,
## # catho80 <dbl>, confu <dbl>, avelf <dbl>, govef <dbl>, graft <dbl>,
## # logyl <dbl>, loga <dbl>, yrsopen <dbl>, gadp <dbl>, engfrac <dbl>,
## # eurfrac <dbl>, frankrom <dbl>, latitude <dbl>, gastil <dbl>, cgexp <dbl>,
## # cgrev <dbl>, ssw <dbl>, rgdph <dbl>, trade <dbl>, prop1564 <dbl>,
## # prop65 <dbl>, federal <dbl>, eduger <dbl>, spropn <dbl>, yearele <dbl>, …
We can also view it in the data viewer:
# Views "pt_copy" in data frame
View(pt_copy)
Once you have a dataset loaded into R, one of the first things you’ll want to do is likely to generate a table of summary statistics. A quick way to do that is to use the describe()
function from the psych package. Below, we’ll generate summary statistics for the pt_copy
dataset by passing it to the describe()
function, and assign the table of summary statistics to a new object named pt_copy_summarystats1
. We’ll then view it in the data viewer:
# Generate summary statistics for "pt_copy" and assign to new object named "pt_copy_summarystats1"
pt_copy_summarystats1<-describe(pt_copy)
# View contents of "pt_copy_summarystats1" in data viewer
View(pt_copy_summarystats1)
While having a simple table of summary statistics is often a useful starting point, it is often useful to generate group-level summary statistics, where summary statistics are presented for different subgroups in the dataset. One way to generate group summary statistics is to use the describeBy()
function (also from the psych package), where the first argument is the data frame you would like to generate group-level summary statistics for, and the second argument is the column that contains the relevant groups. Below, we generate summary statistics for pt_copy
parsed out by the different continents in the “continent” column. The expression pt_copy$continent
indicates that the groups with respect to which we want to calcualte the summary statistics is in the “continent” column of the pt_copy
data frame. More generally, we can explicitly refer to columns in an R data frame using this dollar-sign notation, where the expression before the dollar sign refers to the data frame object, and the expression after refers to the name of the column.
The describeBy()
function will produce a list that contains summary statistics for different groups as list elements. Below, we’ll assign the list of group summary statistics to a new object named summary_stats_by_continent
:
# Creates summary statistics for each continent grouping, and puts results in list named "summary_stats_by_continent"
summary_stats_by_continent<-describeBy(pt_copy, pt_copy$continent)
Now, let’s say we want to extract the summary statistics for Africa, one of the continent categories in the “continent” column. We can do so using the double-bracket notation we discussed above:
# Accessing continent-level summary statistics for africa from the "summary_stats_by_continent" list
summary_stats_by_continent[["africa"]]
## vars n mean sd median trimmed mad min max
## oecd 1 11 0.00 0.00 0.00 0.00 0.00 0.00 0.00
## country* 2 11 6.00 3.32 6.00 6.00 4.45 1.00 11.00
## pind 3 11 0.77 0.42 1.00 0.83 0.00 0.00 1.00
## pindo 4 11 0.77 0.42 1.00 0.83 0.00 0.00 1.00
## ctrycd 5 11 647.55 154.90 684.00 685.56 56.34 199.00 754.00
## col_uk 6 11 0.82 0.40 1.00 0.89 0.00 0.00 1.00
## t_indep 7 11 36.64 19.77 35.00 33.89 5.93 9.00 89.00
## col_uka 8 11 0.69 0.35 0.86 0.74 0.02 0.00 0.92
## col_espa 9 11 0.00 0.00 0.00 0.00 0.00 0.00 0.00
## col_otha 10 11 0.15 0.33 0.00 0.07 0.00 0.00 0.96
## legor_uk 11 11 0.82 0.40 1.00 0.89 0.00 0.00 1.00
## legor_so 12 11 0.00 0.00 0.00 0.00 0.00 0.00 0.00
## legor_fr 13 11 0.18 0.40 0.00 0.11 0.00 0.00 1.00
## legor_ge 14 11 0.00 0.00 0.00 0.00 0.00 0.00 0.00
## legor_sc 15 11 0.00 0.00 0.00 0.00 0.00 0.00 0.00
## prot80 16 11 22.17 20.23 25.80 19.96 19.57 0.10 64.20
## catho80 17 11 19.46 13.67 18.70 18.07 13.20 1.90 49.60
## confu 18 11 0.00 0.00 0.00 0.00 0.00 0.00 0.00
## avelf 19 11 0.71 0.14 0.73 0.73 0.15 0.38 0.84
## govef 20 11 5.37 0.82 5.02 5.25 0.68 4.56 7.26
## graft 21 11 5.11 0.77 5.39 5.12 0.80 3.93 6.23
## logyl 22 11 7.93 0.78 7.75 7.90 0.53 6.95 9.13
## loga 23 11 7.38 0.66 7.33 7.37 0.55 6.28 8.58
## yrsopen 24 11 0.21 0.29 0.16 0.15 0.18 0.00 1.00
## gadp 25 11 0.55 0.12 0.54 0.55 0.12 0.37 0.74
## engfrac 26 11 0.02 0.04 0.00 0.02 0.00 0.00 0.09
## eurfrac 27 11 0.07 0.17 0.00 0.03 0.00 0.00 0.57
## frankrom 28 11 2.90 0.51 2.94 2.86 0.56 2.19 3.95
## latitude 29 11 -9.14 15.17 -15.81 -9.58 8.49 -29.13 14.77
## gastil 30 11 3.59 1.16 4.00 3.66 1.32 1.61 4.89
## cgexp 31 10 27.00 7.63 25.50 27.10 8.58 14.65 38.57
## cgrev 32 9 26.15 10.36 23.81 26.15 6.14 17.24 50.85
## ssw 33 6 1.67 1.46 0.94 1.67 0.58 0.44 3.80
## rgdph 34 11 1899.87 1832.60 1116.28 1522.39 738.30 530.22 6666.77
## trade 35 11 77.34 32.13 69.17 76.87 27.13 30.83 128.12
## prop1564 36 11 54.23 4.91 53.23 53.51 2.96 49.05 65.95
## prop65 37 11 3.28 1.16 2.80 3.06 0.65 2.34 6.26
## federal 38 11 0.00 0.00 0.00 0.00 0.00 0.00 0.00
## eduger 39 11 73.95 23.54 73.55 73.64 25.47 40.05 110.67
## spropn 40 10 0.27 0.42 0.00 0.21 0.00 0.00 1.00
## yearele 41 8 1982.50 13.48 1990.50 1982.50 5.19 1965.00 1994.00
## yearreg 42 8 1982.50 13.48 1990.50 1982.50 5.19 1965.00 1994.00
## seats 43 11 151.20 109.96 122.22 136.21 86.65 37.33 400.00
## maj 44 11 0.73 0.47 1.00 0.78 0.00 0.00 1.00
## pres 45 11 0.64 0.50 1.00 0.67 0.00 0.00 1.00
## lyp 46 11 7.22 0.81 7.02 7.15 0.88 6.27 8.80
## semi 47 11 0.18 0.40 0.00 0.11 0.00 0.00 1.00
## majpar 48 11 0.18 0.40 0.00 0.11 0.00 0.00 1.00
## majpres 49 11 0.55 0.52 1.00 0.56 0.00 0.00 1.00
## propres 50 11 0.09 0.30 0.00 0.00 0.00 0.00 1.00
## dem_age 51 11 1975.82 24.77 1989.00 1981.11 7.41 1910.00 1994.00
## lat01 52 11 0.17 0.08 0.18 0.17 0.05 0.00 0.32
## age 53 11 0.12 0.12 0.05 0.09 0.04 0.03 0.45
## polityIV 54 11 2.34 5.56 0.22 2.42 6.75 -6.00 10.00
## spl 55 8 -1.55 4.52 -1.54 -1.55 1.91 -6.77 8.23
## cpi9500 56 9 5.70 1.15 5.90 5.70 1.14 3.93 7.55
## du_60ctry 57 11 0.27 0.47 0.00 0.22 0.00 0.00 1.00
## magn 58 11 0.71 0.41 1.00 0.75 0.00 0.02 1.00
## sdm 59 9 0.71 0.45 1.00 0.71 0.00 0.03 1.00
## oecd.x 60 11 0.00 0.00 0.00 0.00 0.00 0.00 0.00
## mining_gdp 61 10 8.43 11.70 4.10 5.89 5.71 0.02 37.20
## gini_8090 62 9 50.25 9.95 54.00 50.25 11.86 35.36 62.30
## con2150 63 11 0.00 0.00 0.00 0.00 0.00 0.00 0.00
## con5180 64 11 0.27 0.47 0.00 0.22 0.00 0.00 1.00
## con81 65 11 0.73 0.47 1.00 0.78 0.00 0.00 1.00
## list 66 11 49.83 119.87 0.00 16.46 0.00 0.00 400.00
## maj_bad 67 11 2.73 2.05 3.83 2.80 1.56 0.00 4.89
## maj_gin 68 9 37.31 22.84 41.35 37.31 18.75 0.00 62.00
## maj_old 69 11 0.06 0.07 0.04 0.06 0.06 0.00 0.17
## pres_bad 70 11 2.63 2.18 3.83 2.67 1.56 0.00 4.89
## pres_gin 71 9 26.72 26.59 35.36 26.72 39.50 0.00 62.00
## pres_old 72 11 0.04 0.05 0.03 0.03 0.04 0.00 0.17
## propar 73 11 0.18 0.40 0.00 0.11 0.00 0.00 1.00
## lpop 74 3 13.99 0.15 13.92 13.99 0.05 13.88 14.17
## continent* 75 11 1.00 0.00 1.00 1.00 0.00 1.00 1.00
## range skew kurtosis se
## oecd 0.00 NaN NaN 0.00
## country* 10.00 0.00 -1.53 1.00
## pind 1.00 -1.06 -0.79 0.13
## pindo 1.00 -1.06 -0.79 0.13
## ctrycd 555.00 -2.13 3.44 46.70
## col_uk 1.00 -1.43 0.08 0.12
## t_indep 80.00 1.38 1.88 5.96
## col_uka 0.92 -1.31 -0.14 0.10
## col_espa 0.00 NaN NaN 0.00
## col_otha 0.96 1.58 0.79 0.10
## legor_uk 1.00 -1.43 0.08 0.12
## legor_so 0.00 NaN NaN 0.00
## legor_fr 1.00 1.43 0.08 0.12
## legor_ge 0.00 NaN NaN 0.00
## legor_sc 0.00 NaN NaN 0.00
## prot80 64.10 0.46 -0.80 6.10
## catho80 47.70 0.71 -0.39 4.12
## confu 0.00 NaN NaN 0.00
## avelf 0.46 -1.15 0.44 0.04
## govef 2.70 0.97 -0.17 0.25
## graft 2.30 -0.17 -1.62 0.23
## logyl 2.18 0.42 -1.43 0.23
## loga 2.29 0.03 -0.91 0.20
## yrsopen 1.00 1.72 2.15 0.09
## gadp 0.37 0.28 -1.38 0.04
## engfrac 0.09 0.95 -1.09 0.01
## eurfrac 0.57 2.24 3.76 0.05
## frankrom 1.77 0.54 -0.69 0.15
## latitude 43.90 0.44 -1.52 4.57
## gastil 3.28 -0.48 -1.45 0.35
## cgexp 23.92 0.06 -1.30 2.41
## cgrev 33.61 1.40 0.71 3.45
## ssw 3.36 0.52 -1.87 0.60
## rgdph 6136.54 1.50 1.28 552.55
## trade 97.29 0.31 -1.40 9.69
## prop1564 16.90 1.19 0.34 1.48
## prop65 3.92 1.47 1.16 0.35
## federal 0.00 NaN NaN 0.00
## eduger 70.62 0.08 -1.50 7.10
## spropn 1.00 0.92 -1.07 0.13
## yearele 29.00 -0.41 -2.00 4.77
## yearreg 29.00 -0.41 -2.00 4.77
## seats 362.67 0.92 -0.20 33.16
## maj 1.00 -0.88 -1.31 0.14
## pres 1.00 -0.49 -1.91 0.15
## lyp 2.53 0.53 -1.18 0.25
## semi 1.00 1.43 0.08 0.12
## majpar 1.00 1.43 0.08 0.12
## majpres 1.00 -0.16 -2.15 0.16
## propres 1.00 2.47 4.52 0.09
## dem_age 84.00 -1.57 1.64 7.47
## lat01 0.32 -0.28 -0.38 0.03
## age 0.42 1.57 1.64 0.04
## polityIV 16.00 0.07 -1.63 1.68
## spl 15.00 0.98 0.05 1.60
## cpi9500 3.61 0.01 -1.45 0.38
## du_60ctry 1.00 0.88 -1.31 0.14
## magn 0.98 -0.58 -1.70 0.12
## sdm 0.97 -0.67 -1.63 0.15
## oecd.x 0.00 NaN NaN 0.00
## mining_gdp 37.18 1.39 0.79 3.70
## gini_8090 26.94 -0.19 -1.71 3.32
## con2150 0.00 NaN NaN 0.00
## con5180 1.00 0.88 -1.31 0.14
## con81 1.00 -0.88 -1.31 0.14
## list 400.00 2.22 3.64 36.14
## maj_bad 4.89 -0.32 -1.81 0.62
## maj_gin 62.00 -0.71 -1.17 7.61
## maj_old 0.17 0.69 -1.39 0.02
## pres_bad 4.89 -0.30 -1.92 0.66
## pres_gin 62.00 0.04 -1.97 8.86
## pres_old 0.17 1.66 2.10 0.02
## propar 1.00 1.43 0.08 0.12
## lpop 0.28 0.36 -2.33 0.09
## continent* 0.00 NaN NaN 0.00
Recall that we can assign list elements that we extract from a list to their own object, which allows us to conveniently retrieve it whenever it is needed. Below, we’ll assign the summary statistics for Africa to a new object named africa_summary
:
# Group-level summary statistics can be assigned to their own object for easy retrieval
africa_summary<-summary_stats_by_continent[["africa"]]
Another convenient way to retrieve group-level summary statistics is through the group_by()
function in the dplyr package. First, we’ll run the code below, and assign it to a new object named trade_age_by_continent
:
# Generate a table that displays summary statistics for trade at the continent level and assign to object named "trade_age_by_continent"
trade_age_by_continent<-pt_copy %>%
group_by(continent) %>%
summarise(meanTrade=mean(trade),sdTrade=sd(trade),
meanAge=mean(age), sdAge=sd(age),
n=n())
Now, let’s print the contents of trade_age_by_continent
:
# prints contents of "trade_age_by_continent"
trade_age_by_continent
## # A tibble: 4 × 6
## continent meanTrade sdTrade meanAge sdAge n
## <chr> <dbl> <dbl> <dbl> <dbl> <int>
## 1 africa 77.3 32.1 0.121 0.124 11
## 2 asiae 97.8 84.6 0.110 0.0846 13
## 3 laam 68.6 32.8 0.139 0.153 23
## 4 other 78.8 40.7 0.309 0.263 38
Let’s now unpack the code that created this table. We started with the pt_copy
data frame, and then used group_by(continent)
to declare that subsequent calculations should be performed at the continent-level; then, within the summarise()
function, we defined the column names we wanted to use in the group-level summary table, and how those variables are to be calculated. For example, meanTrade=mean(trade)
indicates that we want the first column to be named “meanTrade”, which is to be calculated by taking the mean of the “trade” variable for each continent grouping. After that, sdTrade=sd(trade)
indicates that we want the second column to be named “sdTrade”, which is to be calculated by taking the standard deviation of the “trade” variable for each continent grouping. And so on. Note that n=n()
indicates that we want the final column, named “n”, to provide information about the number of observations in each continent-level grouping.
You might have noticed a mysterious symbol in the above code that comes immediately after pt_copy
, and immediately after group_by(continent)
. This symbol is known as a “pipe” (%>%
). The pipe operator effectively takes the contents to its left, and then uses these contents as an input to the code on its right. Above, the pipe takes the contents of pt_copy
on its left, and then feeds this data into the group_by()
function on the right; then, after grouping the data by continent, it feeds this grouped data on its left into the summarise()
function on its right. We will use the pipe operator throughout the lesson to chain together functions in this manner.
Finally, in addition to calculating summary statistics and group-level summary statistics, another useful way to explore your data is to generate simple crosstabs that show the breakdown of one variable with respect to the other. The code below uses the tabyl()
function from the janitor package to compute a crosstab between the “federal” variable (i.e. this variable takes on the value of 1 if a country has a federal structure of government, and 0 if it’s a unitary government) and the “continent” variable; it assigns the crosstab to a new object named crosstab_federal_continent
:
# Creates cross-tab showing the breakdown of federal/non federal across continents
crosstab_federal_continent<-pt_copy %>% tabyl(federal, continent)
Let’s print the contents of crosstab_federal_continent
:
# prints contents of "crosstab_federal_continent"
crosstab_federal_continent
## federal africa asiae laam other
## 0 11 11 19 29
## 1 0 2 4 7
## NA 0 0 0 2
This tells us, for instance, that among Latin American countries, 19 had a unitary government, and 4 had federal structure of government.
After getting a sense of your data by computing some summary statistics and running some crosstabs, you’ll often have a sense of how you would like to clean or transform your data for analysis. This section briefly describes some functions that are useful for these basic data-preparation tasks.
Rearranging Columns
We can manipulate the order of the columns in a dataset using the relocate
function. For example, the code below uses the relocate()
function to shift the “country” column to the front of the dataset, and then assigns this change back to pt_copy
to update the object:
# bring the "country" column to the front of the dataset
pt_copy<-pt_copy %>% relocate(country)
You can confirm that the change has been implemented by viewing pt_copy
in the data viewer:
# Views "pt_copy" in data viewer
View(pt_copy)
You can specify more than one argument to the relocate
function. For example, in the code below, passing the “country”, “list”, “trade”, and “oecd” variables/columns to the relocate()
function will make “country” the first column, “list” the second column, “trade” the third column, and so on.
# bring the "country", "list", "trade", "oecd" columns to the front of the dataset
pt_copy<-pt_copy %>% relocate(country, list, trade, oecd)
# Views updated "pt_copy" data frame in data viewer
View(pt_copy)
Renaming variables
In order to rename variables, we can use the rename()
function, as below. The code below renames the existing “list” variable to “party_list”, which is more descriptive, and assigns the change back to the pt_copy
object.
## Renaming a variable (renames "list" to "party_list")
pt_copy<-pt_copy %>% rename(party_list=list)
Check the pt_copy
data frame in the viewer to ensure that the change has been made.
Sorting a dataset in ascending or descending order with respect to a variable
It is often useful to sort a data frame in ascending or descending order with respect to a given variable. The code below sorts the pt_copy
data frame in ascending order with respect to the “trade” variable using the arrange()
function:
# sorting in ascending (low to high) order with respect to the "trade" variable
pt_copy<-pt_copy %>% arrange(trade)
If, instead, you want to sort the dataset in descending order with respect to the “trade” variable, pass the name of the variable to the desc()
function within the arrange()
function, as below:
# sorting in descending (high to low) order with respect to the "trade" variable
pt_copy<-pt_copy %>% arrange(desc(trade))
Creating new variables based on existing variables
Depending on your research question and empirical strategy, it is often useful or necessary to create new variables in your dataset, based on existing variables. To do so, we can use dplyr’s mutate()
function. The code below, for example, uses the mutate()
function to create a new variable, named “non_catholic_80”, that is computed by subtracting the existing “catho80” variable from 100; for convenience, the “country”, “catho80”, and newly created “non_catholic_80” variables are all moved to the front of the dataset using the relocate()
function:
# Create new variable named "non_catholic_80" that is calculated by substracting the Catholic share of the population in 1980 ("catho80") from 100 and relocates "country", "catho80", and the newly created "non_catholic_80" to the front of the dataset
pt_copy<-pt_copy %>% mutate(non_catholic_80=100-catho80) %>%
relocate(country, catho80, non_catholic_80)
You can view the updated pt_copy
data frame to confirm that the new variable has been created:
# views updated "pt_copy" data frame in R Studio data viewer
View(pt_copy)
Selecting or Deleting Variables
Sometimes, you will have a dataset with many variables, and to make things more tractable, you’ll want to select only the variables that are relevant to your analysis. You can explicitly select desired variables using the select()
function from dplyr. The code below selects the “country”, “cgexp”, “cgrev”, “trade”, and “federal” columns from pt_copy
, and then assigns this selection to a new object named pt_copy_selection
:
# Selects "country", "cgexp", "cgrev", and "trade" variables from the "pt_copy" dataset and assigns the selection to a new object named "pt_copy_selection"
pt_copy_selection<-pt_copy %>%
select(country, cgexp, cgrev, trade, federal)
When you view the pt_copy_selection
object in the data viewer, you’ll see that we now have a new data frame that consists only of these variables:
# views "pt_copy_selection" in data viewer
View(pt_copy_selection)
Instead of selecting columns to keep, it may sometimes by easier to directly delete columns. For example, the code below deletes the “federal” variable from pt_copy_selection
by passing it to the select()
column with a “-” in front of it.
# deletes "federal" variable from "pt_copy_selection"
pt_copy_selection %>% select(-federal)
## # A tibble: 85 × 4
## country cgexp cgrev trade
## <chr> <dbl> <dbl> <dbl>
## 1 Singapore 18.5 34.7 343.
## 2 Malta 41.0 35.0 190.
## 3 Luxembourg 40.2 45.5 189.
## 4 Malaysia 24.5 26.8 176.
## 5 Estonia 30.0 31.1 154.
## 6 Belgium 47.9 43.7 132.
## 7 Ireland 38.1 34.8 129.
## 8 Mauritius 22.5 21.6 128.
## 9 St. Vincent&G 34.8 28.7 123.
## 10 Jamaica NA NA 122.
## # … with 75 more rows
If you want to delete multiple columns, simply specify the columns in a vector, preceded by a minus sign, that is passed to the select()
function. The code below, for instance, takes the existing pt_copy_selection
data frame, deletes the “federal” and “trade” columns, and assigns the result to a new object named pt_copy_selection_modified
:
# deletes "federal" and "trade" from "pt_copy_selection" and assigns it to new object named "pt_copy_selection_modified"
pt_copy_selection_modified<-pt_copy_selection %>% select(-c(federal, trade))
Check the pt_copy_selection_modified
data frame in the data viewer to confirm these changes:
# views "pt_copy_selection_modified" in data viewer
pt_copy_selection_modified
## # A tibble: 85 × 3
## country cgexp cgrev
## <chr> <dbl> <dbl>
## 1 Singapore 18.5 34.7
## 2 Malta 41.0 35.0
## 3 Luxembourg 40.2 45.5
## 4 Malaysia 24.5 26.8
## 5 Estonia 30.0 31.1
## 6 Belgium 47.9 43.7
## 7 Ireland 38.1 34.8
## 8 Mauritius 22.5 21.6
## 9 St. Vincent&G 34.8 28.7
## 10 Jamaica NA NA
## # … with 75 more rows
Recoding Variables
“Recoding” a variable refers to the process of taking an existing variable, and generating new variable(s) that represent the information from that original variable in a new way. Below, we’ll consider some common recoding operations.
Creating Dummy Variables from Continuous Numeric Variables
You may sometimes have a continuous numeric variable, but want to create a new dummy variable (a variable that takes on the value of 1 if a given condition is met, and 0 otherwise) based on that numeric variable. For example, let’s say we want to create a new variable, named “trade_open” that takes on the value of 1 if the trade variable is greater than or equal to 77, and 0 otherwise. We can generate this new dummy variable using the mutate()
function; within the mutate()
function below, we specify that we want to create a new variable named “trade_open”; the ifelse()
function specifies the desired condition (trade>=77), followed by the value the new “trade_open” variable is to take if the condition is met (1), and the value the new “trade_open” variable is to take if the condition is not met (0). In other words, we can translate ifelse(trade>=77, 1, 0)
to “if trade>=77, set the ‘trade_open’ variable to 1, otherwise set it to 0.” We’ll assign the data frame with the new “trade_open” variable back to “pt_copy”:
# Creates a new dummy variable based on the existing "trade" variable named "trade_open" (which takes on a value of "1" if "trade" is greater than or equal to 77, and 0 otherwise) and then moves the newly created variable to the front of the dataset along with "country" and "trade"; all changes are assigned to "pt_copy", thereby overwriting the existing version of "pt_copy"
pt_copy<-pt_copy %>% mutate(trade_open=ifelse(trade>=77, 1, 0)) %>%
relocate(country, trade_open, trade)
View the data frame to ensure that the new variable “trade_open”, recoded based on “trade”, has been created:
Creating categorical variables from continuous numeric variables
Sometimes, you will want to create a variable that contains categories or classifications that derive from numeric thresholds of an existing variable. For instance, let’s say we want to take the existing “trade” variable, and define a new variable named “trade_level”, which is set to “Low Trade” when the “trade” variable is greater than 15 and less than 50; “Intermediate_Trade” when the “trade” variable is greater than or equal to 50 and less than 100; and “High_Trade” when the “trade” variable is greater than or equal to 100. The code below creates this new “trade_level” variable using the mutate()
function, and the case_when()
function that maps the conditions onto the desired variable values for “trade_level” using the following syntax:
# Creates a new variable in the "pt_copy" dataset named "trade_level" (that is coded as "Low Trade" when the "trade" variable is greater than 15 and less than 50, coded as "Intermediate Trade" when "trade" is greater than or equal to 50 and less than 100, and coded as "High TradE" when "trade" is greater than or equal to 100), and then reorders the dataset such that "country", "trade_level", and "trade" are the first three variables in the dataset
pt_copy<-pt_copy %>% mutate(trade_level=case_when(trade>15 & trade<50~"Low_Trade",
trade>=50 & trade<100~"Intermediate_Trade",
trade>=100~"High_Trade")) %>%
relocate(country, trade_level, trade)
Check to see that the new “trade_level” variable has indeed been created in pt_copy
according to the specifications above:
# views updated "pt_copy" data frame in data viewer
View(pt_copy)
Creating dummmy variables from categorical variables
Sometimes, you may have a categorical variable in a dataset, and want to create dummy variables based on those categories. For example, consider the “trade_level” variable we created above. Let’s say we want to use the “trade_level” column to create dummy variables for each of the categories in that column. We can do so with the fastDummies package, which can quickly generate dummy variables for the categories in a categorical variable using the dummy_cols()
function. Below, we simply take the existing pt_copy
dataset, and pass the name of the categorical variable out of which we want to create the dummies (“trade_level”) to the dummy_cols()
function:
# Creates dummy variables from "trade_level" column, and relocates the new dummies to the front of the dataset
pt_copy<-pt_copy %>% dummy_cols("trade_level") %>%
relocate(country, trade_level, trade_level_High_Trade, trade_level_Intermediate_Trade, trade_level_Low_Trade)
Let’s now view the updated pt_copy
data frame, with the newly created dummy variables:
# views updated "pt_copy" in data viewer
View(pt_copy)
You’ll notice that there are now dummy variables corresponding to each of the categories in the categorical “trade_level” variable; for example, the “trade_level_High_Trade” dummy variable takes on the value of 1 for all observations where the “trade_level” variable is “High_Trade” and 0 otherwise; the “trade_level_Intermediate_Trade” dummy variable takes on the value of 1 for all observations where the “trade_level” variable is “Intermediate_Trade” and 0 otherwise; and so on.
Subsetting Variables
You will often want to subset, or “filter” your datasets to extract observations that meet specified criteria. The dplyr packages allows you to carry out these subsetting operations with a function called filter()
, which takes various logical conditions as arguments. Let’s say, for example, that we want to extract all of the OECD country observations from the pt_copy
dataset. The “oecd” variable in pt_copy
is equal to 1, for all OECD countries, and 0 for non-OECD countries. By passing the condition oecd==1
to the filter()
function, we can extract all OECD observations. We’ll assign this data subset to a new object named oecd_countries
, and view it in the data viewer:
# Extracts OECD observations in "pt_copy" and assigns to object named "oecd_countries"
oecd_countries<-pt_copy %>% filter(oecd==1) %>%
relocate(country, oecd)
# views "oecd_countries" in data viewer
View(oecd_countries)
Let’s take another example. Let’s use the filter()
function to extract all observations for which the “cgrev” (central government revenue as a share of GDP) exceeds 40. We’ll assign the observations that satisfy this condition to a new object named high_revenues
:
# Extracts observations for which cgrev (central government revenue as % of gdp)>40, and assigns to object named "high_revenues"
high_revenues<-pt_copy %>% filter(cgrev>40) %>%
relocate(country, cgrev)
# Views "high_revenues" in data viewer
high_revenues
## # A tibble: 10 × 81
## country cgrev trade_level trade_level_Hig… trade_level_Int… trade_level_Low…
## <chr> <dbl> <chr> <int> <int> <int>
## 1 Luxembo… 45.5 High_Trade 1 0 0
## 2 Belgium 43.7 High_Trade 1 0 0
## 3 Netherl… 47.6 High_Trade 1 0 0
## 4 Botswana 50.8 Intermedia… 0 1 0
## 5 Hungary 45.6 Intermedia… 0 1 0
## 6 Norway 41.1 Intermedia… 0 1 0
## 7 Sweden 40.8 Intermedia… 0 1 0
## 8 Poland 40.3 Low_Trade 0 0 1
## 9 France 40.9 Low_Trade 0 0 1
## 10 Italy 41.2 Low_Trade 0 0 1
## # … with 75 more variables: trade <dbl>, trade_open <dbl>, catho80 <dbl>,
## # non_catholic_80 <dbl>, party_list <dbl>, oecd <dbl>, pind <dbl>,
## # pindo <dbl>, ctrycd <dbl>, col_uk <dbl>, t_indep <dbl>, col_uka <dbl>,
## # col_espa <dbl>, col_otha <dbl>, legor_uk <dbl>, legor_so <dbl>,
## # legor_fr <dbl>, legor_ge <dbl>, legor_sc <dbl>, prot80 <dbl>, confu <dbl>,
## # avelf <dbl>, govef <dbl>, graft <dbl>, logyl <dbl>, loga <dbl>,
## # yrsopen <dbl>, gadp <dbl>, engfrac <dbl>, eurfrac <dbl>, frankrom <dbl>, …
Let’s try another example. Let’s subset observations from pt_copy
for which the Catholic share of the population in 1980 (“catho80”) is less than or equal to 50, and assign the filtered data to a new object named minority_catholic
:
# Extracts observations for which the "catho80" variable is less than or equal to 50
minority_catholic<-pt_copy %>% filter(catho80<=50) %>%
relocate(country, catho80)
# Views "minority_catholic" in the data viewer
View(minority_catholic)
It is also possible to chain together multiple conditions as arguments to the filter()
function. For example, if we want to subset observations from OECD countries that also have a federal political structure, we can use the “&” operator to specify these two conditions; we’ll assign the filtered dataset to a new object named oecd_federal_countries
:
# Extracts federal OECD countries (where oecd=1 AND federal=1) and assigns to a new object named "oecd_federal_countries"
oecd_federal_countries<-pt_copy %>% filter(oecd==1 & federal==1) %>%
relocate(country, oecd, federal)
# Views "oecd_federal_countries" in data viewer
oecd_federal_countries
## # A tibble: 7 × 81
## country oecd federal trade_level trade_level_Hig… trade_level_Int…
## <chr> <dbl> <dbl> <chr> <int> <int>
## 1 Austria 1 1 Intermediate_Trade 0 1
## 2 Switzerland 1 1 Intermediate_Trade 0 1
## 3 Canada 1 1 Intermediate_Trade 0 1
## 4 Germany 1 1 Low_Trade 0 0
## 5 Mexico 1 1 Low_Trade 0 0
## 6 Australia 1 1 Low_Trade 0 0
## 7 USA 1 1 Low_Trade 0 0
## # … with 75 more variables: trade_level_Low_Trade <int>, trade <dbl>,
## # trade_open <dbl>, catho80 <dbl>, non_catholic_80 <dbl>, party_list <dbl>,
## # pind <dbl>, pindo <dbl>, ctrycd <dbl>, col_uk <dbl>, t_indep <dbl>,
## # col_uka <dbl>, col_espa <dbl>, col_otha <dbl>, legor_uk <dbl>,
## # legor_so <dbl>, legor_fr <dbl>, legor_ge <dbl>, legor_sc <dbl>,
## # prot80 <dbl>, confu <dbl>, avelf <dbl>, govef <dbl>, graft <dbl>,
## # logyl <dbl>, loga <dbl>, yrsopen <dbl>, gadp <dbl>, engfrac <dbl>, …
We can use a vertical line (|) to specify “or” conditions. For example, the code below subsets observations from countries in Africa OR countries in Asia/Europe, and assigns the subsetted data to a new object named asia_europe_africa
:
# Extracts observations that are in Africa ("africa") OR in Asia/Europe ("asiae) and assigns to an object named "asia_europe_africa"
asia_europe_africa<-pt_copy %>% filter(continent=="africa"|continent=="asiae") %>%
relocate(continent)
# views "asia_europe_africa" in data viewer
View(asia_europe_africa)
# Prints contents of "asia_europe_africa"
asia_europe_africa %>% datatable(extensions=c("Scroller", "FixedColumns"), options = list(
deferRender = TRUE,
scrollY = 350,
scrollX = 350,
dom = "t",
scroller = TRUE,
fixedColumns = list(leftColumns = 3)
))
Filtering for observations that do NOT meet a condition
It is also useful to know how to subset datasets to extract observations that do NOT meet a given condition. In particular, the condition “not equal to” is denoted by a “!=”. For example, if we wanted to extract observations from pt_copy
where the “continent” variable is NOT equal to “africa”, and assign the result to a new object named pt_copy_sans_africa
, we can write the following:
# Extracts all non-Africa observations and assigns to object named "pt_copy_sans_africa"
pt_copy_sans_africa<-pt_copy %>% filter(continent!="africa") %>% relocate(continent)
# views pt_copy_sans_africa in the data viewer
View(pt_copy_sans_africa)
# Prints contents of "pt_copy_sans_africa"
pt_copy_sans_africa %>% datatable(extensions=c("Scroller", "FixedColumns"), options = list(
deferRender = TRUE,
scrollY = 350,
scrollX = 350,
dom = "t",
scroller = TRUE,
fixedColumns = list(leftColumns = 3)
))
R has a number of powerful visualization capabilities, but one of the most frequently used tools for data visualization in R is the ggplot2 package, which is a party of the tidyverse suite. Data visualization using ggplot2 is a vast topic; our goal here is to provide you with some basic intuition for how ggplot visualizations are constructed by developing some basic exploratory visualizations. While our treatment here focuses on bar charts and scatterplots, ggplot offers functions for a much wider variety of visualizations. However, bar charts and scatterplots offer a convenient way to familiarize yourself with basic ggplot syntax.
Bar Charts
Let’s make some simple bar charts for the African countries in pt_copy
. Let’s say we want to make a bar chart that displays variation in the “cgexp” variable (central government expenditure as a share of GDP) for African countries. We’ll begin by extracting the Africa observations from pt_copy
using the filter()
function, and removing any “NA” observations for this variable from the dataset using the drop_na()
function:
# filters Africa observations
pt_africa<-pt_copy %>%
filter(continent=="africa") %>%
drop_na(cgexp)
Now, let’s make a basic bar chart of the “cgexp” data from pt_africa
, and assign it to an object named cgexp_africa
:
# Creates a bar chart of the "cgexp" variable (central government expenditure as a share of GDP) for the Africa observations and assigns the plot to an object named "cgexp_africa"
cgexp_africa<-
ggplot(pt_africa)+
geom_col(aes(x=country, y=cgexp))+
labs(
title="Central Govt Expenditure as Pct of GDP for Select African Countries (1990-1998 Average)",
x="Country Name",
y="CGEXP")+
theme(plot.title=element_text(hjust=0.5),
axis.text.x = element_text(angle = 90))
Let’s unpack the code above:
ggplot(pt_africa)
specifies that we want to initialize ggplot, and declares the dataset containing the data we want to map (“pt_africa”)geom_col()
indicates that we want to make a bar chart. If you wanted to make a different type of chart, this function would be different. Within the geom_col()
function, we indicate our desired aesthetic mapping aes()
; an aesthetic mapping indicates how we would like variables in the datasets to be represented on the chosen visualization. Here, the expression x=country, y=cgexp
simply indicates that we want countries to be represented on the x-axis of the chart, and the “cgexp” variable to be represented on the y-axis.labs()
function (short for “labels”) specify a desired title for the visualization, and x-axis and y-axis labels.theme()
function specify a desired position for the plot title, and a desired format for the x-axis labels.Note that ggplot2 functions are chained together with a “+” sign.
Let’s see what cgexp_africa
looks like:
# prints contents of cgexp_africa
cgexp_africa
This is a nice start, but it may look a bit cleaner if we arrayed the chart in ascending order with respect to the cgexp variable. To do so, we can slightly change our aesthetic mapping to look like this: aes(x=reorder(country, cgexp), y=cgexp))
. This indicates that we’d still like the “cgexp” variable on the y-axis, and countries on the x-axis; however, we’d also like to order countries in ascending order with respect to the “cgexp” variable. We’ll assign this modified chart to a new object named cgexp_africa_ascending
:
# Creates a bar chart of the "cgexp" variable (central government expenditure as a share of GDP) for the Africa observations; countries are on the x axis and arrayed in ascending order with respect to the cgexp variable, which is on the y-axis; plot is assigned to an object named "cgexp_africa_ascending"
cgexp_africa_ascending<-
ggplot(pt_africa)+
geom_col(aes(x=reorder(country, cgexp), y=cgexp))+
labs(
title="Central Govt Expenditure as Pct of GDP for Select African Countries (1990-1998 Average)",
x="Country Name",
y="CGEXP")+
theme(plot.title=element_text(hjust=0.5),
axis.text.x = element_text(angle = 90))
All other apsects of the code are the same as before. Let’s see what the modified chart looks like:
# prints "cgexp_africa_ascending"
cgexp_africa_ascending
If, instead of arrange the countries in asscending order with respect to the “cgexp” variable, we want to arrange them in descending order, we can simply put a “-” before “cgexp” within the aesthetic mapping; we’ll assign the modified chart to a new object named cgexp_africa_descending
:
# Creates a bar chart of the "cgexp" variable (central government expenditure as a share of GDP) for the Africa observations; countries are on the x axis and arrayed in descending order with respect to the cgexp variable, which is on the y-axis; plot is assigned to an object named "cgexp_africa_descending"
cgexp_africa_descending<-
ggplot(pt_africa)+
geom_col(aes(x=reorder(country, -cgexp), y=cgexp))+
labs(
title="Central Govt Expenditure as Pct of GDP for Select African Countries (1990-1998 Average)",
x="Country Name",
y="CGEXP")+
theme(plot.title=element_text(hjust=0.5),
axis.text.x = element_text(angle = 90))
Let’s see how the cgexp_africa_descending
plot now looks:
# prints contents of "cgexp_africa_descending"
cgexp_africa_descending
Sometimes, you may wish to invert the axes of your charts, which you can do using the coord_flip()
function. The code below takes the cgexp_africa_ascending
chart we created above, inverts the axes using coord_flip()
, and assigns the result to a new object named cgexp_africa_ascending_inverted
:
# creates a sideways bar chart using the "coord_flip" function and assigns it to a new object named "cgexp_africa_ascending_inverted"
cgexp_africa_ascending_inverted<-cgexp_africa_ascending+
coord_flip()
Let’s see what cgexp_africa_ascending_inverted
looks like:
# prints "cgexp_africa_ascending_inverted"
cgexp_africa_ascending_inverted
Scatterplots
The syntax to make a scatterplot is fairly similar to the syntax used to create a bar chart; the main difference is that instead of using the geom_col()
function to indicate that we want a bar chart, we use the geom_point()
function to indicate that we want a scatterplot. The code below generates a scatterplot of the “cgexp” variable (on the x axis) and the “trade” variable (on the y-axis) for all observations in the pt_copy
dataset, and assigns it to a new object named scatter_cgexp_trade
:
# Creates scatterplot with "cgexp" variable on x-axis and "trade" variiable on y-axis and assigns to object named "scatter_cgexp_trade"
scatter_cgexp_trade<-
ggplot(pt_copy)+
geom_point(aes(x=cgexp, y=trade))+
labs(title="Trade Share of GDP \nas a function of\n Central Govt Expenditure (1990-1998 Average) ",
x="Central Government Expenditure (Pct of GDP)", y="Overall Trade (Pct of GDP)")+
theme(plot.title=element_text(hjust=0.5))
Let’s see what scatter_cgexp_trade
looks like:
# prints contents of "scatter_cgexp_trade"
scatter_cgexp_trade
## Warning: Removed 3 rows containing missing values (geom_point).
Sometimes, you may wish to distinguish between different groups in a scatterplot. One way to do that is to assign different colors to different groups of interest. For example, if we wanted to distinguish continents in the scatterplot, we could specify color=continent
in the aesthetic mapping. The code below does so, and assigns the result to a new object named scatter_cgexp_trade_grouped
:
# Creates scatterplot with "cgexp" variable on x-axis and "trade" variable on y-axis, and uses different color points for different continents; plot is assigned to object named "scatter_cgexp_trade_grouped"
scatter_cgexp_trade_grouped<-
ggplot(pt_copy)+
geom_point(aes(x=cgexp, y=trade, color=continent))+
labs(title="Trade Share of GDP \nas a function of\n Central Govt Expenditure (1990-1998 Average) ",
x="Central Government Expenditure (Pct of GDP)", y="Overall Trade (Pct of GDP)")+
theme(plot.title=element_text(hjust=0.5))
Let’s see what scatter_cgexp_trade_grouped
looks like:
# prints contents of "scatter_cgexp_trade_grouped"
scatter_cgexp_trade_grouped
## Warning: Removed 3 rows containing missing values (geom_point).
An alternative way of parsing categories is to use facets, which create separate visualizations for each of the different categories in a dataset. Below, for example, we create separate scatterplots for each continent (this is specified by the final line in the code, facet_wrap(~continent, nrow=2))
:
# Creates continent-level subplots for scatterplot, using facets; assigns plot to new object named "scatter_cgexp_trade_facets"
scatter_cgexp_trade_facets<-
ggplot(pt_copy) +
geom_point(aes(x = cgexp, y = trade)) +
facet_wrap(~ continent, nrow = 2)
# prints contents of "scatter_cgexp_trade_facets"
scatter_cgexp_trade_facets
## Warning: Removed 3 rows containing missing values (geom_point).
Finally, it’s important to note that it’s possible to layer different geometries over each other. For example, the code below plots a scatterplot for the pt_copy
dataset with the “cgexp” variable on the x axis and the trade variable on the y-axis, but also plots a line of best fit on top of the scatterplot with geom_smooth(aes(x=cgexp, y=trade), method="lm")
; we’ll assign the resulting plot to scatter_cgexp_trade_line
:
# Creates scatterplot with "cgexp" variable on x-axis and "trade" variiable on y-axis, adds line of best fit; plot assigned to object named "scatter_cgexp_trade_line"
scatter_cgexp_trade_line<-
ggplot(pt_copy)+
geom_point(aes(x=cgexp, y=trade))+
geom_smooth(aes(x=cgexp, y=trade), method="lm")+
labs(title="Trade Share of GDP \nas a function of\n Central Govt Expenditure (1990-1998 Average) ",
x="Central Government Expenditure (Pct of GDP)", y="Overall Trade (Pct of GDP)")+
theme(plot.title=element_text(hjust=0.5))
# Prints contents of "scatter_cgexp_trade_line"
scatter_cgexp_trade_line
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 3 rows containing non-finite values (stat_smooth).
## Warning: Removed 3 rows containing missing values (geom_point).
Having learned some basic functions for processing and working with numeric data, we will now turn to a brief exploration of text data in R. We will work with the dataset we created earlier, by appending the various separate Web of Science data frames together; recall that we assigned this data frame to an object named ws_df_appended
:
# prints ws_df_appended
ws_df_appended
## # A tibble: 12,686 × 70
## `Publication Type` Authors `Book Authors` `Book Editors` `Book Group Au…`
## <chr> <chr> <chr> <chr> <chr>
## 1 J Miles, M <NA> <NA> <NA>
## 2 J Dal Farra,… <NA> <NA> <NA>
## 3 J Chen, MH <NA> <NA> <NA>
## 4 J Guy, S; He… <NA> <NA> <NA>
## 5 J Baztan, J;… <NA> <NA> <NA>
## 6 J Burke, M; … <NA> <NA> <NA>
## 7 J Rodder, S <NA> <NA> <NA>
## 8 J Bentz, J; … <NA> <NA> <NA>
## 9 J Ture, C <NA> <NA> <NA>
## 10 J Kim, S <NA> <NA> <NA>
## # … with 12,676 more rows, and 65 more variables: `Author Full Names` <chr>,
## # `Book Author Full Names` <chr>, `Group Authors` <chr>,
## # `Article Title` <chr>, `Source Title` <chr>, `Book Series Title` <chr>,
## # `Book Series Subtitle` <lgl>, Language <chr>, `Document Type` <chr>,
## # `Conference Title` <chr>, `Conference Date` <chr>,
## # `Conference Location` <chr>, `Conference Sponsor` <chr>,
## # `Conference Host` <chr>, `Author Keywords` <chr>, `Keywords Plus` <chr>, …
We will extract the paper abstracts from this dataset, and create a word-frequency table (a table that indicates how many times different words appear in a given corpus) based on the text of all of the abstracts. Such a table will give us a sense of the “keywords” used in a given scholarly literature, which will provide a window into its universe of discourse. In addition to a word frequency table, we’ll also create a word cloud, which is a quick way to visualize the frequency with which various words appear in the corpus of abstracts; the word cloud will be based on the word-frequency table we initially create. Let’s first extract the “Abstract” column from ws_df_appended
and assign it to a new object named wos_abstracts
:
# selects "Abstract" column from "ws_df_appended" and assigns to new object named "wos_abstracts"
wos_abstracts<-ws_df_appended %>% select(Abstract)
Now that we have our abstracts in a tractable data frame, we want to take each word in every abstract, and assign it to its own row; this process is known as “tokenizing” a dataset. The code below tokenizes wos_abstracts
by passing it to the unnest_tokens()
function; input=Abstract
specifies the column containing the data to be tokenized; token="words"
specifies that the unit of analysis is the word (i.e. each word in each abstract gets its own row); and output=word
specifies that the name of the column in the tokenized word that contains the words is to be “word”. We’ll assign the tokenized dataset to a new object named wos_abstracts_tokenized
:
# Tokenizes "Abstract" column text by word; assigns tokenized dataset (with words in "word" column) to a new object named "wos_abstracts_tokenized"
wos_abstracts_tokenized<-wos_abstracts %>%
unnest_tokens(input=Abstract,
token="words",
output=word)
Let’s print wos_abstracts_tokenized
to the console to observe its structure. You can also view it in the data viewer:
# prints contents of "wos_abstracts_tokenized"
wos_abstracts_tokenized
## # A tibble: 2,923,523 × 1
## word
## <chr>
## 1 climate
## 2 change
## 3 is
## 4 now
## 5 an
## 6 established
## 7 scientific
## 8 fact
## 9 and
## 10 dealing
## # … with 2,923,513 more rows
Now, we can use the count()
function to generate the rough draft of a word frequency table. Below, we take the wos_abstracts_tokenized
object, and then pass it to the count()
function, which parses the “word” column in wos_abstracts_tokenized
, and counts up the number of times each word occurs. The resulting frequency table is assigned to a new object named wos_abstracts_frequency
:
# generates frequency table from "wos_abstracts_tokenized", and assigns it to a new object named "wos_abstracts_frequency"
wos_abstracts_frequency<-wos_abstracts_tokenized %>%
count(word, sort=TRUE)
Let’s print the contents of wos_abstracts_frequency
, which you can also observe in the data viewer:
# prints contents of "wos_abstracts_frequency"
wos_abstracts_frequency
## # A tibble: 61,132 × 2
## word n
## <chr> <int>
## 1 the 199114
## 2 of 124221
## 3 and 109854
## 4 in 71449
## 5 to 66830
## 6 a 47559
## 7 for 30201
## 8 is 29116
## 9 with 22105
## 10 that 21885
## # … with 61,122 more rows
When looking through wos_abstracts_frequency
, you’ll notice that many of the words are common or mundane words (i.e. “the”) that don’t provide much information, or numbers, which also typically don’t provide much information in the context of a text analysis. We’ll therefore remove numbers and “common” words from wos_abstracts_frequency
.
These “common” words are often referred to in the text analysis lexicon as “stop words”, and text analysis packages typically provide data frames or vectors stored with these words (you can also create your own objects with stop words, if necessary). We’ll use a data frame of stop words from the tidytext package, assigned to the stop_words
package, which we’ll print below:
# prints "stop_words" (part of the "tidytext" package)
stop_words
## # A tibble: 1,149 × 2
## word lexicon
## <chr> <chr>
## 1 a SMART
## 2 a's SMART
## 3 able SMART
## 4 about SMART
## 5 above SMART
## 6 according SMART
## 7 accordingly SMART
## 8 across SMART
## 9 actually SMART
## 10 after SMART
## # … with 1,139 more rows
Below, we’ll remove stop words contained in stop_words
, and any rows containing numeric digits, from wos_abstracts_frequency_cleaned
using the filter function, and assign the result to a new object named wos_abstracts_frequency_cleaned
:
# cleans "wos_abstracts_frequency" by removing stop words and removing numbers
wos_abstracts_frequency_cleaned<-wos_abstracts_frequency %>%
filter(!word %in% stop_words$word) %>%
filter(!grepl('[0-9]', word))
The expression that reads filter(!word %in% stop_words$word)
can be translated as saying “remove any rows in wos_abstracts_frequency
for which a word in the ‘word’ column contains one of the stop words in the stop_words
object”. The expression that reads filter(!grepl('[0-9]', word))
can be translated as saying “remove any rows in wos_abstracts_frequency
in which the corresponding ‘word’ column contains a numeric digit.” The grepl()
function is used for pattern matching, which you can read more about in the documentation (?grepl
).
You can now view the wos_abstracts_frequency_cleaned
object in the data viewer, and confirm that digits and stop words have been removed. Peruse the table, and get a sense of what the most frequently recurring words in the corpus are.
Let’s now use wos_abstracts_frequency_cleaned
to make a quick visualization using ggplot2. First, wos_abstracts_frequency_cleaned
is too large to plot with a bar chart, but it may be useful to make a plot of the most frequently recurring words. In this case, we’ll plot the ten most frequently recurring words in wos_abstracts_frequency_cleaned
. We’ll first use the slice_max()
function so extract the rows in wos_abstracts_frequency_cleaned
with the ten highest value of “n”, and assign the result to a new object named wos_top_ten
:
# creates a new data frame that consists of the rows with the ten highest values for "n" (i.e. the ten most frequently recurring words) and assigns it to a new object named "wos_top_ten"
wos_top_ten<-wos_abstracts_frequency_cleaned %>%
slice_max(n, n=10)
Let’s view wos_top_ten
in the data viewer:
# views "wos_top_ten" in data viewer
View(wos_top_ten)
Now, let’s make a bar chart using the data in wos_top_ten
and assign it to a new object named wos_frequency_graph
:
# creates bar chart of word frequency data in "wos_top_ten" where words (on the x-axis) are arrayed in ascending order of frequency, and frequency (n) is represented on the Y axis; modified graph is assigned back to "wos_frequency_graph"
wos_frequency_graph<-
ggplot(data=wos_top_ten)+
geom_col(aes(x=reorder(word, n), y=n))+
labs(title="Ten Most Frequent Words in Abstracts of Publications on Climate + Art",
caption = "Source: Web of Science",
x="",
y="Frequency")
Let’s print its contents:
# prints contents of "wos_frequency_graph"
wos_frequency_graph
And, let’s also create a graph with inverted axes using the coord_flip
function, and assign it to a new object named wos_frequency_graph_inverted
:
# inverts axes of "wos_frequency_graph" and assigns the result to a new object named "wos_frequency_graph_inverted
wos_frequency_graph_inverted<-
wos_frequency_graph+
coord_flip()
# prints contents of "wos_frequency_graph_inverted"
wos_frequency_graph_inverted
Word Cloud
Another useful tool of exploratory text analysis is the word cloud. We can use the wordcloud2 package’s wordcloud2()
function to quickly make a word cloud out of the data from our cleaned word frequency table, wos_abstracts_frequency_cleaned
:
# make word cloud based on word frequency information from "wos_abstracts_frequency_cleaned"
wordcloud2(data = wos_abstracts_frequency_cleaned, minRotation = 0, maxRotation = 0, ellipticity = 0.6)
It is useful to have a sense of how to write some basic custom functions to automate basic data processing tasks. Let’s say, for instance, that we have a list of multiple data frames, and want to apply similar procedures to all of those data frames. How can we do so in an efficient and programmatic fashion that saves time?
Let’s return to our list of separate Web of Science data frames, wos_file_list
, which we created earlier. Print the contents of wos_file_list
if you need a reminder of what it looks like.
Now, let’s say for each data frame in the list, we want to select the “Authors”, “Article Title”, “Source Title” and “Language” columns; rename the “Article Title” column to “Article” and rename the “Source Title” column to “Source”; and then subset the English language observations. Instead of applying these changes to all of the data frames individually, we can write a function to perform these tasks, which we’ll assign to a new object named wos_clean_function()
:
# write function to take input WOS dataset, select the "Authors", "Article Title", "Source Title", and "Language" columns, rename "Article Title" column to "Article and rename "Source Title" column to "Source", and then subset English language papers; the function is assigned to an object named "wos_clean_function"
wos_clean_function<-function(input_dataset){
modified_dataset<-input_dataset %>%
select(Authors, "Article Title", "Source Title", Language) %>%
rename("Article"="Article Title",
"Source"="Source Title") %>%
filter(Language=="English")
return(modified_dataset)
}
Now, we’ll iteratively apply the wos_clean
function to all of the list elements in wos_file_list
using the map()
function. The code below applies the wos_clean_function
function to the first element of wos_file_list
and deposits the transformed dataset as the first element of a new list; it then applies the wos_clean_function
to the second element of wos_file_list
and deposits the transformed dataset as the second element of a new list; and so on. The list containing the transformed data frames is assigned to a new list named processed_wos_list
:
# apply "wos_clean_function" to all list elements in "wos_file_list" and assign the new list of modified data frames to a new object named "processed_wos_list"
processed_wos_list<-map(wos_file_list, wos_clean_function)
Let’s print the contents of processed_wos_list
to check that the data frames have indeed been transformed in accordance with the specifications of the function:
# print contents of "processed_wos_list"
processed_wos_list
## [[1]]
## # A tibble: 964 × 4
## Authors Article Source Language
## <chr> <chr> <chr> <chr>
## 1 Miles, M Repres… CULTU… English
## 2 Dal Farra, R; Suarez, P RED CR… LEONA… English
## 3 Guy, S; Henshaw, V; Heidrich, O Climat… JOURN… English
## 4 Baztan, J; Vanderlinden, JP; Jaffres, L; Jorgensen, … Facing… CLIMA… English
## 5 Burke, M; Tickwell, D; Whitmarsh, L Partic… GLOBA… English
## 6 Rodder, S The Cl… MINER… English
## 7 Bentz, J; O'Brien, K ART FO… ELEME… English
## 8 Kim, S Art th… ARTS … English
## 9 Gabrys, J; Yusoff, K Arts, … SCIEN… English
## 10 Taplin, R CONTEM… LEONA… English
## # … with 954 more rows
##
## [[2]]
## # A tibble: 978 × 4
## Authors Article Source Language
## <chr> <chr> <chr> <chr>
## 1 De Ollas, C; Morillon, R; Fotopoulos, V; Puertolas, … Facing… FRONT… English
## 2 Mansfield, LA; Nowack, PJ; Kasoar, M; Everitt, RG; C… Predic… NPJ C… English
## 3 Forrest, M A Refl… PAIDE… English
## 4 Chiodo, G; Garcia-Herrera, R; Calvo, N; Vaquero, JM;… The im… ENVIR… English
## 5 Williams, KD; Jones, A; Roberts, DL; Senior, CA; Woo… The re… CLIMA… English
## 6 Mascaro, G; Viola, F; Deidda, R Evalua… JOURN… English
## 7 Anel, JA High P… IBERG… English
## 8 Crosato, A; Grissetti-Vazquez, A; Bregoli, F; Franca… Adapta… JOURN… English
## 9 Barnett, TP; Pierce, DW; Schnur, R Detect… SCIEN… English
## 10 Buckland, D; Lertzman, R David … ENVIR… English
## # … with 968 more rows
##
## [[3]]
## # A tibble: 969 × 4
## Authors Article Source Language
## <chr> <chr> <chr> <chr>
## 1 Goswami, BB; Khouider, B; Phani, R; Mukhopadhyay, P;… Implem… JOURN… English
## 2 Berman, AL; Silvestri, GE; Tonello, MS On the… QUATE… English
## 3 Kulmala, M; Asmi, A; Lappalainen, HK; Carslaw, KS; P… Introd… ATMOS… English
## 4 Touchan, R; Anchukaitis, KJ; Meko, DM; Sabir, M; Att… Spatio… CLIMA… English
## 5 Zhang, L; Han, WQ; Hu, ZZ Interb… JOURN… English
## 6 Fallah, A; Sungmin, O; Orth, R Climat… HYDRO… English
## 7 Wagner, TJW; Eisenman, I How cl… GEOPH… English
## 8 Bloschl, G; Ardoin-Bardin, S; Bonell, M; Dorninger, … UNESCO… CLIMA… English
## 9 Hoang, T; Pulliat, G Green … URBAN… English
## 10 Sanz, T; Rodriguez-Labajos, B Does a… GEOFO… English
## # … with 959 more rows
##
## [[4]]
## # A tibble: 974 × 4
## Authors Article Source Language
## <chr> <chr> <chr> <chr>
## 1 Arce-Nazario, JA Transl… JOURN… English
## 2 Beevers, L; Popescu, I; Pregnolato, M; Liu, YX; Wrig… Identi… FRONT… English
## 3 Staehelin, J; Tummon, F; Revell, L; Stenke, A; Peter… Tropos… ATMOS… English
## 4 Zhang, YH; Seidel, DJ; Golaz, JC; Deser, C; Tomas, RA Climat… JOURN… English
## 5 Sudantha, BH; Warusavitharana, EJ; Ratnayake, GR; Ma… Buildi… 2018 … English
## 6 Franzen, C Shelte… ENVIR… English
## 7 van Oldenborgh, GJ; Drijfhout, S; van Ulden, A; Haar… Wester… CLIMA… English
## 8 Tankersley, MS; Ledford, DK Stingi… JOURN… English
## 9 Pistocchi, A; Sarigiannis, DA; Vizcaino, P Spatia… SCIEN… English
## 10 Biggs, HR; Desjardins, A Crafti… PROCE… English
## # … with 964 more rows
##
## [[5]]
## # A tibble: 964 × 4
## Authors Article Source Language
## <chr> <chr> <chr> <chr>
## 1 Esper, J; Klippel, L; Krusic, PJ; Konter, O; Raible,… Easter… CLIMA… English
## 2 Hallgren, AM (Un)st… KONST… English
## 3 Caron, LP; Hermanson, L; Dobbin, A; Imbers, J; Lledo… How Sk… BULLE… English
## 4 Sharif, K; Gormley, M Integr… WATER English
## 5 Santer, BD; Wigley, TML; Gaffen, DJ; Bengtsson, L; D… Interp… SCIEN… English
## 6 Sarospataki, M; Szabo, P; Fekete, A Future… LAND English
## 7 Pente, P Slow M… INTER… English
## 8 Marsh, C Tagore… ASIAT… English
## 9 Wang, XY; Dai, ZG; Zhang, EH; Fuyang, KE; Cao, YC; S… Tropos… ADVAN… English
## 10 Servera-Vives, G; Riera, S; Picornell-Gelabert, L; M… The on… PALAE… English
## # … with 954 more rows
##
## [[6]]
## # A tibble: 974 × 4
## Authors Article Source Language
## <chr> <chr> <chr> <chr>
## 1 Kim, JK Novel … MATER… English
## 2 Singh, SJ; Fischer-Kowalski, M; Chertow, M Introd… SUSTA… English
## 3 D'Andrea, F; Provenzale, A; Vautard, R; De Noblet-Du… Hot an… GEOPH… English
## 4 Grant, KM; Rohling, EJ; Bar-Matthews, M; Ayalon, A; … Rapid … NATURE English
## 5 Guan, B; Waliser, DE; Ralph, FM A mult… ANNAL… English
## 6 Hummel, M; Hoose, C; Pummer, B; Schaupp, C; Frohlich… Simula… ATMOS… English
## 7 Rodney, L Road S… SPACE… English
## 8 Jouili, JS Islam … COMPA… English
## 9 Rodriguez-Labajos, B Artist… CURRE… English
## 10 Joyette, ART; Nurse, LA; Pulwarty, RS Disast… DISAS… English
## # … with 964 more rows
##
## [[7]]
## # A tibble: 976 × 4
## Authors Article Source Language
## <chr> <chr> <chr> <chr>
## 1 Farnoosh, A; Azari, B; Ostadabbas, S Deep S… THIRT… English
## 2 Nahhas, TM; Kohl, H Tradit… ARABI… English
## 3 Yoon, J; Bae, S Perfor… SUSTA… English
## 4 Masnavi, MR; Gharai, F; Hajibandeh, M Explor… INTER… English
## 5 Hu, ZZ; Kumar, A; Jha, B; Zhu, JS; Huang, BH Persis… JOURN… English
## 6 Gonzalez, FR; Raval, S; Taplin, R; Timms, W; Hitch, M Evalua… NATUR… English
## 7 Hungilo, GG; Emmanuel, G; Emanuel, AWR Image … 2019 … English
## 8 Fernandes, LL; Lee, ES; McNeil, A; Jonsson, JC; Noui… Angula… ENERG… English
## 9 Ingwersen, W; Gausman, M; Weisbrod, A; Sengupta, D; … Detail… JOURN… English
## 10 Zilitinkevich, SS; Tyuryakov, SA; Troitskaya, YI; Ma… Theore… IZVES… English
## # … with 966 more rows
##
## [[8]]
## # A tibble: 977 × 4
## Authors Article Source Language
## <chr> <chr> <chr> <chr>
## 1 Chalal, ML; Benachir, M; White, M; Shrahily, R Energy… RENEW… English
## 2 El-Araby, E; Taher, M; El-Ghazawi, T; Le Moigne, J Protot… FPT 0… English
## 3 Silva, B; Prieto, B; Rivas, T; Sanchez-Biezma, MJ; P… Rapid … INTER… English
## 4 Sorbet, J; Fernandez-Peruchena, C; Zaversky, F; Chak… Perfor… JOURN… English
## 5 Smart, PDS; Thanammal, KK; Sujatha, SS A nove… SADHA… English
## 6 Colding, J; Wallhagen, M; Sorqyist, P; Marcus, L; Hi… Applyi… SMART… English
## 7 Antonini, E; Vodola, V; Gaspari, J; De Giglio, M Outdoo… ENERG… English
## 8 Woodworth, PL Differ… JOURN… English
## 9 Roh, JS; Kim, S All-fa… JOURN… English
## 10 Loch, CH; Terwiesch, C Accele… JOURN… English
## # … with 967 more rows
##
## [[9]]
## # A tibble: 975 × 4
## Authors Article Source Language
## <chr> <chr> <chr> <chr>
## 1 Rajer, A; Heard, C Pink p… CONSE… English
## 2 Hashioka, T; Vogt, M; Yamanaka, Y; Le Quere, C; Buit… Phytop… BIOGE… English
## 3 Halova, P; Kroupova, ZZ; Havlikova, M; Cechura, L; M… Provis… AGRAR… English
## 4 MARTI, C; BADIA, D CHARAC… ARID … English
## 5 Taveres-Cachat, E; Grynning, S; Almas, O; Goia, F Advanc… 11TH … English
## 6 Haida, M; Palacz, M; Bodys, J; Smolka, J; Gullo, P; … An exp… APPLI… English
## 7 Schmid, R Pocket… SAGE … English
## 8 Manzini, E; Cagnazzo, C; Fogli, PG; Bellucci, A; Mul… Strato… GEOPH… English
## 9 Chang, WL; Griffin, RJ; Dabdub, D Partit… PROCE… English
## 10 Bruhwiler, PA; Buyan, M; Huber, R; Bogerd, CP; Sznit… Heat t… JOURN… English
## # … with 965 more rows
##
## [[10]]
## # A tibble: 976 × 4
## Authors Article Source Language
## <chr> <chr> <chr> <chr>
## 1 Paietta, E Commen… BEST … English
## 2 Chen, N; Paek, SY; Lee, JY; Park, JH; Lee, SY; Lee, … High-p… ENERG… English
## 3 Bruckert, J; Hoshyaripour, GA; Horvath, A; Muser, LO… Online… ATMOS… English
## 4 Cha, MS; Park, JE; Kim, S; Han, SH; Shin, SH; Yang, … Poly(c… ENERG… English
## 5 Brewin, RJW; Sathyendranath, S; Platt, T; Bouman, H;… Sensin… EARTH… English
## 6 Lamane, H; Moussadek, R; Baghdad, B; Mouhir, L; Bria… Soil w… HELIY… English
## 7 Rodriguez, A; Alejo-Reyes, A; Cuevas, E; Robles-Camp… Numeri… MATHE… English
## 8 Cortesi, U; Ceccherini, S; Del Bianco, S; Gai, M; Ti… Advanc… ATMOS… English
## 9 Buerki, S; Jose, S; Yadav, SR; Goldblatt, P; Manning… Contra… PLOS … English
## 10 Nastula, J; Ponte, RM; Salstein, DA Compar… GEOPH… English
## # … with 966 more rows
##
## [[11]]
## # A tibble: 988 × 4
## Authors Article Source Language
## <chr> <chr> <chr> <chr>
## 1 Zhao, XY; Miao, CH Spatia… INTER… English
## 2 Pacheco-Torgal, F Eco-ef… CONST… English
## 3 Nasiyev, B; Gabdulov, M; Zhanatalapov, N; Makanova, G Study … RESEA… English
## 4 Helmig, D; Petrenko, V; Martinerie, P; Witrant, E; R… Recons… ATMOS… English
## 5 Voigt, C; Schumann, U; Graf, K CONTRA… PROGR… English
## 6 Hartman, S; Ogilvie, AEJ; Ingimundarson, JH; Dugmore… Mediev… GLOBA… English
## 7 Konstantiniuk, F; Krobath, M; Ecker, W; Tkadletz, M;… Influe… INTER… English
## 8 Yu, TT; Leng, H; Yuan, Q; Jiang, CY Vulner… JOURN… English
## 9 Karam, N; Khiat, A; Algergawy, A; Sattler, M; Weilan… Matchi… KNOWL… English
## 10 Di Prima, S; Castellini, M; Pirastru, M; Keesstra, S Soil W… WATER English
## # … with 978 more rows
##
## [[12]]
## # A tibble: 981 × 4
## Authors Article Source Language
## <chr> <chr> <chr> <chr>
## 1 Shah, ARY; Shah, KS; Shah, CR; Shah, MA State … RENEW… English
## 2 Lindskog, M; Ridal, M; Thorsteinsson, S; Ning, T Data a… ATMOS… English
## 3 Morciano, M; Fasano, M; Boriskina, SV; Chiavazzo, E;… Solar … ENERG… English
## 4 Lebo, ZJ; Morrison, H A Nove… JOURN… English
## 5 Wilczynska, D; Lysak-Radomska, A; Podczarska-Glowack… Effect… BMC S… English
## 6 Ballard, S Nonorg… FAR F… English
## 7 Pinkovetskaia, I; Gromova, T; Nikitina, I Produc… TARIH… English
## 8 Xiong, Y; Zhang, JP; Yan, Y; Sun, SB; Xu, XY; Higuer… Effect… SUSTA… English
## 9 Bui, DT; Hoang, ND; Pham, TD; Ngo, PTT; Hoa, PV; Min… A new … JOURN… English
## 10 Alloza, JA; Vallejo, R Restor… DESER… English
## # … with 971 more rows
##
## [[13]]
## # A tibble: 676 × 4
## Authors Article Source Language
## <chr> <chr> <chr> <chr>
## 1 Marchese, C; de la Guardia, LC; Myers, PG; Belanger,… Region… ECOLO… English
## 2 Blood, A; Starr, G; Escobedo, F; Chappelka, A; Staud… How Do… FORES… English
## 3 Bonannella, C; Chirici, G; Travaglini, D; Pecchi, M;… Charac… FIRE-… English
## 4 Lazarev, S; Kuiper, KF; Oms, O; Bukhsianidze, M; Vas… Five-f… GLOBA… English
## 5 Gliss, J; Mortier, A; Schulz, M; Andrews, E; Balkans… AeroCo… ATMOS… English
## 6 Mueller, CW; Gutsch, M; Kothieringer, K; Leifeld, J;… Bioava… SOIL … English
## 7 Cohen, M; Quigley, K Submar… AESTH… English
## 8 Nyamekye, AB; Dewulf, A; Van Slobbe, E; Termeer, K Inform… AFRIC… English
## 9 Myriokefalitakis, S; Groger, M; Hieronymus, J; Dosch… An exp… OCEAN… English
## 10 Viana, M; Hammingh, P; Colette, A; Querol, X; Degrae… Impact… ATMOS… English
## # … with 666 more rows
Recall that we can also extract a given list element using double-bracket notation. For example, we can extract the third element of processed_wos_list
with the following:
# extract one of the list elements from "processed_wos_list"
processed_wos_list[[3]]
## # A tibble: 969 × 4
## Authors Article Source Language
## <chr> <chr> <chr> <chr>
## 1 Goswami, BB; Khouider, B; Phani, R; Mukhopadhyay, P;… Implem… JOURN… English
## 2 Berman, AL; Silvestri, GE; Tonello, MS On the… QUATE… English
## 3 Kulmala, M; Asmi, A; Lappalainen, HK; Carslaw, KS; P… Introd… ATMOS… English
## 4 Touchan, R; Anchukaitis, KJ; Meko, DM; Sabir, M; Att… Spatio… CLIMA… English
## 5 Zhang, L; Han, WQ; Hu, ZZ Interb… JOURN… English
## 6 Fallah, A; Sungmin, O; Orth, R Climat… HYDRO… English
## 7 Wagner, TJW; Eisenman, I How cl… GEOPH… English
## 8 Bloschl, G; Ardoin-Bardin, S; Bonell, M; Dorninger, … UNESCO… CLIMA… English
## 9 Hoang, T; Pulliat, G Green … URBAN… English
## 10 Sanz, T; Rodriguez-Labajos, B Does a… GEOFO… English
## # … with 959 more rows
You’ll often need to get data out of R once you’re finished with your analysis, and this section reviews some techniques to do so.
Let’s say you have a data frame in R that you’d like to export as a CSV file to your working directory. You can use the write_csv()
function; below, we export the data frame assigned to the pt_africa
object we created above. The first argument of write_csv()
is the name of the relevant data frame object, and the second argument (“pt_africa.csv”) is the desired file name for the exported object:
# exports pt_africa object to working directory
write_csv(pt_africa, "pt_africa.csv")
Once the code runs, check to make sure that there is a file named “pt_africa.csv” exported to your working directory.
Let’s say we have multiple data frames that we’d like to efficiently export. For example, after generating our list of transformed Web of Science data frames (processed_wos_list
), let’s say we want to export all of the data frames to our working directory. Our first step is to create a vector of desired file names.
Let’s first take the initial wos_files
vector, and remove the “.csv” extension; we’ll assign the result to a new object named base_names
:
# Removes the ".csv" suffix from the strings in "wos_files" and then assigns the modified character vector to a new object named "base_names"
base_names<-str_remove(wos_files, ".csv")
Let’s print the contents of base_names
:
# prints contents of "base_names"
base_names
## [1] "ClimateAndArt_01" "ClimateAndArt_02" "ClimateAndArt_03" "ClimateAndArt_04"
## [5] "ClimateAndArt_05" "ClimateAndArt_06" "ClimateAndArt_07" "ClimateAndArt_08"
## [9] "ClimateAndArt_09" "ClimateAndArt_10" "ClimateAndArt_11" "ClimateAndArt_12"
## [13] "ClimateAndArt_13"
We’ll now use the paste0()
function to append the extension "_modified.csv" to the base names, so that we now that we’ve modified the files:
# appends the suffix "_modified.csv" to the strings in the "base_names" character vector, and assigns the resulting character vector to a new object named "processed_wos_list_names"
processed_wos_list_names<-paste0(base_names, "_modified.csv")
Let’s now print our final vector of file names:
# prints contents of "processed_wos_list_names"
processed_wos_list_names
## [1] "ClimateAndArt_01_modified.csv" "ClimateAndArt_02_modified.csv"
## [3] "ClimateAndArt_03_modified.csv" "ClimateAndArt_04_modified.csv"
## [5] "ClimateAndArt_05_modified.csv" "ClimateAndArt_06_modified.csv"
## [7] "ClimateAndArt_07_modified.csv" "ClimateAndArt_08_modified.csv"
## [9] "ClimateAndArt_09_modified.csv" "ClimateAndArt_10_modified.csv"
## [11] "ClimateAndArt_11_modified.csv" "ClimateAndArt_12_modified.csv"
## [13] "ClimateAndArt_13_modified.csv"
Now, we’ll use the walk2()
function to iteratively write out the data frames in processed_wos_list
to disk, using the names specified in processed_wos_list_names
:
# uses the walk2 function to iteratively apply the "write_csv" function, using the data frames in "processed_wos_list" and the file names in "processed_wos_list_names" as arguments; the files are written out to the working directory
walk2(processed_wos_list, processed_wos_list_names, write_csv)
The code above takes the first dataset in processed_wos_list
and the first desired filename in processed_wos_list_names
, and uses these as arguments to the write_csv
function, which results in a new CSV file written to the working directory that contains the first data frame in processed_wos_list
that is named after the first string element in processed_wos_list_names
(“ClimateAndArt_01_modified.csv”); it then takes the second second dataset in processed_wos_list
and the second desired filename in processed_wos_list_names
, and uses these as arguments to the write_csv
function, which results in a new CSV file written to the working directory that contains the second data frame in processed_wos_list
that is named after the second string element in processed_wos_list_names
(“ClimateAndArt_02_modified.csv”); and so on for the other data frames and specified file names.
Check your working directory to make sure that all 13 of the modified files have been written out to disk.
To export ggplot objects, you can use the package’s ggsave()
function. For example, if we wanted to save the visualization associated with cgexp_africa_ascending_inverted
, we could use the following, where the first argument is the desired file name for the exported visualization, the second argument is the name of the corresponding object, and the optional “width” and “height” arguments specify the desired dimensions for the exported visualization:
# exports "scatter_cgexp_trade_grouped" to working directory as png file
ggsave("africa_bar.png", cgexp_africa_ascending_inverted, width=10, height=5)
If you want a different file format, simply change the extension. For example, if we want to save the file as a “pdf” file instead of “png”, simply change the extension of the exported file’s dired file name to “.pdf”:
# exports "scatter_cgexp_trade_grouped" to working directory as pdf file
ggsave("africa_bar.pdf", cgexp_africa_ascending_inverted, width=10, height=5)
If you have several visualizations that you want to save in a single file, you can use a graphics device. Below, we use a PDF graphics device to create a single PDF file (named “workshop_visualizations.pdf”) that contains the scatter_cgexp_trade_grouped
and cgexp_africa_ascending_inverted
visualizations. When using a graphics device, always remember to turn it off (using dev.off()
) after your file has been exported.
# Uses graphics device to export "scatter_cgexp_trade_grouped" and "cgexp_africa_ascending_inverted" to the working directory as a single PDF file named "workshop_visualizations"
pdf("workshop_visualizations.pdf", width=12, height=5)
scatter_cgexp_trade_grouped
## Warning: Removed 3 rows containing missing values (geom_point).
cgexp_africa_ascending_inverted
dev.off()
## quartz_off_screen
## 2