5 Data Structures
We now turn to a brief overview of some important data structures that help us to work with data in R. We will consider three data structures that are particularly useful: vectors, data frames, and lists. Note that this is not an exhaustive treatment of data structures in R; there are other structures, such as matrices and arrays, that are also important. However, we will limit our discussion to the data structures that are essential for getting started with data-based social scientific research in R.
5.1 Vectors
In R, a vector is a sequence of values. A vector is created using the c() function. For example, let’s make a vector with some arbitrary numeric values:
In R, a vector is a sequence of values. A vector is created using the c() function. For example, let’s make a vector with some arbitrary numeric values:
## [1] 5 7 55 32
If we plan to work with this numeric vector again later in our workflow, it makes sense to assign it to an object, which we’ll call arbitrary_values:
# assigns vector of arbitrary values to new object named "arbitrary_values"
arbitrary_values<-c(5,7,55.6,32.5)Now, whenever we want to print the vector assigned to the arbitrary_values object, we can simply print the name of the object:
## [1] 5.0 7.0 55.6 32.5
It is possible to carry out mathematical operations with numeric vectors; for instance, let’s say that we want to double the values in the arbitrary_values vector; to do so, we can simply multiply arbitrary_values by 2, which yields a new vector where each numeric element is twice the corresponding element in arbitrary_values. Below, we’ll create a new vector that doubles the values in arbitrary_values, assign it to a new object named arbitrary_values_2x, and print the contents of arbitrary_values_2x:
# creates a new vector that doubles the values in "arbitrary_values" and assigns it to a new object named
"arbitrary_values_2x"## [1] "arbitrary_values_2x"
arbitrary_values_2x<-arbitrary_values*2
# prints contents of "arbitrary_values_2x"
arbitrary_values_2x## [1] 10.0 14.0 111.2 65.0
Now, let’s say we want to add different vectors together; the code below creates a new vector by adding together arbitrary_values and arbitrary_values_2x:
# adds "arbitrary_values" vector and "arbitrary_values_2x" vector
arbitrary_values + arbitrary_values_2x## [1] 15.0 21.0 166.8 97.5
Note that each element of the resulting vector printed above is the sum of the corresponding elements in arbitrary_values and arbitrary_values_2x.
Other arithmetic operations on numeric vectors are also possible, and you may wish to explore these on your own as an exercise.
In many cases, it is useful to extract a specific element from a vector. Each element in a given vector is assigned an index number, starting with 1; that is, the first element in a vector is assigned an index value of 1, the second element of a vector is assigned an index value of 2, and so on. We can use these index values to extract our desired vector elements. In particular, we can specify the desired index within square brackets after printing the name of the vector object of interest. For example, let’s say we want to extract the 3rd element of the vector in arbitrary_values. We can do so with the following:
## [1] 55.6
It is also possible to extract a range of values from a vector using index values. For example, let’s say we want to extract a new vector comprised of the second, third, and fourth numeric elements in arbitrary_values; we can do so with the following:
# extracts a new vector comprised of the 2nd, 3rd, and 4th elements of the existing "arbitrary_values" vector
arbitrary_values[2:4]## [1] 7.0 55.6 32.5
Thus far, we have been working with numeric vectors, where each of the vector’s elements is a numeric value, but it is also possible to create vectors in which the elements are strings (i.e. text). Such vectors are know as character vectors. For example, the code below creates a character vector of the first four months of the year, and assigns it to a new object named months_four:
# creates character vector whose elements are the first four months of the year, and assigns the vector to a new object named "months_four"
months_four<-c("January", "February", "March", "April")Let’s now print the character vector assigned to months_four:
## [1] "January" "February" "March" "April"
We can extract elements from character vectors using index values in the same way we did so for elements in a numeric vector. For example:
## [1] "February"
# subsets the second and third elements of "months_four" object (i.e. the "February" and "March" strings, which are extracted as a new character vector)
months_four[2:3]## [1] "February" "March"
5.2 Data Frames
The data frame structure is the workhorse of data analysis in R. A data frame resembles a table, of the sort you might generate in a spreadsheet application.
Often, the most important (and arduous) step in a data analysis workflow is to assemble disparate strands of data into a tractable data frame. What does it mean for a data frame to be “tractable”? One way to define this concept more precisely is to appeal to the concept of “tidy” data, which is often referenced in the data science world. Broadly speaking, a “tidy” data frame is a table in which:
- Each variable has its own column
- Each observation has its own row
- Each value has its own cell
We will work extensively with data frames later in the workshop, but let’s generate a simple data frame from scratch, and assign it to a new object. We will generate a data frame containing “dummy” country-level data on basic economic, geographic, and demographic variables, and assign it to a new object named country_df. The data frame is created through the use of the data.frame() function, which has already been programmed into R. Column names and the corresponding column values are passed to the data.frame() function in the manner below, and the function effectively binds these different columns together into a table:
# Creates a dummy country-level data frame
country_df<-data.frame(Country=c("Country A", "Country B", "Country C"),
GDP=c(8000, 30000, 23500),
Population=c(2000, 5400, 10000),
Continent=c("South America", "Europe", "North America"))To observe the structure of the table, we can print it to the R console by simply printing the name of the object to which it has been assigned:
## Country GDP Population Continent
## 1 Country A 8000 2000 South America
## 2 Country B 30000 5400 Europe
## 3 Country C 23500 10000 North America
One nice feature of R Studio is that instead of simply printing our data frames into the console, we can view a nicely formatted version of our data frame by passing the name of the data frame object through the View() function. For example, the code below will bring up the country_df data frame as a new tab in R Studio:
Note the “tidy” features of this simple data frame:
- Each of the variables (i.e. GDP, Population, Continent) has its own column
- Each of the (country-level) observations has its own row
- Each of the values (i.e. country-level information about a given variable) has its own distinct cell
We will explore data frames, and the process of extracting information from them, at greater length in subsequent sections.
5.3 Lists
In R, a list is a data structure that allows us to conveniently store a variety of different objects, of various types. For example, we can use a list to vectors, data frames, visualizations and graphs–basically any R object you can think of! It is also possible to store a list within a list.
Lists allow us to keep track of the various objects we create, and are therefore a useful data management tool. In addition, lists are very helpful to use when we want to perform iterative operations across multiple objects.
We can create lists in R using the list() function; the arguments to this function are the objects that we want to include in the list. In the code below, we’ll create a list (assigned to an object named example_list) that contains some of the objects we create earlier in the lesson: the arbitrary_values vector, the months_four vector, and the country_df data frame.
# creates list whose elements are the "arbitrary_values" numeric vector, the "months_four" character vector, and the "country_df" data frame, and assigns it to a new object named "example_list"
example_list<-list(arbitrary_values, months_four, country_df)Now that we’ve created our list object, let’s print out its contents:
## [[1]]
## [1] 5.0 7.0 55.6 32.5
##
## [[2]]
## [1] "January" "February" "March" "April"
##
## [[3]]
## Country GDP Population Continent
## 1 Country A 8000 2000 South America
## 2 Country B 30000 5400 Europe
## 3 Country C 23500 10000 North America
As you can see, our list contains each of the various specified objects within a single, unified structure. We can access specific elements within a list using the specific index number of the desired element, in much the same way we did for vectors. When extracting a single list element from a list, the convention is to enclose the index number of the desired list element in double square brackets. For example, if we want to extract the country-level data frame from example_list, we can use the following:
# extracts country-level data frame from "example_list"; the country-level data frame is the third element in "example_list"
example_list[[3]]## Country GDP Population Continent
## 1 Country A 8000 2000 South America
## 2 Country B 30000 5400 Europe
## 3 Country C 23500 10000 North America
If we want to subset a list, and extract more than one list element as a separate list, we can do so by creating a vector of the index values of the desired elements, and enclosing it in single brackets after the name of the list object. For example, if we wanted to generate a new list that contained only the first and third elements of example_list (the numeric vector of arbitrary values and the data frame), we would use the following syntax:
## [[1]]
## [1] 5.0 7.0 55.6 32.5
##
## [[2]]
## Country GDP Population Continent
## 1 Country A 8000 2000 South America
## 2 Country B 30000 5400 Europe
## 3 Country C 23500 10000 North America
While list elements are not automatically named, we can name our list element using the names() function. The first step to define a character vector of desired names. We can specify any names we’d like but for the sake of illustration, let’s say we want to name the first list element “element1”, the second list element “element2”, and the third list element “element3”. Let’s create a vector of our desired names, and assign it to an object named name_vector:
# creates a character vector of desired names for list elements, and assigns it to a new object named "name_vector"
name_vector<-c("element1", "element2", "element3")Now, we’ll assign these names in name_vector to the list elements in example_list with the following
# assigns names from "name_vector" to list elements in "example_list"
names(example_list)<-name_vectorLet’s now print the contents of example_list:
## $element1
## [1] 5.0 7.0 55.6 32.5
##
## $element2
## [1] "January" "February" "March" "April"
##
## $element3
## Country GDP Population Continent
## 1 Country A 8000 2000 South America
## 2 Country B 30000 5400 Europe
## 3 Country C 23500 10000 North America
Note that the list elements now have names attached to them; the first character string in name_vector is assigned as the name of the first element in example_list, the second character string in name_vector is assigned as the name of the second element in example_list, and so on.
Practically speaking, we can now extract list elements using the assigned names. For example, if we want to extract the data frame from example_list, we could do so by its assigned name (“element3”), as follows:
## Country GDP Population Continent
## 1 Country A 8000 2000 South America
## 2 Country B 30000 5400 Europe
## 3 Country C 23500 10000 North America
Note that even after assigning names to list elements, you can still extract elements by their index value, if you would prefer to do so:
## Country GDP Population Continent
## 1 Country A 8000 2000 South America
## 2 Country B 30000 5400 Europe
## 3 Country C 23500 10000 North America
5.4 Identifying Data Structures
It is useful to be able to quickly identify the data structure of a given object. Indeed, one way that things can go wrong when processing or analyzing data in R is that a given function expects a certain type of data structure as an input, but encounters something else, which will cause the function to throw an error or perform unexpectedly. In such circumstances, it is especially useful to be able to quickly double-check the data structure of a given object.
We can quickly ascertain this information by passing a given object as an argument to the class() function, which will provide information about the object’s data structure.
For example, let’s say we want to confirm that example_list is indeed a list:
## [1] "list"
Let’s take another example:
## [1] "character"
Note that we can read “character”, as “character vector”.
Similarly, we can read “numeric” as “numeric vector”:
## [1] "numeric"