# creates a new numeric vector and assigns it to a new object named "sample_vector"
sample_vector<-c(5, 11, 5.6, 8)
# prints contents of "sample_vector"
sample_vector[1] 5.0 11.0 5.6 8.0
In this lesson, we’ll continue to develop an understanding of foundational skills and concepts that will allow you to use R effectively for applied social scientific research. Our goal is to develop an elementary proficiency in functional programming, which will allow you to fully exploit R’s capabilities when you use it for your research data tasks. A function is essentially a small program that takes an input (or series of inputs), run the input(s) through an algorithm, and produces output(s). Many functions come pre-programmed into R. R packages are essentially open-source user-written libraries of interrelated functions united by some theme, which we can draw on to extend the range of functions available to us. And finally, we can write our own custom-functions.
Given the enormous variety and sophistication of the R package ecosystem, you will not have to become an expert programmer to work with data in R; rather, you can draw on the functions others have written to implement virtually any data-related task you could imagine. However, developing a basic understanding of how to write your own functions is nevertheless important, for a variety of reasons:
With those considerations in mind, we’ll learn more in this lesson about built-in functions in R and R packages, but our main purpose is to learn how to write some simple functions of our own. We’ll also learn more about how to use functions from the purrr package (part of the tidyverse suite) to iteratively apply our functions to multiple objects, which can help to automate various data processing workflows. In other words, iteration is the process of applying a function to each element in a vector, list, or data frame; it is a key part of functional programming that can save you enormous amounts of time and energy.
As we have noted, functions are programmatic constructs that take in a set of inputs, and return an output or set of outputs after applying an algorithm, or “recipe” to the set of inputs. The input(s) of a function are often called argument(s). Many functions come programmed into R. For example, consider the sum() function, which takes a numeric vector as an input, and returns the sum of those elements as an output. To see how this works, we’ll first create a toy numeric vector, sample_vector:
# creates a new numeric vector and assigns it to a new object named "sample_vector"
sample_vector<-c(5, 11, 5.6, 8)
# prints contents of "sample_vector"
sample_vector[1] 5.0 11.0 5.6 8.0
Now, we’ll pass sample_vector as an argument to the sum() function. The sum() function takes this argument, applies the algorithm programmed into to it to calculate the sum of vector elements, and returns the sum of the vector elements as the output:
# calculates sum of vector elements using built-in "sum" function
sum(sample_vector)[1] 29.6
Another example of a built-in function is prod(), which returns the product of vector elements. Below, we pass sample_vector as an argument to this function, and it returns the product of the vector elements:
# calculates product of vector elements using built-in "prod" function
prod(sample_vector)[1] 2464
Now, let’s use the built in mean() function to calculate the mean of the vector elements in sample_vector. We pass sample_vector as an argument to the mean() function, which performs the calculation based on its internal programming, and returns the mean of the vector elements as the output:
# calculates the mean of vector elements using built-in "mean" function
mean(sample_vector)[1] 7.4
Let’s try applying the built-in median() function to sample_vector :
# calculates the median of vector elements
median(sample_vector)[1] 6.8
Thus far, we’ve been exploring elementary functions that perform mathematical calculations on numeric vectors. Let’s consider a function that’s relevant for character vectors. The nchar() function takes a string as an argument, and returns the number of characters in that string as the output. Below, we’ll pass the argument “Hello, World!” to nchar(), which returns the number of characters in that argument:
# calculates the number of characters in a string using built-in "nchar" function
nchar("Hello, World!")[1] 13
Now, let’s consider the colnames() function, which is a useful built-in function that takes a data frame as an argument, and returns the names of all of its columns. Below, the argument to the colnames() function is mtcars, which is a dataset that’s built into R (R comes installed with several built-in datasets that are very helpful for practicing new skills and experimenting with code). If you’d like, you can get a sense of the mtcars dataset by passing it to the View() function, or by viewing its documentation with ?mtcars.
# extracts column names for "mtcars" dataset (which is built into R) using the built-in
# "ncol" function
colnames(mtcars) [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
[11] "carb"
We can also extract a data frame’s row names using the built-in rownames() function. Below, we’ll pass mtcars as an argument to rownames():
## extracts row names for "mtcars" dataset
rownames(mtcars) [1] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710"
[4] "Hornet 4 Drive" "Hornet Sportabout" "Valiant"
[7] "Duster 360" "Merc 240D" "Merc 230"
[10] "Merc 280" "Merc 280C" "Merc 450SE"
[13] "Merc 450SL" "Merc 450SLC" "Cadillac Fleetwood"
[16] "Lincoln Continental" "Chrysler Imperial" "Fiat 128"
[19] "Honda Civic" "Toyota Corolla" "Toyota Corona"
[22] "Dodge Challenger" "AMC Javelin" "Camaro Z28"
[25] "Pontiac Firebird" "Fiat X1-9" "Porsche 914-2"
[28] "Lotus Europa" "Ford Pantera L" "Ferrari Dino"
[31] "Maserati Bora" "Volvo 142E"
R has a large number of useful built-in functions, but we are not limited to these. By drawing on R packages, we can exponentially increase the number of functions at our disposal.
R packages are essentially pre-written collections of functions organized around a given theme. One of the big advantages of using R is that it has a very large user community among social scientists, statisticians, and digital humanists, who frequently publish R packages. One might think of packages as workbooks of sorts, which contain a well-integrated set of R functions, scripts, data, and documentation; these “workbooks” are designed to facilitate certain tasks or implement useful procedures. These packages are then shared with the broader R user community, and at this point, anyone who needs to accomplish the tasks to which the package addresses itself can use the package in the context of their own projects. The ability to use published packages considerably simplifies the work of applied data research using R; it means that we rarely have to write code entirely from scratch, and can build on the code that others have published in the form of packages. This allows applied researchers to focus on substantive problems, without having to get too bogged down in complicated programming tasks.
In this session, we will use various functions from the tidyverse, which is actually a suite of several different packages that implement a variety of data science tasks. When you install and load the tidyverse, we’re simultaneously installing the entire range of these packages. Today the main tidyverse package we’ll use is known as the purrr package, which is a package that contains a variety of functions that facilitate the iterative application of functions to multiple input arguments.
To install a package in R, we can use the install.packages() function. In the code block below, the name of the package we want to install (here, the tidyverse suite) is enclosed within quotation marks and placed within parentheses after printing install.packages Running the code below will effectively download the tidyverse suite of packages to our computer:
install.packages("tidyverse")At this point, the tidyverse suite of packages should be installed on your computers, but the packages are not yet ready for use. Before we can use our packages, we must load them into our environment. We can think of the process of loading installed packages into a current R environment as analogous to opening up an application on your phone or computer after it has been installed (even after an application has been installed, you can’t use it until you open it!). To load (i.e. “open”) an R package, we pass the name of the package we want to load as an argument to the library() function. For example, if we want to load our tidyverse packages into the current environment, we can type:
library(tidyverse)At this point, the full suite of the tidyverse suite’s functionality is available for us to use. In the next few sessions, we’ll install and load additional packages, but this is all we need for now.
One important thing to note regarding the installation and loading of packages is that we only have to install packages once; after a package is installed, there is usually no need to subsequently reinstall it. However, we must load the packages we need (using the library function) every time we open a new R session. In other words, if we were to close RStudio at this point and open it up later, we would not need to install the tidyverse again, but would need to load the tidyverse within the library() function again.
As we mentioned earlier, a function is a programming construct that takes in a set of inputs (also known as arguments), manipulates those inputs/arguments in a specific way (within what’s known as the body of the function), and returns an output that is the product of how those inputs are manipulated in the body of the function. It is much like a recipe, where the recipe’s ingredients are analogous to a function’s inputs, the instructions about how to combine and process those ingredients are analogous to the body of the function, and the end product of the recipe (for example, a cake) is analogous to the function’s output.
In the previous section on built-in functions, we specified the functions’ arguments, and noted its outputs; the actual “recipes” were hidden from view. The best way to learn how these under-the-hood “recipes” work is to develop our own. To that end, we’ll now learn how to write some simple functions, and develop some intuition for how they are put together. Writing your own functions can be challenging, so we’ll develop our intuition by starting with a very simple example. In particular, we begin by writing a one-argument function; we then turn to writing a two-argument function; and then consider a multiple (more than two) input function. We’ll also explore writing functions with conditional statements embedded into them, so that they execute differently based on whether or not certain conditions are met.
Let’s say we have a large collection of temperature data, measured in Fahrenheit, and we want to convert these data to Celsius. Recall that the formula to convert from Fahrenheit to Celsius is the following, where “C” represents temperature in Celsius, and “F” represents temperature in Fahrenheit:
# fahrenheit to Celsius formula, where C is Celsius output and F is Fahrenheit input
(F-32)*(5/9)=CAs we discussed before, at its most basic level, R is a calculator; if for example, one of our Fahrenheit measurements is 55 degrees; we can convert this to Celsius by plugging 55 into the conversion formula:
# Converts 55 degrees fahrenheit to Celsius
(55-32)*(5/9)[1] 12.77778
This is easy enough, but if we have a large amount of temperature data that requires processing, we wouldn’t want to carry out this calculation for each measurement in our data collection. The first step in allowing us to carry out this conversion operation at scale is to write a function. Let’s see how we can wrap the Fahrenheit-Celsius formula above into a function:
# creates fahrenheit to celsius conversion function and assigns it to a new object named "fahrenheit_to_celsius_converter"
fahrenheit_to_celsius_converter<-function(fahrenheit_input){
celsius_output<-(fahrenheit_input-32)*5/9
return(celsius_output)
}Let’s unpack the code above, which we used to create our function:
We declare that we are creating a new function with the word function; within the parenthesis after function, we specify the function’s argument(s). Here, the function’s argument is an input named fahrenheit_input. The name of the argument(s) is arbitrary, and can be anything you like; ideally, its name should be informed by relevant context. Here, the argument/input to the function is a temperature value expressed in degrees Fahrenheit, so the name “fahrenheit_input” describes the nature of this input.
After enclosing the function’s arguments within parentheses, we print a right-facing curly brace {, and then define the body of the function (i.e. the recipe), which specifies how we want to transform this input. In particular, we takefahrenheit_input, subtract 32, and then multiply by 5/9, which transforms the input to the celsius temperature scale. We’ll tell R to assign this transformed value to a new object within the function, named celsius_output. Objects defined within a function are treated differently than objects defined outside of it; we’ll return to this topic at the end of the session, but this is worth flagging and keeping in mind right now.
In the function’s final line, return(celsius_output), we specify the value we want the function to return. Here, we are saying that we want the function to return the value that was assigned to celsius_output. We then close the function by typing a left-facing curly brace below the return statement }.
Just as we can assign data or visualizations to objects that allow us to subsequently retrieve the outputs of our code, so too with functions. Here, we’ll assign the function we have just written to an object named fahrenheit_to_celsius_converter.
After running that code, we can use the newly created fahrenheit_to_celsius() function to perform our Fahrenheit to Celsius transformations. Let’s say we have a Fahrenheit value of 68, and want to transform it to Celsius:
# tests function using an input of 68 degrees fahrenheit
fahrenheit_to_celsius_converter(fahrenheit_input=68)[1] 20
Above, we passed the argument fahrenheit_input=68 to the fahrenheit_to_celsius_converter() function that we created; the function then took this value (68), plugged it into “fahrenheit_input” within the function and assigned the resulting value to “celsius_output”; it then returned the value of “celsius_output” (20) back to us. Note that while it’s good practice to label one’s arguments, as we did above (fahrehnheit_input=68), it isn’t strictly necessary. For example, we could just enter a numeric argument for the temperature input, and the function will work:
# uses "fahrenheit_to_celsius_converter" function using an input of 20 degrees fahrenheit
fahrenheit_to_celsius_converter(22)[1] -5.555556
In short, we can specify any value for the “fahrenheit_input” argument; this value will be substituted for “fahrenheit_input” in the expression celsius_output<-(fahrenheit_input-32)*(5/9), after which the value of celsius_output will be returned to us.
Let’s extend what we learned above by writing a function that takes two arguments, rather than one. The principles are the same. To see this, let’s define a function that takes export and import values as arguments, and returns a value for net exports (defined as the difference between total exports and total imports). Below, we assign this function to an object named net_exports_calculation():
# writes function that takes export and import values as inputs, and returns a value for net exports; function is assigned to a new object named "net_exports_calculation"
net_exports_calculation<-function(exports, imports){
net_export_value<-exports-imports
return(net_export_value)
}In essence, the function has two arguments, “exports” and “imports” that are supplied by the user; the body of the function takes these arguments, and subtracts the supplied value of imports from exports, and assigns this result to the object net_export_value, which it then returns as the output. Let’s go ahead and test the function:
# tests the "net_exports_calculation" function in a case where exports are 133, and imports are 55
net_exports_calculation(exports=133, imports=55)[1] 78
The function works as expected. Note that if we switch the order in which we supply the arguments, the function continues to work as expected, so long as we label the arguments:
# tests the "net_exports_calculation" function in a case where exports are 133, and imports are 55; reverses order in which inputs are supplied
net_exports_calculation(imports=55, exports=133)[1] 78
However, if the arguments are not labelled, the order in which they are supplied does matter. That is, if the arguments are not labelled, the function assumes that they are passed in the order they’re defined in the function; in this case, that means that the assumption is that the first argument is the import argument and the second is the export argument. So, the following presumes that exports are 55, and imports are 133:
# tests the "net_exports_calculation" function in a case where exports are 55, and imports are 133; does not explicitly label inputs, order matters
net_exports_calculation(55, 133)[1] -78
And the following presumes the opposite, that exports are 133, and imports are 55.
# uses the "net_exports_calculation" function in a case where exports are 133, and imports are 55; does not explicitly label inputs, order matters
net_exports_calculation(133, 55)[1] 78
In this section, we’ll create a function that takes more than two inputs. In particular, we’ll create a function that takes numeric values for consumption spending (consumption_spending), government spending (government_spending), investment spending (investment_spending), and net exports (net_exports) as arguments, and returns a value for GDP (which is the sum of these values). We’ll assign this GDP calculator function to a new object named gdp_calculation:
# creates a new function that takes consumption spending, government spending, investment spending, and net exports as inputs, and returns a value for GDP by summing these elements; function is assigned to a new object named "gdp_calculation"
gdp_calculation<-
function(consumption_spending, government_spending, investment_spending, net_exports){
gdp<-consumption_spending+government_spending+investment_spending+net_exports
return(gdp)
}In short, the function takes numeric data on consumption spending, government spending, investment spending, and net exports as user-supplied arguments; it then takes these values, adds them up, and assigns them to the object gdp, which it returns as output.
Let’s now test the function; as before, we’ll assume that units are in millions of dollars. We’ll test our function for a country with consumption spending of $125 (consumption_spending=125), government spending of $66 (government_spending=66), investment spending of $36 (investment_spending=36), and net exports of -$33 (net_exports=-33):
# tests gdp calculation for consumption spending of 125, government spending of 66, investment spending of 36, and net exports of -33
gdp_calculation(consumption_spending = 125, government_spending=66, investment_spending=36, net_exports=-33)[1] 194
As expected, the function returns the sum of these values, which translates into 194 (interpreted here as a GDP of $194 million).
Now that we’re hopefully getting the hang of writing functions with any number of arguments, we can introduce a concept that will allow you to write more complex and sophisticated functions, namely conditional statements. By embedding conditional statements within functions, we enable them to make decisions based on whether a condition is true or false, and execute code accordingly. Conditional statements take the following form:
if (condition1) {
# Code to execute if condition1 is TRUE
} else if (condition2) {
# Code to execute if condition1 is FALSE and condition2 is TRUE
} else {
# Code to execute if all the above conditions are FALSE
}The if block is always evaluated first. If the if condition is FALSE, the program moves to the else if block (there can be multiple such blocks). If none of the if or else if conditions are met, and all specified conditions are false, the else block (which is optional) executes.
Let’s take an example; below, we’ll create a function that takes two arguments, “value”, and “unit”. The “value” argument is a numeric temperature value, while “unit” is a string that specifies whether “value” is in Celsius or Fahrenheit. If the input temperature value is in Fahrenheit, the function coverts it to Celsius and returns this value; if the input value is not in Fahrenheit (i.e. is in Celsius), the function converts the input to Fahrenheit, and returns this value; and finally, if the specified input temperature is in neither Celsius nor Fahrenheit, the function returns a message saying “Please indicate whether your input is in Celsius or Fahrenheit”:
# creates a function that takes a temperature value ("value"), and a temperature scale label ("unit") that is either "Fahrenheit" or "Celsius" which designates the temperature scale of the input temperature value; if the input argument is in "Fahrenheit", the function converts the temperature to Celsius and returns this value; if the input is in "Celsius" the function converts the temperature to Fahrenheit and returns this value; if the the temperature scale label is neither "Fahrenheit" nor "Celsius" an error message is returned to the user; the function is assigned to a new object named "convert_temperature"
convert_temperature <- function(value, unit) {
if (unit == "Fahrenheit") {
# Convert Fahrenheit to Celsius
celsius <- (value - 32) * 5 / 9
return(celsius)
} else if (unit == "Celsius") {
# Convert Celsius to Fahrenheit
fahrenheit <- (value * 9 / 5) + 32
return(fahrenheit)
} else {
# Handle invalid input for the unit
return("Please indicate whether your input is in Celsius or Fahrenheit")
}
}Let’s go ahead and test the function. Let’s say we want to convert 100 degrees Fahrenheit to Celsius:
# Converts 100 degrees Fahrenheit to Celsius
convert_temperature(100, "Fahrenheit")[1] 37.77778
Or, say we have Celsius values, and want to convert 25 degrees Celsius to Fahrenheit
# Converts 25 degrees Celsius to Fahrenheit
convert_temperature(25, "Celsius")[1] 77
Let’s try inputting a temperature in Kelvin:
# Attempts to convert 100 degrees Kelvin to another unit; met with a message specifying the function's constraints
convert_temperature(100, "Kelvin")[1] "Please indicate whether your input is in Celsius or Fahrenheit"
As expected, the function returns a message saying this is beyond its scope.
As you can see, using if-then statements in a function allows functions to be flexible and dynamic, which can save time; instead of writing separate functions to convert Celsius to Fahrenheit and vice-versa, we can write a function that implements the conversion in either direction depending on the temperature value the user supplies.
Now that we have a sense of how to write basic functions, let’s now turn to the concept of iteration, which is fundamentally about applying a function across the elements of a vector, list, or data frame in a programmatic way. Functions and iteration are thus closely related; functions are small programs that perform some action, while iteration applies those programs repeatedly across several objects. For example, we already have a function to convert Fahrenheit temperature values to Celsius temperature values; using this newly created function helps us to avoid manually converting each of our temperature values from the Fahrenheit scale to the Celsius scale. Instead of repeating the calculation over and over manually, we could simply plug our Fahrenheit temperature values into the function, and let the function carry out the calculation for us. However, it is still time-consuming to plug our Fahrenheit values into the function one-by-one. Instead, we could deposit our Fahrenheit temperature values into a vector, and iteratively (i.e. sequentially) apply our function to all of these vector elements, and deposit the transformed results into a new vector (or a list, or as the rows of a data frame).
In programming languages, functions are typically applied to multiple inputs in an iterative fashion using a construct known as a for-loop, which some of you may already be familiar with. R users also frequently use specialized functions (instead of for-loops) to iterate over elements; this is often faster, or at the very least, makes R scripts more readable. One family of these iterative functions is the “Apply” family of functions. A more recent set of functions that facilitate iteration is part of the tidyverse, and is found within the purrr package. These functions are known as map() functions, and we will use them in this lesson to iteratively apply our functions to multiple inputs arguments in a repeated fashion.
There are many different kinds of map() functions within the purrr package, and we’ll introduce specific functions from this family of functions as we go. However, two of the fundamental map functions to be aware of at the outset are map(), which iterates over the elements of a vector or list, and deposits the results in a list, and map_dbl(), which iterates over the elements of a vector or list, and deposits the results in a numeric vector. If you’d like a preview of map() functions before we dive in, please consult the function’s documentation: ?map().
Let’s say we have four different Fahrenheit temperature values that we want to convert to Celsius: 45.6, 95.9, 67.8, 43. We could pass each input as an argument to fahrenheit_to_celsius_converter() individually, and get our Fahrenheit values that way, but it would quickly become tedious. Instead, we’ll implement a strategy that involves iteratively applying our function to those Fahrenheit temperature values. As a first step, we’ll create a vector of Fahrenheit temperature input arguments:
# creates a vector of fahrenheit inputs
fahrenheit_input_vector<-c(45.6, 95.9, 67.8, 43)Now, we’ll use the map() function to iteratively apply fahrenheit_to_celsius_converter() to each element in that vector, and deposit the results in a list. We’ll passfahrenheit_input_vector (i.e. the elements we want to iterate over) as the first argument to the map() function, and fahrenheit_to_celsius_converter() (i.e. the function we want to apply iteratively to the elements in thefahrenheit_input_vector) as the second argument. The result of this operation will be a new “results list”, containing the transformed temperature values for each input in the original vector of Fahrenheit values (fahrenheit_input_vector). We’ll assign this result/output list to a new object namedcelsius_outputs_list:
# iteratively applies the "fahrenheit_to_celsius_converter" function to the vector of input arguments, "fahrenheit_input_vector", and assigns the resulting list of outputs to "celsius_outputs_vector"
celsius_outputs_list<-map(.x=fahrenheit_input_vector, .f=fahrenheit_to_celsius_converter)In short, the code above takes fahrenheit_input_vector and runs each of these numbers through the fahrenheit_to_celsius_converter() function, and then sequentially deposits the transformed result to the newly created celsius_outputs_list object, which contains the transformed Celsius values:
# prints contents of "celsius_outputs_list"
celsius_outputs_list[[1]]
[1] 7.555556
[[2]]
[1] 35.5
[[3]]
[1] 19.88889
[[4]]
[1] 6.111111
More explicitly, the code that reads celsius_outputs_list<-map(fahrenheit_input_vector, fahrenheit_converter)did the following:
Pass 45.6 (the first element in the input vector, fahrenheit_input_vector) to the fahrenheit_to_celsius_converter() function, and place the output (7.555556) as the first element in a new list of transformed values, named celsius_outputs_list.
Pass 95.9 (the second element in the input vector, fahrenheit_input_vector) to the fahrenheit_to_celsius_converter() function, and deposit the output (35.500000) as the second element in celsius_outputs_list.
Pass 67.8 (the third element in the input vector, fahrenheit_input_vector) to the fahrenheit_to_celsius_converter() function, and deposit the output (19.888889) as the third element in celsius_outputs_list.
Pass 43 (the fourth element in the input vector, fahrenheit_input_vector) to the fahrenheit_to_celsius_converter() function, and deposit the output (6.111111) as the fourth element in celsius_outputs_list.
Recall that if we want to extract an element from a list, we can do so by specifying its index within double brackets. For instance, if we wanted to extract the second element in celsius_outputs_list, we could type the following:
# extracts second element from "celsius_outputs_list"
celsius_outputs_list[[2]][1] 35.5
As we have noted, there are a variety of map() functions, and the precise one you should use turns on the number of arguments used by the function (here, this value is of course one), and the desired class of the output (i.e. list, numeric vector etc.). Here, we used the core map() function because we wanted a list as an output, and we have a one-argument function that we are applying. Below, we’ll talk more about how to handle functions with multiple arguments within the purrr ecosystem. Before that, though, let’s see how to use a slightly different type of map() function to return a different kind of output.
In particular, let’s say we want to iteratively apply the values in fahrenheit_input_vector as arguments to the fahrenheit_to_celsius_converter() but that we want the outputs to be deposited in a numeric vector, rather than a list (as above). To do so, we can pass the same arguments we passed to the map() function, but use the map_dbl() function instead, which will return a vector:
# iteratively applies the "fahrenheit_to_celsius_converter" function to the vector of input arguments, "fahrenheit_input_vector", and assigns the resulting vector of outputs to "celsius_outputs_vector"
celsius_outputs_vector<-map_dbl(.x=fahrenheit_input_vector, .f=fahrenheit_to_celsius_converter)In short, the code above takes the first element of fahrenheit_input_vector, passes it as an input argument to fahrenheit_to_celsius_converter(), and deposits the output Celsius value as the first element in celsius_outputs_vector; it then takes the second element of fahrenheit_input_vector, passes it as an input argument to fahrenheit_to_celsius_converter(), and deposits the output Celsius value as the second element in celsius_outputs_vector ; and so on. Let’s print the contents of celsius_outputs_vector:
# prints contents of "celsius_outputs_vector"
celsius_outputs_vector[1] 7.555556 35.500000 19.888889 6.111111
As expected, we see that it is a vector containing the output Celsius values generated by applying the function to the various Fahrenheit input arguments.
What if we want a data frame that contains the input Fahrenheit values as one column, and the output Celsius columns as another, instead of a list or vector of outputs? We can do so by using the map_dbl() function in conjunction with the arguments above within the data.frame() function that can define a data frame. In particular:
# creates a data frame in which one column contains Fahrenheit input values, and the other contains Celsius output values
fahrenheit_celsius_df_output<-data.frame(Fahrenheit=fahrenheit_input_vector,
Celsius=map_dbl(.x=fahrenheit_input_vector, .f=fahrenheit_to_celsius_converter))Let’s confirm that the data frame has been created as expected:
# prints contents of "fahrenheit_celsius_df_output"
fahrenheit_celsius_df_output Fahrenheit Celsius
1 45.6 7.555556
2 95.9 35.500000
3 67.8 19.888889
4 43.0 6.111111
As you can see, we now have a handy data frame that has a column of Fahrenheit inputs, and a column of Celsius outputs.
In the previous subsection, we explored map() function in the context of working with single argument functions. In this section, we’ll explore how related functions from the purrr package can be used to iteratively pass arguments to a function with two input arguments. To illustrate, we will consider the net_exports_calculation() function we created above.
Let’s say we have export and import data from three countries, and want to calculate net exports for each country. First, we’ll deposit our input arguments into two different vectors. The numeric vector export_vector contains information for export values, while import_vector contains information on import values:
# creates export and import vectors
export_vector<-c(78, 499, 785)
import_vector<-c(134, 345, 645)Now, we’ll use the map2() function to iteratively pass the input arguments from these two vectors to the net_exports_calculation() function, and deposit the outputs (i.e. net export values) into a list, which we’ll assign to an object named net_export_list. The “.x” label signifies that export_vector is the first argument for the net_exports_calculation() function to iterate over, while the “.y” label signifies that import_vector is the second argument for net_exports_calculation() to iterate over. The “.f” label signifies the name of the function to which we’re applying these arguments.
# iteratively applies the "net_exports_calculation" function to the export values contained in "export_vector" and the import values contained in "import_vector" and deposits the resulting outputs in a list that's assigned to the new object entitled "net_export_list"
net_export_list<-map2(.x=export_vector, .y=import_vector, .f=net_exports_calculation)In short, the code above takes the first value in export_vector and the first value in import_vector and passes these values to net_exports_calculation to calculate net exports for the first country, which is then deposited as the first element in net_export_list; then, it takes the second value in export_vector and the second value in import_vector and passes these values to net_exports_calculation to calculate net exports for the second country, which is then deposited as the second element in net_export_list; and likewise for the third country. We can print the contents of net_export_list to ensure that the code worked as expected:
# prints contents of "net_export_list"
net_export_list[[1]]
[1] -56
[[2]]
[1] 154
[[3]]
[1] 140
If, instead of depositing the results into a list, we’d like to deposit our outputs into a numeric vector, we can do so using themap2_dbl() function, the analog of map_dbl() which is used when the function takes two input arguments rather than one. We’ll assign our results vector to a new object named net_export_vector:
# iteratively applies the "net_exports_calculation" function to the export values contained in "export_vector" and the import values contained in "import_vector" and deposits the resulting outputs in a vector that's assigned to the new object entitled "net_export_vector"
net_export_vector<-map2_dbl(.x=export_vector, .y=import_vector, .f=net_exports_calculation)Let’s print the contents of net_export_vector:
# prints contents of "net_export_vector"
net_export_vector[1] -56 154 140
If, instead, we’d like a data frame that contains exports in the first column, imports in the second column, and net exports in the third, we could use the data.frame() function, and run the code that generated net_export_vector within it. We’ll assign this data frame to a new object named net_exports_dataframe:
# creates data frame with exports in first column, imports in second column, and net exports in third; assigns the data frame to a new object named "net_exports_dataframe"
net_exports_dataframe<-data.frame(exports=export_vector,
imports=import_vector,
net_exports=map2_dbl(.x=export_vector, .y=import_vector, .f=net_exports_calculation))Of course, we could have also created the data frame above with the following:
# alternative way of creating "net_exports_dataframe"
data.frame(exports=export_vector,
imports=import_vector,
net_exports=net_export_vector) exports imports net_exports
1 78 134 -56
2 499 345 154
3 785 645 140
While the map2() family functions allows us to conveniently handle iteration tasks involving two-argument functions, we will often need to write and work with functions with more than two arguments. How can we carry out iteration tasks when we need to iteratively pass multiple input arguments to a function?
The pmap() family of functions within purrr allows us to handle iteration tasks using functions with any number of inputs greater than two, by using a list as a container for all of the input arguments we would like to iteratively pass to a functions To see how this works, let’s consider the gdp_calculator() function that we created above. Let’s say we have consumption spending, government spending, investment spending, and net export data for four different countries, and we want to iteratively pass the data for these countries as arguments to gdp_calculator() and derive the GDP for each country.
The first step is to create a new list of input arguments, where each list element is a vector that contains the country-level values for each argument of the gdp_calculation() function. We’ll assign this list to a new object named gdp_input_list:
# creates a list as a container for the input arguments we'll iteratively run through the "gdp_calculation" function
gdp_input_list<-list(consumption_spending=c(44, 89, 64, 33),
government_spending=c(54, 76, 222, 110),
investment_spending=c(123, 200, 55, 45),
net_exports=c(-55, 89, 143,-12))To make sure we understand what gdp_input_list represents, consider the first element in each of the four numeric vectors in the list; these first elements correspond to the first country, which we can see has consumption spending of $44, government spending of $54, investment spending of $123, and net exports of -$55. The second element of each of the vectors in the list corresponds to information for the second country, which has consumption spending of $89, government spending of $76, investment spending of $200, and net exports of $89. And so on for Countries 3 and 4.
Now that we have defined our list of input values (gdp_input_list) based on the arguments to the gdp_calculation() function, we can pass gdp_input_list (the list of input values) and gdp_calculation() (the function to which we’re iteratively passing the arguments in gdp_input_list) as arguments to purrr’s pmap() function. The pmap() function iteratively passes the arguments in the input list to the gdp_calculation() function in a vectorized fashion. That is, the pmap() function uses the first element in each vector of the input list to generate the first output value, then uses the second element in each vector of the input list to generate the second output value, and so on. We’ll assign the resulting list of output values to a new object named gdp_output_list. The “.l” label is used to designate the list of input arguments, while the “.f” argument designates the function to which we are iteratively passing arguments from the input list:
# iteratively passes arguments from "gdp_input_list" to the "gdp_calculation" function and deposits the results in a new list object named "gdp_output_list"
gdp_output_list<-pmap(.l=gdp_input_list, .f=gdp_calculation)Let’s now print the contents of gdp_output_list:
# prints contents of "gdp_output_list"
gdp_output_list[[1]]
[1] 166
[[2]]
[1] 454
[[3]]
[1] 484
[[4]]
[1] 176
As expected, the first list element contains the GDP of the first country, 166 (44+54+123+55); the second list element contains the GDP of the second country, 454 (89+76+200+89); and so on, for the third and fourth countries.
If, instead, we want the results deposited in a vector, we can use the pmap_dbl() function instead:
# iteratively passes arguments from "gdp_input_list" to the "gdp_calculation" function and deposits the results in a new list object named "gdp_output_vector"
gdp_output_vector<-pmap_dbl(.l=gdp_input_list, .f=gdp_calculation)Let’s print the contents of gdp_output_vector and confirm that it contains the expected outputs of the GDP function:
# prints contents of "gdp_output_vector"
gdp_output_vector[1] 166 454 484 176
As an exercise below, we’ll ask you to assemble a data frame in which consumption spending, government spending, investment spending, and net exports for these countries are in columns, along with another column that contains the GDP value.
Now, let’s see how we can iteratively apply a more complex function with multiple inputs and conditional logic that results in different outputs depending on whether certain conditions are met. We’ll slightly modify the convert_temperature() function we created above. In particular, this modified function will take three inputs; the first specifies the name of a country, the second specifies a temperature in either Celsius or Fahrenheit, and the third provides information on whether the temperature value is provided in Celsius or Fahrenheit.
If the temperature is provided in Fahrenheit, the function converts this value to Celsius, while also recording the temperature in Fahrenheit based on the input value. If the temperature is provided in Celsius, it converts Celsius to Fahrenheit, while also recording the temperature in Celsius based on the input value. Then, it uses this information to create a one-row data frame, where the columns are the Country, temperature in Celsius, and temperature in Fahrenheit. The function below is extensively commented; see if you can make sense of its logic.
# creates new function to take a country name, temperature value in either Celsius or Fahrenheit, and a designation for the temperature unit as inputs, and return a data frame with the country name, temperature value in Fahrenheit, and Temperature value in Celsius as columns; the function is assigned to a new object named "convert_temperature_df"
convert_temperature_df <- function(country, temperature, unit) {
# Check if the unit is valid
if (unit == "Fahrenheit") {
# Convert Fahrenheit to Celsius
celsius <- (temperature - 32) * 5 / 9
fahrenheit <- temperature
} else if (unit == "Celsius") {
# Convert Celsius to Fahrenheit
fahrenheit <- (temperature * 9 / 5) + 32
celsius <- temperature
} else {
# Throw an error if the unit is invalid
stop("Error: Please indicate whether your input is in 'Celsius' or 'Fahrenheit'")
}
# Create and return a data frame
result <- data.frame(
Country = country,
Temperature_Celsius = round(celsius, 2), # Round to 2 decimal places
Temperature_Fahrenheit = round(fahrenheit, 2)
)
return(result)
}Now, let’s go ahead and test this function, where the country is “USA”, and the temperature in Fahrenheit is 100:
# tests "convert_temperature_df" function for USA as the country input, and a temperature of 100 degrees in Fahrenheit
convert_temperature_df("USA", 100, "Fahrenheit") Country Temperature_Celsius Temperature_Fahrenheit
1 USA 37.78 100
As we can see, the function behaved as expected; it creates a data frame where the country is “USA”, and there are columns for the temperature in Celsius and Fahrenheit. In this case, the Fahrenheit temperature was filled in by the user-supplied argument, while the Celsius temperature was filled in by transforming the Fahrenheit temperature value to the Celsius scale within the function.
Let’s test the function again, with different input arguments:
# tests "convert_temperature_df" function for India as the country input, and a temperature of 100 degrees in Fahrenheit
convert_temperature_df("India", 39, "Celsius") Country Temperature_Celsius Temperature_Fahrenheit
1 India 39 102.2
Again, the function behaves as expected. It creates a data frame where the country is “India”, and columns for the temperature in Celsius and Fahrenheit. In this case, the Celsius temperature was filled in by the user-supplied argument, while the Fahrenheit temperature was filled in by transforming the Celsius temperature value to the Fahrenheit scale within the function.
Now, let’s imagine we have data for several countries, and we want to iteratively pass this data as arguments to convert_temperature_df to generate a data frame with information on the temperatures for these countries. To do so, we can use the pmap() function. First, we’ll create a list with vectors that contain the input arguments we want to pass to the function; we’ll assign this list to a new object named input_list_temperatures:
# creates a list of inputs to iterate over
input_list_temperatures<-list(country=c("USA", "Canada", "Mexico", "France"),
temperature=c(66, 11, 25, 33),
unit=c("Fahrenheit", "Celsius", "Fahrenheit", "Celsius"))Next, we’ll pass this list and the convert_temperature_df() function as arguments to pmap(), and assign the resulting list of outputs to a new object named convert_temperature_list.
# iteratively applies the "convert_temperature_df" function using the input variables in "input_list_temperatures"; the outputs are deposited in a list assigned to the object named "convert_temperature_list"
convert_temperature_list<-pmap(.l=input_list_temperatures, .f=convert_temperature_df)The code above first takes the first elements in the vectors in input_list_temperatures (“USA, 66,”Fahrenheit”), runs them through the convert_temperature_df() function, and deposits the resulting data frame as the first element in convert_temperature_list. It then takes the second elements in the vectors in input_list_temperatures (“Canada”, 11, “Celsius”), passes them as arguments to the convert_temperature_df() function, and deposits the resulting data frame as the second element in convert_temperature_list. And so on.
Let’s now print the contents of convert_temperature_list , which should contain four single-row data frames:
# prints contents of "convert_temperature_list"
convert_temperature_list[[1]]
Country Temperature_Celsius Temperature_Fahrenheit
1 USA 18.89 66
[[2]]
Country Temperature_Celsius Temperature_Fahrenheit
1 Canada 11 51.8
[[3]]
Country Temperature_Celsius Temperature_Fahrenheit
1 Mexico -3.89 25
[[4]]
Country Temperature_Celsius Temperature_Fahrenheit
1 France 33 91.4
As we can see, the data frames are stored separately as list elements (which is what we expected). If we’d like to bind these data frame rows together into a single, consolidated data frame, we can easily do so by passing convert_temperature_list as an argument to the bind_rows() function from the dplyr package. We’ll assign this data frame to a new object named convert_temperature_df_final:
# appends together the single-row data frames in "convert_temperature_list" into a single data frame by passing "convert_temperature_list" as an argument to "bind_rows"; the newly created data frame is assigned to a new object named "convert_temperature_df_final"
convert_temperature_df_final<-bind_rows(convert_temperature_list)# prints contents of "convert_temperature_df_final"
convert_temperature_df_final Country Temperature_Celsius Temperature_Fahrenheit
1 USA 18.89 66.0
2 Canada 11.00 51.8
3 Mexico -3.89 25.0
4 France 33.00 91.4
Finally, it’s worth noting that it’s possible to incorporate iteration processes with map() family functions within our custom-written functions. In other words, we can include map() functions within the body of our functions, rather than simply using them on pre-defined functions. Moreover, its possible to incorporate functions we’ve already written into new custom functions; incorporating “functions within functions” can often help us accomplish some very useful tasks.
Let’s consider an example, by writing a new function that takes takes a vector of Fahrenheit temperature values and transforms these values to the Celsius scale. It will return these transformed values either as a list, vector, or data frame (containing the initial Fahrenheit temperatures as one column. and the transformed Celsius temperatures in another), depending on the user’s preference
Below, the function’s first argument is the vector of Fahrenheit temperatures, while the second is the user’s preference over whether the transformed values are returned as a list (“List), vector (”Vector”), or data frame (“Data.Frame”).
If the user desires the output as a list, the function passes the vector of Fahrenheit inputs as an argument to the map() function, along with the fahrenheit_to_celsius_converter() that we defined earlier. This will iteratively pass the temperature values in the input vector as arguments to fahrenheit_to_celsius_converter, and the output Celsius temperatures are returned as a list.
If, instead, the user desires the output as a numeric vector, the function passes the the vector of Fahrenheit inputs as an argument to the map_dbl() function, along with the fahrenheit_to_celsius_converter(). This will iteratively pass the temperature values in the input vector as arguments to fahrenheit_to_celsius_converter, and the output Celsius temperatures are returned as a vector (given the behavior of the map_dbl() function).
If the user desires a data frame, the function will create a data frame using the data.frame() function, and define a column for Fahrenheit values using the vector of inputs, and a column of Celsius temperature values using the vector of transformed Celsius values created by map_dbl(.x=vector_of_fahrenheit_inputs, .f =fahrenheit_to_celsius_converter). We’ll assign this function to a new object named fahrenheit_to_celsius_general():
Finally, If the user specifies a an output that is not “List”, “Data.Frame”, or “Vector”, the function throws an error.
# Writes a function that takes a vector of fahrenheit temperature values, and returns either a list, data frame, or vector of outputs depending on the user's desired output; assigns the function to a new object named "fahrenheit_to_celsius_general"
fahrenheit_to_celsius_general<-function(vector_of_fahrenheit_inputs, desired_output){
if (desired_output == "List") {
outputlist<-map(.x=vector_of_fahrenheit_inputs, .f=fahrenheit_to_celsius_converter)
return(outputlist)
} else if (desired_output=="Vector"){
outputvector<-map_dbl(.x=vector_of_fahrenheit_inputs, .f =fahrenheit_to_celsius_converter)
return(outputvector)
} else if (desired_output=="Data.Frame"){
outputdf<-data.frame(Fahrenheit=vector_of_fahrenheit_inputs,
Celsius=map_dbl(.x=vector_of_fahrenheit_inputs, .f =fahrenheit_to_celsius_converter))
return(outputdf)
} else {
stop("Error: Please indicate whether your desired output is a 'Vector', 'Data.Frame', or 'List'")
}
}Now, let’s test this function out. First, we’ll define a vector of Fahrenheit values we’d like to convert, and assign it to a new object named test_vector_ftoc:
# tests "fahrenheit_to_celsius_general" function; first, defines a vector of fahrenheit values
test_vector_ftoc<-c(18, 66, 88, -12, 7)Now let’s test the function by passing test_vector_ftoc as an argument to it, along with specifying “Data.Frame” as our desired output.
# uses "fahrenheit_to_celsius_general" function to convert the temperature values in "test_vector_ftoc" to Celsius and return a data frame with input Fahrenheit values in one column, and corresponding celsius temperatures in another column
fahrenheit_to_celsius_general(vector_of_fahrenheit_inputs=test_vector_ftoc, desired_output="Data.Frame") Fahrenheit Celsius
1 18 -7.777778
2 66 18.888889
3 88 31.111111
4 -12 -24.444444
5 7 -13.888889
The function behaves as expected. Now, let’s try specifying that we want the transformed Celsius temperatures as a list:
# uses "fahrenheit_to_celsius_general" function to convert the temperature values in "test_vector_ftoc" to Celsius and return the outputs as a list
fahrenheit_to_celsius_general(vector_of_fahrenheit_inputs=test_vector_ftoc, desired_output="List")[[1]]
[1] -7.777778
[[2]]
[1] 18.88889
[[3]]
[1] 31.11111
[[4]]
[1] -24.44444
[[5]]
[1] -13.88889
Finally, we’ll check the function’s behavior when we request the output as a vector:
# uses "fahrenheit_to_celsius_general" function to convert the temperature values in "test_vector_ftoc" to Celsius and return the outputs as a vector
fahrenheit_to_celsius_general(vector_of_fahrenheit_inputs=test_vector_ftoc, desired_output="Vector")[1] -7.777778 18.888889 31.111111 -24.444444 -13.888889
It’s worth noting that when we request an output that is not supported, we receive the expected error message. For example, let’s specify our desired output as a tibble (i.e. a special type of data frame):
# uses "fahrenheit_to_celsius_general" function to convert the temperature values in "test_vector_ftoc" to Celsius and return the outputs as a tibble
fahrenheit_to_celsius_general(vector_of_fahrenheit_inputs=test_vector_ftoc, desired_output="Tibble")Error in fahrenheit_to_celsius_general(vector_of_fahrenheit_inputs = test_vector_ftoc, : Error: Please indicate whether your desired output is a 'Vector', 'Data.Frame', or 'List'
Now that we know a little bit more about how functions are put together, it is worth briefly discussing global and local environments in R. The global environment is the environment where the objects we define are created and stored during an R Session. A local environment is a temporary environment created within functions. When objects are defined within a function, those objects only exist within that function, and won’t be accessible globally (unless they’re explicitly assigned to the global environment).
To get a better sense of this, let’s first define an object, x in our global environment:
# define a variable in the global environment
x<-24Now, we’ll create a toy function that defines another value for x within its environment.
# creates a toy function that takes a numeric input argument ("input1"); it defines an object, x, within the function, then defines a function, z, that's the sum of x and input1. It returns Z as an output
toy_function<-function(input1){
x<-5
z<-x+input1
return(z)
}Go ahead and test the function to ensure it works as expected.
# passes the argument "input1=7" to the toy function
toy_function(input1=7)[1] 12
In this context, the key thing to note is that even though we defined x<-5 within the function, when we print the value of x, it returns 24, which was the value assigned in the global environment. The local object x, assigned a value of 5, only exists when the function runs; then it disappears.
# prints value of x; note it returns the value from the global environment
x[1] 24
Note, also that when we try to print out the value of z, we are met with an error, since it’s only defined within the local environment of the function; z is not an object in the global environment and does not exist in the context of this environment. Rather, it is created in the function’s local environment and disappears after the function finishes running.
# prints value of z; note that there's an error, since z is only defined within the local environment of the function
zError: object 'z' not found
Another way of noting this is to call the ls() function, and see that “z” is not printed to the console along with other objects in the global environment, since it’s only defined within the toy_function().
# prints objects in memory; note that z is not included, since it's only defined within the function
ls() [1] "celsius_outputs_list" "celsius_outputs_vector"
[3] "convert_temperature" "convert_temperature_df"
[5] "convert_temperature_df_final" "convert_temperature_list"
[7] "export_vector" "fahrenheit_celsius_df_output"
[9] "fahrenheit_input_vector" "fahrenheit_to_celsius_converter"
[11] "fahrenheit_to_celsius_general" "gdp_calculation"
[13] "gdp_input_list" "gdp_output_list"
[15] "gdp_output_vector" "import_vector"
[17] "input_list_temperatures" "net_export_list"
[19] "net_export_vector" "net_exports_calculation"
[21] "net_exports_dataframe" "sample_vector"
[23] "test_vector_ftoc" "toy_function"
[25] "x"
An important implication of the fact that the local environment within a function exists apart from the global environment is that when writing functions, you don’t have to worry about accidentally overwriting global objects.
Exercise 1
Write a function that takes a monetary value (in US dollars) as an input, and returns the equivalent value in Euros. Assign the function to an object, and test it out with some sample arguments.
# creates a US Dollar (usd) to Euro (eur) conversion function based on current Dollar to Euro exchange rate
usd_to_eur<-function(dollar_input){
dollar_to_euro_conversion<-dollar_input*0.9637
return(dollar_to_euro_conversion)
}
# Uses function to convert $18.23 to Euros
usd_to_eur(18.23)[1] 17.56825
# Uses function to convert $127.00 to Euros
usd_to_eur(127)[1] 122.3899
Exercise 2
Write a function that takes a monetary value (in US Dollars) as an input, as well as an argument in which the user can tell the function to convert that value to an equivalent amount in Euros, Indian Rupees, or Mexican Pesos. The function returns the US dollar equivalent in the desired currency. Assign this function to a new object.
# creates new function, "exchange_rate_calculator", that converts USD amount to Euros (EUR), Indian Rupees (INR), or Pesos (MXN); if the desired currency argument is not either "EUR", "INR", or "MXN" (i.e. the currency tickers), the function throws an error
exchange_rate_calculator<-function(USD_currency_value, desired_currency){
if (desired_currency=="EUR"){
euros<-USD_currency_value*0.96
return(euros)
} else if (desired_currency=="INR"){
inr<-USD_currency_value*87.58
return(inr)
} else if (desired_currency=="MXN"){
mxn<-USD_currency_value*20.48
return(mxn)
} else{
stop("Please indicate whether you'd like to convert this value to EUR, INR, or MXN")
}
}
# tests "exchange_rate_calculator" by converting $25 to INR
exchange_rate_calculator(25, "INR")[1] 2189.5
# uses "exchange_rate_calculator" to convert $25 to EUR
exchange_rate_calculator(25, "EUR")[1] 24
# uses "exchange_rate_calculator" to convert $25 to MXN
exchange_rate_calculator(25, "MXN")[1] 512
Exercise 3
Take the function you wrote for Question 1 above, and use a function from the purrr package to programmatically generate a data frame where one column contains US dollars in the following amounts: 10.25, 1245.55, 83, 76, 11559, and the other column contains the equivalent sum of money in Euros.
# create vector of USD currency values to convert
usd_vector<-c(10.25, 1245.55, 83, 76, 11559)
# create vector that contains converted Euro values
euro_vector<-map_dbl(.x=usd_vector, .f=usd_to_eur)
# creates dataset with USD values in one column, and corresponding converted Euro values in another
currency_df<-data.frame(USD_Value=usd_vector,
EUR_Value=euro_vector)
# prints "currency_df"
currency_df USD_Value EUR_Value
1 10.25 9.877925
2 1245.55 1200.336535
3 83.00 79.987100
4 76.00 73.241200
5 11559.00 11139.408300