1 Preliminaries

Please load the following packages:

library(sf)
library(tidyverse)
library(tmap)
library(tidycensus)
library(raster)
library(exactextractr)
library(terra)

We will be working with data extracted from NASA’s Moderate Resolution Imaging Spectroradiometer (MODIS) program. There are many ways to extract MODIS data, including the R package MODIStsp and directly from the MODIS website. To save time, the datasets we’ll be working with in this tutorial have already been extracted and clipped to the extent of Colorado for you; you can download the data at the following link. Please set your working directory to the folder containing the data.

2 Introduction

Earlier in class, we developed the intuition for how raster datasets that inform us about biophysical and social activities across the surface of the earth are created using multi-spectral satellite images, which allow researchers to exploit scientific knowledge about the different spectral properties of various objects to make inferences about the types of objects and activities within a pixel. Now that you have this intuition, we’ll learn how to work with some of these satellite-derived raster data products in R (the datasets we’ll work with are derived from MODIS satellite images). Then, we’ll get you oriented to the Google Earth Engine (GEE) interface, and have you compute and map an NDVI index for the region around Boulder by applying band math on satellite images extracted from GEE.

3 NDVI

In this section, we’ll learn how to work with an NDVI raster from MODIS, and visualize it in R Studio.

First, please load the NDVI raster into R Studio:

# Loads NDVI raster into R Studio and assigns it to an object named "CO_ndvi"
CO_ndvi<-raster("MYD13A2_NDVI_2020_185.tif")

The data were initially retrieved from R’s MODIStsp package (mentioned above), and can also be retrieved from this page on MODIS Vegetation Products. Note that the dataset corresponds to the “Vegetation Indices 16-Day L3 Global 250m” dataset, clipped to the extent of Colorado, for the date July 3rd, 2020 (since we’re interested in viewing the spatial distribution of live vegetation in Colorado, it makes sense to pick a day during peak-summer).

Now, let’s go ahead and print the raster metadata:

# prints "CO_ndvi" metadata
CO_ndvi

## class      : RasterLayer 
## dimensions : 269, 471, 126699  (nrow, ncol, ncell)
## resolution : 0.01490176, 0.01491085  (x, y)
## extent     : -109.0603, -102.0415, 36.99243, 41.00344  (xmin, xmax, ymin, ymax)
## crs        : +proj=longlat +datum=WGS84 +no_defs 
## source     : MYD13A2_NDVI_2020_185.tif 
## names      : MYD13A2_NDVI_2020_185

Then, let’s print a summary of the data distribution:

summary(CO_ndvi, maxsamp=ncell(CO_ndvi))

##         MYD13A2_NDVI_2020_185
## Min.                    -1909
## 1st Qu.                  2219
## Median                   3144
## 3rd Qu.                  5307
## Max.                     8979
## NA's                        2

Note the minimum and maximum values ( -1909, 8979), which look strange; recall that NDVI values range from -1 to 1. The issue is that the raw raster values need to be adjusted by a scale factor to recover the correct NDVI values. This scale factor is 0.0001, and is provided in the data product’s metadata (available here; see the “Layers” tab).

To make this adjustment, we can take CO_ndvi and simply multiply by the scale factor, 0.0001; we’ll assign the rescaled raster to a new object named CO_ndvi_adjusted:

# applies scale adjustment to "CO_ndvi" and assigns to new object named "CO_ndvi_adjusted"
CO_ndvi_adjusted<-CO_ndvi*0.0001

Now, let’s view the raster data distribution for CO_ndvi_adjusted:

# prints "CO_ndvi_adjusted" metadata
summary(CO_ndvi_adjusted, maxsamp=ncell(CO_ndvi_adjusted))

##         MYD13A2_NDVI_2020_185
## Min.                  -0.1909
## 1st Qu.                0.2219
## Median                 0.3144
## 3rd Qu.                0.5307
## Max.                   0.8979
## NA's                   2.0000

Note the values range from -0.1909 to 0.8979, which is reasonable, given our knowledge of the NDVI index. Recall that it’s always possible to view a raster’s underlying data by converting the raster object to a dataframe. Below, we’ll convert CO_ndvi_adjusted to a dataframe and assign it to a new object named CO_ndvi_adjusted_df:

# makes data frame from "CO_ndvi_adjusted" and assigns to new object named "CO_ndvi_adjusted_df"
CO_ndvi_adjusted_df <- as.data.frame(CO_ndvi_adjusted, xy=T)

Now, you can view the underlying data in the R Studio data viewer by passing CO_ndvi_adjusted_df to the View() function:

# Views "CO_ndvi_adjusted_df" in data viewer
View(CO_ndvi_adjusted_df)

And, as we learned in the previous class, we can easily view the distribution of these NDVI values using ggplot2(). Below, we’ll make a histogram based on the NDVI values (stored in the “MYD13A2_NDVI_2020_185” column of CO_ndvi_adjusted_df):

# Makes a histogram of NDVI values stored in the "CO_ndvi_adjusted_df" object within the "MYD13A2_NDVI_2020_185" column
ggplot()+
  geom_histogram(data=CO_ndvi_adjusted_df, 
                 aes(MYD13A2_NDVI_2020_185), 
                 bins=25)

## Warning: Removed 2 rows containing non-finite values (`stat_bin()`).

Now, to get a sense of how these values varied across space on that summer day in Colorado, let’s go ahead and plot the raster using tmap, using a Yellow-to-Green color palette. We’ll assign our map to a new object named CO_ndvi_map:

# Maps Colorado NDVI values and assigns to a new object named "CO_ndvi_map"
CO_ndvi_map<-tm_shape(CO_ndvi_adjusted)+ # declares object to map
                tm_raster(palette = "YlGn", # sets palette
                          style="cont", # sets legend breaks to continuous
                          breaks=c(-0.2, 0,0.2,0.4,0.6,0.8, 1), # specifies vector of legend breaks
                          title="Normalized Difference\nVegetation Index\n(NDVI)")+ # legend title
                tm_layout(legend.outside=T, # puts legend outside map extent
                          legend.title.size=0.7, # sets legend title size
                          main.title="Summer Vegetation in Colorado, 7/3/2020", # sets main title
                          main.title.size=0.8, # sets main title size
                          attr.outside=T)+ # puts credits outside map extent
  tm_credits("Map Author: Aditya Ranganath\nData Source: MODIS", # credits text
             position=c(0.78,0),  # sets credits position
             size=0.38) # sets credits size

Now, let’s print our map to see what it looks like:

# prints "CO_ndvi_map"
CO_ndvi_map

## Warning: Breaks contains positive and negative values. Better is to use
## diverging scale instead, or set auto.palette.mapping to FALSE.

It can be instructive to plot this in interactive mode. Let’s first switch the tmap mode to “view”:

# changes tmap to "View" mode
tmap_mode("view")

## tmap mode set to interactive viewing

# prints map in "view" mode
CO_ndvi_map

## Credits not supported in view mode.

## Warning: Breaks contains positive and negative values. Better is to use
## diverging scale instead, or set auto.palette.mapping to FALSE.

## tmap mode set to plotting

Zoom into an area with high concentrations of vegetation; then, turn off the NDVI layer and turn on the “ESRI.WorldTopoMap” base layer; you’ll notice that these concentrations are in National Forest areas, which is what we would expect.

4 Land cover

Let’s get some more practice working with MODIS satellite data by exploring a land cover raster dataset. Just like the NDVI, land cover rasters are created by taking satellite imagery (generated by the MODIS sensor, in this case) and applying our knowledge of the different spectral signatures of different land cover classes to classify pixels according to their land cover type (in the example we’ll work with, a pixel is classified as belonging to a given land cover class if that land class covers the majority of the pixel/grid cell).

The data we’re using was initially extracted using the MODIStsp package, but can also be downloaded from the MODIS page for land cover products, at this link. The data corresponds to the year 2019,

With this information in mind, let’s read in our Land Cover raster:

# read in land cover raster and assign to new object named "CO_land_cover"
CO_land_cover<-raster("MCD12Q1_LC1_2019_001.tif")

Let’s now develop a visualization to see how land cover classes vary across the state of Colorado. The first step is to examine the different land class categories and how they are coded. This information is provided in the land cover products user guide. In particular, the information we need is on Page 7 of this document; it provides us with the names of the 17 land cover classes, and the numbers used to represent these classes in the vector dataset.

Let’s first make a vector of labels, using those land classifications, and assign it to an object named raster_map_label:

# Makes character vector of land cover classes and assigns to object named "raster_map_label"
raster_map_label<-c("Evergreen needleleaf forests",
                    "Evergreen broadleaf forests",
                    "Deciduous needleleaf forests",
                    "Deciduous broadleaf forests",
                    "Mixed forests",
                    "Closed shrublands",
                    "Open shrublands",
                    "Woody savannas",
                    "Savannas",
                    "Grasslands",
                    "Permanent wetlands",
                    "Croplands",
                    "Urban and built-up lands",
                    "Cropland/natural vegetation mosaics",
                    "Permanent Snow and ice",
                    "Barren",
                    "Water bodies")

Now, we’ll create a numeric vector, with elements from 1 to 18, to represent the numeric land class codes associated with each land class category; we’ll assign it to an object named landcover_breaks:

# creates numeric vector with elements from 1 to 18, and assigns to object named "landcover_breaks"
landcover_breaks<-c(1:18)

In our raster plot of CO_land_cover, below, we’ll set the “breaks” argument equal to landcover_breaks, and our “labels” argument equal to raster_map_label. In doing so, tmap will label all observations with a value of “1” as “Evergreen needleleaf forests”, all observations with a raster value of “2” as “Evergreen broadleaf forests”, and so on, until the end, where all values equal to 17 are labelled as “Water bodies”.

We’ll assign our landcover map to a new object named CO_landcover_tmap:

# Creates map of Colorado land cover and assigns to object named "CO_landcover_tmap"
CO_landcover_tmap<-tm_shape(CO_land_cover)+
                    tm_raster(breaks=landcover_breaks,
                              labels=raster_map_label, 
                              palette="Set1",
                              title="Land Cover Classes (IGBP)")+
                      tm_layout(legend.outside=T, # puts legend outside map extent
                          legend.title.size=0.7, # sets legend title size
                          main.title="Land Cover in Colorado, 2019", # sets main title
                          main.title.size=0.8, # sets main title size
                          attr.outside=T)+ # puts credits outside map extent
  tm_credits("Map Author: Aditya Ranganath\nData Source: MODIS", # credits text
             position=c(0.78,0),  # sets credits position
             size=0.38) # sets credits size

# Prints "CO_landcover_tmap"
CO_landcover_tmap

## Some legend labels were too wide. These labels have been resized to 0.58. Increase legend.width (argument of tm_layout) to make the legend wider and therefore the labels larger.

As before, it can be instructive to view this map in interactive mode, and view the underlying topographic or Open Street Map base layers:

# shifts to "view" mode
tmap_mode("view")

## tmap mode set to interactive viewing

# plots "CO_landcover_tmap" in view mode
CO_landcover_tmap

## Credits not supported in view mode.

You’ll see, for instance, a concentration of “Urban and built-up lands” in the Denver metro region, as we would expect.

## tmap mode set to plotting

5 Processing Satellite Data

In this section, we’ll explore some simple ways to take a MODIS raster, and process it to derive novel information.

For example, let’s say that we want to derive county-level information on land-cover. Currently, we only have state-level information, but recall that we could use zonal statistics to extract county-level information from a state-level raster. In particular, let’s apply zonal statistics to our land cover raster and a polygon vector dataset of Colorado states to extract information about the dominant (i.e. most-commonly occurring) land-cover class for each county in Colorado.

First, let’s extract a vector dataset of Colorado counties using tidycensus, and assign it to an object named co_counties:

# Extracts county-level vector polygon dataset for Colorado using tidycensus
 co_counties<-get_decennial(geography = "county",
                            state="CO",
                            variables = "P001001",
                            year = 2010,
                            geometry=TRUE) %>% 
                st_transform(4326)

## Getting data from the 2010 decennial Census

## Downloading feature geometry from the Census website.  To cache shapefiles for use in future sessions, set `options(tigris_use_cache = TRUE)`.

## Using Census Summary File 1

Now that we have our county-level spatial dataset, we’ll employ zonal statistics, via the exact_extract() function, to generate a vector containing the land-class codes that appear most frequently in each county.

Below, the first argument to exact_extract() is the name of the raster object we’re using to compute our statistics (CO_land_cover), while the second argument is the name of the vector data object containing our zones of interest (co_counties). By setting fun="majority", we are specifying that we want the exact_extract() function to return the land class code that occurs most frequently in each county. We’ll assign this vector of results to an object named county_land_cover_majority_zonal:

# Uses zonal statistics to extract the most frequently occuring land-cover class for each CO county and assigns the resulting vector to an object named "county_land_cover_majority_zonal"
county_land_cover_majority_zonal<-exact_extract(CO_land_cover, co_counties, fun="majority")

Now, we’ll column-bind the county_land_cover_majority_zonal vector to our co_counties dataset, and assign the result to a new object named county_land_cover_majority:

# Attaches "county_land_cover_majority_zonal" vector to "co_counties" dataset
county_land_cover_majority<-cbind(co_counties, county_land_cover_majority_zonal)

Go ahead and open up county_land_cover_majority in the data viewer:

# Opens "county_land_cover_majority" in data viewer
View(county_land_cover_majority)

Note that we now have a column (“county_land_cover_majority_zonal) that contains information on which land class covers the largest amount of grid cells within a given county. Just eyeballing the dataset, it looks like for most of the counties in Colorado, grasslands occupy the greatest number of pixels. In other words, grasslands are the dominant land cover class in the vast majority of Colorado counties.

Now, let’s do something slightly more complicated; instead of just extracting the land cover class that covers the most pixels in each county, let’s create a dataset that provides a count of the number of grid cells associated with each land class that is present in each county.

First, we’ll use the exact_extract function, but this time, we won’t specify a function; as a result, specifying exact_extract(CO_land_cover, co_counties) will create a list of 64 datasets (one for each county), and each dataset will contain all of the cell-level land-cover values for each county; we’ll assign this list to an object named county_land_cover_zonal_list:

# makes list of county-level datasets containing land-cover values by grid-cell; assigns list to object named "county_land_cover_zonal_list"
county_land_cover_zonal_list<-exact_extract(CO_land_cover, co_counties)

We’ll name the list elements in county_land_cover_zonal_list (i.e. county-level datasets) after the GEOID values associated with each Colorado county. To do so, we’ll first make a vector of Colorado county GEOIDs by extracting this column from co_counties, and assign it to an object named geoid_vector:

# creates vector of Colorado county GEOIDs and assign it to an object named "geoid_vector"
geoid_vector<-co_counties$GEOID

Now, we’ll name our list elements after these GEOIDs using the names() function; in other words, each list element (corresponding to a distinct county-level dataset), will be named according to that county’s GEOID. The code below takes our vector of GEOIDs (geoid_vector) and uses it to name the list elements of county_land_cover_zonal_list:

# names list elements of "county_land_cover_zonal_list" according to corresponding GEOIDs contained in "geoid_vector" object
names(county_land_cover_zonal_list)<-geoid_vector

To get a sense of what this list structure looks like, go ahead and open it up in the R Studio viewer:

# Views "county_land_cover_zonal_list" in data viewer
View(county_land_cover_zonal_list)

Let’s say we want to view the dataset in county_land_cover_zonal_list that corresponds to Boulder; Boulder’s GEOID is “08013”, so we can pass the following to the data viewer in order to view the grid-cell-level land classes for Boulder:

# Extracts the Boulder dataset from "county_land_cover_zonal_list" and views it in the Data Viewer
View(county_land_cover_zonal_list[["08013"]])

Each county has a corresponding dataset within county_land_cover_zonal_list that looks something like this.

Now, we’ll bind all of these distinct county-level datasets into one large dataset, and use .id="GEOID" to specify that we want to create a column in this large dataset that contains the GEOID for each observation, so that we can match observations to counties. We’ll assign this large dataset to an object named county_land_cover_zonal_final.

# binds county-level datasets in "county_land_cover_zonal_list" into a single dataframe and assigns it to a new object named "county_land_cover_zonal_final"
county_land_cover_zonal_final<-bind_rows(county_land_cover_zonal_list, .id = "GEOID")

Please open county_land_cover_zonal_final in the data viewer so you can see what the resulting dataset looks like.

Now, we’ll take county_land_cover_zonal_final, and then group by “GEOID” and “value” (i.e. the land class code), and then count up the number of observations in each distinct GEOID/value pair using the count() function; setting name="number_grid_cells" within count() specifies that the name of the column containing data on the number of grid cells associated with a GEOID/value pair should be named “number_grid_cells”. We’ll assign this summary dataset to a new object named county_landclass_distribution:

# Creates a dataset that provides information on the number of pixels associated with each land class, for each GEOID, and assigns the result to an object named "county_landclass_distribution"
county_landclass_distribution<-county_land_cover_zonal_final %>% 
                                  group_by(GEOID, value) %>% 
                                  count(name="number_grid_cells")

Let’s take a look at the dataset we just created:

View(county_landclass_distribution)

Note that the name of the column containing the land class codes is “value”, which is somewhat generic and confusing; let’s rename it to “land_class”, so that it is more descriptive; we’ll assign the change back into the county_landclass_distribution object:

# Renames "value" column to "land_class"
county_landclass_distribution<-county_landclass_distribution %>% 
                                rename(land_class=value)

Now, let’s merge our co_counties dataset into county_landclass_distribution, using “GEOID” as the join field; we’ll assign the joined dataset to a new object named county_landclass_distribution_final:

# Joins "co_counties" to "county_landclass_distribution" using "GEOID" as join field; assigns joined dataset to object named "county_landclass_distribution_final" 
county_landclass_distribution_final<-left_join(county_landclass_distribution, co_counties, by="GEOID" )

Let’s now open up county_landclass_distribution_final:

View(county_landclass_distribution_final)

Note that we now have a dataset providing the number of grid cells/pixels associated with each land class, in each county.

As an exercise, see if you can figure out how to calculate the area associated with each land class in each county. Make sure to think about an appropriate Coordinate Reference Systems/projection for your geographic domain and the type of calculation you’d like to perform, raster resolutions, and the units of analysis associated with your chosen coordinate reference system.

6 Exploring Google Earth Engine

Before exploring the GEE code editor, have a look at the Earth Engine catalog, available here. Most of these datasets are publicly available through various organizations and government agencies (like the MODIS data), but Earth Engine has helpfully archived many of these datasets in one place for easy access.

Go ahead and click on a dataset you’re interested in, and have a look at the landing page. One of the really useful features of the catalog is that on each dataset’s landing page, there is some sample code at the bottom; click the blue “Open in Code Editor” and the code will open up in the Earth Engine code editor. You can highlight it and click “Run”, and the code will run; try and interpret the code and think about what each line is doing.

The Earth Engine code editor uses JavaScript, which may be familiar to some of you, but completely new to others. The best way to familiarize yourself with Earth Engine, and the JavaScript syntax, is to explore the sample code for datasets of interest and play around with the pre-written scripts.

After you’ve explored some datasets and scripts in the Code Editor, go to this link, and run the code to compute and visualize an NDVI layer for the Boulder region. This script is taken from a GEE tutorial that is part of the program’s documentation, and only slightly modified to focus on the Boulder area (rather than the SF Bay area). The original tutorial, which includes additional material not included in the script, is available here.

Acknowledgement

Parts of the material on extracting and working with rasters derived from MODIS draw heavily on tutorials produced by the rspatialdata project. The tutorial on land cover data (by Dilinie Seimon) is available here, and the one on vegetation (by the same author) is available here.

Working with Satellite Imagery in R Studio and Google Earth Engine

Aditya Ranganath

2024-04-24