This tutorial benefited extensively from several other sources, including but not limited to rspatial.org and Nick Eubank’s (here) and Justin de Benedictis-Kessner’s (here) great online tutorials.

This is the modernized 2026 version of the original 2019 workshop. It has been updated to use the current spatial R stack built around sf and terra, replacing the now-retired sp, rgdal, rgeos, and maptools packages. The pedagogy of the original tutorial is preserved; only the code has been brought up to date.

You’ll need to go to this GitHub link to download the data used in the tutorial. The Rmd assumes the shapefiles and CSVs live in data/BostonData/ relative to the Rmd’s location.

Setup

If you have not installed the packages used in this workshop, you can install them all at once with:

install.packages(c(
  "sf", "terra", "tidyverse", "tmap", "spdep", "spatialreg",
  "tidygeocoder", "ggmap", "RColorBrewer", "units", "knitr"
))

Load the libraries we will use throughout the workshop:

library(sf)
library(terra)
library(tidyverse)
library(tmap)
library(spdep)
library(spatialreg)
library(tidygeocoder)
library(ggmap)
library(RColorBrewer)
library(units)

Some operations that require external APIs (Google Maps for basemaps, geocoding, driving distance) are kept as eval = FALSE code examples; instructions to enable them are provided where they appear.

How R Sees Spatial Data

In this first section, we will understand how R thinks about spatial data and how to import and export spatial datasets. Modern R uses the sf (simple features) package for vector data, which treats spatial objects as ordinary data frames with a special geometry column. This is far simpler than the older sp framework.

Make Spatial Data Points

We start with the most basic geometric object, points. Like regular data points in plots, each spatial point needs a pair of x-y coordinates. In sf, we can build a points object directly from a data.frame with st_as_sf().

plus.coordinates <- data.frame(
  x = c(1.5, 1.75, 2.25, 2.5,   # horizontal
        2,    2,    2,    2,    # vertical
        2),                     # center
  y = c(1.25, 1.25, 1.25, 1.25,
        1.75, 1.5,  1,    0.75,
        1.25)
)
first.points <- st_as_sf(plus.coordinates, coords = c("x", "y"))
plot(st_geometry(first.points), pch = 16)

To understand how R sees these points, we can use the print() or summary() command, or query specific properties directly. sf objects show their geometry type, bounding box, CRS (Coordinate Reference System), number of features, and the attribute table all together.

print(first.points)
## Simple feature collection with 9 features and 0 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 1.5 ymin: 0.75 xmax: 2.5 ymax: 1.75
## CRS:           NA
##            geometry
## 1  POINT (1.5 1.25)
## 2 POINT (1.75 1.25)
## 3 POINT (2.25 1.25)
## 4  POINT (2.5 1.25)
## 5    POINT (2 1.75)
## 6     POINT (2 1.5)
## 7       POINT (2 1)
## 8    POINT (2 0.75)
## 9    POINT (2 1.25)
st_geometry_type(first.points, by_geometry = FALSE)
## [1] POINT
## 18 Levels: GEOMETRY POINT LINESTRING POLYGON MULTIPOINT ... TRIANGLE
st_bbox(first.points)
## xmin ymin xmax ymax 
## 1.50 0.75 2.50 1.75
st_is_longlat(first.points)    # NA: no CRS assigned yet
## [1] NA
st_crs(first.points)
## Coordinate Reference System: NA
nrow(first.points)
## [1] 9
st_coordinates(first.points)
##          X    Y
##  [1,] 1.50 1.25
##  [2,] 1.75 1.25
##  [3,] 2.25 1.25
##  [4,] 2.50 1.25
##  [5,] 2.00 1.75
##  [6,] 2.00 1.50
##  [7,] 2.00 1.00
##  [8,] 2.00 0.75
##  [9,] 2.00 1.25

Add Coordinate Reference Systems

The object we created above is a simple spatial object for now. We can make this points object geospatial, or relate its coordinates to locations on the Earth, by adding a Coordinate Reference System (CRS). A CRS combines information about an object’s geographic coordinate system (GCS) and its projection, if any.1 The CRS of an sf object can be retrieved with st_crs(). An empty CRS means the object does not yet have one assigned. To check whether the object has unprojected longitude/latitude coordinates, use st_is_longlat().

st_is_longlat(first.points) # NA before a CRS is set
## [1] NA

The CRS can be defined using the EPSG reference code for a coordinate system. We’ll use the most common CRS, EPSG:4326, also known as WGS 84.2 In sf, you can simply pass the integer EPSG code.

st_crs(first.points) <- 4326
st_is_longlat(first.points)
## [1] TRUE
st_crs(first.points)
## Coordinate Reference System:
##   User input: EPSG:4326 
##   wkt:
## GEOGCRS["WGS 84",
##     ENSEMBLE["World Geodetic System 1984 ensemble",
##         MEMBER["World Geodetic System 1984 (Transit)"],
##         MEMBER["World Geodetic System 1984 (G730)"],
##         MEMBER["World Geodetic System 1984 (G873)"],
##         MEMBER["World Geodetic System 1984 (G1150)"],
##         MEMBER["World Geodetic System 1984 (G1674)"],
##         MEMBER["World Geodetic System 1984 (G1762)"],
##         MEMBER["World Geodetic System 1984 (G2139)"],
##         MEMBER["World Geodetic System 1984 (G2296)"],
##         ELLIPSOID["WGS 84",6378137,298.257223563,
##             LENGTHUNIT["metre",1]],
##         ENSEMBLEACCURACY[2.0]],
##     PRIMEM["Greenwich",0,
##         ANGLEUNIT["degree",0.0174532925199433]],
##     CS[ellipsoidal,2],
##         AXIS["geodetic latitude (Lat)",north,
##             ORDER[1],
##             ANGLEUNIT["degree",0.0174532925199433]],
##         AXIS["geodetic longitude (Lon)",east,
##             ORDER[2],
##             ANGLEUNIT["degree",0.0174532925199433]],
##     USAGE[
##         SCOPE["Horizontal component of 3D system."],
##         AREA["World."],
##         BBOX[-90,-180,90,180]],
##     ID["EPSG",4326]]

In the old sp framework we needed a separate SpatialPointsDataFrame to attach attributes to points. In sf, an sf object is already a data frame: you can attach attributes as ordinary columns.

first.points$vars1 <- rnorm(9)
first.points$vars2 <- 101:109
first.points

If you start with a data.frame that contains longitude and latitude columns, you can construct an sf object directly with st_as_sf(df, coords = c("lon", "lat"), crs = 4326). We’ll use this method below.

Polygons

In sf, polygons are built up from numeric matrices of vertex coordinates. A POLYGON geometry expects a closed ring (first and last row equal). We build geometries with st_polygon(), wrap them in a geometry column with st_sfc(), and finally combine geometries and attributes into an sf object with st_sf().

# build two triangles (note: closed rings -- the first point is repeated at the end)
triangle1 <- st_polygon(list(rbind(c(1, 1), c(2, 1), c(1.5, 0), c(1, 1))))
triangle2 <- st_polygon(list(rbind(c(1, 1), c(1.5, 2), c(2, 1), c(1, 1))))

# wrap the geometries in a geometry column with a CRS
triangles.geom <- st_sfc(triangle1, triangle2, crs = 4326)

# attach attributes (rows align with geometries by position)
first.polys <- st_sf(
  id    = c("triangle1", "triangle2"),
  attr1 = c(1, 2),
  attr2 = c(6, 5),
  geometry = triangles.geom
)
first.polys
plot(first.polys["attr1"])

Since the CRS was already set when we called st_sfc(..., crs = 4326), the polygons are geospatial: their coordinates are interpreted as longitude/latitude on the WGS84 datum.

Importing and Exporting Spatial Data

Of course creating polygons by hand is not common. Instead we use existing spatial data by importing them from external sources. The most common data format is a shapefile (.shp).

In sf, we read shapefiles with st_read(). We just give it the path to the .shp file – it will automatically read the sidecar files (.dbf, .prj, .shx).

boston <- st_read(file.path("data", "BostonData", "city_council_districts.shp"))
## Reading layer `city_council_districts' from data source 
##   `/private/tmp/gis-r-tutorial-staging/repo/data/BostonData/city_council_districts.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 9 features and 5 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 739594.1 ymin: 2908187 xmax: 795021.6 ymax: 2970073
## Projected CRS: NAD83 / Massachusetts Mainland (ftUS)

An sf object is a normal data frame with an extra geometry column. You can inspect it just like any other data frame.

st_bbox(boston)
##      xmin      ymin      xmax      ymax 
##  739594.1 2908187.3  795021.6 2970072.8
head(st_coordinates(st_centroid(st_geometry(boston))))
##             X       Y
## [1,] 783326.6 2961212
## [2,] 777534.0 2948923
## [3,] 776960.3 2933459
## [4,] 769011.6 2930656
## [5,] 757770.8 2922005
## [6,] 753001.3 2931516
st_drop_geometry(boston)[1:5, ]
st_crs(boston)
## Coordinate Reference System:
##   User input: NAD83 / Massachusetts Mainland (ftUS) 
##   wkt:
## PROJCRS["NAD83 / Massachusetts Mainland (ftUS)",
##     BASEGEOGCRS["NAD83",
##         DATUM["North American Datum 1983",
##             ELLIPSOID["GRS 1980",6378137,298.257222101,
##                 LENGTHUNIT["metre",1]],
##             ID["EPSG",6269]],
##         PRIMEM["Greenwich",0,
##             ANGLEUNIT["Degree",0.0174532925199433]]],
##     CONVERSION["unnamed",
##         METHOD["Lambert Conic Conformal (2SP)",
##             ID["EPSG",9802]],
##         PARAMETER["Latitude of false origin",41,
##             ANGLEUNIT["Degree",0.0174532925199433],
##             ID["EPSG",8821]],
##         PARAMETER["Longitude of false origin",-71.5,
##             ANGLEUNIT["Degree",0.0174532925199433],
##             ID["EPSG",8822]],
##         PARAMETER["Latitude of 1st standard parallel",41.7166666666667,
##             ANGLEUNIT["Degree",0.0174532925199433],
##             ID["EPSG",8823]],
##         PARAMETER["Latitude of 2nd standard parallel",42.6833333333333,
##             ANGLEUNIT["Degree",0.0174532925199433],
##             ID["EPSG",8824]],
##         PARAMETER["Easting at false origin",656166.666666667,
##             LENGTHUNIT["US survey foot",0.304800609601219],
##             ID["EPSG",8826]],
##         PARAMETER["Northing at false origin",2460625,
##             LENGTHUNIT["US survey foot",0.304800609601219],
##             ID["EPSG",8827]]],
##     CS[Cartesian,2],
##         AXIS["(E)",east,
##             ORDER[1],
##             LENGTHUNIT["US survey foot",0.304800609601219,
##                 ID["EPSG",9003]]],
##         AXIS["(N)",north,
##             ORDER[2],
##             LENGTHUNIT["US survey foot",0.304800609601219,
##                 ID["EPSG",9003]]]]

To export an sf object back out to a shapefile (or GeoPackage, GeoJSON, etc.), you would call st_write(boston, "boston.shp").

Combining Spatial Data with Other Data Sources

Merging

Spatial + Non-Spatial

Now we will learn how to merge a spatial object with tabular data, using a common unique identifier. First, load the elections tabular data.

boston.elections <- read.csv(file.path("data", "BostonData", "dist_results.csv"))

Now we have i) an sf polygons object named boston, and ii) a data.frame called boston.elections, where DISTRICT is the unique identifier in boston and District is the unique identifier in boston.elections. Because an sf object is just a data frame, we can use ordinary merge() (or dplyr::left_join()).

boston <- merge(boston, boston.elections,
                by.x = "DISTRICT", by.y = "District")
st_drop_geometry(boston)[1:5, ]

In the old sp world there was a warning that you should never merge the @data slot directly with another table because it could re-order rows and corrupt the link to the geometries. With sf this concern is gone: the geometry column travels with its row, so merge() and left_join() are safe.

Spatial + Spatial

Before merging two spatial datasets, the first thing you need to do is to check whether they have the same CRS. Both files should agree on the correspondence between coordinate values and points on the Earth. If you try to combine two spatial objects with different CRSs, your results won’t make much sense.

If your objects don’t have the same CRS, you’ll need to reproject one of them. Reprojecting changes not just the metadata associated with an object but also all the actual x and y coordinates.

You can re-project any vector object with st_transform(), which takes i) the object and ii) the EPSG code (or CRS object) you want it to take.

MyCity.reprojected <- st_transform(MyCity, 4326)

If you want to align one object’s CRS with another’s, pull the CRS out and pass it in:

MyCityA.reprojected <- st_transform(MyCityA, st_crs(MyCityB))

Spatial Merges

So far we have used a common unique identifier to merge, as with regular datasets. We can also merge using geographic location. Now we’re going to work with Boston municipal service requests for graffiti removal.

We know the geolocation of graffiti instances but not which district they are in. We will overlay the graffiti points on our Boston district polygons using st_join(). This is the modern replacement for the old over() function from sp.

boston.requests <- read.csv(file.path("data", "BostonData", "graffiti.csv"))
boston.graf <- st_as_sf(boston.requests,
                        coords = c("LONGITUDE", "LATITUDE"),
                        crs = 4326)

Next we make sure the Boston map has the same CRS as the points:

boston <- st_transform(boston, 4326)

For every graffiti instance, we want to know which district it falls in. We use st_join() with join = st_intersects. The result is a new sf object that has each point’s attributes plus the polygon attributes from the district it fell in.

boston.graf.joined <- st_join(boston.graf, boston, join = st_intersects)
st_drop_geometry(boston.graf.joined)[1:5, ]
names(boston.graf.joined)
##  [1] "CASE_ENQUIRY_ID"                "OPEN_DT"                       
##  [3] "CLOSED_DT"                      "TIME_DIFF"                     
##  [5] "CASE_STATUS"                    "CLOSURE_REASON"                
##  [7] "CASE_TITLE"                     "SUBJECT"                       
##  [9] "REASON"                         "TYPE"                          
## [11] "QUEUE"                          "Department"                    
## [13] "Location"                       "neighborhood"                  
## [15] "neighborhood_services_district" "LOCATION_STREET_NAME"          
## [17] "LOCATION_ZIPCODE"               "Source"                        
## [19] "completion.time"                "DISTRICT"                      
## [21] "OBJECTID"                       "Councillor"                    
## [23] "SHAPE_area"                     "SHAPE_len"                     
## [25] "Winning_Perc"                   "Runner_Perc"                   
## [27] "MOV"                            "geometry"

If you only need the index of the intersecting polygon (analogous to the old over(points, geometry(polys))), use st_intersects():

intersections <- st_intersects(boston.graf, boston)
intersections[1:5]
## [[1]]
## [1] 3
## 
## [[2]]
## [1] 3
## 
## [[3]]
## [1] 6
## 
## [[4]]
## [1] 7
## 
## [[5]]
## [1] 1

Making Maps

Making Maps with plot()

Now that we’ve merged our spatial data with other data sources, we are ready to create maps. The base plot() method for sf objects gives you a quick look at your data. By default it creates one panel per attribute column, so often you’ll want to plot just the geometry, or one column at a time.

plot(st_geometry(boston))
plot(st_geometry(boston.graf), col = "blue", cex = 0.5, pch = 16, add = TRUE)

Another useful approach is coloring the polygons based on the values in one of the attributes. For instance, since we have already merged the Boston districts with the election results, we can color the districts based on the margin of victory. These maps – areas shaded in proportion to a statistical variable – are called choropleth maps.

The old workflow involved building color codes by hand with classInt, RColorBrewer, and findColours(). The modern tmap package handles classification, color palettes, and legends in one step:

tm_shape(boston) +
  tm_polygons("MOV",
              style = "quantile",
              n = 5,
              palette = "Blues",
              border.col = NA,
              title = "MOV Quantile") +
  tm_layout(legend.position = c("right", "bottom"),
            legend.text.size = 0.6)

If you don’t like this color scheme, simply swap in another palette from RColorBrewer. To see all the palettes, run:

RColorBrewer::display.brewer.all()

Here is the same map with a different palette:

tm_shape(boston) +
  tm_polygons("MOV",
              style = "quantile",
              n = 5,
              palette = "BuPu",
              border.col = NA,
              title = "MOV Quantile") +
  tm_layout(legend.position = c("right", "bottom"),
            legend.text.size = 0.6)

We can also add district labels at polygon centroids. Use st_centroid() to compute centroids, then add a tm_text() layer (tmap issues a warning when computing centroids for longitude/latitude data, which we silence here).

boston.centroids <- suppressWarnings(st_centroid(boston))

tm_shape(boston) +
  tm_polygons("MOV",
              style = "quantile",
              n = 5,
              palette = "BuPu",
              border.col = NA,
              title = "MOV Quantile") +
  tm_shape(boston.centroids) +
  tm_text("DISTRICT", size = 0.7) +
  tm_layout(legend.position = c("right", "bottom"),
            legend.text.size = 0.6)

Making Maps with ggplot2 and geom_sf

The modern replacement for spplot() is ggplot2’s geom_sf(), which understands sf objects natively. It is far cleaner than the old fortify() workflow: there is no need to “tidy” the polygons into vertices; we just hand the sf object to geom_sf().

ggplot(boston) +
  geom_sf(aes(fill = MOV), color = "white") +
  scale_fill_distiller(palette = "Blues", direction = 1,
                       name = "Margin of Victory") +
  ggtitle("Boston City Council Districts") +
  theme_minimal()

We can layer points on top in exactly the same way – another geom_sf(). This entirely replaces the old fortify() / geom_polygon(aes(x = long, y = lat, group = group)) dance.

ggplot() +
  geom_sf(data = boston, aes(fill = MOV), color = "white", alpha = 0.6) +
  geom_sf(data = boston.graf, color = "purple", size = 0.3) +
  scale_fill_distiller(palette = "Blues", direction = 1,
                       name = "Margin of Victory") +
  theme_minimal()

We can also create a 2-D density surface for the graffiti points using stat_density_2d(). Because stat_density_2d() expects plain x/y columns, we extract them from the geometry first.

graf.xy <- boston.graf %>%
  mutate(LONGITUDE = st_coordinates(.)[, 1],
         LATITUDE  = st_coordinates(.)[, 2]) %>%
  st_drop_geometry()

ggplot() +
  geom_sf(data = boston, aes(fill = MOV), color = "white", alpha = 0.6) +
  stat_density_2d(data = graf.xy,
                  aes(x = LONGITUDE, y = LATITUDE, alpha = after_stat(level)),
                  geom = "polygon", fill = "purple") +
  scale_fill_distiller(palette = "Blues", direction = 1,
                       name = "Margin of Victory") +
  labs(alpha = "Service-request density") +
  theme_minimal()

Adding a Basemap with ggmap

The ggmap package is convenient for combining a tiled basemap with ggplot2 layers. Since 2018, Google has required an API key for its tile and geocoding services; Stamen tiles previously fetched directly through ggmap now also require a (free) Stadia Maps key. Because no key is bundled with this Rmd, the basemap example below is left as eval = FALSE. To run it locally:

  1. Sign up for a Stadia Maps key (or Google Maps Platform key).
  2. Store it in an environment variable, e.g. add STADIA_MAPS_API_KEY=... to your .Renviron.
  3. Register the key in R as shown below.

Do not hard-code your key in this Rmd.

# Register your basemap key (read from environment, not hard-coded!)
ggmap::register_stadiamaps(key = Sys.getenv("STADIA_MAPS_API_KEY"))
# For Google basemaps / geocoding, use:
# ggmap::register_google(key = Sys.getenv("GOOGLE_MAPS_API_KEY"))

boston.bbox <- c(left = -71.20, bottom = 42.22,
                 right = -70.99, top   = 42.40)
g_toner <- get_stadiamap(bbox = boston.bbox, zoom = 12,
                         maptype = "stamen_toner_lite")

base_t <- ggmap(g_toner, extent = "device", legend = "none")
base_t

Once you have a basemap object base_t, layering sf data on top is straightforward. Use geom_sf(..., inherit.aes = FALSE) so that ggmap’s coordinate system is preserved.

base_t +
  geom_sf(data = boston,
          aes(fill = MOV),
          color = "white", alpha = 0.6,
          inherit.aes = FALSE) +
  geom_sf(data = boston.graf,
          color = "purple", size = 0.4,
          inherit.aes = FALSE) +
  scale_fill_distiller(palette = "Blues", direction = 1,
                       name = "Margin of Victory") +
  coord_sf(crs = 4326)

Location, Area, Distance Calculation Tools

Geocoding

Geocoding is the process of converting addresses or place names into latitude/longitude coordinates. The modern, key-free option is tidygeocoder::geocode(), which by default uses OpenStreetMap’s Nominatim service.

places <- tibble(name = c("Harvard Square, Cambridge, MA",
                          "30 Wadsworth St, Cambridge, MA"))
places.geo <- places %>%
  tidygeocoder::geocode(address = name, method = "osm", lat = lat, long = lon)
places.geo

You can also reverse-geocode – ask for an address given a coordinate – using tidygeocoder::reverse_geocode():

places.geo %>%
  tidygeocoder::reverse_geocode(lat = lat, long = lon, method = "osm",
                                address = matched_address)

For finer-grained results (e.g. Google’s loctype precision categories, or driving-time/distance queries), you can still use the Google Maps API through ggmap::geocode() / ggmap::mapdist(). These require a Google API key. The example below is left as eval = FALSE; set GOOGLE_MAPS_API_KEY in your .Renviron and call ggmap::register_google(key = Sys.getenv("GOOGLE_MAPS_API_KEY")) to enable it.

# ggmap::register_google(key = Sys.getenv("GOOGLE_MAPS_API_KEY"))
pub    <- "Harvard Square, Cambridge, MA"
office <- "30 Wadsworth St, Cambridge, MA"

ggmap::mapdist(pub, office, mode = "bicycling", output = "simple")

Distance

For the Euclidean / great-circle distance between points, sf provides st_distance(), which is the modern replacement for sp::spDists() and geosphere::distHaversine(). It returns a units-aware matrix.

# arbitrary points
fake.points <- st_as_sf(
  data.frame(x = c(-144, -144.4, -144.4, -144.25),
             y = c(70.3,  70,    70.3,   70)),
  coords = c("x", "y"), crs = 4326
)
st_distance(fake.points)  # great-circle distance in meters, as a units matrix
## Units: [m]
##          [,1]      [,2]     [,3]      [,4]
## [1,]     0.00 36617.976 14993.31 34668.246
## [2,] 36617.98     0.000 33358.53  5704.643
## [3,] 14993.31 33358.530     0.00 33835.869
## [4,] 34668.25  5704.643 33835.87     0.000

You can convert units easily with the units package:

set_units(st_distance(fake.points), "km")
## Units: [km]
##          [,1]      [,2]     [,3]      [,4]
## [1,]  0.00000 36.617976 14.99331 34.668246
## [2,] 36.61798  0.000000 33.35853  5.704643
## [3,] 14.99331 33.358530  0.00000 33.835869
## [4,] 34.66825  5.704643 33.83587  0.000000

Point-to-polygon distance

What about the distance from a point to a polygon, plus the polygon it is closest to? In the old workflow this was geosphere::dist2Line(). In sf, compute distances with st_distance() and locate the nearest polygon with st_nearest_feature(). To recover the closest coordinate on the polygon boundary, use st_nearest_points().

home <- places.geo %>%
  filter(name == "Harvard Square, Cambridge, MA") %>%
  st_as_sf(coords = c("lon", "lat"), crs = 4326)

# distance from `home` to every district
d <- st_distance(home, boston)
set_units(d[1, ], "km")
## Units: [km]
## [1] 3.2605491 3.9552389 7.7237294 7.3733911 8.4186785 5.0967815 3.9284210
## [8] 2.1074987 0.5813276
# which district is closest?
nearest.idx <- st_nearest_feature(home, boston)
nearest.idx
## [1] 9
boston[nearest.idx, "DISTRICT"]
# the actual closest point on the boundary
near.line <- st_nearest_points(home, boston[nearest.idx, ])
near.line
## Geometry set for 1 feature 
## Geometry type: LINESTRING
## Dimension:     XY
## Bounding box:  xmin: -71.11974 ymin: 42.36828 xmax: -71.11894 ymax: 42.37347
## Geodetic CRS:  WGS 84

We can visualize the result as a map: shade the nearest district, plot the point of interest, and draw a line to the nearest point on the boundary.

ggplot() +
  geom_sf(data = boston, fill = "grey95", color = "grey60") +
  geom_sf(data = boston[nearest.idx, ], fill = "purple", alpha = 0.4) +
  geom_sf(data = near.line, color = "red") +
  geom_sf(data = home, color = "blue", size = 2) +
  theme_minimal()

Spatial Data Analysis

In this section we will introduce how to detect and deal with spatial dependence in regression models using spatial data.

Assessing Spatial Clustering

To learn how to detect spatial autocorrelation, we will use one informal and one formal method. For this part we will use a new polygons dataset called boston_n, which includes neighborhood coordinates, unique IDs, and demographic variables from 2000.

boston_n <- st_read(file.path("data", "BostonData", "Boston_N.shp"))
## Reading layer `Boston_N' from data source 
##   `/private/tmp/gis-r-tutorial-staging/repo/data/BostonData/Boston_N.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 26 features and 8 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -71.19125 ymin: 42.22792 xmax: -70.92278 ymax: 42.39699
## Geodetic CRS:  WGS 84
boston_n <- st_make_valid(boston_n)
head(st_drop_geometry(boston_n))
plot(st_geometry(boston_n))

The question we are interested in is whether the share of Black and Hispanic population in a neighborhood (variable blck_l_) is associated with education attainment, measured as the percentage of people with at least a bachelor’s degree (mnmmbchlr). Shapefile column names are often truncated; we’ll rename them for clarity. With sf, the object behaves like a data.frame, so manipulation is straightforward.

names(boston_n)
## [1] "Name"     "OBJECTI"  "Acres"    "SqMiles"  "ShpSTAr"  "ShpSTLn"  "mnbchlr" 
## [8] "blck_l_"  "geometry"
names(boston_n)[7:8] <- c("minimumbachelor", "black_lat")

An informal approach to detecting spatial autocorrelation is to map the residuals of an OLS regression. If neighboring residuals are similar in size, there is likely spatial autocorrelation. We’ll regress education on the share of Black and Hispanic population, store the residuals, and map them.

my.model <- lm(minimumbachelor ~ black_lat, data = boston_n)
summary(my.model)
## 
## Call:
## lm(formula = minimumbachelor ~ black_lat, data = boston_n)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.817 -10.877  -1.205   9.195  28.534 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  57.3272     4.1495  13.815 6.42e-13 ***
## black_lat    -0.5895     0.1154  -5.106 3.17e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.04 on 24 degrees of freedom
## Multiple R-squared:  0.5207, Adjusted R-squared:  0.5008 
## F-statistic: 26.08 on 1 and 24 DF,  p-value: 3.174e-05
boston_n$resid_ols <- residuals(my.model)

tm_shape(boston_n) +
  tm_polygons("resid_ols",
              style = "quantile",
              n = 5,
              palette = "Blues",
              border.col = "grey",
              title = "OLS Residuals Quantile") +
  tm_layout(legend.position = c("left", "bottom"),
            legend.text.size = 0.6)

Ideally the residuals would look randomly distributed across space. Instead, similar-sized residuals cluster together, suggesting unmodeled spatial structure.

We can do more formal tests of spatial autocorrelation as well. To do this, we first build the spatial weights matrix that identifies which neighborhoods share edges. We use poly2nb() (which accepts an sf object directly in modern spdep) and convert to a row-standardized listw with nb2listw().

continuity.nb <- poly2nb(boston_n, queen = TRUE)
summary(continuity.nb)
## Neighbour list object:
## Number of regions: 26 
## Number of nonzero links: 92 
## Percentage nonzero weights: 13.60947 
## Average number of links: 3.538462 
## 2 regions with no links:
## 12, 26
## 3 disjoint connected subgraphs
## Link number distribution:
## 
## 0 1 2 3 4 5 6 8 9 
## 2 3 4 4 5 4 2 1 1 
## 3 least connected regions:
## 13 18 23 with 1 link
## 1 most connected region:
## 16 with 9 links
# plot neighbor links over the polygons
plot(st_geometry(boston_n), border = "grey")
plot(continuity.nb,
     coords = st_coordinates(suppressWarnings(st_centroid(boston_n))),
     add = TRUE, col = "red")

continuity.listw <- nb2listw(continuity.nb, style = "W", zero.policy = TRUE)
summary(continuity.listw, zero.policy = TRUE)
## Characteristics of weights list object:
## Neighbour list object:
## Number of regions: 26 
## Number of nonzero links: 92 
## Percentage nonzero weights: 13.60947 
## Average number of links: 3.538462 
## 2 regions with no links:
## 12, 26
## 3 disjoint connected subgraphs
## Link number distribution:
## 
## 0 1 2 3 4 5 6 8 9 
## 2 3 4 4 5 4 2 1 1 
## 3 least connected regions:
## 13 18 23 with 1 link
## 1 most connected region:
## 16 with 9 links
## 
## Weights style: W 
## Weights constants summary:
##    n  nn S0       S1       S2
## W 24 576 24 15.42398 104.2362
head(continuity.listw$weights)
## [[1]]
## [1] 0.25 0.25 0.25 0.25
## 
## [[2]]
## [1] 0.2 0.2 0.2 0.2 0.2
## 
## [[3]]
## [1] 0.25 0.25 0.25 0.25
## 
## [[4]]
## [1] 0.5 0.5
## 
## [[5]]
## [1] 0.25 0.25 0.25 0.25
## 
## [[6]]
## [1] 0.5 0.5

One formal test of spatial dependence is Moran’s I, computed with moran.test(). We can also visualize spatial autocorrelation with moran.plot().

moran.test(boston_n$minimumbachelor,
           listw = continuity.listw,
           zero.policy = TRUE,
           adjust.n = TRUE)
## 
##  Moran I test under randomisation
## 
## data:  boston_n$minimumbachelor  
## weights: continuity.listw  
## n reduced by no-neighbour observations  
## 
## Moran I statistic standard deviate = 2.5398, p-value = 0.005545
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##        0.34111187       -0.04347826        0.02292917
moran.plot(boston_n$minimumbachelor,
           continuity.listw,
           zero.policy = TRUE)

The statistic ranges from -1 to 1. A value near 1 indicates strong positive spatial autocorrelation.

We can also calculate local Moran’s I statistics for each polygon. For local Moran’s I, use localmoran().

locmoran.boston_n <- localmoran(boston_n$minimumbachelor,
                                continuity.listw,
                                zero.policy = TRUE)
head(locmoran.boston_n)
##              Ii          E.Ii       Var.Ii        Z.Ii Pr(z != E(Ii))
## 1  0.2408182217 -8.038335e-03 0.0453505345  1.16857848      0.2425735
## 2 -0.1154786299 -1.287267e-03 0.0055709744 -1.52991632      0.1260374
## 3  0.0054051954 -1.605987e-04 0.0009132582  0.18417486      0.8538763
## 4 -0.2063822346 -4.609860e-03 0.0571664223 -0.84390041      0.3987251
## 5  0.0144825088 -1.906523e-05 0.0001084314  1.39263572      0.1637300
## 6  0.0004766308 -1.906523e-05 0.0002375165  0.03216389      0.9743414
boston_n <- cbind(boston_n, as.data.frame(locmoran.boston_n))

We can also map the Z scores from the local Moran’s I output for a more formal picture of clustering.

tm_shape(boston_n) +
  tm_polygons("Z.Ii",
              style = "quantile",
              n = 5,
              palette = "Blues",
              border.col = "grey",
              title = "Z Scores Quantile") +
  tm_layout(legend.position = c("left", "bottom"),
            legend.text.size = 0.6)

We can now see clear spatial dependence among neighborhoods.

Spatial Regression

Now that we have detected spatial dependence, we can try to take it into account in our regression models. Spatial dependence in a regression setting can be modeled similarly to autocorrelation in time series. Spatial autocorrelation can be thought of in terms of non-zero correlation with the error term, similar to the presence of an endogenous variable. This implies that OLS estimates in the non-spatial model will be biased and inconsistent (Anselin and Bera, 1998).

Spatial regression functions used to live in spdep but have moved to the spatialreg package. We estimate a spatial autoregressive (SAR) or spatial lag model with spatialreg::lagsarlm(). This model returns the \(\rho\) parameter (the spatial autoregressive coefficient), its p-value, and lets us compare with standard OLS.

my.model.sar <- lagsarlm(minimumbachelor ~ black_lat,
                         data = boston_n,
                         listw = continuity.listw,
                         zero.policy = TRUE)
summary(my.model.sar)
## 
## Call:lagsarlm(formula = minimumbachelor ~ black_lat, data = boston_n, 
##     listw = continuity.listw, zero.policy = TRUE)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.8753  -5.8921  -1.9245   5.8259  27.5723 
## 
## Type: lag 
## Regions with no neighbours included:
##  12 26 
## Coefficients: (asymptotic standard errors) 
##             Estimate Std. Error z value  Pr(>|z|)
## (Intercept) 33.59490    7.54598  4.4520 8.506e-06
## black_lat   -0.41208    0.10759 -3.8301 0.0001281
## 
## Rho: 0.47336, LR test value: 8.8257, p-value: 0.0029702
## Asymptotic standard error: 0.13514
##     z-value: 3.5027, p-value: 0.00046063
## Wald statistic: 12.269, p-value: 0.00046063
## 
## Log likelihood: -100.1283 for lag model
## ML residual variance (sigma squared): 121.54, (sigma: 11.025)
## Number of observations: 26 
## Number of parameters estimated: 4 
## AIC: 208.26, (AIC for lm: 215.08)
## LM test for residual autocorrelation
## test value: 0.0048895, p-value: 0.94425

Now we can map the new residuals to see if the spatial model improves on the OLS residual pattern.

boston_n$resid_sar <- residuals(my.model.sar)
head(st_drop_geometry(boston_n))
tm_shape(boston_n) +
  tm_polygons("resid_sar",
              style = "quantile",
              n = 5,
              palette = "Blues",
              border.col = "grey",
              title = "SAR Residuals Quantile") +
  tm_layout(legend.position = c("left", "bottom"),
            legend.text.size = 0.6)

It looks a bit better.

There are many ways to model spatial autocorrelation. The SAR or spatial lag model above is analogous to a time-series lagged dependent variable model, where neighboring values play the role of past values. Another type, the spatial error model, models spatial correlation in the disturbance term – analogous to serial autocorrelation in time series. We can implement this with spatialreg::errorsarlm():

my.model.ear <- errorsarlm(minimumbachelor ~ black_lat,
                           data = boston_n,
                           listw = continuity.listw,
                           zero.policy = TRUE)
summary(my.model.ear)
## 
## Call:errorsarlm(formula = minimumbachelor ~ black_lat, data = boston_n, 
##     listw = continuity.listw, zero.policy = TRUE)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -20.50404  -6.95151   0.27443  10.75028  27.16799 
## 
## Type: error 
## Regions with no neighbours included:
##  12 26 
## Coefficients: (asymptotic standard errors) 
##             Estimate Std. Error z value  Pr(>|z|)
## (Intercept) 51.75193    6.03055  8.5816 < 2.2e-16
## black_lat   -0.58032    0.12462 -4.6567 3.214e-06
## 
## Lambda: 0.56342, LR test value: 3.5274, p-value: 0.060362
## Asymptotic standard error: 0.17381
##     z-value: 3.2416, p-value: 0.0011886
## Wald statistic: 10.508, p-value: 0.0011886
## 
## Log likelihood: -102.7775 for error model
## ML residual variance (sigma squared): 144.44, (sigma: 12.018)
## Number of observations: 26 
## Number of parameters estimated: 4 
## AIC: 213.55, (AIC for lm: 215.08)

Another approach is to use spatial heteroskedasticity- and autocorrelation-robust (HAC) standard errors with OLS. The most commonly used estimator is from Conley (1999, 2008), which allows for serial correlation over time as well as spatial correlation among units within a given distance. Conley standard errors are easy to implement in Stata; in R, see the conleyreg or lmtest/sandwich-based implementations for current options.


  1. A Geographic Coordinate System (GCS) links coordinates to geographic locations using a particular spherical or spheroidal model of the Earth. A Projected Coordinate System (PCS) takes a GCS and projects it onto a two-dimensional surface. All geospatial objects will have a GCS, but not all will be projected.↩︎

  2. WGS 84 is a Geographic Coordinate System. It is not projected. However, many Projected Coordinate Systems are based on WGS 84. You can find the EPSG reference number for almost any CRS at spatialreference.org.↩︎