This tutorial benefited extensively from several other sources, including but not limited to rspatial.org and Nick Eubank’s (here) and Justin de Benedictis-Kessner’s (here) great online tutorials.
This is the modernized 2026 version of the original 2019 workshop. It
has been updated to use the current spatial R stack built around
sf and terra, replacing the now-retired
sp, rgdal, rgeos, and
maptools packages. The pedagogy of the original tutorial is
preserved; only the code has been brought up to date.
You’ll need to go to this GitHub
link to download the data used in the tutorial. The Rmd assumes the
shapefiles and CSVs live in data/BostonData/ relative to
the Rmd’s location.
If you have not installed the packages used in this workshop, you can install them all at once with:
install.packages(c(
"sf", "terra", "tidyverse", "tmap", "spdep", "spatialreg",
"tidygeocoder", "ggmap", "RColorBrewer", "units", "knitr"
))
Load the libraries we will use throughout the workshop:
library(sf)
library(terra)
library(tidyverse)
library(tmap)
library(spdep)
library(spatialreg)
library(tidygeocoder)
library(ggmap)
library(RColorBrewer)
library(units)
Some operations that require external APIs (Google Maps for basemaps,
geocoding, driving distance) are kept as eval = FALSE code
examples; instructions to enable them are provided where they
appear.
In this first section, we will understand how R thinks about spatial
data and how to import and export spatial datasets. Modern R uses the
sf (simple features) package for vector data, which treats
spatial objects as ordinary data frames with a special
geometry column. This is far simpler than the older
sp framework.
We start with the most basic geometric object, points. Like regular
data points in plots, each spatial point needs a pair of x-y
coordinates. In sf, we can build a points object directly
from a data.frame with st_as_sf().
plus.coordinates <- data.frame(
x = c(1.5, 1.75, 2.25, 2.5, # horizontal
2, 2, 2, 2, # vertical
2), # center
y = c(1.25, 1.25, 1.25, 1.25,
1.75, 1.5, 1, 0.75,
1.25)
)
first.points <- st_as_sf(plus.coordinates, coords = c("x", "y"))
plot(st_geometry(first.points), pch = 16)
To understand how R sees these points, we can use the
print() or summary() command, or query
specific properties directly. sf objects show their
geometry type, bounding box, CRS (Coordinate Reference System), number
of features, and the attribute table all together.
print(first.points)
## Simple feature collection with 9 features and 0 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: 1.5 ymin: 0.75 xmax: 2.5 ymax: 1.75
## CRS: NA
## geometry
## 1 POINT (1.5 1.25)
## 2 POINT (1.75 1.25)
## 3 POINT (2.25 1.25)
## 4 POINT (2.5 1.25)
## 5 POINT (2 1.75)
## 6 POINT (2 1.5)
## 7 POINT (2 1)
## 8 POINT (2 0.75)
## 9 POINT (2 1.25)
st_geometry_type(first.points, by_geometry = FALSE)
## [1] POINT
## 18 Levels: GEOMETRY POINT LINESTRING POLYGON MULTIPOINT ... TRIANGLE
st_bbox(first.points)
## xmin ymin xmax ymax
## 1.50 0.75 2.50 1.75
st_is_longlat(first.points) # NA: no CRS assigned yet
## [1] NA
st_crs(first.points)
## Coordinate Reference System: NA
nrow(first.points)
## [1] 9
st_coordinates(first.points)
## X Y
## [1,] 1.50 1.25
## [2,] 1.75 1.25
## [3,] 2.25 1.25
## [4,] 2.50 1.25
## [5,] 2.00 1.75
## [6,] 2.00 1.50
## [7,] 2.00 1.00
## [8,] 2.00 0.75
## [9,] 2.00 1.25
The object we created above is a simple spatial object for
now. We can make this points object geospatial, or relate its
coordinates to locations on the Earth, by adding a Coordinate Reference
System (CRS). A CRS combines information about an object’s geographic
coordinate system (GCS) and its projection, if any.1 The CRS of an
sf object can be retrieved with st_crs(). An
empty CRS means the object does not yet have one assigned. To check
whether the object has unprojected longitude/latitude coordinates, use
st_is_longlat().
st_is_longlat(first.points) # NA before a CRS is set
## [1] NA
The CRS can be defined using the EPSG reference code for a coordinate
system. We’ll use the most common CRS, EPSG:4326, also known as WGS
84.2 In
sf, you can simply pass the integer EPSG code.
st_crs(first.points) <- 4326
st_is_longlat(first.points)
## [1] TRUE
st_crs(first.points)
## Coordinate Reference System:
## User input: EPSG:4326
## wkt:
## GEOGCRS["WGS 84",
## ENSEMBLE["World Geodetic System 1984 ensemble",
## MEMBER["World Geodetic System 1984 (Transit)"],
## MEMBER["World Geodetic System 1984 (G730)"],
## MEMBER["World Geodetic System 1984 (G873)"],
## MEMBER["World Geodetic System 1984 (G1150)"],
## MEMBER["World Geodetic System 1984 (G1674)"],
## MEMBER["World Geodetic System 1984 (G1762)"],
## MEMBER["World Geodetic System 1984 (G2139)"],
## MEMBER["World Geodetic System 1984 (G2296)"],
## ELLIPSOID["WGS 84",6378137,298.257223563,
## LENGTHUNIT["metre",1]],
## ENSEMBLEACCURACY[2.0]],
## PRIMEM["Greenwich",0,
## ANGLEUNIT["degree",0.0174532925199433]],
## CS[ellipsoidal,2],
## AXIS["geodetic latitude (Lat)",north,
## ORDER[1],
## ANGLEUNIT["degree",0.0174532925199433]],
## AXIS["geodetic longitude (Lon)",east,
## ORDER[2],
## ANGLEUNIT["degree",0.0174532925199433]],
## USAGE[
## SCOPE["Horizontal component of 3D system."],
## AREA["World."],
## BBOX[-90,-180,90,180]],
## ID["EPSG",4326]]
In the old sp framework we needed a separate
SpatialPointsDataFrame to attach attributes to points. In
sf, an sf object is already a data
frame: you can attach attributes as ordinary columns.
first.points$vars1 <- rnorm(9)
first.points$vars2 <- 101:109
first.points
If you start with a data.frame that contains longitude
and latitude columns, you can construct an sf object
directly with
st_as_sf(df, coords = c("lon", "lat"), crs = 4326). We’ll
use this method below.
In sf, polygons are built up from numeric matrices of
vertex coordinates. A POLYGON geometry expects a closed
ring (first and last row equal). We build geometries with
st_polygon(), wrap them in a geometry column with
st_sfc(), and finally combine geometries and attributes
into an sf object with st_sf().
# build two triangles (note: closed rings -- the first point is repeated at the end)
triangle1 <- st_polygon(list(rbind(c(1, 1), c(2, 1), c(1.5, 0), c(1, 1))))
triangle2 <- st_polygon(list(rbind(c(1, 1), c(1.5, 2), c(2, 1), c(1, 1))))
# wrap the geometries in a geometry column with a CRS
triangles.geom <- st_sfc(triangle1, triangle2, crs = 4326)
# attach attributes (rows align with geometries by position)
first.polys <- st_sf(
id = c("triangle1", "triangle2"),
attr1 = c(1, 2),
attr2 = c(6, 5),
geometry = triangles.geom
)
first.polys
plot(first.polys["attr1"])
Since the CRS was already set when we called
st_sfc(..., crs = 4326), the polygons are geospatial: their
coordinates are interpreted as longitude/latitude on the WGS84
datum.
Of course creating polygons by hand is not common. Instead we use
existing spatial data by importing them from external sources. The most
common data format is a shapefile (.shp).
In sf, we read shapefiles with st_read().
We just give it the path to the .shp file – it will
automatically read the sidecar files (.dbf,
.prj, .shx).
boston <- st_read(file.path("data", "BostonData", "city_council_districts.shp"))
## Reading layer `city_council_districts' from data source
## `/private/tmp/gis-r-tutorial-staging/repo/data/BostonData/city_council_districts.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 9 features and 5 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: 739594.1 ymin: 2908187 xmax: 795021.6 ymax: 2970073
## Projected CRS: NAD83 / Massachusetts Mainland (ftUS)
An sf object is a normal data frame with an extra
geometry column. You can inspect it just like any other
data frame.
st_bbox(boston)
## xmin ymin xmax ymax
## 739594.1 2908187.3 795021.6 2970072.8
head(st_coordinates(st_centroid(st_geometry(boston))))
## X Y
## [1,] 783326.6 2961212
## [2,] 777534.0 2948923
## [3,] 776960.3 2933459
## [4,] 769011.6 2930656
## [5,] 757770.8 2922005
## [6,] 753001.3 2931516
st_drop_geometry(boston)[1:5, ]
st_crs(boston)
## Coordinate Reference System:
## User input: NAD83 / Massachusetts Mainland (ftUS)
## wkt:
## PROJCRS["NAD83 / Massachusetts Mainland (ftUS)",
## BASEGEOGCRS["NAD83",
## DATUM["North American Datum 1983",
## ELLIPSOID["GRS 1980",6378137,298.257222101,
## LENGTHUNIT["metre",1]],
## ID["EPSG",6269]],
## PRIMEM["Greenwich",0,
## ANGLEUNIT["Degree",0.0174532925199433]]],
## CONVERSION["unnamed",
## METHOD["Lambert Conic Conformal (2SP)",
## ID["EPSG",9802]],
## PARAMETER["Latitude of false origin",41,
## ANGLEUNIT["Degree",0.0174532925199433],
## ID["EPSG",8821]],
## PARAMETER["Longitude of false origin",-71.5,
## ANGLEUNIT["Degree",0.0174532925199433],
## ID["EPSG",8822]],
## PARAMETER["Latitude of 1st standard parallel",41.7166666666667,
## ANGLEUNIT["Degree",0.0174532925199433],
## ID["EPSG",8823]],
## PARAMETER["Latitude of 2nd standard parallel",42.6833333333333,
## ANGLEUNIT["Degree",0.0174532925199433],
## ID["EPSG",8824]],
## PARAMETER["Easting at false origin",656166.666666667,
## LENGTHUNIT["US survey foot",0.304800609601219],
## ID["EPSG",8826]],
## PARAMETER["Northing at false origin",2460625,
## LENGTHUNIT["US survey foot",0.304800609601219],
## ID["EPSG",8827]]],
## CS[Cartesian,2],
## AXIS["(E)",east,
## ORDER[1],
## LENGTHUNIT["US survey foot",0.304800609601219,
## ID["EPSG",9003]]],
## AXIS["(N)",north,
## ORDER[2],
## LENGTHUNIT["US survey foot",0.304800609601219,
## ID["EPSG",9003]]]]
To export an sf object back out to a shapefile (or
GeoPackage, GeoJSON, etc.), you would call
st_write(boston, "boston.shp").
Now we will learn how to merge a spatial object with tabular data, using a common unique identifier. First, load the elections tabular data.
boston.elections <- read.csv(file.path("data", "BostonData", "dist_results.csv"))
Now we have i) an sf polygons object named
boston, and ii) a data.frame called
boston.elections, where DISTRICT is the unique
identifier in boston and District is the unique
identifier in boston.elections. Because an sf
object is just a data frame, we can use ordinary merge()
(or dplyr::left_join()).
boston <- merge(boston, boston.elections,
by.x = "DISTRICT", by.y = "District")
st_drop_geometry(boston)[1:5, ]
In the old sp world there was a warning that you should
never merge the @data slot directly with another table
because it could re-order rows and corrupt the link to the geometries.
With sf this concern is gone: the geometry column travels
with its row, so merge() and left_join() are
safe.
Before merging two spatial datasets, the first thing you need to do is to check whether they have the same CRS. Both files should agree on the correspondence between coordinate values and points on the Earth. If you try to combine two spatial objects with different CRSs, your results won’t make much sense.
If your objects don’t have the same CRS, you’ll need to reproject one of them. Reprojecting changes not just the metadata associated with an object but also all the actual x and y coordinates.
You can re-project any vector object with
st_transform(), which takes i) the object and ii) the EPSG
code (or CRS object) you want it to take.
MyCity.reprojected <- st_transform(MyCity, 4326)
If you want to align one object’s CRS with another’s, pull the CRS out and pass it in:
MyCityA.reprojected <- st_transform(MyCityA, st_crs(MyCityB))
So far we have used a common unique identifier to merge, as with regular datasets. We can also merge using geographic location. Now we’re going to work with Boston municipal service requests for graffiti removal.
We know the geolocation of graffiti instances but not which district
they are in. We will overlay the graffiti points on our Boston district
polygons using st_join(). This is the modern replacement
for the old over() function from sp.
boston.requests <- read.csv(file.path("data", "BostonData", "graffiti.csv"))
boston.graf <- st_as_sf(boston.requests,
coords = c("LONGITUDE", "LATITUDE"),
crs = 4326)
Next we make sure the Boston map has the same CRS as the points:
boston <- st_transform(boston, 4326)
For every graffiti instance, we want to know which district it falls
in. We use st_join() with
join = st_intersects. The result is a new sf
object that has each point’s attributes plus the polygon attributes from
the district it fell in.
boston.graf.joined <- st_join(boston.graf, boston, join = st_intersects)
st_drop_geometry(boston.graf.joined)[1:5, ]
names(boston.graf.joined)
## [1] "CASE_ENQUIRY_ID" "OPEN_DT"
## [3] "CLOSED_DT" "TIME_DIFF"
## [5] "CASE_STATUS" "CLOSURE_REASON"
## [7] "CASE_TITLE" "SUBJECT"
## [9] "REASON" "TYPE"
## [11] "QUEUE" "Department"
## [13] "Location" "neighborhood"
## [15] "neighborhood_services_district" "LOCATION_STREET_NAME"
## [17] "LOCATION_ZIPCODE" "Source"
## [19] "completion.time" "DISTRICT"
## [21] "OBJECTID" "Councillor"
## [23] "SHAPE_area" "SHAPE_len"
## [25] "Winning_Perc" "Runner_Perc"
## [27] "MOV" "geometry"
If you only need the index of the intersecting polygon (analogous to
the old over(points, geometry(polys))), use
st_intersects():
intersections <- st_intersects(boston.graf, boston)
intersections[1:5]
## [[1]]
## [1] 3
##
## [[2]]
## [1] 3
##
## [[3]]
## [1] 6
##
## [[4]]
## [1] 7
##
## [[5]]
## [1] 1
plot()Now that we’ve merged our spatial data with other data sources, we
are ready to create maps. The base plot() method for
sf objects gives you a quick look at your data. By default
it creates one panel per attribute column, so often you’ll want to plot
just the geometry, or one column at a time.
plot(st_geometry(boston))
plot(st_geometry(boston.graf), col = "blue", cex = 0.5, pch = 16, add = TRUE)
Another useful approach is coloring the polygons based on the values in one of the attributes. For instance, since we have already merged the Boston districts with the election results, we can color the districts based on the margin of victory. These maps – areas shaded in proportion to a statistical variable – are called choropleth maps.
The old workflow involved building color codes by hand with
classInt, RColorBrewer, and
findColours(). The modern tmap package handles
classification, color palettes, and legends in one step:
tm_shape(boston) +
tm_polygons("MOV",
style = "quantile",
n = 5,
palette = "Blues",
border.col = NA,
title = "MOV Quantile") +
tm_layout(legend.position = c("right", "bottom"),
legend.text.size = 0.6)
If you don’t like this color scheme, simply swap in another palette
from RColorBrewer. To see all the palettes, run:
RColorBrewer::display.brewer.all()
Here is the same map with a different palette:
tm_shape(boston) +
tm_polygons("MOV",
style = "quantile",
n = 5,
palette = "BuPu",
border.col = NA,
title = "MOV Quantile") +
tm_layout(legend.position = c("right", "bottom"),
legend.text.size = 0.6)
We can also add district labels at polygon centroids. Use
st_centroid() to compute centroids, then add a
tm_text() layer (tmap issues a warning when
computing centroids for longitude/latitude data, which we silence
here).
boston.centroids <- suppressWarnings(st_centroid(boston))
tm_shape(boston) +
tm_polygons("MOV",
style = "quantile",
n = 5,
palette = "BuPu",
border.col = NA,
title = "MOV Quantile") +
tm_shape(boston.centroids) +
tm_text("DISTRICT", size = 0.7) +
tm_layout(legend.position = c("right", "bottom"),
legend.text.size = 0.6)
ggplot2 and geom_sfThe modern replacement for spplot() is
ggplot2’s geom_sf(), which understands
sf objects natively. It is far cleaner than the old
fortify() workflow: there is no need to “tidy” the polygons
into vertices; we just hand the sf object to
geom_sf().
ggplot(boston) +
geom_sf(aes(fill = MOV), color = "white") +
scale_fill_distiller(palette = "Blues", direction = 1,
name = "Margin of Victory") +
ggtitle("Boston City Council Districts") +
theme_minimal()
We can layer points on top in exactly the same way – another
geom_sf(). This entirely replaces the old
fortify() /
geom_polygon(aes(x = long, y = lat, group = group))
dance.
ggplot() +
geom_sf(data = boston, aes(fill = MOV), color = "white", alpha = 0.6) +
geom_sf(data = boston.graf, color = "purple", size = 0.3) +
scale_fill_distiller(palette = "Blues", direction = 1,
name = "Margin of Victory") +
theme_minimal()
We can also create a 2-D density surface for the graffiti points
using stat_density_2d(). Because
stat_density_2d() expects plain x/y columns, we extract
them from the geometry first.
graf.xy <- boston.graf %>%
mutate(LONGITUDE = st_coordinates(.)[, 1],
LATITUDE = st_coordinates(.)[, 2]) %>%
st_drop_geometry()
ggplot() +
geom_sf(data = boston, aes(fill = MOV), color = "white", alpha = 0.6) +
stat_density_2d(data = graf.xy,
aes(x = LONGITUDE, y = LATITUDE, alpha = after_stat(level)),
geom = "polygon", fill = "purple") +
scale_fill_distiller(palette = "Blues", direction = 1,
name = "Margin of Victory") +
labs(alpha = "Service-request density") +
theme_minimal()
ggmapThe ggmap package is convenient for combining a tiled
basemap with ggplot2 layers. Since 2018, Google has
required an API key for its tile and geocoding services; Stamen tiles
previously fetched directly through ggmap now also require
a (free) Stadia Maps key. Because no key is bundled with this Rmd, the
basemap example below is left as eval = FALSE. To run it
locally:
STADIA_MAPS_API_KEY=... to your
.Renviron.Do not hard-code your key in this Rmd.
# Register your basemap key (read from environment, not hard-coded!)
ggmap::register_stadiamaps(key = Sys.getenv("STADIA_MAPS_API_KEY"))
# For Google basemaps / geocoding, use:
# ggmap::register_google(key = Sys.getenv("GOOGLE_MAPS_API_KEY"))
boston.bbox <- c(left = -71.20, bottom = 42.22,
right = -70.99, top = 42.40)
g_toner <- get_stadiamap(bbox = boston.bbox, zoom = 12,
maptype = "stamen_toner_lite")
base_t <- ggmap(g_toner, extent = "device", legend = "none")
base_t
Once you have a basemap object base_t, layering
sf data on top is straightforward. Use
geom_sf(..., inherit.aes = FALSE) so that
ggmap’s coordinate system is preserved.
base_t +
geom_sf(data = boston,
aes(fill = MOV),
color = "white", alpha = 0.6,
inherit.aes = FALSE) +
geom_sf(data = boston.graf,
color = "purple", size = 0.4,
inherit.aes = FALSE) +
scale_fill_distiller(palette = "Blues", direction = 1,
name = "Margin of Victory") +
coord_sf(crs = 4326)
Geocoding is the process of converting addresses or place names into
latitude/longitude coordinates. The modern, key-free option is
tidygeocoder::geocode(), which by default uses
OpenStreetMap’s Nominatim service.
places <- tibble(name = c("Harvard Square, Cambridge, MA",
"30 Wadsworth St, Cambridge, MA"))
places.geo <- places %>%
tidygeocoder::geocode(address = name, method = "osm", lat = lat, long = lon)
places.geo
You can also reverse-geocode – ask for an address given a coordinate
– using tidygeocoder::reverse_geocode():
places.geo %>%
tidygeocoder::reverse_geocode(lat = lat, long = lon, method = "osm",
address = matched_address)
For finer-grained results (e.g. Google’s loctype
precision categories, or driving-time/distance queries), you can still
use the Google Maps API through ggmap::geocode() /
ggmap::mapdist(). These require a Google API key. The
example below is left as eval = FALSE; set
GOOGLE_MAPS_API_KEY in your .Renviron and call
ggmap::register_google(key = Sys.getenv("GOOGLE_MAPS_API_KEY"))
to enable it.
# ggmap::register_google(key = Sys.getenv("GOOGLE_MAPS_API_KEY"))
pub <- "Harvard Square, Cambridge, MA"
office <- "30 Wadsworth St, Cambridge, MA"
ggmap::mapdist(pub, office, mode = "bicycling", output = "simple")
For the Euclidean / great-circle distance between points,
sf provides st_distance(), which is the modern
replacement for sp::spDists() and
geosphere::distHaversine(). It returns a units-aware
matrix.
# arbitrary points
fake.points <- st_as_sf(
data.frame(x = c(-144, -144.4, -144.4, -144.25),
y = c(70.3, 70, 70.3, 70)),
coords = c("x", "y"), crs = 4326
)
st_distance(fake.points) # great-circle distance in meters, as a units matrix
## Units: [m]
## [,1] [,2] [,3] [,4]
## [1,] 0.00 36617.976 14993.31 34668.246
## [2,] 36617.98 0.000 33358.53 5704.643
## [3,] 14993.31 33358.530 0.00 33835.869
## [4,] 34668.25 5704.643 33835.87 0.000
You can convert units easily with the units package:
set_units(st_distance(fake.points), "km")
## Units: [km]
## [,1] [,2] [,3] [,4]
## [1,] 0.00000 36.617976 14.99331 34.668246
## [2,] 36.61798 0.000000 33.35853 5.704643
## [3,] 14.99331 33.358530 0.00000 33.835869
## [4,] 34.66825 5.704643 33.83587 0.000000
What about the distance from a point to a polygon, plus the polygon
it is closest to? In the old workflow this was
geosphere::dist2Line(). In sf, compute
distances with st_distance() and locate the nearest polygon
with st_nearest_feature(). To recover the closest
coordinate on the polygon boundary, use
st_nearest_points().
home <- places.geo %>%
filter(name == "Harvard Square, Cambridge, MA") %>%
st_as_sf(coords = c("lon", "lat"), crs = 4326)
# distance from `home` to every district
d <- st_distance(home, boston)
set_units(d[1, ], "km")
## Units: [km]
## [1] 3.2605491 3.9552389 7.7237294 7.3733911 8.4186785 5.0967815 3.9284210
## [8] 2.1074987 0.5813276
# which district is closest?
nearest.idx <- st_nearest_feature(home, boston)
nearest.idx
## [1] 9
boston[nearest.idx, "DISTRICT"]
# the actual closest point on the boundary
near.line <- st_nearest_points(home, boston[nearest.idx, ])
near.line
## Geometry set for 1 feature
## Geometry type: LINESTRING
## Dimension: XY
## Bounding box: xmin: -71.11974 ymin: 42.36828 xmax: -71.11894 ymax: 42.37347
## Geodetic CRS: WGS 84
We can visualize the result as a map: shade the nearest district, plot the point of interest, and draw a line to the nearest point on the boundary.
ggplot() +
geom_sf(data = boston, fill = "grey95", color = "grey60") +
geom_sf(data = boston[nearest.idx, ], fill = "purple", alpha = 0.4) +
geom_sf(data = near.line, color = "red") +
geom_sf(data = home, color = "blue", size = 2) +
theme_minimal()
In this section we will introduce how to detect and deal with spatial dependence in regression models using spatial data.
To learn how to detect spatial autocorrelation, we will use one
informal and one formal method. For this part we will
use a new polygons dataset called boston_n, which includes
neighborhood coordinates, unique IDs, and demographic variables from
2000.
boston_n <- st_read(file.path("data", "BostonData", "Boston_N.shp"))
## Reading layer `Boston_N' from data source
## `/private/tmp/gis-r-tutorial-staging/repo/data/BostonData/Boston_N.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 26 features and 8 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -71.19125 ymin: 42.22792 xmax: -70.92278 ymax: 42.39699
## Geodetic CRS: WGS 84
boston_n <- st_make_valid(boston_n)
head(st_drop_geometry(boston_n))
plot(st_geometry(boston_n))
The question we are interested in is whether the share of Black and
Hispanic population in a neighborhood (variable blck_l_) is
associated with education attainment, measured as the percentage of
people with at least a bachelor’s degree (mnmmbchlr).
Shapefile column names are often truncated; we’ll rename them for
clarity. With sf, the object behaves like a
data.frame, so manipulation is straightforward.
names(boston_n)
## [1] "Name" "OBJECTI" "Acres" "SqMiles" "ShpSTAr" "ShpSTLn" "mnbchlr"
## [8] "blck_l_" "geometry"
names(boston_n)[7:8] <- c("minimumbachelor", "black_lat")
An informal approach to detecting spatial autocorrelation is to map the residuals of an OLS regression. If neighboring residuals are similar in size, there is likely spatial autocorrelation. We’ll regress education on the share of Black and Hispanic population, store the residuals, and map them.
my.model <- lm(minimumbachelor ~ black_lat, data = boston_n)
summary(my.model)
##
## Call:
## lm(formula = minimumbachelor ~ black_lat, data = boston_n)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.817 -10.877 -1.205 9.195 28.534
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 57.3272 4.1495 13.815 6.42e-13 ***
## black_lat -0.5895 0.1154 -5.106 3.17e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.04 on 24 degrees of freedom
## Multiple R-squared: 0.5207, Adjusted R-squared: 0.5008
## F-statistic: 26.08 on 1 and 24 DF, p-value: 3.174e-05
boston_n$resid_ols <- residuals(my.model)
tm_shape(boston_n) +
tm_polygons("resid_ols",
style = "quantile",
n = 5,
palette = "Blues",
border.col = "grey",
title = "OLS Residuals Quantile") +
tm_layout(legend.position = c("left", "bottom"),
legend.text.size = 0.6)
Ideally the residuals would look randomly distributed across space. Instead, similar-sized residuals cluster together, suggesting unmodeled spatial structure.
We can do more formal tests of spatial autocorrelation as well. To do
this, we first build the spatial weights matrix that identifies which
neighborhoods share edges. We use poly2nb() (which accepts
an sf object directly in modern spdep) and
convert to a row-standardized listw with
nb2listw().
continuity.nb <- poly2nb(boston_n, queen = TRUE)
summary(continuity.nb)
## Neighbour list object:
## Number of regions: 26
## Number of nonzero links: 92
## Percentage nonzero weights: 13.60947
## Average number of links: 3.538462
## 2 regions with no links:
## 12, 26
## 3 disjoint connected subgraphs
## Link number distribution:
##
## 0 1 2 3 4 5 6 8 9
## 2 3 4 4 5 4 2 1 1
## 3 least connected regions:
## 13 18 23 with 1 link
## 1 most connected region:
## 16 with 9 links
# plot neighbor links over the polygons
plot(st_geometry(boston_n), border = "grey")
plot(continuity.nb,
coords = st_coordinates(suppressWarnings(st_centroid(boston_n))),
add = TRUE, col = "red")
continuity.listw <- nb2listw(continuity.nb, style = "W", zero.policy = TRUE)
summary(continuity.listw, zero.policy = TRUE)
## Characteristics of weights list object:
## Neighbour list object:
## Number of regions: 26
## Number of nonzero links: 92
## Percentage nonzero weights: 13.60947
## Average number of links: 3.538462
## 2 regions with no links:
## 12, 26
## 3 disjoint connected subgraphs
## Link number distribution:
##
## 0 1 2 3 4 5 6 8 9
## 2 3 4 4 5 4 2 1 1
## 3 least connected regions:
## 13 18 23 with 1 link
## 1 most connected region:
## 16 with 9 links
##
## Weights style: W
## Weights constants summary:
## n nn S0 S1 S2
## W 24 576 24 15.42398 104.2362
head(continuity.listw$weights)
## [[1]]
## [1] 0.25 0.25 0.25 0.25
##
## [[2]]
## [1] 0.2 0.2 0.2 0.2 0.2
##
## [[3]]
## [1] 0.25 0.25 0.25 0.25
##
## [[4]]
## [1] 0.5 0.5
##
## [[5]]
## [1] 0.25 0.25 0.25 0.25
##
## [[6]]
## [1] 0.5 0.5
One formal test of spatial dependence is Moran’s I, computed with
moran.test(). We can also visualize spatial autocorrelation
with moran.plot().
moran.test(boston_n$minimumbachelor,
listw = continuity.listw,
zero.policy = TRUE,
adjust.n = TRUE)
##
## Moran I test under randomisation
##
## data: boston_n$minimumbachelor
## weights: continuity.listw
## n reduced by no-neighbour observations
##
## Moran I statistic standard deviate = 2.5398, p-value = 0.005545
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.34111187 -0.04347826 0.02292917
moran.plot(boston_n$minimumbachelor,
continuity.listw,
zero.policy = TRUE)
The statistic ranges from -1 to 1. A value near 1 indicates strong positive spatial autocorrelation.
We can also calculate local Moran’s I statistics for each
polygon. For local Moran’s I, use localmoran().
locmoran.boston_n <- localmoran(boston_n$minimumbachelor,
continuity.listw,
zero.policy = TRUE)
head(locmoran.boston_n)
## Ii E.Ii Var.Ii Z.Ii Pr(z != E(Ii))
## 1 0.2408182217 -8.038335e-03 0.0453505345 1.16857848 0.2425735
## 2 -0.1154786299 -1.287267e-03 0.0055709744 -1.52991632 0.1260374
## 3 0.0054051954 -1.605987e-04 0.0009132582 0.18417486 0.8538763
## 4 -0.2063822346 -4.609860e-03 0.0571664223 -0.84390041 0.3987251
## 5 0.0144825088 -1.906523e-05 0.0001084314 1.39263572 0.1637300
## 6 0.0004766308 -1.906523e-05 0.0002375165 0.03216389 0.9743414
boston_n <- cbind(boston_n, as.data.frame(locmoran.boston_n))
We can also map the Z scores from the local Moran’s I output for a more formal picture of clustering.
tm_shape(boston_n) +
tm_polygons("Z.Ii",
style = "quantile",
n = 5,
palette = "Blues",
border.col = "grey",
title = "Z Scores Quantile") +
tm_layout(legend.position = c("left", "bottom"),
legend.text.size = 0.6)
We can now see clear spatial dependence among neighborhoods.
Now that we have detected spatial dependence, we can try to take it into account in our regression models. Spatial dependence in a regression setting can be modeled similarly to autocorrelation in time series. Spatial autocorrelation can be thought of in terms of non-zero correlation with the error term, similar to the presence of an endogenous variable. This implies that OLS estimates in the non-spatial model will be biased and inconsistent (Anselin and Bera, 1998).
Spatial regression functions used to live in spdep but
have moved to the spatialreg package. We estimate a spatial
autoregressive (SAR) or spatial lag model with
spatialreg::lagsarlm(). This model returns the \(\rho\) parameter (the spatial
autoregressive coefficient), its p-value, and lets us compare with
standard OLS.
my.model.sar <- lagsarlm(minimumbachelor ~ black_lat,
data = boston_n,
listw = continuity.listw,
zero.policy = TRUE)
summary(my.model.sar)
##
## Call:lagsarlm(formula = minimumbachelor ~ black_lat, data = boston_n,
## listw = continuity.listw, zero.policy = TRUE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.8753 -5.8921 -1.9245 5.8259 27.5723
##
## Type: lag
## Regions with no neighbours included:
## 12 26
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 33.59490 7.54598 4.4520 8.506e-06
## black_lat -0.41208 0.10759 -3.8301 0.0001281
##
## Rho: 0.47336, LR test value: 8.8257, p-value: 0.0029702
## Asymptotic standard error: 0.13514
## z-value: 3.5027, p-value: 0.00046063
## Wald statistic: 12.269, p-value: 0.00046063
##
## Log likelihood: -100.1283 for lag model
## ML residual variance (sigma squared): 121.54, (sigma: 11.025)
## Number of observations: 26
## Number of parameters estimated: 4
## AIC: 208.26, (AIC for lm: 215.08)
## LM test for residual autocorrelation
## test value: 0.0048895, p-value: 0.94425
Now we can map the new residuals to see if the spatial model improves on the OLS residual pattern.
boston_n$resid_sar <- residuals(my.model.sar)
head(st_drop_geometry(boston_n))
tm_shape(boston_n) +
tm_polygons("resid_sar",
style = "quantile",
n = 5,
palette = "Blues",
border.col = "grey",
title = "SAR Residuals Quantile") +
tm_layout(legend.position = c("left", "bottom"),
legend.text.size = 0.6)
It looks a bit better.
There are many ways to model spatial autocorrelation. The SAR or
spatial lag model above is analogous to a time-series lagged dependent
variable model, where neighboring values play the role of past values.
Another type, the spatial error model, models spatial
correlation in the disturbance term – analogous to serial
autocorrelation in time series. We can implement this with
spatialreg::errorsarlm():
my.model.ear <- errorsarlm(minimumbachelor ~ black_lat,
data = boston_n,
listw = continuity.listw,
zero.policy = TRUE)
summary(my.model.ear)
##
## Call:errorsarlm(formula = minimumbachelor ~ black_lat, data = boston_n,
## listw = continuity.listw, zero.policy = TRUE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.50404 -6.95151 0.27443 10.75028 27.16799
##
## Type: error
## Regions with no neighbours included:
## 12 26
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 51.75193 6.03055 8.5816 < 2.2e-16
## black_lat -0.58032 0.12462 -4.6567 3.214e-06
##
## Lambda: 0.56342, LR test value: 3.5274, p-value: 0.060362
## Asymptotic standard error: 0.17381
## z-value: 3.2416, p-value: 0.0011886
## Wald statistic: 10.508, p-value: 0.0011886
##
## Log likelihood: -102.7775 for error model
## ML residual variance (sigma squared): 144.44, (sigma: 12.018)
## Number of observations: 26
## Number of parameters estimated: 4
## AIC: 213.55, (AIC for lm: 215.08)
Another approach is to use spatial heteroskedasticity- and
autocorrelation-robust (HAC) standard errors with OLS. The most commonly
used estimator is from Conley (1999, 2008), which allows for serial
correlation over time as well as spatial correlation among units within
a given distance. Conley standard errors are easy to implement in Stata;
in R, see the conleyreg or
lmtest/sandwich-based implementations for
current options.
A Geographic Coordinate System (GCS) links coordinates to geographic locations using a particular spherical or spheroidal model of the Earth. A Projected Coordinate System (PCS) takes a GCS and projects it onto a two-dimensional surface. All geospatial objects will have a GCS, but not all will be projected.↩︎
WGS 84 is a Geographic Coordinate System. It is not projected. However, many Projected Coordinate Systems are based on WGS 84. You can find the EPSG reference number for almost any CRS at spatialreference.org.↩︎