This tutorial is a companion to the GIS-in-R workshop and assumes familiarity with the
sfpackage and basic causal inference.
Political scientists rarely care about spatial data for its own sake. We care because the units we study (countries, districts, villages, polling stations) are arranged in space, and that arrangement carries information about the assignment of treatment and the plausibility of identification. A border between two regimes can serve as a discontinuity. A policy that applies in some districts but not others creates a quasi-experiment, but only if treatment does not bleed across the line. A study that ignores geography risks both confounding (omitted spatial factors that drive both treatment and outcome) and SUTVA violations (units near the treated are themselves partially treated).
This tutorial is therefore not a generic spatial econometrics primer. The goal is to walk through four quasi-experimental designs that political methodologists actually run with spatial data: geographic regression discontinuity (GRD), spatial difference-in-differences, matching on geographic covariates, and using spatial regression as a robustness check. For each design we show, in code you can copy, how to implement it and diagnose its credibility.
A guiding principle throughout: spatial econometrics models (SAR, SEM, SDM) are diagnostic tools, not causal estimators. The current standard in political methodology (see, e.g., Betz, Cook & Hollenbach; Cook, Hays & Franzese) is to use a design (RDD, DiD, matching, IV) for identification, and to deploy spatial models to probe whether the estimate is robust to plausible spatial dependence. Students should resist the temptation to report a SAR or SDM coefficient as “the effect.”
We use a modern stack: sf for vector geometries,
spdep and spatialreg for weight matrices and
spatial models, fixest for two-way fixed-effects DiD,
MatchIt for matching, and rdrobust for the
geographic RDD (optional; we also show a hand-rolled local linear
estimator).
library(sf)
library(spdep)
library(spatialreg)
library(fixest)
library(MatchIt)
library(rdrobust)
library(ggplot2)
library(dplyr)
library(tidyr)
library(broom)
We build a 30 × 30 grid of municipalities. A vertical
“border” at \(x = 15\) separates a
treated region (left half) from a control region (right half). The
outcome depends on (a) the treatment, (b) a smooth latitudinal gradient
\(f(y)\) representing an omitted
geographic confounder, (c) spatial autocorrelation, and (d) noise. We
also generate a pre-treatment period for the DiD section.
set.seed(20240519)
# 30 x 30 grid -> 900 cells; trim to a circle-ish region for visual variety
grid <- st_make_grid(
st_polygon(list(rbind(c(0, 0), c(30, 0), c(30, 30), c(0, 30), c(0, 0)))),
cellsize = 1.5
) |> st_sf(geometry = _)
grid$id <- seq_len(nrow(grid))
ctr <- st_coordinates(st_centroid(grid))
grid$x <- ctr[, 1]
grid$y <- ctr[, 2]
grid$dx <- grid$x - 15 # signed distance to the vertical border (km)
grid$treat <- as.integer(grid$dx < 0) # left of border = treated
# Spatial autocorrelation via an exponential covariance kernel
D <- as.matrix(dist(ctr))
Sigma <- exp(-D / 4) # range ~4 km
u <- as.numeric(t(chol(Sigma + diag(1e-4, nrow(D)))) %*% rnorm(nrow(D)))
# Outcome: treatment effect (+2), smooth gradient in y (confounder),
# spatial AC, and i.i.d. noise. Same gradient on both sides of the border.
beta_tau <- 2
grid$y0 <- 0.05 * (grid$y - 15)^2 / 5 + 1.5 * u + rnorm(nrow(grid), sd = 1)
grid$y1 <- grid$y0 + beta_tau
grid$Y <- ifelse(grid$treat == 1, grid$y1, grid$y0)
# Pre-period (no treatment effect anywhere) and post-period (treatment kicks in on left)
grid$Y_pre <- 0.05 * (grid$y - 15)^2 / 5 + 1.5 * u + rnorm(nrow(grid), sd = 1)
grid$Y_post <- grid$Y
cat("N units:", nrow(grid),
" Treated:", sum(grid$treat),
" Control:", sum(1 - grid$treat), "\n")
## N units: 400 Treated: 200 Control: 200
A quick map to confirm the setup:
ggplot(grid) +
geom_sf(aes(fill = Y), color = NA) +
geom_vline(xintercept = 15, color = "white", linewidth = 1, linetype = 2) +
scale_fill_viridis_c(option = "C") +
labs(title = "Simulated outcome Y", subtitle = "Dashed line: treatment border (left = treated)") +
theme_minimal()
Notice the latitudinal gradient (top and bottom are higher than the middle); this is the confounder a naive cross-sectional comparison would absorb. The treatment effect adds a constant shift to the left half.
Two issues motivate everything that follows.
SUTVA (the Stable Unit Treatment Value Assumption) requires that the potential outcome of unit \(i\) depends only on \(i\)’s own treatment, not on the treatment of any other unit. In spatially structured settings this is often violated:
When spillovers go in the same direction as the treatment effect, comparing treated to “spatially close” controls understates the effect; comparing to faraway controls may pick up other geographic differences. There is no free lunch; the methodological question is how to make the trade-off transparent.
Choice of neighbor structure matters. Two units can be “neighbors” because they share a border (queen/rook contiguity), because they are within some distance threshold, or because they are among each other’s \(k\) nearest neighbors. The choice should reflect a theoretical claim about how spillover propagates: information networks suggest \(k\)-nearest, commuting suggests distance bands, regulatory interactions suggest contiguity. Report sensitivity to alternative weight matrices.
# Three neighbor definitions, for the same units.
ctr_pts <- st_coordinates(st_centroid(grid))
# (a) Queen contiguity on the polygon grid
nb_queen <- poly2nb(grid, queen = TRUE)
# (b) k = 4 nearest neighbors
nb_knn4 <- knn2nb(knearneigh(ctr_pts, k = 4))
# (c) Distance threshold of 3 km
nb_d3 <- dnearneigh(ctr_pts, d1 = 0, d2 = 3)
# Row-standardized weight matrices
lw <- nb2listw(nb_queen, style = "W", zero.policy = TRUE)
lw_knn4 <- nb2listw(nb_knn4, style = "W", zero.policy = TRUE)
lw_d3 <- nb2listw(nb_d3, style = "W", zero.policy = TRUE)
# Quick descriptive: average number of neighbors per unit
sapply(list(queen = nb_queen, knn4 = nb_knn4, dist3 = nb_d3),
function(nbi) mean(card(nbi)))
## queen knn4 dist3
## 7.41 4.00 11.01
Moran’s I gives a single global summary of spatial autocorrelation:
moran.test(grid$Y, lw, zero.policy = TRUE)
##
## Moran I test under randomisation
##
## data: grid$Y
## weights: lw
##
## Moran I statistic standard deviate = 22.021, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.5723025927 -0.0025062657 0.0006813518
Moran’s I on the raw outcome is far from zero, but most of that comes from the latitudinal confounder, not from a treatment spillover per se. The interesting test is on residuals from your causal model; we return to this in Sections 3 and 5. A useful complement is the Moran scatterplot, which visualizes each unit’s value against the average of its neighbors’:
moran.plot(grid$Y, lw, zero.policy = TRUE,
xlab = "Y", ylab = "Spatially lagged Y",
main = "Moran scatterplot of the outcome")
Points in the upper-right and lower-left quadrants are spatial clusters of high-high and low-low values; off-diagonal points are spatial outliers.
Spatial confounding occurs when an unmeasured variable that varies smoothly across space is correlated with both treatment and outcome. In our simulation this is the latitudinal gradient. A naive OLS comparing the two halves of the map will (by symmetry of our setup) recover roughly the correct effect, but in any real application, where treatment status is not assigned by a clean vertical cut, geography will be confounded with history, ethnicity, economic structure, and so on. Think of regimes that were historically separated by a now-defunct border, ethnic groups whose territory roughly aligns with administrative units, or climate zones that determine both agricultural policy and economic outcomes.
The methodological responses are:
# Naive cross-sectional OLS: badly misleading in general, but useful as a baseline.
ols_naive <- lm(Y ~ treat, data = grid)
summary(ols_naive)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.2733529 0.1357118 -2.014217 4.465912e-02
## treat 2.1377681 0.1919254 11.138536 2.881084e-25
The estimate is biased upward or downward depending on which side has more of the gradient, and the standard errors are too small because of spatial autocorrelation. Conventional OLS treats each grid cell as an independent draw, but neighboring cells share unobserved drivers; the effective number of independent observations is far smaller than \(N\). We need a design.
The geographic RD design (Keele & Titiunik, 2015; Dell, 2010) exploits a sharp boundary in space across which treatment changes discontinuously but potential outcomes are assumed continuous. The canonical example is Dell’s (2010) study of the Peruvian mita: a colonial forced-labor system that applied in some districts but not in others, with a sharp administrative boundary. Comparing villages just inside the mita boundary to villages just outside, Dell isolates the long-run economic effect of the institution.
Let \(D_i \in \{0, 1\}\) denote treatment and \(b(s_i)\) the signed distance from unit \(i\)’s location \(s_i\) to the boundary, with \(b < 0\) on the treated side. The identifying assumption is:
\[ \lim_{b \uparrow 0} \mathbb{E}[Y(0) \mid b(s) = b] = \lim_{b \downarrow 0} \mathbb{E}[Y(0) \mid b(s) = b] \]
That is: expected potential outcomes are continuous at the boundary. Anything that jumps at the boundary (ethnicity, prior institutions, climate) threatens identification. In our simulation the latitudinal gradient is, by construction, continuous across the border, so the design works.
In our setup the border is vertical at \(x
= 15\), so signed distance is simply dx = x - 15
(already computed). In real applications you’d compute distance to a
sfc_LINESTRING representing the border and add the
sign:
# Illustrative; not run, since our simulation has a trivial vertical border.
border_line <- st_sfc(st_linestring(rbind(c(15, 0), c(15, 30))), crs = st_crs(grid))
dist_to_border <- as.numeric(st_distance(st_centroid(grid), border_line))
side_sign <- ifelse(grid$treat == 1, -1, 1)
grid$dx_real <- side_sign * dist_to_border
The chunk above is marked eval = FALSE because for the
simulated vertical border, dx is exact and
st_distance would just reproduce it (with a bit of
floating-point noise). Use this pattern with a real
LINESTRING border in applied work.
The workhorse GRD estimator is a local linear regression on each side
of the border, weighted by a kernel that decays as you move away from
the cutoff. We first do it by hand for transparency, then with
rdrobust.
# Hand-rolled local linear estimator at the cutoff with a triangular kernel.
local_linear_rd <- function(d, y, treat, bw, kernel = "triangular") {
in_band <- abs(d) <= bw
d_b <- d[in_band]; y_b <- y[in_band]; t_b <- treat[in_band]
w <- if (kernel == "triangular") pmax(0, 1 - abs(d_b) / bw) else rep(1, length(d_b))
# Interact distance with treatment so slopes differ on each side
fit <- lm(y_b ~ t_b * d_b, weights = w)
list(
tau = unname(coef(fit)["t_b"]),
se = unname(sqrt(diag(vcov(fit)))["t_b"]),
n = sum(in_band)
)
}
# Pick a moderate bandwidth in km
bw_pick <- 5
rd_hand <- local_linear_rd(grid$dx, grid$Y, grid$treat, bw = bw_pick)
rd_hand
## $tau
## [1] 1.231343
##
## $se
## [1] 0.7000539
##
## $n
## [1] 120
The hand-rolled estimate of tau should be near the true
effect of 2. Note that the standard error here is
not spatial-cluster-robust; for publication you would
want Conley (1999) standard errors or clustering at the boundary
segment. See the conleyreg package, or
fixest::feols(..., vcov = conley(cutoff = ...)).
# Conley SEs via fixest. Cutoff in same units as coordinates (km here).
grd_data <- grid |>
st_drop_geometry() |>
mutate(weight_tri = pmax(0, 1 - abs(dx) / bw_pick)) |>
filter(abs(dx) <= bw_pick)
# Attach coordinates back for Conley SE
grd_data$lon <- grid$x[grid$id %in% grd_data$id]
grd_data$lat <- grid$y[grid$id %in% grd_data$id]
fit_conley <- feols(
Y ~ treat * dx,
data = grd_data,
weights = ~ weight_tri,
vcov = conley(cutoff = 3, distance = "spherical")
)
summary(fit_conley)
## OLS estimation, Dep. Var.: Y
## Observations: 120
## Weights: weight_tri
## Standard-errors: Conley (3km)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.021457 0.546506 -0.039261 0.96875
## treat 1.231343 0.839653 1.466490 0.14522
## dx -0.043081 0.219699 -0.196091 0.84488
## treat:dx -0.207372 0.324520 -0.639011 0.52408
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 2.03965 Adj. R2: 0.137596
(Conley SEs assume coordinates are lon/lat in degrees; with planar
simulated coordinates the magnitude is illustrative. In real work,
project to a planar CRS and use distance = "triangular" or
rescale the cutoff.)
rdrobust workflowrdrobust provides MSE-optimal bandwidth selection
(Calonico, Cattaneo & Titiunik) and bias-corrected confidence
intervals.
rd_fit <- rdrobust(y = grid$Y, x = grid$dx, c = 0, kernel = "triangular")
summary(rd_fit)
## Sharp RD estimates using local polynomial regression.
##
## Number of Obs. 400
## BW type mserd
## Kernel Triangular
## VCE method NN
##
## Number of Obs. 200 200
## Eff. Number of Obs. 80 80
## Order est. (p) 1 1
## Order bias (q) 2 2
## BW est. (h) 5.581 5.581
## BW bias (b) 8.484 8.484
## rho (h/b) 0.658 0.658
## Unique Obs. 10 10
##
## =====================================================================
## Point Robust Inference
## Estimate z P>|z| [ 95% C.I. ]
## ---------------------------------------------------------------------
## RD Effect -1.345 -1.231 0.218 [-3.104 , 0.709]
## =====================================================================
The “Conventional” estimate is the usual local-linear coefficient; the “Robust” estimate corrects for bias from the polynomial approximation and is what most referees now expect. Compare it to the true treatment effect of \(2\).
Before any formal estimation, plot the data. A scatter of \(Y\) against signed distance to the border with a local-polynomial smoother on each side is the single most useful diagnostic for a geographic RD: a visible jump at zero is the design’s signature, and its absence is reason to reconsider.
ggplot(grid, aes(dx, Y, color = factor(treat))) +
geom_point(alpha = 0.4) +
geom_smooth(aes(group = treat), method = "loess", se = TRUE) +
geom_vline(xintercept = 0, linetype = 2) +
scale_color_manual(values = c("0" = "gray40", "1" = "steelblue"),
labels = c("Control", "Treated"), name = NULL) +
labs(x = "Signed distance to border (km)", y = "Outcome Y",
title = "RD plot: outcome vs. running variable") +
theme_minimal()
The vertical gap between the two smoothers at \(d_x = 0\) is, visually, your RD estimate.
Three diagnostics every geographic RD should report.
bws <- seq(2, 10, by = 1)
sens <- vapply(bws, function(b) {
r <- local_linear_rd(grid$dx, grid$Y, grid$treat, bw = b)
c(b = b, tau = r$tau, se = r$se)
}, numeric(3)) |> t() |> as.data.frame()
ggplot(sens, aes(b, tau)) +
geom_ribbon(aes(ymin = tau - 1.96 * se, ymax = tau + 1.96 * se), alpha = 0.2) +
geom_line() + geom_point() +
geom_hline(yintercept = beta_tau, linetype = 2, color = "red") +
labs(x = "Bandwidth (km)", y = "Estimated effect",
title = "Sensitivity of RD estimate to bandwidth",
subtitle = "Red dashed line = true effect") +
theme_minimal()
The estimate should be flat in a neighborhood of the optimal bandwidth. Sharp drift suggests the polynomial approximation is doing too much work.
Manipulation of the running variable would show up as a discontinuity in the density of \(d_x\) at zero. With administrative borders this is rarely a concern, but it’s still worth checking, e.g., that population mass isn’t strangely sparse on one side.
# Simple visual density check; for a formal test use rddensity::rddensity()
ggplot(grid, aes(dx)) +
geom_histogram(bins = 40, fill = "steelblue", color = "white") +
geom_vline(xintercept = 0, color = "red", linetype = 2) +
labs(x = "Distance to border (km)", title = "Density of the running variable") +
theme_minimal()
If covariates that cannot be affected by treatment (geography, baseline demographics) jump at the boundary, the continuity assumption is suspect. Run an RD on each one and report the table.
# Pre-period outcome plays the role of a placebo covariate here.
rd_balance <- rdrobust(y = grid$Y_pre, x = grid$dx, c = 0, kernel = "triangular")
data.frame(
outcome = "Y_pre (placebo)",
estimate = rd_balance$Estimate[1],
pval = rd_balance$pv[3] # robust p-value
)
A small, insignificant placebo effect supports the design. A large one is a red flag.
In two-dimensional space, a “boundary” is not a scalar; it is a curve. Two units the same distance from the border can be at very different points along it, with different local conditions on the other side. Keele & Titiunik (2015) emphasize this: a defensible geographic RD requires either (a) a homogeneous border (think a flat geometric line through homogeneous terrain) or (b) explicit conditioning on location along the border. In practice, a common solution is to add boundary segment fixed effects or to interact the running variable with location along the boundary. We do not do this here because our simulated border is trivially homogeneous, but in applied work this is a non-trivial modeling choice.
When treatment is assigned to some geographic units at some time, the standard tool is two-way fixed-effects DiD:
\[ Y_{it} = \alpha_i + \gamma_t + \tau \cdot (D_i \cdot \text{Post}_t) + \varepsilon_{it}. \]
The unit fixed effects \(\alpha_i\) absorb the spatial confounder we worried about in Section 1, and time fixed effects \(\gamma_t\) absorb common shocks. With two periods (pre, post) and a time-invariant treatment, this collapses to a comparison of changes.
panel <- grid |>
st_drop_geometry() |>
select(id, x, y, dx, treat, Y_pre, Y_post) |>
pivot_longer(cols = c(Y_pre, Y_post),
names_to = "period", values_to = "Y") |>
mutate(post = as.integer(period == "Y_post"),
did = treat * post)
head(panel)
did_fit <- feols(Y ~ did | id + period, data = panel)
summary(did_fit)
## OLS estimation, Dep. Var.: Y
## Observations: 800
## Fixed-effects: id: 400, period: 2
## Standard-errors: IID
## Estimate Std. Error t value Pr(>|t|)
## did 1.89743 0.142351 13.3293 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 0.709973 Adj. R2: 0.766988
## Within R2: 0.308631
The point estimate should be close to the true treatment effect (2).
Standard errors are clustered at the unit by default with
feols when fixed effects are present.
The DiD logic assumes that control units’ pre-to-post changes capture the counterfactual trend for treated units. If treated and control units are spatially proximate and treatment spills over, controls are partially treated and the estimate is biased toward zero (assuming positive spillovers).
The first diagnostic: Moran’s I on the residuals. Residual spatial autocorrelation suggests something spatial is still in the error term, either unmodeled confounding or spillovers.
panel$resid <- residuals(did_fit)
# Test residual autocorrelation in the post period only
post_resid <- panel$resid[panel$post == 1]
moran.test(post_resid, lw, zero.policy = TRUE)
##
## Moran I test under randomisation
##
## data: post_resid
## weights: lw
##
## Moran I statistic standard deviate = -0.60059, p-value = 0.7259
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## -0.0181750578 -0.0025062657 0.0006806409
In our simulation the residuals should be roughly white: the unit fixed effects swept out the spatial confounder. In applied work, a significant Moran’s I should prompt you to consider rings or donuts and to report a SAR or SEM robustness check (Section 5).
If spillovers extend, say, 3 km, the cleanest robustness check is to drop the units within 3 km of the boundary on both sides. The remaining sample compares “deeply treated” to “untreated by spillover” units.
donut_radius <- 3
panel_donut <- filter(panel, abs(dx) > donut_radius)
did_donut <- feols(Y ~ did | id + period, data = panel_donut)
bind_rows(
data.frame(spec = "Full sample", est = coef(did_fit)["did"], se = se(did_fit)["did"]),
data.frame(spec = "Donut (>3 km)", est = coef(did_donut)["did"], se = se(did_donut)["did"])
)
If the donut estimate is materially larger than the full-sample estimate, spillovers were attenuating the full-sample effect.
The default clustered SEs from feols cluster on the unit
dimension, which handles serial correlation within a unit over time but
not spatial correlation across units. With many time
periods and spatial dependence, you may want:
fixest::feols(..., vcov = conley(cutoff = ...)) or in the
conleyreg package.vcov = ~ id + region.Spatial dependence does not bias the point estimate, but it can severely understate uncertainty.
A more informative version assigns control units to concentric rings around the treated region and estimates a separate effect for each ring. The treatment effect is the coefficient on “deep treatment” (far inside the boundary); the ring coefficients trace the spillover decay.
panel_rings <- panel |>
mutate(zone = case_when(
dx < -3 ~ "treated_deep",
dx >= -3 & dx < 0 ~ "treated_near",
dx >= 0 & dx < 3 ~ "control_near",
dx >= 3 ~ "control_far"
),
zone = factor(zone, levels = c("control_far", "control_near",
"treated_near", "treated_deep")))
ring_fit <- feols(Y ~ i(zone, post, ref = "control_far") | id + period,
data = panel_rings)
summary(ring_fit)
## OLS estimation, Dep. Var.: Y
## Observations: 800
## Fixed-effects: id: 400, period: 2
## Standard-errors: IID
## Estimate Std. Error t value Pr(>|t|)
## zone::control_near:post -0.273791 0.251691 -1.08781 2.7734e-01
## zone::treated_near:post 1.678208 0.251691 6.66774 8.7492e-11 ***
## zone::treated_deep:post 1.883791 0.159183 11.83411 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 0.70832 Adj. R2: 0.7669
## Within R2: 0.311846
Reading the coefficients: each is the differential pre-to-post change in that zone relative to “control_far.” If “control_near” picks up a positive effect, that’s a spillover; the “treated_deep” coefficient is your spillover-adjusted treatment effect.
You can visualize the ring estimates as a function of distance to the border:
ring_coefs <- broom::tidy(ring_fit, conf.int = TRUE) |>
mutate(zone = gsub("zone::|:post", "", term)) |>
filter(grepl("post", term))
ggplot(ring_coefs, aes(zone, estimate)) +
geom_pointrange(aes(ymin = conf.low, ymax = conf.high)) +
geom_hline(yintercept = 0, linetype = 2, color = "red") +
labs(x = "Zone (ordered from control to deeply treated)",
y = "Estimated effect vs. control_far",
title = "Ring DiD: distance-by-time interactions") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 20, hjust = 1))
When neither RDD (no sharp boundary) nor DiD (no clean pre-period) is available, geographic matching offers a third route. The idea: match each treated unit to a control unit that is similar on observables including geographic location, then compare outcomes.
The danger to avoid is matching units that are too close: if treatment spills over, your matched control is partially treated and your estimate is biased toward zero. Use a caliper to require a minimum geographic distance, while still controlling for non-geographic confounders.
match_data <- grid |>
st_drop_geometry() |>
mutate(latitude = y) |>
select(id, treat, Y, x, y, latitude)
# Nearest-neighbor match on latitude (the geographic confounder) only.
# In real applications, include socioeconomic covariates as well.
m_out <- matchit(
treat ~ latitude,
data = match_data,
method = "nearest",
distance = "mahalanobis"
)
matched <- match.data(m_out)
# Estimate ATT on the matched sample
att_match <- lm(Y ~ treat, data = matched, weights = weights)
summary(att_match)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.2733529 0.1357118 -2.014217 4.465912e-02
## treat 2.1377681 0.1919254 11.138536 2.881084e-25
To add a geographic caliper that excludes near-neighbors (i.e., enforce a minimum distance between matched treated and control to mitigate spillovers), you can post-process the match:
# For each matched pair, drop pairs that are within `min_dist` km
min_dist <- 3
pairs <- m_out$match.matrix # rows = treated, entries = control ids
treated_ids <- as.integer(rownames(pairs))
control_ids <- as.integer(pairs[, 1])
dist_pair <- sqrt(
(match_data$x[treated_ids] - match_data$x[control_ids])^2 +
(match_data$y[treated_ids] - match_data$y[control_ids])^2
)
keep <- which(dist_pair > min_dist)
matched_far <- bind_rows(
match_data[treated_ids[keep], ],
match_data[control_ids[keep], ]
)
att_far <- lm(Y ~ treat, data = matched_far)
summary(att_far)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.2733529 0.1357118 -2.014217 4.465912e-02
## treat 2.1377681 0.1919254 11.138536 2.881084e-25
Always inspect balance after matching. The point of matching is to make treated and control units comparable on the covariates you included.
summary(m_out, un = FALSE)
##
## Call:
## matchit(formula = treat ~ latitude, data = match_data, method = "nearest",
## distance = "mahalanobis")
##
## Summary of Balance for Matched Data:
## Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
## latitude 15 15 0 1 0
## eCDF Max Std. Pair Dist.
## latitude 0 0
##
## Sample Sizes:
## Control Treated
## All 200 200
## Matched 200 200
## Unmatched 0 0
## Discarded 0 0
Standardized mean differences should generally be below 0.1 in absolute value. If matching fails to balance a covariate, that’s a sign you need a different specification (different distance metric, more covariates, exact matching on key variables) or that overlap is insufficient.
A brief note on synthetic control: when only one or
a few units are treated, synthetic control (Synth,
tidysynth, augsynth) constructs a weighted
combination of donors that resembles the treated unit on pre-treatment
outcomes. With spatial data you can restrict the donor pool to nearby
(or, depending on the threat model, far away) units to manage
geographic similarity vs. spillover concerns. We don’t run synthetic
control here because the canonical case is one treated unit, but the
geographic restriction logic carries over.
A frequent mistake is to use matching when the underlying threat is unobserved spatial confounding. Matching only handles observed covariates; if the true confounder is unmeasured and varies smoothly in space, matching on observed geographic location can do more harm than good (by closing some good comparisons while leaving the confounding open). Geographic matching is most defensible when (a) you have a rich set of observed covariates, (b) the treatment assignment mechanism is plausibly conditional on those covariates, and (c) you treat geographic distance as a covariate among others, not as a substitute for identification.
If your causal estimate from RDD, DiD, or matching is sensitive to assumptions about the spatial structure of the error, you have a problem. Spatial autoregressive (SAR) and spatial error (SEM) models, fit on the same data, let you check this. These are not the primary causal estimator; they are a robustness exercise.
The SAR model adds a spatially lagged outcome: \[ Y = \rho W Y + X\beta + \varepsilon \]
The SEM model puts spatial dependence in the error term: \[ Y = X\beta + u, \quad u = \lambda W u + \varepsilon \]
# Fit OLS, SAR, SEM with the same design matrix
ols_fit <- lm(Y ~ treat + y, data = grid)
sar_fit <- lagsarlm(Y ~ treat + y, data = grid, listw = lw, zero.policy = TRUE)
sem_fit <- errorsarlm(Y ~ treat + y, data = grid, listw = lw, zero.policy = TRUE)
# Side-by-side coefficients on the causal variable
out <- data.frame(
model = c("OLS", "SAR", "SEM"),
estimate = c(coef(ols_fit)["treat"],
coef(sar_fit)["treat"],
coef(sem_fit)["treat"]),
se = c(sqrt(diag(vcov(ols_fit)))["treat"],
sqrt(diag(vcov(sar_fit)))["treat"],
sqrt(diag(vcov(sem_fit)))["treat"])
)
out
Two interpretive points:
treat is not the
total effect. Because the lagged outcome propagates, the total
effect includes spillovers through \(W\). Use impacts() to get
direct, indirect, and total effects:sar_impacts <- impacts(sar_fit, listw = lw, R = 200)
summary(sar_impacts, zstats = TRUE)
## Impact measures (lag, exact):
## Direct Indirect Total
## treat dy/dx 0.75447040 1.53231975 2.2867901
## y dy/dx -0.02388766 -0.04851553 -0.0724032
## ========================================================
## Simulation results ( variance matrix):
## Direct:
##
## Iterations = 1:200
## Thinning interval = 1
## Number of chains = 1
## Sample size per chain = 200
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## treat dy/dx 0.74804 0.169366 0.011976 0.010552
## y dy/dx -0.02381 0.009136 0.000646 0.000556
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## treat dy/dx 0.4421 0.62971 0.75064 0.87738 1.079785
## y dy/dx -0.0415 -0.03049 -0.02357 -0.01755 -0.007691
##
## ========================================================
## Indirect:
##
## Iterations = 1:200
## Thinning interval = 1
## Number of chains = 1
## Sample size per chain = 200
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## treat dy/dx 1.55137 0.39135 0.027673 0.024042
## y dy/dx -0.04923 0.01991 0.001408 0.001034
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## treat dy/dx 0.90538 1.27465 1.52071 1.77871 2.29275
## y dy/dx -0.09306 -0.06203 -0.04664 -0.03545 -0.01353
##
## ========================================================
## Total:
##
## Iterations = 1:200
## Thinning interval = 1
## Number of chains = 1
## Sample size per chain = 200
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## treat dy/dx 2.29941 0.51765 0.036604 0.03137
## y dy/dx -0.07304 0.02822 0.001995 0.00151
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## treat dy/dx 1.4124 1.96012 2.2683 2.62101 3.223
## y dy/dx -0.1308 -0.09072 -0.0711 -0.05504 -0.021
##
## ========================================================
## Simulated standard errors
## Direct Indirect Total
## treat dy/dx 0.169365733 0.3913512 0.51765250
## y dy/dx 0.009136379 0.0199071 0.02821852
##
## Simulated z-values:
## Direct Indirect Total
## treat dy/dx 4.416736 3.964127 4.441994
## y dy/dx -2.606566 -2.472758 -2.588372
##
## Simulated p-values:
## Direct Indirect Total
## treat dy/dx 1.002e-05 7.3665e-05 8.9129e-06
## y dy/dx 0.0091455 0.013407 0.0096431
This is precisely why SAR should not be the primary estimator in causal applications: the coefficient you would want to interpret is conditional on a network propagation model that you usually do not believe literally. As a robustness check, however, it is informative: a small total SAR effect when your DiD effect is large should make you re-examine the DiD assumptions.
A reasonable workflow for a paper that uses geography for identification:
sf. Signed distance to a border, neighbor list,
ring indicators.The point of the spatial-econometrics toolbox in modern political methodology is not to be the showcase; it is to provide diagnostics and robustness for a design-based estimate.
sessionInfo()
## R version 4.5.2 (2025-10-31)
## Platform: x86_64-apple-darwin20
## Running under: macOS Monterey 12.7.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
##
## locale:
## [1] en_GB/en_GB/en_GB/C/en_GB/en_GB
##
## time zone: Europe/London
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] broom_1.0.12 tidyr_1.3.2 dplyr_1.2.0 ggplot2_4.0.2
## [5] rdrobust_3.0.0 MatchIt_4.7.2 fixest_0.13.2 spatialreg_1.4-2
## [9] Matrix_1.7-4 spdep_1.4-2 spData_2.3.4 sf_1.0-24
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.1 viridisLite_0.4.3 farver_2.1.2
## [4] S7_0.2.1 fastmap_1.2.0 TH.data_1.1-5
## [7] digest_0.6.39 lifecycle_1.0.5 dreamerr_1.5.0
## [10] LearnBayes_2.15.2 survival_3.8-3 dbscan_1.2.4
## [13] magrittr_2.0.4 compiler_4.5.2 rlang_1.1.7
## [16] sass_0.4.10 tools_4.5.2 igraph_2.2.2
## [19] yaml_2.3.12 knitr_1.51 labeling_0.4.3
## [22] sp_2.2-1 classInt_0.4-11 RColorBrewer_1.1-3
## [25] multcomp_1.4-29 KernSmooth_2.23-26 withr_3.0.2
## [28] purrr_1.2.1 numDeriv_2016.8-1.1 grid_4.5.2
## [31] e1071_1.7-17 scales_1.4.0 MASS_7.3-65
## [34] cli_3.6.5 mvtnorm_1.3-3 rmarkdown_2.30
## [37] generics_0.1.4 otel_0.2.0 DBI_1.2.3
## [40] cachem_1.1.0 proxy_0.4-29 splines_4.5.2
## [43] s2_1.1.9 stringmagic_1.2.0 vctrs_0.7.1
## [46] boot_1.3-32 sandwich_3.1-1 jsonlite_2.0.0
## [49] Formula_1.2-5 jquerylib_0.1.4 units_1.0-0
## [52] glue_1.8.0 chk_0.10.0 codetools_0.2-20
## [55] gtable_0.3.6 deldir_2.0-4 tibble_3.3.1
## [58] pillar_1.11.1 htmltools_0.5.9 R6_2.6.1
## [61] wk_0.9.5 evaluate_1.0.5 lattice_0.22-7
## [64] backports_1.5.0 bslib_0.10.0 class_7.3-23
## [67] Rcpp_1.1.1 coda_0.19-4.1 nlme_3.1-168
## [70] mgcv_1.9-3 xfun_0.56 zoo_1.8-15
## [73] pkgconfig_2.0.3