This vignette provides a tutorial for fitting spatial regression models to raster data using geostan. The term “raster” is used here to refer to any regularly spaced set of observations such that the data can be represented spatially by a rectangular grid. Remotely sensed imagery is a common form of raster data.
geostan can be used for spatial regression with fairly large raster data layers, although the functionality of these models will often be limited to the estimation of regression coefficients and spatial autocorrelation parameters. Limited experience thus far finds that geostan’s spatial autoregressive models can be fit to raster layers with two hundred thousand observations using a laptop computer and fewer than ten minutes of sampling time.
Start by loading some necessary R packages.
library(geostan)
library(sf)
set.seed(1127)
We will create a small raster data layer for the purpose of illustration.
40
row <- 30
col <-c(N <- row * col)
## [1] 1200
st_sfc(st_polygon(list(rbind(c(0,0), c(col,0), c(col,row), c(0,0)))))
sfc = st_make_grid(sfc, cellsize = 1, square = TRUE)
grid <- st_as_sf(grid)
grid <- shape2mat(grid, style = "W", queen = FALSE) W <-
## Warning in shape2mat(grid, style = "W", queen = FALSE): The 'queen' argument is
## deprecated. Use 'method' instead.
## Contiguity condition: rook
## Number of neighbors per unit, summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 4.000 4.000 3.883 4.000 4.000
##
## Spatial weights, summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2500 0.2500 0.2500 0.2575 0.2500 0.5000
$z <- sim_sar(w = W, rho = 0.9)
grid$y <- -0.5 * grid$z + sim_sar(w = W, rho = .9, sigma = .3) grid
plot(grid[,'z'])
The following R code will fit a spatial autoregressive model to these data:
stan_sar(y ~ z, data = grid, C = W) fit <-
The stan_sar
function will take the spatial weights matrix W
and pass it through a function called prep_sar_data
which will calculate the eigenvalues of the spatial weights matrix using base::eigen
, as required for computational reasons. This step is prohibitive for large data sets (e.g., \(N = 100,000\)).
The following code would normally be used to fit a conditional autoregressive (CAR) model:
shape2mat(grid, style = "B", queen = FALSE)
C <- prep_car_data(C, "WCAR")
car_list <- stan_car(y ~ z, data = grid, car_parts = car_list) fit <-
Here, the prep_car_data
function calculates the eigenvalues of the spatial weights matrix using base::eigen
, which is not feasible for large N.
The prep_sar_data2
and prep_car_data2
functions are designed for raster layers. As input, they require the dimensions of the grid (number of rows and number of columns). The eigenvalues are produced very quickly using Equation 5 from Griffith (2000). The methods have certain restrictions. First, this is only applicable to raster layers—regularly spaced, rectangular grids of observations. Second, to define which observations are adjacent to one another, the “rook” criteria is used (spatially, only observations that share an edge are defined as neighbors to one another). Third, the spatial adjacency matrix will be row-standardized. This is standard (and required) for SAR models, and it corresponds to the “WCAR” specification of the CAR model (see Donegan 2022).
The following code will fit a SAR model to our grid
data, and is suitable for much larger raster layers:
prep_sar_data2(row = row, col = col) sar_list <-
## Range of permissible rho values: -1, 1
stan_sar(y ~ z,
fit <-data = grid,
centerx = TRUE,
sar_parts = sar_list,
iter = 500,
chains = 4,
slim = TRUE #,
# cores = 4, # for multi-core processing
)
##
## *Setting prior parameters for intercept
## Distribution: normal
## location scale
## 1 0.035 5
##
## *Setting prior parameters for beta
## Distribution: normal
## location scale
## 1 0 5
##
## *Setting prior for SAR scale parameter (sar_scale)
## Distribution: student_t
## df location scale
## 1 10 0 3
##
## *Setting prior for SAR spatial autocorrelation parameter (sar_rho)
## Distribution: uniform
## lower upper
## 1 -1 1
##
## SAMPLING FOR MODEL 'foundation' NOW (CHAIN 1).
## Chain 1:
## Chain 1: Gradient evaluation took 0.001244 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 12.44 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1:
## Chain 1:
## Chain 1: Iteration: 1 / 500 [ 0%] (Warmup)
## Chain 1: Iteration: 251 / 500 [ 50%] (Sampling)
## Chain 1: Iteration: 500 / 500 [100%] (Sampling)
## Chain 1:
## Chain 1: Elapsed Time: 1.862 seconds (Warm-up)
## Chain 1: 1.302 seconds (Sampling)
## Chain 1: 3.164 seconds (Total)
## Chain 1:
##
## SAMPLING FOR MODEL 'foundation' NOW (CHAIN 2).
## Chain 2:
## Chain 2: Gradient evaluation took 0.00099 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 9.9 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2:
## Chain 2:
## Chain 2: Iteration: 1 / 500 [ 0%] (Warmup)
## Chain 2: Iteration: 251 / 500 [ 50%] (Sampling)
## Chain 2: Iteration: 500 / 500 [100%] (Sampling)
## Chain 2:
## Chain 2: Elapsed Time: 1.903 seconds (Warm-up)
## Chain 2: 1.476 seconds (Sampling)
## Chain 2: 3.379 seconds (Total)
## Chain 2:
##
## SAMPLING FOR MODEL 'foundation' NOW (CHAIN 3).
## Chain 3:
## Chain 3: Gradient evaluation took 0.000961 seconds
## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 9.61 seconds.
## Chain 3: Adjust your expectations accordingly!
## Chain 3:
## Chain 3:
## Chain 3: Iteration: 1 / 500 [ 0%] (Warmup)
## Chain 3: Iteration: 251 / 500 [ 50%] (Sampling)
## Chain 3: Iteration: 500 / 500 [100%] (Sampling)
## Chain 3:
## Chain 3: Elapsed Time: 1.929 seconds (Warm-up)
## Chain 3: 1.361 seconds (Sampling)
## Chain 3: 3.29 seconds (Total)
## Chain 3:
##
## SAMPLING FOR MODEL 'foundation' NOW (CHAIN 4).
## Chain 4:
## Chain 4: Gradient evaluation took 0.000981 seconds
## Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 9.81 seconds.
## Chain 4: Adjust your expectations accordingly!
## Chain 4:
## Chain 4:
## Chain 4: Iteration: 1 / 500 [ 0%] (Warmup)
## Chain 4: Iteration: 251 / 500 [ 50%] (Sampling)
## Chain 4: Iteration: 500 / 500 [100%] (Sampling)
## Chain 4:
## Chain 4: Elapsed Time: 2.173 seconds (Warm-up)
## Chain 4: 1.266 seconds (Sampling)
## Chain 4: 3.439 seconds (Total)
## Chain 4:
print(fit)
## Spatial Model Results
## Formula: y ~ z
## Spatial method (outcome): SAR
## Likelihood function: auto_gaussian
## Link function: identity
## Observations: 1200
## Data models (ME): none
## Inference for Stan model: foundation.
## 4 chains, each with iter=500; warmup=250; thin=1;
## post-warmup draws per chain=250, total post-warmup draws=1000.
##
## mean se_mean sd 2.5% 20% 50% 80% 97.5% n_eff Rhat
## intercept 0.034 0.003 0.078 -0.121 -0.031 0.033 0.101 0.181 709 0.998
## z -0.510 0.000 0.009 -0.527 -0.518 -0.510 -0.502 -0.490 1129 0.998
## sar_rho 0.874 0.001 0.016 0.843 0.859 0.873 0.887 0.904 758 0.998
## sar_scale 0.312 0.000 0.007 0.298 0.306 0.312 0.317 0.326 955 0.999
##
## Samples were drawn using NUTS(diag_e) at Wed Sep 18 07:06:55 2024.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
The user first creates the data list using prep_sar_data2
and then passes it to stan_sar
using the sar_parts
argument. Also, slim = TRUE
is invoked to prevent the model from collecting N-length parameter vectors and quantities of interest (such as fitted values and log-likelihoods).
For large data sets and complex models, slim = TRUE
can bring about computational improvements at the cost of losing some functionality (including the loss of convenience functions like sp_diag
, me_diag
, spatial
, resid
, and fitted
). Many quantities of interest, such as fitted values and spatial trend terms, can still be calculated manually using the data and parameter estimates (intercept, coefficients, and spatial autocorrelation parameters).
The favorable MCMC diagnostics for this model (sufficiently large effective sample sizes n_eff
, and Rhat
values very near to 1), based on just 250 post-warmup iterations per chain with four MCMC chains, provides some indication as to how computationally efficient these spatial autoregressive models can be.
Also, note that Stan usually samples more efficiently when variables have been mean-centered. Using the centerx = TRUE
argument in stan_sar
(or any other model-fitting function in geostan) can be very helpful in this respect. Also note that the SAR models in geostan are (generally) no less computationally-efficient than the CAR models, and may even be slightly more efficient.
Donegan, Connor. 2022. “Building Spatial Conditional Autoregressive Models in the Stan Programming Language.” OSF Preprints. https://doi.org/10.31219/osf.io/3ey65.
Griffith, Daniel A. 2000. “Eigenfunction Properties and Approximations of Selected Incidence Matrices Employed in Spatial Analyses.” Linear Algebra and Its Applications 321 (1-3): 95–112.