Introduction to space-time GAMS with stgam

Lex Comber, Paul Harris and Chrs Brunsdon

June 2024

Overview

The aim of this vignette is provide an introduction to the stgam package and demonstrates how to construct a spatially vary coefficient (SVC) model and spatial and temporally varying (STVC) model using the stgam package. Essentially stgam provides a wrapper for constructing GAMs with Gaussian Process (GP) smooths or splines, that are parameterised with location for SVCs and with location and time for STVCs. The key ideas underpinning the development of SVCs with GAMs in the stgam package and this vignette are:

  1. Standard linear regression assumes that predictor-to-response relationships to be the same throughout the study area.
  2. This is often not the case when when location is considered, for example of outliers.
  3. Many geographic processes have a Gaussian form when they are examined over 2-dimensional space, as they essentially exhibit distance decay.
  4. The GAMs can include smooths or splines of different forms. One such form is a Gaussian Process (GP) spline.
  5. GP splines can be specified to model non-linearity (wiggliness) over geographic space if location is included with the covariate.
  6. A GAM with GP splines parameterised by location - a Geographic GP GAM or GGP-GAM - defines a spatially varying coefficient (SVC) model.
  7. This can be extended to time and space-time.

The approach is presented in outline below, but detail of the SVCs, TVCs and STVCs constructed using GAMs with GP smooths, and the evolution of their application from spatial models to space-time coefficient modelling can be found in Comber, Harris, and Brunsdon (2024) and Comber et al. (submitted). In this, the concept of a Gaussian Process (Wood 2020) is important in the context of regression modelling. It provides a data model of the likelihood that a given data set is generated given a statistical model involving some unknown parameters and in regression modelling, the unknown parameters are the regression parameters. These are described formally in Comber, Harris, and Brunsdon (2024) and Comber et al. (submitted).

One final comment is that in this vignette you will all of the varying coefficient models you create assume that spatial, temporal or spatio-temporal dependencies are present in the data. This may not be the case and these assumptions are examined in detail in the second vignette in the stgam package.

In this vignette you will:

Data and Packagees

You should install the stgam package either from CRAN or from GitHub:

install.packages("stgam", dependencies = TRUE)
remotes::install_github("lexcomber/stgam")

And then make sure the required packages and data are loaded:

# load the packages
library(stgam)
library(cols4all)   # for nice shading in graphs and maps
library(cowplot)    # for managing plots
library(dplyr)      # for data manipulation 
library(ggplot2)    # for plotting and mapping
library(glue)       # for model construction 
library(mgcv)       # for GAMs
library(sf)         # for spatial data
library(doParallel) # for parallelising operations
library(purrr)      # for model construction
library(tidyr)      # for model construction 
# load the data 
data(productivity)
data(us_data) 

The productivity data is annual economic productivity data for the 48 contiguous US states (with Washington DC merged into Maryland), for years 1970 to 1985 (17 years). This was extracted from the plm package (Croissant, Millo, and Tappe 2022). The us_data is a spatial dataset of the US states in a USA Contiguous Equidistant Conic projection (ESRI:102005) from the spData package (Bivand, Nowosad, and Lovelace 2019). The productivity data includes locational information of the state geometric centroid in the same projection. The code below maps the X and Y locations in productivity along with the US state areas.

ggplot() +  geom_sf(data = us_data, fill = NA) +
  geom_point(data = productivity |> filter(year == "1970"), aes(x = X, y = Y)) +
  theme_bw() + ylab("") + xlab("")
The US States and the geoemtric centroids used as locations.
The US States and the geoemtric centroids used as locations.

The data attributes can be examined and a spatial model of constructed using gam function from the mgcv package. The varying coefficient models created below all focus on Private capital stock (privC) as the target variable, with Unemployment (unemp) and Public capital (pubC) covariates, where the coefficient functions are assumed to be realisation of a Gaussian Process (GP) introduced above.

head(productivity)
#> # A tibble: 6 × 14
#>   state   GEOID  year region   pubC   hwy water  util  privC   gsp   emp unemp       X
#>   <chr>   <chr> <int> <fct>   <dbl> <dbl> <dbl> <dbl>  <dbl> <int> <dbl> <dbl>   <dbl>
#> 1 ALABAMA 01     1970 6      15033. 7326. 1656. 6051. 35794. 28418 1010.   4.7 858146.
#> 2 ALABAMA 01     1971 6      15502. 7526. 1721. 6255. 37300. 29375 1022.   5.2 858146.
#> 3 ALABAMA 01     1972 6      15972. 7765. 1765. 6442. 38670. 31303 1072.   4.7 858146.
#> 4 ALABAMA 01     1973 6      16406. 7908. 1742. 6756. 40084. 33430 1136.   3.9 858146.
#> 5 ALABAMA 01     1974 6      16763. 8026. 1735. 7002. 42057. 33749 1170.   5.5 858146.
#> 6 ALABAMA 01     1975 6      17316. 8158. 1752. 7406. 43972. 33604 1155.   7.7 858146.
#> # ℹ 1 more variable: Y <dbl>

A simple SVC

A spatially varying coefficient model will be created using the productivity dataset.

The code below defines an intercept column (Intercept) in the data. This to allow the intercept to be treated as an addressable term in the model. It also defines parametric and non-parametric forms for the intercept and each covariate, so that they can can take a global form (i.e. as in a standard OLS regression) and a spatially varying form in the GP smooth.

# define intercept term
productivity <- productivity |> mutate(Intercept = 1)
# create the SVC
svc.gam = gam(privC ~ 0 +
                Intercept   + s(X, Y, bs = 'gp', by = Intercept) + 
                unemp + s(X, Y, bs = "gp", by = unemp) + 
                pubC  + s(X, Y, bs = "gp", by = pubC), 
              data = productivity |> filter(year == "1970"))

Notice the 0 + in the model. This indicates that the intercept coefficient is not included implicitly and it is included explicitly as Intercept. Also notice the different form of the splines from those specified in Part 1. Here, for each covariate, a GP smooth is specified for X and Y (the coordinates in geographic space) and the covariate is included via the by parameter. This is to explore the interaction of the covariate with the target variable over the space defined by X and Y locations, allowing spatially varying coefficients to be generated. The model has 4 key terms specified in the same way by <VAR> + (X, Y, b s= 'gp', by = <VAR>):

The model output can be assessed: the k' and edf parameters are quite close, but the p-values are high and and the k-index is greater than 1 so this looks OK. The diagnostic plots are again generated by the gam.check function as below:

# check 
gam.check(svc.gam)
The GAM GP SVC diagnostic plots.
The GAM GP SVC diagnostic plots.
#> 
#> Method: GCV   Optimizer: magic
#> Smoothing parameter selection converged after 6 iterations by steepest
#> descent step failure.
#> The RMS GCV score gradient at convergence was 2452.878 .
#> The Hessian was not positive definite.
#> Model rank =  8 / 101 
#> 
#> Basis dimension (k) checking results. Low p-value (k-index<1) may
#> indicate that k is too low, especially if edf is close to k'.
#> 
#>                     k'   edf k-index p-value   
#> s(X,Y):Intercept 32.00  2.00    0.82   0.065 . 
#> s(X,Y):unemp     33.00  2.00    0.82   0.010 **
#> s(X,Y):pubC      33.00  3.67    0.82   0.035 * 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# model summary
summary(svc.gam)
#> 
#> Family: gaussian 
#> Link function: identity 
#> 
#> Formula:
#> privC ~ 0 + Intercept + s(X, Y, bs = "gp", by = Intercept) + 
#>     unemp + s(X, Y, bs = "gp", by = unemp) + pubC + s(X, Y, bs = "gp", 
#>     by = pubC)
#> 
#> Parametric coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> Intercept  0.0014245  0.0003432   4.151 0.000168 ***
#> unemp      0.0064540  0.0015548   4.151 0.000168 ***
#> pubC      -3.9243042  0.9571325  -4.100 0.000196 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Approximate significance of smooth terms:
#>                    edf Ref.df       F p-value    
#> s(X,Y):Intercept 2.000  2.000   0.453   0.639    
#> s(X,Y):unemp     2.000  2.000   0.405   0.676    
#> s(X,Y):pubC      3.672  3.715 188.593  <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Rank: 8/101
#> R-sq.(adj) =  0.918   Deviance explained = 96.5%
#> GCV = 1.9338e+08  Scale est. = 1.613e+08  n = 48

Here it can be seen that:

  1. The model is well tuned: all of the all effective degrees of freedom (edf are well below k in the gam.check() printed output (the k-index<1 issue is not important because of this).
  2. All of the the fixed parametric terms are significant.
  3. Of the smooth terms, only pccap is locally significant and spatially varying.

The spatially varying coefficient estimates can be extracted using predict. To do this a dummy data set is created with the pubC term set to 1, and the intersect and unemp terms set to zero. The result is that the predicted values for the coefficient estimate are just a function of \(\beta_2\), the pubC coefficient estimate at each location.

get_b2<- productivity |> filter(year == "1970") |> mutate(Intercept = 0, unemp = 0, pubC = 1)
res <- productivity |> filter(year == "1970") |> 
  mutate(b2 = predict(svc.gam, newdata = get_b2))

The resulting data.frame called res has a new variable called b2 which is the spatially varying coefficient estimate for pubC. For comparison, we can generate the spatially varying coefficient estimate for the intercept (\(\beta_0\)) and unemp \(\beta_1\) (which were not found to be significant locally) in the same way by setting the other terms in the model to zero:

get_b0 <- productivity |> filter(year == "1970") |> mutate(Intercept = 1, unemp = 0, pubC = 0)
res <- res |> mutate(b0 = predict(svc.gam, newdata = get_b0))
get_b1 <- productivity |> filter(year == "1970") |> mutate(Intercept = 0, unemp = 1, pubC = 0)
res <- res |> mutate(b1 = predict(svc.gam, newdata = get_b1))

So res has the records for the year 1970 and three new columns for b0, b1 and b2 The distribution of the spatially varying coefficient estimates can be examined:

res |> select(b0, b1, b2) |>
  apply(2, summary) |> round(2)
#>                b0       b1    b2
#> Min.    -19058.71 -2848.15 -0.18
#> 1st Qu.  -7882.46 -1199.14  1.29
#> Median     731.22  -156.64  2.11
#> Mean         0.00     0.30  2.08
#> 3rd Qu.   7891.22  1179.57  2.85
#> Max.     20699.80  3362.89  3.80

The stgam package has a function to extract the varying coefficients, calculate_vcs. This takes three arguments, the GAM varying coefficient model, the model terms, and the data used to create the model:

terms = c("Intercept", "unemp", "pubC")
res <-  calculate_vcs(model = svc.gam, 
                      terms = terms, 
                      input_data = productivity |> filter(year == "1970"))
summary(res[, paste0("b_",terms)])
#>   b_Intercept            b_unemp              b_pubC       
#>  Min.   :-19058.713   Min.   :-2848.148   Min.   :-0.1822  
#>  1st Qu.: -7882.457   1st Qu.:-1199.142   1st Qu.: 1.2941  
#>  Median :   731.222   Median : -156.638   Median : 2.1100  
#>  Mean   :     0.001   Mean   :    0.302   Mean   : 2.0825  
#>  3rd Qu.:  7891.224   3rd Qu.: 1179.568   3rd Qu.: 2.8474  
#>  Max.   : 20699.803   Max.   : 3362.893   Max.   : 3.8034

Standard dplyr and ggplot approaches can be used to join and map the coefficient estimates, formally \(\beta_0\), \(\beta_1\) and \(\beta_2\)) as in the figure below. Notice the North-South trend for the Intercept and the East-West and trend for Unemployment - both insignificant predictors of privC - and a much stronger specific spatial pattern between the target variable and Public capital (pubC), with particularity high coefficient estimates in the south.

# join the data 
map_results <-
  us_data |> left_join(res |> select(GEOID, b_Intercept, b_unemp, b_pubC), by = "GEOID")
# plot the insignificant coefficient estimates
tit =expression(paste(""*beta[0]*""))
p1 <- 
  ggplot(data = map_results, aes(fill=b_Intercept)) + 
  geom_sf() + 
  scale_fill_continuous_c4a_div(palette="brewer.spectral",name=tit) + 
  coord_sf() +
  ggtitle("Intercept: not significant")
tit =expression(paste(""*beta[1]*""))
p2 <- 
  ggplot(data = map_results, aes(fill=b_unemp)) + 
  geom_sf() + 
  scale_fill_continuous_c4a_div(palette="brewer.spectral",name=tit) + 
  coord_sf() +
  ggtitle("Unemployment: not significant")
# plot the significant pubC coefficient estimates
tit =expression(paste(""*beta[2]*" "))
p3 <- 
  ggplot(data = map_results, aes(fill=b_pubC)) + 
  geom_sf() + 
  scale_fill_continuous_c4a_div(palette="brewer.prgn",name=tit) + 
  coord_sf() +
  ggtitle("Public captial: significant")
plot_grid(p1, p2, p3, ncol = 1)
The spatially varying coefficient (SVC) estimates.
The spatially varying coefficient (SVC) estimates.

A simple TVC

The productivity data was filtered for 1970 the the SVC above in which the X-Y location of the 48 states was used to parametrise the Gaussian Process smooths. The same structure can be used to create a temporally vary coefficient model (TVC), with smooths specified to include the year parameter, but this time not restricting the analysis data to records from a single year:

# create the TVC
tvc.gam = gam(privC ~ 0 +
                Intercept   + s(year, bs = 'gp', by = Intercept) + 
                unemp + s(year, bs = "gp", by = unemp) + 
                pubC  + s(year, bs = "gp", by = pubC), 
              data = productivity)

The model can be inspected in the same way to examine the the k' and edf parameters using the gam.check function and again there are no concerns:

gam.check(tvc.gam)
summary(tvc.gam)

The model summary indicates that all of the parametric and temporally vary coefficients are significant at the 95% level except the parametric one for Unemployment, but there are some interesting patterns in the residuals, as reflected in the diagnostics plots above and the model fit (\(R^2\)) value.

The temporally varying coefficients can be extracted in the same way as the DVC approach, explored using the predict function and setting each of the covariates to 1 and the others to zero in turn:

terms = c("Intercept", "unemp", "pubC")
res <-  calculate_vcs(model = tvc.gam, 
                      terms = terms, 
                      input_data = productivity)
summary(res[, paste0("b_",terms)])
#>   b_Intercept       b_unemp            b_pubC     
#>  Min.   :-1057   Min.   :-3733.9   Min.   :1.462  
#>  1st Qu.: 7649   1st Qu.:-2342.3   1st Qu.:1.653  
#>  Median :16356   Median : -950.6   Median :1.844  
#>  Mean   :16356   Mean   : -950.6   Mean   :1.844  
#>  3rd Qu.:25063   3rd Qu.:  441.0   3rd Qu.:2.035  
#>  Max.   :33769   Max.   : 1832.6   Max.   :2.226

The variation in coefficient estimates can be inspected over time, remembering that each State has the same coefficient estimate value for each year, so just one state is selected in the code below (you could chose another and the result would be the same). Notice the linear declines due to the linearity of the time covariate in the figure below.

res |> 
  filter(state == "ARIZONA") |> 
  select(year, b_Intercept, b_unemp, b_pubC) |> 
  pivot_longer(-year) |>
  ggplot(aes(x = year, y = value)) +
  geom_point() + geom_line() +
  facet_wrap(~name, scale = "free") +
  theme_bw() + xlab("Year") + ylab("") 
Trends in the temporally varying coefficient estimates.
Trends in the temporally varying coefficient estimates.

Spatially and Temporally Varying Coefficient (STVC) models

We can combine space and time in GAM GP splines. But how? We could use separate smooths for location and for time, or a single, 3D smooth parameterised with both location and time. There are assumptions associated with each of these. The code below specifies the interaction of the covariates within a single space-time GP smooth. You will notice this takes a few seconds longer to run, and choice of how to specify the smooths is explored in the second vignette in the stgam package.

stvc.gam = gam(privC ~ 0 +
                 Intercept   + s(X, Y, year, bs = 'gp', by = Intercept) + 
                 unemp + s(X, Y, year, bs = "gp", by = unemp) + 
                 pubC  + s(X, Y, year, bs = "gp", by = pubC), 
               data = productivity)

The model can be inspected in the usual way using the gam.check function (again, no concerns) and the summary function with \(k\) and \(edf\) despite the concerning k-index and p-value values. In this case all of the parametric and smooth coefficient estimates are significant at the 95% level, and the model fit (\(R^2\)) has again increased over the SVC model.

gam.check(stvc.gam)
summary(stvc.gam)

The coefficients spatial and temporally vary coefficients can be extracted in the same way as before and the variation in coefficient estimates from the STVC-GAM model summarised:

terms = c("Intercept", "unemp", "pubC")
res <-  calculate_vcs(model = stvc.gam, 
                      terms = terms, 
                      input_data = productivity)
summary(res[, paste0("b_",terms)])
#>   b_Intercept          b_unemp              b_pubC        
#>  Min.   :-9767.27   Min.   :-1581.313   Min.   :0.009162  
#>  1st Qu.:-3966.23   1st Qu.: -642.095   1st Qu.:1.619839  
#>  Median :  -77.85   Median : -312.564   Median :2.298574  
#>  Mean   :    0.00   Mean   :   -2.411   Mean   :2.308127  
#>  3rd Qu.: 3378.97   3rd Qu.:  629.192   3rd Qu.:3.027175  
#>  Max.   :10806.31   Max.   : 2195.418   Max.   :4.396288

This indicates that a positive relationship between Private capital with with Public capital and mixed positive and negative one with Unemployment and the Intercept.

It is instructive to unpick some of the model coefficients in more detail and the code below summarises variations over time through the median values of each coefficient estimate:

res |> 
  select(year, b_Intercept, b_unemp, b_pubC) |>
  group_by(year) |>
  summarise(med_b0 = median(b_Intercept),
            med_b1 = median(b_unemp),
            med_b2 = median(b_pubC))
#> # A tibble: 17 × 4
#>     year med_b0 med_b1 med_b2
#>    <int>  <dbl>  <dbl>  <dbl>
#>  1  1970  -77.8  -313.   1.95
#>  2  1971  -77.8  -313.   1.99
#>  3  1972  -77.8  -313.   2.03
#>  4  1973  -77.8  -313.   2.07
#>  5  1974  -77.8  -313.   2.11
#>  6  1975  -77.8  -313.   2.15
#>  7  1976  -77.8  -313.   2.19
#>  8  1977  -77.8  -313.   2.23
#>  9  1978  -77.8  -313.   2.27
#> 10  1979  -77.8  -313.   2.31
#> 11  1980  -77.8  -313.   2.35
#> 12  1981  -77.8  -313.   2.39
#> 13  1982  -77.8  -313.   2.43
#> 14  1983  -77.8  -313.   2.47
#> 15  1984  -77.8  -313.   2.51
#> 16  1985  -77.8  -313.   2.55
#> 17  1986  -77.8  -313.   2.59

It is evident that of the 2 covariates and the intercept used to model Private capital, only Public capital (b2) varies (increases) over time. This increase is shown visually below .

# inputs to plot
res |> select(starts_with("b"), year) |> 
  mutate(year = "All Years") -> tmp
cols = c(c4a("tableau.red_gold", n = 17, reverse = T), "grey")
tit =expression(paste(""*beta[`Private Capital`]*""))
# plot
res |> select(starts_with("b"), year) |> 
  rbind(tmp) |> 
  mutate(year = factor(year)) |> 
  ggplot(aes(y = year, x = b_pubC, fill = year)) +
  geom_boxplot(outlier.alpha = 0.1) +
  scale_fill_manual(values=cols, guide = "none") +
  theme_bw() + xlab(tit) + ylab("Time") 
The temporal variation of the Public capital coefficient estimates over 17 years.
The temporal variation of the Public capital coefficient estimates over 17 years.

The spatial pattern of this temporal trend can also be explored as below. This shows that the increasing intensity of the effect of Public capital on Private capital does not vary spatially: the increase in effect is spatially even, with high values in the south.

tit =expression(paste(""*beta[`Public Capital`]*""))
# join the data 
map_results <-
  us_data |> left_join(res |> select(GEOID, year, b_Intercept, b_unemp, b_pubC), by = "GEOID")
# create the plot
map_results |>
  ggplot() + geom_sf(aes(fill = b_pubC), col = NA) +
    scale_fill_binned_c4a_seq(palette="scico.lajolla", name = tit) + 
  facet_wrap(~year) +
    theme_bw() + xlab("") + ylab("") + 
    theme(
      strip.background =element_rect(fill="white"), 
      strip.text = element_text(size = 8, margin = margin()),
      legend.position = "inside", legend.position.inside = c(0.7, 0.1),
      legend.direction = "horizontal",
      legend.key.width = unit(1, "cm"),
        axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank(), 
        axis.title.y=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks.y=element_blank())  
The spatial variation of the Unemployment coefficient estimates over time.
The spatial variation of the Unemployment coefficient estimates over time.

Summary

The rationale for using GAMs with GP splines for spatially varying coefficient (SVC) or temporally varying coefficient (TVC) models is as follows:

Initial research has demonstrated the formulation and application of a GAM with GP splines calibrated via observation location as a multiscale SVC model: the Geographical Gaussian Process GAM (GGP-GAM) (Comber, Harris, and Brunsdon 2024). The GGP-GAM was compared with the most popular SVC model, Geographically Weighted Regression (GWR) (Brunsdon, Fotheringham, and Charlton 1996) and shown to out-perform Multiscale GWR.

However, when handling space and time, simply plugging all the space time data into specific GAM configuration is to make potentially unreasonable assumptions about how space and time interact in spatially and temporally varying coefficient (STVC) models. To address this workshop has suggested that the full set of models is investigated to identify the best model or the best set of models. Where there is a clear winner, this can be applied. Where these is not, as in the example used then the model coefficients can be combined using Bayesian Model Averaging (in the second vignette).

References

Bivand, Roger, Jakub Nowosad, and Robin Lovelace. 2019. “spData: Datasets for Spatial Analysis.” R package version 0.3. 0, URL https://CRAN. R-project. org/package= spData.
Brunsdon, Chris, A Stewart Fotheringham, and Martin E Charlton. 1996. “Geographically Weighted Regression: A Method for Exploring Spatial Nonstationarity.” Geographical Analysis 28 (4): 281–98.
Comber, Alexis, Paul Harris, and Chris Brunsdon. 2024. “Multiscale Spatially Varying Coefficient Modelling Using a Geographical Gaussian Process GAM.” International Journal of Geographical Information Science 38 (1): 27–47.
Comber, Alexis, Paul Harris, Naru Tsutsumida, Jennie Gray, and Chris Brunsdon. submitted. “Where, When and How? Specifying Spatially and Temporally Varying Coefficient Models Using Generalized Additive Models with Gaussian Process Splines.” International Journal of Geographical Information Science, submitted.
Croissant, Yves, Giovanni Millo, and Kevin Tappe. 2022. “Plm: Linear Models for Panel Data.”
Wood, Simon N. 2020. “Inference and Computation with Generalized Additive Models and Their Extensions.” TEST 29 (2): 307–39. https://doi.org/10.1007/s11749-020-00711-5.