dfms provides a user friendly and computationally efficient approach to estimate linear Gaussian Dynamic Factor Models in R. The package is not geared at any specific application, and can be used for dimensionality reduction, forecasting and nowcasting systems of time series. The use of the package is facilitated by a comprehensive set of methods to explore/plot models and extract results.
introduction.R
This vignette walks through the main features of the package. The data provided in the package in xts format is taken from Banbura and Modugno (2014)1, henceforth BM14, and covers the Euro Area from January 1980 through September 2009.
# Using the monthly series from BM14
dim(BM14_M)
#> [1] 357 92
range(index(BM14_M))
#> [1] "1980-01-31" "2009-09-30"
head(colnames(BM14_M))
#> [1] "ip_total" "ip_tot_cstr" "ip_tot_cstr_en" "ip_constr"
#> [5] "ip_im_goods" "ip_capital"
plot(scale(BM14_M), lwd = 1)
introduction.R
The data frame BM14_Models
provides information about
the series2, and the various models estimated by
BM14.
head(BM14_Models, 3)
#> series
#> 1 ip_total
#> 2 ip_tot_cstr
#> 3 ip_tot_cstr_en
#> label
#> 1 IP-Total Industry - Working Day and Seasonally Adjusted
#> 2 IP-Total Industry (Excluding Construction) - Working Day and Seasonally Adjusted
#> 3 IP-Total Industry Excluding Construction and MIG Energy - Working Day and Seasonally Adjusted
#> code freq log_trans small medium large
#> 1 sts.m.i5.Y.PROD.NS0010.4.000 M TRUE FALSE FALSE TRUE
#> 2 sts.m.i5.Y.PROD.NS0020.4.000 M TRUE TRUE TRUE TRUE
#> 3 sts.m.i5.Y.PROD.NS0021.4.000 M TRUE FALSE FALSE TRUE
# Using only monthly data
BM14_Models_M <- subset(BM14_Models, freq == "M")
introduction.R
Prior to estimation, all data is differenced by BM14, and some series
are log, differenced, as indicated by the log_trans
column
in BM14_Models
. In general, dfms uses a stationary
Kalman Filter with time-invariant system matrices, and therefore expects
data to be stationary. Data is also scaled and centered3 in the main
DFM()
function, thus this does not need to be done by the
user.
library(magrittr)
# log-transforming and first-differencing the data
BM14_M[, BM14_Models_M$log_trans] %<>% log()
BM14_M_diff = diff(BM14_M)
plot(scale(BM14_M_diff), lwd = 1)
introduction.R
Before estimating a model, the ICr()
function can be
applied to determine the number of factors. It computes 3 information
criteria proposed in Bai and NG (2002)4, whereby the second
criteria generally suggests the most parsimonious model.
ic = ICr(BM14_M_diff)
#> Missing values detected: imputing data with tsnarmimp() with default settings
print(ic)
#> Optimal Number of Factors (r) from Bai and Ng (2002) Criteria
#>
#> IC1 IC2 IC3
#> 7 7 13
plot(ic)
introduction.R
Another option is to use a Screeplot to gauge the number of factors
by looking for a kink in the plot. A mathematical procedure for finding
the kink was suggested by Onatski (2010)5, but this is currently
not implemented in ICr()
.
introduction.R
Based on both the information criteria and the Screeplot, I gauge
that a model with 4 factors should be estimated, as factors, 5, 6 and 7
do not add much to the explanatory power of the model. Next to the
number of factors, the lag order of the factor-VAR of the transition
equation should be estimated (the default is 1 lag). This can be done
using the VARselect()
function from the vars
package, with PCA factor estimates reported by ICr()
.
# Using vars::VARselect() with 4 principal components to estimate the VAR lag order
vars::VARselect(ic$F_pca[, 1:4])
#> $selection
#> AIC(n) HQ(n) SC(n) FPE(n)
#> 6 3 3 6
#>
#> $criteria
#> 1 2 3 4 5 6
#> AIC(n) 5.810223 5.617282 5.427760 5.389413 5.407765 5.381829
#> HQ(n) 5.898758 5.776646 5.657953 5.690434 5.779614 5.824507
#> SC(n) 6.032560 6.017490 6.005838 6.145361 6.341582 6.493517
#> FPE(n) 333.696100 275.153456 227.671078 219.144228 223.265640 217.639133
#> 7 8 9 10
#> AIC(n) 5.409877 5.394900 5.421375 5.460761
#> HQ(n) 5.923383 5.979235 6.076538 6.186753
#> SC(n) 6.699434 6.862328 7.066673 7.283929
#> FPE(n) 223.956824 220.793226 226.933863 236.331677
introduction.R
The selection thus suggests we should estimate a factor model with
r = 4
factors and p = 3
lags6. Before estimating the
model I note that dfms does not deal with seasonality in
series, thus it is recommended to also seasonally adjust data,
e.g. using the seasonal package before estimation. BM14 only
use seasonally adjusted series, thus this is not necessary with the
example data provided.
Estimation can then simply be done using the DFM()
function with parameters r
and p
7.
# Estimating the model with 4 factors and 3 lags using BM14's EM algorithm
model1 = DFM(BM14_M_diff, r = 4, p = 3)
#> Converged after 26 iterations.
print(model1)
#> Dynamic Factor Model: n = 92, T = 356, r = 4, p = 3, %NA = 25.8366
#>
#> Factor Transition Matrix [A]
#> L1.f1 L1.f2 L1.f3 L1.f4 L2.f1 L2.f2 L2.f3 L2.f4 L3.f1
#> f1 0.4720 -0.1297 0.8460 0.2098 -0.0733 -0.1436 -0.0595 0.1565 0.2356
#> f2 -0.1612 0.1699 0.2389 0.1598 0.0641 -0.1341 -0.0542 0.1287 0.1336
#> f3 0.3965 0.3264 0.0213 -0.3033 -0.1542 -0.0467 -0.1484 -0.0150 -0.1172
#> f4 0.1096 0.1601 -0.1578 0.2485 -0.0365 -0.0563 -0.0230 -0.1117 -0.0719
#> L3.f2 L3.f3 L3.f4
#> f1 -0.0803 -0.0386 0.0408
#> f2 0.1347 -0.0024 -0.0342
#> f3 -0.0087 0.1767 0.0249
#> f4 0.0307 0.0662 -0.0035
plot(model1)
introduction.R
The model can be investigated using summary()
, which
returns an object of class ‘dfm_summary’ containing the system matrices
and summary statistics of the factors and the residuals in the
measurement equation, as well as the R-Squared of the factor model for
individual series. The print method automatically adjusts the amount of
information printed to the data size. For large databases with more than
40 series, no series-level statistics are printed.
dfm_summary <- summary(model1)
print(dfm_summary) # Large model with > 40 series: defaults to compact = 2
#> Dynamic Factor Model: n = 92, T = 356, r = 4, p = 3, %NA = 25.8366
#>
#> Call: DFM(X = BM14_M_diff, r = 4, p = 3)
#>
#> Summary Statistics of Factors [F]
#> N Mean Median SD Min Max
#> f1 356 -0.0448 0.3455 4.4505 -21.9265 11.0306
#> f2 356 -0.0319 -0.0892 2.68 -9.9549 7.4988
#> f3 356 -0.1032 -0.0593 3.2891 -12.0969 16.2455
#> f4 356 -0.0118 0.089 2.161 -8.2883 10.7219
#>
#> Factor Transition Matrix [A]
#> L1.f1 L1.f2 L1.f3 L1.f4 L2.f1 L2.f2 L2.f3 L2.f4 L3.f1
#> f1 0.4720 -0.1297 0.84605 0.2098 -0.07334 -0.14356 -0.05950 0.15645 0.2356
#> f2 -0.1612 0.1699 0.23889 0.1598 0.06406 -0.13413 -0.05415 0.12869 0.1336
#> f3 0.3965 0.3264 0.02128 -0.3033 -0.15424 -0.04669 -0.14839 -0.01495 -0.1172
#> f4 0.1096 0.1601 -0.15776 0.2485 -0.03655 -0.05626 -0.02304 -0.11169 -0.0719
#> L3.f2 L3.f3 L3.f4
#> f1 -0.080320 -0.038592 0.040812
#> f2 0.134692 -0.002391 -0.034215
#> f3 -0.008694 0.176663 0.024876
#> f4 0.030716 0.066201 -0.003465
#>
#> Factor Covariance Matrix [cov(F)]
#> f1 f2 f3 f4
#> f1 19.8067 2.0846* -3.4700* -2.1094*
#> f2 2.0846* 7.1822 -2.8725* -1.0631*
#> f3 -3.4700* -2.8725* 10.8182 1.9286*
#> f4 -2.1094* -1.0631* 1.9286* 4.6701
#>
#> Factor Transition Error Covariance Matrix [Q]
#> u1 u2 u3 u4
#> u1 9.0178 0.3303 -3.0764 -1.0182
#> u2 0.3303 5.4425 -1.3095 -0.5051
#> u3 -3.0764 -1.3095 7.0230 0.8639
#> u4 -1.0182 -0.5051 0.8639 3.8005
#>
#> Summary of Residual AR(1) Serial Correlations
#> N Mean Median SD Min Max
#> 92 -0.0409 -0.0782 0.2959 -0.5073 0.6858
#>
#> Summary of Individual R-Squared's
#> N Mean Median SD Min Max
#> 92 0.3712 0.299 0.2888 0.0067 0.9978
# Can request more detailed printouts
# print(dfm_summary, compact = 1)
# print(dfm_summary, compact = 0)
introduction.R
Apart from the model summary, the dfm methods
residuals()
and fitted()
return observation
residuals and fitted values from the model. The default format is a
plain matrix, but the functions also have an argument to return data in
the original (input) format.
introduction.R
Another way to examine the factor model visually is to plot the
Quasi-Maximum-Likelihood (QML) factor estimates against PCA and Two-Step
estimates following Doz, Giannone and Reichlin (2011)8, where the Kalman
Filter and Smoother is run only once. Both estimates are also computed
by DFM()
during EM estimation and can also be visualized
with plot.dfm
.
introduction.R
The plot with the various estimates shows that the QML estimates are
more volatile in the initial periods where there are many missing
series, but less volatile in the latter periods. In general, QML
estimates may now always be superior across the entire data range to
Two-Step and PCA estimates. Often, Two-Step estimates also provide
similar forecasting performance, and are much faster to estimate using
DFM(BM14_M_diff, r = 4, p = 3, em.method = "none")
.
The factor estimates themselves can be extracted in a data frame
using as.data.frame()
, which also provides various options
regarding the estimates retained and the format of the frame. It is also
possible to add a time variable from the original data (the default is a
sequence of integers).
# Default: all estimates in long format
head(as.data.frame(model1, time = index(BM14_M_diff)))
#> Method Factor Time Value
#> 1 PCA f1 1980-02-29 1.1503013
#> 2 PCA f1 1980-03-31 0.3374613
#> 3 PCA f1 1980-04-30 -1.4905337
#> 4 PCA f1 1980-05-31 -1.5360036
#> 5 PCA f1 1980-06-30 -0.2828433
#> 6 PCA f1 1980-07-31 0.3385403
introduction.R
DFM forecasts can be obtained with the predict()
method,
which dynamically forecasts the factors using the transition equation
(default 10 periods), and then also predicts data forecasts using the
observation equation. Objects are of class ‘dfm_forecast’
# 12-period ahead DFM forecast
fc = predict(model1, h = 12)
print(fc)
#> 12 Step Ahead Forecast from Dynamic Factor Model
#>
#> Factor Forecasts
#> f1 f2 f3 f4
#> 1 3.4698 1.7276 0.2706 -0.9064
#> 2 2.4379 0.0305 1.3119 -0.0097
#> 3 1.6576 0.3182 0.7247 0.1037
#> 4 1.7445 0.7829 -0.2230 -0.1513
#> 5 0.8281 0.1390 0.5545 0.1199
#> 6 0.9576 0.3001 0.0126 -0.0923
#> 7 0.6599 0.2489 0.0333 -0.0518
#> 8 0.3354 0.0681 0.1999 0.0332
#> 9 0.4286 0.1824 -0.0830 -0.0672
#> 10 0.1853 0.0605 0.0882 0.0107
#> 11 0.1605 0.0544 0.0268 -0.0067
#> 12 0.1509 0.0734 -0.0288 -0.0235
#>
#> Series Forecasts
#> ip_total ip_tot_cstr ip_tot_cstr_en ip_constr ip_im_goods ip_capital
#> ip_d_cstr ip_nd_cons ip_en ip_en_2 ip_manuf ip_metals ip_chemicals
#> ip_electric ip_machinery ip_paper ip_plastic new_cars orders
#> ret_turnover_defl ecs_ec_sent_ind ecs_ind_conf ecs_ind_order_book
#> ecs_ind_stocks ecs_ind_prod_exp ecs_ind_prod_rec_m ecs_ind_x_orders
#> ecs_ind_empl_exp ecs_cons_conf ecs_cons_sit_over_next_12 ecs_cons_exp_unempl
#> ecs_cons_gen_last_12m ecs_cstr_conf ecs_cstr_order_books ecs_cstr_empl_exp
#> ecs_cstr_prod_recent ecs_ret_tr_conf ecs_ret_tr_bus_sit ecs_ret_tr_stocks
#> ecs_ret_tr_exp_bus ecs_ret_tr_empl ecs_serv_conf ecs_serv_empl_exp
#> pms_comp_output pms_comp_empl pms_pmi pms_manuf_empl pms_manuf_output
#> pms_manuf_product pms_serv_out pms_serv_empl pms_serv_new_bus
#> pms_serv_product urx empl_total empl_tot_xc empl_cstr empl_manuf
#> extra_ea_trade_exp_val intra_ea_trade_exp_val extra_ea_trade_imp_val
#> intra_ea_trade_imp_val us_ip us_urx us_empl us_retail_sales
#> us_ip_manuf_exp us_cons_exp us_r3_m us_r10_year m3 loans ir_long
#> ir_short ir_1_year ir_2_year ir_5_year eer eer_cpi eer_ppi exr_usd
#> exr_gbp rxr_yen euro50 euro325 sp500 dow_j raw_mat_en raw_mat_oil
#> raw_mat_gold raw_mat_oil_fwd raw_mat
#> [ reached getOption("max.print") -- omitted 12 rows ]
introduction.R
These forecasts can also be visualized using a plot method. By default the entire series history is plotted along with the forecasts, thus it is often helpful to restrict the plot range. As with any stationary autoregressive model, the forecasts tend to zero quite quickly9.
# Predicting with Two-Step estimates
# plot(predict(model1, h = 12, method = "2s"), xlim = c(320, 370))
introduction.R
The forecasts can also be retrieved in data frame using
as.data.frame()
. Again the method has various arguments to
control the forecasts retained (factors, data or both, default factors),
and the format of the frame.
# Factor forecasts in wide format
head(as.data.frame(fc, pivot = "wide"))
#> Time Forecast f1 f2 f3 f4
#> 1 1 FALSE 4.207651 -1.6552707 3.2190834 -1.82663141
#> 2 2 FALSE -3.656458 -6.6178403 -3.9121446 -0.05762001
#> 3 3 FALSE -14.486337 -0.6573393 -6.8663850 1.69525122
#> 4 4 FALSE -14.794458 3.1573790 -6.4820328 1.70916785
#> 5 5 FALSE -8.191696 0.5563773 -0.1840816 0.98407541
#> 6 6 FALSE -1.357927 -0.3805697 3.5068964 -0.05346754
introduction.R
dfms currently provides no specific adjustments for data at different frequencies. An algorithm that accommodates monthly and quarterly series is planned for summer 2023. In the meantime, users may choose to block the data (creating multiple quarterly series from a monthly series, and duplicating quarterly series to maintain equal representation).
dfms also exports central functions that help with DFM
estimation, such as imputing missing values with
tsnarmimp()
, estimating a VAR with .VAR()
, or
Kalman Filtering and Smoothing with SKFS()
, or separately
with SKF()
followed by FIS()
. To my knowledge
these are the fastest routines for simple stationary Kalman Filtering
and Smoothing currently available in R. The function
em_converged()
can be used to check convergence of the
log-likelihood in EM estimation.
dfms also exports a matrix inverse and pseudo-inverse from
the Armadillo C++ library through the functions ainv()
and
apinv()
. These are often faster than solve()
,
and somewhat more robust in near-singularity cases.
Doz, C., Giannone, D., & Reichlin, L. (2011). A two-step estimator for large approximate dynamic factor models based on Kalman filtering. Journal of Econometrics, 164(1), 188-205.
Doz, C., Giannone, D., & Reichlin, L. (2012). A quasi-maximum likelihood approach for large, approximate dynamic factor models. Review of Economics and Statistics, 94(4), 1014-1024.
Banbura, M., & Modugno, M. (2014). Maximum likelihood estimation of factor models on datasets with arbitrary pattern of missing data. Journal of Applied Econometrics, 29(1), 133-160.
Mariano, R. S., & Murasawa, Y. (2003). A new coincident index of business cycles based on monthly and quarterly series. Journal of Applied Econometrics, 18(4), 427-443.
Bai, J., Ng, S. (2002). Determining the Number of Factors in Approximate Factor Models. Econometrica, 70(1), 191-221.
Onatski, A. (2010). Determining the number of factors from empirical distribution of eigenvalues. The Review of Economics and Statistics, 92(4), 1004-1016.
Stock, J. H., & Watson, M. W. (2016). Dynamic Factor Models, Factor-Augmented Vector Autoregressions, and Structural Vector Autoregressions in Macroeconomics. Handbook of Macroeconomics, 2, 415–525.
Banbura, M., & Modugno, M. (2014). Maximum likelihood estimation of factor models on datasets with arbitrary pattern of missing data. Journal of Applied Econometrics, 29(1), 133-160.↩︎
Both about the monthly ones and quarterly series
contained in BM14_Q
. The order of rows in
BM14_Models
matches the column order of series in
merge(BM14_M, BM14_Q)
.↩︎
Data must be scaled and centered because the Kalman Filter has no intercept term.↩︎
Bai, J., Ng, S. (2002). Determining the Number of Factors in Approximate Factor Models. Econometrica, 70(1), 191-221. doi:10.1111/1468-0262.00273↩︎
Onatski, A. (2010). Determining the number of factors from empirical distribution of eigenvalues. The Review of Economics and Statistics, 92(4), 1004-1016.↩︎
Some authors like Doz, Giannone, and Reichlin (2012)
additionally allow the number of transition error processes, termed
‘dynamic factors’ and given an extra parameter q
, to be
less than the number of ‘static factors’ r
. I find this
terminology confusing and the feature not very useful for most practical
purposes, thus I have not implemented it in dfms.↩︎
Users can also choose from two different implementations
of the EM algorithm using the argument em.method
. The
default is em.method = "auto"
, which chooses the modified
EM algorithm proposed by BM14 for missing data if anyNA(X)
,
and the plain EM implementation of Dos, Giannone and Reichlin (2012),
henceforth DGR12, otherwise. The baseline implementation of DGR12 is
typically more than 2x faster than BM14’s method, and can also be used
with data that has a few random missing values (< 5%), but gives
wrong results with many and systematically missing data, such as ragged
edges at the beginning of end of the sample or series at different
frequencies. With complete datasets, both em.method = "DGR"
and em.method = "BM"
gives identical factor estimates, and
in this case em.method = "DGR"
should be used.↩︎
Doz, C., Giannone, D., & Reichlin, L. (2011). A two-step estimator for large approximate dynamic factor models based on Kalman filtering. Journal of Econometrics, 164(1), 188-205.↩︎
Depending also on the lag-order of the factor-VAR. Higher lag-orders product more interesting forecast dynamics.↩︎