This R package provides a big-data-friendly and memory-efficient difference-in-differences (DiD) estimator for staggered (and non-staggered) treatment contexts. It supports controlling for time-varying covariates, heteroskedasticity-robust standard errors, and (single and multi-way) clustered standard errors. It addresses 4 issues that arise in the context of large administrative datasets:
DiDforBigData
will provide estimation and inference for
staggered DiD with millions of observations on a personal laptop. It is
orders of magnitude faster than other available software if the sample
size is large; see the demonstration here.DiDforBigData
helps by using much less memory than other
software; see the demonstration here.data.table
for big data management and
sandwich
for robust standard error estimation, which are
already installed with most R distributions. Optionally, it will use the
fixest
package to speed up the estimation if it is
installed. If the progress
package is installed, it will
also provide a progress bar so you know how much longer the estimation
will take.DiDforBigData
makes
parallelization easy as long as the parallel
package is
installed.To install the package from CRAN:
install.packages("DiDforBigData")
To install the package from Github:
::install_github("setzler/DiDforBigData") devtools
To use the package after it is installed:
library(DiDforBigData)
It is recommended to also make sure these optional packages have been installed:
library("progress")
library("fixest")
library("parallel")
There are only 3 functions in this package:
SimDiD()
: This function simulates data.DiDge()
: This function estimates DiD for a single
cohort and a single event time.DiD()
: This function estimates DiD for all available
cohorts and event times.Details for each function are available from the Function Documentation.
Before estimation, set up a variable list with the names of your variables:
= list()
varnames $time_name = "year"
varnames$outcome_name = "Y"
varnames$cohort_name = "cohort"
varnames$id_name = "id" varnames
To estimate DiD for a single cohort and event time, use the
DiDge
command. For example:
DiDge(inputdata = yourdata, varnames = varnames,
cohort_time = 2010, event_postperiod = 3)
A detailed manual explaining the various features available in
DiDge
is available here
or by running this command in R:
?DiDge
To estimate DiD for many cohorts and event times, use the
DiD
command. For example:
DiD(inputdata = yourdata, varnames = varnames,
min_event = -3, max_event = 5)
A detailed manual explaining the various features available in
DiD
is available here
or by running this command in R:
?DiD
For more information, read the following articles:
Acknowledgements: Thanks to Mert Demirer and Kirill Borusyak for helpful comments.