The functions genCorGen
and addCorGen
generate correlated data from a number of distributions (see the more
general vignette discussing correlated data). Each function includes an
option to specify a correlation matrix. There are three functions that
will facilitate the creation of matrices that can be used to in both
CorGen
functions: genCorMat
,
blockExchangeMat
, and blockDecayMat
. This
vignette provides a brief introduction to each.
In its most simple form, genCorMat
can generate a single
(square) correlation matrix of a specified dimension. This can be a
completely randomly generated (valid) correlation matrix or a
correlation matrix with a set of specified coefficients.
Here is an example of the first, a randomly generated correlation matrix:
## [,1] [,2] [,3] [,4]
## [1,] 1.00000000 -0.22742403 0.01285282 -0.3201579
## [2,] -0.22742403 1.00000000 -0.04973973 -0.1218070
## [3,] 0.01285282 -0.04973973 1.00000000 -0.2940923
## [4,] -0.32015788 -0.12180695 -0.29409234 1.0000000
And here is a matrix with a specified set of coefficients (and you well get an error message if it is not positive semidefinite!). These coefficients define the lower triangle (and upper) triangle of the matrix, reading down the columns:
## [,1] [,2] [,3] [,4]
## [1,] 1.0 0.6 0.4 0.2
## [2,] 0.6 1.0 0.5 0.3
## [3,] 0.4 0.5 1.0 0.8
## [4,] 0.2 0.3 0.8 1.0
This matrix can be used to generate data using functions
genCorData
or genCorGen
:
dd <- genCorGen(n = 1000, nvars = 4, corMatrix = R, params1 = c(3, 5, 8, 9),
dist = "poisson", wide = TRUE)
head(dd)
## Key: <id>
## id V1 V2 V3 V4
## <int> <num> <num> <num> <num>
## 1: 1 3 3 5 6
## 2: 2 3 9 12 10
## 3: 3 1 2 14 16
## 4: 4 4 9 13 15
## 5: 5 4 9 7 9
## 6: 6 4 5 6 7
And the correlation from this data set is quite close to the specified matrix R.
## V1 V2 V3 V4
## V1 1.0 0.6 0.4 0.2
## V2 0.6 1.0 0.5 0.3
## V3 0.4 0.5 1.0 0.8
## V4 0.2 0.3 0.8 1.0
It is also possible to specify an exchangeable/compound symmetry or auto-regressive structure. Here is the compound symmetry structure:
## [,1] [,2] [,3] [,4]
## [1,] 1.0 0.6 0.6 0.6
## [2,] 0.6 1.0 0.6 0.6
## [3,] 0.6 0.6 1.0 0.6
## [4,] 0.6 0.6 0.6 1.0
And here is a matrix with an auto-regressive or decaying structure:
## [,1] [,2] [,3] [,4]
## [1,] 1.000 0.60 0.36 0.216
## [2,] 0.600 1.00 0.60 0.360
## [3,] 0.360 0.60 1.00 0.600
## [4,] 0.216 0.36 0.60 1.000
It is also possible to generate a list of correlation
matrices, each of which corresponds to a specific cluster. These
matrices can be of different sizes (to accommodate different cluster
sizes) and have different parameters (if not random). The only
constraints are that the overall structure of matrices need to be the
same (i.e. random, cs, or ar1), and it is not possible
to use the cors
argument (since the number of correlation
parameters would be different depending on the cluster size).
In this example, I am generating matrices with a cs structure for four clusters with sizes 2, 3, 4, and 3, respectively, and within-cluster correlations of \(\rho_1 = 0.6\), \(\rho_2 = 0.7\), \(\rho_3 = 0.5\), and \(\rho_4 = 0.4\). This reflects an overall block correlation matrix that looks like this:
Each column represents an individual unit (and so does each row). Reading down a column (or across a row) gives the correlations with the other individual units. The clusters are represented by the grids drawn over the matrix. In this case, individuals are correlated only with other individuals in the same cluster.
To generate this system of matrices, we just need to specify the number of observations per cluster (\(nvars\)), the correlation coefficients for each cluster (\(rho\), which in this case is a vector), and the number of clusters. The \(nvars\) argument needs to match the numbers of individuals in each cluster in the data set, and the lengths of \(nvars\) and \(rho\) must be the same as the number of clusters (though either or both can be scalars, in which case the values are shared across the clusters). The output is a list of correlation matrices, one for each cluster.
RC <- genCorMat(nvars = c(2, 3, 4, 3), rho = c(0.6, 0.7, 0.5, 0.4),
corstr = "cs", nclusters = 4)
RC
## $`1`
## [,1] [,2]
## [1,] 1.0 0.6
## [2,] 0.6 1.0
##
## $`2`
## [,1] [,2] [,3]
## [1,] 1.0 0.7 0.7
## [2,] 0.7 1.0 0.7
## [3,] 0.7 0.7 1.0
##
## $`3`
## [,1] [,2] [,3] [,4]
## [1,] 1.0 0.5 0.5 0.5
## [2,] 0.5 1.0 0.5 0.5
## [3,] 0.5 0.5 1.0 0.5
## [4,] 0.5 0.5 0.5 1.0
##
## $`4`
## [,1] [,2] [,3]
## [1,] 1.0 0.4 0.4
## [2,] 0.4 1.0 0.4
## [3,] 0.4 0.4 1.0
To create the correlated data, first we can generate a data set of individuals that are clustered in groups. The outcome will be Poisson distributed, so we are specifying mean \(\lambda\) for each cluster:
d1 <- defData(varname = "n", formula = "c(2, 3, 4, 3)", dist = "nonrandom")
d1 <- defData(d1, varname = "lambda", formula = "c(6, 7, 9, 8)", dist = "nonrandom")
ds <- genData(4, d1, id = "site")
dc <- genCluster(dtClust = ds, cLevelVar = "site", numIndsVar = "n", "id")
Now, we can generate data using the correlation matrix RC:
dd <- addCorGen(dc, idvar = "site", param1 = "lambda", corMatrix = RC,
dist = "poisson", cnames = "y", method = "copula")
dd
## Key: <site>
## site n lambda id y
## <int> <num> <num> <int> <num>
## 1: 1 2 6 1 11
## 2: 1 2 6 2 7
## 3: 2 3 7 3 4
## 4: 2 3 7 4 3
## 5: 2 3 7 5 5
## 6: 3 4 9 6 8
## 7: 3 4 9 7 7
## 8: 3 4 9 8 10
## 9: 3 4 9 9 11
## 10: 4 3 8 10 2
## 11: 4 3 8 11 6
## 12: 4 3 8 12 4
If we want to confirm that everything is working as expected, we can recover the overall correlation matrix by generating a large number of data sets (in this case 5000):
replicate <- function(R, dc) {
reps <- lapply(1:5000, function(x)
addCorGen(dc, idvar = "site", param1 = "lambda", corMatrix = R,
dist = "poisson", cnames = "y", method = "copula")
)
drep <- data.table::rbindlist(reps, idcol = "rep")
drep[, seq := 1:.N, keyby = rep]
dmat <- as.matrix(dcast(drep, rep ~ seq, value.var = "y")[, -1])
round(cor(dmat), 1)
}
replicate(R = RC, dc = dc)
It seems to have worked quite well - the empirical matrix matches the hypothetical matrix above. In the next post, I’ll describe how block matrices for different clusters over different time periods can also be flexibly generated for different groups.
Here is an example that generates data for a large number of clusters, where the parameters (cluster means and correlation coefficients) themselves are randomly generated. By providing this flexibility, we induce extra variability in the data generation process.
d1 <- defData(varname = "n", formula = 20, dist = "noZeroPoisson")
d1 <- defData(d1, varname = "mu", formula = 10, variance = 8, dist = "normal")
d1 <- defData(d1, varname = "s2", formula = 4, dist = "nonrandom")
ds <- genData(100, d1, id = "site")
dc <- genCluster(dtClust = ds, cLevelVar = "site", numIndsVar = "n", "id")
n <- dc[, .N, keyby = site][, N]
nsites <- length(n)
rho <- rbeta(nsites, 25, 15)
RM <- genCorMat(nvars = n, rho = rho, corstr = "cs", nclusters = nsites)
Here are the first three rows and columns of the correlation matrices for three clusters, as well as the dimensions for each matrix.
## $`1`
## [,1] [,2] [,3]
## [1,] 1.0000000 0.6155452 0.6155452
## [2,] 0.6155452 1.0000000 0.6155452
## [3,] 0.6155452 0.6155452 1.0000000
##
## $`38`
## [,1] [,2] [,3]
## [1,] 1.0000000 0.4967856 0.4967856
## [2,] 0.4967856 1.0000000 0.4967856
## [3,] 0.4967856 0.4967856 1.0000000
##
## $`97`
## [,1] [,2] [,3]
## [1,] 1.0000000 0.6292113 0.6292113
## [2,] 0.6292113 1.0000000 0.6292113
## [3,] 0.6292113 0.6292113 1.0000000
## $`1`
## [1] 20 20
##
## $`38`
## [1] 26 26
##
## $`97`
## [1] 24 24
And here is how we generate the data
dd <- addCorGen(dc, idvar = "site", param1 = "mu", param2 = "s2",
corMatrix = RM, dist = "normal", cnames = "y", method = "copula")
dd
## Key: <site>
## site n mu s2 id y
## <int> <num> <num> <num> <int> <num>
## 1: 1 20 6.141913 4 1 5.104369
## 2: 1 20 6.141913 4 2 7.747311
## 3: 1 20 6.141913 4 3 4.779143
## 4: 1 20 6.141913 4 4 5.396952
## 5: 1 20 6.141913 4 5 6.308432
## ---
## 2005: 100 16 10.032301 4 2005 10.680678
## 2006: 100 16 10.032301 4 2006 11.243578
## 2007: 100 16 10.032301 4 2007 10.375579
## 2008: 100 16 10.032301 4 2008 10.412091
## 2009: 100 16 10.032301 4 2009 11.517456
There are two functions,blockExchangeMat
and
blockDecayMat
, that generate correlation matrices that
reflect different correlation patterns over time. This type of
correlation has a block structure, where time periods define the blocks.
The general idea is that the correlation of individuals measured in the
same time period (within-period correlation) could be different from the
correlation of individuals measured in different time periods
(between-period correlation).
A parameterization of the structure of these block correlation matrices is described by Li et al (see reference) and has been implemented in these two functions. The options for the block matrices depend on the distinction between cross-sectional vs cohort samples as well as the exchangeability vs. decay patterns of correlation.
In the case where individuals are measured only once, the sample is considered cross-sectional. The key point is that in a cross-sectional design who are measured at different time periods will be unique. The structure the correlation will depend on the assumption we make about how correlation changes over time: the correlation can reflect either exchangeability or decay.
Under the assumption of exchangeability, there is a constant within-period correlation (\(\rho_w\)) across all study participants in the same period. For participants in different periods, the between-period correlation (\(\rho_b\)) is different (presumably lower) but constant over time.
A conceptual diagram of this exchangeable correlation matrix for a cross-sectional design is shown below; it includes three periods and two individuals per period. Each box represents a different time period. So, the correlation represented in the box in the upper left hand corner is the within-period correlation for the first period. The bottom left box represents the between-period correlation for the individuals in the first and third periods.
Under the assumption of decay, the within-period correlation (\(\rho_w\)) is the same as under the exchangeability assumptions. The between-period correlation is now a function of the difference in time when the two individuals were measured. It is \(\rho_w * r^{|s-t|}\), where \(r\) is a decay parameter between 0 and 1, and \(s\) and \(t\) are the periods under consideration. For example, in the lower left-hand box, we have the correlation between individuals in the first period (\(s=1\)) and individuals in the third period (\(t=3\)), which gives a correlation coefficient of \(\rho_w \times r^{|1-3|} = \rho_w \times r^2\). As the difference in periods grows, \(r^{|s-t|}\) gets smaller.
When individuals measured repeatedly (i.e., in each period of a study), the sample is considered to be a cohort. Actually, if every individual is measured in each period, as I’ve just described, this would be a closed cohort, closed in the sense that once the cohort is defined at the beginning of the study, no new participants are added. If we allow participants to start and stop and random points, this would be an open cohort design. For the purposes of simulation it is challenging to generate data under an open cohort design with this marginal approach (using correlation matrices), and is much easier to do with random effects. What I am describing here applies to closed cohorts only.
The key difference between the cross-sectional and cohort design is the within-individual between-period (auto) correlation. Under the exchangeable assumption, the autocorrelation is specified with the correlation coefficient \(\rho_a\). The within-period between-individual correlation is still \(\rho_w\), and the between-period between-individual correlation is still \(\rho_b\). All of these correlations remain constant in the exchangeable framework:
The decay structure under an assumption of a closed cohort is the last of the four possible variations. The within-period between-individual correlation \(\rho_w\) remains the same as the cross-sectional case, and so does the between-period between-individual correlation \(\rho_wr^{|s-t|}\). However, the between-period within-individual correlation is specified as \(r^{|s-t|}\):
These variations are implmented in blockExchangeMat
and
blockDecayMat
. In the simulations that follow, I will
generate a single corrlation matrix that reflects three periods and two
indviduals per period, just to keep it simple.
In the first example, we specify \(\rho_w = 0.5\) and \(\rho_b = 0.3\):
library(simstudy)
library(data.table)
R_XE <- blockExchangeMat(ninds = 2 , nperiods = 3, rho_w = 0.5,
rho_b = 0.3, pattern = "xsection")
R_XE
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 1.0 0.5 0.3 0.3 0.3 0.3
## [2,] 0.5 1.0 0.3 0.3 0.3 0.3
## [3,] 0.3 0.3 1.0 0.5 0.3 0.3
## [4,] 0.3 0.3 0.5 1.0 0.3 0.3
## [5,] 0.3 0.3 0.3 0.3 1.0 0.5
## [6,] 0.3 0.3 0.3 0.3 0.5 1.0
The correlated data are generated using genCorGen
, using
the correlation matrix \(R_XE\). I am
effectively generating 5000 data sets with 6 observations each, all
based on a Poisson distribution with mean = 7. I then report the
empirical correlation matrix.
dd <- genCorGen(n = 5000, nvars = 6, corMatrix = R_XE,
dist = "poisson", params1 = 7, wide = TRUE)
round(cor(as.matrix(dd[, -1])), 2)
## V1 V2 V3 V4 V5 V6
## V1 1.00 0.48 0.29 0.29 0.29 0.30
## V2 0.48 1.00 0.25 0.28 0.29 0.28
## V3 0.29 0.25 1.00 0.50 0.28 0.29
## V4 0.29 0.28 0.50 1.00 0.29 0.31
## V5 0.29 0.29 0.28 0.29 1.00 0.50
## V6 0.30 0.28 0.29 0.31 0.50 1.00
Here, there is a decay parameter \(r = 0.8\) and no parameter \(\rho_b\).
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 1.00 0.50 0.4 0.4 0.32 0.32
## [2,] 0.50 1.00 0.4 0.4 0.32 0.32
## [3,] 0.40 0.40 1.0 0.5 0.40 0.40
## [4,] 0.40 0.40 0.5 1.0 0.40 0.40
## [5,] 0.32 0.32 0.4 0.4 1.00 0.50
## [6,] 0.32 0.32 0.4 0.4 0.50 1.00
## V1 V2 V3 V4 V5 V6
## V1 1.00 0.51 0.38 0.41 0.31 0.31
## V2 0.51 1.00 0.38 0.39 0.30 0.30
## V3 0.38 0.38 1.00 0.50 0.40 0.39
## V4 0.41 0.39 0.50 1.00 0.38 0.39
## V5 0.31 0.30 0.40 0.38 1.00 0.49
## V6 0.31 0.30 0.39 0.39 0.49 1.00
Since we have a cohort, we introduce \(\rho_a\) = 0.4, and specify \(pattern = \text{"cohort"}\):
R_CE <- blockExchangeMat(ninds = 2 , nperiods = 3, rho_w = 0.5,
rho_b = 0.3, rho_a = 0.4, pattern = "cohort")
R_CE
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 1.0 0.5 0.4 0.3 0.4 0.3
## [2,] 0.5 1.0 0.3 0.4 0.3 0.4
## [3,] 0.4 0.3 1.0 0.5 0.4 0.3
## [4,] 0.3 0.4 0.5 1.0 0.3 0.4
## [5,] 0.4 0.3 0.4 0.3 1.0 0.5
## [6,] 0.3 0.4 0.3 0.4 0.5 1.0
## V1 V2 V3 V4 V5 V6
## V1 1.00 0.49 0.39 0.27 0.39 0.29
## V2 0.49 1.00 0.29 0.40 0.30 0.41
## V3 0.39 0.29 1.00 0.50 0.38 0.29
## V4 0.27 0.40 0.50 1.00 0.28 0.40
## V5 0.39 0.30 0.38 0.28 1.00 0.49
## V6 0.29 0.41 0.29 0.40 0.49 1.00
In the final case, the parameterization for decaying correlation with a cohort is the same as a decay in the case of a cross sectional design; the only difference that we set \(pattern = \text{"cohort"}\):
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 1.00 0.50 0.8 0.4 0.64 0.32
## [2,] 0.50 1.00 0.4 0.8 0.32 0.64
## [3,] 0.80 0.40 1.0 0.5 0.80 0.40
## [4,] 0.40 0.80 0.5 1.0 0.40 0.80
## [5,] 0.64 0.32 0.8 0.4 1.00 0.50
## [6,] 0.32 0.64 0.4 0.8 0.50 1.00
## V1 V2 V3 V4 V5 V6
## V1 1.00 0.47 0.79 0.38 0.63 0.30
## V2 0.47 1.00 0.38 0.78 0.30 0.62
## V3 0.79 0.38 1.00 0.48 0.79 0.39
## V4 0.38 0.78 0.48 1.00 0.39 0.79
## V5 0.63 0.30 0.79 0.39 1.00 0.48
## V6 0.30 0.62 0.39 0.79 0.48 1.00
In the case of a cross-sectional design, the number of observations per period for a specific cluster does not need to remain constant (though in the case of data generation under a cohort design it does). We can vary the total number of observations as well as the correlation parameters by cluster.
In this example, there are 10 clusters and three periods. The number of individuals per cluster per period ranges from two to four, and are randomly generated. The decay rate \(r\) varies by cluster (generated using the beta distribution with shape parameters 6 and 2). The parameter \(\rho_w\) is constant across all clusters and is 0.6
defC <- defData(varname = "lambda", formula = "sample(5:10, 1)", dist = "nonrandom")
defP <- defDataAdd(varname = "n", formula = "2;4", dist="uniformInt")
dc <- genData(n = 10, dtDefs = defC, id = "site")
dc <- addPeriods(dtName = dc, nPeriods = 3,
idvars = "site", perName = "period")
dc <- addColumns(defP, dc)
dd <- genCluster(dtClust = dc, cLevelVar = "timeID",
numIndsVar = "n", level1ID = "id")
In this example, the 10 clusters will have varying numbers of observations per period. Here are the counts for three sites:
## site period n
## <int> <int> <int>
## 1: 1 0 4
## 2: 1 1 2
## 3: 1 2 3
## 4: 3 0 3
## 5: 3 1 3
## 6: 3 2 3
## 7: 7 0 3
## 8: 7 1 2
## 9: 7 2 3
The sites will also have unique decay rates:
## [1] 0.66 0.85 0.81
Here are the correlation matrices for these three sites:
N <- dd[, .N, keyby = .(site, period)][, N]
R <- blockDecayMat(ninds = N , nperiods = 3, rho_w = 0.6, r = r, nclusters = 10)
lapply(R, function(x) round(x,2))[c(1, 3, 7)]
## [[1]]
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
## [1,] 1.00 0.60 0.60 0.60 0.4 0.4 0.26 0.26 0.26
## [2,] 0.60 1.00 0.60 0.60 0.4 0.4 0.26 0.26 0.26
## [3,] 0.60 0.60 1.00 0.60 0.4 0.4 0.26 0.26 0.26
## [4,] 0.60 0.60 0.60 1.00 0.4 0.4 0.26 0.26 0.26
## [5,] 0.40 0.40 0.40 0.40 1.0 0.6 0.40 0.40 0.40
## [6,] 0.40 0.40 0.40 0.40 0.6 1.0 0.40 0.40 0.40
## [7,] 0.26 0.26 0.26 0.26 0.4 0.4 1.00 0.60 0.60
## [8,] 0.26 0.26 0.26 0.26 0.4 0.4 0.60 1.00 0.60
## [9,] 0.26 0.26 0.26 0.26 0.4 0.4 0.60 0.60 1.00
##
## [[2]]
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
## [1,] 1.00 0.60 0.60 0.51 0.51 0.51 0.43 0.43 0.43
## [2,] 0.60 1.00 0.60 0.51 0.51 0.51 0.43 0.43 0.43
## [3,] 0.60 0.60 1.00 0.51 0.51 0.51 0.43 0.43 0.43
## [4,] 0.51 0.51 0.51 1.00 0.60 0.60 0.51 0.51 0.51
## [5,] 0.51 0.51 0.51 0.60 1.00 0.60 0.51 0.51 0.51
## [6,] 0.51 0.51 0.51 0.60 0.60 1.00 0.51 0.51 0.51
## [7,] 0.43 0.43 0.43 0.51 0.51 0.51 1.00 0.60 0.60
## [8,] 0.43 0.43 0.43 0.51 0.51 0.51 0.60 1.00 0.60
## [9,] 0.43 0.43 0.43 0.51 0.51 0.51 0.60 0.60 1.00
##
## [[3]]
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,] 1.00 0.60 0.60 0.49 0.49 0.39 0.39 0.39
## [2,] 0.60 1.00 0.60 0.49 0.49 0.39 0.39 0.39
## [3,] 0.60 0.60 1.00 0.49 0.49 0.39 0.39 0.39
## [4,] 0.49 0.49 0.49 1.00 0.60 0.49 0.49 0.49
## [5,] 0.49 0.49 0.49 0.60 1.00 0.49 0.49 0.49
## [6,] 0.39 0.39 0.39 0.49 0.49 1.00 0.60 0.60
## [7,] 0.39 0.39 0.39 0.49 0.49 0.60 1.00 0.60
## [8,] 0.39 0.39 0.39 0.49 0.49 0.60 0.60 1.00
And here is code to generate the empirical correlation matrices for the three sites, based on 5000 replications of the data:
reps <- lapply(1:5000,
function(x) addCorGen(dd, idvar = "site", corMatrix = R,
dist = "poisson", param1 = "lambda", cnames = "y")
)
drep <- data.table::rbindlist(reps, idcol = "rep")
empir_corr <- function(cluster) {
dcrep <- drep[site == cluster, ]
dcrep[, seq := 1:.N, keyby = rep]
dmat <- as.matrix(dcast(dcrep, rep ~ seq, value.var = "y")[, -1])
return(round(cor(dmat), 2))
}
empir_corr(cluster = 1)
empir_corr(cluster = 3)
empir_corr(cluster = 7)
Reference:
Li, Fan, James P. Hughes, Karla Hemming, Monica Taljaard, Edward R. Melnick, and Patrick J. Heagerty. “Mixed-effects models for the design and analysis of stepped wedge cluster randomized trials: an overview.” Statistical Methods in Medical Research 30, no. 2 (2021): 612-639.