Comparing samples defined over a single dimension is a straightforward task that relies on standard, well established methods. Meanwhile distance between samples in high dimensional space remains a largely unexplored field. Available solutions rely on multivariate normal distributions, a condition that is both difficult to check and overlooking key behaviors in biological samples, where populations of interest often correspond to a small proportion (<1%) of all the points that have been measured.
We have developed hilbertSimilarity
to address the problem of sample similarity in mass cytometry where samples are measured over up to 100 dimensions, and where each sample contains a slightly different number of points (or cells). Our method first transforms each sample into a probability vector, by dividing each dimension into a fixed number of bins and by associating each cell to a specific multidimensional cube. The proportion of cells in each hypercube is characteristic of a given sample. To characterize an compare samples we use measures from Information Theory, since their interpretation in terms of similarity and complexity is straightforward.
To demonstrate the power of hilbertSimilarity
we applied the method to a subset of the bodenmiller et al. dataset, comparing the effect of different stimulations and identifying groups of cells that are significantly affected by different treatments.
Compared to other methods, hilbertSimilarity
does not rely on expert-driven gating, or require any hypothesis about the number of populations that are present in the data. This makes it a powerful tool to quickly assess a dataset before using traditional methods, or when populations a not known a priori.
hilbertSimilarity
can be installed using the following command
devtools::install_github(yannabraham/hilbertSimilarity)
Once installed the package can be loaded using the standard library
command.
library(hilbertSimilarity)
We first need data from the bodenmiller
package:
library(bodenmiller)
data(refPhenoMat)
data(untreatedPhenoMat)
data(refFuncMat)
data(untreatedFuncMat)
data(refAnnots)
refAnnots$Treatment <- 'reference'
data(untreatedAnnots)
fullAnnots <- rbind(refAnnots[,names(untreatedAnnots)],
untreatedAnnots)
fullAnnots$Treatment <- factor(fullAnnots$Treatment)
fullAnnots$Treatment <- relevel(fullAnnots$Treatment,'reference')
refMat <- cbind(refPhenoMat,refFuncMat)
untreatedMat <- cbind(untreatedPhenoMat,
untreatedFuncMat)
fullMat <- rbind(refMat,untreatedMat)
fullMat <- apply(fullMat,2,function(x) {
qx <- quantile(x,c(0.005,0.995))
x[x<qx[1]] <- qx[1]
x[x>qx[2]] <- qx[2]
return(x)
})
In this dataset 51936 cells corresponding to 4 have been measured over 23 channels. Cells have been gated into 14 populations. The percentage of each population per treatment is shown below:
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## # A tibble: 14 x 5
## Cells reference `BCR/FcR-XL` `PMA/Ionomycin` vanadate
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 cd14-hladr- 0.14 0.12 0.17 0.43
## 2 cd14-hladrhigh 0.38 0.12 0.03 0.02
## 3 cd14-hladrmid 1.55 0.98 0.37 0.28
## 4 cd14-surf- 11.4 12.1 10.4 28.9
## 5 cd14+hladr- 0.08 0.11 0.1 1.44
## 6 cd14+hladrhigh 0.13 0.02 0.03 0.02
## 7 cd14+hladrmid 8.69 5.5 2.04 2.45
## 8 cd14+surf- 0.51 0.5 1.47 2.63
## 9 cd4+ 25.3 26.7 31.2 7.96
## 10 cd8+ 32.1 33.4 35.1 19
## 11 dendritic 0.37 0.2 0.19 0.36
## 12 igm- 1.47 2.16 1.24 1.42
## 13 igm+ 6.77 5.66 6.81 6.25
## 14 nk 11.1 12.4 10.8 28.8
The smallest cell population, cd14+hladrhigh, corresponds to 0.058% of the total cells.
To demonstrate the use of hilbertSimilarity
we will compare analyze the data at the treatment level.
The make.cut
function is used to prepare the grid that will be used to process the data matrix. It requires 2 arguments: the number of cuts (or bins), and the minimum number of cells to consider for a density. The latter depends on the technology and on the decision by the scientist running the analysis. For CyTOF we will require a population to correspond to at least 40 cells.
The number of bins in each dimension is defined as follows:
\[c^{d} < N\]
Where \(c\) is the number of bins, \(d\) is the number of dimensions and \(N\) is the total number of cells in the dataset. We chose the first integer \(c\) that would generate at least 1 bin per cell. Because many combinations don’t have any biological sense, the fraction of occupied bins will be lower than 1.
\(c\) can be computed easily using the following formula
\[c = \max \left \{ \left \lfloor \sqrt[d]{N} \right \rfloor , 2 \right \}\]
The formula has been implemented in the hilbert.order
function. For our example, where \(N\) is 51936 and \(d\) is 23, \(c\) is 2. The number of cuts to generate is the number of required bins plus 1.
After manual inspection, we used a \(c\) of 3 to compute the cuts:
nbins <- 2
cuts <- make.cut(fullMat,
n=nbins+1,
count.lim=40)
## CD20
## IgM
## CD4
## CD33
## HLA.DR
## CD14
## CD7
## CD3
## CD123
## pStat1
## pSlp76
## pBtk
## pPlcg2
## pErk
## pLat
## pS6
## pNFkB
## pp38
## pStat5
## pAkt
## pSHP2
## pZap70
## pStat3
The make.cut
function returns 2 cuts:
fixed
corresponds to equally sized bins over the range of the channelcombined
adjusts the bin size to the density of the channelThe cuts can be visualized using the show.cut
function:
show.cut(cuts)
The green lines correspond to the fixed
limits, the red lines correspond to the adjusted limits (when applicable). For cases like CD3, pSlp76 and ZAP70, it allows for a better separation between the positive and negative populations.
Given a dataset and a grid, the do.cut
function will assign each cell to a particular bin in every dimension.
cutFullMat <- do.cut(fullMat,cuts,type='combined')
Effectively, each cell is now associated to a voxel, and each voxel is enriched for a particular cell type that corresponds to the the unique combination of dimensions and ranges it corresponds to.
To uniquely identify each occupied voxel, one can compute a Hilbert index over the grid. The after the dimensions have been ordered to better capture the
Intuitively, each voxel on the grid contains a specific cell type. In order for populations to be associated with consecutive specific bins, one must optimize the order of channels so that dimensions that are specific to a population are grouped together in the input matrix.
A simple way to achieve this is to use the normalized mutual information between the dimensions to cluster them into meaningful groups:
library(entropy)
miFullMat <- matrix(0,nrow=ncol(fullMat),ncol = ncol(fullMat) )
for (i in seq(ncol(fullMat)-1)) {
for (j in seq(i+1,ncol(fullMat))) {
cur.tbl <- table(cutFullMat[,i],cutFullMat[,j])
nent <- 1-mi.empirical(cur.tbl)/entropy.empirical(cur.tbl)
miFullMat[i,j] <- miFullMat[j,i] <- nent
}
}
dimnames(miFullMat) <- list(colnames(fullMat),colnames(fullMat))
hcFullMat <- hclust(as.dist(miFullMat))
plot(hcFullMat)
After ordering the cut matrix using the normalized mutual information, it can be transformed into a Hilbert index using the do.hilbert
function:
col.order <- hcFullMat$labels[rev(hcFullMat$order)]
hc <- do.hilbert(cutFullMat[,col.order],nbins)
All cells are found in 13484 voxels out of the 8.38860810^{6} possible voxels defined over the initial 23-dimensional space (0.16%).
Using a Hilbert index to describe the grid has the advantage that consecutive index correspond to consecutive voxels in the grid. Effectively it corresponds to a projection from N dimensions to a single one.
To visualize the Hilbert curve we use Andrews plots to standardize the display to a common range. First we compute the number of cells per Hilbert index, per treatment:
treatment.table <- with(fullAnnots,
table(hc,Treatment))
treatment.pc <- apply(treatment.table,2,function(x) x/sum(x))
Next we prepare the Andrews vector; adjust the number of breaks to return a smoother curve :
av <- andrewsProjection(t(treatment.pc),breaks=30)
Then we project the Hilbert curve of each treatment onto the Andrews vector:
treatment.proj <- t(treatment.pc) %*% t(av$freq)
Now we can visualize the different treatment using line charts:
library(ggplot2)
library(reshape2)
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
melt(treatment.proj) %>%
rename(AndrewsIndex=Var2) %>%
mutate(AndrewsIndex=av$i[AndrewsIndex]) %>%
ggplot(aes(x=AndrewsIndex,y=value))+
geom_line(aes(group=Treatment,color=Treatment))+
theme_light(base_size=16)+
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
axis.title.y=element_blank(),
axis.text.y=element_blank(),
axis.ticks.y=element_blank())
Specific treatments are associated with changes in specific bins. Please note that this method compresses the Hilbert curve by only considering indices that contain cells.
The Hilbert curve can be re-folded back into any dimensionality using the hilbertProjection
function. To visualize the Hilbert curve as a scatter plot, we use 2 as the target number of dimensions :
proj <- hilbertProjection(hc,target = 2)
## Projecting the Hilbert curve to 2 dimensions will require 4096 bins
To visualize the Hilbert curve as a 2D projection, we first compute the number of cells per unique combination of annotation columns and Hilbert index:
fullProj <- fullAnnots %>%
mutate(HilbertIndex=hc) %>%
group_by_at(vars(one_of(colnames(fullAnnots),'HilbertIndex'))) %>%
count() %>%
bind_cols(as.data.frame(proj[.$HilbertIndex+1,]))
fullProjCount <- fullProj %>%
ungroup() %>%
count(HilbertIndex,Treatment) %>%
arrange(desc(n))
kable(head(fullProjCount))
HilbertIndex | Treatment | n |
---|---|---|
516096 | reference | 4 |
523264 | reference | 4 |
523903 | BCR/FcR-XL | 4 |
524287 | reference | 4 |
524287 | BCR/FcR-XL | 4 |
524287 | vanadate | 4 |
The percentage of cells per treatment is visualized using ggplot2
:
fullProj %>%
group_by(Treatment) %>%
mutate(PC=n/sum(n)) %>%
ggplot(aes(x=V1,y=V2))+
geom_tile(aes(fill=PC),
width=24,
height=24)+
facet_wrap(~Treatment)+
scale_fill_gradient(low='grey80',high='blue')+
theme_light(base_size=16)+
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
axis.title.y=element_blank(),
axis.text.y=element_blank(),
axis.ticks.y=element_blank())
The quality of the projection will depend on the density of the Hilbert curve. As this curve is sparse (only 0.16% of the Hilbert index contain at least 1 cell) this projection is only provided as an example.
With each cell now associated to a specific hilbert index, each sample can be described by the percentage of cells from a given sample that corresponds to a particular index. The resulting table can be visualized as a heat map:
heatmap(log10(treatment.pc),
scale = 'none',
Colv = NA,
Rowv = NA,
labRow = NA,
col = colorRampPalette(c('white',blues9))(256),
margin = c(12,1))
In this experiment, each sample corresponds to a specific cell type and treatment: to compute the distance between samples we use a distance derived from information theory, the Jensen-Shannon distance. This is done through the js.dist
function. The resulting distance matrix can be used to compute a hierarchical cluster:
treatment.dist <- js.dist(t(treatment.table))
treatment.hc <- hclust(treatment.dist)
When ordering samples using treatment.hc
we see patterns emerging:
heatmap(log10(treatment.pc),
scale = 'none',
Colv = as.dendrogram(treatment.hc),
Rowv = NA,
labRow = NA,
col = colorRampPalette(c('white',blues9))(256),
margin = c(12,2))
Samples corresponding to reference and BCR/FcR-XL cluster together, while PMA/Ionomycin and vanadate samples show strong differences with every other sample.