SNPLinkage is modular enough to use directly dataframes of correlation matrices and chromosomic positions specified by the user, e.g. for visualizing RNASeq data. The user can compute a correlation matrix or any kind of pair-wise similarity matrix independently and then use SNPLinkage to build and arrange easily customizable ggplot2 objects.
The user can specify the correlations he wants to visualize as a
dataframe to the ggplot_ld
function. The column names must
follow the following pattern: SNP_A
and SNP_B
for the two variables in relation, and R2
for the
correlation value.
library(snplinkage)
#> Loading required package: GWASTools
#> Loading required package: Biobase
#> Loading required package: BiocGenerics
#>
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:stats':
#>
#> IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#>
#> Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
#> as.data.frame, basename, cbind, colnames, dirname, do.call,
#> duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
#> lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
#> pmin.int, rank, rbind, rownames, sapply, setdiff, table, tapply,
#> union, unique, unsplit, which.max, which.min
#> Welcome to Bioconductor
#>
#> Vignettes contain introductory material; view with
#> 'browseVignettes()'. To cite Bioconductor, see
#> 'citation("Biobase")', and for packages 'citation("pkgname")'.
# example rnaseq data matrix, 20 variables of 20 patients
m_rna = matrix(runif(20 ^ 2), nrow = 20)
# pair-wise correlation matrix
m_ld = cor(m_rna) ^ 2
# keep only upper triangle and reshape to data frame
m_ld[lower.tri(m_ld, diag = TRUE)] = NA
df_ld = reshape2::melt(m_ld) |> na.omit()
# rename for SNPLinkage
names(df_ld) = c('SNP_A', 'SNP_B', 'R2')
# visualize with ggplot_ld
gg_ld = ggplot_ld(df_ld)
gg_ld
Similarly, the user can specify a dataframe to the
ggplot_snp_pos
function. The dataframe is assumed to be in
the same order as the correlation dataframe, and the column name
position
is required.
# let's imagine the 20 variables came from 3 physically close regions
positions = c(runif(7, 31e6, 31.5e6), runif(6, 32e6, 32.5e6),
runif(7, 33e6, 33.5e6)) |> sort()
# build the dataframe
df_snp_pos = data.frame(position = positions)
# minimal call
gg_snp_pos = ggplot_snp_pos(df_snp_pos)
Optionally, one can specify the labels_colname
parameter
to give the name of a column that will have the labels to display.
We then arrange the plots with the gtable_ld_grobs
function. One needs to specify in the labels_colname
parameter if the chromosomic positions plot was built with labels or
not. The title
parameter is also required.
Finally we add the variables’ associations to our outcome of
interest. The ggplot_associations
uses as input a dataframe
and accepts a parameter pvalue_colname
to specify which
column holds the association values, by default ‘pvalues’. It also
requires a labels_colname
parameter to specify the column
holding the labels, and a column named chromosome
. The
linked_area
parameter will affect how the associations are
plotted and it is recommended to be used in combination with the
diamonds
parameter of ggplot_ld
(i.e. TRUE for
small number of variables, approximately less than 40).
Additionally, the n_labels
parameter controls the number
of highest association labels displayed (be default 10, the behavior can
be disabled by setting labels_colname
to NULL), and the
nudge
parameter will affect how the labels are displayed
(passed to geom_label_repel
function of ‘ggrepel’
package).
# let's imagine the middle region, HLA-B, is more associated with the outcome
pvalues = c(runif(7, 1e-3, 1e-2), runif(6, 1e-8, 1e-6), runif(7, 1e-3, 1e-2))
log10_pvals = -log10(pvalues)
# we can reuse the df_snp_pos object
df_snp_pos$pvalues = log10_pvals
# add the chromosome column
df_snp_pos$chromosome = 6
gg_assocs = ggplot_associations(df_snp_pos, labels_colname = 'label',
linked_area = TRUE, nudge = c(0, 0.5),
n_labels = 12)
We then arrange the plots with the gtable_ld_grobs
function as previously. We need to call the ggplot_snp_pos
function with the upper_subset
parameter set to TRUE for it
to connect to the upper graph.
gg_pos_biplot = ggplot_snp_pos(df_snp_pos, labels_colname = 'label',
upper_subset = TRUE)
# let's also say the middle region HLA-B is particularly correlated
df_ld$R2[df_ld$SNP_A %in% 8:13 & df_ld$SNP_B %in% 8:13] = runif(15, 0.7, 0.9)
gg_ld = ggplot_ld(df_ld)
l_ggs = list(pos = gg_pos_biplot, ld = gg_ld, pval = gg_assocs)
gt_ld = gtable_ld_associations_combine(l_ggs, diamonds = TRUE)
We can extract a title and remove the horizontal axis text as follows.
library(ggplot2)
gg_assocs <- gg_assocs + theme(axis.text.x = element_blank())
title <- gg_assocs$labels$x %>% gsub(' (Mbp)', '', ., fixed = TRUE) %>%
paste('-', nrow(df_snp_pos), 'SNPs')
gg_assocs <- gg_assocs + labs(title = title, x = NULL)
l_ggs$pval = gg_assocs
gt_ld = gtable_ld_associations_combine(l_ggs, diamonds = TRUE)
grid::grid.draw(gt_ld)
Let’s say we want to change the color of the associations area. We first need to identify which layer it corresponds to:
gg_assocs$layers
#> [[1]]
#> geom_area: na.rm = FALSE, orientation = NA, outline.type = upper
#> stat_align: na.rm = FALSE, orientation = NA
#> position_stack
#>
#> [[2]]
#> mapping: xend = ~x
#> geom_segment: arrow = NULL, arrow.fill = NULL, lineend = butt, linejoin = round, na.rm = FALSE
#> stat_identity: na.rm = FALSE
#> position_identity
#>
#> [[3]]
#> geom_label_repel: parse = FALSE, box.padding = 0.25, label.padding = 0.1, point.padding = 1e-06, label.r = 0.15, label.size = 0.1, min.segment.length = 0.5, arrow = NULL, na.rm = TRUE, force = 1, force_pull = 1, max.time = 0.5, max.iter = 10000, max.overlaps = 10, nudge_x = 0, nudge_y = 0.5, xlim = c(NA, NA), ylim = c(NA, NA), direction = both, seed = NA, verbose = FALSE
#> stat_identity: na.rm = TRUE
#> position_nudge_repel
#>
#> [[4]]
#> geom_point: na.rm = FALSE
#> stat_identity: na.rm = FALSE
#> position_identity
Then we can change the color with:
And rebuild our object.
To change the lines and labels colors, parameters in the functions are available. You can either specify a single value or a vector of same length as your number of features.
gg_pos_biplot = ggplot_snp_pos(df_snp_pos, labels_colname = 'label',
upper_subset = TRUE, colors = '#101d6b')
gg_assocs = ggplot_associations(df_snp_pos, labels_colname = 'label',
linked_area = TRUE, nudge = c(0, 0.5),
n_labels = 12, colors = '#101d6b')
# extract title
gg_assocs <- gg_assocs + theme(axis.text.x = element_blank())
title <- gg_assocs$labels$x %>% gsub(' (Mbp)', '', ., fixed = TRUE) %>%
paste('-', nrow(df_snp_pos), 'SNPs')
gg_assocs <- gg_assocs + labs(title = title, x = NULL)
# replace area color
gg_assocs$layers[[1]]$aes_params$fill = "#0147ab"
# rebuild
l_ggs = list(pos = gg_pos_biplot, ld = gg_ld, pval = gg_assocs)
gt_ld = gtable_ld_associations_combine(l_ggs, diamonds = TRUE)
grid::grid.draw(gt_ld)
In this dataset from the ‘gap’ package (Zhao, Kurt Hornik, and Ripley 2015), 206 SNPs from chromosome 5 (5q31) were measured from 129 Crohn’s disease patients and their 2 parents, totalling 387 samples.
data('crohn')
m_hla = crohn[, -(1:6)]
m_ld = cor(m_hla) ^ 2
# keep only upper triangle and reshape to data frame
m_ld[lower.tri(m_ld, diag = TRUE)] = NA
df_ld = reshape2::melt(m_ld) |> na.omit()
# rename for SNPLinkage
names(df_ld) = c('SNP_A', 'SNP_B', 'R2')
# visualize with ggplot_ld
gg_ld = ggplot_ld(df_ld)
Compute p-values
mlog10_pvals = chisq_pvalues(m_hla, crohn[, 'crohn'])
df_pos = data.frame(probe_id = colnames(m_hla), pvalues = mlog10_pvals,
chromosome = 5)
# if we don't have positions we can use byindex = TRUE
gg_assocs = ggplot_associations(df_pos, byindex = TRUE, nudge = c(0, 0.5))
Arrange with ‘cowplot’
Focus on most associated
df_top_assocs = subset(df_pos, pvalues > quantile(pvalues, 0.9))
gg_assocs = ggplot_associations(df_top_assocs, linked_area = TRUE,
nudge = c(0, 0.5))
df_ld = subset(df_ld, SNP_A %in% df_top_assocs$probe_id &
SNP_B %in% df_top_assocs$probe_id)
gg_ld = ggplot_ld(df_ld)
cowplot::plot_grid(gg_assocs, gg_ld, nrow = 2)
sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 22.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] ggplot2_3.5.1 snplinkage_1.2.0 GWASTools_1.50.0
#> [4] Biobase_2.64.0 BiocGenerics_0.50.0
#>
#> loaded via a namespace (and not attached):
#> [1] DBI_1.2.3 gdsfmt_1.40.0 httr2_1.0.1
#> [4] sandwich_3.1-0 biomaRt_2.60.1 rlang_1.1.4
#> [7] magrittr_2.0.3 compiler_4.4.1 RSQLite_2.3.7
#> [10] mgcv_1.9-1 reshape2_1.4.4 png_0.1-8
#> [13] vctrs_0.6.5 quantreg_5.98 stringr_1.5.1
#> [16] pkgconfig_2.0.3 shape_1.4.6.1 crayon_1.5.3
#> [19] SNPRelate_1.38.0 fastmap_1.2.0 backports_1.5.0
#> [22] dbplyr_2.5.0 XVector_0.44.0 labeling_0.4.3
#> [25] utf8_1.2.4 rmarkdown_2.28 UCSC.utils_1.0.0
#> [28] nloptr_2.1.1 MatrixModels_0.5-3 purrr_1.0.2
#> [31] bit_4.0.5 xfun_0.47 glmnet_4.1-8
#> [34] jomo_2.7-6 logistf_1.26.0 zlibbioc_1.50.0
#> [37] cachem_1.1.0 GenomeInfoDb_1.40.1 jsonlite_1.8.8
#> [40] progress_1.2.3 blob_1.2.4 highr_0.11
#> [43] pan_1.9 parallel_4.4.1 broom_1.0.6
#> [46] prettyunits_1.2.0 R6_2.5.1 bslib_0.8.0
#> [49] stringi_1.8.4 boot_1.3-30 DNAcopy_1.78.0
#> [52] rpart_4.1.23 lmtest_0.9-40 jquerylib_0.1.4
#> [55] Rcpp_1.0.13 iterators_1.0.14 knitr_1.48
#> [58] zoo_1.8-12 IRanges_2.38.1 Matrix_1.7-0
#> [61] splines_4.4.1 nnet_7.3-19 tidyselect_1.2.1
#> [64] yaml_2.3.10 codetools_0.2-20 curl_5.2.2
#> [67] plyr_1.8.9 lattice_0.22-6 tibble_3.2.1
#> [70] withr_3.0.1 quantsmooth_1.70.0 KEGGREST_1.44.1
#> [73] evaluate_0.24.0 survival_3.6-4 BiocFileCache_2.12.0
#> [76] xml2_1.3.6 Biostrings_2.72.1 pillar_1.9.0
#> [79] filelock_1.0.3 mice_3.16.0 foreach_1.5.2
#> [82] stats4_4.4.1 generics_0.1.3 S4Vectors_0.42.1
#> [85] hms_1.1.3 munsell_0.5.1 scales_1.3.0
#> [88] minqa_1.2.8 GWASExactHW_1.2 glue_1.7.0
#> [91] tools_4.4.1 data.table_1.16.0 lme4_1.1-35.5
#> [94] SparseM_1.84-2 cowplot_1.1.3 grid_4.4.1
#> [97] tidyr_1.3.1 AnnotationDbi_1.66.0 colorspace_2.1-1
#> [100] nlme_3.1-164 formula.tools_1.7.1 GenomeInfoDbData_1.2.12
#> [103] cli_3.6.3 rappdirs_0.3.3 fansi_1.0.6
#> [106] dplyr_1.1.4 gtable_0.3.5 sass_0.4.9
#> [109] digest_0.6.37 operator.tools_1.6.3 ggrepel_0.9.6
#> [112] farver_2.1.2 memoise_2.0.1 htmltools_0.5.8.1
#> [115] lifecycle_1.0.4 httr_1.4.7 mitml_0.4-5
#> [118] bit64_4.0.5 MASS_7.3-60.2