Other Data Formats And Customizations

Thomas Charlon

2024-09-09

Other Data Formats And Customizations

SNPLinkage is modular enough to use directly dataframes of correlation matrices and chromosomic positions specified by the user, e.g. for visualizing RNASeq data. The user can compute a correlation matrix or any kind of pair-wise similarity matrix independently and then use SNPLinkage to build and arrange easily customizable ggplot2 objects.

Building the correlation matrix plot

The user can specify the correlations he wants to visualize as a dataframe to the ggplot_ld function. The column names must follow the following pattern: SNP_A and SNP_B for the two variables in relation, and R2 for the correlation value.

  library(snplinkage)
#> Loading required package: GWASTools
#> Loading required package: Biobase
#> Loading required package: BiocGenerics
#> 
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:stats':
#> 
#>     IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#> 
#>     Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
#>     as.data.frame, basename, cbind, colnames, dirname, do.call,
#>     duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
#>     lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
#>     pmin.int, rank, rbind, rownames, sapply, setdiff, table, tapply,
#>     union, unique, unsplit, which.max, which.min
#> Welcome to Bioconductor
#> 
#>     Vignettes contain introductory material; view with
#>     'browseVignettes()'. To cite Bioconductor, see
#>     'citation("Biobase")', and for packages 'citation("pkgname")'.

  # example rnaseq data matrix, 20 variables of 20 patients
  m_rna = matrix(runif(20 ^ 2), nrow = 20)

  # pair-wise correlation matrix
  m_ld = cor(m_rna) ^ 2

  # keep only upper triangle and reshape to data frame
  m_ld[lower.tri(m_ld, diag = TRUE)] = NA
  df_ld = reshape2::melt(m_ld) |> na.omit()

  # rename for SNPLinkage
  names(df_ld) = c('SNP_A', 'SNP_B', 'R2')

  # visualize with ggplot_ld
  gg_ld = ggplot_ld(df_ld)
  gg_ld

Adding chromosomic positions

Similarly, the user can specify a dataframe to the ggplot_snp_pos function. The dataframe is assumed to be in the same order as the correlation dataframe, and the column name position is required.

  # let's imagine the 20 variables came from 3 physically close regions
  positions = c(runif(7, 31e6, 31.5e6), runif(6, 32e6, 32.5e6),
                runif(7, 33e6, 33.5e6)) |> sort()

  # build the dataframe
  df_snp_pos = data.frame(position = positions)

  # minimal call
  gg_snp_pos = ggplot_snp_pos(df_snp_pos)

Optionally, one can specify the labels_colname parameter to give the name of a column that will have the labels to display.

  df_snp_pos$label = c(rep('HLA-A', 7), rep('HLA-B', 6), rep('HLA-C', 7))
  gg_snp_pos = ggplot_snp_pos(df_snp_pos, labels_colname = 'label')

Arranging the plots together

We then arrange the plots with the gtable_ld_grobs function. One needs to specify in the labels_colname parameter if the chromosomic positions plot was built with labels or not. The title parameter is also required.

  l_ggs = list(snp_pos = gg_snp_pos, ld = gg_ld)
  gt_ld = gtable_ld_grobs(l_ggs, labels_colname = TRUE,
                          title = 'RNASeq correlations')
  grid::grid.draw(gt_ld)

Adding associations values

Finally we add the variables’ associations to our outcome of interest. The ggplot_associations uses as input a dataframe and accepts a parameter pvalue_colname to specify which column holds the association values, by default ‘pvalues’. It also requires a labels_colname parameter to specify the column holding the labels, and a column named chromosome. The linked_area parameter will affect how the associations are plotted and it is recommended to be used in combination with the diamonds parameter of ggplot_ld (i.e. TRUE for small number of variables, approximately less than 40).

Additionally, the n_labels parameter controls the number of highest association labels displayed (be default 10, the behavior can be disabled by setting labels_colname to NULL), and the nudge parameter will affect how the labels are displayed (passed to geom_label_repel function of ‘ggrepel’ package).

  # let's imagine the middle region, HLA-B, is more associated with the outcome
  pvalues = c(runif(7, 1e-3, 1e-2), runif(6, 1e-8, 1e-6), runif(7, 1e-3, 1e-2))
  log10_pvals = -log10(pvalues)

  # we can reuse the df_snp_pos object
  df_snp_pos$pvalues = log10_pvals
  
  # add the chromosome column
  df_snp_pos$chromosome = 6

  gg_assocs = ggplot_associations(df_snp_pos, labels_colname = 'label',
                                  linked_area = TRUE, nudge = c(0, 0.5),
                                  n_labels = 12)

We then arrange the plots with the gtable_ld_grobs function as previously. We need to call the ggplot_snp_pos function with the upper_subset parameter set to TRUE for it to connect to the upper graph.

  gg_pos_biplot = ggplot_snp_pos(df_snp_pos, labels_colname = 'label',
                                 upper_subset = TRUE)

  # let's also say the middle region HLA-B is particularly correlated
  df_ld$R2[df_ld$SNP_A %in% 8:13 & df_ld$SNP_B %in% 8:13] = runif(15, 0.7, 0.9)
  gg_ld = ggplot_ld(df_ld)

  l_ggs = list(pos = gg_pos_biplot, ld = gg_ld, pval = gg_assocs)
  gt_ld = gtable_ld_associations_combine(l_ggs, diamonds = TRUE)

We can extract a title and remove the horizontal axis text as follows.

  library(ggplot2)
  gg_assocs <- gg_assocs + theme(axis.text.x = element_blank())
  title <- gg_assocs$labels$x %>% gsub(' (Mbp)', '', ., fixed = TRUE) %>%
    paste('-', nrow(df_snp_pos), 'SNPs')
  gg_assocs <- gg_assocs + labs(title = title, x = NULL)
  
  l_ggs$pval = gg_assocs
  gt_ld = gtable_ld_associations_combine(l_ggs, diamonds = TRUE)
  grid::grid.draw(gt_ld)

Customizing the ggplots

Let’s say we want to change the color of the associations area. We first need to identify which layer it corresponds to:

  gg_assocs$layers
#> [[1]]
#> geom_area: na.rm = FALSE, orientation = NA, outline.type = upper
#> stat_align: na.rm = FALSE, orientation = NA
#> position_stack 
#> 
#> [[2]]
#> mapping: xend = ~x 
#> geom_segment: arrow = NULL, arrow.fill = NULL, lineend = butt, linejoin = round, na.rm = FALSE
#> stat_identity: na.rm = FALSE
#> position_identity 
#> 
#> [[3]]
#> geom_label_repel: parse = FALSE, box.padding = 0.25, label.padding = 0.1, point.padding = 1e-06, label.r = 0.15, label.size = 0.1, min.segment.length = 0.5, arrow = NULL, na.rm = TRUE, force = 1, force_pull = 1, max.time = 0.5, max.iter = 10000, max.overlaps = 10, nudge_x = 0, nudge_y = 0.5, xlim = c(NA, NA), ylim = c(NA, NA), direction = both, seed = NA, verbose = FALSE
#> stat_identity: na.rm = TRUE
#> position_nudge_repel 
#> 
#> [[4]]
#> geom_point: na.rm = FALSE
#> stat_identity: na.rm = FALSE
#> position_identity

Then we can change the color with:

  gg_assocs$layers[[1]]$aes_params$fill = "#0147ab"

And rebuild our object.

  l_ggs$pval = gg_assocs
  gt_ld = gtable_ld_associations_combine(l_ggs, diamonds = TRUE)

To change the lines and labels colors, parameters in the functions are available. You can either specify a single value or a vector of same length as your number of features.

  gg_pos_biplot = ggplot_snp_pos(df_snp_pos, labels_colname = 'label',
                                 upper_subset = TRUE, colors = '#101d6b')

  gg_assocs = ggplot_associations(df_snp_pos, labels_colname = 'label',
                                  linked_area = TRUE, nudge = c(0, 0.5),
                                  n_labels = 12, colors = '#101d6b')

  # extract title
  gg_assocs <- gg_assocs + theme(axis.text.x = element_blank())
  title <- gg_assocs$labels$x %>% gsub(' (Mbp)', '', ., fixed = TRUE) %>%
    paste('-', nrow(df_snp_pos), 'SNPs')
  gg_assocs <- gg_assocs + labs(title = title, x = NULL)

  # replace area color
  gg_assocs$layers[[1]]$aes_params$fill = "#0147ab"

  # rebuild
  l_ggs = list(pos = gg_pos_biplot, ld = gg_ld, pval = gg_assocs)
  gt_ld = gtable_ld_associations_combine(l_ggs, diamonds = TRUE)
  grid::grid.draw(gt_ld)

Example on Crohn dataset from gap package

In this dataset from the ‘gap’ package (Zhao, Kurt Hornik, and Ripley 2015), 206 SNPs from chromosome 5 (5q31) were measured from 129 Crohn’s disease patients and their 2 parents, totalling 387 samples.

  data('crohn')
  m_hla = crohn[, -(1:6)]
  m_ld = cor(m_hla) ^ 2

  # keep only upper triangle and reshape to data frame
  m_ld[lower.tri(m_ld, diag = TRUE)] = NA
  df_ld = reshape2::melt(m_ld) |> na.omit()

  # rename for SNPLinkage
  names(df_ld) = c('SNP_A', 'SNP_B', 'R2')

  # visualize with ggplot_ld
  gg_ld = ggplot_ld(df_ld)

Compute p-values

  mlog10_pvals = chisq_pvalues(m_hla, crohn[, 'crohn'])
  df_pos = data.frame(probe_id = colnames(m_hla), pvalues = mlog10_pvals,
                      chromosome = 5)

  # if we don't have positions we can use byindex = TRUE
  gg_assocs = ggplot_associations(df_pos, byindex = TRUE, nudge = c(0, 0.5))

Arrange with ‘cowplot’

  cowplot::plot_grid(gg_assocs, gg_ld, nrow = 2)

Focus on most associated

  df_top_assocs = subset(df_pos, pvalues > quantile(pvalues, 0.9))
  gg_assocs = ggplot_associations(df_top_assocs, linked_area = TRUE,
                                  nudge = c(0, 0.5))

  df_ld = subset(df_ld, SNP_A %in% df_top_assocs$probe_id &
                        SNP_B %in% df_top_assocs$probe_id)

  gg_ld = ggplot_ld(df_ld)
   
  cowplot::plot_grid(gg_assocs, gg_ld, nrow = 2)

References

Zhao, Jing Hua, colleagues with inputs from Kurt Hornik, and Brian Ripley. 2015. : Genetic Analysis Package. https://CRAN.R-project.org/package=gap.

Session information

  sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 22.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] ggplot2_3.5.1       snplinkage_1.2.0    GWASTools_1.50.0   
#> [4] Biobase_2.64.0      BiocGenerics_0.50.0
#> 
#> loaded via a namespace (and not attached):
#>   [1] DBI_1.2.3               gdsfmt_1.40.0           httr2_1.0.1            
#>   [4] sandwich_3.1-0          biomaRt_2.60.1          rlang_1.1.4            
#>   [7] magrittr_2.0.3          compiler_4.4.1          RSQLite_2.3.7          
#>  [10] mgcv_1.9-1              reshape2_1.4.4          png_0.1-8              
#>  [13] vctrs_0.6.5             quantreg_5.98           stringr_1.5.1          
#>  [16] pkgconfig_2.0.3         shape_1.4.6.1           crayon_1.5.3           
#>  [19] SNPRelate_1.38.0        fastmap_1.2.0           backports_1.5.0        
#>  [22] dbplyr_2.5.0            XVector_0.44.0          labeling_0.4.3         
#>  [25] utf8_1.2.4              rmarkdown_2.28          UCSC.utils_1.0.0       
#>  [28] nloptr_2.1.1            MatrixModels_0.5-3      purrr_1.0.2            
#>  [31] bit_4.0.5               xfun_0.47               glmnet_4.1-8           
#>  [34] jomo_2.7-6              logistf_1.26.0          zlibbioc_1.50.0        
#>  [37] cachem_1.1.0            GenomeInfoDb_1.40.1     jsonlite_1.8.8         
#>  [40] progress_1.2.3          blob_1.2.4              highr_0.11             
#>  [43] pan_1.9                 parallel_4.4.1          broom_1.0.6            
#>  [46] prettyunits_1.2.0       R6_2.5.1                bslib_0.8.0            
#>  [49] stringi_1.8.4           boot_1.3-30             DNAcopy_1.78.0         
#>  [52] rpart_4.1.23            lmtest_0.9-40           jquerylib_0.1.4        
#>  [55] Rcpp_1.0.13             iterators_1.0.14        knitr_1.48             
#>  [58] zoo_1.8-12              IRanges_2.38.1          Matrix_1.7-0           
#>  [61] splines_4.4.1           nnet_7.3-19             tidyselect_1.2.1       
#>  [64] yaml_2.3.10             codetools_0.2-20        curl_5.2.2             
#>  [67] plyr_1.8.9              lattice_0.22-6          tibble_3.2.1           
#>  [70] withr_3.0.1             quantsmooth_1.70.0      KEGGREST_1.44.1        
#>  [73] evaluate_0.24.0         survival_3.6-4          BiocFileCache_2.12.0   
#>  [76] xml2_1.3.6              Biostrings_2.72.1       pillar_1.9.0           
#>  [79] filelock_1.0.3          mice_3.16.0             foreach_1.5.2          
#>  [82] stats4_4.4.1            generics_0.1.3          S4Vectors_0.42.1       
#>  [85] hms_1.1.3               munsell_0.5.1           scales_1.3.0           
#>  [88] minqa_1.2.8             GWASExactHW_1.2         glue_1.7.0             
#>  [91] tools_4.4.1             data.table_1.16.0       lme4_1.1-35.5          
#>  [94] SparseM_1.84-2          cowplot_1.1.3           grid_4.4.1             
#>  [97] tidyr_1.3.1             AnnotationDbi_1.66.0    colorspace_2.1-1       
#> [100] nlme_3.1-164            formula.tools_1.7.1     GenomeInfoDbData_1.2.12
#> [103] cli_3.6.3               rappdirs_0.3.3          fansi_1.0.6            
#> [106] dplyr_1.1.4             gtable_0.3.5            sass_0.4.9             
#> [109] digest_0.6.37           operator.tools_1.6.3    ggrepel_0.9.6          
#> [112] farver_2.1.2            memoise_2.0.1           htmltools_0.5.8.1      
#> [115] lifecycle_1.0.4         httr_1.4.7              mitml_0.4-5            
#> [118] bit64_4.0.5             MASS_7.3-60.2