Updated: Nov-22-2024
The MotifPeeker package facilitates the comparison and validation of datasets from epigenomic profiling methods, using motif enrichment as the key benchmark. The package generates a comprehensive summary report with results from various downstream analyses by processing peak, alignment, and motif files. This allows for detailed statistical analysis of multiple epigenomic datasets without any coding, ensuring both accessibility and robustness.
The rapidly advancing field of epigenomics has led to the development of various techniques for profiling protein interactions with DNA, enhancing our understanding of gene regulatory mechanisms and genetic factors behind complex diseases. However, the validation of these newer methods, such as CUT&RUN, CUT&TAG and TIP-Seq, remains a critical area that requires further exploration, especially given their potential to address the challenges of traditional ChIP-Seq.
Common epigenomic profiling techniques rely on target proteins, such as the transcriptional regulator CTCF, binding to their respective sites on the DNA to isolate the sequences for sequencing. These binding sites may contain specific sequences recognised by the transcription factors, called motifs. Unlike other comparison tools like ChIPseeker and EpiCompare, MotifPeeker checks for the presence of these motifs in the sequences enriched from epigenomic profiling methods as a novel strategy to benchmark them.
At the same time, general metrics like FRiP scores and peak width distributions are also reported to add more context to the comparisons. While the goal remains to benchmark different epigenomic datasets, MotifPeeker can also be used to compare the effects of various downstream processing, such as the thresholds for peak calling and the choice of the peak caller itself. The package can also help identify differences arising from different experimental conditions or protocol optimisations.
MotifPeeker comes with a small subset of two epigenomic datasets targeting CTCF in HCT116 cells, generated using ChIP-Seq and TIP-Seq.
CTCF_ChIP_alignment.bam
)
sourced from the ENCODE project (Accession:
ENCFF091ODJ).CTCF_TIP_alignment.bam
) was
manually processed using the nf-core/cutandrun
pipeline. The raw read files were sourced from NIH Sequence Read
Archives (ID:
SRR16963166).The alignment files were processed using the MACS3 peak
caller to produce their respective peak files with the
q-value
parameter set to 0.01.
Two motif files for CTCF are also bundled with the package:
Please note that the peaks and alignments included are a very small subset (chr10:65,654,529-74,841,155) of the actual data. It only serves as an example to demonstrate the package and run tests to maintain the integrity of the package.
MotifPeeker
uses memes
which relies on a local install of the MEME suite, which can be
installed as follows:
MEME_VERSION=5.5.5 # or the latest version
wget https://meme-suite.org/meme/meme-software/$MEME_VERSION/meme-$MEME_VERSION.tar.gz
tar zxf meme-$MEME_VERSION.tar.gz
cd meme-$MEME_VERSION
./configure --prefix=$HOME/meme --with-url=http://meme-suite.org/ \
--enable-build-libxml2 --enable-build-libxslt
make
make install
# Add to PATH
echo 'export PATH=$HOME/meme/bin:$HOME/meme/libexec/meme-$MEME_VERSION:$PATH' >> ~/.bashrc
echo 'export MEME_BIN=$HOME/meme/bin' >> ~/.bashrc
source ~/.bashrc
NOTE: It is important that Perl dependencies
associated with MEME suite are also installed, particularly
XML::Parser
, which can be installed using the following
command in the terminal:
cpan install XML::Parser
For more information, refer to the Perl dependency section of the MEME suite.
Once the MEME suite and its associated Perl dependencies are
installed, install and load MotifPeeker
:
library(MotifPeeker)
Alternatively, you can use the Docker/Singularity container to run the package out-of-the-box.
In this example, we will compare the bundled ChIP-Seq dataset against the TIP-Seq dataset.
Once installed, load the package using:
library(MotifPeeker)
## Peak files processed using read_peak_file()
data("CTCF_ChIP_peaks", package = "MotifPeeker")
data("CTCF_TIP_peaks", package = "MotifPeeker")
## Motif files processed using read_motif_file()
data("motif_MA1102.3", package = "MotifPeeker")
data("motif_MA1930.2", package = "MotifPeeker")
MotifPeeker accepts lists of both GRanges
objects produced by read_peak_file()
, or paths to the
MACS2/3 .narrowPeak
files or SEACR
.bed
files, or ENCODE file IDs to automatically download
the respective files.
## MACS2/3 peak files
peak_files <- list("/path/to/peak1.narrowPeak", "/path/to/peak2.narrowPeak")
## or SEACR peak files
peak_files <- list("/path/to/peak1.bed", "/path/to/peak2.bed")
In this example, we will use the bundled GRanges
peaks:
peak_files <- list(CTCF_ChIP_peaks, CTCF_TIP_peaks)
Optionally provide a list of path to .bam
alignment
files, or ENCODE file IDs to generate additional comparisons like FRiP
scores.
In this example, we will use the built-in alignment files.
## Alignment files
CTCF_ChIP_alignment <- system.file("extdata", "CTCF_ChIP_alignment.bam",
package = "MotifPeeker")
CTCF_TIP_alignment <- system.file("extdata", "CTCF_TIP_alignment.bam",
package = "MotifPeeker")
alignment_files <- list(CTCF_ChIP_alignment, CTCF_TIP_alignment)
MotifPeeker accepts a list of either
universalmotif
objects, or paths to the
.jaspar
files.
## JASPAR motif files
motif_files <- list("/path/to/motif1.jaspar", "/path/to/motif2.jaspar")
If you use JASPAR motif files, it is recommended that you label them
by using the motif_labels
parameter of the
MotifPeeker()
function.
In this example, we will use the bundled universalmotif
motifs:
motif_files <- list(motif_MA1102.3, motif_MA1930.2)
The report can be generated by using the main function
MotifPeeker()
. For more run customisations, refer to the
next sections.
if (MotifPeeker:::confirm_meme_install(continue = TRUE)) {
MotifPeeker(
peak_files = peak_files,
reference_index = 2, # Set TIP-seq experiment as reference
alignment_files = alignment_files,
exp_labels = c("ChIP", "TIP"),
exp_type = c("chipseq", "tipseq"),
genome_build = "hg38", # Use hg38 genome build
motif_files = motif_files,
cell_counts = NULL, # No cell-count information
motif_discovery = TRUE,
motif_discovery_count = 3, # Discover top 3 motifs
motif_db = NULL, # Use default motif database (JASPAR)
download_buttons = TRUE,
out_dir = tempdir(), # Save output in a temporary directory
BPPARAM = BiocParallel::SerialParam(), # Use two CPU cores on a 16GB RAM machine
debug = FALSE,
quiet = TRUE,
verbose = TRUE
)
}
## Cannot find MEME suite installation. If installed, try setting the path 'MEME_BIN' environment varaible, or use the 'meme_path' parameter in the MotifPeeker function call.
## For more information, see the memes pacakge documention-
## https://github.com/snystrom/memes#detecting-the-meme-suite
These input parameters must be provided:
peak_files
: A list of path to peak files or
GRanges
objects with the peaks to analyse. Currently, only
peak files from MACS2/3
(.narrowPeak
) and
SEACR
(.bed
) are supported. ENCODE file IDs
can also be provided to automatically fetch peak file(s) from the ENCODE
database.reference_index
: An integer specifying the index of the
reference dataset in the peak_files
list to use as
reference for various comparisons. (default = 1)genome_build
: A character string or a
BSgenome
object specifying the genome build of the
datasets. At the moment, only hg38 and hg19 are supported as abbreviated
input.out_dir
: A character string specifying the output
directory to save the HTML report and other files.These input parameters optional, but recommended to add more analyses, or enhance them:
alignment_files
: A list of path to alignment files or
Rsamtools::BamFile
objects with the alignment sequences to
analyse. Alignment files are used to calculate read-related metrics like
FRiP score. ENCODE file IDs can also be provided to automatically fetch
alignment file(s) from the ENCODE database.exp_labels
: A character vector of labels for each peak
file. If not provided, capital letters will be used as labels in the
report.exp_type
: A character vector of experimental types for
each peak file.exp_type
is used only for labelling. It does not affect the
analyses. You can also input custom strings. Datasets will be grouped as
long as they match their respective exp_type
. Supported
experimental types are:chipseq
: ChIP-seq datatipseq
: TIP-seq datacuttag
: CUT&Tag datacutrun
: CUT&Run datamotif_files
: A character vector of path to motif files,
or a vector of universalmotif-class
objects. Required to
run Known Motif Enrichment Analysis. JASPAR matrix IDs can also be
provided to automatically fetch motifs from the JASPAR.motif_labels
: A character vector of labels for each
motif file. Only used if path to file names are passed in motif_files.
If not provided, the motif file names will be used as labels.cell_counts
: An integer vector of experiment cell
counts for each peak file (if available). Creates additional comparisons
based on cell counts.motif_db
: Path to .meme
format file to use
as reference database, or a list of universalmotif-class
objects. Results from motif discovery are searched against this database
to find similar motifs. If not provided, JASPAR CORE database will be
used, making this parameter truly optional.
NOTE: p-value estimates are inaccurate when the
database has fewer than 50 entries.For more information on additional parameters, please refer to the
documentation for MotifPeeker()
.
For 4 datasets, the runtime is approximately 3 minutes with motif_discovery disabled. However, motif discovery can take hours to complete.
To make computation faster, we highly recommend tuning the following arguments:
BPPARAM = Multicore(x)
: Running motif discovery in
parallel can significantly reduce runtime, but it is very
memory-intensive, consuming upwards of 10GB of RAM per thread. Memory
starvation can greatly slow the process, so set workers (x) with
caution.motif_discovery_count
: The number of motifs to discover
per sequence group exponentially increases runtime. We recommend no more
than 5 motifs to make a meaningful inference.trim_seq_width
: Trimming sequences before running motif
discovery can significantly reduce the search space. Sequence length can
exponentially increase runtime. We recommend running the script with
motif_discovery = FALSE
and studying the motif-summit
distance distribution under general metrics to find the sequence length
that captures most motifs. A good starting point is 150 but it can be
reduced further if appropriate.MotifPeeker
generates its output in a new folder within
he out_dir
directory. The folder is named
MotifPeeker_YYYYMMDD_HHMMSS
and contains the following
files:
MotifPeeker.html
: The main HTML report, including all
analyses and plots.save_runfiles
is set to
TRUE
.If something does not work as expected, refer to troubleshooting.
trim_peak_width
to reduce
motif discovery runtime.utils::sessionInfo()
## R Under development (unstable) (2024-10-21 r87258)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] MotifPeeker_0.99.11
##
## loaded via a namespace (and not attached):
## [1] DBI_1.2.3 bitops_1.0-9
## [3] gridExtra_2.3 testthat_3.2.1.1
## [5] rlang_1.1.4 magrittr_2.0.3
## [7] matrixStats_1.4.1 compiler_4.5.0
## [9] RSQLite_2.3.8 vctrs_0.6.5
## [11] pkgconfig_2.0.3 crayon_1.5.3
## [13] fastmap_1.2.0 dbplyr_2.5.0
## [15] XVector_0.47.0 memes_1.15.0
## [17] ca_0.71.1 utf8_1.2.4
## [19] Rsamtools_2.23.0 rmarkdown_2.29
## [21] tzdb_0.4.0 UCSC.utils_1.3.0
## [23] waldo_0.6.1 purrr_1.0.2
## [25] bit_4.5.0 xfun_0.49
## [27] zlibbioc_1.53.0 ggseqlogo_0.2
## [29] cachem_1.1.0 GenomeInfoDb_1.43.1
## [31] jsonlite_1.8.9 blob_1.2.4
## [33] DelayedArray_0.33.2 BiocParallel_1.41.0
## [35] parallel_4.5.0 R6_2.5.1
## [37] bslib_0.8.0 RColorBrewer_1.1-3
## [39] rtracklayer_1.67.0 pkgload_1.4.0
## [41] brio_1.1.5 GenomicRanges_1.59.1
## [43] jquerylib_0.1.4 Rcpp_1.0.13-1
## [45] assertthat_0.2.1 SummarizedExperiment_1.37.0
## [47] iterators_1.0.14 knitr_1.49
## [49] R.utils_2.12.3 readr_2.1.5
## [51] IRanges_2.41.1 Matrix_1.7-1
## [53] tidyselect_1.2.1 abind_1.4-8
## [55] yaml_2.3.10 viridis_0.6.5
## [57] TSP_1.2-4 codetools_0.2-20
## [59] curl_6.0.1 lattice_0.22-6
## [61] tibble_3.2.1 Biobase_2.67.0
## [63] evaluate_1.0.1 desc_1.4.3
## [65] heatmaply_1.5.0 BiocFileCache_2.15.0
## [67] universalmotif_1.25.1 Biostrings_2.75.1
## [69] pillar_1.9.0 filelock_1.0.3
## [71] MatrixGenerics_1.19.0 DT_0.33
## [73] foreach_1.5.2 stats4_4.5.0
## [75] plotly_4.10.4 generics_0.1.3
## [77] rprojroot_2.0.4 RCurl_1.98-1.16
## [79] S4Vectors_0.45.2 hms_1.1.3
## [81] ggplot2_3.5.1 munsell_0.5.1
## [83] scales_1.3.0 glue_1.8.0
## [85] lazyeval_0.2.2 tools_4.5.0
## [87] dendextend_1.19.0 BiocIO_1.17.1
## [89] data.table_1.16.2 BSgenome_1.75.0
## [91] webshot_0.5.5 GenomicAlignments_1.43.0
## [93] registry_0.5-1 XML_3.99-0.17
## [95] grid_4.5.0 tidyr_1.3.1
## [97] seriation_1.5.6 cmdfun_1.0.2
## [99] colorspace_2.1-1 GenomeInfoDbData_1.2.13
## [101] restfulr_0.0.15 cli_3.6.3
## [103] fansi_1.0.6 S4Arrays_1.7.1
## [105] viridisLite_0.4.2 dplyr_1.1.4
## [107] gtable_0.3.6 R.methodsS3_1.8.2
## [109] sass_0.4.9 digest_0.6.37
## [111] BiocGenerics_0.53.3 SparseArray_1.7.2
## [113] rjson_0.2.23 htmlwidgets_1.6.4
## [115] R.oo_1.27.0 memoise_2.0.1
## [117] htmltools_0.5.8.1 lifecycle_1.0.4
## [119] httr_1.4.7 MASS_7.3-61
## [121] bit64_4.5.2