Cell-free DNA (cfDNA) in blood has emerged as an ideal surrogate for tumor biopsy. It can be obtained noninvasively, and provides a comprehensive landscape of the heterogeneous genetic and epigenetic alterations in tumors. The cfTools package provides methods for cfDNA methylation data analysis to facilitate cfDNA-based tumor-derived cfDNA and estimate the tumor-derived cfDNA fraction composition and the cfDNA fraction of multiple tissue types for a plasma cfDNA sample. studies, including (1) cancer detection: sensitively detect (tumor burden); (2) tissue deconvolution: infer the tissue type
cfTools 1.0.0
Given the methylation sequencing data of a cell-free DNA (cfDNA) sample, for each cancer marker or tissue marker, we deconvolve the tumor-derived or tissue-specific reads from all reads falling in the marker region. Our read-based deconvolution algorithm exploits the pervasiveness of DNA methylation for signal enhancement, therefore can sensitively identify a trace amount of tumor-specific or tissue-specific cfDNA in plasma.
Specifically, cfTools can deconvolve different sources of cfDNA fragments (or reads) in two contexts:
Cancer detection [1]: separate cfDNA fragments into tumor-derived fragments and background normal fragments (2 classes), and estimate the tumor-derived cfDNA fraction \(\theta\) (\(0\leq \theta < 1\)).
Tissue deconvolution [2]: separate cfDNA fragments from different tissues (> 2 classes), and estimate the cfDNA fraction \(\theta_t\) (\(0\leq \theta_t < 1\)) of different tissue types (including an unknown type) \(t\) for a plasma cfDNA sample.
These functions can serve as foundations for more advanced cfDNA-based studies, including cancer diagnosis and disease monitoring.
cfTools is an R
package available via the
Bioconductor repository for packages.
You can install the release version by using the following commands
in your R
session:
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install("cfTools")
Alternatively, you can install the development version from GitHub :
BiocManager::install("jasminezhoulab/cfTools")
The two main input files for CancerDetector()
and cfDeconvolve()
are
Input 1: fragment-level methylation states (methState), which are represented by a sequence of binary values (0 represents unmethylated CpG and 1 represents methylated CpG on the same fragment);
Input 2: methylation pattern (paired shape parameters of beta distributions) of markers.
Section 3.1, 3.2, 3.3 provide an example for generating Input 1. We require users to provide pre-processed paired-end bisulfite sequencing reads (i.e., aligned to the reference genome). For each cfDNA sample, users need to prepare (1) the standard (sorted) BED file of the aligned reads and (2) the methylation information that bismark extracts from the aligned reads as input data files. Specifically,
MergePEReads()
MergeCpGs()
CpG_OT*
file and a CpG_OB*
file generated by
bismark methylation extractorGenerateFragMeth()
MergePEReads()
and MergeCpGs()
Section 3.4 provides an example for generating Input 2. Specifically,
GenerateMarkerParam()
Example input data files are included within the package:
library(cfTools)
library(utils)
demo.dir <- system.file("data", package="cfTools")
Function MergePEReads()
generates fragment-level information for
paired-end sequencing reads. The main input file is a standard (sorted)
BED file (e.g. output of bedtools bamtobed) of paired-end sequencing
reads containing columns of chromosome name, chromosome start,
chromosome end, sequence ID, mapping quality score, and strand.
PEReads <- file.path(demo.dir, "demo.sorted.bed.txt.gz")
head(read.table(PEReads, header=FALSE), 2)
#> V1 V2 V3
#> 1 chr21 9417768 9417888
#> 2 chr21 9418101 9418223
#> V4
#> 1 GWNJ-0965:805:GW2108234048th:1:2211:31223:60642_1:N:0:ATTCCT:R1:CAAGTCCC:R2:ACATTATG:F1:TGACT:F2:AGACT/2
#> 2 GWNJ-0965:805:GW2108234048th:1:2211:31223:60642_1:N:0:ATTCCT:R1:CAAGTCCC:R2:ACATTATG:F1:TGACT:F2:AGACT/1
#> V5 V6
#> 1 0 +
#> 2 0 -
The output is a list in BED file format and/or written to an output BED file, where each line contains the information of a cfDNA fragment.
fragInfo <- MergePEReads(PEReads)
#> + '/home/biocbuild/.cache/R/basilisk/1.12.0/0/bin/conda' 'create' '--yes' '--prefix' '/home/biocbuild/.cache/R/basilisk/1.12.0/cfTools/1.0.0/my_env' 'python=3.7.0' '--quiet' '-c' 'conda-forge'
#> + '/home/biocbuild/.cache/R/basilisk/1.12.0/0/bin/conda' 'install' '--yes' '--prefix' '/home/biocbuild/.cache/R/basilisk/1.12.0/cfTools/1.0.0/my_env' 'python=3.7.0'
#> + '/home/biocbuild/.cache/R/basilisk/1.12.0/0/bin/conda' 'install' '--yes' '--prefix' '/home/biocbuild/.cache/R/basilisk/1.12.0/cfTools/1.0.0/my_env' '-c' 'conda-forge' 'python=3.7.0' 'python=3.7.0' 'numpy=1.16' 'scipy=1.5.3'
head(fragInfo, 2)
#> chr start end fragmentLength strand
#> 1 chr21 9439599 9439741 142 -
#> 2 chr21 9483455 9483622 167 -
#> name
#> 1 GWNJ-0965:805:GW2108234048th:1:1101:11972:44486_1:N:0:ATTCCT:R1:TTACGGCG:R2:CAGTGGAG:F1:TGACT:F2:TGACT
#> 2 GWNJ-0965:805:GW2108234048th:1:1101:20933:67955_1:N:0:ATTCCT:R1:GGGACTTT:R2:CGCGAGTA:F1:TGACT:F2:TGACT
Function MergeCpGs()
generates fragment-level methylation states of CpGs.
The main inputs of it are two output files of bismark methylation extractor,
which is a program performing methylation calling on bisulfite treated
sequencing reads. The CpG_OT*
file contains methylation information for
CpGs on the original top strand (OT); the CpG_OB*
file contains methylation
information for CpGs on the original bottom strand (OB). Both files contain
columns of sequence ID, methylation state, chromosome name, chromosome start,
methylation call.
CpG_OT <- file.path(demo.dir, "CpG_OT_demo.txt.gz")
CpG_OB <- file.path(demo.dir, "CpG_OB_demo.txt.gz")
head(read.table(CpG_OT, header=FALSE), 2)
#> V1
#> 1 GWNJ-0965:805:GW2108234048th:2:2108:26250:45136_1:N:0:ATTCCT:R1:CTTTCCAA:R2:GCTTCGAG:F1:TGACT:F2:AGACG
#> 2 GWNJ-0965:805:GW2108234048th:2:2108:26250:45136_1:N:0:ATTCCT:R1:CTTTCCAA:R2:GCTTCGAG:F1:TGACT:F2:AGACG
#> V2 V3 V4 V5
#> 1 + chr21 9438974 Z
#> 2 - chr21 9438990 z
head(read.table(CpG_OB, header=FALSE), 2)
#> V1
#> 1 GWNJ-0965:805:GW2108234048th:2:1114:27600:18502_1:N:0:ATTCCT:R1:GAAAATGA:R2:AAGATGCC:F1:TGACT:F2:TGACT
#> 2 GWNJ-0965:805:GW2108234048th:2:1114:27600:18502_1:N:0:ATTCCT:R1:GAAAATGA:R2:AAGATGCC:F1:TGACT:F2:TGACT
#> V2 V3 V4 V5
#> 1 + chr21 9498035 Z
#> 2 + chr21 9497961 Z
The output is a list in BED file format and/or written to an output BED file, where each line contains methylation states of all CpGs on the same fragment. Column methState is a sequence of binary values indicating the methylation states of all CpGs on the same fragment (0 represents unmethylated CpG and 1 represents methylated CpG).
methInfo <- MergeCpGs(CpG_OT, CpG_OB)
head(methInfo, 2)
#> chr cpgStart cpgEnd strand cpgNumber
#> 1 chr21 9439600 9439741 - 7
#> 2 chr21 9483487 9483622 - 16
#> cpgPosition
#> 1 9439600,9439636,9439678,9439685,9439717,9439725,9439741
#> 2 9483487,9483489,9483516,9483519,9483522,9483524,9483536,9483539,9483541,9483574,9483577,9483580,9483582,9483600,9483603,9483622
#> methState
#> 1 0111010
#> 2 0111011010011010
#> name
#> 1 GWNJ-0965:805:GW2108234048th:1:1101:11972:44486_1:N:0:ATTCCT:R1:TTACGGCG:R2:CAGTGGAG:F1:TGACT:F2:TGACT
#> 2 GWNJ-0965:805:GW2108234048th:1:1101:20933:67955_1:N:0:ATTCCT:R1:GGGACTTT:R2:CGCGAGTA:F1:TGACT:F2:TGACT
Function GenerateFragMeth()
combines the output lists of
MergePEReads()
and MergeCpGs()
into one list,
which contains both the fragment information and the methylation
states of all CpGs on each fragment.
fragMeth <- GenerateFragMeth(fragInfo, methInfo)
head(fragMeth, 2)
#> chr start end
#> 1 chr21 9417768 9418223
#> 2 chr21 9431278 9431614
#> name
#> 1 GWNJ-0965:805:GW2108234048th:1:2211:31223:60642_1:N:0:ATTCCT:R1:CAAGTCCC:R2:ACATTATG:F1:TGACT:F2:AGACT
#> 2 GWNJ-0965:805:GW2108234048th:1:1106:4411:3542_1:N:0:ATTCCT:R1:CATGTGCC:R2:AGCACTAA:F1:TGACT:F2:ACACT
#> fragmentLength strand cpgNumber cpgPosition
#> 1 455 - 2 9418124,9418171
#> 2 336 + 5 9431279,9431386,9431537,9431554,9431596
#> methState
#> 1 01
#> 2 00111
Function GenerateMarkerParam()
calculates paired shape parameters
of beta distributions for each marker. There are three main inputs
to this function: (1) a list of methylation levels (e.g.,
beta values), where each row is a sample and each column is a marker;
(2) a vector of sample types (e.g., tumor or normal, tissue types)
corresponding to the rows of the list; (3) a vector of marker
names corresponding to the columns of the list.
methLevel <- read.table(file.path(demo.dir, "beta_matrix.txt.gz"),
row.names=1, header = TRUE)
sampleTypes <- read.table(file.path(demo.dir, "sample_type.txt.gz"),
row.names=1, header = TRUE)$sampleType
markerNames <- read.table(file.path(demo.dir, "marker_index.txt.gz"),
row.names=1, header = TRUE)$markerIndex
print(methLevel)
#> marker1 marker2 marker3
#> sample1 0.8967302 0.9444779 0.9274943
#> sample2 0.6807593 0.9391826 0.8329514
#> sample3 0.8407675 0.9312582 0.8824475
#> sample4 0.8614165 0.9571972 0.8170904
#> sample5 0.8685513 0.9138511 0.8993120
#> sample6 0.5926250 0.9626752 0.8629679
#> sample7 0.8382709 0.9145921 0.2179631
#> sample8 0.8554025 0.9510139 0.8368095
#> sample9 0.8286787 0.9376723 0.8830746
#> sample10 0.8362881 0.9286860 0.8326099
#> sample11 0.8782031 0.9801491 0.9095433
#> sample12 0.6815395 0.9093327 0.8749625
#> sample13 0.8751985 0.9527680 0.8362279
#> sample14 0.7974884 0.9184259 0.9163573
#> sample15 0.8785107 0.8995657 0.8645149
#> sample16 0.8159900 0.9719771 0.8779748
#> sample17 0.4364249 0.8534525 0.3489145
#> sample18 0.8701227 0.9525218 0.8704187
#> sample19 0.8737151 0.9037688 0.8957254
#> sample20 0.8572883 0.9467923 0.8380768
print(sampleTypes)
#> [1] "normal" "tumor" "tumor" "normal" "normal" "tumor" "tumor" "normal"
#> [9] "normal" "tumor" "normal" "tumor" "normal" "tumor" "tumor" "normal"
#> [17] "tumor" "normal" "tumor" "normal"
print(markerNames)
#> [1] 1 2 3
The output is a list containing the paired shape parameters of
beta distributions for markers, which are delimited by :
. Users can
save this list into a file with column names, no row names, and
columns are delimited by TAB for later use.
markerParam <- GenerateMarkerParam(methLevel, sampleTypes, markerNames)
print(markerParam)
#> markerName normal tumor
#> 1 1 183.74:29.723 5.955:2.032
#> 2 2 134.584:6.958 83.653:7.662
#> 3 3 72.949:10.939 1.476:0.484
To make the computation more efficient, users may only keep the fragments
that overlap with the genomic regions of markers. Here, we provide an
example of using R
package GenomicRanges to perform the
intersection.
First, transform the two BED files into GRanges classes.
library(GenomicRanges)
# a BED file of fragment-level methylation information
frag_bed <- read.csv(file.path(demo.dir, "demo.fragment_level.meth.bed.txt.gz"),
header=TRUE, sep="\t")
frag_meth.gr <- GRanges(seqnames=frag_bed$chr,
ranges=IRanges(frag_bed$start, frag_bed$end),
strand=frag_bed$strand,
methState=as.character(frag_bed$methState))
# a BED file of genomic regions of markers
markers_bed <- read.csv(file.path(demo.dir, "markers.bed.txt.gz"),
header=TRUE, sep="\t")
markers.gr <- GRanges(seqnames=markers_bed$chr,
ranges=IRanges(markers_bed$start, markers_bed$end),
markerName=markers_bed$markerName)
head(frag_meth.gr, 2)
#> GRanges object with 2 ranges and 1 metadata column:
#> seqnames ranges strand | methState
#> <Rle> <IRanges> <Rle> | <character>
#> [1] chr21 9417768-9418223 - | 1
#> [2] chr21 9431278-9431614 + | 111
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths
head(markers.gr, 2)
#> GRanges object with 2 ranges and 1 metadata column:
#> seqnames ranges strand | markerName
#> <Rle> <IRanges> <Rle> | <integer>
#> [1] chr21 9418124-9418171 * | 1
#> [2] chr21 9437462-9437533 * | 2
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths
Then, overlap two GRanges classes and get the fragment-level methylation states intersecting with the markers.
ranges <- subsetByOverlaps(frag_meth.gr, markers.gr, ignore.strand=TRUE)
hits <- findOverlaps(frag_meth.gr, markers.gr,ignore.strand=TRUE)
idx <- subjectHits(hits)
values <- DataFrame(markerName=markers.gr$markerName[idx])
mcols(ranges) <- c(mcols(ranges), values)
marker.methState <- as.data.frame(cbind(ranges$markerName,
ranges$methState))
colnames(marker.methState) <- c("markerName", "methState")
head(marker.methState, 4)
#> markerName methState
#> 1 1 1
#> 2 2 10111101110
#> 3 2 101100011010
#> 4 2 1011000111001
Function CancerDetector()
separates cfDNA into tumor-derived fragments
and background normal fragments and estimates the tumor burden. The main
inputs are two files: (1) the fragment-level methylation states of reads
(column methState
) that mapped to the cancer-specific markers;
(2) paired shape parameters of beta distributions for cancer-specific markers.
All columns are delimited by TAB, and the first line is the column names.
fragMethFile <- file.path(demo.dir, "CancerDetector.reads.txt.gz")
markerParamFile <- file.path(demo.dir, "CancerDetector.markers.txt.gz")
head(read.csv(fragMethFile, sep = "\t", colClasses = "character"), 4)
#> markerName methState
#> 1 2 000
#> 2 2 000
#> 3 2 000
#> 4 2 111111
head(read.csv(markerParamFile, sep = "\t"), 4)
#> markerName tumor normalPlasma
#> 1 2 5:3 97:12
#> 2 3 3:2 57:9
#> 3 4 4:4 24:3
#> 4 5 3:3 32:5
The output is the estimated tumor burden \(\theta\) and the normal cfDNA fraction \(1-\theta\).
CancerDetector(fragMethFile, markerParamFile)
#> cfDNA_tumor_burden normal_cfDNA_fraction
#> 1 0.055 0.945
Function cfDeconvolve()
estimates fractions of cfDNA fragments from
different tissues (> 2 classes). The main two input files are similar
to function CancerDetector()
: (1) the fragment-level methylation states
of reads (column methState
) that mapped to the tissue-specific markers;
(2) paired shape parameters of beta distributions for tissue-specific
markers. All columns are delimited by TAB, and there is a header line of
the column names.
fragMethFile2 <- file.path(demo.dir, "cfDeconvolve.reads.txt.gz")
markerParamFile2 <- file.path(demo.dir, "cfDeconvolve.markers.txt.gz")
head(read.csv(fragMethFile2, header=TRUE, sep="\t",
colClasses = "character"), 4)
#> markerName methState
#> 1 2 1111100
#> 2 2 0100110
#> 3 2 0111111
#> 4 2 1111111
head(read.csv(markerParamFile2, header=TRUE, sep="\t",
colClasses = "character"), 4)
#> markerName tissue1 tissue2 tissue3 tissue4 tissue5 tissue6
#> 1 2 1.68:1.82 2:18 1.02:0.459 2.08:18 0.061 0.176
#> 2 27 0.605 0.47 0.667 29.1:12.1 0.0528 0.542
#> 3 61 12.8:2.68 9.35:1.75 19.4:4.06 6.28:7.06 8.44:20.4 21.1:3.48
#> 4 63 4.71:2.49 13.5:2.17 55.3:8.53 0.906 31.3:3.8 5.46:3.17
#> tissue7
#> 1 6.65:2.19
#> 2 7.31:3.47
#> 3 6.69:4.01
#> 4 15:3.3
Other input parameters are:
em.global.unknown
(default), em.global.known
, em.local.unknown
,
em.local.known
.For example:
numTissues <- 7
emAlgorithmType <- "em.global.unknown"
likelihoodRatioThreshold <- 2
emMaxIterations <- 100
The output is a list containing the cfDNA fractions of different tissue types and an unknown class.
tissueFraction <- cfDeconvolve(fragMethFile2, markerParamFile2, numTissues,
emAlgorithmType, likelihoodRatioThreshold,
emMaxIterations)
tissueFraction
#> tissue1 tissue2 tissue3 tissue4 tissue5 tissue6 tissue7
#> 1 1.58246e-13 1.35298e-19 0.0442296 1.45249e-21 0.0157226 4.00821e-18 0.930494
#> unknown
#> 1 0.00955414
[1] Li W, Li Q, Kang S, Same M, Zhou Y, Sun C, Liu CC, Matsuoka L, Sher L, Wong WH, Alber F, Zhou XJ. CancerDetector: ultrasensitive and non-invasive cancer detection at the resolution of individual reads using cell-free DNA methylation sequencing data. Nucleic Acids Res. 2018 Sep 6;46(15):e89. doi: 10.1093/nar/gky423. PMID: 29897492; PMCID: PMC6125664.
[2] Del Vecchio G, Li Q, Li W, Thamotharan S, Tosevska A, Morselli M, Sung K, Janzen C, Zhou X, Pellegrini M, Devaskar SU. Cell-free DNA methylation and transcriptomic signature prediction of pregnancies with adverse outcomes. Epigenetics. 2021 Jun;16(6):642-661. doi: 10.1080/15592294.2020.1816774. PMID: 33045922; PMCID: PMC8143248.
#> R version 4.3.0 RC (2023-04-13 r84269)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.2 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.17-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] GenomicRanges_1.52.0 GenomeInfoDb_1.36.0 IRanges_2.34.0
#> [4] S4Vectors_0.38.0 BiocGenerics_0.46.0 cfTools_1.0.0
#> [7] BiocStyle_2.28.0
#>
#> loaded via a namespace (and not attached):
#> [1] Matrix_1.5-4 jsonlite_1.8.4 compiler_4.3.0
#> [4] BiocManager_1.30.20 filelock_1.0.2 Rcpp_1.0.10
#> [7] bitops_1.0-7 parallel_4.3.0 jquerylib_0.1.4
#> [10] png_0.1-8 yaml_2.3.7 fastmap_1.1.1
#> [13] here_1.0.1 lattice_0.21-8 reticulate_1.28
#> [16] R6_2.5.1 XVector_0.40.0 knitr_1.42
#> [19] bookdown_0.33 rprojroot_2.0.3 GenomeInfoDbData_1.2.10
#> [22] bslib_0.4.2 R.utils_2.12.2 rlang_1.1.0
#> [25] cachem_1.0.7 dir.expiry_1.8.0 xfun_0.39
#> [28] sass_0.4.5 cli_3.6.1 withr_2.5.0
#> [31] zlibbioc_1.46.0 grid_4.3.0 digest_0.6.31
#> [34] basilisk_1.12.0 R.methodsS3_1.8.2 R.oo_1.25.0
#> [37] evaluate_0.20 RCurl_1.98-1.12 rmarkdown_2.21
#> [40] basilisk.utils_1.12.0 tools_4.3.0 htmltools_0.5.5