1 Introduction

Given the methylation sequencing data of a cell-free DNA (cfDNA) sample, for each cancer marker or tissue marker, we deconvolve the tumor-derived or tissue-specific reads from all reads falling in the marker region. Our read-based deconvolution algorithm exploits the pervasiveness of DNA methylation for signal enhancement, therefore can sensitively identify a trace amount of tumor-specific or tissue-specific cfDNA in plasma.

Specifically, cfTools can deconvolve different sources of cfDNA fragments (or reads) in two contexts:

  1. Cancer detection [1]: separate cfDNA fragments into tumor-derived fragments and background normal fragments (2 classes), and estimate the tumor-derived cfDNA fraction \(\theta\) (\(0\leq \theta < 1\)).

  2. Tissue deconvolution [2]: separate cfDNA fragments from different tissues (> 2 classes), and estimate the cfDNA fraction \(\theta_t\) (\(0\leq \theta_t < 1\)) of different tissue types (including an unknown type) \(t\) for a plasma cfDNA sample.

These functions can serve as foundations for more advanced cfDNA-based studies, including cancer diagnosis and disease monitoring.

2 Installation

cfTools is an R package available via the Bioconductor repository for packages. You can install the release version by using the following commands in your R session:

if (!requireNamespace("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
}
BiocManager::install("cfTools")

Alternatively, you can install the development version from GitHub :

BiocManager::install("jasminezhoulab/cfTools")

3 Input data preparation

The two main input files for CancerDetector() and cfDeconvolve() are

  • Input 1: fragment-level methylation states (methState), which are represented by a sequence of binary values (0 represents unmethylated CpG and 1 represents methylated CpG on the same fragment);

  • Input 2: methylation pattern (paired shape parameters of beta distributions) of markers.

Section 3.1, 3.2, 3.3 provide an example for generating Input 1. We require users to provide pre-processed paired-end bisulfite sequencing reads (i.e., aligned to the reference genome). For each cfDNA sample, users need to prepare (1) the standard (sorted) BED file of the aligned reads and (2) the methylation information that bismark extracts from the aligned reads as input data files. Specifically,

  • Section 3.1 MergePEReads()
    • Input: a standard (sorted) BED file of paired-end sequencing reads
    • Output: fragment-level information of cfDNA reads
  • Section 3.2 MergeCpGs()
    • Input: a CpG_OT* file and a CpG_OB* file generated by bismark methylation extractor
    • Output: methylation information of all CpGs on the same fragment
  • Section 3.3 GenerateFragMeth()
    • Input: the output lists of MergePEReads() and MergeCpGs()
    • Output: fragment-level information about methylation states

Section 3.4 provides an example for generating Input 2. Specifically,

  • Section 3.4 GenerateMarkerParam()
    • Input: a list of methylation levels (e.g., beta values) for markers, a vector of sample types (e.g., tumor or normal, tissue types) corresponding to the rows of the list, a vector of marker names corresponding to the columns of the list
    • Output: a list containing the paired shape parameters of beta distributions for markers

Example input data files are included within the package:

library(cfTools)
library(utils)
demo.dir <- system.file("data", package="cfTools")

3.1 Merge paired-end sequencing reads to fragment-level

Function MergePEReads() generates fragment-level information for paired-end sequencing reads. The main input file is a standard (sorted) BED file (e.g. output of bedtools bamtobed) of paired-end sequencing reads containing columns of chromosome name, chromosome start, chromosome end, sequence ID, mapping quality score, and strand.

PEReads <- file.path(demo.dir, "demo.sorted.bed.txt.gz")
head(read.table(PEReads, header=FALSE), 2)
#>      V1      V2      V3
#> 1 chr21 9417768 9417888
#> 2 chr21 9418101 9418223
#>                                                                                                         V4
#> 1 GWNJ-0965:805:GW2108234048th:1:2211:31223:60642_1:N:0:ATTCCT:R1:CAAGTCCC:R2:ACATTATG:F1:TGACT:F2:AGACT/2
#> 2 GWNJ-0965:805:GW2108234048th:1:2211:31223:60642_1:N:0:ATTCCT:R1:CAAGTCCC:R2:ACATTATG:F1:TGACT:F2:AGACT/1
#>   V5 V6
#> 1  0  +
#> 2  0  -

The output is a list in BED file format and/or written to an output BED file, where each line contains the information of a cfDNA fragment.

fragInfo <- MergePEReads(PEReads)
#> + '/home/biocbuild/.cache/R/basilisk/1.12.0/0/bin/conda' 'create' '--yes' '--prefix' '/home/biocbuild/.cache/R/basilisk/1.12.0/cfTools/1.0.0/my_env' 'python=3.7.0' '--quiet' '-c' 'conda-forge'
#> + '/home/biocbuild/.cache/R/basilisk/1.12.0/0/bin/conda' 'install' '--yes' '--prefix' '/home/biocbuild/.cache/R/basilisk/1.12.0/cfTools/1.0.0/my_env' 'python=3.7.0'
#> + '/home/biocbuild/.cache/R/basilisk/1.12.0/0/bin/conda' 'install' '--yes' '--prefix' '/home/biocbuild/.cache/R/basilisk/1.12.0/cfTools/1.0.0/my_env' '-c' 'conda-forge' 'python=3.7.0' 'python=3.7.0' 'numpy=1.16' 'scipy=1.5.3'
head(fragInfo, 2)
#>     chr   start     end fragmentLength strand
#> 1 chr21 9439599 9439741            142      -
#> 2 chr21 9483455 9483622            167      -
#>                                                                                                     name
#> 1 GWNJ-0965:805:GW2108234048th:1:1101:11972:44486_1:N:0:ATTCCT:R1:TTACGGCG:R2:CAGTGGAG:F1:TGACT:F2:TGACT
#> 2 GWNJ-0965:805:GW2108234048th:1:1101:20933:67955_1:N:0:ATTCCT:R1:GGGACTTT:R2:CGCGAGTA:F1:TGACT:F2:TGACT

3.2 Merge methylation states of CpGs on two strands to fragment-level

Function MergeCpGs() generates fragment-level methylation states of CpGs. The main inputs of it are two output files of bismark methylation extractor, which is a program performing methylation calling on bisulfite treated sequencing reads. The CpG_OT* file contains methylation information for CpGs on the original top strand (OT); the CpG_OB* file contains methylation information for CpGs on the original bottom strand (OB). Both files contain columns of sequence ID, methylation state, chromosome name, chromosome start, methylation call.

CpG_OT <- file.path(demo.dir, "CpG_OT_demo.txt.gz")
CpG_OB <- file.path(demo.dir, "CpG_OB_demo.txt.gz")
head(read.table(CpG_OT, header=FALSE), 2)
#>                                                                                                       V1
#> 1 GWNJ-0965:805:GW2108234048th:2:2108:26250:45136_1:N:0:ATTCCT:R1:CTTTCCAA:R2:GCTTCGAG:F1:TGACT:F2:AGACG
#> 2 GWNJ-0965:805:GW2108234048th:2:2108:26250:45136_1:N:0:ATTCCT:R1:CTTTCCAA:R2:GCTTCGAG:F1:TGACT:F2:AGACG
#>   V2    V3      V4 V5
#> 1  + chr21 9438974  Z
#> 2  - chr21 9438990  z
head(read.table(CpG_OB, header=FALSE), 2)
#>                                                                                                       V1
#> 1 GWNJ-0965:805:GW2108234048th:2:1114:27600:18502_1:N:0:ATTCCT:R1:GAAAATGA:R2:AAGATGCC:F1:TGACT:F2:TGACT
#> 2 GWNJ-0965:805:GW2108234048th:2:1114:27600:18502_1:N:0:ATTCCT:R1:GAAAATGA:R2:AAGATGCC:F1:TGACT:F2:TGACT
#>   V2    V3      V4 V5
#> 1  + chr21 9498035  Z
#> 2  + chr21 9497961  Z

The output is a list in BED file format and/or written to an output BED file, where each line contains methylation states of all CpGs on the same fragment. Column methState is a sequence of binary values indicating the methylation states of all CpGs on the same fragment (0 represents unmethylated CpG and 1 represents methylated CpG).

methInfo <- MergeCpGs(CpG_OT, CpG_OB)
head(methInfo, 2)
#>     chr cpgStart  cpgEnd strand cpgNumber
#> 1 chr21  9439600 9439741      -         7
#> 2 chr21  9483487 9483622      -        16
#>                                                                                                                       cpgPosition
#> 1                                                                         9439600,9439636,9439678,9439685,9439717,9439725,9439741
#> 2 9483487,9483489,9483516,9483519,9483522,9483524,9483536,9483539,9483541,9483574,9483577,9483580,9483582,9483600,9483603,9483622
#>          methState
#> 1          0111010
#> 2 0111011010011010
#>                                                                                                     name
#> 1 GWNJ-0965:805:GW2108234048th:1:1101:11972:44486_1:N:0:ATTCCT:R1:TTACGGCG:R2:CAGTGGAG:F1:TGACT:F2:TGACT
#> 2 GWNJ-0965:805:GW2108234048th:1:1101:20933:67955_1:N:0:ATTCCT:R1:GGGACTTT:R2:CGCGAGTA:F1:TGACT:F2:TGACT

3.3 Generate fragment-level information about methylation states

Function GenerateFragMeth() combines the output lists of MergePEReads() and MergeCpGs() into one list, which contains both the fragment information and the methylation states of all CpGs on each fragment.

fragMeth <- GenerateFragMeth(fragInfo, methInfo)
head(fragMeth, 2)
#>     chr   start     end
#> 1 chr21 9417768 9418223
#> 2 chr21 9431278 9431614
#>                                                                                                     name
#> 1 GWNJ-0965:805:GW2108234048th:1:2211:31223:60642_1:N:0:ATTCCT:R1:CAAGTCCC:R2:ACATTATG:F1:TGACT:F2:AGACT
#> 2   GWNJ-0965:805:GW2108234048th:1:1106:4411:3542_1:N:0:ATTCCT:R1:CATGTGCC:R2:AGCACTAA:F1:TGACT:F2:ACACT
#>   fragmentLength strand cpgNumber                             cpgPosition
#> 1            455      -         2                         9418124,9418171
#> 2            336      +         5 9431279,9431386,9431537,9431554,9431596
#>   methState
#> 1        01
#> 2     00111

3.4 Generate the methylation pattern of markers

Function GenerateMarkerParam() calculates paired shape parameters of beta distributions for each marker. There are three main inputs to this function: (1) a list of methylation levels (e.g., beta values), where each row is a sample and each column is a marker; (2) a vector of sample types (e.g., tumor or normal, tissue types) corresponding to the rows of the list; (3) a vector of marker names corresponding to the columns of the list.

methLevel <- read.table(file.path(demo.dir, "beta_matrix.txt.gz"), 
                      row.names=1, header = TRUE)
sampleTypes <- read.table(file.path(demo.dir, "sample_type.txt.gz"), 
                        row.names=1, header = TRUE)$sampleType
markerNames <- read.table(file.path(demo.dir, "marker_index.txt.gz"), 
                        row.names=1, header = TRUE)$markerIndex
print(methLevel)
#>            marker1   marker2   marker3
#> sample1  0.8967302 0.9444779 0.9274943
#> sample2  0.6807593 0.9391826 0.8329514
#> sample3  0.8407675 0.9312582 0.8824475
#> sample4  0.8614165 0.9571972 0.8170904
#> sample5  0.8685513 0.9138511 0.8993120
#> sample6  0.5926250 0.9626752 0.8629679
#> sample7  0.8382709 0.9145921 0.2179631
#> sample8  0.8554025 0.9510139 0.8368095
#> sample9  0.8286787 0.9376723 0.8830746
#> sample10 0.8362881 0.9286860 0.8326099
#> sample11 0.8782031 0.9801491 0.9095433
#> sample12 0.6815395 0.9093327 0.8749625
#> sample13 0.8751985 0.9527680 0.8362279
#> sample14 0.7974884 0.9184259 0.9163573
#> sample15 0.8785107 0.8995657 0.8645149
#> sample16 0.8159900 0.9719771 0.8779748
#> sample17 0.4364249 0.8534525 0.3489145
#> sample18 0.8701227 0.9525218 0.8704187
#> sample19 0.8737151 0.9037688 0.8957254
#> sample20 0.8572883 0.9467923 0.8380768
print(sampleTypes)
#>  [1] "normal" "tumor"  "tumor"  "normal" "normal" "tumor"  "tumor"  "normal"
#>  [9] "normal" "tumor"  "normal" "tumor"  "normal" "tumor"  "tumor"  "normal"
#> [17] "tumor"  "normal" "tumor"  "normal"
print(markerNames)
#> [1] 1 2 3

The output is a list containing the paired shape parameters of beta distributions for markers, which are delimited by :. Users can save this list into a file with column names, no row names, and columns are delimited by TAB for later use.

markerParam <- GenerateMarkerParam(methLevel, sampleTypes, markerNames)
print(markerParam)
#>   markerName        normal        tumor
#> 1          1 183.74:29.723  5.955:2.032
#> 2          2 134.584:6.958 83.653:7.662
#> 3          3 72.949:10.939  1.476:0.484

4 Fragments intersecting with marker regions

To make the computation more efficient, users may only keep the fragments that overlap with the genomic regions of markers. Here, we provide an example of using R package GenomicRanges to perform the intersection.

First, transform the two BED files into GRanges classes.

library(GenomicRanges)

# a BED file of fragment-level methylation information
frag_bed <- read.csv(file.path(demo.dir, "demo.fragment_level.meth.bed.txt.gz"), 
                     header=TRUE, sep="\t")
frag_meth.gr <- GRanges(seqnames=frag_bed$chr, 
                     ranges=IRanges(frag_bed$start, frag_bed$end),
                     strand=frag_bed$strand,
                     methState=as.character(frag_bed$methState))

# a BED file of genomic regions of markers
markers_bed <- read.csv(file.path(demo.dir, "markers.bed.txt.gz"), 
                        header=TRUE, sep="\t")
markers.gr <- GRanges(seqnames=markers_bed$chr, 
                      ranges=IRanges(markers_bed$start, markers_bed$end),
                      markerName=markers_bed$markerName)

head(frag_meth.gr, 2)
#> GRanges object with 2 ranges and 1 metadata column:
#>       seqnames          ranges strand |   methState
#>          <Rle>       <IRanges>  <Rle> | <character>
#>   [1]    chr21 9417768-9418223      - |           1
#>   [2]    chr21 9431278-9431614      + |         111
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths
head(markers.gr, 2)
#> GRanges object with 2 ranges and 1 metadata column:
#>       seqnames          ranges strand | markerName
#>          <Rle>       <IRanges>  <Rle> |  <integer>
#>   [1]    chr21 9418124-9418171      * |          1
#>   [2]    chr21 9437462-9437533      * |          2
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Then, overlap two GRanges classes and get the fragment-level methylation states intersecting with the markers.

ranges <- subsetByOverlaps(frag_meth.gr, markers.gr, ignore.strand=TRUE)
hits <- findOverlaps(frag_meth.gr, markers.gr,ignore.strand=TRUE)
idx <- subjectHits(hits)

values <- DataFrame(markerName=markers.gr$markerName[idx])
mcols(ranges) <- c(mcols(ranges), values)

marker.methState <- as.data.frame(cbind(ranges$markerName, 
                                        ranges$methState))
colnames(marker.methState) <- c("markerName", "methState")
head(marker.methState, 4)
#>   markerName     methState
#> 1          1             1
#> 2          2   10111101110
#> 3          2  101100011010
#> 4          2 1011000111001

5 Cancer detection

Function CancerDetector() separates cfDNA into tumor-derived fragments and background normal fragments and estimates the tumor burden. The main inputs are two files: (1) the fragment-level methylation states of reads (column methState) that mapped to the cancer-specific markers; (2) paired shape parameters of beta distributions for cancer-specific markers. All columns are delimited by TAB, and the first line is the column names.

fragMethFile <- file.path(demo.dir, "CancerDetector.reads.txt.gz")
markerParamFile <- file.path(demo.dir, "CancerDetector.markers.txt.gz")
head(read.csv(fragMethFile, sep = "\t", colClasses = "character"), 4)
#>   markerName methState
#> 1          2       000
#> 2          2       000
#> 3          2       000
#> 4          2    111111
head(read.csv(markerParamFile, sep = "\t"), 4)
#>   markerName tumor normalPlasma
#> 1          2   5:3        97:12
#> 2          3   3:2         57:9
#> 3          4   4:4         24:3
#> 4          5   3:3         32:5

The output is the estimated tumor burden \(\theta\) and the normal cfDNA fraction \(1-\theta\).

CancerDetector(fragMethFile, markerParamFile)
#>   cfDNA_tumor_burden normal_cfDNA_fraction
#> 1              0.055                 0.945

6 Tissue deconvolution

Function cfDeconvolve() estimates fractions of cfDNA fragments from different tissues (> 2 classes). The main two input files are similar to function CancerDetector(): (1) the fragment-level methylation states of reads (column methState) that mapped to the tissue-specific markers; (2) paired shape parameters of beta distributions for tissue-specific markers. All columns are delimited by TAB, and there is a header line of the column names.

fragMethFile2 <- file.path(demo.dir, "cfDeconvolve.reads.txt.gz")
markerParamFile2 <- file.path(demo.dir, "cfDeconvolve.markers.txt.gz")
head(read.csv(fragMethFile2, header=TRUE, sep="\t", 
              colClasses = "character"), 4)
#>   markerName methState
#> 1          2   1111100
#> 2          2   0100110
#> 3          2   0111111
#> 4          2   1111111
head(read.csv(markerParamFile2, header=TRUE, sep="\t", 
              colClasses = "character"), 4)
#>   markerName   tissue1   tissue2    tissue3   tissue4   tissue5   tissue6
#> 1          2 1.68:1.82      2:18 1.02:0.459   2.08:18     0.061     0.176
#> 2         27     0.605      0.47      0.667 29.1:12.1    0.0528     0.542
#> 3         61 12.8:2.68 9.35:1.75  19.4:4.06 6.28:7.06 8.44:20.4 21.1:3.48
#> 4         63 4.71:2.49 13.5:2.17  55.3:8.53     0.906  31.3:3.8 5.46:3.17
#>     tissue7
#> 1 6.65:2.19
#> 2 7.31:3.47
#> 3 6.69:4.01
#> 4    15:3.3

Other input parameters are:

  • Number of tissue types: a positive integer number k. Reads will be classified into k different tissue types.
  • Read-based tissue deconvolution EM algorithm type: options are em.global.unknown (default), em.global.known, em.local.unknown, em.local.known.
  • Likelihood ratio threshold: a positive float number. We suggest this number is 2 (default). All reads with likelihood ratio < cutoff (default: -1.0, meanin no reads filtering is used) will not be used. Likelihood ratio is the max(all tissues’ likelihoods)/min(all tissues’ likelihoods).
  • EM algorithm maximum iteration number: a positive integer number. Default is 100.

For example:

numTissues <- 7
emAlgorithmType <- "em.global.unknown"
likelihoodRatioThreshold <- 2
emMaxIterations <- 100

The output is a list containing the cfDNA fractions of different tissue types and an unknown class.

tissueFraction <- cfDeconvolve(fragMethFile2, markerParamFile2, numTissues, 
                               emAlgorithmType, likelihoodRatioThreshold, 
                               emMaxIterations)
tissueFraction
#>       tissue1     tissue2   tissue3     tissue4   tissue5     tissue6  tissue7
#> 1 1.58246e-13 1.35298e-19 0.0442296 1.45249e-21 0.0157226 4.00821e-18 0.930494
#>      unknown
#> 1 0.00955414

7 References

Appendix

[1] Li W, Li Q, Kang S, Same M, Zhou Y, Sun C, Liu CC, Matsuoka L, Sher L, Wong WH, Alber F, Zhou XJ. CancerDetector: ultrasensitive and non-invasive cancer detection at the resolution of individual reads using cell-free DNA methylation sequencing data. Nucleic Acids Res. 2018 Sep 6;46(15):e89. doi: 10.1093/nar/gky423. PMID: 29897492; PMCID: PMC6125664.

[2] Del Vecchio G, Li Q, Li W, Thamotharan S, Tosevska A, Morselli M, Sung K, Janzen C, Zhou X, Pellegrini M, Devaskar SU. Cell-free DNA methylation and transcriptomic signature prediction of pregnancies with adverse outcomes. Epigenetics. 2021 Jun;16(6):642-661. doi: 10.1080/15592294.2020.1816774. PMID: 33045922; PMCID: PMC8143248.

Session info

#> R version 4.3.0 RC (2023-04-13 r84269)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.2 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.17-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#> [1] GenomicRanges_1.52.0 GenomeInfoDb_1.36.0  IRanges_2.34.0      
#> [4] S4Vectors_0.38.0     BiocGenerics_0.46.0  cfTools_1.0.0       
#> [7] BiocStyle_2.28.0    
#> 
#> loaded via a namespace (and not attached):
#>  [1] Matrix_1.5-4            jsonlite_1.8.4          compiler_4.3.0         
#>  [4] BiocManager_1.30.20     filelock_1.0.2          Rcpp_1.0.10            
#>  [7] bitops_1.0-7            parallel_4.3.0          jquerylib_0.1.4        
#> [10] png_0.1-8               yaml_2.3.7              fastmap_1.1.1          
#> [13] here_1.0.1              lattice_0.21-8          reticulate_1.28        
#> [16] R6_2.5.1                XVector_0.40.0          knitr_1.42             
#> [19] bookdown_0.33           rprojroot_2.0.3         GenomeInfoDbData_1.2.10
#> [22] bslib_0.4.2             R.utils_2.12.2          rlang_1.1.0            
#> [25] cachem_1.0.7            dir.expiry_1.8.0        xfun_0.39              
#> [28] sass_0.4.5              cli_3.6.1               withr_2.5.0            
#> [31] zlibbioc_1.46.0         grid_4.3.0              digest_0.6.31          
#> [34] basilisk_1.12.0         R.methodsS3_1.8.2       R.oo_1.25.0            
#> [37] evaluate_0.20           RCurl_1.98-1.12         rmarkdown_2.21         
#> [40] basilisk.utils_1.12.0   tools_4.3.0             htmltools_0.5.5