Package version: InPAS 1.2.1

Contents

1 Introduction

Alternative polyadenylation (APA) is one of the most important post-transcriptional regulation mechanisms which occurs in most human genes. APA in a gene can result in altered expression of the gene, which can lead pathological effect to the cells, such as uncontrolled cell cycle and growth. However, there are only limited ways to identify and quantify APA in genes, and most of them suffers from complicated process for library construction and requires large amount of RNAs.

RNA-seq has become one of the most commonly used methods for quantifying genome-wide gene expression. There are massive RNA-seq datasets available publicly such as GEO and TCGA. For this reason, we develop the InPAS algorithm for identifying APA from conventional RNA-seq data.

The workflow for InPAS is:

2 How to

To use InPAS, BSgenome and TxDb object have to be loaded before run.

suppressPackageStartupMessages(library(InPAS))
suppressPackageStartupMessages(library(BSgenome.Mmusculus.UCSC.mm10))
library(TxDb.Mmusculus.UCSC.mm10.knownGene)
path <- file.path(find.package("InPAS"), "extdata")

Users can prepare annotaiton by utr3Annotation with a TxDb and org annotation. The 3UTR annotation prepared by utr3Annotation includes the gaps to next annotation region.

suppressPackageStartupMessages(library(org.Hs.eg.db))
samplefile <- system.file("extdata", "hg19_knownGene_sample.sqlite",
            package="GenomicFeatures")
txdb <- loadDb(samplefile)
utr3.sample.anno <- utr3Annotation(txdb=txdb, 
                   orgDbSYMBOL="org.Hs.egSYMBOL")
utr3.sample.anno
## GRanges object with 177 ranges and 6 metadata columns:
##                             seqnames                 ranges strand   |
##                                <Rle>              <IRanges>  <Rle>   |
##      uc001bum.2_5|IQCC|utr3     chr1 [ 32673684,  32674288]      +   |
##    uc001fbq.3_3|S100A9|utr3     chr1 [153333315, 153333503]      +   |
##    uc001gde.2_2|LRRC52|utr3     chr1 [165533062, 165533185]      +   |
##   uc001hfg.3_15|PFKFB2|utr3     chr1 [207245717, 207251162]      +   |
##   uc010psc.2_17|PFKFB2|utr3     chr1 [207252365, 207254368]      +   |
##                         ...      ...                    ...    ... ...
##      uc004dst.3_22|PHF8|CDS     chrX [ 53965591,  53965679]      -   |
##      uc004dsx.3_15|PHF8|CDS     chrX [ 53969797,  53969835]      -   |
##     uc004ehz.1_5|ARMCX3|CDS     chrX [100879970, 100881109]      +   |
##    uc004elw.3_6|FAM199X|CDS     chrX [103434289, 103434459]      +   |
##      uc004fmj.1_10|GAB3|CDS     chrX [153906455, 153906571]      -   |
##                                 feature          id          exon
##                             <character> <character>   <character>
##      uc001bum.2_5|IQCC|utr3     unknown        utr3  uc001bum.2_5
##    uc001fbq.3_3|S100A9|utr3     unknown        utr3  uc001fbq.3_3
##    uc001gde.2_2|LRRC52|utr3     unknown        utr3  uc001gde.2_2
##   uc001hfg.3_15|PFKFB2|utr3     unknown        utr3 uc001hfg.3_15
##   uc010psc.2_17|PFKFB2|utr3     unknown        utr3 uc010psc.2_17
##                         ...         ...         ...           ...
##      uc004dst.3_22|PHF8|CDS     unknown         CDS uc004dst.3_22
##      uc004dsx.3_15|PHF8|CDS     unknown         CDS uc004dsx.3_15
##     uc004ehz.1_5|ARMCX3|CDS     unknown         CDS  uc004ehz.1_5
##    uc004elw.3_6|FAM199X|CDS     unknown         CDS  uc004elw.3_6
##      uc004fmj.1_10|GAB3|CDS     unknown         CDS uc004fmj.1_10
##                              transcript        gene      symbol
##                             <character> <character> <character>
##      uc001bum.2_5|IQCC|utr3  uc001bum.2       55721        IQCC
##    uc001fbq.3_3|S100A9|utr3  uc001fbq.3        6280      S100A9
##    uc001gde.2_2|LRRC52|utr3  uc001gde.2      440699      LRRC52
##   uc001hfg.3_15|PFKFB2|utr3  uc001hfg.3        5208      PFKFB2
##   uc010psc.2_17|PFKFB2|utr3  uc010psc.2        5208      PFKFB2
##                         ...         ...         ...         ...
##      uc004dst.3_22|PHF8|CDS  uc004dst.3       23133        PHF8
##      uc004dsx.3_15|PHF8|CDS  uc004dsx.3       23133        PHF8
##     uc004ehz.1_5|ARMCX3|CDS  uc004ehz.1       51566      ARMCX3
##    uc004elw.3_6|FAM199X|CDS  uc004elw.3      139231     FAM199X
##      uc004fmj.1_10|GAB3|CDS  uc004fmj.1      139716        GAB3
##   -------
##   seqinfo: 27 sequences from hg19 genome; no seqlengths

Users can load mm10 and hg19 annotation from pre-prepared data. Here we loaded the prepared mm10 3UTR annotation file.

##step1 annotation
# load from dataset
data(utr3.mm10)

The coverage is calculated from BEDGraph file. The RNA-seq BAM files could be converted to BEDGraph files by bedtools genomecov tool (parameter: -bg -split). PWM and a classifier of polyA signal can be used for adjusting CP sites prediction.

load(file.path(path, "polyA.rds"))
suppressPackageStartupMessages(library(cleanUpdTSeq))
data(classifier)
bedgraphs <- c(file.path(path, "Baf3.extract.bedgraph"), 
               file.path(path, "UM15.extract.bedgraph"))
hugeData <- FALSE
##step1 Calculate coverage
coverage <- coverageFromBedGraph(bedgraphs, 
                                 tags=c("Baf3", "UM15"),
                                 genome=BSgenome.Mmusculus.UCSC.mm10,
                                 hugeData=hugeData)

##step2 Predict cleavage sites
CPs <- CPsites(coverage=coverage, 
               genome=BSgenome.Mmusculus.UCSC.mm10,
               utr3=utr3.mm10, 
               search_point_START=200, 
               cutEnd=.2, 
               long_coverage_threshold=3,
               background="10K",
               txdb=TxDb.Mmusculus.UCSC.mm10.knownGene,
               PolyA_PWM=pwm, 
               classifier=classifier,
               shift_range=50)

##step3 Estimate 3UTR usage
res <- testUsage(CPsites=CPs, 
                 coverage=coverage, 
                 genome=BSgenome.Mmusculus.UCSC.mm10,
                 utr3=utr3.mm10, 
                 method="fisher.exact",
                 gp1="Baf3", gp2="UM15")

##step4 view the results
as(res, "GRanges") 
## GRanges object with 4 ranges and 25 metadata columns:
##              seqnames                 ranges strand |  transcript
##                 <Rle>              <IRanges>  <Rle> | <character>
##   uc009daz.2     chr6 [ 98018176,  98021358]      + |  uc009daz.2
##   uc009dhz.2     chr6 [114859343, 114862071]      + |  uc009dhz.2
##   uc009eet.1     chr6 [128846244, 128850081]      - |  uc009eet.1
##   uc012ero.1     chr6 [119315133, 119317055]      - |  uc012ero.1
##                     gene      symbol  fit_value Predicted_Proximal_APA
##              <character> <character>  <numeric>              <numeric>
##   uc009daz.2       17342        Mitf 12594.6737               98018978
##   uc009dhz.2       74244        Atg7 27383.4125              114859674
##   uc009eet.1      232406    BC035044   128.8275              128849128
##   uc012ero.1      211187       Lrtm2   168.5509              119316541
##              Predicted_Distal_APA           type utr3start   utr3end
##                         <numeric>    <character> <numeric> <numeric>
##   uc009daz.2             98021358 novel proximal  98018276  98021358
##   uc009dhz.2            114862071   novel distal 114859443 114860614
##   uc009eet.1            128846244   novel distal 128849981 128848044
##   uc012ero.1            119315133 novel proximal 119316955 119315133
##              Baf3_short.form.usage UM15_short.form.usage
##                          <numeric>             <numeric>
##   uc009daz.2             33.531338             1.0521497
##   uc009dhz.2            520.914790           172.1523492
##   uc009eet.1             22.671784             0.7334097
##   uc012ero.1              8.857865            49.5356160
##              Baf3_long.form.usage UM15_long.form.usage Baf3_PDUI UM15_PDUI
##                         <numeric>            <numeric> <numeric> <numeric>
##   uc009daz.2           282.824024            189.14112 0.8940074 0.9944680
##   uc009dhz.2           208.024604            456.54462 0.2853798 0.7261760
##   uc009eet.1             7.876603             21.96014 0.2578402 0.9676820
##   uc012ero.1             8.535131             70.81263 0.4907223 0.5883977
##              short.mean.gp1 long.mean.gp1 short.mean.gp2 long.mean.gp2
##                   <numeric>     <numeric>      <numeric>     <numeric>
##   uc009daz.2      33.531338    282.824024      1.0521497     189.14112
##   uc009dhz.2     520.914790    208.024604    172.1523492     456.54462
##   uc009eet.1      22.671784      7.876603      0.7334097      21.96014
##   uc012ero.1       8.857865      8.535131     49.5356160      70.81263
##               PDUI.gp1  PDUI.gp2       dPDUI      logFC      P.Value
##              <numeric> <numeric>   <numeric>  <numeric>    <numeric>
##   uc009daz.2 0.8940074 0.9944680 -0.10046063 -0.1536382 1.586980e-06
##   uc009dhz.2 0.2853798 0.7261760 -0.44079612 -1.3474358 1.709476e-60
##   uc009eet.1 0.2578402 0.9676820 -0.70984179 -1.9080557 2.632879e-08
##   uc012ero.1 0.4907223 0.5883977 -0.09767539 -0.2618847 5.930309e-01
##                 adj.P.Val
##                 <numeric>
##   uc009daz.2 2.115973e-06
##   uc009dhz.2 6.837904e-60
##   uc009eet.1 5.265757e-08
##   uc012ero.1 5.930309e-01
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
filterRes(res, gp1="Baf3", gp2="UM15",
          background_coverage_threshold=3,
          adj.P.Val_cutoff=0.05, 
          dPDUI_cutoff=0.3, 
          PDUI_logFC_cutoff=0.59)
## GRanges object with 4 ranges and 26 metadata columns:
##              seqnames                 ranges strand |  transcript
##                 <Rle>              <IRanges>  <Rle> | <character>
##   uc009daz.2     chr6 [ 98018176,  98021358]      + |  uc009daz.2
##   uc009dhz.2     chr6 [114859343, 114862071]      + |  uc009dhz.2
##   uc009eet.1     chr6 [128846244, 128850081]      - |  uc009eet.1
##   uc012ero.1     chr6 [119315133, 119317055]      - |  uc012ero.1
##                     gene      symbol  fit_value Predicted_Proximal_APA
##              <character> <character>  <numeric>              <numeric>
##   uc009daz.2       17342        Mitf 12594.6737               98018978
##   uc009dhz.2       74244        Atg7 27383.4125              114859674
##   uc009eet.1      232406    BC035044   128.8275              128849128
##   uc012ero.1      211187       Lrtm2   168.5509              119316541
##              Predicted_Distal_APA           type utr3start   utr3end
##                         <numeric>    <character> <numeric> <numeric>
##   uc009daz.2             98021358 novel proximal  98018276  98021358
##   uc009dhz.2            114862071   novel distal 114859443 114860614
##   uc009eet.1            128846244   novel distal 128849981 128848044
##   uc012ero.1            119315133 novel proximal 119316955 119315133
##              Baf3_short.form.usage UM15_short.form.usage
##                          <numeric>             <numeric>
##   uc009daz.2             33.531338             1.0521497
##   uc009dhz.2            520.914790           172.1523492
##   uc009eet.1             22.671784             0.7334097
##   uc012ero.1              8.857865            49.5356160
##              Baf3_long.form.usage UM15_long.form.usage Baf3_PDUI UM15_PDUI
##                         <numeric>            <numeric> <numeric> <numeric>
##   uc009daz.2           282.824024            189.14112 0.8940074 0.9944680
##   uc009dhz.2           208.024604            456.54462 0.2853798 0.7261760
##   uc009eet.1             7.876603             21.96014 0.2578402 0.9676820
##   uc012ero.1             8.535131             70.81263 0.4907223 0.5883977
##              short.mean.gp1 long.mean.gp1 short.mean.gp2 long.mean.gp2
##                   <numeric>     <numeric>      <numeric>     <numeric>
##   uc009daz.2      33.531338    282.824024      1.0521497     189.14112
##   uc009dhz.2     520.914790    208.024604    172.1523492     456.54462
##   uc009eet.1      22.671784      7.876603      0.7334097      21.96014
##   uc012ero.1       8.857865      8.535131     49.5356160      70.81263
##               PDUI.gp1  PDUI.gp2       dPDUI      logFC      P.Value
##              <numeric> <numeric>   <numeric>  <numeric>    <numeric>
##   uc009daz.2 0.8940074 0.9944680 -0.10046063 -0.1536382 1.586980e-06
##   uc009dhz.2 0.2853798 0.7261760 -0.44079612 -1.3474358 1.709476e-60
##   uc009eet.1 0.2578402 0.9676820 -0.70984179 -1.9080557 2.632879e-08
##   uc012ero.1 0.4907223 0.5883977 -0.09767539 -0.2618847 5.930309e-01
##                 adj.P.Val      PASS
##                 <numeric> <logical>
##   uc009daz.2 2.115973e-06     FALSE
##   uc009dhz.2 6.837904e-60      TRUE
##   uc009eet.1 5.265757e-08      TRUE
##   uc012ero.1 5.930309e-01     FALSE
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

The process described above can be done in one step.

if(interactive()){
    res <- inPAS(bedgraphs=bedgraphs, tags=c("Baf3", "UM15"), 
              genome=BSgenome.Mmusculus.UCSC.mm10, 
              utr3=utr3.mm10, gp1="Baf3", gp2="UM15",
              txdb=TxDb.Mmusculus.UCSC.mm10.knownGene,
              search_point_START=200,
              short_coverage_threshold=15,
              long_coverage_threshold=3, 
              cutStart=0, cutEnd=.2, 
              hugeData=FALSE,
              shift_range=50,
              PolyA_PWM=pwm, classifier=classifier,
              method="fisher.exact",
              adj.P.Val_cutoff=0.05,
              dPDUI_cutoff=0.3, 
              PDUI_logFC_cutoff=0.59)
}

InPAS can handle single group data.

inPAS(bedgraphs=bedgraphs[1], tags=c("Baf3"), 
      genome=BSgenome.Mmusculus.UCSC.mm10, 
      utr3=utr3.mm10, gp1="Baf3", gp2=NULL,
      txdb=TxDb.Mmusculus.UCSC.mm10.knownGene,
      search_point_START=200,
      short_coverage_threshold=15,
      long_coverage_threshold=3, 
      cutStart=0, cutEnd=.2, 
      hugeData=FALSE, 
      PolyA_PWM=pwm, classifier=classifier,
      method="singleSample",
      adj.P.Val_cutoff=0.05)
## converged at iteration 1 with logLik: -1835.501 
## converged at iteration 5 with logLik: -838.8306 
## converged at iteration 1 with logLik: -1487.249 
## converged at iteration 5 with logLik: -711.1667 
## converged at iteration 1 with logLik: -1173.482 
## converged at iteration 5 with logLik: -579.836 
## converged at iteration 1 with logLik: -460.8479 
## converged at iteration 16 with logLik: -211.5795 
## converged at iteration 2 with logLik: -191.9692 
## converged at iteration 12 with logLik: -153.2558
## dPDUI is calculated by gp2 - gp1.
## GRanges object with 6 ranges and 20 metadata columns:
##              seqnames                   ranges  strand   |    transcript
##               "<Rle>"              "<IRanges>" "<Rle>" "|" "<character>"
##   uc009daz.2   "chr6" "[ 98018176,  98021358]"     "+" "|"  "uc009daz.2"
##   uc009dhz.2   "chr6" "[114859343, 114862071]"     "+" "|"  "uc009dhz.2"
##   uc009die.2   "chr6" "[114860615, 114862164]"     "-" "|"  "uc009die.2"
##   uc009eet.1   "chr6" "[128847265, 128850081]"     "-" "|"  "uc009eet.1"
##   uc009eol.1   "chr6" "[140651362, 140651622]"     "+" "|"  "uc009eol.1"
##   uc012ero.1   "chr6" "[119315133, 119317055]"     "-" "|"  "uc012ero.1"
##                       gene        symbol   fit_value
##              "<character>" "<character>" "<numeric>"
##   uc009daz.2       "17342"        "Mitf"    17842.87
##   uc009dhz.2       "74244"        "Atg7"    7576.055
##   uc009die.2      "232334"       "Vgll4"    11877.19
##   uc009eet.1      "232406"    "BC035044"    174.2304
##   uc009eol.1       "11569"       "Aebp2"      190776
##   uc012ero.1      "211187"       "Lrtm2"    19.16645
##              Predicted_Proximal_APA Predicted_Distal_APA             type
##                         "<numeric>"          "<numeric>"    "<character>"
##   uc009daz.2               98019024             98021358 "novel proximal"
##   uc009dhz.2              114860286            114862071   "novel distal"
##   uc009die.2              114861564            114860615   "novel distal"
##   uc009eet.1              128849248            128847265   "novel distal"
##   uc009eol.1              140651561            140651622 "novel proximal"
##   uc012ero.1              119316666            119315133 "novel proximal"
##                utr3start     utr3end       data2 Baf3_short.form.usage
##              "<numeric>" "<numeric>"    "<list>"           "<numeric>"
##   uc009daz.2    98018176    98021358 Integer,848               29.6839
##   uc009dhz.2   114859343   114860614 Integer,943              2.934986
##   uc009die.2   114862164   114862092  Integer,73              104.7689
##   uc009eet.1   128850081   128848044 Integer,833              28.03751
##   uc009eol.1   140651362   140651622 Integer,199              1057.506
##   uc012ero.1   119317055   119315133 Integer,389              8.935672
##              Baf3_long.form.usage   Baf3_PDUI short.mean  long.mean
##                       "<numeric>" "<numeric>"   "<list>"   "<list>"
##   uc009daz.2             283.3645   0.9051779 "########" "########"
##   uc009dhz.2             210.5062   0.9862492 "########" "########"
##   uc009die.2             191.2421   0.6460641 "########" "########"
##   uc009eet.1              11.7248   0.2948722 "########" "########"
##   uc009eol.1             248.3387   0.1901748 "########" "########"
##   uc012ero.1             9.095176   0.5044231 "########" "########"
##                    PDUI    P.Value  adj.P.Val       dPDUI        PASS
##                "<list>"   "<list>"   "<list>" "<numeric>" "<logical>"
##   uc009daz.2 "########" "########" "########"         NaN       FALSE
##   uc009dhz.2 "########" "########" "########"         NaN       FALSE
##   uc009die.2 "########" "########" "########"         NaN       FALSE
##   uc009eet.1 "########" "########" "########"         NaN       FALSE
##   uc009eol.1 "########" "########" "########"         NaN       FALSE
##   uc012ero.1 "########" "########" "########"         NaN       FALSE
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

3 Session Info

sessionInfo()

R version 3.2.2 (2015-08-14) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 14.04.3 LTS

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets [8] methods base

other attached packages: [1] cleanUpdTSeq_1.8.0
[2] e1071_1.6-7
[3] seqinr_3.1-3
[4] ade4_1.7-3
[5] BSgenome.Drerio.UCSC.danRer7_1.4.0
[6] org.Hs.eg.db_3.2.3
[7] RSQLite_1.0.0
[8] DBI_0.3.1
[9] TxDb.Mmusculus.UCSC.mm10.knownGene_3.2.2 [10] BSgenome.Mmusculus.UCSC.mm10_1.4.0
[11] BSgenome_1.38.0
[12] rtracklayer_1.30.1
[13] Biostrings_2.38.2
[14] XVector_0.10.0
[15] InPAS_1.2.1
[16] GenomicFeatures_1.22.6
[17] AnnotationDbi_1.32.1
[18] GenomicRanges_1.22.1
[19] GenomeInfoDb_1.6.1
[20] IRanges_2.4.4
[21] S4Vectors_0.8.3
[22] Biobase_2.30.0
[23] BiocGenerics_0.16.1
[24] BiocStyle_1.8.0

loaded via a namespace (and not attached): [1] Rcpp_0.12.2 biovizBase_1.18.0
[3] lattice_0.20-33 class_7.3-14
[5] Rsamtools_1.22.0 digest_0.6.8
[7] truncnorm_1.0-7 plyr_1.8.3
[9] futile.options_1.0.0 acepack_1.3-3.3
[11] evaluate_0.8 ggplot2_1.0.1
[13] zlibbioc_1.16.0 depmixS4_1.3-2
[15] rpart_4.1-10 preprocessCore_1.32.0
[17] rmarkdown_0.8.1 proto_0.3-10
[19] splines_3.2.2 BiocParallel_1.4.0
[21] stringr_1.0.0 foreign_0.8-66
[23] RCurl_1.95-4.7 biomaRt_2.26.1
[25] munsell_0.4.2 Gviz_1.14.0
[27] htmltools_0.2.6 Rsolnp_1.15
[29] nnet_7.3-11 SummarizedExperiment_1.0.1 [31] gridExtra_2.0.0 Hmisc_3.17-0
[33] matrixStats_0.15.0 XML_3.98-1.3
[35] GenomicAlignments_1.6.1 MASS_7.3-45
[37] bitops_1.0-6 grid_3.2.2
[39] gtable_0.1.2 magrittr_1.5
[41] formatR_1.2.1 scales_0.3.0
[43] stringi_1.0-1 reshape2_1.4.1
[45] limma_3.26.3 latticeExtra_0.6-26
[47] futile.logger_1.4.1 Formula_1.2-1
[49] lambda.r_1.1.7 RColorBrewer_1.1-2
[51] tools_3.2.2 dichromat_2.0-0
[53] survival_2.38-3 yaml_2.1.13
[55] colorspace_1.2-6 cluster_2.0.3
[57] knitr_1.11 VariantAnnotation_1.16.3