This package contains a collection of various tools for Proteomics used at the proteomics platform of the IGBMC. To get started, we need to load the packages “wrMisc” and this package (wrProteo), both are available from CRAN. The packages wrGraph and RColorBrewer get used internally by some of the functions from this package for (optional/improved) figures. Furthermore, the Bioconductor package limma will be used internally for statistical testing.
If you are not familiar with R you may find many introductory documents on the official R-site in contributed documents or under Documentation/Manuals. Of course, numerous other documents/sites with tutorials exit.
The aim of package-vignettes is to provide additional information and show examples how the R-package concerned may be used, thus complementing the documentation given with help() for each of the functions of the package. In terms of examples, frequent standard types of problems are preferred in a vignette. Nevertheless, most functions can be used in many other ways, for this you may have to check the various arguments via calling help on the function of interest. All R-code in this vigentte can be directly repeated by the user, all data used is provided with the package.
## if you need to install the packages 'wrMisc','wrProteo' and 'wrGraph' from CRAN :
install.packages("wrMisc")
install.packages("wrProteo")
## The package 'wrGraph' is not obligatory, but it allows making better graphs
install.packages("wrGraph")
## Installation of limma from Bioconductor
if(!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager")
BiocManager::install("limma")
## Let's assume this is a fresh R-session
## Get started by loading the packages
library("knitr")
library("wrMisc")
library("wrProteo")
library("wrGraph")
# This is wrProteo version no :
packageVersion("wrProteo")
#> [1] '1.12.0'
This way you can browse all vignettes available to this package :
browseVignettes("wrProteo")
There you can find another vignette dedicated to the analysis of heterogenous spike-in experiments.
Please note that molecular masses may be given in two flavours : Monoisotopic mass and average mass. For details you may refer to Wikipedia: monoisotopic mass. Monoisotopic masses commonly are used in mass-spectrometry and will be used by default in this package.
Molecular (mono-isotopic) masses of the atomes integrated in this package were taken from Unimod. They can be easily updated, if in the future, (mono-isotopic) molecular masses will be determined with higher precision (ie more digits).
At this level (summed) atomic compositions are evaluated. Here, the number of atoms has to be written before the atom. Thus, ‘2C’ means two atoms of carbon. Empty or invalid entries will be by default returned as mass=0, a message will comment such issues.
The mass of an electron can be assigned using ‘e’.
massDeFormula(c("12H12O", "HO", " 2H 1 Se, 6C 2N", "HSeCN", " ", "e"))
#> massDeFormula : can't find ' ' .. setting to 0 mass
#> 12H12O 1H1O 2H1Se1e6C2N 1H1Se1e1C1N 1e
#> 2.040329e+02 1.700274e+01 1.819389e+02 1.069280e+02 0.000000e+00 5.485799e-04
# Ignore empty/invalid entries
massDeFormula(c("12H12O", "HO", " 2H 1 Se, 6C 2N", "HSeCN"), rmEmpty=TRUE)
#> 12H12O 1H1O 2H1Se1e6C2N 1H1Se1e1C1N
#> 204.03288 17.00274 181.93887 106.92797
Using the argument massTy one can switch from default monoisotopic mass to average mass :
massDeFormula(c("12H12O", "HO", " 2H 1 Se, 6C 2N", "HSeCN"), massTy="aver")
#> 12H12O 1H1O 2H1Se1e6C2N 1H1Se1e1C1N
#> 204.08808 17.00734 181.05403 105.98589
The mass of these amino-acids can be used:
AAmass()
#> A R N D C E Q G
#> 71.03711 156.10111 114.04293 115.02694 103.00918 129.04259 128.05858 57.02146
#> H I L K M F P S
#> 137.05891 113.08406 113.08406 128.09496 131.04048 147.06841 97.05276 87.03203
#> T W Y V
#> 101.04768 186.07931 163.06333 99.06841
Here the one-letter amino-acid code is used to descibre peptides or proteins.
## mass of peptide (or protein)
pep1 <- c(aa="AAAA",de="DEFDEF")
convAASeq2mass(pep1, seqN=FALSE)
#> aa de
#> 302.1590 800.2865
This package contains a parser for Fasta-files allowing to separate
different fields of meta-data like IDs, name and species of the
respecive entries. Here we will read a tiny example fasta-file (a
collection of typical contaminants in proteomics) using
readFasta2()
.
path1 <- system.file('extdata', package='wrProteo')
fiNa <- "conta1.fasta.gz"
## basic reading of Fasta
fasta1 <- readFasta2(file.path(path1, fiNa))
str(fasta1)
#> Named chr [1:36] "FPTDDDDKIVGGYTCAANSIPYQVSLNSGSHFCGGSLINSQWVVSAAHCYKSRIQVRLGEHNIDVLEGNEQFINAAKIITHPNFNGNTLDNDIMLIKLSSPATLNSRVATV"| __truncated__ ...
#> - attr(*, "names")= chr [1:36] "P00761 TRYP_PIG TRYPSIN PRECURSOR" "P00760 TRY1_BOVIN Cationic trypsin (Fragment) - Bos taurus (Bovine)" "P00766 CTRA_BOVIN Chymotrypsinogen A - Bos taurus (Bovine)" "P00767 CTRB_BOVIN Chymotrypsinogen B - Bos taurus (Bovine)" ...
## now let's read and further separate details in annotation-fields
fasta1b <- readFasta2(file.path(path1, fiNa), tableOut=TRUE)
str(fasta1b)
#> chr [1:36, 1:9] "generic" "generic" "generic" "generic" "generic" ...
#> - attr(*, "dimnames")=List of 2
#> ..$ : NULL
#> ..$ : chr [1:9] "database" "uniqueIdentifier" "entryName" "proteinName" ...
Now we can check if some entries appear twice.
dupEntry <- duplicated(fasta1)
table(dupEntry)
#> dupEntry
#> FALSE TRUE
#> 35 1
Let’s remove the duplicated entry.
fasta3 <- fasta1[which(!dupEntry)]
length(fasta3)
#> [1] 35
Once we have modified a fasta we might want to save it again as
fasta-formatted file. This can be done using
writeFasta2()
.
writeFasta2(fasta3, fileNa="testWrite.fasta")
.
Multiple algorithms and software implementations have been developed for quantitation label-free proteomics experiments (LFQ), in particular for extracted ion chromatograms (XIC). For more background information you may look at Wikipedia labell-free Proteomics.
The tools presented here are designed for use with label-free XIC (ie LFQ) data. Several of the programs for extracting initial quantitations also allow getting spectral counting (PSM) data which can also get imported into R, however their use is not further discussed in this vignette. In general it is preferable to use XIC for comparing peptde of protein quantities between different protein extracts/samples.
This package provides support for importing quantitation results from Proteome Discoverer, MaxQuant, Fragpipe, Proline, MassChroQ, DIA-NN, AlphaPept, Wombat-P and OpenMS.
All quantitation import functions offer special features for further separating annotation related information, like species, for later use.
In most common real-world cases people typically analyze data using only one quantitation algorithm/software. Below in this vignette, we’ll use only the quantitation data generated using MaxQuant (AlphaPept, DIA-NN, FragPipe, MassChroQ, OpenMS, ProteomeDiscoverer, Proline and Wombat-P are supported, too). The other vignette to this package (“UPS-1 spike-in Experiments”) shows in detail the import functions available for MaxQuant, ProteomeDiscoverer and Proline and how further comparsions can be performed in bench-mark studies. All these import functions generate an equivalent output format, separating (selected) annotation data ($annot) from normalized log2-quantitation data ($quant) and initial quantitation ($raw).
Normalization (discussed below in more detail) is an important part of ‘preparing’ the data for subsequant analysis. The import functions in this package allow performin an initial normalization step (with choice among multiple algorithims), too. Further information about the proteins identifed can be considered during normalization: For example, it is possible to exclude contaminants like keratins which are frequently found among the higher abundant proteins which may potentially introduce bias at global normalization.
Technical replicates are very frequently produced in proteomics, they allow to assess the variability linked to repeated injection of the same material. Biological replicates, however, make additional information accessible, allowing the interpretation of experiments in a more general way.
MaxQuant is free software
provided by the Max-Planck-Institute, see also
Tyanova et al 2016.
Typically MaxQuant exports by
default quantitation data on level of consensus-proteins as a folder
called txt with a file always called ‘proteinGroups.txt’. Data exported
from MaxQuant can get imported
(and normalized) using readMaxQuantFile()
, in a standard
case one needs only to provide the path to the file ‘proteinGroups.txt’
which can be found the combined/txt/ folder produced by
MaxQuant. gz-compressed files can be read, too (as in the example below
the file ‘proteinGroups.txt.gz’). The argument specPref allows
giving further details about expected (primary) species, it defaults to
working with human proteins. To get started, let’s just set it to
NULL for ignoring.
path1 <- system.file("extdata", package="wrProteo")
dataMQ <- readMaxQuantFile(path1, specPref=NULL, normalizeMeth="median")
#> readMaxQuantFile : Note: Found 11 out of 1115 proteins marked as 'REV_' (reverse peptide identification) - Removing
#> readMaxQuantFile : Transform 2845(9.5%) initial '0' values to 'NA'
#> readMaxQuantFile : Could not find peptide counts columns (argument 'pepCountCol') matching to 'Unique peptides MS\.MS\.count '
#> readMaxQuantFile : Found 1 species name(s) appearing inside other ones, assume as truncated (eg Saccharomyces cerevis)
#> readMaxQuantFile : Note: 5 proteins with unknown species
#> data by species : Gallus gallus: 1, Homo sapiens: 49, Mus musculus: 1, Saccharomyces cerevisiae: 1047, Sus scrofa: 1,
#> readMaxQuantFile : Found 70 composite accession-numbers (eg P00761;P00761), truncating
#> readMaxQuantFile : Use column 'Accession' as identifyer (has fewest, ie 0 duplicated entries) as rownames
## number of lines and columns of quantitation data
dim(dataMQ$quant)
#> [1] 1104 27
Similarly we can also add directly information about principal species, contaminants, special groups of proteins and add sdrf annotation (if existing) directly when reading the data. Setting customized tags according to species or other search-terms can be done using the argument specPref. In the example below we define a main species (tags are made by comparing to the species information initially given by the fasta) and we define a custom group of proteins by their Uniprot-Accessions (here the UPS1 spike-in). Then, the content of argument specPref will get searched in multiple types of annotation (if available from the initial Fasta).
By setting suplAnnotFile=TRUE the import function will also look for files (by default produced by MaxQuant as ‘summary.txt’ and ‘parameters.txt’) giving more information about experiment and samples and integrate this to the output. (This time let’s do not display the plot of distributions, it’s the same plot as above, see argument plotGraph.)
## The grouping of replicates
grp9 <- rep(1:9,each=3)
head(grp9)
#> [1] 1 1 1 2 2 2
## special group of proteins (we want to differentiate/ highlight lateron)
UPS1ac <- c("P00915", "P00918", "P01031", "P69905", "P68871", "P41159", "P02768", "P62988",
"P04040", "P00167", "P01133", "P02144", "P15559", "P62937", "Q06830", "P63165", "P00709", "P06732",
"P12081", "P61626", "Q15843", "P02753", "P16083", "P63279", "P01008", "P61769", "P55957", "O76070",
"P08263", "P01344", "P01127", "P10599", "P99999", "P06396", "P09211", "P01112", "P01579", "P02787",
"O00762", "P51965", "P08758", "P02741", "P05413", "P10145", "P02788", "P10636-8", "P00441", "P01375")
specPrefMQ <- list(conta="CON_|LYSC_CHICK", mainSpecies="OS=Saccharomyces cerevisiae", spike=UPS1ac)
dataMQ <- readMaxQuantFile(path1, specPref=specPrefMQ, suplAnnotFile=TRUE, groupPref=list(lowNumberOfGroups=FALSE), gr=grp9, plotGraph=FALSE)
#> readMaxQuantFile : Note: Found 11 out of 1115 proteins marked as 'REV_' (reverse peptide identification) - Removing
#> readMaxQuantFile : Transform 2845(9.5%) initial '0' values to 'NA'
#> readMaxQuantFile : Could not find peptide counts columns (argument 'pepCountCol') matching to 'Unique peptides MS\.MS\.count '
#> readMaxQuantFile : Found 1 species name(s) appearing inside other ones, assume as truncated (eg Saccharomyces cerevis)
#> readMaxQuantFile : Note: 5 proteins with unknown species
#> data by species : Gallus gallus: 1, Homo sapiens: 49, Mus musculus: 1, Saccharomyces cerevisiae: 1047, Sus scrofa: 1,
#> readMaxQuantFile : Found 70 composite accession-numbers (eg P00761;P00761), truncating
#> readMaxQuantFile : Use column 'Accession' as identifyer (has fewest, ie 0 duplicated entries) as rownames
#> readMaxQuantFile : readSampleMetaData : PROBLEM : Meta-data and abundance data do not match ! Number of samples from summary.txt.gz (27) and from main data (27) do NOT match !! .. ignoring
## the quantifiation data is the same as before
dim(dataMQ$quant)
#> [1] 1104 27
Now we can access special tags in the annotation part of the resulting object the results :
## count of tags based on argument specPref
table(dataMQ$annot[,"SpecType"])
#>
#> conta mainSpecies spike
#> 9 1047 48
This information can be used automatically lateron for assigning different symbols and/or colors when drawing Volcano-plots or PCA.
To further analyze the data from an experiment typically the user also need to know/declare different groups of samples (eg who is replicate of whom). In the simplest case this can be done via the argument gr, as shown above. By the way, if gr is provided it gets priority over other automcatic mining results.
The import-functions from this package try to help you in multiple ways to find out more about the experimental details. Most quantitation software (like MaxQuant and ProteomeDiscoverer) also produce files/documentation about experimental annotation specified by the user. These files may be automatically read and mined via argument suplAnnotFile=TRUE to gather information about groups of samples.
The project Proteomics Sample Metadata Format aims to provide a framework of providing a uniform format for documenting experimental meta-data (sdrf). If sfdr-annotation (see Proteomics Sample Metadata Format) exists on Pride, it can be imported, too. The information on the experimental setup will be mined to automatically to design groups of samples (ie levels of covariant factors). If sdrf has not been prepared, the user may also simply provide a data.frame formatted like sfdr from Pride.
Finally, if nothing of the above is available, the column-names from the quantitation columns will be minded to search hints about groups of replicates (in particular when using MaxQuant).
For a bit more complex example of using readMaxQuantFile() or integrating other annotation information, please look at the vignette “UPS1 spike-in Experiments” also available to this package.
The simplest way of adding sdrf annotation consists in addin the project ID from Pride, as shown below. The argument groupPref allows defining further adjustments/choices. The import-function will first check if this a local file, and if not try to download from Pride (if available) and further mine the information.
dataMQ <- readMaxQuantFile(path1, specPref=specPrefMQ, sdrf="PXD001819", suplAnnotFile=TRUE, groupPref=list(lowNumberOfGroups=FALSE), plotGraph=FALSE)
#> readMaxQuantFile : Note: Found 11 out of 1115 proteins marked as 'REV_' (reverse peptide identification) - Removing
#> readMaxQuantFile : Transform 2845(9.5%) initial '0' values to 'NA'
#> readMaxQuantFile : Could not find peptide counts columns (argument 'pepCountCol') matching to 'Unique peptides MS\.MS\.count '
#> readMaxQuantFile : Found 1 species name(s) appearing inside other ones, assume as truncated (eg Saccharomyces cerevis)
#> readMaxQuantFile : Note: 5 proteins with unknown species
#> data by species : Gallus gallus: 1, Homo sapiens: 49, Mus musculus: 1, Saccharomyces cerevisiae: 1047, Sus scrofa: 1,
#> readMaxQuantFile : Found 70 composite accession-numbers (eg P00761;P00761), truncating
#> readMaxQuantFile : Use column 'Accession' as identifyer (has fewest, ie 0 duplicated entries) as rownames
#> readMaxQuantFile : readSampleMetaData : PROBLEM : Meta-data and abundance data do not match ! Number of samples from summary.txt.gz (27) and from main data (27) do NOT match !! .. ignoring
#> readMaxQuantFile : readSampleMetaData : readSdrf : Successfully read 24 annotation columns for 27 samples
#> readMaxQuantFile : readSampleMetaData : Note : Some filenames contain '.raw', others do NOT; solved inconsistency ..
#> readMaxQuantFile : readSampleMetaData : Successfully adjusted order of sdrf to content of summary.txt.gzparameters.txt.gz
#> readMaxQuantFile : readSampleMetaData : Choosing model 'combNonOrth' for evaluating replicate-structure (ie 9 groups of samples)
As mentioned, the Proteomics Sample Metadata Format - sdrf is an effort for standardizing experimental meta-data. Many of the typically documented ones may already have been entered when lauching MaxQuant and can be exported as a draft Sdrf-file. All main columns for standard experiments are present in the file, though some columns will have to be completed by the user (by any text-editor) for submitting to Pride.
path1 <- system.file("extdata", package="wrProteo")
fiNaMQ <- "proteinGroups.txt.gz"
dataMQ2 <- readMaxQuantFile(path1, file=fiNaMQ, refLi="mainSpe", sdrf=FALSE, suplAnnotFile=TRUE)
#> readMaxQuantFile : Note: Found 11 out of 1115 proteins marked as 'REV_' (reverse peptide identification) - Removing
#> readMaxQuantFile : Transform 2845(9.5%) initial '0' values to 'NA'
#> readMaxQuantFile : Could not find peptide counts columns (argument 'pepCountCol') matching to 'Unique peptides MS\.MS\.count '
#> readMaxQuantFile : Found 1 species name(s) appearing inside other ones, assume as truncated (eg Saccharomyces cerevis)
#> readMaxQuantFile : Note: 5 proteins with unknown species
#> data by species : Gallus gallus: 1, Homo sapiens: 49, Mus musculus: 1, Saccharomyces cerevisiae: 1047, Sus scrofa: 1,
#> readMaxQuantFile : Found 70 composite accession-numbers (eg P00761;P00761), truncating
#> readMaxQuantFile : Use column 'Accession' as identifyer (has fewest, ie 0 duplicated entries) as rownames
#> readMaxQuantFile : Could not find any proteins matching argument 'refLi=mainSpe', ignoring ...
#> readMaxQuantFile : normalizeThis : Omit redundant 'refLines'
#> readMaxQuantFile : readSampleMetaData : PROBLEM : Meta-data and abundance data do not match ! Number of samples from summary.txt.gz (27) and from main data (27) do NOT match !! .. ignoring
#> readMaxQuantFile : readSampleMetaData : Note: 'sdrf' looks bizarre (trouble ahead ?), expecting either file, data.frame or complete list
#> readMaxQuantFile : readSampleMetaData : Note : Ignoring 'sdrf' : it does NOT have the expected number or rows (1 given but 27 expected !)
#> readMaxQuantFile : readSampleMetaData : Reading of sdrf was NOT successful and no summaryD available => nothing can be done to mine experimental setup...
## Here we'll write simply in the current temporary directory of this R-session
exportSdrfDraft(dataMQ2, file.path(tempdir(),"testSdrf.tsv"))
#> exportSdrfDraft : Successfully exported sdrf-draft to file 'C:\Users\wraff\AppData\Local\Temp\RtmpMxiL0A/testSdrf.tsv'
Similarly it is possible to read the file by default called ‘peptides.txt’ for the peptide-data. In the example below we’ll provide a custom file-name (to a tiny example non-representative for biological interpretation). The data get imported to a similar structure like the protein-level data, quantitations on peptide level by default median-normalized, sample-setup from sdrf-files may be added, too.
MQpepFi1 <- "peptides_tinyMQ.txt.gz"
path1 <- system.file("extdata", package="wrProteo")
specPref1 <- c(conta="conta|CON_|LYSC_CHICK", mainSpecies="YEAST", spec2="HUMAN")
dataMQpep <- readMaxQuantPeptides(path1, file=MQpepFi1, specPref=specPref1, tit="Tiny MaxQuant Peptides")
#> readMaxQuantPeptides : Transform 405(19%) initial '0' values to 'NA'
summary(dataMQpep$quant)
#> 12500am.1 12500am.2 12500am.3 125am.1
#> Min. :20.87 Min. :21.02 Min. :19.89 Min. :18.87
#> 1st Qu.:22.24 1st Qu.:22.24 1st Qu.:22.42 1st Qu.:22.20
#> Median :23.08 Median :23.08 Median :23.08 Median :23.08
#> Mean :23.38 Mean :23.44 Mean :23.46 Mean :23.29
#> 3rd Qu.:24.12 3rd Qu.:24.28 3rd Qu.:24.32 3rd Qu.:24.11
#> Max. :28.65 Max. :28.66 Max. :28.86 Max. :28.17
#> NA's :37 NA's :26 NA's :28 NA's :37
#> 125am.2 125am.3 25000am.1 25000am.2
#> Min. :20.74 Min. :20.39 Min. :20.58 Min. :19.69
#> 1st Qu.:22.24 1st Qu.:22.23 1st Qu.:22.26 1st Qu.:22.12
#> Median :23.08 Median :23.08 Median :23.08 Median :23.08
#> Mean :23.35 Mean :23.26 Mean :23.48 Mean :23.26
#> 3rd Qu.:24.22 3rd Qu.:24.02 3rd Qu.:24.34 3rd Qu.:24.09
#> Max. :28.09 Max. :27.99 Max. :28.75 Max. :28.51
#> NA's :35 NA's :38 NA's :28 NA's :37
#> 25000am.3 2500am.1 2500am.2 2500am.3
#> Min. :18.98 Min. :20.52 Min. :20.85 Min. :20.70
#> 1st Qu.:22.08 1st Qu.:22.11 1st Qu.:22.33 1st Qu.:22.15
#> Median :23.08 Median :23.08 Median :23.08 Median :23.08
#> Mean :23.32 Mean :23.31 Mean :23.37 Mean :23.37
#> 3rd Qu.:24.25 3rd Qu.:24.13 3rd Qu.:24.16 3rd Qu.:24.20
#> Max. :28.87 Max. :28.39 Max. :28.46 Max. :28.66
#> NA's :24 NA's :39 NA's :38 NA's :38
If the argument suplAnnotFile is set to TRUE, the files ‘summary.txt’ and ‘parameters.txt’ (produced by MaxQuant by default) will be searched in the same directory. If these files are available and seem to correspond to the quantiation date read in the main part of the function, supplemental information about experimental setup will be mined and added to the resulting object.
.
Proteome
Discoverer is commercial software from ThermoFisher
(www.thermofisher.com), see also Orsburn, 2021. Data
exported from Proteome
Discoverer can get imported (typically the xx_Proteins.txt
file) using readProteomeDiscovererFile()
, for details
please see the vignette “UPS-1 spike-in Experiments” also available with
this package.
The example below is just a toy data-set, normally one can identify and
quantify many more proteins.
fiNa <- "tinyPD_allProteins.txt.gz"
dataPD <- readProteomeDiscovererFile(file=fiNa, path=path1, suplAnnotFile=FALSE, plotGraph=FALSE)
#> readProteomeDiscovererFile : Adding supl annotation-columns 'Gene'
summary(dataPD$quant)
#> Abundance.S1rep1 Abundance.S1rep2 Abundance.S1rep3 Abundance.S2rep1
#> Min. :17.56 Min. :17.44 Min. :18.32 Min. :17.20
#> 1st Qu.:20.15 1st Qu.:20.21 1st Qu.:20.01 1st Qu.:20.28
#> Median :21.72 Median :21.72 Median :21.72 Median :21.72
#> Mean :21.75 Mean :21.83 Mean :21.71 Mean :21.90
#> 3rd Qu.:23.08 3rd Qu.:23.21 3rd Qu.:23.04 3rd Qu.:23.27
#> Max. :28.27 Max. :28.37 Max. :28.26 Max. :28.62
#> NA's :1 NA's :3 NA's :1 NA's :2
#> Abundance.S2rep2 Abundance.S2rep3
#> Min. :17.64 Min. :17.22
#> 1st Qu.:20.25 1st Qu.:20.11
#> Median :21.72 Median :21.72
#> Mean :21.86 Mean :21.78
#> 3rd Qu.:23.23 3rd Qu.:23.10
#> Max. :28.54 Max. :28.47
#> NA's :2
Please note, that quantitation data exported from ProteomeDiscoverer frequently have very generic column-names (increasing numbers). When calling the import-function they can be replaced by more meaningful names either using the argument sampNa (thus, much care should be taken on the order when preparing the vector sampleNames !), or from reading the default annotation in the file ‘InputFiles.txt’ (if exported) or, from sdrf-annotation (if available). In this case, supplemental information about experimental setup will be mined and added to the resulting object.
As descibed with MaxQuant, additional meta-data as sdrf can be imported in the same way. For a more complex example of using readProteomeDiscovererFile() please see the vignette ‘UPS1 spike-in Experiments’ of this package.
Similarly it is possible to read the peptide-data files exported by
ProteomeDiscoverer using the function
readProtDiscovererPeptides()
. The data get imported to a
similar structure like the protein-level data, quantitations on peptide
level by default median-normalized, sample-setup from sdrf-files may be
added, too.
DIA-NN is free
software provided by the by Demichev, Ralser and Lilley labs, see also
Demichev et al,
2020. Typically DIA-NN allows exporting
quantitation data on level of consensus-proteins as tsv-formatted files.
Such data can get imported (and normalized) using
readDiaNNFile()
. The example below is just a toy data-set,
normally one can identify and quantify many more proteins.
diaNNFi1 <- "tinyDiaNN1.tsv.gz"
## This file contains much less identifications than one may usually obtain
path1 <- system.file("extdata", package="wrProteo")
## let's define the main species and allow tagging some contaminants
specPref1 <- c(conta="conta|CON_|LYSC_CHICK", mainSpecies="HUMAN")
dataNN <- readDiaNNFile(path1, file=diaNNFi1, specPref=specPref1, tit="Tiny DIA-NN Data", plotGraph=FALSE)
summary(dataNN$quant)
#> 1 e2 e3
#> Min. :13.39 Min. :13.02 Min. :13.82
#> 1st Qu.:15.79 1st Qu.:15.83 1st Qu.:15.96
#> Median :17.30 Median :17.30 Median :17.30
#> Mean :17.23 Mean :17.38 Mean :17.43
#> 3rd Qu.:18.55 3rd Qu.:18.79 3rd Qu.:18.82
#> Max. :22.93 Max. :23.27 Max. :23.26
#> NA's :21 NA's :4
Similarly data from DIA-NN on peptide level
can get imported (and normalized) using
readDiaNNPeptides()
.
Proline is free
software provided by the Profi-consortium, see also Bouyssié et al
2020. Data exported from Proline (xlsx, csv or
tsv format) can get imported using readProlineFile()
. The
example below is just a toy data-set, normally one can identify and
quantify many more proteins.
path1 <- system.file("extdata", package="wrProteo")
fiNa <- "exampleProlineABC.csv.gz" # gz compressed data can be read, too
dataPL <- readProlineFile(file=fiNa, path=path1, plotGraph=FALSE)
summary(dataPL$quant[,1:8])
#> A_01.raw.mzDB.t.xml A_02.raw.mzDB.t.xml A_03.raw.mzDB.t.xml
#> Min. :14.11 Min. :14.06 Min. :14.03
#> 1st Qu.:19.11 1st Qu.:19.23 1st Qu.:19.15
#> Median :20.65 Median :20.65 Median :20.65
#> Mean :20.83 Mean :20.92 Mean :20.86
#> 3rd Qu.:22.35 3rd Qu.:22.43 3rd Qu.:22.47
#> Max. :28.51 Max. :28.59 Max. :28.54
#> NA's :29 NA's :27 NA's :21
#> A_04.raw.mzDB.t.xml B_01.raw.mzDB.t.xml B_02.raw.mzDB.t.xml
#> Min. :14.34 Min. :16.13 Min. :12.58
#> 1st Qu.:19.37 1st Qu.:19.18 1st Qu.:19.33
#> Median :20.65 Median :20.65 Median :20.65
#> Mean :20.99 Mean :20.78 Mean :20.85
#> 3rd Qu.:22.47 3rd Qu.:22.37 3rd Qu.:22.51
#> Max. :28.72 Max. :27.96 Max. :28.17
#> NA's :27 NA's :77 NA's :77
#> B_03.raw.mzDB.t.xml B_04.raw.mzDB.t.xml
#> Min. :15.70 Min. :13.85
#> 1st Qu.:19.16 1st Qu.:19.09
#> Median :20.65 Median :20.65
#> Mean :20.80 Mean :20.81
#> 3rd Qu.:22.38 3rd Qu.:22.42
#> Max. :28.13 Max. :28.23
#> NA's :76 NA's :73
As descibed with MaxQuant, additional meta-data as sdrf can be imported in the same way. For a more complex example of using readProlineFile() please see the vignette ‘UPS1 spike-in Experiments’ from this package.
Fragpipe is a database
search tool for peptide identification, open-source developed by the Nesvizhskii lab, see eg Kong et al 2017, da Veiga Leprevost; et
al 2020 or other related publications. Data exported from Fragpipe (in tsv format) can
get imported using readFragpipeFile()
. The example below is
just a toy data-set, normally one can identify and quantify many more
proteins.
FPproFi1 <- "tinyFragpipe1.tsv.gz"
## let's define the main species and allow tagging some contaminants
specPref1 <- c(conta="conta|CON_|LYSC_CHICK", mainSpecies="MOUSE")
dataFP <- readFragpipeFile(path1, file=FPproFi1, specPref=specPref1, tit="Tiny Fragpipe Example", plotGraph=FALSE)
#> readFragpipeFile : Count by 'specPref' : Bos taurus: 1 ; Homo sapiens: 5 ; Mus muscullus: 1 ; Mus musculus: 92 ; Sus scrofa: 1 ;
#> readFragpipeFile : Removing 24 lines/proteins removed as NOT passing protein identification filter at 0.99
summary(dataFP$quant)
#> A_1 A_2 B_1 B_2
#> Min. :15.09 Min. :13.73 Min. :14.48 Min. :15.53
#> 1st Qu.:17.89 1st Qu.:18.21 1st Qu.:18.78 1st Qu.:18.58
#> Median :19.94 Median :19.94 Median :19.94 Median :19.94
#> Mean :19.93 Mean :20.13 Mean :20.54 Mean :20.39
#> 3rd Qu.:21.43 3rd Qu.:21.68 3rd Qu.:21.92 3rd Qu.:21.78
#> Max. :30.64 Max. :30.41 Max. :30.32 Max. :29.99
#> NA's :11 NA's :10 NA's :15 NA's :14
#> B_3 B_4 C_1 C_2
#> Min. :13.31 Min. :13.96 Min. :14.77 Min. :14.74
#> 1st Qu.:18.57 1st Qu.:18.46 1st Qu.:18.51 1st Qu.:18.73
#> Median :19.94 Median :19.94 Median :19.94 Median :19.94
#> Mean :20.38 Mean :20.40 Mean :20.29 Mean :20.66
#> 3rd Qu.:21.80 3rd Qu.:21.94 3rd Qu.:21.63 3rd Qu.:21.84
#> Max. :30.16 Max. :30.46 Max. :30.17 Max. :30.63
#> NA's :14 NA's :14 NA's :12 NA's :12
#> C_3 C_4
#> Min. :13.94 Min. :14.89
#> 1st Qu.:18.36 1st Qu.:18.85
#> Median :19.94 Median :19.94
#> Mean :20.30 Mean :20.41
#> 3rd Qu.:21.78 3rd Qu.:21.70
#> Max. :29.89 Max. :30.46
#> NA's :10 NA's :12
As descibed with MaxQuant, additional meta-data as sdrf can be imported in the same way.
MassChroQ is
free open software provided by the PAPPSO, see also Valot et al 2011.
Inital quantifications are on peptide basis and should be normalized and
summarized using the R-package MassChroqR, which is also publicly
available at the PAPPSO.
Quantifications at protein-level can be saved as matrix into an
RData-file or written to tsv, csv or txt files for following import into
the framework of this package using readMassChroQFile()
,
for details please see the help-page to this function. The example below
is just a toy data-set, normally one can identify and quantify many more
proteins.
MCproFi1 <- "tinyMC.RData"
dataMC <- readMassChroQFile(path1, file=MCproFi1, tit="Tiny MassChroq Example", plotGraph=FALSE)
#> readMassChroQFile : Loading R-object 'tinyMC' as quantification data out of C:/Users/wraff/AppData/Local/Temp/RtmpuMGvwt/Rinstfe037d3b3/wrProteo/extdata/tinyMC.RData
summary(dataMC$quant)
#> sampa0 sampa1 sampa2 sampa5
#> Min. :6.525 Min. :6.582 Min. :6.614 Min. :6.759
#> 1st Qu.:7.282 1st Qu.:7.276 1st Qu.:7.343 1st Qu.:7.316
#> Median :7.769 Median :7.769 Median :7.769 Median :7.769
#> Mean :7.718 Mean :7.691 Mean :7.724 Mean :7.742
#> 3rd Qu.:8.020 3rd Qu.:7.993 3rd Qu.:8.053 3rd Qu.:8.051
#> Max. :9.205 Max. :8.769 Max. :8.861 Max. :9.013
#> NA's :1 NA's :1 NA's :1 NA's :1
#> sampa6 sampa7
#> Min. :6.699 Min. :6.695
#> 1st Qu.:7.287 1st Qu.:7.284
#> Median :7.769 Median :7.769
#> Mean :7.716 Mean :7.718
#> 3rd Qu.:8.054 3rd Qu.:8.061
#> Max. :9.834 Max. :8.815
#> NA's :1 NA's :1
As descibed with MaxQuant, additional meta-data as sdrf can be imported in the same way.
AlphaPept is a
free open-source search tool for peptide identification created by the
Mann-lab, see eg Strauss et al 2021.
Data exported from AlphaPept (in csv format) can get imported using
readAlphaPeptFile()
. The example below is just a toy
data-set, normally one can identify and quantify many more proteins.
APproFi1 <- "tinyAlpaPeptide.csv.gz"
## let's define the main species and allow tagging some contaminants
specPref1 <- c(conta="conta|CON_|LYSC_CHICK")
dataAP <- readAlphaPeptFile(path1, file=APproFi1, specPref=specPref1, tit="Tiny AlphaPept Example", plotGraph=FALSE)
#> readAlphaPeptFile : convToNum : length(x) 128 class(x) character mode(x) character head 20527055.9907 11156676.1246 69877530.7494 69046565.9149 9278681.48595 58945043.9570
#> readAlphaPeptFile : convToNum : length(x) 128 class(x) character mode(x) character head 19545754.2 4638044.8 66308486.5 85844464.1 9132372.2 60070113.0
#> readAlphaPeptFile : .extrSpecPref : 'mainSpecies' Seem absent from 'specPref' !
#> readAlphaPeptFile : Note: 128 proteins with unknown species
#> readAlphaPeptFile : Use column '' as identifyer (has fewest, ie 0 duplicated entries) as rownames
summary(dataAP$quant)
#> 1 2
#> Min. :17.06 Min. :19.68
#> 1st Qu.:23.15 1st Qu.:23.35
#> Median :24.30 Median :24.35
#> Mean :24.42 Mean :24.52
#> 3rd Qu.:25.87 3rd Qu.:25.88
#> Max. :29.54 Max. :29.56
#> NA's :5 NA's :4
As descibed with MaxQuant, additional meta-data as sdrf can be imported in the same way.
ionbot is a database search
tool for protein and peptide identification, developed by the ComPomics group headed by L
Martens, see eg Kong et al 2017 or
other related publications. Peptide identification and quantification
data exported from ionbot (in
tsv format) can get imported using readIonbotPeptides()
.
The example below is just a tiny toy data-set, normally one can identify
and quantify many more proteins.
path1 <- system.file("extdata", package="wrProteo")
fiIonbot <- "tinyIonbotFile1.tsv.gz"
datIobPep <- readIonbotPeptides(fiIonbot, path=path1)
#> readIonbotPeptides : .plotQuantDistr : Unknown log-status, assuming data is log2, ie notLogAbund=FALSE
summary(datIobPep$quant)
#> Intensity_QEKAC160601_34 Intensity_QEKAC160601_125 Intensity_QEKAC160601_119
#> Min. : -Inf Min. : -Inf Min. : -Inf
#> 1st Qu.:19.17 1st Qu.:19.11 1st Qu.:19.14
#> Median :20.44 Median :20.44 Median :20.44
#> Mean : -Inf Mean : -Inf Mean : -Inf
#> 3rd Qu.:22.14 3rd Qu.:22.09 3rd Qu.:22.13
#> Max. :28.60 Max. :28.42 Max. :28.58
#> Intensity_QEKAC160601_117 Intensity_QEKAC160601_71
#> Min. : -Inf Min. : -Inf
#> 1st Qu.:19.06 1st Qu.:19.17
#> Median :20.44 Median :20.44
#> Mean : -Inf Mean : -Inf
#> 3rd Qu.:22.12 3rd Qu.:22.14
#> Max. :28.48 Max. :28.43
As descibed with MaxQuant, additional meta-data as sdrf can be imported in the same way.
Wombat-P is a free
open-source search tool for peptide identification created by an
Elixir-consortium, see also Bouyssie et al
2023. Data exported from Wombat-P (in csv format) can get imported
using readWombatNormFile()
. The example below is just a toy
data-set, normally one can identify and quantify many more proteins.
WBproFi1 <- "tinyWombCompo1.csv.gz"
## let's define the main species and allow tagging some contaminants
specPref1 <- c(conta="conta|CON_|LYSC_CHICK", mainSpecies="YEAST")
dataWB <- readWombatNormFile(path1, file=WBproFi1, specPref=specPref1, tit="Tiny Wombat-P Example", plotGraph=FALSE)
#> readWombatNormFile : Removing 1 lines since absent ID or all NA (won't be able to do anything lateron withour ID ..)
#> readWombatNormFile : Adjusting order of annot to have ID in 1st column
#> readWombatNormFile : Note: 102 proteins with unknown species
#> readWombatNormFile : Use column 'Accession' as identifyer (has fewest, ie 2 duplicated entries) as rownames
summary(dataWB$quant)
#> 1.fmol.CN.UPS1_2 10.amol.CN.UPS1_1 1.fmol.CN.UPS1_4 10.amol.CN.UPS1_3
#> Min. :15.00 Min. :18.50 Min. :18.58 Min. :18.05
#> 1st Qu.:22.23 1st Qu.:22.39 1st Qu.:22.42 1st Qu.:22.44
#> Median :23.73 Median :23.87 Median :23.90 Median :23.86
#> Mean :23.59 Mean :23.91 Mean :23.89 Mean :23.84
#> 3rd Qu.:25.22 3rd Qu.:25.34 3rd Qu.:25.37 3rd Qu.:25.31
#> Max. :27.92 Max. :29.02 Max. :29.04 Max. :29.13
#> NA's :4 NA's :3 NA's :3 NA's :3
#> 1.fmol.CN.UPS1_3 1.fmol.CN.UPS1_1 10.amol.CN.UPS1_2 10.amol.CN.UPS1_4
#> Min. :18.96 Min. :17.97 Min. :18.84 Min. :17.79
#> 1st Qu.:22.56 1st Qu.:22.28 1st Qu.:22.51 1st Qu.:22.38
#> Median :24.04 Median :23.83 Median :23.96 Median :23.83
#> Mean :24.06 Mean :23.84 Mean :24.07 Mean :23.79
#> 3rd Qu.:25.47 3rd Qu.:25.30 3rd Qu.:25.51 3rd Qu.:25.24
#> Max. :29.04 Max. :29.23 Max. :29.02 Max. :29.14
#> NA's :4 NA's :3 NA's :3 NA's :3
As descibed with MaxQuant, additional meta-data as sdrf can be imported in the same way.
OpenMS is free open software
provided by the deNBI Center for integrative Bioinformatics, see also Rost et al 2016. Peptide
level data exported as csv get summarized from peptide to protein level
and further normalized using readOpenMSFile()
, for details
please see the help-page to this function.
The project Proteomics Sample Metadata Format aims to provide a framework of providing a uniform format for documenting experimental meta-data (sdrf-format).
As mentioned at the section for reading MaxQuant, most import-functions from wrProteo can directly import (if available) the experimental setup from sdrf, or from files produced using the various quantitation software (as shown with MaxQuant.
To do this separately, or if you need to read an alternative
annotation file, you may use readSampleMetaData()
. If sdrf
annotation is available on Pride/github this information can be read and
directly integrated with software specific annotation using the
import-functions shown above or as shown below. Of course the user
should always make sure the annotation really corresponds to current the
experimental data !
When adding the quantitation-data using argument abund, the functions also checks if the number of samples fit and tries to align the order of the meta-data to that of the quantitation data (based on the raw files), since they are not necessarily in the same order.
MQsdrf001819Setup <- readSampleMetaData(quantMeth="MQ", sdrf="PXD001819", path=path1, suplAnnotFile="summary.txt.gz", abund=dataMQ$quant)
#> readSampleMetaData : PROBLEM : Meta-data and abundance data do not match ! Number of samples from summary.txt.gz (27) and from main data (27) do NOT match !! .. ignoring
#> readSampleMetaData : readSdrf : Successfully read 24 annotation columns for 27 samples
#> readSampleMetaData : Note : Some filenames contain '.raw', others do NOT; solved inconsistency ..
#> readSampleMetaData : Successfully adjusted order of sdrf to content of summary.txt.gz
#> readSampleMetaData : Choosing model 'combNonOrth' for evaluating replicate-structure (ie 9 groups of samples)
str(MQsdrf001819Setup)
#> List of 7
#> $ sdrfDat :'data.frame': 27 obs. of 24 variables:
#> ..$ source.name : chr [1:27] "Sample 1" "Sample 1" "Sample 1" "Sample 2" ...
#> ..$ characteristics.organism. : chr [1:27] "Saccharomyces cerevisiae" "Saccharomyces cerevisiae" "Saccharomyces cerevisiae" "Saccharomyces cerevisiae" ...
#> ..$ characteristics.organism.part. : chr [1:27] "not available" "not available" "not available" "not available" ...
#> ..$ characteristics.disease. : chr [1:27] "not available" "not available" "not available" "not available" ...
#> ..$ characteristics.cell.type. : chr [1:27] "not applicable" "not applicable" "not applicable" "not applicable" ...
#> ..$ characteristics.mass. : chr [1:27] "2 mg" "2 mg" "2 mg" "2 mg" ...
#> ..$ characteristics.spiked.compound. : chr [1:27] "CT=mixture;QY=12500 amol;CN=UPS1;CV=Standards Research Group" "CT=mixture;QY=12500 amol;CN=UPS1;CV=Standards Research Group" "CT=mixture;QY=12500 amol;CN=UPS1;CV=Standards Research Group" "CT=mixture;QY=125 amol;CN=UPS1;CV=Standards Research Group" ...
#> ..$ characteristics.biological.replicate.: int [1:27] 1 1 1 1 1 1 1 1 1 1 ...
#> ..$ material.type : chr [1:27] "lysate" "lysate" "lysate" "lysate" ...
#> ..$ assay.name : chr [1:27] "run 1" "run 2" "run 3" "run 4" ...
#> ..$ technology.type : chr [1:27] "proteomic profiling by mass spectrometry" "proteomic profiling by mass spectrometry" "proteomic profiling by mass spectrometry" "proteomic profiling by mass spectrometry" ...
#> ..$ comment.label. : chr [1:27] "AC=MS:1002038;NT=label free sample" "AC=MS:1002038;NT=label free sample" "AC=MS:1002038;NT=label free sample" "AC=MS:1002038;NT=label free sample" ...
#> ..$ comment.instrument. : chr [1:27] "AC=MS:1001742;NT=LTQ Orbitrap Velos" "AC=MS:1001742;NT=LTQ Orbitrap Velos" "AC=MS:1001742;NT=LTQ Orbitrap Velos" "AC=MS:1001742;NT=LTQ Orbitrap Velos" ...
#> ..$ comment.precursor.mass.tolerance. : chr [1:27] "5 ppm" "5 ppm" "5 ppm" "5 ppm" ...
#> ..$ comment.fragment.mass.tolerance. : chr [1:27] "0.8 Da" "0.8 Da" "0.8 Da" "0.8 Da" ...
#> ..$ comment.cleavage.agent.details. : chr [1:27] "NT=trypsin/P;AC=MS:1001313" "NT=trypsin/P;AC=MS:1001313" "NT=trypsin/P;AC=MS:1001313" "NT=trypsin/P;AC=MS:1001313" ...
#> ..$ comment.modification.parameters. : chr [1:27] "NT=Carbamidomethyl;TA=C;MT=fixed;AC=UNIMOD:4" "NT=Carbamidomethyl;TA=C;MT=fixed;AC=UNIMOD:4" "NT=Carbamidomethyl;TA=C;MT=fixed;AC=UNIMOD:4" "NT=Carbamidomethyl;TA=C;MT=fixed;AC=UNIMOD:4" ...
#> ..$ comment.modification.parameters..1 : chr [1:27] "NT=Oxidation;MT=variable;TA=M;AC=UNIMOD:35" "NT=Oxidation;MT=variable;TA=M;AC=UNIMOD:35" "NT=Oxidation;MT=variable;TA=M;AC=UNIMOD:35" "NT=Oxidation;MT=variable;TA=M;AC=UNIMOD:35" ...
#> ..$ comment.modification.parameters..2 : chr [1:27] "NT=Acetyl;AC=UNIMOD:67;PP=Protein N-term;MT=variable" "NT=Acetyl;AC=UNIMOD:67;PP=Protein N-term;MT=variable" "NT=Acetyl;AC=UNIMOD:67;PP=Protein N-term;MT=variable" "NT=Acetyl;AC=UNIMOD:67;PP=Protein N-term;MT=variable" ...
#> ..$ comment.technical.replicate. : int [1:27] 1 2 3 1 2 3 1 2 3 1 ...
#> ..$ comment.fraction.identifier. : int [1:27] 1 1 1 1 1 1 1 1 1 1 ...
#> ..$ comment.file.uri. : chr [1:27] "https://ftp.ebi.ac.uk/pride-archive/2015/12/PXD001819/UPS1_12500amol_R1.raw" "https://ftp.ebi.ac.uk/pride-archive/2015/12/PXD001819/UPS1_12500amol_R2.raw" "https://ftp.ebi.ac.uk/pride-archive/2015/12/PXD001819/UPS1_12500amol_R3.raw" "https://ftp.ebi.ac.uk/pride-archive/2015/12/PXD001819/UPS1_125amol_R1.raw" ...
#> ..$ comment.data.file. : chr [1:27] "UPS1_12500amol_R1.raw" "UPS1_12500amol_R2.raw" "UPS1_12500amol_R3.raw" "UPS1_125amol_R1.raw" ...
#> ..$ factor.value.spiked.compound. : chr [1:27] "CT=mixture;QY=12500 amol;CN=UPS1;CV=Standards Research Group" "CT=mixture;QY=12500 amol;CN=UPS1;CV=Standards Research Group" "CT=mixture;QY=12500 amol;CN=UPS1;CV=Standards Research Group" "CT=mixture;QY=125 amol;CN=UPS1;CV=Standards Research Group" ...
#> $ level : Named int [1:27] 1 1 1 2 2 2 3 3 3 4 ...
#> ..- attr(*, "names")= chr [1:27] "_S_a_m_p_l_e_1_" "_S_a_m_p_l_e_1_" "_S_a_m_p_l_e_1_" "_S_a_m_p_l_e_2_" ...
#> $ col : Named int 1
#> ..- attr(*, "names")= chr "source.name"
#> $ meth : chr "combNonOrth"
#> $ annotBySoft :'data.frame': 27 obs. of 9 variables:
#> ..$ Raw.file : chr [1:27] "UPS1_12500amol_R1" "UPS1_12500amol_R2" "UPS1_12500amol_R3" "UPS1_125amol_R1" ...
#> ..$ Experiment : chr [1:27] "12500amol_R1" "12500amol_R2" "12500amol_R3" "125amol_R1" ...
#> ..$ Enzyme : chr [1:27] "Trypsin/P" "Trypsin/P" "Trypsin/P" "Trypsin/P" ...
#> ..$ Variable.modifications: chr [1:27] "Oxidation (M);Acetyl (Protein N-term)" "Oxidation (M);Acetyl (Protein N-term)" "Oxidation (M);Acetyl (Protein N-term)" "Oxidation (M);Acetyl (Protein N-term)" ...
#> ..$ Fixed.modifications : chr [1:27] "Carbamidomethyl (C)" "Carbamidomethyl (C)" "Carbamidomethyl (C)" "Carbamidomethyl (C)" ...
#> ..$ Multi.modifications : logi [1:27] NA NA NA NA NA NA ...
#> ..$ ref : chr [1:27] "12500amol_1" "12500amol_2" "12500amol_3" "125amol_1" ...
#> ..$ grp : chr [1:27] "12500amol" "12500amol" "12500amol" "125amol" ...
#> ..$ grpInd : int [1:27] 1 1 1 2 2 2 3 3 3 4 ...
#> $ syncColumns : Named logi [1:2] TRUE FALSE
#> ..- attr(*, "names")= chr [1:2] "sdrfDat" "annotBySoft"
#> $ iniSdrfOrder: int [1:27] 1 2 3 4 5 6 7 8 9 10 ...
However, the recommended and most convenient way is to add/import meta-data directly when importing quantitation-data (eg using readMaxQuantFile(), readProteomeDiscovererFile(), etc).
If needed, function fuseProteomicsProjects()
allows
combining up to 3 separate data-sets previously imported using wrProteo.
The user should very carefully think how and why he wants to fuse
multiple separately imported data-sets, which might have their own
charteristics. Note, the function presented here does not re-normalize
the combined data, the user should investigate the data and decide on
suitable strategies for further normalization.
Data from different software may not contain exactely the same proteins or peptides, only the common identifiers are retained by this approach. The user should pay attention to which identifyer should be used and that they do not appear multiple times in the the same data-set. If, however, some IDs appear multiple times (ie as separate lines) in the same data-set, the corresponding numeric data will be summarized to a single line. This may may have notocable effect on the following biological interpretation.
Thus, it is very important to know your data and to understand when lines that appear with the same identifyers should/may be fused/summarized without doing damage to the later biological interpretation ! The user may specify for each dataset the colum out of the protein/peptide-annotation to use via the argument columnNa. Then, this content will be matched as identical match, so when combining data from different software special care shoud be taken !
path1 <- system.file("extdata", package="wrProteo")
dataMQ <- readMaxQuantFile(path1, specPref=NULL, normalizeMeth="median")
#> readMaxQuantFile : Note: Found 11 out of 1115 proteins marked as 'REV_' (reverse peptide identification) - Removing
#> readMaxQuantFile : Transform 2845(9.5%) initial '0' values to 'NA'
#> readMaxQuantFile : Could not find peptide counts columns (argument 'pepCountCol') matching to 'Unique peptides MS\.MS\.count '
#> readMaxQuantFile : Found 1 species name(s) appearing inside other ones, assume as truncated (eg Saccharomyces cerevis)
#> readMaxQuantFile : Note: 5 proteins with unknown species
#> data by species : Gallus gallus: 1, Homo sapiens: 49, Mus musculus: 1, Saccharomyces cerevisiae: 1047, Sus scrofa: 1,
#> readMaxQuantFile : Found 70 composite accession-numbers (eg P00761;P00761), truncating
#> readMaxQuantFile : Use column 'Accession' as identifyer (has fewest, ie 0 duplicated entries) as rownames
dataMC <- readMassChroQFile(path1, file="tinyMC.RData", tit="Tiny MassChroq Example", plotGraph=FALSE)
#> readMassChroQFile : Loading R-object 'tinyMC' as quantification data out of C:/Users/wraff/AppData/Local/Temp/RtmpuMGvwt/Rinstfe037d3b3/wrProteo/extdata/tinyMC.RData
dataFused <- fuseProteomicsProjects(dataMQ, dataMC)
str(dataFused$quant)
#> num [1:82, 1:33] 20.7 22.3 29 26.8 30 ...
#> - attr(*, "dimnames")=List of 2
#> ..$ : chr [1:82] "O13563" "O14467" "P00330" "P00358" ...
#> ..$ : chr [1:33] "dataMQ.12500amol_1" "dataMQ.12500amol_2" "dataMQ.12500amol_3" "dataMQ.125amol_1" ...
As mentioned, the aim of normalization is to remove bias in data not
linked to the original (biological) question. The import functions
presented above do already by default run global median normalization.
When choosing a normalization procedure one should reflect what
additional information may be available to guide normalization. For
example, it may be very useful to exclude classical protein contaminants
since they typically do not reflect the original biolocial material. In
overall, it is important to inspect results from normalization,
graphical display of histograms, boxplots or violin-plots to compare
distributions. Multiple options exist for normalizing data, please look
at the documentation provided with the import-functions introduced
above. Please note, that enrichment experiments (like IP) can quickly
pose major problems to the choice of normalization approaches. The
function normalizeThis()
from the package wrMisc is run
internally. It can be used to run additional normalization, if
needed.
Different normalization procedures intervene with different ‘aggressiveness’, ie also with different capacity to change the initial data. In general, it is suggested to start normalizing using ‘milder’ procedures, like global median and switch to more intervening methods if initial results seem not satisfactory. Beware, heavy normalization procedures may also alter the main information you want to analyze. Ie, some biologically true positive changes may start to fade or disappear when inappropriate normalization gets performed. Please note, that normalization should be performed before NA-imputation to avoid introducing new bias in the group of imputed values.
In proteomics the quantitation of very low abundances is very challenging. Proteins which are absent or very low abundances typically appear in the results as 0 or NA. Frequantly this may be linked to the fact that no peak is detected in a MS-1 chromatogram (for a given LC elution-time) while other samples had a strong peak at the respective place which led to successful MS-2 identification. Please note, that the match-between-runs option in the various softwar options allows to considerably reduce the number of NAs. To simplify the treatment all 0 values are transformed to NA, anyway they would not allow log2 transformation either.
Before replacing NA-values it is important to verify that such values
may be associated to absent or very low abundances. To do so, we suggest
to inspect groups of replicate-measurements using
matrixNAinspect()
. In particular, with multiple technical
replicates of the same sample it is supposed that any variability
observed is not linked to the sample itself. So for each NA that occurs
in the data we suggest to look what was reported for the same protein
with the other (technical) replicates. This brings us to the term of
‘NA-neighbours’ (quantifications for the same protein in replicates).
When drawing histograms of NA-neighbours one can visually inspect and
verify that NA-neighbours are typically low abundance values, however,
but not necessarily the lowest values observed in the entire
data-set.
## Let's inspect NA values as graphic
matrixNAinspect(dataMQ$quant, gr=grp9, tit="Histogram of Protein Abundances and NA-Neighbours")
#> stableMode : Method='density', length of x =1142, 'bandw' has been set to 47
So only if the hypothesis of NA-neighbours as typically low abundance values gets confirmed by visual inspection of the histograms, one may safely proceed to replacing them by low random values.
If one uses a unique (very) low value for NA-replacements, this will quickly pose a problem at the level of t-tests to look for proteins changing abundance between two or more groups of samples. Therefore it is common practice to draw random values from a Normal distribution representing this lower end of abundance values. Nevertheless, the choice of the parameters of this Normal distribution is very delicate.
This package proposes several related strategies/options for NA-imputation. First, the classical imputation of NA-values using Normal distributed random data is presented. The mean value for the Normal data can be taken from the median or mode of the NA-neighbour values, since (in case of technical replicetes) NA-neighbours tell us what these values might have been and thus we model a distribution around. Later in this vignette, a more elaborate version based on repeated implementations to obtain more robust results will be presented.
The function matrixNAneighbourImpute()
proposed in this
package offers automatic selection of these parameters, which have been
tested in a number of different projects. However, this choice should be
checked by critically inspecting the histograms of ‘NA-neighbours’ (ie
successful quantitation in other replicate samples of the same protein)
and the final resulting distribution. Initially all NA-neighbours are
extracted. It is also worth mentioning that in the majority of data-sets
encountered, such NA-neighbours form skewed distributions.
The successful quantitation of instances with more than one NA-values per group may be considered even more representative, but of course less sucessfully quntified values remain. Thus a primary choice is made: If the selection of (min) 2 NA-values per group has more than 300 values, this distribution will be used as base to model the distribution for drawing random values. In this case, by default the 0.18 quantile of the 2 NA-neighbour distribution will be used as mean for the new Normal distribution used for NA-replacements. If the number of 2 NA-neighbours is >= 300, (by default) the 0.1 quantile all NA-neighbour values will used. Of course, the user has also the possibility to use custom choices for these parameters.
The final replacement is done on all NA values. This also includes proteins with are all NA in a given condition as well a instances of mixed successful quantitation and NA values.
## MaxQuant simple NA-imputation (single round)
dataMQimp <- matrixNAneighbourImpute(dataMQ$quant, gr=grp9, tit="Histogram of Imputed and Final Data")
#> matrixNAneighbourImpute : stableMode : Method='density', length of x =1142, 'bandw' has been set to 47
#> matrixNAneighbourImpute : n.woNA= 26963 , n.NA = 2845
#> Imputing based on 'mode2' using mean= 21.49 and sd= 0.5
#>
#> matrixNAneighbourImpute : Plotting figure
However, imputing using normal distributed random data also brings the risk of occasional extreme values. In the extreme case it may happen that a given protein is all NA in one group, and by chance the random values turn out be rather high. Then, the final group mean of imputed values may be even higher than the mean of another group with successful quantitations. Of course in this case it would be a bad interpretation to consider the protein in question upregulated in a sample where all values for this protein were NA. To circumvent this problem there are 2 options : 1) one may use special filtering schemes to exclude such constellations from final results or 2) one could repeat replacement of NA-values numerous times.
The function testRobustToNAimputation() allows such repeated replacement of NA-values. For details, see also the following section.
For other packages dealing with missing values (NAs), please also look at the missing data task-view on CRAN.
The main aim of filtering in omic-data analysis is to remove proteins/genes/lines which are for obvious reasons not suitable for further analysis. Invalid or low quality measures are not suitable for good results and may thus be removed. Frequently additional information is used to justy the procedure of removing certain proteins/genes/lines.
One very common element in filtering is the observation that very low abundance measures are typically less precise than medium or high abundance values. Thus, a protein/gene with all abundance measures at the very low end of the global scale may as well just seem to change abundance due to the elevated variance of low abundance measures. However, most statitical tests are not very well prepared for elevated variance of low abundance measures. In consequence, it is common to remove or disqualify such proteins/genes/lines which are at high risk of yielding false positive results.
In the context of proteomics the number of samples with NAs (non-quantified peptides/proteins) for a given protein/peptide represents also an interesting starting point. If almost all values finally compared are a result of (random) imputation, any apparent change in abundanc of such proteins/peptides lay rather reflect rare stochastic events of NA-imputation. Please note, that rather aggressive filtering may severly reduce the ability to identify on/off situations which very well may occur in most biological settings.
General filtering can be performed using presenceFilt()
(from package wrMisc). Other
filtering of proteins/peptides/lines based on the annotation (eg for
hypothetical proteins etc) may done using filterLiColDeList()
(also from package wrMisc).
Initial information for filtering is already collected by the import-functions (readMaxQuantFile(), readProteomeDiscovererFile()_, readProlineFile(), readOpenMSFile() etc..). Then information for filtering can be used by the function combineMultFilterNAimput() which is integrated to testRobustToNAimputation() (see section below) to conveniently include filtering-aspects.
The t-test remains the main statistical test used, as in many other coases of omics, too. Statistical testing in the context of proteomics data poses challenges similar to transcriptomics : Many times the number of replicate-samples is fairly low and the inter-measurement variability quite high. In some unfortunate cases proteins with rather constant quantities may appear as false positives when searching for proteins who’s abundance changes between two groups of samples : If the apparent variability is by chance too low, the respective standard-deviations will be low and a plain t-test may give very enthusiastic p-values. Besides stringent filtering (previous section of this vignette), the use of shrinkage when estimating the intra-group/replicate variance from the Bioconductor package limma turns out very helpful, see also Ritchie et al 2015. In this package the function eBayes() has been used and adopted to proteomics.
The function testRobustToNAimputation()
allows running
multiple cycles of NA-imputation and statistical testing with the aim of
providing stable imputation and testing results. It performs
NA-imputation and statistical testing (after repeated imputation)
between all groups of samples the same time (as it would be inefficient
to separate these two tasks). The tests underneath apply shrinkage from
the empirical Bayes procedure from the bioconductor package limma.
In addition, various formats of multiple test correction can be directly
added to the results : Benjamini-Hochberg FDR, local false discovery
rate (lfdr, using the package fdrtool, see Strimmer 2008),
or modified testing by ROTS,
etc …
The fact that a single round of NA-imputation may provoke false positives as well as false negatives, made it necessary to combine this (iterative) process of NA-imputation and subsequent testing in one single function.
## Impute NA-values repeatedly and run statistical testing after each round of imputations
testMQ <- testRobustToNAimputation(dataMQ, gr=grp9)
#> testRobustToNAimputation : matrixNAneighbourImpute : stableMode : Method='density', length of x =1142, 'bandw' has been set to 47
#> testRobustToNAimputation : matrixNAneighbourImpute : n.woNA= 26963 , n.NA = 2845
#> Imputing based on 'mode2' using mean= 21.49 and sd= 0.5
#>
## Example of the data after repeated NA-imputation
head(testMQ$datImp[,1:6])
#> 12500amol_1 12500amol_2 12500amol_3 125amol_1 125amol_2 125amol_3
#> A5Z2X5 23.72618 23.71144 23.65240 23.76086 24.03255 23.70416
#> P00761 28.82344 28.77536 28.84650 28.70372 28.70950 28.75524
#> P01966 21.15903 21.01804 20.95925 21.53529 21.50786 21.47617
#> P02768 24.65225 24.20638 24.55969 15.69368 16.19980 14.93828
#> P04264 21.44216 21.49767 21.43551 21.39250 21.44091 21.25765
#> P13645 21.51048 21.50719 21.40007 21.45895 21.47966 21.41829
Brielfy, principal components analysis (PCA) searches to decompose the data along all the axises defined by all samples. Then, the axis-combinations with the highest degree of correlation are searched. In principle one could also run PCA along the rows, ie the proteins, but their number is typically so high that the resultant plots get too crowded.
In the context of high throughput experiments, like proteomics, PCA allows to distinguish important information how the different samples are related (ie similar). This covers of course the real differences between different biological conditions, but also additional bias introduced as (technical) artifacts. Thus, such plots serve as well for quality control (in particular to identify outlyer-samples, eg due to degraded material) as well as for the biological interpretation.
Normally one could immediately check the normalized data by PCA before running statistical tests. As stated in other places, PCA can’t handle missing values (ie NA ). Thus, all proteins having one NA in just one sample won’t be considered during PCA. This would mask a significant number of proteins in numerous of proteomics experiments. Thus, it may be preferable to run PCA after NA-imputation. However, since in this package statistical testing was coupled to the repeated NA-imputation, it may be better to use the NA-imputations made for the statistical testing (in the section above).
Here we’ll use the function plotPCAw()
form the package
wrGraph
# limit to UPS1
plotPCAw(testMQ$datImp, sampleGrp=grp9, tit="PCA on Protein Abundances (MaxQuant,NAs imputed)", rowTyName="proteins", useSymb2=0)
#> plotPCAw : addBagPlot : Keep 3 out of 3 and consider 0 as outliers
#> plotPCAw : addBagPlot : Keep 3 out of 3 and consider 0 as outliers
#> plotPCAw : addBagPlot : Keep 3 out of 3 and consider 0 as outliers
#> plotPCAw : addBagPlot : Keep 3 out of 3 and consider 0 as outliers
#> plotPCAw : addBagPlot : Keep 3 out of 3 and consider 0 as outliers
#> plotPCAw : addBagPlot : Keep 3 out of 3 and consider 0 as outliers
#> plotPCAw : addBagPlot : Keep 3 out of 3 and consider 0 as outliers
#> plotPCAw : addBagPlot : Keep 3 out of 3 and consider 0 as outliers
#> plotPCAw : addBagPlot : Keep 3 out of 3 and consider 0 as outliers
#> addBagPlot : Keep 3 out of 3 and consider 0 as outliers
#> addBagPlot : Keep 3 out of 3 and consider 0 as outliers
#> addBagPlot : Keep 3 out of 3 and consider 0 as outliers
#> addBagPlot : Keep 3 out of 3 and consider 0 as outliers
#> addBagPlot : Keep 3 out of 3 and consider 0 as outliers
#> addBagPlot : Keep 3 out of 3 and consider 0 as outliers
#> addBagPlot : Keep 3 out of 3 and consider 0 as outliers
#> addBagPlot : Keep 3 out of 3 and consider 0 as outliers
#> addBagPlot : Keep 3 out of 3 and consider 0 as outliers
Please note, the vignette dedicated to spike-in experiments (“UPS-1 spike-in Experiments”) presents a slightly different way of making PCA-plots for this specific type of experiment/data-set.
MA-plots are mainly used for diagnostic purposes. Basically, an MA-plot displays the
log-Fold-Change versus the average abundance. We’ll use the function
MAplotW()
from the package wrGraph.
# By default this plots at the first of all pairwise questions
MAplotW(testMQ)
#> MAplotW : Successfully extracted 1104 Mvalues and 1104 Avalues plus annotation
#> MAplotW : filtered (based on 'filtFin') from 1104 to 950 lines
Now for the second group of pair-wise comparisons, plus adding names of proteins passing threshold:
res1 <- NULL
MAplotW(testMQ, useComp=2, namesNBest="passFC")
#> MAplotW : Successfully extracted 1104 Mvalues and 1104 Avalues plus annotation
#> MAplotW : filtered (based on 'filtFin') from 1104 to 978 lines
A Volcano-plot allows to compare the simple fold-change (FC) opposed to the outcome of a statistcal test. Frequently we can obsereve, that a some proteins show very small FC but enthousiastic p-values and subsequently enthousiastic FDR-values. However, generally such proteins with so small FC don’t get considered as reliable results, therefore it is common practice to add an additional FC-threshold, typically a 1.5 or 2 fold-change.
The number of proteins retained by pair-wise comparison :
## by default the first pairwise comparison is taken
## using the argument 'namesNBest' we can add names from the annotation
VolcanoPlotW(testMQ, useComp=2, namesNBest="passFDR")
#> VolcanoPlotW : Using element 'BH' as FDR-values for plot
#> VolcanoPlotW : Successfully extracted 1104 Mvalues and 1104 pValues
Additional Note : Volcano-plots may also help identifying bias in the data, in particular, to the question if normalization gave satisfactory results. Based on the hypothesis of no global change used for normalization, normally, one would expect about the same number of down-regulated as up-regulated proteins.
In fact, this experiment is somehow unusual since one set of samples got a strong increase in abundance for 48 UPS1 proteins while the other proteins remained constant. Thus, on the global scale there may be a (small) imbalance of abundances and the global median will reflect this, which can create some bias. So, in this special case it might be better to perform normalization only based on the yeast proteins (which are assumed as constant), as it has been performed in the vignette ‘UPS-1 spike-in Experiments’, a vignette which is entirely dedicated to UPS1 spike-in experiments.
Tables with results can be either directed created using
VolcanoPlotW() or, as shown below, using the function
extractTestingResults()
.
For example, let’s look at the first of the pair-wise comparisons (the Volcano-plot above shwed another pait-wise comparison): The moderated t-test expressed as Benjamini-Hochberg FDR gave 125 proteins with FDR < 0.05 for the comparison 1-2. Since unfortunately many verly low fold-change instances are amongst the results, one should add an additional filter for too low FC values. This is common practice in most omics analysis when mostly technical replicates are run and/or the number of replicates is rather low.
res1 <- extractTestingResults(testMQ, compNo=1, thrsh=0.05, FCthrs=2)
After FC-filtering for 2-fold (ie change of protein abundance to double or half) 12 proteins remain.
knitr::kable(res1[,-1], caption="5%-FDR (BH) Significant results for 1st pairwise set", align="c")
EntryName | GeneName | FDR | logFC.1-2 | av.1 | av.2 | av.3 | av.4 | av.5 | av.6 | av.7 | av.8 | av.9 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
P02768 | ALBU_HUMAN_UPS | ALB | 0.0000000 | -8.86218 | 24.4728 | 15.6106 | 25.6550 | 20.9497 | 15.8247 | 26.7898 | 22.8419 | 15.8119 | 15.4360 |
P00441 | SODC_HUMAN_UPS | SOD1 | 0.0001366 | -2.52883 | 23.2696 | 20.7408 | 24.0286 | 20.7999 | 21.0287 | 26.1312 | 22.5730 | 20.9296 | 20.8630 |
P02144 | MYG_HUMAN_UPS | MB | 0.0000515 | -4.27243 | 23.6040 | 19.3316 | 24.6858 | 20.8111 | 21.4857 | 26.1499 | 22.2525 | 21.7257 | 21.4153 |
P06732 | KCRM_HUMAN_UPS | CKM | 0.0040671 | -6.07530 | 23.2133 | 17.1380 | 24.8177 | 20.3664 | 21.0183 | 25.9850 | 21.5912 | 19.8332 | 21.2603 |
P0CX81 | MTCU2_YEAST | C | 0.0217374 | 1.82053 | 23.9388 | 25.7594 | 25.0909 | 24.6317 | 25.5004 | 24.9255 | 25.2460 | 25.4509 | 26.2346 |
P12081 | SYHC_HUMAN_UPS | HARS | 0.0000000 | -3.10841 | 23.2354 | 20.1270 | 24.5269 | 20.6061 | 20.0177 | 26.0028 | 21.7845 | 20.2674 | 20.7059 |
P16083 | NQO2_HUMAN_UPS | NQO2 | 0.0002334 | -7.63434 | 24.2829 | 16.6486 | 25.5550 | 20.3500 | 18.1979 | 27.1847 | 21.6031 | 19.5608 | 16.8013 |
P40518 | ARPC5_YEAST | ARC15 | 0.0001238 | 1.34381 | 22.4921 | 23.8359 | 22.3436 | 22.3461 | 23.4080 | 22.4787 | 22.4556 | 23.4558 | 23.3886 |
P48015 | GCST_YEAST | GCV1 | 0.0463851 | -1.31805 | 23.7550 | 22.4369 | 22.8775 | 24.3904 | 23.2207 | 23.8159 | 23.5783 | 23.8336 | 22.2663 |
P63165 | SUMO1_HUMAN_UPS | NA | 0.0092354 | -3.81835 | 23.4815 | 19.6632 | 24.7151 | 19.5454 | 20.4615 | 26.0607 | 21.4671 | 20.4616 | 21.3713 |
Q03677 | MTND_YEAST | ADI1 | 0.0133064 | 1.30599 | 21.3911 | 22.6971 | 22.2767 | 21.3213 | 22.5188 | 20.7517 | 20.7684 | 21.9049 | 22.6539 |
Q06830 | PRDX1_HUMAN_UPS | PRDX1 | 0.0000001 | -4.64343 | 23.9604 | 19.3170 | 25.3546 | 21.3546 | 19.1452 | 26.5003 | 21.9310 | 19.2336 | 19.3971 |
Please note that the column-name ‘BH’ referrs to Benjamini-Hochberg FDR (several other options of multiple testing correction exist, too). We can see that many UPS1 proteins are, as expected, among the best-ranking differential changes. However, not all UPS1 proteins do show up in the results as expected, and furthermore, a number of yeast proteins (however expected to remain constant !) were reported as differential, too.
The function extractTestingResults() also allows to write the data shown above directly to a csv-file.
.
In case of standard projects one typically would like to find out more about the biological context of the proteins retained at statistical analysis, their function and their interactors. Such a list of significant proteins from a given project could be tested lateron for enrichment of GO-functions or for their inter-connectivity in PPI networks like String. There are multiple tools available on Bioconductor and CRAN as well as outside of R to perform such analysis tasks.
In case of UPS1 spike-in experiments the subsequent analysis is different. Suggestions for in depth-analysis of UPS1 spike-in are shown in the vignette ‘UPS-1 spike-in Experiments’ of this package.
.
In most ‘Omics’ activities getting additional annotation may get sometimes a bit tricky. In Proteomics most mass-spectrometry software will use the informaton provided in the Fasta-file as annotation (typically as provided from UniProt). But this lacks for example chromosomal location information. There are are many repositories with genome-, gene- and protein-annotation and most of them are linked, but sometimes the links get broken when data-base updates are not done everywhere or are not followed by new re-matching. The Fasta-files used initially for mass-spectrometry peak-identification may be not completely up to date (sometimes gene- or protein-IDs do change or may even disappear) and thus will contribute to a certain percentage of entries hard to link.
Globally two families of strategies for adding annotation exist :
Strategies using online-based ressources for getting the most up-to-date information/annotation (like biomaRt). Despite the advantage of most up-to-date information there may be some downsides : Interogating databases may require more time to run all queries via interet-connections and this strategy is vulnerable to broken links (eg linked to the delay of updates between different types of databases that may need to get combined). Furthermore, the results typically may change a bit when the queries get repeated (in particular when this contains hypothetical peptides, pseudogenes etc). When combining multiple interconnected ressources it may be very tricky to document the precise version of all individual ressources used.
Strategies based on using (local copies of) defined versions of databases. Frequently, these databases can get downloaded/installed locally and thus allow faster queries and guarantee of repeatability and comparability to other tools or studies. A number of databases are available on Bioconductor formatted for R. Besides, the tables from UCSC are another option (which will be used here). Note, that tracking version-numbers may be much easier using this approach based on defined versions of databases. And finally results are 100% reproducible when the same versions much easier to document are used.
In the context of adding chromosomal annotation to a list of proteins here the following concept is developed : Annotation-tables from UCSC are available for a decent number of species and can be downloaded for conventient off-line search. However, in the context of less common species we realized that the UniProt tables from UCSC had many times low yield in final matching. For this reason we propose the slightly more complicated route that provided finally a much higher success-rate to find chromosomal locations for a list of UniProt IDs. First one needs to acces/download from UCSC the table corresponding to the species of question fields ‘clade’,‘genome’,‘assembly’). For ‘group’ choose ‘Genes and Gene Predictions’ and for ‘track’ choose ‘Ensembl Genes’, as table choose ‘ensGene’. In addition, it is possible to select either the entire genome-annotation or user-specified regions. In terms of ‘output-format’ one may choose ‘GTF’ (slightly more condensed, no headers) or ‘all filds from selected table’.
The following strategy for adding genomic location data using this package is presented here :
Locate (& download) organism annotation from UCSC, read into R (readUCSCtable() ) -> from R export (non-redundant) ‘enst’-IDs (still using readUCSCtable() ), get corresponding UniProt-IDs at UniProt site, save as file and import result into R (readUniProtExport() ) -> (in R) combine with initial UCSC table (readUniProtExport() ) .
The function readUCSCtable()
is able to read such files
downloaded from UCSC, compressed .gz files can be read, too (like in the
example below). In the example below we’ll just look at chromosome 11 of
the human genome - to keep this example small.
path1 <- system.file("extdata", package="wrProteo")
gtfFi <- file.path(path1, "UCSC_hg38_chr11extr.gtf.gz")
UcscAnnot1 <- readUCSCtable(gtfFi)
#> readUCSCtable : File 'UCSC_hg38_chr11extr.gtf.gz' is gtf : TRUE
#> readUCSCtable : simplifed from 2000 to 223 non-redundant gene_id
# The Ensemble transcript identifyers and their chromosomal locations :
head(UcscAnnot1)
#> gene_id chr start end strand frame
#> [1,] "ENST00000270115.8" "chr11" "537527" "554912" "+" "."
#> [2,] "ENST00000308020.6" "chr11" "450279" "491399" "+" "."
#> [3,] "ENST00000311189.8" "chr11" "532242" "535576" "-" "."
#> [4,] "ENST00000312165.5" "chr11" "278570" "285304" "+" "."
#> [5,] "ENST00000325113.8" "chr11" "196738" "200258" "+" "."
#> [6,] "ENST00000325147.13" "chr11" "202924" "207382" "-" "."
However, this annotation does not provide protein IDs. In order to obtain the corresponding protein IDs an additional step is required : Here we will use the batch search/conversion tool from UniProt. In order to do so, we can export directly from readUCSCtable() a small text-file which can be fed into the UniProt batch-search tool.
# Here we'll redo reading the UCSC table, plus immediatley write the file for UniProt conversion
# (in this vignette we write to tempdir() to keep things tidy)
expFi <- file.path(tempdir(),"deUcscForUniProt2.txt")
UcscAnnot1 <- readUCSCtable(gtfFi, exportFileNa=expFi)
#> readUCSCtable : File 'UCSC_hg38_chr11extr.gtf.gz' is gtf : TRUE
#> readUCSCtable : simplifed from 2000 to 223 non-redundant gene_id
#> readUCSCtable : Write to file : ENST00000270115, ENST00000308020, ENST00000311189, ENST00000312165 ...
#> readUCSCtable : Exporting file 'C:/Users/wraff/AppData/Local/Temp/RtmpMxiL0A/deUcscForUniProt2.txt' for conversion on https://www.uniprot.org/id-mapping/
Now everything is ready to go to UniProt for retrieving the corresponding UniProt-IDs. Since we exported Ensemble transcript IDs (ENSTxxx), select converting from ‘Ensembl Transcript’ to ‘UniProtKB’. Then, when downloading the conversion results, choose tab-separated file format (compression is recommended), this may take several seconds (depending on the size).
It is suggested to rename the downloaded file so one can easily
understand its content. Note, that the function
readUniProtExport()
can also read .gz compressed files. To
continue this vignette we’ll use a result which has been downloaded from
UniProt and renamed to
‘deUniProt_hg38chr11extr.tab’. One may also optionally define a specific
genomic region of interest using the argument ‘targRegion’, here the
entire chromosome 11 was chosen.
deUniProtFi <- file.path(path1, "deUniProt_hg38chr11extr.tab")
deUniPr1 <- readUniProtExport(UniP=deUniProtFi, deUcsc=UcscAnnot1, targRegion="chr11:1-135,086,622")
#> readUniProtExport : low yield matching 'enst00000410108', 'enst00000342878' and 'enst00000325113,enst00000525282' and 'ENST00000270115', 'ENST00000308020' and 'ENST00000311189' convert all to lower case and remove version numbers ('xxx.2') for better matching
#> readUniProtExport : intergrating genomic information for 88 entries (14 not found)
#> readUniProtExport : 88 IDs in output
str(deUniPr1)
#> 'data.frame': 88 obs. of 15 variables:
#> $ EnsTraID : chr "enst00000410108" "enst00000342878" "enst00000325113,enst00000525282" "enst00000342593" ...
#> $ UniProtID : chr "B8ZZS0" "Q8TD33" "Q96PU9" "F8W6Z3" ...
#> $ Entry.name : chr "B8ZZS0_HUMAN" "SG1C1_HUMAN" "ODF3A_HUMAN" "F8W6Z3_HUMAN" ...
#> $ Status : chr "unreviewed" "reviewed" "reviewed" "unreviewed" ...
#> $ Protein.names: chr "BET1-like protein" "Secretoglobin family 1C member 1 (Secretoglobin RYD5)" "Outer dense fiber protein 3 (Outer dense fiber of sperm tails protein 3) (Sperm tail protein SHIPPO 1) (Transcr"| __truncated__ "Outer dense fiber protein 3" ...
#> $ Gene.names : chr "BET1L" "SCGB1C1" "ODF3 SHIPPO1 TISP50" "ODF3" ...
#> $ Length : int 152 95 254 127 102 111 92 58 88 148 ...
#> $ ID : logi NA NA NA NA NA NA ...
#> $ chr : chr "chr11" "chr11" NA "chr11" ...
#> $ start : num 167784 193078 NA 197303 202924 ...
#> $ end : num 207383 194575 NA 200033 207382 ...
#> $ strand : chr "-" "+" NA "+" ...
#> $ frame : chr "." "." NA "." ...
#> $ avPos : num 187584 193826 NA 198668 205153 ...
#> $ inTarg : logi FALSE FALSE NA FALSE FALSE FALSE ...
The resulting data.frame (ie the column ‘UniProtID’) may be used to complement protein annotation after importing mass-spectrometry peak- and protein-identification results. Obviously, using recent Fasta-files from UniProt for protein-identification will typically give better matching at the end.
You may note that sometimes Ensemble transcript IDs are written as ‘enst00000410108’ whereas at other places it may be written as ‘ENST00000410108.5’. The function readUniProtExport() switches to a more flexible search mode stripping of version-numbers and reading all as lower-caps, if initial direct matching reveals less than 4 hits.
Finally, it should be added, that of course several other ways of retrieving annotation do exist, too. For example, as mentioned above, Bioconductor) offers serval packages dedicated to gene- and protein-annotation.
The author would like to acknowledge the support by the IGBMC (CNRS UMR 7104, Inserm U 1258, UdS), CNRS, Université de Strasbourg (UdS) and Inserm. All collegues from the proteomics platform at the IGBMC work very commited to provide high quality mass-spectrometry data (including some of those used here). The author wishes to thank the CRAN-staff for all their help with new entries and their efforts in maintaining this repository of R-packages. Furthermore, many very fruitful discussions with colleages on national and international level have helped to improve the tools presented here.
Thank you for you interest. This package is constantly evolving, new featues/functions may get added to the next version.
For completeness :
#> R version 4.4.1 (2024-06-14 ucrt)
#> Platform: x86_64-w64-mingw32/x64
#> Running under: Windows 10 x64 (build 19045)
#>
#> Matrix products: default
#>
#>
#> locale:
#> [1] LC_COLLATE=C LC_CTYPE=French_France.utf8
#> [3] LC_MONETARY=French_France.utf8 LC_NUMERIC=C
#> [5] LC_TIME=French_France.utf8
#>
#> time zone: Europe/Paris
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] rmarkdown_2.27 knitr_1.47 wrGraph_1.3.7 wrProteo_1.12.0
#> [5] wrMisc_1.15.1
#>
#> loaded via a namespace (and not attached):
#> [1] gtable_0.3.5 limma_3.60.3 jsonlite_1.8.8 dplyr_1.1.4
#> [5] compiler_4.4.0 highr_0.11 Rcpp_1.0.12 tidyselect_1.2.1
#> [9] stringr_1.5.1 jquerylib_0.1.4 splines_4.4.0 scales_1.3.0
#> [13] yaml_2.3.8 fastmap_1.2.0 statmod_1.5.0 plyr_1.8.9
#> [17] ggplot2_3.5.1 R6_2.5.1 generics_0.1.3 fdrtool_1.2.17
#> [21] tibble_3.2.1 munsell_0.5.1 sm_2.2-6.0 bslib_0.7.0
#> [25] pillar_1.9.0 RColorBrewer_1.1-3 R.utils_2.12.3 rlang_1.1.4
#> [29] utf8_1.2.4 stringi_1.8.4 cachem_1.1.0 xfun_0.45
#> [33] sass_0.4.9 cli_3.6.3 magrittr_2.0.3 digest_0.6.36
#> [37] grid_4.4.0 lifecycle_1.0.4 R.oo_1.26.0 R.methodsS3_1.8.2
#> [41] vctrs_0.6.5 qvalue_2.36.0 evaluate_0.24.0 glue_1.7.0
#> [45] fansi_1.0.6 colorspace_2.1-0 reshape2_1.4.4 tools_4.4.0
#> [49] pkgconfig_2.0.3 htmltools_0.5.8.1