Input Preparation Workflow

Dina Schuster

2024-10-22

Introduction

This vignette will give you an overview of how you can prepare the quantitative protein/peptide matrix output from common search engines and software such as Spectronaut, MaxQuant, Proteome Discoverer and Skyline for the analysis with protti. Due to its modular and flexible structure protti can be used on the output of common bottom-up proteomics search engines irrespective of the measurement mode (DDA, DIA, targeted-MS).

Furthermore, you are not only restricted to reports from the above mentioned search engines. As long as your data has a tabular format (data frame) and a specific minimal number of data columns you can analyse it with protti. The columns minimally required contain information on sample, condition, intensity, protein ID and the level the intensity is based on (fragment, precursor, peptide) if different from protein intensity. Depending on the analysis many more columns can be useful, but they are not required. Ultimately, your data should have a structure similar to this:

Sample Protein ID Peptide Sequence Condition Intensity
sample1 P62942 PEPTIDER treated 14000
sample2 P62942 PEPTI treated 15000
sample3 P62942 PEPTIDE treated 14500
sample4 P62942 PEPTIDER control 18000
sample5 P62942 PEPTI control 21000
sample6 P62942 PEPTIDE control 19000

It is very important, that each unit of the level you perform your analysis on (e.g. peptide) has a single unique intensity associated with it. If, for example, a peptide has two different intensities, protti would not know how to deal with this and many functions will likely fail.

Data should always be organised in a format called tidy data. That means data should be contained in a long format (e.g. all sample names in one column) rather than a wide format (e.g. each sample name in its own column with intensity as the content of the columns). You can easily achieve this by using the pivot_longer() function from the tidyr package. The output of many search engines already contains tidy data and working with it is very easy because you can refer to information with only one variable. protti is designed to work together well with the tidyverse package family that is build around the concept of tidy data.

Protein-centric analysis

Many search engines provide the user with protein intensities. However, it is also possible to calculate protein intensities directly from precursor intensities with the protti function calculate_protein_abundance(). Protti implements the "iq" method, previously implemented in the R package iq which performs protein quantification based on the maximal peptide ratio extraction algorithm adapted from the MaxLFQ algorithm (Cox, J. 2013).

One advantage of calculating the protein abundance with protti is the possibility to median normalise run intensities on the precursor level. This is closer to the actually acquired intensities and thus sample concentrations than if normalisation is performed on the protein level. Some search engines provide the option for automatic median normalisation but not all. Furthermore, some search engines calculate protein intensities by summation of precursor intensities irrespective of missingness of peptides in certain samples. In these cases the maximal peptide ratio implemented in extraction algorithm provides a more robust calculation of protein intensities.

If you prefer to use protein intensities provided by the seach engine of your choice this is not a problem and we will show how some of this information can be converted into the right format.

Loading packages

We will demonstrate how most outputs can be converted with functions from the R packages magrittr, dplyr, tidyr and stringr. You can load packages after you installed them with the library() function.

library(magrittr)
library(dplyr)
library(tidyr)
library(stringr)

Note that we are using the R package magrittr because of its pipe operator %>%. It takes the output of the preceding function and supplies it as the first argument of the following function. Using %>% makes code easier to read and follow.

Spectronaut

Spectronaut reports already contain data in the tidy data format. Therefore nothing needs to be changed in order to use them with protti. However, the columns we would recommend (not all columns are required) to export from Spectronaut are:

Please make sure that the report is a .csv file. You can use the read_protti() function in order to load the spectronaut report into R. This function is a wrapper around the fast fread() function from the data.table package and the clean_names() function from the janitor package. This will allow you to not only load your data into R very fast, but also to clean up the column names into lower snake case. This will make it easier to remember them and to use them in your data analysis. For the Spectronaut columns R.FileName will change for example into r_file_name.

# To read in your own data you can use read_protti()
spectronaut_data <- read_protti(filename = "mydata/spectronaut.csv")

MaxQuant

Depending on which analysis you are performing you will have to use different outputs. For peptide-centric analyses we would recommend to use the evidence.txt file. If you want to perform a protein-centric analysis and you want to use protein quantities calculated by MaxQuant, you need the proteinGroups.txt file. However, you can also apply the maximal peptide ratio extraction algorithm from the iq R package implemented in the protein_abundance_calculation() function of protti. This allows you to only use the evidence.txt file. The resulting protein intensities are identical since they were calculated with the same algorithm.

Peptide-centric analysis/LiP-MS analysis

In case you are interested in performing a peptide-centric analysis (necessary for LiP-MS), you should use the evidence.txt file provided in the search output of MaxQuant.

The evidence.txt file basically contains all the information we need to run protti. It is also contained in a long format which makes it easy to read in and use directly. One thing to take into consideration is the lack of a column for information on proteotypicity of peptides. However, this information can be inferred from the Proteins column if it contains more than one protein ID. You can extract this information and create a new column called is_proteotypic containing logicals that will be TRUE if the Proteins column does not contain a semicolon and FALSE if it does (this indicates that the peptide belongs to more than one protein). As mentioned in the data analysis vignettes this information is necessary for the analysis of LiP-MS data but it could be also considered for the correct calculation of protein abundances.

Another column that is required for the analysis of your data is a column indicating conditions to which certain samples belong. This can be easily added to the evidence file by joining a data frame containing the specific annotations. You can create such a data frame in Excel and import it into R for a large number of samples or just create it directly in R.

MaxQuant output provides information on decoy hits contained in the column reverse and also has information on whether your hit is a contaminant potential_contaminant. You should filter these out before the analysis. However, the contaminant column can be used for quality control.

One important thing for MaxQuant data is to make sure that you only have one intensity assigned to each peptide or precursor. You can do this by summing up all intensities that MaxQuant exports (these can be MULTI-MSMS, MSMS, ISO-MSMS, MULTI-MATCH, ISO-SECPEP) or you can filter for example for precursors with MULTI-MSMS quantification and only use these.

In this section we will show you how to read in the file with read_protti() and how to create the is_proteotypic column and the condition column (minimally required) with the help of the stringr and dplyr packages. How to filter your data best is described in the data analysis vignettes.

# To read in your own data you can use read_protti()
evidence <- read_protti(filename = "yourpath/evidence.txt")

evidence_proteotypic <- evidence %>%
  # adds new column with logicals that are TRUE if the peptide can be assigned
  # to only one protein and FALSE if it can be assigned to multiple
  mutate(is_proteotypic = str_detect(
    string = proteins,
    pattern = ";",
    negate = TRUE
  )) %>%
  # adds new column with logicals indicating if peptide is coming from a potential contaminant
  mutate(is_contaminant = ifelse(potential_contaminant == "+", TRUE, FALSE))

# Make an annotation data frame and merge it with your data frame to obtain conditions
# We are annotating sample 1-3 as controls and samples 4-6 as treated conditions

file_name <- c( # make sure that the names are the same name as in your report
  "sample1",
  "sample2",
  "sample3",
  "sample4",
  "sample5",
  "sample6"
)

condition <- c(
  "control",
  "control",
  "control",
  "treated",
  "treated",
  "treated"
)

annotation <- data.frame(file_name, condition)

# Combine your long data frame with the annotation
evidence_annotated <- evidence_proteotypic %>%
  left_join(y = annotation, by = "file_name")

Protein-centric analysis

For protein-centric analyses you can use the proteinGroups.txt file provided by MaxQuant. This file contains information in a wide format where each sample has its own column containing intensity values. Therefore, we need to transform this data into a long format to meet the conditions of tidy data.

We will filter the data and use tidyr’s pivot_longer() to change the format to long format. Furthermore, we produce an annotation data frame to create a conditions column. The filtering is only done in order to remove proteins with potentially low quality. Further filtering for decoys and potential contaminants should be performed based on the data analysis vignettes.

# To read in your own data you can use read_protti()
protein_groups <- read_protti(filename = "yourpath/proteinGroups.txt") %>%
  # adds new column with logicals indicating if protein is a potential contaminant,
  # you can filter these out later on. You should also consider filtering out proteins
  # that were "only identified by site" and reverse hits, as well as proteins with only
  # one identified peptide
  mutate(is_potential_contaminant = ifelse(potential_contaminant == "+", TRUE, FALSE))

# Change wide format to long format and create new columns called `r_file_name`and `intensity`
protein_groups_long <- protein_groups %>%
  pivot_longer(
    cols = starts_with("intensity_"),
    names_to = "file_name",
    values_to = "intensity"
  )

# Make an annotation data frame and merge it with your data frame to obtain conditions
# We are annotating sample 1-3 as controls and samples 4-6 as treated conditions

file_name <- c( # make sure that the names are the same name as in your report
  "intensity_sample1",
  "intensity_sample2",
  "intensity_sample3",
  "intensity_sample4",
  "intensity_sample5",
  "intensity_sample6"
)

condition <- c(
  "control",
  "control",
  "control",
  "treated",
  "treated",
  "treated"
)

annotation <- data.frame(file_name, condition)

# Combine your long data frame with the annotation
protein_groups_annotated <- protein_groups_long %>%
  left_join(y = annotation, by = "file_name")

Skyline

The Skyline output is already in long format, however, to process it you need to sum up the transition intensities to obtain the intensity of one precursor. If you prefer to analyse your data on the fragment level, you should create a column that uniquely identifies each fragment of each precursor. You could do that by pasting together the peptide sequence with the charge and the product m/z.

The required Skyline output columns include:

You can add replicate and condition annotations in Skyline directly. However, we will explain in this section how you can also do it in R. If you want to analyse your data on the protein abundance level you will have to combine the precursor intensities to obtain one value for protein abundance. This could be done using the calculate_protein_abundance() function from protti.

# Load data
skyline_data <- read_protti(filename = "yourpath/skyline.csv")

skyline_data_int <- skyline_data %>%
  # create a column with precursor information
  mutate(precursor = paste0(peptide_sequence, "_", charge)) %>%
  group_by(replicate_name, precursor) %>%
  # making a new column containing the summed up intensities of all transitions of one precursor
  mutate(sum_intensity = sum(area)) %>%
  select(-c(product_mz, area)) %>% # removing the columns we don't need
  distinct() # removing duplicated rows from the data frame

# Add annotation
# make sure that the names are the same name as in your report
replicate_name <- c(
  "sample_1",
  "sample_2",
  "sample_3",
  "sample_1",
  "sample_2",
  "sample_3"
)

condition <- c(
  "control",
  "control",
  "control",
  "treated",
  "treated",
  "treated"
)

annotation <- data.frame(replicate_name, condition)

# Combine your long data frame with the annotation
skyline_annotated <- skyline_data_int %>%
  left_join(y = annotation, by = "replicate_name")

Proteome Discoverer

The Proteome Discoverer output contains data in wide format (one column for each sample). Similar to MaxQuant there is also the option for a peptide or a protein-centric export. We will discuss both cases in this segment.

Peptide-centric analysis/LiP-MS analysis

For a peptide-centric or a LiP-MS analysis please export the “Peptide Groups” report. Before preparing your export you can add the column “sequence” to your table otherwise Proteome Discoverer will only export the “annotated sequence” column which includes the preceding and following amino acids in the protein sequence.

The required columns include:

After saving the report as an Excel file please convert it to a .csv file, simply by opening it and saving it as such.

We will read in the file using read_protti() and then select the columns we are interested in. You can use the contaminant column for qualitiy control. The number_proteins column contains information on the proteotypicity. If this is 1 then the peptide is proteotypic. If you want to analyse your data qualitatively only with quality control functions of protti you can keep peptides without quantifications. Before you start your quantitative analysis remove observations that are labeled "No Quan Values" in the quan_info column. In the below example they are filtered out at this step, but you can keep them and only filter them out later.

# Load data
pd_pep_data <- read_protti("yourpath/PDpeptides.csv")

# Select relevant columns
pd_pep_selected <- pd_pep_data %>%
  select(
    sequence,
    modifications,
    number_proteins,
    contaminant,
    master_protein_accessions,
    starts_with("abundances_grouped"), # select all columns that start with "abundances_grouped"
    quan_info
  )

# Filter data frame
pd_pep_filtered <- pd_pep_selected %>%
  filter(contaminant == FALSE) %>% # remove annotated contaminants
  filter(number_proteins == 1) %>% # select proteotypic peptides
  filter(quan_info != "No Quan Values") # remove peptides that have no quantification values

# Convert into long format
pd_pep_long <- pd_pep_filtered %>%
  pivot_longer(
    cols = starts_with("abundances"),
    names_to = "file_name",
    values_to = "intensity"
  ) %>%
  # combine peptide sequence and modifications to make a precursor column
  mutate(precursor = paste(sequence, modifications))

# Make annotation data frame
file_name <- c( # make sure that the names are the same name as in your report
  "abundances_grouped_f1",
  "abundances_grouped_f2",
  "abundances_grouped_f3",
  "abundances_grouped_f4",
  "abundances_grouped_f5",
  "abundances_grouped_f6"
)

condition <- c(
  "control",
  "control",
  "control",
  "treated",
  "treated",
  "treated"
)

annotation <- data.frame(file_name, condition)

# Combine your long data frame with the annotation
pd_pep_long_annotated <- pd_pep_long %>%
  left_join(y = annotation, by = "file_name")

Protein-centric analysis

For a protein-centric or analysis please export the “Proteins” report.

The required columns include:

After saving the report as an Excel file please convert it to a .csv file, simply by opening it and saving it as such.

We will read in the file using read_protti() and then select the columns we are interested in. Similar to above you can either filter the contaminant and number_peptides columns now or later.

# Load data
pd_prot_data <- read_protti("yourpath/PDproteins.csv")

# Select relevant columns
pd_prot_selected <- pd_prot_data %>%
  select(
    accession,
    description,
    contaminant,
    number_peptides,
    starts_with("abundances_grouped"), # select all columns that start with "abundances_grouped"
  )

# Filter data frame
pd_prot_data_filtered <- pd_prot_selected %>%
  filter(contaminant == FALSE) %>% # remove annotated contaminants
  filter(number_peptides > 1) # select proteins with more than one identified peptide

# Convert into long format
pd_prot_long <- pd_prot_data_filtered %>%
  pivot_longer(
    cols = starts_with("abundances"),
    names_to = "file_name",
    values_to = "intensity"
  )

# Make annotation data frame
file_name <- c( # make sure that the names are the same name as in your report
  "abundances_grouped_f1",
  "abundances_grouped_f2",
  "abundances_grouped_f3",
  "abundances_grouped_f4",
  "abundances_grouped_f5",
  "abundances_grouped_f6"
)

condition <- c(
  "control",
  "control",
  "control",
  "treated",
  "treated",
  "treated"
)

annotation <- data.frame(file_name, condition)

# Combine your long data frame with the annotation
pd_prot_long_annotated <- pd_prot_long %>%
  left_join(y = annotation, by = "file_name")

Other search engines and software

As mentioned in the beginning of this vignette you can use the output of any search engine as long as it contains the minimally required columns. If it is not in the right format you can see if some of the above transformations can be applied to your data. It is also always useful to check if you can find additional columns that help you in your analysis and that you can export from your search engine. Always make sure that all of your observations are ones you are interested in. Check if there are decoys, contaminants or non-proteotypic peptides in your data. For protein-centric analysis, potentially remove quantifications that rely on only a few peptides.