The HaploCatcher package is based on the work done in Winn et al (2022) which demonstrated that relatively simple machine learning models could be used to generate haplotype information for genome-wide genotyped lines. The idea behind this approach was to leverage the historical PCR-based marker information in programs to identify haplotypes of interest in early generation material which has been sequenced for genomic selection. This approach offers several distinct advantages for breeding programs:
Lines that are genotyped for genomic selection in early generations, which would not have PCR-based markers run on them, can receive haplotype information that they would otherwise not have.
Users can apply this information for selection of desirable haplotypes in early generations.
Haplotype information could become available earlier in the breeding process.
Haplotypes could be predicted for historical lines for use in estimation of locus effects across time.
The development team of HaploCatcher made a package available for individuals to apply this method in their programs with as little intervention as possible. In this demonstration, we will discuss the potential uses of this package and the file formats required to run the functions in the HaploCatcher package.
To use the HaploCatcher package, the data must be structured properly. There are three necessary data frames that must be available to use this package. Let’s take a look at them.
First, lets look at the head of the “gene_comp” file.
Trait | Chromosome | Gene | Nursery | Line | FullSampleName | Call |
---|---|---|---|---|---|---|
sawfly_resistance_solid_stem | 3B | sst1_solid_stem | SRPN | Geno_190 | Geno_190 | non_sst1_solid_stem |
sawfly_resistance_solid_stem | 3B | sst1_solid_stem | SRPN | Geno_100 | Geno_100 | non_sst1_solid_stem |
sawfly_resistance_solid_stem | 3B | sst1_solid_stem | RGON | Geno_49 | Geno_49 | non_sst1_solid_stem |
sawfly_resistance_solid_stem | 3B | sst1_solid_stem | CSU | Geno_71 | Geno_71 | non_sst1_solid_stem |
sawfly_resistance_solid_stem | 3B | sst1_solid_stem | CSU | Geno_123 | Geno_123 | non_sst1_solid_stem |
sawfly_resistance_solid_stem | 3B | sst1_solid_stem | CSU | Geno_88 | Geno_88 | non_sst1_solid_stem |
The gene_comp file is a categorization file made by the user for their specific program. It contains seven columns which are as follows:
Trait: Defines what trait the locus affects. In this case, we are looking at Sst1 which affects resistance to sawfly in wheat.
Chromosome: Defines what chromosome the gene is located on. This will be specific to a marker_information file later on.
Gene: Defines the name of the gene.
Nursery: Defines what nursery/program the information was derived from.
Line: Defines the “line designation” given to a genotype. The line designation is denoted by the breeder.
FullSampleName: The name of the line which matches with the row names in the marker matrix.
Call: The haplotype for that line for that gene. This call column is very important for the functioning of the HaploCatcher package. This package only predicts biallelic loci. The locus can be only homozygous cases or include a heterozygous case.
For simplicity in the package, calls for gene haplotypes should be formatted like this:
Call |
---|
non_sst1_solid_stem |
sst1_solid_stem |
het_sst1_solid_stem |
The heterozygous case has the string “het_” in front of it like “het_sst1_solid_stem” and the negative case has the string “non_” in front of it like “non_sst1_solid_stem”.
If the user wants to only provide the positive and negative cases, without the heterozygous case, the format would be the same without “het_” category. To use the package properly, users must have at least four columns in the data present: FullSampleName, Chromosome, Gene, and Call. Other data may be present. These columns must both be properly named and case correct in the data.
For example, if the user was trying to predict the Pm34 locus in wheat with heterozygous cases, the data frame for the gene_comp would look like this:
FullSampleName | Chromosome | Gene | Call |
---|---|---|---|
Geno_1 | 5D | pm34 | pm34 |
Geno_2 | 5D | pm34 | het_pm34 |
Geno_3 | 5D | pm34 | non_pm34 |
The structure of the genotypic matrix is a full-rank (no missing data) numeric matrix of markers.
S3B_496866 | S3B_496906 | S3B_509279 | S3B_619176 | S3B_619185 | |
---|---|---|---|---|---|
Geno_1 | 1 | 0 | 0 | 0 | 0 |
Geno_2 | 0 | 0 | 0 | 0 | 0 |
Geno_3 | 0 | 0 | 0 | 0 | 0 |
Geno_4 | 0 | 0 | 0 | 0 | 0 |
Geno_5 | 0 | 0 | 0 | 0 | 2 |
Where the columns of the matrix are markers, the rows of the matrix are genotypes (FullSampleName), and the the marker information is coded numerically. The marker matrix provided in the package is coded as 0, 1,and 2; where 0 is the major allele, 1 is the heterozygous haplotype, and 2 is the minor allele. The coding of the matrix must always be numeric, but it can be in any numeric format as long as the numbers are real integers.
Note: the column names must match information provided in the marker info file (see next section) and the row names must match the “FullSampleName” provided in the gene_comp!
The marker_info file contains information on all the markers found in the columns of the geno_mat file.
Marker | Chromosome | BP_Position |
---|---|---|
S3B_496866 | 3B | 496866 |
S3B_496906 | 3B | 496906 |
S3B_509279 | 3B | 509279 |
S3B_619176 | 3B | 619176 |
S3B_619185 | 3B | 619185 |
S3B_627259 | 3B | 627259 |
The marker_info file has three required columns:
Marker: The name of the marker which is found in the column names of the marker matrix
Chromosome: The name of the chromosome (or linkage group) that the marker belongs to
BP_Position: The base pair position of the marker. This can be replaced by a numeric dummy variable or a centimorgan position!
Below is an example where a linkage group and centimorgan position is provided.
Marker | Chromosome | BP_Position |
---|---|---|
TRACE_00102 | 1 | 0.0 |
TRACE_13112 | 1 | 0.5 |
TRACE_43821 | 1 | 1.2 |
Note: even though these are not chromosomes or base pair positions, you must leave the names of the columns titled as such! Furthermore, the columns for the marker_info and geno_comp are case sensitive
The total pipeline for the HaploCatcher package can be understood using the provided diagram (Figure 1).
Figure 1. A diagram of [A] input data structure and [B] the “auto_locus” function pipeline. Panel [A] shows a total data set that is partitioned into a training and test population. The training population in panel [A] shows a population of individuals, that is suggested to be comprised of more than 750 individuals, which have both genome wide marker and historical haplotype data. The testing population in panel [A] shows a testing population, which may be any size greater than zero, which only has genome wide marker data. Panel [B] shows the workflow of the “auto_locus” function. In the cross-validation step [I], the total training population is split in a user defined way (default is 80:20 split) and the 80% tuning population is used to train and select optimal hyper-parameters for a k-nearest neighbors (KNN) and random forest (RF) model. The trained models are then used to predict the haplotype of the validation population. The predicted haplotype is then compared to the ‘true’ haplotype and kappa (and accuracy) are calculated. This is repeated a user set number of times (default is 30). The best performing model based on accuracy or kappa (default is kappa) is then taken as the model to be used in forward prediction. There are two options post cross-validation: [IIA] a single model with a set seed for repeatability or [IIB] a user set number of random models (default is 30) used to create a consensus haplotype prediction.
The HaploCatcher package contains several different functions which are all placed into a pipeline for ease of access. The whole process hinges on having the “gene_comp” file, which has historical information about a locus of interest and a “geno_mat” file that contains all those individuals in the “geno_comp” file and individuals for which a predicted haplotype is desired (Figure 1A).
The “auto_locus” function (Figure 1B) has two distinct steps: (I) Selecting either a random forest or k-nearest neighbors model by cross-validation and (II) using the cross-validation results to select the desired model, training off the total available data, and either (IIA) set a random seed and predict haplotypes once or (IIB) set no random seed and predict many times to get a consensus vote by majority rule (Figure 1B).
Now that we understand the format of the data and the pipeline process, it is time to use the main function “auto_locus”.
First, read in the data…
#library
library(HaploCatcher)
#read data from package
data("geno_mat")
data("gene_comp")
data("marker_info")
In this example, only individuals who are in the gene_comp are found in the geno_mat file provided in the package, so we will have to randomly partition the data so that we can show how the “auto_locus” function could be used for forward prediction of lines with only genome-wide marker data!
#set seed (for reproducible results)
set.seed(022294)
#randomly partition the training data and test from total
=sample(rownames(geno_mat), size = round(nrow(geno_mat)*0.8, 0))
training_genotypes=rownames(geno_mat)[!rownames(geno_mat) %in% training_genotypes]
testing_genotypes
#nullify the seed we set so we don't mess with cross validation
set.seed(NULL)
In this example, we assume that the individuals in the “testing_genotypes” vector DO NOT have haplotype calls for Sst1 and we are trying to produce a predictive haplotype for these individuals.
Now that we have defined our testing and training data sets, we can use these inputs for the auto_locus function. The auto_locus function has eight user defined inputs:
The auto_locus function has many user defined setting that are optional, such as “include_hets” which either allows heterozygous lines to be predicted (TRUE) or not (FALSE). The default is FALSE.
Note: if your data has only positive and negative cases without heterozygous calls, you should set “include_hets” to FALSE, otherwise you will receive an error!
Furthermore, there are options for sequential modeling (parallel=FALSE) or parallel modeling (parallel=TRUE), as well as methods to suppress output of text (verbose=FALSE).
The example data set for the HaploCatcher package contains heterozygous calls, so we will perform models using the “include_hets” argument set to FALSE. We will also set n_perms=5 for expediency of models, even though the minimum should be 30.
#run without heterozygous individuals sequentially
<-auto_locus(geno_mat = geno_mat,
results1gene_file = gene_comp,
gene_name = "sst1_solid_stem",
marker_info = marker_info,
chromosome = "3B",
training_genotypes = training_genotypes,
testing_genotypes = testing_genotypes,
set_seed = 022294,
n_perms = 10,
verbose = FALSE)
#> Loading required package: lattice
Each result obtained puts out a list-of-list objects which contain several portions which can be found in the documentation for the “auto_locus” function. The predictions made for the training population can be found in the results$predictions pathway.
#show results for with and without hets
::kable(results1$predictions[1:15,],
knitralign = "c",
caption = "Heterozygous Excluded Results")
FullSampleName | Prediction_RF |
---|---|
Geno_2 | sst1_solid_stem |
Geno_6 | sst1_solid_stem |
Geno_9 | sst1_solid_stem |
Geno_13 | sst1_solid_stem |
Geno_14 | sst1_solid_stem |
Geno_15 | sst1_solid_stem |
Geno_16 | sst1_solid_stem |
Geno_19 | sst1_solid_stem |
Geno_41 | sst1_solid_stem |
Geno_45 | sst1_solid_stem |
Geno_76 | sst1_solid_stem |
Geno_77 | sst1_solid_stem |
Geno_79 | sst1_solid_stem |
Geno_86 | non_sst1_solid_stem |
Geno_89 | sst1_solid_stem |
For users who do not want to run a single final model with a set seed, there is the “predict_by_vote” option, which runs a series of random models to get a majority rule vote. Here we set the number of votes to 10 with the “n_votes” argument and the number of cross-validation permutations to 10 with the “n_perms” argument. We also silence the output using the “verbose=FALSE” argument.
Note that the predictions, when made through voting, may be found in the results$consensus_predictions pathway. This data frame shows the FullSampleName, a haplotype call titled “Consensus_Call” made by majority rule, and the different potential calls with a number of observations over all permutations of the model.
If users which to visualize the cross-validation results outside of the model, they can use the “plot_locus_perm_cv” function by referencing the list-of-list in the results$cross_validation_results.
#plot out the cross validation results
plot_locus_perm_cv(results1$cross_validation_results)
Each results object has cross validation parameters to check the predictive ability of the data set. To see this, users can access summaries in the list-of-list results object. These parameters may be useful for publications or model selection.
#look at overall and by_class performance
::kable(results1$cross_validation_results$Overall_Summary[,1:5],
knitralign = "c",
caption = "Subsection of Overall Summary for 'results1' Object")
Model | Mean_Accuracy | Min_Accuracy | Max_Accuracy | SD_Accuracy |
---|---|---|---|---|
K-Nearest Neighbors | 0.9750000 | 0.9591837 | 0.9897959 | 0.0121334 |
Random Forest | 0.9755102 | 0.9591837 | 0.9846939 | 0.0089346 |
::kable(results1$cross_validation_results$By_Class_Summary[,1:5],
knitralign = "c",
caption = "Subsection of By-Class Summary for 'results1' Object")
Model | Mean_Sensitivity | Min_Sensitivity | Max_Sensitivity | SD_Sensitivity |
---|---|---|---|---|
K-Nearest Neighbors | 0.9728407 | 0.9453125 | 1 | 0.0181457 |
Random Forest | 0.9745744 | 0.9393939 | 1 | 0.0206555 |
In this tutorial, we have demonstrated how to use the HaploCatcher package. We hope that this tutorial has been informative for potential users and that it has fueled interest in the use of our package. We appreciate your interest in our work and thank you for taking the time to learn about HaploCatcher. If you have any further questions or comments, please do not hesitate to contact the package maintainer listed in the package description.