Overview

GENESIS provides statistical methodology for analyzing genetic data from samples with population structure and/or familial relatedness. This vignette provides a description of how to use GENESIS for inferring population structure, as well as estimating relatedness measures such as kinship coefficients, identity by descent (IBD) sharing probabilities, and inbreeding coefficients. GENESIS uses PC-AiR for population structure inference that is robust to known or cryptic relatedness, and it uses PC-Relate for accurate relatedness estimation in the presence of population structure, admixutre, and departures from Hardy-Weinberg equilibrium.

Relatedness Estimation Adjusted for Principal Components (PC-Relate)

Many estimators exist that use genome-wide SNP genotype data from samples in genetic studies to estimate measures of recent genetic relatedness such as pairwise kinship coefficients, pairwise IBD sharing probabilities, and individual inbreeding coefficients. However, many of these estimators either make simplifying assumptions that do not hold in the presence of population structure and/or ancestry admixture, or they require reference population panels of known ancestry from pre-specified populations. When assumptions are violated, these estimators can provide highly biased estimates.

The PC-Relate method is used to accurately estimate measures of recent genetic relatedness in samples with unknown or unspecified population structure without requiring reference population panels. PC-Relate uses ancestry representative principal components to account for sample ancestry differences and provide estimates that are robust to population structure, ancestry admixture, and departures from Hardy-Weinberg equilibirum.

Data

Reading in Genotype Data

The functions in the GENESIS package read genotype data from a GenotypeData class object as created by the GWASTools package. Through the use of GWASTools, a GenotypeData class object can easily be created from:

an R matrix of SNP genotype data
a GDS file
PLINK files

Example R code for creating a GenotypeData object is presented below. Much more detail can be found in the GWASTools package reference manual.

R Matrix

geno <- MatrixGenotypeReader(genotype = genotype, snpID = snpID, chromosome = chromosome, 
                             position = position, scanID = scanID)
genoData <- GenotypeData(geno)

genotype is a matrix of genotype values coded as 0 / 1 / 2, where rows index SNPs and columns index samples
snpID is an integer vector of unique SNP IDs
chromosome is an integer vector specifying the chromosome of each SNP
position is an integer vector specifying the position of each SNP
scanID is a vector of unique individual IDs

GDS files

geno <- GdsGenotypeReader(filename = "genotype.gds")
genoData <- GenotypeData(geno)

filename is the file path to the GDS object

PLINK files

The SNPRelate package provides the snpgdsBED2GDS function to convert binary PLINK files into a GDS file.

snpgdsBED2GDS(bed.fn = "genotype.bed", bim.fn = "genotype.bim", fam.fn = "genotype.fam", 
              out.gdsfn = "genotype.gds")

bed.fn is the file path to the PLINK .bed file
bim.fn is the file path to the PLINK .bim file
fam.fn is the file path to the PLINK .fam file
out.gdsfn is the file path for the output GDS file

Once the PLINK files have been converted to a GDS file, then a GenotypeData object can be created as described above.

HapMap Data

To demonstrate PC-AiR and PC-Relate analyses with the GENESIS package, we analyze SNP data from the Mexican Americans in Los Angeles, California (MXL) and African American individuals in the southwestern USA (ASW) population samples of HapMap 3. Mexican Americans and African Americans have a diverse ancestral background, and familial relatives are present in these data. Genotype data at a subset of 20K autosomal SNPs for 173 individuals are provided as a GDS file.

# read in GDS data
gdsfile <- system.file("extdata", "HapMap_ASW_MXL_geno.gds", package="GENESIS")
HapMap_geno <- GdsGenotypeReader(filename = gdsfile)
# create a GenotypeData class object
HapMap_genoData <- GenotypeData(HapMap_geno)
HapMap_genoData

## An object of class GenotypeData 
##  | data:
## File: /tmp/RtmpaHIlQO/Rinst50b55d9b81a0/GENESIS/extdata/HapMap_ASW_MXL_geno.gds (923.5 KB)
## +    [  ] *
## |--+ sample.id   { Int32,factor 173 ZIP(40.90%), 283 bytes } *
## |--+ snp.id   { Int32 20000 ZIP(34.64%), 27.7 KB }
## |--+ snp.position   { Int32 20000 ZIP(34.64%), 27.7 KB }
## |--+ snp.chromosome   { Int32 20000 ZIP(0.13%), 103 bytes }
## |--+ genotype   { Bit2 20000x173, 865.0 KB } *
##  | SNP Annotation:
## NULL
##  | Scan Annotation:
## NULL

Principal Components Analysis in Related Samples (PC-AiR)

Pairwise Measures of Ancestry Divergence

It is possible to identify a subset of mutually unrelated individuals in a sample based solely on pairwise measures of genetic relatedness (i.e. kinship coefficients). However, in order to obtain accurate population structure inference for the entire sample, it is important that the ancestries of all individuals in the sample are represented by at least one individual in this subset. In order to identify a mutually unrelated and ancestry representative subset of individuals, PC-AiR also utilizes measures of ancestry divergence. These measures are calculated using the KING-robust kinship coefficient estimator (Manichaikul et al., 2010), which provides systematically negative estimates for unrelated pairs of individuals with different ancestry. The number of negative pairwise estimates that an individual has provides information regarding how different that individual’s ancestry is from the rest of the sample, which helps to prioritize individuals that should be kept in the ancestry representative subset.

The relevant output from the KING software is two text files with the file extensions .kin0 and .kin. The king2mat function can be used to extract the kinship coefficients (which we use as divergence measures) from this output and create a matrix usable by the GENESIS functions. The iids input of the king2mat function should be used to ensure that the output matrix orders individuals in the same way as the GenotypeData object.

# read individual IDs from GenotypeData object
iids <- getScanID(HapMap_genoData)
head(iids)

## [1] NA19919 NA19916 NA19835 NA20282 NA19703 NA19902
## 173 Levels: NA19625 NA19649 NA19650 NA19651 NA19652 NA19653 ... NA20412

# create matrix of KING estimates
KINGmat <- king2mat(file.kin0 = system.file("extdata", "MXL_ASW.kin0", package="GENESIS"), 
                    file.kin = system.file("extdata", "MXL_ASW.kin", package="GENESIS"), 
                    iids = iids)

## Reading .kin0 file...

## Reading .kin file...

## Determining Unique Individual IDs from KING Output...

## Checking Provided Individual IDs

## Adding Kinship Entries from .kin0 file...

## Adding Kinship Entries from .kin file...

KINGmat[1:5,1:5]

##         NA19919 NA19916 NA19835 NA20282 NA19703
## NA19919  0.5000 -0.0009 -0.0059 -0.0080  0.0014
## NA19916 -0.0009  0.5000 -0.0063 -0.0150 -0.0039
## NA19835 -0.0059 -0.0063  0.5000 -0.0094 -0.0104
## NA20282 -0.0080 -0.0150 -0.0094  0.5000 -0.0134
## NA19703  0.0014 -0.0039 -0.0104 -0.0134  0.5000

The column and row names of the matrix are the individual IDs, and each off-diagonal entry is the KING-robust estimate for the specified pair of individuals.

Alternative to running the KING software, the snpgdsIBDKING function from the SNPRelate package can be used to calculate the KING-robust estimates directly from a GDS file. The ouput of this function contains a matrix of pairwise estimates, which can be used by the GENESIS functions.

Running PC-AiR

The PC-AiR algorithm requires pairwise measures of both kinship and ancestry divergence in order to partition the sample into an “unrelated subset” and a “related subset.” The kinship coefficient estimates are used to identify relatives, as only one individual from a set of relatives can be included in the “unrelated subset,” and the rest must be assigned to the “related subset.” The ancestry divergence measures calculated from KING-robust are used to help select which individual from a set of relatives has the most unique ancestry and should be given priority for inclusion in the “unrelated subset.”

The KING-robust estimates read in above are always used as measures of ancestry divergence for unrelated pairs of individuals, but they can also be used as measures of kinship for relatives (NOTE: they may be biased measures of kinship for admixed relatives with different ancestry). The pcair function performs the PC-AiR analysis.

# run PC-AiR
mypcair <- pcair(genoData = HapMap_genoData, kinMat = KINGmat, divMat = KINGmat)

## Partitioning Samples into Related and Unrelated Sets...

## Unrelated Set: 97 Samples 
## Related Set: 76 Samples

## Running Analysis with 20000 SNPs - in 2 Block(s)

## Computing Genetic Correlation Matrix for the Unrelated Set: Block 1 of 2 ...

## Computing Genetic Correlation Matrix for the Unrelated Set: Block 2 of 2 ...

## Performing PCA on the Unrelated Set...

## Predicting PC Values for the Related Set: Block 1 of 2 ...

## Predicting PC Values for the Related Set: Block 2 of 2 ...

## Concatenating Results...

genoData is a GenotypeData class object
kinMat is a matrix of pairwise kinship coefficient estimates
divMat is a matrix of pairwise measures of ancestry divergence (KING-robust estimates)

If better estimates of kinship coefficients are available, then the kinMat input can be replaced with a similar matrix of these estimates. The divMat input should always remain as the KING-robust estimates.

Reference Population Samples

As PCA is an unsupervised method, it is often difficult to identify what population structure each of the top PCs represents. In order to provide some understanding of the inferred structure, it is sometimes recommended to include reference population samples of known ancestry in the analysis. If the data set contains such reference population samples, we may want to use only those individuals as the “unrelated subset” for performing the PCA and predict values for all other sample individuals. This can be accomplished by setting the input unrel.set equal to a vector, IDs, of the individual IDs for the reference population samples.

mypcair <- pcair(genoData = HapMap_genoData, unrel.set = IDs)

If, instead, we want to make sure that these reference population samples are included in the “unrelated subset,” but also allow for the PC-AiR algorithm to select additional individuals from the sample to be a part of the “unrelated subset,” then this can be accomplished by using the inputs kinMat, divMat, and unrel.set together.

mypcair <- pcair(genoData = HapMap_genoData, kinMat = KINGmat, divMat = KINGmat, unrel.set = IDs)

This will force individuals specified with unrel.set into the “unrelated subset,” move any of their relatives into the “related subset,” and then perform the PC-AiR partitioning algorithm on the remaining samples.

Partition a Sample without Running PCA

It may be of interest to partition a sample into an ancestry representative “unrelated subset” of individuals and a “related subset” of individuals without performing a PCA. The pcairPartition function, which is called within the pcair function, enables the user to do this.

part <- pcairPartition(kinMat = KINGmat, divMat = KINGmat)

The output contains two vectors that give the individual IDs for the “unrelated subset” and the “related subset.”

head(part$unrels); head(part$rels)

## [1] "NA19916" "NA19835" "NA19703" "NA19901" "NA19908" "NA19914"

## [1] "NA19919" "NA20282" "NA19902" "NA20287" "NA20335" "NA19713"

As with the pcair function, the input unrel.set can be used to specify certain individuals that must be part of the “unrelated subset.”

Output from PC-AiR

An object returned from the pcair function has class pcair. A summary method for an object of class pcair is provided to obtain a quick overview of the results.

summary(mypcair)

## Call:
## pcair(genoData = HapMap_genoData, kinMat = KINGmat, divMat = KINGmat)
## 
## PCA Method: PC-AiR 
## 
## Sample Size: 173 
## Unrelated Set: 97 Samples 
## Related Set: 76 Samples 
## 
## Kinship Threshold: 0.02209709 
## Divergence Threshold: -0.02209709 
## 
## Principal Components Returned: 20 
## Eigenvalues: 9.334 1.87 1.289 1.24 1.207 1.189 1.174 1.162 1.148 1.139 ...
## 
## MAF Filter: 0.01 
## SNPs Used: 19996

The output provides the total sample size along with the number of individuals assigned to the unrelated and related subsets, as well as the threshold values used for determining which pairs of individuals were related or ancestrally divergent. The eigenvalues for the top PCs are also shown, which can assist in determining the number of PCs that reflect structure. The minor allele frequency (MAF) filter used for excluding SNPs is also specified, along with the total number of SNPs analyzed after this filtering. Details describing how to modify the analysis parameters and the available output of the function can be found with the command help(pcair).

Plotting PC-AiR PCs

The GENESIS package also provides a plot method for an object of class pcair to quickly visualize pairs of PCs. Each point in one of these PC pairs plots represents a sample individual. These plots help to visualize population structure in the sample and identify clusters of individuals with similar ancestry.

# plot top 2 PCs
plot(mypcair)
# plot PCs 3 and 4
plot(mypcair, vx = 3, vy = 4)

The default is to plot PC values as black dots and blue pluses for individuals in the “unrelated subset” and “related subsets” respectively. The plotting colors and characters, as well as other standard plotting parameters, can be changed with the standard input to the plot function.

Relatedness Estimation Adjusted for Principal Components (PC-Relate)

Running PC-Relate

PC-Relate uses the ancestry representative principal components (PCs) calculated from PC-AiR to adjust for the population structure and ancestry of individuals in the sample and provide accurate estimates of recent genetic relatedness due to family structure. The pcrelate function performs the PC-Relate analysis.

The training.set input of the pcrelate function allows for the specification of which samples are used to estimate the ancestry adjustment at each SNP. The adjustment tends to perform best when close relatives are excluded from training.set, so the individuals in the “unrelated subset” from the PC-AiR analysis are typically a good choice. However, when an “unrelated subset” is not available, the adjustment still works well when estimated using all samples (training.set = NULL).

# run PC-Relate
mypcrelate <- pcrelate(genoData = HapMap_genoData, pcMat = mypcair$vectors[,1:2], 
                       training.set = mypcair$unrels)

## Running Analysis with 20000 SNPs - in 2 Block(s)

## Running Analysis with 173 Samples - in 1 Block(s)

## Using 2 PC(s) in pcMat to Calculate Adjusted Estimates

## Using 97 Samples in training.set to Estimate PC effects on Allele Frequencies

## Computing PC-Relate Estimates...

## ...SNP Block 1 of 2 Completed - 5.468 secs

## ...SNP Block 2 of 2 Completed - 5.472 secs

## Performing Small Sample Correction...

genoData is a GenotypeData class object
pcMat is a matrix whose columns are the PCs used for the ancestry adjustment
training.set is a vector of individual IDs specifying which samples are used to esimate the ancestry adjustment at each SNP

If estimates of IBD sharing probabilities are not desired, then setting the input ibd.probs = FALSE will speed up the computation.

Output from PC-Relate

The pcrelate function will either return an object of class pcrelate (when the input write.to.gds = FALSE) or it will save the output to a GDS file (when the input write.to.gds = TRUE). Saving the output to a GDS file is useful for large samples, as it allows for efficient storage and access to the estimates (see the package gdsfmt for more details). The following command can be used to read in the results from a previous PC-Relate analysis saved to the GDS file “tmp_pcrelate.gds”

library(gdsfmt)
mypcrelate <- openfn.gds("tmp_pcrelate.gds")

Functions are provided for easily reading the output from pcrelate (either a class pcrelate object or a GDS file) and making a table of pairwise relatedness estimates, a table of individual inbreeding coeficients, and a genetic relationship matrix (GRM).

pcrelateReadKinship(pcrelObj = mypcrelate, scan.include = iids[1:40], kin.thresh = 2^(-9/2))

##         ID1     ID2  nsnp        kin           k2        k1           k0
## 7   NA19919 NA19908 19437 0.24668482 -0.017347662 1.0157801 0.0015676062
## 17  NA19919 NA19909 19626 0.24792942  0.008880185 0.9911198 0.0000000000
## 138 NA20282 NA20301 19765 0.32326286  0.364583049 0.5780613 0.0573556808
## 186 NA19902 NA19901 19640 0.25122134  0.011649796 0.9883502 0.0000000000
## 212 NA19902 NA19900 19771 0.23177090 -0.023637964 1.0228805 0.0007574672
## 365 NA20335 NA20337 19906 0.06867109  0.009357025 0.2559703 0.7346726524
## 493 NA20340 NA20349 19679 0.08084154  0.005413026 0.3125401 0.6820468648
## 501 NA20340 NA20344 19657 0.05710043 -0.010172410 0.2487465 0.7614258883
## 513 NA20297 NA20281 19669 0.06444652  0.019864365 0.2180573 0.7620783007
## 623 NA20290 NA20289 19666 0.25641191 -0.002406525 1.0001161 0.0022904721
## 625 NA20290 NA20333 19731 0.04482144  0.018667900 0.1419500 0.8393821454
## 634 NA20295 NA20294 19769 0.25275105 -0.006725302 1.0044206 0.0023046668
## 649 NA20346 NA20349 19765 0.05540436 -0.005522529 0.2326625 0.7728600478
## 657 NA20346 NA20344 19743 0.05144352 -0.002491201 0.2107565 0.7917347083
## 722 NA20349 NA20344 19878 0.27963012  0.280096947 0.5316348 0.1882682417

pcrelateReadInbreed(pcrelObj = mypcrelate, scan.include = iids[1:40], f.thresh = 2^(-11/2))

##         ID  nsnp          f
## 11 NA20335 19949 0.03210545
## 13 NA19904 19814 0.02276732
## 20 NA20317 19933 0.02302650
## 30 NA20294 19827 0.02705174
## 37 NA20344 19905 0.03607983
## 38 NA20333 19902 0.03498920

pcrelateMakeGRM(pcrelObj = mypcrelate, scan.include = iids[1:5], scaleKin = 2)

##              NA19919     NA19916      NA19835      NA20282     NA19703
## NA19919  0.971883727 0.011129327 -0.029404169  0.009562848 0.032625513
## NA19916  0.011129327 1.004740236  0.007354497  0.001717844 0.008775792
## NA19835 -0.029404169 0.007354497  0.970464272 -0.010680929 0.002279565
## NA20282  0.009562848 0.001717844 -0.010680929  0.989649727 0.016076041
## NA19703  0.032625513 0.008775792  0.002279565  0.016076041 1.000048656

pcrelObj is the output from pcrelate; either a class pcrelate object or a GDS file
scan.include is a vector of individual IDs specifying which individuals to include in the table or matrix
kin.thresh specifies a minimum kinship coefficient value to include in the table
f.thresh specifies a minimum inbreeding coefficient value to include in the table
scaleKin specifies a factor to multiply the kinship coefficients by in the GRM

References

Conomos M.P., Miller M.B., & Thornton T.A. (2015). Robust Inference of Population Structure for Ancestry Prediction and Correction of Stratification in the Presence of Relatedness. Genetic Epidemiology, 39(4), 276-293.
Gogarten, S.M., Bhangale, T., Conomos, M.P., Laurie, C.A., McHugh, C.P., Painter, I., … & Laurie, C.C. (2012). GWASTools: an R/Bioconductor package for quality control and analysis of Genome-Wide Association Studies. Bioinformatics, 28(24), 3329-3331.
Manichaikul, A., Mychaleckyj, J.C., Rich, S.S., Daly, K., Sale, M., & Chen, W.M. (2010). Robust relationship inference in genome-wide association studies. Bioinformatics, 26(22), 2867-2873.

Population Structure and Relatedness Inference using the GENESIS Package

Matthew P. Conomos

2016-01-21

Overview

Relatedness Estimation Adjusted for Principal Components (PC-Relate)

Data

Reading in Genotype Data

R Matrix

GDS files

PLINK files

HapMap Data

Relatedness Estimation Adjusted for Principal Components (PC-Relate)

Running PC-Relate

Output from PC-Relate

References