Pairwise Measures of Ancestry Divergence
It is possible to identify a subset of mutually unrelated individuals in a sample based solely on pairwise measures of genetic relatedness (i.e. kinship coefficients). However, in order to obtain accurate population structure inference for the entire sample, it is important that the ancestries of all individuals in the sample are represented by at least one individual in this subset. In order to identify a mutually unrelated and ancestry representative subset of individuals, PC-AiR also utilizes measures of ancestry divergence. These measures are calculated using the KING-robust kinship coefficient estimator (Manichaikul et al., 2010), which provides systematically negative estimates for unrelated pairs of individuals with different ancestry. The number of negative pairwise estimates that an individual has provides information regarding how different that individual’s ancestry is from the rest of the sample, which helps to prioritize individuals that should be kept in the ancestry representative subset.
The relevant output from the KING software is two text files with the file extensions .kin0 and .kin. The king2mat
function can be used to extract the kinship coefficients (which we use as divergence measures) from this output and create a matrix usable by the GENESIS
functions. The iids
input of the king2mat
function should be used to ensure that the output matrix orders individuals in the same way as the GenotypeData
object.
# read individual IDs from GenotypeData object
iids <- getScanID(HapMap_genoData)
head(iids)
## [1] NA19919 NA19916 NA19835 NA20282 NA19703 NA19902
## 173 Levels: NA19625 NA19649 NA19650 NA19651 NA19652 NA19653 ... NA20412
# create matrix of KING estimates
KINGmat <- king2mat(file.kin0 = system.file("extdata", "MXL_ASW.kin0", package="GENESIS"),
file.kin = system.file("extdata", "MXL_ASW.kin", package="GENESIS"),
iids = iids)
## Reading .kin0 file...
## Reading .kin file...
## Determining Unique Individual IDs from KING Output...
## Checking Provided Individual IDs
## Adding Kinship Entries from .kin0 file...
## Adding Kinship Entries from .kin file...
KINGmat[1:5,1:5]
## NA19919 NA19916 NA19835 NA20282 NA19703
## NA19919 0.5000 -0.0009 -0.0059 -0.0080 0.0014
## NA19916 -0.0009 0.5000 -0.0063 -0.0150 -0.0039
## NA19835 -0.0059 -0.0063 0.5000 -0.0094 -0.0104
## NA20282 -0.0080 -0.0150 -0.0094 0.5000 -0.0134
## NA19703 0.0014 -0.0039 -0.0104 -0.0134 0.5000
The column and row names of the matrix are the individual IDs, and each off-diagonal entry is the KING-robust estimate for the specified pair of individuals.
Alternative to running the KING software, the snpgdsIBDKING
function from the SNPRelate
package can be used to calculate the KING-robust estimates directly from a GDS file. The ouput of this function contains a matrix of pairwise estimates, which can be used by the GENESIS
functions.
Running PC-AiR
The PC-AiR algorithm requires pairwise measures of both kinship and ancestry divergence in order to partition the sample into an “unrelated subset” and a “related subset.” The kinship coefficient estimates are used to identify relatives, as only one individual from a set of relatives can be included in the “unrelated subset,” and the rest must be assigned to the “related subset.” The ancestry divergence measures calculated from KING-robust are used to help select which individual from a set of relatives has the most unique ancestry and should be given priority for inclusion in the “unrelated subset.”
The KING-robust estimates read in above are always used as measures of ancestry divergence for unrelated pairs of individuals, but they can also be used as measures of kinship for relatives (NOTE: they may be biased measures of kinship for admixed relatives with different ancestry). The pcair
function performs the PC-AiR analysis.
# run PC-AiR
mypcair <- pcair(genoData = HapMap_genoData, kinMat = KINGmat, divMat = KINGmat)
## Partitioning Samples into Related and Unrelated Sets...
## Unrelated Set: 97 Samples
## Related Set: 76 Samples
## Running Analysis with 20000 SNPs - in 2 Block(s)
## Computing Genetic Correlation Matrix for the Unrelated Set: Block 1 of 2 ...
## Computing Genetic Correlation Matrix for the Unrelated Set: Block 2 of 2 ...
## Performing PCA on the Unrelated Set...
## Predicting PC Values for the Related Set: Block 1 of 2 ...
## Predicting PC Values for the Related Set: Block 2 of 2 ...
## Concatenating Results...
genoData
is a GenotypeData
class object
kinMat
is a matrix of pairwise kinship coefficient estimates
divMat
is a matrix of pairwise measures of ancestry divergence (KING-robust estimates)
If better estimates of kinship coefficients are available, then the kinMat
input can be replaced with a similar matrix of these estimates. The divMat
input should always remain as the KING-robust estimates.
Reference Population Samples
As PCA is an unsupervised method, it is often difficult to identify what population structure each of the top PCs represents. In order to provide some understanding of the inferred structure, it is sometimes recommended to include reference population samples of known ancestry in the analysis. If the data set contains such reference population samples, we may want to use only those individuals as the “unrelated subset” for performing the PCA and predict values for all other sample individuals. This can be accomplished by setting the input unrel.set
equal to a vector, IDs
, of the individual IDs for the reference population samples.
mypcair <- pcair(genoData = HapMap_genoData, unrel.set = IDs)
If, instead, we want to make sure that these reference population samples are included in the “unrelated subset,” but also allow for the PC-AiR algorithm to select additional individuals from the sample to be a part of the “unrelated subset,” then this can be accomplished by using the inputs kinMat
, divMat
, and unrel.set
together.
mypcair <- pcair(genoData = HapMap_genoData, kinMat = KINGmat, divMat = KINGmat, unrel.set = IDs)
This will force individuals specified with unrel.set
into the “unrelated subset,” move any of their relatives into the “related subset,” and then perform the PC-AiR partitioning algorithm on the remaining samples.
Partition a Sample without Running PCA
It may be of interest to partition a sample into an ancestry representative “unrelated subset” of individuals and a “related subset” of individuals without performing a PCA. The pcairPartition
function, which is called within the pcair
function, enables the user to do this.
part <- pcairPartition(kinMat = KINGmat, divMat = KINGmat)
The output contains two vectors that give the individual IDs for the “unrelated subset” and the “related subset.”
head(part$unrels); head(part$rels)
## [1] "NA19916" "NA19835" "NA19703" "NA19901" "NA19908" "NA19914"
## [1] "NA19919" "NA20282" "NA19902" "NA20287" "NA20335" "NA19713"
As with the pcair
function, the input unrel.set
can be used to specify certain individuals that must be part of the “unrelated subset.”
Output from PC-AiR
An object returned from the pcair
function has class pcair
. A summary
method for an object of class pcair
is provided to obtain a quick overview of the results.
summary(mypcair)
## Call:
## pcair(genoData = HapMap_genoData, kinMat = KINGmat, divMat = KINGmat)
##
## PCA Method: PC-AiR
##
## Sample Size: 173
## Unrelated Set: 97 Samples
## Related Set: 76 Samples
##
## Kinship Threshold: 0.02209709
## Divergence Threshold: -0.02209709
##
## Principal Components Returned: 20
## Eigenvalues: 9.334 1.87 1.289 1.24 1.207 1.189 1.174 1.162 1.148 1.139 ...
##
## MAF Filter: 0.01
## SNPs Used: 19996
The output provides the total sample size along with the number of individuals assigned to the unrelated and related subsets, as well as the threshold values used for determining which pairs of individuals were related or ancestrally divergent. The eigenvalues for the top PCs are also shown, which can assist in determining the number of PCs that reflect structure. The minor allele frequency (MAF) filter used for excluding SNPs is also specified, along with the total number of SNPs analyzed after this filtering. Details describing how to modify the analysis parameters and the available output of the function can be found with the command help(pcair)
.
Plotting PC-AiR PCs
The GENESIS
package also provides a plot
method for an object of class pcair
to quickly visualize pairs of PCs. Each point in one of these PC pairs plots represents a sample individual. These plots help to visualize population structure in the sample and identify clusters of individuals with similar ancestry.
# plot top 2 PCs
plot(mypcair)
# plot PCs 3 and 4
plot(mypcair, vx = 3, vy = 4)
The default is to plot PC values as black dots and blue pluses for individuals in the “unrelated subset” and “related subsets” respectively. The plotting colors and characters, as well as other standard plotting parameters, can be changed with the standard input to the plot
function.