polyRAD is an R package that assists with genotype calling from DNA sequence datasets such as genotyping-by-sequencing (GBS) or restriction site-associated DNA sequencing (RAD) in polyploids and diploids. Genotype likelihoods are estimated from allelic read depth, genotype prior probabilities are estimated from population parameters, and then genotype posterior probabilities are estimated from likelihoods and prior probabilities. Posterior probabilities can be used directly in downstream analysis, converted to posterior mean genotypes for analyses of additive genetic effects, or used for export of the most probable genotypes for analyses that require discrete genotypic data.
Analyses in polyRAD center around objects of an S3 class called
RADdata
. A single RADdata
object contains the
entire dataset of read depth and locus information, as well as
parameters that are estimated during the course of analysis.
For any function named in this section, see its help page for more
information. (For example by typing ?VCF2RADdata
into the R
console.)
Several functions are available for import of read depth data and (optionally) alignment information into a RADdata object:
VCF2RADdata
readTagDigger
readStacks
readHMC
readTASSELGBSv2
readProcessSamMulti
readProcessIsoloci
readDArTag
More generally, the RADdata
function is used for
constructing RADdata objects; see the help page for that function for
more information on what data are needed.
Several pipelines are available for genotype estimation, depending on how the population is structured (i.e. what the genotype prior probabilities should be):
PipelineMapping2Parents
IterateHWE
IterateHWE_LD
IteratePopStruct
IteratePopStructLD
For exporting the estimated genotypes to other software:
ExportGAPIT
Export_rrBLUP_Amat
Export_rrBLUP_GWAS
Export_TASSEL_Numeric
Export_polymapR
Export_polymapR_probs
Export_MAPpoly
Export_GWASpoly
Export_Structure
Export_adegenet_genind
RADdata2VCF
If you need continuous numerical genotypes exported in some other
format, see GetWeightedMeanGenotypes
. If you need discrete
numerical genotypes, see GetProbableGenotypes
. Also,
GetLikelyGen
returns the most likely genotypes (based on
read depth only) for a single sample.
There are also various utilities for manipulating RADdata objects:
SubsetByTaxon
SubsetByLocus
SubsetByPloidy
SplitByChromosome
MergeRareHaplotypes
MergeTaxaDepth
RemoveMonomorphicLoci
RemoveHighDepthLoci
RemoveUngenotypedLoci
EstimateContaminationRate
StripDown
LocusInfo
For identifying problematic loci and individuals:
HindHe
HindHeMapping
ExpectedHindHe
ExpectedHindHeMapping
InbreedingFromHindHe
For interrogating the genotype calling process, see
?ExamineGenotype
.
See ?GetTaxa
for a list of accessor functions as
well.
Ploidy is represented in two dimensions in RADdata objects. The first
dimension, indicating inheritance mode of loci with respect to the
reference genome, is specified in the possiblePloidies
slot. The second dimension, indicating additional genome duplication
within individuals, is specified by the taxaPloidy
slot.
The product of possiblePloidies
and
taxaPloidy
, divided by two, indicates the possible
inheritance modes within each individual.
If you expect each locus to align to a unique location within the
reference genome, set possiblePloidies = list(2)
. If you
are working, for example, with an allotetraploid but are aligning to the
genome of one diploid ancestor, set
possiblePloidies = list(c(2, 2))
to indicate that a single
alignment location will represent two Mendelian loci. For example, if
you are analysing allohexaploid winter wheat, if you are aligning to the
allohexaploid reference you would set
possiblePloidies = list(2)
, but if you were aligning to the
reference of a diploid ancestor you would set
possiblePloidies = list(c(2, 2, 2))
. In reality, the data
may be messier than that, particularly if you are using a reference-free
variant calling pipeline, and so you can for example set
possiblePloidies = list(2, c(2, 2), c(2, 2, 2))
to test
each genomic location for whether it seems to consist of one, two, or
three Mendelian loci. Another example case is an ancient tetraploid
genome in which some regions have diploidized and others still segregate
in an autotetraploid fashion, which could be coded as
possiblePloidies = list(2, 4)
or
possiblePloidies = list(2, 4, c(2, 2))
. Keep in mind that
the memory footprint of polyRAD during and after genotyping will scale
linearly with the number of inheritance modes tested, however.
One situation you want to avoid is alleles from a single Mendelian locus aligning to multiple locations in the reference, as polyRAD will treat these as different loci. The reference genome used in variant calling should therefore be one in which none of the chromosomes are expected to pair with each other at meiosis. If you are studying a mixture of two species and their hybrids, I recommend performing variant calling and allelic read depth quantification against the reference genome of just one species (whichever has a better assembly and annotation available), then taking the set of RAD tags in which variants were called and aligning them to the reference genome of the other species, and filtering to retain only markers with alignment locations in both species.
The taxaPloidy
slot is a multiplier on top of the
possiblePloidies
slot. It has one value per individual
(taxon) in the dataset, but when a RADdata object is created it is
possible to just supply one value to assign all individuals the same
taxaPloidy
. For example, if you are studying
autotetraploids, you can set possiblePloidies = list(2)
and
taxaPloidy = 4
, although
possiblePloidies = list(4)
and taxaPloidy = 2
will also work. An allotetraploid with a diploid reference would be
possiblePloidies = list(c(2,2))
and
taxaPloidy = 2
, and an allotetraploid with an
allotetraploid reference would be
possiblePloidies = list(2)
and taxaPloidy = 2
.
If you have, for example, a mixture if diploids, triploids, and
tetraploids, you should construct a named integer vector, where the
values are the ploidies of the individuals, and the names are the sample
names, then pass this vector to the taxaPloidy
argument
when making your RADdata object (see example in “Estimating genotype
probabilities in a diversity panel” below). For mapping populations, it
is possible for the parents to have different ploidies as long as all
progeny are the ploidy expected from the cross (e.g. diploid x
tetraploid = triploid).
In this example, we’ll import some data from an F1 mapping population of Miscanthus sinensis that were output by the UNEAK pipeline. These data are from a study by Liu et al. (2015; doi:10.1111/gcbb.12275; data available at http://hdl.handle.net/2142/79522), and can be found in the “extdata” folder of the polyRAD installation. Miscanthus is an ancient tetraploid that has undergone diploidization. Given the ability of the UNEAK pipeline to filter paralogs, we expect most loci to behave in a diploid fashion, but some may behave in an allotetraploid fashion.
We’ll start by loading polyRAD and importing the data into a
RADdata
object. The possiblePloidies
argument
indicates the expected inheritance modes: diploid (2) and allotetraploid
(2 2). The taxaPloidy
argument indicates that there is no
further genome duplication beyond that indicated with
possiblePloidies
.
With your own dataset, you will not need to use
system.file
. Instead, directly create a text string
indicating the name of your file (and its location if it is not in the
working directory.)
library(polyRAD)
maphmcfile <- system.file("extdata", "ClareMap_HapMap.hmc.txt",
package = "polyRAD")
maphmcfile
## [1] "/tmp/RtmpJScqln/Rinst19b123332eea/polyRAD/extdata/ClareMap_HapMap.hmc.txt"
## ## RADdata object ##
## 299 taxa and 50 loci
## 766014 total reads
## Assumed sample cross-contamination rate of 0.001
##
## Possible ploidies:
## Autodiploid (2)
## Allotetraploid (2 2)
We can view the imported taxa names (subsetted here for space).
## [1] "IGR-2011-001" "Kaskade-Justin" "Map1-001" "Map1-002"
## [5] "Map1-003" "Map1-005" "Map1-008" "Map1-011"
## [9] "Map1-016" "Map1-018" "Map1-488" "Map1-489"
## [13] "Map1-490" "Map1-491" "Zebrinus-Justin" "p196-150A-c"
## [17] "p877-348-b"
All names starting with “Map” are progeny. “Kaskade-Justin” and “Zebrinus-Justin” are the parents. “IGR-2011-001”, “p196-150A-c”, and “p877-348-b” aren’t part of the population, but were doubled haploid lines that were used to screen for paralogous markers. We can tell polyRAD which taxa are the parents; since this is an F1 population it doesn’t matter which is “donor” and which is “recurrent”.
mydata <- SetDonorParent(mydata, "Kaskade-Justin")
mydata <- SetRecurrentParent(mydata, "Zebrinus-Justin")
Although we will end up excluding the doubled haploid lines from
analysis below, if we did want to retain them in the analysis and treat
them as haploid for the purpose of genotyping, we could adjust
taxaPloidy
for just these samples:
## ## RADdata object ##
## 299 taxa and 50 loci
## 766014 total reads
## Assumed sample cross-contamination rate of 0.001
##
## Possible ploidies:
## Autohaploid (1)
## Autodiploid (2)
## Allodiploid (1 1)
## Allotetraploid (2 2)
The next thing we’ll want to do is add our genomic alignment data.
For this dataset, we have alignment data stored in a CSV file, also in
the “extdata” directory of the polyRAD installation. We’ll add it to the
locTable
slot of our RADdata
object. Be sure
to name the new columns “Chr” and “Pos”.
alignfile <- system.file("extdata", "ClareMap_alignments.csv",
package = "polyRAD")
aligndata <- read.csv(alignfile, row.names = 1)
head(aligndata)
## Sorghum.LG Position.on.Sorghum.LG..bp.
## TP5489 1 4560204
## TP13305 1 4584260
## TP18261 1 2911329
## TP18674 1 387849
## TP19030 1 7576879
## TP26698 1 6972841
mydata$locTable$Chr <- aligndata[GetLoci(mydata), 1]
mydata$locTable$Pos <- aligndata[GetLoci(mydata), 2]
head(mydata$locTable)
## Chr Pos
## TP5489 1 4560204
## TP13305 1 4584260
## TP18261 1 2911329
## TP18674 1 387849
## TP19030 1 7576879
## TP26698 1 6972841
If you don’t have alignment data in your own dataset, you can still
use the pipeline described here. Just set
useLinkage = FALSE
in the code below. The advantage of
including alignment data is that gentoypes at linked markers are used
for imputing missing or correcting erroneous genotypes.
It is important that the only individuals included in the analysis are those that are truly part of the population. Allele frequencies are used for inferring segregation pattern, and could be incorrect if many individuals are included that are not part of the population. Additionally, the genotype priors will be incorrect for individuals that are not part of the population, leading to incorrect genotypes.
At this point we would normally do
However, because a very small number of markers was used in this example dataset, the PCA does not accurately reflect the relatedness of individuals. Here I will load a PCA that was done with the full set of markers.
Now a plot can be used for visualizing the relationship among taxa.
Now we’ll extract a subset of taxa that we actually want to analyze. We can see from the plot that a fair number of them were the product of self-fertilization of “Zebrinus-Justin” and should be eliminated.
realprogeny <- GetTaxa(mydata)[mydata$PCA[,"PC1"] > -10 &
mydata$PCA[,"PC1"] < 10]
# eliminate the one doubled haploid line in this group
realprogeny <- realprogeny[!realprogeny %in% c("IGR-2011-001", "p196-150A-c",
"p877-348-b")]
# also retain parents
keeptaxa <- c(realprogeny, GetDonorParent(mydata), GetRecurrentParent(mydata))
mydata <- SubsetByTaxon(mydata, taxa = keeptaxa)
plot(mydata)
Now we can perform a preliminary run of the pipeline. The
allowedDeviation
argument indicates how different the
apparent allele frequency (based on read depth ratios) can be from an
expected allele frequency (determined based on ploidy and mapping
population type) and still be classified as that allele frequency. The
default settings assume an F1 population, but the population type can be
adjusted using the n.gen.backcrossing
,
n.gen.intermating
, and n.gen.selfing
arguments. We’ll also lower minLikelihoodRatio
from the
default because one of the parents has many uncertain genotypes under
the tetraploid model (which was determined by exploration of the dataset
outside of this tutorial; many NA values were observed in
priorProb
under the default). Since this first run is for a
rough estimate of genotypes, we’ll set useLinkage = FALSE
to save a little computational time.
mydata2 <- PipelineMapping2Parents(mydata,
freqAllowedDeviation = 0.06,
useLinkage = FALSE,
minLikelihoodRatio = 2)
## Making initial parameter estimates...
## Generating sampling permutations for allele depth.
## Done.
We can use these preliminary estimates to determine whether we need
to adjust the overdispersion parameter. Exactly how much does read depth
distribution deviate from what would be expected under binomial
distibution? The TestOverdispersion
function will help us
here.
## Optimal value is 12.
sapply(overdispersionP[names(overdispersionP) != "optimal"],
quantile, probs = c(0.01, 0.25, 0.5, 0.75, 0.99))
## 8 9 10 11 12 13
## 1% 0.005533597 0.004096143 0.002848244 0.001954561 0.001322713 0.0009932275
## 25% 0.304361182 0.279742132 0.258070046 0.239189269 0.223981900 0.2093677395
## 50% 0.583948430 0.562335702 0.545448898 0.528202715 0.512373571 0.4977739384
## 75% 0.818619207 0.808643509 0.799476463 0.793000540 0.785103002 0.7775481475
## 99% 1.000000000 1.000000000 1.000000000 1.000000000 1.000000000 1.0000000000
## 14 15
## 1% 0.0007527206 0.0005746429
## 25% 0.1957778953 0.1837745108
## 50% 0.4842515536 0.4706852334
## 75% 0.7704398101 0.7645633921
## 99% 1.0000000000 1.0000000000
It looks like 12
follows the expected distribution of
p-values most closely, both from the automated output of
TestOverdispersion
and from examining the quantiles of
p-values. In the matrix of quantiles, we are looking for columns where
the 25th percentile is about 0.25, the 50th percentile is about 0.5,
etc. These libraries were made with 15 PCR cycles. If you used more PCR
cycles you are likely to have a lower optimal value for the
overdispersion parameter, or conversely if you used fewer PCR cycles you
will probably have a higher value for the overdispersion parameter.
Next we can check for markers that are behaving in a non-Mendelian fashion. If we are expecting diploid segregation, all markers should show a Hind/HE value of 0.5 or less. (For an autopolyploid, the expected value is $\frac{ploidy - 1}{ploidy}$.)
myhindhe <- HindHeMapping(mydata, ploidy = 2L)
hist(colMeans(myhindhe, na.rm = TRUE), col = "lightgrey",
xlab = "Hind/He", main = "Histogram of Hind/He by locus")
How does this compare to the distribution that we might expect from
this dataset if all loci were Mendelian? We will use
ExpectedHindHeMapping
with the overdispersion parameter
that we estimated above. One your own dataset, I recommend using the
default value of reps
.
set.seed(720)
ExpectedHindHeMapping(mydata, ploidy = 2, overdispersion = my_ovdisp, reps = 2,
contamRate = 0.001, errorRate = 0.001)
## Generating sampling permutations for allele depth.
## Simulating rep 1
## Completed 2 simulation reps.
## Mean Hind/He: 0.468
## Standard deviation: 0.027
## 95% of observations are between 0.426 and 0.529
Comparing the observed distribution to the expected distribution, we may want to filter some markers.
goodMarkers <- colnames(myhindhe)[which(colMeans(myhindhe, na.rm = TRUE) < 0.53 &
colMeans(myhindhe, na.rm = TRUE) > 0.43)]
mydata <- SubsetByLocus(mydata, goodMarkers)
Now we can re-run the pipeline to properly call the genotypes.
mydata <- PipelineMapping2Parents(mydata,
freqAllowedDeviation = 0.06,
useLinkage = TRUE, overdispersion = my_ovdisp,
minLikelihoodRatio = 2)
## Making initial parameter estimates...
## Generating sampling permutations for allele depth.
## Updating priors using linkage...
## Done.
We can examine the allele frequencies. Allele frequencies that fall outside of the expected ranges will be recorded as they were estimated from read depth. In this case most are within the expected ranges.
##
## 0.25 0.5 0.75
## 11 2 11
Genotype likelihood is also stored in the object for each possible genotype at each locus, taxon, and ploidy. This is the probability of seeing the observed distribution of reads.
## TP19030_0 TP19030_1 TP28986_0 TP28986_1 TP31810_0 TP31810_1 TP34939_0 TP34939_1
## 2 13 0 6 0 16 8 9
## TP19030_0 TP19030_1 TP28986_0 TP28986_1 TP31810_0 TP31810_1
## 0 9.706321e-04 4.742367e-09 9.987416e-01 1.237115e-07 9.973885e-01 4.444851e-11
## 1 2.328815e-02 2.328815e-02 3.741288e-02 3.741288e-02 1.567148e-03 1.567148e-03
## 2 4.742367e-09 9.706321e-04 1.237115e-07 9.987416e-01 4.444851e-11 9.973885e-01
## TP34939_0 TP34939_1
## 0 8.959640e-06 1.180948e-06
## 1 1.199589e-01 1.199589e-01
## 2 1.180948e-06 8.959640e-06
## TP19030_0 TP19030_1 TP28986_0 TP28986_1 TP31810_0 TP31810_1
## 0 9.706321e-04 4.742367e-09 9.987416e-01 1.237115e-07 9.973885e-01 4.444851e-11
## 1 1.578635e-01 6.145133e-04 2.426471e-01 2.279025e-03 5.641026e-02 1.187627e-05
## 2 2.328815e-02 2.328815e-02 3.741288e-02 3.741288e-02 1.567148e-03 1.567148e-03
## 3 6.145133e-04 1.578635e-01 2.279025e-03 2.426471e-01 1.187627e-05 5.641026e-02
## 4 4.742367e-09 9.706321e-04 1.237115e-07 9.987416e-01 4.444851e-11 9.973885e-01
## TP34939_0 TP34939_1
## 0 8.959640e-06 1.180948e-06
## 1 5.115888e-02 3.296284e-02
## 2 1.199589e-01 1.199589e-01
## 3 3.296284e-02 5.115888e-02
## 4 1.180948e-06 8.959640e-06
Above, for one individal (Map1-089), we see its read depth at eight alleles (four loci), followed by the genotype likelihoods under diploid and tetraploid models. For example, at locus TP19030, heterozygosity is the most likely state, although there is a chance that this individual is homozygous for allele 1 and the two reads of allele 0 were due to contamination. If this locus is allotetraploid, it is most likely that there is one copy of allele 0 and three copies of allele 1. Other individuals have higher depth and as a result there is less uncertainty in the genotype, particularly for the diploid model.
The prior genotype probabilities (expected genotype distributions)
are also stored in the object for each possible ploidy. These
distributions are estimated based on the most likely parent genotypes.
Low confidence parent genotypes can be ignored by increasing the
minLikelihoodRatio
argument to
PipelineMapping2Parents
.
## TP19030_0 TP19030_1 TP28986_0 TP28986_1 TP31810_0 TP31810_1 TP34939_0
## 0 0.25 0.25 0.5 0.0 0.5 0.0 0.0
## 1 0.50 0.50 0.5 0.5 0.5 0.5 0.5
## 2 0.25 0.25 0.0 0.5 0.0 0.5 0.5
## TP34939_1
## 0 0.5
## 1 0.5
## 2 0.0
## TP19030_0 TP19030_1 TP28986_0 TP28986_1 TP31810_0 TP31810_1 TP34939_0
## 0 0 0 NA NA 0 0 NA
## 1 0 0 NA NA 1 0 NA
## 2 1 1 NA NA 0 0 NA
## 3 0 0 NA NA 0 1 NA
## 4 0 0 NA NA 0 0 NA
## TP34939_1
## 0 NA
## 1 NA
## 2 NA
## 3 NA
## 4 NA
Here we see some pretty big differences under the diploid and allotetraploid models. For example, if TP19030 is behaving in a diploid fashion we expect F2-like segregation since both parents were heterozygous. However, if TP19030 is behaving in an allotetraploid fashion, a 1:1 segregation ratio is expected due to one parent being heterozygous at one isolocus and the other being homozygous at both isoloci.
Now we want to determine which ploidy is the best fit for each locus. This is done by comparing genotype prior probabilities to genotype likelihoods and estimating a χ2 statistic. Lower values indicate a better fit.
## TP19030_0 TP19030_1 TP28986_0 TP28986_1 TP31810_0 TP31810_1
## [1,] 0.3258861 0.3258861 0.07787878 0.07787878 0.1627961 0.1627961
## [2,] 162.8754654 162.8754654 NA NA 202.4987169 202.4987169
## TP34939_0 TP34939_1
## [1,] 1.021263 1.021263
## [2,] NA NA
We can make a plot to get an overall sense of how well the markers fit the diploid versus tetraploid model.
plot(mydata$ploidyChiSq[1,], mydata$ploidyChiSq[2,],
xlab = "Chi-squared for diploid model",
ylab = "Chi-squared for tetraploid model")
For each allele, whichever model gives the lower Chi-squared value is the one with the best fit. In this case it looks like everything is diploid with fairly high confidence, in agreement with our Hind/HE results.
Now we’ll examine the posterior genotype probabilities. These are still estimated separately for each ploidy.
## TP19030_0 TP19030_1 TP28986_0 TP28986_1 TP31810_0 TP31810_1
## 0 1.969255e-02 9.339240e-11 0.92839995 0.00000000 0.998431214 0.000000000
## 1 9.803075e-01 9.803075e-01 0.07160005 0.07160005 0.001568786 0.001568786
## 2 9.339240e-11 1.969255e-02 0.00000000 0.92839995 0.000000000 0.998431214
## TP34939_0 TP34939_1
## 0 0.000000e+00 1.778922e-07
## 1 9.999998e-01 9.999998e-01
## 2 1.778922e-07 0.000000e+00
## TP19030_0 TP19030_1 TP28986_0 TP28986_1 TP31810_0 TP31810_1 TP34939_0
## 0 0 0 NA NA 0 0 NA
## 1 0 0 NA NA 1 0 NA
## 2 1 1 NA NA 0 0 NA
## 3 0 0 NA NA 0 1 NA
## 4 0 0 NA NA 0 0 NA
## TP34939_1
## 0 NA
## 1 NA
## 2 NA
## 3 NA
## 4 NA
Since we decided from the Chi-squared results that the markers were only segregating in a diploid manner, we can remove allotetraploidy from the dataset.
Typically in a mapping population, due to noisy data polyRAD will not
be able to determine the segregation patterns of some markers, which end
up having NA
values for their prior and posterior
probabilities. There may also be some cases where both parents were
homozygous and as a result there is no segregation in an F1 population.
In this example dataset, these issues are not present (as long as
diploidy is assumed) because the markers were curated from a set that
had already been filtered for mapping. Generally, however, you would
want to find and remove such markers using
RemoveUngenotypedLoci
:
We can export the results for use in downstream analysis. The
function below weights possible ploidies for each allele based on the
results in mydata$ploidyChiSq
, and for each taxon outputs a
continuous, numerical genotype that is the mean of all possible
genotypes weighted by genotype posterior probabilities (i.e.
the posterior mean genotype). By default, one allele per locus is
discarded in order to avoid mathematical singularities in downstream
analysis. The continuous genotypes also range from zero to one by
default, which can be changed with the minval
and
maxval
arguments.
## TP53071_0 TP57018_0 TP110401_1 TP115159_0
## Kaskade-Justin 0.5 0.5 0.5 0.5
## Zebrinus-Justin 0.0 0.0 0.0 0.0
## Map1-001 0.5 0.5 0.0 0.0
## Map1-002 0.0 0.0 0.5 0.5
## Map1-003 0.0 0.0 0.0 0.5
## Map1-005 0.5 0.5 0.5 0.5
## Map1-008 0.0 0.0 0.5 0.0
Note that the parent posterior mean genotypes were estimated using
gentoype likelihood only, ignoring the priors set for the progeny. In
some places they may not match the progeny genotypes, indicating a
likely error in parental genotype calling. We can see the parental
genotypes that were used for estimating progeny priors using
$likelyGeno_donor
and
$likelyGeno_recurrent
.
## TP19030_0 TP19030_1 TP28986_0 TP28986_1 TP31810_0 TP31810_1 TP34939_0 TP34939_1
## 1 1 1 1 0 2 1 1
## TP19030_0 TP19030_1 TP28986_0 TP28986_1 TP31810_0 TP31810_1 TP34939_0 TP34939_1
## 1 1 0 2 1 1 2 0
Pipelines in polyRAD for processing a diversity panel (i.e. a germplasm collection, a set of samples collected in the wild, or a panel for genome-wide association analysis or genomic prediction) use iterative algorithms. Essentially, allele frequencies are re-estimated with each iteration until convergence is reached.
Here we’ll import a RAD-seq dataset from a large collection of wild and ornamental Miscanthus from Clark et al. (2014; doi:10.1093/aob/mcu084).
Since the data are in VCF format, we will need the Bioconductor package VariantAnnotation to load them. See https://bioconductor.org/packages/release/bioc/html/VariantAnnotation.html for installation instructions.
Again, with your own dataset you will not need to use
system.file
(see section on mapping populations).
## Warning: multiple methods tables found for 'setequal'
## Warning: replacing previous import 'BiocGenerics::setequal' by
## 'S4Vectors::setequal' when loading 'VariantAnnotation'
## Warning: replacing previous import 'BiocGenerics::setequal' by
## 'S4Vectors::setequal' when loading 'IRanges'
## Warning: replacing previous import 'BiocGenerics::setequal' by
## 'S4Vectors::setequal' when loading 'GenomeInfoDb'
## Warning: replacing previous import 'BiocGenerics::setequal' by
## 'S4Vectors::setequal' when loading 'GenomicRanges'
## Warning: replacing previous import 'BiocGenerics::setequal' by
## 'S4Vectors::setequal' when loading 'XVector'
## Warning: replacing previous import 'BiocGenerics::setequal' by
## 'S4Vectors::setequal' when loading 'SummarizedExperiment'
## Warning: replacing previous import 'BiocGenerics::setequal' by
## 'S4Vectors::setequal' when loading 'S4Arrays'
## Warning: replacing previous import 'BiocGenerics::setequal' by
## 'S4Vectors::setequal' when loading 'DelayedArray'
## Warning: replacing previous import 'BiocGenerics::setequal' by
## 'S4Vectors::setequal' when loading 'SparseArray'
## Warning: replacing previous import 'S4Arrays::read_block' by
## 'DelayedArray::read_block' when loading 'SummarizedExperiment'
## Warning: replacing previous import 'BiocGenerics::setequal' by
## 'S4Vectors::setequal' when loading 'Biostrings'
## Warning: multiple methods tables found for 'setequal'
## Warning: replacing previous import 'BiocGenerics::setequal' by
## 'S4Vectors::setequal' when loading 'AnnotationDbi'
## Warning: replacing previous import 'BiocGenerics::setequal' by
## 'S4Vectors::setequal' when loading 'GenomicFeatures'
## Warning: replacing previous import 'BiocGenerics::setequal' by
## 'S4Vectors::setequal' when loading 'rtracklayer'
## Warning: replacing previous import 'BiocGenerics::setequal' by
## 'S4Vectors::setequal' when loading 'GenomicAlignments'
## Warning: replacing previous import 'BiocGenerics::setequal' by
## 'S4Vectors::setequal' when loading 'BSgenome'
For your own VCF files, you will want to compress and index them before reading them. This has already been done for the file supplied with polyRAD, but here is how you would do it:
This dataset is mostly diploid, but it contains a few haploid and triploid individuals. We’ll load a text file listing the ploidy of each individual.
pldfile <- system.file("extdata", "Msi_ploidies.txt", package = "polyRAD")
msi_ploidies <- read.table(pldfile, sep = "\t", header = FALSE)
head(msi_ploidies)
## V1 V2
## 1 Akeno 2
## 2 blank 2
## 3 CANE9233-US47-0011 2
## 4 DC-2010-001-A 2
## 5 DC-2010-001-E 2
## 6 Gunma 2
##
## 1 2 3
## 3 581 1
Now we can make our RADdata
object. Because this is a
small example dataset, we are setting expectedLoci
and
expectedAlleles
to very low values; in a real dataset they
should reflect how much data you are actually expecting. It is best to
slightly overestimate the number of expected alleles and loci.
mydata <- VCF2RADdata(myVCF, possiblePloidies = list(2, c(2,2)),
expectedLoci = 100, expectedAlleles = 500,
taxaPloidy = pld_vect)
## Reading file...
## Unpacking data from VCF...
## Filtering markers...
## Phasing 55 SNPs on chromosome 01
## Reading file...
## 24 loci imported.
## Building RADdata object...
## Merging rare haplotypes...
## 24 markers retained out of 24 originally.
## ## RADdata object ##
## 585 taxa and 24 loci
## 422433 total reads
## Assumed sample cross-contamination rate of 0.001
##
## Possible ploidies:
## Autodiploid (2)
## Autotriploid (3)
## Autohaploid (1)
## Allotetraploid (2 2)
## Allohexaploid (3 3)
## Allodiploid (1 1)
For natural populations and diversity panels, we can run
TestOverdispersion
before performing any genotype
calling.
## Genotype estimates not found in object. Performing rough genotype estimation under HWE.
## Generating sampling permutations for allele depth.
## Optimal value is 10.
sapply(overdispersionP[names(overdispersionP) != "optimal"],
quantile, probs = c(0.01, 0.25, 0.5, 0.75, 0.99))
## 8 9 10 11 12 13 14
## 1% 0.03361521 0.02496243 0.0185622 0.0141304 0.01060993 0.00805366 0.006298907
## 25% 0.26936814 0.24868707 0.2290173 0.2086904 0.19289979 0.18050423 0.167115359
## 50% 0.53640477 0.51375557 0.4922544 0.4732410 0.46074857 0.44954064 0.435073179
## 75% 0.80687246 0.79642717 0.7870220 0.7785209 0.76967492 0.76067468 0.753409104
## 99% 1.00000000 1.00000000 1.0000000 1.0000000 1.00000000 1.00000000 1.000000000
In this case, ten looks like a good value. In the matrix of quantiles, we are looking for columns where the 25th percentile is about 0.25, the 50th percentile is about 0.5, etc. These libraries were made with 15 PCR cycles. If you used more PCR cycles you are likely to have a lower optimal value for the overdispersion parameter, or conversely if you used fewer PCR cycles you will probably have a higher value for the overdispersion parameter.
Before we perform genotype calling, we can also test for Mendelian segregation at each marker using the Hind/HE statistic.
myhindhe <- HindHe(mydata)
myhindheByLoc <- colMeans(myhindhe, na.rm = TRUE)
hist(myhindheByLoc, col = "lightgrey",
xlab = "Hind/He", main = "Histogram of Hind/He by locus")
abline(v = 0.5, col = "blue", lwd = 2)
The peak below 0.5 indicates well-behaved diploid loci. For estimation of inbreeding, we’ll just want to look at markers with a minor allele frequency of at least 0.05, since lower minor allele frequencies tend to have a lot of bias in Hind/HE dependent on sample size. We will also just look at the predominant ploidy (diploid in this case) to estimate inbreeding.
mydata <- AddAlleleFreqHWE(mydata)
theseloci <- GetLoci(mydata)[mydata$alleles2loc[mydata$alleleFreq >= 0.05 & mydata$alleleFreq < 0.5]]
theseloci <- unique(theseloci)
myhindheByLoc2 <- colMeans(myhindhe[mydata$taxaPloidy == 2L, theseloci], na.rm = TRUE)
hist(myhindheByLoc2, col = "lightgrey",
xlab = "Hind/He", main = "Histogram of Hind/He by locus, MAF >= 0.05")
abline(v = 0.5, col = "blue", lwd = 2)
In a typical dataset with more markers, you can get more resolution
on the histogram (see the breaks
argument of
hist
), but let’s say the peak is at 0.35. The graph below
can be used to estimate inbreeding from Hind/HE
and overdispersion. For a diploid with overdispersion of 10 and Hind/HE
of 0.35, inbreeding is about 0.25. Note that the graph assumes a
sequencing error rate of 0.001 and a minor allele frequency of at least
0.05.
There is also the InbreedingFromHindHe
function for
estimating inbreeding, but it is from an older version of polyRAD and
does not account for overdispersion or sequencing error, leading to
overestimates of inbreeding.
With overdispersion and inbreeding estimated, we can simulate what
the Hind/HE
distribution might look like if the dataset consisted entirely of
Mendelian diploid loci with no other technical issues. Here
reps
is set to 10 because of the size of the dataset and
the need for a short run time, but in your own data is is probably best
to leave it at the default. You do not need to use set.seed
unless you are trying to reproduce this vignette exactly.
set.seed(803)
ExpectedHindHe(mydata, inbreeding = 0.25, ploidy = 2, overdispersion = my_ovdisp,
reps = 10, contamRate = 0.001, errorRate = 0.001)
## Simulating rep 1
## Completed 10 simulation reps.
## Mean Hind/He: 0.346
## Standard deviation: 0.0543
## 95% of observations are between 0.253 and 0.467
According to these results, good quality markers can be expected to have Hind/HE values from about 0.24 to 0.48. Values lower than that indicate technical problems such as restriction cut site polymorphisms, causing overdispersion in the data that could reduce genotyping quality. Values higher than that indicate paralogy or higher ploidy than expected. Since we are allowing for allotetraploidy in our genotype calling, we’ll only remove markers where Hind/HE is too low (although you may consider filtering differently in your own dataset).
## [1] 0.2916667
We can iteratively estimate genotype probabilities assuming
Hardy-Weinberg equilibrium. The argument tol
is set to a
higher value than the default here in order to help the tutorial run
more quickly. Since Miscanthus is highly outcrossing, we will
leave the selfing.rate
argument at its default of zero.
Let’s take a look at allele frequencies:
We can do a different genotype probability estimation that models population structure and variation in allele frequencies among populations. We don’t need to specify populations, since principal components analysis is used to assess population structure assuming an isolation-by-distance model, with gradients of gene flow across many groups of individuals. This dataset includes a very broad sampling of Miscanthus across Asia, so it is very appropriate to model population structure in this case.
For this example, since random number generation is used internally
by IteratePopStruct
for probabalistic principal components
analysis, I am setting a seed so that the vignette always renders in the
same way.
set.seed(3908)
mydataPopStruct <- IteratePopStruct(mydata, nPcsInit = 8, tol = 5e-03,
overdispersion = 10)
Allele frequency estimates have changed slightly:
Here’s some of the population structure that was used for modeling allele frequencies (fairly weak in this case because so few markers were used):
And here’s an example of allele frequency varying across the
environment. Allele frequencies were estimated for each taxon, and are
stored in the $alleleFreqByTaxa
slot. In the plot below,
color indicates estimated local allele frequency.
Let’s examine the inheritance mode of the markers again now that we have run the pipeline.
plot(mydataPopStruct$ploidyChiSq[1,], mydataPopStruct$ploidyChiSq[2,],
xlab = "Chi-squared for diploid model",
ylab = "Chi-squared for allotetraploid model", log = "xy")
abline(a = 0, b = 1, col = "blue", lwd = 2)
It seems that some markers look allotetraploid, and others look diploid. We can see if this matches Hind/HE results.
myChiSqRat <- mydataPopStruct$ploidyChiSq[1,] / mydataPopStruct$ploidyChiSq[2,]
myChiSqRat <- tapply(myChiSqRat, mydataPopStruct$alleles2loc, mean)
allelesPerLoc <- as.vector(table(mydataPopStruct$alleles2loc))
library(ggplot2)
ggplot(mapping = aes(x = myhindheByLoc[GetLoci(mydata)], y = myChiSqRat, fill = as.factor(allelesPerLoc))) +
geom_point(shape = 21, size = 3) +
labs(x = "Hind/He", y = "Ratio of Chi-squared values, diploid to allotetraploid",
fill = "Alleles per locus") +
geom_hline(yintercept = 1) +
geom_vline(xintercept = 0.5) +
scale_fill_brewer(palette = "YlOrRd")
Markers that fall in (or near) the lower-left quadrent are probably well-behaved diploid markers, but others might represent merged paralogs.
As before, we can export the posterior mean genotypes for downstream analysis.
## S01_139820_TT S01_139820_CT S01_150928_GG S01_150928_AA
## KMS207-8 0.884929543 0.114970457 2.867904e-08 9.992910e-08
## JM0051.003 0.007736915 0.279295342 3.370304e-07 1.031228e-06
## JM0034.001 0.125390846 0.317981638 1.268545e-04 6.107921e-07
## JM0220.001 0.220283676 0.171998526 1.061244e-07 3.562348e-07
## NC-2010-003-001 0.000100000 0.154894658 4.103560e-05 1.589559e-06
## JM0026.001 0.093776437 0.566301114 1.996941e-08 2.563725e-06
## JM0026.002 0.162077586 0.163232974 2.925276e-03 3.066340e-06
## PI294605-US64-0007-01 0.006523966 0.008399963 1.963375e-06 6.720166e-09
## JM0058.001 0.671449424 0.293410060 2.940711e-04 1.775268e-07
## UI10-00086-Silberfeil 0.000100000 0.127364065 2.360426e-09 8.333395e-09
## S01_151004_A
## KMS207-8 7.415028e-03
## JM0051.003 7.481997e-02
## JM0034.001 2.588873e-02
## JM0220.001 1.000000e-04
## NC-2010-003-001 1.000000e-04
## JM0026.001 1.000000e-04
## JM0026.002 7.940355e-02
## PI294605-US64-0007-01 6.177545e-02
## JM0058.001 6.135116e-03
## UI10-00086-Silberfeil 3.272857e-05
If you expect that your species has high linkage disequilibrium, the
functions IterateHWE_LD
and IteratePopStructLD
behave like IterateHWE
and IteratePopStruct
,
respectively, but also update priors based on genotypes at linked
loci.
GBS/RAD data are inherently messy. Some markers may behave in a non-Mendelian fashion due to misalignments, amplification bias, presence-absence variation, or other issues. In addition to filtering out problematic markers, you may also want to confirm that all individuals in the dataset are well-behaved.
The Hind/HE
statistic (Clark et
al. 2020) helps to filter such markers and individuals. In a mapping
population it can be run using the HindHeMapping
function,
which requires a single ploidy to be input, along with the mapping
population design. In a natural population or diversity panel, the
HindHe
function can be used. HindHe
should
also be used for mapping populations in which the most recent generation
was created by random intermating among all progeny. In all cases, I
recommend running HindHe
or HindHeMapping
before running TestOverdispersion
or any of the genotype
calling functions, as demonstrated in the previous sections.
Below we’ll work with a dataset from Miscanthus
sacchariflorus, including 635 individuals and 1000 loci (Clark et al. 2018). The
RADdata
object is not provided here due to size, but the
following objects were created from it:
We will load these:
## [1] "myHindHe" "ploidies" "TotDepthT"
This additionally provides a vector called ploidies
indicating the ploidy of each individual, determined primarily by flow
cytometry. myHindHe
is a matrix with one value per
individual*locus, and TotDepthT
is a vector showing the
total read depth at each locus.
To investigate individuals, we can take the row means of the matrix:
Then we can plot these versus depth for each ploidy.
ggplot(data.frame(Depth = TotDepthT, HindHe = myHindHeByInd,
Ploidy = ploidies),
mapping = aes(x = Depth, y = HindHe, color = Ploidy)) +
geom_point() +
scale_x_log10() +
facet_wrap(~ Ploidy) +
geom_hline(data = data.frame(Ploidy = c("2x", "3x", "4x"),
ExpHindHe = c(1/2, 2/3, 3/4)),
mapping = aes(yintercept = ExpHindHe), lty = 2) +
labs(x = "Read Depth", y = "Hind/He", color = "Ploidy")
Dashed lines indicate the expected value under Hardy-Weinberg Equilibrium. This is $\frac{ploidy - 1}{ploidy}$, e.g. 0.5 for diploids and 0.75 for tetraploids. Since there is some population structure, most individuals show a lower value. However, some interspecific hybrids have values higher than expected. We can also see that it is fairly easy to distinguish diploids and tetraploids. This method is not a replacement for flow cytometry, but can complement it if some minority of samples in the dataset have unknown ploidy.
Let’s divide the Hind/HE results into those for diploids vs. tetraploids.
Now we can look a the distribution of values across markers.
myHindHeByLoc2x <- colMeans(myHindHe2x, na.rm = TRUE)
hist(myHindHeByLoc2x, breaks = 50, xlab = "Hind/He",
main = "Distribution of Hind/He among loci in diploids",
col = "lightgrey")
abline(v = 0.5, col = "blue", lwd = 2)
myHindHeByLoc4x <- colMeans(myHindHe4x, na.rm = TRUE)
hist(myHindHeByLoc4x, breaks = 50, xlab = "Hind/He",
main = "Distribution of Hind/He among loci in tetraploids",
col = "lightgrey")
abline(v = 0.75, col = "blue", lwd = 2)
Most loci look good, but those to the right of the blue line should probably be filtered from the dataset.
goodLoci <- colnames(myHindHe)[myHindHeByLoc2x < 0.5 & myHindHeByLoc4x < 0.75]
length(goodLoci) # 611 out of 1000 markers retained
## [1] 611
## [1] "S05_132813" "S05_266733" "S05_606899" "S05_754184" "S05_764224"
## [6] "S05_1643871"
The goodLoci
vector that we created here could then be
used by SubsetByLocus
to filter the dataset. Remember that
you would also want to set the taxaPloidy
slot to indicate
the ploidy of each individual. The ExpectedHindHe
function
can also help with determining a good cutoff for filtering markers.
RADdata
objects contain large matrices and arrays for
storing read depth and the parameters that are estimated by the pipeline
functions, and as a result require a lot of RAM (computer memory) in
comparison to the posterior mean genotypes that are exported. A
RADdata
object that has just been imported will take up
less RAM than one that has been processed by a pipeline function.
RADdata
objects will also take up more RAM (and take longer
for pipeline functions to process) if they have more possible ploidies
and/or higher ploidies.
If you have hundreds of thousands, or possibly even tens of
thousands, of markers in your dataset, it may be too large to process as
one object on a typical computer. In that case, I recommend using the
SplitByChromosome
function immediately after import. This
function will create separate RADdata
objects by
chromosomes or groups of chromosomes, and will save those objects to
separate R workspace (.RData) files on your hard drive. You can then run
a loop to re-import those objects one at a time, process each one with a
pipeline function, and export posterior mean geneotypes (or any other
parameters you wish to keep) to a file or a smaller R object. If you
have access to a high performance computing cluster, you may instead
wish to process individual chromosomes as parallel jobs.
If you don’t have alignment positions for your markers, or if you
want to divide them up some other way than by chromosome, see
SubsetByLocus
. If you are importing from VCF but don’t want
to import the whole genome at once, see the examples on the help page
for VCF2RADdata
for how to import just a particular genomic
region.
You might use SubsetByLocus
and select a random subset
of ~1000 loci to use with TestOverdispersion
for estimating
the overdispersion parameter.
If you are using one of the iterative pipelines, it is possible to
set the tol
argument higher in order to reduce processing
time at the expense of accuracy.
After you have run a pipeline, if you want to keep the
RADdata
object but discard any components that are not
needed for genotype export, you can use the StripDown
function.
Below is an example script showing how I processed a real dataset with hundreds of thousands of SNPs. Note that the (very large) VCF files are not included with the polyRAD installation.
library(polyRAD)
library(VariantAnnotation)
# Two files produced by the TASSEL-GBSv2 pipeline using two different
# enzyme systems.
NsiI_file <- "170705Msi_NsiI_genotypes.vcf.bgz"
PstI_file <- "170608Msi_PstI_genotypes.vcf.bgz"
# The vector allSam was defined outside of this script, and contains the
# names of all samples that I wanted to import. Below I find sample names
# within the VCF files that match those samples.
NsiI_sam <- allSam[allSam %in% samples(scanVcfHeader(NsiI_file))]
PstI_sam <- allSam[allSam %in% samples(scanVcfHeader(PstI_file))]
# Import two RADdata objects, assuming diploidy. A large yield size was
# used due to the computer having 64 Gb RAM; on a typical laptop you
# would probably want to keep the default of 5000.
PstI_RAD <- VCF2RADdata(PstI_file, samples = PstI_sam, yieldSize = 5e4,
expectedAlleles = 1e6, expectedLoci = 2e5)
NsiI_RAD <- VCF2RADdata(NsiI_file, samples = NsiI_sam, yieldSize = 5e4,
expectedAlleles = 1e6, expectedLoci = 2e5)
# remove any loci duplicated across the two sets
nLoci(PstI_RAD) # 116757
nLoci(NsiI_RAD) # 187434
nAlleles(PstI_RAD) # 478210
nAlleles(NsiI_RAD) # 952511
NsiI_keeploci <- which(!GetLoci(NsiI_RAD) %in% GetLoci(PstI_RAD))
cat(nLoci(NsiI_RAD) - length(NsiI_keeploci),
file = "180522Num_duplicate_loci.txt") #992 duplicate
NsiI_RAD <- SubsetByLocus(NsiI_RAD, NsiI_keeploci)
# combine allele depth into one matrix
PstI_depth <- PstI_RAD$alleleDepth
NsiI_depth <- NsiI_RAD$alleleDepth
total_depth <- matrix(0L, nrow = length(allSam),
ncol = ncol(PstI_depth) + ncol(NsiI_depth),
dimnames = list(allSam,
c(colnames(PstI_depth),
colnames(NsiI_depth))))
total_depth[,colnames(PstI_depth)] <- PstI_depth[allSam,]
total_depth[rownames(NsiI_depth),colnames(NsiI_depth)] <- NsiI_depth
# combine other slots
total_alleles2loc <- c(PstI_RAD$alleles2loc,
NsiI_RAD$alleles2loc + nLoci(PstI_RAD))
total_locTable <- rbind(PstI_RAD$locTable, NsiI_RAD$locTable)
total_alleleNucleotides <- c(PstI_RAD$alleleNucleotides,
NsiI_RAD$alleleNucleotides)
# build new RADdata object and save
total_RAD <- RADdata(total_depth, total_alleles2loc, total_locTable,
list(2L), 0.001, total_alleleNucleotides)
#save(total_RAD, file = "180524_RADdata_NsiIPstI.RData")
# Make groups representing pairs of chromosomes, and one group for all
# non-assembled scaffolds.
splitlist <- list(c("^01$", "^02$"),
c("^03$", "^04$"),
c("^05$", "^06$"),
c("^07$", "^08$"),
c("^09$", "^10$"),
c("^11$", "^12$"),
c("^13$", "^14$", "^15$"),
c("^16$", "^17$"),
c("^18$", "^194"), "^SCAFFOLD")
# split by chromosome and save seperate objects
SplitByChromosome(total_RAD, chromlist = splitlist,
chromlist.use.regex = TRUE, fileprefix = "180524splitRAD")
# files with RADdata objects
splitfiles <- grep("^180524splitRAD", list.files("."), value = TRUE)
# list to hold markers formatted for GAPIT/FarmCPU
GAPITlist <- list()
length(GAPITlist) <- length(splitfiles)
# loop through RADdata objects
for(i in 1:length(splitfiles)){
load(splitfiles[i])
splitRADdata <- IteratePopStructLD(splitRADdata)
GAPITlist[[i]] <- ExportGAPIT(splitRADdata)
}
#save(GAPITlist, file = "180524GAPITlist.RData")
# put together into one dataset for FarmCPU
GM.all <- rbind(GAPITlist[[1]]$GM, GAPITlist[[2]]$GM, GAPITlist[[3]]$GM,
GAPITlist[[4]]$GM, GAPITlist[[5]]$GM, GAPITlist[[6]]$GM,
GAPITlist[[7]]$GM, GAPITlist[[8]]$GM,
GAPITlist[[9]]$GM, GAPITlist[[10]]$GM)
GD.all <- cbind(GAPITlist[[1]]$GD, GAPITlist[[2]]$GD[,-1],
GAPITlist[[3]]$GD[,-1], GAPITlist[[4]]$GD[,-1],
GAPITlist[[5]]$GD[,-1], GAPITlist[[6]]$GD[,-1],
GAPITlist[[7]]$GD[,-1], GAPITlist[[8]]$GD[,-1],
GAPITlist[[9]]$GD[,-1], GAPITlist[[10]]$GD[,-1])
#save(GD.all, GM.all, file = "180525GM_GD_all_polyRAD.RData") # 1076888 markers
Clark LV, Lipka AE, and Sacks EJ (2019) polyRAD: Genotype calling with uncertainty from sequencing data in polyploids and diploids. G3 9(3):663–673, doi:10.1534/g3.118.200913.
To cite the Hind/HE statistic:
Clark LV, Mays W, Lipka AE, and Sacks EJ (2022) A population-level statistic for assessing Mendelian behavior of genotyping-by-sequencing data from highly duplicated genomes. BMC Bioinformatics 23: 101, doi:10.1186/s12859-022-04635-9.