This tutorial uses RAD-seq data from a diversity panel consisting mostly of wild-collected Miscanthus sacchariflorus (Clark et al. 2018, https://doi.org/10.1093/aob/mcy161, https://doi.org/10.13012/B2IDB-0170190_V3, https://doi.org/10.13012/B2IDB-8170405_V1). Most individuals are diploid or tetraploid, with a few triploids.
The R package containing this tutorial is archived on Zenodo at
In addition to loading polyRAD, we’ll load some other packages for population genetics and making plots. The polyRADtutorials package is not on CRAN but can be installed from GitHub or R-universe, and contains the dataset.
library(polyRAD)
library(adegenet)
library(polyRADtutorials)
library(ggplot2)
library(maps)
library(pegas)
library(PBSmapping)
library(spdep)
library(polysat)
library(tidyr)
library(umap)
First we will load a spreadsheet of collection location information
and ploidy as determined by flow cytometry. Here I use the
system.file()
function to locate the example file on my
computer, but for your own dataset you can input the file path
directly.
Msa_latlong <- read.csv(system.file("extdata", "Msa_ploidy.csv",
package = "polyRADtutorials"))
head(Msa_latlong)
## Accession Ploidy Country Latitude Longitude
## 1 PMS-074 2 China 40.13448 116.1748
## 2 PMS-512 2 China 41.32800 123.6909
## 3 PMS-458 4 China 36.17340 120.4978
## 4 PMS-075 2 China 40.13460 116.1751
## 5 PMS-076-7 2 China 40.16420 116.0500
## 6 KMS-widespread 4 South Korea 33.43167 126.4172
We’ll also load some geographical data, namely the borders of East Asian countries where these samples were collected. This will be used for plotting later.
## Accession Ploidy Country Latitude
## Length:595 Min. :2.00 Length:595 Min. :28.60
## Class :character 1st Qu.:2.00 Class :character 1st Qu.:35.59
## Mode :character Median :2.00 Mode :character Median :38.38
## Mean :2.83 Mean :39.63
## 3rd Qu.:4.00 3rd Qu.:43.97
## Max. :4.00 Max. :49.33
## Longitude
## Min. :104.5
## 1st Qu.:122.2
## Median :128.5
## Mean :129.2
## 3rd Qu.:134.9
## Max. :145.1
east_asia <- east_asia[east_asia$long > 104 & east_asia$long < 146 &
east_asia$lat > 28 & east_asia$lat < 50,]
# prevent some ugly lines in the map
east_asia$group2 <- integer(nrow(east_asia))
j <- 1L
for(i in seq_len(nrow(east_asia))){
if(i != 1L && east_asia$order[i] - east_asia$order[i - 1L] != 1L){
j <- j + 1L
}
east_asia$group2[i] <- j
}
Now we will import allelic read depth from VCF. Note that if you want
to be able to export to VCF later, you will need your reference genome
file to pass to the refgenome
argument (I am not doing that
here because I probably should not be distributing a reference genome
sequence on GitHub). I am also setting expectedAlleles
and
expectedLoci
to low values because I know there are only
5000 SNPs in this example dataset. Many of these SNPs originated from
the same RAD tags, so you’ll see that they get reduced to 1395 markers,
many of which are multiallelic.
# Find the VCF file to import
infile <- system.file("extdata", "Msa_Chr03_5k.vcf.bgz",
package = "polyRADtutorials")
# Make a vector of taxa ploidies from our metadata spreadsheet
tp <- Msa_latlong$Ploidy
names(tp) <- Msa_latlong$Accession
# Read the VCF
mydata <- VCF2RADdata(infile, possiblePloidies = list(2), taxaPloidy = tp,
expectedAlleles = 15000, expectedLoci = 5000)
## ## RADdata object ##
## 595 taxa and 1395 loci
## 24387145 total reads
## Assumed sample cross-contamination rate of 0.001
##
## Possible ploidies:
## Autodiploid (2)
## Autotetraploid (4)
## Autotriploid (3)
Let’s examine the distribution of read depth and the Hind/HE statistic across samples, by ploidy.
hh <- HindHe(mydata)
Msa_latlong$HindHe <- rowMeans(hh, na.rm = TRUE)[Msa_latlong$Accession]
Msa_latlong$Depth <- rowSums(mydata$locDepth)[Msa_latlong$Accession]
ggplot(Msa_latlong, aes(x = Depth, y = HindHe)) +
geom_point() +
facet_wrap(~ Ploidy) +
ggtitle("Read depth and Hind/He across individuals") +
scale_x_log10()
Within ploidy, there aren’t any major outliers for Hind/HE, so we probably don’t need to filter samples. We can visualize Hind/HE, scaled by its expected value, on a map to see if particular regions are more inbred. In this case there’s nothing particularly striking although some regions of Russia and central Japan may be more inbred than other regions.
ggplot() +
geom_path(data = east_asia,
mapping = aes(x = long, y = lat, group = group2),
color = "black") +
geom_point(data = Msa_latlong,
mapping = aes(x = Longitude, y = Latitude, fill = HindHe * Ploidy / (Ploidy - 1),
shape = as.character(Ploidy)),
size = 3) +
scale_fill_viridis_c() +
labs(shape = "Ploidy") +
scale_shape_manual(values = c("2" = 21, "3" = 24, "4" = 22)) +
ggtitle("Hind/He by individual")
To filter markers by Hind/HE, we must first estimate overdispersion.
## Genotype estimates not found in object. Performing rough genotype estimation under HWE.
## Generating sampling permutations for allele depth.
## Optimal value is 11.
Next, we must estimate inbreeding. We will do this within ploidy using the graph provided in the polyRAD tutorial, using only markers with a minor allele frequency of at least 0.05.
alfreq2x <- colMeans(mydata$depthRatio[mydata$taxaPloidy == 2,], na.rm = TRUE)
theseloci2x <- GetLoci(mydata)[mydata$alleles2loc[alfreq2x >= 0.05 & alfreq2x < 0.5]]
theseloci2x <- unique(theseloci2x)
hh2x_05 <- colMeans(hh[mydata$taxaPloidy == 2, theseloci2x], na.rm = TRUE)
hist(hh2x_05, breaks = 20, xlab = "Hind/He", main = "Hind/He in diploids, MAF >= 0.05")
alfreq4x <- colMeans(mydata$depthRatio[mydata$taxaPloidy == 4,], na.rm = TRUE)
theseloci4x <- GetLoci(mydata)[mydata$alleles2loc[alfreq4x >= 0.05 & alfreq4x < 0.5]]
theseloci4x <- unique(theseloci4x)
hh4x_05 <- colMeans(hh[mydata$taxaPloidy == 4, theseloci4x], na.rm = TRUE)
hist(hh4x_05, breaks = 20, xlab = "Hind/He", main = "Hind/He in tetraploids, MAF >= 0.05")
We are expecting 0.5 in diploids and 0.75 in tetraploids, so the modes of 0.35 for diploids and 0.45 for tetraploids probably represent Mendelian markers. Using the chart from the main tutorial, with overdispersion of 12 this means inbreeding of about 0.3 for both diploids and tetraploids.
Now we can use our estimated overdispersion and inbreeding to determine the expected distribution of Hind/HE in our dataset, given sample size and read depth, if it consisted entirely of Mendelian markers. We will use a 95% confidence interval on that distribution to filter the dataset.
## Simulating rep 1
## Completed 4 simulation reps.
## Mean Hind/He: 0.391
## Standard deviation: 0.047
## 95% of observations are between 0.314 and 0.507
thresh1 <- 0.321
thresh2 <- 0.507
keeploci <- names(hhByLoc)[hhByLoc > thresh1 & hhByLoc < thresh2]
mydata <- SubsetByLocus(mydata, keeploci)
mydata
## ## RADdata object ##
## 595 taxa and 557 loci
## 8497806 total reads
## Assumed sample cross-contamination rate of 0.001
##
## Possible ploidies:
## Autodiploid (2)
## Autotetraploid (4)
## Autotriploid (3)
Since this is a collection of wild plants,
IteratePopStruct
will be the most accurate way to call
genotypes.
## Performing initial PCA and allele frequency estimation.
## Starting iteration 1
## PCs used: 5
## Generating sampling permutations for allele depth.
## Mean difference in allele frequencies of 0.0119675976945463
## Starting iteration 2
## PCs used: 6
## Mean difference in allele frequencies of 0.00836220815504441
## Starting iteration 3
## PCs used: 7
## Mean difference in allele frequencies of 0.00723467211831438
## Starting iteration 4
## PCs used: 8
## Mean difference in allele frequencies of 0.00255887651506484
## Starting iteration 5
## PCs used: 8
## Mean difference in allele frequencies of 0.00159533301665825
## Starting iteration 6
## PCs used: 8
## Mean difference in allele frequencies of 0.00115739767368151
## Starting iteration 7
## PCs used: 8
## Mean difference in allele frequencies of 0.000897258722711223
However… I plan to use my genotypes to estimate population structure, when I just used population structure to estimate my genotypes. It’s kind of circular. Will a reviewer have a problem with it? Maybe. Think about it, anyway. If you would rather use a naive method to call genotypes, do this:
mydataNaive <- AddGenotypePriorProb_Even(mydata)
mydataNaive <- AddGenotypeLikelihood(mydataNaive, overdispersion = od$optimal)
## Allele frequencies not found; estimating under HWE from depth ratios.
## Generating sampling permutations for allele depth.
Alternatively, we can fine-tune the number of PCs so that genotypes are corrected somewhat by population structure, but hopefully not overcorrected.
## Performing initial PCA and allele frequency estimation.
## Starting iteration 1
## PCs used: 2
## Generating sampling permutations for allele depth.
## Mean difference in allele frequencies of 0.0138692770267731
## Starting iteration 2
## PCs used: 3
## Mean difference in allele frequencies of 0.00339792175295924
## Starting iteration 3
## PCs used: 3
## Mean difference in allele frequencies of 0.0013791234277688
## Starting iteration 4
## PCs used: 3
## Mean difference in allele frequencies of 0.000682214613514888
We can take a look at the PCs that were used for genotype calling by the population structure method. This will give us an idea of patterns of population structure that could be exaggerated using that calling method. The first two axes are shown below, but I recommend exploring all eight (or however many are generated with your dataset).
## [1] TRUE
Msa_latlong <- cbind(Msa_latlong, mydataPopStruct$PCA)
ggplot() +
geom_path(data = east_asia,
mapping = aes(x = long, y = lat, group = group2),
color = "black") +
geom_point(data = Msa_latlong,
mapping = aes(x = Longitude, y = Latitude, fill = PC1,
shape = as.character(Ploidy)),
size = 3) +
scale_fill_viridis_c() +
labs(shape = "Ploidy") +
scale_shape_manual(values = c("2" = 21, "3" = 24, "4" = 22)) +
ggtitle("PC1")
ggplot() +
geom_path(data = east_asia,
mapping = aes(x = long, y = lat, group = group2),
color = "black") +
geom_point(data = Msa_latlong,
mapping = aes(x = Longitude, y = Latitude, fill = PC2,
shape = as.character(Ploidy)),
size = 3) +
scale_fill_viridis_c() +
labs(shape = "Ploidy") +
scale_shape_manual(values = c("2" = 21, "3" = 24, "4" = 22)) +
ggtitle("PC2")
The R package adegenet has some nice analyses popular in population genetics. Better still, many of them now work on simple data matrices, without the need to use any of adegenet’s built-in data classes. So we’ll generate matrices of marker data from polyRAD and then go from there.
matrixPopStruct <- GetWeightedMeanGenotypes(mydataPopStruct)
matrixNaive <- GetWeightedMeanGenotypes(mydataNaive)
matrixIntermediate <- GetWeightedMeanGenotypes(mydataIntermediate)
First we’ll run find.clust
to identify likely cluster
membership for each sample.
Based on these curves, I chose seven, four, and four clusters for the population structure, naive, and intermediate datasets, respectively.
Now we will use these preliminary clusters to run DAPC. You may
choose to omit the n.pca
and n.da
arguments in
order to choose these values interactively.
dapc_PopStruct <- dapc(matrixPopStruct, clust1$grp,
n.pca = 200, n.da = 6)
dapc_Naive <- dapc(matrixNaive, clust2$grp,
n.pca = 200, n.da = 3)
dapc_Intermediate <- dapc(matrixIntermediate, clust3$grp,
n.pca = 200, n.da = 3)
We’ll add the assignments to our sample table.
Msa_latlong$DAPC_PopStruct <- dapc_PopStruct$assign
Msa_latlong$DAPC_Naive <- dapc_Naive$assign
Msa_latlong$DAPC_Intermediate <- dapc_Intermediate$assign
How did DAPC groups correspond to ploidy?
##
## 2 3 4
## 1 0 0 117
## 2 48 0 0
## 3 126 0 0
## 4 8 5 14
## 5 56 0 0
## 6 1 2 112
## 7 105 1 0
##
## 2 3 4
## 1 57 0 0
## 2 194 0 0
## 3 20 5 168
## 4 73 3 75
##
## 2 3 4
## 1 0 0 128
## 2 229 1 0
## 3 55 0 0
## 4 60 7 115
We can visualize Hind/HE by DAPC group in order to see if that matches our expectations based on ploidy. Groups with higher values than expected could represent interspecies hybrids.
ggplot(Msa_latlong, aes(x = DAPC_Intermediate, y = HindHe, fill = DAPC_Intermediate)) +
geom_boxplot()
We can map these groups to see if they make sense geographically.
ggplot() +
geom_path(data = east_asia,
mapping = aes(x = long, y = lat, group = group2),
color = "black") +
geom_point(data = Msa_latlong,
mapping = aes(x = Longitude, y = Latitude, fill = DAPC_PopStruct,
shape = as.character(Ploidy)),
size = 2.5) +
scale_shape_manual(values = c("2" = 21, "3" = 24, "4" = 22)) +
labs(shape = "Ploidy")
ggplot() +
geom_path(data = east_asia,
mapping = aes(x = long, y = lat, group = group2),
color = "black") +
geom_point(data = Msa_latlong,
mapping = aes(x = Longitude, y = Latitude, fill = DAPC_Naive,
shape = as.character(Ploidy)),
size = 2.5) +
scale_shape_manual(values = c("2" = 21, "3" = 24, "4" = 22)) +
labs(shape = "Ploidy")
ggplot() +
geom_path(data = east_asia,
mapping = aes(x = long, y = lat, group = group2),
color = "black") +
geom_point(data = Msa_latlong,
mapping = aes(x = Longitude, y = Latitude, fill = DAPC_Intermediate,
shape = as.character(Ploidy)),
size = 2.5) +
scale_shape_manual(values = c("2" = 21, "3" = 24, "4" = 22)) +
labs(shape = "Ploidy")
The intermediate approach seems to make the most sense geographically, so it may be best to proceed with that approach.
sPCA is like PCA, but combines genetic and spatial information in order to find more subtle patterns of geographic population structure. To perform this analysis, we will first need to convert latitude and longitude to UTM. We’ll choose a UTM zone central to the dataset; see https://en.wikipedia.org/wiki/Universal_Transverse_Mercator_coordinate_system.
tempXY <- data.frame(Accession = Msa_latlong$Accession,
X = Msa_latlong$Longitude,
Y = Msa_latlong$Latitude)
attr(tempXY, "projection") <- "LL"
attr(tempXY, "zone") <- 52
utmXY <- PBSmapping::convUL(tempXY)
## convUL: Converting coordinates within the northern hemisphere.
## Accession X Y
## Length:595 Min. :-1875.4 Min. :3232
## Class :character 1st Qu.: -76.3 1st Qu.:3961
## Mode :character Median : 459.7 Median :4309
## Mean : 476.7 Mean :4418
## 3rd Qu.: 939.4 3rd Qu.:4923
## Max. : 1810.2 Max. :5492
We will jitter the positions since some accessions were collected at the same site.
Now we need to build a connection network between collection sites. I chose the Gabriel graph (type 2 when prompted).
Now we can run sPCA. I recommend running first without the
nfposi
and nfnega
arguments so that you can
chose these interactively.
spca_PopStruct <- spca(matrixPopStruct[utmXY$Accession,], cn = myCN,
nfposi = 5, nfnega = 0, scannf = FALSE)
## Registered S3 methods overwritten by 'adegraphics':
## method from
## biplot.dudi ade4
## kplot.foucart ade4
## kplot.mcoa ade4
## kplot.mfa ade4
## kplot.pta ade4
## kplot.sepan ade4
## kplot.statis ade4
## scatter.coa ade4
## scatter.dudi ade4
## scatter.nipals ade4
## scatter.pco ade4
## score.acm ade4
## score.mix ade4
## score.pca ade4
## screeplot.dudi ade4
## Registered S3 methods overwritten by 'adespatial':
## method from
## plot.multispati adegraphics
## print.multispati ade4
## summary.multispati ade4
spca_Naive <- spca(matrixNaive[utmXY$Accession,], cn = myCN,
nfposi = 5, nfnega = 0, scannf = FALSE)
spca_Intermediate <- spca(matrixIntermediate[utmXY$Accession,], cn = myCN,
nfposi = 5, nfnega = 0, scannf = FALSE)
We’ll add the results to our sample table.
temp <- matrix(NA_real_, nrow = nrow(Msa_latlong), ncol = 15,
dimnames = list(Msa_latlong$Accession,
c(paste0("sPCA_PopStruct_", 1:5),
paste0("sPCA_Naive_", 1:5),
paste0("sPCA_Intermediate_", 1:5))))
temp[rownames(spca_PopStruct$li),1:5] <- as.matrix(spca_PopStruct$li)
temp[rownames(spca_Naive$li),6:10] <- as.matrix(spca_Naive$li)
temp[rownames(spca_Intermediate$li),11:15] <- as.matrix(spca_Intermediate$li)
Msa_latlong <- cbind(Msa_latlong, temp)
Now we can visualize the first few axes. The first two are similar to what we got with the PCA done for genotype calling. The fourth and fifth differ by genotype calling method, so the first three are probably the most meaningful and trustworthy.
temp <- Msa_latlong[, c(1,2,4,5,grep("sPCA_", colnames(Msa_latlong)))]
temp <- pivot_longer(temp, cols = -(1:4), names_prefix = "sPCA_",
names_to = c("Approach", "PC"), names_sep = "_",
values_to = "Score")
ggplot() +
geom_path(data = east_asia,
mapping = aes(x = long, y = lat, group = group2),
color = "black") +
geom_point(data = temp,
mapping = aes(x = Longitude, y = Latitude, fill = Score,
shape = as.character(Ploidy))) +
scale_fill_viridis_c() +
scale_shape_manual(values = c("2" = 21, "3" = 24, "4" = 22)) +
labs(shape = "Ploidy") +
scale_x_continuous(limits = c(110, 146)) +
facet_grid(PC ~ Approach, labeller = "label_both")
For the rest of the tutorial, I will use only one genotype calling method for the sake of brevity.
For a Mantel test, we need to compare genetic and geographic distances. First we will get Euclidian genetic distances.
We can use the UTM coordinates directly to get distance in kilometers.
Now we can test that these are correlated with each other. Unsurprisingly at this point, the relationship is significant.
## Monte-Carlo test
## Call: mantel.randtest(m1 = distIntermediate, m2 = dist_geo)
##
## Observation: 0.3271219
##
## Based on 999 replicates
## Simulated p-value: 0.001
## Alternative hypothesis: greater
##
## Std.Obs Expectation Variance
## 17.9079710542 -0.0005498143 0.0003347998
This relationship can also be visualized. A positive, bimodal relationship can be observed. It also looks like there are some pairs with particularly low genetic distance in close geographic proximity, which may represent siblings, parent-offspring, or clones.
ggplot(mapping = aes(x = dist_geo, y = distIntermediate)) +
geom_point(alpha = 0.1) + geom_density_2d() +
labs(x = "Geographic distance", y = "Genetic distance")
## Don't know how to automatically pick scale for object of type <dist>.
## Defaulting to continuous.
## Don't know how to automatically pick scale for object of type <dist>.
## Defaulting to continuous.
If we visualize our first two PCA axes, some geographical regions and DAPC clusters are stretched out and not well-represented on them, and we know there’s more to the story since there was a third axis that seemed important and reproducible.
ggplot(Msa_latlong, aes(x = PC1, y = PC2, color = Country, shape = as.factor(Ploidy))) +
geom_point()
ggplot(Msa_latlong, aes(x = PC1, y = PC2, color = DAPC_Intermediate, shape = as.factor(Ploidy))) +
geom_point()
An alternative is a non-linear reduction called UMAP, which places similar individuals next to each other, with further distances only being rough approximations of genetic distance.
set.seed(1002)
myumap <- umap(matrixIntermediate, method = "naive")
identical(Msa_latlong$Accession, rownames(myumap$layout))
## [1] TRUE
Msa_latlong$UMAP_1 <- myumap$layout[,1]
Msa_latlong$UMAP_2 <- myumap$layout[,2]
ggplot(Msa_latlong, aes(x = UMAP_1, y = UMAP_2, color = Country, shape = as.factor(Ploidy))) +
geom_point()
ggplot(Msa_latlong, aes(x = UMAP_1, y = UMAP_2, color = DAPC_Intermediate, shape = as.factor(Ploidy))) +
geom_point()
We can see the separation of northern and southern Japan much more clearly now. The Yangtze River cluster still looks like an offshoot of other populations from China, but no longer takes up a disproportionate amount of space on the plot for the number of individuals it has.
In AMOVA, genetic distances between individuals or populations are partitioned so that we can determine the amount of variance explained by hierarchical groupings. There are a few implementations in R, and we will use the one from the pegas package here. Here the matrix of Euclidean distances is regressed on the DAPC clusters.
set.seed(1002)
amovaIntermediate <- pegas::amova(distIntermediate ~ DAPC_Intermediate, data = Msa_latlong)
amovaIntermediate
##
## Analysis of Molecular Variance
##
## Call: pegas::amova(formula = distIntermediate ~ DAPC_Intermediate,
## data = Msa_latlong)
##
## SSD MSD df
## DAPC_Intermediate 3268.042 1089.34721 3
## Error 23277.234 39.38618 591
## Total 26545.276 44.68902 594
##
## Variance components:
## sigma2 P.value
## DAPC_Intermediate 7.5392 0
## Error 39.3862
##
## Phi-statistics:
## DAPC_Intermediate.in.GLOBAL
## 0.1606633
##
## Variance coefficients:
## a
## 139.2672
We can see that these four DAPC clusters explain 16.1% of the
molecular variance. Significance is less than 0.001; we would have to do
more permutations (nperm
) to determine it exactly.
To calculate population differentiation statistics, we need to know allele frequencies in each group. We can calculate these as column means of the posterior mean genotypes.
matrixIntermediateAll <- GetWeightedMeanGenotypes(mydataIntermediate, omit1allelePerLocus = FALSE)
freq_by_pop <- do.call(rbind, by(matrixIntermediateAll, Msa_latlong$DAPC_Intermediate, colMeans))
freq_by_pop[,1:5]
## S03_117378_T S03_117378_C S03_189211_CC S03_189211_CT S03_189211_TT
## 1 0.8504978 0.1495022 0.8124241 0.1849128 0.002352002
## 2 0.8189026 0.1810974 0.8930126 0.0982381 0.008520857
## 3 0.7101366 0.2898634 0.7814224 0.2142264 0.003801448
## 4 0.8259216 0.1740784 0.7885790 0.1779828 0.031604546
We need to make sure that allele frequencies sum to one within loci, which is not guaranteed by polyRAD when there are more than two alleles. We also need to rename alleles to a format supported by polysat.
for(i in seq_len(nLoci(mydataIntermediate))){
thesealleles <- which(mydataIntermediate$alleles2loc == i)
# normalize to sum to one
temp <- freq_by_pop[,thesealleles]
temp <- sweep(temp, 1, rowSums(temp), "/")
freq_by_pop[,thesealleles] <- temp
# change allele names
thisloc <- GetLoci(mydataIntermediate)[i]
thisloc <- gsub("\\.", "_", thisloc) # remove periods from locus names
colnames(freq_by_pop)[thesealleles] <- paste(thisloc, seq_along(thesealleles), sep = ".")
}
For the polysat calcPopDiff
function, we also need a
column called “Genomes” indicating the size of the population multiplied
by the ploidy.
genomes <- tapply(Msa_latlong$Ploidy, Msa_latlong$DAPC_Intermediate, sum)
freq_df <- data.frame(Genomes = genomes,
freq_by_pop)
Now we can get differentiation statistics.
## 1 2 3 4
## 1 0.00000000 0.05538850 0.07910069 0.02364677
## 2 0.05538850 0.00000000 0.09681908 0.04313514
## 3 0.07910069 0.09681908 0.00000000 0.05197611
## 4 0.02364677 0.04313514 0.05197611 0.00000000
## 1 2 3 4
## 1 0.00000000 3.068945e-02 4.136511e-02 1.368700e-02
## 2 0.03068945 2.520609e-18 5.225528e-02 2.380334e-02
## 3 0.04136511 5.225528e-02 3.696712e-18 2.733831e-02
## 4 0.01368700 2.380334e-02 2.733831e-02 -3.210725e-18
We can also use our matrix of allele frequencies to calculate expected heterozygosity (i.e. the probability that if two alleles are drawn at random from the population, they will be different), a common measure of diversity. Here I’m just doing the calculation manually rather than using polysat.
div_mat <- matrix(nrow = nrow(freq_by_pop), ncol = nLoci(mydataIntermediate),
dimnames = list(rownames(freq_by_pop), GetLoci(mydataIntermediate)))
for(i in seq_len(nLoci(mydataIntermediate))){
thesealleles <- which(mydataNaive$alleles2loc == i)
temp <- freq_by_pop[,thesealleles]
div_mat[,i] <- 1 - rowSums(temp ^ 2)
}
rowMeans(div_mat)
## 1 2 3 4
## 0.3744920 0.3588097 0.3440956 0.3861777
Structure
is more than two decades old and has stood the test of time. You can use
the Export_Structure
function to take discrete gentoype
calls from polyRAD and export them to Structure. A Structure run on a
typical GBS dataset can take several days, so I recommend using a
computing cluster or cloud service to run many jobs in parallel from the
command line. In my experience, this approach is worthwhile because
Structure is much more sensitive than similar but faster methods.
(Instructions on running Structure from the command line are beyond the
scope of this tutorial.)
Number of individuals: 595
Number of loci: 553
Ploidy of data: 4
Missing data value: -9
File contains:
Row of marker names
Individual ID for each individual
If you have any questions or problems with this tutorial, you can start a discussion or file an issue.