Using polyRAD for population genetics

Introduction

This tutorial uses RAD-seq data from a diversity panel consisting mostly of wild-collected Miscanthus sacchariflorus (Clark et al. 2018, https://doi.org/10.1093/aob/mcy161, https://doi.org/10.13012/B2IDB-0170190_V3, https://doi.org/10.13012/B2IDB-8170405_V1). Most individuals are diploid or tetraploid, with a few triploids.

The R package containing this tutorial is archived on Zenodo at DOI

In addition to loading polyRAD, we’ll load some other packages for population genetics and making plots. The polyRADtutorials package is not on CRAN but can be installed from GitHub or R-universe, and contains the dataset.

library(polyRAD)
library(adegenet)
library(polyRADtutorials)
library(ggplot2)
library(maps)
library(pegas)
library(PBSmapping)
library(spdep)
library(polysat)
library(tidyr)
library(umap)

First we will load a spreadsheet of collection location information and ploidy as determined by flow cytometry. Here I use the system.file() function to locate the example file on my computer, but for your own dataset you can input the file path directly.

Msa_latlong <- read.csv(system.file("extdata", "Msa_ploidy.csv",
                                    package = "polyRADtutorials"))
head(Msa_latlong)
##        Accession Ploidy     Country Latitude Longitude
## 1        PMS-074      2       China 40.13448  116.1748
## 2        PMS-512      2       China 41.32800  123.6909
## 3        PMS-458      4       China 36.17340  120.4978
## 4        PMS-075      2       China 40.13460  116.1751
## 5      PMS-076-7      2       China 40.16420  116.0500
## 6 KMS-widespread      4 South Korea 33.43167  126.4172

We’ll also load some geographical data, namely the borders of East Asian countries where these samples were collected. This will be used for plotting later.

east_asia <- map_data("world")
summary(Msa_latlong)
##   Accession             Ploidy       Country             Latitude    
##  Length:595         Min.   :2.00   Length:595         Min.   :28.60  
##  Class :character   1st Qu.:2.00   Class :character   1st Qu.:35.59  
##  Mode  :character   Median :2.00   Mode  :character   Median :38.38  
##                     Mean   :2.83                      Mean   :39.63  
##                     3rd Qu.:4.00                      3rd Qu.:43.97  
##                     Max.   :4.00                      Max.   :49.33  
##    Longitude    
##  Min.   :104.5  
##  1st Qu.:122.2  
##  Median :128.5  
##  Mean   :129.2  
##  3rd Qu.:134.9  
##  Max.   :145.1
east_asia <- east_asia[east_asia$long > 104 & east_asia$long < 146 &
                         east_asia$lat > 28 & east_asia$lat < 50,]

# prevent some ugly lines in the map
east_asia$group2 <- integer(nrow(east_asia))
j <- 1L
for(i in seq_len(nrow(east_asia))){
  if(i != 1L && east_asia$order[i] - east_asia$order[i - 1L] != 1L){
     j <- j + 1L
  }
  east_asia$group2[i] <- j
}

Now we will import allelic read depth from VCF. Note that if you want to be able to export to VCF later, you will need your reference genome file to pass to the refgenome argument (I am not doing that here because I probably should not be distributing a reference genome sequence on GitHub). I am also setting expectedAlleles and expectedLoci to low values because I know there are only 5000 SNPs in this example dataset. Many of these SNPs originated from the same RAD tags, so you’ll see that they get reduced to 1395 markers, many of which are multiallelic.

# Find the VCF file to import
infile <- system.file("extdata", "Msa_Chr03_5k.vcf.bgz",
                      package = "polyRADtutorials")

# Make a vector of taxa ploidies from our metadata spreadsheet
tp <- Msa_latlong$Ploidy
names(tp) <- Msa_latlong$Accession

# Read the VCF
mydata <- VCF2RADdata(infile, possiblePloidies = list(2), taxaPloidy = tp,
                      expectedAlleles = 15000, expectedLoci = 5000)
mydata
## ## RADdata object ##
## 595 taxa and 1395 loci
## 24387145 total reads
## Assumed sample cross-contamination rate of 0.001
## 
## Possible ploidies:
## Autodiploid (2)
## Autotetraploid (4)
## Autotriploid (3)

Filtering

Samples

Let’s examine the distribution of read depth and the Hind/HE statistic across samples, by ploidy.

hh <- HindHe(mydata)

Msa_latlong$HindHe <- rowMeans(hh, na.rm = TRUE)[Msa_latlong$Accession]
Msa_latlong$Depth <- rowSums(mydata$locDepth)[Msa_latlong$Accession]

ggplot(Msa_latlong, aes(x = Depth, y = HindHe)) +
  geom_point() +
  facet_wrap(~ Ploidy) +
  ggtitle("Read depth and Hind/He across individuals") +
  scale_x_log10()

Within ploidy, there aren’t any major outliers for Hind/HE, so we probably don’t need to filter samples. We can visualize Hind/HE, scaled by its expected value, on a map to see if particular regions are more inbred. In this case there’s nothing particularly striking although some regions of Russia and central Japan may be more inbred than other regions.

ggplot() +
  geom_path(data = east_asia,
            mapping = aes(x = long, y = lat, group = group2),
            color = "black") +
  geom_point(data = Msa_latlong,
             mapping = aes(x = Longitude, y = Latitude, fill = HindHe * Ploidy / (Ploidy - 1),
                           shape = as.character(Ploidy)),
             size = 3) +
  scale_fill_viridis_c() +
  labs(shape = "Ploidy") +
  scale_shape_manual(values = c("2" = 21, "3" = 24, "4" = 22)) +
  ggtitle("Hind/He by individual")

Markers

To filter markers by Hind/HE, we must first estimate overdispersion.

od <- TestOverdispersion(mydata, to_test = 8:14)
## Genotype estimates not found in object. Performing rough genotype estimation under HWE.
## Generating sampling permutations for allele depth.
## Optimal value is 11.

Next, we must estimate inbreeding. We will do this within ploidy using the graph provided in the polyRAD tutorial, using only markers with a minor allele frequency of at least 0.05.

alfreq2x <- colMeans(mydata$depthRatio[mydata$taxaPloidy == 2,], na.rm = TRUE)
theseloci2x <- GetLoci(mydata)[mydata$alleles2loc[alfreq2x >= 0.05 & alfreq2x < 0.5]]
theseloci2x <- unique(theseloci2x)

hh2x_05 <- colMeans(hh[mydata$taxaPloidy == 2, theseloci2x], na.rm = TRUE)

hist(hh2x_05, breaks = 20, xlab = "Hind/He", main = "Hind/He in diploids, MAF >= 0.05")

alfreq4x <- colMeans(mydata$depthRatio[mydata$taxaPloidy == 4,], na.rm = TRUE)
theseloci4x <- GetLoci(mydata)[mydata$alleles2loc[alfreq4x >= 0.05 & alfreq4x < 0.5]]
theseloci4x <- unique(theseloci4x)

hh4x_05 <- colMeans(hh[mydata$taxaPloidy == 4, theseloci4x], na.rm = TRUE)

hist(hh4x_05, breaks = 20, xlab = "Hind/He", main = "Hind/He in tetraploids, MAF >= 0.05")

We are expecting 0.5 in diploids and 0.75 in tetraploids, so the modes of 0.35 for diploids and 0.45 for tetraploids probably represent Mendelian markers. Using the chart from the main tutorial, with overdispersion of 12 this means inbreeding of about 0.3 for both diploids and tetraploids.

Now we can use our estimated overdispersion and inbreeding to determine the expected distribution of Hind/HE in our dataset, given sample size and read depth, if it consisted entirely of Mendelian markers. We will use a 95% confidence interval on that distribution to filter the dataset.

hhByLoc <- colMeans(hh, na.rm = TRUE)
hist(hhByLoc)

set.seed(528)
ExpectedHindHe(mydata, inbreeding = 0.3, overdispersion = od$optimal)
## Simulating rep 1
## Completed 4 simulation reps.
## Mean Hind/He: 0.391
## Standard deviation: 0.047
## 95% of observations are between 0.314 and 0.507

thresh1 <- 0.321
thresh2 <- 0.507

keeploci <- names(hhByLoc)[hhByLoc > thresh1 & hhByLoc < thresh2]

mydata <- SubsetByLocus(mydata, keeploci)
mydata
## ## RADdata object ##
## 595 taxa and 557 loci
## 8497806 total reads
## Assumed sample cross-contamination rate of 0.001
## 
## Possible ploidies:
## Autodiploid (2)
## Autotetraploid (4)
## Autotriploid (3)

Genotype calling

Since this is a collection of wild plants, IteratePopStruct will be the most accurate way to call genotypes.

set.seed(326)
mydataPopStruct <- IteratePopStruct(mydata, overdispersion = od$optimal)
## Performing initial PCA and allele frequency estimation.
## Starting iteration 1
## PCs used: 5
## Generating sampling permutations for allele depth.
## Mean difference in allele frequencies of 0.0119675976945463
## Starting iteration 2
## PCs used: 6
## Mean difference in allele frequencies of 0.00836220815504441
## Starting iteration 3
## PCs used: 7
## Mean difference in allele frequencies of 0.00723467211831438
## Starting iteration 4
## PCs used: 8
## Mean difference in allele frequencies of 0.00255887651506484
## Starting iteration 5
## PCs used: 8
## Mean difference in allele frequencies of 0.00159533301665825
## Starting iteration 6
## PCs used: 8
## Mean difference in allele frequencies of 0.00115739767368151
## Starting iteration 7
## PCs used: 8
## Mean difference in allele frequencies of 0.000897258722711223

However… I plan to use my genotypes to estimate population structure, when I just used population structure to estimate my genotypes. It’s kind of circular. Will a reviewer have a problem with it? Maybe. Think about it, anyway. If you would rather use a naive method to call genotypes, do this:

mydataNaive <- AddGenotypePriorProb_Even(mydata)
mydataNaive <- AddGenotypeLikelihood(mydataNaive, overdispersion = od$optimal)
## Allele frequencies not found; estimating under HWE from depth ratios.
## Generating sampling permutations for allele depth.
mydataNaive <- AddGenotypePosteriorProb(mydataNaive)
mydataNaive <- AddPloidyChiSq(mydataNaive)

Alternatively, we can fine-tune the number of PCs so that genotypes are corrected somewhat by population structure, but hopefully not overcorrected.

mydataIntermediate <- IteratePopStruct(mydata, overdispersion = od$optimal,
                                       maxR2changeratio = 0.5)
## Performing initial PCA and allele frequency estimation.
## Starting iteration 1
## PCs used: 2
## Generating sampling permutations for allele depth.
## Mean difference in allele frequencies of 0.0138692770267731
## Starting iteration 2
## PCs used: 3
## Mean difference in allele frequencies of 0.00339792175295924
## Starting iteration 3
## PCs used: 3
## Mean difference in allele frequencies of 0.0013791234277688
## Starting iteration 4
## PCs used: 3
## Mean difference in allele frequencies of 0.000682214613514888

We can take a look at the PCs that were used for genotype calling by the population structure method. This will give us an idea of patterns of population structure that could be exaggerated using that calling method. The first two axes are shown below, but I recommend exploring all eight (or however many are generated with your dataset).

identical(rownames(mydataPopStruct$PCA), Msa_latlong$Accession)
## [1] TRUE
Msa_latlong <- cbind(Msa_latlong, mydataPopStruct$PCA)

ggplot() +
  geom_path(data = east_asia,
            mapping = aes(x = long, y = lat, group = group2),
            color = "black") +
  geom_point(data = Msa_latlong,
             mapping = aes(x = Longitude, y = Latitude, fill = PC1,
                           shape = as.character(Ploidy)),
             size = 3) +
  scale_fill_viridis_c() +
  labs(shape = "Ploidy") +
  scale_shape_manual(values = c("2" = 21, "3" = 24, "4" = 22)) +
  ggtitle("PC1")

ggplot() +
  geom_path(data = east_asia,
            mapping = aes(x = long, y = lat, group = group2),
            color = "black") +
  geom_point(data = Msa_latlong,
             mapping = aes(x = Longitude, y = Latitude, fill = PC2,
                           shape = as.character(Ploidy)),
             size = 3) +
  scale_fill_viridis_c() +
  labs(shape = "Ploidy") +
  scale_shape_manual(values = c("2" = 21, "3" = 24, "4" = 22)) +
  ggtitle("PC2")

Analyses with adegenet

The R package adegenet has some nice analyses popular in population genetics. Better still, many of them now work on simple data matrices, without the need to use any of adegenet’s built-in data classes. So we’ll generate matrices of marker data from polyRAD and then go from there.

matrixPopStruct <- GetWeightedMeanGenotypes(mydataPopStruct)
matrixNaive <- GetWeightedMeanGenotypes(mydataNaive)
matrixIntermediate <- GetWeightedMeanGenotypes(mydataIntermediate)

Discriminant analysis of principal components (DAPC)

First we’ll run find.clust to identify likely cluster membership for each sample.

clust1 <- find.clusters(matrixPopStruct, n.pca = nrow(matrixPopStruct))
BIC using pop struct dataset
BIC using pop struct dataset
clust2 <- find.clusters(matrixNaive, n.pca = nrow(matrixNaive))
BIC using naive dataset
BIC using naive dataset
clust3 <- find.clusters(matrixIntermediate, n.pca = nrow(matrixIntermediate))
BIC using intermediate dataset
BIC using intermediate dataset

Based on these curves, I chose seven, four, and four clusters for the population structure, naive, and intermediate datasets, respectively.

Now we will use these preliminary clusters to run DAPC. You may choose to omit the n.pca and n.da arguments in order to choose these values interactively.

dapc_PopStruct <- dapc(matrixPopStruct, clust1$grp,
                       n.pca = 200, n.da = 6)
dapc_Naive <- dapc(matrixNaive, clust2$grp,
                       n.pca = 200, n.da = 3)
dapc_Intermediate <- dapc(matrixIntermediate, clust3$grp,
                       n.pca = 200, n.da = 3)

We’ll add the assignments to our sample table.

Msa_latlong$DAPC_PopStruct <- dapc_PopStruct$assign
Msa_latlong$DAPC_Naive <- dapc_Naive$assign
Msa_latlong$DAPC_Intermediate <- dapc_Intermediate$assign

How did DAPC groups correspond to ploidy?

table(Msa_latlong$DAPC_PopStruct, Msa_latlong$Ploidy)
##    
##       2   3   4
##   1   0   0 117
##   2  48   0   0
##   3 126   0   0
##   4   8   5  14
##   5  56   0   0
##   6   1   2 112
##   7 105   1   0
table(Msa_latlong$DAPC_Naive, Msa_latlong$Ploidy)
##    
##       2   3   4
##   1  57   0   0
##   2 194   0   0
##   3  20   5 168
##   4  73   3  75
table(Msa_latlong$DAPC_Intermediate, Msa_latlong$Ploidy)
##    
##       2   3   4
##   1   0   0 128
##   2 229   1   0
##   3  55   0   0
##   4  60   7 115

We can visualize Hind/HE by DAPC group in order to see if that matches our expectations based on ploidy. Groups with higher values than expected could represent interspecies hybrids.

ggplot(Msa_latlong, aes(x = DAPC_PopStruct, y = HindHe, fill = DAPC_PopStruct)) +
  geom_boxplot()

ggplot(Msa_latlong, aes(x = DAPC_Naive, y = HindHe, fill = DAPC_Naive)) +
  geom_boxplot()

ggplot(Msa_latlong, aes(x = DAPC_Intermediate, y = HindHe, fill = DAPC_Intermediate)) +
  geom_boxplot()

We can map these groups to see if they make sense geographically.

ggplot() +
  geom_path(data = east_asia,
            mapping = aes(x = long, y = lat, group = group2),
            color = "black") +
  geom_point(data = Msa_latlong,
             mapping = aes(x = Longitude, y = Latitude, fill = DAPC_PopStruct,
                           shape = as.character(Ploidy)),
             size = 2.5) +
  scale_shape_manual(values = c("2" = 21, "3" = 24, "4" = 22)) +
  labs(shape = "Ploidy")

ggplot() +
  geom_path(data = east_asia,
            mapping = aes(x = long, y = lat, group = group2),
            color = "black") +
  geom_point(data = Msa_latlong,
             mapping = aes(x = Longitude, y = Latitude, fill = DAPC_Naive,
                           shape = as.character(Ploidy)),
             size = 2.5) +
  scale_shape_manual(values = c("2" = 21, "3" = 24, "4" = 22)) +
  labs(shape = "Ploidy")

ggplot() +
  geom_path(data = east_asia,
            mapping = aes(x = long, y = lat, group = group2),
            color = "black") +
  geom_point(data = Msa_latlong,
             mapping = aes(x = Longitude, y = Latitude, fill = DAPC_Intermediate,
                           shape = as.character(Ploidy)),
             size = 2.5) +
  scale_shape_manual(values = c("2" = 21, "3" = 24, "4" = 22)) +
  labs(shape = "Ploidy")

The intermediate approach seems to make the most sense geographically, so it may be best to proceed with that approach.

Spatial principal coordinates analysis (sPCA)

sPCA is like PCA, but combines genetic and spatial information in order to find more subtle patterns of geographic population structure. To perform this analysis, we will first need to convert latitude and longitude to UTM. We’ll choose a UTM zone central to the dataset; see https://en.wikipedia.org/wiki/Universal_Transverse_Mercator_coordinate_system.

tempXY <- data.frame(Accession = Msa_latlong$Accession,
                     X = Msa_latlong$Longitude,
                     Y = Msa_latlong$Latitude)
attr(tempXY, "projection") <- "LL"
attr(tempXY, "zone") <- 52
utmXY <- PBSmapping::convUL(tempXY)
## convUL: Converting coordinates within the northern hemisphere.
summary(utmXY)
##   Accession               X                 Y       
##  Length:595         Min.   :-1875.4   Min.   :3232  
##  Class :character   1st Qu.:  -76.3   1st Qu.:3961  
##  Mode  :character   Median :  459.7   Median :4309  
##                     Mean   :  476.7   Mean   :4418  
##                     3rd Qu.:  939.4   3rd Qu.:4923  
##                     Max.   : 1810.2   Max.   :5492

We will jitter the positions since some accessions were collected at the same site.

utmXY$X <- jitter(utmXY$X)
utmXY$Y <- jitter(utmXY$Y)

Now we need to build a connection network between collection sites. I chose the Gabriel graph (type 2 when prompted).

myCN <- chooseCN(utmXY[,c("X", "Y")])

Now we can run sPCA. I recommend running first without the nfposi and nfnega arguments so that you can chose these interactively.

spca_PopStruct <- spca(matrixPopStruct[utmXY$Accession,], cn = myCN,
                       nfposi = 5, nfnega = 0, scannf = FALSE)
## Registered S3 methods overwritten by 'adegraphics':
##   method         from
##   biplot.dudi    ade4
##   kplot.foucart  ade4
##   kplot.mcoa     ade4
##   kplot.mfa      ade4
##   kplot.pta      ade4
##   kplot.sepan    ade4
##   kplot.statis   ade4
##   scatter.coa    ade4
##   scatter.dudi   ade4
##   scatter.nipals ade4
##   scatter.pco    ade4
##   score.acm      ade4
##   score.mix      ade4
##   score.pca      ade4
##   screeplot.dudi ade4
## Registered S3 methods overwritten by 'adespatial':
##   method             from       
##   plot.multispati    adegraphics
##   print.multispati   ade4       
##   summary.multispati ade4
spca_Naive <- spca(matrixNaive[utmXY$Accession,], cn = myCN,
                   nfposi = 5, nfnega = 0, scannf = FALSE)
spca_Intermediate <- spca(matrixIntermediate[utmXY$Accession,], cn = myCN,
                   nfposi = 5, nfnega = 0, scannf = FALSE)

We’ll add the results to our sample table.

temp <- matrix(NA_real_, nrow = nrow(Msa_latlong), ncol = 15,
               dimnames = list(Msa_latlong$Accession,
                               c(paste0("sPCA_PopStruct_", 1:5),
                                 paste0("sPCA_Naive_", 1:5),
                                 paste0("sPCA_Intermediate_", 1:5))))
temp[rownames(spca_PopStruct$li),1:5] <- as.matrix(spca_PopStruct$li)
temp[rownames(spca_Naive$li),6:10] <- as.matrix(spca_Naive$li)
temp[rownames(spca_Intermediate$li),11:15] <- as.matrix(spca_Intermediate$li)
Msa_latlong <- cbind(Msa_latlong, temp)

Now we can visualize the first few axes. The first two are similar to what we got with the PCA done for genotype calling. The fourth and fifth differ by genotype calling method, so the first three are probably the most meaningful and trustworthy.

temp <- Msa_latlong[, c(1,2,4,5,grep("sPCA_", colnames(Msa_latlong)))]
temp <- pivot_longer(temp, cols = -(1:4), names_prefix = "sPCA_",
                     names_to = c("Approach", "PC"), names_sep = "_",
                     values_to = "Score")

ggplot() +
  geom_path(data = east_asia,
            mapping = aes(x = long, y = lat, group = group2),
            color = "black") +
  geom_point(data = temp,
             mapping = aes(x = Longitude, y = Latitude, fill = Score,
                           shape = as.character(Ploidy))) +
  scale_fill_viridis_c() +
  scale_shape_manual(values = c("2" = 21, "3" = 24, "4" = 22)) +
  labs(shape = "Ploidy") +
  scale_x_continuous(limits = c(110, 146)) +
  facet_grid(PC ~ Approach, labeller = "label_both")

Mantel test

For the rest of the tutorial, I will use only one genotype calling method for the sake of brevity.

For a Mantel test, we need to compare genetic and geographic distances. First we will get Euclidian genetic distances.

distIntermediate <- dist(matrixIntermediate[utmXY$Accession,], method = "euclidean")

We can use the UTM coordinates directly to get distance in kilometers.

dist_geo <- dist(utmXY[,c("X", "Y")], method = "euclidean")

Now we can test that these are correlated with each other. Unsurprisingly at this point, the relationship is significant.

mantel_Intermediate <- mantel.randtest(distIntermediate, dist_geo)
mantel_Intermediate
## Monte-Carlo test
## Call: mantel.randtest(m1 = distIntermediate, m2 = dist_geo)
## 
## Observation: 0.3271219 
## 
## Based on 999 replicates
## Simulated p-value: 0.001 
## Alternative hypothesis: greater 
## 
##       Std.Obs   Expectation      Variance 
## 17.9079710542 -0.0005498143  0.0003347998

This relationship can also be visualized. A positive, bimodal relationship can be observed. It also looks like there are some pairs with particularly low genetic distance in close geographic proximity, which may represent siblings, parent-offspring, or clones.

ggplot(mapping = aes(x = dist_geo, y = distIntermediate)) +
  geom_point(alpha = 0.1) + geom_density_2d() + 
  labs(x = "Geographic distance", y = "Genetic distance")
## Don't know how to automatically pick scale for object of type <dist>.
## Defaulting to continuous.
## Don't know how to automatically pick scale for object of type <dist>.
## Defaulting to continuous.

Export of genotypes to adegenet

Are there other functions from adegenet that you want to use? You can export discrete genotypes directly from polyRAD to a genind object.

mygenind <- Export_adegenet_genind(mydataIntermediate)

Visualization with UMAP

If we visualize our first two PCA axes, some geographical regions and DAPC clusters are stretched out and not well-represented on them, and we know there’s more to the story since there was a third axis that seemed important and reproducible.

ggplot(Msa_latlong, aes(x = PC1, y = PC2, color = Country, shape = as.factor(Ploidy))) +
  geom_point()

ggplot(Msa_latlong, aes(x = PC1, y = PC2, color = DAPC_Intermediate, shape = as.factor(Ploidy))) +
  geom_point()

An alternative is a non-linear reduction called UMAP, which places similar individuals next to each other, with further distances only being rough approximations of genetic distance.

set.seed(1002)
myumap <- umap(matrixIntermediate, method = "naive")

identical(Msa_latlong$Accession, rownames(myumap$layout))
## [1] TRUE
Msa_latlong$UMAP_1 <- myumap$layout[,1]
Msa_latlong$UMAP_2 <- myumap$layout[,2]

ggplot(Msa_latlong, aes(x = UMAP_1, y = UMAP_2, color = Country, shape = as.factor(Ploidy))) +
  geom_point()

ggplot(Msa_latlong, aes(x = UMAP_1, y = UMAP_2, color = DAPC_Intermediate, shape = as.factor(Ploidy))) +
  geom_point()

We can see the separation of northern and southern Japan much more clearly now. The Yangtze River cluster still looks like an offshoot of other populations from China, but no longer takes up a disproportionate amount of space on the plot for the number of individuals it has.

Analysis of molecular variance (AMOVA) using pegas

In AMOVA, genetic distances between individuals or populations are partitioned so that we can determine the amount of variance explained by hierarchical groupings. There are a few implementations in R, and we will use the one from the pegas package here. Here the matrix of Euclidean distances is regressed on the DAPC clusters.

set.seed(1002)
amovaIntermediate <- pegas::amova(distIntermediate ~ DAPC_Intermediate, data = Msa_latlong)
amovaIntermediate
## 
##  Analysis of Molecular Variance
## 
## Call: pegas::amova(formula = distIntermediate ~ DAPC_Intermediate, 
##     data = Msa_latlong)
## 
##                         SSD        MSD  df
## DAPC_Intermediate  3268.042 1089.34721   3
## Error             23277.234   39.38618 591
## Total             26545.276   44.68902 594
## 
## Variance components:
##                    sigma2 P.value
## DAPC_Intermediate  7.5392       0
## Error             39.3862        
## 
## Phi-statistics:
## DAPC_Intermediate.in.GLOBAL 
##                   0.1606633 
## 
## Variance coefficients:
##        a 
## 139.2672

We can see that these four DAPC clusters explain 16.1% of the molecular variance. Significance is less than 0.001; we would have to do more permutations (nperm) to determine it exactly.

Differentiation statistics using polysat

To calculate population differentiation statistics, we need to know allele frequencies in each group. We can calculate these as column means of the posterior mean genotypes.

matrixIntermediateAll <- GetWeightedMeanGenotypes(mydataIntermediate, omit1allelePerLocus = FALSE)
freq_by_pop <- do.call(rbind, by(matrixIntermediateAll, Msa_latlong$DAPC_Intermediate, colMeans))
freq_by_pop[,1:5]
##   S03_117378_T S03_117378_C S03_189211_CC S03_189211_CT S03_189211_TT
## 1    0.8504978    0.1495022     0.8124241     0.1849128   0.002352002
## 2    0.8189026    0.1810974     0.8930126     0.0982381   0.008520857
## 3    0.7101366    0.2898634     0.7814224     0.2142264   0.003801448
## 4    0.8259216    0.1740784     0.7885790     0.1779828   0.031604546

We need to make sure that allele frequencies sum to one within loci, which is not guaranteed by polyRAD when there are more than two alleles. We also need to rename alleles to a format supported by polysat.

for(i in seq_len(nLoci(mydataIntermediate))){
  thesealleles <- which(mydataIntermediate$alleles2loc == i)
  # normalize to sum to one
  temp <- freq_by_pop[,thesealleles]
  temp <- sweep(temp, 1, rowSums(temp), "/")
  freq_by_pop[,thesealleles] <- temp
  # change allele names
  thisloc <- GetLoci(mydataIntermediate)[i]
  thisloc <- gsub("\\.", "_", thisloc) # remove periods from locus names
  colnames(freq_by_pop)[thesealleles] <- paste(thisloc, seq_along(thesealleles), sep = ".")
}

For the polysat calcPopDiff function, we also need a column called “Genomes” indicating the size of the population multiplied by the ploidy.

genomes <- tapply(Msa_latlong$Ploidy, Msa_latlong$DAPC_Intermediate, sum)
freq_df <- data.frame(Genomes = genomes,
                      freq_by_pop)

Now we can get differentiation statistics.

my_JostD <- calcPopDiff(freq_df, metric = "Jost's D")
my_JostD
##            1          2          3          4
## 1 0.00000000 0.05538850 0.07910069 0.02364677
## 2 0.05538850 0.00000000 0.09681908 0.04313514
## 3 0.07910069 0.09681908 0.00000000 0.05197611
## 4 0.02364677 0.04313514 0.05197611 0.00000000
my_Gst <- calcPopDiff(freq_df, metric = "Gst")
my_Gst
##            1            2            3             4
## 1 0.00000000 3.068945e-02 4.136511e-02  1.368700e-02
## 2 0.03068945 2.520609e-18 5.225528e-02  2.380334e-02
## 3 0.04136511 5.225528e-02 3.696712e-18  2.733831e-02
## 4 0.01368700 2.380334e-02 2.733831e-02 -3.210725e-18

We can also use our matrix of allele frequencies to calculate expected heterozygosity (i.e. the probability that if two alleles are drawn at random from the population, they will be different), a common measure of diversity. Here I’m just doing the calculation manually rather than using polysat.

div_mat <- matrix(nrow = nrow(freq_by_pop), ncol = nLoci(mydataIntermediate),
                  dimnames = list(rownames(freq_by_pop), GetLoci(mydataIntermediate)))
for(i in seq_len(nLoci(mydataIntermediate))){
  thesealleles <- which(mydataNaive$alleles2loc == i)

  temp <- freq_by_pop[,thesealleles]
  div_mat[,i] <- 1 - rowSums(temp ^ 2)
}
rowMeans(div_mat)
##         1         2         3         4 
## 0.3744920 0.3588097 0.3440956 0.3861777

Export to Structure

Structure is more than two decades old and has stood the test of time. You can use the Export_Structure function to take discrete gentoype calls from polyRAD and export them to Structure. A Structure run on a typical GBS dataset can take several days, so I recommend using a computing cluster or cloud service to run many jobs in parallel from the command line. In my experience, this approach is worthwhile because Structure is much more sensitive than similar but faster methods. (Instructions on running Structure from the command line are beyond the scope of this tutorial.)

Export_Structure(mydataIntermediate, "Msa_structure.txt")
Number of individuals: 595
Number of loci: 553
Ploidy of data: 4
Missing data value: -9

File contains:
Row of marker names
Individual ID for each individual

Did I miss anything?

If you have any questions or problems with this tutorial, you can start a discussion or file an issue.