--- title: "Using polyRAD for population genetics" author: Lindsay Clark, HPCBio, Roy J. Carver Biotechnology Center, University of Illinois, Urbana-Champaign date: "`r format(Sys.time(), '%d %B %Y')`" output: html_document: toc: true vignette: > %\VignetteEngine{knitr::rmarkdown} %\VignetteIndexEntry{Using polyRAD for population genetics} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` ## Introduction This tutorial uses RAD-seq data from a diversity panel consisting mostly of wild-collected _Miscanthus sacchariflorus_ (Clark et al. 2018, https://doi.org/10.1093/aob/mcy161, https://doi.org/10.13012/B2IDB-0170190_V3, https://doi.org/10.13012/B2IDB-8170405_V1). Most individuals are diploid or tetraploid, with a few triploids. The R package containing this tutorial is archived on Zenodo at [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.4876568.svg)](https://doi.org/10.5281/zenodo.4876568) In addition to loading polyRAD, we'll load some other packages for population genetics and making plots. The polyRADtutorials package is not on CRAN but can be installed from GitHub or R-universe, and contains the dataset. ```{r libs, message = FALSE, warning = FALSE} library(polyRAD) library(adegenet) library(polyRADtutorials) library(ggplot2) library(maps) library(pegas) library(PBSmapping) library(spdep) library(polysat) library(tidyr) library(umap) ``` First we will load a spreadsheet of collection location information and ploidy as determined by flow cytometry. Here I use the `system.file()` function to locate the example file on my computer, but for your own dataset you can input the file path directly. ```{r data} Msa_latlong <- read.csv(system.file("extdata", "Msa_ploidy.csv", package = "polyRADtutorials")) head(Msa_latlong) ``` We'll also load some geographical data, namely the borders of East Asian countries where these samples were collected. This will be used for plotting later. ```{r borders} east_asia <- map_data("world") summary(Msa_latlong) east_asia <- east_asia[east_asia$long > 104 & east_asia$long < 146 & east_asia$lat > 28 & east_asia$lat < 50,] # prevent some ugly lines in the map east_asia$group2 <- integer(nrow(east_asia)) j <- 1L for(i in seq_len(nrow(east_asia))){ if(i != 1L && east_asia$order[i] - east_asia$order[i - 1L] != 1L){ j <- j + 1L } east_asia$group2[i] <- j } ``` Now we will import allelic read depth from VCF. Note that if you want to be able to export to VCF later, you will need your reference genome file to pass to the `refgenome` argument (I am not doing that here because I probably should not be distributing a reference genome sequence on GitHub). I am also setting `expectedAlleles` and `expectedLoci` to low values because I know there are only 5000 SNPs in this example dataset. Many of these SNPs originated from the same RAD tags, so you'll see that they get reduced to 1395 markers, many of which are multiallelic. ```{r echo = FALSE} # Determine if VariantAnnotation is installed. haveVA <- requireNamespace("VariantAnnotation", quietly = TRUE) ``` ```{r vcf, eval = haveVA} # Find the VCF file to import infile <- system.file("extdata", "Msa_Chr03_5k.vcf.bgz", package = "polyRADtutorials") # Make a vector of taxa ploidies from our metadata spreadsheet tp <- Msa_latlong$Ploidy names(tp) <- Msa_latlong$Accession # Read the VCF mydata <- VCF2RADdata(infile, possiblePloidies = list(2), taxaPloidy = tp, expectedAlleles = 15000, expectedLoci = 5000) ``` ```{r backupvcf, eval = !haveVA, echo = FALSE} mydata <- readRDS("VCFdata.rds") ``` ```{r printmydata} mydata ``` ## Filtering ### Samples Let's examine the distribution of read depth and the $H_{ind}/H_E$ statistic across samples, by ploidy. ```{r hh_sample} hh <- HindHe(mydata) Msa_latlong$HindHe <- rowMeans(hh, na.rm = TRUE)[Msa_latlong$Accession] Msa_latlong$Depth <- rowSums(mydata$locDepth)[Msa_latlong$Accession] ggplot(Msa_latlong, aes(x = Depth, y = HindHe)) + geom_point() + facet_wrap(~ Ploidy) + ggtitle("Read depth and Hind/He across individuals") + scale_x_log10() ``` Within ploidy, there aren't any major outliers for $H_{ind}/H_E$, so we probably don't need to filter samples. We can visualize $H_{ind}/H_E$, scaled by its expected value, on a map to see if particular regions are more inbred. In this case there's nothing particularly striking although some regions of Russia and central Japan may be more inbred than other regions. ```{r inbrmap, fig.width = 9, fig.height = 7} ggplot() + geom_path(data = east_asia, mapping = aes(x = long, y = lat, group = group2), color = "black") + geom_point(data = Msa_latlong, mapping = aes(x = Longitude, y = Latitude, fill = HindHe * Ploidy / (Ploidy - 1), shape = as.character(Ploidy)), size = 3) + scale_fill_viridis_c() + labs(shape = "Ploidy") + scale_shape_manual(values = c("2" = 21, "3" = 24, "4" = 22)) + ggtitle("Hind/He by individual") ``` ### Markers To filter markers by $H_{ind}/H_E$, we must first estimate overdispersion. ```{r overdispersion} od <- TestOverdispersion(mydata, to_test = 8:14) ``` Next, we must estimate inbreeding. We will do this within ploidy using the graph provided in the polyRAD tutorial, using only markers with a minor allele frequency of at least 0.05. ```{r estinbr} alfreq2x <- colMeans(mydata$depthRatio[mydata$taxaPloidy == 2,], na.rm = TRUE) theseloci2x <- GetLoci(mydata)[mydata$alleles2loc[alfreq2x >= 0.05 & alfreq2x < 0.5]] theseloci2x <- unique(theseloci2x) hh2x_05 <- colMeans(hh[mydata$taxaPloidy == 2, theseloci2x], na.rm = TRUE) hist(hh2x_05, breaks = 20, xlab = "Hind/He", main = "Hind/He in diploids, MAF >= 0.05") alfreq4x <- colMeans(mydata$depthRatio[mydata$taxaPloidy == 4,], na.rm = TRUE) theseloci4x <- GetLoci(mydata)[mydata$alleles2loc[alfreq4x >= 0.05 & alfreq4x < 0.5]] theseloci4x <- unique(theseloci4x) hh4x_05 <- colMeans(hh[mydata$taxaPloidy == 4, theseloci4x], na.rm = TRUE) hist(hh4x_05, breaks = 20, xlab = "Hind/He", main = "Hind/He in tetraploids, MAF >= 0.05") ``` We are expecting 0.5 in diploids and 0.75 in tetraploids, so the modes of 0.35 for diploids and 0.45 for tetraploids probably represent Mendelian markers. Using the chart from the main tutorial, with overdispersion of 12 this means inbreeding of about 0.3 for both diploids and tetraploids. Now we can use our estimated overdispersion and inbreeding to determine the expected distribution of $H_{ind}/H_E$ in our dataset, given sample size and read depth, if it consisted entirely of Mendelian markers. We will use a 95% confidence interval on that distribution to filter the dataset. ```{r hh_locus} hhByLoc <- colMeans(hh, na.rm = TRUE) hist(hhByLoc) set.seed(528) ExpectedHindHe(mydata, inbreeding = 0.3, overdispersion = od$optimal) thresh1 <- 0.321 thresh2 <- 0.507 keeploci <- names(hhByLoc)[hhByLoc > thresh1 & hhByLoc < thresh2] mydata <- SubsetByLocus(mydata, keeploci) mydata ``` ## Genotype calling Since this is a collection of wild plants, `IteratePopStruct` will be the most accurate way to call genotypes. ```{r iteratepopstruct} set.seed(326) mydataPopStruct <- IteratePopStruct(mydata, overdispersion = od$optimal) ``` However... I plan to use my genotypes to estimate population structure, when I just used population structure to estimate my genotypes. It's kind of circular. Will a reviewer have a problem with it? Maybe. Think about it, anyway. If you would rather use a naive method to call genotypes, do this: ```{r naive} mydataNaive <- AddGenotypePriorProb_Even(mydata) mydataNaive <- AddGenotypeLikelihood(mydataNaive, overdispersion = od$optimal) mydataNaive <- AddGenotypePosteriorProb(mydataNaive) mydataNaive <- AddPloidyChiSq(mydataNaive) ``` Alternatively, we can fine-tune the number of PCs so that genotypes are corrected somewhat by population structure, but hopefully not overcorrected. ```{r intermidiate} mydataIntermediate <- IteratePopStruct(mydata, overdispersion = od$optimal, maxR2changeratio = 0.5) ``` We can take a look at the PCs that were used for genotype calling by the population structure method. This will give us an idea of patterns of population structure that could be exaggerated using that calling method. The first two axes are shown below, but I recommend exploring all eight (or however many are generated with your dataset). ```{r checkpca, fig.width = 9, fig.height = 7} identical(rownames(mydataPopStruct$PCA), Msa_latlong$Accession) Msa_latlong <- cbind(Msa_latlong, mydataPopStruct$PCA) ggplot() + geom_path(data = east_asia, mapping = aes(x = long, y = lat, group = group2), color = "black") + geom_point(data = Msa_latlong, mapping = aes(x = Longitude, y = Latitude, fill = PC1, shape = as.character(Ploidy)), size = 3) + scale_fill_viridis_c() + labs(shape = "Ploidy") + scale_shape_manual(values = c("2" = 21, "3" = 24, "4" = 22)) + ggtitle("PC1") ggplot() + geom_path(data = east_asia, mapping = aes(x = long, y = lat, group = group2), color = "black") + geom_point(data = Msa_latlong, mapping = aes(x = Longitude, y = Latitude, fill = PC2, shape = as.character(Ploidy)), size = 3) + scale_fill_viridis_c() + labs(shape = "Ploidy") + scale_shape_manual(values = c("2" = 21, "3" = 24, "4" = 22)) + ggtitle("PC2") ``` ## Analyses with adegenet The R package adegenet has some nice analyses popular in population genetics. Better still, many of them now work on simple data matrices, without the need to use any of adegenet's built-in data classes. So we'll generate matrices of marker data from polyRAD and then go from there. ```{r getwmg} matrixPopStruct <- GetWeightedMeanGenotypes(mydataPopStruct) matrixNaive <- GetWeightedMeanGenotypes(mydataNaive) matrixIntermediate <- GetWeightedMeanGenotypes(mydataIntermediate) ``` ### Discriminant analysis of principal components (DAPC) First we'll run `find.clust` to identify likely cluster membership for each sample. ```{r findclust1a, eval = FALSE} clust1 <- find.clusters(matrixPopStruct, n.pca = nrow(matrixPopStruct)) ``` ![BIC using pop struct dataset](images/BIC_popstruct.png) ```{r findclust1b, eval = FALSE} clust2 <- find.clusters(matrixNaive, n.pca = nrow(matrixNaive)) ``` ![BIC using naive dataset](images/BIC_naive.png) ```{r findclust1c, eval = FALSE} clust3 <- find.clusters(matrixIntermediate, n.pca = nrow(matrixIntermediate)) ``` ![BIC using intermediate dataset](images/BIC_intermediate.png) Based on these curves, I chose seven, four, and four clusters for the population structure, naive, and intermediate datasets, respectively. ```{r findclust2, echo = FALSE} set.seed(529) clust1 <- find.clusters(matrixPopStruct, n.pca = nrow(matrixPopStruct), n.clust = 7) clust2 <- find.clusters(matrixNaive, n.pca = nrow(matrixNaive), n.clust = 4) clust3 <- find.clusters(matrixIntermediate, n.pca = nrow(matrixIntermediate), n.clust = 4) ``` Now we will use these preliminary clusters to run DAPC. You may choose to omit the `n.pca` and `n.da` arguments in order to choose these values interactively. ```{r dapc} dapc_PopStruct <- dapc(matrixPopStruct, clust1$grp, n.pca = 200, n.da = 6) dapc_Naive <- dapc(matrixNaive, clust2$grp, n.pca = 200, n.da = 3) dapc_Intermediate <- dapc(matrixIntermediate, clust3$grp, n.pca = 200, n.da = 3) ``` We'll add the assignments to our sample table. ```{r dapctable} Msa_latlong$DAPC_PopStruct <- dapc_PopStruct$assign Msa_latlong$DAPC_Naive <- dapc_Naive$assign Msa_latlong$DAPC_Intermediate <- dapc_Intermediate$assign ``` How did DAPC groups correspond to ploidy? ```{r dapcpld} table(Msa_latlong$DAPC_PopStruct, Msa_latlong$Ploidy) table(Msa_latlong$DAPC_Naive, Msa_latlong$Ploidy) table(Msa_latlong$DAPC_Intermediate, Msa_latlong$Ploidy) ``` We can visualize $H_{ind}/H_E$ by DAPC group in order to see if that matches our expectations based on ploidy. Groups with higher values than expected could represent interspecies hybrids. ```{r dapcbarplots} ggplot(Msa_latlong, aes(x = DAPC_PopStruct, y = HindHe, fill = DAPC_PopStruct)) + geom_boxplot() ggplot(Msa_latlong, aes(x = DAPC_Naive, y = HindHe, fill = DAPC_Naive)) + geom_boxplot() ggplot(Msa_latlong, aes(x = DAPC_Intermediate, y = HindHe, fill = DAPC_Intermediate)) + geom_boxplot() ``` We can map these groups to see if they make sense geographically. ```{r dapcmap} ggplot() + geom_path(data = east_asia, mapping = aes(x = long, y = lat, group = group2), color = "black") + geom_point(data = Msa_latlong, mapping = aes(x = Longitude, y = Latitude, fill = DAPC_PopStruct, shape = as.character(Ploidy)), size = 2.5) + scale_shape_manual(values = c("2" = 21, "3" = 24, "4" = 22)) + labs(shape = "Ploidy") ggplot() + geom_path(data = east_asia, mapping = aes(x = long, y = lat, group = group2), color = "black") + geom_point(data = Msa_latlong, mapping = aes(x = Longitude, y = Latitude, fill = DAPC_Naive, shape = as.character(Ploidy)), size = 2.5) + scale_shape_manual(values = c("2" = 21, "3" = 24, "4" = 22)) + labs(shape = "Ploidy") ggplot() + geom_path(data = east_asia, mapping = aes(x = long, y = lat, group = group2), color = "black") + geom_point(data = Msa_latlong, mapping = aes(x = Longitude, y = Latitude, fill = DAPC_Intermediate, shape = as.character(Ploidy)), size = 2.5) + scale_shape_manual(values = c("2" = 21, "3" = 24, "4" = 22)) + labs(shape = "Ploidy") ``` The intermediate approach seems to make the most sense geographically, so it may be best to proceed with that approach. ### Spatial principal coordinates analysis (sPCA) sPCA is like PCA, but combines genetic and spatial information in order to find more subtle patterns of geographic population structure. To perform this analysis, we will first need to convert latitude and longitude to UTM. We'll choose a UTM zone central to the dataset; see https://en.wikipedia.org/wiki/Universal_Transverse_Mercator_coordinate_system. ```{r utm} tempXY <- data.frame(Accession = Msa_latlong$Accession, X = Msa_latlong$Longitude, Y = Msa_latlong$Latitude) attr(tempXY, "projection") <- "LL" attr(tempXY, "zone") <- 52 utmXY <- PBSmapping::convUL(tempXY) summary(utmXY) ``` We will jitter the positions since some accessions were collected at the same site. ```{r jitter} utmXY$X <- jitter(utmXY$X) utmXY$Y <- jitter(utmXY$Y) ``` Now we need to build a connection network between collection sites. I chose the Gabriel graph (type 2 when prompted). ```{r chooseCN, eval = FALSE} myCN <- chooseCN(utmXY[,c("X", "Y")]) ``` ```{r chooseCN2, echo = FALSE} myCN <- chooseCN(utmXY[,c("X", "Y")], type = 2, plot.nb = TRUE) ``` Now we can run sPCA. I recommend running first without the `nfposi` and `nfnega` arguments so that you can chose these interactively. ```{r spca, warning = FALSE} spca_PopStruct <- spca(matrixPopStruct[utmXY$Accession,], cn = myCN, nfposi = 5, nfnega = 0, scannf = FALSE) spca_Naive <- spca(matrixNaive[utmXY$Accession,], cn = myCN, nfposi = 5, nfnega = 0, scannf = FALSE) spca_Intermediate <- spca(matrixIntermediate[utmXY$Accession,], cn = myCN, nfposi = 5, nfnega = 0, scannf = FALSE) ``` We'll add the results to our sample table. ```{r addspca} temp <- matrix(NA_real_, nrow = nrow(Msa_latlong), ncol = 15, dimnames = list(Msa_latlong$Accession, c(paste0("sPCA_PopStruct_", 1:5), paste0("sPCA_Naive_", 1:5), paste0("sPCA_Intermediate_", 1:5)))) temp[rownames(spca_PopStruct$li),1:5] <- as.matrix(spca_PopStruct$li) temp[rownames(spca_Naive$li),6:10] <- as.matrix(spca_Naive$li) temp[rownames(spca_Intermediate$li),11:15] <- as.matrix(spca_Intermediate$li) Msa_latlong <- cbind(Msa_latlong, temp) ``` Now we can visualize the first few axes. The first two are similar to what we got with the PCA done for genotype calling. The fourth and fifth differ by genotype calling method, so the first three are probably the most meaningful and trustworthy. ```{r plotspca, fig.width = 9, fig.height = 12, warning = FALSE} temp <- Msa_latlong[, c(1,2,4,5,grep("sPCA_", colnames(Msa_latlong)))] temp <- pivot_longer(temp, cols = -(1:4), names_prefix = "sPCA_", names_to = c("Approach", "PC"), names_sep = "_", values_to = "Score") ggplot() + geom_path(data = east_asia, mapping = aes(x = long, y = lat, group = group2), color = "black") + geom_point(data = temp, mapping = aes(x = Longitude, y = Latitude, fill = Score, shape = as.character(Ploidy))) + scale_fill_viridis_c() + scale_shape_manual(values = c("2" = 21, "3" = 24, "4" = 22)) + labs(shape = "Ploidy") + scale_x_continuous(limits = c(110, 146)) + facet_grid(PC ~ Approach, labeller = "label_both") ``` ### Mantel test For the rest of the tutorial, I will use only one genotype calling method for the sake of brevity. For a Mantel test, we need to compare genetic and geographic distances. First we will get Euclidian genetic distances. ```{r distsub} distIntermediate <- dist(matrixIntermediate[utmXY$Accession,], method = "euclidean") ``` We can use the UTM coordinates directly to get distance in kilometers. ```{r utmdist} dist_geo <- dist(utmXY[,c("X", "Y")], method = "euclidean") ``` Now we can test that these are correlated with each other. Unsurprisingly at this point, the relationship is significant. ```{r mantel} mantel_Intermediate <- mantel.randtest(distIntermediate, dist_geo) mantel_Intermediate ``` This relationship can also be visualized. A positive, bimodal relationship can be observed. It also looks like there are some pairs with particularly low genetic distance in close geographic proximity, which may represent siblings, parent-offspring, or clones. ```{r plotmantel, warning = FALSE} ggplot(mapping = aes(x = dist_geo, y = distIntermediate)) + geom_point(alpha = 0.1) + geom_density_2d() + labs(x = "Geographic distance", y = "Genetic distance") ``` ### Export of genotypes to adegenet Are there other functions from adegenet that you want to use? You can export discrete genotypes directly from polyRAD to a genind object. ```{r exportgenind, eval = FALSE} mygenind <- Export_adegenet_genind(mydataIntermediate) ``` ## Visualization with UMAP If we visualize our first two PCA axes, some geographical regions and DAPC clusters are stretched out and not well-represented on them, and we know there's more to the story since there was a third axis that seemed important and reproducible. ```{r pca2d} ggplot(Msa_latlong, aes(x = PC1, y = PC2, color = Country, shape = as.factor(Ploidy))) + geom_point() ggplot(Msa_latlong, aes(x = PC1, y = PC2, color = DAPC_Intermediate, shape = as.factor(Ploidy))) + geom_point() ``` An alternative is a non-linear reduction called UMAP, which places similar individuals next to each other, with further distances only being rough approximations of genetic distance. ```{r umap} set.seed(1002) myumap <- umap(matrixIntermediate, method = "naive") identical(Msa_latlong$Accession, rownames(myumap$layout)) Msa_latlong$UMAP_1 <- myumap$layout[,1] Msa_latlong$UMAP_2 <- myumap$layout[,2] ggplot(Msa_latlong, aes(x = UMAP_1, y = UMAP_2, color = Country, shape = as.factor(Ploidy))) + geom_point() ggplot(Msa_latlong, aes(x = UMAP_1, y = UMAP_2, color = DAPC_Intermediate, shape = as.factor(Ploidy))) + geom_point() ``` We can see the separation of northern and southern Japan much more clearly now. The Yangtze River cluster still looks like an offshoot of other populations from China, but no longer takes up a disproportionate amount of space on the plot for the number of individuals it has. ## Analysis of molecular variance (AMOVA) using pegas In AMOVA, genetic distances between individuals or populations are partitioned so that we can determine the amount of variance explained by hierarchical groupings. There are a few implementations in R, and we will use the one from the pegas package here. Here the matrix of Euclidean distances is regressed on the DAPC clusters. ```{r amova} set.seed(1002) amovaIntermediate <- pegas::amova(distIntermediate ~ DAPC_Intermediate, data = Msa_latlong) amovaIntermediate ``` We can see that these four DAPC clusters explain 16.1% of the molecular variance. Significance is less than 0.001; we would have to do more permutations (`nperm`) to determine it exactly. ## Differentiation statistics using polysat To calculate population differentiation statistics, we need to know allele frequencies in each group. We can calculate these as column means of the posterior mean genotypes. ```{r popfreq} matrixIntermediateAll <- GetWeightedMeanGenotypes(mydataIntermediate, omit1allelePerLocus = FALSE) freq_by_pop <- do.call(rbind, by(matrixIntermediateAll, Msa_latlong$DAPC_Intermediate, colMeans)) freq_by_pop[,1:5] ``` We need to make sure that allele frequencies sum to one within loci, which is not guaranteed by polyRAD when there are more than two alleles. We also need to rename alleles to a format supported by polysat. ```{r freqnorm} for(i in seq_len(nLoci(mydataIntermediate))){ thesealleles <- which(mydataIntermediate$alleles2loc == i) # normalize to sum to one temp <- freq_by_pop[,thesealleles] temp <- sweep(temp, 1, rowSums(temp), "/") freq_by_pop[,thesealleles] <- temp # change allele names thisloc <- GetLoci(mydataIntermediate)[i] thisloc <- gsub("\\.", "_", thisloc) # remove periods from locus names colnames(freq_by_pop)[thesealleles] <- paste(thisloc, seq_along(thesealleles), sep = ".") } ``` For the polysat `calcPopDiff` function, we also need a column called "Genomes" indicating the size of the population multiplied by the ploidy. ```{r freqgenomes} genomes <- tapply(Msa_latlong$Ploidy, Msa_latlong$DAPC_Intermediate, sum) freq_df <- data.frame(Genomes = genomes, freq_by_pop) ``` Now we can get differentiation statistics. ```{r diffstats} my_JostD <- calcPopDiff(freq_df, metric = "Jost's D") my_JostD my_Gst <- calcPopDiff(freq_df, metric = "Gst") my_Gst ``` We can also use our matrix of allele frequencies to calculate expected heterozygosity (i.e. the probability that if two alleles are drawn at random from the population, they will be different), a common measure of diversity. Here I'm just doing the calculation manually rather than using polysat. ```{r diversity} div_mat <- matrix(nrow = nrow(freq_by_pop), ncol = nLoci(mydataIntermediate), dimnames = list(rownames(freq_by_pop), GetLoci(mydataIntermediate))) for(i in seq_len(nLoci(mydataIntermediate))){ thesealleles <- which(mydataNaive$alleles2loc == i) temp <- freq_by_pop[,thesealleles] div_mat[,i] <- 1 - rowSums(temp ^ 2) } rowMeans(div_mat) ``` ## Export to Structure [Structure](https://web.stanford.edu/group/pritchardlab/structure_software/release_versions/v2.3.4/html/structure.html) is more than two decades old and has stood the test of time. You can use the `Export_Structure` function to take discrete gentoype calls from polyRAD and export them to Structure. A Structure run on a typical GBS dataset can take several days, so I recommend using a computing cluster or cloud service to run many jobs in parallel from the command line. In my experience, this approach is worthwhile because Structure is much more sensitive than similar but faster methods. (Instructions on running Structure from the command line are beyond the scope of this tutorial.) ```{r structure, eval = FALSE} Export_Structure(mydataIntermediate, "Msa_structure.txt") ``` ``` Number of individuals: 595 Number of loci: 553 Ploidy of data: 4 Missing data value: -9 File contains: Row of marker names Individual ID for each individual ``` ## Did I miss anything? If you have any questions or problems with this tutorial, you can [start a discussion](https://github.com/lvclark/polyRADtutorials/discussions) or [file an issue](https://github.com/lvclark/polyRADtutorials/issues).