Title: | Tools for Polyploid Microsatellite Analysis |
---|---|
Description: | A collection of tools to handle microsatellite data of any ploidy (and samples of mixed ploidy) where allele copy number is not known in partially heterozygous genotypes. It can import and export data in ABI 'GeneMapper', 'Structure', 'ATetra', 'Tetrasat'/'Tetra', 'GenoDive', 'SPAGeDi', 'POPDIST', 'STRand', and binary presence/absence formats. It can calculate pairwise distances between individuals using a stepwise mutation model or infinite alleles model, with or without taking ploidies and allele frequencies into account. These distances can be used for the calculation of clonal diversity statistics or used for further analysis in R. Allelic diversity statistics and Polymorphic Information Content are also available. polysat can assist the user in estimating the ploidy of samples, and it can estimate allele frequencies in populations, calculate pairwise or global differentiation statistics based on those frequencies, and export allele frequencies to 'SPAGeDi' and 'adegenet'. Functions are also included for assigning alleles to isoloci in cases where one pair of microsatellite primers amplifies alleles from two or more independently segregating isoloci. polysat is described by Clark and Jasieniuk (2011) <doi:10.1111/j.1755-0998.2011.02985.x> and Clark and Schreier (2017) <doi:10.1111/1755-0998.12639>. |
Authors: | Lindsay V. Clark [aut, cre] , Alistair J. Hall [ctb] , Handunnethi Nihal de Silva [ctb], Tyler William Smith [ctb] |
Maintainer: | Lindsay V. Clark <[email protected]> |
License: | GPL-2 |
Version: | 1.7-7 |
Built: | 2024-11-01 04:40:04 UTC |
Source: | https://github.com/lvclark/polysat |
The accessor functions return information that is contained, either
directly or indirectly, in the slots of a gendata
object. The
replacement functions alter information in one or more slots as appropriate.
Samples(object, populations, ploidies) Samples(object) <- value Loci(object, usatnts, ploidies) Loci(object) <- value PopInfo(object) PopInfo(object) <- value PopNames(object) PopNames(object) <- value PopNum(object, popname) PopNum(object, popname) <- value Ploidies(object, samples, loci) Ploidies(object) <- value Usatnts(object) Usatnts(object) <- value Description(object) Description(object) <- value Missing(object) Missing(object) <- value Present(object) Present(object) <- value Absent(object) Absent(object) <- value Genotype(object, sample, locus) Genotype(object, sample, locus) <- value Genotypes(object, samples = Samples(object), loci = Loci(object)) Genotypes(object, samples = Samples(object), loci = Loci(object)) <- value
Samples(object, populations, ploidies) Samples(object) <- value Loci(object, usatnts, ploidies) Loci(object) <- value PopInfo(object) PopInfo(object) <- value PopNames(object) PopNames(object) <- value PopNum(object, popname) PopNum(object, popname) <- value Ploidies(object, samples, loci) Ploidies(object) <- value Usatnts(object) Usatnts(object) <- value Description(object) Description(object) <- value Missing(object) Missing(object) <- value Present(object) Present(object) <- value Absent(object) Absent(object) <- value Genotype(object, sample, locus) Genotype(object, sample, locus) <- value Genotypes(object, samples = Samples(object), loci = Loci(object)) Genotypes(object, samples = Samples(object), loci = Loci(object)) <- value
object |
An object of the class |
populations |
A character or numeric vector indicating from which populations to return samples. (optional) |
ploidies |
A numeric vector indicating ploidies, if only samples or loci with a certain ploidy should be returned. (optional) |
sample |
A character string or number indicating the name or number of the sample whose genotype should be returned. |
locus |
A character string or number indicating the name or number of the locus whose genotype should be returned. |
samples |
A character or numeric vector indicating samples for which to return genotypes or ploidies. (optional) |
loci |
A character or numeric vector indicating loci for which to return genotypes or ploidies. (optional) |
usatnts |
A numeric vector indicating microsatellite repeat lengths, where only loci of those repeat lengths should be returned. (optional) |
popname |
Chacter string or vector. The name(s) of the
population(s) for which to retrieve or replace the corresponding
|
value |
|
Samples<-
and Loci<-
can only be used to change sample
and locus names, not to add or remove samples and loci from the
dataset.
For slots that require integer values, numerical values used in replacement functions will be coerced to integers. The replacement functions also ensure that all slots remain properly indexed.
The Missing<-
function finds any genotypes with the old missing
data symbol and changes them to the new missing data symbol, then
assigns the new symbol to the slot that indicates what the missing
data symbol is. Present<-
and Absent<-
work similarly for
the genbinary
class.
The Genotype
access and replacement functions deal with
individual genotypes, which are vectors in the genambig
class.
The Genotypes
access and replacement functions deal with lists
of genotypes.
The PopInfo<-
replacement function also adds elements to
PopNames(object)
if necessary in order to have names for all of
the populations. These will be of the form "Pop"
followed by
the population number, and can be later edited using
PopNames<-
.
The PopNum<-
replacement function first finds all samples in
the population popname
, and replaces the number in
PopInfo(object)
for those samples with value
. It then
inserts NA
into the original PopNames
slot that
contained popname
, and inserts popname
into
PopNames(object)[value]
. If this results in two populations
being merged into one, a message is printed to the console.
PopInfo
, PopNames
, Missing
, Description
,
Usatnts
, Ploidies
and Genotypes
simply return the
contents of the slots of the same names (or in the case of
Ploidies
, object@Ploidies@pld
is returned). Samples
and
Loci
return character vectors taken from the names
of
other slots (PopInfo
and Usatnts
,
respectively; the initialization and replacement methods ensure that
these slots are always named according to samples and
loci). PopNum
returns an integer vector indicating the
population number(s) of the population(s) named in popname
.
Genotype
returns a single genotype for a given sample and
locus, which is a vector whose exact form will depend on the class of
object
.
Lindsay V. Clark
deleteSamples
, deleteLoci
,
viewGenotypes
, editGenotypes
,
isMissing
, estimatePloidy
,
merge,gendata,gendata-method
, gendata
# create a new genambig (subclass of gendata) object to manipulate mygen <- new("genambig", samples=c("a", "b", "c"), loci=c("locG", "locH")) # retrieve the sample and locus names Samples(mygen) Loci(mygen) # change some of the sample and locus names Loci(mygen) <- c("lG", "lH") Samples(mygen)[2] <- "b1" # describe the dataset Description(mygen) <- "Example dataset for documentation." # name some populations and assign samples to them PopNames(mygen) <- c("PopL", "PopK") PopInfo(mygen) <- c(1,1,2) # now we can retrieve samples by population Samples(mygen, populations="PopL") # we can also adjust the numbers if we want to make them # match another dataset PopNum(mygen, "PopK") <- 3 PopNames(mygen) PopInfo(mygen) # change the population identity of just one sample PopInfo(mygen)["b1"] <- 3 # indicate that both loci are dinucleotide repeats Usatnts(mygen) <- c(2,2) # indicate that all samples are tetraploid Ploidies(mygen) <- 4 # or Ploidies(mygen) <- rep(4, times = length(Samples(mygen)) * length(Loci(mygen))) # actually, one sample is triploid Ploidies(mygen)["c",] <- 3 # view ploidies Ploidies(mygen) # view the genotype array as it currently is: filled with missing # values Genotypes(mygen) # fill in the genotypes Genotypes(mygen, loci="lG") <- list(c(120, 124, 130, 136), c(122, 120), c(128, 130, 134)) Genotypes(mygen, loci="lH") <- list(c(200, 202, 210), c(206, 208, 210, 214), c(208)) # genotypes can also be edited or retrieved by sample Genotypes(mygen, samples="a") # fix a single genotype Genotype(mygen, "a", "lH") <- c(200, 204, 210) # retrieve a single genotype Genotype(mygen, "c", "lG") # change a genotype to being missing Genotype(mygen, "c", "lH") <- Missing(mygen) # show the current missing data symbol Missing(mygen) # an example of genotypes where one contains the missing data symbol Genotypes(mygen, samples="c") # change the missing data symbol Missing(mygen) <- as.integer(-1) # now look at the genotypes Genotypes(mygen, samples="c")
# create a new genambig (subclass of gendata) object to manipulate mygen <- new("genambig", samples=c("a", "b", "c"), loci=c("locG", "locH")) # retrieve the sample and locus names Samples(mygen) Loci(mygen) # change some of the sample and locus names Loci(mygen) <- c("lG", "lH") Samples(mygen)[2] <- "b1" # describe the dataset Description(mygen) <- "Example dataset for documentation." # name some populations and assign samples to them PopNames(mygen) <- c("PopL", "PopK") PopInfo(mygen) <- c(1,1,2) # now we can retrieve samples by population Samples(mygen, populations="PopL") # we can also adjust the numbers if we want to make them # match another dataset PopNum(mygen, "PopK") <- 3 PopNames(mygen) PopInfo(mygen) # change the population identity of just one sample PopInfo(mygen)["b1"] <- 3 # indicate that both loci are dinucleotide repeats Usatnts(mygen) <- c(2,2) # indicate that all samples are tetraploid Ploidies(mygen) <- 4 # or Ploidies(mygen) <- rep(4, times = length(Samples(mygen)) * length(Loci(mygen))) # actually, one sample is triploid Ploidies(mygen)["c",] <- 3 # view ploidies Ploidies(mygen) # view the genotype array as it currently is: filled with missing # values Genotypes(mygen) # fill in the genotypes Genotypes(mygen, loci="lG") <- list(c(120, 124, 130, 136), c(122, 120), c(128, 130, 134)) Genotypes(mygen, loci="lH") <- list(c(200, 202, 210), c(206, 208, 210, 214), c(208)) # genotypes can also be edited or retrieved by sample Genotypes(mygen, samples="a") # fix a single genotype Genotype(mygen, "a", "lH") <- c(200, 204, 210) # retrieve a single genotype Genotype(mygen, "c", "lG") # change a genotype to being missing Genotype(mygen, "c", "lH") <- Missing(mygen) # show the current missing data symbol Missing(mygen) # an example of genotypes where one contains the missing data symbol Genotypes(mygen, samples="c") # change the missing data symbol Missing(mygen) <- as.integer(-1) # now look at the genotypes Genotypes(mygen, samples="c")
Where a single locus represents two or more independent isoloci (as in
an allopolyploid, or a diploidized autopolyploid), these two functions
can be used in sequence to assign alleles to isoloci.
alleleCorrelations
uses K-means and UPGMA clustering of pairwise p-values
from Fisher's exact test to make initial groupings of alleles into
putative isoloci. testAlGroups
is then used to check those
groupings against individual genotypes, and adjust the assignments if necessary.
alleleCorrelations(object, samples = Samples(object), locus = 1, alpha = 0.05, n.subgen = 2, n.start = 50) testAlGroups(object, fisherResults, SGploidy=2, samples=Samples(object), null.weight=0.5, tolerance=0.05, swap = TRUE, R = 100, rho = 0.95, T0 = 1, maxreps = 100)
alleleCorrelations(object, samples = Samples(object), locus = 1, alpha = 0.05, n.subgen = 2, n.start = 50) testAlGroups(object, fisherResults, SGploidy=2, samples=Samples(object), null.weight=0.5, tolerance=0.05, swap = TRUE, R = 100, rho = 0.95, T0 = 1, maxreps = 100)
object |
A |
samples |
An optional character or numeric vector indicating which samples to analyze. |
locus |
A single character string or integer indicating which locus to analyze. |
alpha |
The significance threshold, before multiple correction, for determining whether two alleles are significantly correlated. |
n.subgen |
The number of subgenomes (number of isoloci) for this locus. This would
be |
n.start |
Integer, passed directly to the |
fisherResults |
A list output from |
SGploidy |
The ploidy of each subgenome (each isolocus). This is |
null.weight |
Numeric, indicating how genotypes with potential null alleles should
be counted when looking for signs of homoplasy. |
tolerance |
The proportion of genotypes that are allowed to be in disagreement with the allele assignments. This is the proportion of genotypes that are expected to have meiotic error or scoring error. |
swap |
Boolean indicating whether or not to use the allele swapping algorithm before checking for homoplasy. TRUE will yield more accurate results in most cases, but FALSE may be preferable for loci with null or homoplasious alleles at high frequency. |
R |
Simulated annealing parameter for the allele swapping algorithm. Indicates how many swaps to attempt in each rep (i.e. how many swaps to attempt before changing the temperature). |
rho |
Simulated annealing parameter for the allele swapping algorithm. Factor by which to reduce the temperature at the end of each rep. |
T0 |
Simulated annealing parameter for the allele swapping algorithm. Starting temperature. |
maxreps |
Simulated annealing parameter for the allele swapping algorithm. Maximum number of reps if convergence is not achieved. |
These functions implement a novel methodology, introduced in polysat version 1.4 and updated in version 1.6, for cases where one pair of microsatellite primers amplifies alleles at two or more independently-segregating loci (referred to here as isoloci). This is not typically the case with new autopolyploids, in which all copies of a locus have equal chances of pairing with each other at meiosis. It is, however, frequently the case with allopolyploids, in which there are two homeologous subgenomes that do not pair (or infrequently pair) at meiosis, or ancient autopolyploids, in which duplicated chromosomes have diverged to the point of no longer pairing at meiosis.
Within the two functions there are four major steps:
alleleCorrelations
checks to see if there are any alleles
that are present in every genotype in the dataset. Such invariable
alleles are assumed to be fixed at one isolocus (which is not
necessarily true, but may be corrected by
testAlGroups
in steps 4 and 5).
If present, each invariable allele is assigned to its own isolocus.
If there are more invariable alleles than isoloci, the function throws
an error. If only one isolocus remains, all remaining (variable) alleles are
assigned to that isolocus. If there are as many invariable alleles as
isoloci, all remaining (variable) alleles are assigned to all isoloci
(i.e. they are considered homoplasious because they cannot be
assigned).
If, after step 1, two or more isoloci remain
without alleles assigned to them, correlations between alleles are
tested by alleleCorrelations
. The dataset is converted
to "genbinary"
if not
already in that format, and a Fisher's exact test, with negative
association (odds ratio being less than one) as the alternative
hypothesis, is performed between
each pair of columns (alleles) in the genotype matrix. The p-value of
this test between each pair of alleles is stored in a square matrix,
and zeros are inserted into the diagonal of the matrix. K-means
clustering and UPGMA are then performed on the square matrix of
p-values, and the
clusters that are produced represent initial assignments of alleles
to isoloci.
The output of alleleCorrelations
is then passed to
testAlGroups
. If the results of K-means clustering and UPGMA
were not identical, testAlGroups
checks both sets of
assignments against all genotypes in the dataset. For a genotype to
be consistent with a set of assignments, it should have at least one
allele and no more than SGploidy
alleles belonging to each
isolocus. The set of assignments that is consistent with the greatest
number of genotypes is chosen, or in the case of a tie, the set of
assignments produced by K-means clustering.
If swap = TRUE
and the assignments chosen in the previous
step are inconsistent with some genotypes, testAlGroups
attempts
to swap the isoloci of single alleles, using a simulated annealing
(Bertsimas and Tsitsiklis 1993) algorithm to search for a new set of
assignments that is consistent with as many genotypes as possible.
At each step, an allele is chosen at random to be moved to a different
isolocus (which is also chosen at random if there are more than two
isoloci). If the new set of allele assignments is consistent with an equal or
greater number of genotypes than the previous set of assignments, the new
set is retained. If the new set is consistent with fewer genotypes than
the old set, there is a small probability of retaining the new set,
dependent on how much worse the new set of assignments is and what the
current “temperature” of the algorithm is. After R
allele
swapping attempts, the temperature is lowered, reducing the probability
of retaining a set of allele assignments that is worse than the previous set.
A new rep of R
swapping attempts then begins.
If a set of allele assignments is found that is consistent with all genotypes,
the algorithm stops immediately. Otherwise it stops if no changes are made
during an entire rep of R
swap attempts, or if maxreps
reps
are performed.
testAlGroups
then checks through all genotypes to look
for signs of homoplasy, meaning single alleles that should be assigned
to more than one isolocus. For each genotype, there should be no more
than SGploidy
alleles assigned to each isolocus. Additionally,
if there are no null alleles, each genotype should have at least one
allele belonging to each isolocus. Each time a genotype is
encountered that does not meet these criteria, the a score is
increased for all alleles that might be homoplasious. (The second
criterion is not checked if null.weight = 0
.) This score
starts at zero and is increased by 1 if there are too many alleles per
isolocus or by null.weight
if an isolocus has no alleles. Once
all genotypes have been checked, the allele with the highest score is
considered to be homoplasious and is added to the other isolocus. (In
a hexaploid or higher, which isolocus the allele is added to depends on the
genotypes that were found to be inconsistent with the allele
assignments, and which isolocus or isoloci the allele could have
belonged to in order to fix the assignment.) Allele scores are reset
to zero and all alleles are then
checked again with the new set of allele assignments. The process is
repeated until the proportion of genotypes that are inconsistent with
the allele assignments is at or below tolerance
.
Both functions return lists. For alleleCorrelations
:
locus |
The name of the locus that was analyzed. |
clustering.method |
The method that was ultimately used to
produce |
significant.neg |
Square matrix of logical values indicating whether there was significant negative correlation between each pair of alleles, after multiple testing correction by Holm-Bonferroni. |
significant.pos |
Square matrix of logical values indicating whether there was significant positive correlation between each pair of alleles, after multiple testing correction by Holm-Bonferroni. |
p.values.neg |
Square matrix of p-values from Fisher's exact test for negative correlation between each pair of alleles. |
p.values.pos |
Square matrix of p-values from Fisher's exact test for positive correlation between each pair of alleles. |
odds.ratio |
Square matrix of the odds ratio estimate from Fisher's exact test for each pair of alleles. |
Kmeans.groups |
Matrix with |
UPGMA.groups |
Matrix in the same format as
|
heatmap.dist |
Square matrix like |
totss |
Total sums of squares output from K-means clustering. |
betweenss |
Sums of squares between clusters output from K-means
clustering. |
gentable |
The table indicating presence/absence of each allele in each genotype. |
For testAlGroups
:
locus |
Name of the locus that was tested. |
SGploidy |
The ploidy of each subgenome, taken from the
|
assignments |
Matrix with as many rows as there are isoloci, and as many
columns as there are alleles in the dataset. |
proportion.inconsistent.genotypes |
A number ranging from zero to one
indicating the proportion of genotypes from the dataset that are inconsistent
with |
alleleCorrelations
will print a warning to the console or to the
standard output stream if a significant positive correlation is found
between any pair of alleles. (This is not a “warning” in the
technical sense usually used in R, because it can occur by random
chance and I did not want it to cause polysat to fail package
checks.) You can see which allele pair(s) caused this warning by
looking at value$significant.pos
. If you receive this warning for
many loci, consider that there may be population structure in your
dataset, and that you might split the dataset into multiple
populations to test seperately. If it happens at just a few loci,
check to make sure there are not scoring problems such as stutter
peaks being miscalled as alleles. If it only happens at one locus and
you can't find any evidence of scoring problems, two alleles may have
been positively correlated simply from random chance, and the warning
can be ignored.
alleleCorrelations
can also produce an actual warning stating
“Quick-TRANSfer stage steps exceeded maximum”. This warning
is produced internally by kmeans
and may occur if many
genotypes are similar, as in mapping populations. It can be safely
ignored.
Lindsay V. Clark
Clark, L. V. and Drauch Schreier, A. (2017) Resolving microsatellite genotype ambiguity in populations of allopolyploid and diploidized autopolyploid organisms using negative correlations between allelic variables. Molecular Ecology Resources, 17, 1090–1103. DOI: 10.1111/1755-0998.12639.
Bertsimas, D. and Tsitsiklis, J.(1993) Simulated annealing. Statistical Science 8, 10–15.
recodeAllopoly
, mergeAlleleAssignments
,
catalanAlleles
, processDatasetAllo
# randomly generate example data for an allotetraploid mydata <- simAllopoly(n.alleles=c(5,5), n.homoplasy=1) viewGenotypes(mydata) # test allele correlations # n.start is lowered in this example to speed up computation time myCorr <- alleleCorrelations(mydata, n.subgen=2, n.start=10) myCorr$Kmeans.groups myCorr$clustering.method if(!is.null(myCorr$heatmap.dist)) heatmap(myCorr$heatmap.dist) # check individual genotypes # (low maxreps used in order to speed processing time for this example) myRes <- testAlGroups(mydata, myCorr, SGploidy=2, maxreps = 5) myRes$assignments myRes2 <- testAlGroups(mydata, myCorr, SGploidy=2, swap = FALSE) myRes2$assignments
# randomly generate example data for an allotetraploid mydata <- simAllopoly(n.alleles=c(5,5), n.homoplasy=1) viewGenotypes(mydata) # test allele correlations # n.start is lowered in this example to speed up computation time myCorr <- alleleCorrelations(mydata, n.subgen=2, n.start=10) myCorr$Kmeans.groups myCorr$clustering.method if(!is.null(myCorr$heatmap.dist)) heatmap(myCorr$heatmap.dist) # check individual genotypes # (low maxreps used in order to speed processing time for this example) myRes <- testAlGroups(mydata, myCorr, SGploidy=2, maxreps = 5) myRes$assignments myRes2 <- testAlGroups(mydata, myCorr, SGploidy=2, swap = FALSE) myRes2$assignments
alleleDiversity
returns the number of unique alleles and/or a
list of vectors of all unique alleles, indexed by locus and population.
alleleDiversity(genobject, samples = Samples(genobject), loci = Loci(genobject), alleles = TRUE, counts = TRUE)
alleleDiversity(genobject, samples = Samples(genobject), loci = Loci(genobject), alleles = TRUE, counts = TRUE)
genobject |
An object of the class |
samples |
Optional. A character or numeric vector indicating samples to include in the analysis. |
loci |
Optional. A character or numeric vector indicating loci to include in the analysis. |
alleles |
Boolean, indicating whether or not to return the alleles themselves. |
counts |
Boolean, indicating whether or not to return the number of unique alleles. |
Under default settings, a list is returned:
alleles |
A two dimensional list. The first dimension is indexed
by population, with the additional element “overall”
representing the entire dataset. The second dimension is indexed by
locus. Each element of the list is a vector, containing all unique
alleles found for that population and locus.
|
counts |
A matrix, indexed in the same way as |
If the argument alleles
or counts
is set to FALSE
,
then only one of the above list elements is returned.
Lindsay V. Clark
# generate a dataset for this example mygen <- new("genambig", samples=c("a","b","c","d"), loci=c("E","F")) PopInfo(mygen) <- c(1,1,2,2) Genotypes(mygen, loci="E") <- list(c(122,132),c(122,124,140), c(124,130,132),c(132,136)) Genotypes(mygen, loci="F") <- list(c(97,99,111),c(113,115), c(99,113),c(111,115)) # look at unique alleles myal <- alleleDiversity(mygen) myal$counts myal$alleles myal$alleles[["Pop1","E"]] myal$alleles[["overall","F"]]
# generate a dataset for this example mygen <- new("genambig", samples=c("a","b","c","d"), loci=c("E","F")) PopInfo(mygen) <- c(1,1,2,2) Genotypes(mygen, loci="E") <- list(c(122,132),c(122,124,140), c(124,130,132),c(132,136)) Genotypes(mygen, loci="F") <- list(c(97,99,111),c(113,115), c(99,113),c(111,115)) # look at unique alleles myal <- alleleDiversity(mygen) myal$counts myal$alleles myal$alleles[["Pop1","E"]] myal$alleles[["overall","F"]]
This is a simulated microsatellite dataset for seven loci and 303 individuals. It is intended to be used with the tutorial “Assigning alleles to isoloci in polysat”.
data("AllopolyTutorialData")
data("AllopolyTutorialData")
A "genambig"
object.
data(AllopolyTutorialData) summary(AllopolyTutorialData) viewGenotypes(AllopolyTutorialData, samples=1:10, loci=1)
data(AllopolyTutorialData) summary(AllopolyTutorialData) viewGenotypes(AllopolyTutorialData, samples=1:10, loci=1)
assignClones
uses a distance matrix such as that produced by
meandistance.matrix
or meandistance.matrix2
to place individuals into groups representing asexually-related ramets,
or any other grouping based on a distance threshold.
assignClones(d, samples = dimnames(d)[[1]], threshold = 0)
assignClones(d, samples = dimnames(d)[[1]], threshold = 0)
d |
A symmetrical, square matrix containing genetic distances between
individuals. Both dimensions should be named according to the names
of the individuals (samples). A matrix produced by
|
samples |
A character vector containing the names of samples to analyze. This
should be all or a subset of the names of |
threshold |
A number indicating the maximum distance between two individuals that will be placed into the same group. |
This function groups individuals very similarly to the software GenoType
(Meirmans and van Tienderen, 2004). If a distance matrix from
polysat is exported to GenoType, the results will be the same as
those from assignClones
assuming the same threshold is used. Note
that GenoType requires that
distances be integers rather than decimals, so you will have to
multiply the distances produced by polysat by a large number and
round them to the nearest integer if you wish to export them to GenoType.
When comparing the
results of assignClones
and GenoType using my own data, the only
differences I have seen have been the result of rounding; a decimal that
was slightly above the threshold in when analyzed in R was rounded down
to the threshold when analyzed in GenoType.
Note that when using a distance threshold of zero (the default), it is advisable to exclude all samples with missing data, in order to prevent the merging of non-identical clones. At higher thresholds, some missing data are allowable, but samples that have missing data at many loci should be excluded.
The write.table
function can be used for exporting the results to
GenoDive. See the R documentation for information on how to
make a tab-delimited file with no header.
A numeric vector, named by samples
. Each clone or group is given
a number, and the number for each sample indicates the clone or group
to which it belongs.
Lindsay V. Clark
Meirmans, P. G. and Van Tienderen, P. H. (2004) GENOTYPE and GENODIVE: two programs for the analysis of genetic diversity of asexual organisms. Molecular Ecology Notes 4, 792–794.
# set up a simple matrix with three samples test <- matrix(c(0,0,.5,0,0,.5,.5,.5,0), ncol=3, nrow=3) abc <- c("a", "b", "c") dimnames(test) <- list(abc,abc) # assign clones with a threshold of zero or 0.5 assignClones(test) assignClones(test, threshold=0.5)
# set up a simple matrix with three samples test <- matrix(c(0,0,.5,0,0,.5,.5,.5,0), ncol=3, nrow=3) abc <- c("a", "b", "c") dimnames(test) <- list(abc,abc) # assign clones with a threshold of zero or 0.5 assignClones(test) assignClones(test, threshold=0.5)
This function calculates the distance between two individuals at one microsatellite locus using a method based on that of Bruvo et al. (2004).
Bruvo.distance(genotype1, genotype2, maxl=10, usatnt=2, missing=-9)
Bruvo.distance(genotype1, genotype2, maxl=10, usatnt=2, missing=-9)
genotype1 |
A vector of alleles for one individual at one locus. Allele length is in nucleotides or repeat count. Each unique allele corresponds to one element in the vector, and the vector is no longer than it needs to be to contain all unique alleles for this individual at this locus. |
genotype2 |
A vector of alleles for another individual at the same locus. |
maxl |
If both individuals have more than this number of
alleles at this locus, |
usatnt |
Length of the repeat at this locus. For example
|
missing |
A numerical value that, when in the first allele
position, indicates missing data. |
Since allele copy number is frequently unknown in polyploid microsatellite data, Bruvo et al. developed a measure of genetic distance similar to band-sharing indices used with dominant data, but taking into account mutational distances between alleles. A matrix is created containing all differences in repeat count between the alleles of two individuals at one locus. These differences are then geometrically transformed to reflect the probabilities of mutation from one allele to another. The matrix is then searched to find the minimum sum if each allele from one individual is paired to one allele from the other individual. This sum is divided by the number of alleles per individual.
If one genotype has more alleles than the other, ‘virtual alleles’ must
be created so that both genotypes are the same length. There are
three options for the value of these virtual alleles, but
Bruvo.distance
only implements the simplest one, assuming that it is
not known whether differences in ploidy arose from genome addition or
genome loss. Virtual alleles are set to infinity, such that the
geometric distance between any allele and a virtual allele is 1.
In the original publication by Bruvo et al. (2004), ambiguous
genotypes were dealt with by calculating the distance for all possible
unambiguous genotype combinations and averaging across all of them
equally. When Bruvo.distance
is called from
meandistance.matrix
, ploidy is unknown and all genotypes are
simply treated as if they had one copy of each allele. When
Bruvo.distance
is called from meandistance.matrix2
, the
analysis is truer to the original, in that ploidy is known and all
possible unambiguous genotype combinations are considered. However,
instead of all possible unambiguous genotypes being weighted equally,
in meandistance.matrix2
they are weighted based on allele
frequencies and selfing rate, since some unambiguous genotypes are
more likely than others.
A number ranging from 0 to 1, with 0 indicating identical
genotypes, and 1 being a theoretical maximum distance if all alleles
from genotype1
differed by an infinite number of repeats from all
alleles in genotype2
. NA
is returned if both genotypes have
more than maxl
alleles or if either genotype has the symbol for
missing data as its first allele.
The processing time is a function of the factorial of the number of
alleles, since each possible combination of allele pairs must be
evaluated. For genotypes with a sufficiently large number of alleles,
it may be more efficient to estimate distances manually by creating
the matrix in Excel and visually picking out the shortest distances
between alleles. This is the purpose of the maxl
argument. On my
personal computer, if both genotypes had more than nine alleles, the
calculation could take an hour or more, and so this is the default
limit. In this case, Bruvo.distance
returns NA
.
Lindsay V. Clark
Bruvo, R., Michiels, N. K., D'Sousa, T. G., and Schulenberg, H. (2004) A simple method for calculation of microsatellite genotypes irrespective of ploidy level. Molecular Ecology 13, 2101-2106.
meandistance.matrix
, Lynch.distance
,
Bruvo2.distance
Bruvo.distance(c(202,206,210,220),c(204,206,216,222)) Bruvo.distance(c(202,206,210,220),c(204,206,216,222),usatnt=4) Bruvo.distance(c(202,206,210,220),c(204,206,222)) Bruvo.distance(c(202,206,210,220),c(204,206,216,222),maxl=3) Bruvo.distance(c(202,206,210,220),c(-9))
Bruvo.distance(c(202,206,210,220),c(204,206,216,222)) Bruvo.distance(c(202,206,210,220),c(204,206,216,222),usatnt=4) Bruvo.distance(c(202,206,210,220),c(204,206,222)) Bruvo.distance(c(202,206,210,220),c(204,206,216,222),maxl=3) Bruvo.distance(c(202,206,210,220),c(-9))
This is an inter-individual distance measure similar to
Bruvo.distance
, except that where genotypes have different
numbers of alleles, virtual alleles are equal to those from the longer
and/or shorter genotype, rather than being equal to infinity.
Bruvo2.distance(genotype1, genotype2, maxl = 7, usatnt = 2, missing = -9, add = TRUE, loss = TRUE)
Bruvo2.distance(genotype1, genotype2, maxl = 7, usatnt = 2, missing = -9, add = TRUE, loss = TRUE)
genotype1 |
A numeric vector representing the genotype of one individual at one
locus. This type of vector is produced by the |
genotype2 |
The second genotype for the distance calculation, in the same format as
|
maxl |
The maximum number of alleles that either genotype can have. If it is
exceeded, |
usatnt |
The microsatellite repeat type for the locus. |
missing |
The symbol that indicates missing data for a given sample and locus.
See |
add |
|
loss |
|
Bruvo et al. (2004) describe multiple methods for calculating
genetic distances between individuals of different ploidy. (See
“Special cases” starting on page 2102 of the paper.) The
original Bruvo.distance
function in polysat uses the method
described for systems with complex changes in ploidy, adding virtual
alleles equal to infinity to the shorter genotype to make it the same
length as the longer genotype. This method, however, can exaggerate
distances between individuals of different ploidy, particularly when
used with meandistance.matrix2
as opposed to
meandistance.matrix
.
Bruvo2.distance
calculates distances between individuals under
the models of genome addition and genome loss. If add = TRUE
and
loss = TRUE
, the distance produced is equal to that of equation 6
in the paper.
If add = TRUE
and loss = FALSE
, the distance calculated is
that under genome addition only. Likewise if add = FALSE
and
loss = TRUE
the distance is calculated under genome loss only.
The latter distance should be greater than the former. If both were averaged
together, they would give the identical result to that produced when
add = TRUE
and loss = TRUE
. All three distances will be
less than that produced by Bruvo.distance
.
If both genotypes have the same number of alleles, they are passed to
Bruvo.distance
for the calculation. This also happens if
add = FALSE
and loss = FALSE
. Otherwise, if the genotypes have
different numbers of alleles, all possible genotypes with virtual
alleles are enumerated and passed to Bruvo.distance
one by one, and
the results averaged.
The number of different genotypes simulated under genome loss or genome
addition is , where
is the length of the genotype from
which virtual alleles are being taken, and
is the difference in
length between the longer and shorter genotype. For example, under
genome addition for a diploid individual with alleles 1 and 2 being
compared to a tetraploid individual, the genotypes 1211, 1212, 1221, and
1222 will each be used once to represent the diploid individual.
A decimal between 0 and 1, with 0 indicating complete identity of two
genotypes, and 1 indicating maximum dissimilarity. NA
is
returned if one or both genotypes are missing or if maxl
is exceeded.
Figure 1B and 1C of Bruvo et al. (2004) illustrate an example of this distance measure. To perform the identical calculation to that listed directly under equation 6, you would type:
Bruvo2.distance(c(20,23,24), c(20,24,26,43), usatnt=1)
However, you will notice that the result, 0.401, is slightly different from that given in the paper. This is due to an error in the paper. For the distance under genome loss when the virtual allele is 26, the result should be 1 instead of 1.75.
Lindsay V. Clark
Bruvo, R., Michiels, N. K., D'Sousa, T. G., and Schulenberg, H. (2004) A simple method for calculation of microsatellite genotypes irrespective of ploidy level. Molecular Ecology 13, 2101–2106.
Lynch.distance
, Bruvo.distance
,
meandistance.matrix2
Bruvo2.distance(c(102,104), c(104,104,106,110)) Bruvo2.distance(c(102,104), c(104,104,106,110), add = FALSE) Bruvo2.distance(c(102,104), c(104,104,106,110), loss = FALSE)
Bruvo2.distance(c(102,104), c(104,104,106,110)) Bruvo2.distance(c(102,104), c(104,104,106,110), add = FALSE) Bruvo2.distance(c(102,104), c(104,104,106,110), loss = FALSE)
Given a data frame of allele frequencies and population sizes,
calcPopDiff
calculates a matrix of pairwise ,
, Jost's
, or
values, or a single global
value for any of these four statistics.
calcFst
is a wrapper for calcPopDiff
to allow backwards
compatibility with previous versions of polysat.
calcPopDiff(freqs, metric, pops = row.names(freqs), loci = unique(gsub("\\..*$", "", names(freqs))), global = FALSE, bootstrap = FALSE, n.bootstraps = 1000, object = NULL) calcFst(freqs, pops = row.names(freqs), loci = unique(gsub("\\..*$", "", names(freqs))), ...)
calcPopDiff(freqs, metric, pops = row.names(freqs), loci = unique(gsub("\\..*$", "", names(freqs))), global = FALSE, bootstrap = FALSE, n.bootstraps = 1000, object = NULL) calcFst(freqs, pops = row.names(freqs), loci = unique(gsub("\\..*$", "", names(freqs))), ...)
freqs |
A data frame of allele frequencies and population sizes such as that
produced by |
metric |
The population differentiation statistic to estimate. Can be “Fst”, “Gst”, “Jost's D”, or “Rst”. |
pops |
A character vector. Populations to analyze, which should be
a subset of |
loci |
A character vector indicating which loci to analyze. These should be a
subset of the locus names as used in the column names of |
global |
Boolean. If |
bootstrap |
Boolean. If |
n.bootstraps |
Integer. The number of bootstrap replicates to perform. Ignored if
|
object |
A |
... |
Additional arguments to be passed to |
For metric = "Fst"
or calcFst
:
and
are estimate directly from allele frequencies
for each locus for each pair of populations, then averaged across loci.
Wright's
is then calculated for each pair of populations as
.
values (expected heterozygosities for populations and combined
populations) are calculated as one minus the sum of all squared allele
frequencies at a locus. To calculte
, allele frequencies between two
populations are averaged before the calculation. To calculate
,
values are averaged after the calculation. In both cases, the averages
are weighted by the relative sizes of the two populations (as indicated
by
freqs$Genomes
).
For metric = "Gst"
:
This metric is similar to , but mean allele frequencies and
mean expected heterozygosities are not weighted by population size. Additionally,
unbiased estimators of
and
are used according to Nei
and Chesser (1983; equations 15 and 16, reproduced also in Jost (2008)).
Instead of using twice the harmonic mean of the number of individuals in the two
subpopulations as described by Nei and Chesser (1983), the harmonic mean of the
number of allele copies in the two subpopulations (taken from
freq$Genomes
)
is used, in order to allow for polyploidy.
is estimated for each locus and then averaged across loci.
For metric = "Jost's D"
:
The unbiased estimators of and
and calculated as
with
, without weighting by population size. They are then used
to estimate
at each locus according to Jost (2008; equation 12):
Values of are then averaged across loci.
For metric = "Rst"
:
is estimated on a per-locus basis according to Slatkin (1995),
but with populations weighted equally regardless of size. Values are then averaged
across loci.
For each locus:
where is the number of populations,
is an individual population,
and
are individual alleles,
is the frequency of an allele in a population,
is the number of allele copies in a population,
is the size of an allele
expressed in number of repeat units,
is an allele frequency averaged across
populations (with populations weighted equally), and
is the total number of allele copies
across all populations.
If global = FALSE
and bootstrap = FALSE
, a square matrix containing ,
,
, or
values.
The rows and columns of the matrix are both named by population.
If global = TRUE
and bootstrap = FALSE
, a single value indicating the ,
,
, or
value.
If global = TRUE
and bootstrap = TRUE
, a vector of bootstrapped replicates of ,
,
, or
.
If global = FALSE
and bootstrap = TRUE
, a square two-dimensional list, with rows and columns named by population. Each item is
a vector of bootstrapped values for ,
,
, or
for that pair of populations.
Lindsay V. Clark
Nei, M. (1973) Analysis of gene diversity in subdivided populations. Proceedings of the National Academy of Sciences of the United States of America 70, 3321–3323.
Nei, M. and Chesser, R. (1983) Estimation of fixation indices and gene diversities. Annals of Human Genetics 47, 253–259.
Jost, L. (2008) and its relatives to not measure differentiation.
Molecular Ecology 17, 4015–4026.
Slatkin, M. (1995) A measure of population subdivision based on microsatellite allele frequencies. Genetics 139, 457–462.
# create a data set (typically done by reading files) mygenotypes <- new("genambig", samples = paste("ind", 1:6, sep=""), loci = c("loc1", "loc2")) Genotypes(mygenotypes, loci = "loc1") <- list(c(206), c(208,210), c(204,206,210), c(196,198,202,208), c(196,200), c(198,200,202,204)) Genotypes(mygenotypes, loci = "loc2") <- list(c(130,134), c(138,140), c(130,136,140), c(138), c(136,140), c(130,132,136)) PopInfo(mygenotypes) <- c(1,1,1,2,2,2) mygenotypes <- reformatPloidies(mygenotypes, output="sample") Ploidies(mygenotypes) <- c(2,2,4,4,2,4) Usatnts(mygenotypes) <- c(2,2) # calculate allele frequencies myfreq <- simpleFreq(mygenotypes) # calculate pairwise differentiation statistics myfst <- calcPopDiff(myfreq, metric = "Fst") mygst <- calcPopDiff(myfreq, metric = "Gst") myD <- calcPopDiff(myfreq, metric = "Jost's D") myrst <- calcPopDiff(myfreq, metric = "Rst", object = mygenotypes) # examine the results myfst mygst myD myrst # get global statistics calcPopDiff(myfreq, metric = "Fst", global = TRUE) calcPopDiff(myfreq, metric = "Gst", global = TRUE) calcPopDiff(myfreq, metric = "Jost's D", global = TRUE) calcPopDiff(myfreq, metric = "Rst", global = TRUE, object = mygenotypes)
# create a data set (typically done by reading files) mygenotypes <- new("genambig", samples = paste("ind", 1:6, sep=""), loci = c("loc1", "loc2")) Genotypes(mygenotypes, loci = "loc1") <- list(c(206), c(208,210), c(204,206,210), c(196,198,202,208), c(196,200), c(198,200,202,204)) Genotypes(mygenotypes, loci = "loc2") <- list(c(130,134), c(138,140), c(130,136,140), c(138), c(136,140), c(130,132,136)) PopInfo(mygenotypes) <- c(1,1,1,2,2,2) mygenotypes <- reformatPloidies(mygenotypes, output="sample") Ploidies(mygenotypes) <- c(2,2,4,4,2,4) Usatnts(mygenotypes) <- c(2,2) # calculate allele frequencies myfreq <- simpleFreq(mygenotypes) # calculate pairwise differentiation statistics myfst <- calcPopDiff(myfreq, metric = "Fst") mygst <- calcPopDiff(myfreq, metric = "Gst") myD <- calcPopDiff(myfreq, metric = "Jost's D") myrst <- calcPopDiff(myfreq, metric = "Rst", object = mygenotypes) # examine the results myfst mygst myD myrst # get global statistics calcPopDiff(myfreq, metric = "Fst", global = TRUE) calcPopDiff(myfreq, metric = "Gst", global = TRUE) calcPopDiff(myfreq, metric = "Jost's D", global = TRUE) calcPopDiff(myfreq, metric = "Rst", global = TRUE, object = mygenotypes)
catalanAlleles
uses genotypes present in a "genambig"
object to sort alleles from one locus into two or more isoloci in an
allopolyploid or diploidized autopolyploid species. Alleles are
determined to belong to different
isoloci if they are both present in a fully homozygous genotype. If
necessary, heterozygous genotypes are also examined to resolve remaining
alleles.
catalanAlleles(object, samples = Samples(object), locus = 1, n.subgen = 2, SGploidy = 2, verbose = FALSE)
catalanAlleles(object, samples = Samples(object), locus = 1, n.subgen = 2, SGploidy = 2, verbose = FALSE)
object |
A |
samples |
Optional argument indicating samples to be analyzed. Can be integer or character, as with other polysat functions. |
locus |
An integer or character string indicating which locus to analyze. Cannot be a vector greater than length 1. (The function will only analyze one locus at a time.) |
n.subgen |
The number of isoloci (number of subgenomes). For example, |
SGploidy |
The ploidy of each genome. Only one value is allowed; all genomes
must be the same ploidy. |
verbose |
Boolean. Indicates whether results, and if applicable, problematic genotypes, should be printed to the console. |
catalanAlleles
implements and extends an approach used by Catalan
et al. (2006) that sorts alleles from a duplicated microsatellite
locus into two or more isoloci (homeologous loci on different
subgenomes). First, fully homozygous genotypes are identified and used
in analysis. If a genotype has as many alleles as there are subgenomes
(for example, a genotype with two alleles in an allotetraploid species),
it is assumed to be fully homozygous and the alleles are assumed to
belong to different subgenomes. If some alleles remain unassigned after
examination of all fully homozygous genotypes, heterozygous genotypes
are also examined to attempt to assign those remaining alleles.
For example, in an allotetraploid, if a genotype contains one unassigned allele, and all other alleles in the genotype are known to belong to one isolocus, the unassigned allele can be assigned to the other isolocus. Or, if two alleles in a genotype belong to one isolocus, one allele belongs to the other isolocus, and one allele is unassigned, the unassigned allele can be assigned to the latter isolocus. The function follows such logic (which can be extended to higher ploidies) until all alleles can be assigned, or returns a text string saying that the allele assignments were unresolvable.
It is important to note that this method assumes no null alleles and no homoplasy across isoloci. If the function encounters evidence of either it will not return allele assignments. Null alleles and homoplasy are real possibilities in any dataset, which means that this method simply will not work for some microsatellite loci.
(Null alleles are those that do not produce a PCR amplicon, usually because of a mutation in the primer binding site. Alleles that exhibit homoplasy are those that produce amplicons of the same size, despite not being identical by descent. Specifically, homoplasy between alleles from different isoloci will interfere with the Catalan method of allele assignment.)
A list containing the following items:
locus |
A character string giving the name of the locus. |
SGploidy |
A number giving the ploidy of each subgenome.
Identical to the |
assignments |
If assignments cannot be made, a character string
describing the problem. Otherwise, a matrix with |
Aside from homoplasy and null alleles, stochastic effects may prevent the minimum combination of genotypes needed to resolve all alleles from being present in the dataset. For a typical allotetraploid dataset, 50 to 100 samples will be needed, whereas an allohexaploid dataset may require over 100 samples. In simulations, allo-octoploid datasets with two tetraploid genomes were unresolvable even with 10,000 samples due to the low probability of finding full homozygotes. Additionally, loci are less likely to be resolvable if they have many alleles or if one isolocus is monomorphic.
Although determination of allele copy number by is not needed (or
expected) for catalanAlleles
as it was in the originally
published Catalan method, it is still very important that the
genotypes be high quality. Even a single scoring error can cause the
method to fail, including allelic dropout, contamination between
samples, stutter peaks miscalled as alleles, and PCR artifacts
miscalled as alleles. Poor quality loci (those that require some
“artistic” interpretation of gels or electropherograms) are
unlikely to work with this method. Individual genotypes that are of
questionable quality should be discarded before running the function.
Lindsay V. Clark
Catalan, P., Segarra-Moragues, J. G., Palop-Esteban, M., Moreno, C. and Gonzalez-Candelas, F. (2006) A Bayesian approach for discriminating among alternative inheritance hypotheses in plant polyploids: the allotetraploid origin of genus Borderea (Dioscoreaceae). Genetics 172, 1939–1953.
alleleCorrelations
, mergeAlleleAssignments
,
recodeAllopoly
, simAllopoly
# make the default simulated allotetraploid dataset mydata <- simAllopoly() # resolve the alleles myassign <- catalanAlleles(mydata)
# make the default simulated allotetraploid dataset mydata <- simAllopoly() # resolve the alleles myassign <- catalanAlleles(mydata)
These functions remove samples or loci from all relevant slots of an object.
deleteSamples(object, samples) deleteLoci(object, loci)
deleteSamples(object, samples) deleteLoci(object, loci)
object |
An object containing the dataset of interest. Generally an object of
some subclass of |
samples |
A numerical or character vector of samples to be removed. |
loci |
A numerical or character vector of loci to be removed. |
These are generic functions with methods for genambig
,
genbinary
, and
gendata
objects. The methods for the subclasses remove samples
or loci
from the @Genotypes
slot, then pass the object to the method for
gendata
, which removes samples or loci from the @PopInfo
,
@Ploidies
, and/or @Usatnts
slots, as appropriate. The
@PopNames
slot is left untouched even if an entire population is
deleted, in order to preserve the connection between the numbers in
@PopInfo
and the names in @PopNames
.
If your intent is to experiment with excluding samples or loci, it may
be a better idea to create character vectors of samples and loci that
you want to use and then use these vectors as the samples
and
loci
arguments for analysis or export functions.
An object identical to object
, but with the specified samples or
loci removed.
These functions are somewhat redundant with the subscripting function
"["
, which also works for all gendata
objects. However,
they may be more convenient depending on whether the user prefers to
specify the samples and loci to use or to
exclude.
Lindsay V. Clark
Samples
, Loci
,
merge,gendata,gendata-method
# set up genambig object mygen <- new("genambig", samples = c("ind1", "ind2", "ind3", "ind4"), loci = c("locA", "locB", "locC", "locD")) # delete a sample Samples(mygen) mygen <- deleteSamples(mygen, "ind1") Samples(mygen) # delete some loci Loci(mygen) mygen <- deleteLoci(mygen, c("locB", "locC")) Loci(mygen)
# set up genambig object mygen <- new("genambig", samples = c("ind1", "ind2", "ind3", "ind4"), loci = c("locA", "locB", "locC", "locD")) # delete a sample Samples(mygen) mygen <- deleteSamples(mygen, "ind1") Samples(mygen) # delete some loci Loci(mygen) mygen <- deleteLoci(mygen, c("locB", "locC")) Loci(mygen)
This function uses the method of De Silva et al. (2005) to estimate allele frequencies under polysomic inheritance with a known selfing rate.
deSilvaFreq(object, self, samples = Samples(object), loci = Loci(object), initNull = 0.15, initFreq = simpleFreq(object[samples, loci]), tol = 1e-08, maxiter = 1e4)
deSilvaFreq(object, self, samples = Samples(object), loci = Loci(object), initNull = 0.15, initFreq = simpleFreq(object[samples, loci]), tol = 1e-08, maxiter = 1e4)
object |
A |
self |
A number between 1 and 0, indicating the rate of selfing. |
samples |
An optional character vector indicating a subset of samples to use in the calculation. |
loci |
An optional character vector indicating a subset of loci for which to calculate allele frequencies. |
initNull |
A single value or numeric vector indicating initial frequencies to use for the null allele at each locus. |
initFreq |
A data frame containing allele frequencies (for non-null loci) to use
for initialization. This needs to be in the same format as the output
of |
tol |
The tolerance level for determining when the results have converged.
Where |
maxiter |
The maximum number of iterations that will be performed for each locus and population. |
Most of the SAS code from the supplementary material of De Silva
et al.
(2005) is translated directly into the R code for this function. The
SIMSAMPLE (or CreateRandomSample in the SAS code) function is omitted
so that the actual allelic phenotypes from the dataset can be used
instead of simulated phenotypes. deSilvaFreq
loops through each locus and population, and in each loop tallies the
number of alleles and sets up matrices using GENLIST, PHENLIST, RANMUL, SELFMAT,
and CONVMAT as described in the paper.
Frequencies of each allelic phenotype are then tallied
across all samples in that population with non-missing data at the
locus. Initial allele
frequencies for that population and locus are then extraced from
initFreq
and adjusted according to initNull
. The EM
iteration then begins for that population and locus, as described in the
paper (EXPECTATION, GPROBS, and MAXIMISATION).
Each repetition of the EM algorithm includes an expectation and maximization step. The expectation step uses allele frequencies and the selfing rate to calculate expected genotype frequencies, then uses observed phenotype frequencies and expected genotype frequencies to estimate genotype frequencies for the population. The maximization step uses the estimated genotype frequencies to calculate a new set of allele frequencies. The process is repeated until allele frequencies converge.
In addition to returning a data frame of allele frequencies,
deSilvaFreq
also prints to the console the number of EM
repetitions used for each population and locus. When each locus and
each population is begun, a message is printed to the console so that
the user can monitor the progress of the computation.
A data frame containing the estimated allele frequencies. The row names
are population names from PopNames(object)
. The first column
shows how many genomes each population has. All other columns represent
alleles (including one null allele per locus). These column names are
the locus name and allele name separated by a period.
It is possible to exceed memory limits for R if a locus has too many
alleles in a population (e.g. 15 alleles in a tetraploid if the memory
limit is 1535 Mb, see memory.limit
).
De Silva et al. mention that their estimation method could be extended to the case of disomic inheritence. A method for disomic inheritence is not implemented here, as it would require knowledge of which alleles belong to which isoloci.
De Silva et al. also suggest a means of estimating the selfing rate with a least-squares method. Using the notation in the source code, this would be:
lsq <- smatt %*% EP - rvec
self <- as.vector((t(EP - rvec) %*% lsq)/(t(lsq) %*% lsq))
However, in my experimentation with this calculation, it sometimes yields selfing rates greater than one. For this reason, it is not implemented here.
Lindsay V. Clark
De Silva, H. N., Hall, A. J., Rikkerink, E., and Fraser, L. G. (2005) Estimation of allele frequencies in polyploids under certain patterns of inheritance. Heredity 95, 327–334
simpleFreq
, write.freq.SPAGeDi
,
GENLIST
## Not run: ## An example with a long run time due to the number of alleles # create a dataset for this example mygen <- new("genambig", samples=c(paste("A", 1:100, sep=""), paste("B", 1:100, sep="")), loci=c("loc1", "loc2")) PopNames(mygen) <- c("PopA", "PopB") PopInfo(mygen) <- c(rep(1, 100), rep(2, 100)) mygen <- reformatPloidies(mygen, output="one") Ploidies(mygen) <- 4 Usatnts(mygen) <- c(2, 2) Description(mygen) <- "An example for allele frequency calculation." # create some genotypes at random for this example for(s in Samples(mygen)){ Genotype(mygen, s, "loc1") <- sample(seq(120, 140, by=2), sample(1:4, 1)) } for(s in Samples(mygen)){ Genotype(mygen, s, "loc2") <- sample(seq(130, 156, by=2), sample(1:4, 1)) } # make one genotype missing Genotype(mygen, "B4", "loc2") <- Missing(mygen) # view the dataset summary(mygen) viewGenotypes(mygen) # calculate the allele frequencies if the rate of selfing is 0.2 myfrequencies <- deSilvaFreq(mygen, self=0.2) # view the results myfrequencies ## End(Not run) ## An example with a shorter run time, for checking that the funciton ## is working. Genotype simulation is also a bit more realistic here. # Create a dataset for the example. mygen <- new("genambig", samples=paste("A", 1:100, sep=""), loci="loc1") PopNames(mygen) <- "PopA" PopInfo(mygen) <- rep(1, 100) mygen <- reformatPloidies(mygen, output="one") Ploidies(mygen) <- 4 Usatnts(mygen) <- 2 for(s in Samples(mygen)){ alleles <- unique(sample(c(122,124,126,0), 4, replace=TRUE, prob = c(0.3, 0.2, 0.4, 0.1))) Genotype(mygen, s, "loc1") <- alleles[alleles != 0] if(length(Genotype(mygen, s, "loc1"))==0) Genotype(mygen, s, "loc1") <- Missing(mygen) } # We have created a random mating populations with four alleles # including one null. The allele frequencies are given in the # 'prob' argument. # Estimate allele frequencies myfreq <- deSilvaFreq(mygen, self=0.01) myfreq
## Not run: ## An example with a long run time due to the number of alleles # create a dataset for this example mygen <- new("genambig", samples=c(paste("A", 1:100, sep=""), paste("B", 1:100, sep="")), loci=c("loc1", "loc2")) PopNames(mygen) <- c("PopA", "PopB") PopInfo(mygen) <- c(rep(1, 100), rep(2, 100)) mygen <- reformatPloidies(mygen, output="one") Ploidies(mygen) <- 4 Usatnts(mygen) <- c(2, 2) Description(mygen) <- "An example for allele frequency calculation." # create some genotypes at random for this example for(s in Samples(mygen)){ Genotype(mygen, s, "loc1") <- sample(seq(120, 140, by=2), sample(1:4, 1)) } for(s in Samples(mygen)){ Genotype(mygen, s, "loc2") <- sample(seq(130, 156, by=2), sample(1:4, 1)) } # make one genotype missing Genotype(mygen, "B4", "loc2") <- Missing(mygen) # view the dataset summary(mygen) viewGenotypes(mygen) # calculate the allele frequencies if the rate of selfing is 0.2 myfrequencies <- deSilvaFreq(mygen, self=0.2) # view the results myfrequencies ## End(Not run) ## An example with a shorter run time, for checking that the funciton ## is working. Genotype simulation is also a bit more realistic here. # Create a dataset for the example. mygen <- new("genambig", samples=paste("A", 1:100, sep=""), loci="loc1") PopNames(mygen) <- "PopA" PopInfo(mygen) <- rep(1, 100) mygen <- reformatPloidies(mygen, output="one") Ploidies(mygen) <- 4 Usatnts(mygen) <- 2 for(s in Samples(mygen)){ alleles <- unique(sample(c(122,124,126,0), 4, replace=TRUE, prob = c(0.3, 0.2, 0.4, 0.1))) Genotype(mygen, s, "loc1") <- alleles[alleles != 0] if(length(Genotype(mygen, s, "loc1"))==0) Genotype(mygen, s, "loc1") <- Missing(mygen) } # We have created a random mating populations with four alleles # including one null. The allele frequencies are given in the # 'prob' argument. # Estimate allele frequencies myfreq <- deSilvaFreq(mygen, self=0.01) myfreq
The genotypes from an object of one of the subclasses of gendata
are converted to a data frame (if necessary), then displayed in the data
editor. After
the user makes the desired edits and closes the data editor window, the
new genotypes are written to the gendata
object and the object is
returned.
editGenotypes(object, maxalleles = max(Ploidies(object)), samples = Samples(object), loci = Loci(object))
editGenotypes(object, maxalleles = max(Ploidies(object)), samples = Samples(object), loci = Loci(object))
object |
An object of the class |
maxalleles |
Numeric. The maximum number of alleles found in any given genotype.
The method
for |
samples |
Character or numeric vector indicating which samples to edit. |
loci |
Character or numeric vector indicating which loci to edit. |
The method for genambig
lists sample and locus names in each row
in order to identify the genotypes. However, only the alleles
themselves should be edited. NA values and duplicate alleles in the
data editor will be
omitted from the genotype vectors that are written back to the
genambig
object.
An object identical to object
but with edited genotypes.
Lindsay V. Clark
viewGenotypes
, Genotype<-
,
Genotypes<-
if(interactive()){ #this line included for automated checking on CRAN # set up "genambig" object to edit mygen <- new("genambig", samples = c("a", "b", "c"), loci = c("loc1", "loc2")) Genotypes(mygen, loci="loc1") <- list(c(133, 139, 142), c(130, 136, 139, 145), c(136, 142)) Genotypes(mygen, loci="loc2") <- list(c(202, 204), Missing(mygen), c(200, 206, 208)) mygen <- reformatPloidies(mygen, output="one") Ploidies(mygen) <- 4 # open up the data editor mygen <- editGenotypes(mygen) # view the results of your edits viewGenotypes(mygen) }
if(interactive()){ #this line included for automated checking on CRAN # set up "genambig" object to edit mygen <- new("genambig", samples = c("a", "b", "c"), loci = c("loc1", "loc2")) Genotypes(mygen, loci="loc1") <- list(c(133, 139, 142), c(130, 136, 139, 145), c(136, 142)) Genotypes(mygen, loci="loc2") <- list(c(202, 204), Missing(mygen), c(200, 206, 208)) mygen <- reformatPloidies(mygen, output="one") Ploidies(mygen) <- 4 # open up the data editor mygen <- editGenotypes(mygen) # view the results of your edits viewGenotypes(mygen) }
estimatePloidy
calculates the maximum and mean number of unique
alleles for each sample across a given set of loci. These values are
presented in a data editor, along with other pertinent information, so
that the user can then edit the ploidy values for the object.
estimatePloidy(object, extrainfo, samples = Samples(object), loci = Loci(object))
estimatePloidy(object, extrainfo, samples = Samples(object), loci = Loci(object))
object |
The object containing genotype data, and to which ploidies will be written. |
extrainfo |
A named or unnamed vector or data frame containing extra information
(such as morphological or flow cytometry data) to
display in the data editor, to assist with making decisions about
ploidy. If unnamed, the vector (or the rows of the data frame) is
assumed to be in the same order as |
samples |
A numeric or character vector indicating a subset of samples to evaluate. |
loci |
A numeric or character vector indicating a subset of loci to use in the calculation of mean and maximum allele number. |
estimatePloidy
is a generic function with methods written for
the genambig
and genbinary
classes.
If the Ploidies
slot of object
is not already a
"ploidysample"
object, the function will first convert the
Ploidies
slot to this format, deleting any data that is currently
there. (Ploidies must be indexed by sample and not by locus.) If
ploidies were already in the "ploidysample"
format, any ploidy
data already in the object is retained and put into the table (see
below).
Population identities are displayed in the table only if more than one
population identity is found in the dataset. Likewise, the current
ploidies of the dataset are only displayed if there is more than one
ploidy level already found in Ploidies(object)
.
Missing genotypes are ignored; maximum and mean allele counts are only
calculted across genotypes that are not missing. If all genotypes for a
given sample are missing, NA
is displayed in the corresponding
cells in the data editor.
The default values for new.ploidy
are the maximum number of
alleles per locus for each sample.
object
is returned, with Ploidies(object)
now equal to the
values set in the new.ploidy
column of the data editor.
Lindsay V. Clark
if(interactive()){ #this line included for automated checking on CRAN # create a dataset for this example mygen <- new("genambig", samples=c("a", "b", "c"), loci=c("loc1", "loc2")) Genotypes(mygen, loci="loc1") <- list(c(122, 126, 128), c(124, 130), c(120, 122, 124)) Genotypes(mygen, loci="loc2") <- list(c(140, 148), c(144, 150), Missing(mygen)) # estimate the ploidies mygen <- estimatePloidy(mygen) # view the ploidies Ploidies(mygen) }
if(interactive()){ #this line included for automated checking on CRAN # create a dataset for this example mygen <- new("genambig", samples=c("a", "b", "c"), loci=c("loc1", "loc2")) Genotypes(mygen, loci="loc1") <- list(c(122, 126, 128), c(124, 130), c(120, 122, 124)) Genotypes(mygen, loci="loc2") <- list(c(140, 148), c(144, 150), Missing(mygen)) # estimate the ploidies mygen <- estimatePloidy(mygen) # view the ploidies Ploidies(mygen) }
For 20 Rubus samples, contains colors and symbols to use for plotting data.
data(FCRinfo)
data(FCRinfo)
Data frame. FCRinfo$Plot.color
contains character
strings of the colors to be used to represent species groups.
FCR.info$Plot.symbol
contains integers to be passed to
pch
to designate the symbol used to represent each individual.
These reflect chloroplast haplotypes.
Clark, L. V. and Jasieniuk, M. (2012) Spontaneous hybrids between native and exotic Rubus in the Western United States produce offspring both by apomixis and by sexual recombination. Heredity 109, 320–328. Data available at: doi:10.5061/dryad.m466f
This function returns a data frame listing the locus and sample names of all genotypes with missing data.
find.missing.gen(object, samples = Samples(object), loci = Loci(object))
find.missing.gen(object, samples = Samples(object), loci = Loci(object))
object |
A |
samples |
A character vector of all samples to be searched. Must be a subset of
|
loci |
A character vector of all loci to be searched. Must be a subset of
|
A data frame with no row names. The first column is named “Locus” and the second column is named “Sample”. Each row represents one missing genotype, and gives the locus and sample of that genotype.
Lindsay V. Clark
# set up the genotype data samples <- paste("ind", 1:4, sep="") samples loci <- paste("loc", 1:3, sep="") loci testgen <- new("genambig", samples = samples, loci = loci) Genotypes(testgen, loci="loc1") <- list(c(-9), c(102,104), c(100,106,108,110,114), c(102,104,106,110,112)) Genotypes(testgen, loci="loc2") <- list(c(77,79,83), c(79,85), c(-9), c(83,85,87,91)) Genotypes(testgen, loci="loc3") <- list(c(122,128), c(124,126,128,132), c(120,126), c(124,128,130)) # look up which samples*loci have missing genotypes find.missing.gen(testgen)
# set up the genotype data samples <- paste("ind", 1:4, sep="") samples loci <- paste("loc", 1:3, sep="") loci testgen <- new("genambig", samples = samples, loci = loci) Genotypes(testgen, loci="loc1") <- list(c(-9), c(102,104), c(100,106,108,110,114), c(102,104,106,110,112)) Genotypes(testgen, loci="loc2") <- list(c(77,79,83), c(79,85), c(-9), c(83,85,87,91)) Genotypes(testgen, loci="loc3") <- list(c(122,128), c(124,126,128,132), c(120,126), c(124,128,130)) # look up which samples*loci have missing genotypes find.missing.gen(testgen)
Given a data frame of allele frequencies such as that produced by
simpleFreq
or deSilvaFreq
, freq.to.genpop
creates a
data frame of allele counts that can be read by the as.genpop
function in the package adegenet.
freq.to.genpop(freqs, pops = row.names(freqs), loci = unique(as.matrix(as.data.frame(strsplit(names(freqs), split = ".", fixed = TRUE), stringsAsFactors = FALSE))[1, ]))
freq.to.genpop(freqs, pops = row.names(freqs), loci = unique(as.matrix(as.data.frame(strsplit(names(freqs), split = ".", fixed = TRUE), stringsAsFactors = FALSE))[1, ]))
freqs |
A data frame of allele frequencies. Row names are population names.
The first column is called |
pops |
An optional character vector indicating the names of populations to use. |
loci |
An optional character vector indicating the names of loci to use. |
adegenet expects one ploidy for the entire dataset. Therefore, data
frames of allele frequencies with multiple “Genomes” columns,
such as those produced when ploidy varies by locus, are not allowed as
the freqs
argument.
A data frame with row and column names identical to those in
freqs
, minus the "Genomes"
column and any columns for loci
not included in loci
. Allele frequencies are converted to counts
by multiplying by the values in the "Genomes"
column and rounding
to the nearest integer.
Lindsay V. Clark
Jombart, T. (2008) adegenet: a R package for the multivariate analysis of genetic markers. Bioinformatics 24, 1403-1405.
simpleFreq
, deSilvaFreq
,
write.freq.SPAGeDi
, gendata.to.genind
# create a simple allele frequency table # (usually done with simpleFreq or deSilvaFreq) myfreq <- data.frame(row.names=c("popA","popB"), Genomes=c(120,100), locG.152=c(0.1,0.4), locG.156=c(0.5, 0.3), locG.160=c(0.4, 0.3), locK.179=c(0.15, 0.25), locK.181=c(0.35, 0.6), locK.183=c(0.5, 0.15)) myfreq # convert to adegenet format gpfreq <- freq.to.genpop(myfreq) gpfreq ## Not run: # If you have adegenet installed, you can now make this into a # genpop object. require(adegenet) mygenpop <- genpop(gpfreq, ploidy=as.integer(4), type="codom") # examine the object that has been created mygenpop popNames(mygenpop) mygenpop@tab [email protected] # Perform a distance calculation with the object dist.genpop(mygenpop) ## End(Not run)
# create a simple allele frequency table # (usually done with simpleFreq or deSilvaFreq) myfreq <- data.frame(row.names=c("popA","popB"), Genomes=c(120,100), locG.152=c(0.1,0.4), locG.156=c(0.5, 0.3), locG.160=c(0.4, 0.3), locK.179=c(0.15, 0.25), locK.181=c(0.35, 0.6), locK.183=c(0.5, 0.15)) myfreq # convert to adegenet format gpfreq <- freq.to.genpop(myfreq) gpfreq ## Not run: # If you have adegenet installed, you can now make this into a # genpop object. require(adegenet) mygenpop <- genpop(gpfreq, ploidy=as.integer(4), type="codom") # examine the object that has been created mygenpop popNames(mygenpop) mygenpop@tab mygenpop@all.names # Perform a distance calculation with the object dist.genpop(mygenpop) ## End(Not run)
Objects of this class store microsatellite datasets in which allele copy
number is ambiguous. Genotypes are stored as a two-dimensional list of
vectors, each vector containing all unique alleles for a given sample at
a given locus. genambig
is a subclass of gendata
.
Objects can be created by calls of the form new("genambig",
samples, loci, ...)
. This automatically sets up a two-dimensional
list in the Genotypes
slot, with dimnames=list(samples,
loci)
. This array-list is initially populated with the missing data
symbol. All other slots are given initial values according to the
initialize
method for gendata
. Data can then be inserted
into the slots using the replacement functions (see
Accessors
).
Genotypes
:Object of class "array"
. The first
dimension of the array represents and is named by samples, while
the second dimension represents and is named by loci. Each
element of the array can contain a vector. Each vector should
contain each unique allele for the genotype once. If an array
element contains a vector of length 1 containing only the symbol
that is in the Missing
slot, this indicates missing data
for that sample and locus.
Description
:Object of class "character"
. This
stores a description of the dataset for the user's convenience.
Missing
:Object of class "ANY"
. A symbol to be
used to indicate missing data in the Genotypes
slot. This is
the integer -9
by default.
Usatnts
:Object of class "integer"
. A vector,
named by loci. Each element indicates the repeat type of the
locus. 2
indicates dinucleotide repeats, 3
indicates trinucleotide repeats, and so on. If the alleles stored
in the Genotypes
slot for a given locus are already written
in terms of repeat number, the Usatnts
value for that locus
should be 1
. In other words, all alleles for a locus can be
divided by the number in Usatnts
to give alleles expressed in
terms of relative repeat number.
Ploidies
:Object of class "integer"
. A vector,
named by samples. This stores the ploidy of each sample. NA
indicates unknown ploidy. See Ploidies<-
and
estimatePloidy
for ways to fill this slot.
PopInfo
:Object of class "integer"
. A vector,
named by samples, containing the population identity of each sample.
PopNames
:Object of class "character"
. A
vector containing names for all populations. The position of a
population name in the vector indicates the integer used to
represent that population in PopInfo
.
Class "gendata"
, directly.
For more information on any of these methods, see the help files of their respective generic functions.
signature(object = "genambig")
: Removes
columns in the array in the Genotypes
slot corresponding to
the locus names supplied, then passes the arguments to the method
for gendata
.
signature(object = "genambig")
: Removes
rows in the array in the Genotypes
slot corresponding to
the sample names supplied, then passes the arguments to the method
for gendata
.
signature(object = "genambig")
: Each
vector in the Genotypes
slot is placed into the row of a
data frame, along with the sample and locus name for this vector.
The data frame is then opened in the Data Editor so that the user
can make changes. When the Data Editor window is closed, vectors
are extracted back out of the data frame and written to the
Genotypes
slot.
signature(object = "genambig")
:
Calculates the length of each genotype vector (excluding those
with the missing data symbol), and creates a data frame showing
the maximum and mean number of alleles per locus for each sample.
This data frame is then opened in the Data Editor, where the user
may edit ploidy levels. Once the Data Editor is closed, the
genambig
object is returned with the new values written to
the Ploidies
slot.
signature(object = "genambig")
: Retrieves a
single genotype vector, as specified by sample
and
locus
arguments.
signature(object = "genambig")
: Replaces a
single genotype vector.
signature(object = "genambig")
: Retrieves a
two-dimensional list of genotype vectors.
signature(object = "genambig")
: Replaces a
one- or two-dimensional list of genotype vectors.
signature(.Object = "genambig")
: When
new
is called to create a new genambig
object, the
initialize
method sets up a two dimensional list in the
Genotypes
slot indexed by sample and locus, and fills this
list with the missing data symbol. The initialize
method
for gendata
is then called.
signature(object = "genambig")
: Given a set of
samples and loci, each position in the array in the Genotypes
slot is checked to see if it matches the missing data value. A single
Boolean value or an array of Boolean values is returned.
signature(object = "genambig")
: For changing
the names of loci. The names are changed in the second dimension
of the array in the Genotypes
slot, and then the
Loci<-
method for gendata
is called.
signature(object = "genambig")
: For changing
the missing data symbol. All elements of the Genotypes
array
that match the current missing data symbol are changed to the new
missing data symbol. The Missing<-
method for gendata
is then called.
signature(object = "genambig")
: For changing
the names of samples. The names are changed in the first dimension
of the array in the Genotypes
slot, and then the
Samples<-
method for gendata
is called.
signature(object = "genambig")
: Prints the
dataset description (Description
slot) to the console as
well as the number of missing genotypes, then calls the
summary
method for gendata
.
signature(object = "genambig")
: Prints the
data to the console, formatted to make it more legible. The
genotype for each locus is shown as the size of each allele,
separated by a '/'. The SSR motif length ('Usatnts'), ploidies,
population names, and population membership ('PopInfo') are
displayed if they exist.
signature(object = "genambig")
: Prints a
tab-delimited table of samples, loci, and genotype vectors to the console.
signature(x = "genambig", i = "ANY", j = "ANY")
:
For subscipting genambig
objects. Should be of the form
mygenambig[mysamples, myloci]
. Returns a genambig
object. The Genotypes
slot
is replaced by one containing only samples i
and loci
j
. Likewise, the PopInfo
and Ploidies
slots
are truncated to contain only samples i
, and the
Usatnts
slot is truncated to contain only loci
j
. Other slots are left unaltered.
signature(x = "genambig", y = "genambig")
:
Merges two genotypes objects together. See
merge,genambig,genambig-method
.
Lindsay V. Clark, Tyler W. Smith
gendata
, Accessors
,
merge,genambig,genambig-method
# display class definition showClass("genambig") # create a genambig object mygen <- new("genambig", samples=c("a", "b", "c", "d"), loci=c("L1", "L2", "L3")) # add some genotypes Genotypes(mygen)[,"L1"] <- list(c(133, 139, 145), c(142, 154), c(130, 142, 148), Missing(mygen)) Genotypes(mygen, loci="L2") <- list(c(105, 109, 113), c(111, 117), c(103, 115), c(105, 109, 113)) Genotypes(mygen, loci="L3") <- list(c(254, 258), Missing(mygen), c(246, 250, 262), c(250, 258)) # see a summary of the object summary(mygen) # display some of the genotypes viewGenotypes(mygen[c("a", "b", "c"),])
# display class definition showClass("genambig") # create a genambig object mygen <- new("genambig", samples=c("a", "b", "c", "d"), loci=c("L1", "L2", "L3")) # add some genotypes Genotypes(mygen)[,"L1"] <- list(c(133, 139, 145), c(142, 154), c(130, 142, 148), Missing(mygen)) Genotypes(mygen, loci="L2") <- list(c(105, 109, 113), c(111, 117), c(103, 115), c(105, 109, 113)) Genotypes(mygen, loci="L3") <- list(c(254, 258), Missing(mygen), c(246, 250, 262), c(250, 258)) # see a summary of the object summary(mygen) # display some of the genotypes viewGenotypes(mygen[c("a", "b", "c"),])
These functions convert back and forth between the genambig
and
genbinary
classes.
genambig.to.genbinary(object, samples = Samples(object), loci = Loci(object)) genbinary.to.genambig(object, samples = Samples(object), loci = Loci(object))
genambig.to.genbinary(object, samples = Samples(object), loci = Loci(object)) genbinary.to.genambig(object, samples = Samples(object), loci = Loci(object))
object |
The object containing the genetic dataset. A |
samples |
An optional character vector indicating samples to include in the new object. |
loci |
An optional character vector indicating loci to include in the new object. |
The slots Description
, Ploidies
, Usatnts
,
PopNames
, and PopInfo
are transferred as-is from the old
object to the new. The value in the
Genotypes
slot is converted from one format to the other, with
preservation of allele names.
For genambig.to.genbinary
: a genbinary
object containing
all of the data from object
. Missing
, Present
,
and Absent
are set at their default values.
For genbinary.to.genambig
: a genambig
object containing
all of the data from object
. Missing
is at the default
value.
Lindsay V. Clark
# set up a genambig object for this example mygen <- new("genambig", samples = c("A", "B", "C", "D"), loci = c("locJ", "locK")) PopNames(mygen) <- c("PopQ", "PopR") PopInfo(mygen) <- c(1,1,2,2) Usatnts(mygen) <- c(2,2) Genotypes(mygen, loci="locJ") <- list(c(178, 184, 186), c(174,186), c(182, 188, 190), c(182, 184, 188)) Genotypes(mygen, loci="locK") <- list(c(133, 135, 141), c(131, 135, 137, 143), Missing(mygen), c(133, 137)) # convert it to a genbinary object mygenB <- genambig.to.genbinary(mygen) # check the results viewGenotypes(mygenB) viewGenotypes(mygen) PopInfo(mygenB) # convert back to a genambig object mygenA <- genbinary.to.genambig(mygenB) viewGenotypes(mygenA) # note: identical(mygen, mygenA) returns FALSE, because the alleles # origninally input are not stored as integers, while the alleles # produced by genbinary.to.genambig are integers.
# set up a genambig object for this example mygen <- new("genambig", samples = c("A", "B", "C", "D"), loci = c("locJ", "locK")) PopNames(mygen) <- c("PopQ", "PopR") PopInfo(mygen) <- c(1,1,2,2) Usatnts(mygen) <- c(2,2) Genotypes(mygen, loci="locJ") <- list(c(178, 184, 186), c(174,186), c(182, 188, 190), c(182, 184, 188)) Genotypes(mygen, loci="locK") <- list(c(133, 135, 141), c(131, 135, 137, 143), Missing(mygen), c(133, 137)) # convert it to a genbinary object mygenB <- genambig.to.genbinary(mygen) # check the results viewGenotypes(mygenB) viewGenotypes(mygen) PopInfo(mygenB) # convert back to a genambig object mygenA <- genbinary.to.genambig(mygenB) viewGenotypes(mygenA) # note: identical(mygen, mygenA) returns FALSE, because the alleles # origninally input are not stored as integers, while the alleles # produced by genbinary.to.genambig are integers.
This is a subclass of gendata
that allows genotypes to be stored
as a matrix indicating the presence and absence of alleles.
Objects can be created by calls of the form new("genbinary",
samples, loci, ...)
. After objects are initialized with sample and
locus names, data can be added to slots using the replacement functions.
Genotypes
:Object of class "matrix"
. Row names
of the matrix are sample names. Each column name is a locus name
and an allele separated by a period (e.g.
"loc1.124"
); each column represents on allele. The number of
alleles per locus is not limited and can be expanded even after
entering initial data. Each element of the matrix must be equal to
either Present(object)
, Absent(object)
, or
Missing(object)
. These symbols indicate, respectively, that
a sample has an allele, that a sample does not have an allele, or
that data for the sample at that locus are missing.
Present
:Object of class "ANY"
. The integer
1
by default. This symbol is used in the Genotypes
slot to indicate the presence of an allele in a sample.
Absent
:Object of class "ANY"
. The integer
0
by default. This symbol is used in the Genotypes
slot to indicate the absence of an allele in a sample.
Description
:Object of class "character"
. A
character string or vector describing the dataset, for the
convenience of the user.
Missing
:Object of class "ANY"
. The integer
-9
by default. This symbol is used in the Genotypes
slot to indicate that data are missing for a given sample and locus.
Usatnts
:Object of class "integer"
. A vector,
named by loci. This indicates the repeat length of each locus.
2
indicates dinucleotide repeats, 3
indicates trinucleotide repeats, and so on. If the alleles stored
in the column names of the Genotypes
slot for a given locus
are already written
in terms of repeat number, the Usatnts
value for that locus
should be 1
. In other words, all alleles for a locus can be
divided by the number in Usatnts
to give alleles expressed in
terms of relative repeat number.
Ploidies
:Object of class "integer"
. A vector,
named by samples. This indicates the ploidy of each sample.
PopInfo
:Object of class "integer"
. A vector,
named by samples. This indicates the population identity of each
sample.
PopNames
:Object of class "character"
. Names
of each population. The position of the population name in the
vector corresponds to the number used to represent that population
in the PopInfo
slot.
Class "gendata"
, directly.
signature(object = "genbinary")
: Returns the
symbol used to indicate that a given allele is absent in a given
sample.
signature(object = "genbinary")
: Changes the
symbol used to indicate that a given allele is absent in a given
sample. The matrix in the Genotypes
slot is searched for
the old symbol, which is replaced by the new. The new symbol is
then written to the Absent
slot.
signature(object = "genbinary")
: Returns a
matrix containing the genotype for a given sample and locus (by a
call to Genotypes
).
signature(object = "genbinary")
: Returns the
matrix stored in the Genotypes
slot, or a subset as specified
by the samples
and loci
arguments.
signature(object = "genbinary")
: A method
for adding or replacing genotype data in the object. Note that
allele columns cannot be removed from the matrix in the
Genotypes
slot using this method, although an entire column
could be filled with zeros in order to effectively remove an
allele from the dataset. If the order of rows in value
(the matrix containing values to be assigned to the
Genotypes
slot) is not identical to Samples(object)
,
the samples
argument should be used to indicate row order.
Row names in value
are ignored. The loci
argument can
be left at the default, even if only a subset of loci are being
assigned. Column names of value
are important, and should be
the locus name and allele name separated by a period, as they are in
the Genotypes
slot. After checking that the column name is
valid, the method checks for whether the column name already exists or
not in the Genotypes
slot. If it does exist, data from that
column are replaced with data from value
. If not, a column is
added to the matrix in the Genotypes
slot for the new allele. If
the column is new and data are not being written for samples, the method
automatically fills in Missing
or Absent
symbols for
additional samples, depending on whether or not data for the locus
appear to be missing for the sample or not.
signature(.Object = "genbinary")
: Sets up a
genbinary
object when new("genbinary")
is called. If
samples
or loci
arguments are missing, these are
filled in with dummy values ("ind1", "ind2", "loc1",
"loc2"
). The matrix is then set up in the Genotypes
slot. Sample names are used for row names, and there are zero
columns. The initialize
method for gendata
is then
called.
signature(object = "genbinary")
: Replaces
all elements in matrix in the Genotypes
slot containing the
old Missing
symbol with the new Missing
symbol. The
method for gendata
is then called to replace the value in the
Missing
slot.
signature(object = "genbinary")
: Returns the
symbol used to indicate that a given allele is present in a given
sample.
signature(object = "genbinary")
: Changes the
symbol used for indicating that a given allele is present in a given
sample. The symbol is first replaced in the Genotypes
slot,
and then in the Present
slot.
signature(object = "genbinary")
: Changes sample
names in the dataset. Changes the row names in the Genotypes
slot, then calls the method for gendata
to change the names in
the PopInfo
and Ploidies
slots.
signature(object = "genbinary")
: Changes locus
names in the dataset. Replaces the locus portion of the column names
in the Genotypes
slot, then calls the method for gendata
to change the names in the Usatnts
slot.
signature(object = "genbinary")
: Returns Boolean
values, by sample and locus, indicating whether genotypes are missing.
If there are any missing data symbols within the genotype, it is
considered missing.
signature(object = "genbinary")
: Prints
description of dataset and number of missing genotypes, then calls the
method for gendata
to print additional information.
signature(object = "genbinary")
: Opens the
genotype matrix in the Data Editor for editing. Useful for making
minor changes, although allele columns cannot be added using this method.
signature(object = "genbinary")
: Prints the
genotype matrix to the console, one locus at a time.
signature(object = "genbinary")
: Removes the
specified samples from the genotypes matrix, then calls the method for
gendata
.
signature(object = "genbinary")
: Removes the
specified loci from the genotypes matrix, then calls the method for
gendata
.
signature(x = "genbinary", i = "ANY", j = "ANY")
:
Subscripting method. Returns a genbinary
object with a subset
of the samples and/or loci from x
. Usage:
genobject[samples,loci]
.
signature(object = "genbinary")
: Creates a
data frame of mean and maximum number of alleles per sample, which is
then opened in the Data Editor so that the user can manually specify
the ploidy of each sample. Ploidies are then written to the
Ploidies
slot of the object.
signature(x = "genbinary", y = "genbinary")
: Merges
two genotype objects together. See
merge,genbinary,genbinary-method
.
Lindsay V. Clark
# show the class definition showClass("genbinary") # create a genbinary object mygen <- new("genbinary", samples = c("indA", "indB", "indC", "indD"), loci = c("loc1", "loc2")) Description(mygen) <- "Example genbinary object for the documentation." Usatnts(mygen) <- c(2,3) PopNames(mygen) <- c("Maine", "Indiana") PopInfo(mygen) <- c(1,1,2,2) Genotypes(mygen) <- matrix(c(1,1,0,0, 1,0,0,1, 0,0,1,1, 1,-9,1,0, 0,-9,0,1, 1,-9,0,1, 0,-9,1,1), nrow=4, ncol=7, dimnames = list(NULL, c("loc1.140", "loc1.144", "loc1.150", "loc2.97", "loc2.100", "loc2.106", "loc2.109"))) # view all of the data in the object mygen
# show the class definition showClass("genbinary") # create a genbinary object mygen <- new("genbinary", samples = c("indA", "indB", "indC", "indD"), loci = c("loc1", "loc2")) Description(mygen) <- "Example genbinary object for the documentation." Usatnts(mygen) <- c(2,3) PopNames(mygen) <- c("Maine", "Indiana") PopInfo(mygen) <- c(1,1,2,2) Genotypes(mygen) <- matrix(c(1,1,0,0, 1,0,0,1, 0,0,1,1, 1,-9,1,0, 0,-9,0,1, 1,-9,0,1, 0,-9,1,1), nrow=4, ncol=7, dimnames = list(NULL, c("loc1.140", "loc1.144", "loc1.150", "loc2.97", "loc2.100", "loc2.106", "loc2.109"))) # view all of the data in the object mygen
This is a superclass for other classes that contain population
genetic datasets. It has slots for population information, ploidy,
microsatellite repeat lengths, and a missing data symbol, but does not
have a slot to store genotypes. Sample and locus names are stored as
the names
of vectors in the slots.
Objects can be created by calls of the form new("gendata", samples, loci, ...)
.
The missing data symbol will be set to -9
by default. The
default initial value for PopNames
is a
character vector of length 0, and for Description
is the string
"Insert dataset description here"
. The default initial value for
the Ploidies
slot is a "ploidymatrix"
object, containing a matrix filled with NA
and named by samples
in the first dimension and loci in the second dimension. For
other slots, vectors filled with NA
will be generated and will be
named by samples (for PopInfo
) or loci (for
Usatnts
). The slots can then be edited
using the methods described below.
Note that in most cases you will want to instead create an
object from one of gendata
's subclasses, such as genambig
.
Description
:Object of class "character"
. One
or more character strings to name or describe the dataset.
Missing
:Object of class "ANY"
. A value to
indicate missing data in the genotypes of the dataset. -9
by default.
Usatnts
:Object of class "integer"
. This
vector must be named by locus names. Each element should be the
length of the microsatellite repeat for that locus, given in nucleotides.
For example, 2
would indicate a locus with dinucleotide
repeats, and 3
would indicate a locus with trinucleotide
repeats. 1
should be used for mononucleotide repeats OR if
alleles for that locus are already expressed in terms of repeat
number rather than nucleotides. To put it another way, if you
divided the number used to represent an allele by the corresponding
number in Usatnts
(and rounded if necessary), the result would
be the number of repeats (plus some additional length for flanking regions).
Ploidies
:Object of class "ploidysuper"
. This
object will contain a pld
slot that is a matrix named by
samples and loci, a vector named by samples or loci, or a single
value, depending on the subclass. Each element is an integer that
represents ploidy. NA
indicates unknown ploidy.
PopInfo
:Object of class "integer"
. This
vector also must be named by sample names. Each element
represents the number of the population to which each sample belongs.
PopNames
:Object of class "character"
. An
unnamed vector containing the name of each population. If a
number from PopInfo
is used to index PopNames
, it
should find the correct population name. For example, if the
first element of PopNames
is "ABC"
, then any samples
with 1
as their PopInfo
value belong to population
"ABC"
.
signature(object = "gendata")
: Permanently
remove loci from the dataset. This removes elements from Usatnts
.
signature(object = "gendata")
:
Permanently remove samples from the dataset. This removes
elements from PopInfo
and Ploidies
.
signature(object = "gendata")
: Returns the
character vector in the Description
slot.
signature(object = "gendata")
: Assigns a
new value to the character vector in the Description
slot.
signature(.Object = "gendata")
: This is
called when the new("gendata")
function is used. A new
gendata
object is created with sample and locus names used
to index the appropriate slots.
signature(object = "gendata", usatnts =
"missing", ploidies="missing")
: Returns a character vector
containing all locus
names for the object. The method accomplishes this by returning
names(object@Usatnts)
.
signature(object = "gendata", usatnts =
"numeric", ploidies = "missing")
: Returns a character vector
of all loci for a given
set of repeat lengths. For example, if usatnts = 2
all
loci with dinucleotide repeats will be returned.
signature(object= "gendata", usatnts = "missing",
ploidies = "numeric")
: Returns a character vector of all loci for a
given set of ploidies. Only works if object@Ploidies
is a
"ploidylocus"
object.
signature(object = "gendata", usatnts = "numeric",
ploidies = "numeric")
: Returns a character vector of all loci that have
one of the indicated repeat types and one of the indicated ploidies.
Only works if object@Ploidies
is a "ploidylocus"
object.
signature(object = "gendata")
: Assigns new
names to loci in the dataset (changes names(object@Usants)
.
Should not be used for adding or removing loci.
signature(object = "gendata")
: Returns the
missing data symbol from object@Missing
.
signature(object = "gendata")
: Assigns a new
value to object@Missing
(changes the missing data symbol).
signature(object = "gendata", samples = "ANY",
loci = "ANY")
: Returns the
ploidies in the dataset (object@Ploidies
), indexed by
sample and locus if applicable..
signature(object = "gendata")
: Assigns new
values to ploidies of samples in the dataset. The assigned values
are coerced to integers by the method. Names in the assigned
vector or matrix are ignored; sample and/or locus names already
present in the gendata
object are used instead.
signature(object = "gendata")
: Returns the
population numbers of samples in the dataset (object@PopInfo
).
signature(object = "gendata")
: Assigns new
population numbers to samples in the dataset. The assigned values
are coerced to integers by the method. Names in the assigned
vector are ignored; sample names already present in the
gendata
object are used instead.
signature(object = "gendata")
: Returns a
character vector of population names (object@PopNames
).
signature(object = "gendata")
: Assigns new
names to populations.
signature(object = "gendata",
popname="character")
: Returns the number corresponding to a
population name.
signature(object = "gendata", popname =
"character")
: Changes the population number for a given
population name, merging it with an existing population of that
number if applicable.
signature(object = "gendata", populations =
"character", ploidies = "missing")
: Returns all sample names
for a given set of population names.
signature(object = "gendata", populations =
"character", ploidies = "numeric")
: Returns all sample names
for a given set of population names and ploidies. Only samples
that fit both criteria will be returned.
signature(object = "gendata", populations =
"missing", ploidies = "missing")
: Returns all sample names.
signature(object = "gendata", populations =
"missing", ploidies = "numeric")
: Returns all sample names for
a given set of ploidies. Only works if object@Ploidies
is a
"ploidysample"
object.
signature(object = "gendata", populations =
"numeric", ploidies = "missing")
: Returns all sample names for
a given set of population numbers.
signature(object = "gendata", populations =
"numeric", ploidies = "numeric")
: Returns all sample names for
a given set of population numbers and ploidies. Only samples that
fit both criteria will be returned. Only works if
object@Ploidies
is a "ploidysample"
object.
signature(object = "gendata")
: Assigns new
names to samples. This edits both names(object@PopInfo)
and
names(object@Ploidies)
. It should not be used for adding
or removing samples from the dataset.
signature(object = "gendata")
: Prints some
informaton to the console, including the numbers of samples, loci,
and populations, the ploidies present, and the types of
microsatellite repeats present.
signature(object = "gendata")
: Returns
microsatellite repeat lengths for loci in the dataset
(object@Usatnts
).
signature(object = "gendata")
: Assigns new
values to microsatellite repeat lengths of loci
(object@Usatnts
). The assigned values
are coerced to integers by the method. Names in the assigned
vector are ignored; locus names already present in the
gendata
object are used instead.
signature(x = "gendata", i = "ANY", j = "ANY")
:
Subscripts the data by a subset of samples and/or loci. Should be
used in the format mygendata[mysamples, myloci]
. Returns a
gendata
object with PopInfo
, Ploidies
, and
Usatnts
truncated to only contain the samples and loci
listed in i
and j
, respectively.
Description
, Missing
, and PopNames
are left
unaltered.
signature(x = "gendata", y = "gendata")
: Merges
two genotype objects. See merge,gendata,gendata-method
.
Lindsay V. Clark
genambig
, genbinary
,
Accessors
# show class definition showClass("gendata") # create an object of the class gendata # (in reality you would want to create an object belonging to one of the # subclasses, but the procedure is the same) mygen <- new("gendata", samples = c("a", "b", "c"), loci = c("loc1", "loc2")) Description(mygen) <- "An example for the documentation" Usatnts(mygen) <- c(2,3) PopNames(mygen) <- c("PopV", "PopX") PopInfo(mygen) <- c(2,1,2) Ploidies(mygen) <- c(2,2,4,2,2,2) # view a summary of the object summary(mygen)
# show class definition showClass("gendata") # create an object of the class gendata # (in reality you would want to create an object belonging to one of the # subclasses, but the procedure is the same) mygen <- new("gendata", samples = c("a", "b", "c"), loci = c("loc1", "loc2")) Description(mygen) <- "An example for the documentation" Usatnts(mygen) <- c(2,3) PopNames(mygen) <- c("PopV", "PopX") PopInfo(mygen) <- c(2,1,2) Ploidies(mygen) <- c(2,2,4,2,2,2) # view a summary of the object summary(mygen)
This is a function for exporting data to the package adegenet.
gendata.to.genind(object, samples = Samples(object), loci = Loci(object))
gendata.to.genind(object, samples = Samples(object), loci = Loci(object))
object |
A |
samples |
A character vector indicating the samples to include in the output. |
loci |
A character vector indicating the loci to include in the output. |
gendata.to.genind
converts a "genambig"
or
"genbinary"
object to a "genind"
object using the package
adegenet. Each individual must have a single ploidy.
Ploidy and population information are carried over to the new
object. Data will be coded as presence/absence in the new object. The
locus names in the new object are locus and allele names
seperated by a hyphen.
adegenet must be installed in order to use this function.
A genetic dataset in the "genind"
class, ready for use in
adegenet.
Lindsay V. Clark
http://adegenet.r-forge.r-project.org/
Jombart, T. (2008) adegenet: a R package for the multivariate analysis of genetic markers. Bioinformatics 24, 1403–1405.
# create a "genambig" object mydata <- new("genambig", samples=c("a","b","c","d"), loci=c("e","f")) PopNames(mydata) <- c("G","H") PopInfo(mydata) <- c(1,1,2,2) mydata <- reformatPloidies(mydata, output="one") Ploidies(mydata) <- 3 Genotypes(mydata, loci="e") <- list(c(100),c(100,102), c(98,102,104),c(102,106)) Genotypes(mydata, loci="f") <- list(c(200,202,204),Missing(mydata), c(210,212),c(204,210,214)) # convert to "genind"; not tested as it takes several seconds to load adegenet if(require("adegenet")){ mydata2 <- gendata.to.genind(mydata) mydata2@tab locNames(mydata2) indNames(mydata2) popNames(mydata2) pop(mydata2) }
# create a "genambig" object mydata <- new("genambig", samples=c("a","b","c","d"), loci=c("e","f")) PopNames(mydata) <- c("G","H") PopInfo(mydata) <- c(1,1,2,2) mydata <- reformatPloidies(mydata, output="one") Ploidies(mydata) <- 3 Genotypes(mydata, loci="e") <- list(c(100),c(100,102), c(98,102,104),c(102,106)) Genotypes(mydata, loci="f") <- list(c(200,202,204),Missing(mydata), c(210,212),c(204,210,214)) # convert to "genind"; not tested as it takes several seconds to load adegenet if(require("adegenet")){ mydata2 <- gendata.to.genind(mydata) mydata2@tab locNames(mydata2) indNames(mydata2) popNames(mydata2) pop(mydata2) }
This function will return all unique genotypes for a given locus (ignoring allele
order, but taking copy number into account) and return those genotypes as well
as an index indicating which genotype(s) each individual has. This is a generic function
with methods for "genambig"
objects and for arrays. The array
method is primarily intended for internal use with meandistance.matrix2
,
processing the output of genotypeProbs
.
genIndex(object, locus)
genIndex(object, locus)
object |
Typically, a |
locus |
A character string or integer indicating which locus to process. |
A list with two elements:
uniquegen |
A list, where each element in the list is a vector indicating a unique genotype that was found. |
genindex |
For |
Lindsay V. Clark
meandistance.matrix
uses the "genambig"
method internally.
data(simgen) genIndex(simgen, 1)
data(simgen) genIndex(simgen, 1)
genotypeDiversity
calculates diversity statistics based on
genotype frequencies, using a distance matrix to assign individuals to
genotypes. The Shannon
and Simpson
functions are also
available to calculate these statistics directly from a vector
of frequencies.
genotypeDiversity(genobject, samples = Samples(genobject), loci = Loci(genobject), d = meandistance.matrix(genobject, samples, loci, all.distances = TRUE, distmetric = Lynch.distance), threshold = 0, index = Shannon, ...) Shannon(p, base = exp(1)) Simpson(p) Simpson.var(p)
genotypeDiversity(genobject, samples = Samples(genobject), loci = Loci(genobject), d = meandistance.matrix(genobject, samples, loci, all.distances = TRUE, distmetric = Lynch.distance), threshold = 0, index = Shannon, ...) Shannon(p, base = exp(1)) Simpson(p) Simpson.var(p)
genobject |
An object of the class |
samples |
An optional character vector indicating a subset of samples to analyze. |
loci |
An optional character vector indicating a subset of loci to analyze. |
d |
A list such as that produced by |
threshold |
The maximum genetic distance between two samples that can be considered to be the same genotype. |
index |
The diversity index to calculate. This should be |
... |
Additional arguments to pass to |
p |
A vector of counts. |
base |
The base of the logarithm for calculating the Shannon index. This is
|
genotypeDiversity
runs assignClones
on distance
matrices for individual loci and then for all loci, for each seperate
population. The results of assignClones
are used to
calculate a vector of genotype frequencies, which is passed to
index
.
Shannon
calculates the Shannon index, which is:
(or log base 2 or any other base, using the base
argument) given
a vector of genotype counts, where
is the sum of those counts.
Simpson
calculates the Simpson index, which is:
Simpson.var
calculates the variance of the Simpson index:
The variance of the Simpson index can be used to calculate a confidence
interval, for example the results of Simpson
plus or minus twice
the square root of the results of Simpson.var
would be the 95%
confidence interval.
A matrix of diversity index results, with populations in rows and
loci in columns. The final column is called "overall"
and gives
the results when all loci are analyzed together.
Lindsay V. Clark
Shannon, C. E. (1948) A mathematical theory of communication. Bell System Technical Journal 27:379–423 and 623–656.
Simpson, E. H. (1949) Measurement of diversity. Nature 163:688.
Lowe, A., Harris, S. and Ashton, P. (2004) Ecological Genetics: Design, Analysis, and Application. Wiley-Blackwell.
Arnaud-Haond, S., Duarte, M., Alberto, F. and Serrao, E. A. (2007) Standardizing methods to address clonality in population studies. Molecular Ecology 16:5115–5139.
http://www.comparingpartitions.info/index.php?link=Tut4
# set up dataset mydata <- new("genambig", samples=c("a","b","c"), loci=c("F","G")) Genotypes(mydata, loci="F") <- list(c(115,118,124),c(115,118,124), c(121,124)) Genotypes(mydata, loci="G") <- list(c(162,170,174),c(170,172), c(166,180,182)) Usatnts(mydata) <- c(3,2) # get genetic distances mydist <- meandistance.matrix(mydata, all.distances=TRUE) # calculate diversity under various conditions genotypeDiversity(mydata, d=mydist) genotypeDiversity(mydata, d=mydist, base=2) genotypeDiversity(mydata, d=mydist, threshold=0.3) genotypeDiversity(mydata, d=mydist, index=Simpson) genotypeDiversity(mydata, d=mydist, index=Simpson.var)
# set up dataset mydata <- new("genambig", samples=c("a","b","c"), loci=c("F","G")) Genotypes(mydata, loci="F") <- list(c(115,118,124),c(115,118,124), c(121,124)) Genotypes(mydata, loci="G") <- list(c(162,170,174),c(170,172), c(166,180,182)) Usatnts(mydata) <- c(3,2) # get genetic distances mydist <- meandistance.matrix(mydata, all.distances=TRUE) # calculate diversity under various conditions genotypeDiversity(mydata, d=mydist) genotypeDiversity(mydata, d=mydist, base=2) genotypeDiversity(mydata, d=mydist, threshold=0.3) genotypeDiversity(mydata, d=mydist, index=Simpson) genotypeDiversity(mydata, d=mydist, index=Simpson.var)
Given an ambiguous genotype and either a data frame of allele
frequencies or a vector of genotype probabilities,
genotypeProbs
calculates all possible unambiguous genotypes and
their probabilities of being the true genotype.
genotypeProbs(object, sample, locus, freq = NULL, gprob = NULL, alleles = NULL)
genotypeProbs(object, sample, locus, freq = NULL, gprob = NULL, alleles = NULL)
object |
An object of class |
sample |
Number or character string indicating the sample to evaluate. |
locus |
Character string indicating the locus to evaluate. |
freq |
A data frame of allele frequencies, such as that produced by
|
gprob |
A vector of genotype probabilities based on allele frequencies and
selfing rate. This is generated by |
alleles |
An integer vector of all alleles. This argument should only be used
if |
This function is primarily designed to be called by
meandistance.matrix2
, in order to calculate distances between all
possible unambiguous genotypes. Ordinary users won't use
genotypeProbs
unless they are designing a new analysis.
The genotype analyzed is Genotype(object, sample, locus)
. If
the genotype is unambiguous (fully heterozygous or homozygous), a single
unambiguous genotype is returned with a probability of one.
If the genotype is ambiguous (partially heterozygous), a recursive algorithm is used to generate all possible unambiguous genotypes (all possible duplications of alleles in the genotype, up to the ploidy of the individual.)
If the freq
argument is supplied:
The probability of each unambiguous genotype is then calculated from the allele frequencies of the individual's population, under the assumption of random mating. Allele frequencies are normalized so that the frequencies of the alleles in the ambiguous genotype sum to one; this converts each frequency to the probability of the allele being present in more than one copy. The product of these probabilities is multiplied by the appropriate polynomial coefficient to calculate the probability of the unambiguous genotype.
where p is the probability of the unambiguous genotype, n is the number of alleles in the ambiguous genotype, f is the normalized frequency of each allele, c is the number of duplicated copies (total number of copies minus one) of the allele in the unambiguous genotype, and k is the ploidy of the individual.
If the gprob
and alleles
arguments are supplied:
The probabilities of all possible genotypes in the population have
already been calculated, based on allele frequencies and selfing rate.
This is done in meandistance.matrix2
using code from De Silva
et al. (2005).
Probabilities for the genotypes of interest (those that the ambiguous
genotype could represent) are normalized to sum to 1, in order to give
the conditional probabilities of the possible genotypes.
probs |
A vector containing the probabilities of each unambiguous genotype. |
genotypes |
A matrix. Each row represents one genotype, and the number of columns is equal to the ploidy of the individual. Each element of the matrix is an allele. |
Lindsay V. Clark
De Silva, H. N., Hall, A. J., Rikkerink, E., and Fraser, L. G. (2005) Estimation of allele frequencies in polyploids under certain patterns of inheritance. Heredity 95, 327–334
# get a data set and define ploidies data(testgenotypes) Ploidies(testgenotypes) <- c(8,8,8,4,8,8,rep(4,14)) # get allele frequencies tfreq <- simpleFreq(testgenotypes) # see results of genotypeProbs under different circumstances Genotype(testgenotypes, "FCR7", "RhCBA15") genotypeProbs(testgenotypes, "FCR7", "RhCBA15", tfreq) Genotype(testgenotypes, "FCR10", "RhCBA15") genotypeProbs(testgenotypes, "FCR10", "RhCBA15", tfreq) Genotype(testgenotypes, "FCR1", "RhCBA15") genotypeProbs(testgenotypes, "FCR1", "RhCBA15", tfreq) Genotype(testgenotypes, "FCR2", "RhCBA23") genotypeProbs(testgenotypes, "FCR2", "RhCBA23", tfreq) Genotype(testgenotypes, "FCR3", "RhCBA23") genotypeProbs(testgenotypes, "FCR3", "RhCBA23", tfreq)
# get a data set and define ploidies data(testgenotypes) Ploidies(testgenotypes) <- c(8,8,8,4,8,8,rep(4,14)) # get allele frequencies tfreq <- simpleFreq(testgenotypes) # see results of genotypeProbs under different circumstances Genotype(testgenotypes, "FCR7", "RhCBA15") genotypeProbs(testgenotypes, "FCR7", "RhCBA15", tfreq) Genotype(testgenotypes, "FCR10", "RhCBA15") genotypeProbs(testgenotypes, "FCR10", "RhCBA15", tfreq) Genotype(testgenotypes, "FCR1", "RhCBA15") genotypeProbs(testgenotypes, "FCR1", "RhCBA15", tfreq) Genotype(testgenotypes, "FCR2", "RhCBA23") genotypeProbs(testgenotypes, "FCR2", "RhCBA23", tfreq) Genotype(testgenotypes, "FCR3", "RhCBA23") genotypeProbs(testgenotypes, "FCR3", "RhCBA23", tfreq)
The internal functions G
, INDEXG
, GENLIST
,
RANMUL
, and SELFMAT
are used for calculating
genotype probabilities under partial selfing. The internal function
.unal1loc
finds all unique alleles at a single locus. The internal
function fixloci
converts locus names to a format that will be
compatible as column headers for allele frequency tables.
G(q, n) INDEXG(ag1, na1, m2) GENLIST(ng, na1, m2) RANMUL(ng, na1, ag, m2) SELFMAT(ng, na1, ag, m2) .unal1loc(object, samples, locus) fixloci(loci, warn = TRUE)
G(q, n) INDEXG(ag1, na1, m2) GENLIST(ng, na1, m2) RANMUL(ng, na1, ag, m2) SELFMAT(ng, na1, ag, m2) .unal1loc(object, samples, locus) fixloci(loci, warn = TRUE)
q |
Integer. |
n |
Integer. |
ag1 |
A vector representing an unambiguous genotype. |
na1 |
Integer. The number of alleles, including a null. |
m2 |
Integer. The ploidy. |
ng |
Integer. The number of genotypes. |
ag |
An array of genotypes such as that produced by
|
object |
A |
samples |
Optional, a numeric or character vector indicating which samples to use. |
locus |
A character string or number indicating which locus to use. |
loci |
A character vector of locus names. |
warn |
Boolean indicating whether a warning should be issued if locus names are changed. |
G
returns
INDEXG
returns an integer indicating the row containing a
particular genotype in the matrix produced by GENLIST
.
GENLIST
returns an array with dimensions ng, m2
,
containing all possible unambiguous genotypes, one in each row. The null
allele is the highest-numbered allele.
RANMUL
returns a list. The first item is a vector of
polynomial coefficients for calculating genotype frequencies under
random mating. The second is an array
showing how many copies of each allele each genotype has.
SELFMAT
returns the selfing matrix. Parental genotypes are
represented in rows, and offspring genotypes in columns. The numbers
indicate relative amounts of offspring genotypes produced when the
parental genotypes are self-fertilized.
.unal1loc
returns a vector containing all unique alleles, not
including Missing(object)
.
fixloci
returns a character vector of corrected locus names.
Lindsay V. Clark
De Silva, H. N., Hall, A. J., Rikkerink, E., and Fraser, L. G. (2005) Estimation of allele frequencies in polyploids under certain patterns of inheritance. Heredity 95, 327–334
deSilvaFreq
, meandistance.matrix2
,
genotypeProbs
, genambig.to.genbinary
,
alleleDiversity
# Calculation of genotype probabilities in a tetraploid with four # alleles plus a null, and a selfing rate of 0.5. This is a translation # of code in the supplementary material of De Silva et al. (2005). m2 <- 4 m <- m2/2 na1 <- 5 self <- 0.5 ng <- na1 for(j in 2:m2){ ng <- ng*(na1+j-1)/j } ag <- polysat:::GENLIST(ng, na1, m2) temp <- polysat:::RANMUL(ng, na1, ag, m2) rmul <- temp[[1]] arep <- temp[[2]] rm(temp) smat <- polysat:::SELFMAT(ng, na1, ag, m2) smatdiv <- (polysat:::G(m-1,m+1))^2 p1 <- c(0.1, 0.4, 0.2, 0.2, 0.1) # allele frequencies # GPROBS subroutine rvec <- rep(0,ng) for(g in 1:ng){ rvec[g] <- rmul[g] for(j in 1:m2){ rvec[g] <- rvec[g]*p1[ag[g,j]] } } id <- diag(nrow=ng) smatt <- smat/smatdiv s3 <- id - self * smatt s3inv <- solve(s3) gprob <- (1-self) * s3inv %*% rvec # gprob is a vector of probabilities of the seventy genotypes.
# Calculation of genotype probabilities in a tetraploid with four # alleles plus a null, and a selfing rate of 0.5. This is a translation # of code in the supplementary material of De Silva et al. (2005). m2 <- 4 m <- m2/2 na1 <- 5 self <- 0.5 ng <- na1 for(j in 2:m2){ ng <- ng*(na1+j-1)/j } ag <- polysat:::GENLIST(ng, na1, m2) temp <- polysat:::RANMUL(ng, na1, ag, m2) rmul <- temp[[1]] arep <- temp[[2]] rm(temp) smat <- polysat:::SELFMAT(ng, na1, ag, m2) smatdiv <- (polysat:::G(m-1,m+1))^2 p1 <- c(0.1, 0.4, 0.2, 0.2, 0.1) # allele frequencies # GPROBS subroutine rvec <- rep(0,ng) for(g in 1:ng){ rvec[g] <- rmul[g] for(j in 1:m2){ rvec[g] <- rvec[g]*p1[ag[g,j]] } } id <- diag(nrow=ng) smatt <- smat/smatdiv s3 <- id - self * smatt s3inv <- solve(s3) gprob <- (1-self) * s3inv %*% rvec # gprob is a vector of probabilities of the seventy genotypes.
isMissing
returns Boolean values indicating whether the genotypes
for a given set of samples and loci are missing from the dataset.
isMissing(object, samples = Samples(object), loci = Loci(object))
isMissing(object, samples = Samples(object), loci = Loci(object))
object |
An object of one of the subclasses of |
samples |
A character or numeric vector indicating samples to be tested. |
loci |
A character or numeric vector indicating loci to be tested. |
isMissing
is a generic function with methods for genambig
and genbinary
objects.
For each genotype in a genambig
object, the function evaluates and returns
Genotype(object, sample, locus)[1] == Missing(object)
. For a
genbinary
object, TRUE %in% (Genotype(object, sample,
locus) == Missing(object))
is returned for the genotype. If only
one sample and locus are being evaluated, this is the Boolean value that
is returned. If multiple samples and/or loci are being evaluated, the
function creates an array of Boolean values and recursively calls itself
to fill in the result for each element of the array.
If both samples
and loci
are of length 1, a single
Boolean value is returned, TRUE
if the genotype is missing, and
FALSE
if it isn't. Otherwise, the function returns a named
array with samples in the first dimension and loci in the second
dimension, filled with Boolean values indicating whether the genotype
for each sample*locus combination is missing.
Lindsay V. Clark
Missing
, Missing<-
, Genotype
,
find.missing.gen
# set up a genambig object for this example mygen <- new("genambig", samples=c("a", "b"), loci=c("locD", "locE")) Genotypes(mygen) <- array(list(c(122, 126), c(124, 128, 134), Missing(mygen), c(156, 159)), dim=c(2,2)) viewGenotypes(mygen) # test if some individual genotypes are missing isMissing(mygen, "a", "locD") isMissing(mygen, "a", "locE") # test an array of genotypes isMissing(mygen, Samples(mygen), Loci(mygen))
# set up a genambig object for this example mygen <- new("genambig", samples=c("a", "b"), loci=c("locD", "locE")) Genotypes(mygen) <- array(list(c(122, 126), c(124, 128, 134), Missing(mygen), c(156, 159)), dim=c(2,2)) viewGenotypes(mygen) # test if some individual genotypes are missing isMissing(mygen, "a", "locD") isMissing(mygen, "a", "locE") # test an array of genotypes isMissing(mygen, Samples(mygen), Loci(mygen))
Given two genotypes in the form of vectors of unique alleles, a dissimilarity is calculated as: 1 - (number of alleles in common)/(average number of alleles per genotype).
Lynch.distance(genotype1, genotype2, usatnt = NA, missing = -9)
Lynch.distance(genotype1, genotype2, usatnt = NA, missing = -9)
genotype1 |
A vector containing all alleles for a particular sample and locus. Each allele is only present once in the vector. |
genotype2 |
A vector of the same form as |
usatnt |
The microsatellite repeat length for this locus (ignored by the function). |
missing |
The symbol used to indicate missing data in either genotype vector. |
Lynch (1990) defines a simple measure of similarity between DNA
fingerprints. This is 2 times the number of bands that two fingerprints
have in common, divided by the total number of bands that the two genotypes
have. Lynch.distance
returns a dissimilarity, which is 1 minus
the similarity.
If the first element of either or both input genotypes is equal to
missing
, NA is returned.
Otherwise, a numerical value is returned. This is one minus the similarity. The similarity is calculated as the number of alleles that the two genotypes have in common divided by the mean length of the two genotypes.
Lindsay V. Clark
Lynch, M. (1990) The similarity index and DNA fingerprinting. Molecular Biology and Evolution 7, 478-484.
Bruvo.distance
,
meandistance.matrix
Lynch.distance(c(100,102,104), c(100,104,108)) Lynch.distance(-9, c(102,104,110)) Lynch.distance(c(100), c(100,104,106))
Lynch.distance(c(100,102,104), c(100,104,108)) Lynch.distance(-9, c(102,104,110)) Lynch.distance(c(100), c(100,104,106))
meandist.from.array
produces a mean distance matrix from an array of
pairwise distances by locus, such as that produced by
meandistance.matrix
when all.distances=TRUE
. find.na.dist
finds missing distances in such an array, and
find.na.dist.not.missing
finds missing distances that aren't the
result of missing genotypes.
meandist.from.array(distarray, samples = dimnames(distarray)[[2]], loci = dimnames(distarray)[[1]]) find.na.dist(distarray, samples = dimnames(distarray)[[2]], loci = dimnames(distarray)[[1]]) find.na.dist.not.missing(object, distarray, samples = dimnames(distarray)[[2]], loci = dimnames(distarray)[[1]])
meandist.from.array(distarray, samples = dimnames(distarray)[[2]], loci = dimnames(distarray)[[1]]) find.na.dist(distarray, samples = dimnames(distarray)[[2]], loci = dimnames(distarray)[[1]]) find.na.dist.not.missing(object, distarray, samples = dimnames(distarray)[[2]], loci = dimnames(distarray)[[1]])
distarray |
A three-dimensional array of pairwise distances between samples, by
locus. Loci are represented in the first dimension, and samples are
represented in the second and third dimensions. Dimensions are named
accordingly. Such an array is the first element of the list produced by
|
samples |
Character vector. Samples to analyze. |
loci |
Character vector. Loci to analyze. |
object |
A |
find.na.dist.not.missing
is primarily intended to locate distances
that were not calculated by Bruvo.distance
because both genotypes
had too many alleles (more than maxl
). The user may wish to
estimate these distances manually and fill them into the array, then
recalculate the mean matrix using meandist.from.array
.
meandist.from.array
returns a matrix, with both rows and columns
named by samples, of distances averaged across loci.
find.na.dist
and find.na.dist.not.missing
both return data
frames with three columns: Locus, Sample1, and Sample2. Each row
represents the index in the array of an element containing NA.
Lindsay V. Clark
meandistance.matrix
, Bruvo.distance
,
find.missing.gen
# set up the genotype data samples <- paste("ind", 1:4, sep="") samples loci <- paste("loc", 1:3, sep="") loci testgen <- new("genambig", samples=samples, loci=loci) Genotypes(testgen, loci="loc1") <- list(c(-9), c(102,104), c(100,106,108,110,114), c(102,104,106,110,112)) Genotypes(testgen, loci="loc2") <- list(c(77,79,83), c(79,85), c(-9), c(83,85,87,91)) Genotypes(testgen, loci="loc3") <- list(c(122,128), c(124,126,128,132), c(120,126), c(124,128,130)) Usatnts(testgen) <- c(2,2,2) # look up which samples*loci have missing genotypes find.missing.gen(testgen) # get the three-dimensional distance array and the mean of the array gendist <- meandistance.matrix(testgen, distmetric=Bruvo.distance, maxl=4, all.distances=TRUE) # look at the distances for loc1, where there is missing data and long genotypes gendist[[1]]["loc1",,] # look up all missing distances in the array find.na.dist(gendist[[1]]) # look up just the missing distances that don't result from missing genotypes find.na.dist.not.missing(testgen, gendist[[1]]) # Copy the array to edit the new copy newDistArray <- gendist[[1]] # calculate the distances that were NA from genotype lengths exceeding maxl # (in reality, if this were too computationally intensive you might estimate # it manually instead) subDist <- Bruvo.distance(c(100,106,108,110,114), c(102,104,106,110,112)) subDist # insert this distance into the correct positions newDistArray["loc1","ind3","ind4"] <- subDist newDistArray["loc1","ind4","ind3"] <- subDist # calculate the new mean distance matrix newMeanMatrix <- meandist.from.array(newDistArray) # look at the difference between this matrix and the original. newMeanMatrix gendist[[2]]
# set up the genotype data samples <- paste("ind", 1:4, sep="") samples loci <- paste("loc", 1:3, sep="") loci testgen <- new("genambig", samples=samples, loci=loci) Genotypes(testgen, loci="loc1") <- list(c(-9), c(102,104), c(100,106,108,110,114), c(102,104,106,110,112)) Genotypes(testgen, loci="loc2") <- list(c(77,79,83), c(79,85), c(-9), c(83,85,87,91)) Genotypes(testgen, loci="loc3") <- list(c(122,128), c(124,126,128,132), c(120,126), c(124,128,130)) Usatnts(testgen) <- c(2,2,2) # look up which samples*loci have missing genotypes find.missing.gen(testgen) # get the three-dimensional distance array and the mean of the array gendist <- meandistance.matrix(testgen, distmetric=Bruvo.distance, maxl=4, all.distances=TRUE) # look at the distances for loc1, where there is missing data and long genotypes gendist[[1]]["loc1",,] # look up all missing distances in the array find.na.dist(gendist[[1]]) # look up just the missing distances that don't result from missing genotypes find.na.dist.not.missing(testgen, gendist[[1]]) # Copy the array to edit the new copy newDistArray <- gendist[[1]] # calculate the distances that were NA from genotype lengths exceeding maxl # (in reality, if this were too computationally intensive you might estimate # it manually instead) subDist <- Bruvo.distance(c(100,106,108,110,114), c(102,104,106,110,112)) subDist # insert this distance into the correct positions newDistArray["loc1","ind3","ind4"] <- subDist newDistArray["loc1","ind4","ind3"] <- subDist # calculate the new mean distance matrix newMeanMatrix <- meandist.from.array(newDistArray) # look at the difference between this matrix and the original. newMeanMatrix gendist[[2]]
Given a genambig
object, meandistance.matrix
produces a symmetrical matrix of
pairwise distances between samples, averaged across all loci. An
array of all distances prior to averaging may also be produced.
meandistance.matrix(object, samples = Samples(object), loci = Loci(object), all.distances=FALSE, distmetric = Bruvo.distance, progress = TRUE, ...) meandistance.matrix2(object, samples = Samples(object), loci = Loci(object), freq = simpleFreq(object, samples, loci), self = 0, all.distances = FALSE, distmetric = Bruvo.distance, progress = TRUE, ...)
meandistance.matrix(object, samples = Samples(object), loci = Loci(object), all.distances=FALSE, distmetric = Bruvo.distance, progress = TRUE, ...) meandistance.matrix2(object, samples = Samples(object), loci = Loci(object), freq = simpleFreq(object, samples, loci), self = 0, all.distances = FALSE, distmetric = Bruvo.distance, progress = TRUE, ...)
object |
A |
samples |
A character vector of samples to be analyzed. These
should be all or a subset of the sample names used in |
loci |
A character vector of loci to be analyzed. These should
be all or a subset of the loci names used in |
freq |
A data frame of allele frequencies such as that produced
by |
self |
A number ranging from 0 to 1, indicating the rate of selfing. |
all.distances |
If |
distmetric |
The function to be used to calculate distances
between genotypes. |
progress |
If |
... |
Additional arguments (such as |
Each distance for the three-dimensional array is calculated only once, to save computation time. Since the array (and resulting mean matrix) is symmetrical, the distance is written to two positions in the array at once.
meandistance.matrix
uses ambiguous genotypes exactly as they
are, whereas meandistance.matrix2
uses
genotypeProbs
to calculate all possible unambiguous
genotypes and their probabilities under random mating or partial
selfing. The distance
between each possible pair of unambiguous genotypes for the two
samples is calculated with distmetric
and weighted by the
product of the probabilities of the two gentoypes. As you might
expect, meandistance.matrix2
takes longer to process a given
"genambig"
object than meandistance.matrix
does.
Additionally, the distance between two identical ambiguous genotypes
will be zero when calculated with meandistance.matrix
, and
greater than zero when calculated with meandistance.matrix2
,
due to potential differences in copy number of the alleles.
When Bruvo.distance
is used,
meandistance.matrix2
exaggerates distances between individuals
of different ploidy as compared to meandistance.matrix
. The
use of Bruvo2.distance
with meandistance.matrix2
allows individuals with different ploidies to have similar
inter-individual distances to those between individuals of the same ploidy.
In general, it will be desirable to use Bruvo.distance
with
meandistance.matrix
for complex datasets with high ploidy
levels, or Bruvo.distance2
with meandistance.matrix2
for
hexaploid or lower datasets (based on how long it takes my personal
computer to perform these calculations) where changes in ploidy are
due to genome doubling or genome loss. If all individuals have the
same ploidy, Bruvo.distance
and Bruvo2.distance
will
give identical results regardless of whether
meandistance.matrix
or meandistance.matrix2
is used.
meandistance.matrix2
does not allow a genotype to have more
alleles than the ploidy of the individual (as listed in the
Ploidies
slot). Additionally, if self
is greater than
zero, each population may only have one ploidy at each locus.
A symmetrical matrix containing pairwise distances between all
samples, averaged across all loci. Row and column names of the matrix
will be the sample names provided in the samples
argument. If
all.distances=TRUE
, a list will be produced containing the above
matrix as well as a three-dimensional array containing all distances
by locus and sample. The array is the first item in the list, and the
mean matrix is the second.
Lindsay V. Clark
Bruvo.distance
, Bruvo2.distance
,
Lynch.distance
, meandist.from.array
,
GENLIST
# create a list of genotype data mygendata <- new("genambig", samples = c("ind1","ind2","ind3","ind4"), loci = c("locus1","locus2","locus3","locus4")) Genotypes(mygendata) <- array(list(c(124,128,138),c(122,130,140,142),c(122,132,136),c(122,134,140), c(203,212,218),c(197,206,221),c(215),c(200,218), c(140,144,148,150),c(-9),c(146,150),c(152,154,158), c(233,236,280),c(-9),c(-9),c(-9))) Usatnts(mygendata) <- c(2,3,2,1) # make index vectors of data to use myloci <- c("locus1","locus2","locus3") mysamples <- c("ind1","ind2","ind4") # calculate array and matrix mymat <- meandistance.matrix(mygendata, mysamples, myloci, all.distances=TRUE) # view the results mymat[[1]]["locus1",,] mymat[[1]]["locus2",,] mymat[[1]]["locus3",,] mymat[[2]] # add addtional info needed for meandistance.matrix2 mygendata <- reformatPloidies(mygendata, output="one") Ploidies(mygendata) <- 4 PopInfo(mygendata) <- c(1,1,1,1) # calculate distances taking allele freqs into account mymat2 <- meandistance.matrix2(mygendata, mysamples, myloci) mymat2 # now do the same under selfing mymat3 <- meandistance.matrix2(mygendata, mysamples, myloci, self=0.3) mymat3
# create a list of genotype data mygendata <- new("genambig", samples = c("ind1","ind2","ind3","ind4"), loci = c("locus1","locus2","locus3","locus4")) Genotypes(mygendata) <- array(list(c(124,128,138),c(122,130,140,142),c(122,132,136),c(122,134,140), c(203,212,218),c(197,206,221),c(215),c(200,218), c(140,144,148,150),c(-9),c(146,150),c(152,154,158), c(233,236,280),c(-9),c(-9),c(-9))) Usatnts(mygendata) <- c(2,3,2,1) # make index vectors of data to use myloci <- c("locus1","locus2","locus3") mysamples <- c("ind1","ind2","ind4") # calculate array and matrix mymat <- meandistance.matrix(mygendata, mysamples, myloci, all.distances=TRUE) # view the results mymat[[1]]["locus1",,] mymat[[1]]["locus2",,] mymat[[1]]["locus3",,] mymat[[2]] # add addtional info needed for meandistance.matrix2 mygendata <- reformatPloidies(mygendata, output="one") Ploidies(mygendata) <- 4 PopInfo(mygendata) <- c(1,1,1,1) # calculate distances taking allele freqs into account mymat2 <- meandistance.matrix2(mygendata, mysamples, myloci) mymat2 # now do the same under selfing mymat3 <- meandistance.matrix2(mygendata, mysamples, myloci, self=0.3) mymat3
The generic function merge
has methods defined in polysat to
merge two genotype objects of the same class. Each method has optional
samples
and loci
arguments for specifying subsets of
samples and loci to be included in the merged object. Each method also
has an optional overwrite
argument to specify which of the two
objects should not be used in the case of conflicting data.
merge(x, y, ...)
merge(x, y, ...)
x |
One of the objects to be merged. For the methods defined for
polysat this should be of class |
y |
The other object to be merged. Should be of the same class as
|
... |
Additional arguments specific to the method. |
The methods for merge
in polysat have four additional
arguments: objectm, samples, loci, overwrite
.
The samples
and loci
arguments can specify, using
character vectors, a subset of the samples and loci found in
x
and y
to write to the object that is returned.
If overwrite = "x"
, data from the second object will be used
wherever there is contradicting data. Likewise if overwrite =
"y"
, data from the first object will be used wherever there is
contradicting data. If no overwrite
argument is given, then
any contradicting data between the two objects will produce an error
indicating where the contradicting data were found.
The objectm
argument is primarily for internal use (most
users will not need it). If this argument is not
provided, a new genotype object is created and data from x
and y
are written to it. If objectm
is
provided, this is the object to which data will be written, and the
object that will be returned.
signature(x = "genambig", y = "genambig")
This method merges the genotype data from x
and y
. If the
missing data symbols differ between the objects, overwrite
is
used to determine which missing data symbol to use, and all missing data
symbols in the overwritten object are converted. If overwrite
is
not provided and the missing data symbols differ between the objects, an
error will be given. The genotypes are then filled in. If certain
sample*locus combinations do not exist in either object (x
and
y
have different samples as well as different loci), missing data
symbols are left in these positions. Again, for genotypes,
overwrite
determines which object to preferentially use for data
and whether to give an error if there is a disagreement.
The merge
method for gendata
is then called.
signature(x = "genbinary", y = "genbinary")
This method also merges genotype data for x
and y
, then
calls the method for gendata
. Missing
, Present
,
and Absent
are checked for consistency between objects similarly
to what happens with Missing
in the genambig
method.
Genotypes are then written to the merged object, and consistency between
genotypes is checked.
signature(x = "gendata", y = "gendata")
This method merges data about ploidy, repeat length, and population identity, as well as writing one or both dataset descriptions to the merged object.
The same population numbers can have different meanings in
PopInfo(x)
and PopInfo(y)
. The unique PopNames
are
used instead to determine population identity, and the PopInfo
numbers are changed if necessary. Therefore, it is important for
identical populations to be named the same way in both objects, but not
important for identical populations to have the same number in both objects.
In cases where multiple populations are used separately to assign
alleles to homeologous loci, mergeAlleleAssignments
is used to
consolidate the results.
mergeAlleleAssignments(x)
mergeAlleleAssignments(x)
x |
A list, where each element is in the format of results produced by
|
A list in similar format to x
, but with only one element per locus.
Lindsay V. Clark
testAlGroups
, catalanAlleles
,
recodeAllopoly
# List of allele assignment results for this example; normally these # would be produced by other functions evaluating a genetic dataset. # The example below is for an allotetraploid. # Locus L1 is a well-behaved locus with no homoplasy. myresults <- list(list(locus="L1", SGploidy=2, assignments=matrix(c(1,0,0,1,1,0,0,1,0,1), nrow=2, ncol=5, dimnames=list(NULL, c("124","128","130","134","138")))), list(locus="L1", SGploidy=2, assignments=matrix(c(0,1,1,0,1,0,0,1), nrow=2, ncol=4, dimnames=list(NULL, c("124","128","132","140")))), list(locus="L1", SGploidy=2, assignments=matrix(c(0,1,1,0,0,1,1,0), nrow=2, ncol=4, dimnames=list(NULL, c("126","128","130","132")))), # Locus L2 is unresolvable because there are no shared alleles between # populations. list(locus="L2", SGploidy=2, assignments=matrix(c(1,0,1,0,0,1), nrow=2, ncol=3, dimnames=list(NULL, c("205","210","225")))), list(locus="L2", SGploidy=2, assignments=matrix(c(1,0,0,1,0,1,1,0), nrow=2, ncol=4, dimnames=list(NULL, c("195","215","220","230")))), # Locus L3 has homoplasy that makes it unresolvable. list(locus="L3", SGploidy=2, assignments=matrix(c(1,0,0,1,0,1,1,0), nrow=2, ncol=4, dimnames=list(NULL, c("153","159","168","171")))), list(locus="L3", SGploidy=2, assignments=matrix(c(1,0,0,1,1,0,0,1), nrow=2, ncol=4, dimnames=list(NULL, c("153","156","165","171")))), # Locus L4 has homoplasy, but the results can still be merged. list(locus="L4", SGploidy=2, assignments=matrix(c(1,0,1,0,0,1,0,1,0,1), nrow=2, ncol=5, dimnames=list(NULL, c("242","246","254","260","264")))), list(locus="L4", SGploidy=2, assignments=matrix(c(1,0,0,1,1,0,0,1,0,1), nrow=2, ncol=5, dimnames=list(NULL, c("242","246","250","254","260")))) ) myresults # merge within loci mergedresults <- mergeAlleleAssignments(myresults) mergedresults
# List of allele assignment results for this example; normally these # would be produced by other functions evaluating a genetic dataset. # The example below is for an allotetraploid. # Locus L1 is a well-behaved locus with no homoplasy. myresults <- list(list(locus="L1", SGploidy=2, assignments=matrix(c(1,0,0,1,1,0,0,1,0,1), nrow=2, ncol=5, dimnames=list(NULL, c("124","128","130","134","138")))), list(locus="L1", SGploidy=2, assignments=matrix(c(0,1,1,0,1,0,0,1), nrow=2, ncol=4, dimnames=list(NULL, c("124","128","132","140")))), list(locus="L1", SGploidy=2, assignments=matrix(c(0,1,1,0,0,1,1,0), nrow=2, ncol=4, dimnames=list(NULL, c("126","128","130","132")))), # Locus L2 is unresolvable because there are no shared alleles between # populations. list(locus="L2", SGploidy=2, assignments=matrix(c(1,0,1,0,0,1), nrow=2, ncol=3, dimnames=list(NULL, c("205","210","225")))), list(locus="L2", SGploidy=2, assignments=matrix(c(1,0,0,1,0,1,1,0), nrow=2, ncol=4, dimnames=list(NULL, c("195","215","220","230")))), # Locus L3 has homoplasy that makes it unresolvable. list(locus="L3", SGploidy=2, assignments=matrix(c(1,0,0,1,0,1,1,0), nrow=2, ncol=4, dimnames=list(NULL, c("153","159","168","171")))), list(locus="L3", SGploidy=2, assignments=matrix(c(1,0,0,1,1,0,0,1), nrow=2, ncol=4, dimnames=list(NULL, c("153","156","165","171")))), # Locus L4 has homoplasy, but the results can still be merged. list(locus="L4", SGploidy=2, assignments=matrix(c(1,0,1,0,0,1,0,1,0,1), nrow=2, ncol=5, dimnames=list(NULL, c("242","246","254","260","264")))), list(locus="L4", SGploidy=2, assignments=matrix(c(1,0,0,1,1,0,0,1,0,1), nrow=2, ncol=5, dimnames=list(NULL, c("242","246","250","254","260")))) ) myresults # merge within loci mergedresults <- mergeAlleleAssignments(myresults) mergedresults
Given a set of allele frequencies, this function estimates the Polymorphic Information Content (PIC) for each locus, within and/or across populations.
PIC(freqs, pops = row.names(freqs), loci = unique(as.matrix(as.data.frame( strsplit(names(freqs), split = ".", fixed = TRUE), stringsAsFactors = FALSE))[1, ]), bypop = TRUE, overall = TRUE)
PIC(freqs, pops = row.names(freqs), loci = unique(as.matrix(as.data.frame( strsplit(names(freqs), split = ".", fixed = TRUE), stringsAsFactors = FALSE))[1, ]), bypop = TRUE, overall = TRUE)
freqs |
A data frame of allele frequencies, such as those output by |
pops |
An optional characer vector containing names of populations to include. |
loci |
An optional character vector containing names of loci to include. |
bypop |
If |
overall |
If |
PIC is estimated as:
according to Botstein et al. (1980), where and
are allele frequencies at alleles
i and j, respectively, and
is the number of alleles.
The higher this value is, the more useful a marker is for distinguishing individuals and understanding relationships among them.
A matrix, with loci in columns, and populations and/or “Overall” in rows. Each element of the matrix contains a PIC value.
Lindsay V. Clark
Botstein, M., White, R. L., Skolnick, M. and Davis, R. W. (1980) Construction of a genetic linkage map in man using restriction fragment length polymorphisms. American Journal of Human Genetics 32, 314–331.
# generate allele frequencies for this example myfreq <- data.frame(row.names = c("pop1", "pop2"), Genomes = c(20,30), loc1.124 = c(0.1, 0.25), loc1.126 = c(0.2, 0), loc1.128 = c(0.05, 0.4), loc1.130 = c(0.3, 0.1), loc1.132 = c(0.1, 0.1), loc1.134 = c(0.25, 0.15), loc2.150 = c(0.4, 0.5), loc2.155 = c(0.3, 0.2), loc2.160 = c(0.3, 0.3)) # estimate PIC PIC(myfreq)
# generate allele frequencies for this example myfreq <- data.frame(row.names = c("pop1", "pop2"), Genomes = c(20,30), loc1.124 = c(0.1, 0.25), loc1.126 = c(0.2, 0), loc1.128 = c(0.05, 0.4), loc1.130 = c(0.3, 0.1), loc1.132 = c(0.1, 0.1), loc1.134 = c(0.25, 0.15), loc2.150 = c(0.4, 0.5), loc2.155 = c(0.3, 0.2), loc2.160 = c(0.3, 0.3)) # estimate PIC PIC(myfreq)
"ploidysuper"
Objects
pld
accesses and replaces the pld
slot of objects of
"ploidysuper"
subclasses. plCollapse
tests
whether an object of one of these classes can be converted to an object
of a simpler one of these classes, and optionally returns the converted
object. These are generic functions with methods for the subclasses of
"ploidysuper"
. These functions are primarily for internal use.
pld(object, samples, loci) pld(object) <- value plCollapse(object, na.rm, returnvalue)
pld(object, samples, loci) pld(object) <- value plCollapse(object, na.rm, returnvalue)
object |
A |
samples |
An optional character or numeric vector indexing the samples for which to return ploidy values. |
loci |
An optional character or numeric vector indexing the loci for which to return ploidy values. |
value |
A numeric vector or matrix that can be coerced to integers. These
represent the ploidies to store in the |
na.rm |
Boolean. If |
returnvalue |
Boolean. If |
pld
returns the vector or matrix containing the ploidy values.
This is the contents of object@pld
.
plCollapse
either returns a Boolean value indicating whether the
ploidy can be changed to a simpler format, or a new "ploidysuper"
object with all of the ploidy data of object
put into a simpler
format. If object
is a "ploidymatrix"
object, a
"ploidysample"
, "ploidylocus"
, or "ploidyone"
object can be returned depending on how many unique ploidy values there
are and how they are distributed. If object
is a
"ploidysample"
or "ploidylocus"
object, a
"ploidyone"
object can be returned.
Lindsay V. Clark
test <- new("ploidymatrix", samples=c("a","b","c"), loci=c("l1","l2","l3")) pld(test) # view the ploidies pld(test) <- 2 # make it diploid at all samples and loci pld(test)["a",] <- c(2,4,4) # change the ploidies for sample a pld(test, samples=c("a","b")) # view ploidies at a subset of samples # test to see if the ploidies can be simplified p <- plCollapse(test, na.rm=FALSE, returnvalue=TRUE) p # now change a ploidy and repeat the test pld(test)["a","l1"] <- 4 p <- plCollapse(test, na.rm=FALSE, returnvalue=TRUE) p # change something else and collapse it further pld(p)["a"] <- 2 p2 <- plCollapse(p, na.rm=FALSE, returnvalue=TRUE) p2 # if na.rm=FALSE, NA values are not ignored: pld(test)["a","l1"] <- NA pld(test) plCollapse(test, na.rm=FALSE, returnvalue=TRUE) # NA values are ignored with na.rm=TRUE plCollapse(test, na.rm=TRUE, returnvalue=TRUE)
test <- new("ploidymatrix", samples=c("a","b","c"), loci=c("l1","l2","l3")) pld(test) # view the ploidies pld(test) <- 2 # make it diploid at all samples and loci pld(test)["a",] <- c(2,4,4) # change the ploidies for sample a pld(test, samples=c("a","b")) # view ploidies at a subset of samples # test to see if the ploidies can be simplified p <- plCollapse(test, na.rm=FALSE, returnvalue=TRUE) p # now change a ploidy and repeat the test pld(test)["a","l1"] <- 4 p <- plCollapse(test, na.rm=FALSE, returnvalue=TRUE) p # change something else and collapse it further pld(p)["a"] <- 2 p2 <- plCollapse(p, na.rm=FALSE, returnvalue=TRUE) p2 # if na.rm=FALSE, NA values are not ignored: pld(test)["a","l1"] <- NA pld(test) plCollapse(test, na.rm=FALSE, returnvalue=TRUE) # NA values are ignored with na.rm=TRUE plCollapse(test, na.rm=TRUE, returnvalue=TRUE)
"ploidysuper"
and SubclassesThese classes contain ploidy data indexed by sample, locus, both, or
neither. They are intended to go in the Ploidies
slot of
"gendata"
objects.
"ploidysuper"
is a virtual
class: No objects may be created from it.
Objects of the subclasses "ploidymatrix"
, "ploidysample"
,
"ploidylocus"
, and "ploidyone"
can be created with the
call new(ploidyclass, samples, loci, ...)
, where
ploidyclass
is a character string of one of the class names, and
samples
and loci
are character vectors naming samples and
loci, respectively. The latter two arguments are optional depending on
the class (whether ploidies are indexed by sample and/or locus). The
typical user will not have to create an object in this way, because other
functions in polysat will do it for you.
pld
:The only slot for objects of these classes. For
"ploidymatrix"
, this is a matrix of integers, indexed in
the first dimension by sample name and in the second dimension by
locus name. Each element represents the ploidy at a given sample
and locus. For "ploidysample"
and "ploidylocus"
,
the slot is an integer vector, named by sample or locus and
indicating the ploidy at each sample or locus, respectively. For
"ploidyone"
, the slot contains a single integer
representing the ploidy for the entire dataset.
signature(object= "ploidymatrix")
: Returns the
contents of object@pld
: a matrix of ploidies indexed by sample
and locus. The samples
and loci
arguments can be used,
optionally, to only return a subset of ploidies.
Lindsay V. Clark
reformatPloidies
, pld
,
plCollapse
, gendata
showClass("ploidysuper")
showClass("ploidysuper")
processDatasetAllo
runs alleleCorrelations
on every locus in
a "genambig"
object, then runs testAlGroups
on every locus
using several user-specified parameter sets. It chooses a single best set of allele assignments
for each locus, and produces plots to help the user evaluate assignment quality.
plotSSAllo
assists the user in evaluating the quality of allele assignments by plotting
the results of K-means clustering. plotParamHeatmap
assists the user in choosing the
best parameter set for testAlGroups
for each locus.
plotSSAllo(AlCorrArray) plotParamHeatmap(propMat, popname = "AllInd", col = grey.colors(12)[12:1], main = "") processDatasetAllo(object, samples = Samples(object), loci = Loci(object), n.subgen = 2, SGploidy = 2, n.start = 50, alpha = 0.05, parameters = data.frame(tolerance = c(0.05, 0.05, 0.05, 0.05), swap = c(TRUE, FALSE, TRUE, FALSE), null.weight = c(0.5, 0.5, 0, 0)), plotsfile = "alleleAssignmentPlots.pdf", usePops = FALSE, ...)
plotSSAllo(AlCorrArray) plotParamHeatmap(propMat, popname = "AllInd", col = grey.colors(12)[12:1], main = "") processDatasetAllo(object, samples = Samples(object), loci = Loci(object), n.subgen = 2, SGploidy = 2, n.start = 50, alpha = 0.05, parameters = data.frame(tolerance = c(0.05, 0.05, 0.05, 0.05), swap = c(TRUE, FALSE, TRUE, FALSE), null.weight = c(0.5, 0.5, 0, 0)), plotsfile = "alleleAssignmentPlots.pdf", usePops = FALSE, ...)
AlCorrArray |
A two-dimensional list, where each item in the list is the output of |
propMat |
A two-dimensional array, with loci in the first dimension and parameter sets in the
second dimension, indicating the proportion of alleles that were found to be homoplasious by |
popname |
The name of the population corresponding to the data in |
col |
The color scale for representing the proportion of loci that are homoplasious or the proportion of genotypes that are missing. |
main |
A title for the plot. |
object |
A |
samples |
An optional character vector indicating which samples to include in analysis. |
loci |
An optional character vector indicating which loci to include in analysis. |
n.subgen |
The number of isoloci into which each locus should be split. Passed directly to
|
SGploidy |
The ploidy of each isolocus. Passed directly to |
n.start |
Passed directly to the |
alpha |
The significance threshold for determining whether two alleles are significantly correlated. Used
primarily for identifying potentially problematic positive correlations. Passed directly to |
parameters |
Data frame indicating parameter sets to pass to |
plotsfile |
A PDF output file name for drawing plots to help assess assignment quality. Can be |
usePops |
If |
... |
Additional parameters to pass to |
plotSSAllo
produces a plot of loci by population, with the sums-of-squares ratio on the x-axis and the evenness of allele distribution
on the y-axis (see Value). Locus names are written directly on the plot. If there are multiple population names, locus names are colored
by population, and a legend is provided for colors. Loci with high-quality allele clustering are expected to be in the upper-right
quadrant of the plot. If locus names are in italics, it indicates that positive correlations were found between some alleles, indicating
population structure or scoring error that could interfere with assignment quality.
plotParamHeatmap
produces an image to indicate the proportion of alleles found to be homoplasious, or the proportion of genotypes that
could not be unambiguously recoded using allele assignments, for each locus and
parameter set for a given population (when looking at homoplasy) or merged across populations (for homoplasy or the proportion of non-recodeable
genotypes). Darker colors indicate more homoplasy or more genotypes that could not be recoded.
The word “best” indicates, for each
locus, the parameter set that found the least homoplasy or smallest number of non-recodeable genotypes.
By default, processDatasetAllo
generates a PDF file containing output from plotSSAllo
and plotParamHeatmap
,
as well as heatmaps of the $heatmap.dist
output of alleleCorrelations
for each locus and population.
Heatmaps are not plotted for loci where an allele is present in all individuals. processDatasetAllo
also
generates a list of R objects containing allele assignments under different parameters, as well as statistics for evaluating
clustering quality and choosing the optimal parameter sets, as described below.
plotSSAllo
draws a plot and invisibly returns a list:
ssratio |
A two-dimensional array with loci in the first dimension and populations in the second dimension. Each value is the sums-of-squares between isoloci divided by the total sums-of-squares, as output by K-means clustering. If K-means clustering was not performed, the value is zero. |
evenness |
An array of the same dimensions as
where |
max.evenness |
The maximum possible value for |
min.evenness |
The minimum possible value for |
posCor |
An array of the same dimensions as |
processDatasetAllo
returns a list:
AlCorrArray |
A two-dimensional list with loci in the first dimension and populations in the second
dimension, giving the results of |
TAGarray |
A three-dimensional list with loci in the first dimension, populations in the second dimension,
and parameter sets in the third dimension, giving the results of |
plotSS |
The output of |
propHomoplasious |
A three-dimensional array, with the same dimensions as |
mergedAssignments |
A two-dimensional list, with loci in the first dimension and parameter sets in the
second dimension, containing allele assignments merged across populations. This is the output of
|
propHomoplMerged |
A two-dimensional array, of the same dimensions as |
missRate |
A matrix with the same dimensions as |
bestAssign |
A one-dimensional list with a single best set of allele assignments, from |
plotParamHeatmap
draws a plot and does not return anything.
Lindsay V. Clark
Clark, L. V. and Drauch Schreier, A. (2017) Resolving microsatellite genotype ambiguity in populations of allopolyploid and diploidized autopolyploid organisms using negative correlations between allelic variables. Molecular Ecology Resources, 17, 1090–1103. DOI: 10.1111/1755-0998.12639.
alleleCorrelations
, recodeAllopoly
# get example dataset data(AllopolyTutorialData) # data cleanup mydata <- deleteSamples(AllopolyTutorialData, c("301", "302", "303")) PopInfo(mydata) <- rep(1:2, each = 150) Genotype(mydata, 43, 2) <- Missing(mydata) # allele assignments # R is set to 10 here to speed processing for example. It should typically be left at the default. myassign <- processDatasetAllo(mydata, loci = c("Loc3", "Loc6"), plotsfile = NULL, usePops = TRUE, R = 10, parameters = data.frame(tolerance = c(0.5, 0.5), swap = c(TRUE, FALSE), null.weight = c(0.5, 0.5))) # view best assignments for each locus myassign$bestAssign # plot K-means results plotSSAllo(myassign$AlCorrArray) # plot proportion of homoplasious alleles plotParamHeatmap(myassign$propHomoplasious, "Pop1") plotParamHeatmap(myassign$propHomoplasious, "Pop2") plotParamHeatmap(myassign$propHomoplMerged, "Merged across populations") # plot proportion of missing data, after recoding, for each locus and parameter set plotParamHeatmap(myassign$missRate, main = "Missing data:")
# get example dataset data(AllopolyTutorialData) # data cleanup mydata <- deleteSamples(AllopolyTutorialData, c("301", "302", "303")) PopInfo(mydata) <- rep(1:2, each = 150) Genotype(mydata, 43, 2) <- Missing(mydata) # allele assignments # R is set to 10 here to speed processing for example. It should typically be left at the default. myassign <- processDatasetAllo(mydata, loci = c("Loc3", "Loc6"), plotsfile = NULL, usePops = TRUE, R = 10, parameters = data.frame(tolerance = c(0.5, 0.5), swap = c(TRUE, FALSE), null.weight = c(0.5, 0.5))) # view best assignments for each locus myassign$bestAssign # plot K-means results plotSSAllo(myassign$AlCorrArray) # plot proportion of homoplasious alleles plotParamHeatmap(myassign$propHomoplasious, "Pop1") plotParamHeatmap(myassign$propHomoplasious, "Pop2") plotParamHeatmap(myassign$propHomoplMerged, "Merged across populations") # plot proportion of missing data, after recoding, for each locus and parameter set plotParamHeatmap(myassign$missRate, main = "Missing data:")
Given a file formatted for the software ATetra, read.ATetra
produces a genambig
object containing genotypes, population
identities, population names, and a dataset description from the file.
Ploidy in the genambig
object is automatically set to
4
.
read.ATetra(infile)
read.ATetra(infile)
infile |
Character string. A file path to the file to be read. |
read.ATetra
reads text files in the exact format specified by
the ATetra documentation. Note that this format only allows tetraploid
data and that there can be no missing data.
A genambig
object as described above.
Lindsay V. Clark
http://www.vub.ac.be/APNA/ATetra_Manual-1-1.pdf
van Puyvelde, K., van Geert, A. and Triest, L. (2010) ATETRA, a new software program to analyze tetraploid microsatellite data: comparison with TETRA and TETRASAT. Molecular Ecology Resources 10, 331–334.
write.ATetra
, read.Tetrasat
,
read.GeneMapper
,
read.Structure
, read.GenoDive
,
read.SPAGeDi
, read.POPDIST
,
read.STRand
# create a file to be read # (this would normally be done in a text editor or with ATetra's Excel template) myfile <- tempfile() cat("TIT,Sample Rubus Data for ATetra", "LOC,1,CBA15", "POP,1,1,Commonwealth", "IND,1,1,1,CMW1,197,208,211,213", "IND,1,1,2,CMW2,197,207,211,212", "IND,1,1,3,CMW3,197,208,212,219", "IND,1,1,4,CMW4,197,208,212,219", "IND,1,1,5,CMW5,197,208,211,212", "POP,1,2,Fall Creek Lake", "IND,1,2,6,FCR4,197,207,211,212", "IND,1,2,7,FCR7,197,208,212,218", "IND,1,2,8,FCR14,197,207,212,218", "IND,1,2,9,FCR15,197,208,211,212", "IND,1,2,10,FCR16,197,208,211,212", "IND,1,2,11,FCR17,197,207,212,218","LOC,2,CBA23","POP,2,1,Commonwealth", "IND,2,1,1,CMW1,98,100,106,125","IND,2,1,2,CMW2,98,125,,", "IND,2,1,3,CMW3,98,126,,","IND,2,1,4,CMW4,98,106,119,127", "IND,2,1,5,CMW5,98,106,125,","POP,2,2,Fall Creek Lake", "IND,2,2,6,FCR4,98,125,,","IND,2,2,7,FCR7,98,106,126,", "IND,2,2,8,FCR14,98,127,,","IND,2,2,9,FCR15,98,108,117,", "IND,2,2,10,FCR16,98,125,,","IND,2,2,11,FCR17,98,126,,","END", file = myfile, sep = "\n") # Read the file and examine the data exampledata <- read.ATetra(myfile) summary(exampledata) PopNames(exampledata) viewGenotypes(exampledata)
# create a file to be read # (this would normally be done in a text editor or with ATetra's Excel template) myfile <- tempfile() cat("TIT,Sample Rubus Data for ATetra", "LOC,1,CBA15", "POP,1,1,Commonwealth", "IND,1,1,1,CMW1,197,208,211,213", "IND,1,1,2,CMW2,197,207,211,212", "IND,1,1,3,CMW3,197,208,212,219", "IND,1,1,4,CMW4,197,208,212,219", "IND,1,1,5,CMW5,197,208,211,212", "POP,1,2,Fall Creek Lake", "IND,1,2,6,FCR4,197,207,211,212", "IND,1,2,7,FCR7,197,208,212,218", "IND,1,2,8,FCR14,197,207,212,218", "IND,1,2,9,FCR15,197,208,211,212", "IND,1,2,10,FCR16,197,208,211,212", "IND,1,2,11,FCR17,197,207,212,218","LOC,2,CBA23","POP,2,1,Commonwealth", "IND,2,1,1,CMW1,98,100,106,125","IND,2,1,2,CMW2,98,125,,", "IND,2,1,3,CMW3,98,126,,","IND,2,1,4,CMW4,98,106,119,127", "IND,2,1,5,CMW5,98,106,125,","POP,2,2,Fall Creek Lake", "IND,2,2,6,FCR4,98,125,,","IND,2,2,7,FCR7,98,106,126,", "IND,2,2,8,FCR14,98,127,,","IND,2,2,9,FCR15,98,108,117,", "IND,2,2,10,FCR16,98,125,,","IND,2,2,11,FCR17,98,126,,","END", file = myfile, sep = "\n") # Read the file and examine the data exampledata <- read.ATetra(myfile) summary(exampledata) PopNames(exampledata) viewGenotypes(exampledata)
Given a vector of filepaths to tab-delimited text files containing
genotype data in the ABI GeneMapper Genotypes Table format,
read.GeneMapper
produces a genambig
object containing
the genotype data.
read.GeneMapper(infiles, forceInteger=TRUE)
read.GeneMapper(infiles, forceInteger=TRUE)
infiles |
A character vector of paths to the files to be read. |
forceInteger |
Boolean. If |
read.GeneMapper
can read the genotypes tables that are exported
by the Applied Biosystems GeneMapper software. The only alterations
to the files that the user may have to make are 1) delete
any rows with missing data or fill in -9
in the first allele slot for
that row, 2) make sure that all allele names are numeric
representations of fragment length (no question marks or dashes), and
3) put sample names into the Sample Name column, if the names that you
wish to use in analysis are not already there. Each file should have
the standard header row produced by the software. If any sample has
more than one genotype listed for a given locus, only the last
genotype listed will be used.
The file format is simple enough that the user can easily create files
manually if GeneMapper is not the software used in allele calling.
The files are tab-delimited text files. There should be a header row
with column names. The column labeled “Sample Name” should contain
the names of the samples, and the column labeled “Marker” should
contain the names of the loci. You can have as many or as few columns as
needed to contain the alleles, and each of these columns should be
labeled “Allele X” where X is a number unique to each column. Row
labels and any other columns are ignored. For any given sample, each
allele is listed only once and is given as an integer that is the
length of the fragment in nucleotides. Duplicate alleles in the same
row are ignored by read.GeneMapper
. Alleles are separated by
tabs. If you have more allele columns than alleles for any given
sample, leave the extra cells blank so that read.table
will
read them as NA
. Example data files in this format are
included in the package.
read.GeneMapper
will read all of your data at once. It takes
as its first argument a character vector containing paths to all of
the files to be read. How the data are distributed over these files
does not matter. The function finds all unique sample names and all
unique markers across all the files, and automatically puts a missing
data symbol into the list if a particular sample and locus combination
is not found. Rows in which all allele cells are blank should NOT be
included in the input files; either delete these rows or put the
missing data symbol into the first allele cell.
Sample and locus names must be consistent within and across the files. The object that is produced is indexed by these names.
If forceInteger=FALSE
, alleles can be non-numeric values. Some
functionality of polysat will be lost in this case, but it could
allow for the import of SNP data, for example.
A genambig
object containing genotypes from the files, stored as
vectors of unique alleles in its Genotypes
slot. Other slots are
left at the default values.
A ‘subscript out of bounds’ error may mean that a sample name
or marker was left blank in one of the input files. A ‘NAs
introduced by coercion’ warning when forceInteger=TRUE
means that a
non-numeric, non-whitespace character was included in one of the
allele fields of the file(s), in which case the file(s) should be
carefully checked and re-imported.
Lindsay V. Clark
GeneMapper website: https://www.thermofisher.com/order/catalog/product/4475073
genambig
, read.Structure
,
read.GenoDive
, read.SPAGeDi
,
read.Tetrasat
, read.ATetra
,
write.GeneMapper
, read.POPDIST
,
read.STRand
# create a table of data gentable <- data.frame(Sample.Name=rep(c("ind1","ind2","ind3"),2), Marker=rep(c("loc1","loc2"), each=3), Allele.1=c(202,200,204,133,133,130), Allele.2=c(206,202,208,136,142,136), Allele.3=c(NA,208,212,145,148,NA), Allele.4=c(NA,216,NA,151,157,NA) ) # create a file (inspect this file in a text editor or spreadsheet # software to see the required format) myfile <- tempfile() write.table(gentable, file=myfile, quote=FALSE, sep="\t", na="", row.names=FALSE, col.names=TRUE) # read the file mygenotypes <- read.GeneMapper(myfile) # inspect the results viewGenotypes(mygenotypes)
# create a table of data gentable <- data.frame(Sample.Name=rep(c("ind1","ind2","ind3"),2), Marker=rep(c("loc1","loc2"), each=3), Allele.1=c(202,200,204,133,133,130), Allele.2=c(206,202,208,136,142,136), Allele.3=c(NA,208,212,145,148,NA), Allele.4=c(NA,216,NA,151,157,NA) ) # create a file (inspect this file in a text editor or spreadsheet # software to see the required format) myfile <- tempfile() write.table(gentable, file=myfile, quote=FALSE, sep="\t", na="", row.names=FALSE, col.names=TRUE) # read the file mygenotypes <- read.GeneMapper(myfile) # inspect the results viewGenotypes(mygenotypes)
read.GenoDive
takes a text file in the format for the software
GenoDive and produces a genambig
object.
read.GenoDive(infile)
read.GenoDive(infile)
infile |
A character string. The path to the file to be read. |
GenoDive is a Mac-only program for population genetic analysis that
allows for polyploid data. read.GenoDive
imports data from text
files formatted for this program.
The first line of the file is a comment line, which is written to the
Description
slot of the genambig
object.
On the second line, separated by tabs, are the number of individuals,
number of populations, number of loci, maximum ploidy
(ignored), and number of digits used to code alleles.
The following lines contain the names of populations, which are written
to the PopNames
slot of the genambig
object. After that is a
header line for the genotype data. This line contains, separated by
tabs, column headers for populations, clones (optional), and
individuals, followed by the name of each locus. The locus names for the
genotype object are derived from this line.
Each individual is on one line following the genotype header line. Separated by tabs are the population number, the clone number (optional), the individual name (used as the sample name in the output) and the genotypes at each locus. Alleles at one locus are concatenated together in one string without any characters to separate them. Each allele must have the same number of digits, although leading zeros can be omitted.
If the only alleles listed for a particular individual and locus are
zeros, this is interpreted by read.GenoDive
as missing data, and
Missing(object)
(the default, -9
) is written in that
genotype slot in the genambig
object.
GenoDive allows for a genotype to be partially missing but polysat does
not; therefore, if an allele is coded as zero but other alleles are
recorded for that sample and locus, the output genotype will just
contain the alleles that are present, with the zeros thrown out.
A genambig
object containing the data from the file.
Lindsay V. Clark
Meirmans, P. G. and Van Tienderen, P. H. (2004) GENOTYPE and GENODIVE: two programs for the analysis of genetic diversity of asexual organisms. Molecular Ecology Notes 4, 792-794.
http://www.bentleydrummer.nl/software/software/GenoDive.html
read.GeneMapper
, write.GenoDive
,
read.Tetrasat
, read.ATetra
,
read.Structure
,
read.SPAGeDi
, read.POPDIST
,
read.STRand
# create data file (normally done in a text editor or spreadsheet software) myfile <- tempfile() cat(c("example comment line", "5\t2\t2\t3\t2", "pop1", "pop2", "pop\tind\tloc1\tloc2", "1\tJohn\t102\t1214", "1\tPaul\t202\t0", "2\tGeorge\t101\t121213", "2\tRingo\t10304\t131414","1\tYoko\t10303\t120014"), file = myfile, sep = "\n") # import file data exampledata <- read.GenoDive(myfile) # view data summary(exampledata) viewGenotypes(exampledata) exampledata
# create data file (normally done in a text editor or spreadsheet software) myfile <- tempfile() cat(c("example comment line", "5\t2\t2\t3\t2", "pop1", "pop2", "pop\tind\tloc1\tloc2", "1\tJohn\t102\t1214", "1\tPaul\t202\t0", "2\tGeorge\t101\t121213", "2\tRingo\t10304\t131414","1\tYoko\t10303\t120014"), file = myfile, sep = "\n") # import file data exampledata <- read.GenoDive(myfile) # view data summary(exampledata) viewGenotypes(exampledata) exampledata
read.POPDIST
reads one or more text files formatted for the
software POPDIST and produces a "genambig"
object containing
genotypes, ploidies, and population identities from the file(s).
read.POPDIST(infiles)
read.POPDIST(infiles)
infiles |
A character vector of file paths to be read. |
The format for the software POPDIST is a modified version of the popular Genepop format. The first line is a comment line, followed by a list of locus names, each on a separate line or on one line separated by commas. A line starting with the string “Pop” (“pop” and “POP” are also recognized) indicates the beginning of data for one population. Each individual is then represented on one line, with the population name and individual genotype separated by a tab followed by a comma. Genotypes for different loci are separated by a tab or space. Each allele must be coded by two digits. Zeros (“00”) indicate missing data, either for an entire locus or for a partially heterozygous genotype. Partially heterozygous genotypes can also be represented by the arbitrary duplication of alleles.
If more than one file is read at once, locus names must be consistent
across all files. Locus and population names should not start with “Pop”,
“pop”, or “POP”, as read.POPDIST
searches for these
character strings in order to identify the lines that delimit populations.
A "genambig"
object. The Description
slot of the object
is taken from the comment line of the first file. Locus names are taken
from the files, and samples are given numbers instead of names. Each
genotype consists of all unique non-zero integers for a given sample and
locus. The Ploidies
slot is filled in based on how many alleles
are present at each locus of each sample (the number of characters
for the genotype, divided by two). reformatPloidies
is
used internally by the function to collapse the ploidies to the simplest
format. Population names are taken from the
individual genotype lines, and population identities are recorded based
on how the individuals are delimited by “Pop” lines.
Lindsay V. Clark
Tomiuk, J., Guldbrandtsen, B. and Loeschcke, B. (2009) Genetic similarity of polyploids: a new version of the computer program POPDIST (version 1.2.0) considers intraspecific genetic differentiation. Molecular Ecology Resources 9, 1364-1368.
Guldbrandtsen, B., Tomiuk, J. and Loeschcke, B. (2000) POPDIST version 1.1.1: A program to calculate population genetic distance and identity measures. Journal of Heredity 91, 178-179.
write.POPDIST
, read.Tetrasat
,
read.ATetra
, read.Structure
,
read.SPAGeDi
, read.GeneMapper
,
read.GenoDive
, read.STRand
# Create a file to read (this is typically done in a text editor) myfile <- tempfile() cat("An example for the read.POPDIST documentation.", "abcR", "abcQ", "Pop", "Piscataqua\t, 0204 0505", "Piscataqua\t, 0404 0307", "Piscataqua\t, 050200 030509", "Pop", "Salmon Falls\t, 1006\t0805", "Salmon Falls\t, 0510\t0308", "Pop", "Great Works\t, 050807 030800", "Great Works\t, 0000 0408", "Great Works\t, 0707 0305", file=myfile, sep="\n") # View the file in the R console (or open it in a text editor) cat(readLines(myfile), sep="\n") # Read the file into a "genambig" object fishes <- read.POPDIST(myfile) # View the data in the object summary(fishes) PopNames(fishes) PopInfo(fishes) Ploidies(fishes) viewGenotypes(fishes)
# Create a file to read (this is typically done in a text editor) myfile <- tempfile() cat("An example for the read.POPDIST documentation.", "abcR", "abcQ", "Pop", "Piscataqua\t, 0204 0505", "Piscataqua\t, 0404 0307", "Piscataqua\t, 050200 030509", "Pop", "Salmon Falls\t, 1006\t0805", "Salmon Falls\t, 0510\t0308", "Pop", "Great Works\t, 050807 030800", "Great Works\t, 0000 0408", "Great Works\t, 0707 0305", file=myfile, sep="\n") # View the file in the R console (or open it in a text editor) cat(readLines(myfile), sep="\n") # Read the file into a "genambig" object fishes <- read.POPDIST(myfile) # View the data in the object summary(fishes) PopNames(fishes) PopInfo(fishes) Ploidies(fishes) viewGenotypes(fishes)
read.SPAGeDi
can read a text file formatted for the SPAGeDi
software and return a genambig
object,
as well as optionally returning a data frame of spatial coordinates.
The genambig
object includes genotypes, ploidies, and population
identities (from the category column, if present) from the file.
read.SPAGeDi(infile, allelesep = "/", returnspatcoord = FALSE)
read.SPAGeDi(infile, allelesep = "/", returnspatcoord = FALSE)
infile |
A character string indicating the path of the file to read. |
allelesep |
The character that is used to delimit alleles within genotypes, or
|
returnspatcoord |
Boolean. Indicates whether a data frame should be returned containing the spatial coordinates columns. |
SPAGeDi offers a lot of flexibility in how data files are formatted.
read.SPAGeDi
accomodates most of that flexibility. The primary
exception is that alleles must be delimited in the same way across all
genotypes, as specified by allelesep
. Comment lines beginning
with //
, as well as blank lines, are ignored by
read.SPAGeDi
just as they are by SPAGeDi.
read.SPAGeDi
is not designed to read dominant data (see section
3.2.2 of the SPAGeDi 1.3 manual). However, see
genbinary.to.genambig
for a way to read this type
of data after some simple manipulation in a spreadsheet program.
The first line of a SPAGeDi file contains information that is used by
read.SPAGeDi
. The ploidy as specified in the 6th position of the
first line is ignored, and is instead calculated by counting alleles for
each individual (including zeros on the right, but not the left, side of
the genotype). The number of digits specified in the 5th position of
the first line is only used if allelesep=""
. All other values
in the first line are important for the function.
If the only alleles found for a particular individual and locus are
zeros, the genotype is interpreted as missing. Otherwise, zeros on the
left side of a genotype are ignored, and zeros on the right side of a
genotype are used in calculating the ploidy but are not included in the
genotype object that is returned. If allelesep=""
,
read.SPAGeDi
checks that the number of characters in the genotype
can be evenly divided by the number of digits per allele. If not, zeros
are added to the left of the genotype string before splitting it into
alleles.
The Ploidies
slot of the "genambig"
object that is created
is initially indexed by both sample and locus, with ploidy being
written to the slot on a per-genotype basis. After all genotypes have
been imported, reformatPloidies
is used to convert
Ploidies
to the simplest possible format before the object is returned.
Under the default where returnspatcoord=FALSE
, a genambig
object is returned. Alleles are formatted as integers. The
Ploidies
slot is filled in according to the number of alleles
per genotype, ignoring zeros on the left. If the first line of the
file indicates that there are more than zero categories, the category
column is used to fill in the PopNames
and PopInfo
slots.
Otherwise, a list is returned:
SpatCoord |
A data frame of spatial coordinates,
unchanged from the file. The format of each column is determined under
the default |
Dataset |
A |
Lindsay V. Clark
https://ebe.ulb.ac.be/ebe/SPAGeDi.html
Hardy, O. J. and Vekemans, X. (2002) SPAGeDi: a versatile computer program to analyse spatial genetic structure at the individual or population levels. Molecular Ecology Notes 2, 618–620.
write.SPAGeDi
, genbinary.to.genambig
,
read.table
, read.GeneMapper
,
read.GenoDive
, read.Structure
,
read.ATetra
, read.Tetrasat
,
read.POPDIST
, read.STRand
# create a file to read (usually done with spreadsheet software or a # text editor): myfile <- tempfile() cat("// here's a comment line at the beginning of the file", "5\t0\t-2\t2\t2\t4", "4\t5\t10\t50\t100", "Ind\tLat\tLong\tloc1\tloc2", "ind1\t39.5\t-120.8\t00003133\t00004040", "ind2\t39.5\t-120.8\t3537\t4246", "ind3\t42.6\t-121.1\t5083332\t40414500", "ind4\t38.2\t-120.3\t00000000\t41430000", "ind5\t38.2\t-120.3\t00053137\t00414200", "END", sep="\n", file=myfile) # display the file cat(readLines(myfile), sep="\n") # read the file mydata <- read.SPAGeDi(myfile, allelesep = "", returnspatcoord = TRUE) # view the data mydata viewGenotypes(mydata[[2]])
# create a file to read (usually done with spreadsheet software or a # text editor): myfile <- tempfile() cat("// here's a comment line at the beginning of the file", "5\t0\t-2\t2\t2\t4", "4\t5\t10\t50\t100", "Ind\tLat\tLong\tloc1\tloc2", "ind1\t39.5\t-120.8\t00003133\t00004040", "ind2\t39.5\t-120.8\t3537\t4246", "ind3\t42.6\t-121.1\t5083332\t40414500", "ind4\t38.2\t-120.3\t00000000\t41430000", "ind5\t38.2\t-120.3\t00053137\t00414200", "END", sep="\n", file=myfile) # display the file cat(readLines(myfile), sep="\n") # read the file mydata <- read.SPAGeDi(myfile, allelesep = "", returnspatcoord = TRUE) # view the data mydata viewGenotypes(mydata[[2]])
This function reads in data in a format derived from the “BTH” format for exporting genotypes from the allele calling software STRand.
read.STRand(file, sep = "\t", popInSam = TRUE)
read.STRand(file, sep = "\t", popInSam = TRUE)
file |
A text string indicating the file to read. |
sep |
Field delimiter for the file. Tab by default. |
popInSam |
Boolean. If |
This function does not read the files directly produced from STRand, but
requires some simple clean-up in spreadsheet software. The BTH format
in STRand produces two columns per locus. One of these columns should
be deleted so that there is just one column per locus. Loci names
should remain in the column headers. The column containing sample names
should be deleted or renamed “Ind”. A
“Pop” column will need to be added, containing population names.
An “Ind” column is also necessary, containing either full sample
names or a sample suffix to be concatenated with the population name
(see popInSam
argument).
STRand adds an asterisk to the end of any genotype with more than two
alleles. read.STRand
will automatically strip this asterisk out
of the genotype.
Missing data is indicated by a zero in the file.
A "genambig"
object containing genotypes, locus and sample names,
population names, and population identities from the file.
Lindsay V. Clark
https://vgl.ucdavis.edu/STRand
Toonen, R. J. and Hughes, S. (2001) Increased Throughput for Fragment Analysis on ABI Prism 377 Automated Sequencer Using a Membrane Comb and STRand Software. Biotechniques 31, 1320–1324.
read.table
, read.GeneMapper
,
read.GenoDive
, read.Structure
,
read.ATetra
, read.Tetrasat
,
read.POPDIST
, read.SPAGeDi
# generate file to read strtemp <- data.frame(Pop=c("P1","P1","P2","P2"), Ind=c("a","b","a","b"), LocD=c("0","172/174","170/172/178*","172/176"), LocG=c("130/136/138/142*","132/136","138","132/140/144*")) myfile <- tempfile() write.table(strtemp, file=myfile, sep="\t", row.names=FALSE, quote=FALSE) # read the file mydata <- read.STRand(myfile) viewGenotypes(mydata) PopNames(mydata) # alternative example with popInSam=FALSE strtemp$Ind <- c("OH1","OH5","MT4","MT7") write.table(strtemp, file=myfile, sep="\t", row.names=FALSE, quote=FALSE) mydata <- read.STRand(myfile, popInSam=FALSE) Samples(mydata) PopNames(mydata)
# generate file to read strtemp <- data.frame(Pop=c("P1","P1","P2","P2"), Ind=c("a","b","a","b"), LocD=c("0","172/174","170/172/178*","172/176"), LocG=c("130/136/138/142*","132/136","138","132/140/144*")) myfile <- tempfile() write.table(strtemp, file=myfile, sep="\t", row.names=FALSE, quote=FALSE) # read the file mydata <- read.STRand(myfile) viewGenotypes(mydata) PopNames(mydata) # alternative example with popInSam=FALSE strtemp$Ind <- c("OH1","OH5","MT4","MT7") write.table(strtemp, file=myfile, sep="\t", row.names=FALSE, quote=FALSE) mydata <- read.STRand(myfile, popInSam=FALSE) Samples(mydata) PopNames(mydata)
read.Structure
creates a genambig
object by reading a text
file formatted for the software Structure. Ploidies
and
PopInfo
(if
available) are also written to the object, and data from additional
columns can optionally be extracted as well.
read.Structure(infile, ploidy, missingin = -9, sep = "\t", markernames = TRUE, labels = TRUE, extrarows = 1, popinfocol = 1, extracols = 1, getexcols = FALSE, ploidyoutput="one")
read.Structure(infile, ploidy, missingin = -9, sep = "\t", markernames = TRUE, labels = TRUE, extrarows = 1, popinfocol = 1, extracols = 1, getexcols = FALSE, ploidyoutput="one")
infile |
Character string. The file path to be read. |
ploidy |
Integer. The ploidy of the file, i.e. how many rows there are for each individual. |
missingin |
The symbol used to represent missing data in the Structure file. |
sep |
The character used to delimit the fields of the Structure file (tab by default). |
markernames |
Boolean, indicating whether the file has a header containing marker names. |
labels |
Boolean, indicating whether the file has a column containing sample names. |
extrarows |
Integer. The number of extra rows that the file has, not counting marker names. This could include rows for recessive alleles, inter-marker distances, or phase information. |
popinfocol |
Integer. The column number (after the labels column, if present)
where the data to be used for |
extracols |
Integer. The number of extra columns that the file has, not counting
sample names (labels) but counting the column to be used for
|
getexcols |
Boolean, indicating whether the function should return the data from any extra columns. |
ploidyoutput |
This argument determines what is assigned to the |
The current version of read.Structure
does not support the
ONEROWPERIND option in the file format. Each locus must only have one
column. If your data are in ONEROWPERIND format, it should be fairly
simple to manipulate it in a spreadsheet program so that it can be read
by read.GeneMapper
instead.
read.Structure
uses read.table
to initially read the file into a
data frame, then extracts information from the data frame. Because of
this, any header rows (particularly the one containing marker names)
should have leading tabs (or spaces if sep=" "
) so that the marker
names align correctly with their corresponding genotypes. You should be
able to open the file in a spreadsheet program and have everything align
correctly.
If the file does not contain sample names, set labels=FALSE
. The
samples will be numbered instead, and if you like you can use the
Samples<-
function to edit the sample names of the genotype object after
import. Likewise, if markernames=FALSE
, the
loci will be numbered automatically by the column names that
read.table
creates, but these can also be edited after the fact.
The Ploidies
slot of the "genambig"
object that is created
is initially indexed by both sample and locus, with ploidy being
written to the slot on a per-genotype basis. After all genotypes have
been imported, reformatPloidies
is used to convert
Ploidies
to the simplest possible format before the object is returned.
If getexcols=FALSE
, the function returns only a genambig
object.
If getexcols=TRUE
, the function returns a list with two elements. The
first, named ExtraCol
, is a data frame, where the row names are the
sample names and each column is one of the extra columns from the file
(but with each sample only once instead of being repeated ploidy
number of times). The second element is named Dataset
and is the
genotype object described above.
Lindsay V. Clark
Hubisz, M. J., Falush, D., Stephens, M. and Pritchard, J. K. (2009) Inferring weak population structure with the assistance of sample group information. Molecular Ecology Resources 9, 1322–1332.
Falush, D., Stephens, M. and Pritchard, J. K. (2007) Inferences of population structure using multilocus genotype data: dominant markers and null alleles. Molecular Ecology Notes 7, 574–578.
write.Structure
, read.GeneMapper
,
read.Tetrasat
, read.ATetra
,
read.GenoDive
,
read.SPAGeDi
, read.POPDIST
,
read.STRand
# create a file to read (normally done in a text editor or spreadsheet # software) myfile <- tempfile() cat("\t\tRhCBA15\tRhCBA23\tRhCBA28\tRhCBA14\tRUB126\tRUB262\tRhCBA6\tRUB26", "\t\t-9\t-9\t-9\t-9\t-9\t-9\t-9\t-9", "WIN1B\t1\t197\t98\t152\t170\t136\t208\t151\t99", "WIN1B\t1\t208\t106\t174\t180\t166\t208\t164\t99", "WIN1B\t1\t211\t98\t182\t187\t184\t208\t174\t99", "WIN1B\t1\t212\t98\t193\t170\t203\t208\t151\t99", "WIN1B\t1\t-9\t-9\t-9\t-9\t-9\t-9\t-9\t-9", "WIN1B\t1\t-9\t-9\t-9\t-9\t-9\t-9\t-9\t-9", "WIN1B\t1\t-9\t-9\t-9\t-9\t-9\t-9\t-9\t-9", "WIN1B\t1\t-9\t-9\t-9\t-9\t-9\t-9\t-9\t-9", "MCD1\t2\t208\t100\t138\t160\t127\t202\t151\t124", "MCD1\t2\t208\t102\t153\t168\t138\t207\t151\t134", "MCD1\t2\t208\t106\t157\t180\t162\t211\t151\t137", "MCD1\t2\t208\t110\t159\t187\t127\t215\t151\t124", "MCD1\t2\t208\t114\t168\t160\t127\t224\t151\t124", "MCD1\t2\t208\t124\t193\t160\t127\t228\t151\t124", "MCD1\t2\t-9\t-9\t-9\t-9\t-9\t-9\t-9\t-9", "MCD1\t2\t-9\t-9\t-9\t-9\t-9\t-9\t-9\t-9", "MCD2\t2\t208\t98\t138\t160\t136\t202\t150\t120", "MCD2\t2\t208\t102\t144\t174\t145\t214\t150\t132", "MCD2\t2\t208\t105\t148\t178\t136\t217\t150\t135", "MCD2\t2\t208\t114\t151\t184\t136\t227\t150\t120", "MCD2\t2\t208\t98\t155\t160\t136\t202\t150\t120", "MCD2\t2\t208\t98\t157\t160\t136\t202\t150\t120", "MCD2\t2\t208\t98\t163\t160\t136\t202\t150\t120", "MCD2\t2\t208\t98\t138\t160\t136\t202\t150\t120", "MCD3\t2\t197\t100\t172\t170\t159\t213\t174\t134", "MCD3\t2\t197\t106\t174\t178\t193\t213\t176\t132", "MCD3\t2\t-9\t-9\t-9\t-9\t-9\t-9\t-9\t-9", "MCD3\t2\t-9\t-9\t-9\t-9\t-9\t-9\t-9\t-9", "MCD3\t2\t-9\t-9\t-9\t-9\t-9\t-9\t-9\t-9", "MCD3\t2\t-9\t-9\t-9\t-9\t-9\t-9\t-9\t-9", "MCD3\t2\t-9\t-9\t-9\t-9\t-9\t-9\t-9\t-9", "MCD3\t2\t-9\t-9\t-9\t-9\t-9\t-9\t-9\t-9", sep="\n",file=myfile) # view the file cat(readLines(myfile), sep="\n") # read the structure file into genotypes and populations testdata <- read.Structure(myfile, ploidy=8) # examine the results testdata
# create a file to read (normally done in a text editor or spreadsheet # software) myfile <- tempfile() cat("\t\tRhCBA15\tRhCBA23\tRhCBA28\tRhCBA14\tRUB126\tRUB262\tRhCBA6\tRUB26", "\t\t-9\t-9\t-9\t-9\t-9\t-9\t-9\t-9", "WIN1B\t1\t197\t98\t152\t170\t136\t208\t151\t99", "WIN1B\t1\t208\t106\t174\t180\t166\t208\t164\t99", "WIN1B\t1\t211\t98\t182\t187\t184\t208\t174\t99", "WIN1B\t1\t212\t98\t193\t170\t203\t208\t151\t99", "WIN1B\t1\t-9\t-9\t-9\t-9\t-9\t-9\t-9\t-9", "WIN1B\t1\t-9\t-9\t-9\t-9\t-9\t-9\t-9\t-9", "WIN1B\t1\t-9\t-9\t-9\t-9\t-9\t-9\t-9\t-9", "WIN1B\t1\t-9\t-9\t-9\t-9\t-9\t-9\t-9\t-9", "MCD1\t2\t208\t100\t138\t160\t127\t202\t151\t124", "MCD1\t2\t208\t102\t153\t168\t138\t207\t151\t134", "MCD1\t2\t208\t106\t157\t180\t162\t211\t151\t137", "MCD1\t2\t208\t110\t159\t187\t127\t215\t151\t124", "MCD1\t2\t208\t114\t168\t160\t127\t224\t151\t124", "MCD1\t2\t208\t124\t193\t160\t127\t228\t151\t124", "MCD1\t2\t-9\t-9\t-9\t-9\t-9\t-9\t-9\t-9", "MCD1\t2\t-9\t-9\t-9\t-9\t-9\t-9\t-9\t-9", "MCD2\t2\t208\t98\t138\t160\t136\t202\t150\t120", "MCD2\t2\t208\t102\t144\t174\t145\t214\t150\t132", "MCD2\t2\t208\t105\t148\t178\t136\t217\t150\t135", "MCD2\t2\t208\t114\t151\t184\t136\t227\t150\t120", "MCD2\t2\t208\t98\t155\t160\t136\t202\t150\t120", "MCD2\t2\t208\t98\t157\t160\t136\t202\t150\t120", "MCD2\t2\t208\t98\t163\t160\t136\t202\t150\t120", "MCD2\t2\t208\t98\t138\t160\t136\t202\t150\t120", "MCD3\t2\t197\t100\t172\t170\t159\t213\t174\t134", "MCD3\t2\t197\t106\t174\t178\t193\t213\t176\t132", "MCD3\t2\t-9\t-9\t-9\t-9\t-9\t-9\t-9\t-9", "MCD3\t2\t-9\t-9\t-9\t-9\t-9\t-9\t-9\t-9", "MCD3\t2\t-9\t-9\t-9\t-9\t-9\t-9\t-9\t-9", "MCD3\t2\t-9\t-9\t-9\t-9\t-9\t-9\t-9\t-9", "MCD3\t2\t-9\t-9\t-9\t-9\t-9\t-9\t-9\t-9", "MCD3\t2\t-9\t-9\t-9\t-9\t-9\t-9\t-9\t-9", sep="\n",file=myfile) # view the file cat(readLines(myfile), sep="\n") # read the structure file into genotypes and populations testdata <- read.Structure(myfile, ploidy=8) # examine the results testdata
Given a file containing genotypes in the TETRASAT format,
read.Tetrasat
produces a genambig
object containing
genotypes and population identities from the file.
read.Tetrasat(infile)
read.Tetrasat(infile)
infile |
A character string of the file path to be read. |
read.Tetrasat
reads text files that are in the exact format
specified by the software TETRASAT and TETRA (see references for more
information). This is similar to the file format for GenePop but allows
for up to four alleles per locus. All alleles must be coded by two
digits. Another difference between the TETRASAT and GenePop formats is
that in TETRASAT the sample name and genotypes are not separated by a
comma, because the columns of data have fixed widths.
Since TETRASAT files also contain information about which samples belong
to which populations, this information is put into the PopInfo
slot of the genambig
object. Population names are not taken from
the file. The Ploidies
slot is
filled with the number 4 (using the "ploidyone"
class), because all individuals should be tetraploid.
The first line of the file is put into the Description
slot.
Locus names should not contain the letters "pop", uppercase or lowercase, adjacent to each other.
A genambig
object containing data from the file.
Lindsay V. Clark
Markwith, S. H., Stewart, D. J. and Dyer, J. L. (2006) TETRASAT: a program for the population analysis of allotetraploid microsatellite data. Molecular Ecology Notes 6, 586-589.
Liao, W. J., Zhu, B. R., Zeng, Y. F. and Zhang, D. Y. (2008) TETRA: an improved program for population genetic analysis of allotetraploid microsatellite data. Molecular Ecology Resources 8, 1260–1262.
read.GeneMapper
, write.Tetrasat
,
read.ATetra
, read.GenoDive
,
read.Structure
,
read.SPAGeDi
, read.POPDIST
,
read.STRand
# example with defined data: myfile <- tempfile() cat("Sample Data", "A1_Gtype", "A10_Gtype", "B1_Gtype", "D7_Gtype", "D9_Gtype", "D12_Gtype", "Pop", "BCRHE 1 0406 04040404 0208 02020202 03030303 0710", "BCRHE 10 0406 04040404 07070707 02020202 0304 0710", "BCRHE 2 04040404 04040404 0708 02020202 010305 0710", "BCRHE 3 04040404 04040404 02020202 0203 03030303 0809", "BCRHE 4 04040404 04040404 0608 0203 03030303 070910", "BCRHE 5 04040404 04040404 0208 02020202 03030303 050710", "BCRHE 6 0304 04040404 0207 02020202 03030303 07070707", "BCRHE 7 0406 04040404 0708 02020202 03030303 07070707", "BCRHE 8 0304 04040404 0203 0203 03030303 0709", "BCRHE 9 0406 04040404 0708 02020202 03030303 0710", "Pop", "BR 1 0406 04040404 05050505 02020202 03030303 1012", "BR 10 030406 04040404 0607 02020202 03030303 1011", "BR 2 030406 04040404 07070707 02020202 03030303 09090909", "BR 3 010304 04040404 07070707 02020202 03030303 09090909", "BR 4 030406 04040404 07070707 0203 03030303 10101010", "BR 5 030406 04040404 07070707 02020202 03030303 10101010", "BR 6 0406 04040404 0507 0203 03030303 10101010", "BR 7 0304 04040404 0809 02020202 03030303 070910", "BR 8 030406 04040404 07070707 02020202 03030303 070910", "BR 9 0406 04040404 07070707 02020202 03030303 07070707", sep="\n", file=myfile) mydata2 <- read.Tetrasat(myfile) summary(mydata2) viewGenotypes(mydata2, loci="B1_Gtype")
# example with defined data: myfile <- tempfile() cat("Sample Data", "A1_Gtype", "A10_Gtype", "B1_Gtype", "D7_Gtype", "D9_Gtype", "D12_Gtype", "Pop", "BCRHE 1 0406 04040404 0208 02020202 03030303 0710", "BCRHE 10 0406 04040404 07070707 02020202 0304 0710", "BCRHE 2 04040404 04040404 0708 02020202 010305 0710", "BCRHE 3 04040404 04040404 02020202 0203 03030303 0809", "BCRHE 4 04040404 04040404 0608 0203 03030303 070910", "BCRHE 5 04040404 04040404 0208 02020202 03030303 050710", "BCRHE 6 0304 04040404 0207 02020202 03030303 07070707", "BCRHE 7 0406 04040404 0708 02020202 03030303 07070707", "BCRHE 8 0304 04040404 0203 0203 03030303 0709", "BCRHE 9 0406 04040404 0708 02020202 03030303 0710", "Pop", "BR 1 0406 04040404 05050505 02020202 03030303 1012", "BR 10 030406 04040404 0607 02020202 03030303 1011", "BR 2 030406 04040404 07070707 02020202 03030303 09090909", "BR 3 010304 04040404 07070707 02020202 03030303 09090909", "BR 4 030406 04040404 07070707 0203 03030303 10101010", "BR 5 030406 04040404 07070707 02020202 03030303 10101010", "BR 6 0406 04040404 0507 0203 03030303 10101010", "BR 7 0304 04040404 0809 02020202 03030303 070910", "BR 8 030406 04040404 07070707 02020202 03030303 070910", "BR 9 0406 04040404 07070707 02020202 03030303 07070707", sep="\n", file=myfile) mydata2 <- read.Tetrasat(myfile) summary(mydata2) viewGenotypes(mydata2, loci="B1_Gtype")
genambig
Dataset with Loci Split into Isoloci
Given a "genambig"
object and a list of allele
assignments such as those produced by testAlGroups
or
catalanAlleles
, recodeAllopoly
will generate a
new "genambig"
object, with genotypes split according to which
alleles belong to which isoloci.
recodeAllopoly(object, x, allowAneuploidy = TRUE, samples = Samples(object), loci = Loci(object))
recodeAllopoly(object, x, allowAneuploidy = TRUE, samples = Samples(object), loci = Loci(object))
object |
A |
x |
A list. Each item in the list should itself be a list, in the format
output by |
allowAneuploidy |
Boolean. This controls what happens when the function encounters
genotypes that have more alleles than are possible for a given
isolocus. (For example, the genotype has four alleles, but three belong
to isolocus 1 and one belongs to isolocus 2.) If |
samples |
An optional character vector indicating which samples to analyze and output. |
loci |
An optional character vector indicating which loci to analyze and output. |
The same locus may appear more than once in x
, for example if
distinct populations were analyzed separately to produce the
allele assignments. If this is the case, recodeAllopoly
will
internally use mergeAlleleAssignments
to consolidate items in
x
with the same locus name. Loci that are in x
but not
object
are ignored with a warning. Loci that are in
object
but not x
are retained in the output of the
function, but not re-coded.
This function allows homoplasy, and uses process-of-elimination to try to determine which isoloci the homoplasious alleles belong to. In cases where genotypes cannot be determined for certain due to homoplasy, missing data are inserted.
If a genotype has more alleles than should be possible (e.g. five alleles in an allotetraploid), the genotype is skipped and will be output as missing data for all corresponding isoloci.
A "genambig"
object, with loci that are in x
split into
the appropriate number of isoloci.
Lindsay V. Clark
Clark, L. V. and Drauch Schreier, A. (2017) Resolving microsatellite genotype ambiguity in populations of allopolyploid and diploidized autopolyploid organisms using negative correlations between alleles. Molecular Ecology Resources, 17, 1090–1103. DOI: 10.1111/1755-0998.12639.
# generate a dataset for this example testdata <- new("genambig", samples = paste("S", 1:9, sep = ""), loci = c("L1", "L2","L3")) Genotypes(testdata, loci="L1") <- list(c(120,124),c(124,126,130),c(120,126),c(126,132,134), c(120,124,130,132),c(120,126,130),c(120,132,134), c(120,124,126,130),c(120,132,138)) Genotypes(testdata, loci="L2") <- list(c(210,219,222,225),c(216,228),c(210,213,219,222),c(213,222,225,228), c(210,213,216,219),c(222,228),c(213),c(210,216),c(219,222,228)) Genotypes(testdata, loci="L3") <- list(c(155,145,153),c(157,155),c(151,157,159,165),c(147,151),c(149,153,157), c(149,157),c(153,159,161),c(163,165),c(147,163,167)) viewGenotypes(testdata) # generate allele assignments for this example myAssign <- list(list(locus="L1", SGploidy=2, assignments=matrix(c(1,0,0,1,1,1,0,1,1,0,1,1), nrow=2, ncol=6, dimnames=list(NULL, c("120","124","126","130","132","134")))), list(locus="L2", SGploidy=2, assignments=matrix(c(1,1,1,1,1,1,1,0,1,0,1,0,0,1), nrow=2, ncol=7, dimnames=list(NULL,c("210","213","216","219","222","225","228")))), list(locus="L3", SGploidy=2, assignments="No assignment")) myAssign # recode the dataset splitdata <- recodeAllopoly(testdata, myAssign) # view results viewGenotypes(splitdata) Ploidies(splitdata)
# generate a dataset for this example testdata <- new("genambig", samples = paste("S", 1:9, sep = ""), loci = c("L1", "L2","L3")) Genotypes(testdata, loci="L1") <- list(c(120,124),c(124,126,130),c(120,126),c(126,132,134), c(120,124,130,132),c(120,126,130),c(120,132,134), c(120,124,126,130),c(120,132,138)) Genotypes(testdata, loci="L2") <- list(c(210,219,222,225),c(216,228),c(210,213,219,222),c(213,222,225,228), c(210,213,216,219),c(222,228),c(213),c(210,216),c(219,222,228)) Genotypes(testdata, loci="L3") <- list(c(155,145,153),c(157,155),c(151,157,159,165),c(147,151),c(149,153,157), c(149,157),c(153,159,161),c(163,165),c(147,163,167)) viewGenotypes(testdata) # generate allele assignments for this example myAssign <- list(list(locus="L1", SGploidy=2, assignments=matrix(c(1,0,0,1,1,1,0,1,1,0,1,1), nrow=2, ncol=6, dimnames=list(NULL, c("120","124","126","130","132","134")))), list(locus="L2", SGploidy=2, assignments=matrix(c(1,1,1,1,1,1,1,0,1,0,1,0,0,1), nrow=2, ncol=7, dimnames=list(NULL,c("210","213","216","219","222","225","228")))), list(locus="L3", SGploidy=2, assignments="No assignment")) myAssign # recode the dataset splitdata <- recodeAllopoly(testdata, myAssign) # view results viewGenotypes(splitdata) Ploidies(splitdata)
This function changes the class of the object in the Ploidies
slot of a "gendata"
object. (See the four subclasses described
in "ploidysuper"
.) Existing ploidy data can either
be erased or, if possible, used in the new format.
reformatPloidies(object, output = "collapse", na.rm = FALSE, erase = FALSE)
reformatPloidies(object, output = "collapse", na.rm = FALSE, erase = FALSE)
object |
A |
output |
A character string indicating the desired result of the conversion:
|
na.rm |
Boolean. If |
erase |
Boolean. If |
This is a versatile function that can accomplish several tasks relating to the format of ploidies in the dataset:
If you wish to change how ploidy is indexed, but don't care about
keeping any data in the Ploidies
slot, set erase=TRUE
and output
to "matrix"
, "sample"
, "locus"
,
or "one"
.
If you wish to keep ploidy data while moving from a simpler format to
a more complex format (i.e. from "one"
to any other
format, or from "sample"
or "locus"
to "matrix"
),
leave erase=FALSE
and set output
to the desired format.
Existing data will be duplicated to fill out the new format. For
example, if ploidies were indexed by sample and you change to matrix
format, the ploidy that had previously been recorded for each sample
will be duplicated for each locus.
If you wish to keep ploidy data while performing any other format
conversions (e.g. "matrix"
to "sample"
or
"sample"
to "locus"
), the function will check that there
is one unique ploidy for each sample, locus, or the entire dataset (as
appropriate), and will produce an error if the conversion cannot be
done without a loss of information.
If you wish to keep ploidy data and convert to the simplest possible
format, set output="collapse"
. The function will automatically
determine the simplest format for conversion without loss of data.
(The read
functions in polysat that take ploidy data from
the input file use this option.)
A "gendata"
object that is a copy of object
but with the
Ploidies
slot converted to a new class.
Lindsay V. Clark
# Make a new "genambig" object for this example testdata <- new("genambig") Ploidies(testdata) ## If you need to reformat before you have entered any ploidy ## information: # convert from matrix to sample format testdata <- reformatPloidies(testdata, output="sample") Ploidies(testdata) ## If you have entered ploidy information but realized you can use a ## simpler format: # Enter some ploidies Ploidies(testdata)[1] <- 2 Ploidies(testdata) # Convert from "sample" to "one" with na.rm=TRUE testdata <- reformatPloidies(testdata, na.rm=TRUE, output="one") Ploidies(testdata) ## If you change your mind and want to go back to a more complex format # Convert from "one" to "locus" testdata <- reformatPloidies(testdata, output="locus") Ploidies(testdata)
# Make a new "genambig" object for this example testdata <- new("genambig") Ploidies(testdata) ## If you need to reformat before you have entered any ploidy ## information: # convert from matrix to sample format testdata <- reformatPloidies(testdata, output="sample") Ploidies(testdata) ## If you have entered ploidy information but realized you can use a ## simpler format: # Enter some ploidies Ploidies(testdata)[1] <- 2 Ploidies(testdata) # Convert from "sample" to "one" with na.rm=TRUE testdata <- reformatPloidies(testdata, na.rm=TRUE, output="one") Ploidies(testdata) ## If you change your mind and want to go back to a more complex format # Convert from "one" to "locus" testdata <- reformatPloidies(testdata, output="locus") Ploidies(testdata)
Given the number of subgenomes, the ploidy of each subgenome, and
optionally, allele frequencies, simAllopoly
will generate a
"genambig"
object containing simulated data for one locus.
simAllopoly(ploidy = c(2, 2), n.alleles = c(4, 4), n.homoplasy = 0, n.null.alleles=rep(0, length(ploidy)), alleles = NULL, freq = NULL, meiotic.error.rate=0, nSam = 100, locname = "L1")
simAllopoly(ploidy = c(2, 2), n.alleles = c(4, 4), n.homoplasy = 0, n.null.alleles=rep(0, length(ploidy)), alleles = NULL, freq = NULL, meiotic.error.rate=0, nSam = 100, locname = "L1")
ploidy |
A vector of integers, with one value for each subgenome, indicating the
ploidy of that subgenome. For example, |
n.alleles |
A vector, in similar format to the previous argument, indicating how
many different unique alleles there are for each isolocus. Ignored if
|
n.homoplasy |
A single value indicating how many homoplasious alleles there are.
Ignored if |
n.null.alleles |
A vector, in similar format to |
alleles |
Optional. A list of vectors of allele names (which are usually expressed as integers, but can also be character strings if desired). Each element of the list contains the allele names for the corresponding isolocus. Zero indicates a null allele. Allele names that are identical between isoloci will be treated as homoplasious. If this argument is not provided, alleles will be named as described in “Details”. |
freq |
Optional. A list of vectors of allele frequencies. If |
meiotic.error.rate |
A single value ranging from 0 to 0.5. The probability of a gamete containing a meiotic error involving this locus. See “Details”. |
nSam |
A single value indicating the number of samples to generate. |
locname |
The name for the locus. |
If alleles=NULL
, allele names will be generated in the format
A-1, A-2, B-1, B-2
etc., where A and B refer to separate
subgenomes. Homoplasious alleles will be named H-1, H-2
, etc.
Meiotic errors, as simulated by simAllopoly
, always result in
balanced aneuploidy, i.e. one copy of an isolocus will be
replaced by an additional copy of a different isolocus. This is
simulated on a per-gamete basis, so each gamete can have a maximum of
one meiotic error per locus, but an individual could potentially be
derived from two error-containing gametes. Note that in homozygotes and
partial heterozygotes, it may not be possible to detect aneuploidy by
examining the genotype; this phenomenon lowers the apparent rate of
aneuploidy in the dataset.
If alleles are provided by the user with the alleles
argument,
zero (for sets of numeric alleles) or "N"
(for sets of character
alleles) indicates a null allele. The null allele will be
simulated at the frequency specified, but will not be shown in the
output dataset. Genotypes with no non-null alleles are recorded as missing.
A "genambig"
object.
Unlike the code supplied in the file extdata/simgen.R
, all
genotypes in a dataset generated by this function will be of the same
ploidy.
Lindsay V. Clark
Clark, L. V. and Drauch Schreier, A. (2017) Resolving microsatellite genotype ambiguity in populations of allopolyploid and diploidized autopolyploid organisms using negative correlations between alleles. Molecular Ecology Resources, 17, 1090–1103. DOI: 10.1111/1755-0998.12639.
alleleCorrelations
, catalanAlleles
,
simgen
# Generate an allotetraploid dataset with no homoplasy. # One isolocus has five alleles, while the other has eight. test <- simAllopoly(n.alleles=c(5,8)) # Generate an allo-octoploid dataset with two tetraploid subgenomes, ten # alleles per subgenome, including one homoplasious allele. test2 <- simAllopoly(ploidy=c(4,4), n.alleles=c(10,10), n.homoplasy=1) # Generate an allotetraploid dataset, and manually define allele names # and frequencies. test3 <- simAllopoly(alleles=list(c(120,124,126),c(130,134,138,140)), freq=list(c(0.4,0.3,0.3),c(0.25,0.25,0.25,0.25))) # Generate an autotetraploid dataset with seven alleles test4 <- simAllopoly(ploidy=4, n.alleles=7) # Generate an allotetraploid dataset with a null allele at high frequency test5 <- simAllopoly(n.null.alleles=c(1,0), freq=list(c(0.5,0.1,0.1,0.3), c(0.25,0.25,0.4,0.1)))
# Generate an allotetraploid dataset with no homoplasy. # One isolocus has five alleles, while the other has eight. test <- simAllopoly(n.alleles=c(5,8)) # Generate an allo-octoploid dataset with two tetraploid subgenomes, ten # alleles per subgenome, including one homoplasious allele. test2 <- simAllopoly(ploidy=c(4,4), n.alleles=c(10,10), n.homoplasy=1) # Generate an allotetraploid dataset, and manually define allele names # and frequencies. test3 <- simAllopoly(alleles=list(c(120,124,126),c(130,134,138,140)), freq=list(c(0.4,0.3,0.3),c(0.25,0.25,0.25,0.25))) # Generate an autotetraploid dataset with seven alleles test4 <- simAllopoly(ploidy=4, n.alleles=7) # Generate an allotetraploid dataset with a null allele at high frequency test5 <- simAllopoly(n.null.alleles=c(1,0), freq=list(c(0.5,0.1,0.1,0.3), c(0.25,0.25,0.4,0.1)))
genambig
object containing simulated data from three
populations with 100 individuals each, at three loci. Individuals
are a random mixture of diploids and tetraploids. Genotypes were
generated according to pre-set allele frequencies.
data(simgen)
data(simgen)
A genambig
object with data in the Genotypes
,
PopInfo
, PopNames
, Ploidies
and Usatnts
slots. This is saved as an .RData file. simgen
was created
using the code found in the “simgen.R” file in the
“extdata” directory of the polysat package installation. This
code may be useful for inspiration on how to create a simulated dataset.
simulated data
Given genetic data, allele frequencies by population are calculated. This estimation method assumes polysomic inheritance. For genotypes with allele copy number ambiguity, all alleles are assumed to have an equal chance of being present in multiple copies. This function is best used to generate initial values for more complex allele frequency estimation methods.
simpleFreq(object, samples = Samples(object), loci = Loci(object))
simpleFreq(object, samples = Samples(object), loci = Loci(object))
object |
A |
samples |
An optional character vector of samples to include in the calculation. |
loci |
An optional character vector of loci to include in the calculation. |
If object
is of class genambig
, it is converted to a
genbinary
object before allele frequency calculations take
place. Everything else being equal, the function will work more quickly
if it is supplied with a genbinary
object.
For
each sample*locus, a conversion factor is generated that is the ploidy
of the sample (and/or locus) as specified in Ploidies(object)
divided by the number of
alleles that the sample has at that locus. Each allele is then
considered to be present in as many copies as the conversion factor
(note that this is not necessarily an integer). The number of copies of
an allele is totaled for the population and is divided by the total
number of genomes in the population (minus missing data at the locus)
in order to calculate allele frequency.
A major assumption of this calculation method is that each allele in a partially heterozygous genotype has an equal chance of being present in more than one copy. This is almost never true, because common alleles in a population are more likely to be partially homozygous in an individual. The result is that the frequency of common alleles is underestimated and the frequency of rare alleles is overestimated. Also note that the level of inbreeding in the population has an effect on the relationship between genotype frequencies and allele frequencies, but is not taken into account in this calculation.
Data frame, where each population is in one row. If each sample in
object
has only one ploidy, the first column of the data frame is
called Genomes
and contains the number of genomes in each
population. Otherwise, there is a
Genomes
column for each locus. Each remaining column contains
frequencies for one allele.
Columns are named by locus and allele, separated by a period. Row names
are taken from PopNames(object)
.
Lindsay V. Clark
# create a data set for this example mygen <- new("genambig", samples = paste("ind", 1:6, sep=""), loci = c("loc1", "loc2")) mygen <- reformatPloidies(mygen, output="sample") Genotypes(mygen, loci="loc1") <- list(c(206),c(208,210),c(204,206,210), c(196,198,202,208),c(196,200),c(198,200,202,204)) Genotypes(mygen, loci="loc2") <- list(c(130,134),c(138,140),c(130,136,140), c(138),c(136,140),c(130,132,136)) PopInfo(mygen) <- c(1,1,1,2,2,2) Ploidies(mygen) <- c(2,2,4,4,2,4) # calculate allele frequencies myfreq <- simpleFreq(mygen) # look at the results myfreq # an example where ploidy is indexed by locus instead mygen2 <- new("genambig", samples = paste("ind", 1:6, sep=""), loci = c("loc1", "loc2")) mygen2 <- reformatPloidies(mygen2, output="locus") PopInfo(mygen2) <- 1 Ploidies(mygen2) <- c(2,4) Genotypes(mygen2, loci="loc1") <- list(c(198), c(200,204), c(200), c(198,202), c(200), c(202,204)) Genotypes(mygen2, loci="loc2") <- list(c(140,144,146), c(138,144), c(136,138,144,148), c(140), c(140,142,146,150), c(142,148,150)) myfreq2 <- simpleFreq(mygen2) myfreq2
# create a data set for this example mygen <- new("genambig", samples = paste("ind", 1:6, sep=""), loci = c("loc1", "loc2")) mygen <- reformatPloidies(mygen, output="sample") Genotypes(mygen, loci="loc1") <- list(c(206),c(208,210),c(204,206,210), c(196,198,202,208),c(196,200),c(198,200,202,204)) Genotypes(mygen, loci="loc2") <- list(c(130,134),c(138,140),c(130,136,140), c(138),c(136,140),c(130,132,136)) PopInfo(mygen) <- c(1,1,1,2,2,2) Ploidies(mygen) <- c(2,2,4,4,2,4) # calculate allele frequencies myfreq <- simpleFreq(mygen) # look at the results myfreq # an example where ploidy is indexed by locus instead mygen2 <- new("genambig", samples = paste("ind", 1:6, sep=""), loci = c("loc1", "loc2")) mygen2 <- reformatPloidies(mygen2, output="locus") PopInfo(mygen2) <- 1 Ploidies(mygen2) <- c(2,4) Genotypes(mygen2, loci="loc1") <- list(c(198), c(200,204), c(200), c(198,202), c(200), c(202,204)) Genotypes(mygen2, loci="loc2") <- list(c(140,144,146), c(138,144), c(136,138,144,148), c(140), c(140,142,146,150), c(142,148,150)) myfreq2 <- simpleFreq(mygen2) myfreq2
genambig
object containing alleles of 20 Rubus
samples at three microsatellite loci.
data(testgenotypes)
data(testgenotypes)
A genambig
object with data in the Genotypes
,
PopInfo
, PopNames
, and Usatnts
slots. This is
saved as a .RData file. Population identities are used here to
indicate two different species.
Clark, L. V. and Jasieniuk, M. (2012) Spontaneous hybrids between native and exotic Rubus in the Western United States produce offspring both by apomixis and by sexual recombination. Heredity 109, 320–328. Data available at: doi:10.5061/dryad.m466f
viewGenotypes
prints a tab-delimited table of samples, loci, and
alleles to the console so that genotypes can be easily viewed.
viewGenotypes(object, samples = Samples(object), loci = Loci(object))
viewGenotypes(object, samples = Samples(object), loci = Loci(object))
object |
An object of
one of the |
samples |
A numerical or character vector indicating which samples to display. |
loci |
A numerical or character vector indicating which loci to display. |
viewGenotypes
is a generic function with methods for
genambig
and genbinary
objects.
For a genambig
object, a header line indicating sample, locus,
and allele columns is printed.
Genotypes are printed below this. Genotypes are ordered first by locus
and second by sample.
For a genbinary
object, the presence/absence matrix is printed,
organized by locus. After the matrix for one locus is printed, a blank
line is inserted and the matrix for the next locus is printed.
No value is returned.
Lindsay V. Clark
# create a dataset for this example mygen <- new("genambig", samples=c("ind1", "ind2", "ind3", "ind4"), loci=c("locA", "locB")) Genotypes(mygen) <- array(list(c(98, 104, 108), c(100, 104, 110, 114), c(102, 108, 110), Missing(mygen), c(132, 135), c(138, 141, 147), c(135, 141, 144), c(129, 150)), dim=c(4,2)) # view the genotypes viewGenotypes(mygen)
# create a dataset for this example mygen <- new("genambig", samples=c("ind1", "ind2", "ind3", "ind4"), loci=c("locA", "locB")) Genotypes(mygen) <- array(list(c(98, 104, 108), c(100, 104, 110, 114), c(102, 108, 110), Missing(mygen), c(132, 135), c(138, 141, 147), c(135, 141, 144), c(129, 150)), dim=c(4,2)) # view the genotypes viewGenotypes(mygen)
write.ATetra
uses genotype and population information contained
in a genambig
object to create a
text file of genotypes in the ATetra format.
write.ATetra(object, samples = Samples(object), loci = Loci(object), file = "")
write.ATetra(object, samples = Samples(object), loci = Loci(object), file = "")
object |
A |
samples |
A character vector of samples to write to the file. This is a subset
of |
loci |
A character vector of loci to write to the file. This is a subset of
|
file |
A character string indicating the path and name to which to write the file. |
Note that missing data are not allowed in ATetra, although
write.ATetra
will still process missing data. When it does so, it
leaves all alleles blank in the file for that particular sample and
locus, and also prints a warning indicating which sample and locus had
missing data.
A file is written but no value is returned.
Lindsay V. Clark
http://www.vub.ac.be/APNA/ATetra_Manual-1-1.pdf
van Puyvelde, K., van Geert, A. and Triest, L. (2010) ATETRA, a new software program to analyze tetraploid microsatellite data: comparison with TETRA and TETRASAT. Molecular Ecology Resources 10, 331-334.
read.ATetra
, write.Tetrasat
,
write.GeneMapper
, write.POPDIST
# set up sample data (usually done by reading files) mysamples <- c("ind1", "ind2", "ind3", "ind4") myloci <- c("loc1", "loc2") mygendata <- new("genambig", samples=mysamples, loci=myloci) mygendata <- reformatPloidies(mygendata, output="one") Genotypes(mygendata, loci="loc1") <- list(c(202,204), c(204), c(200,206,208,212), c(198,204,208)) Genotypes(mygendata, loci="loc2") <- list(c(78,81,84), c(75,90,93,96,99), c(87), c(-9)) PopInfo(mygendata) <- c(1,2,1,2) PopNames(mygendata) <- c("this pop", "that pop") Ploidies(mygendata) <- 4 Description(mygendata) <- "Example for write.ATetra." ## Not run: # write an ATetra file write.ATetra(mygendata, file="atetratest.txt") # view the file cat(readLines("atetratest.txt"),sep="\n") ## End(Not run)
# set up sample data (usually done by reading files) mysamples <- c("ind1", "ind2", "ind3", "ind4") myloci <- c("loc1", "loc2") mygendata <- new("genambig", samples=mysamples, loci=myloci) mygendata <- reformatPloidies(mygendata, output="one") Genotypes(mygendata, loci="loc1") <- list(c(202,204), c(204), c(200,206,208,212), c(198,204,208)) Genotypes(mygendata, loci="loc2") <- list(c(78,81,84), c(75,90,93,96,99), c(87), c(-9)) PopInfo(mygendata) <- c(1,2,1,2) PopNames(mygendata) <- c("this pop", "that pop") Ploidies(mygendata) <- 4 Description(mygendata) <- "Example for write.ATetra." ## Not run: # write an ATetra file write.ATetra(mygendata, file="atetratest.txt") # view the file cat(readLines("atetratest.txt"),sep="\n") ## End(Not run)
A table of allele frequencies such as that produced by simpleFreq
or deSilvaFreq
is used to calculate average allele frequencies
for the entire dataset. These are then written in a format that can be
read by the software SPAGeDi.
write.freq.SPAGeDi(freqs, usatnts, file = "", digits = 2, pops = row.names(freqs), loci = unique(as.matrix(as.data.frame(strsplit(names(freqs), split = ".", fixed = TRUE), stringsAsFactors = FALSE))[1, ]))
write.freq.SPAGeDi(freqs, usatnts, file = "", digits = 2, pops = row.names(freqs), loci = unique(as.matrix(as.data.frame(strsplit(names(freqs), split = ".", fixed = TRUE), stringsAsFactors = FALSE))[1, ]))
freqs |
A data frame of population sizes and allele frequencies, such as that
produced by |
usatnts |
An integer vector containing the lengths of the microsatellite repeats
for the loci in the table. In most cases, if |
file |
The name of the file to write. |
digits |
The number of digits to use to represent each allele. This should be
the same as that used in |
pops |
An optional character vector indicating a subset of populations from the table to use in calculating mean allele frequencies. |
loci |
An optional character vector indicating a subset of loci to write to the file. |
For some calculations of inter-individual relatedness and kinship
coefficients, SPAGeDi can read a file of allele frequencies to use in
the calculation. write.freq.SPAGeDi
puts allele frequencies from
polysat into this format.
A weighted average of allele frequencies is calculated across all
populations (or those specified by pops
). The average is
weighted by population size as specified in the “Genomes” column
of freqs
.
Allele names are converted to match those produced by
write.SPAGeDi
. Alleles are divided by the numbers in
usatnts
in order to convert fragment length in nucleotides to
repeat numbers. If necessary, 10^(digits-1)
is repeatedly
subtracted from all alleles until they can be represented using the
right number of digits.
The file produced is tab-delimited and contains two columns per locus. The first column contains the locus name followed by all allele names, and the second column contains the number of alleles followed by the allele frequencies.
A file is written but no value is returned.
SPAGeDi can already estimate allele frequencies in a way that is
identical to that of simpleFreq
. Therefore, if you have allele
frequencies produced by simpleFreq
, there is not much sense in
exporting them to SPAGeDi. deSilvaFreq
, however, is a more
advanced and accurate allele frequency estimation than what is
available in SPAGeDi v1.3. write.freq.SPAGeDi
exists primarily to
export allele frequencies from deSilvaFreq
.
Lindsay V. Clark
https://ebe.ulb.ac.be/ebe/SPAGeDi.html
Hardy, O. J. and Vekemans, X. (2002) SPAGeDi: a versatile computer program to analyse spatial genetic structure at the individual or population levels. Molecular Ecology Notes 2, 618-620.
## Not run: # set up a genambig object to use in this example mygen <- new("genambig", samples=c(paste("G", 1:30, sep=""), paste("R", 1:50, sep="")), loci=c("afrY", "ggP")) PopNames(mygen) <- c("G", "R") PopInfo(mygen) <- c(rep(1, 30), rep(2, 50)) mygen <- reformatPloidies(mygen, output="one") Ploidies(mygen) <- 4 Usatnts(mygen) <- c(2, 2) # randomly create genotypes according to pre-set allele frequencies for(s in Samples(mygen, populations=1)){ Genotype(mygen, s, "afrY") <- unique(sample(c(140, 142, 146, 150, 152), 4, TRUE, c(.30, .12, .26, .08, .24))) Genotype(mygen, s, "ggP") <- unique(sample(c(210, 214, 218, 220, 222), 4, TRUE, c(.21, .13, .27, .07, .32))) } for(s in Samples(mygen, populations=2)){ Genotype(mygen, s, "afrY") <- unique(sample(c(140, 142, 144, 150, 152), 4, TRUE, c(.05, .26, .17, .33, .19))) Genotype(mygen, s, "ggP") <- unique(sample(c(212, 214, 220, 222, 224), 4, TRUE, c(.14, .04, .36, .20, .26))) } # write a SPAGeDi file write.SPAGeDi(mygen, file="SPAGdataFreqExample.txt") # calculate allele frequenies myfreq <- deSilvaFreq(mygen, self = 0.05) # write allele frequencies file write.freq.SPAGeDi(myfreq, usatnts=Usatnts(mygen), file="SPAGfreqExample.txt") ## End(Not run)
## Not run: # set up a genambig object to use in this example mygen <- new("genambig", samples=c(paste("G", 1:30, sep=""), paste("R", 1:50, sep="")), loci=c("afrY", "ggP")) PopNames(mygen) <- c("G", "R") PopInfo(mygen) <- c(rep(1, 30), rep(2, 50)) mygen <- reformatPloidies(mygen, output="one") Ploidies(mygen) <- 4 Usatnts(mygen) <- c(2, 2) # randomly create genotypes according to pre-set allele frequencies for(s in Samples(mygen, populations=1)){ Genotype(mygen, s, "afrY") <- unique(sample(c(140, 142, 146, 150, 152), 4, TRUE, c(.30, .12, .26, .08, .24))) Genotype(mygen, s, "ggP") <- unique(sample(c(210, 214, 218, 220, 222), 4, TRUE, c(.21, .13, .27, .07, .32))) } for(s in Samples(mygen, populations=2)){ Genotype(mygen, s, "afrY") <- unique(sample(c(140, 142, 144, 150, 152), 4, TRUE, c(.05, .26, .17, .33, .19))) Genotype(mygen, s, "ggP") <- unique(sample(c(212, 214, 220, 222, 224), 4, TRUE, c(.14, .04, .36, .20, .26))) } # write a SPAGeDi file write.SPAGeDi(mygen, file="SPAGdataFreqExample.txt") # calculate allele frequenies myfreq <- deSilvaFreq(mygen, self = 0.05) # write allele frequencies file write.freq.SPAGeDi(myfreq, usatnts=Usatnts(mygen), file="SPAGfreqExample.txt") ## End(Not run)
Given a genambig
object,
write.GeneMapper
writes a text file of a table containing columns
for sample name, locus, and alleles.
write.GeneMapper(object, file = "", samples = Samples(object), loci = Loci(object))
write.GeneMapper(object, file = "", samples = Samples(object), loci = Loci(object))
object |
A |
file |
Character string. The path to which to write the file. |
samples |
Character vector. Samples to write to the file. This should be a
subset of |
loci |
Character vector. Loci to write to the file. This should be a subset
of |
Although I do not know of any population genetic software other than polysat that will read this data format directly, the ABI GeneMapper Genotypes Table format is a convenient way for the user to store microsatellite genotype data and view it in a text editor or spreadsheet software. Each row contains the sample name, locus name, and alleles separated by tabs.
The number of allele columns needed is detected by the maximum value
of Ploidies(object,samples,loci)
. The function will add additional
columns if it encounters genotypes with more than this number of
alleles.
write.GeneMapper
handles missing data in a very simple way, in
that it writes the missing data symbol directly to the table as though
it were an allele. If you want missing data to be represented
differently in the table, you can open it in spreadsheet software and
use find/replace or conditional formatting to locate missing data.
The file that is produced can be read back into R directly by
read.GeneMapper
, and therefore may be a convenient way to backup
genotype data for future analysis and manipulation in polysat.
(save
can also be used to backup an R object more directly,
including population and other information.) This can also enable the
user to edit genotype data in spreadsheet software, if the
editGenotypes
function is not sufficient.
A file is written but no value is returned.
Lindsay V. Clark
GeneMapper website: https://www.thermofisher.com/order/catalog/product/4475073
read.GeneMapper
, write.Structure
,
write.GenoDive
, write.Tetrasat
,
write.ATetra
, write.POPDIST
,
write.SPAGeDi
, editGenotypes
# create a genotype object (usually done by reading a file) mysamples <- c("ind1", "ind2", "ind3", "ind4") myloci <- c("loc1", "loc2") mygendata <- new("genambig", samples=mysamples, loci=myloci) mygendata <- reformatPloidies(mygendata, output="one") Genotypes(mygendata, loci="loc1") <- list(c(202,204), c(204), c(200,206,208,212), c(198,204,208)) Genotypes(mygendata, loci="loc2") <- list(c(78,81,84), c(75,90,93,96,99), c(87), c(-9)) Ploidies(mygendata) <- 6 ## Not run: # write a GeneMapper file write.GeneMapper(mygendata, "exampleGMoutput.txt") # view the file with read.table read.table("exampleGMoutput.txt", sep="\t", header=TRUE) ## End(Not run)
# create a genotype object (usually done by reading a file) mysamples <- c("ind1", "ind2", "ind3", "ind4") myloci <- c("loc1", "loc2") mygendata <- new("genambig", samples=mysamples, loci=myloci) mygendata <- reformatPloidies(mygendata, output="one") Genotypes(mygendata, loci="loc1") <- list(c(202,204), c(204), c(200,206,208,212), c(198,204,208)) Genotypes(mygendata, loci="loc2") <- list(c(78,81,84), c(75,90,93,96,99), c(87), c(-9)) Ploidies(mygendata) <- 6 ## Not run: # write a GeneMapper file write.GeneMapper(mygendata, "exampleGMoutput.txt") # view the file with read.table read.table("exampleGMoutput.txt", sep="\t", header=TRUE) ## End(Not run)
write.GenoDive
uses data from a genambig
object to
create a file formatted for the software GenoDive.
write.GenoDive(object, digits = 2, file = "", samples = Samples(object), loci = Loci(object))
write.GenoDive(object, digits = 2, file = "", samples = Samples(object), loci = Loci(object))
object |
A |
digits |
An integer indicating how many digits to use to represent each allele (usually 2 or 3). |
file |
A character string of the file path to which to write. |
samples |
A character vector of samples to include in the file. A subset of
|
loci |
A character vector of loci to include in the file. A subset of
|
The number of individuals, number of populations, number of loci, and
maximum ploidy of the sample are calculated automatically and entered in
the second line of the file. If the maximum ploidy needs to be reduced
by random removal of alleles, it is possible to do this in the software
GenoDive after importing the data. The genambig
object should
not have individuals with more alleles than the highest ploidy level
listed in its Ploidies
slot.
Several steps happen in order to convert alleles to the right format.
First, all instances of the missing data symbol are replaced with zero.
Alleles are then divided by the numbers provided in
Usatnts(object)
(and rounded down if necessary) in order to
convert them from nucleotides to repeat numbers. If the alleles are
still too big to be represented by the number of digits specified,
write.GenoDive
repeatedly subtracts a number (10^(digits-1)
; 10
if digits=2
) from all alleles at a locus until the alleles are small
enough. Alleles are then converted to characters, and a leading zero is
added to an allele if it does not have enough digits. These alleles are
concatenated at each locus so that each sample*locus genotype is an
uninterrupted string of numbers.
A file is written but no value is returned.
Lindsay V. Clark
Meirmans, P. G. and Van Tienderen P. H. (2004) GENOTYPE and GENODIVE: two programs for the analysis of genetic diversity of asexual organisms. Molecular Ecology Notes 4, 792-794.
http://www.bentleydrummer.nl/software/software/GenoDive.html
read.GenoDive
, write.Structure
,
write.ATetra
, write.Tetrasat
,
write.GeneMapper
, write.POPDIST
,
write.SPAGeDi
# set up the genotype object (usually done by reading a file) mysamples <- c("Mal", "Inara", "Kaylee", "Simon", "River", "Zoe", "Wash", "Jayne", "Book") myloci <- c("loc1", "loc2") mygendata <- new("genambig", samples=mysamples, loci=myloci) mygendata <- reformatPloidies(mygendata, output="sample") Genotypes(mygendata, loci="loc1") <- list(c(304,306), c(302,310), c(306), c(312,314), c(312,314), c(308,310), c(312), c(302,308,310), c(-9)) Genotypes(mygendata, loci="loc2") <- list(c(118,133), c(121,130), c(122,139), c(124,133), c(118,124), c(121,127), c(124,136), c(124,127,136), c(121,130)) Usatnts(mygendata) <- c(2,3) PopNames(mygendata) <- c("Core","Outer Rim") PopInfo(mygendata) <- c(2,1,2,1,1,2,2,2,1) Ploidies(mygendata) <- c(2,2,2,2,2,2,2,3,2) Description(mygendata) <- "Serenity crew" ## Not run: # write files (use file="" to write to the console instead) write.GenoDive(mygendata, digits=2, file="testGenoDive2.txt") write.GenoDive(mygendata, digits=3, file="testGenoDive3.txt") ## End(Not run)
# set up the genotype object (usually done by reading a file) mysamples <- c("Mal", "Inara", "Kaylee", "Simon", "River", "Zoe", "Wash", "Jayne", "Book") myloci <- c("loc1", "loc2") mygendata <- new("genambig", samples=mysamples, loci=myloci) mygendata <- reformatPloidies(mygendata, output="sample") Genotypes(mygendata, loci="loc1") <- list(c(304,306), c(302,310), c(306), c(312,314), c(312,314), c(308,310), c(312), c(302,308,310), c(-9)) Genotypes(mygendata, loci="loc2") <- list(c(118,133), c(121,130), c(122,139), c(124,133), c(118,124), c(121,127), c(124,136), c(124,127,136), c(121,130)) Usatnts(mygendata) <- c(2,3) PopNames(mygendata) <- c("Core","Outer Rim") PopInfo(mygendata) <- c(2,1,2,1,1,2,2,2,1) Ploidies(mygendata) <- c(2,2,2,2,2,2,2,3,2) Description(mygendata) <- "Serenity crew" ## Not run: # write files (use file="" to write to the console instead) write.GenoDive(mygendata, digits=2, file="testGenoDive2.txt") write.GenoDive(mygendata, digits=3, file="testGenoDive3.txt") ## End(Not run)
write.POPDIST
uses data from a "genambig"
object to write
a file formatted for the software POPDIST.
write.POPDIST(object, samples = Samples(object), loci = Loci(object), file = "")
write.POPDIST(object, samples = Samples(object), loci = Loci(object), file = "")
object |
A |
samples |
An optional character vector of samples to use. Must be a subset of
|
loci |
An optional character vector of loci to use. Must be a subset of
|
file |
Character string. File path to which to write. |
POPDIST is a program that calculates inter-population distance measures, some of which are available for polyploid samples with allele copy number ambiguity. Each population must be of uniform ploidy, but different populations may have different ploidies.
Two types of warning messages may be printed by write.POPDIST
.
The first indicates that a population contains individuals of more than
one ploidy. In this case a file is still written, but POPDIST may not
be able to read it. Separate populations with different ploidies are
okay. The second type of warning indicates that an individual has more
alleles than its ploidy level. If this occurs, alleles are randomly
removed from the genotype that is written to the file.
If necessary, write.POPDIST
converts alleles into a two-digit
format, similarly to write.Tetrasat
. If the value of any allele
for a given locus is greater than 99, the function first checks to see
if the locus has a Usatnts
value greater than 1, and if so
divides all alleles by this value and rounds down. If the locus still
has alleles with more than two digits, a multiple of 10 is subtracted
from all alleles. A zero is placed in front of any allele with one digit.
A file is written but no value is returned.
Lindsay V. Clark
Tomiuk, J., Guldbrandtsen, B. and Loeschcke, B. (2009) Genetic similarity of polyploids: a new version of the computer program POPDIST (version 1.2.0) considers intraspecific genetic differentiation. Molecular Ecology Resources 9, 1364-1368.
Guldbrandtsen, B., Tomiuk, J. and Loeschcke, B. (2000) POPDIST version 1.1.1: A program to calculate population genetic distance and identity measures. Journal of Heredity 91, 178–179.
read.POPDIST
, write.Tetrasat
,
write.ATetra
, write.SPAGeDi
,
write.GenoDive
, write.Structure
,
write.GeneMapper
# create a "genambig" object containing the dataset mygen <- new("genambig", samples=c("a", "b", "c", "d"), loci=c("loc1", "loc27")) mygen <- reformatPloidies(mygen, output="sample") Description(mygen) <- "Some example data for POPDIST" PopInfo(mygen) <- c(1,1,2,2) PopNames(mygen) <- c("Old Orchard Beach", "York Beach") Ploidies(mygen) <- c(2,2,4,4) Usatnts(mygen) <- c(2,2) Genotypes(mygen, loci="loc1") <- list(c(128, 134), c(130), Missing(mygen), c(126, 128, 132)) Genotypes(mygen, loci="loc27") <- list(c(209,211), c(207,217), c(207,209,215,221), c(211,223)) ## Not run: # write the file write.POPDIST(mygen, file="forPOPDIST.txt") # view the file cat(readLines("forPOPDIST.txt"), sep="\n") ## End(Not run)
# create a "genambig" object containing the dataset mygen <- new("genambig", samples=c("a", "b", "c", "d"), loci=c("loc1", "loc27")) mygen <- reformatPloidies(mygen, output="sample") Description(mygen) <- "Some example data for POPDIST" PopInfo(mygen) <- c(1,1,2,2) PopNames(mygen) <- c("Old Orchard Beach", "York Beach") Ploidies(mygen) <- c(2,2,4,4) Usatnts(mygen) <- c(2,2) Genotypes(mygen, loci="loc1") <- list(c(128, 134), c(130), Missing(mygen), c(126, 128, 132)) Genotypes(mygen, loci="loc27") <- list(c(209,211), c(207,217), c(207,209,215,221), c(211,223)) ## Not run: # write the file write.POPDIST(mygen, file="forPOPDIST.txt") # view the file cat(readLines("forPOPDIST.txt"), sep="\n") ## End(Not run)
write.SPAGeDi
uses data contained in a genambig
object
to create a file that can be read by the software SPAGeDi. The
user controls how the genotypes are formatted, and can provide a data
frame of spatial coordinates for each sample.
write.SPAGeDi(object, samples = Samples(object), loci = Loci(object), allelesep = "/", digits = 2, file = "", spatcoord = data.frame(X = rep(1, length(samples)), Y = rep(1, length(samples)), row.names = samples))
write.SPAGeDi(object, samples = Samples(object), loci = Loci(object), allelesep = "/", digits = 2, file = "", spatcoord = data.frame(X = rep(1, length(samples)), Y = rep(1, length(samples)), row.names = samples))
object |
A |
samples |
Character vector. Samples to write to the file. Must be a subset of
|
loci |
Character vector. Loci to write to the file. Must be a subset of
|
allelesep |
The character that will be used to separate alleles within a genotype.
If each allele should instead be a fixed number of digits, with no
characters to delimit alleles, set |
digits |
Integer. The number of digits used to represent each allele. |
file |
A character string indicating the path to which the file should be written. |
spatcoord |
Data frame. Spatial coordinates of each sample. Column names are used
for column names in the file. Row names indicate sample, or if absent
it is assumed that the rows are in the same order as |
The Categories column of the SPAGeDi file that is produced contains
information from the PopNames
and PopInfo
slots of
object
; the population name for each sample is written to the
column.
The first line of the file contains the number of individuals, number of categories, number of spatial coordinates, number of loci, number of digits for coding alleles, and maximum ploidy, and is generated automatically from the data provided.
The function does not write distance intervals to the file, but instead
writes 0
to the second line.
All alleles for a given locus are divided by the Usatnts
value
for that locus, after all missing data symbols have been replaced with
zeros. If necessary, a multiple of 10 is subtracted from all
alleles at a locus in order to get the alleles down to the right number
of digits.
If a genotype has fewer alleles than the Ploidies
value for
that sample and locus, zeros are added up to the ploidy. If the
genotype has more
alleles than the ploidy, a random subset of alleles is used and a
warning is printed. If the genotype has only one allele (is fully
heterozygous), then that allele is replicated to the ploidy of the
individual. Genotypes are then concatenated into strings,
delimited by allelesep
. If allelesep=""
, leading zeros
are first added to alleles as necessary to make them the right number of
digits.
A file is written but no value is returned.
Lindsay V. Clark
https://ebe.ulb.ac.be/ebe/SPAGeDi.html
Hardy, O. J. and Vekemans, X. (2002) SPAGeDi: a versatile computer program to analyse spatial genetic structure at the individual or population levels. Molecular Ecology Notes 2, 618–620.
read.SPAGeDi
, write.freq.SPAGeDi
,
write.GenoDive
,
write.Structure
, write.GeneMapper
,
write.ATetra
, write.Tetrasat
,
write.POPDIST
# set up data to write (usually read from a file) mygendata <- new("genambig", samples = c("ind1","ind2","ind3","ind4"), loci = c("loc1", "loc2")) mygendata <- reformatPloidies(mygendata, output="sample") Genotypes(mygendata, samples="ind1") <- list(c(102,106,108),c(207,210)) Genotypes(mygendata, samples="ind2") <- list(c(104),c(204,210)) Genotypes(mygendata, samples="ind3") <- list(c(100,102,108),c(201,213)) Genotypes(mygendata, samples="ind4") <- list(c(102,112),c(-9)) Ploidies(mygendata) <- c(3,2,2,2) Usatnts(mygendata) <- c(2,3) PopNames(mygendata) <- c("A", "B") PopInfo(mygendata) <- c(1,1,2,2) myspatcoord <- data.frame(X=c(27,29,24,30), Y=c(44,41,45,46), row.names=c("ind1","ind2","ind3","ind4")) ## Not run: # write a file write.SPAGeDi(mygendata, spatcoord = myspatcoord, file="SpagOutExample.txt") ## End(Not run)
# set up data to write (usually read from a file) mygendata <- new("genambig", samples = c("ind1","ind2","ind3","ind4"), loci = c("loc1", "loc2")) mygendata <- reformatPloidies(mygendata, output="sample") Genotypes(mygendata, samples="ind1") <- list(c(102,106,108),c(207,210)) Genotypes(mygendata, samples="ind2") <- list(c(104),c(204,210)) Genotypes(mygendata, samples="ind3") <- list(c(100,102,108),c(201,213)) Genotypes(mygendata, samples="ind4") <- list(c(102,112),c(-9)) Ploidies(mygendata) <- c(3,2,2,2) Usatnts(mygendata) <- c(2,3) PopNames(mygendata) <- c("A", "B") PopInfo(mygendata) <- c(1,1,2,2) myspatcoord <- data.frame(X=c(27,29,24,30), Y=c(44,41,45,46), row.names=c("ind1","ind2","ind3","ind4")) ## Not run: # write a file write.SPAGeDi(mygendata, spatcoord = myspatcoord, file="SpagOutExample.txt") ## End(Not run)
Given a dataset stored in a genambig
object,
write.Structure
produces a text file of the genotypes in a format
readable by Structure 2.2 and higher. The user specifies the overall
ploidy of the file, while the ploidy of each sample is extracted from
the genambig
object. PopInfo
and other data can
optionally be written to the file as well.
write.Structure(object, ploidy, file="", samples = Samples(object), loci = Loci(object), writepopinfo = TRUE, extracols = NULL, missingout = -9)
write.Structure(object, ploidy, file="", samples = Samples(object), loci = Loci(object), writepopinfo = TRUE, extracols = NULL, missingout = -9)
object |
A |
ploidy |
PLOIDY for Structure, i.e. how many rows per individual to write. |
file |
A character string specifying where the file should be written. |
samples |
An optional character vector listing the names of samples to be written to the file. |
loci |
An optional character vector listing the names of the loci to be written to the file. |
writepopinfo |
|
extracols |
An array, with the first dimension names
corresponding to |
missingout |
The number used to indicate missing data. |
Structure 2.2 and higher can process autopolyploid microsatellite data,
although 2.3.3 or higher is recommended for its improvements on
polyploid handling. The input format of Structure requires that
each locus take up one column and that each individual take up as
many rows as the parameter PLOIDY. Because of the multiple rows per
sample, each sample name must be duplicated, as well as any
population, location, or phenotype data. Partially heterozygous
genotypes also must have one arbitrary allele duplicated up to the
ploidy of the sample, and samples that have a lower ploidy than that
used in the file (for mixed polyploid data sets) must have a missing
data symbol inserted to fill in the extra rows. Additionally, if
some samples have more alleles than PLOIDY (if you are using a lower
PLOIDY to save processing time, or if there are extra alleles from
scoring errors), some alleles must be randomly removed from the data.
write.Structure
performs this duplication, insertion, and random
deletion of data.
The sample names from samples
will be used as row
names in the Structure file. Each sample name should only be in the
vector samples
once, because write.Structure
will duplicate
the sample names a number of times as dictated by ploidy
.
In writing genotypes to the file, write.Structure
compares the number
of alleles in the genotype, the ploidy of the sample*locus as stored in
Ploidies
, and the ploidy of the file as stored in
ploidy
, and does one of six things (for a given sample x and
locus loc):
1) If Ploidies(object,x,loc)
is greater than or equal to
ploidy
, and
length(Genotype(object, x, loc))
is equal to ploidy
, the
genotype data are used as is.
2) If Ploidies(object,x,loc)
is greater than or equal to
ploidy
, and
length(Genotype(object, x, loc))
is less than ploidy
,
the first allele is
duplicated as many times as necessary for there to be as many alleles
as ploidy
.
3) If Ploidies(object,x,loc)
is greater than or equal to
ploidy
, and length(Genotype(object, x, loc))
is greater
than ploidy
, a random sample of
the alleles, without replacement, is used as the genotype.
4) If Ploidies(object,x,loc)
is less than ploidy
, and
length(Genotype(object, x, loc))
is equal to
Ploidies(object,x,loc)
, the genotype
data are used as is and missing data symbols are inserted in the extra
rows.
5) If Ploidies(object,x,loc)
is less than ploidy
, and
length(Genotype(object, x, loc))
is less than
Ploidies(object,x,loc)
, the first
allele is duplicated as many times as necessary for there to be as
many alleles as Ploidies(object,x,loc)
, and missing data symbols
are inserted in the extra rows.
6) If Ploidies(object,x,loc)
is less than ploidy
, and
length(Genotype(object, x, loc))
is greater than
Ploidies(object,x,loc)
, a random
sample, without replacement, of Ploidies(object)[x]
alleles is
used, and
missing data symbols are inserted in the extra rows. (Alleles are
removed even though there is room for them in the file.)
Two of the header rows that are optional for Structure are written by
write.Structure
. These are ‘Marker Names’, containing the
names of loci supplied in gendata
, and ‘Recessive
Alleles’, which contains the missing data symbol once for each locus.
This indicates to the program that all alleles are codominant with
copy number ambiguity.
No value is returned, but instead a file is written at the path specified.
If extracols
is a character array, make sure none of the
elements contain white space.
Lindsay V. Clark
Hubisz, M. J., Falush, D., Stephens, M. and Pritchard, J. K. (2009) Inferring weak population structure with the assistance of sample group information. Molecular Ecology Resources 9, 1322-1332.
Falush, D., Stephens, M. and Pritchard, J. K. (2007) Inferences of population structure using multilocus genotype data: dominant markers and null alleles. Molecular Ecology Notes 7, 574-578.
read.Structure
, write.GeneMapper
,
write.GenoDive
, write.SPAGeDi
,
write.ATetra
, write.Tetrasat
,
write.POPDIST
# input genotype data (this is usually done by reading a file) mygendata <- new("genambig", samples = c("ind1","ind2","ind3", "ind4","ind5","ind6"), loci = c("locus1","locus2")) Genotypes(mygendata) <- array(list(c(100,102,106,108,114,118),c(102,110), c(98,100,104,108,110,112,116),c(102,106,112,118), c(104,108,110),c(-9), c(204),c(206,208,210,212,220,224,226), c(202,206,208,212,214,218),c(200,204,206,208,212), c(-9),c(202,206)), dim=c(6,2)) Ploidies(mygendata) <- c(6,6,6,4,4,4) # Note that some of the above genotypes have more or fewer alleles than # the ploidy of the sample. # create a vector of sample names to be used. Note that this excludes # ind6. mysamples <- c("ind1","ind2","ind3","ind4","ind5") # Create an array containing data for additional columns to be written # to the file. You might also prefer to just read this and the ploidies # in from a file. myexcols <- array(data=c(1,2,1,2,1,1,1,0,0,0),dim=c(5,2), dimnames=list(mysamples, c("PopData","PopFlag"))) # Write the Structure file, with six rows per individual. # Since outfile="", the data will be written to the console instead of a file. write.Structure(mygendata, 6, "", samples = mysamples, writepopinfo = FALSE, extracols = myexcols)
# input genotype data (this is usually done by reading a file) mygendata <- new("genambig", samples = c("ind1","ind2","ind3", "ind4","ind5","ind6"), loci = c("locus1","locus2")) Genotypes(mygendata) <- array(list(c(100,102,106,108,114,118),c(102,110), c(98,100,104,108,110,112,116),c(102,106,112,118), c(104,108,110),c(-9), c(204),c(206,208,210,212,220,224,226), c(202,206,208,212,214,218),c(200,204,206,208,212), c(-9),c(202,206)), dim=c(6,2)) Ploidies(mygendata) <- c(6,6,6,4,4,4) # Note that some of the above genotypes have more or fewer alleles than # the ploidy of the sample. # create a vector of sample names to be used. Note that this excludes # ind6. mysamples <- c("ind1","ind2","ind3","ind4","ind5") # Create an array containing data for additional columns to be written # to the file. You might also prefer to just read this and the ploidies # in from a file. myexcols <- array(data=c(1,2,1,2,1,1,1,0,0,0),dim=c(5,2), dimnames=list(mysamples, c("PopData","PopFlag"))) # Write the Structure file, with six rows per individual. # Since outfile="", the data will be written to the console instead of a file. write.Structure(mygendata, 6, "", samples = mysamples, writepopinfo = FALSE, extracols = myexcols)
Given a genambig
object,
write.Tetrasat
creates a file that can be read by the software
Tetrasat and Tetra.
write.Tetrasat(object, samples = Samples(object), loci = Loci(object), file = "")
write.Tetrasat(object, samples = Samples(object), loci = Loci(object), file = "")
object |
A |
samples |
A character vector of samples to write to the file. Should be a subset
of |
loci |
A character vector of loci to write to the file. Should be a subset of
|
file |
A character string indicating the file to which to write. |
Tetrasat files are space-delimited text files in which all alleles at a locus are concatenated into a string eight characters long. Population names or numbers are not used in the file, but samples are ordered by population, with the line “Pop” delimiting populations.
write.Tetrasat
divides each allele by the length of the repeat and
rounds down in order to convert alleles to repeat numbers. If
necessary, it subtracts a multiple of 10 from all alleles at a locus to
make all allele values less than 100, or puts a zero in front of the
number if it only has one digit. If the individual is fully homozygous
at a locus, the single allele is repeated four times. If any genotype
has more than four alleles, write.Tetrasat
picks a random sample of
four alleles without replacement, and prints a warning. Missing data
are represented by blank spaces.
Sample names should be a maximum of 20 characters long in order for the file to be read correctly by Tetrasat or Tetra.
A file is written but no value is returned.
Lindsay V. Clark
Markwith, S. H., Stewart, D. J. and Dyer, J. L. (2006) TETRASAT: a program for the population analysis of allotetraploid microsatellite data. Molecular Ecology Notes 6, 586-589.
Liao, W. J., Zhu, B. R., Zeng, Y. F. and Zhang, D. Y. (2008) TETRA: an improved program for population genetic analysis of allotetraploid microsatellite data. Molecular Ecology Resources 8, 1260–1262.
read.Tetrasat
,
write.GeneMapper
, write.ATetra
,
write.POPDIST
# set up sample data (usually done by reading files) mysamples <- c("ind1", "ind2", "ind3", "ind4") myloci <- c("loc1", "loc2") mygendata <- new("genambig", samples = mysamples, loci = myloci) mygendata <- reformatPloidies(mygendata, output="one") Usatnts(mygendata) <- c(2, 3) Genotypes(mygendata, loci="loc1") <- list(c(202,204), c(204), c(200,206,208,212), c(198,204,208)) Genotypes(mygendata, loci="loc2") <- list(c(78,81,84), c(75,90,93,96,99), c(87), c(-9)) PopInfo(mygendata) <- c(1,2,1,2) Description(mygendata) <- "An example for write.Tetrasat." Ploidies(mygendata) <- 4 ## Not run: # write a Tetrasat file write.Tetrasat(mygendata, file="tetrasattest.txt") # view the file cat(readLines("tetrasattest.txt"),sep="\n") ## End(Not run)
# set up sample data (usually done by reading files) mysamples <- c("ind1", "ind2", "ind3", "ind4") myloci <- c("loc1", "loc2") mygendata <- new("genambig", samples = mysamples, loci = myloci) mygendata <- reformatPloidies(mygendata, output="one") Usatnts(mygendata) <- c(2, 3) Genotypes(mygendata, loci="loc1") <- list(c(202,204), c(204), c(200,206,208,212), c(198,204,208)) Genotypes(mygendata, loci="loc2") <- list(c(78,81,84), c(75,90,93,96,99), c(87), c(-9)) PopInfo(mygendata) <- c(1,2,1,2) Description(mygendata) <- "An example for write.Tetrasat." Ploidies(mygendata) <- 4 ## Not run: # write a Tetrasat file write.Tetrasat(mygendata, file="tetrasattest.txt") # view the file cat(readLines("tetrasattest.txt"),sep="\n") ## End(Not run)