First, we load our packages:
Since you are now familiar with R, we can proceed to our first lab in evolutionary processes. Today, we will focus on the fundamental entity of evolution: genes. We will first review how the laws of probability can be used to build very simple (but powerful!) models of genotypic evolutionary change. Then we will calculate some properties of and test hypotheses about such change. We will finish the lab by introducing methods to measure the genetic diversity generated by such processes.
It is important to note that we will change conventions quickly in this tutorial, meaning that the same letter might have different meanings in different equations, so be mindful of which section you are in.
The first notation we will introduce is the summation, denoted by the uppercase Greek letter sigma (∑). Usually, the subscript of this letter tells you from which element to start the operation, and the upperscript tells you at which element the summation ends. For instance the notation:
indicates a summation of elements starting at the first (i = 1) and
going until the n-th element (the upperscript). It is left implicit in
this notation that the summation is made element by element, but
sometimes this information might appear explicitly. If you are familiar
with R’s for
loops, it may be useful to think of the
summation as representing a for
loop that iterates from the
sub- to the upperscript, each time applying the operation and adding its
result to the running sum.
A familiar example is the formula to calculate the arithmetic mean:
As another example, take the following equation, which describes the expected number of heterozygotes in a diploid population with k alleles in Hardy-Weinberg equilibrium:
where k is the number of alleles, and pi and pj are the frequencies of the ith and jth alleles. Note that on the right side of the equality, we sum the product of all allele frequencies. The inequality i ≠ j tells you to not apply the summation when i = j. This kind of explicit information may appear in a summation.
So, in the case of k = 3, we would have the following equation:
$H_{k=3} = \sum_{i=1 ; i\neq{j}}^{3}p_ip_j = (p_1 \times p_2) + (p_1 \times p_3) + (p_2 \times p_1) + (p_2 \times p_3) + (p_3 \times p_1) + (p_3 \times p_2)$
Just for the sake of completion, one could also say that:
$H = 1-G = 1- (\sum_{i=1}^{k}p_i^2)$
There is an equivalent of the summation for multiplication, and it is symbolized by the uppercase Greek letter pi (∏). So, the probability P of n independent events v, each having the same probability Pv is:
With v representing each event. Note that the upper- and subscript are read in the same way as in the summation.
The joint probability of independent events is the multiplication of each respective probability (a quick video recap on that here).
So, the probability of seeing “snake eyes” (i.e. two “ones”) when you throw two dice is:
The notation in the middle equality above might seem overly complicated when compared to the left side of the equality, but the compactness of this notation becomes clear when we want to multiply many elements. R has functions that allow you to “throw dice” and study the probability of events. These are the functions that generate random numbers from probability functions.
Using functions that generate random numbers, we can simulate much more than just coins and dice. R has functions that generate random draws from famous probability distributions.
To explore this concept, we will focus on one of these famous distributions: the binomial distribution. The binomial distribution applies for samples of events that have a binomial outcome, such as coin flips which result in either a “heads” or “tails” outcome. In the jargon of the field, we call one of these outcomes a “success” and the other one a “failure”; this makes more sense in the coin flip analogy if we imagine that we are using the coin to determine the result of some outcome, and one of the sides has the more favorable outcome (e.g. eat a piece of chocolate only when the coin lands on heads, no chocolate on tails).
The binomial distribution has two major properties:
The binomial distribution is therefore a distribution of integers that each represent how many successes there were within a sample of binomial events. The distribution is made by taking many repeated such samples. Each time we generate a random number from a binomial distribution, we are generating an integer x between 0 (no successes occurred) and the total number of coin flips n (only successes occurred). In the coin tossing example, x is equivalent to counting heads out of n coin tosses.
Like all probability functions, the binomial distribution has certain parameters. These are: p, the probability of success for a single binary event, and n, the total number of binary events. In the coin flip example, p would be 0.5, representing a 50% chance of getting heads. n would be however many times we flip the coin.
Let’s simulate a coin flip. Recall that the binomial distribution outputs the number of successes from a sample of events, so this is what each generated random number represents.
We use the rbinom
function to randomly generate numbers
from the binomial distribution. rbinom
is in the r-family
of functions, the collection of functions that randomly generate numbers
from probability distributions.
A potential point of confusion is that, in rbinom
, the
argument size
represents the parameter n discussed above, while the
argument n
actually represents the total number of times we
repeated sampling. The argument prob
represents the
parameter p discussed above,
the probability of success – much more straightforward. The line of code
above represents flipping a coin (with a 50% chance of heads) 10 times
and counting the number of heads flipped a total of rr
times.
If we check what the object binom_nbrs
contains, we will
see…
## [1] 7 6 5 5 7 5 5 7 6 3
… a bunch of integers between 0 and 10. Each of these integers represents the number of heads flipped in a sample of 10 coin tosses.
As the binomial is a discrete probability function, it is meaningful to tabulate the frequencies of each number we got from the random draw we just made:
## binom_nbrs
## 0 1 2 3 4 5 6 7 8 9 10
## 6 101 461 1138 2079 2425 2082 1161 439 100 8
We can also plot what we did very easily.
Above are the random numbers we generated – i.e., the stochastic realizations of the process that follows the distribution we chose. Each number represents the summation of the number of successes per sample. But what if instead of having a stochastic realization of the binomial process, we wanted to know what is the fixed probability associated with each specific number of successes?
For this, we would use the d-family of functions! These calculate the probability density associated with an event. An event in this case is a number of successes. For instance, to directly calculate the probability of observing, say, 3 heads out of a sample of 10 coin tosses, we would write:
P_x3_s10_p0.5 <- dbinom(x = 3, size = 10, prob = 0.5)
# x is the integer value we want to know the probability density of
P_x3_s10_p0.5
## [1] 0.1171875
This function basically just applies the probability mass function of the binomial distribution, which is
with: $\binom{n}{x} = \frac{n!}{x!(n-x)!}$
Knowing that dinom
is simply using the function written
above, you could manually calculate the expected probability density for
x = 3, and check to make sure
it is the same as what dbinom
gives you. To do that, use
equation (6) and substitute n
with 10 and x with 3 – you
should get exactly the same number as the output of dbinom
,
which is 0.1171875.
There is a difference here between the stochastic realization of a
process, acquired with rbinom
, and the theoretically
expected outcome of a process, acquired with dbinom
. If we
draw many random numbers, we should be able to approximate the expected
probability by calculating the proportion of each outcome.
# comparing the stochastic realization of x = 3 with the expectation for x = 3
table(binom_nbrs)[4] / rr #why the division?
## 3
## 0.1138
## [1] 0.1171875
What do you think? Is it a good approximation? How could we improve it?
We can also plot the random draws we made:
The idea of geometric/exponential/Malthusian growth is of major importance for evolutionary biology. You have probably learnt of this concept before, but we will reiterate it here. Although often called “Malthusian growth”, the basic analytic approach was first developed by Euler and published in 1748. He modeled population growth with equation (7):
where x is the growth rate of the population or the fractional increase per year, N is the population size (in number of individuals), and t is time.
Instead of relying on a function to determine population size in the future, we can simulate population growth across time.
First we decide on an initial value for our population size. This is the number of hermaphrodite individuals, or the number of reproductive females in the population.
Let’s also construct a vector to store time.
We need to decide on a value for R.
We also have to decide for how far in time our projections of population size will go.
Let’s also create a vector that stores the generations we will simulate through.
Now we will construct what in programming is called a loop. This is just a series of very similar steps repeated. It is like running the same chunk of code again and again, but with small changes in inputs or parameter values (an explanation can be found here). We create a loop as below.
# For each generation, beginning in the generation = 1 and:
for(generation in all_generations){
N_t <- popsize[length(popsize)] # by indexing
# the vector in this way (i.e. using "[length()]"),
# we guarantee we are always taking the last value of
# this vector.
N_t_plus_one <- N_t * R # this is the application of Euler's formula
#Now, let's store the population size at that generation:
popsize <- c(popsize, N_t_plus_one)
#let's not forget to record the time that we are on:
time <- c(time, generation)
}
Notice that the way our loop works is completely dependent on the way
we set up all objects (i.e., popsize
,
time
, R
, tmax
, and so on) before.
If you change the loop or change how these objects are
structured/organized, your code may return an error message (or not give
you the right result). It is safe though to change the numeric values
in those objects to explore how exponential growth behaves at
different values.
Now, we will visualize the results of the simulation we just did.
Let’s first create an empty plot, which includes providing the numerical range of the x and y axes of the plot. Then, we will add a line describing the size of the population over time.
#
plot(NA, xlim=c(0,tmax), ylim=c(0,1500),
xlab="time", ylab="Population size",
main="Malthusian growth")
lines(x = time, y = popsize, col = "blue", lwd = 3)
Try to change the numbers inside the arguments xlim
and
ylim
. How does this change the plot?
Now, make a plot of exponential growth in four populations: one where
R < 1, one where R = 1, one where R > 1, and one where R > x, where x is a
number of your choice that must be greater than 1. Color the populations
using increasing darker shades of your chosen color. (Names of colors in
R can be found here).
Choose x and y limits (arguments xlim
and
ylim
, respectively, in the plot()
function)
that you think best describes qualitatively the difference(s) between
the 4 lines.
Repeat the same plot as above, but this time plot the y axis on a log scale.
We will now use simulations to test if the observed set of genotype counts is consistent with what we would see if we sampled genotypes at random. We will first create a null pattern using R functions, then we will contrast this pattern with the observed genotype frequencies and compare them to expectations under Hardy-Weinberg equilibrium (hereafter, HWE, Hardy, 1908; Weinberg, 1908). For a historical perspective on HWE, we recommend reading Mayo (2008).
For the purposes of this lab, we will look to a bear population in Canada.
Hedrick & Ritland (2012) studied a population of Ursus americanus where a “single nucleotide change from G to A, resulting in the replacement of Tyr to Cys at codon 298 in the melanocortin 1 receptor gene (mc1r)” created a whole new phenotype related to color: the “Spirit bear” variety (see figure 1, from Hedrick & Ritland (2012)).
In a previous study (Ritland et al, 2001), the following frequencies of genotypes were observed: 42 individuals with the dominant homozygote phenotype, 24 heterozygotes, and 21 recessive genotypes. Hedrick & Ritland (2012) asks: are the Spirit bear frequencies a result of neutral processes? This is equivalent of asking is this genotype frequency a simple reflex of Hardy-Weinberg Equilibrium (HWE)?
Let’s start our investigation with a simpler question: what is the allele frequency of this population? To have an idea, we would need to count each allele’s relative frequency in the population genetic pool.
Let’s consider A1 (the dominant allele): first we would need to count the number of A1 alleles by finding out (a) how many copies of that allele each genotype carries, and then (b) multiplying that by the number of individuals with each relevant genotype. So, applying this to the data on the Spirit bear:
counts(A1) = (42 * 2) + (24 * 1) + (21 * 0) = 108
We sum all these terms and divide this sum by the total number of alleles in the population to get the allele frequency. As this gene has only two alleles, we get:
So: $f(A_1) = \frac{108}{(108+66)} = 0.617$
Using the allele frequencies we calculated above, we can use statistics to answer the question posed by Hedrick & Ritland (2012).
To do this, we will do a test that resembles a chi-square test. The first step is to decide upon a test statistic – i.e., some measure we care about. For instance, we could pick the sum of the squared difference between each observed genotype frequency and the HWE expected frequency:
where Og is the observed number of individuals and Eg is the expected number of individuals. This calculation is performed for each genotype g (A is the total number of genotypes).
Considering all HWE assumptions, we should expect the frequencies to be p2 for one homozygote, q2 for the second homozygote, and 2pq for the heterozygotes. As p = 0.617, and q = 1 − p, our statistic is calculated as:
Estat = (42 − (0.38 * 87))2 + (24 − (0.47 * 87))2 + (21 − (0.15 * 87))2 = 428.3982
Note that the value above uses the rounded allele frequencies for simplicity. However, if you calculate such frequencies using R, the calculation will be very precise and the lack of rounding will change the number of Estat.
Once we calculate this value, we need to know: is the value of our statistic big enough to reject HWE? We do not know yet.
To proceed, we have to create a null distribution of our statistic of interest (Estat). To construct the null distribution, we just have to know what would be the distribution of our statistic of interest if all assumptions of HWE are met. We will meet these assumptions in our simulations. Note that this is basically the construction of a virtual Wright-Fisher population (more details on this in future lectures).
To make it easier to simulate data, we will use the built-in
simulations from evolved
, the function
OneGenHWSim
. It takes three arguments:
n.ind
: the number of individuals in the population
n.sim
: the number of simulations you want to do
p
: the allele frequency of one of the two alleles
So, if we want to study HWE frequencies (in a Wright-Fisher) using 5 simulated populations of 100 individuals each, all having an allele frequency f(A1) = 0.467, we would write:
sim_pops <- OneGenHWSim(n.ind = 100, n.sim = 5, p = 0.467)
#now, checking the result of our simulations:
sim_pops
## A1A1 A1A2 A2A2
## sim_1 24 45 31
## sim_2 16 62 22
## sim_3 23 48 29
## sim_4 22 49 29
## sim_5 20 54 26
What the result above shows is a table with the number of individuals with each genotype at the end of the simulations. We can use this to calculate our statistic of interest. The following steps will walk you through this process.
Do many simulations of populations that have the same number of individuals and allele frequencies as the Kermode bear population studied by Hedrick & Ritland (2012).
Calculate Estat for every single one of your many populations. A good tip here is to remember that R is a vector calculator, so you can apply the same operation to all elements of a column in a data frame. If you don’t remember how R does vectorized calculations, you can check the Lab_00.pdf file, or see below:
# To remember how this works, let's imagine you want to
# arbitrarily calculate (f(A1) + 3) / 4.5 for all your simulations.
#You should run:
freqs_A1 <- sim_pops$A1A1 / (sim_pops$A1A1 + sim_pops$A1A2 + sim_pops$A2A2)
result <- (freqs_A1 + 3) / 4.5
result
## [1] 0.7200000 0.7022222 0.7177778 0.7155556 0.7111111
# the above has no biological "meaning".
# it is just to remind you how vectorized calculation works
Visualize what we did. To do this, first use the function
hist()
to plot the histogram of the set of Estat
values that came from our simulations. This represents the distribution
of the likely values of Estat
if the population is following HW assumptions, adjusted to the
particularities (i.e., the “parameters”) of the empirical population.
Where would the empirical (observed) value of Estat
be on the x axis?
Measure quantitatively the likelihood of our empirical value of Estat being generated by pure chance. This is equivalent of asking: what is the probability of, by pure chance alone, observing the empirical value of our statistic (or a value even more extreme than that)?
Can you think of some very simple code that quantifies the proportion of simulations that have an Estat value equal or larger than the empirical value? Maybe one that uses a logical test?
n.sim
)abline
to mark relevant values in your
histogram using vertical (argument v
) or horizontal
(argument h
) lines. Hint: make sure the line you plot is
inside the values of your axes! To manually change your axes, use the
argument xlim = c(A,B)
, where A
is the left
limit of your x axis, and B
is the right limit of your x
axis.Now, use the procedure above to answer question number five:
The first challenge of this lab is recommended for students that feel confident with their R skills:
As defined in the lecture, “heterozygosity” might mean two things:
As you saw in the lecture, some of the most remarkable first studies of genetic variation are Hubby & Lewontin (1966) and Lewontin & Hubby (1966). To deepen your perspective on these studies and their historical and current applications, we recommend reading Charlesworth et al (2016).
Genetic diversity has had an important role in the study of biological diversity. For the purposes of this class, we will take an extremely simplifying approach and calculate different, but simple statistics of “genetic diversity”.
Consider the following aligned DNA sequences from 3 genes (genes separated by space):
Individual 1 ACCGTA AAAAAT CTTATA
Individual 2 AGCGGA CATAAT CTTATA
Individual 3 ACCGTA AAAAAT CTACTA
Individual 4 ACCGGA AAAAAT CTACTA
Charlesworth, B., Charlesworth, D., Coyne, J. A., & Langley, C. H. (2016). Hubby and Lewontin on protein variation in natural populations: when molecular genetics came to the rescue of population genetics. Genetics, 203(4), 1497-1503.
Hardy, G. H. (1908). Mendelian proportions in a mixed population. Science, 28, 49–50.
Hedrick, P. W., & Ritland, K. (2012). Population genetics of the white‐phased “Spirit” black bear of British Columbia. Evolution: International Journal of Organic Evolution, 66(2), 305-313.
Hubby, J. L., & Lewontin, R. C. (1966). A molecular approach to the study of genic heterozygosity in natural populations. I. The number of alleles at different loci in Drosophila pseudoobscura. Genetics, 54(2), 577.
Lewontin, R. C., & Hubby, J. L. (1966). A molecular approach to the study of genic heterozygosity in natural populations. II. Amount of variation and degree of heterozygosity in natural populations of Drosophila pseudoobscura. Genetics, 54(2), 595.
Mayo, O. (2008). A century of Hardy–Weinberg equilibrium. Twin Research and Human Genetics, 11(3), 249-256.
Ritland, K., C. Newton, and H. D. Marshall. 2001. Inheritance and population structure of the white-phased “Kermode” black bear. Curr. Biol. 11:1468–1472.
Weinberg, W. (1908). Uber den Nachweis der Vererbung beim Menschen. Jahreshefte des Vereins fur vaterlandische Naturkunde in Wurttemberg, Stuttgart 64:369–382. [On the demonstration of inheritance in humans]. Translation by R. A. Jameson printed in D. L. Jameson (Ed.), (1977). Benchmark papers in genetics, Volume 8: Evolutionary genetics (pp. 115–125). Stroudsburg, PA: Dowden, Hutchinson & Ross.