One consequence of speciation being a population event is that populations have genetic diversity – not all members of the population are genetically identical. For any particular gene, then, a population may have several slightly different forms present within it. These different forms are called alleles. An example in humans that is fairly well-known is the different alleles that control blood types: one allele gives rise to the A type, another to the B type, and a third allele the O type. Individuals may be either blood type A (either two A alleles or A + O); blood type B (either two B alleles or B + O); type AB (one A allele + one B allele) or type O (two O alleles). Any one individual can have only two alleles of this gene (one from mom, the other from dad), but as a population we collectively maintain all three. Other human genes have many more alleles than three (for example, some genes of the immune system have hundreds of alleles) despite the fact that any given individual can have at most two. The larger a population is, the more alleles of a given gene it can maintain. Smaller populations are more at risk of losing alleles due to chance (something called genetic drift).
The fact that populations maintain genetic diversity is important to remember when considering speciation. Speciation events are commonly represented with branching tree diagrams (“phylogenies”, or “species trees”) such as this one:
Here we see that Species 1 and Species 2 are more closely related to each other than they are to Species 3. What this says is that Species 1 and Species 2 shared a common ancestral population more recently with each other than either did with Species 3. So far, so good – but what this doesn’t mean, however, is that comparing gene sequences between these species will always group 1 & 2 together as more similar to each other than to 3. While this will be true most of the time, it is expected that some of the time this pattern will not hold. The reason is due to something called incomplete lineage sorting, and it has to do with the fact that populations going their separate ways carry genetic diversity with them. Let’s try to explain what is going on here.
Imagine that the ancestral population of all three species (the 1,2,3 common ancestor) has four alleles of a certain gene (represented by different colors in the diagram). These alleles originally arose due to a single mutational difference during DNA copying. Once there is a difference in place, two alleles can go on to acquire other differences over time, again, through copying errors. As a result, alleles can be compared to each other, just like species. Alleles that are recently separated will have more similarities in common, and alleles that have been separate for longer will have acquired more differences. In this example, the blue and green alleles are more similar to each other than either is to red or orange, and vice versa. The blue and green alleles arose from a common ancestral allele, and the red and orange alleles arose from a common ancestral allele. Further back in time, these two ancestral alleles themselves arose from one common starting allele. All four alleles will have a great deal in common (nucleotide sequences inherited from the single ancestral allele), as well as differences (for example, the red and orange alleles will share all changes that occurred between the time they split off from the blue/green lineage and when they themselves separated into two distinct alleles).
Now consider the time when the (1,2,3 common ancestor) population divides to become the (1,2 common ancestor) species and the Species 3 ancestor (the first branch in the diagram). As this population divides into two species, it is not guaranteed that all four alleles will be present in the founding population of each new species, simply by chance. Each founding population is a sample of the original population, but any given sample may omit certain alleles:
In the example above, we see that the red allele has been lost from the (1,2 common ancestor) species, and that the Species 3 ancestor has lost the blue and orange alleles. What this means is that the founding population of the (1,2 common ancestor) species didn’t have any individuals that carried the red allele, and that the Species 3 ancestor founding population didn’t have any individuals that had the blue or orange alleles. Both events happened simply by chance, because the founding populations are not representative samples of the original population.
Later, as the (1,2 common ancestor) species separates again into Species 1 and Species 2, the same issues arise. The two founding populations may not transmit all of the genetic diversity of the (1,2 common ancestor) population:
In this case, the founding population leading to Species 1 did not include a member with the green allele, and the founding population leading to Species 2 did not include any members with either blue or orange alleles. Also, the green allele has been lost in the lineage leading to Species 3 (it became rare and was eventually not passed on due to chance).
In the present day, examining the alleles of the three modern species will reveal different levels of similarity. The blue allele is now only found in Species 1, and it is most similar to the green allele in Species 2, and less similar to the red allele in Species 3. This pattern matches the overall “species tree” pattern for these three species:
The orange allele in Species 1, however, tells a different story: it is most similar to the red allele in Species 3, and less similar to the green and blue alleles. If we knew only about the orange allele in Species 1, we might conclude that Species 1 and Species 3 are the closest relatives. This is because the “gene tree” for these alleles places orange closest with red, even though the true “species tree” reveals an overall pattern of speciation that is different:
The orange allele thus has a gene phylogeny that is said to be “discordant” with the overall species phylogeny.
It might seem from the above discussion that assembling a species phylogeny from gene phylogenies is a hopeless task: after all, if any individual gene tree might be misleading, how can we be certain we have the correct species tree?
The solution is to realize that while any individual gene tree might be discordant, gene trees that match the species tree will be the most common category. In our example above, Species 1 and Species 2 share a common ancestral population for some time after the (1,2 common ancestor) and the Species 3 common ancestor populations diverge. This means that any event that happens to this population (loss of an allele, for example) will be reflected in all descendant species (in our example, Species 1 and Species 2). This common history favors gene trees that match the species tree. For a discordant tree, the ancestral (1,2) population needs to maintain two alleles, and these alleles cannot sort equally into Species 1 and 2. This can happen, but it is less likely.
What this means in practice is that biologists expect a certain pattern of gene trees when comparing related organisms. Using our three species as an example, most gene trees should match the species tree. The less likely outcome is a gene tree where an allele from Species 1 is more similar to the allele in Species 3. We can be confident we have the correct species tree because the majority of the gene trees favor one species tree over the alternatives.
The fact that gene phylogenies/trees and species phylogenies/trees don’t always match is not something that surprises scientists, since it is a well-known phenomenon and the mechanisms underlying it are understood: species arise from genetically diverse populations and that diversity does not always sort completely down to every descendant species. Discordant phylogenies, however, are commonly used among Christians as a means to cast doubt on to common ancestry and/or evolutionary biology as a whole. One example from the Intelligent Design movement will serve as an illustration. In a blog post discussing discordant trees found when comparing the human genome to that of other primates, Casey Luskin argues
Since humans are typically said to be most closely related to chimps, this data conflicts with the standard supposed tree … the basic problem is that one gene (or portion of the genome) gives you one version of the tree, while another gene (or portion of the genome) gives you a very different version of the tree. This leads to discrepancies between molecule-based trees, wherein DNA data fails to provide a consistent picture of common ancestry.
In the end, molecular trees are based upon the sheer assumption that the degree of genetic similarity reflects the degree of evolutionary relatedness … Clearly this assumption fails when different genes paint contradictory pictures of evolutionary relationships.
As we have seen, these differences are the natural, expected consequence of genetic diversity from an ancestral population sorting itself incompletely into different descendant species. The data set Casey is concerned about is primate evolution, where the species tree for humans, chimpanzees, gorillas and orangutans is as follows:
In the article linked above, Casey is discussing a recent comparison of the newly-completed orangutan genome with the human genome. The availability of the orangutan genome allowed researchers to scan the human genome for locations where humans are more similar to orangutans than to chimps. These regions are rare in the human genome, and very short in length. Indeed, the researchers found a pattern: chromosome segments in humans most often match chimpanzees, and do so for thousands of nucleotide base pairs at a time, on average. Those regions that match orangutans are tiny (on average less than 100 base pairs) and rare. This is exactly what one expects from the species tree: humans and chimps are much more likely to have gene trees in common, since they more recently shared a common ancestral population (around 4-5 million years ago). Humans and orangutans, on the other hand, haven’t shared a common ancestral population in about 10 million years or more, meaning that it is much less likely for any given human allele to more closely match an orangutan allele. It is certainly possible, however, and in scanning over the entire genome rare sites that have this pattern can be found. Indeed, the authors of the paper above used previously-determined speciation times and population size estimates to predict what fraction of the human genome would be expected to match more closely with orangutans. Based on these parameters obtained in other studies, they predicted 0.9% of the human genome would have a human : orangutan gene tree. Their observed value was 0.8% - a result that provides additional support for the population size estimates and speciation times from other studies.
Aside from its misinterpretation by the ID movement, this sort of data actually provides us with information about the population size of the species that went on to give rise to orangutans, gorillas, chimpanzees and humans, as well as times for the various speciation events. I have discussed similar data for the (gorilla/chimpanzee/human) and (chimpanzee/human) common ancestor populations elsewhere; this new data merely confirms previous estimates of the population sizes of the various ancestral groups, and extends back to the (orangutan/gorilla/chimpanzee/human) common ancestor population with greater precision. As before, these results continue to strongly support the hypothesis that the human lineage has never been as low as two individuals at any point in our evolutionary history. Indeed, these new results confirm that the human : chimp common ancestor population was large (about 50,000 members). As Darrel Falk and I have discussed here on BioLogos in the past, all methods used to date (numerous approaches, all using independent assumptions) would have to be wildly wrong (by several orders of magnitude) if indeed our species arose from just two individuals.
In genetics, coalescent theory is a retrospective model of population genetics. It attempts to trace all alleles of a gene shared by all members of a population to a single ancestral copy, known as the most recent common ancestor (MRCA; sometimes also termed the coancestor to emphasize the coalescent relationship). The inheritance relationships between alleles are typically represented as a gene genealogy, similar in form to a phylogenetic tree. This gene genealogy is also known as the coalescent. Understanding the statistical properties of the coalescent under different assumptions forms the basis of coalescent theory.
The coalescent runs models of genetic drift backward in time to investigate the genealogy of antecedents. In the simplest case, coalescent theory assumes no recombination, no natural selection, and no gene flow or population structure. Advances in coalescent theory, however, allow extension to the basic coalescent, and can include recombination, selection, and virtually any arbitrarily complex evolutionary or demographic model in population genetic analysis. The mathematical theory of the coalescent was originally developed in the early 1980s by John Kingman.
Consider two distinct haploid organisms who differ at a single nucleotide. By tracing the ancestry of these two individuals backwards there will be a point in time when the MRCA is encountered and the two lineages will have coalesced.
A useful analysis based on coalescent theory seeks to predict the amount of time elapsed between the introduction of a mutation and the arising of a particular allele or gene distribution in a population. This time period is equal to how long ago the most recent common ancestor existed.
The probability that two lineages coalesce in the immediately preceding generation is the probability that they share a parental DNA sequence. In a diploid population with a constant effective population size with 2Ne copies of each locus, there are 2Ne "potential parents" in the previous generation. Under a random mating model, the probability that two alleles share a parent is thus 1/(2Ne) and, correspondingly, the probability that they do not coalesce is 1 ? 1/(2Ne).
At each successive preceding generation, the probability of coalescence is geometrically distributed — that is, it is the probability of noncoalescence at the t ? 1 preceding generations multiplied by the probability of coalescence at the generation of interest:
For sufficiently large values of Ne, this distribution is well approximated by the continuously defined exponential distribution
The standard exponential distribution has both the expected value and the standard deviation equal to 2Ne; therefore, although the expected time to coalescence is 2Ne, actual coalescence times have a wide range of variation. Note that coalescent time is the number of preceding generations where the coalescence took place and not calendar time though an estimation of the latter can be made multiplying 2Ne with the average time between generations.
Coalescent theory can also be used to model the amount of variation in DNA sequences expected from genetic drift and mutation. This value is termed the mean heterozygosity, represented as . Mean heterozygosity is calculated as the probability of a mutation occurring at a given generation divided by the probability of any "event" at that generation (either a mutation or a coalescence). The probability that the event is a mutation is the probability of a mutation in either of the two lineages: . Thus the mean heterozygosity is equal to
For , the vast majority of allele pairs have at least one difference in nucleotide sequence.
Coalescents can be visualised using dendrograms which show the relationship of branches of the population to each other. The point where two branches meet indicates a coalescent event.
Note: This series of posts is intended as a basic introduction to the science of evolution for non-specialists. You can see the introduction to this series here. In this post we discuss how comparative genomics data can be used to estimate population sizes at different points within a phylogeny.
In the last post in this series we introduced the (challenging) concept of discordant gene trees within a species tree that arise through incomplete lineage sorting (ILS). In this post, we’ll take a look at one of the interesting implications of ILS – using it to estimate population sizes - before moving on to other topics. (Again, if this topic seems too challenging, feel free to bypass this post and the last – the later posts in this series will not depend on understanding this information.)
We can use ILS to measure population size because discordant trees give us a way to measure the number of alleles present in an ancestral population (which in turn can be used to estimate the number of individuals in that population). Before getting into the details, however, let’s briefly review how speciation is a population-level phenomenon.
As you will recall from previous posts in this series, speciation events get their start when two populations become genetically isolated from one another (either completely, or partially). This allows the average characteristics of the two populations to diverge, which in turn may lead to speciation over time. The point to emphasize here is that both populations are populations – a group of interbreeding organisms of the same species. Populations, as we have seen, are capable of passing on much more genetic diversity than a single individual can – where one individual can carry only two alleles of a given gene, a population can maintain hundreds, or even thousands.
With this in mind, we can return to our discussion of incomplete lineage sorting and the resulting discordant gene trees nested within a species tree. The example we used previously had gorillas and chimpanzees with more closely related alleles, and the human allele more distantly related:
We then described an example of incomplete lineage sorting where gorillas and chimpanzees inherit the most closely related alleles, and humans a more distantly related allele:
We are now ready to discuss what we can infer from this pattern, and what it tells us about the (H,G,C) common ancestral population. First, this pattern tells us that the red and blue alleles were present before the chimpanzee lineage separated from the gorilla lineage. Since we know from the species tree that the (gorilla / chimpanzee) common ancestral population is the (H,G,C) common ancestral population, this confirms that the blue and red alleles were part of the variation that this population maintained. The next thing to notice is that the yellow allele is more ancestral – in other words, it has fewer mutations when compared to the red and blue alleles. This means that the yellow allele is older than the red or blue alleles. This places the yellow allele on the phylogeny prior to the (G) / (H,C) speciation event. Also, since humans have the yellow allele, it must have been present in the (H,C) common ancestral population at the point when it separates from the (G) lineage. Taken together, this means that the yellow allele was also present in the (H,G,C) population. In the absence of new mutations (which are excluded in these analyses) there is no other way to produce this pattern of inheritance unless all three alleles are present in the (H,G,C) population. Even though the present-day species have only one allele each, we can infer that their shared ancestral population had all three.
So, discordant gene trees are a window to the past that reveal the genetic diversity of an ancestral population – how many alleles it maintained for a given region of the genome. By comparing large sets of genome data from humans, chimpanzees and gorillas, it is possible to get an accurate estimate of population size for the (H,G,C) ancestral population (about 50,000 individuals). This measure, called the effective population size (denoted Ne) is the population size needed to transmit the observed amount of genetic variation from an ancestral population to the present day. The human / chimpanzee (H,C) common ancestral lineage, estimated using the same methods, also numbered about 50,000 individuals over its history.
The sequencing of the orangutan genome (completed in 2011) provided researchers with an opportunity to check these estimates using an additional data set. The orangutan lineage branches off the primate phylogeny from a common ancestral population (i.e. the (H,O,G,C) population, where the “O” stands for orangutan) leaving the (H,G,C) ancestral population which will undergo speciation later:
Using prior estimates of (H,G,C) and (H,C) population sizes, the researchers were able to predict in advance that a very small fraction of the human and orangutan genomes should be more closely related to each other – i.e. that incomplete lineage sorting should have produced rare genome regions where the human and orangutan alleles are more similar to each other than to other primates. The expected value of such (H,O) paired regions (~1.2%) is tiny when compared to the predicted value for (H,G) paired regions (around 25%), in large part because humans, chimpanzees and gorillas underwent speciation in a relatively short period of time, whereas the time between the orangutan divergence and the later gorilla divergence is greater. The genome-wide fraction of our genome that more closely matches the orangutan genome is about 0.8% - remarkably close to the predicted value, and consistent with the Ne values estimated for the (H,G,C) and (H,C) populations from prior work. In other words, when comparing primate genomes, we see a pattern of incomplete lineage sorting – as expected, our genome matches chimpanzees most frequently, then gorilla, and then orangutan. (As an aside, it is formally possible that once the gibbon genome is sequenced and analyzed that there might be a trace of incomplete lineage sorting present to give (human, gibbon) allele groupings, but it is likely that this fraction of the genome will be too tiny to detect reliably, since gibbons branch off the primate tree well before orangutans do).
Far from being a “problem” for common ancestry, incomplete lineage sorting is an expected consequence of populations undergoing speciation events – and a window into their genetic diversity. The end result within a phylogeny, as we have seen, is a subset of characteristics that have a discordant tree within the species tree. In the next post in this series, we’ll explore another effect that can also produce patterns at odds with a species tree: convergent evolution.
Hobolth A, et al., (2007). Genomic Relationships and Speciation Times of Human, Chimpanzee, and Gorilla Inferred from a Coalescent Hidden Markov Model. PLoS Genet 3(2): e7 (source)
Holboth A., et al. (2011). Incomplete lineage sorting patterns among human, chimpanzee, and orangutan suggest recent orangutan speciation and widespread selection. Genome Research. 2011 March; 21(3) 349. (source)