Chapter 2.3: Evolutionary signatures (Video Transcript)

Origins of Genetic Variation

Title: Origins of Genetic Variation

Presenter(s): Jessica Pamment, PhD (Department of Biological Sciences, DePaul University)

Jessica Pamment:

Next time you’re in the classroom, look around you, and you’ll see that although all your peers are humans, the same species as you, no two individuals in the class will look exactly the same, unless you have identical twins in the room. This variation in traits is true not only for human population but for any species. Some of the differences observed within a population are caused by the environment and experiences of each individual. For example, hormonal changes brought on by cooler temperatures result in the fur of an arctic fox turning from brown to white. Although the environment definitely plays a role in introducing variation in a population, most of the variations seen in populations are caused by differences in genes. For example, one gene is responsible for determining whether the rats will have brown or black fur. With the exception of clones, such as identical twins, each individual within a population carries a unique set of genes, half of which were received from one parent and half from the other. The total set of genes of all individuals in a given population is called the gene pool.

A gene is a discrete unit of hereditary information consisting of a specific nucleotide sequence in DNA. So, nucleotides are the building blocks of DNA and, therefore, of genes. Differences between individuals can be measured all the way down to the level of individual nucleotides. However, measuring differences within a gene pool at this level is not particularly useful because much of the variation lies within non-coding regions of the DNA, meaning that these variations don’t result in an observable difference. It’s often better to measure variation at the gene level because it is at this level that both quantitative and discrete traits are coded.

So, how does genetic variation arise in a population? Well, one of the ways is as a result of mutations, which results in a change in the original DNA sequence. Mutations can occur as mistakes during DNA replication. However, if the mutation does not happen in a cell that is passed down to offspring, such as an egg or sperm cell, the change cannot lead to a new allele, which is an alternative version of a gene. Variation can also arise at the chromosome level during the process of meiosis. This is a modified type of cell division found only in sexually reproducing organisms, which results in the production of gametes.

The two ways in which variation is introduced in meiosis: crossing over and independent assortment. Crossing over happens early on in meiosis in prophase one and results in the exchange of DNA between homologous chromosomes, so between the paternal and maternal chromosome of each chromosome pair. This results in recombinant chromosomes. The second way in which variation is introduced is as a result of the random arrangement of chromosome pairs on the cell plate during metaphase one. In humans, the random assortment of chromosomes gives rise to over 8.4 million possible combinations of chromosomes, and this is without taking crossing-over into account, which introduces even more variation.

Another mechanism that contributes to genetic variation in sexually reproducing organisms is random fertilization. As I just mentioned in humans, each male and female gamete represents one of about 8.4 million possible chromosome combinations due to independent assortment. The fusion of a male gamete with a female gamete during fertilization is completely random and will produce a cell with any of about 70 trillion chromosome combinations. If we factor in variation brought in by crossing-over, the number of combinations is even higher.

Hopefully, you can see how unique you really are. Now that we’ve learned how genetic variation is introduced into a population of sexually reproducing organisms, it is important to remember the evolutionary significance of this. Natural selection is a driving force behind evolution, and natural selection results in the accumulation of genetic variations favored by the environment. Another way of thinking about this is that genetic variation is the raw material needed for evolution to occur.

[Music]

MPG Primer: Natural selection & human genetic variation

Title: MPG Primer: Natural selection & human genetic variation

Presenter(s): Stephen Schaffner, PhD (Infectious Disease and Microbiome Program, Broad Institute of Harvard and MIT)

Stephen Schaffner:

Intro

Good morning, hello everyone. Thank you for coming. I have the privilege of introducing myself. I’m Steve Schaffner, a staff scientist here, a computational biologist, and I’ve been here for a long time. I used to be part of MPG back in the Thousand Genomes and HapMap era and even before that. These days, I work more in pathogens, malaria, and viruses, but I’m still interested in human genetics and natural selection in humans. That’s why I’m talking to you today about natural selection in humans and in particular its effect on human genetic variation and what we can learn from this and how we can detect it.

Natural selection

So, I think you probably all know what natural selection is. I’ll just state it for the record that it’s the principle that alleles that make an organism more successful in terms of survival and reproduction are likely to be transmitted more to the next generation and therefore likely to increase in frequency while they’re being selected. We can, for convenience, divide the kinds of selection into several different types. First is balancing selection, which is selection that maintains multiple alleles in the population at some intermediate frequency. Then there’s purifying selection, which is the elimination of new mutations that are deleterious. Finally, there’s positive selection, which is selection for some beneficial trait. Actually, the last two are really flip sides of the same thing – if you’re selecting for one trait, you’re selecting against some other trait. You have to be choosing one, but it’s convenient to distinguish them based on what’s the starting out as rare; you can think of purifying selection as eliminating new rare things.

Sickle cell

I’ll start by talking about balancing selection. This is probably the rarest kind of selection. It’s kind of cool, if you can find it, but it’s a little difficult to detect and probably doesn’t happen very often. There is one very well-known case in humans, the sickle cell trait. The sickle cell allele was one of the first cases, probably I think the first case, of identified natural selection in humans. It was Haldane in the ’50s, who noticed that certain diseases involving hemoglobin were much more common in places where there was a lot of malaria, and he hypothesized that natural selection was playing some sort of role and he was correct. These two maps: the top map shows where malaria occurs worldwide and the bottom map shows where there’s a high prevalence of the sickle cell variant, the sickle cell trait. The reason is quite clear and it’s easy to find out.

Heterozygous individuals survive best

To check, you can actually measure the difference in fitness. If you have one copy of the allele, you’re a heterozygote, then you have considerable protection against malaria. So, if you have no copies, you’re exposed to malaria and more likely to die younger because malaria is a big killer. If you have one copy, you have a real benefit. If you have two copies, then you get sickle cell disease and you’re also likely to die young, because without modern medical care, it’s a very serious disease. So, you could look at the survival and it’s very clear that heterozygotes have an advantage. That means that there’s selection pressure to maintain this allele at some intermediate level, so that there’s a maximum number of heterozygotes in the population. It’s not a very pleasant solution, but it is an evolutionarily stable solution to a severe selective pressure.

Balancing selection for diversity

You can get balancing selection in other ways by selection for diversity. There are a lot of cases in which it’s good to be different from other members of your species, like escaping from predators. If the predator is used to finding purple people and eating them, if you’re green, it can be kind of good to be green. Then, it’s quite common in resistance to disease. If a new virus enters the village and it can infect most of the people in the village, and you’re different and you know you have a different genotype than most of the people, then you have an advantage. So, the whole thing that’s spreading through the rest of the population, you’re immune from, and you can see the effects of this, the selection for diversity in terms of disease resistance in the HLA region, which is critical for immune response to pathogens.

Selection for diversity: HLA

And if you just look on chromosome 6, the density of SNPs in that region is much higher there than elsewhere in the genome because there’s a lot of selection. Obviously, we’re constantly exposed to different kinds of infectious diseases, so there’s a lot of selection for having diversity there. And there’s the HLA region – if you couldn’t guess where it was.

And this is actually an example of frequency-dependent selection, that is, whether the frequency of an allele is advantageous affects how frequent it is. When it’s rare, it’s good to be different, so that’s advantageous. So, if it increases in frequency, then it can become less advantageous. And also, it obviously can vary quite a lot with your local environment, what pathogens happen to be nearby or what predators, and in humans, it’s mostly pathogens we worry about, not so much predators these days. So, it can fluctuate quite a bit on small geographic scales and on short temporal timescales.

As a side note, pathogens also evolve and they do some similar things. This is one of the chromosomes of Plasmodium falciparum, which causes the most severe kind of malaria, and where that sharp spike in diversity is showing diversity across the chromosome, that short, sharp spike in diversity is a gene that codes for a protein that’s exposed to the immune system. A lot of malaria proteins are not exposed because they hide in red blood cells, but this protein is exposed to the human immune system. And again, there’s a lot of pressure to be diverse, so that if you are the first parasite entering a village, you want to be different from all the other parasites that the population has been exposed to, so that you can happily infect people without having that unpleasant immune system triggered immediately. And they evolve faster than we do, by the way. So, they’re constantly evolving as well. So that’s balancing selection.

Purifying selection

Purifying selection is the most common kind of selection. It’s sort of a little dull, because all it is the removal of new deleterious mutations. Most organisms are pretty well adapted to their environment, and a functional change, something that changes their phenotype in an important way, is probably bad for them. It’s going to be eliminated by natural selection. It’s not going to be passed on for very long. And it’s very clear to see this in human genetic data. This is a plot of the top figure here from Thousand Genomes data. It’s the diversity across the whole genome, looking at different parts of genes. And I’ve marked where some exons are, the first exon, the middle exon, and the last exon within genes throughout the genome. And you can see the diversity is much lower within these coding exons, because changes to the protein are probably bad and they tend to be eliminated. And you can see the effect is strong and just read off from there how many mutations have occurred in these exons and that have been eliminated by selection over time. And although they’re eliminated, they may not be eliminated immediately. Obviously, if something makes you non-viable, then you’re not going to see it in the population. But lots of mutations are mildly deleterious. They might make you more likely to be sick or stupid or less attractive, whatever, which might be bad in certain circumstances.

Purifying selection eliminates deleterious mutations

And so, you can see these alleles are hanging around the population, but they won’t rise to very high frequency, because they are bad and they’re less successful. So, if you just plot the allele frequency of nonsynonymous mutations and compare it to synonymous mutations, that’s when I show here on the left as a function of frequency. In the 1% being that first bin, you see there’s an excess of nonsynonymous mutations compared to the synonymous mutations, because more of these are functional; so, that excess represents mildly deleterious alleles that are going to be eliminated eventually by selection, but haven’t been eliminated yet.

Even within the nonsynonymous, you can break it down further. On the right, I’ve plotted different categorizations of non-synonymous mutations as to whether or not they’re likely to damage the protein – change the effect of the protein. So, in pink and red, those are changes that have a pretty good chance of changing the protein’s function, and those are the ones that you see more of them at a low frequency, so those are the ones that are going to be eliminated.

In terms of evolutionary biology, these are not interesting; this is going to sludge, it’s being eliminated by natural selection all the time. In terms of medical genetics or human genetics of disease, these are probably a lot of the ones that we’re interested in, because one of the ways of being deleterious is it makes you more likely to get sick. And so, these are some of the things that are causing genetic diseases or increasing your risk for early onset of diseases of various kinds. So, there is great medical interest in some cases but not of tremendous interest to evolutionary biologists.

The effect of purifying selection can be seen not only in the allele itself that’s being selected against – it can also be seen in some of the surrounding variation. The effect of purifying selection is that it reduces diversity in that region. You can think of it like this: Say there’s a gene where deleterious mutations keep happening. When a deleterious mutation happens, the chromosome it happens on is going to be removed from the population eventually. So effectively, it’s not part of the population now in terms of the long-term success. So, you have right around genes or other functional elements, you effectively have a smaller population size, and that means you can sustain less diversity because some of the diversity gets taken. Any diversity that’s on that chromosome that gets the bad mutation is going to be removed from the population.

And it’s very easy to see this, too. This again a 1000 Genomes data. Here they plotted the distance from the start or the stop of a gene across the whole genome. The three colors are three different populations, and what’s plotted is diversity. And there’s a dip, a significant dip, in diversity around a gene. It’s on the order of 50 or 100 kb, so it’s a substantial stretch where there’s notable reduction in diversity and this is affecting the distribution of diversity throughout the genome. This is an ongoing effect.

Audience question: [Unable to hear on video].

Stephen: Why? Why are the absolute levels different? Well, there’s just less diversity. The red and the blue are the two non-African populations, and there’s just less diversity outside of Africa because they passed through a bottleneck leaving Africa and so there isn’t that much there. There may also be a small difference in how much diversity is reduced there. There’s a question of whether purifying selection has been less effective outside of Africa because the effective population size was smaller, but that’s a pretty minor effect.

Finally, there’s positive selection, which is what we usually think of as natural selection. It’s what Darwin was famous for. It’s the basis for adaptation, pretty much all adaptation we think. And so, it’s kind of a sexy thing to look for, and people have looked for it.

I’m going to focus it initially on selective sweeps, which is selection where the mutation starts out basically as a new mutation and increases in frequency. I’ll add a few complications later. And so, the question is: what’s the effect of positive selection on genetic variation? And basically that means: How can we detect it just by looking at genetic variation?

So, let’s take the case where selection doesn’t happen, this is neutral evolution. So, suppose there’s a mutation that happens in this blue guy here, and as time goes on, the frequency of that allele may increase, it may decrease, it kind of bops around a little bit. It drifts, and the technical term is “it drifts”, it’s genetic drift. But on average, it doesn’t actually change in frequency. So very slowly, it might change over time.

If there’s positive selection for something. If this new allele, this mutation, provides a benefit, well then over time, it can rapidly increase in frequency. And that rapid increase is what leaves the genetic signature that we can look for. It actually leaves a number of different genetic signatures, and I’ll kind of describe some of them.

So, let’s consider what the genetic situation looks like before selection happens and after. So, if we have this cartoon version of some chromosomes in the population, this new red mutation is beneficial, and there’s a lot of diversity there. There are different alleles, you have different combinations of those different haplotypes present in the population, because it’s just been sitting and behaving normally. After selection has happened, this allele has increased rapidly in frequency and is present in a large fraction of the population. And so, this signature we can look for… well, one signature happens. Suppose this selection has happened only in one geographic region, like it happened in the Boston area, like being a Red Sox fan means or maybe even Yankee fan, there’s an allele for that might make you more reproductively successful here. Let’s compare it to New York. If that’s the case, then you will find that allele at very high frequency in that region and very low frequency elsewhere. So, it’ll be unusual. One signature of selection then is that there’s an unusually large difference in frequency between populations at that locus, and this can indeed be seen. I’ll mention a common way of measuring frequency differences in populations is a statistic called FST, but there are other statistics, too, you can use.

And a classic example of this is the Duffy null allele. The Duffy protein is a blood antigen, one of the many blood antigens, it sits on the outside of red blood cells and does something or other – not entirely clear what. But one of its roles that it’s not intended for it is that it’s also the way by which Plasmodium vivax, another cause of malaria, enters red blood cells. And so, it’s the only way of entering the red blood cell, so it’s critical for invasion and for infection. And if you lose that protein, then you are pretty much immune to vivax malaria. And if there is a mutation that knocks out that gene, it’s called the Duffy null allele. The plot that’s shown there is the density, the frequency of that mutation around the world. And you see it’s in very high frequency in sub-Saharan Africa, and as a consequence, people in sub-Saharan Africa are largely immune to the effects of vivax malaria. So much so that there’s almost no vivax malaria across Africa. So, this was a highly successful case of natural selection in humans, suggesting that there was a very large cost from vivax malaria at some point. And outside of Africa, it’s basically not there at all. And so, this is sort of a classic case. In fact, I believe this is how it was discovered that this protein was the invasion route for this parasite. So, it can be very useful to be looking at natural selection.

High altitude adaptation in Tibet

A more recent case in terms of studying it came from people who are looking for adaptation for handling high altitude, where there’s very low oxygen. So, what this research group did was they compared allele frequencies between a Tibetan population, a sample from Tibet, and a Han Chinese sample, very closely related populations, very similarly low frequencies. So, the bottom of the two axes are the Tibetan frequency and the Han Chinese frequency, and you can see almost everything has this pretty much the same frequency in both populations. And there are two alleles that are very high frequency in Tibet and not at a very high frequency in the Han Chinese, and they’re on the lower right-hand corner. There are two alleles in the same gene EPAS1 and it turns out this does indeed confer adaptation to handling low oxygen levels present there. So, this was a very easy way of finding this particular gene.

So, in principle, this is a powerful way of detecting where selection has happened regionally. In practice, there aren’t very many low-hanging fruit like that. Somebody took the trouble of plotting, of comparing, how many real outliers are there. They took a whole bunch of different population pairs, so each one of these dots is a pair of populations. On the x-axis is the average difference in the allele frequency measured by FST between those two populations, and on the y-axis is the most extreme single allele in that pair of populations. And most of the time, you can predict really well what the extreme is going to be just from the averages. So, all you’re seeing is sort of the tail, the distribution up at the high end. There are a few cases here up here where there are clearly outliers. The one I was just talking about, EPAS1, is one of them. All the other colored dots here are known pigmentation genes, so there’s information in there, but by itself, it may not be a very easy way of finding out what’s been selected.

Alright, so that’s just the first signature. I won’t focus much on the others. Another signature is low diversity. If everyone or a large fraction of the population is the same now around this new selected allele, that means if you’re the same, you’re not different. There’s not a lot of genetic diversity there. And so, if you just plot genetic diversity across the chromosome, you’ll find regions where diversity has dips because there’s been a selective sweep there. So, I said purifying selection produces reduced diversity around functional elements. Positive selection can also produce reduced diversity. So, they have somewhat similar signatures here, which is inconvenient. The signature, the loss of diversity, can be more profound in the case of positive selection because if a sweep goes all the way to fixation, everybody is the same and so there’s virtually no diversity present.

Another thing that’s a little bit easier to look for and to serve as a different way of looking at the same thing is that if everyone is the same, you have the same haplotype there. They’re basically if you can predict if you have that red allele, you can predict what other alleles everyone will have in the population up to the point where recombination has broken it down. If it’s if this selection happened recently, then recombination hasn’t had time to break it down yet. And so, you can look for long haplotypes that are at high frequency in the population. And there’s a whole series of statistical tests that have been developed for detecting that there’s an unbroken long haplotype present.

Long haplotype: LCT

And this turns out to be a very powerful way of detecting selection that’s happened within the last 20,000 years, roughly. Here, a very clear signature for positive selection can be seen at one of the now poster children for selection in humans, which is lactase persistence. The normal state for mammals is that as you get older, you lose the ability to digest lactose, because lactose is present mostly only in milk and normal mammals don’t drink milk as adults. But in human populations where they practiced herding for a long time, many adults can, in fact, digest lactose; they’re lactose tolerant. Most Northern Europeans are… I’m lactose tolerant because my ancestors are from Northern Europe. If you look around the lactase gene, there is an enormously long haplotype that’s hardly broken at all. And in Northern Europe, it’s around 70% frequency. It extends for more than a megabase. It is the result of a very strong natural selection, positive selection for this trait. The mutation that gives the capacity for digesting lactose is, in fact, on that haplotype. The plot just shows that if you plot the length of the haplotype versus the frequency; lactase sits way up here on the right. This is the strongest signal of selection by this test in Western Europe.

Finally, one other signature that’s a little bit harder to visualize is that in regions that have been undergoing recent positive selection, there will be an excess of high-frequency derived alleles. Normally, the derived allele, the newer allele, is at low frequency. It tends to stay at low frequency. But when this kind of sweep happens towards higher frequency, it can bring any other mutations that are nearby to higher frequency. So, it can bring more of these rare mutations up to high frequency along with it. That’s just a different sort of thing you can look at in the data. All these signatures have been looked at, they’ve been known for a while, and they’ve been looked at in various scans across the genome, looking for places where selection has happened. You can do some interesting things with them.

Composite of Multiple Signals (CMS)

It turns out that there’s independent information in each of these signatures. So, one thing you can do, an approach that was pioneered by Pritchard and Sabeti, is to combine the information from the different signals. Here, in a cartoon version, we’ve got all the evidence from long haplotype, evidence from derived allele frequency, and evidence from differentiation in populations and they’re giving you different information, but the actual selected one, where selection actually happened, you can get enhanced information about that, which is important because some of these signatures tend to cover very large regions of a chromosome and not really tell you much about which allele or gene was actually selected for. So, if you want to get down to details, then it’s better to combine information.

MS pinpoints candidate variants

Here’s an actual case. I think this is chromosome 5 in humans, and I don’t remember what dataset this is – it might be a 1000 Genomes. These are different signatures of selection. The top one is a long haplotype, the second one is population differentiation, and then the bottom one is derived alleles. A lot of these are raised or elevated, so somewhere in here, there’s evidence that selection happened. But they’re very noisy, and it’s very hard to know exactly where the selection happens. But if you combine them, then, as if by magic, the bottom distribution shows the score for where you think selection happened. And I don’t know if you can actually see it, but there are only a handful that have an elevated score, indicating that right here is where selection probably happened. It turns out the biggest signal there is a nonsynonymous mutation, so there’s a good chance it’s functional. It’s sitting in a gene that’s important for skin pigmentation. This, in fact, is an allele that contributes to European skin color; it’s one of the major alleles for that.

Audience question: What kind of sample size do you need to detect these kinds of changes?

Stephen: Good question. Depends on how strong it is. I’d say, for really strong signals, 100 is fine. As you’re getting to more subtle things, thousands are better. Above that, the problem isn’t so much sample size as knowing how to distinguish between background stuff and what’s actually selection. It’s not just a statistical problem. Basically, what we’re doing is looking for the weirdest part of the genome, and lots of weird stuff can be happening in the genome. There are other various things that confound these. For instance, the long haplotype test can be confounded if there’s an inversion where that suppresses recombination in that region. I think that’s one of those sorts of things that can confuse you. So, there’s a variety of things that can affect you.

So I said, these tests have been known for a while. There was a big spate of genome scans about ten or twelve years ago as genome-wide data became available. But there is still work going on in this area.

A new signature singleton density

This is a figure from a paper that was published two weeks ago in Science, which introduces a new signature for recent selection. I thought it was cool enough that I will try to explain it to you. The basic idea is – we’ve got two alleles here at a site. Along the bottom, these are all the samples. If you just construct a gene tree of which samples are related to which at this particular locus, this is the gene tree. In blue, here is a derived allele, a mutation that happened at some point in the past and has been selected for. If it’s selected for, it increases in frequency, which means it’s younger. On average, you have more recent common ancestors. If you compare people, they have a recent common ancestor; that’s what it means for it to have increased in frequency recently. So, if you just look at the terminal branches, the terminal lines here leading to all of these samples, they tend to be longer for the one that wasn’t selected for, because they’re older. Any mutation that occurs on one of these is going to appear as a singleton in your sample. It’s just one mutation.

So, all this test does is count how many singletons there are around each allele from the genome. Just count how many singletons there are nearby and look at the density of singletons. The density will be lower, fewer singletons around recently selected things. One of the nice things about this test is that, unlike some of the other tests I’ll mention in a minute, it works not just on selective sweeps, but it works in some other more complicated situations. According to the authors, they figured out what is this sensitive to; if you have a reasonable sample size of a few thousand, then it’s probably sensitive to selection that occurred in the last 2000 years. In genetic terms, that’s very recent selection. So, this is cool, and it’s worth reading that paper.

Okay, so I said there are some complications. A selective sweep is nice and pure if you have one, but it’s making some assumptions. It assumes that the beneficial mutation occurred once or at least was so rare that you treat it as just one copy, and then it rises to frequency and has all these effects. There’s just that one mutation of pretty large effect.

Selection can happen in other ways. Selection can occur on standing variation, which is variation, variants that have been in the population for a long time. If it’s been there for half a million years or a million years, recombination has been happening all the time, so that variant is hitching on all lots of different haplotype backgrounds, and they’re all going to rise in frequency. You’re not going to see most of these signals. The same mutation can happen more than once. That’s happened in the sickle cell trait, where this identical mutation has occurred on multiple occasions and been selected for. But you’re in a scenario where you have multiple haplotypes increasing in frequency.

Finally, there may be lots of different alleles that contribute to the same trait, and each one may only shift a little bit in frequency, but you can still have a substantial effect on that trait. So, these signatures of selection are pretty much hopeless. There are attempts to use modifications of these tests, like looking at allele frequency distributions in the case of selection on the standing variation, but it’s just harder. The signal isn’t as easy to find.

Focus on traits rather than alleles

You can still do some interesting things, though, and find some interesting stuff if you stop thinking just about alleles but rather think about the trait. If you know what genes or what alleles are contributing to variation in a trait, then you can aggregate different alleles that are involved in that trait.

This is an example that was carried out by Joel Hirschhorn’s group here at Harvard, looking at height variation in Europeans. We know from genome-wide association studies (GWAS) many of the variants that affect stature. We also know just from observing Europeans that there’s a cline in height across Europe from north to south; northern Europeans tend to be taller than southern Europeans. So, what they did in this figure, they plotted and ranked the SNPs from the GWAS in terms of how large an effect each SNP had on stature; then, on the y-axis, they plotted the difference in frequency between northern Europe and southern Europe. What they find is that the alleles with the biggest effect on stature are also the ones with the biggest frequency difference in Europe. This is consistent across all the major alleles here. They all have higher frequency. This isn’t randomly a few alleles happen to be higher. All of them have higher frequency in the increased height direction in northern Europe or, alternatively, they all have the lower stature allele in southern Europe. This provides pretty good evidence – and it’s been supported by further studies by others – that selection was acting on stature in Europeans in some way. It’s not exactly clear how, but in some way.

There have been other studies. This is a similar sort of study looking at a variety of traits correlated with geography. In this case, finding a significant selection for a stronger response to damage from ultraviolet radiation if you live near the equator, which maybe isn’t too surprising. Different colors are different continents, and on different continents, the same sort of selection pressure has been happening. So, you can indeed extract a fair bit of information about the trait that’s been selected for, even if you may not necessarily know which particular allele has been selected.

So, that leaves the question: what traits have been selected for? What have we actually found? What’s been going on in humans in the last 20,000 years or so? Because that’s pretty much all the data I’m talking about are from selection that’s occurred within the last twenty to forty thousand years, because that’s the easy place to look. Lots of interesting selection happened before, like what made us anatomically modern humans, but that’s a lot harder to find, harder to study. So, we’re studying all the relatively easier things.

Results of a selection scan

So, as I said, you can scan the genome for these signatures of selection, and many people did. Here’s the result of one of those scans, guess it was done here. In this figure, there are many places in the genome where there’s evidence that selection happened. There are probably some false positives in there, but there are lots of actual cases of selection that occurred there. The question is: What do they do? What was involved? The answer is mostly, we have no idea. Something happened there, somewhere. And we don’t know what the trait was. We don’t know what the selection pressure was. It’s going from “something happened at this locus” to “what’s the phenotype?” That’s hard work; that’s biology. That’s not just sitting around looking at genome data. You actually have to do some biology then.

From candidate to function: EDAR

I’ll provide one example here of the kind of work that can elucidate this sort of thing. This is a little segment of one of those genome scans. This is chromosome 2 in East Asians. This is a test for long haplotype, so there’s a very high score here across this entire region, indicating that selection happened here. Where exactly? Probably somewhere in here, but you can’t really tell.

Audience question: [Unable to hear on video].

Stephen: Relative to size of a gene. Also, the sheer number of variants here is a problem if you want to see which one of these variants had a phenotypic effect that actually drove this. Because one of these variants was probably actually important, and the rest didn’t. So figuring out what to test for, if you want to do some functional work. This by itself is kind of daunting. There are just too many variants here to put into a model organism, say, and see it in the cell line.

But you can use the trick I mentioned before of combining information from multiple signal signatures, and that actually cleans up this particular locus very nicely. There are only a handful of candidate variants you’d want to look at, and one of those turns out to be a nonsynonymous change. As you can see here, there are, in fact, multiple genes across this region. It was a nonsynonymous change in a particular gene known to be involved in the development of hair and sweat glands. It’s known to have been under selection in other organisms as well. That then gives you something you can look at in more detail.

Mouse model EDARV370A

You can take that variant and study it in the lab. Somebody did a postdoc at Harvard, working with the Broad Institute, and said this was selection that apparently happened in East Asia. This variant increased in frequency, and she stuck it, this particular variant, in a mouse to see what it did. Well, it did several different things. It produced thicker hair, smaller and denser mammary glands, and a higher density of eccrine sweat glands. It turns out that at least the first and third effects, the hair and the sweat glands, have the same phenotypic effect in humans. The thicker hair is something you can actually see; East Asian hair doesn’t look the same as European hair, typically. So, this is a mutation that was selected for. It’s illustrative in that by focusing on something that was selected for, we found something that clearly has a notable phenotypic effect on humans, distinguishing between humans.

It also illustrates one of the problems, which is that mutations don’t always have one effect. So, it affected hair and it affected sweat glands – and we don’t know why. You can guess maybe it’s something about temperature regulation, and sweat glands are important. Or maybe hair was important. We don’t know. So, it’s not a final answer, but it was a major clue to finding something that’s phenotypically important and that distinguishes humans from one another.

If you look more broadly, this is from a very recent review article by Sarah Tishkoff’s group in Science about what we’ve learned about regional selection in regional human populations. You can see, on the map, we’ve learned a number of different things, and several of the cases I’ve talked about are on this map. I’ll just mention them: selection for changes to diet – the lactase persistence case up there in Europe – that’s one of the classic ones. If you can’t find lactase, then you’re doing something wrong when you’re studying selection. But that’s also been studied in East Africa and the Middle East, where there have been other herding populations. Turns out to be independent mutations in the same regulatory region that have the same phenotypic effect. So, it’s exactly the same pathway, exactly the same mechanism, but that’s occurred multiple times in different places. There are other populations around the world that have lactase persistence. In South Asia and West Africa, these have not been studied in any depth at this point. So, exactly what the mechanisms are there is yet unknown. I mentioned skin pigmentation. There are a bunch of genes where alleles are known to contribute to the paler skin color that Europeans have, which varies with latitude. So they’ve been well-studied. Other genes are known to have mutations in Asia – some of them independent, some of them shared. So, a partially different set of mutations. Other parts of the world where there’s also pigmentation. Like, within Africa, there’s quite a lot of variation and, probably, some have been under selection, but again, the studies haven’t been done there. There’s plenty of things still to study. I mentioned the polygenic selection on stature in Europe. Turns out, there’s also been selection on stature elsewhere, particularly in rainforest environments. The selection for smaller stature in humans – pygmies that tend to have small stature – is well-known in the Central African rainforest. Selection there has operated in a rather different way. In Europe, it was very polygenic – selection on lots of different alleles. In Central Africa, it apparently was strong selection operating on a handful of loci for shorter stature. It might be because the pressure was of a different kind, or it may just be that’s the way it happened to work. Don’t really know. I think the last one is the circle here for high altitude adaptation. That’s been studied in the Himalayas, and it’s also been studied in other high places, in the Ethiopian highlands and in the Andes. In these cases, you find independent mutations, most of them in the same pathway leading to a similar phenotype. So, the same selection pressure, different mutations, but similar outcome.

Audience question: What’s the thinking behind the selection for short stature and the advantageous nature of that?

Stephen:I don’t know much about the thinking. It might just be a matter of resources – you’re better off if you need to eat less because it’s not a very nourishing environment. But it’s not something I’ve looked at in any detail.

So, those are the cases we’ve understood something about. We don’t always necessarily know exactly what the selection pressure was, but at least we have some idea what the phenotype might be.

How much positive selection is there, anyway?

One question that might occur to people is, well, how much of this is there? How much has natural selection been operating in humans recently? It turns out to be kind of tricky to figure out because there are frequency differences between populations. It varies across the genome. How much of that is naturally occurring and how much of that is, I mean, neutrally occurring? How much of that is the result of selection? Especially since we don’t really know the demographic history of humans in detail, we can’t exactly model it and tell you what the distribution should look like. So, it’s kind of an open question of how much positive selection has happened.

One way of trying to address this is to look at that reduction in diversity, particularly the reduction of diversity around the genes. And I said that can be caused either by selective sweeps happening repeatedly in positive selection or by background selection. So, you need to find a way to distinguish them. So, one group, what they did, was they looked at genes. We’re only looking at genes. Presumably, background selection is happening all the time, and they compared cases where there’s been a nonsynonymous change in humans over the last million years or so to places where there’s only been a nonsynonymous [synonymous] change. The idea is if they’ve been a lot of selective sweeps, you should see reduced diversity around the nonsynonymous case. Here’s their plot from their paper. This is Hernandez et al. from 2011, a few years ago now. Here’s where the substitution happened at some point, and they plot diversity for synonymous in blue, the substitutions, and for around non-synonymous substitutions in red. Those distributions are basically the same. There’s no obvious difference between the non-synonymous and the synonymous.

Conclusion 1

And their conclusion was, “We don’t see any evidence for lower diversity around functional changes.” Classic selective sweeps, sure, there’s been some, but they’re not a major factor. Maybe it’s been, you know, this polygenic selection, other kinds of selection are happening. Basically, they said the conclusion is,“You should stop wasting your time looking for these things,” which was a little annoying for those of us who were looking for them, particularly those of us who wanted to get funding to look for them. Because then, you know, the reviewers of the grant proposal say, “Why are you looking for this? Hernandez et al. just showed that there aren’t any selective sweeps, or you know, we’ve already found them all. Don’t waste your time.”

And then a couple of years later, there was another paper from a different group, David Enard, Dmitri Petrov’s group, and they came to a very different conclusion. I think I’ve time to go through this. And they said they looked at the same data and they said, “No, you’ve drawn the wrong conclusion. Because the problem is, you’re looking at nonsynonymous mutations.” Well, let’s think about that a little bit. Let’s think about you’ve got two genes: Gene A is highly constrained, every amino acid is a precious jewel, and if you change one of these, it’s bad. So, anytime there’s a missense mutation, it’s deleterious. You will find very few missense mutations that happen neutrally. So, if there’s any mutations in missense mutations, they will have been beneficial. And what that means is, lots of background selection happens; there’s lots of purifying selection going on all the time, because all of these mutations are bad. So, if you look at diversity around that gene, well, there’s going to be very low diversity around that gene. Now, Gene B, on the other hand, this is a weakly constrained gene; a lot of the amino acids don’t do anything, so a lot of neutral changes. And the consequence is there’s got to be less background selection because there are fewer opportunities for deleterious mutations. And so, around those genes, there’s going to be a smaller reduction in diversity. And what they say is that when you look at nonsynonymous mutations and nonsynonymous substitutions that have happened, which is what the Hernandez et al. group did, you’re picking primarily from this group, from these genes. On the other hand, when you’re picking synonymous mutations, you’re picking from both of them. So, you’re biasing yourself to finding places where there’s not a lot of change. And so, if there are cases, a lot of cases, of positive selection going on, selective sweeps, you’re going to completely lose it because of this bias.

Conclusion 2

And so, their conclusion was that there’s actually strong evidence for lots of selective sweeps in recent human history. There’s been a high rate of strongly adaptive substitutions near amino acid changes, and there have been even more sweeps driven around regulatory changes, which I think we know independently. So, that same data, very different conclusions, and a little heartening to those of us who are interested in selection.

I mentioned that I introduced that new test for selection, this density of singletons. They conclude that lots of different traits, if you just accept their data at face value, lots of different quantitative traits in humans that have been studied by GWAS, a lot of them show evidence for having been undergoing selection in the last 2000 years – like more than half. I don’t know if I believe that, but certainly their evidence has suggested that lots of slow selection of some kind is going on all the time. Okay, so that’s where we are in terms of studying selection from just looking at genetic variation and what its effect is.

But there’s one other topic that’s really interesting because it gives us a whole new way of looking at this, and that’s ancient DNA. Because in all these things, we’re inferring what happened in the past. All these kinds of studies, we look at genetic variations in a bunch of Norwegians, and you try to figure out what happened 10,000 years ago. But now we can actually look at DNA from 10,000 years ago, 5,000 years ago, and see directly what’s changed between then and now. People have done this. We now at the point where we’ve sequenced enough ancient genomes; we can compare allele frequencies in the past to allele frequencies now.

230 ancient genomes

And this is data from a paper of last year, David Reich’s group at Harvard. And the colored dots are the frequency of several particular alleles that are of interest in several ancient populations. And the dashed lines are what the frequency is in modern European populations – this is basically Europe and West Asia there are these ancient populations. And I’ll point out a couple of cases. The top-left one is lactase, again, this classic example. You can see that the frequencies in these ancient populations were at or near zero, the allele that gives you the ability to digest lactose. Whereas in modern southern European populations, it’s still fairly low, but in northern European populations, it’s very high. So, this is selection that’s happened just in the last few thousand years. You can see that it’s dramatically increased in frequency over that time.

Here on the other hand is a pigmentation allele, one of the major ones that contributes to European paleness. You can see that it was at different levels in different populations. The steppe peoples, these are the people of the Western steppes in Asia, were apparently paler than these other Europeans down here. But they’re all at a lower frequency than in modern Europeans. So, this is a selection that was ongoing at this time and has continued into the present.

If you look across the genome – I don’t know if you’re used to Manhattan plots but this is the entire genome spread out – and the places where there are signals that selection has happened, just from comparing the ancient DNA to the modern DNA, you find a lot of the same things. You find the skin pigmentation genes that are already known. They find the genes including lactase and fatty acid dehydrogenase, which has been seen to be under selection in other populations. Selection for resistance to infectious disease at the HLA toll-like receptors.

Ancient DNA tells a similar selection story

So, the basic story we get from ancient DNA is very similar, which is heartening, because we were reconstructing the past based on computer models and it’s nice to see that when you actually can look at the ancient DNA, it tells you we were right in a lot of these cases. We were really correctly inferring that selection happened. But you get a lot more detail when you look at the ancient DNA because our computer models are simple. So, I’ll give you a few examples of the difference in the story and to end here.

I talked about this gradient of stature in Europe, and there’s this north-south gradient. So, we concluded selection happened. Well, when looking at the ancient DNA, they concluded a little more detail. They conclude that selection happened for shorter stature in southern Europe. That selection for taller stature happened in West Asia, in these steppe populations. It seems that’s where we see it, and that we see this greater height in northern Europe because those people then moved into northern Europe. It wasn’t necessarily selection happening in northern Europe for greater height – it was just people having to migrate in. And looking at modern populations, we would have no idea. We would have no idea about this. It’s very hard to figure out all of these, where all these people have been moving around. We tend to assume that if we’re looking at Chinese people today, their ancestors were living there 10,000 years ago. No, people move, they move a lot it turns out – from looking at ancient DNA sometimes in ways that we couldn’t tell at all from modern DNA.

Alright, a second case, pigmentation. There are two main genes that contribute to European pigmentation, typical color, with very similar names: SLC24A5 and SLC45A2. And we know from the genetic evidence in modern Europeans that they’ve been both under strong selection. There’s an allele that’s risen to high frequency in both cases. Turns out the two genes have somewhat different histories, looking at the ancient DNA. One of them – actually, it’s the one I just showed – this skin pigmentation gene, you can see rising in frequency within Europe as selection was occurring for lighter skin. The other allele actually entered Europe at a very high frequency with farmers when the first farmers moved in from Anatolia, which is modern-day Turkey. They largely replaced the European population, and they already had lighter skin, presumably as a result of earlier selection. But you get a much more detailed picture of the history this way.

And finally, the last case is the case of EDAR, the one thing that gives you more sweat glands and thicker hair. When we were studying it, the story seemed pretty simple. This plot shows where that allele is present, it’s present in East Asia and it’s present the New World, because people from eastern Asia populated the New World. And so, it was clear, they did very detailed modeling for this. And this is estimating where this allele originated, and it originated in central China 30,000 years ago. So, it was a nice, simple story. There was a Cell paper, and it’s a great paper, but this particular conclusion – the problem is, you look at the ancient DNA, it turns out this allele was at high frequency in Swedish hunter-gatherers 6,000 years ago, which is not something you would guess from looking at the modern distribution. It was still under selection in East Asia and that story hasn’t changed, but the details of what happened historically are more complicated. It probably arose in western Asia and happened to be selected for in eastern Asia. So, this is the kind of information you get from it. Really, it’s a tiny snapshot of ancient DNA. As we get more DNA, we’ll be able to learn a great deal more, at least from places where there is ancient DNA – a lot of the world, you know tropics, DNA does not preserve well.

I’m going to conclude with a couple of comments. So, recent positive selection has clearly had an impact, a significant impact on human phenotypic diversity, both within individuals and within groups. Exactly how much? You know, some traits – many traits, at least two traits – have been changed by this. And many of these traits are of medical interest or biological interest. It’s a great way of finding out, one way of finding out where these important phenotypic changes are. So, if you study natural selection, you can learn and identify places where things have changed. It’s not, by itself, an all-purpose tool. It’s a clue. It has to be combined with functional work, with GWAS, with association studies, with all kinds of other things. But it is one tool in the toolkit. And I’ll stop there, thanks.