ANDY CLARK: As Matt just said, I'm Andy Clark. My background, I did my PhD work at Stanford with Drosophila Melanogaster. So I kind of come from the model organism side of things. Worked on that for many years then backed into human genetics, and nowadays we're working on horse, donkey, mule, hinny, opossum, Spix's macaw, Florida scrub jay, the list goes on and on and on. So things have just exploded with the ease with which one can collect genomic data now, and you can tailor your question to the appropriate organism.
And so my talk-- where's the keyboard-- is geared to-- why isn't that working? Is geared to sort of just being a pep talk. Mind you, I do love sweating the details. I think that's going to be really-- you're going to enjoy getting into the details of dealing with sequence data very soon and that's important, but what I'm mostly going to do is try to motivate. Why do we want to collect genome sequences and what are we going to expect to learn from them?
And so the plan for my roughly hour talk here is to talk briefly about how genomic data reflect past selection, a little bit about codon substitution models and how they can be used to test selection, a little bit about site-frequency spectrum tests, some of the complications that you need to worry about from the very beginning with respect to dealing with site-frequency based tests, polymorphism and divergence based tests, extended haplotype test, hard and soft sweeps, and patterns of inter-population divergence. So it's sort of a hodgepodge. I'm going to go pretty quickly so let's have the scheme of being bold and screaming questions if you encounter anything that's just mysterious.
So some of the attributes of DNA sequence data that reflect past selection are listed here. First of all, patterns of synonymous and nonsynonymous variation in protein coding genes. We'll see a little bit about that and apply it. I mentioned the site-frequency spectrum already. That's one of the first things you get when you get sequenced data.
There are a lot of the details having to do with the kinds of errors that you make when you have less than perfect sequence. And you'll see from the beginning that you'll never get perfect sequence, that dealing with error is an important issue to be concerned about and site-frequency spectrum is an in particular one that gets distorted by errors. Aberrant patterns of linkage disequilibrium, exceptionally long haplotypes, patterns of polymorphism that are out of kilter with the levels of divergence between species, and then divergence between populations. That's pretty much in the same order, they just happen to be the particular attributes of the sequence that we'll go through.
So this idea of looking at differences at synonymous and nonsynonymous sites is particularly informative when you have multiple species. It's particularly applied in the situation where I'm comparing a particular single representative of multiple species and I want to know, was there some difference in the divergence time between those species? So one can draw a tree of human, chimp, and mouse. And I think the very first genome wide study for looking for selection in humans was ours in 2003 in Science. At the time I had no idea that this would explode and be its own little kind of cottage industry of trying to infer selection in humans. I guess I should have anticipated that.
At any rate, one can actually estimate rates of divergence at nonsynonymous site-- that is, changes in protein sequences that actually affect amino acid sequence-- in a branch specific way. And the particular thing that's being illustrated here is one can ask, is there a particular attribute of the nonsynonymous substitution rate on the branch towards human that's different from the other branches of that tree? And one can construct formal likelihood tests and ask, is there a significant acceleration of the rate of nonsynonymous change in humans? And so those two terms, dN/dS then again are the rate of nonsynonymous substitutions per nonsynonymous site and rate of synonymous substitutions per synonymous sites, those that do not change amino acid.
So the parameter in these so-called codon substitution models that Rasmus Nielsen, Z Hang Yang and a number have developed over the years is this parameter omega. And one can specify the expected pattern of substitutions of synonymous and nonsynonymous sites under this model. And one of the parameters of that model is this omega, which you can then estimate.
So there's some issues with the calculation of dN/dS that are not very informative. And again, we're starting with human, chimp. We count chimp as a non model organism I guess here. Human and chimp genes differ in only a few nucleotides. It's only a percent or two. And so even very large genes don't have that many substitutions. So this is a second issue of power.
So statistical power is something that will come up again and again in your discussions about the kinds of tests that you can do. And a low power test means, even if the null hypothesis is false-- that is, even if there is selection operating-- you're very unlikely to be able to reject it because you don't have enough data, basically. And so doing single gene tests, as the genes get shorter and shorter-- obviously, very short genes, human versus chimp comparison-- you'll just never be able to reject the null hypothesis.
This dN/dS can be spuriously large if that denominator dS is small. That's sort of obvious. Any tests, it's a ratio often has a aberrant statistical properties and so we need to be able to deal with that. So that paper was published and it found lots of genes involved in immunity and so forth that are important, and many, many other subsequent studies have come down the pike and we see that the immune system is one of the most rampant targets of natural selection that you see because pathogens are one of the strongest forces of selection historically in many, many genomes in any organism.
Including, it turns out, insects. So a particular illustration of this method of dN/dS I'd like to show you emphasizes one other point, which is, don't just think of your one non model organism, think of the tree of the several relatives of your non-model organism. Because if you can get phylogeny based data, multiple sequences from related species, you're that much more empowered.
So Hollis Woodard and Brielle Fishman visited my lab for a couple of weeks. They're students of Gene Robinson and had exome sequence, that is, protein coding sequence, from a number of bees. And the question here was about the origin of sociality in bees. And you think of all the questions that you're asking like, what are the genes that are involved in that which allowed bees to live in groups harmoniously and have specified tasks and castes and so forth? Kind of a high level question that you wouldn't think you could make much traction on by just doing some genome sequence, but that was the end-- the hope anyway in this study.
So lots of attributes were acquired in the process of becoming social. The wonderful thing about this is you can draw these bees on a phylogeny. There is enough sequence data to do a pretty good phylogenetic tree. And you can color code the tips of those trees red for the social ones-- Apis melliferas are well known honey-bee-- and you can see that Apis mellifera and fluorea are both social, youth social, truly social bees with castes and everything. And so then you can actually make this inference that if those are both social then maybe the common ancestor was social. But if these are solitary bees, then sociality must have evolved along that lineage. That's the kind of thinking that we're doing.
It turns out that sociality evolved a couple of times, apparently, if you draw it on the tree in this way. And so one can ask about rates of synonymous and nonsynonymous substitutions on branches which lead to sociality and branches which didn't. And I'm going to skip a couple of these just to kind of keep up the speed. So basically one does this tests-- these tests on nonsynonymous and synonymous as a formal hierarchical likelihood based test. So you can say, what's the probability of this data under this model where we allow for different rates on social and solitary branches?
So we're specifically estimating an omega parameter along these red branches separate-- from a separate omega parameter along the black branches. So that's a model. We get a goodness of fit and all that sort of thing, a chi-square like. The null hypothesis is everybody has all the same rates all throughout. And you can do likelihood ratio tests by actually comparing the goodness of fit of those two. And that was the basic scheme for this.
And it worked beautifully. Got a nice paper in PNAS out of it. And we did find a number of genes that looked like they were accelerated on the social branches. I should also mention that it's not quite as simple as I just glossed over because there are different issues about exactly how many times did sociality arise. So you need to do the test a couple of ways, a couple of times, based on these different models of when sociality arose. So just add that caveat.
So this is just showing a few genes that are accelerated, most accelerated evolving rapidly on the social lineages. That is, specifically, we reject the null hypothesis of equal rates and conclude that there's accelerated amino acid substitution on the social branches. And this is just a list a few of the top genes with a predominant emphasis on genes that are involved in behavior and brain function and metabolism. Now, metabolism had been seen before. Lots of aspects of honeybee metabolism is different. They're revved up, basically, compared to solitary bees.
So a number of the blue ones I'm not going to see any more about, but some of these ones that are involved in behavior, you might think, well, so social bees, they actually have to live a pretty complicated life. They have to understand how to show each other where the food is, all the stuff that social bees do. And that one dunce particularly caught my attention. I almost fell off my seat, in fact, because I knew about this work that was published in 1976. It pays to read the older literature, too.
Chip Quinn actually was a student with Seymour Benzer, the Nobel Prize winner. Seymour Benzer got interested in his book, Time, Memory, Love. If you haven't read that, it's a terrific history of genetics and science book that Seymour Benzer wrote on his thinkings-- that was written about him that was-- sort of followed his career. And towards the end he studied behavior. How do you do formal genetics of behavior?
And Chip Quinn was one of his first students getting into this and they devised this instrument here. It was an olfaction kind of machine where the flies could go into one to or the other. This is a model organism, sorry about that. But we'll get back to the-- and so the flies could enter the tube or not and there would be different odorants in the tube. And if it was an odorant that they wanted to train the fly to avoid, like benzaldehyde or something, they'd have a shock and the flies would get a shock and they'd learn if they smell benzaldehyde they'd get a shock. And very, very quickly they learned to avoid that.
And so this is showing, with a control odorant, they stay-- this is the fraction of time they stay in the start tube. It doesn't change over time. But if they shock them then they quickly learn and then they could measure, are there genetic differences between lines that learn quickly or fail to learn or forget quickly and so forth? So there are all these learning mutants and they were named things like turnip and rutabaga. They started out with different names of Nobel Prize winners but they thought that was offensive and so-- they honestly did that.
And so dunce, you can see here, actually-- you can see it's not particularly learning here, but then it very, very quickly-- let me see what's going on here. So this is-- OK. So they're training it. So when they get a shock they leave the start tube. But then in the test here, you can see that-- the last one over here is the test-- do they actually learn? Do they avoid it when there's not a shock? With dunce the answers no and the normal ones, of course, they do remember and avoid that.
So here's a specific gene, cyclic AMP phosphodiesterase clearly implicated in learning behavior in flies. And that's the fastest evolving behavior gene in the honeybee. So we're off and running. Gene Robinson has a huge lab at the University of Illinois and is doing lots of follow up stuff on that. So that's one little vignette.
The idea that you can do this with multiple species has hit a number of individuals. So we just did it with multiple species in that bee case. But you can do it with multiple mammals as well. And Karen Kosiol who was a former post-doc here expanded this test and looked at, so what are the genes that are showing accelerated rates in different parts of the mammalian tree? This one here is being just the human specific branch. They only found 10 genes showing acceleration by their test, and one can very specifically dial in to what particular genes are accelerated in what parts of that tree.
So if you have a particular phylogeny that has some organisms over here with big flippers and organisms over here that have some other attribute, you can start to dial in and ask very specific questions about adaptation in those different features. And again, from Kosiol's paper they're finding not only immunity and defense but also sensory perception. So many genes that are involved in sensing the environment also seem to be rapidly evolving. And that's also true in insects as well as mammals, which is pretty interesting.
I'm not going to go into details of this but this is just showing that these methods that let you get synonymous and nonsynonymous rate differences, also allow you to scan along each gene and ask, are there particular parts of a gene that are fast evolving? And for immunoglobulins it's regions of those immunoglobulins that are interacting with the antigens. And so if you have a particular protein of interest that you think is interacting with the environment or with sensing or something like that, you can specifically ask, am I seeing a signature of positive selection specifically on the parts of the molecule that I think might be involved in some adaptive process. And often that's been successful. This is just one example of many.
OK. Any questions at that point? dN/dS kind of stuff? So you'll see lots in the literature of it going hot and cold. People sometimes don't like it and complain about the statistics of it, but it's among the more formally rigorous kinds of tests. There are lots of things that can foul it up so you need to be careful. But there's beautiful software for it. One called PAMEL. I don't know if you guys are going to deal with that at all this meeting-- in the workshop-- but it's out there and pretty easy to pick up. And scales to genome wide tests pretty well. There are other accelerated forms of it as well.
OK. Site-frequency spectrum. So this is in the situation where you get your genome and sequence a number of individuals within a population. So this is population genetics now where we have variability within a population, multiple individuals within a population. And we're going to be looking at this thing here which is you're counting up the sites that have the following property. So there'll be a SNP, a polymorphic site, where there is one copy of a drive allele and all the rest are the sort of common allele. So it's a very rare variant where only one individual has it. That's known as a singleton.
So the count of how many there are, the proportion of SNPs that are in that class, the singleton class, form this first bar here. Those are singletons. And then if there are two copies of the rare allele and the rest are the common allele, that's a doubleton. That's the next bar over. Three copies and the rest are common. So as you move to the right, you're looking at more and more common variation. These are the rare variants, the singletons, and this is more common variation.
And so what we're looking at here is a sort of a cartoon of a couple of different kinds of site-frequency spectra. Oh. They should have been in different colors, but anyway. So what we're looking at is-- that's not too well labeled but-- what happens is there are different distortions. This gets pushed one way or the other depending on forces that are at play. So if there's purifying selection, purifying selection means that the new variant is deleterious so get rid of it. So it's pushed ever rarer in frequency. So purifying selection would be like this blue bar. It would be-- the variation would be pushed towards a rarer class.
A sweep in progress would be like the red bar where there's some common variant that's favored. So you see this high frequency of a derived allele. High frequency derived alleles might be a positive sweep. Heterozygote advantage is not shown here. Heterozygote advantage, we'll see in a minute, has a bump in the middle. And there are other forces that are at play on populations that also distort the site-frequency spectrum. So we need to think through all of those.
And some of the caveats then, some of these other factors besides selection that drive it, are ascertainment bias. So if the SNPs that I'm looking at are on a micro ray and I pick the SNPs to be on that micro-array for some reason, like I know that they're intermediate frequency so I'll put them on the chip and then I test them, that's obviously not like a random sample of SNPs. There's ascertainment bias to them.
Admixtures. So if there are individuals from two different populations that are combining and interbreeding, you can see you get rare variance from those two outside populations and there will be some distortions there. Population stratification. The easiest way to see this is, you imagine here is a territory I'm sampling my organisms from, and in fact, instead of that being one territory, they're like a boundary between two different subpopulations that don't interbreed very much but breed within those two subpopulations.
So if I consider them as being one panmictic unit when really it's subdivided, that's a very, very common source of error. And that can be fairly subtle so one has to be very careful about population stratification. We'll mention some ways to deal with that.
And finally, various sorts of demographic histories. So rapid population growth also generates excess rear variation. In fact, one of the odd things in human population genetics is the explosive very recent growth that we've seen in our population gives a horrific distortion. There are five times more singletons than expected in the human population if you take a very large sample. So huge distortion just caused by population growth.
So one of the tests that's commonly used for the site-frequency spectrum distortion is known as the Tajima's D test. And just a couple of terms. We'll be encountering these here and there. One is this term theta. Theta is 4 times N times mu. That's a common population parameter where it's the population size, or affective population size, times the mutation rate. In a population that's at balance between mutation and drift, the level of heterozygosity per nucleotide site is roughly equal to theta for theta small if you're thinking about a per site basis.
And we can estimate it by just counting up the number of segregating sites of all frequency classes and dividing by this sum, which is-- little n is the sample size-- so it's 1 plus 1/2 plus a 1/3 plus a 1/4 up to n minus 1, whatever that is. So you can see, the more segregating sites you have, the larger is theta. And thetas are sort of as summative per nucleotide variability. So that's Watterson's theta and that's a good parameter under a strictly neutral kind of variation.
Another parameter we can estimate is pi. Pi is like calculating the heterozygosity at each site. The heterozygosity is the probability that if I draw two copies, from two individuals, the probability that they are different from each other. That's the heterozygosity. And they average that across all sites. That's pi. It's just simply the average nucleotide heterozygosity. And this shows one of the ways that you can calculate that.
And Tajima's D is simply the difference between these. It's actually normalized by the standard deviation of that difference. And this is just showing that a site that's under no selection-- I just stole this off the web, I'll admit. It's actually a pretty nice slide. So you have some phylogeny to get the samples of alleles that you're looking at that are right here. And mutations occur along the branches of this phylogeny, and so this particular haplotype has these mutations and these mutations and this one.
So this is then a sketch of the different haplotypes you get from this tree. It's a very nice kind of way to see that whole picture. And from this picture also you can count up. Here's a doubleton, here's a singleton, here's another doubleton. So you get a site frequency spectrum as well from that tree. So those things ought to sort of relate conceptually in your mind. A tree, a collection of haplotypes, and a frequency spectrum.
And there's some particular relationship between the heterozygosity at each site that we get from this calculated by pi, and from Watterson's theta, the number of segregating sites, and if everything's strictly neutral that that difference will be 0. If that difference tends to be negative, which means there are too many segregating sites where the level of heterozygosity-- in other words, there's not enough per site heterozygosity, which happens if the same frequency spectrum is skewed towards too many rare variants-- so Tajima's D is negative. Tajima's D is negative when you have purifying selection or excess population growth and so forth.
So that's just one of several other tasks that are basically crude overview tests of the site-frequency spectrum. There are many other follow ups to this that essentially fit the whole site-frequency spectrum to a model and so forth. So that's showing, again, with purifying selection, you get this distortion of excess rare variation, Tajima's D goes negative and you'll see a pattern that looks like this.
This is just showing one of the other distortions of the site-frequency spectrum. I just came across this picture too and thought, oh, that would be fun to show. So this is just looking at Atlantic versus Gulf populations of sea bass, sparrow, horseshoe crab. Many organisms show relatively low migration around Florida, and so you compare the southeast Atlantic coast against the Gulf and you'll see this kind of break as thought the populations used to be connected before the sea level lowered in Florida, rose up out of the sea and isolated these two populations. And now they've been isolated and they've been diverging, and you see this in many, many of these.
So if you didn't know about this-- and oyster as well-- if you didn't know about this and just considered this one population, all the sites that are rare in this population different from these would give this excess. These patterns, if you lump them together, would all give a strongly negative Tajima's D. I guarantee it. There would be excess rare variation in samples like that. So drawing the tree and trying to infer, am I seeing really a single panmictic population, is an important first step before you apply these tests.
Let's not go into this in more detail. This is a-- ascertainment bias comes up in a situation where you're designing a microarray and scoring SNPs in a very large number of samples with a microarray. And let's not go there yet. So this-- let's skip that, too.
One other thing that you can compare, many organisms have sex chromosomes. There are lots of very interesting things going on with sex chromosomes where the nature of the way selection operates on the sex chromosomes is different. Organisms that are xx female, xy male, means that mutations that occur on the x are hemizygous in males. So if they're otherwise recessive, males sort of immediately display those-- effects of those mutations.
And so you often see this phenomenon of a faster male evolution or faster x evolution. The X chromosome tends to spend more of its time in females than males and that has consequences. So there are many, many attributes that are of interest in contrasting the autosomal and the X chromosome. And this is just showing that-- in this particular, this actually is human data-- if you look at autosomal and x variation you do see this excess rare variation on the x and so there's more purifying selection. But you also see excess derived variants on the x as well as though there's more purifyings, as thought there's more selective sweeps going on the x as well.
So there's just lots and lots of work to do on the sex chromosomes and I urge you to think about that. Too many studies just sort of set aside the sex chromosomes and they go, we'll worry about that later. But there's often some very interesting work to be done there.
You can also contrast these site-frequency spectra stratified by different attributes of the sites. So we mentioned this dN/dS test. You can sort of combine the two by asking, what about the site-frequency spectrum of synonymous sites versus nonsynonymous sites? So if you think about this for a second, synonymous sites are not going to be changing the amino acid sequence. There is this phenomenon of codon substitution bias where you might expect there to be-- I'm sorry, codon bias-- where you might expect there to be some weak selection operating on different silent substitutions.
But for the most part, we can think of silent sites as being somewhat analogous to quasi neutral. And so these, the red bars, giving the site-frequency spectrum for those synonymous or silent changes. And if we look at these nonsynonymous changes that probably affect the protein-- and there's software that allows one to try to predict this, Polly Finn is one of them and there are a few others. They don't do that good a job but they take a start at it. Anyway, the probably damaging mutations are like a radical amino acid change. So these are all nonsynonymous changes.
And you can see that nonsynonymous probably damaging ones do have this big excess of the rarest class of variation. Which is consistent with our idea that, if they are probably damaging, there's probably purifying selection pushing them to ever rarer frequency classes, and that's exactly what we see.
So that's the kind of test that you can come up with your cells. Just thinking about, how can I combine attributes of these different tests? So this is combining a site-frequency based test with this comparison of synonymous and nonsynonymous sites.
What about balancing selections? So the balancing selections is a situation where heterozygotes have the highest fitness, higher than the homozygous classes. There are a few classic examples of this in the literature. Probably the most well-demonstrated one is for the major histocompatibility complex. In many organisms there's excess variability and actually even excess heterozygotes compared to Hardy Weinberg from any MHC genes.
And so Ida Andras, a former post-doc in the lab, asked about the site-frequency spectrum across all human genes. Sorry about human examples but there's just so much data there it's sort of easiest to illustrate things there. In this sort of situation where you see this excess intermediacy of the allele frequency, if you do a scan it'll ask for-- let's look across all genes and ask, do any genes show this sort of pattern of excess intermediate frequency?
You find a few, and some of them have been very nicely confirmed in several different independent population samples as through there really is a phenomenon going on here. And some of them show this sort of picture where this is one common haplotype. It's common across many different human populations. Here's another common haplotype also common in many human populations. And it seems like they're all shared across all populations.
This is a radically non-neutral kind of pattern and it's consistent with the idea that, well, maybe there's some actual selective force maintaining that haplotype diversity in that particular gene. And a number of groups at the Max Planck Institute in Leipzig are tracking down what's going on with ERAP2 and could there be some form of positive balancing selection going on there. So these methods work for not only purifying and adaptive selection but also for this kind of positive selection.
Now, another thing to think about when you consider a mutation arising in a population that's advantageous-- you've all seen pictures of this-- they increase in frequency. When they increase in frequency, if there is relatively little recombination around it, that mutation arrives on a particular haplotype and the whole haplotype gets dragged up in higher frequency. And what this might do is reduce, then, the level of diversity flanking that particular adaptive mutation.
And so Kim and Stefan and many others since then have been asking about, can we find-- can we detect signatures of this kind of adaptive selection that results in what we call selective sweep? A reduction-- a selective event that's so strong that it essentially removes flanking variation resulting in this sort of trough of diversity. So these are plots of nucleotide diversity pi on this axis against just position along the genome. And we're seeing these trunks of the genome that are showing these deficit of variability as though something went on there.
Now, I have to give you a little bit of warning that if you see something like that it doesn't necessarily mean, aha, selection, there was a selective sweep, I've proven it. Because other things can do this as well. One in particular is, if you simply drag a population through a bottleneck, you might think, OK, a bottleneck is a reduced population size or some increased probability that drift is occurring. You might think that very homogeneously reduces the level of diversity as you go through a bottleneck.
But that's not quite true. That just by chance, some regions of the genome will just completely lose variability and others will keep more higher levels of variability. So you see little troughs of diversity like this as a natural consequence of bottlenecks as well. So it takes great care to try to disentangle what's due to a selective sweep and what's due to a bottleneck.
And this is just showing an example of dragging a population by simulation through a bottleneck and actually observing patterns of this sort of reduced variability around very spiky looking regions of a genome. It's a little bit hard to imagine up front, why would it look like that if I just had this homogeneous reduction in population? And it's just by chance. You sample from a smaller populations when it's in a bottleneck, by chance you'll see these sorts of events. I'll skip that one, too.
So where are we for time? Ooh. We do have to keep up this kind of pace. Are we OK? Any questions at this point? All right.
So I mentioned the complication of demography. So it's not only population growth which gives you access rare variation but this situation where if the population is complicated with migrants coming in. That affects it. If the population is subdivided as I mentioned, we need to deal with that. Another other thing that's actually important is, if we're dealing with a population where there are relatively few founders and then there was a relatively small population size where many individuals are related to each other, that is, if there's kinship among individuals in your sample, if there's close kinship. Or if the way you sample is more likely to collect large numbers from one particular family unit somehow rather than just a homogeneous random sampling. You really have to be careful there because this kinship thing can really foul things up.
And I can't remember if I have slides for these. Yeah. So these are solutions for each of those three in order. So for this demographic model, what people do-- and this starts to get way more complicated, although it's still feasible and there are methods out there you can download and play with-- people do different approaches to estimate actual demographic parameters. So by this I mean, what was the rate of growth in this population in the past? What was the rate of migration between population? So there are methods for estimating both of those parameters.
So typically you find fourfold degenerate sites, sites that you believe are more neutral than the rest, fit some kind of demographic model. And the idea here is that the whole genome is going to be dragged through that demographic model. So if the whole genome has some distortion of the site-frequency spectrum, you'll be able to estimate that with these kinds of model fitting procedures. And then each particular gene was dragged through that demography, plus on top of that there may have been selection operating. So that's the nature of this modeling. You fit the demographic model first in the whole genome, and then ask about, am I seeing on top of that an extra distortion of the site-frequency spectrum for each gene?
For population substructure there's a very commonly applied method that works pretty well, which is just you jam all of your variation just through principal components analysis. So PCA actually provides a plot of for each individual you get weightings of those principal components. If the population is structured in some way, you could get a very nice graphical picture of that. You can see actual clusters. I think I'll have some pictures of this later on. And then you use those principal components, those weightings of those principal components, in the model that's testing selection. And there'll be some applications of that later.
Finally, for kinship, you actually have to estimate for each pair of individuals, what is their degree of kinship? You can do this by methods that are referred to as identity by descent estimation, and then you get these estimates of IBD and kinship and get a kinship matrix across all pairs of individuals and use that in the model. All right. Boy, we better accelerate.
So polymorphism and divergence tests. McDonald-Kreitman is a very commonly applied one. This is this very cute idea that if you have two species that have been diverging, there's going to be a certain rate of change at synonymous and nonsynonymous sites that are different between them. And if this gene is completely neutral, you'll get some neutral nonsynonymous and synonymous variation. And then within a species they'll be divergence along the branches within that species they will all have the same relative rates of synonymous and nonsynonymous divergence.
That means we ought to be able to do a chi-square test, a two by two test, where we count up synonymous and nonsynonymous differences that are differences between species and count up the sites that are polymorphic sites within species and just do a simple chi-square. That seems just too good to be true, too simple to be true, but it's an incredibly robust test and has been very widely applied. Anyway, that's out there. It's one that you can do on the back of an envelope if you can just count up sites. You don't need a big fancy computer program.
An extension that Carlos Bustamante developed is a Poissant field rand-- Poisson random field model. Sorry. That counts-- so this is, again, for human chimp divergent sites. At synonymous and nonsynonymous positions and coding sequence, there were 34,000 synonymous and 20,000 nonsynonymous. So you look at that ratio, normally there should be about two times as many nonsynonymous as synonymous if they're neutral. Right? That's how many mutations coming into the genetic code ought to have twice as many nonsynonymous.
So you see there's a deficit of nonsynonymous divergent sites here. If you look at SNPs within humans, there's roughly equal number of synonymous and nonsynonymous differences. And so if you do this chi-square, it's wildly significant saying that that cells the odd one. There's been sort of a weeding out of a lot of the nonsynonymous differences on the branches leading to human and chimp.
And another way to see is, if this is what's normally going on in evolution, namely a weeding out of many of these nonsynonymous differences, then a lot of this segregating nonsynonymous difference in human must be deleterious. Right? Because we haven't weeded it out yet. It's segregating the population but it's destined to be removed by this action of selection.
So this is a pattern that's consistent with a fairly strong case for purifying selection acting on our nonsynonymous polymorphism. And the nice thing about this Poisson random field model is there's actually a model fits for each of the four terms of that McDonald-Kreitman table that are based on neutral evolution and molecular clock from Kimura, they're based on selection theory from [? Sua ?] Wright and so forth. And we won't go through all the details of that but you can actually estimate parameters of those and get them likelihood ratio tests for selection based on McDonald-Kreitman type data at a much, much more rich scale because you can estimate parameters of selection and so forth from them. So we applied that genome-wide and got lots of evidence for selection.
Anyway, we can skip through that because this is published also. Again, I should mention, just the punchline was, defense and immunity were among the most strongly selected genes. And you can see this is-- gamma is that selection parameter. There was actually a lot of signature of positive selection. This was the overwhelming signal for this class of genes showing the strongest positive selection.
But some genes and immunity do undergo purifying selection. That's not too surprising. Some of those genes are not interacting with antigens. They don't need to be highly variable, they're just mission critical genes in the process of immunity, and such a gene would then suffer deleterious consequences if there were mutations in that gene. And there is, therefore, purifying selection acting on some genes in immunity.
Another class of tests that I'm just going to mention and sort of give you the hint of how it works is the Hudson-Kreitman-Aguade test. It also is based on this idea of polymorphism and divergence, that under strict neutrality there ought to be relationships between levels and patterns of variability within a population that relate to the rate of divergence between the two. And the basic idea of HKA test is you have a test locus, the one you want to know, is this gene under selection? And you need one or more genes that are your reference, neutral loci, and that's always been kind of a bothersome thing about this test. But pick a gene that you think is neutral.
And you're basically-- the null hypothesis you're testing is, are these genes under the same kind of selective constraint? So if this ones neutral, the test is, is this one under any selection? And you estimate polymorphism for each of those genes and divergence for each of those genes and it's basically a test of the null hypothesis that the ratio of polymorphism and divergence is the same across the two genes. This test is also very robust. It works very nicely. You've seen many applications of it also in genome-wide way and it's highly Google-able and you can get lots of information on that test.
All right. The most commonly applied tests in human genetics literature recently, and you're seeing this more and more applied in elsewhere, are these extended haplotype tests. And the caveat here for your non-model organisms is, generally this requires the highest standard of data. That is, you need high quality, full sequence data where you can say, if I see this long run of sequence that's the same across this individual, this individual, and this individual, I've not only been able to infer the sequence but I can also get phasing of the haplotypes from those individuals. So the very best way to do this extended haplotype test is with phased high quality sequence data. You can also do it with SNP data, however.
So one illustration of this that I had the great good fortune to be involved with is with this story with Drosophila pachea and the senita cactus. So senita is a pretty interesting cactus in the Sonoran Desert. Almost nobody lives on it. Most of the other cacti, sajuaro and so forth, there's a series of many different so-called cactophilic Drosophila species that live there.
But senita is-- this species, Drosophila pachea, is the only one that lives on the cactus itself. At the base of the cactus there's another one called Drosophila mettleri but it doesn't actually live on the host itself. And there's a very interesting story relating to cholesterol metabolism. So most plants actually do have a little bit of cholesterol, and flies require that cholesterol. They cannot synthesize cholesterol at all. We can. Flies can't so they have to get it in their diet.
And where it's mission critical for the fly is they use cholesterol to make ecdysone. Ecdysone is the hormone that's necessary to be able to go from the larval stage to [INAUDIBLE]. Yes?
ANDY CLARK: OK. So if you have a block in this particular gene here, the flies cannot make ecdysone and hence they stave as larvae forever. And so if they stay as larvae forever, then they're always children, and hence this gene was called Neverland. That's the Drosophila community for you. So this was covered in flies, this enzyme is called Neverland. And it's a particular-- it catabolizes this reaction from cholesterol to 7-dihydrocholesterol. And of course that's where Peter Pan was. OK. You all know the story.
So it turns out that senita cactus makes lathosterol, and lathosterol actually can be metabolized into 7-dihydrocholesterol and they can make ecdysone. So if a bug can metabolize lathosterol instead of cholesterol, they're OK and they can make ecdysone.
So then the whole story comes down to, what's going on with Neverland in Drosphila pachea? Can their version of Neverland actually use lathosterol? And so many different species were sequenced in Neverland and this is showing pachea's here somewhere. Here at the very top as is pachea. And you can see, here's a site, G250A, that's an A in-- that is an alanine and pachea and a glutamic acid in all the other species. And so that's a site that says, well, maybe there was a specific substitution that was important in pachea that the other species didn't have. And so on for these other sites. They're not all quite fixed like that but there's five sites that show this rather radical change in pachea compared to the others.
So if you calculate dN/dS along the branches, you can see there are numbers all over the place. For pachea it's 0.23. It never gets to be very high, but that's higher than many of the other branches. And one of the problems with dN/dS, if you think of that ratio has to be greater than 1, it's actually referring to the ratio across all the sites in the gene. And if it's only a couple of sites that are under selection, maybe the whole ratio isn't going to get higher than 1.
Anyway, this is showing the part of the genome. Here's Neverland right here. And so this, all this segment right close to Neverland, about 100 kb that were sequenced. And here is some other random parts of the genome, and you can see there is pronounced reduce polymorphism in that neighborhood of Neverland. Lots of interesting polymorphism statistics that show a negative Tajima's D. Same thing. Same thing.
So Hudson-Kreitman-Aguade test is showing rather high significance for Neverland. It looks like it's undergoing selection. But the key thing I wanted to show you was this one, which is this omega statistic, which is one of those tests for was there a selective sweep. And it's showing that in this region around Netherlands, this is that 100 kb, it's more complicated than you might guess. In fact, here's Neverland. One of the peaks isn't right on top of it, which could be due to selection of Neverland. There could be low enough recombination here that would drive that.
This one's a little bit mysterious. So it could be actually-- this could also be because of Neverland if there's that low recombination. So it really ups the need. We really want to know what's the nature of recombination in this region so we can interpret this. It's also formally possible that there was something else that was selected right near Neverland so that the reduced polymorphism of Neverland is due to something else altogether.
So even though it's very clear that there were substitutions in Neverland that allowed it to be able to use lathosterol-- and very careful biochemistry was done to show that-- the absolute proof that selection was on Neverland itself is still a little bit thin. If we start with Neverland as the focal point and ask, what's the size of the haplotype as we go out from Neverland? They're showing a consistent haplotype.
That's what these plots are meant to show, and the size of these, I don't know, spider-gram kind of balls going out from Neverland, is much, much larger than you see from any other focal point. So you do this picture with a lot of mis-- totally graphical, there's not a significance test associated with this particular diagram but it's a very nice graphical way to show those sweep events. And I should mention that this work was primarily done in the laboratory of Virginia Orgogozo in Paris. She stumbled upon this Neverland story and really ran with it, did a beautiful story published recently in Science.
So that's a case where a single gene gave-- resulted in a net shift to this new cactus. It's really quite-- it'll be in textbooks. Watch for it. Neverland. OK.
Inter-population divergence, and we're about out of time. This is just-- let me sort of jump to the main point here, which is, if you look at FST, the degree of population subdivision for different genes, you find a number of genes that are outliers. And again, there are just such boatloads of data for human genetics, this is very easy to illustrate with the human data. Several genes that are involved in skin color, in pigmentation, show very, very strong patterns of differences in and out of Africa. And this is just showing this radical change in allele frequency going out of Africa, and then the haplotypes.
So it's showing that they are exceptionally-- this is sort of like spider-gram picture but just sort of on a flat-- flattened out. And you can see this very large extended haplotypes for regions of genes as though going out of Africa there was a mutation that arose that said, you need more vitamin D. If you had lighter skin you'd get more vitamin D so let's have this mutation sweep up in frequency. And this happened for several different genes involved in skin color.
It happened in different ways in Europe and Asia very strikingly. So some genes involved in lighter skin in Europe are different from those in Asians. But again, this pattern is pronouncedly different between Africa and European. And we can see very clear patterns of selection now that are driven-- the Asian one, MC1R, is very clearly a gene that was under positive selection in Asians for lightening skin but not so in Europeans. Not so much.
And so this idea of capitalizing on site-frequency spectrum, long haplotypes, and their geographic structure is particularly powerful. You can make inferences about geography and differential selection with different environments by combining those different methods.
One last one is just, simply by comparing the site-frequency spectrum on two axes here where the axes are two populations-- so this is Tibet, this is Han Chinese, and so these are the rarest class. Imagine this as being like a histogram so it's the color showing the frequency. So this is a site that's rare in Tibet and in Han Chinese, these are the common variants, and so we have these site-frequency spectra plotted against each other as a scatterplot. And so you're looking for these exceptions. And these exceptions turn out to have very direct consequences with respect to ability for the physiology to adapt to high altitude, to low oxygen content. So EPAS1 is involved specifically in that developmental process of high blood hemoglobin.
So, again, this contrasts of site-frequency spectra across populations, not only using FST but actually the full site-frequency spectrum, can also be highly informative. So this is just saying everything again. Natural selection, if you look hard enough with full sequence data, it actually is quite abundant in natural populations. Signatures of selection arise from patterns of polymorphism and divergence.
We can do tests that look for different times in the past. So these tests that are looking, in particular at these haplotypes that have swept to high frequency, that's looking at a relatively recent event compared to those dN/dS sorts of tests which might be integrating information over the last, well, since species divergence really. Positive selection is especially common for genes in immunity defense and perception. Most of the data there were from mammals, including human chimp divergence, but we're seeing immunity in defense for sure in insects, and we even saw some genes in perception in that evolution of social insects story.
Some genes that have a signature of recent selection, they also have larger allele frequency differences between populations. So if you have a structured population and can compare them, particularly if they are in different environments where there might be some sort of reasoning that you can do about specific adaptation to different environmental attributes, definitely don't ignore that.
And power is an issue to always pay attention to. Ideally you do this before you jump in and spend all the money on the experiment. Do some calculations on how much variability is there, what sample size do I need to be able to detect a given strength of selection, and if it looks promising then you can jump in and go for it. And besides, NSF likes to see those power calculations. And a lot of times it's storytelling but you can do some modeling, get some idea, is there any prayer for this test having any ability to detect selection? I think that's it. Yeah. All right. Any questions?
AUDIENCE: You talked about dN/dS ratios. And I'm just wondering, in non model organisms, if you don't have good ideas about [INAUDIBLE] interim boundaries, how easy is it to get that then--
ANDY CLARK: Yes.
AUDIENCE: --or [INAUDIBLE].
ANDY CLARK: Yeah. So annotation is going to be critical for a species where, if you want to make inference about protein coding genes, which in terms of going from genome sequence to making any statement about function, protein coding genes are the easiest place to start. So if you have genome sequence and you want to go down this route, it's a sort of long term commitment to getting somewhere with a species, you'll almost certainly want to do RNA sequencing and you'll want to run it through, you'll want to assemble them, align them to the genome and try to annotate genes.
It's very true that even going through all that your confidence in the genes is going to be lower than first-- for a model organism. So it's sort of that trade off between-- some of the genes, a highly expressed gene, you're going to be able to do pretty well. It may be a gene that's of less interest with respect to adaptation but yeah. No. It's a very good point. Annotation of protein coding genes is essential for those protein based tests of selection.
In fact, that bee story that was a dN/dS story, the very first thing out of the box was just exome sequence. It was actually derived-- I shouldn't say exome, it was transcriptome sequence. That was actually in the days when it was first done by 454 which had sufficient read length to do pretty good de Novo assembly of the transcriptomes. Now they've done boatloads of additional transcriptome sequence by RNA sequence with aluminon. And now they're doing the whole genome sequences as well.
So they sort of did it backwards that way but you can do a pretty good job. So it's the top 3,000, 4,000 genes that you get by fairly deep RNA sequencing and assembly. And even if a gene is fragmentary, you can have a very good sense of what is the-- you don't have to have a complete thorough knowledge of a gene with precise understanding of all the splice forms and everything. If you just have high confidence in the reading frame for at least a good portion of the gene, then you can apply those tests.
SPEAKER: We've got 10 minutes for questions. Great.
AUDIENCE: So when you're looking at the site-frequency spectrum you kept saying excess of rare. Is that purely a visual? Like, you look at gravity and say, oh, it's a spike at this point or is there a formal statistical test you can do?
ANDY CLARK: Tajima's D and Foo and Lee's f-test, those are essentially looking at the site-frequency spectrum. So a negative Tajima's D is, in words, equivalent with excess rare variation. Yeah. And there is a formal test for that where you get a chi-squared and a p-value.
AUDIENCE: So that site-frequency spectrum is just on the minor alleles? Or minor--
ANDY CLARK: Yeah. You're raising a very good point, which I kind of glossed over, which is, ideally, the very best situation is where you have not only your species and the polymorphism in it but you have an out group or more than one out group so that you can determine what's a derived allele versus the ancestral allele. So the ancestral allele is the state of that SNP that's shared with the common ancestor, and then the newly arisen allele is the derived allele.
Now, there are situations where, of course, that access-- the x-axis-- can go from 0-- well, it's actually singletons is the first column-- all the way up to fixation-- well, 1 minus fixation because its fixed, it's not on the site-frequency spectrum. So a derived allele can have frequency n minus 1. All but one could be the derived allele. So the commonest allele may be the derived one if it's being swept to frequency or some drifting alleles will as well.
So that's the best. That's known as unfolded site-frequency spectrum where you actually know what's ancestral and what's derived. That's the best case. Often you don't have that out group, or often the out group will be highly polymorphic as well in which case you could have polymorphism that's shared all the way since-- and so they'll be confounding of what's ancestral, what's derived.
So sometimes what's done is the site-frequency spectrum is folded. So that means that you're combining. So you look at a common and a rare variant. Anytime you see a singleton and all the rest are another allele, whether that singleton is a count of one of the derived allele or a count of n minus 1 of the derived allele, you still put them in that class. So it's folded at 0.5 and then you're adding them up. So the folded site-frequency spectrum is often what you deal with, and some of the tests specifically deal with knowing that it's a folded site-frequency spectrum. It has a little bit different properties. But that's a very good essential point.
AUDIENCE: You talked about fitting demographic models to the genome data. Do you worry about including genes that are under selection when you're fitting that model?
ANDY CLARK: Yeah. To some extent. So there are two ways that it's done. One way is to identify regions of the genome that are sort of spiked for being less likely to be selected. Kind of a strange operation but it has worked reasonably well. So they'll take either introns or fourfold degenerate sites or if there are genes that clearly have some really wacky behavior compared to the rest of the genome they'll set those aside. That's one way.
Another way is to simply not even try to do that but just take only four-fold degenerate sites across all proteins. And if you do that, then a gene that was under some strong selection like it went through a sweep, if it went through a sweep, even the four-fold degenerate sites would have an odd site-frequency spectrum. They would have been distorted.
But the idea is it comes out in the wash across the whole genome. You look at 15,000 genes, how many have been selected where there's this obvious sweep event? It might be a dozen or whatever or maybe it's even 100. But out of 15,000, the idea is it comes out in the wash and you still get a pretty good estimate of the genome wide demographic.
AUDIENCE: What's your opinion about using that? What kind of out group for polarizing the data group?
ANDY CLARK: There the main issue is level of divergence. So if it's too close, the main problem that you have is there is too much what they call incomplete lineage sorting, where the common ancestor was a polymorphic species that then gave rise to this group that you're studying and this group as well, and many lineages, many polymorphisms are in the common ancestor species as well as both your species and the one that you're using as an out group. So that kind of incomplete lineage sorting is a real pain.
And they're developing beautiful methods to explicitly deal with it, but as a first start it's probably not the easiest place to go. So you want something more distant in the past than that ideally. And of course, if it's too far back then too much has gone on and your inferences about difference between that out group species and your targeted species of interest is integrating over too much evolutionary time and you lose signal that way, too.
What is that sweet spot in the middle? I'm afraid it varies from one organism to another depending on a lot of factors, including not just mutation rate, but also affected population size, the nature of the demographic changes that have occurred. It's sort of hard to tell. So one thing I can say is the very best always is to have more than one out group. If you have a whole phylogeny, a tree where your species of interest is here, then you'll find that different genes within your target of-- target species have interests different genes will have different optimal out group species for studying it. So it's sort of embedding its whole genome in the context of the phylogeny is the strongest approach if you can do it.
SPEAKER: Andy, I really liked how you helped motivate [INAUDIBLE] data. [INAUDIBLE] of annotation. I wonder if you could comment on [INAUDIBLE] data but in non models with different [INAUDIBLE] have extended haplotypes. But there is options for collecting data [INAUDIBLE] you can get, say, about 100 base pairs of [INAUDIBLE] for haplotypes as opposed to SNPs. And to what degree do you gain power in some of these tests [INAUDIBLE]?
ANDY CLARK: So I guess one of the better examples maybe is that Neverland story. I was a little bit more negative about our inability to specifically say that the selection was acting at Neverland, but that's the gene Neverland itself. In that gene, it's very clear that selection was operating, and that was based on data where we did actually make inference of haplotypes.
Now, that was with inbred lines of flies where we can actually get haplotypes directly. A lot of times if you can't bring an organism into a laboratory, you can't inbreed it, you just have to deal with whatever is the collection of genotypes that you happen to sample in nature. And then there are statistical methods for inference of haplotypes. And what really matters for being able to do that is the degree of linkage disequilibrium in the sample.
So as soon as you just collect SNP data on a bunch of individuals, you can calculate what's the pairwise linkage disequilibrium across all the SNPs. I didn't show any pictures of this I guess, but one can-- there are various ways to depict that graphically. There are various ways to test the statistical significance of that pairwise correlation of SNP genotypes across sites that we know as linkage disequilibrium. And you're kind of hoping for the situation where there's enough linkage disequilibrium where you can make some traction in for these haplotypes.
And the reason it is so powerful is all those tests that were haplotype based, like those extended large haplotypes, they're sort of borrowing information from neighboring SNPs. So anytime you deal with one SNP at a time, the level of sequencing error-- and it will always be there-- your ability to call SNP genotypes for each individual SNP in a data set depends on your read depth and all kinds of-- read mapping-- all kinds of issues that will arise. Your idea that DNA sequence is digital and everything is going to be cleaned, soon as you touch any data, you realize, god, is this a mess. It's really a lot of work to get high confidence in these SNP genotypes.
And so there'll be this underlying level of error. And so if you're relying desperately tightly on a single SNPs genotype, you're in a much worse situation than in the situation where you can see information across many SNPs in a region as a haplotype where the occasional SNP error-- you'll still know that it says haplotype-- and your ability to infer that estimate of the frequency of that haplotype is much better.
And so those are I guess the main regions-- reasons. You see this sort of phenomenon of selection impacting the genome as haplotypes, like these extended haplotype tests, as well as the inherent error correction that you get out of haplotype based tests. So phase fast phase there are a bunch of tools that, just given SNP genotype data, you'll be able to infer leakage phase of these haplotypes as well. Of course, with error, so watch out.
SPEAKER: Let's do one last question.
AUDIENCE: This [INAUDIBLE], on the subject of frequency error, how much should we worry [INAUDIBLE]?
ANDY CLARK: Yeah. So think about it. So what is it going to do if you have just sequence and then you introduce errors into it? It almost always increases your rare variance, right? So you get-- so what fraction of your singletons are due to sequencing error is always an issue. Tests like-- the Foo and Lee f-test specifically deals with counts of singletons. So it's like, out of all your beautiful sequence you've collected, it focuses on the one hardest thing to get, which is singleton counts.
And so it's an issue. The test-- Tajima's D doesn't only rely on singletons, it does rely on other site-frequencies as well. And so there are various things that you can do to tighten up inference of that rare variation. The relationship between singleton and doubleton counts, in particular, is good. So your probability of making doubletons purely by sequence error is lower than singletons by quite a margin, roughly the square of error rate, so if they're both errors and independent. So if you have a real spike in singletons and everything else looks normal, then be suspicious. I'm sure there's a better answer than that but [INAUDIBLE].
AUDIENCE: Can you point us to any references that you think you would particularly [INAUDIBLE] dealing with that?
ANDY CLARK: Well, this is textbook, principles of population.
You mean on the issue of error?
ANDY CLARK: Wow.
AUDIENCE: And site-frequency spectrum.
ANDY CLARK: Yeah. So the site-frequency spectrum is-- that's sort of-- gosh, it's spread over so many papers. I think I would start with the textbooks. The error issue is going to be more where things really got tightened up, and you should all be kind of grateful to the human community as odd as it is for doing this, for going just from NextGen sequence to coming up with the tools and platforms and procedures for really tightening up SNP confidence. Some of those papers, like the 1,000 genomes, those sorts of papers that really went through GATK pipeline and how that works and how they infer SNPs, how they infer confidence is one.
There was a paper recently published by Matt Nelson and colleagues in Science in 2012 where they sequence in 14,000 people, 202 genes. And there their focus really was on those singletons. Out of a sample of 14,000, imagine that nightmare. So they're write up on how they inferred confidence in those singletons and really tightened up the error, that's exceptionally good. The supplemental report of that Matt Nelson paper is really a good place to go.
SPEAKER: Thank you. Thanks, Andy.
We've received your request
You will be notified by email when the transcript and captions are available. The process may take up to 5 business days. Please contact email@example.com if you have any questions about this request.
Andy Clark, the Jacob Gould Schurman Professor of Population Genetics and Nancy and Peter Meinig Family Investigator, explains how to find and identify signals of adaptation in genomes, as part of a population genomics workshop at Cornell University, July 23-24, 2013.
Population genomics involves sampling, financial, and bioinformatics trade-offs, so proper experimental design requires understanding probability, sequencing technologies and evolutionary theory and how they relate to research trade-offs. The workshop, "Next Generation Population Genomics for Nonmodel Taxa," explored the strengths and weaknesses of different approaches to genome sequencing and bioinformatics when studying population genomics in nonmodel species.
Go to the
workshop website for information associated with these videos including lecture notes, descriptions of exercises, and computer code. This website is a site for on-going learning about methods for population genomics.