share
interactive transcript
request transcript/captions
live captions
download
|
MyPlaylist
ALEX BUERKLE: OK. So if you've looked at the schedule, you realize that I'm going to be talking for a while today. And I'm going to use an overhead projector or a fancy modern overhead projector, where I can write on paper. That will serve multiple purposes.
I can draw pictures. I can do things in real time. And I'll go slower. And I'll try to write all the important things that I think are the key things for you.
It also allows me to number notes in a hierarchy. So I have numbered notes all the way through for all of the hours that I'm going to talk today. So you'll be able to keep track of where I am. And we can pick back up.
Rather than, when you have PowerPoints, you don't know the nestedness of those. And so we'll have topics. And this is our first topic here.
And because I don't have a PowerPoint and I have so much time over several different sections, please interrupt me. OK. It would be terrible if I said something wrong or terribly confusing early, early on, and it left us confused for the rest of the day. So I'm assuming that we're all naive about this, because I started very naive about these topics very recently.
OK. So I'm assuming you're where I was, which is I didn't know very much at all. OK. So you're probably ahead of me. So if this is remedial or very introductory for you, forgive me. There are other people that I'm confident are benefiting from what I'm going to say.
And my objective in all of this is not to turn this all into modelers and to be able to write code. But let's make ourselves into better consumers of models. And, usually, I try to avoid business metaphors or consumers in a good sense of business when I'm talking about academics.
But my metaphor that I thought of this morning is that this is like learning how to read nutrition labels on boxes of food and things like that. We want to be able to look at models and evaluate what's in them. We don't necessarily want to be able to make Cheetos. But we want to make an informed decision about whether this is good for us or not and when we're going to have them and we're not going to have them.
And so when we're looking at models, whatever model it is, a structure model, Bayesian genomic cline model, a genome wide association study model, we want to have some idea of what's inside even if we couldn't reimplement it. And so that's what I want to do. I want to bring our literacy up a little bit, so we can read those labels and even find them in the paper.
OK. Where's the nutrition label on a paper? Maybe that should be a new box that we should require for modeling papers. So that's our objective.
And it'll be an introduction. It will be a departure for more learning. It'll give us some points of entry so that when you go further and do other things, you have a place to start hopefully. So the first question I wanted to ask is why do we need probability theory for genomics?
And the simple answer is we want to estimate parameters. And it's the probability that gives us the basis for estimation. So let's go with a quick example of what that is.
And this should be familiar from things that I said yesterday. An example of a parameter that we're going to want to estimate, particularly in the context of this workshop, are genotype probabilities. Even if you're not going to work with genotype uncertainty directly, DNA sequencers produce stochastic reads from a true genotype for an individual or for a pool.
OK. So we're dealing with stochastic sampling. We're dealing with an unobserved quantity. We only indirectly observe it through our sampling, so we need to estimate something.
And so our example are genotype probabilities as a parameter. And so there's this fundamental issue that results from next generation sequencing. Whatever library preparation we use-- and in some cases, you could even think about SNP arrays.
We don't actually see the genotype. We see light being admitted or some type of binding that is giving us information about relative probabilities of one genotype or the other. We have this fundamental issue the results from next generation that we have stochastic sampling that represents the true state.
And so we want to end up with some type of genotype probability that ideally would look like something like this, where we have a very high probability of one of the states. And as I mentioned yesterday, genotype abilities have been used in various settings other than next generation sequencing previously. So, previously, we already in some organisms were imputing unobserved SNPs where there was no data, but we knew something about the population from which the sample came and what haplotypes existed.
And so we could impute and basically fill in a blank where we had no information. That's going to be done probabilistically. OK. And there has to be some model for that.
And so we have to make some assumptions about recombination. We need to make some assumptions about allele frequencies and the haplotypes that are in the population to do that type of imputation. So the use of genotype probabilities is not restricted to next generation sequence. So there are models that people have been doing with this previously.
There are two ways, at least-- I know I'm near the bottom-- of using genotype probabilities. We can threshold where we retain information only if the probabilities exceed some threshold or model the uncertainty and propagate it. So we could apply a threshold after what we're doing now with the transcriptome sequences and the alignment and the SNIP calling that we're going to do today.
We can impose a threshold and say, we're only going to retain samples that have a certain genotype probably or information for a locus for an individual that has a certain genotype probably. It has to be very, very high for one of these categories. Otherwise, we're just going to put N/A, no information for that locus. So it's either in, or it's out.
Alternatively, you could take those three numbers and take them forward in your analysis and have your uncertainty be reflected and avoid making the strong claim that you know what the genotype is. And, of course, the consequences of that are going to depend on how different your three probabilities are. If you have a lot of information about the genome type already, and there's some big differences in probability between those three categories, you're not going to lose very much information by thresholding.
But if you have uncertainty, it'd be better to carry that forward and see what consequences it has in downstream analyses. But either one is possible. And either one will depend on a calculation of the genotype probabilities.
So even if you plan to use thresholding to reconstruct typical genotypes and put them into typical software, like STRUCTURE, or any of the other softwares there, you will have made genotype probability calculations the way. So everybody who's dealing with next generation sequencing data is dealing with genotype uncertainly at some point in the process even if, at the end, you threshold and throw the uncertainty away and turn this into one zero zero.
Any question so far? We're at 8:13. I know it's early in the morning. We're going fairly-- seems like an OK pace for getting us going in the morning. OK.
I don't know if those were nervous chuckles. Anyway, so that's our example of genotype probabilities. Let's think more generally about population genetic parameters. Because nobody is going to be very happy if they work very hard for next generation sequencing data and all they end up with genotype probabilities.
We want to use those as information. And we want to use those as the basis for inferences about individuals and populations and species and biological processes. The genotype probabilities in and of themselves aren't very useful.
And I made the argument yesterday about Bayesian modeling that provides a framework for us. And so I'm going to make the argument in what comes up here that, ideally, we're going to use an inferential framework about our parameters that makes use of mutual information. OK. And because of the system in which we're working with genetics or with populations, individuals live in populations, populations live within species, populations are within space.
So there are opportunities for having mutual information at that organismal level. There's also opportunities for mutual information at the genomic level. A SNP does not exist in space. It's not in a vacuum.
It actually lives next to other SNPs. And those SNPs are linked to one another on chromosomes. And they travel together within individuals. So we have the opportunity for mutual information.
And as our models evolve, I think we'll learn more and more about using that mutual information. So, for example, if you're trying to make an inference about a particular SNP, you could think about, well, what are the neighbors doing? In a probabilistic formal model way, not in a sliding window, but that's an aside.
So we want to have formal framework models for using mutual information within the genome, but also at the organismal level, individuals within populations, populations within species. And so we're going to strive for modeling frameworks that make use of those. But we can also start with naive models that just really do treat them fairly simply.
This isn't probably that important. But I thought I would give some examples of population genetic parameters. One that came up a bunch yesterday is the population mutation rate or the population scale mutation rate 4Nmu.
That, actually, might not be something you think you're interested in. But it's a measure of diversity. And so a lot of us, no matter what we're interested in, we're interested in measure of diversity for our populations. We understand our population's at the edge of a range, more or less diverse. Or, this particular focal population isn't more or less diverse than others.
So we might want to estimate something like this to simply allele frequencies in population. And I'm just going to call those p. Because we're going to be doing that a lot today, so allele frequencies in populations.
And that sounds super mundane. But one of the things I'm going to teach you is that in Bayesian analysis if you get estimates of a parameter like this and you have posterior probability distributions for them and you can model the uncertainty in them, you can actually transform those to anything you want. And you'll get proper probability distribution for those transform variables.
And you probably know that most population genetic things that we want to measure are transformations of allele frequency, like heterozygosity. Expected heterozygosity in a population is simply a transformation of this. So if we can do a good job of estimating allele frequencies, we've already done a good job of estimating all kinds of transformations of allele frequencies.
OK. So that's a fundamental parameter that we're interested in. And then, of course, if you're interested in hybridization and movement between differentiated gene pools, you might be interested in hybrid admixture coefficients. And FST and some definition of them might be another population genetic parameter you would be interested in.
So those are just some examples. That shouldn't blow our minds. So let me say a few more things generally about parameter estimation.
Parameter estimation is central to inference, but often disregarded. This is a very general point. Even if you don't do population genetics anymore in two years, you are going to be interested in parameter estimation. Because it's central to understanding things that we measure we want to estimate parameter on populations in order to use those as a basis of inference.
Unfortunately, most of the statistics classes that we take, unless you're very fortunate to be somewhere where this isn't true, don't focus on parameter estimation. So in a basic stats class, what did you do in terms of parameter estimation if it's so central to inference? I think I heard somebody say something. What do you do?
AUDIENCE: [INAUDIBLE]
ALEX BUERKLE: What's that?
AUDIENCE: P-values.
ALEX BUERKLE: p-values. Yeah, there's a parameter. Actually, it's a statistic, but it's actually not a parameter that we estimate. p-values are usually based on some other parameters. But you're right.
The point I'm making here is that we end up being focused on p-values rather than the parameters. But so what are the key parameters?
AUDIENCE: [INAUDIBLE]
ALEX BUERKLE: What's that?
AUDIENCE: The mean.
ALEX BUERKLE: Absolutely. The mean is a parameter of a normal distribution. And so very often, it's a parameter of all distributions. But that's what you're typically thinking of.
And so we focus on that. And so what we report and what we end up looking at in our hypothesis testing is whether means are equal. Right? And we test the hypothesis. Are they all equal in all the groups that we're studying?
And then what we report is an s statistic, perhaps, and a p-value. But we actually never report what the-- I'm sorry. Very rarely, do we report what the true estimated difference in means was. Right?
That would be cool to know. Was it 5 or was it 0.001? Because both could be significant depending on your sample size. If you have a huge sample size, you can take very, very small differences.
So this gets at the whole issue of biological effect size. And, really, we want to know how big are the differences. OK. We don't really care that they're not zero.
In fact, in any two finite samples, the difference is not zero. OK. It's the question of whether you can distinguish that difference from something that is probable just due to chance. OK. And so we want to estimate parameters directly.
But, unfortunately, we're rarely given the tools and learning. So we need some new statistical tools to think about this and to do a good job of parameter estimation. So some of the stuff I'm going to say is just general about that and how we can go about parameter estimation.
Parameter estimation gives us a framework for model comparison and choice. As authors of a nice book subtitled their book, they called it Confronting Models With Data. You guys know what that is a subtitle to? You know the name of the book? Ecological Detective.
It's a good little read. It's a little Princeton book I think. It's called that Ecological Detective. The subtitle is Confronting Models With Data. So we have all these models out there, and we want to confront them with data.
We can go get data. And there are an infinite number of models. Which of them applies to ours? Or, which is more supported by our data versus others?
So we want to contrast those different things. And there are kind of two different ways that we can think about this model comparison. We can break it down into comparing alternative parameter values.
And what I mean by that, how is comparing alternative parameter values comparing models? Well, two distributions with a different mean are different models. You can think of them that way.
OK. So we're comparing different models that are consistent with our data. Particularly if you're thinking about the difference in means between two groups and you subtract them and you can get a distribution, what is the difference between them? Let me not wave my hands. I have a pen.
Suppose you are doing inference that is equivalent to a t-test. You're comparing the mean of two populations. This is the estimated mean here of one population. This is the estimated mean of a second population.
We can estimate that difference. OK. And one possibility, often, that we consider is whether it's zero. But it's never going to be zero. It's going to be something.
And we want to estimate that parameter value. Different choices for means are alternative models. OK. And so we want to estimate different parameter value. That's model comparison. The other types of model comparison, and maybe this will help, is that we're going to contrast models not just with different parameter values, but with different numbers and types of parameters.
So when people typically think about model choice and they're thinking about choosing between different linear regressions or something like that, they're thinking about, well, what different covariants or explanatory variables do I have in the model. Let me choose amongst those. But even when you're doing a linear regression and choosing amongst models with different slopes, that's model choice as well.
OK. And in that case, we have maximum likelihood estimates for the best fitting model given the data. Does that makes sense, the kind of general point about comparing different parameter values qualifying as model comparison? Any questions so far?
Maybe I'm still going too slowly. Am I going too slowly. Is it OK? So to bring this back into the world of not means and things, let's put it in the context of population genetics. We might want to compare.
Like I was just writing on the previous page, we might want to estimate FST as a parameter and choose amongst the alternative models that are consistent with the data. What FSTs are supported by the data? Or, we might want to estimate genetic architectures for traits. And I really mean by that what people usually write in papers, the number of causal loci and their effect sizes.
So I think enough on that. I think this motivates a little bit about why we want to think about Bayesian estimation of parameters. And so I'm going to dive into that a little bit more now. If we want to have that type of framework for estimation, let's talk a little bit about the essentials of Bayesian estimation.
So I'm going to give you a nice list of six different key elements of Bayesian inference. This first one I made yesterday in my short talk. Bayesian methods give us a probability density for the model parameters given the data.
Like other methods, they're going to take the data as fixed. That's what we know about the world. But there are all these different models out there. And we're going to get a probability density for our model space.
OK. So it's easier to think about if our alternative models are alternative parameters, we're going to get a probability density for all those different models. The data are going to be fixed. We're going to refer to that probability density as the posterior probability.
And I'll talk more about how that all works and some of the terminology in a bit. But that's what I mean when I say a probability density for the parameters. We're going to learn about the posterior probability of those.
Importantly, the probability density fully describes all information about the parameters and uncertainty. And a closely related point is that the measure of uncertainty for each individual parameter incorporates uncertainty for all others. That sounds good.
If we're going to calculate FST or allele frequencies or any of these higher level things that I'm talking about, when we say that our confidence or our uncertainty-- so those are two sides of the same coin. That the true parameter lies in this interval is this, we'd like that to be absolutely true about all the levels of uncertainty. Rather than just the uncertainty in one particular dimension, we want it to reflect our uncertainty in genotype, our uncertainty due to finite sampling, and all the different levels of information that came forward into that parameter estimate.
This is not true. This c is not true about other methods in most cases. You can do Monte Carlo approaches to likelihood. OK. That's very rare. You can, in likelihood, have your marginal parameter estimates reflect your uncertainty about all the other ones. But that's not how likelihood is typically done.
Instead, what I said yesterday, I think I went like this, you have to fix all your parameters and all the other dimensions to their maximum likelihood estimate. And then you can solve for the marginal likelihood and the uncertainty in that one parameter that you're interested in. This becomes important.
If you're doing linear regression in small models, maybe the likelihood models are just fine. OK. But we're talking about even if you're doing a fairly simple project with 10,000 loci and 5 or 10 populations, you're talking about as many tens of thousands of parameters. So if you want to say this is my uncertainty in this parameter, it really should reflect your uncertainty in all the other sources of information that go forward into them.
OK. So there's a really nice point. And remember when I said yesterday that people get all bent and upset about the prior and things like that, and they focus on that? I don't understand why they don't focus on this, because this is something good. We want this.
OK. Bayesian people are not running around wanting to incorporate a prior. OK. No, I'm not going to incorporate an informative part. I want this. I want to have good estimates of my uncertainty when I have a model.
Most of the time, I'm going to incorporate uninformative priors that are totally overwhelmed by other terms in the calculation. So this is a big selling point. Yeah.
AUDIENCE: And this is what call [INAUDIBLE]?
ALEX BUERKLE: Yes. I mean, it's related to that. I mean, lack of propagation of uncertainty could be that you're just fixing things at a certain value explicitly by turning genotype probabilities into called genotypes. But that's closely related to fixing your estimates of the slopes at all other coefficients at their maximum likelihood, rather than recognizing, oh, my uncertainty over there in that term interacts with this one. Yes.
AUDIENCE: Is it fair to say that Bayesian estimates of uncertainty are always higher than max likely ones, but [INAUDIBLE]? Is that--
ALEX BUERKLE: Yeah. That's a really good way of putting. Because what you've done-- I mean, I can't see always. I know to not say always. But, typically, you will underestimate your uncertainty by fixing all those other parameters at their maximum likelihood estimates. And focusing only in the marginal likelihood that you have for that parameter.
I would tend to think that they will underestimate. Yeah. The other thing that relates to that is that in likelihood, you don't get a probability distribution even for that marginal parameter. You don't get A, a probability density. You get a likelihood surface.
And you can use operational criteria for saying, oh, I'm going to accept things that are within a certain support limit from my maximum likelihood estimate. But it's not a probability. Whereas, here, you can calculate quantiles and really put limits on it to have a probability interpretation. Thank you for the questions, by the way. It's good.
AUDIENCE: So the with the prior slope, isn't it important then, if you're building a Bayesian model, to test how robust it is and [INAUDIBLE]. Because it is true that we are trying to just build a prior. But priors can have important [INAUDIBLE].
ALEX BUERKLE: Yup. They can. And there are different methods for if you are the model developer, particularly, for measuring conflict between the likelihood and the prior, for example. So you can measure how much information comes from your other terms beyond the-- and we're getting a little bit ahead.
But you definitely can. As with any modeling that you do, you can test for the robustness of your inferences to different model choices. And swapping out different priors would be one of those.
AUDIENCE: So kind of along the same lines, if you're allowing all these things to interact with each other, is there more of a risk for a snowball effect where one thing is [INAUDIBLE] just gets carried through and totally [INAUDIBLE]?
ALEX BUERKLE: So there are absolutely opportunities to be misled in your inference, where your prior is overwhelming the information that you have at a certain level in the hierarchy. So, yes. I think we have to be cautious about that and aware of it.
And so simpler models will be more easily investigated in terms of the consequences of different models choices. If you have a monster model with a huge hierarchy, like I often do, then you have to watch yourself. But yeah, you can investigate.
We'll get into some of the methods. I thought where you were going with your question was that, because it's this multi-dimensional thing with lots of different interactions, might you be misled in parameter space in terms of estimation, which is kind of a related question. And you can.
The theory we'll talk about is that if you do your search long enough, that you should converge. But that could be a million years. OK. So let's see. Let's talk about D.
So computers make it possible to obtain very good descriptions of the posterior distribution of parameters. Put this in, in part, because in some parts of Bayesian inference and some subdisciplines within biology, people are not using computers that much, actually, to get their estimates. And they're depending on more analytical methods.
And this actually offers an explanation for why we're typically not taught this type of inference in our statistics classes. Because for most of the problems that we're interested in, you need computers. And you actually need fairly fast computers and computers with storage. And that's not how statistics was done.
OK. And that's not how the professors who teach most of the classes were taught. And so there's going to be an era, I think, where computational statistics will become much more prevalent. And it'll be much more routine to use computers to obtain samples from a posterior distribution.
And so that if you want to investigate something as simple as a difference in means between two groups, that you might be taught or your kids or my kids or something in the future, will be taught a Bayesian approach to that from the get-go. Because it actually makes a lot more sense than a lot of the stuff that we have to learn to be able to do frequentist statistics. So computers have played a key role in Bayesian estimation.
And for practical problems, whatever they are in biology, computers are required. OK. You are not required to be able to do a bunch of algebra or to integrate and to solve analytical solutions. You can use computers to obtain samples from these distributions. That probably was obvious to you.
The next point I've made in passing a little bit earlier, which is we can get densities. So probability densities, are readily obtained for transformed parameters. You might not immediately think of why you'd want a transformer variable.
But, for example, if you're describing something about what your expectation for the parameter is and your uncertainty in it, we want to be able to transform it and still have our uncertainty be represented. And so, for example, if you measure two attributes of an organism, two continuous variables, you get an estimate for its mean and its uncertainty. But if you want to multiply those two things together, you don't have a probability distribution for that transformation of multiplying them together.
But if your co-estimating them in a model, a Bayesian model, you could take that product along the way and get a posterior distribution for that product. So this becomes important in all kinds of physiology and that type of thing or ecosystem science or all kinds of different measurements where you want to be able to measure something that actually is a combination of your variables. That's an example.
Or, the example in population genetics that I gave where you want to estimate allele frequencies. And you can transform allele frequencies to get expected heterozygosity. But that's a quadratic, right? So that's a p times a 1 minus p.
How is that going to affect your uncertainty in other contexts? Well, in Bayesian context, you can get a measure of your uncertainty on that transformation of allele frequencies as well. So that's super sweet, too.
And then kind of a grab bag last statement is that it's a robust framework for inference. And by that, I mean using criteria to perform model choice and to be able to represent our strength of evidence for different alternative models. And so we can use things like information criteria to do model choice and weigh our evidence for alternatives, which is really, really important if you're doing things, particularly if you're doing it in the context of policy or anything where really the consequences of your decisions and your choices matter, and you have to weigh things.
So that was a long list of key elements. And as I've gone along, I've said things about likelihood analysis. And so I want to say a couple more, and then put it away.
So these two forms of inference are two of the key tools that we use. There are certain contexts where likelihood analysis is beneficial. And there are some advantages that are gained from it.
It simplifies the problem in a lot of cases relative to Bayesian estimation. You don't have to store as much data on disk. And computationally, it might be easier.
But for that long list of reasons that I gave there, almost each one of them makes me prefer Bayesian analysis over likelihood analysis. OK. So that list of six things, in a sense, was an argument in favor of Bayesian analysis relative to likelihood analysis. Almost each one of them is an advantage of Bayesian over likelihood. OK.
So that's some of my basic motivation, the argument for Bayes' and why we should learn a little bit more about Bayesian probability. Pretty good so far? So we're going to start talking about Bayes and some of the terminology associated with it.
I'm going to just lay out Bayes' theorem, demystify what a prior is, what a likelihood is. And then we're going to do an application pretty quickly, where we'll solve a closed form solution for allele frequencies in a population. All right. So let's go with Bayes' theorem and what it really represents.
I think a lot of us probably learned this in middle school or somewhere along the way in middle school or secondary school if you had a little bit of probability. But you had no idea why you were doing this. It was probably presented in the context of Venn diagrams and logic and that kind of thing.
So even if you haven't used it, you might recognize that Bayes' theorem is the probability of some parameters. And I'm going to write theta. That could be a whole set of parameters.
It could be a mean and a variant. It could be the allele frequencies in a whole set of populations. So I'm just going to use the theta as a grab bag there as parameters.
And that's going to be given the data. It's going to be equal to the inverse probability, the probably of the data given theta times the probability of theta over the probability of the data. Theta and data rhyme I didn't realize that beforehand.
So we've got that. That's Bayes' theorem. So all of our inferences are going to be based on this idea. And this is what you probably did if you did this in math at some point without an application is you derived this.
OK. You talked about how this relationship exists. We're just going to use it as a tool. So let's label some of these parts.
This part over here, this probability of the parameters is called the prior probability. And it makes sense as a prior. Because this is our probability distribution before we've seen any data.
And the controversies come in. Because in some cases, particularly in medicine, we know something about our parameters even before we've seen some data. And we want to update our beliefs on the basis of the new data that we have.
But we can choose uninformative priors, basically that multiply 1 times everything. OK. So mathematically, you can just think of that as an uninformative prior is going to have a 1 over there. It's not going to affect the shape of the distribution at all. It's not really 1. But it's the same probability for all parameter values.
Regretfully, this part of the equation is called the likelihood. You guys agree that that's regretful? There are reasons for it, I suppose. But it's confusing.
But we know that we're not doing likelihood analysis. We just happen to have something that's called the likelihood within our equation. But some of the terminology can trip you up when you read books and papers is just that. And you're like, I thought I was reading a Bayesian analysis paper. Now, they're talking about the likelihood.
In all cases, what that will be will be this term at the most left-hand side of however long the equation is, where ultimately the data appear on the left side. OK. Sorry, I should explain. I think you know what this means. I think I said it.
But this means given, this vertical line. You've written that in math, right? I sometimes lose my confidence of what we learned and what we didn't learn. But so we're all on the page, it means the left stuff we're calculating the probability of that given this other thing, that we've observed some data.
Same thing here is that given the parameters, what's the probability of our data? And once we have that illustrated, that won't be hard to figure out how that works at all. We'll be thinking about something like giving the allele frequency in a population, what's the probability of observing these data that I have in hand?
OK. So that would be an example of a likelihood. So that's not very complicated. But we just need to have those labels to know what we're talking about. Some other terminology and conceptual stuff to introduce is that you will hear me and others refer to analytical solutions for this equation.
I guess I should label on the left side that this is the posterior probably. We can talk about [INAUDIBLE]. I've already said that. OK. So there are analytical solutions for this posterior probability over on the left side of the equation.
These are available in simple cases. And I'll actually say in simple classroom cases. The other word or the other term that you'll hear for analytical solution is closed form, the alternative way of saying the same thing.
And, basically, this is going to be an equation for the integral, so the calculus statement for that probability. OK. So it is going to be an equation for the full probability distribution that sums to 1. OK. All probability distributions sum up to 1.
You can write a summation if it's a discrete probability distribution or an integral, which is the continuous equivalent of a summation, for that left-hand side of the equation. Given that, this is what I mean by saying that there are probability distributions. We can then calculate quantiles and the expectation of the distribution, so the expectation being the mean.
AUDIENCE: [INAUDIBLE]
ALEX BUERKLE: Yes, thank you. I want to do a little bit more terminology. Because at least a few people will be helped by this. If we have a probability distribution, we know over all values of this function that it sums up to 1.
And so if it's a discrete distribution or if it's a continuous distribution, that would be an integral. When we talk about the expectation, we're talking about the mean of that distribution. And I don't think that's news.
But when I say quantiles, I'm sure that there are a few people who don't know for sure what I mean by a quantile. It's just a percentile is another way of thinking about it. And often, we're interested in the 97.5% quantile and the 2.5% quantile of the distribution.
The other word that you hear about are quartiles. Those are the values that divide the distribution into quarters. 25% of the values lie to the left, 50% lie to the left, 75% to the left. So when we see box plots in R, it's presenting those three quartiles.
OK. Quantiles is more general. It's any number on that distribution. And we're really interested in the 2.5% and the 97.5%, because that's our conventional way of presenting uncertainty. The distance between those two points presents our uncertainty.
Was I right? Quantiles is worth explaining for a few of you? OK. So we can calculate that.
We can obtain those quantities from not just analytical solutions, but in the alternative cases as well. I said that these are available in simple cases. OK. And unfortunately-- maybe someone asked me the question yesterday. Where can we go learn about Bayesian stuff? But it was a different question I think.
But you might have that question now. Where can I go learn more about this? I want to read a book. You were confusing.
I found the books, for the most part, to be very frustrating. Because you have to wade through about 200 pages of analytical solutions to get to anything that's useful for me. OK. So that's why I'm cutting past that for the most part. OK.
Most of the analytical things are not useful to the biological problems that we're studying as population geneticists and to a lot of other things as well. You need computational solutions. And so that's my next category here is that you have computational simulations.
That's not how people usually refer to them. They usually refer to them as MCMC. So one of those MCMCs refers to the simulation part. It's the Monte Carlo part, Monte Carlo being a place where there are a bunch of casinos and games of chance.
And so they're simulations. Because simulations are-- well, now understand that there are people who think that they're non-stochastic simulations, which is basically back of the envelope calculations that are deterministic. But they call those simulations, too.
I always thought simulations meant stochastic simulations. That's what I mean when I say it anyway. So stochastic simulations are what we were thinking about that have chance involved, probability distribute.
So these allow us to not calculate this equation explicitly, but to take samples from that distribution. So that if we take a bunch of samples, we have a very good description of its shape. So they are going to be simulation based samples. And so we're going to draw samples from the posterior.
And there are algorithms to do the sampling. You've heard of some of these. And I'll talk more about them. But you've seen people refer to Metropolis algorithm or Metropolis-Hastings algorithm. And you said, OK, fine, I don't know what that is maybe.
But it's an algorithm for ensuring that as long as you sample long enough, that you will be taking samples from this posterior distribution. You might have to burn in, so that you get beyond some period to where you're sampling from a stationary posterior distribution. So that's what these algorithms ensure.
And a lot of the development in Bayesian estimation and things that people are doing with computational statistics is working on these algorithms and having good ways of obtaining samples through simulation. This is very different than where you might have learned likelihood. I think a lot of people are exposed to likelihood in the context of phylogeny estimation, for example, tree estimation and searching that tree, where it involves some type of search algorithm.
What we are doing is not searching. OK. I didn't say that we're searching for the maximum likelihood estimate or the estimate of the question. We are seeking to draw samples from that distribution, meaning that we will draw samples that are unlikely according to their probability. And we will draw samples that are likely according to their probability.
OK. And so the way the numbers of samples that we get of different magnitudes will be informative about the shape of that posterior distribution. OK. So it's not a search. This is something that, at least, for some people is a challenge to think about.
You're not running the stochastic simulation longer to be able to finish your search. You're running it long enough, so that it achieves stationarity. It's converged, and that you have a large enough number of independent samples from that distribution to have a good idea of what its shape is. Yes.
AUDIENCE: So what is it converging on?
ALEX BUERKLE: The posterior distribution.
AUDIENCE: So [INAUDIBLE].
ALEX BUERKLE: So you're going to be really wrong when you start. OK. You put in a number. And this is a simulation based method. It's going to have an algorithm for updating that number. And early on, it might change very rapidly.
You are not sampling from a stationary distribution that's constant. Instead, it's moving as you update by this algorithm. And so you need to burn in a certain amount of period, so that you're getting repeated samples from a stationary distribution.
AUDIENCE: So it's the distribution that's reaching stationarity?
ALEX BUERKLE: Is your posterior.
AUDIENCE: Yeah, posterior distribution is reaching stationarity rather than your sample [INAUDIBLE].
ALEX BUERKLE: No. I mean, the posterior distribution is what it is. But your estimates or your samples need to reach stationary and converge on that posterior distribution.
AUDIENCE: Because, by chance, you could end up sampling a whole bunch of the tail and that's--
ALEX BUERKLE: Or, you could just be very far from the answer.
AUDIENCE: Oh, OK.
ALEX BUERKLE: Particularly, if you're not bounded by zero and one-- let's say your parameter could be if it's a variance. So it could go up to infinity. You might start off far away from infinity. And you would take some time to converge upon samples that are samples from the posterior distribution, those parameters that are given the data.
Because a lot of parameters would be highly improbable given the data. And so you need to move away from those and get into a place where you're sampling from the posterior distribution. And so these algorithms that are very, very simple-- in truth when you look at them they're very, very simple-- ensure that if we sample long enough, we will converge upon the posterior distribution. The problem is we don't know how long that will be.
AUDIENCE: Yeah.
AUDIENCE: Are there possibilities of having multiple posteriors during difference minimal-- I guess, like [INAUDIBLE].
ALEX BUERKLE: That's the big difference with likelihood and why I made that distinction is that we're not searching for an optimum and the shape around that. So we can't get stuck in the same way as long as we have sampled long enough even if we have a multi-modal distribution. As long as our form of our equation is such that it could be that, we could estimate it. And we'd get samples from that distribution according to their probability.
AUDIENCE: Yeah. You might just need a sample [INAUDIBLE]?
ALEX BUERKLE: No. So for example, in Bayesian estimation, people often use it for mixture models where you actually have a mixture of multiple different distributions. And if you specify that properly in your equation, you can calculate the properties of that mixture distribution where it really is multi-modal.
So if you have a mixture of individuals from-- you're studying diets or something like that. And you're using isotopes to look at the different contributions to their overall makeup. That's a mixture of isotopic compositions in different species. So I think you don't get end up getting stuck-- the theory will show, if you run it long enough, that you will be able to sample off of one of those modes. You won't get stuck on them if that's the issue. Yeah.
AUDIENCE: Can you explain what you mean by burn in.
ALEX BUERKLE: I will. Let me think when I'm going to do that. Let's do that later when we're doing some of the samples. But it means those samples that you can discard, because you can recognize that you are not sampling from the posterior distribution, that you're moving.
Your estimates are moving in a directional way. And so, often, people plot traces of their parameter estimates through MCMC steps. And you see, oh, it's still moving, and then it settled down.
OK. And so that's one of the benefits of being able to look at your estimates as you go through an MCM is you can do those diagnostics. Some of the software that we use, you specify that a priori, right? You don't even ever look at that.
So, for example, with structure, people just typically burn it in for a bunch of steps. And then they take a certain number of samples. But they never actually look at how well things are mixing.
But the idea is that you've burned in long enough. You've thrown those samples away, so that all your remaining samples are from the posterior distribution. There was another question, yeah.
AUDIENCE: So the sampling is never done with replacement?
ALEX BUERKLE: Absolutely. There are-- well, I'm jumping ahead. There are two different ways to update your chain and take your next sample. Your next sample could be what's called an independence chain, which is that they constantly come from the same distribution.
Or, it can be a random walk where the update is based on your current value and some kernel of proposal values around that. But it's not sampling with replacement at all. So in memory when you're doing this computationally, you typically only have the current step and the proposed next step.
Those are the only two things. So often, Bayesian programs are not very memory intensive. Because they aren't necessarily keeping a ton of memory in RAM. Their memory intensive in that they're writing lots and lots of MCMC steps out of the disk. Yes.
AUDIENCE: You raised about the program STRUCTURE for the Bayesian [INAUDIBLE] analysis and [INAUDIBLE] for MCMC approach. Other people have talked about alternative [INAUDIBLE] like BAPS. Does that represent a different type of computational simulation? Or, is that a more [INAUDIBLE]?
ALEX BUERKLE: So BAPS as an alternative model just for the population genetics? It's specifying a different equation for the posterior. But in terms of these algorithms that I'm talking about, they're the same.
OK. In fact, whether you're doing analysis of isotopes or studying the economy or in modeling all kinds of different things, you're using the same map in terms of algorithms. You're just specifying a different model. You're making different assumptions about allele frequencies and populations and in individuals and how they mix.
So if I remember what BAPS is, it's a more spatially explicit model. And so it's incorporating information about space I think. OK. So we have analytical solutions and simulation solutions.
I haven't talked anything about the denominator, the probability of the data. It's going to be a normalizing constant for the equation. So it's in the denominator.
It makes it so that it's a quantity so that it divides that numerator in a way so that the left-hand side of the equation adds up to 1. OK. It's normalizing constant. So in a lot of what we're going to talk about and what you will do is when you're doing MCMC methods, the probability of the data, all it does is shift the surface up and down. in terms of it's just a constant, right?
It's just a number. It's multiplication, division. They're the same thing, right? It's just shifting all the values up by some constant value.
In analytical methods, analytical solutions, we need to have some way to calculate that typically or do some math trick to make it just go away. And in simulation methods, what we will do instead is if we use the simulation methods, we are ensured that we can get the form of this probability distribution right. And we'll write this thing that is supposed to be a symbol that looks a little bit like an alpha symbol, but it means proportional to.
We're going to have our posterior probability be proportional to the likelihood times the prior. And the algorithms, the MCMC algorithms, ensure that our random samples that we draw from the posterior will allow us to approximate the true posterior probability distribution correctly without ever knowing what that constant is in the denominator. So by using Metropolis-Hastings, using Metropolis, running it long enough that we've converged on the posterior distribution, they will ensure that we can write our equation in this way and disregard the probability of the data and that the relative density of our random samples in different intervals will reflect the true probability. distribution.
So we never need to know that scalar in terms of how it moves things up and down. It will be integrated out through our stochastic sampling. So we're going to take that as faith. I'm probably sure that a probability professor could talk about that for a week. But I'm not going to do that.
And as I already wrote up above, for both analytical solutions and these computational solutions, we can calculate quantiles in the expectation. Both of them will give rise to them. So how would we obtain the quantiles, the 2.5% quantiles of 10,000 samples that we generated for a parameter?
We're using stochastic simulations. Let's not make it 10,000. Math is hard before 9:00. Let's make it 1,000. OK. We have 1,000 samples that we think are from our posterior distribution.
How would we-- they're now on your desk. How would you find what is the 2.5% quantile of those simulations? What would you need to do?
AUDIENCE: [INAUDIBLE]
ALEX BUERKLE: Pardon?
AUDIENCE: [INAUDIBLE] the variance?
ALEX BUERKLE: You could calculate the variance. And then you can make some assumptions about the shape of that distribution, and then use the parametric estimate of the 95% confidence interval. But we don't want to do that. We want to know the 2.5%-- let me make it easier. How would we calculate the median?
AUDIENCE: [INAUDIBLE]
ALEX BUERKLE: Sort them. And we'd make a pile of 500 over here and 5 over here in sorted order. And we'd compare those two values, whatever it is the 500, then the 501. Although, it's bad. Medians are bad with even numbers.
So anyway, approximate with those. We'd look at that point. It would be roughly between 500th and our 501st. So what would we do for the 2.5% quantile?
The median is the 50%. So what do we need to do? We still need to sort them. And we need to see all of them to be able to sort them. You can't start calculating quantiles until you have all of your samples.
OK. This is bad. Because it means you have to write them to disk, so that you can look at them afterwards. You can't calculate posterior estimates or quantiles. You could approximate them maybe.
So what do we need to do for the 2.5%? We need to sort them. And we're going to look at our first--
AUDIENCE: 25.
ALEX BUERKLE: 25. So this is an empirical estimate. Just sorting, it's an empirical estimate of this probability distribution. That's what you always do with simulation.
OK. To estimate quantiles of a distribution, it's an empirical sample. OK. And so that's what I mean here. The mean is easy. We could just add them all up and divide by the number of observations. That gives us the mean.
So that's how we can report our quantiles, 2.5% and 97.5%, the median if we want, and then the mean. So I think that illustrates that. OK. Any questions?
I want to work an example and an application to some population genetic data. I've said that we can't typically do closed form solutions. But let's do one just to amuse ourselves and see whether I can do algebra on Wednesday morning.
So let's work an example and an application to pop gen. So our example is pretty simple. We worked hard. We got 100 individuals from a population.
63 of them had this genotype. 34, this gene type. And 3, this genotype. So we're not working with genotype uncertainty. OK. We won't be able to do closed form solutions when we work with genotype uncertainly.
So this is with known genotypes. We went out and collected these 100 individuals and found these counts of different genotypes. So these might be SNP alleles. Or you got them by some means.
And so we're going to take p as our allele frequency for the A allele. So if we weren't in Bayesian analysis, what would be our point estimate of the frequency of the A allele? So why don't you guys do that? I'm not going to wait for one.
Everybody, let's figure out what's the frequency of the A allele in this population. What's our point estimate for it? Don't be confused. Just what is the frequency in the sample? That is your point estimate.
I hear 0.8. I set it up so that came out to be a nice number. So how did you get this? Well, you said 63 times 2, because each of the individuals has two copies of that allele. So we're counting them.
And then we're adding the 34 individuals that have just one copy of it. That gives us 160. And there were 200 total allele copies in the population, right? This is intro bio stuff.
We make people do this, so that then we can make them go through convulsions with Hardy-Weinberg. But that's a point estimate of the frequency and our population. Because this is a sample.
We took in a finite sample of 100 individuals from a population. Our best information about that population comes from our sample. And we would say that the true underlying frequency, our point estimate for it, is 0.8 without using Bayesian approach.
But what if we wanted to do a posterior probability distribution and estimate this in a Bayesian framework? And so we want to write an equation for the probability of the allele frequency given our data, these data that we just talked about. What would that need to have?
What are the two terms that are going to be on the right side of the equation? A likelihood and a prior. OK. So I'll help with the prior.
OK. So there is a prior. That's our parameter, the probability of p. What is our likelihood going to look like? This is the inverse probability, right?
So we're just going to write probability of p given the data. We don't have to think yet. This is all automatic. This is just-- thank you. Am I off the screen?
AUDIENCE: [INAUDIBLE] reversed.
[INTERPOSING VOICES]
ALEX BUERKLE: Oh, sorry. Yes, thank you. That's the right response when I do something wrong. But at first, I didn't recognize what you were talking about.
OK. So there we have our likelihood. We're going to, for now, put the proportional symbol there. OK. We're not going to worry about the probability of the data in the bottom.
So let's deal with the likelihood first. And I should say what we're doing right now, this is actually very representative of what you would do in any type of analysis. You would write the equation.
Because for the equation, you actually don't need to know what form any of those things have. You're just sort of defining the hierarchy and the relationships with the things. Now the next step is to say, well, what probability distributions are you going to use for those things, for the likelihood and for the prior.
So we're going to make some choices about those. We're going to specify our model. So in our model specification, let's think about the likelihood. We have draws alleles from our population, our sample and that there's an underlying probability of seeing the A or the T allele.
They're complementary. They're the only two possible outcomes of our trials. What does that sound like? Binomial. It sounds like coin tossing. We can think of A and T as the two possible outcomes in a Bernoulli trial.
OK. And we've performed a certain number of Bernoulli trials. Those are described by a binomial distribution. OK. Notice that what we did with our data-- and this relates to the topic of this workshop-- is that we got rid of the individual information.
If we want to estimate allele frequencies, we do not need to know genotypes. Our data are not genotypes in this case. We quickly turn them into 160 allele copies that were A and the remaining 40 that were T. Those are our data.
So for estimating allele frequency in populations, any piece of software that does that, it quickly takes your carefully painstakingly gathered data where you determine genotype with confidence and gets rid of that and says, oh, you had 160 out of 200. And we know how many individuals you sample. So that's, I think, illustrative.
Someone said correctly that this is a binomial. We're going to use a binomial. And we're going to think about a binomial where we're thinking about the probability of our data where those are the counts of the A allele given the probability. For the A allele, we defined that is that p is for the A allele way up here. And we need to know how many individuals we've sampled. But that's fixed. That's not a parameter.
But that defines the shape of the binomial distribution. And so in our case, it was n equals 200, the number of allele copies that were sampled. So that seems pretty comfortable.
What about the prior? How might we specify the prior? What form might it take? Yes.
AUDIENCE: So is there an important distinction here between the proportion of the A allele in the sample and the proportion in the population?
ALEX BUERKLE: Yeah. By p, I mean the true parameter for the population. OK. We know what the proportion is in our sample. That actually doesn't require estimation. But that's a good point.
The random variable that we're trying to understand, the unknown unobserved thing is the true parameter for the population. And so what attribute-- yes.
AUDIENCE: When you said [INAUDIBLE] p, is that the big P or the little p?
ALEX BUERKLE: OK. So my big P is the probability when I keep writing P. That's what I'm talking about that. I could write prob P.
AUDIENCE: So just related to the last question, when he's talking about the true parameter in the population, is that the big P or the little p?
ALEX BUERKLE: I really mean that there's only one little p as a parameter. That's the only one. And so when I wrote up here this was the parameter for true A allele frequency, this was our point estimate of that. It happens to be the same number, which is a little bit confusing maybe. Because our best information about that true parameter is our sample parameter.
Seem OK? So we're going to try to estimate that true allele frequency. So we're thinking about what is the prior. So what should some of the features of this prior be? What would be desirable features for a prior for allele frequency? It should be positive.
AUDIENCE: Between 0 and 1.
ALEX BUERKLE: Between 0 and 1. OK. So those are our key features. We want 0 and 1 as our interval. So it can take any value between 0 and 1. Are some values more likely than others?
AUDIENCE: Yes.
ALEX BUERKLE: Yes?
AUDIENCE: [INAUDIBLE] is a lot more [INAUDIBLE] 0.
ALEX BUERKLE: 0.8 Could be more likely than 0, but not necessarily. I'm sorry. A priori, before we've seen our data, so it's a prior probability. Before you've seen your data, do we want to ascribe greater anticipation of some results than others?
AUDIENCE: You could assume everything is in [INAUDIBLE]. And that could be [INAUDIBLE] you know [INAUDIBLE] the population [INAUDIBLE]
ALEX BUERKLE: So this is a model for allele frequency. So Hardy-Weinberg would be a model for genotypes given the allele frequencies. But the other thing of Hardy-Weinberg is that they just stay constant.
So the model part of Hardy-Weinberg for allele frequency is just they're constant. So we don't need to make any of those assumptions. It's just about allele frequency per se.
50 Would be a good expectation, so middle of the road. But what about 0.75? Should it have lower probability than 0.5?
So we want maybe a flat distribution, right? I drew some of these yesterday. Or, they were on the computer screen. Where we have 0 to 1 that looks like this, would be reasonable, so that all of them are equally likely.
What are different names for this distribution? A uniform distribution, where there's a uniform probability over interval, over the parameter space. And I'm going to start us with a beta distribution. Because the beta can take that shape. The beta includes the uniform as a special case.
It has limits of 0 and 1. And it can be symmetric, so that 0.5 would be our expectation. And in particular, beta 1, 1-- or we could write it out more extensively, beta where alpha equals 1 and beta equals 1-- has that uniform distribution.
Now, you might be asking, well, why would you want to do that instead of just uniform? And the reason is is because if we multiply a beta times a binomial, the shape of the posterior is a beta. By assuming a beta, we get conjugacy.
The beta is the conjugate prior for the binomial. This lets us do the closed form solution. Intrinsically there is nothing better.
There's no reason to prefer the beta over the uniform. They are the same probabilities. They're flat over this interval between 0 and 1. But algebraically, the beta gives us something nice, which is that if we multiply a beta times binomial, what I'm going to show you in a minute with some algebra is that you get a beta back.
OK. So we know the shape. We don't know the algebraic relationship between a binomial times a uniform. There is no closed form analytical solution for it. But a binomial times a beta gives you back a beta. Yes.
AUDIENCE: Is having the beta [INAUDIBLE] the only thing that you need in order to [INAUDIBLE]?
ALEX BUERKLE: Yes. That's all we need. And it's going to be super easy. The algebra will take me a second to write down. But we already know what we're going to get. I've told you.
We're going to get back to a beta distribution. OK. So all the algebra will be able to show us what the equation is for the left side of the equation. On the left side, it will be a beta with new parameters. Because we just need to figure out with the algebra how are the beta parameters changed by multiplying the binomial times it. OK. Yes.
AUDIENCE: So what would have happened if you'd use a [INAUDIBLE]?
ALEX BUERKLE: This is an uninformative prior. This is uninformative. And it's a good question. It's uninformative, because all the parameter values, any value, any model that I can consider has equal probability. Right All the different p parameters, the infinite number of models that are between 0 and 1, all have equal probability.
AUDIENCE: [INAUDIBLE]
ALEX BUERKLE: Because it's beautiful, yup. I'm sorry, because the probability is uniform across this beta distribution whether you specify it with a beta or [INAUDIBLE]. Yes.
AUDIENCE: So related to that, why not incorporate the knowledge and [INAUDIBLE] frequency into this distribution?
ALEX BUERKLE: So at this point, the prior, this equation, has no information about the data. We're really specifying this saying, what is the prior probability of this parameter in the absence of any information. So what you might be thinking of is that sometimes we do know something about the frequency at a particular locus given features of the genome, let's say.
Allele frequencies, an allele frequency spectrum in the population, that might be able to tell you something and give you some prior information about that locus. And we'll talk about that this afternoon. But if you're just talking about this mathematical term, it does not have the data in it.
OK. So there are no data at this point. OK. So we're taking that term and multiplying it by this term where the data do appear. That has a mathematical relationship to our posterior distribution. OK. So we're constrained by the Bayes' theorem to have the likelihood and the prior.
And if we knew something about the probability of the data and put it in the denominator, we can put it there. But these are just the two terms. And in the prior, the data don't appear.
I want to add that we could use MCMC for this. There are always analytical methods and simulation approaches. You should get the same answer if you do the MCMC right. OK. But we can do an analytical solution to illustrate this point. What am I supposed to do at 9:30?
AUDIENCE: [INAUDIBLE]
ALEX BUERKLE: So we're scheduled for questions from 9:30 to 9:45. So just interrupt me. And I'll just keep going. We'll get this analytical solution done.
So the closed form solution, given what we've just said, we've got our full model that I'll just rewrite. So the probability of p-- and I'm going to call the allele count, our data, I'll just call y if that's OK. And we had n trials. It's proportional to the probability of y given p and n times the prior.
And we already had that, right? And we said we're going to make this a binomial likelihood. So we need to write out the equation for the binomial.
So this is probability of y term given p and n. So the basic part of the binomial or an important part is p to the y. So that allele frequency to the number of samples that-- yes.
AUDIENCE: So the y now is the p in the last section? Is that--
ALEX BUERKLE: No. So before, we estimated p with a point estimate. Though I never wrote down a symbol for y before, it was our 160.
AUDIENCE: [INAUDIBLE] data.
ALEX BUERKLE: It's the 160. OK. If you want to write 160 there, the data are fixed in this example. So you could just write 160 there. It wouldn't change the meaning of anything.
So, basically, here we have an allele frequency. We have 160 draws from the population that matched that allele. OK. So that's going to be this term.
And then there's another frequency, the alternative allele. And we had n minus y draws from the population that matched that. So in our case, that would be 40, right? 200 minus 160 matched that case.
So that's most of the binomial probability distribution. What's missing? You guys remember this? This is definitely, like, regents exam or something to be able to get out of high school type of thing.
What's out in front of the binomial? N choose k, whatever that is. And there's actually a function in R for it to be able to do that type of thing.
But it's the combinatorial coefficient for all the different ways that you could get this, where you get 160 and 40. But there are a bunch of different ways of doing that. We do that in pop gen, where we calculate the two different ways that you could get a heterozygote.
OK. This is just a big coefficient out here, n choose k, that does not depend on p. OK. It's just a coefficient for all the different ways that you could get our numerical result that involves n and y. So we're just going to write a C out there for now.
If you like, you can write n choose k out there. And that involves factorials and a ratio, which is fun to write, but does not affect our calculations. Because it's just a coefficient. It does not depend on p.
So I'll write that out. C is a constant that doesn't depend on p. So we don't care about it for now.
We have a beta prior. So that's our probability of p. That equation is probably unfamiliar. But it's going to have p in it. And it's going to have alpha and beta, because those are the parameters of the distribution.
And it has a similar form to the binomial. It looks like this. And it has its own constant out front that does not depend on p, alpha, or beta. So C is still a constant out in front.
So it's p raised to the alpha minus 1 1 minus p raised to the beta minus 1. That's a beta equation. That's the equation for a beta distribution.
So we know up here that we have this term, our likelihood. And we want to multiply it by this term. And we can look at this.
And we say, well, we have the same base twice. We have p twice. And we have 1 minus p twice. It appears in both terms.
And rules of exponents are that when you multiply things with the same base, we just add the exponents. So all the algebra that we have to do is when we multiply the binomial likelihood times the beta prior, we need to say, well, that p base, the one with y as its exponent, is going to be multiplied by this one. So if we say p to the y times p to the alpha minus 1, what does that simplify to?
Well, we add exponents. It's y plus alpha minus 1. And the other base that we have there, the 1 minus p, is to the n minus y. And the other one is beta minus 1. We just add the exponents again.
So if we multiply those two things together, what do we get? We get, if we're multiplying our likelihood times our prior, is equal to that constant that's out in front times these terms that we just calculated. Of the two probability distributions that we talked about, this looks like both of them, right?
Both of them have the same bases. But it's the beta that has the minus 1 term. And this right here is a beta distribution. I'll write it lowercase, a beta distribution with new parameters.
OK. So it's beta y plus alpha. And the other parameter is now n minus y beta. That's our posterior distribution. This is the equation for our posterior distribution, is a beta with y plus alpha and n minus y minus beta.
In our example, what did we set alpha and beta to be? 1, right? And so our posterior distribution we can further simplify it to be y plus 1 and n minus 1-- I'm sorry, n minus y plus 1. That is the posterior distribution for allele frequencies, the algebraic solution for it.
It is a beta distribution with those new parameters. What we did here is that we took our prior and updated it by our data. We added our data to it is one way to think about this.
We had a prior probability distribution, which was uniform with alpha and beta, right? And we updated it by the number of observations that fit the one allele category. How many was that? What was y?
AUDIENCE: 160.
ALEX BUERKLE: 160. So in terms of our calculation, this is going to be 160 plus 1. Sound right? This is going to be 40 plus 1.
So, algebraically, that is the equation for the beta distribution. And it's going to be 160 plus 1. And the other parameter will be 40 plus 1. Right there, you can judge the relative effect of your data and the prior.
The prior was like having one prior observation of the two different allele states. It affects the shape of this distribution by adding 1 to both categories. But your data are of much larger magnitude, 160 and 40. They're going to have a much larger effect on the shape of the posterior distribution-- on its density. It doesn't affect the shape at all, because it's symmetrical. Well, it does affect the shape. Amanda?
AUDIENCE: Yeah. I was looking for-- the first line of p, you have n minus 1 plus beta minus 1. How do you go from that to the next line down where it's n minus y minus--
ALEX BUERKLE: Yeah. So this is the equation--
AUDIENCE: [INAUDIBLE]
ALEX BUERKLE: Well, so this is the equation for a beta distribution with parameters alpha and beta. So this right here is our alpha. This is our beta. So if we look down here, this is our new alpha. This is our new beta. So I just put those-- whoop, whoop.
AUDIENCE: It just seems like you're missing something.
AUDIENCE: You changed a sign on the--
[INTERPOSING VOICES]
AUDIENCE: --plus beta.
[INTERPOSING VOICES]
ALEX BUERKLE: Yes. Sorry. There were a lot of choices. And a lot of symbols start to [INAUDIBLE]. Did I get it right now? Sorry about that.
It's good throw in some-- if everything's right up here, if it was all PowerPoint--
AUDIENCE: If it makes you feel any better, one of my math professors in college says always my arithmetic. Mathematicians can't do what we do.
ALEX BUERKLE: Yeah. And it's hard to think and write and all that at the same time. OK. So that's cool I think. But most of us have a sense of what a beta distribution looks at the extreme, like flat. And I said something about u-shaped and modality yesterday. But let's plot this.
Because we now know the equation for this, we should be able to plot in R. OK. Because one of the whole reasons you use R is to plot. And so it's on the web. But I think it's just as easy for me to write the code up here.
And then during our break, you can find in my section of the course, I've numbered where we are in some R code. There's a file called Commands.R. And there'll be a version of what I'm going to write there. You can just copy and paste and put it into R to plot the beta distribution.
But what we want to do is we want to plot a bunch of deviates. So we're going to sequence from 0 to 1 at some scale that pleases us. A bunch of numbers between 0 and 1, those will be our deviates where we want to look at the height of the distribution.
So those will be our p-values that we're going to evaluate the probability for. And we'll look at the density of the beta for all those deviates. And we need to give it the parameters. And as I mentioned before, in R, we have shape one and shape two.
They can be matched positionally or named. But are alpha will be this y plus alpha. So that'll be 160 plus 1. And shape 2 will be our 40 plus 1.
And we need to then close that, because that's all that the beta stuff. And we can say that we want our plot to be aligned. And so we're sure that what we're looking at, on the x-axis, we have p, our estimated allele frequency. And our y-axis, we have a density.
So I suggest fire up R. Do this. And then we'll do some exercises. The exercise that I want to do is that one of the very common questions that you get from your purely empiricist friends will be how big should my sample be from a population. Here, we have a sample of 100 individuals. That's a lot.
Could I get by with 16 or 20? And what would be my confidence in allele frequencies with smaller sample sizes? Well, you now have an equation for the beta distribution for allele frequencies, where you can just change your sample size by modifying that 160 and the 40 over there and see what effect that has on the width of that distribution.
So in addition to plotting this, play with some different sample sizes to look at the extent to which the distributions change in size. And then we'll talk a little bit more about this after the break. Does that seem OK?
Alex Buerkle, associate professor of evolutionary genetics at the University of Wyoming, gives an overview of probability basics for models of SNPs and genotypes, as part of a population genomics workshop at Cornell University, July 23-24, 2013.
Population genomics involves sampling, financial, and bioinformatics trade-offs, so proper experimental design requires understanding probability, sequencing technologies and evolutionary theory and how they relate to research trade-offs. The workshop, "Next Generation Population Genomics for Nonmodel Taxa," explored the strengths and weaknesses of different approaches to genome sequencing and bioinformatics when studying population genomics in nonmodel species.
Go to the workshop website for information associated with these videos including lecture notes, descriptions of exercises, and computer code. This website is a site for on-going learning about methods for population genomics.