Chapter 8.5: Fine-Mapping (Video Transcript)
Title: Introduction to fine-mapping methods
Presenter(s): Hilary Finucane, PhD (Broad Institute of Harvard and MIT)
Sarah [Host]:
Good morning, everyone, and welcome to the MPG primer for today. It’s 8:30, so we’ll go ahead and get started with the introductions. So, this is our penultimate primer for the season, and we are very happy today to have Dr. Hilary Finucane today to speak to us about fine-mapping methods. Her background includes a Bachelor’s in math from Harvard. She then followed that up with a Master’s in theoretical computer science, and then went on to complete a Ph.D. in applied math at MIT. She was selected for a very prestigious NIH Director’s Early Independence Award and has been doing wonderful work here at the Broad. She’s now co-director of the program in Medical and Population Genetics and she’s also an assistant investigator at the Analytic and Translational Genetics Unit at MGH and is about to be an assistant professor at HMS. And we are so thankful for her today for sharing this presentation with us. She’s happy to take questions and has natural pauses built in her talk, but I will also keep an eye on any raised hands and Q&A, and so we welcome your participation. Thank you very much.
Hilary:
Thanks very much, Sarah, for that lovely introduction, and hi, everyone. I’m happy to be talking today about Bayesian fine-mapping methods. And this isn’t going to be a comprehensive review. I’m going to try to give an overview of some of the main ideas in the field, but as Sarah said, I’m very happy to take questions as I go and answer - I’ll be moderating those questions.
So let me start by talking about the context for fine-mapping. So in a genome-wide association study, we see often these days many genome-wide associated regions. So here’s an example of a Manhattan plot from the 2014 schizophrenia GWAS, where every green diamond is a genomic locus that has passed genome-wide significance. And that naturally invites the question, what’s actually going on in the locus? And there are a lot of questions that we can ask about a particular locus.
That can mean a lot of things, and what I’m going to focus on now is, what are the actual variants that are driving the association at the locus? And so typically, when we zoom in on a locus, we might see something like this. So here we’ve got genomic coordinates on the x-axis and then the level of significance on the y-axis. And this is an example from Hailiang Huang’s IBD analysis. And what we imagine is going on is that there’s actually a simple underlying causal structure, or maybe there are only two causal variants in the locus, and it’s only because of patterns of LD, and then the noise due to finite sample size, that we see all of these many variants coming up as associated in this way. So, the goal of statistical fine-mapping is to take the GWAS data that shows this complex association at the locus and to try to detangle it and figure out what’s the actual simple story that’s underlying it. What are the causal variants that are underlying this association?
And so why might we want to do something like this? Well, one reason is if we’re interested in genes: If we can identify the causal variants, these variants sometimes implicate genes. For example, the variants may be coding variants that directly implicate a gene, or they may be regulatory variants that we can then tie to a gene. So, fine-mapping can often help us with this goal of finding causal genes. And another reason might be, even once we’ve got the name of the gene, we want the variant-to-gene mechanism. And that, for example, might enable us to do an experiment that more realistically recapitulates the disease-relevant biology than knocking out the gene altogether. Then there’s another set of reasons having to do with the genetic architecture. So, for example, by looking at many fine-mapping results across many loci or by building models that are based on fine-mapping models, you might be able to do enrichment analyses - which types of variants tend to be associated or causal for disease? Moving from association to fine mapping can also enable cross-population and cross-trait comparisons and has the potential to be particularly useful in prediction. And so there are a lot of things that we’re trying to do that become easier once we have some model that lets us get not just association, but rather to make some inference about causal structure.
And so today, I’m gonna focus mostly on different aspects of statistical methods for fine-mapping. This is the outline: I’ll start by talking about posterior inclusion probabilities and credible sets, and then I’ll go through a few different methods points, and then I’ll close with some thoughts on evaluating fine-mapping methods. And, I’ll pause after each section here, and so maybe I’ll just start by pausing after that brief introduction if there are any questions so far.
Great, so then let me continue with PIPs (posterior inclusion probabilities) and credible sets. What are these kind of basic concepts?
So, our goal in fine-mapping is to recover the causal variants, but of course, we can’t always with precise accuracy and perfect confidence recover exactly what the causal variants are.
And so, what does the output of a fine-mapping algorithm typically look like? Well, there are two aspects that I’ll focus on here. We’ll take each variant in the locus, and then we can plot it now, with the y-axis being the posterior inclusion probability. So, each variant gets a PIP, and then we can also identify sets of variants called credible sets. Here, one credible set is red, and one credible set is blue. So, what are PIPs and what are credible sets?
The posterior inclusion probability for a variant is the posterior probability that the variant is causal, and this, of course, is according to the model. So, once you’ve bought into all of the assumptions of your model, then the PIP reflects the probability that the variant is causal. And so, a PIP of 1 would be the most confident you can possibly get, and then as the PIP gets lower, that means you’re less and less confident that this is likely that a causal variant driving the signal. And this has a couple of different names; posterior inclusion probability is the most standard one that I’ve seen, but some people call this posterior probability of causality, or you may see other acronyms in the literature.
Then, a credible set, typically we talk about 95% credible sets, is a set of variants that contains a causal variant with at least 95 percent probability. And this has also been defined in some alternative ways and in some places in the literature, but this is now, to my understanding, the most standard use.
And so, if we go back and look at this particular locus, you can see that the blue credible set is a set of variants that contains exactly one variant, and that variant has a very high PIP. So that means that there’s one signal that’s been really resolved very well. The blue credible set says, “I think that one of the causal variants is here, and I’m pretty confident about it.” And then there’s a red credible set, so that means there’s a second causal variant. “I think it’s one of these five red variants. I’m not quite sure which of the five, and my posterior inclusion probability is going to quantify exactly what do I think is the probability that each one of these variants is the causal variant for this second signal”. And so you can think of each credible set as corresponding to one putative causal variant, and it’s reflecting the uncertainty around which variant is that actual putative causal variant.
So typically, when we think about fine-mapping methods, what we’re interested in is getting a PIP per variant in the locus, and then a credible set, each one of which reflects one causal variant and the uncertainty around where that causal variant might be.
So, let me again ask if there are questions so far on PIPs and credible sets. Okay, so then with that, I’ll dig into some of how do we actually try to compute these PIPs and credible sets, and I’ll start with the case of single causal variant fine-mapping. So you can imagine you’ve done a GWAS, you’ve got a particular locus you’re interested in, you’ve got the data on the locus, and I’ll discuss later on whether by that, I mean summary statistics and LD or genotypes and phenotypes. And now what you’d like to get are some PIPs and some credible sets, and you have a choice now: which is, are you going to figure there’s probably only one causal variant in the locus, or there may be multiple causal variants in the locus? And there’s increasingly good evidence in the field that many loci harbor multiple causal variants, and so that’s going to be an important point, but single causal variant fine-mapping is very robust and statistically straightforward, and it’s also a building block for a couple of the different multiple causal variant methods. So first, I’m going to talk about single causal variant fine-mapping.
So here we have our locus, and we’d like to know what’s the PIP for each variant, and then there’s only going to be one credible set here because we’re assuming one causal variant. And so, which variants should we put into our credible set? The PIPs now, these PIPs will sum to one. We’re saying there’s actually one causal variant; we just don’t know which one it is. And so now I’m going to talk in a little bit of technical detail for a few slides on how we actually go about doing this.
So, what is the PIP at SNP J? Let’s start by computing the PIP at SNP J, and we can write this as the probability under our model that SNP J is causal given the data that we have. And we’re being Bayesian, so let’s say that we have a flat prior on which variant is causal. Then Bayes’ rule allows us to rewrite this probability as the probability of the data given SNP J is causal, divided by the sum over all variants in the locus of this probability of the data given that the variant is causal. This is a pretty straightforward application of Bayes’ rule.
Then, a trick comes in, saying that in order to make the computation easier, let’s just divide everything into both the numerator and the denominator by this null probability - the likelihood of the data under a null model in which none of the variants is causal.
And now we can call this new quantity that we’ve got a Bayes factor. So, the Bayes factor is the likelihood of the data given that the variant K is causal, divided by this null probability. And this just allows us to rewrite our PIPs as the Bayes factor for SNP J, divided by the sum over all variants of the Bayes factors. And the reason that this is a nice thing is because this Bayes factor turns out to be pretty simple to compute.
So, Maller et al. showed that the Bayes factor - you don’t actually need to model all of the data at the locus to compute the Bayes factor for a single variant. You only care about what the genotypes are at that particular variant. And then Wakefield and others showed that this Bayes factor can, in fact, be computed or approximated depending on the model that you’re fitting, from summary statistics. And so, in computing this Bayes factor, you can just go one variant at a time and compute a pretty straightforward transformation of the summary statistics that you’ve seen. So, in particular, this doesn’t depend on LD at all and is a linear-time computation. So, this is how for simple causal variant fine-mapping you might compute PIPs.
And how about credible sets? Well, let’s first remember how we defined a credible set. S is a set of variants that we’ll call a 95% credible set if the probability that it harbors the causal variant (because we’re in a single causal variant land) is at least 95 percent. So, we now have a probability for each one of our variants that it’s the causal variant, and we want to know which causal variants should be put together so that we are covering at least 95% of the probability space.
And because we only are assuming the single causal variant assumption, the probability that the causal variant is in S is just the sum of the PIPs of the variants in S. So, to construct the smallest 95% credible set, we can just add the variant that has the highest PIP and then add the variant that has the second-highest PIP, and just keep on going until our PIPs sum to 95%. And typically, I should note, there’s a lot of different ways to construct credible sets. You could always just throw all of the variants in, and that’ll sum to more than 95 percent. So usually, the goal is to construct the smallest possible credible set because what you’d like is to have as much resolution as possible and to be able to say, “We really narrowed down our signal to as few as possible variants.”
Host: Then the question is asking what values are part of the flat prior and what assumptions are made in order to calculate that flat prior?
Hilary: Absolutely, absolutely. So, when I say flat prior, yeah, I should have clarified this better. What I mean is a flat prior over which variant is causal, which is also something that I’ll come back to. So, the flat priors here are saying a priori, before I’ve seen the GWAS data in my locus at all, I’m going to say that every variant is equally likely to be causal. There’s another prior that has to be defined that has to do with what’s the effect size of each variant in the locus, and there you do have to specify it, that there’s different ways that different folks do that, and it turns out that that might actually be pretty important, but for the sake of time, I’m leaving that out of this particular presentation. And so here for in order for what I said on this slide to hold, what you need is for the prior, on which variant is causal, to be uniform across the different variants. Does that answer the question?
Host: Yes, there is a follow-up one, whether a single causal variant is a prior that there is only one or no causal variant, or a constraint?
Hilary: In this case, it’s a constraint. So, in this case, when I say single causal variant fine mapping, what I mean is the model that you write down says there is exactly one variant, and it’s going to be one of these. Under the prior, if you have M variants, your probability is 1 over M that your first variant is causal, and it’s 1 over M that your second variant is causal, and that sums to 1 across the whole locus. In subsequent work that I’ll talk about in the next section, we put priors on the number of causal variants, and those might up-weight or down-weight, well those seem to up-weight sparse solutions like single causal variant solutions. But in this case, there’s a hard constraint: there is only one causal variant at the locus.
Host: And does LD structure affect the PIP? There’s a lot of questions coming in, great!
Hilary: Excellent, great! No, that’s kind of the magical thing about single causal variant fine mapping. This was first shown in 2012 in this Maller et al. paper. For one single causal variant, there are a couple of different ways to see it, and if I had a whiteboard, then I would show some of them. But the fact that these Bayes factors, that you can actually compute the probability of all of the data given that a SNP is causal divided by the probability of the data under the null model, that no longer depends on all of the other variants in the locus. You can see this, for example, if you’re looking at a linear model or a standard model for quantitative traits that you usually write down. You can actually write down the normal likelihoods and watch things cancel, and then a bunch of stuff disappears, and you wind up with something pretty simple. But there are also probabilistic arguments in both Maller et al. and Wang et al. that show that whether you’re conditioning on X or consider X to be part of your data, you actually get this canceling, and so your Bayes factors don’t depend on any variant except the variant that you’re computing the Bayes factor for. I think that’s part of why people like single causal variant fine-mapping so much; it means there’s no way to misspecify your LD, and it’s super simple and straightforward.
Host: I think a related question, just to finish up, is whether there are any other methods that prefer proximity. So, if you have a clustering of variants instead of a single variant, if that adjacency is considered in any alternative models.
Hilary: Interesting! So the question there is, now you’re modeling multiple causal variants, and you want to put a prior that your causal variants are likely to be close together, but you don’t want to up-weight or down-weight any particular variant, is that right?
Host: Well, so that was my interpretation of the question, but I’ll read the question which was, “Does your candidate selection require that variants are adjacent, or is there a method that prefers proximity?”
Hilary: Ah, so this is about credible sets now. With credible sets, there’s nothing explicit about adjacency. I think that typically, if you have a single causal variant, then the variants that have the highest PIPs are going to tend to be in LD with each other. So typically, credible sets tend to consist of variants that are in at least a medium amount of LD with each other. This can even be used as a diagnostic in some methods. If your credible set contains a bunch of variants that are in very loose LD with each other, then there’s a sense in which things didn’t work, and you should become suspicious. So I would say that if the model is well-specified, then you might expect a credible set to consist of variants that are in LD with each other, but there’s nothing explicit here that enforces that.
Host: Thank you so much!
Hilary: So, to recap, how might you do the single causal variant fine mapping? Well, first, you can take your summary statistics and compute approximate Bayes factors, transform these into PIPs and then compute credible sets from your PIPs.
One nice thing about single causal variant fine-mapping is that it also allows us to build some intuition about some basic concepts in fine-mapping. So, one thing that we might be very interested in is “what factors affect our ability to fine map effectively?”. We’re happy if we get a few variants with high PIP and other variants with low PIP, and that means we’ve really been able to zoom in on the causal variants. Another way to think about “power” (in quotes because it’s a very frequent term used in statistics) is intuitively, we’re trying to say, with what confidence have we been able to identify these causal variants? And you can imagine that if there’s a lot of LD in your locus, then it’s going to be harder to identify the causal variant. If you know, in the extreme, if you have two variants in perfect LD, then it doesn’t matter what your sample size is or what your algorithm is. You’re never going to be able to tease apart which of those variants is causal without bringing in some extra information. The less LD there is in the locus, the easier it becomes to kind of tease apart which variant is causal and which variants are non-causal. Similarly, as with GWAS, both sample size and effect size are very important for being able to confidently zoom in on a small number of most likely causal variants.
In this work by Schaid et al., the authors wrote down an approximate expected PIP at a causal SNP under a simplified model. And so here’s an example: you can imagine you have a locus with ten SNPs. All SNPs have equal LD; they’re correlated to each other at level R. There’s a single causal SNP that explains 1% of the variance in your phenotype. The authors wrote down an analytic expression for roughly under this scenario what would you expect the PIP of the causal variant to be, and they created this figure.
So, here high values are good, that means we were able to narrow in with a lot of confidence on the causal variant. You can see that on the x-axis, as the amount of LD among variants in the locus changes, you’re less and less confident that the causal variant is actually causal. The colored lines show how, as you increase your sample size, you’re more and more confident. Being able to get this kind of quantitative sense of what’s the trade-off between LD and sample size as you’re trying to zoom in on particular causal variants can be a useful way to build an intuition. And one comment I want to make here is that when we think about cross-population fine-mapping, one reason that it can be particularly effective to combine information across multiple populations in fine-mapping is because it changes the LD structure. The relevant LD is related to the average LD between the two populations. So, if you compare, let’s say, the same sample size, but you can choose to have it either all in one population or all in another population, or half-and-half in two populations, then because there are differences in LD structure among the two populations, combining across populations can help you move to the left in this plot, which is, as you can see, a good way to also move up, which means you’re more confident in the causal variant.
So, that’s an overview of single causal variant fine-mapping. Now, I’ll give kind of a high-level introduction to multiple causal variant Bayesian fine-mapping, and maybe I’ll pause one more time for questions. We had some in the middle, but not just because I’m at the outline slide again. Are there other questions?
Great, so we know that there’s often not just a single causal variant in a locus, and so that’s usually not an assumption that we’d like to hard-code, and especially as our sample sizes increase, this becomes more and more relevant and is reflected more and more clearly in the GWAS data that we see. So now, if we think about multiple causal variant fine mapping, there are two main approaches. The first one is to say, “Okay, there are multiple causal variants. Let’s split our locus up in some way, and then apply single causal variant fine-mapping because that’s a really robust tool that we can use.” So, then how does this typically work? What does it mean to split the locus up? There are a lot of different ways to do this.
One standard way is conditional analysis, and so here’s a figure describing conditional analysis. Let’s say that this is your locus in the top left here, and in conditional analysis, you take the top signal, and then you include the genotypes at that variant as a covariate in your association, and if that variant is in high LD with a causal variant, and there’s only one causal variant, then that variant explains all of the other associations in the locus, and so by conditioning on that variant, you get this, you know, you kill all the signal, and you get this null pattern here. So, if there’s a single causal variant and if the top variant is in high LD with that causal variant, then conditioning on the top variant will kill all of your signal.
On the other hand, if you’ve got two causal variants, then conditioning on the top variant is unlikely to kill all of your signal. And in particular, if that top variant is in high LD with a causal variant, then after you’ve conditioned on it, there’s a sense in which you’ve, you know, accounted for the effect of that causal variant, and now you’ve got a locus that’s got one fewer causal variant than before. So, you can iterate this and then get these set of index SNPs. Conditional analysis is one commonly used way to break complex loci into multiple signals. Then, one way you might then use single causal variant fine-mapping would be to fine-map each of these signals conditioning on the other. So, once you’ve got all of your index variants that you got by a conditional analysis, and maybe you’ll include all but one as covariates and apply single causal variant fine-mapping, and then repeat that excluding each signal one at a time.
So, that’s a commonly used type of approach, conditional analysis, and it has some limitations. One limitation is that you might, there’s no guarantee that your top variant is in high LD with a causal variant. So here’s an example from the SuSiE paper where they did a simulation where SNP 1 and SNP 2 are the causal SNPs, but because the yellow SNP tags both of the two red SNPs, it comes out as most associated, even though it’s not in particularly high LD with either one of these causal SNPs. This would be a case where if you did conditional analysis, you start by conditioning on the yellow SNP, but that wouldn’t properly kill either of your signals because this idea that your top variant is in high LD with a causal variant is violated in this particular case. So, conditional analysis is one approach, but examples like this motivate instead writing down a Bayesian model to jointly model the effects of multiple variants at the same time.
And that’s what I’m calling, you know, approach number two: “How might we jointly model multiple causal variants in one Bayesian model for the locus?”
Host: Hilary, there are two questions about that last approach. What do you mean by top variants? Is that defined by the GWAS score?
Hilary: Yes, yes, sorry, by marginal significance.
Host: And then when you iterate for variants conditionally, there’s an assumption that it’s not done manually. What’s the process like in sorting out hits?
Hilary: So there’s a software to do this, and it’s pretty automatic. Right, at each step, you want to take the most significant variant. So typically, the kind of manual part is, you have to decide when you’re gonna stop, and that’s often done by setting a threshold on significance. At what point are you going to say you’ve killed all of the signal? And so you take the most significant variant and you include it as a covariate. If any variant passes whatever your predetermined level of residual significance is, then you’ll do that again. You’ll take the most significant variant, condition on it, and then iterate. And then you consider yourself done when no variant passes your predetermined level of significance. Does that answer the question?
Host: I think so.
Hilary: Great, so then I’ll move on to how we might jointly model multiple causal variants. So here, let’s start by analogy to single causal variant fine-mapping, but here, instead of one variant, we’re gonna look at sets of variants. So let’s let Sj be a set of variants, and we want to know what’s the probability that this set of variants is causal given the data, and we can again apply Bayes’ rule and start to try to compute some likelihoods, but we get stuck very quickly.
And the reason is before, we were only summing over variants in the locus, and so we could say, like, what is the space of all things that could possibly happen? Well, variant one could be causal, variant two could be causal, variant three could be causal, and so on. There’s only a number of variants possible choices. But now, what’s the space of all things that could possibly happen? Well, variant one could be causal, or variant 1 and 2 could be causal, or variants 1, 3, and 10 could be causal. And so now, if you want to just naively apply Bayes’ rule, you’re summing over all possible configurations of causal variants, and that’s large, two to the size of the locus. So, that’s way too many terms to be tractable. There are a number of different methods to do joint modeling of multiple causal variants, and each one of them approaches this challenge differently. Caviar, which to my knowledge was the first work to write down this model in this way, limits the maximum number of causal variants and is typically applied to smaller loci. Once you limit the maximum number of causal variants, then that limits the total number of configurations as well in a pretty direct way. And then there are methods such as FINEMAP and DAP-G that sum over what their algorithm thinks are the most likely configurations. And then, more recently, the SuSiE method takes a different approach based on variational inference, for those of you who know what that is. It’s analogous to iterative conditional analysis, where instead of just doing conditional analysis once through the locus, they then go back and redo the conditional analysis multiple times until convergence. This has some nice theoretical properties as well. So, this isn’t a comprehensive overview of multiple causal variant fine-mapping, but just to give a sense that when you want to do joint modeling of multiple causal variants, there’s kind of a fundamental challenge to the first way we would think of doing it. There’s been a series of really nice work making that more and more efficient in these different and in other works.
I’m not going to go into the details of exactly how these different methods work; though that’s something that I find very interesting. Instead, I’m going to touch on two other method topics, and one of them is functionally informed fine-mapping. So let me pause again for questions before I move on to functionally informed fine mapping.
Host: There is one question: Do you need to take into account effect size when you do this? Either assume effect size of each causal variant is the same or weight causal variants by effect size?
Hilary: Yes, that’s a really subtle point that the different methods deal with differently. You have to put a prior on effect size is the usual way to do it, and then integrate out the prior. The question is, how do you figure out what the prior should be? Some methods do this by having the prior be a mixture of normals or learning the prior from the data. In some cases, it’s shared across all variants, and in some cases, it’s different for the different variants. So, that’s an important point that different methods deal with differently.
Host: Thank you.
Hilary: So let me, sorry, is there another question? I might be looking at the wrong place.
Host: Just popped up. Um, you mentioned that there’s evidence that there are multiple causal variants for GWAS loci, and just curious as to which studies have confirmed that?
Hilary: Yeah, there’s a couple of different ways to see that, I guess. I mean, one way to see that is if you look at the applications of multiple causal variant methods that then give you a posterior on how many variants there are, then that posterior is often concentrated away from one. Another way to see that is doing conditional analysis. If there’s a single causal variant, then conditional analysis should kill your signal pretty well, and it very often doesn’t. Another is, depending on how you define your locus, sometimes you can just look at the locus zoom plot, and it’s pretty clear that there’s more than one signal. For example, if you’ve got variants with a high marginal effect that are in low LD with your top variant, that’s not really consistent with more than one variant at the locus. There’s been some work on estimating amounts of allelic heterogeneticy from Farhad et. al. and [indistinguishable], where they try to, you know, model this specifically. But I’d say that there’s just, for the fact that it often happens that there are multiple causal variants, that seems to be something that you can see in a lot of different ways. And then the question of how often and how many causal variants, I think, is a much subtler and more difficult thing to get at.
Host: Thank you! One more just popped up, yes. So, Caviar, SuSiE, and DAP-G each use different models. Is there a way to judge a priori which method best suits our user’s data?
Hilary: That’s something that I’ll get into towards the end: evaluating fine- mapping methods. In my opinion, one of the things that this field really needs more of is benchmarking in realistic settings, and so I’ll talk a little bit about that at the end. But you can also base it a bit on intuition based on just the assumptions that the methods make. But I think actually, rather than go into that, I think that empirical, like more empirical evaluation, is really needed. A common thing is also to apply more than one method and then when they agree, to have more confidence. That’s something that our group has done, where we apply both FINEMAP and SuSiE, and then one way of evaluating the methods is to look at functional enrichments of the variants that get prioritized by these two different methods. And if you look at the enrichment when they agree versus the enrichment when they disagree and you go with either method, then you can see much stronger functional enrichment at the loci where the two methods agreed, than when they disagree, but in our hands, at least, they mostly agree, which is, I think, a good sign.
Host: Thank you.
Hilary: Okay, so I got a question earlier about flat priors, and what I was saying was the methods that I’ve described so far assume that before you look at the GWAS data in the locus, you think every variant is equally likely to be causal. But intuitively, of course, that’s not the case. If you haven’t looked at your GWAS data yet, you just know which variants are in the locus, but some of them are coding and some of them are non-coding. Then a coding variant is more likely to drive disease than a non-coding variant.
And because we’re doing Bayesian analysis here, that can be incorporated into a prior. So, a functionally-informed prior is one where you take into account the functional annotations of a variant to up-weight and down-weight certain variants according to which ones are more or less likely to be causal a priori. And then the question is, how do you set that prior? Do you have to just kind of trust your own intuition that, I don’t know, enhancer variants are five times more likely than other non-coding variants to be causal? One way to get around this question is to learn the prior from the data. So, a lot of the methods that I described so far, if you want to just say waht the prior is, that can actually be done pretty simply. And what makes this difficult is learning from the data by looking across many loci what prior would make sense to set. And so now what you’d like to do is say, “Ok, I’ve got several different loci. I’m gonna fine-map them simultaneously, but I want to learn by looking at these loci, are they consistent with, like, what, how much enrichment are they consistent with?” And so different methods again have done this in different ways. fgwas is a functionally informed single causal variant fine-mapping method, and then PAINTOR allows for functionally informed fine-mapping at multiple causal variants, and then CAVIARBF allows for many annotations in a multiple causal variant framework. And most recently, PolyFun leverages polygenic enrichment by leveraging stratified LD score regression.
And so, to give um just an example of how this works sometimes, I’ve pulled a pic or a figure from the PolyFun paper. And so here, if you first focus only on the squares, then you can see that the squares reflect, here, the PIPs that are not functionally informed. And if you look only at the squares, then what you can see is that none of the PIPs are bigger than 0.4, and these are RS288326, the red square, gets a PIP that’s, you know, somewhere below 0.4. But that particular variant turns out to be non-synonymous. And so, the functionally informed fine-mapping results, which are displayed in circles here, up-weight that in the prior. And so, then if you look at the posterior inclusion probability or the posterior causal probability here, then you can see that incorporating this functional information has bumped up that nonsynonymous variant to a posterior probability closer to one, which might match our intuition better from the combination of the data together with our understanding that this is a nonsynonymous variant.
So, this is, you know, an example of the kinds of ways that functional information can be incorporated into fine mapping, and this has pretty clear advantages. For example, if your prior reflects true biology, then you’ll get a more accurate posterior. One disadvantage would be if you want to use functional information downstream to, for example, evaluate your fine-mapping method, or if you sometimes it can be useful to say, “My fine-mapping results don’t actually have, like, I haven’t incorporated the functional information yet,” and so then I can do, for example, enrichment analyses. But I think that especially as these methods become more efficient and robust, as they have recently, then this is going to be an important direction as well. A very useful type of information to be incorporating into fine-mapping. So, are there any questions on functionally informed fine-mapping?
I’m great, so then, um, sorry, was that… sure.
Host: It was just a question about variants that might be in trans and how that complicates this analysis.
Hilary: Yeah, for sure, for sure. So, in order to do functionally informed fine-mapping, you need a set of annotations. So when you say, so what you’re taking advantage of is, you know, how to characterize variants. If you don’t know how to characterize the variants, then you can’t take advantage of that anymore. So typically, you first start by writing down a set of functional annotations. Here are my coding variants, here are my promoter variants, and one thing that’s different among the different methods is how many of those can you write down. But if something is regulatory and trans in a way that hasn’t been well characterized or that you can’t work into your model, then yeah, then that’s not something that you can take advantage of with these types of methods.
Host: And how specific is PolyFun to a particular cell type, disease, or phenotype, and can that be customized?
Hilary: So actually, let Omar field this question, but in general, if you think about functionally informed fine-mapping, it again depends on which annotations get used. And so if you only incorporate annotations from a certain cell type, then it’ll be cell type-specific. My understanding is the default for PolyFun is not cell type-specific, and that it uses annotations that don’t correspond to a particular phenotype, which makes it pretty widely applicable to polygenic phenotypes where you can only pick enrichment estimates. I don’t know if Omar is on the call, but if he is, then he should feel free to chime in.
Host: And do those annotations include features like promoters and enhancers?
Hilary: Yeah, coding is just one example, but there’s, depending on which method you’re looking at, typically a large number of annotations that can be incorporated.
Host: And then this is testing the limits of my zoom abilities, but Layla has a hand up.
Layla: That was an accident.
Host: Thank you so much. Great!
Hilary: Alright, so then maybe I’ll say a few words about summary statistics.
So many of the methods that I’ve described, I haven’t been differentiating so far, but many of them, rather than requiring your full genotype matrix and phenotype vector, can actually be run given only your LD matrix and summary statistics. This is convenient because, depending on what your sample size is and how you’re defining your loci, the LD matrix can be a bit smaller. But it’s particularly convenient if you can estimate patterns of LD from a reference panel. And I’ll get into that in the next slide, but let me first point out that this isn’t actually a coincidence. If we call our genotype matrix X and our phenotype vector Y, our LD matrix is then, up to normalization, proportional to X transpose X. And our summary statistics allow us to recover X transpose Y. X transpose X and X transpose Y are actually sufficient for V in the linear model that most of these methods are based on. And so what that means is that X transpose X and X transpose Y statistically have all the information about V that you would want to get from X and Y. And so the fact that there continues to be summary statistics-based methods is based on this very nice fact. As long as we’re starting from this Y equals X beta plus epsilon model, then it’s gonna be possible to do it from summary statistics. Although here, the only guarantee is if you have the actual X transpose X from your entire genotype matrix, so this is full in-sample exact LD. And of course, it doesn’t apply to, you know, logistic regression, there are things like that. And so when do you actually need, so the statistical guarantees come from in-sample LD, and when is it okay to use a subset of your samples or LD that you’ve estimated from a different population?
And so, Benner et al. have written about this particular question, and this is their schematic of what is the question that we’re asking here. So starting from the right, you can do fine-mapping from summary statistics and LD information. If your LD information comes from your GWAS data, traits, and genotypes, then that’s optimal. And then the question is, if you have a reference panel, then can it work to compute LD from the reference panel instead?
Their conclusion is that it depends on the size of the reference panel and the size of your GWAS. So as your GWAS gets bigger, you have to have a bigger and bigger reference panel, and of course, the population has to match as well. So for, I think what they say is, for a GWAS of over 10,000 individuals, you need a reference panel of at least 1,000 individuals or something like that. And then I think this question of the population must match as well. To my understanding I haven’t seen much work exploring exactly how well do you have to have chosen a perfectly random subset of the individuals you did your GWAS in, or is it okay to get the right continent, or is it something in-between there? And I think that the fact that a small perfectly matched subset doesn’t suffice means that as your GWAS gets bigger and bigger, you have to really be getting the LD very close to perfect. So I think that continuing to explore exactly in what situations reference panel LD is okay and gives accurate answers is something that it would be helpful to still have more work to understand. That set of kind of constraints because then, you know, if it did work, that would be very good.
So, that’s summary statistics versus full data. Now, move on to my last small number of minutes. Oops, and it looks like I don’t actually have time to talk about evaluating fine-mapping methods. So maybe I’ll actually conclude there and just say the high level of evaluating fine-mapping methods is that it’s important to try to break them in all of the ways that we think they’re broken. I’ll show you just this one slide.
Fine-mapping methods tend to assume that all the causal variants in the locus are modeled, there’s no imputation noise, you have exactly between one and five or one and ten causal variants, and that your phenotype is normally distributed and conditional on your genotype, you know, things like that. And then typically when fine-mapping methods are evaluated, all of these assumptions are satisfied in the evaluation. So one thing that my group has been working on that we think is very important is trying to find other ways to evaluate fine-mapping methods, both in simulations that might break some of these assumptions and also by real data analyses that can give us insight into what’s working and what’s not.
So, with that, I will conclude because we’re out of time. If there are any final questions, maybe I could take one.
Host: First, was that Omar wrote and did completely agree with you that PolyFun can be customized, but isn’t by default. And then I’ll just take one question. I think this is an interesting one I’ve been actually wondering is: summary statistics preserve privacy, but is there a way to publish the true underlying LD matrices or approximations there - that it will also preserve adequate participant privacy?
Hilary: I think that’s a super interesting thing to look into, and I don’t know the answer to that. I’m pretty sure that you can release approximate LD while preserving privacy because approximate LD should be the same in different samples from the same population, but I’m not sure whether it’s possible, whether you can publish infinite precision exact LD while preserving privacy. That’s not something I’ve worked on myself, and I don’t know of any work on that in particular. If someone else on the call does, they should chime in.
Host: Good, this was a wonderful session. Thank you so much, Hilary. This was our most interactive primer yet. Clearly a topic of great interest, very well presented. But thank you all, and we’ll see you in just a few minutes for the MPG session.