Chapter 8.8: TWAS (Video Transcript)

Transcriptome-Wide Association Studies

Title: Understanding GWAS mechanisms with Transcriptome-Wide Association Studies

Presenter(s): Sasha Gusev, PhD (Dana-Farber Cancer Institute, Harvard Medical School)

Sasha Gusev:

I am Sasha Gusev. This is my first time at CGSI, so thanks, everybody, for having me and giving me the opportunity to give this tutorial. As with the other ones, please feel free to interrupt or ask questions throughout, and I’ll try to sort of break things down in a way that’s accessible. I’m going to be talking about genome-wide association studies and specifically trying to make sense of genome-wide association studies as a way to understand human disease and complex traits.

So just to sort of start it at the very basic level, this is the output of a genome-wide association study or GWAS. The procedure is very straightforward - you collect a lot of genetic data on individuals with the disease and without the disease or with a quantitative trait, and then you test each genetic variant (and that’s what each of these dots is here) for association with the phenotype. The variants that are significantly associated are above a predefined threshold here, and if they replicate, we treat those as genetic variants that are causal for the disease. This is sort of a study design that I think initially almost seemed too simple to work, but now over time and with very large sample sizes has produced thousands, if not hundreds of thousands, of associations for nearly every complex trait that it’s been applied to when there was sufficient sample size.

In fact, the challenge is now that these association studies are almost producing too many results, and what we would rather have than sort of this figure, which is a real plot from a GWAS in prostate cancer, is something more like this, which is a systemic or systematic understanding of the disease, of which genes are involved in the disease, how they interact, what contexts they’re relevant in, and so forth. So, whereas initially there is sort of a challenge of just fleshing out this side of the plot, getting these associations, I think a key challenge now is in connecting from this side of the plot over here to an actual understanding of the disease.

One of the most basic pieces of getting to that understanding is connecting variants to the genes, associated variants to the genes that they likely operate through and then operate on the trait. So, we can break it down into this very simple structure. We have a variant; we want to know its target gene and the effect that it has on that disease. So, in particular, we can break this down even further and first just ask whether we can identify variants that influence the expression of genes in a systematic way.

This is something that was observed some time ago is that, in fact, if you take gene expression and you basically run a GWAS but on expression as your outcome (gene expression measured in the past through microarrays or now through RNA-seq and test variants), typically near the gene in cis with the gene for association with expression across individuals, you will find that the expression of many genes is often highly heritable. So, there’s an estimate here, in 2011, that the cis locus for an average gene contributed to between 37% and 24% of the variance of expression. And again, once you have a heritable phenotype in a population, you can sort of apply the GWAS paradigm to that phenotype and, instead, we call that an eQTL analysis.

I’m sure you folks have seen work from the GTEx Consortium over many years, applying eQTL studies and identifying thousands of variants associated with the expression of many genes in many tissues. And in fact, again, this is one of those cases, where as the sample sizes have grown, this study design has actually yielded a very large number of associations that are almost like too difficult, too many to fully process.

And I think the most recent GTEx study showed that if you sort of relax the significance threshold for these associations, nearly every gene has at least one eQTL in some tissue. And in fact, I think that if you continue as the sample sizes have grown even further, we see that genes then start to have secondary eQTLs and tertiary eQTLs, and this sort of curve does not even hit diminishing returns. So that’s the piece about identifying genetic variants that influence gene expression.

Then there’s been a lot of work in trying to understand how these eQTLs connect to disease, and I’ll highlight a couple of studies in particular, which basically asked in a couple of different ways whether an eQTL is more likely to be a GWAS variant or is more likely to be associated with a complex trait. So, the results on the left show that eQTLs, specifically as you get more confident about them being the causal eQTL, are more enriched for heritability across many complex traits from GWAS. And then this figure on the right from Gamazon et al. showing that if you just sort of try to partition the amount of disease heritability that could be explained by eQTLs, those estimates are also quite high across a large number of complex traits, again ranging from maybe 10% up to 35%. So there’s this sort of incidental evidence that eQTLs are enriched for disease heritability and may therefore give us an instrument to understand the likely causal genes and eventually go back to that big system-wide understanding of the phenotype.

So that’s the first part of the arrow. The other part of this network is we want to understand how this genetic mechanism of gene expression actually goes on to influence the trait and for which traits, and this is where the approach of a transcriptome-wide association study or a TWAS comes in.

I’ll just start with a very basic sort of thought experiment of what would we want to do if we had the ideal data set. How would we, with infinite resources, try to relate gene expression, genetics, and disease together? I think one way that we could do this is we could estimate expression in the hundreds of thousands of individuals that we have genetics and case-control status in. Here, like this represents case-control status. And then we could ask what genes are genetically correlated, meaning the effect sizes on expression are also shared with the effect sizes on disease. We could do this for every single gene across the genome, and that would give us an estimate of the genes that, in principle, could be linked to this phenotype. The hurdle here is that we very rarely or pretty much never have data at this scale. What we typically have is a relatively small study of genotypes and measured gene expression, usually as in the case of the GTEx, in a sort of healthy, relatively healthy, population that was convenient to sample. And then we also have very large disease studies that also have genotypes but no gene expression measured. So the basic insight of the transcriptome-wide association study or TWAS is kind of thinking about the fact that what is shared across these two studies is the genetics, and we know previously that gene expression is itself a heritable trait. And if it’s a heritable trait, then in principle, it should be a predictable trait.

So what we want to do is use the genetics to predict expression into this study over here where we haven’t measured it and then use the predicted expression as a sort of proxy to estimate the relationship between the genetic component or the predicted component of expression and the phenotype. Again, I’m sort of presenting everything in the context of a single gene, but the idea is to use this methodology and scan across every gene in the genome and identify the set of genes that are significantly genetically correlated or for whom the predicted expression is significantly associated with the phenotype.

And so right, then we do the test. So the first question is: can we actually predict gene expression in this way? And the fact that we’ve observed significant eQTLs or individual variants that affect expression basically tells us that we can. And in work that we’ve done and others have done, we’ve shown using a number of different prediction schemes that I sort of won’t go into but that are various forms of penalized or Bayesian regression that you can, in fact, predict gene expression with a substantial degree of accuracy. In particular, when you use models that incorporate all of the genetic variation around the gene, you typically have substantial gains in the predictive accuracy. So even though the single-topic eQTL explains a large fraction of the cis effect or of sort of the total heritability near the gene, there is a very large number of genes for which additional variants contribute substantially to the predictive accuracy. Simply going from a single SNP paradigm to a sort of locus-wide paradigm increases our predictive accuracy, and that’s going to translate into better association statistics in the eventual GWAS study.

Now, one additional constraint is that we typically don’t really even have this design where there’s individual-level data in both studies. What we actually have more frequently is this design where we have individual-level data for the gene expression study, and then we have summary statistics for the GWAS. The summary statistics are basically for every SNP, the marginal association statistics for every variant. And what we want to know from this kind of data is what would the gene-trait association have been if we could get to the individual-level data and measure it. And so this is really where the TWAS methodology comes in. Again because this is the type of data we have most of the time.

I’ll just sort of sketch out how this parameter is estimated, and the basic idea is that we think about what we would want to do with individual-level data and then we kind of move terms around and try to identify pieces that can be estimated from the summary level data. So, we start with predicted expression over here (X are the genotypes that we use for the prediction, w are the weights that we’ve trained in the gene expression data that gives us this term G, that’s the predicted expression), and then what we want to know is the association between Y (the phenotype) and G (the predicted expression). So specifically, we want to know this orange β_TWAS. So we can kind of plug in the terms into a basic ordinary least squares regression and then decompose these terms, and you can start to see pieces here that you can actually estimate from summary level data. In particular, you’ll see that this covariance between the genotype and the phenotype actually corresponds to these GWAS summary statistics that we get, the association between each SNP and the phenotype. And then this term down here, the covariance between the SNPs themselves is also something that we, in principle, can get from reference panels because it doesn’t rely on knowing the phenotype. So these two pieces we can get externally, we plug them back in, and now this is a summary-based estimate of the β_TWAS that only requires the Z-scores, the reference LD, and then these weights which we have (we sort of assume that we have a priori). And then I won’t go into the details of how we derive the variance for this statistic; it’s very, very similar. And the final association statistic that we get looks like this, where again in the numerator you have, you can think of this, as a weighted sum of the GWAS scores that’s weighted by the predictors of expression, and then in the denominator, we have essentially the variance of that predicted expression that accounts for the correlation across these SNPs – so SNPs that are correlated are going to add to the variance and SNPs that are independent are not. So this is basically the score, and I think this is also kind of a useful framework to think about how you can go from individual-level data to estimates of quantities we’re interested in with summary-level data.

When we apply this technique to summary-based data and individual-based data, it works really well. Correlation is nearly perfect, and again, we didn’t really make any assumptions going through that previous derivation except for the fact that the LD is well-matched to the target population, and also there’s sort of a hidden assumption that the effect sizes can’t be so enormous that we need to account for changes in the environmental variants and those assumptions are very easily satisfied in most studies.

Now, thinking about when does this approach actually lead to associations, we ran some simulations where we considered three different study designs under the model where there is a causal gene, and we’ve observed the predictors of that causal gene. So you could imagine, in that scenario, just running your standard GWAS to try to identify the association. You could imagine testing only the top SNP, the top eQTL that’s associated with expression, or you can imagine running a full TWAS test. And when we do that, we see that in this scenario, because we’re testing fewer features, we’re only testing each gene instead of each SNP, then the power of the TWAS or the eQTL-only approach is higher than the GWAS approach. So, this is one case where not only are we getting a parameter that we’re interested in on its own, we also have some increase in power because the multiple testing burden is effectively lower.

Furthermore, if we expand the model and say, additionally consider genes with multiple causal variants, where now the TWAS approach of applying a penalized model to the entire locus is giving us more signals, more predictive accuracy than the top eQTL, we see that the power of these single SNP approaches drops, but the power of the TWAS locus-wide approach remains effectively the same. So again, this is another scenario where when we have many causal variants for expression that all lead to disease, then we can substantially boost power. And where the truth is in between or maybe a little bit off the page, there’s going to be some loci where we don’t have the measured expression at all, so these expression-based approaches will just fail. There’s going to be some loci where there’s only a single variant for the gene, and we’ll be up here, and there’s going to be some loci where there are many causal variants for the gene, and the TWAS will then maximize power relative to other approaches.

So this is all in simulations under very specific presumed models. We can also ask how well does this approach perform in real data. And this has actually been quite a challenging question to answer because as it stands, we have very few well-established causal genes for disease. So I showed you that plot at the beginning that had over 200 known associations for prostate cancer, but the number of well-established, really definitively established causal genes for prostate cancer for that study is extremely small, and that’s sort of the case for most complex traits. So we don’t actually have a kind of working in a regime where we don’t really have a ground truth. There was a study that was done in this pre-print by Weeks et al., from the Finucane lab, which I thought was an interesting attempt to try to get at a ground truth. The basic idea was that if we look at data, in their case, they looked at data from the UK Biobank where you had associations both with common, sort of standard GWAS and also a rare variant, coding variant-based set of tests, and you identify a locus where there’s both common non-coding associations and rare coding variant associations. You can assume, maybe it’s not a safe assumption, but they assume that the rare coding variant is telling you the right causal gene. And so under this model, they basically have a kind of ground truth, which is what does the rare coding variant tell you the causal gene is, and then they can ask how various other approaches do based on just the blue stuff, just the common variant associations for identifying that causal gene.

And so, now they have a ground truth. They can plot precision-recall curves, and they used this approach to evaluate a bunch of different methods listed here and then also to propose their method, which is, conceptually, quite different. I won’t go into it, but it’s sort of like an ensemble that integrates many different features at the locus to make the predictions. But I think what’s relevant here is how these other approaches perform, and what you can see is that there’s quite a lot of heterogeneity in their performance. The TWAS is here in blue, and at one point, it has the highest recall in this model relative to the other approaches, aside from their ensemble-based approach. And then, additionally, they integrated each of these methods together with their model, and in that scenario, the TWAS had the highest precision together with their approach. But again, I think an important takeaway here is that this is far from a solved model with a clear optimal method. The TWAS provides you an estimate of a certain statistical quantity, but this is biology and biology is complicated. So, lots of different approaches have different trade-offs for what they’re able to identify and at what levels of precision and recall. And then, you know, one thing I should mention that maybe people are noticing is that if you take a very simple model of just what is the nearest gene or what’s the distance or how far away is the potential causal gene, that actually performs really well. And, in fact, it performs about as well as the method that they developed and also, when combined, has very good precision. So, again, I think that there are many explanations for this. One is that, in fact, it may be that the nearest gene, oftentimes, is the correct gene. It may also be the case that this specific model tends to emphasize genes that are close to the association statistics. But, again, I think it’s also important to keep in mind that probably some hybrid of all of these methods that also consider proximity is going to eventually be the optimal solution.

Okay, so that’s kind of where we stand with TWAS applications. Coming back to this figure, you can sort of wonder why the precision of TWAS is relatively low compared to these other methods, and I think, again, there’s an important set of caveats which were sort of highlighted in this paper from Mike Weinberg et al. a couple of years ago, in Nature Genetics, which essentially come down to the fact that TWAS is an association study. It’s not a causal inference technique. And as an association study, it’s going to be susceptible to tagging and correlation in the same way that genome-wide association studies are.

And so, in this paper, they proposed a number of alternative models, which could still identify a significant TWAS hit. One alternative model is that you can have co-regulation at a locus, where the same genetic variant or set of genetic variants drive the expression of multiple genes, both the causal and non-causal genes. And this is a real phenomenon that it’s not that uncommon that you will identify loci with multiple genes with very high cis genetic correlation and high genetic correlation to the trait.

Another case is you can imagine some part of the genetic effect on a non-causal gene is tagged because of LD between variants with the effect on the causal gene. And this would produce a false positive association or it would sort of induce some effect on both the causal and non-causal gene and the causal gene.

And then, likewise, you can imagine a scenario where the effect on the causal gene was missed in the gene expression study because it was not sufficiently powered or it didn’t get the right context. And so this would lead to a false negative association. And so, again, I think this is important to keep in mind that this is a test that is expected to tag the causal mechanism when these assumptions are met, but in the real world, these assumptions should also be interrogated.

The other, I think, important limitation and one that’s potentially solvable is to consider is the fact that as we’ve seen with other genetic predictors, the predicted expression models do not generalize well to other populations. And really, because most of the data has so far been collected in individuals of European ancestry, this is particularly a problem for generalizing to data from non-European populations or in admixed populations with low European ancestry. And so this paper showed models that were trained in European individuals that had high accuracy predicted into held-out European individuals and had significant drops in accuracy when predicting into individuals of African ancestry. And, again, I think that there are potentially interesting ways to address this problem. Probably the most basic is just to start collecting more data in other populations; we should definitely be doing that. But also, there’s methodological approaches that could potentially leverage all of the training data that we have available or think about the differences between populations to improve the prediction of these models.

So I also wanted to talk a little bit about methods. I think these are all methods that we did not develop and had no hand in, but that I think are interesting approaches to moving beyond just that β_TWAS that I described for the association between expression and disease. And I’ll sort of walk through them briefly, you know, to give you guys a flavor of methodologically what else can be done in this space.

There’s a great method called UTMOST that came out a couple of years ago in Nature Genetics, which thought about how gene expression data that’s measured in multiple tissues in the same individuals could potentially be used to improve these predictive models. So everything that I’ve been talking about so far sort of assumed that there was a population with some single modality of gene expression. But you could imagine, and this is exactly how the GTEx was designed, that you’ve measured multiple tissues for every individual. This Y is now a matrix instead of a vector for a given gene. And then, the approach that UTMOST proposes is to actually try to learn the expression for each tissue together with all of the other tissues observed. And so, again, you see some similarities here. These B’s, what they’re learning, are sort of the w’s that I talked about earlier, now are being learned for all tissues at once. And they do that by using again a form of penalized regression where they have a penalty within each tissue where they want the weights to generally be sparse. And then they also have a penalty across the tissues where they don’t want to see a lot of differences between tissues. They sort of assume that if a SNP is important for one tissue, it should also be important for another tissue. And this approach, particularly for tissues that had relatively small sample sizes, substantially increased the prediction accuracy, basically by borrowing signal from other tissues that were available. And that’s sort of shown here in purple is the increase in prediction accuracy and held-out data.

Another approach thinking about this sort of multi-tissue framework is instead of learning weights using multiple tissues, we may be interested in testing multiple tissues where each set of weights were learned individually. And so there was this work, a method called MultiXcan, by Barbeira et al. in 2019, which essentially showed that if what we’re interested in is this relationship here between – now we have many G’s for a single gene, we have the predictive model from one tissue, a second tissue, a third tissue, and we want to know if there’s an association for any of these features in a joint model, so a multi-degree of freedom test for association, what we actually have, again, because we don’t have the individual level data, we’ve actually observed is these marginal TWAS, individual TWAS, statistics. But if we know the correlation between these statistics, then we can actually approximate the relationship or the effect under the three-degree of freedom or n-degree of freedom test from these marginal effects. The MultiXcan paper also did some clever stuff where you have many tissues with highly correlated expression, and you don’t want to just throw them all into this model by using principal components analysis to first reduce the dimensionality of the expression down, then just test the leading components of expression for association in this P degree of freedom test. And this also, in practice, showed that it produces a much larger number of significant gene-trait associations. Again, now we’re sort of saying that if there’s a little bit of signal here and a little bit of signal here and here, then that can add up to a lot of signal across the three degrees of freedom.

The other, I think, interesting method or any other interesting method in this space is now thinking about how to integrate together many TWAS associations across a given locus. And so, you’ve probably seen methods for GWAS fine-mapping that try to identify the set of causal variants or variants that contain the causal variant with some predefined probability. The same kind of methodology or the same sort of concept can be applied to TWAS statistics. And so, instead, this work of Mancuso et al. in 2019 reformulated this problem in terms of having multiple TWAS associations at a single locus and then fine-mapping these down to the set of likely causal genes. And this is actually starting to address some of the caveats that I outlined earlier when you have co-regulation of multiple genes, or you have some tagging across multiple genes, this is now an approach to put probabilities on which genes out of many are likely to be causal, whereas which are likely to just be tags, and to sort of estimate posterior probabilities of causality for a given gene.

So just in the last couple of minutes, I wanted to mention a bit about what else can be done with this framework. And so, everything so far that I’ve been talking about has involved gene expression or transcription, but really the idea is that any molecular trait that is heritable and that can be predicted from data that we’ve measured is amenable to this sort of approach and this way of integrating with GWAS.

And, in particular, we can go back to this model which maybe we’ve solved now in some sense and observe that this is also an oversimplification. In that most of the time for non-coding variants, what we expect is that there’s some regulatory element that sits in between the variant and the expressed gene that maybe is the modifier or is the mediator of this gene expression. So really, there’s probably an enhancer or a transcription factor or a combination of those features through which this SNP has an effect on the expression of the gene, which then goes on to have an effect on the trait. And with sufficient data, we can actually start modeling these regulatory elements and the genetic predictors of these regulatory elements.

And we have some recent work to that end, which we call a regulome or a system-wide association study. So, we’re sort of padding out the letters of the alphabet here, but the idea is that instead of learning predictors of a given gene, you can learn predictors of some biochemical activity, including transcription factor binding, chromatin state, or chromatin accessibility. And additionally, in this regime, we can also leverage some allele-specific information of variants that are inside these peaks that we suspect to be modifying their activity. And so, we can boost power even further because we can leverage signal within each individual in addition to across the individuals. And again, we’ve shown that this approach is fairly robust, that you can identify a very large number of predictive models.

And when we’ve applied this approach specifically to cancer GWAS phenotypes, so again, this is cancer risk we’re just talking about the predisposition to develop cancer, we see, going back to this plot that I started with, we’ve now characterized each of these loci where we see that there was this inner circle here is the number of loci that had a significant TWAS association with a gene. But then actually, when we incorporate these epigenetic features, we see a much larger number of loci that additionally have associations through chromatin accessibility in this case, many of which do not actually exhibit a direct transcriptomic association. And so, this is actually sort of interesting and somewhat mysterious in that we’re able to identify loci where there seems to be a genetic regulatory effect that we don’t see have a downstream cascade on expression. We do capture most of the loci that have the TWAS association, those we’re able to characterize, but then we have this number of additional loci.

And we’ve sort of started to think about what those loci could be telling us. One observation is that if you look at the distribution of evolutionary constraint across the genome, you will see that in regions with higher evolutionary constraint, we see fewer TWAS models that can be built, probably because selection is making it more difficult to detect or is decreasing the observed effect on expression, making it harder to pick up the eQTLs. But we actually see more of these RWAS or chromatin-WAS models observed in those loci. So, this is maybe this gap could potentially explain that sliver in the previous figure. These are loci that are very difficult to pick up in the expression framework but are not as difficult to pick up when we look directly at the intermediate chromatin phenotype. And a sort of related observation that we’ve made is if you look at genes in terms of their tissue specificity, so as we move from here to here from the left to the right, these genes are more specifically expressed in prostate tissue, we’re looking at a prostate cancer GWAS again, we see that the TWAS models, there’s fewer of them, they’re harder to fit for more tissue-specific expression, but for the chromatin-based models and the transcription factor-based models, there’s more of them and they’re easier to fit. And so again, this could be pointing towards a phenomenon where more tissue-specific expression has lower power for the sort of eQTL and transcriptome-based models but higher power for these epigenetic-based models.

And so with that, I’ll conclude. I hope I’ve been able to convince you at the very least that gene expression is a complex, heritable, and predictable trait, and this predictability is something that we can leverage to integrate that trait into other datasets where we don’t have it measured. And specifically, we derived this TWAS statistic, which is a measure of the cis genetic correlation between the gene expression and the disease. As I noted at the end, this is not just limited to transcription; other molecular phenotypes can be used within the same framework. And again, I want to sort of emphasize the caveats that go along with any kind of association study. It’s not a causal inference, and in fact, causal inference in this space is I think a really interesting and sort of ongoing open problem. How do we disentangle all of those different arrows that I was showing earlier? And then also, just to remind you that all the prediction here has been within the cis locus of the gene. That’s where we have power at the current sample size. But there’s a whole world of trans effects, which we haven’t really scratched the surface in understanding. And so that’s something that I think as studies get larger and as we have more experimental data, we’ll also be able to fold into this framework. And so with that, I’ll take your questions. Thanks. Thank you.

TWAS Primer

Title: PGC TWAS Primer

Presenter(s): Sasha Gusev, PhD (Dana-Farber Cancer Institute, Harvard Medical School)

Sasha Gusev:

Hello! Thank you for having me for PGC Day to talk about transcriptome wide association studies and our method, FUSION. I’m very excited to give a little bit of background on this methodology and also some examples of how to use it and how to interpret the results.

So I will start by talking a little bit more generally about what the transcriptome-wide association study, or TWAS, is, and then I’ll get into how to actually run it yourself with your own data. So, and this is going to be very brief. The basic idea of TWAS is that oftentimes when we’re performing a genome-wide association study (GWAS), we’re interested in understanding the potential mechanisms of an associated locus or identifying novel mechanisms that we haven’t discovered yet. One way that we can make those associations is by integrating gene expression data. So, under the assumption that genetic variants modify transcriptional activity and then lead to disease via that transformational activity, what we would like to have is a study with genetic information, gene expression information, and disease all measured in the same individuals, for which we could sort of investigate the relationships between all of these modalities and additionally compute the genetic component of expression and ask whether that component is associated with the phenotype and how strongly.

Unfortunately, in a typical GWAS, we don’t have this data. We usually don’t have gene expression measured in our cases and controls, and so we can’t probe these questions directly. But the sort of insights that TWAS makes are essentially that we can probe these questions using prediction of this genetic component of expression, and that’s something we can do with summary level data from the GWAS only without even requiring the individual-level phenotype information. So the way that the TWAS works is by constructing predictive models that relate genotypes to expression in some training data, for example, in the GTEx cohort or in the Common Mind Consortium datasets that don’t necessarily have the phenotype of interest measured in them. Then we predict this expression into our GWAS study, and now we can associate it with the phenotype directly, and we can infer associations between the genetic effect on expression and the disease. Alternatively, you can think about this as inferring the genetic correlation between the gene expression and the phenotype. And in all of the applications I’m going to be talking about, we train this using just the cis locus, just a megabase around the gene. So this is all the cis genetic correlation or the local genetic correlation between gene expression and disease.

Now, the one other point that’s important here is that in the cases where we don’t have the underlying phenotype information but we do have summary GWAS data, so association statistics, p-values, Z-scores, and effect sizes, we can still perform this analysis by using an LD reference panel with the summary-level data and estimating what the predicted gene-trait association would have been if we had the underlying data. And for more information and the derivation of how we do this, how accurate it is relative to individual level data, I would refer you to the references here.

Okay, so a little bit more context now that we have a sense of what TWAS is doing, how do we interpret what it’s doing, how do we interpret those results? So, one point to think about is how does TWAS compare to other approaches that integrate molecular data with GWAS, and I think, you know, a main analysis type is co-localization analysis, and I think that it’s important to keep in mind how TWAS differs from co-localization. So, as I mentioned, TWAS is estimating essentially the genetic correlation between the cis component of expression and the disease that we’re interested in. Co-localization is testing a specific hypothesis or estimating a specific probability that the disease and the molecular phenotype have the same causal variant. So, this is an association test, and co-localization is evaluating the probability of a shared causal variance. That’s a little bit different. The rest of the differences are kind of mechanistic, so TWAS is a frequentist test, and it provides a signed test statistic because it’s estimating a genetic correlation, whereas co-localization is typically implemented in a Bayesian framework, and it estimates a posterior between zero and one of the probability of the shared causal effect.

TWAS doesn’t make any assumptions or does not have to make any assumptions on the causal variants either in the disease or in the expression trait. So, you can have allelic heterogeneity; you can have complex relationships, as long as the causal effects are linear within the summary-based model. There’s no assumption on how those causal effects have to be distributed, whereas typically co-localization requires some assumption on the number or the relative relationship of the causal effect sizes to estimate this probability.

And then, lastly, because TWAS is using individual level data to train predictive models, it can train all sorts of fancy predictors of gene expression and then impute those into the target GWAS study. Whereas typically co-localization is using marginal eQTL association statistics and marginal GWAS association statistics, and so it does not inherently allow for these kinds of fancy predictive models. And this means that, in some cases, TWAS maybe can squeeze out a bit more power by modeling complex genotype-phenotype relationships.

One other important point to think about is how not to interpret TWAS or ways in which TWAS can provide you with non-causal associations, which is what we don’t want. And this was covered in detail in this great paper from Weinberg et al. in 2019 in Nature Genetics. But the basic idea is that because TWAS is an association statistic, it can pick up associations due to all sorts of tagging, in the same way that GWAS can pick up associations due to SNPs being tagged or correlated with a causal variant. TWAS can pick up associations due to genes or eQTLs being correlated with the causal variant, and so this example from Weinberg et al. shows that if you have a non-causal gene that’s correlated and co-regulated, meaning the same variant drives the non-causal gene and the causal gene, you may observe a TWAS association with the non-causal gene that’s just a product of this co-regulation. And likewise, if you have eQTLs that are influencing both a causal gene and maybe partially a non-causal gene, that may induce some TWAS association. And if you are missing an eQTL for the causal gene, but you have an eQTL for the non-causal gene in the same locus, that can induce a missing TWAS association. So these are all things to keep in mind. Essentially, the same sort of limitations that apply to integrating eQTL studies of any kind apply to the TWAS framework, which is building on top of the eQTLs.

Okay, so now getting to the heart of the problem, how do we actually run these methods and what do we get out of them? So, I’m just going to run through a specific example. I would urge everybody to go to the TWAS website (the link is here) and sort of work through the outline and download the code. I’m going to show you what it looks like for a single gene. The inputs that we need for this analysis are the FUSION software, which implements the TWAS association test; LD reference data, as I mentioned; GWAS summary statistics in a prepared format; and gene expression weights. These are the predictive models that we’re going to be using that are the key component of the TWAS analysis.

So, let’s walk through each of these one at a time. The LD reference data is essentially a directory of PLINK-formatted genotype files from the 1000 Genomes in this case. You can substitute your own LD reference panels if you’re interested in or if you’re working with other populations, just broken up by chromosome - basic genotype files.

The summary statistics I’m going to use schizophrenia summary statistics from the PGC2. They follow the LD score format. If you’ve ever used LDSC for heritability analyses, the format is essentially the same. We need to know the SNP, the alleles for that SNP, and the Z-score field - the direction and significance of the association - all we need for this analysis.

Lastly, the gene expression weights. Again, you can download these extensively from the FUSION website. We’ve compiled these for GTEx, for Common Mind, for various PsychENCODE studies, as well as actually for other phenotypes besides gene expression. Those weights, when you download them, you get an annotation file which is this .pos file that contains information about the underlying weight. The weight is in an R data format. You can just load this in R and see what the weights are for predicting expression. We also have the information on the gene and the position of the gene, and this is again this file and pointers to each of these weights is all that’s required.

Okay, so let’s run a TWAS analysis now that we have all of those components. TWAS is a series of FUSION implements, a series of R scripts for doing this analysis. To perform a test, you run this FUSION association test script. It takes as inputs the summary statistics, a pointer to the weights file, a reference to where the weights actually are so that it can read them in, a reference to the LD reference panel (which has all the chromosome information), the specific chromosome that we’re going to be running on (chromosome 3), and where to print the output. This is it. We run this.

And the output that we get is this .dat file. I’m just going to look at a couple of lines from that file, the significant associations from that file. And they look like this. This isn’t all that readable, but you can see each line is one result from a TWAS test analysis. Just to break this down a little bit further, let’s take a look at one of these lines. So it looks something like this. You have entries for which weights file was used. In this case, the CNTN4 gene was used for prediction. The name of the gene, where it is, the heritability of the gene. We have some information on the best GWAS SNP in the locus and the Z-score for that GWAS SNP. We also have information on the best eQTL for that gene and the R-squared and Z-score for that eQTL. So, you can see that the eQTL, the best eQTL, is actually quite a bit less significant in the GWAS study than the best SNP in the locus.

We have information on how many SNPs were used to train the model and how many were actually retained for prediction. This NWGT is the number of non-zero weights, the best predictive model of many different models trained. The cross-validation R-squared of that model and, again, something to keep in mind here, for example, is that the R-squared from the predictive model from the Elastic Net model is much higher than the R-squared that we got from the top eQTL. So this suggests that there are some additional variants that are contributing to the predictive accuracy of this test.

The cross-validation model. The cross-validation model p-value, this is just the p-value on this R-squared. And then finally, the statistics that we’re interested in, the TWAS association Z-score and the p-value on that Z-score. So this is the key statistic that we want: how strongly is this CNTN4 predictor associated with the disease?

Okay, now in addition to getting these kinds of outputs, something else that we may be interested in is visualizing these kinds of results. So, I’m going to take a look at again some of the top significant TWAS associations from our analysis. We’ll put them in this file .top. Then we run this script in FUSION for post-processing the data. The script takes some of the same inputs, essentially the summary statistics and the LD reference data, this .top file that we just generated, as well as some information on what we want to do. We want to plot these loci within 100 kb of the gene and generate the outputs.

And this produces figures that look like this, which I think are very useful to make sense of these analyses. What this figure is showing is a Manhattan plot of the significant locus from this .top file. In gray is the original GWAS association statistic, and then in blue is the GWAS association statistic after conditioning on the TWAS-predicted gene, which is shown in green here. This is a locus with the

THOC7 gene; after conditioning on the predictor of THOC7, you see that the GWAS Manhattan plot goes from significant associations beyond a p-value of 10^-8 to essentially no significant associations in this locus. So, this is a very good sign visually that things are kind of working as we expect. And when we condition on the predictor, the association goes away, which is consistent with the predictor being the mediator. You can do these analyses when you have multiple genes in the locus, you can do all sorts of fancy pairwise or individual model conditioning, but I think that this is, in addition to looking at just the raw statistics, this is a useful visual output to understand what’s going on at the locus.

Okay, so hopefully, that was a useful primer on running a TWAS analysis with FUSION. I’ll just mention that there are a couple of other analyses that you can do. We’ve worked with Nick Mancuso in this paper in 2019 for fine-mapping TWAS associations in a similar way as how you would fine-map SNPs. I would also refer you to this paper of Yao et al. in Nature Genetics looking at estimating the fraction of heritability that’s mediated by gene expression rather than just associated. Additionally, there are a lot of scripts on the Fusion website that let you predict into individual-level data, perform co-localization analyses in addition to TWAS analyses, and do more detailed comparisons and visualizations of the ones that I described, as well as cross-model correlations. What are the weights that are contributing to each model, how do the eQTLs for a model look like, and so forth, and let you really dig into an association in a visual way.

The one other thing I mentioned is that I’ll mention is that we’ve put together an interactive website called twas-hub.org, where you can go and look at all of these associations. You can search for individual phenotypes for genes and look at the TWAS associations and investigate which models are predictive and do all this in the web interface. Um, which I think is also very useful to get a sense of how these methods perform. And with that, thank you very much.