Chapter 8.6: Quantitative Trait Loci (Video Transcript)
GWAS Studies and eQTL Analysis
Title: GWAS Studies and eQTL Analysis
Presenter(s): Xiaole Shirley Liu, PhD (GV20 Therapeutics)
Xiaole Shirley Liu:
GWAS Studies
Over the last decade, we see an increasing rate of GWAS studies. This is because people really want to see: “Based on my genetic information, can I predict a disease? How likely am I going to have a disease?” They are also interested in whether there’s an association with the patient’s response to drugs. This is facilitated by the drastic reduction of genome sequencing costs in early days by the microarray cost. So you can see the number of publications in the GWAS studies are really increasing. In the early days when the cost was high and people didn’t have enough data yet, it was collecting sample cohorts. It’s very time-consuming; it takes decades to collect enough patients for some case-control studies. In the early days, they were looking at cohorts that have only a few hundred patients. Sometimes they might find a SNP that gives them some association, and if they repeat the same study using another scientist, they collect another cohort of patients with a few hundred patients, they do the same study, they see some SNPs, and there could be very low overlap between the two studies. This is mostly because, in the early days, there were not enough patients, and most of these GWAS individual associations have very weak effects on the disease. And because we’re doing so many multiple hypothesis testing, you actually hit on things that are not really real. If you change a cohort, you’ll land in some other regions. You can see here the number of GWAS hits is directly associated with your discovery cohort size. In the early days, it could be a few hundred or a few thousand, and then 20,000, and even 200,000. And now, with more and more people doing genome sequencing, you might even have these huge cohorts that have close to a million individuals to look at associations. The nice thing about this is, once you sequence somebody’s DNA, the information is always there, and all you need to decide is which phenotype do I want to look at. You can see here this could be height, body mass index, this QT interval (is a heart QT interval), cholesterol level. You can look at hair color, eye color, disease other associations, you can look at their IQ. Yeah, there’s many, like skin color and things like that, that you can look at.
Population Stratification
So in these types of GWAS studies, to make sure that you are getting robust results, having a bigger population, a bigger cohort, is very, very important. Another important thing is to make sure that you don’t have population stratification. For example, if I want to look for SNPs that can predict longevity, I collect people who can live over a hundred, you know centenarians, who are over 100 years old, versus people who died before they reached 100 years old. Of course, centenarians are hard to find, but when you collect those and you do the GWAS study, you might find, it turns out, hidden things you don’t know. For example, the centenarians are enriched in, say, this Japanese population, whereas the people who died younger are kind of a random mix. And so because of that, when you are looking at the SNP difference, what you might see is that, “Oh, the Asian SNPs are all showing up,” because, you know, it’s just, it has nothing to do with longevity – it just happened that in your cohorts, the longevity cohort has more people from Asian or from Japanese descent. And so in this case, in your case, you have overwhelming population whereas your control is different. And what happens is that when you do the p-value calculation of different SNPs, remember we are doing this for millions of chi-squared tests, we can look at the p-value of those chi-squared tests and then do a Q-Q plot to compare the expected p-value and the observed p-value. If you don’t have population stratification, you would say that okay, most of these SNPs are not significant, then suddenly we have some SNPs that are significant. These are real, whereas in other cases, if you see kind of a very early diversion from your expected, like this; this is an indication that your case-control populations have this population stratification. There are a lot of SNPs that are significantly different between your case and control that could be because of racial/ethnic groups that are differently populated in your case and control, and so for that, you have to correct that and only call those ones as your real difference.
Yeah, so this is one example. If you also just run the PCA on all the SNPs, you can already see that on this PCA plot, there are people from different populations. And so you know, correct for the population difference from your case-control studies that are arising from the uneven populations in the different cohorts. Another situation is there are some unknown relatives. Maybe you have a distant cousin in this study that you don’t know, or in a case-control, we don’t want family structures. And so you really want unrelated family members for the case-control study. And so there are also something called IBD, which kind of gives you the chance that these two individuals are from the same family. And so if on IBD tests you see, “Oh, you see a lot of people from the same family,” you might want to remove the siblings or the cousins from the study. You really want a representative member from unrelated families in your study.
GWAS Catalog
And so by this time, for example, this is a figure in 2019. In NCBI, there is a GWAS catalog, and they continue to collect the published GWAS studies and get their SNPs from them. To see, you know, this is chromosome 1, 2, 3, 4, and what are the current discoveries of SNPs that are associated with different diseases. And so there are like many of those summary statistics files that tell you this GWAS study identified the following SNPs as associated with that disease, and another is a different collection of SNPs associated with disease.
Majority of the GWAS SNPs Are Located in the Non-coding Regions
And based on our lecture, Dr. Fong, you have mentioned that a lot of these GWAS SNPs, people found that only a very small set of those are actually in coding regions, because coding region is only 2% of the genome. A lot of these GWAS-associated SNPs land in non-coding regions of the genome, and interestingly, a lot of them are enriched in DNA hypersensitivity peaks. There could be some that are not in the peaks but in perfect LD with those peaks. This is because sometimes they didn’t do genome-wide association using genome sequencing; they just use SNP arrays and type the one SNP, but they don’t have the SNP in the DNA hypersensitivity peak region. But these two are perfectly linked, so you are suspecting that the real effect is on the DNA hypersensitivity region, and only a quarter of them are not in LD, which suggests that a lot of these SNPs associated with disease are related to a transcription factor that binds to a region, which influences the nearby gene expression. This “nearby” could mean 50 kb or 200 kb, or, you know, quite a distal region that influences the expression of that nearby gene, which then has an effect on the phenotype. So, understanding epigenetics now becomes very important to interpreting these GWAS findings. You know, “Oh, there’s a SNP in here, it’s associated with the disease,” but what’s happening? We want to see what transcription factor binds to that enhancer, what kind of nearby genes are linked, and then we can understand why that’s related to the phenotype or the disease.
eQTL: expression Quantitative Trait Loci
Another type of study that’s really useful is called eQTL analysis and a national project, a consortium project called GTEx, is really, really quite exciting. So first, they look at a thousand individuals. These are not disease-specific individuals; they are just average people who have died from various causes, a lot of them could be unnatural causes. They donated their bodies to the project, and they took different tissues, in this case, brain, heart, and 50 different tissues from these individuals. And they did both the SNP typing, using SNP arrays to look at their SNPs across the genome, and also to do the expression of those 54 tissues across individuals.
In here, because we are not interested in disease per se, the trait, in this case, is no longer whether this person has blue eyes or black hair or whether they have fair skin or a high body mass index. Instead, we use expression data as the phenotype. We say, “Well, for example, in a 1000 people’s livers, for this gene, it’s expressed at a much, much higher level in some populations but lower in another population. What is the difference?”
eQTL Analysis
So the features are on the SNPs, and we say, “Oh, it turns out if they have, for example, these SNPs, the expression is much, much higher in the liver. But if they have another SNP, the expression of the nearby gene is happening much lower in the liver.” And so you can establish this association from something called expression quantitative trait loci, or eQTL. So, quantitative trait is like a height; it’s more quantitative, it’s not a black-white, 1 or 2. It could be just the expression level, and the trait, in this case, is not really eye color or physical trait; it could just be an expression, and we are trying to look for a particular SNP locus that’s associated and this locus is associated with a quantitative gene expression level. Using these types of eQTL analysis, again, a lot of people find that there are SNPs in nearby regions that are linked to gene expression differences, and a lot of those are also in the ataxic or DNA peaks, which means that gene transcription factor binding in the distal regions might influence this gene expression and eventually influence real phenotype or trait.
Summary
Yeah, so that’s kind of the summary. In the genome, there are SNPs, they are often linked if they are in proximity on a chromosome, and there are also these haplotype blocks where a lot of SNPs are linked to each other. In order to really do the genome-wide association studies, you can collect big families where members associated are known to be associated with the disease to look at the allele that’s transmitted, which made the difference, or you can just do case-control studies, normal versus disease, and check every allele to see whether there is any difference between the allele frequency in the two populations. And in order to increase the statistical power of your analysis, you can look at haplotype association rather than individual SNP association. You could increase your patient sample population; bigger cohort, the better power. And also, you need to make sure to eliminate the population stratification and remove the too-close family members. This is in the case-control studies. And also, we can use the eQTL to identify SNPs associated with differential expression and also the GWAS SNPs are linked to some trait. And a lot of these eQTL sites and the GWAS SNPs are in the non-coding regions, and then we need to use the epigenetic data to help us understand what’s happening. Okay, that’s all for today.
MPG Primer: Introduction to expression quantitative trait loci
TItle: MPG Primer: Introduction to expression quantitative trait loci (2021).
Presenter(s): Francois Auget, PhD (Illumina’s Artificial Intelligence Laboratory)
Francois Auget:
It’s a pleasure to be here. I’ve learned a ton from this series in the past, so it’s really nice to have an opportunity to contribute to it. So, what I’d like to do with this talk is give a sort of broad overview of quantitative trait loci or QTLs (I put “expression” here in parentheses), although most of the talk will be focused on expression QTLs. Many of the concepts are really generally applicable to any kind of molecular QTL.
In terms of the structure of the talk, I’ll start with aspects of data normalization and covariate correction; then get into the details of QTL mapping for cis-eQTLs; talk about other QTL types, including trans-eQTLs, splicing QTLs, conditional independent QTLs; as well as context-dependent QTLs. And as I’ve mentioned, these concepts are really generally applicable. There are some topics that the talk won’t cover just for the sake of time. One of those aspects, which is really key to these analyses but basically another talk in itself, is how to perform quality control of the core data types, so both on the genotyping side as well as the RNA sequencing. It also won’t be a sort of step-by-step, really detailed walkthrough of how to do these analyses with code examples. Really more of a bit of a higher-level overview. It also won’t be a discussion of QTL results, although I’ll present several examples, these will really be more for illustration purposes.
So, before I start, I just want to acknowledge that pretty much everything that I’m going to present is built on work that I did and others did as part of the GTEx Consortium, also building on a lot of prior expertise of mapping QTLs in other previous studies. So, this is really the result of a large collaboration.
As most of you are probably familiar with, the large and still-growing number of genome-wide association studies conducted over the past decade or so have now revealed over 70,000 associations to common diseases and traits. A large majority of those are found in non-coding regions of the genome, rendering their interpretation highly challenging.
But so, the hypothesis is that the mechanism of action of these non-coding variants must, at least partially, go through modification of transcription. So, the idea behind quantitative trait loci and specifically expression quantitative trait loci is then to identify this relationship between genetic variation and molecular phenotype, that’s the total level of gene expression, or alternative expression of different transcripts of a gene, and I’ll get to that later in the talk. The general concept is that we want to scan the genome either locally around genes or looking for any variant genome-wide that might affect the expression of a specific gene. And often, these effects can be tissue-specific. So, what I’m illustrating on the slide is an example where, in the lung, for example, we find a strong correlation between genotype and gene expression in the presence of a specific variant and assuming that there’s some cofactor that steers this in the lung. Whereas it’s absent in the heart where the effect is not observed. And to be able to observe these effects, we need to study large populations. In practice, this means that we generally require at least 70 samples or so to be able to reliably detect these relationships for common genetic variation. And then, the goal is to, for each variant, look for correlations between genotype and gene expression.
And so, there are many different types of these relationships we can try to identify. And just to start a little bit with a little bit of terminology, so when we try to identify local genetic regulation of gene expression, we call this cis-eQTLs. And this simply means that we’re looking, typically, in a 1 Mb window on each side of the transcription start side of a gene. We’re looking for a common genetic variance that affects the expression level of a gene. The other component of this is trans-eQTLs, where we’re looking for a now distal regulatory variation, and that can be both distal, meaning further away from the gene than that cis window of a megabase or, more robustly, when the regulation is coming from a different chromosome, as illustrated in the cartoon here. And in addition to total gene expression level changes, we can make the same observation in terms of splicing. So, there could be a genetic variant that switches the expression of a specific isoform of a gene, and this can also be detected in the same cis window as the total expression level as well as looking for these effects in trans.
And so, the bulk of this talk is actually really just answering the question: How do we identify which of these relationships are statistically significant, considering that we’re doing this for every gene in the expressed genome, in the specific tissue or cell context, and genome-wide when we’re looking for trans associations? And so, I’m just showing two examples that you might be, sort of, in terms of the visualizations, familiar with if you’ve looked at the QTL data before. Typically, for the association of a single variant and phenotype, we visualize these in these types of box or violin plots, where you see that in the top example, the sort of very strong relationship between the genotype and the expression level. Also, in these locus plots where we see essentially the several thousand common variants in a genetic locus and their association p-values, and on top of that, illustrated with the LD level between different variants, so variants that are in strong linkage are highlighted in these plots. The second example I’m showing here is much weaker, and so there’s a slight trend towards lower expression levels as a function of genotype. And this is also much more ambiguous in this locus visualization. So, the goal of all these analyses is identifying which of these signals are significant. In the top case, the association p-value is so strong that it will survive any sort of genome-wide multiple hypothesis correction. But cases like the signal at the bottom here are much trickier. So, it’s really, how do we come up with a robust framework to determine what the significance levels of any of these associations are?
On this slide, I’m just showing a broad map of the workflow. So, starting with the study data consisting of genotypes, RNA-seq data, as well as sample metadata, we essentially have the first couple of steps to generalize normalized expression matrices as well as sets of covariates to correct for unwanted variation – I’ll get into this in much more detail. Once we have these inputs, then we can actually perform the various types of QTL mapping, including the discovery of cis-QTLs, trans-QTLs, conditional QTLs, and also context-dependent ones. And I’ll get into all of these aspects, and the outputs from that are typically a number of files, one summarizing which genes have significant QTLs as well as the lead variants for those and then summary statistics for all associations that have been found. And then using those results, there are many downstream analyses that can be performed, including replication, fine-mapping, and so on. And so, the focus of the talk again is really on the central part here. There have been other very nice resources for some of these downstream analyses, including in this primer series.
So, if there are no questions so far, I’ll start with the first section, giving an overview of data normalization and covariate correction aspects. Just very briefly, partly also because we’ve developed tools for this, there’s in terms of the RNA-seq quality control, it’s really important to remove obvious outlier samples that might indicate poor quality or identify sample swaps. In any sort of large-scale study like this, this will invariably happen that some of the samples might get swapped somewhere in the process, and identifying those is critical to avoid just loss of power because of the misassigned samples. So, if you’re interested in this particular topic, feel free to get in touch with me, and I can provide more details.
The first really critical step, then, once we have the RNA-seq data in hand, and that means essentially just a quantification of read counts for every gene in for every sample, the first real step is to apply normalization between these samples. Because, if you’re familiar with RNA-seq data, it’s really a relative measurement, and if you conduct the experiment for several samples, there’s no normalization between these samples. The only normalization factor is the library size, meaning the total number of read counts that you get for a specific sample, and that just experimentally can be variable. And some of the standard normalizations will just normalize the scale to units called transcripts per million, which is an absolute scale and ignores potential outlier effects. So, it’s important to actually have a normalization that mitigates these, and there are several methods for this. And some of the best-performing ones are called TMM or size factors from DESeq. The idea there is to normalize all the samples that you’re using to either a representative sample that represents the median or construct such a representative sample from the set and then identifying a scaling factor that rescales normalizations relative to this. Here, just illustrating why this is exactly important is the relationship between just raw library size, so the total read coverage, and the scaling factor. While the correlation is fairly high and the scaling factor is close to one for most samples, there are some clear deviations from this. So, making sure that these are well corrected for is important. In terms of some of these well-performing approaches, there’s really relatively little difference between them, so any of those would, in practice, be reasonable to use.
And so, in terms of this normalization, this can be applied to samples that are largely similar or have weak perturbation between them. But I don’t want to give the impression that this between-sample normalization will enable comparing and combining samples from vastly different tissues or experimental contexts. There’s this really nice review from a few years back that really highlights these issues. But the general idea is that once the deviation and the perturbation of the transcriptome becomes too strong, these computational normalization approaches, which rely on having at least a medium range of gene expression that’s unperturbed between samples, that falls apart. And then, the normalization methods also start to fail.
After normalizing the samples, the next important step is identifying potential confounding effects.
Audience question: There was one question that came up based on your last statement, saying that eQTL is mainly for non-coding SNPs. And does that make eQTL and pQTL complementary to each other in some sense? Answer: That’s a great question, and it’s important to clarify that. So, what one could say, one of the main goals is really to facilitate the interpretation of non-coding SNPs. But because identifying a functional mechanism for those is much harder, but there’s definitely coding SNPs that can be both eQTLs or pQTLs. And the analysis, in itself, in terms of just the mapping, is actually agnostic to the specific characteristics and functional impact of a variant. So, we really take all common variants in a cis window around the gene and look for associations with a molecular phenotype. And if we do this with gene expression, we can do this with protein levels. And some of those QTLs will be the ones that generally regulate general expression levels, will likely be the same. So, the QTL set that modify that are at the same time modify the protein structure function might have other downstream effects. But in terms of detecting which are significant associations between variants and the selected phenotype, the general statistical approach is the same and agnostic to the type of variant.
The next step is trying to make sure that we’re not confounded by unwanted technical or population effects. As you might know, if you’ve worked with genome-wide association studies, many of these concerns are exactly the same. In terms of experimental batch effects, there’s many studies that have nicely shown under highly controlled experimental protocols how measurements can be robust or vary between different labs. Here’s just an example actually showing that when quantifying just exon-level gene expression, for example, sequence at many different centers, the sample identity remains very robust. But then, when looking at transcript level, there’s much more variation and also a little bit of a trend towards center-specific effects. So, that’s important to keep in mind that there’s really an experimental variation that can add a lot of variance to gene expression, and that needs to be corrected for when we map QTLs to a similar degree of population structure. We want to make sure we’re correcting for this not to be confounded by potential effects that are correlated with population structure.
Here’s one of these examples of technical effects between centers, and one of the stronger ones is just minor variation in GC content that can arise as part of the library preparation protocols in RNA-seq. And the example here is where there’s just, in terms of looking at the fragment GC content distribution, there’s a bit of variation on the upper end of the GC percentage. But then when we look at individual genes, this can actually translate to quite massive differences in coverage. And so, in this particular example, the coverage on this exon is almost lost in one of the centers. And this means that now in terms of both global expression levels and potentially splicing differences, we might artifactually believe that this represents a change even though it’s entirely driven by a technical artifact.
Similarly, when working with degraded samples, such as some of the postmortem samples from GTEx, RNA can be at least partially degraded, and in extreme cases, this manifests as a strong 3’ bias. So, when looking at the 3’ UTR since it’s based on a poly-A selection protocol, we see a strong enrichment towards this end of the transcript. And that, too, is important to correct for, especially if there’s variability of these effects across samples. We want to make sure we normalize this out.
Listing these potential effects individually and trying to correct for them explicitly is really challenging because, for a large-scale study, there can be many of them and there’s absolutely no certainty that we’re capturing or aware of all of them. So, there are frameworks to do this automatically, in a sense, and try to identify latent factors that best capture this unwanted variation. There are two more points that are important to make here, and then one I’ll get into more in detail. The first is that usually, the strongest contributor to expression variants is actually the cellular composition heterogeneity of different tissue samples in a study like GTEx. But this can also be the case if just in terms of whole blood or PBMCs or other blood-based samples. There’s quite a bit of variability between individuals. So, in accounting for that and these various technical factors to try to understand what variance components we’re identifying with latent factors, having access to really good sample metadata is very important. So, we can actually then disentangle what’s captured by latent factors and assign this to cell-type-specific variation or technical effects like the ones I’ve shown before.
And here’s just an example, sort of highlighting how strong the cellular heterogeneity differences can be between samples, shown for both heart tissue and colon in the bottom row here. So this is part of an analysis where we actually looked at the enrichment of specific cell types in these tissues, and I’ll just use the bottom example here to highlight where some of these differences are coming from. Just in terms of the tissue sampling, for these colon samples, there will be samples where almost all of the tissue collected is from the muscular layer of the colon, whereas other samples might be almost purely epithelial. So, the cell type proportions across these can vary drastically.
Audience question: Francois, there was a question about why you use PEER for adjusting for latent factors rather than SVA. Answer: So PEER is essentially just a Bayesian framework for identifying such latent factors. There’s many other approaches that can be used, including principal components. In the PEER paper and then prior work from the same authors, they show that the PEER framework is more efficient in terms of capturing unwanted variation with fewer components. But I’ve seen several QTL studies that use principal components, and there’s probably not a very drastic difference between these approaches in the end. SVA is designed to capture variation that’s orthogonal to a variable of interest, and it’s more often used in the context of differential expression. So, I’ll talk a little bit about context-dependent QTLs later, and SVA could certainly be a framework that could be useful for these types of analyses. But overall, these different approaches try to do the same thing and are largely compatible.
And so here, just sort of to belabor this point of, in terms of quality control, the variance captured by any sort of latent factor that we’re selecting, whether it’s PEERs or principal components, what we generally see is that the first couple of these factors are really strongly correlated with the cellular composition of tissue. And what I’m showing here is just sort of an enrichment score, a correlation between an estimate of the tissue heterogeneity and PEER factors. And at the same time, these effects are only weakly correlated with known technical variants.
In terms of selecting these PEER factors, what is the optimal number of factors or components that we actually want to remove from the expression? Here, it gets a little bit trickier because on one hand, it’s again this idea that with metadata that we collect and data that we can infer from the samples, such as cell type composition, can only inform us to some degree the correlation between these known aspects and latent factors. But that’s no guarantee that we’re really removing everything that’s a potential confounding effect. So the strategy that we employ typically is to actually select the number of factors that will maximize QTL discovery. And here, I’m just showing this from an earlier analysis in GTEx, where the plots are just an increasing number of PEER factors and then the number of genes with at least one significant QTL that we identify. And this is also to some degree sample size dependent. And this is a little bit heuristic in terms of how we choose these; generally, we cut off these curves as they start to plateau. One danger of selecting too many PEERs, and this will be stronger with principal components because it’s an orthogonal decomposition of variance and will just increasingly remove signal, but a bigger concern is when we’re using these same factors for discovering cis regulatory variation and trans regulatory variation is that as we increase the number of PEERs, they might actually start to capture trans effects. And so one important quality control is actually to test for this. And what we generally do is we just run a genome-wide scan of each PEER factor to try to see if there’s any enrichment for specific loci that are significantly associated with this factor. And here’s just an example showing that in a case where we don’t really detect such effects.
So that sort of covers the aspects of data normalization and covariate corrections. With this, I’ll get into more details of QTL mapping. The model itself is very simple, so it’s just a simple linear regression of genotype onto phenotype with a set of covariates that I’ve discussed. So, these are these PEER factors, genotype principal components, and some technical batches that we may want to include, especially if there were technical batches on the genotyping side. But importantly, typically, we don’t really care about the coefficients of these covariates, so we just residualize the phenotype and genotype relative to these covariates using the orthogonal projection matrix, which should be pre-computed. And this residualization can be done very efficiently. In practice, rather than regressing on these sort of box plots that you typically see, the residualized space generally looks like this. Here’s an extreme example again of a very strong association.
Audience question: Francois, there was one question about when you add additional PEER factors to maximize the number of eQTLs found, how do you differentiate between adding true positives and false positive discoveries? Answer Yeah, that’s a good point. The advantage of the PEER factors is that they once you add too many, they tend to be correlated and will, in most cases, add noise and lead to loss of power rather than false positives. At least that’s what we’ve seen in practice. But it’s an important point. There’s no guarantee that this will never be the case. The best sanity check is really looking for trans signals that might get captured, in which case we’re knowing that we’re starting to remove real effects.
And so then here I just wanted to give an intuition of going back to this question of how do we identify a significant signal. I wanted to give an intuition of how this is done. So, in a specific locus, we can look for the strongest association between any variant and gene expression. But how do we know that this didn’t just arise by chance for this particular gene? So what we do is just apply a permutation strategy where we permute the sample labels and then retest this. And ask under this permutation, how often do we actually see a strongest association that’s at least as strong as the one we saw with the unpermuted labels? And that’s sort of illustrated here. So, we typically run a few thousand permutations, and then based on that, we employ an approximation strategy that extrapolates from this distribution, which can be modeled with the β distribution, to get a more accurate estimate of the empirical p-value. And we use these p-values to determine which genes have at least one significant signal.
Here’s a couple of examples of this. One case where there’s a very localized signal that’s strongly significant after permutation. Another case where there’s a lot of LD, so there’s on that essentially in almost two-thirds of the regions, there’s variants that are strongly correlated that tag the same signal. But this also, under the permutation, is very significant. And here’s a much more marginal association, where the nominal significance p-value is still relatively strong, but it’s even visually in the locus less clear. And then with the empirical p-value, we also see that this is not significant by any means.
Audience question: Francois, there is a question just about how you actually do the permutation and compute the signal after the permutation has been completed. So, the first scan is in the locus. We have a few thousand variants. We look for ways to compute the correlation between the genotypes for each variant and the phenotype, and then we pick the lead variant. And this relies on an order of the mapping between phenotype and genotype sample labels. And the permutations are just a scrambling of these labels. And then we perform exactly the same calculation. So, in this plot, the orange line indicates the p-value we had for the non-scrambled labels, and each scrambled p-value contributes to generating this gray histogram. So this is this distribution. And then we extrapolate this to get more accurate p-values. But it’s really the permutation builds null distribution. And we do this because this varies highly. For you that ,these are all sort of on the same scale, these distributions vary highly across genes. We really have to do this for every gene individually.
Once we’ve done this, we can apply FDR (false discovery rate) across all genes, and we typically do this using Storey q-values. But other corrections such as Benjamini-Hochberg would also be appropriate. As I mentioned, we do this using these beta-approximated empirical p-values.
The same sort of framework can be used for replication. So, what this means is typically, and this also answers one of the theory questions about how do we know if we’re capturing a true signal or if the addition of PEER factors adds false positives, one way, which also has some caveats, is to look for replication in an independent study. So here I’m just showing examples from GTEx where, for tissues that had a paired tissue type in the Twins UK study, we wanted to see how the p-values from GTEx replicate in those. And what we’re looking here is essentially enrichment; assuming that the signals discovered in GTEx are real, we now want to see the enrichment of small p-values for all these gene-variant pairs and that’s what these histograms are showing. The histograms are showing these replication p-values, which tend to all be very small, so less than 0.05. But to quantify this, we can use the same strategy behind the q-values, which is to estimate the proportion of true positives.
Now, linking this back to both in terms of the FDR thresholds and the permutation scheme to nominal p-values, I’ve shown you examples showing that these empirical p-values can vary drastically between different genes. But in practice, we often get asked, if we calculate the nominal p-value threshold based on that, couldn’t we just use this as a sort of general cutoff for genes to avoid having to do all these permutations? And the answer to this is shown in this graph and it’s essentially no, because there’s quite a bit of variability, especially for a small number of outlier genes in this space that have really very different cutoffs corresponding to a set FDR level. And this has a little bit of an association with sample size, which is increasing here, but only vaguely.
Everything I’ve described so far in terms of the mapping relies on the normalized gene expression space that was described earlier. While this is great for performing the associations robustly, the effect sizes that come out of this are not necessarily meaningful or interpretable. So what we actually want to know and quantify is the relative cis-regulatory effect of the alternative allele compared to a reference. And the way we quantify this is as allelic fold change, which is the definition of this shown here. So, it’s essentially the log fold change of expression of the alternate allele over the reference. To compute this, we actually have to work with the untransformed gene expression data, just with the read counts that have been corrected for between-sample variation but without further normalization. And this means that we are essentially dealing with a model that has now multiplicative noise instead of the additive noise of the simple regression model. To solve this, we can apply iterative approaches, but these are not necessarily tractable for the large number of associations that we want to compute this for, and there are now efficient linear approximations to speed this up.
Here’s just an example of what we can learn when we have these analytical change effect sizes in hand. One of these important takeaways is that as we increase sample size in QTL studies, we tend to discover increasingly small effects. So, the discovery of large effects saturates with a few hundred samples, and then as samples increase, we detect increasingly smaller effects.
Just briefly now, after going over the bulk of cis-QTL discoveries, I just want to mention a few points about other types of QTLs. One is splicing QTLs, so instead of just quantifying total read counts for every gene, we can quantify splicing phenotypes. One really nice approach is described in a paper describing the LeafCutter method, which computes intron excision ratios and groups them by connected components. So, groups of introns that share pairs of junctions essentially get clustered together and quantified together. Then these ratios get computed for all of these clusters. In terms of actually mapping QTLs for these, it’s not fundamentally different from the expression QTLs, but we want to do this as a group phenotype because to identify is whether there’s a significant splicing QTL at the gene level, we’re now essentially testing each of these clusters for the gene. And since we’re always picking the smallest p-value, we want to make sure that in this case now that we’re testing multiple phenotypes for each gene, we’re correcting for the selection of the smallest p-value among n phenotypes also as part of the permutation scheme. And that’s really all that’s different for the first splicing QTL discovery.
Without going much into detail here, an important aspect of splicing QTLs is that there’s now a more subtle technical effect that can happen during the RNA-seq alignment, which is called the “allelic mapping bias.” So, if a specific RNA-seq read contains a variant allele, this might actually mean that it aligns to a different place in the genome than if it contained the reference allele. This is generally a weak bias, but specifically, if it affects a splice site, this could actually induce a false positive splicing QTL and needs to be corrected for. Just something to keep in mind if you’re interested in splicing QTL analysis.
Audience question: So there was a question about the kind of grouping and clustering that you observe with the splicing QTLs. I guess, do you have any more insights into how they cluster or what the biology behind it might be? Answer: This is not really biological; this is essentially just a computational strategy to robustly identify these associations. We want to make sure that we’re not inflating the p-values by selecting the smallest among many. But the biological interpretation here is more complex. Essentially, what comes out of this analysis is a SNP that’s associated with alternative exon or transcript usage within the patterns detected by this method. So, it does miss some alternative splicing phenotypes, such as alternative UTRs and polyA tails, and so on. But within that framework, this will detect changes; however, to identify the specific change, one needs to go back to the raw data and isoform assembly data to figure out what the switch exactly corresponds to.
Now, just very briefly about trans-QTLs. In a way, detecting trans-QTLs is simpler than cis-QTLs because we’re just performing an association scan, looking for a correlation between the specific splicing phenotype or gene expression and all variants across the genome. So here it’s just more about computational efficiency and being able to conduct these scans. But with studies with sample sizes such as GTEx, we’re still very underpowered to detect these effects. We have only detected a few dozen of these trans-genes in GTEx, and there are alternative strategies, such as employing eQTLs, because of this more limited power, where one can take trait-associated variants and then specifically test those for trans-effects.
The reason I wanted to quickly bring up some details about trans-effects is that there’s an important technical confounder that needs to be corrected for. Between two genes, there might be a region of homology, and reads that originate from one gene might erroneously map to another. In the case of trying to detect trans-QTLs, this means that a cis-regulatory variant that affects a gene containing such a region might then actually artificially drive a trans association just because the reads from the cis gene map to the trans gene, and that association gets picked up.
Audience question: Thanks, Francois. So, everyone’s very interested in your approaches, and the question was just, you know, when looking at these trans-eQTLs, what’s the relative power you need to detect a trans-eQTL versus a cis-eQTL in terms of sample size? Answer: I don’t have a good answer for this. I actually don’t have this plot in my slides. I thought that it’s sort of the first figure of a typical QTL paper, especially from GTEx, usually shows the correlation between sample size and the number of eGenes discovered, which shows these growing and now plateauing curves. We didn’t observe something like this for trans. There’s a slight indication that this starts to grow for the two tissues highlighted here, thyroid and skeletal muscle, which are some of the tissues with the largest sample sizes. But it’s difficult from this to extrapolate how far we’ll really need to go to detect these effects. And there are other confounders that we’re just starting to explore. Many of these trans-eQTLs could actually be driven by cellular composition effects. This is definitely something that’s still being worked on, and hopefully, as large-scale studies get done, we’ll learn a lot more about these and be able to discover a lot more of these effects.
So, just being aware of the time, one more type of QTL mapping that I wanted to describe is conditional independent QTLs and how to detect allelic heterogeneity. By applying this type of conditional analysis, we’re able to detect many different independent signals. Here’s an extreme example of this: in the top row, I’m showing the bulk association signal, and then the bottom two rows show a conditional analysis. Just as a reminder, in this context, “conditional analysis” means that once we find a lead variant associated with the gene, we can then add this to the covariates, essentially regress out the effect of that variant, and in the residuals, now ask if there is another signal that remains that also passes our significance threshold. In this example, there are two LD blocks that are perfectly separated, so if we regress out the signal from the first variant, the second signal remains, and there’s a strong secondary signal. What’s nice in this particular example is that we’re actually then able to map this back to effects originating in different cell types.
On that note, I just wanted to briefly give a last section on identifying context-dependent QTLs. This can mean many things, so the focus of what I’ll cover will be trying to identify cell type-specific effects, but this could be really any sort of context, meaning differences between genders, dynamic QTLs responding to a specific stimulus, for example, during development, and so on.
One way to identify these effects is using interaction models. So now, complementing the original simple model where we just looked for correlation between genotype and the phenotype, we’re now adding an interaction term and this gene-by-interaction cross term here, with again a set of latent factors and covariates. In terms of detecting these effects, what this means in terms of intuition is that if, for samples where this effect is very weak or not present, you’ll essentially see no difference in terms of just total coverage of reads in the RNA-seq between three genotype groups, illustrated in gray, blue, and red here. But as this effect becomes stronger, for example, if there’s a strong enrichment in samples for a specific cell type, the coverage here and the QTL effect will become more apparent. We can also visualize these in scatter plots, whereas the cell type enrichment grows, we then see a separation between the different genotypes and the QTL effect becomes apparent. We’ve extensively applied this to try to identify cell type-specific effects, and this general idea of trying to identify context-dependent QTLs was already proposed in an earlier paper by Lude Franke’s group.
Here’s an example of the cartoon I showed in the previous slide, where at the bottom, for samples where there’s a low proportion of keratinocytes, there’s a relatively weak difference in coverage across the exons of this gene for the different genotypes. This effect becomes much more pronounced with samples with high keratinocytes, and we also see this in the scatter plot.
Why is this important and why do we think this is an exciting direction to go in? Because when we’re actually starting to incorporate cell type-specific effects, we were able to find a lot more colocalizations with the GWAS signals. So, I’m rushing a little bit here towards the end because I just really wanted to give a broad overview of the different types of analyses. But what this talk leaves out is a very important aspect of downstream analysis. Once you have your QTL results in hand, meaning the different eGenes and the conditionally independent signals, an important next step in many analyses will be to actually fine map these signals and then find colocalizations with results from other studies, including GWAS. Here’s just an example showing this: typically, what we want to do is, if we did a locus for just an example gene, we would try to identify a QTL that underlies the association that we see with the GWAS signal. The example here shows that we were not able to identify such an association, such a colocalization, with the bulk QTL signal. But then when we actually apply these context-dependent analyses and try to identify cell type-specific effects, in this case looking for QTLs that depend on the enrichment of neutrophils, we then actually pick up a QTL signal that colocalizes very well with the GWAS. When we apply this across all the blood genes in GTEx and also across all tissues, we see a very strong gain in terms of colocalization that we are able to identify with these context-dependent QTLs, indicating that it’s an important direction to pursue here. Others have had results along these lines, of course. It will be to go beyond just bulk tissue studies and try to increasingly identify cell type-specific effects.
Very briefly, all I’ve shown before was conceptual. Here’s a shameless self-promoting plug: we’ve built this software that allows conducting these associations extremely efficiently by leveraging GPUs, called tensorQTL. It builds on previous work, essentially extending what fastQTL, a previously widely used mapper, has been doing, but it implements a lot of different modes, essentially all the types of QTL mappings that I’ve described are implemented in the software. So, if you have any questions about this, feel free to reach out.
I think the really exciting direction for the field now is to go after identifying more cell type and context-dependent QTLs. This is more on the experimental design side of QTL studies, but also increasing sample sizes, especially to identify more trans effects. There are large-scale biobank efforts such as TOPMed, which are generating a lot of multi-omic data, which will provide really exciting opportunities to do these analyses. At the same time, these new data types and larger study sizes will require improved statistical methods to deal with the challenges that will arise from these data types.
I’ll just end here with the slide containing different resources to follow up on this. Of course, the GTEx portal contains all the results from GTEx and has nice visualizations that allow you to explore the QTLs. A really nice resource is the eQTL catalog, which aggregates a lot of QTL studies and makes all the summary statistics available. All the pipelines and software that are behind the analyses and methods that I’ve described in the talk are publicly available at these two links. If you have any questions, please reach out and get in touch. Thank you so much.
Audience question: Francois, that was an excellent and tightly packed review of quantitative trait loci across many different situations. Just as one final question before we end: How do you think this applies to the study of rare variants and their contribution to human disease? Answer: So, I’ve actually glossed over this a little bit. In terms of the genotype QC at the beginning, these types of studies, depending on sample size, are typically restricted to common variants with a minor allele frequency of 1% or higher. If the sample sizes are small, even 5%, in terms of identifying rare variant effects, I think that for GWAS, I’m much less familiar with those approaches, but there are aggregate tests to gain power to identify the effect of rare variants and that’s certainly an idea that I think the QTL field will go into. So, as hopefully sample sizes from biobank-scale studies in blood will go towards tens of thousands of samples, those are models that we’ll be able to start exploring, including actually just lowering the minor allele threshold to go to slightly more rare variants in te
eQTLs, genes, and molecular networks
Title: Introduction to expression (e)QTL & their role in connecting QTL to genes and molecular networks.
Presenter(s): Laura Saba, PhD (Department of Pharmaceutical Sciences, University of Colorado)
Laura Saba:
Thank you, thank you. So, just to touch on a few housekeeping things, this is the third webinar for the “Quantitative Genetics Tools for Mapping Trait Variation to Mechanisms, Therapeutics, and Intervention” . It’s sponsored by the NIDA Center of Excellence and Omics Systems Genetics and the Addictome. Today, we’re going to be talking about expression QTL and how we connect those to genes. So, we sent out a survey a few weeks ago after our second webinar, and one of the really helpful suggestions that was given on that survey was to aim for a presentation that’s close to an hour and then add an extra 30 minutes for those who want to stick around and ask questions and have more of a discussion. So, that’s how we’ll work it today. Feel free, if you’ve only got an hour, to log off at the end of the talk, or feel free to stay on and have a discussion with us to talk about some more in detail about some of the things that we’re discussing in these webinars. And I do need to make the apology of sending out the wrong link to you for those of you that had to register twice on my email, lesson learned about copying and pasting. The slides are now available on GitHub and the video and the slides will be available online at the Oak Harwell website within the next couple of days. And we’ll do another webinar here in a couple weeks, so we’ll send out information all about that so that you can register and join in on that if you’d like. I think that about sums up the housekeeping items, except that I will be sending out another quick survey, just three questions, in case you want to give us some feedback on the format or give us some feedback on other topics you’d like to see covered. So, let’s just jump right into it. I’m Laura Saba, I’m at the University of Colorado, and I’m co-director of the NIDA Center for Omics Systems Genetics and the Addictome. I’m going to be talking today about expression QTL.
Quantitative Genetics Tools for Mapping Trait Variation to Mechanisms, Therapeutics, and Interventions Webinar Series
To remind you what this whole webinar series is about, we’re trying to traverse this path in a forward genetics approach where we start off with a phenotype and genotype, identify the QTL, and then identify genomic regions, genes, and pathways that could be responsible for that association between the DNA variant and our phenotype of interest. And so, I wanted to start off by just recapping what we’ve already presented to get everybody on the same page and give you a reminder of the things we covered before. So, if you remember, if you attended the first webinar, that was Saunak, who gave us a great introduction to quantitative trait loci analyses in general. So, he was able to tell us what they are and teach us how and what genome scans are, and how we find these QTLs through markers or DNA variants to find that link between DNA and our phenotype.
Then, for the webinar two, Rob built on this knowledge of QTLs and showed us how to do it within GeneNetwork. So, he gave us a review of rodent models and human models looking at substance use disorder. He walked us through on GeneNetwork: how we can take the phenotype data, do some initial preliminary glances at it, like looking at normalization and blocking and distribution, and then go ahead and do the actual mapping of the QTLs right on GeneNetwork. And then he ended the presentation by observing what genes were physically located underneath these QTLs that were identified. Sorry, I keep pushing the wrong button.
Outline
So, now I’m going to take over here. I’m going to talk about a slightly alternative pathway to identifying genes underneath a QTL and start talking about how we can incorporate RNA expression to link that genomic region or those QTLs with genes. And so, we’ll go over the first half will be about expression QTLs for individual genes or individual transcripts. And then we’ll talk about taking that a step further and looking at co-expression networks and how we can link co-expression networks to regions of the genome or to QTLs of particular phenotypes.
Strategies for linking QTL to genes of interest
So, we’ll start with an introduction to eQTLs. Like Rob mentioned in the GeneNetwork analysis, he looked at 15 genes physically located in…
[audio cuts out for 3 minutes]
Why RNA expression?
And so, a couple of reasons why we’re interested in the RNA expression is it really is one of the first quantitative links between DNA sequence and phenotype. So, DNA sequence differences there are usually yes/no, heterozygous or homozygous, where when we get to RNA, we really are talking about levels and slight differences that may have a large impact down the road. It’s one of the first steps where DNA sequence and environment interact, and we can see that right away with differences in RNA expression in different types of cells and different types of tissues. And RNA expression, to that point, can help us differentiate cell types and tissues within individuals. So, incorporating this information into our analysis of how that causal DNA variant can affect our phenotype is also going to bring in knowledge about relevant cell types and relevant tissues. And finally, what we’ll touch on at the end of this presentation is that also by using network analyses, we can, with these transcription levels and their natural relationships to each other, gain a little insight into genetic and environmental interactions.
RNA as a mediator of the effect of a DNA variant on a phenotype
So, when we start bringing RNA into this analysis of translating our QTLs into genes and mechanisms, we need to think of RNA as a mediator of the effect of a DNA variant on a phenotype. So, in our phenotypic QTL, we see this relationship, this dotted line between DNA and phenotype, and we really don’t know what’s going on in between. By adding this element of RNA expression, we can start to understand whether the DNA variant causes differences in RNA expression that, in turn, influence the phenotype.
So, by incorporating this concept of RNA acting as a mediator of that relationship between DNA and the phenotype, we can add to our information about translating that QTL. Instead of looking at genes physically located under the QTL, what we can do is look for RNA expression levels that are controlled by the same QTL that controls the phenotype.
So, when we start to think about some examples of how this would actually work in the cell, we bring back those effects that we talked about that were physically located. So, again, a DNA variant in a transcription factor binding site or an untranslated region that affects levels, but we also incorporate information on what I’m calling, in this context, an “indirect effect”. So maybe the causal variant is in a transcription factor rather than in the transcription factor binding site, so the variant is causing differences in the transcription factor, which then causes differences in expression of many other genes.
Genetic differences in RNA expression
And so, one of our first assumptions if we’re going to bring in RNA into this model of how DNA affects a phenotype is the assumption that our DNA-RNA expression is genetically controlled. And so, from a very simplistic view, we think about heritability. A lot of times, we’ll calculate heritability on phenotypes for humans and for rodent models. In the most simplistic case, in an inbred panel of animals, we can calculate heritability relatively easily. We can look at the variation within an inbred strain - so within animals with the exact same genetic background - and compare that to the variability across strains. This gives us an idea of how much the variability in RNA expression is controlled by genetics or by differences in the DNA. If you look on the left-hand side of this graphic, we see a toy demonstration of a really strongly heritable gene expression. Along my Y-axis, expression levels, and then each one of my colors along the x-axis represents a different strain. We can see that there’s very little variation within a strain. When they have the same genetic background, the RNA expression levels are fairly stable. But then we see relatively big differences when we look across strains. So, when we do vary the genetic background, we get strong differences. Where in a non-heritable RNA transcript, we see a lot of variability within strain, and just as much variability within strain that we do see across strains. This RNA that has no heritability or very little heritability isn’t a good candidate for the mediator of a DNA variant on our phenotype. So, we’re looking for transcripts with this strong heritability characteristic.
Mapping expression (e)QTL
When we go to map expression, we do it in a very similar way to what we did for the phenotypes that Saunak talked about a couple of webinars ago. So, we simply map the RNA as a quantitative trait against our DNA variants. Again, we test for an association between each one of our markers or SNPs with our RNA expression levels. The big difference here really comes in the computation of it all. Some of this is what we need to keep in mind when we’re picking out tools and things like this to run these analyses. Now, instead of running an analysis rather quickly, if we need to run 30,000 genes, then we have to be very intentional with our computational efficiency. I just wanted to make a little note; early on when eQTLs were first discussed, they were given the name “genetical genomics.” So, we hear that off and on here in more modern times, but originally, we called it genetic genomics, so studying the genetics that drive RNA expression.
cis vs. trans eQTL
We can also classify these eQTLs into two different categories. So oftentimes, we call them cis-eQTL or trans-eQTL, but “local” versus “distal” is a more accurate term for it. In a cis-eQTL or local eQTL, the locus that controls transcription is near the physical location of the gene in the genome. It’s right in that promoter region, it’s right in that untranslated region. It’s something that has a direct effect on the gene of interest. Whereas a trans-eQTL can be further away on the chromosome or can be on a completely different chromosome than what the gene is physically located on.
So if we come back to our example of a transcription factor, if there is a variant that was affecting the transcription level of a transcription factor on a different chromosome, but that transcription factor was necessary for the gene of interest to be expressed, that would end up looking like a trans-eQTL.
A couple of characteristics that we typically see on cis- versus trans- eQTLs: cis-eQTLs tend to be much stronger. So they tend to have lower p-values, higher statistical significance than our trans-eQTLs. We can rationalize that based on these mechanistic effects that we’ve been thinking about and use to conceptualize eQTLs. So, if the effect is more direct to the gene and doesn’t have to act through an intermediate, then it’s likely to have a stronger statistical effect than it would if it has to go through another transcript, another protein, or a more indirect route to controlling that expression.
eQTL hot spots
With trans-eQTLs, we also often see hotspots. It’s what it’s called: a hotspot is an eQTL that is in the same location for many genes that may not physically be located there. So, in this graphic up here, we have a single element on a particular chromosome that actually controls multiple genes on multiple different chromosomes. We can also conceptualize it, again, going back to my favorite example of a transcription factor. If the red square or red triangle here is a transcription factor physically located in its own cis-eQTL, it may be creating a trans-eQTL for many genes. So, these blue circles represent genes that aren’t physically located near the eQTL but are controlled from that region.
Tissue-specificity of eQTL
We talked a little bit about one of the benefits of using RNA expression to discover genes related to phenotypic QTLs. It’s being that we can gain information about tissue and cell type. This demonstration here is a graphic from the GTEx project that looked at cis- and trans-eQTLs across many human tissues. A couple of things that I wanted to point out about these graphics: first of all, this upper triangle here represents the cis correlations, eQTL effects across tissues, and the bottom triangle represents trans-eQTLs. A couple of things that stand out fairly quickly on this graphic are that tissues that are similar tend to have similar eQTLs, those cis and trans. So, this box here represents different brain regions, and some of these other boxes along the axis represent slightly different tissues that are highly related. And again, you can see the brighter colors in these boxes. The other thing that we see is that brain regions tend to be similar among themselves but tend to be rather different than the other tissues. And the final thing that this graphic alerts us to is, as we expected, the cis-eQTLs tend to be more conserved across tissues than the trans-eQTLs. You can see that by the brighter colors in the upper triangle than with the lighter colors in the lower triangles and the difference in the scales on this. The other interesting thing we see about tissue specificity is when they went to compare cis-eQTLs to look if they were shared across tissues, one of the things that was immediately apparent is that it was a very bimodal distribution when looking at the number of shared cis-eQTLs. So, cis-eQTLs tend to be present in all tissues or most tissues, or present in a single tissue or just a couple of tissues without very much in between. And they do tend to, in the final one is they do tend to have larger effect sizes in the cis than the trans. Just have larger effect sizes and tend to be more consistent across many tissues.
Tools for mapping eQTL
Tools, we’ve been touching on this a little bit throughout, that the tools for mapping eQTLs are similar to any of the tools that can be used to map a behavioral or physiological QTL. But really, efficiency is of utmost importance. So, you can still use R/QTL and there are other tools such as QTLReaper and Matrix eQTL, another R package that can do this rather quickly. And then when we’re doing QTL analyses with population structure, there are several software out there that work to make this more efficient computationally, so it can run these mixed models quickly.
Databases/Websites for Rat eQTL
Because this does require a lot of computation. If you’re thinking about 30,000 transcripts across 10,000 variants, there have been a lot of databases for the rat eQTLs, and we’ll talk particularly about rat because that’s what our NIDA center happens to be focused on. But both PhenoGen and GeneNetwork have several rather large datasets of these eQTLs and information. And that’s one of the benefits when incorporating it into these studies. So, if you’re measuring a phenotypic QTL, you can actually use eQTL information from different populations or different labs for linking it.
Summary of intro to eQTL
Just to take a pause here and summarize what we learned in this intro to eQTL. We know that expression QTLs are simply regions of the genome that influence RNA expression levels. So, the genetically driven variation in RNA expression levels. When we separate them into local and distal eQTLs, local eQTLs tend to have bigger effect sizes and are more likely to be retained across tissues than trans, and eQTLs are matched using similar methods and software.
Motivation for Genetical Genomics/Phenomics Approach
Now, how do we link this information we’ve gotten from eQTLs to our phenotypic QTLs and eventually to genes? So, what’s our motivation? And so, I’m gonna call this a genetical genomics phenomics approach throughout, just to give it some kind of terminology. So, use the first approach when we only identify candidate genes that are physically located in a phenotypic QTL, we may miss genes that are controlled by that QTL, whose RNA expression is controlled from that QTL, but they don’t physically reside there. So, we’re going to miss those genes with trans-eQTLs in this similar region. We also run the risk of identifying genes that are not expressed in our tissue of interest. So, for instance, if we’re looking at a dependence-related trait for alcohol or drugs of abuse, we may not find the gene expressed in the brain or a tissue that would be relevant to that disease. And the third thing is, we could miss critical information about how that causal variant influences the phenotype. But when we use this overlap of an eQTL with a phenotypic QTL, you’re more likely to focus in on those genes from the tissues you believe are important. You’re more likely to have a little bit more information about the mechanism on how that region influences the phenotype. And we might be able to identify broader biological and mechanistic themes by looking at genes that are controlled from the same region, although they may not all be physically located there. So, they may have more of a mechanistic reason for having that shared transcription control, and we’ll go into this a little bit more when we start to talk about networks.
Definition of Genetical Genomics/Phenomics Approach
In a genetical genomics phenomics approach, we’re combining three sources of information. We’re combining the sequence polymorphisms with the complex trait or diseases, so our phenotype here, with variation in transcription levels. So, in a very traditional genetical genomics phenomics approach, we look at each one of these comparisons and pairwise comparisons individually and then look at convergence of results.
For the sequence to the disease, we’re going to use a QTL analysis. We can look at the correlation between a transcript and a disease. And then finally, we can calculate these eQTLs and make sure that each eQTL transcript is associated with the same polymorphisms that are associated with a disease. So, again, we’re looking for the convergence of all three of these analyses to a single gene to make it a candidate in a genetical genomics phenomics approach.
Application of GGP approach
And so, just to give you an example of the way this has been done in the past, in this particular analysis, we were looking at alcohol consumption in a RI panel of rats. So, we matched the phenotype of interest, in this case, it’s just the amount of alcohol consumed. And then we took individual genes and first looked at their correlation with the phenotype and indicated that they had to be correlated with the phenotype. And then we looked to see that their eQTL location overlapped the pQTL location, and that’s how we got our candidate list of the gene. And then we were able to look at similar biologically annotated functions among the genes and similar literature on these genes to get a better idea of the genetic pathways involved.
So, the quick summary here is simply that by looking at the convergence of this information, we’re getting more information about mechanism. We’re getting candidate genes that we can link through both DNA and RNA to our trait.
Why networks?
And so, this is a genetical genomics approach. What we’ve showed you so far is looking at a gene-by-gene basis and then talking about a list of candidate genes in the end. But an alternative approach is to think of this in a network-type setting. So, why do we want to move from this single gene to a network-type approach? First of all, it really is logical to conclude that no gene product acts independently in the cell. Most, if not all, gene products interact with one another. We know that most tissues, especially the brain, are complex hierarchical networks. And so, we know that there’s cell types, there’s cell-to-cell communication, there’s cell type to cell type communication, there’s region-to-region communication, and all of these represent interactions and ways genes work together. Another thing that comes up often, especially in substance use disorders, is that we can conceptualize many diseases as a failure in network regulation - is that some kind of network breaks down to produce the disease, to produce the phenotype. So, it makes sense to model our data using that network structure as well. And then finally, it can provide more insight if we know that the genes involved, if we know how they interact with one another, so that we can get a better idea of predisposition to disease and how diseases evolve.
Methods for defining networks of genes
There are typically lots of different methods for defining networks of genes or defining how genes and proteins interact. There’s protein-protein interaction databases and networks that look both at experimental results and computationally derived results about how proteins interact with each other to form protein complexes and other mechanisms. There are annotated pathways and gene ontology, such as KEGG and the GO ontology, that actually derive these networks through annotation in the literature and other mechanisms and means. And then, finally, one of the other methods, and again, this isn’t an exhaustive list, is RNA co-expression. So, instead of looking at relationships at the protein level or things that are derived from the literature, by actually looking at a mathematical or statistical mechanism for talking about the interaction between two genes or two RNA transcripts. And again, there are lots of different methods out there for doing that. Weighted Gene co-expression network analysis is just one. I’ll talk a little bit about it more later, just as an example. But there are also very classic methods like k-means clustering and more advanced methods like Bayesian networks and Gaussian graphical models, and more statistically advanced methods for identifying these co-expression networks.
Co-expression as a measure of “connection”
So, for the remainder of this talk, I’m going to focus on co-expression as a measure of “connection” - I’m just going to talk about the co-expression methods for identifying networks, not that the other methods aren’t valid or don’t add a lot of information. It’s just a focus for today.
When we think of co-expression and using it as a measure of connection, the theory behind it is that if the expression levels of two transcripts correlate over multiple environments, then the two transcripts are likely to be involved in similar biological processes. So, if gene A is increased under stress and gene B is also increased under stress, or if A is reduced under hypoxia and so is B, if they tend to react in a similar manner across these multiple environments, the likelihood that they’re doing similar things in the cell increases. So, when we talk about genetic panels like the Hybrid Rat Diversity Panel or HS rats or other types of animals, we’re talking about, instead of environments, thinking about stress and actual toxins and things like that, we’re classifying these different environments as different genomes. So, if gene A is high or low in a particular strain, then gene B tends to be low in that same strain. In this toy example, the blue gene and the red gene tend to differ in a similar manner across all the strains. So, we’re hypothesizing that they are involved in similar processes. They may have a similar method of transcriptome control. Whereas the green gene here does not follow any of the similar patterns as the other genes.
What do we gain by building networks and identifying co- expression modules instead of considering each candidate gene individually?
We have found this in our research to be extremely helpful when we start looking at under-annotated or unannotated genes. This could be protein-coding genes, this could be non-coding genes, this could be small RNAs that we may not have the KEGG information, we may not have the GO information, it may not create a protein that we could look at protein-protein interactions with. But we can incorporate these genes into our co-expression modules and start to make hypotheses about what their biological function could be, based on what the other genes it’s highly correlated with do and what their known functions are. In a similar vein, many of the genes do lots of different functions in the cell. When we can look for when genes have multiple roles, if we can look at the overlaps of those roles between the different genes, when a network is associated with this particular phenotype, we may find out which one of those roles is important for this particular phenotype. When we start moving into making functional hypotheses about mechanisms and thinking about therapeutic targets, when we have a network, our most significant gene in our association might not be druggable. It might not be the best one to target when thinking about therapeutics. But when we have that network of genes, we can start looking for alternatives or things that already have a drug, or things that we know more about potential side effects with. And then finally, like knowing the biological function, we can also get information about cell types and brain regions that may be of interest in the heterogeneous tissue. So, if in our co-expression module, half of the genes are exclusive to dopaminergic cells, then those may be the cell type that’s important for all genes within this module.
Weighted gene co-expression network analysis (WGCNA)
So, I’m just going to go into a little bit of detail about weighted gene co-expression network analysis (WGCNA). This method has been around for a really long time. It’s been used by many people. It’s rather simplistic from a statistical point, making it simple to use. When I have trainees who are just beginning to learn about co-expression network analysis, it’s a really good one to start with. They have great tutorials, they have an R package, and there are many people using it not only in the traditional way but also in new novel ways. Where WGCNA differs from simply looking at correlation is that we do some statistical and mathematical manipulations with traditional correlation coefficients to make our connections between our genes or our network resemble a scale-free network. We’ll talk a little bit about the motivation for wanting to do that in the next couple of slides. The second thing that weighted gene co-expression network analysis (WGCNA) does above and beyond just a simple correlation is that they also have created what’s called a robust measure of connection or co-expression between two genes done by not only taking into account the direct correlation between the two, but also looking other genes that are correlated with these two genes. So, from a social network perspective, they’re looking to see if these two genes’ friends are friends. This provides more evidence that there really is a robust association between two genes.
Scale-free network assumption
Now, moving on to the scale-free network assumption. So, over here on the left-hand side, it’s just a visual demonstration of the difference between, here on the farthest left, what we consider a random network versus what’s on the right, which we consider a scale-free network. Just looking at the two pictures, we often make the analogy of a highway system for the random network versus an airport system or an airline system for the connections on the right. When we think about the highways or this random network, each node in the network has just a few lines or just a few links that are coming out of it. Each node, each dot in our analogy, each city, has about the same number of highways or roads coming into and out of it. So, if we look at the distribution of the number of links per node, we see that most of our nodes, or most of our cities in this example, have the same number of links, and we only see very few cities or nodes that don’t have any links, and very few cities or nodes that have a lot of links. On the other hand, in our scale-free network up here, we see these red dots or red nodes as kind of like hub cities for the airlines, where they have lots of lines going in and out. Some of the other nodes up here have very few, maybe one or two, that are coming in and out. The difference in the distribution of the number of links per node in the scale-free network is that most of our cities in this graphic only have one or two flights going in, and you have a couple of cities that have a lot going in and out of them. So, that’s a general description of what a scale-free network is. We have a couple of sources of information that lead us to believe that RNA may work in a similar manner. We don’t know that for sure, but one piece of evidence that we have is that we can see some of these what are called small-world properties in metabolic networks. So instead of metabolites being produced in a linear chain where one is needed to create two, which is needed to create three, and so forth, we see that this linear arrangement can be really inefficient from getting from metabolite 1 to the production of metabolite 778 takes a long time and is very inefficient. If any one of these nodes breaks down or disappears, there’s no way to get from 1 to 778. It just isn’t a very efficient biological model. Whereas when we think about hubs, in a scale-free network type of paradigm, we can get quickly by moving from hub to hub, from the beginning of the linear array to the end of the array, and that if we have any of the smaller nodes fail in between, we’re able to bypass them and still get the work done. The other source of evidence, which I don’t have a graphic for here, is if we think about it in evolutionary terms, genes get added to the DNA and to the system, so older genes tend to have a lot more connections and tend to be your nobes, whereas evolutionarily younger genes tend to have fewer connections.
Integration of indirect and direct correlations
The other part that we talked about, standing out in making WGCNA just a little bit different than correlation is this idea of indirect and direct correlations. And I realize that there’s some ambiguity around the terminology here—indirect versus direct. So, when we have high indirect correlations, if I’m looking at the association between the two grey genes or two grey nodes here, they have a strong correlation, but we also see that all of the green nodes, so all the other genes that they’re correlated with, they’re correlated with each other. So, this green node is correlated with those grey ones, and the other green one that they were both correlated with, so their friends are friends. Where if we look at this demonstration of a low indirect correlation, there’s a high correlation between the two greys, but none of the other genes that they’re correlated with are correlated with the other grey node. So, this tends to be a more robust correlation measurement, and this tends to be a less robust correlation measurement.
Module eigengenes
Once we have our modules—so now we have gone from these measures of how to tell how connected two genes were—we use a modified hierarchical clustering method to then identify modules of co-expressed genes based on this new measure of connectivity or distance between two genes. But once we have the modules, one of the things we need to be able to summarize the association of a module with a phenotype, or if we need to map the module, is to get a summary of expression across the genes in a module. So, we’re going to call this a module eigengene. These graphics here on the right are simply giving you a general idea of what I’m talking about. If each one of these grey lines represents a gene’s expression level across 18 samples—you can see it goes up and down—then the black line would represent the module eigengene for this particular gene. It’s trying to summarize the relationship among the samples across all the genes. This is a similar graphic where you can see this module expression heatmap, where we have genes along the Y-axis and samples along the X-axis. It’s summarized into an eigengene expression value. For example, sample 3 tends to be high across all these genes, and sample 6 tends to be low. We see that reflected in the eigengene expression. So, we typically use PCA analysis to do that and look at the first principal component. Our end result, like this last graphic here, is eigengene expression, and we get one value for each one of our samples for a given module.
Summary of co-expression networks
And then, just like we talked about expression, we can map this module eigengene using all the same tools, tricks, and models that we had before. So, to give you a general summary of what I wanted to cover with the co-expression networks, is that by looking at genes as a group and talking about how they’re related to one another, again, we’re hoping to gain more information to develop those functional hypotheses down the road. Co-expression is one mathematical way for describing a network, and so this allows us to be more inclusive, but at times can be at the detriment of ignoring known information about interactions and assuming that there are things like specific relationships between types of relationships, whether it be linear or monotonic, between two transcripts. We talked about WGCNA very briefly, just because it’s a popular method. It’s a little bit easier to grasp and understand. And just like eQTLs, we can map our module eigengenes in that summary of those modules across the genome.
So, how do we link this all together? Again, all we’ve done is taken that genetical genomics phenomics approach and replaced an individual gene with a co-expression module. So, we’ve looked at genetic correlations, we’ve looked at QTLs, and we’ve looked for overlap with that module eigengene QTL.
Candidate Co-expression Module for Alcohol Clearance
Just want to check how we’re doing on time. I’m almost out but I’m almost close. And so, I just wanted to close with an example of when we’ve used this process of looking at networks and phenotypic QTL overlapping with eigengene QTL. In this particular example, I’m going to talk about today, we measured a couple of alcohol metabolism-related traits in an RI panel and then used liver RNA expression to identify candidate genes. And so, we did this very intentionally because we know a lot about what’s going on in alcohol metabolism. So, we had an expectation of what should come out of this procedure.
We did the typical analysis where we did our phenotypic QTLs and we were looking at alcohol clearance and circulating acetate levels or acetate area under the curve (AUC) in this particular analysis, we were able to identify several peaks—you can see throughout here. We then identified our co-expression networks in our liver and looked for those that were associated with our phenotypes. So, these were associated with alcohol clearance, these are associated with acetate AUC. And one of the modules I want to point out here is this orange 3 that’s associated with both acetate and alcohol clearance.
When we overlap our maps of both the phenotypes and that orange 3 eigengene, which is a particular module, we can see that we see a similar peak here for both the orange 3 eigengene and our alcohol clearance module. So, indicating we do have that overlap. And when we look further into this orange 3 module, we see that the two most highly connected genes within this module are alcohol dehydrogenases. And so, that’s what we expected based on what we knew about the process of alcohol metabolism.
Summary of pQTL to meQTL to network
So, in summary, at the pQTL to meQTL to network, we are gathering and looking again for convergence of information. We’re using not only the identity of these genes but how they relate to one another. And we were able to show that in a somewhat unsupervised approach, we get things that we would expect to get in this process.
Conclusions
So, I wanted to wrap up this portion or the whole talk, I guess, by saying that by bringing in RNA expression, we’re thinking about it being a natural mediator between DNA variants and behavior. And that by incorporating RNA expression information, we’re just broadening our possible mechanisms for how that QTL affects that phenotype. Because we would miss the differences in protein structure by only looking at RNA expression and eQTL overlap, but we would be missing some potentially critical genes by only looking at those physically located. So, oftentimes, it requires us to look at multiple sources of evidence for how to identify these genes. And oftentimes, these co-expression networks, not only from a statistical point of view, does it reduce some of that multiple testing burden, but it also allows us to bring in some of those genes that we just really don’t have a lot of information on. And if we would have done it in a fashion where we only looked at a gene-by-gene basis, our natural human instinct would have been to ignore them completely and just go with the genes that we know. And so, this is a way to kind of bring that information in and make sure that we’re keeping these under-annotated or unannotated genes in our radar when thinking about what the next functional test is.
The last one is just to remind you the NIDA Core who’s been sponsoring all this. And we’re here to develop, help provide some training, help to find some study design and analysis services. And trying to, we have some pilot grants that we’ve given out to kind of build up these databases and resources.
And finally, you know, I have a great group in my lab. We have some funding from NIAAA as well as you saw by all those alcohol samples to generate some of these datasets that hopefully can be used by many, many people. So, I think I will stop my share there and see if we can attack some of the questions.
Host: Thank you, thanks very much, Laura. There are some numbered questions in there. Joe had two towards the beginning, so if you just wrap, roll up the top of the chat, you can sort of follow along. Joe, if you want to unmute yourself and just ask your set of questions, go ahead. Let’s see, do I have to unmute Joe, or can you do that himself? There you are.
Audience member: I have to go back and figure out what the first question was.
Laura: Oh, I see it in the chat. And the question was about splice variants. So, I think you know, I find splice variants especially intriguing, and I think our ability to capture and add splice variants to these types of analyses – whether there’s some groups that have specifically looked at splicing QTLs, meaning genetic variants that could affect splicing – versus looking at eQTLs at the splice isoform level, rather than at the gene level. So, all of those things, I think, will continue to evolve and I think are really easily incorporated in these kinds of analyses. But, you know, this calculating things at the isoform level requires really deep sequencing and a great depth of knowledge about what splice variants are out there. So, it’s requiring some more informatics at the front end for this type of analysis. And we, as a rat community, are working towards that. But all of those splice variants aren’t annotated yet, and so that causes some ambiguity and some of the isoformal connotations.
Audience member: Do you have any sense of how much biology we’re missing by not formally including splice variants in the tests that we’re doing? And second, I think that there are reasonable methods out there – there are methods out there to look at splice variants. And you said you’ve referred to those that are annotated, but you’d also do discovery of splice variants that could then be, if it’s related to the trait of interest, could motivate some secondary studies to validate the occurrence of the variant. And then you could do work that would, again, motivate further functional studies.
Laura: Right, right, no and I totally agree. Some work on this – that the reason why I say that is because we have done some of the reconstruction with the RNA-seq. And so, those are some of the methods that identify these novel splice variants. And those types of methods do really well at identifying splice junctions, but they do so-so at chaining them in the correct order. And so, when we’ve compared that to things that we’ve gotten out of single-molecule sequencing, right – which actually defines the full transcript length – we see a few differences. So, the reconstruction is better than the annotation, and the small molecule sequence is even better than that. And so, that’s why I hedged my bets a little bit there. But I do think it has a huge influence. You know, with these datasets where we’ve used this single-molecule sequencing, especially on the brain, we’re capturing 60,000 splice variants in the brain, right, where we have, you know, twenty annotated, right? So, we know that a lot of that’s going on. So, it’s not only the splicing, but it’s also those differences in 5’ and 3’ ends too – which could be really important. And I think that as we build these datasets and as we, as these technologies improve, we can get more and more specific about the networks that we’re building. So when we build them at the gene level, we’re kind of grabbing at that low-hanging fruit, right? I don’t think we’re wrong in some of the things that we’re identifying, but I think we’re missing a lot more detail and subtle differences, like splicing.
Audience member: Because with expression levels, you’re changing the amount of basically the same isoform. With splicing, you’re potentially changing functionality of the encoded protein.
Laura: Right, right, and right now, most of - we’re trying to build these databases that are quantitating transcripts at that isoform level, but I think most of the results right now are at the gene.
Audience member: Yeah, so my second question was related to something that you said later on about strain differences giving you a sense, a measure heritability, but basically, the strain difference is just a strain differences, anything that differs between strains. Not only the nuclear genome, but also mitochondria, the microbiome, if there are parental behavioral influences, stress and the parents affecting stress in the kids. It doesn’t, when you sit, when we say heritability in the narrow sense that we mean, we’re interested in genes because we’re geneticists, but heritability more generally is anything that’s transmitted from one generation to the next in that nice linear, strain-specific way. And we need to take that into account when we look at those data and just the way that you said, to geneticists, everything looks like a gene.
Laura: And I agree, and I think that’s going to be some of the, you know, we’re re-deriving what we’re calling the Hybrid Rat Diversity Panel. So I think that will be interesting when thinking about some of these microbiome effects too, about where these strains were originally raised in different places, and if we resuscitate them in the same place and resuscitate them in the same environment, are we changing some of those microbiome effects? Right? if there were environmentally driven.
Audience member: Ron Kahn has a great paper. I think his Cell Metabolism, maybe three or four years ago, where he looked at strains new from different sources, same strain, different sources, and then changes in the microbiome as they acclimated to a particular, the same facility. And there were remarkable changes in the microbiome, and they were correlated with changes in the host metabolic features.
Host: It’s a good point. It really comes back to Suanak’s first comment about the method of differences. That was where we had the caveats of when that doesn’t work too well if you’ve got confounders you haven’t taken care of. And that, of course, led to a lot of frustration on computing heritabilities. The nice thing about getting down to the QTL level is that, at that point, you don’t have to really concern yourself too much about the heritability if you have a strong linkage.
Audience member: What I was getting out too though, is that if we see as differences between strains, does it mean that it has a genetic basis? And there could be other features that could be just as interesting biologically, outside of the nuclear genome.
Host: Join the discussion now. Just unmute yourself if you have any comments. It’s not just the three of us here, so weigh in if you feel like weighing in.
Audience member: Should we go down the question list?
Host: Um, yeah, I just want to make sure we’ve got plenty of time, so we’ll get to those. The next question is mine. So, Laura, a great introduction to cis- and trans-eQTLs. I ask a question, Carl gave an initial answer, just want your thoughts. When we declare a cis-eQTL or trans-eQTL, the criteria are somewhat different because one is a genome-wide search initially, one is sort of a regional search around a gene, and that has some important implications for declaring that you found a cis- or a trans-eQTL, so talk about that a little bit.
Laura: It’s all how it’s all how you apply it, right? Or the context of the experiment, right? I mean, one option is that we’ve kind of, you know, in Colorado we’ve kind of taken the approach before. We do genome-wide analysis, identify the top variant, and then declare whether the top is a cis or trans, as opposed to separating it out into two separate analyses, right? Is one option for controlling a genome-wide rate, but there’s also a lot of value to only looking for the cis-eQTLs. But I do think that you can’t really compare the threshold between the two, is my opinion, unless you’re searching for them in the same manner.
Host: Um, let’s see, I’m just going through this. I think that was, yeah, that was question three, and then we’re about halfway through your talk, and Joe has the question, “Why are tests for cis-eQTL limited to local rather than genome-wide?”
Audience member: I’ve believed that was related to what you’re just saying.
Host: I think about a genome-wide penalty for cis tests, you want to field that, Laura?
Laura: The approach I was talking about, if I’m entertaining the question right. You penalize everything for a genome-wide search and then classify your hits as either cis or trans, rather than in testing directly per locus effect and only looking at cis genes.
Audience member: Do cis effects get a multiple testing penalty?
Laura: So it depends on how the QTL is applied.
Host: My philosophy is a little bit more liberal, and it’s based upon sort of a, I think, a reasonable Bayesian prior that if any variant is going to control gene expression, it’s going to be a local variant. So that self-control is sort of a prior expectation, and from that prior expectation, you actually expect, in a sufficiently polymorphic population, every transcript to be a cis-eQTL at some level. It may not be true in a population that has a lot of regions that are IBD, but if you imagine a hypothetical population where every gene is highly polymorphic, then every gene should be a cis-eQTL, frankly. So, from that point of view, the initial barrier for declaring a cis-eQTL, in my opinion, should be quite… should not have any kind of genome-wide correction. It might involve a correction for the length of the segment that’s in the LD with the gene. In a human GWAS, that’ll be pretty small, that’ll be a hundred KB or less. But in that hundred KB, you might end up doing 2,200 tests. So, in that case, you would need an FDR correction for all the local tests you’ve done. So, you couldn’t just likely take a p-value of 0.05 as sufficient evidence of linkage. But it really, I think it depends a lot on your priors, and my prior is, again, that cis-eQTLs are, frankly, no-brainers. And I find them interesting but hardly surprising.
Audience member: So, Rob, your point is completely consistent with Carl’s suggestion, in that the his suggestion of applying the FDR across all your, all the cis-eQTLs is exactly what you’re talking about, except that, you know, the FDR has an empirical-based interpretation. So, it is estimating the prior in effect from the data itself because you can kind of, based on the shape of the distribution of p-values, you can estimate what fraction of cis-eQTLs are actually null versus marginal. So his suggestion is completely consistent with your, you know, your intuitive idea of how they should be treated, but they, I think, you know, in some ways, applying the FDR has the advantage of, you know, taking out, to some extent, individual opinions, and, you know, you have a more formalized procedure, right?
Host: Carl, do you have anything you wanna add? You’re still on mute. I’ve tried to unmute you, but maybe you’re not right in front of your monitor. We had a question from Herodotus, where is that question?
Laura: I see it “How to appropriately test for gene overlap between two modules from data…” [indistinguishable] This is a really good question that we have been, is not struggling with, but investigating, let’s say - it’s a more positive way to put it. And some of the research that we’ve been doing computationally in the lab. It’s really to compare two networks, to say that a network has fallen apart or a network has been activated under different conditions. So, there are some tools from, for example, from WGCNA, where they’re looking at the robustness of a module. So, what they would do would take the genes from e-turquoise and look to their relationships are in the splicing dataset. So then they can assign a quantitative value to how well that e-turquoise is conserved in the splicing data, but it’s not necessarily meant to say, “I’ve identified these modules independently. How much do they overlap?” And so there are several, you know, statistical tools to say, “does this group of genes in the e-turquoise module have the same relationships when I look at them in a different dataset?” But not as much as for the overlap, because you know an overly simplistic approach would be to simply look for a for enrichment of genes from the e-turquoise and the s-turquoise, but what you’re missing there is the weightedness based on how connected those genes are in the module. We’ve anecdotaly found, as we’ve run these types of analyses on different technologies and with different normalizations and things like that, that our modules tend to be really robust among those genes that are highly connected with that module for liver, the alcohol dehydrogenases, those that were highly connected, tend to be really robust no matter what we do to the dataset. It’s the ones that were there on the periphery, that weren’t that connected with other genes, that kind of fell in and fall out. So strict, you know, Fisher’s exact test for the two models would miss that information, the prioritizing of that information, but I can definitely put the link in the chat discussion that we post on GitHub for the article from at least the WGCNA group that looks at, that has this z-score for replicability of a module.
Host: There’s a question just came in from Priscilla. You might have a look at that to the last. Has anybody compared cis-eQTLs of genes in WGCNA?
Laura: We’ve just done it anectodately not systematically. But it would make sense that those that are most highly connected, especially for those modules that have a strong eigengene QTL, that it would be a cis-eQTL at the hub. And so, in some of our work where we’ve looked at that alcohol consumption and brain, our most highly connected genes in the module are ones that are physically located in that area and have a cis-eQTL. And that sort of leads to, I think it was your question, Rob, about how to avoid LD artifacts. Implementation of WGCNA - we can’t - LD could be one of the reasons why they’re correlated.
Host: Can you explain exactly what that means?
Laura: LD is linkage disequilibrium, and so it’s saying that gene A and gene B, their expression is correlated, simply because they’re physically near each other on the genome. So during recombination, they tend to travel together, but the actual variant that’s controlling expression for those two genes are two different variants, right? They just happen to always travel together during recombination, so they appear correlated, but it’s just because they’re so close to one another on the genome. Do you agree with that definition, roughly?
Host: How could I disagree?
Laura: And some of the things that we’ve done in a post hoc manner with our co-expression modules has been to look at partial correlations among genes within the module once you control for that QTL for that module eigengene QTL. So if two genes are highly correlated and are physically next to each other, and we do a partial correlation where we account for that QTL and they’re no longer correlated in that partial correlation, then it’s a signal to us, or it’s evidence for us, that those two were just correlated because of linkage disequilibrium, rather than some mechanistic relationship between the two. And so, again, I’ve got a whole slide on that. We could do a whole webinar on it sometime about what the rationale is behind that and how the statistics behind it work, but we’ve just done it simply from, in a post hoc vein. I think I’ve seen other people do it where they adjust out a cis-eQTL effect of all genes and then do a WGCNA, which is there is another option on how to control for that.
Host: I’m pleased we have Jake Lusis on, taking the class with us. He could be giving this, having coined the term eQTL in some papers in the early 2000s. Jake, if you want to unmute yourself and say hi that would be great.
Audience member: Hi everybody good to see everyone.
Host: He was Eric Schadt’s mentor, and so all of those very early corn human mouse eQTL studies, and of course, he’s been very busy in the field for the last twenty years since then, so good to be joining us.
Audience member: Thank you.
Host: And If you want to give one of these courses, just let me, Laura, and Sean know. There was one more from little Herodotus, but maybe that got answered. Module size, what’s the kind of, what’s module size?
Laura: It depends. One of the clinicians in our department laughs, because she says that that standard statistician answer, “It depends.” In WGCNA, the creators of WGCNA, you know, Steve Horvath and Peter Langfelder, they suggested 30 genes being a module size, and some of their rationale for that is because if you want to do any kind of enrichment analysis afterward, so if you want to look for an enrichment of GO terms or enrichment of KEGG pathways, you need at least 30 genes to do that in any kind of reasonable way. In our lab, we tend to drop it down to five. And the reason why we’re down to five is because we know that of the genes that we’re able to, with our technology, we might not expect a really strong correlation between a ton of genes. And so, we’re willing to do the knowledge-based analyses of looking it up and doing things like that, and to us, a smaller number of genes is much more manageable for looking at each individual in the literature to make a story. We’ve tested it out with WGCNA. What happens is when a gene does not fit into a module, they put it into a clump that they call “gray.” And so, that just means that it wasn’t in a module that was over 30. When we adjust that minimum module size in our actual development of these modules, what we see is that “gray” tends to get smaller if we allow those smaller modules. So, we were willing to take the hit on the multiple testing correction for more modules to be able to dig down a little deeper on some of these small ones. So, it really depends on what you want to do next, and a lot of times it may even depend on what the phenotype you’re interested in is. Is a phenotype that you expect a thousand genes to be associated with or influencing, or is it a phenotype that you expect a smaller number of relevant genes on a smaller network to be involved?
Host: Probably see that question from Saunak at the very bottom:
Laura: How do your methods change for other omics data such as microbiome, metabolome, etc., versus the transcriptome? So that’s a good question. You know, one of the things that we’ve been struggling with — again, struggling is the wrong word, let’s be positive, which we’ve been investigating a little more — is once you start… In my mind, the philosophy of the scale-free network is easy to comprehend at the protein-coding gene level. When we start adding in things like non-coding or microbiome, we might not expect that scale-free network distribution. We might not expect the same relationships between a non-coding and a protein-coding RNA as you would between protein-coding RNAs. And so it’s that piece of the WGCNA that I’m more hesitant with when we switch to other omics. Is this idea of the scale-free network and whether or not that’s something we want to enforce.
Host: To your question, one other obvious difference is that some of the “omes” are rooted in the genome: the epigenome is, the transcriptome is, and the proteome is, so the concept of cis- or trans-eQTL makes sense, at least for the proteome and transcriptome. They don’t for the metabolome, and they don’t for the metagenome, and they don’t for a lot of “omes” that were are new. So in some sense, they don’t have that built-in “rooting”. So the lovely thing about cis-eQTLs as they are is sort of a built-in standard for quality control and other basically data quality control. Obviously, there’s a lot of real biology there. I had a question for you, Laura, about WGCNA. So, it seems like when the scale-free, the exponent beta that’s used to raise the correlation coefficients of, you know, to 8, or which is kind of the recommended default, it tends to kill all the negative correlates, which typically have lower correlation coefficients than do the positive correlations. So if you look at a distribution of correlation coefficients that tend to be poly-skewed and tend to some power like 0.8. The negative correlations get washed out. But those negative correlations are really important. In your talk, you mentioned that you’re defining “friends” of transcripts, but I’m often more interested in finding antagonists. And I find for WGCNA, I almost always have to rummage around and rescue the negative covariates. I’m just wondering whether you have a strategy.
Laura: That’s a good question I’ve never looked at in-depth about the bias and distribution of negative to the positive correlations. There are options in WGCNA, and there are differences in philosophy on whether you actually do what they consider a signed network versus an unsigned network. So in an assigned network, you’re forcing all of the genes within the network to have a positive correlation with each other, as opposed to allowing negative and positive correlations within the same network. I’ve always been a big fan of letting them both in. I think there’s not much value to only looking at one direction. You know, so many of our pathways have both, like you said, protagonists and antagonists, and, you know, we don’t know -that’s important to have that information.
Host: So, in your Orange 3 module, could you tell me which ones were positive and which ones were negative?
Laura: Yes, I can.
Host: I just wanted to make sure that data was actually preserved. I haven’t followed that, but I am fairly sensitive to the imbalance. It’s way more acute in human datasets. So if you look at GTEx, it’s extreme. You can look at the top 500 covariates of a transcript and every one of them will be positive. In rodent expression data, we rarely see that. It tends to be more balanced. But even in the rodent datasets, the positive covariates are more common, and it kind of makes sense because any of your batch effects are going to add on the positive side. Almost all.
Laura: That would be my interpretation of the overabundance of positive correlates.
Host: I wish there was a way. I mean, there is a [indistinguishable] way to take the negative covariances, give them a little boost in the negative direction so that they’re perfectly balanced with the positives, and then raise everybody to the power of eight.
Laura: So I think the one last question, given we only have five more minutes, and this was from [indistinguishable], and it was “what constitutes to independently validate an expression module?” And again, the answer is, “it depends”. It’s a really hard question to answer. And so I’ll go back to - so WGCNA, they have a method, and it’s published in a paper, it’s like “is my module replicable”? And so they actually take, I believe, nine different measures of replication. So it’s same genes, the same order, it’s the most connected one always, the most connected one, those sorts of things. So they have different aspects that’s a module that you can quantitatively measure to say that it’s been reproduced. And so they combine that into a z-score to do it. But there isn’t a globally acceptable way to say that this entire expression module is validated, right? And so, like we talked a little bit about anecdotally, and that we tend to validate the highly connected ones, and the low connected ones tend to not validate, well do you consider that whole module validated or not. So there’s a lot of it’s not an easy answer. We tend to use that overall and almost like a meta-analysis of different aspects to look at the robustness of our network. So we’ll take our WGCNA module, and then we’ll do a bootstrap sample and see what that conservation of that module is in the bootstrap sample and see how consistent that is when we do things like bootstrap or natural ways to [indistinguishable].
Host: Anybody else unmute yourself and comments questions, suggestions for improvement all welcomed at this point. Countdown 10, 9, 8… All right, well, I guess we’re heading into the weekend. Actually, I and Saunak are heading into our next meeting.
Laura: Thank you.
From SNPs to Genes
Title: Linking SNPs with genes in GWAS.
Presenter(s):
Luke O’Connor, PhD (Department of Biomedical Informatics, Harvard Medical School)
Dan Weiner, PhD (Broad Institute of Harvard and MIT)
Sarah [Host]:
Good morning, everyone, and welcome to today’s MPG primer session. We are very excited to have a talk by Luke O’Connor and Dan Weiner, who are here today to share their work on the intersection of common and rare variant disease. Luke O’Connor is a Schmidt fellow at the Broad Institute; his focus is on developing statistical approaches to understand the genetics of common disease. His work integrates data from multiple phenotypes, from common emerging genetic variation, and from diverse populations.
Dan Weiner is an MD, PhD candidate at Harvard Medical School who has just defended his PhD. Congratulations! He is studying the convergence of common and rare genetic variation, co-advised by Luke O’Connor and Elise Robinson. They will be speaking to us today about their recent work linking single nucleotide polymorphisms with genes and genome-wide association studies. So, thank you both for being here today. They’ll be happy to take your questions during the talk, so go ahead and post them in the Q&A or type in the chat, and there will be pauses to have them addressed.
Luke O’Connor:
Oh, thank you very much, Sarah, for the invitation and the introduction and good morning, everyone. So today, Dan and I will be telling you about a sort of well-studied and well-known challenge in statistical genetics and functional genomics, which is linking SNPs to genes and GWAS. In particular, linking disease-associated SNPs with the disease genes that probably mediate their effects.
So I’ll just give a quick background on GWAS and the importance of this problem before telling you about a very direct approach to link SNPs to genes using eQTLs. And then I’ll pass it over to Dan, who will tell you about work that he led on linking SNPs to genes with proximity and this abstract mediation model, as well as some work led by our collaborators on combining multiple strategies to link SNPs to genes’ extensions.
So the challenge, of course, is that you observe, after performing a genome-wide association study, some associated variants, and you don’t know which genes it is that they regulate. In particular, GWAS hits are usually not in coding regions, and typically, the loci that are implicated in GWAS contain multiple plausible candidate causal genes. What it looks like these variants do is that they affect transcriptional regulation.
So, very early on, it became apparent that significant hits were non-coding, and that they instead localized to regions in the genome that plausibly affected transcriptional regulation. DHS (DNA size hypersensitivity) sites were sort of, that was an early assay of possible transcriptional activity. These are regions of the genome that are accessible to DNase, and therefore they might also be accessible to binding by a transcription factor, and a SNP that’s localized where this region might affect the binding of that transcription factor and thereby alter regulation of a nearby gene. So, there’s early evidence that a large fraction of GWAS hits were in DHSs or were in linkage disequilibrium (LD) with DHSs. And of course, heritability-style analyses (that was 2012) found the same thing. So these were some early estimates from Gusev et. al, Sasha Gusev, by the way, is local and has done a lot of work on eQTLs as well, looking at the heritability explained by coding as opposed to regulatory regions, so DHS regions in particular. So it’s about 10 or 20 percent of SNPs explaining a large fraction of heritability, maybe its estimates have been revised down a little bit since 2014, but a large fraction of heritability and coding variants, even though they’re actually more enriched for heritability, about 10x enriched. They comprise a much smaller fraction of the genome, so they explain less heritability.
Now on the one hand, it’d be nice if a larger fraction of heritability localized to the coding variants, and we could identify the genes more easily. But on the other hand, one nice thing about regulatory variation is that it’s cell type-specific. So, a lot of regions of DNA look like they have regulatory activity in some cell types but not others. And this is a figure from Finucane et. al 2015, the original stratified LD Square regression paper where they looked at the heritability enrichment of the cell type-specific regulatory elements and found, sure enough, that they’re enriched in the diseases you’d expect. So, in schizophrenia, brain regulatory elements are highly enriched, in rheumatoid arthritis, immune and hematopoietic regulatory elements are enriched. One interesting finding here was that for BMI, BMI looks more like a brain behavioral trait than it does look like a metabolic rate, even though, of course, metabolic processes are also important. So this implies that disease-relevant transcriptional processes are probably usually cell type-specific, and any approach that measures those processes should probably recapitulate that specificity.
And it strongly supports this model where what’s going on with most variants is that most disease-causing SNPs affect gene expression in some sort of causal cell type or cellular context, and that, in turn, has an effect on disease risk. I don’t want to give you the wrong idea, though, because protein-coding variation is also very important, and I’ll come back to transcription regulation in a second. But, in particular, for rare variants, rare coding variants are strongly implicated. They’ve been implicated for a long time in rare diseases, of course, in particular developmental disorders. And more recently, they’ve also been strongly implicated in complex traits, where they can also often have larger effect sizes.
Low-frequency variants, you know, around one percent frequency or kind of in between, it looks like they have a much higher percentage of their heritability in coding regions compared with common variants. And for common variants as well, actually, fine mapping studies will often turn up a whole bunch of coding fine-mapped SNPs. So this was a study in inflammatory bowel disease where they found that 30% of their fine-mapped variants were in coding regions. And, of course, those types of variants are very valuable; they point to genes directly.
Okay, so next, I’m going to talk about eQTLs. I just want to pause and see if there are any questions.
Host: Thank you, Luke. I don’t see any questions yet, so.
Luke: Okay. So, eQTLs are a very direct approach to link SNPs with the genes they regulate.
So the sort of hypothesis behind eQTL studies is that there’s some causal SNP that is associated with disease risk from each GWAS, and you look at all the nearby genes, and you see that, well, here’s this gene that this SNP actually regulates. The SNP is associated with the expression level of this gene, and maybe it’s associated with the expression level of this gene in some tissue or cell type that you think is disease-relevant. And then that would be good evidence that what the SNP is doing is that it’s affecting the gene, and the gene is affecting the risk of disease.
So, the GTEx project in particular is an eQTL study that, it’s not the first or the largest eQTL study, but it’s the most comprehensive in terms of the tissues that are represented in that study. Very much motivated by, right, the observation that, you know, disease-relevant transcriptional processes are probably cell type or tissue-specific.
There have been some challenges with this approach, though, and the first challenge is co-regulation, in particular, co-regulation across cell types and tissues. So, whereas a lot of regulatory elements, especially enhancers, will be cell type-specific, a lot if not most eQTLs are not tissue-specific. This is a figure from Urbut et al., who did some basically clustering of eQTLs by their pattern of tissue specificity. And the largest pattern, the pattern that explained most eQTLs was a pattern where basically these are variants that affect expression levels across all the tissues, and they are particularly correlated across different brain regions. And then they’re also strongly correlated across non-brain regions. And the result of that is that when you look at the relevance of these eQTLs in GWAS, it’s hard to pick out the right tissues.
So, this is a figure from Hormozdiari et al., who performed fine- mapping of these eQTL signals, identified likely, sort of, causally linked eQTLs (causal for gene expression, I mean), and then looked at the disease heritability enrichment in those variants, finding that actually, the strongest disease heritability is associated with the tissues with the largest sample size and not necessarily, sort of, the right tissues that you’d expect to be involved in various traits. That, of course is, right, a consequence of the fact that the eQTL signal is relatively similar across tissues, with important exceptions.
Another flavor of this challenge is co-regulation across nearby genes. So, it’s actually not always the case that you see a SNP and it’s clear that it always regulates this gene specifically and not the other ones. A lot of enhancers will activate the transcription of all of their nearby genes, and accordingly, like a lot of regulatory SNPs, affect multiple genes, resulting in these sort of patterns of co-regulation. So, this is a heatmap showing the correlation with the cis- genetic correlations between the expression levels of nearby genes that are locus. You get these blocks of correlated genes, and you also get these faraway correlations as well. Some of these are negative correlations, which are sort of equally problematic as positive correlations for identifying causal genes. So, it’s much less informative as to what’s the disease gene if your disease-associated SNP is regulating all of the genes at the locus.
And another reason that you can get co-regulation is just LD. So, the variants themselves can be highly correlated, and then even if the causal variant for your trait only affects the expression level of one gene, it might be in LD with a different SNP that affects the regulation of a different gene. So, you can have co-regulation due to LD as well. And this is a figure from Chun et al., who performed co-localization. Co-localization is an approach similar to fine-mapping, trying to identify shared causal SNPs, so SNPs that are causal for both gene expression and the trait, as opposed to different SNPs. And here, they’re looking at autoimmune disease loci and they’re looking at immune cells where there is an eQTL at the same locus. And they found that most of the time, like 80 or 90% of the time, even though there’s an autoimmune association and an immune cell eQTL, the causal variants are actually different.
So the solution to this challenge, maybe I shouldn’t call it a solution, but the approach to address this challenge that has been most successful is to explicitly model co-regulation. So this is a figure from [audio cuts out][indistinguishable] tissues in terms of the eQTL enrichments for different traits, and most of the time, that turns up sensible tissues. And this is another figure from Mancuso et al., showing that this is an approach on, sort of, basically a fine-mapping approach at the level of keys. So, we can model co-regulation across genes and identify which genes look like their expression levels are causal. So, these blue triangles, these are sort of their credible set. This is a locus where lots and lots of genes look like they’re associated. So, this is like the transcription, this is like this transcriptome-wide significance threshold. Lots of genes look like they’re associated, but their model thinks that one of these five genes here is the causal gene. In particular, like this gene and this gene are strongly associated, but they’re probably along for the ride.
Host: Luke, someone in the audience, Cal, has just raised a hand, so I’m gonna see if I can unmute Cal to ask a question. Cal, you should be able to speak now. Let’s see. I’m not sure that, um, oh, Cal can’t hear you, or at least I can’t hear you.
Cal: I have myself raised my hand by accident.
Host: Oh, okay, great. Um, maybe we will assume for the moment that that’s the case, but, Cal, if you have a question, please do post it in the Q&A, and we’ll handle it that way. Thanks.
Luke: Yeah, I’m definitely happy to take questions at any point. I’ll pause at the end of the section, which is just coming up in a few slides as well.
Okay, so, and of course, these methods also need to account for LD, which is another major challenge.
The second challenge associated with eQTLs is that they often will, they may, measure the wrong thing. So, in particular, they might measure expression level in the wrong cell type or tissue. Well, probably not the wrong tissues, since you know GTEx measures like all the tissues, but it might be the wrong cell type or the wrong cellular context, the wrong stage of development. And, um, these are figures from Yao et al. Doug developed a model where basically you’ve got some observed expression levels, you’ve got SNPs, and you’ve got observed effects of the SNP on the expression levels and the SNP on your trait or disease. And basically, the model just estimates what fraction of heritability looks like it’s explained by these observed gene expression levels versus, is not mediated by that, must be explained by some other mechanism. And under this model, Doug estimated that only about 10% of heritability is actually explained by gene expression levels instead of observed cell types and tissues. That’s consistent, of course, with the findings from Chun et al., which is that oftentimes the eQTL and the disease-associated SNP are just different variants.
And then the other flavor of the measuring-the-wrong-thing challenge is that you might actually be measuring the wrong, you might be measuring, you might be finding eQTLs for the wrong gene. And so, this is an interesting figure from Yao et al. again. So, surprisingly, genes with higher cis- heritability, so genes with more eQTLs or stronger eQTLs, seemed to mediate less trait heritability. So, they seem to be depleted of trait heritability relative to genes with just a few eQTLs or just with weak eQTLs. And that’s consistent with the possible effect of negative selection. So what might be going on is that important disease-associated genes are under selective constraints and they don’t have strong eQTLs or they don’t have a lot of eQTLs. And so then, you perform an eQTL study in a couple hundred individuals, and you mostly pick out the strong eQTLs, and the strong eQTLs are mostly those for sort of less important genes. Mostafavi et al, this is a recent pre-print, a really nice recent pre-print, that sort of explores this model where, okay, you’ve got genes with respect to trait, and those genes are under strong negative selection. And what happens to your eQTL signal? They find that, as a result of this type of effect, there are all sorts of differences between genes that look like they’re associated with traits and genes that have eQTLs. And they think that this is what explains this sort of limited overlap between eQTLs and genes that are disease genes. Explains why it’s been a challenge to use this approach.
And then, of course, the possible solution to this is, well, I mean, for measuring the wrong cell types, a possible solution is to measure the right cell types. And if you’re not getting eQTLs for the right genes because they have vQTLs, and of course, a possible solution is to use larger sample sizes. So, there’s a lot of interest right now in single-cell RNA-seq, single-cell eQTLs in particular, and really context-specific eQTLs, and a lot of interest in scaling that up to large sample sizes.
And before passing it over to Dan, I’ll just note that some people think that you should just pick the closest gene instead of using eQTLs. So Eric Fauman is very vocal on Twitter about this. He thinks that, you know, basically, you should just look at the closest gene or the closest couple of genes, you should forget about the eQTL signal and approach it, he really likes, as an illustration of this point, to use metabolomics GWAS, where, you know, metabolomics, we often actually know what the causal gene is. It’s often just known as we understand the biology, and GWAS will often, like, you know, pull out these whopping hits and then turns out that the closest gene will be the one. And that sort of approach to figuring out what’s a good SNP-to-gene strategy is the same sort of approach that we use, that Dan is about to tell you about, and also that our collaborators use in the other thing. So, now I’ll ask for a question before handing them over to Dan.
Host: Thank you very much, Luke. I don’t see any questions yet. So, one thing I will be curious to learn more about is whether you have detected any patterns about whether the closest gene assumption as a starting phenomenon is a worse or better approach for some types of phenotypes than others.
Luke: Yeah, that’s a good question. So, I suspect that those metabolomics GWAS, it’s especially good if those metabolomics GWAS, because it’s like they’ve got these whopping hits, they’re just much stronger than what you observe in other GWAS, and maybe those really strong hits, it’s like basically the way you get that is by being a promoter variant or being very close to the causal gene. So, I suspect that that’s the case. Yeah.
Host: Great. Thank you.
Dan: Fantastic. Well, thanks, Luke, for that great overview, and I’m gonna first talk about our recent work on estimating gene-mediated heritability with the abstract mediation model.
So, as Luke talked about, it’s often unknown which SNPs regulate which genes, and this obscures something that we might be particularly interested in, which is this idea of gene-mediated heritability. Coming at it from the rare variant and sequencing perspective, we get coding variants, making estimation of this quantity of gene-mediated heritability much more straightforward. But from a common variant perspective, where the mapping between largely non-coding common variants and genes is uncertain, it’s much more challenging to estimate. So, as Luke mentioned, there are a number of potential approaches to this issue. So, we can estimate gene-mediated heritability or something close to it using proximity-based enrichment. These are approaches like LDSC and MAGMA, which can estimate the heritability enrichment of SNPs near a gene set of interest. But those SNPs may not actually be regulating those genes. So, in this toy example here in the bottom left, we have a SNP that regulates gene B. It’s closest to gene A. So, we might, using a closest gene approach, fully map its heritability to gene A when in reality that’s not the appropriate mapping.
The second potential approach, this is the same paper that Luke just talked about from Doug Yao and colleagues, is the idea of integrating molecular data like eQTLs to estimate mediation. But this paper found that a small fraction of disease heritability is mediated by measured expression levels.
The approach that we took was not to use measured molecular phenotypes, given the challenges that we’ve discussed, but instead to develop an abstract mediation approach where, instead of directly observed SNP-to-gene effects, for example from eQTLs, instead we used enriched proxies. So, SNP-to-gene proximity for the gene effect and membership of genes in an enriched gene set, and then the results from genome-wide association studies for the SNP-to-trait effects. And the idea is that we can estimate SNP-to-gene architecture by the decay in heritability enrichment as a set of enriched genes gets farther away from the those variants. And this allows us to partition heritability across SNP-gene pairs and then to estimate gene-mediated heritability in any combination that we would like, for example, in a gene set of interest.
So, to illustrate this a little bit further, we model the proportion of heritability, on average genome-wide, mediated by the closest gene in Pk, which is in expectation the case with distance, as well as membership of genes in an enriched gene set. For example, the set of constrained genes. And the expected SNP heritability is essentially a function of whether a SNP has a proximate gene in the enriched gene set and the proportion of heritability, on average, mediated by genes of that proximity. And then we can estimate all of this using an approach similar to LD score regression.
The approach behaves well in simulations, and there are a couple of important assumptions in this model that I wanted to go over.
So, the first is that the heritability enrichment of SNPs mapped to a gene set is actually mediated by those genes. So, for example, if we pick SNPs that are based on GWAS results, we can get spurious enrichment. So, this would be that genes in the set are in a disease-relevant region of the genome without being relevant themselves, which can give spurious enrichments.
The second is that this cis-regulatory architecture does not vary by gene sets. So, the SNP-gene architecture that we can estimate comes from our training set of enriched genes. So, primarily the set of constrained genes. And if this regulatory architecture varies outside of that set, then we won’t be able to see it. So, you could violate this by picking a training gene set with a particularly different cis-regulatory architecture.
And then the third assumption has to do with cis versus trans mediation. So, we assume that genes in cis, nearby genes, mediate all heritability. But if there are some trans regulatory effects that are not mediated by those genes in cis, then that can give us bias.
So, we use the set of constrained genes meta-analyzed across 47 traits to estimate this SNP-to-gene architecture. We estimate that, on average, the closest gene mediates just under 30% of heritability, and that these fractions decay with distance as genes get farther away. But a detectable proportion of heritability is mediated by more distal genes. So, this is consistent with the idea that most heritability, or if you had to pick one gene that mediates the most heritability on average, it will be the closest gene, but there are substantial proportions that are mediated by more distal genes.
We checked to see that these SNP-to-gene architecture estimates weren’t specific to constrained genes, and we see qualitatively similar estimates using gene sets of specific expression and relevant matched traits, here from liver and from cortex. So now, with these SNP-to-gene architecture estimates, these proportions of heritability mediated by the closest gene, we can then go in and estimate gene-mediated heritability.
These are estimates from a range of gene sets, and we see a number of interesting things here. So, for example, in a small gene set of Mendelian lipid disorders and lipid drug targets, we see massive enrichment of LDL heritability, that these, just around 20 genes, mediate approximately a quarter of LDL heritability. In contrast, for genes implicated in Mendelian developmental disorders, we see much more modest enrichments, which is consistent with differences in negative selection affecting these gene sets and traits.
We also compare mediated heritability enrichments across different tissues based on the magnitude of gene expression. So, here on the left, this is the cumulative expression fraction for cortex and liver, where there are many fewer genes that are highly expressed in liver relative to cortex, which has a much more flattened distribution of expression across genes. And we see this difference echoed in the mediated heritability enrichments. Where for liver, we see strong mediated heritability enrichments in the genes that are most expressed, decaying quickly. While, for cortex, the mediated heritability enrichments are much flatter across the distribution of gene expression. And I think these observations are consistent with elements of the omnigenic model, where expression in the appropriate cell type is sufficient for enrichment of those genes as perhaps peripheral genes in that model.
A couple of important limitations of the approach: the gene sets must be pre-specified and more broadly, and the approach doesn’t identify causal genes. We’re estimating snp-to-gene architecture on average across genes here. The cellular context and cell state, as Luke mentioned, is very important for many of these cis-regulatory elements and gene mappings, and AMM doesn’t identify the relevant cellular context. And similarly, the approach has limited power for individual genes and small gene sets. And this is different from the next paper that I’ll present, which does look at both genome-wide patterns but also the individual links between SNPs and genes.
I’ll finish this section by mentioning that the AMM approach is available as a command-line tool and operates on summary statistics and gene sets. And as I mentioned before, estimation of gene-mediated heritability using AMM provides an apples-to-apples comparison for gene-mediated rare variant heritability from exome sequencing. And we used this comparison recently in a paper where we characterized the genetic architecture of rare coding variation and compared it to common variation.
Host: Thank you very much again. Before you go on, there were two questions, both of them relevant to eQTLs, so I think they may pertain to both of the sections so far. So, the first is this: for eQTLs, the eQTLGen Consortium combines PBMC (15% of cohorts) and whole blood (85% of cohorts) eQTLs. The audience member would like to know if you think that the eQTLs may therefore represent different signals.
Luke: So, if I understand the question correctly, I think it’s saying that there’s a mixture of different cell types. Is that right?
Host: That is my interpretation too. Yes, although if the person who raised this question is here and would like to elaborate, of course, please do in the Q&A.
Luke: That’s a good question. Of course, most tissues are a mixture of different cell types, and it sounds like eQTLGen might actually have a mixture of different approaches in there. I know eQTLGen has a massive sample size, so it might be sort of a meta-analysis. It’s definitely a challenge, especially in whole blood, that you have—which I think eQTLGen is whole blood—that you have a mixture of different cell types. And for some eQTLs, that’s not a problem because those eQTLs are actually shared across those cell types. So, for a promoter variant, in particular, you might expect that it’s going to have an effect in every cell type where that gene is expressed, and you’re going to maximize your power to detect it by munging all those cell types together. On the other hand, if you have some cell type-specific enhancer, that effect might be missed, or it might get swamped by stronger signals when you mix a whole bunch of cell types. So, that’s definitely a potential problem.
Host: Great, thank you very much. And then the second eQTL-related question is this: for GWAS and eQTLs, the signals do not seem to always align well. And when the goal is to predict gene expression levels, the audience member would like to know your thoughts on using genotype data versus eQTLs to do that.
Luke: When the goal—sorry, the question was—when the goal is to predict gene expression?
Host: I believe the question may be your thoughts on using GWAS as compared to eQTLs to do that.
Luke: So, if your goal is to predict gene expression levels, then eQTLs are great. So they—um, they directly, I mean eQTLs, you know, are associated gene expression levels. And so if that’s what you want to do, if you want to predict the expression level of the gene from genotypes, then eQTLs are definitely a good approach.
Host: Great. I believe that’s the question, but again, if you’re in the audience and want to elaborate, please do. And that’s all the questions at the moment. So, thank you.
Dan: All right. So, for this last section, I’m going to present a recent paper from one of our colleagues, Stephen Gazal et al., and the Price Group on the prospect of combining different SNP-to-gene linking strategies to both connect individual loci to mediating genes, as well as for genome-wide characterization of gene-mediated heritability.
As we’ve talked about, there are many approaches to prioritize target genes from causal variants, and each has strengths and weaknesses. Previous papers have suggested that a combined approach could perform better than any individual SNP-to-gene mapping strategy, but it’s unclear how, first, to estimate the performance of any individual SNP-to-gene approach, and then to take those performance measures and consider how to combine approaches to optimize the performance of the SNP-to-gene mapping strategy. So, these were the two challenges that this paper took on.
They considered 13 primary SNP-to-gene strategies. These include a number that we have mentioned. There are some that are non-functionally informed, for instance mapping variants to their closest TSS or just looking at exonic variants and mapping to the corresponding gene. There are also functional approaches, including using data from GTEx and other resources to map to mediating genes through eQTLs. There are also Hi-C approaches and other more sophisticated approaches like the activity-by-contact model.
Each of these SNP-to-gene mapping approaches, the authors synthesized into what can be thought of as a significant SNP-to-gene matrix, where for each SNP-gene pair, you have a numeric value that represents the strength of the connection between that SNP and gene. And then you can use that as a genome annotation to estimate heritability enrichments of that SNP-to-gene mapping approach.
So, the authors took a heritability-based approach to assess different SNP-to-gene mapping strategies, and there were a couple of key quantities that they looked at. The first is coverage of the SNP-to-gene mapping approach, which is the percent of SNP heritability that’s linked to genes. Then second is precision, essentially how efficiently does this strategy capture SNP heritability, with the intuition that a precise SNP-to-gene strategy is more likely to link a SNP to a critical gene. Here, like in AMM, they make use of the set of constrained genes as a benchmark for assessing the performance of different SNP-to-gene approaches.
Here’s one of the primary results, we’re looking at the precision and recall plot. You can see, for example, approaches like promoter and exon which are highly precise but only capture a small fraction of overall heritability because only a small fraction of disease-associated variants are exonic, compared to closest TSS which captures a much greater fraction of trait heritability but often maps to genes that are less likely to be disease-critical.
So, one of the primary advances here is the development of combined SNP-to-gene strategies. They take the approach of using linear combinations of individual SNP-to-gene strategies, using these weights Wk across K SNP-to-gene approaches, where the weights are determined through optimization over recall thresholded on a certain precision. Here’s the composition of the combined SNP-to-gene approach, which is heavily based on exon and promoter annotations, but includes a number of functional approaches as well. And on the top right of this plot, you can see the performance of the combines-to-Gene approach, which performs better than the non-combined approaches.
There are a couple of applications that the authors use the combined approach for in this paper. They look at nomination of causal genes and disease loci, as well as genome-wide assessments of disease “omnigenicity”, comparing the number of causal SNPs and E with the effective number of causal genes, seeing a wide range from traits that have many relevance, like neuroticism, compared to traits with much different genetic architecture, like color.
The combined SNP-to-gene strategies are publicly available and are a great resource that will definitely continue to develop over time, as other constituent SNP-to-gene approaches come online. And I think that is it.
Host: Great, thank you so much, uh, to you both. I don’t see any other, uh, questions in the Q&A at the moment, so I encourage anyone in the audience who has, uh, remaining questions to post those now. Um, so we’ll give people a couple of seconds to think about, think about that.
So, I have a question to get things started off, if that’s okay. Um, thank you both for an excellent talk today and just covering a huge amount of recent advancements, as well as some of the persistent stumbling blocks. What do you think, if you’re someone who’s, uh, as Dan is, moving forward to the clinic, what do you do once you have a SNP that’s perhaps associated with the disease gene? What do you see as the future for kind of moving forward? Where we see these SNPs that may have an incremental risk, what do you think we’ll be doing with this information in five years or ten years?
Dan: Yeah, I mean, I think that there are a range of different paths for it in that context. You know, there’s the approach of taking the SNP-to-gene link and then deeply studying the biology of that association, essentially using GWAS to get to a set of relevant genes and focusing in on those. I think, given the recent expansion of exome sequencing approaches as well, I think it’s really important to look at the overlap of evidence between common and rare variation and to see if, you know, in the context of gene loss-of-function or other variants that have potentially larger effects on that gene, whether we can learn about the potential efficacy and safety of a therapeutic targeted at that gene. But there have also been great recent examples in sickle cell disease and others of targeting the non-coding cis-regulatory elements as a therapeutic approach as well. So, it’s definitely not necessary to sort of, to just use the non-coding region as a tool to get to the gene, but there are, there are exciting examples of therapeutic modulation of the cis architecture to affect the gene as well.
Host: Thank you. A new question has just popped up in the Q&A, and here it is: Do you expect cis eQTL heritability to change depending on different tissue developmental times and postmortem versus fresh, and how do we tease apart noise from true signal?
Luke: Yeah, that’s a good question. So, I do expect that the cis regulatory architecture will be different across developmental stages and, just more generally, across different cellular states, and we know that, we know that especially during development, you know, cells do different things at different time points again go through stages where they, you know, have different roles. And I don’t think we understand very well, in particular, the genetic effects on those processes. Like, it’s easy, like we know that there are, like, you know, enhancers that are, like, you know, specifically active in fetal brain, for example, and I don’t think that we understand very well at all what regulatory variation does at those time points.
Host: Great. Thank you. Well, I think that’s the last of the questions, so this seems like a good time to conclude. Thank you so much to you both, that was really, really an exciting and thought-provoking talk, so thank you, and we will talk with you and look forward to hearing more in the future.
Dan and Luke: Thank you very much. Thank you.