Chapter 9.4: Interactions with Environmental Factors (Video Transcript)


Statistics of the Interaction Term

Dummy Variables: Interaction Terms Explanation

Title: Dummy variables: interaction terms explanation

Presenter(s): Ben Lambert, PhD (College of Engineering, Mathematics and Physical Sciences, University of Exeter)

Ben Lambert:

So let’s think back to our example we had in the last video. So let’s say we were interested in how wage rates varied between, let’s say, male and female people. So the idea is that we regress wage on, let’s say, now that we’re sort of implicitly assuming that we have all these other variables, so I’m not going to include them explicitly. We’re just gonna have alpha plus, let’s say, beta-1 times the number of years of education plus beta-2 times our sort of sex variable where our sex variable takes on the value of 1 if the individual is female and it takes on a value of naught if the individual is male. But then we included a further term, which was, let’s say, beta-3 where we multiplied sex times education. So the idea is that we have collected all these variables across all our individuals in the population or in our sample, rather, and we have included a multiplicative term in our regression specification.

So what does this multiplicative term mean? How do we interpret this beta-3? Well, let’s think about again what this sort of average wage rate would be for a female and compare it with the average wage rate for a male. So the average wage rate for a female, if they had a given number of years of education, would be alpha plus beta-1 times the number of years of education which they had, plus, well, this sex variable now takes on the value of 1, so I’ve just got plus beta-2. And now our sex variable here is taking on a value of 1 as well, so I’ve got plus beta-3 times the number of years of education, okay? And then we can sort of simplify this if we should’ve noticed that our alpha and our beta-2 are both constants here. So you’re writing those both at the start of the model. We get sort of alpha plus beta-2, let’s say, and then we recognize that we have essentially got two education terms. We’ve got this one and this one, so I could simplify these as well by writing them or by combining them. I just get beta-1 plus beta-3 times the number of years of education.

Okay, So that’s for the female case. What do we have for the male? So the idea for the sort of males in our sample is that the average wage rate is given by alpha plus beta-1 times the number of years of education because our sex variable takes on a value of zero, so these sort of second two terms actually cancel or don’t exist for the male. So now we can think about what the effects of our sex variable has been on our specification and our interpretation.

So what does beta-2 represent here? Well, beta-2 represents the additional premium which females would have over males if they had zero years’ worth of education because if they had zero years’ worth of education, then both of these two terms would disappear, and the only difference between males and females would, in fact, actually be our beta-2. So just like you’ve proved in the last video, that is actually the wage premium which females do so, it’s over males in this case of having zero years’ worth of education.

Okay, So what does beta-3 represent? Well, notice that the only difference between these two specifications in terms of the education variable is that essentially the partial effect of education for females has been boosted by an amount beta-3 relative to the males. So what does that mean? Well, if beta-3 was greater than zero, it means that the additional effect of one more year of education for females was, in fact, greater than that for males. If it was less than zero, then it would be the other way around, so the additional effects of having one more year of education on average would tend to cause a smaller increase in female wage than it would do for males. So we can sort of think about what these cross terms mean in our regression specification. Well, essentially what they mean is that if I’m interacting a dummy variable with a continuous variable, it allows us to have different slopes of that particular continuous variable across the two different values which our dummy variable can take on. So, and that’s quite an inappropriate assumption to make in a whole host of different situations. In this particular situation, it kind of then you might suppose that there might be a different effect of education between males and females, but there are a whole host of other ways in which this could be true across other types of models.


Continuous Variables: Interaction Term Interpretation

Title: Continuous variables: interaction term interpretation

Presenter(s): Ben Lambert, PhD (College of Engineering, Mathematics and Physical Sciences, University of Exeter)

Ben Lambert:

Hi there! In this video,I wanted to explain what the sort of interpretation is when we have two continuous variables multiplied together in some sort of regression model. Okay, so let’s think about a particular example. The only deal here is that let’s say we were trying to explain a company’s level of sales, but we are trying to do that in terms of, let’s say, the effect of price and let’s say the effect of advertising. Yes, this might be the company’s level of advertising spend, and this is just the company’s price set for a particular product. So traditional theory would sort of expect us to have a downward-sloping demand curve, so we would expect beta-1 to be less than zero because if you lower the price, then sales increase, and we would sort of expect that if we spend more on advertising, then sales tend to be higher as well. So we’ve got beta-1 being less than 0 and beta-2 being greater than 0, but let’s say we included a third term here, which is beta-3, and now included the product of price and the company’s spend on advertising. What interpretation can we actually give to this beta-3? Well, let’s think about this in two different situations.

(First Situation): So let’s say that the company was, let’s say, spending $100,000 on advertising, and let’s think about what the company’s expected sales would be under that situation. Well, the idea is that the company’s level of sales we would expect if advertising is $100,000 would be equal to alpha plus beta-1 times the price plus now we’re going to get $100,000 times beta-2 for this third term, and then we’re going to get plus $100,000 times beta-3 times the price. Okay, so what does this show us?

Well, we can actually sort of think about the effective price because price is appearing twice in our model here. We can sort of combine the price variables to create a sort of aggregate effect of price. So here we would have the aggregate effect of price would be beta-1 plus $100,000 times beta-3, and then that would all be multiplied by the price.

So one interpretation is beta-3, and what sign would we expect beta-3 to have in this case? Well, we would actually expect that beta-3 would be greater than zero. Why would we expect that?

Well, the idea here is that if you spend more money on advertising, then that tends to decrease the sensitivity of your consumers to price changes in that product. So notice that this appears because beta-1 is less than zero. So if we’re adding $100,000 times beta-3, where beta-3 is greater than zero, then the idea is actually we have decreased the sort of sensitivity of consumers to price changes, or we sort of made our customers less reactive to price changes, which is something which you might expect companies’ sales to exhibit. Right? You might expect if you spend more money on advertising, you increase the brand value or the sort of non-tangible effect which consumers consider when they’re thinking about your brand, so that might make them less price-sensitive. Okay, so that’s kind of what beta-3 is representing in this case.

(Second Situation): Let’s think about another example whereby let’s say we had the price level be set to ten, and let’s sort of say what we might predict the company sales to be in that case. So the idea is that the company sales, on average, when the price was ten, would be equal to alpha plus ten times beta-1 plus beta-2 times the level of advertising, which we haven’t specified. Plus now we can have ten times beta-3 times the level of advertising. So notice that again here we have two terms which essentially have the same variable. So we can combine these, so we now have an aggregate effect of advertising being beta-2 plus ten times beta-3. So what does beta-3 represent in this case?

Well, remember that we found from the first example that beta-3, by theory, should be greater than zero. Well, what does it say in this case? It says if your price is higher (so remember the prices are represented by this ten here), then the effect of advertising tends to be greater. So that might be the case. If you have a higher premium price product, you might have to demonstrate to consumers that it’s worthwhile buying, so the effect of advertising is greater than if you have, let’s say, a low-price product which consumers would flock towards anyway.

So beta-3 generally, what does it mean? What are we learned from considering these two cases? Well, it shows that the effect of price depends on the level of advertising spend, and the effect of advertising tends to be determined by, or tends to be affected by, the level of price, so beta-3 is sort of a way of adjusting the effect of price and advertising to take into account their multiplicative effects on one another.


Gene-Environment Interaction in Psychiatric Genetics

Title: Gene-environment interaction analysis

Presenter(s): Kenny Westerman, PhD (Broad Institute of Harvard and MIT)

Kenny Westerman:

Yeah, so I’m a little bit daunted by the scope of presenting in general on gene-environment interaction analysis because there’s so much that can be encompassed here. But I’ll do my best to provide a broad overview and I’m happy to go into any of the topics in more depth.

Outline

All right, so to start off here, I’ll generally present this in three chunks.

  1. Here we can start with gene-environment interactions in general: what are they, why are we interested in them, and how might we go about assessing or measuring them?

  2. The second, a little bit of investigation into why statistical power is a particular issue for gene-environment interactions (or G by E) and talk about some of the opportunities for addressing those and improving power.

  3. And then number three will be a bit of a grab bag of some additional topics that are relevant and potentially interesting for a future investigation.

Phenylketonuria: a high-impact gene- environment interaction

So, I want to start with an example here, and that example is of phenylketonuria, which is known as PKU. This is an example of a high-impact gene-environment interaction. Basically, the idea here is that phenylketonuria is a disease in which the phenylalanine hydroxylase gene, which is responsible for converting phenylalanine to tyrosine, loses most or almost all of its function. This leads to phenylalanine buildup, and that can downstream lead to intellectual disability and developmental problems quite early in life in babies if not addressed. It’s so impactful that it has been a long part of routine neonatal testing, and that’s partly because it can be treated, in a sense, by specifically using a low phenylalanine diet in babies and throughout life, and that is able to largely remove these symptoms.

The interesting thing here, and the reason I call it a gene-environment interaction, is that if we look at the top row of this table here where we say we sort of have a normal phenylalanine hydroxylase gene, it doesn’t necessarily make a difference: a low phenylalanine diet or a standard diet that’s not going to change the disease phenotype. Likewise, in the case of a low phenylalanine diet, you will see a normal phenotype whether the gene is normal or the gene is dysfunctional. But it’s only in this sort of two-hit synergistic manner that having both the dysfunctional phenylalanine hydroxylase gene and a standard diet will result in these disease symptoms, and this takes us towards the general concept of a gene-environment interaction.

What is a gene-environment interaction (GxE)?

The idea of a gene-environment interaction can be viewed from one of two perspectives. One, I call it genotype-centric, where we’re interested in genetic effects, but that genetic effect depends on some exposure in terms of how either its direction of effect or its magnitude of effect. Likewise, we can think of it, although it’s equivalent, in an exposure-centric perspective where we care about the effect or the association of some exposure with an outcome, but that depends on a genotype. And so what this looks like in practice: here’s my little cartoon.

Here, if we look at the left panel, there’s both an effect of the genotype—the phenotype increases as we go from genotype A to B—and the phenotype also increases as we go from environment one to two, but there’s no interaction here because these effects are independent of each other. Another way to say this is that these lines are parallel. So in the right panel instead here, we have an example of the G by E interaction because in genotype A, we see that the difference between environment one and two is different than in genotype B and vice versa.

Other common concepts in genetics can reflect GxE

So just to expand this intuition just a little bit, other common concepts in genetics can reflect gene-environment interactions in a way that we might not typically think of. So other concepts that might encompass this include variable penetrance. So we talk about, “Well, this genetic mutation is impactful in some people, but it’s not impactful in others.” That might be because of interactions with the environment. When we stratify genetic associations, let’s say by sex, and we say there’s a genetic association in males if that’s different than females, this is an interaction between sex and the genetic variant. And also pharmacogenomics because the outcome of interest in pharmacogenomics is the response to some exposure, which is a pharmaceutical drug in this case. This can also be thought of as a G by E.

Why do we care about GxE?

So, why do we care in general? There are a whole host of reasons why gene-environment interactions could be interesting. First is simply a better understanding of biology. So, if we see that a genetic variant we think we understand how it’s acting, and it turns out that that depends on the environment, that could give really interesting clues as to refining our understanding of the biology there. If we think in this genotype-centric manner that I talked about earlier, we might be interested in understanding genetic effects but in the context of the environment. So, maybe this helps us explain missing heritability; you know we can’t find the right genetic effects to explain heritability because they are dependent on some way and environment. Or it might help us detect genetic associations that we otherwise wouldn’t because again they’re hidden in some way. If we move to the exposure-centric perspective where we care about the effects of exposure being modulated by genotype, this is where we get into the realm of precision medicine, right? The idea that for me versus for you, based on our genotype, we should be taking different drugs, so we should be eating different diets or something like that.

Statistical model

We can start with a standard genome-wide association study (GWAS). You know we’re going to put one that’s very straightforward here. We have a genotype term and we have covariates, and the G by E extension is not too complicated. Essentially, we’re going to add one term, an exposure term, which is just like a covariate and it’s going to be treated as such. And then we have a G by E term, and this is the nuts and bolts of it, right? We have a product term, just a pointwise product between the genotype and the environment, and it’s that beta G by E that’s often of interest when we evaluate such a model. But once we’ve trained or fit such a model, we can then get quite a lot of things from it, in some sense. So, one, we have an interaction test, and that would be evaluating this beta G by E, and we would maybe, for that, be trying to understand biology or understand again specifically this synergy between the gene and the environment in impacting the outcome. We can also evaluate jointly, and there are statistical ways to do this. We can jointly evaluate both the genotype and the gene-environment interaction term, and this gives us some idea that maybe we can better detect genetic associations by incorporating both of these terms jointly than only the genetic term by itself. And likewise, though this is not a statistical test that’s often performed, if we’re thinking about precision medicine or an exposure-centric manner, we’re in some sense trying to holistically evaluate both the interaction term and the effect of the exposure so that we understand let’s say some notion of a context-specific impact overall of the exposure.

Note: we can use “environment” liberally

I’ll make one note here. We can use "environment" liberally; we’re using the term gene-environment interaction. For me, that’s largely historical in the way I use that term, but we can expand far beyond what we think of as an environment. Let’s say that might be pollution or social factors, but we can expand to lifestyle, smoking, or diet, for example. We can expand to demographic factors like sex. We can expand to even physiological factors that are sometimes used as an outcome, like BMI; those could be used as exposures in a gene-environment interaction test. Here, specifically, I’m going to focus just because of my area of expertise and in order to constrain the space of where I’m getting samples from. I’ll be focusing on specifically dietary exposures and using BMI as an outcome, but that’s by no means the limits of the scope of gene-environment interactions.

GxE discovery as intended: saturated fat, APOA2, and BMI

So, the PKU, the phenylketonuria example is a particularly stark one because we have a phenotype that’s very notable, and it’s very responsive to this interaction. But in practice, we expect to find more modest effects. And so, I’ll give one little vignette here of a series of studies looking into one interaction that might give a little intuition for how we maybe hope or expect that this could work in practice. So, in a study in 2009, in the Framingham Heart Study, a group was able to look at stratifying individuals by both their saturated fat intake, which you see on the groups here on the x-axis, as well as genotype at a variant in APOA2. And you see those different colored bars, or the different shaded bars here, are those genotypes, and they’re interested in seeing how those different strata and that interaction of saturated fat and the genotype at APOA2 influenced BMI. And what we can see here in this diagram is that in individuals with low saturated fat intake, this genotype, while there might be a little decrease, we don’t see a statistically significant difference in BMI. However, if we look at the high saturated fat consumers, more than 22 grams per day, we see there is a significant difference in BMI between the two genotypes, CC carriers versus everyone else. And so, they originally discovered this in the Framingham Heart Study, and here you don’t need to look at everything for these figures. Let’s just focus on the relative heights of the bars. We can see that they were able to see a relatively similar type of pattern in the GOLDN study, which is also in the US, as well as the Boston-Puerto Rican study, which is in different ancestry individuals but also from the US.

We move on, additional studies by this group and others were able to see that they could replicate again the same sort of shape of these relative bars, you know, an increase based on genotype in high but not low saturated fat consumers in additional populations. We have a Mediterranean population, Chinese and Asian Indians, and Iranian individuals with type 2 diabetes. So, in all sorts of ancestries, ethnicities, and even physiological states like type 2 diabetes, this interaction seems to replicate.

And finally, a mechanistic follow-up type study was able to be performed, and they understood that genotype and saturated fat come to mutually influence a regulatory region in APOA2 that then downstream is able to affect DNA methylation, then transcription, then affect certain metabolites, especially tryptophan and branched-chain amino acid metabolites, ultimately downstream affecting BMI and are looking into a clinical trial to test this interaction. So, this is just an example of how we hope that interaction analysis will work, and I can step in in a moment into maybe how things often turn out in practice, but I’ll stop there for questions just at the moment.

Moderator: Thank you, Kenny. I don’t see any questions just yet, so I will let you know at the next break point.

Kenny: Sounds great. All right, so getting into sort of if reality hits a little bit, we often have some statistical power difficulties in gene-environment interactions.

GxE discovery is difficult

Some of you may have seen this; this is even from four years ago or so. We’ve noted that GWAS discovery, in step with both time passing and sample sizes increasing, we see a nice increase in locus discovery. But GxE analysis has just lagged a little bit behind this, and so one major reason for this is that statistical power is a major obstacle to gene-environment interaction.

Statistical power is a major obstacle

So, at least in the case of a binary exposure, one heuristic we often use is that interaction detection statistically will often require almost four times the sample size compared to a main effect that has a similar magnitude of effect on the outcome. So, to get a little handle on this, we can look at just some quick ad hoc power calculations that I performed for this. If we are able to simulate an effect size of 0.05 standard deviations in some, let’s say, continuous outcome, if we want intuition for this, we could think about BMI and think about FTO variant that is one of the best-known GxE variants and one of the strongest that associates with about 0.4 BMI units per allele. So, we’d be talking about an interaction effect that would correspond to about half of that, let’s say 0.2 BMI units that this interaction is responsible for, in some sense. And so, we can see across different minor allele frequencies, if we’re looking towards rare variants of let’s say 0.005, all the way up to 0.5, these are naturally given the same effect size; they’re explaining different amounts of variants and will require different sample sizes. So, if we’re conducting a GWAS to try to detect this type of sample size at a genome-wide significance threshold, for the 0.5 minor low frequency, let’s say maybe 34,000-ish individuals here is what’s necessary, and then we’re moving all the way up for this rarer variant. If we’re not aggregating, it might take quite a lot of individuals, even for a GWAS, almost reaching 2 million to detect this effect. Now, if we extend to GxE, suddenly, you know, we’re expanding; this is not exactly a factor of four, but we need a substantially greater number of individuals based on a set of reasonable assumptions to get to the same type of power for the interaction. And we’re getting towards, here we see moving from 34,000, which is very common these days for GWAS; we’re getting towards the need for biobank-scale population sizes in order to even identify an interaction effect for a common variant, and we’re getting towards maybe even infeasible sample sizes for something like a very rare variant that might be, you know, the 0.005 minor low frequency with current sample sizes that are available.

Exposure measurement can be an additional limiting factor

So, in addition, beyond simply that general difficulty in statistical power, we have the additional obstacle of exposure measurement. A couple of obstacles this poses; one is that we can have imprecise or biased measurement. You know, a lot of exposures are self-reported, and this can result in both lower power and less strong interaction effect estimates. We have cultural and environmental heterogeneity, resulting in more difficult replication, and we have the potential for reverse causation, in addition, and this could lead not necessarily that, you know, maybe an interaction effect is equally strong, but it could lead us to conclude incorrectly if we are thinking in one direction, like diet soda affects obesity, and maybe it turns out to be the opposite.

Approaches to increase statistical power

So, taking these two obstacles into account, we have a couple of types of approaches that can be used to increase statistical power in this context. We’re going to focus; we’re going to say, “Okay, we have some genotype or genotypes, some exposures, and they’re interacting to affect body mass index,” and we can focus on the genotypes or the exposures in trying to increase statistical power. So, first, focusing on the genotypes here…

  1. Variant prioritization for two-stage testing: A very common type of approach is to use a two-stage testing in which we prioritize variants in stage one that, for some reason, we think they’re more likely to participate in interactions, and then in stage two, we can test for GxE and greatly reduce the multiple testing burden necessary.

    1. So, the most straightforward approach; we can choose variants or variants from genes based on some prior biological knowledge. We have genes related to BMI; let’s use those. That’s very simple, straightforward, and is often done.

    2. You take one step forward and say, based on a general idea, we have some reason to believe that interactions will often show up, maybe in known loci that are already found from a standard GWAS.

    This is one analysis looking at waist-to-hip ratio as a function of genes, sex, and interaction between genes and sex, and here from one of our papers recently, and what we’re showing here is a Manhattan plot for interaction p-values going up on the top and marginal genetic effects going down, and what you generally see here is that almost every peak that we see for interaction we see showing up as a marginal effect locus as well, indicating that maybe we can simply prioritize GWAS loci and then look for interaction effects at those loci.

  2. Variant prioritization (variance-QTLs): Yet another idea would be this idea of variance-QTLs. So here, the point is that interactions can induce genetic associations with the variance of some phenotype, and so what we’re looking at here, and we might actually start in the panel on the right in B here, and point out that if there’s a genetic interaction, let’s say the genotype increases the phenotype in one environment, let’s say the blue, and in the red, it’s actually decreasing the phenotype, this may not happen a lot in practice, but let’s suppose it happens. We can then look at the actual distribution of the values of these phenotypes, and now, look to the left and see that if we were to test for the difference in variance across genotypes, AA, AG, and GG, we might actually detect that this locus is interesting even if we don’t know or we haven’t measured the red, green, blue environment in panel B there.

    And so, multiple studies have looked at this, but one, for example, took a look at finding these variants-QTLs and then ultimately looking at those vQTLS if we find interactions and try to replicate them internally, how well do they replicate, indicating sort of how enriched maybe these variants are for actual interaction. If they take a bunch of random SNPs, they see essentially negligible replication of the interactions that they find, as might be expected. If they take main effect only, so that was what I was talking about in the last slide - maybe a GWAS locus, they find that, okay, if they find a certain set of interactions, then maybe about 25 percent of them are replicating in this held-out internal sample, but finally, if they choose vQTLS, we can see that we’re getting a lot higher here, and that we’re more than doubling the number of interactions that seem to be real or that seem to replicate. So, this is quite a promising strategy, and I’ll actually be talking in the main MPG talk today about one investigation, leveraging this idea.

  3. Polygenic scores for interaction: Another genotype-focused approach is to use polygenic scores, and this is probably the most straightforward of all here. So, we have an example here of sugar-sweetened beverages, genotypes, and BMI. This study was quite a well-known study from 2012, where they simply constructed a BMI polygenic score, not based on interaction, just based on the typical polygenic score methods, and they were able to see that on the y-axis here, we have the effect size of sugar-sweetened beverages on BMI, and we can see that across each of these three cohorts, and when we pull them together, there’s this step increase, such that individuals, as their genetic predisposition to higher BMI increases, the effect of sugar-sweetened beverages goes from negative or essentially zero up to a quite substantial effect. So, this is another quite promising path that a lot of that’s quite often used to increase power for interaction detection.

  4. Exposure-focused strategies: In addition, we have exposure-focused strategies, right? So, I’ll put this in two categories here.

    1. First, we can think about the datasets that we actually collect.

      1. One approach is better measurement, so maybe wearable technologies for physical activity, for sleep, etc. These might give us better estimates of some of these behavioral components than we might get from self-report.

      2. We can incorporate more diverse populations where ethnic and cultural heterogeneity leading to differences in the distribution of environments and exposures; this can actually be quite helpful in some instances in aid discovery, especially in meta-analysis, that sort of thing.

      3. And finally, we can collect longitudinal data, which will also help us better pin down better estimates of the outcome and the exposure in different individuals.

    2. Likewise, on the analysis side, there are multiple very interesting methods cropping up related to the use of multiple environments at once.

      1. There’s the idea of a lifestyle risk score that’s come out recently from the CHARGE Gene-Lifestyle Interaction Group where they’re trying to aggregate and say, “Well, are multiple healthy behaviors in some sort of score? Does that interact with genotype?”

      2. Likewise, another approach taken in a 2019 paper here was this idea of high-dimensional environments where we construct an environment relationship matrix that’s similar to what we might think of as a genomic relationship matrix and use that to encompass a similarity of environments across individuals across a wide scope of these exposures.

      3. And finally, longitudinal methods here in the analysis will help us better take advantage of the longitudinal data that I mentioned above to again better identify these interactions.

I’ll take a break here for a moment, and I saw a couple of questions here.

Moderator: Thank you very much. Yes, I will voice the one that’s here and ask anyone else again who has questions to please post them in the Q&A. So, the question here is, “Isn’t the assumption that interactions are more likely to be found in main effect GWAS loci flawed because the larger variance introduced by interaction effects would lower the likelihood that a given variant reaches significance? Thus, main effect GWAS loci could be biased toward non-interacting loci.”

Kenny: I think this is a really good point and something that I glossed over at the beginning when I introduced interactions, and I think is a key piece here, is this distinction between what we’ll often call quantitative interaction where the genotype, let’s say, the effect of the environment is just being tuned up or down; you know, it’s more strong of an effect in one environmental strata than compared to the other.

Then we have this idea of qualitative interactions, and I can actually back up real quick here. So, this slide, if we’re looking at panel B on the left here, that’s an example, what’s being shown here is what we might call a qualitative interaction. So, this is a particular instance where the person asking the question is totally right that in this particular instance, we’d expect a potentially strong interaction and no marginal genetic effect potentially whatsoever. What I would point out is that we don’t expect to often see this in practice. It’s, at least intuitively, it seems a bit more likely that we’d find, let’s say, a genetic variant that’s sort of tuning the effect. If we go to our saturated fat-BMI example, we might not expect that, in high or low saturated fat intake, the effect of a genotype is going to totally reverse direction; we might rather expect that it’s going to be modest in one group of the population and more substantial in another group.

So, I think that is a great point, and that’s why it’s useful to conduct genome-wide studies of interaction and not just neglect that, but for certain, especially depending on your hypothesis of what the interaction might look like, it will often be worth the trade-off to use main effect, let’s say.

Moderator: Thank you very much and there’s there’s one other question um before you go on so um the the attendee thanks you for your talk um and says: " I’d like to know how to conduct the joint test GxE plus G and the other test GxE plus E"

Kenny: Yeah so the joint test specifically there are a couple of different uh software tools um our lab has particularly released one software recently it’s called the this a GEM tool and you can read about that in our paper right here the bioinformatics paper that is in press at the moment. So, that is one approach. Another approach Is to use one of a couple different softwares. Some software that’s used for calculating gene-environment interactions, especially genome-wide, will be able to output directly the joint test.

Another approach, though, in general, if you have, let’s say, one interaction, is that ultimately the joint test is simply, it’s basically a chi-squared test with two degrees of freedom. That is taking into account both the genetic effect, its variance, the GxE effect, its variance, and the covariance between those two terms. And so you can, if it’s maybe for one locus or something, you can sort of construct this yourself if need be.

And a final thing, I’ll note, there’s also a patch that my advisor, Alisa Manning, and her collaborator, Han Chen, input to METAL where you can use METAL in a way that performs a meta-analysis and can take both the interaction terms and the main effects and will output this joint test as well. And I’m happy to, feel free to get in touch with me after, and I’m happy to talk specifics.

Moderator: Thank you so much. Those are the questions for the moment.

Kenny: So, all right, so here, as I mentioned before, we can step into a little bit of a grab bag here and just some interesting or thought-provoking questions in interaction testing that I think are worth thinking about in this context.

Interactions masquerading as main effects

So what I’m calling interactions masquerading as main effects, uh, and the idea here, uh, we’re going to set up a situation where we simulate a pure interaction, and that’s what was done. This was not my work but, um, was in Hugo Astrodad’s paper from 2016 here. So, if we simulate a pure interaction, we don’t expect that the genotype by itself or the environment by itself is explaining any effects per se, only insofar as they participate in the interaction. But our question that we’re going to ask here, and we’re going to look at in this plot, is how much additional variance is explained by the interaction beyond what we could find by environment and genotype alone.

What we can see, if we make some basic assumptions here, for example, we’re going to assume that the mean of the environment is non-zero. What you can see in this plot is that as we move across the risk allele frequency spectrum, there is, you know, there’s a large change in some of these parameters, how much variance is explained and which components are explaining it, the interaction, the genotype by itself or the environment. But what you’ll note is that most prominently here is that the vast majority of the variance explained across the sole allele frequency spectrum are explained by things that are not the interaction. And so the idea is, I mean, for intuition, we can think about the fact that if let’s say, as we assumed here, the mean of the environment is two or let’s just say it’s non-zero, the interaction term becomes correlated with the gene term and/or the environment term, and because of that, we expect that performing a simple GWAS might very well uncover loci that in reality are pure interactions but we’re going to uncover them even in a test for a marginal genetic effect.

And this also gets back to the question that was asked earlier which, you know, about whether we expect interaction effects to show up in GWAS loci or not. This is one of the reasons why we expect that. And so just to sort of summarize our takeaway at this point, we don’t necessarily expect that G by E will explain a huge amount of additional variance because of this effect that we’re seeing here to the left where GWAS will find interactions even though it’s not explicitly looking for them. And so we might not expect that incorporation of G by E is going to hugely improve prediction but what it can do is help us identify causal players, and that’s why that’s one of the specific reasons why this interaction analysis is particularly interesting because even if it’s not going to improve prediction so much for some phenotypes, at least we can expect that it might reveal interesting causal associations.

Additive vs. multiplicative interactions

Okay, so the second idea here is additive versus multiplicative interactions. And so here we’re going to set up another hypothetical example. We have an exposure that has two categories and we have a genotype that has two categories. Just to simplify here, let’s say each of them doubles risk for some binary outcome. So we move on the genotype equals zero row, we move from exposure zero to one, we double our risk, move from genotype 0 to genotype 1 in the exposure zero stratum, double our risk. If this continues to double risk in that final bottom right column, then we expect that sort of the relative risk here is four.

Now, if this is the case, then modifying the exposure in some sense matters more in risk genotype carriers than it does in non-risk genotype carriers. If that’s the case, then we might care more about it for, let’s say, public health reasons, right? Because if we think that this is some disease, for example, and then maybe if we can find risk carriers, they’re going to be generally have a greater burden of the disease and changing the exposure even if there’s no multiplicative interaction or there’s no interaction where we find something new biologically, it still might be more compelling to adjust the exposure if we have, I don’t know, limited resources, something like that, in people who have the risk genotype.

And so this is what we call an additive interaction where if you have effects of both the genotype and the exposure, it might be of interest to understand how non-additive they are, and in this case, they’re not multiplicative. There’s no multiplicative interaction. It’s simply multiplying a relative risk of two with a relative risk of two, but that doesn’t mean that the two aren’t important for each other in determining relative amounts of absolute risk that’s being changed. This is particularly relevant for binary outcomes. Additive and multiplicative interactions can exist for continuous outcomes, but it tends to be less relevant.

Areas of investigation

Just a quick note on some other ongoing areas of investigation. One is that we have whole-genome approaches to heritability and variants explained due to interactions, as I mentioned before. There are some particular considerations in thinking about whether we expect to uncover lots of new heritability but it still might be quite interesting to get some large-scale view. Whether we know the environments, for some of these methods, we need to know the environment. For some of them, we don’t even need to know the environment at all. But there are a whole host of methods that can look in a genome-wide way at the heritability contributed to by interactions in some sense.

For rare variants, we have the same trade-off that exists in GWAS, right, that we expect that we very well might find some higher-impact rare variants, but this is going to be traded off with the fact that, as we already described, power is quite an obstacle for gene-environment interaction analysis and even more so when we start to reduce the minor allele frequency, as we saw in that simulation. So rare variants are certainly an ongoing area of interest, but the power trade-off has to be taken into account.

And finally, something I just think is a lot of fun. I don’t know if this is an official method or not, but a group from the UK has looked, uh, for a couple different studies have looked into this idea of Mendelian randomization, G by E, where the idea is that G by E interaction analysis can help prove the robustness of an MR analysis. So our example here, if we have an MR analysis that’s proxying smoking and trying to understand a causal effect of smoking, if we can look at it from a gene-environment interaction analysis perspective where there’s some modifying factor, which is, does the population smoke at all, we’d expect that if an even genetic variance proxying smoking will not have any effect and should not affect the outcome if it’s a population where there is no smoking, for example. So it’s just another interesting, uh, area for investigation as well.

Summary

  1. And so just to sum up here, gene-environment interaction analysis can be a critical building block for biological understanding as well as informing personalized medicine. I talked about that in general, the biological understanding may be from just looking specifically at the interaction term and understanding synergies between environments writ large and genotypes as well as for informing personalized medicine. And that again comes from this idea of being able to look at the effect of the exposure holistically on conditional on some genotype.

  2. Low statistical power, as I talked about in the second section, is the major obstacle, and we saw that that’s due to both general difficulties in power for gene-environment interactions as well as the fact that often the exposure will not be measured as well as the genotype or the outcome, and if that’s the case, then that will further dilute statistical power. But we have a series of approaches that can help deal with that.

    1. We can prioritize variants any number of ways, GWAS variants, variants QTLs, etc.

    2. We can use multi-genotype approaches like a polygenic risk score to collapse genotypes into one, and we can use multi-exposure approaches.

    3. This could be like a lifestyle risk score, or this could be like some of the more advanced, maybe high-dimensional environment random effect-type approaches to understanding the impact of multiple environments interacting at once.

  3. And to make progress on this, a lot of these takeaways are similar to what we might say for GWAS, right? We can use larger samples. We can try to actively seek more diverse populations. And I think unique from GWAS is the fact that particularly in one thing we can mean by diversity here even more than for GWAS is that where you can specifically see populations where there’s diversity of these exposures, whether sort of environmental differences in these groups.

  4. And finally, better exposure measurement will always be helpful in pinpointing the measurement and identification.

And so thank you for your attention. Happy to take some additional questions here.

Moderator: Thank you so much. That was really wonderful. There are a couple of questions. And before I voice them, a reminder to everyone who’s attending that if you have any questions, please continue posting them in the Q&A.

So the first question is posted there, and it’s a question about terminology, and the attendee says, “Are the additive rather than multiplicative interactions sometimes called gene-environment correlations rather than interactions?”

Kenny: Yeah, yeah. So I’m aware of, although I don’t, I’m not hugely in, I don’t, I’m not too familiar with the literature on this. There’s, I think a lot, especially in the social sciences, there’s a lot of discussion of and references to this gene-environment correlation, where we’re thinking about if our exposure, let’s say, is some behavior, let’s say it’s a behavior, where that can have impacts on both our statistical inference and how we think about the mechanism of an interaction. So that is definitely worth thinking about, and it’s something that can be a really key component to consider when you’re understanding - when you’re testing for interactions and trying to understand what’s going on. With that said, I don’t know, and I’m happy to, to hear if there’s an element that I’m not thinking about, but the additive versus multiplicative, I don’t know that those two pieces are necessarily correlated because at least in the way I presented it here, if we’re thinking about this hypothetical setup, we’re not necessarily claiming that there’s any impact of the genotype on the exposure, but simply that changing one of them will have more impact on the absolute number of people, let’s say, getting some disease in one strata versus the other of exposure or genotype, etc.

Moderator: Got it. Great. Thank you. And to the attendee, if you wanted to follow up, please, of course, post things in the Q&A. I will follow up with one question also that’s something you’ve already touched on a bit, but it’s something that is really frequently occurs to me, which is that oftentimes even just with, you know, a simple GWAS or, you know, just genotype-phenotype needs, um, there’s often varying levels of confidence in the in the phenotypes. And so sometimes you’re absolutely certain as well as one can be of a phenotype, and for other individuals, you may have a decent guess, but you have some probability associated with that, um, and then of course when you add the environmental component that adds a whole new opportunity for variation in the certainty among individuals, and I know you’ve addressed aspects of this before. I’m wondering what your sense is if you have complete, as much as that can be the case, phenotype and environment data for a subset of individuals and then only phenotype data for others, is there a way to include all of the individuals, bearing in mind that the amount or the types of information is variable or the certainty is variable among them, or is it better just to exclude things that fall, individuals that fall far below the information standards for the others?

Kenny: Yeah, I think this is, this is a really tricky one, and I think, um, I, my mind’s going to a couple different ways. One is that, you know, the idea of unsupervised learning, you know, if we’re thinking in a machine learning context, right, that, you know, we have some labels, we have some not labels, and can we use all the data in a useful way? I think one way, naively, that we could address a situation like this, let’s just say we decide half of our individuals are just, they’re, they’re straight out, we can’t use them whatsoever. For these individuals, we might still be able to use that sample in some way. When, if we’re thinking about some of the variant prioritization approaches that I talked about, if we think about, okay, let’s stratify our sample, let’s use the first half who don’t have as good exposure data. So, let’s think about, for example, variance QTLs, right? That’s what I was talking about. It might, you might very well be able to take individuals from the same population but for whom you don’t know the environment, you might be able to prioritize 10% of your variance, 5% of your variance, you know, you could use a very liberal cutoff and just say, “All right, we’re trying to narrow down by some substantial amount the set of variants that we’re looking at,” and you could look at that in individuals with the exposure.

Another way, in general, and especially for more noisy, noisily measured exposures like, let’s say, self-reported physical activity or diet, something like that, an approach I’ve seen explored is to take a look at only extremes, and where we say, you know, binarization or sort of discretization of individuals where the environment isn’t measured as well maybe could still allow us to use that information and maybe be diluted a little bit less. So, these are all kinds of approaches towards that.

Moderator: Great. Thank you very much. That’s really thought-provoking, and I think we have time for one more question. And this goes back to the previous question. "So to go back to the earlier question about traditional GWAS being biased toward main effects for any interacting variant, not just the example you mentioned, the standard deviation will be larger and thus the wider confidence intervals may impair achieving genome-wide significance for the interacting variant compared to a non-interacting variant with the same main effect, correct?"

Kenny: Uh, yes, I would, yeah, is that the question? Or that’s the first part.

Moderator: That is the question as written, and again, please, please, um, follow up if you would like to ask more questions.

Kenny: But yes, yeah. So, I think the idea there would be, yes, we might expect that the confidence interval could be, it could be a little more difficult to detect that variant compared to a main effect variant of the same effect. But I think that the important, maybe, comparison here is that, you know, GWAS power, more power is always helpful, more sample size is always helpful, but we have pretty solid sample sizes for a lot of phenotypes for GWAS. So much so that we might expect that even those that are a little more difficult to identify, that at this point we might be getting towards identifying those and that the comparison of that with a main GWAS is less important than the comparison between that and an interaction test where we have maybe mismeasurement of exposure, etc., and so I think that all contributes.