Chapter 5.4: Meta-Analysis (Video Transcript)
Title: Genome-wide association study design and interpretation
Presenter(s): Gina Peloso, PhD (Department of Biostatistics, Boston University)
Host:
Okay, I think we’re gonna get started. Good morning, today is the third in the primer series. We had one talk focused on complex trait genetics, one talk focused on Mendelian genetics. Today, we’re going to have a talk focused on common variants genetic association studies, and the subsequent three talks are going to focus on rare coding variants association studies of different forms. So, this will be the only session dedicated to genome-wide association studies, a tool that has been widely used in the past 10 years and quite foundational for much of the work that many folks are doing. Gina did her PhD work at Boston University, did a post-doc at the Broad Institute and Mass General, and is now an assistant professor in the Department of Biostatistics at Boston University and an affiliate of the Broad Institute and, today, talking about genome-wide association study design and interpretation. Thanks very much, Gina.
Gina:
Thanks. So, with genome-wide association studies, we are trying to get at testing the association of phenotypic variation with genotypic variation, and this is particularly useful when looking at complex traits. So, traits that have both a genetic and an environmental component, and these complex traits not only have a genetic component, but they have many genes that are contributing to the trait variability and some of these genetic effects can be very subtle. So, with genome-wide associations, we are testing the type of genetic variation called single nucleotide polymorphisms, or SNPs. Here is a pictorial of 10 chromosomes, and you can see that, for these base pairs, most of the individuals have the same exact allele, but in the middle here, there is an allele that varies among individuals where it could be a C or a G allele, and this is a single nucleotide polymorphism. And you can see that it’s a common variant, it’s seen in four of the 10 chromosomes, so it’s seen in many individuals and this is the type of variation we’re going to be testing with genome-wide association studies.
GWAS are really getting at the common disease-common variant hypothesis. Here on the upper left-hand corner, you see that variants affecting Mendelian disease are considered very rare. So, along the x-axis is allele frequency, and then along the y-axis is the penetrance or the effect size of a variant. For Mendelian diseases here, they’re very, very driven by very rare variants of high effect. On the opposite side of the graph here is common variation with very subtle effects and low effect sizes, and this is what we’re getting at with genome-wide association studies, this type of variation down here.
So, genome-wide association studies have been performed for the past 10 years or so, and it really began with the HapMap project. And the goal of the HapMap Project was to describe variation in the human genome for common populations, those of Yoruba descent from Nigeria; individuals living in Beijing, China; individuals from Tokyo; as well as individuals from the CEPH Cohorts who were of Northern and Western European descent. And so, they queried a small number of these individuals for variation that varied among these individuals from these four population groups, and they took that variation from the HapMap Project and put it on to commercial genotyping arrays that can be then genotyped in many, many individuals. And so, what the goal of GWAS is to look across these genome-wide set of SNPs for an association to a particular outcome. Now, it can be said that GWAS have been very successful in identifying regions of the genome associated with a range of diseases, and this diagram was downloaded from the GWAS catalog and has been updated just last week, and it shows all the genetic association studies that have been identified by GWAS at a genome-wide significance level, and it’s impressive that the GWAS catalog contains 2,554 studies performed and has identified 25,037 unique SNP-trait associations. So, it’s been very successful in terms of identifying locations in the genome associated with disease.
So, these are the basic steps for performing a genome-wide association study. You have to think about your study design and optimally designing your study, and we’re going to talk about sample collection as well as genotyping. And then one of the main steps that has been really fine-tuned within GWAS and has made GWAS so successful is robust quality control done on the data for GWAS. And then finally, after you have performed your quality control, you can go on and do the interesting part and look at the association between your trait of interest and these genetic markers across the genome.
So, first, you have to come up with your outcome, what are you interested in studying, and decide whether you want that outcome of interest to be studied as a disease case-control sample, where you’re collecting cases and controls; for example, type 2 diabetes studies, schizophrenia studies, and obesity studies have collected extremes or individuals with disease and those without disease and performed GWAS comparing these two groups. You can also do GWAS on quantitative traits. I particularly work around cholesterol levels, and you can compare the distribution of a phenotype versus the allele frequency distributions, but importantly, before you embark on a GWAS, you want to make sure your trait has a heritable component and confirm that actually there’s a genetic effect that is contributing to the trait, so that you have an ability to find genetic markers associated with that trait.
So, you might think, what sample size do I need to detect effects of a certain magnitude? And this is a pretty old figure from 2005, so just at the start of GWAS, and what it says is along the x-axis is the frequency of the disease susceptibility allele and now along the y-axis is the sample size required. And this was for 80% power to detect an effect at about a 1x10-6 α level. And what you can see is that if you want to detect smaller and smaller effects, you need larger and larger sample sizes. Here we have an odds ratio of 1.2, and actually in GWAS, we’re detecting odds ratios much smaller than that. So, with an odds ratio of 1.2 and the most informative marker, so an allele frequency of about 0.5, you need about 4,000 subjects. In GWAS, we’re kind of detecting effect sizes in the order of 1.02. Alright, so you’re all the way up here and needing individuals in the tens of thousands to be able to attack such subtle effects.
So, what sample size you need kind of will determine a route you’re going to use for performing genome-wide association studies. So, there are two kinds of routes you can take for doing this. You can do a single-study analysis where you’ve collected both the phenotype of interest and you’ve done a genotyping array, and you analyze it in-house. These are really great when you have large effects, you’re looking for large effects, and you have unique phenotypes that are not traditionally collected on many individuals. However, power is limited in these studies because there’s only so many subjects that you can collect within a single study. So, a common method and what’s been traditionally done over the last ten years for common traits is multi-study or meta-analyses of studies contributing to the same results. This is great when you have commonly collected phenotypes. This will give you a larger sample size and therefore more power to detect subtler effects of the genotypes.
So, there are over 40 genotyping arrays that have been developed in the past ten years. This isn’t showing all of them, it didn’t come up right, but there are both Illumina and Affymetrix chips, and they vary widely in their content. So, this is going to come into play later when we talk about when you need to combine data across multiple studies because when you have multiple studies, they might not have all the same genotyping chip done on them.
Once you have the phenotype collected and you’ve gotten those individuals’ genotypes, you have to do quality control.
Because the ability to detect a true genetic association is only as good as the quality of your underlying data. And because a large number of markers are tested for association in a GWAS, even a low error rate can be detrimental to a GWAS association study. So, take this example: we have 1 million markers tested for association, which is pretty typical in a genome-wide association study, and let’s assume that approximately 0.1% of those markers are poorly genotyped and that inaccurate calling results in spurious association. Okay, so if inaccurate calling leads to spurious association, that can mean that up to a thousand markers might be unnecessarily taken forward for replication because of false positive associations due to poor genotyping. So, quality control steps are essential in analyzing genetic data, and they are taken to remove both individuals and markers with high error rates. It’s assumed that many thousands of individuals have been genotyped to maximize the power to detect an association, so removing a handful of individuals will really have little effect on the overall power of this study. Also, given that a very large number of markers are genotyped, the removal of a small percent of the SNPs should not markedly decrease the overall power of the study. That being said, every marker removed from a study is potentially a disease-associated locus that you’re not testing. So, removing a marker may potentially have more of an impact than removing one individual. Of course, we’re going to talk about genotype imputation where we might be able to recover these markers back.
Here are some standard QC metrics that are done in genome-wide association studies. You have both sample QC and SNP QC. In sample QC, we look for high missingness rates; deviations from heterozygosity, which can indicate some contamination; gender checks to make sure we have the right individuals; duplicates; cryptic relatedness; unexpected relatedness (and that depends on our study design whether we have family information or not; if we do have families within the data set, we can look at Mendelian errors and see if there is a problem that could indicate that we have the wrong individual); and then we usually exclude population outliers. For SNP QC, we look at missingness; deviations from Hardy-Weinberg equilibrium, and that’s because it might be a problem with the genotyping that’s causing those deviations from Hardy-Weinberg (if you have a case-control study, you can look at the differential missingness between cases and controls); and then SNPs with a high number of Mendelian errors could indicate that there’s a problem with genotyping that SNP. Now, QC is usually done on individuals first and then on markers, and this approach is used because there are more markers contributing to the sample-level statistics than there are samples contributing to the marker-level statistics. So, for looking at the sample-level statistics, a couple of bad markers are going to be drowned out by all the good markers and won’t affect the sample-level statistics as much, as there are usually fewer samples than markers, so they have a higher weight in the sample-level statistics.
Audience question: So the question was: when we look at relatedness, do we have information about the relatedness, how the subjects are related? Yes, you want to look at both. If you have a family study, you would look at the expected relationships versus the observed relationships based on their genome-wide identity by state matrix. So, you can compute the expected proportion of sharing of alleles and then compare that expected proportion to what you think is the case. Now, for family studies, you would compare that to a family structure, but if you think you have a set of unrelated subjects, you would expect those IBS (identity by state) estimates and the proportion of sharing between individuals to be relatively low.
Gina: Okay, so one of the confounders of genome-wide association studies is population structure. Population structure occurs when there are subgroups within your data that differ with respect to trait distributions as well as marker frequencies, and this can cause spurious association results. Population structure is one of the few confounders in genetic association studies, and that’s not to say you shouldn’t be adjusting for covariates in your tests for association. You still want to adjust for covariates because they increase the precision of your outcome. But population structure really causes a confounding effect that needs to be adjusted for.
Luckily, there are techniques for adjusting for this effect. You can use a well-matched design, making sure you’re selecting individuals from the same regions, so there’s less of a concern. But even within samples of European ancestry, there is observed population structure. Here is a plot of a sample of all Europeans, and it is calculating principal components of genetic relatedness. Principal components are weighted aggregated scores of independent genetic variants. You can calculate them, and you get this plot when you plot principal component one versus principal component two. Here, I’ve labeled individuals by the Italian or non-Italian, and you can see that based on these two principal components, which is a weighted score of SNPs, is distinguishing between being Italian or not Italian. So, while this sample is all Europeans, there are subtle differences that can be detected, and these can cause spurious associations. You can use these principal components as covariates within a statistical model.
Audience question:The question is if genomic inflation λ statistic larger than 1.05, why does that indicate population stratification? Having a λ statistic greater than 1.05 doesn’t necessarily indicate that there is population structure, but it could indicate there’s population structure. It could be because there are lots of markers showing spurious association and, when we talk about QQ plots, it could draw the association off the line if there are many markers that are affected by population structure.
Gina: The last way that you can control for population structure, and one of the ways that is more routinely done in today’s GWAS is using mixed models with kinship relationship matrices. A matrix of IBS sharing among individuals to be able to get at subtle differences in between individuals. So, using a mixed model is a third way to control for population structure.
Here are some resources for best practices on QC of genome-wide association studies that provide a lot more detail on these resources about what QC thresholds should be used.
So I have alluded to that we can improve power by increasing the sample size by combining studies from multiple different cohorts and combining them together to increase our power. The problem with this is that different studies have used different commercial chips with different sets of SNPs for their genotyping, and we can take the set of common SNPs and analyze the set of common SNPs that are on each of the genotyping arrays. But that set of common SNPs across platforms is really restrictive.
This brings us to imputation. What imputation does is it fills in the genotypes or the SNPs based on LD and haplotypes from a reference sample to get a fuller range of SNPs in your study. Take, for example, we have this reference sample that has been genotyped on many SNPs, and then we have our sample from a commercial array that has a proportion of those SNPs. What we can do is use imputation software to fill in these missing SNPs based on the reference haplotype and leveraging LD between the SNPs.
The imputation panels that we have used over the last 10 years started with HapMap, which was the original backbone for imputation to be able to fill in missing SNPs. The goal to define variation greater than 5% in those four collected sample groups. Then, around 2008, 1000Genomes came along. 1000Genomes is actually 2,500 individuals, so it’s a much larger set of individuals than HapMap. HapMap had, for Europeans, 30 trios, and that gave a 120 independent chromosomes to form the backbone, so you really could only get at very common variation. When we went to 1000 Genomes, you could go a little further down the allele frequency thresholds and get down to perhaps 1% or so variation imputed well, because you had more copies seen in the reference.
And today, if you were to do imputation, you would go to the Haplotype Reference Consortium (HRC). The Haplotype Reference Consortium has leveraged all the sequencing studies that have been performed over the last couple of years and aggregated that data to create a new reference panel of haplotypes to use as the backbone of your imputation. The Haplotype Reference Consortium has approximately 60,000 haplotypes available, and you can go down to a minor allele frequency or a minor allele count of approximately 5. As you go down this page, you increase the sample or the number of haplotypes used for the backbone as well as the allele frequency that you can impute the variance to.
So here, there’s multiple imputation software that could be used to go from one of these panels to impute those SNPs into your samples that have a chip already genotyped. I’m not going to comment on them.
Once you do the imputation, you get post-imputation measures of quality. So, you don’t want to keep all your variants. Some of the SNPs that are imputed aren’t going to be imputed well. So, if you have an imputation quality score close to 1, that means you’ve gotten good imputation quality and you can move on to analyzing those SNPs. However, sometimes it doesn’t work; it could be in LD-poor regions, and so imputation quality can be lower and we exclude SNPs that have low imputation quality.
Once you’ve done imputation and excluded variants that have poor imputation quality, you can readily combine information across studies. Here, it shows two different studies, one done on an Affymetrix chip and one on an Illumina chip. You can see that if we try to look at the overlap of SNPs between the two chips, there is a very small overlap between the two. But after imputation was applied to fill in missing SNPs based on the haplotypes, you can see that there is an overlap of these SNPs between the two and many more variants that can be analyzed in the combined study.
So, after we do quality control and imputation, we can move to the exciting part, where you want to be, and do analyses. Basically, with genome-wide association studies, we are doing simple linear regression, where we’re comparing the trait value between two groups, and we’re doing this over a massive number of SNPs. So basically, we have a simple model here, where we have Xi equal to 1 if individual i has the A allele at SNP i and trait, and you’re regressing the SNP on a trait for that individual. Basically, you are testing the difference in the mean trait value for individuals that carry the A allele versus those who do not carry the A allele. So, I just described simple linear regression, but for quantitative traits, we’re comparing each marker, the trends in the trait, so the trends and the genotypes, and we can use linear regression and adjust for covariates in that linear regression. For case-control studies or any dichotomous outcome, you’re comparing the marker frequency in cases to the marker frequency in controls, and there are a couple of different tests you can do, depending on your counts and ultimately your research question. And you want to make sure you control for possible covariates. A lot of times, we control for age and sex. Controlling for covariates (a covariate not a confounder) is to increase precision of your outcome, whereas we want to control for population structure using principal components analysis because it is a confounder of association.
So basically our null hypothesis is there’s no association between SNP i and the outcome. So, we’re testing β is equal to zero versus the alternative that our β estimate from the regression is not equal to 0. We get effect sizes, so we’re looking at how much of an effect the genotype has on the outcome, standard errors, and p-values. So, you’re getting an effect size, the standard error, and a p-value for each of the hundred thousands to 2.5 million SNPs in your GWAS. So, you’re testing each of those SNPs individually against the outcome.
Now, we might see a significant result for many reasons. We might see a significant result because there is an actual effect of that SNP on the outcome, but we also might see a significant result because of just chance, some bias, and unadjusted confounding. So, we use statistical tests to determine whether our observed difference between groups is likely due to chance.
And we rely on the p-value, which is the probability of the observed result or something more extreme given that the null hypothesis is true. And if this probability of the event is small enough, we say that the difference is simply not due to chance and we have an actual effect, and we can call the result statistically significant. A point to note is that we never accept the null hypothesis; we just fail to reject it because we might not have enough evidence or power to accept it. We never accept the null hypothesis.
So, you might ask yourself, what is small enough? Alright, we’re testing many variants across the genome for association to a trait. So, the simplest way is to adjust your overall α level for the number of tests you’re performing. And so, we do that traditionally with a Bonferroni correction, where you divide your overall αlevel by the number of comparisons, and so you get a new α level, and you compare your p-values that have been generated to this new α level. In GWAS, we’ve had a threshold of significance of approximately 5x10-8, and that accounts for approximately 1 million independent tests that have been understood to be within common variation within the European population. So that’s very key. We’re looking at this number 5x10-8, is really for common variation in Europeans. If you have a sample that is of another ancestry, you might have more independent variants, and therefore your level of significance should actually be lower.
So, there are two ways, coming back to combining data across multiple studies, there are two ways to combine data across multiple studies. You can do a combined analysis where you take the individual raw-level data from each of the contributing cohorts and create a huge dataset with all the data together and do the analysis that way. But this is often not feasible because of patient confidentiality and being able to share data across institutions. So, a common approach in genome-wide association studies is doing meta-analysis, where you generate the association statistics within each study, and then you combine those test statistics across studies using meta-analysis.
The most popular type of meta-analysis is inverse variance-weighted meta-analysis, also called fixed-effects meta-analysis. It’s been implemented in many software for doing these tests across all the variants in the genome, and it’s a weighted average of the effect sizes from each study, taking the precision of the effect into consideration so that larger studies are given more weight, and smaller studies are given less weights, and these weights are inversely proportional to the standard error, which has your sample size within your study.
There are just some equations here. You have the effect sizes, standard errors within each of the case studies. You can then get a weight for the case study and get a pooled effect estimate across each of the studies and a pooled standard error across all the studies. Then you can meta-analyze; you can get a meta-analyzed Z value by taking that pooled effect size divided by the pooled standard error to get a Z-score and then convert that to a p-value for your meta-analysis, which is distributed as a normal distribution under the null, the Z.
Here are some best practices for using imputation-driven meta-analysis that provides a lot more details on how to operationalize this.
So, whether you’ve done a single study or whether you’ve done this meta-analysis, you have a lot of association statistics that you need to look through and understand, and so two different plots are traditionally created to be able to summarize the many tests that are done within a genome-wide association study.
The first is the QQ plot or quantile-quantile plot, which gives a visualization of the overall distribution of p-values. What you have on the x-axis is the expected chi-squared statistic, and on the y-axis, you have the observed chi-squared statistic. Under the null hypothesis, you would expect this to follow the 45-degree line. And actually, this is the expected -log10 p-value and the observed -log10 p-value. You would expect these to fall along the line under the null hypothesis. When they do, you can think that you have no unaccounted-for confounding or issues with the association statistics. This is the genomic control λ value, which is the median chi-squared statistic divided by the expected chi-squared statistic, and that is really getting at the bulk of the distribution. The most points you have on this graph are lying right over here because your p-values are normally distributed or uniformly distributed between 0 and 1. Right, you should have a uniform distribution. And so, if this is plotting the -log10 p-values, that means the majority of those points are right down in this lower piece of the QQ plots, and the lambda value is getting at what the median of this distribution looks like. We expect this λ value to be close to 1, and it’s really dependent on the sample size. As you get more samples, you’re more likely to have a larger λ value.
Audience question: So the question was, why in this QQ plot, do the points dip below the diagonal?
Gina: Here, it’s just random noise. So, you can see, I drew a confidence interval around that 45-degree line, and they all are pretty much within that confidence interval. So, it is probably just random noise. It could be also that we are not powered enough for the association study. So, a lot of early rare variant association tests have seen, you know, points below the 45-degree line, or they have seen the observed distribution below the 45-degree line, and that’s just because they’re not powered enough to detect association well.
Audience question: This is just simulated real data, so I’m saying it’s random noise. The question was why here were there more points below the diagonal? And this was a simulated graph, so just random, right?
Gina: So that is right. The bulk of the distribution is down here because most of the uniform distribution of our p-values.
Audience question:So the question is, is the z-score always normally distributed in a GWAS? And if we have adequately done our study, we should be normally distributed. You might not have the power to detect very significant effects, and the Z of β over the standard error is going to be normally distributed for a standard normal.
Gina: So, you can have inflation in your test statistic, and you see that on your QQ plots. So here, I simulated some inflation, and you can see that the QQ plots are deviating from the line pretty early. And so, this could be because there’s population structure that’s not accounted for, some unrelatedness that’s not accounted for, some technical bias, or poor-quality genotypes. But when you see a few points that are deviating from the line, it indicates that there is a problem with your GWAS. In particular, when it’s deviating from the line early on.
Okay, so let’s take a real example of these QQ plots. Here is from a study of LDL cholesterol, and here you see all the SNPs. Here’s my line right down here. The 45-degree line is all the way down here; you can’t even see it. And that’s because we have really significant results. We’re getting p-values at 10-600, highly significant. But we know we have known variation that we know is associated with LDL cholesterol. And so, if we zoom in on the lower part of the distribution, you can see that at this lower end of the distribution that the observed values are following the 45-degree line and that what we have up here is true associations. The green is genome-wide significant. I removed so when we remove things we’ve already found, we find that it’s behaving pretty normally. But this is including variants that have been shown to be genome-wide significance.
Here is the second way we summarize the gross distribution of p-values. Here’s for LDL cholesterol. It’s called Manhattan plots because it’s supposed to look like the skyline of Manhattan. And what you see are peaks where we have associated loci, and we have the peaks and multiple. Each of the points on this graph is an individual p-value, an individual association. So, this peak here on chromosome 5 gives you a bunch of variants that are looking like they’re associated with LDL cholesterol. And this is because of the LD. The top variant here is in LD with the next variant, and there’s decay of LD as you go down this line. And so, you would expect to see this in a genome-wide association study, especially when you have imputed data. You want to not see just a single point setting up here and showing significance. If I just have a single point up here in the peaks, there might be a problem with that association because it’s saying this SNP is associated, but nothing in LD with that SNP is associated. So, something to be cautious of.
You can really dive down into particular regions with regional association plots. So this is done with LocusZoom, a software tool. You can Google it, but it will take your genome-wide association results and look at a particular locus you’re interested in. Here I’m plotting the SORT1 locus, which has been robustly associated with LDL cholesterol. And what I can see here is: using publicly available LD information, LocusZoom is getting LD information, and you can see that my top variants here are highly correlated – that the reds and oranges and greens are highly correlated with the other SNPs in the region, indicating that this is probably one SNP that’s really associated in the region and these other SNPs that are showing associations are just in LD with that top variant.
On the other hand, this is a regional association plot of the CETP region, and here you can see that there are some variants here that are highly significant but do not look like they are correlated with the top variant, indicating that there might be multiple signals in this region. And you would want to do some fine mapping or conditional analyses to get at the multiple signals within this region.
A key point is that association is not causation. Okay, to take away from these GWAS, the variants or the SNPs that we’re analyzing can have a functional effect on the traits. They could cause an amino acid change, they can change expression of a gene or be involved in the regulation of the gene. But they could also be in LD with a functional variant. So, with GWAS, you’re really getting at different regions of the genome, loci that are associated with disease, and not particular variants that are causal for disease.
There are many tools to perform GWAS, developed by individuals at the Broad and Broad affiliates, particularly PLINK and EIGENSOFT, as well as METAL, which has been developed for meta-analysis by the Abecasis group. And also LocusZoom, which were the plots I was showing for the regional association plots.
So, after you’ve done your GWAS, there’s a lot more that can be done. So, you shouldn’t think of GWAS just getting a new set of significant loci associated with your trait. You want to start thinking about secondary analyses that you can use and gain further information using genome-wide association studies. And so you could do risk prediction. Can you use a score of genetic variants that have been shown to be associated with your outcome to predict disease? So can we use genetics and GWAS to predict disease? Pathway analyses – are the associated loci linked to a particular biological pathway? Can we learn new ideology about the disease based on the associations we’re finding in GWAS? We can also do Mendelian randomization, where we leverage genetic markers to get at causality of biomarkers.
Other secondary analyses include estimating the variance explained by sets of SNPs, and this is done through the GCTA software. Now that many GWAS have been performed across a wide range of phenotypes, you can look at pleiotropy. Does a SNP relate to multiple traits? You could fine-map your genome-wide association regions to get at the independent SNPs within a locus. And then LD score has been a technique also developed by individuals at the Broad to be able to distinguish between confounding and polygenicity in genome-wide association studies. And this has been a good tool when you have increasing sample sizes in the GWAS being performed now for common diseases.
In summary, GWAS have been successful at locating regions of the genome with associations to complex traits. Many of the loci are non-coding with no gene function and need to be further investigated through functional studies and follow-up. And I think there’s more genetic variance to be found as we grow in sample size. We’re getting more significant genome-wide association signals; they just happen to be of a lower effect size. And we can really leverage non-European individuals. Most GWAS have been done in European individuals, and non-European individuals will gain us additional information. So that’s all I have for today. Thank you very much.