Chapter 8.3: Gene Association Analysis (Video Transcript)
MAGMA
Title: How do we go from genetic discoveries from GWAS/WGS/WES to mechanistic disease insight?
Presenter(s): Danielle Posthuma, PhD (Department of Complex Trait Genetics, Vrije Universiteit Amsterdam)
Danielle Posthuma:
Well, welcome back, this is part three of the session on how do we go from genetic discoveries to mechanistic disease insight, and in this last part I will focus a little bit on the software tool MAGMA for conducting gene based and pathway analysis. So in the practical you will learn how to work with MAGMA, and that’s a tool that was created by Christiaan de Leeuw a couple of years ago and it can be downloaded from this website over here.
It is a tool for gene-set analysis and requires you to work with the Command Line interface, which should now be quite familiar to you and as an input you can either provide raw genotypic and phenotypic data or you can also provide summary statistics from already published results, but then you would also need some reference data. Because we need information on the LD structure of between the SNPs that are part of your analysis and then the other input is gene definitions. So that’s with SNPs belong to which genes and I will come back to that in the next couple of slides and also definitions of gene sets. But if you download MAGMA, there’s some files that have, some default files that you can use, that do this for you, but you’re also free to use your own files if you want to. So just as an aside, if you want to have access to public summary statistics, we created a database and on this slide I’ve just added one example when you’re interested in looking at one particular GWAS, then it gives you the Manhattan plot that gives you gene-based plots, it gives you the QQ plot. It also gives you gene sets outcome already. And it gives you some information about this GWAS with the link to the PubMed ID and where to download the data. So that’s, you can use this database if you want to play around with any software tool that requires you to input summary statistics, then yeah you can just download summary statistics from this database but there are also some other databases that have the same purpose. In MAGMA gene set analysis, there are three main steps, so step one is the annotation where we match SNPs to genes, and so MAGMA needs to know which SNPs do I have to analyze as part of which gene. So that’s step one.
Step 2 is the gene analysis, so that’s where we compute the association of the gene with the phenotype. So here the unit of analysis is the gene and then step three Is the gene set analysis, where the association of gene sets is tested against your phenotype. And then, because it’s a very general linear regression framework which can easily be extended, it’s very easy to use continuous sets. So instead of having a dichotomous set where genes are either a member of the gene set or they’re not a member of the gene set that you can also have quantitative, quantitatively defined set of genes where every gene has a value that indicates how likely it is to be part of a gene set, or that indicates the expression level of a gene in a cell type and then the cell type is the gene set. And it also allows you to do conditional and joint analysis and interaction analysis as was explained in part two of today’s lectures.
Annotation
Now going back to the three main steps, annotation. If you download MAGMA it comes with a general annotation file and there SNPs are mapped to genes based on the physical location, and but you can also change this annotation file so you can, if you would like to have eQTLs included in it, you can map SNPs that are physically located outside of a gene but have a known eQTL link to the gene, or chromatin interaction, that’s also possible to use. You can also add a window around the gene so you can say, well I would like to have maybe 1 megabases before and after the gene and those SNPs should also be analyzed as part of this gene. An one SNP can actually be linked, can be mapped to multiple genes.
Gene Analysis
Then if you run the analysis, there are four models that are available in MAGMA. If you have the raw genotypic data, then it will conduct a principal component linear regression analysis and that that can only be done when you have access to the raw data. So if you have, if you input summary statistics which most of you will probably do, then there are three different models that you can use to evaluate statistical significance of your genes and of your gene sets.
So the first model is the SNP-wise mean and it performs the test on the mean SNP association, so that evaluates the evidence for association of all of the SNPs in, that are located in gene and then uses the average association to evaluate whether the gene is actually associated. Or you can do to SNP-wise Top Model where where the focus of the analysis is on the strongest SNP association, and you can also combine these two models and get this, the SNPwise multi model where the evidence from both of these previous models is combined into one p-value for your gene. And yeah, deciding which model is best for you, that depends on what your actual hypothesis is. So what kind of sensitivity would you want? So there’s no, we don’t think that there’s a best, best model. It really depends on the situation or your research question. That’s why we provide multiple models in the MAGMA tool. So what’s being done in the MAGMA tool, so when you do a gene set analysis, that’s basically an analysis of genes. So instead of individuals being your unit of analysis or your data points, the genes are the data points in the analysis. So in this table we have listed 10 different gene IDs and each of these genes have been tested for association in the gene-based step in MAGMA. So they all have some kind of measure for the strength of the association with your phenotype based on your GWAS summary statistics. And then there’s also an indication of whether or not they are part of the set of of your gene set that you would like to test.
So in this case the genes are the data points and the gene set is the grouping variable and the genetic association with the phenotype, that’s the outcome that you would like to get, so this is basically a simple T-test testing whether the average association of the genes that are inside your gene set is different from the average association of the genes that are outside of your gene-set. Yeah, so that’s basically just a one-sided test of genes because you have a very strong hypothesis of what association should be stronger. Now, there are two kinds of tests. So you could do a self-contained analysis where you ask whether the mean or the average genetic association of genes in a gene set is greater than zero. Yeah, so that’s your null hypothesis and your alternative hypothesis, whereas in competitive analysis you ask whether the mean genetic association of genes in the gene set is greater that of the genes outside of the gene set.
Yeah, so that’s your competitive analysis. And compare this with with a randomized controlled trial or any experimental setup, So in a self-contained analysis we would ask if the mean improvement of patients in the treatment group is greater than zero, whereas in a competitive analysis you would have a control group, so you would ask whether the mean improvement of patients in the treatment group is actually greater than that of patients in the control group. Now everybody would agree that we would want to do a competitive analysis. We would need a control group, otherwise we cannot really say that the treatment is causing the patients to improve. So that’s also the reason why we think competitive analysis is the way to go in gene set analysis and that self-contained analyses are not informative for asking the question whether your set of genes that you tested is actually causally associated with your trait of interest. That’s why we advise never to do a self-contained analysis, but always to to use a competitive gene set analysis.
OK. This just is stressing that same point and also in the, in part two of these lectures of today I’ve indicated this or if this is not clear than maybe you should go back to Part 2 of the lecture, so I hope that this message does come across and I’m looking forward to the MAGMA practical that is planned for later today. Thank you for listening and see you later!
FUMA
Title: FUMA: Functional mapping and annotation of genetic associations
Presenter(s): Kyoko Watanabe, PhD (Regeneron)
Kyoko Watanabe:
Hi everyone. I’m a PhD student at the Vrije Universiteit in Amsterdam. My work mainly focuses on understanding genetic associations in a biological context. Today, I’m going to introduce you to a web application I have recently developed, which is Functional Mapping and Annotation of Genetic Associations [FUMA].
So, I’m going to start with a very quick recap of what was GWAS [genome-wide association study] again. So, we basically start by genotyping a large number of individuals using SNP [single nucleotide polymorphism] arrays. Which, nowadays, we can tag around a million SNPs, and by performing imputation with reference panels, you end up with a maximum of 20 million SNPs. So, in a very simple case, when you have case and control groups in your genotyped individuals, you perform statistical tests to see if the occurrence of minor alleles in case and control groups are different from zero. So, in the end, you get the p-value for every single SNP you have. But, as you can imagine, the number of statistical tests being performed is the same as the number of SNPs you have. Of course, you have to correct for multiple testing, and the gold standard for a genomically significant p-value is 5 × 10-8. So, whenever you find SNPs with p-values less than that, those genomic regions are called “hits” or “significant.”
So, the very first GWAS study was published in 2005, and since then, the cost of genotyping has dramatically decreased, which allowed us to collect a much larger number of individuals. Nowadays, big consortia for meta-analysis usually use more than 100,000 individuals. And by increasing the sample size, we also increase the statistical power to detect relatively weak effect sizes. For example, the height study, using around two hundred thousand individuals, ended up identifying more than a hundred genome-wide loci. So, we’ve been conducting GWAS in the last decade, and nowadays, in the GWAS Catalog, we have more than 3,000 studies, including over 38,000 unique SNP-trait associations for over 600 phenotypes. So basically, we have a lot of risk loci spread all over the genome.
However, especially for complex traits which are highly polygenic, we know that the association of single SNPs is very weak. To detect those effects, we need a much larger number of samples. And luckily, the UK Biobank was just released this month, and the QC2 database [UK Biobank dataset] contains information on 500,000 individuals and more than a thousand phenotypes. So, the UK Biobank has a potential to identify novel loci for many human complex traits, and we are expecting more and more GWAS to be published in the coming months.
So, the question is, what benefits do we gain from GWAS results? Ideally, we would like to identify the causal variants from genetic associations that can be used to improve diagnostics, prognostics, or even identify novel drug targets or biomarkers. However, an association isn’t causal. Association doesn’t tell anything about causality. And also, purely based on p-values from GWAS, you don’t really know much about underlying biology. Identifying causal variants from GWAS results is not straightforward. So, to overcome this problem, we usually go through several steps.
The first step is to correct for LD [linkage disequilibrium], which is [a] non-random occurrence of SNPs. So because of LD, the most significant SNPs you find from a specific genomic locus doesn’t necessarily have to be the one actually causing the phenotype. Instead, there could be other SNPs that are truly responsible for the phenotype, and these SNPs might have a higher correlation with the most significant SNPs. So, we don’t want to miss those SNPs just based on the p-value. So The first step is to include all the SNPs that have a higher correlation with the significant SNPs. Once you have the list of SNPs you’re interested in, the second step is to check the functional consequences on the genes. For example, if you have SNPs in exonic regions or in the non-coding regions, there are several software tools that can perform this task. However, more than 90% of GWAS findings are known to fall into non-coding regions. So just knowing that you have a hit in a non-coding region doesn’t really help you understand what is actually going on in a biological context. So you still need to annotate the biological functions.
There are several data resources you can use. For example, CADD score is a metric that assesses the deleteriousness of SNPs, and RegulomeDB is a categorical score that indicates how likely the SNP affects regulatory elements. Additionally, there are several eQTL databases, for example GTEx has details in 44 different tissue types. And, especially for non-coding regions, you will also want to check the epigenetic status. The data is available from Roadmap and ENCODE. I didn’t bring up any database names over here, but the 3-D genome, in the field of 3-D genome, more and more data is becoming available. So, including Hi-C data might also be another option to map SNPs to the distal genes. So, using this functional information at the SNP level, you can end up with a list of genes you’re interested in. Finally, you need to consider expression patterns in different tissue types and also cells that share biological functions, such as pathways.
So, we’ve been performing these multiple steps manually. As you can imagine, this requires you to install different softwares and download various databases, and sometimes reformat the data each time. So this is very time-consuming and elaborate. So, we hoped to make a single platform that can perform all of them.
So, we developed a web application named FUMA that basically optimises the four steps I showed in the previous slides into one single platform. So, in the FUMA, there are two main processes. The first one is SNP-to-gene, starting from GWAS summary statistics. We provide lists of candidate SNPs with annotations, and also the lists of prioritized genes. And these genes can be passed to the second process, which is gene-to-func[tion analysis], where it provides you [with] the further variant annotation at the gene level. And another advantage of FUMA is we also provide interactive visualisation in the web application, so you don’t have to use external software just for visualisation.
So I’m going to go through what FUMA actually does in each process. So in the SNP-to-gene, starting from GWAS summary statistics, we first characterize genomic loci by correcting for LD. And here, we provide you with the list of lead SNPs and the genomic risk loci. All the SNPs which are in LD of lead SNPs are then passed to the second step, which is the annotation of SNPs. Here, we perform the ANNOVAR and annotate several variant scores and eQTL, and also the Hi-C. Using this information, we finally perform the gene mapping. So currently, we have three different criteria for gene mapping. The first one is positional mapping using annotations from ANNOVAR and eQTL mapping, and also, the chromatin interaction mapping. So before you perform this gene mapping, you can also filter SNPs based on the annotations you obtain from step two. And you can also combine different mappings together. You can specify lots of different parameters when you submit the job. And we provide a list of genes mapped by… based on the user-defined parameters.
So this is just an example of how the result page looks. We provide a Manhattan plot on the top.
And the second one is… we perform gene-based tests using the MAGMA software. So this is the Manhattan plot based on gene p-values. And the summary results per genomic risk loci. And all the results are available as a table. And you can also create a regional plot with all the annotations and results together. And all the results and approaches are downloadable.
So this is just one example [of] how you can utilise the eQTL mapping. So this is one of the risk loci on chromosome 14, from schizophrenia GWAS. associated with schizophrenia. From the top, you’ll see a zoomed-in Manhattan plot, and the genes, CADD score, RegulomeDB, chromatin open chromatin states, and eQTLs. So as you can see, the risk locus itself spans multiple genes. So if you don’t know, if you don’t have any further information, you end up with listing all the genes, or you can manually check the function of genes and you can pick the one that has the most interesting function in the phenotype. However, by performing eQTL mapping, we prioritise the single genes which have eQTLs in the brain. So performing different types of eQTL mapping, you can also prioritise genes.
And another example is for chromatin interaction mapping. FUMA currently uses Hi-C data from Schmitt et al. which includes 14 different tissue types and cell lines. As I already said, the field is growing very fast. We also provide [the] option to apply custom the chromatin interaction matrix, which isn’t limited to Hi-C but can include Capture Hi-C and C5. So the graph shows the risk loci on chromosome 16 from BMI GWAS. The most outer layer is the Manhattan plot, and the second, the blue circle, is the genome coordinates. And the risk loci are highlighted in blue. And inside of the circle, the orange links are Hi-C links, and green links are eQTLs. So, as you can clearly see, the Hi-C can map SNPs to distal genes compared to eQTLs. So this can help you to identify novel candidate genes which you might have missed.
So finally, once you have the list of genes, you can use the gene-to-func process, where we provide a gene expression heatmap and tissue specificity, by performing overrepresentation tests for differentially expressed genes across different tissue types, enrichment testing for gene sets, and also, external links to OMIM [Online Mendelian Inheritance in Man] and DrugBank to further investigate the individual genes.
So, in summary, we optimise the post-GWAS annotation in a single platform, as a web application. So this might be the very first place to stop by for the very broad overview of what’s going on in the GWAS risk loci once you get the new GWAS results. But also, if you have a phenotype of interest, there are lots of GWAS summary statistics already available. So you can start, you can perform the FUMA for the available GWAS and start integrating with research. And for the future updates, we are thinking to extend FUMA to be able to accept the whole exome sequencing studies and also EWAS.
And so finally, I would like to thank my supervisor, my co-supervisor, and FUMA is available online, so please feel free to visit the website. And I also have a poster this evening at location B-325, so if you want to know more details, please feel free to visit me. Thank you. [Note: please note that this is an archival recording; the FUMA website is available at https://fuma.ctglab.nl/]
Facilitator: Time for a couple of questions?
Audience member: So I have a question. I’ve seen a couple of cases where, even though the risk locus is associated with the same phenotype, there’s clear evidence of distinct haplotypes. It seems like FUMA would probably be able to show you cases like that where potentially you’re getting the same phenotype from distinct variants that are affecting, say, the promoter of the gene or a nearby enhancer.
Kyoko Watanabe: You mean, like, pleiotropy?
Audience member: Like, same phenotype but two different causal variants.
Kyoko Watanabe: In the same region?
Audience member: In the same region.
Kyoko Watanabe: Um. Yeah. So it’s more like the FUMA is just to annotate what’s the functional information available, so it just provides you the options which SNPs you’re going to look at in further. So it’s not… It’s not removing the information. So, you might get multiple SNPs that have functions from one locus, but yeah, we cannot distinguish which [one] is actually causal. But, I don’t think you’re actually going to miss that information.