Software Tutorials: Datasets (Video Transcript)

FinnGen

Title: Intro to FinnGen

Presenter(s): Aarno Palotie, MD, PhD (Institute for Molecular Medicine Finland (FIMM), University of Helsinki)

Aarno Palotie:

Good morning, good evening, good afternoon, wherever you are, you happen to be in your time zone, and thank you for attending this PGC meeting today. I’m going to tell you about three major collections in Finland that utilize some of the Finnish characteristics and special characters in the Finnish population.

So there they are: FinnGen, the Super Finland project, and the Northern Finnish Intellectual Disability Court. So, the success of genetic studies in disease genetics, especially specifically, are relying on basically four cornerstones: the population isolation, the National Health Registers, the long-standing epidemiological studies which have further developed biobanks, and then, of course, knowledge of the genome.

First, diving into what the National Registries in the Scandinavian countries are. They are ways to record the usage of the healthcare services, and these are very similar in all Scandinavian countries, not just in Finland. It means that whenever and wherever in the country you visit a hospital, your diagnosis is recorded in a central register. The same thing when you, when you purchase a prescription drug. These registers were originally developed for administrative purposes, but since they accumulate a lot of data of healthcare service usage, they become very interesting resources for research. And although they are silos as such, the typical individual identification number or social security number, whichever way you would like to look at it, then combines this data together once you get the appropriate permits.

The second aspect, the population, the history of Finland or the current population history of Finland, stems a few thousand years back where a small number of settlers were mainly living on the coastal regions of Finland, but then in the 6th Century, the Swedish King Gustav Vasa demanded Finns to move also to the East and the Northern parts of the country, which resulted in a second very strong internal migration and multiple bottlenecks. What happened then in all Nordic countries, after the 18th century, is a rapid population growth, which obviously then resulted in all the all the typical characteristics of a population isolate.

The Nationwide registers that I was describing. What is great about them is that every single individual – every citizen, every resident – is recorded, and they stem back decades back in a digital format. Some of them go back all the way to the 50s, but I think the most important time point is the late 1960s, when the hospital discharge data, the cause of death data, and the reimbursement data came into usage.

And this obviously provides you an opportunity to look at longitudinal data because this Health Data is available from birth to death, and this is one of the very key characteristics of this data. Even if it doesn’t include some symptom-level data – it’s ICD codes, ATC codes – but the great thing is, it’s longitudinal.

And this is what the FinnGen research project is based on. It’s a project that was initiated in 2017, and with the idea that it utilizes genetic strategies to understand disease mechanisms. It’s aiming to collect 500,000 individuals, which is roughly 10% of the population, genotype them, and impute them against a national or population-specific deep whole genome sequence backbone and then integrating the National Health register data, which results in a dataset of 500,000 individuals with both genome and Health Data for our association analysis.

This project is a public-private collaboration – a research project where all Finnish biobanks, which exist in all University hospitals around the country, all universities with a medical school are involved, as well as the Institute of Health and Welfare, and blood transfusion Center. Also, the Finnish Biobank Cooperative is a partner there, and currently, 13 Pharma partners are together with us working on our scientific goals.

This is a 10 year project and we have just completed year four. The sample collection will go on for still a little bit less than a couple of years, at which point we are full with our collection and then really have an opportunity to further focus on the analysis. Where are we in respect to the collection? We have currently collected 472,000 participants, which means that we are slightly ahead of schedule and seem well set to reach our goal. This includes not quite yet 200,000, but it will include some 200,000 Legacy Collections and 300,000 prospective collections. The prospective collection, what is crucial here most of them come from hospital biobanks, typically University Hospitals, which means that they are specifically enriched for diseases treated in these places. So, if you compare to the UK Biobank, this is a big difference in the content of the sample.

We produce data sets every six months. The current data freeze will consist of 350,000 individuals and more than 4,000 endpoints. All of these pheWAS analyses done from these data freezes are made public after 12 months once they become ready.

If you look at the type of phenotypes that we have, then the special thing is, again, that the age is relatively high here, which means that there are quite a lot of healthcare events already. The mean number of health events per individual is 340, including some 186 drug purchases. This means that there’s quite a lot of data. And how we build these phenotype endpoints is that we combine data from different registers, whether drug usage or hospital visits, and create a strength and endpoint by this in this way. This needs quite a lot of understanding of how the data works, or the National register data, how it functions, what are the strengths, what are the drawbacks, and so forth.

And how do we then access the data? We have two levels; both of them work in the Google Cloud space. One is where we have the summary-level data, the results, in other words, from GWAS and pheWAS analyses. And then there is the secure environment, and this is something to stress in the way that those familiar with the European data regulations, which have become quite stringent. So this is coping with all those requirements and still being able – we are able to - access the data from outside the European Union. The idea is that it’s a secure environment where you cannot take data out; you can analyze it, you can look at it, but you cannot copy it out from there or download it to your own computer.

What has been characteristic in analyzing this data is that there are a large number of Finnish-specific variants that even in very extensively studied, genetically very extensively studied, diseases, like type 2 diabetes. All the ones with stars here marked are Finnish-enriched alleles and Finnish-specific hits. And this goes through, this is the characteristics through all traits that we actually have studied. There are typically Finnish-enriched variants.

The other way you can look at it, since we have the medication data from all individuals and the entire country, so you can ask questions the other way around: What about those who use medicines? Do we see variants enriched that actually give additional information for us, not just from the basic diagnosis? And this seems to be the case, at least in some of the cardiovascular study medications and those related medication GWASs.

So, when we look at the numbers of mental health diagnoses in FinnGen, out of the 321,000 individuals that were in the data freeze, any ICD codes for mental health were observed in 76,000 individuals, but even more individuals had depression medication, almost 8,000 more. Depression diagnosis was then seen in 33,000 individuals, schizophrenia in 11,000, sorry, schizophrenia, schizotypal and delusional disorders in 11,600, schizophrenia in 6,000, and bipolar disorder in almost 6,000 individuals as well.

A place where you can look into the distribution of diagnoses and how they have been constructed from the register data is called Risteys. If you go to the FinnGen website, you can be guided to this research website, which is open for everyone. You can see various age distributions of the individuals, years of first diagnosis, and comorbidities. It’s quite a helpful site to get an understanding of how our endpoints are constructed.

Then we move to the second collection, which is the Super Finland collection. This has been critically supported by the Stanley Center, a collection that was done over the years 2016 to 2018. It consists of a little bit more than 10,000 patients with psychotic disorders. All samples have been GWASed and exome sequenced, a little bit less than 900 also have whole genome sequencing. Almost 90 percent of the individuals have consented for IPS research and can be re-contacted, and as of now, a little bit more than 5,200 PBMC lines have been collected for IPS cell generation. And this map is showing the distribution from where the patients are, which means that they pretty well represent the population density also in the country.

As you can imagine, there’s a huge number of hard-working people behind it. They don’t all fit here in the slide, but they worked very hard over the years to first collect them, and now later to analyze them.

The diagnostic groups in SUPER are such that a little bit more than half of them, a little bit less than 6,000 have a schizophrenia diagnosis. 2,600 have bipolar disorder, and the next largest group being schizoaffective disorders, psychotic depression, and other psychoses.

From the recruited individuals, we can clearly see that, as expected, the education completion is lower than in the general population. They are less married than non-psychotic individuals in the population and so forth. So, we have an understanding that quite a lot of these people are chronically ill and represent that type of psychosis spectrum.

We also have collected questionnaire and cognitive test data at the time of collection, so we don’t have longitudinal cognitive data, just that the recruitment. We have the medical care register data and also the blood plasma stored.

Then, the third collection and the last one is the Northern Finnish Intellectual Disability Cohort. We currently have a little bit more than 3,000 participants. They are all collected from the northern regions, and if you remember when I was describing the late settlement movement, which happened after the 16th century, so these were areas that were very sparsely populated in small villages. We can see from the frequency, that the prevalence of both intellectual disability and schizophrenia are higher in these areas, and that was one of the reasons that stimulated us to collect individuals that don’t have a clear ID diagnosis from this area currently. Currently, most of these have already been exome sequenced and GWAS analyzed. For instance, just for interest, 56 SCHEMA gene variants are observed in 80 carriers. It’s kind of interesting that since we are dealing with a majority of mild ID cases, which is slightly unusual for this type of cohort. It’s interesting to compare now the variant carriers specifically in this group and then in the SUPER Psychosis Study. Indeed, the same variants exist as well as the same genes having different variants in these two collections.

And I think everyone for listening and hope that these cohorts provide an opportunity for good harvest. Thank you.


PsychENCODE

Title: Introduction to PsychENCODE

Presenter(s): Chunyu Liu, PhD (Department of Psychiatry, SUNY Upstate Medical University)

Chunyu Liu:

Hello. I’m Chunyu Liu at SUNY Upstate Medical University at Syracuse. I’m glad to be here to introduce the PsychENCODE project and explain how PsychENCODE data might be useful for your study of psychiatric genetics.

So, the PsychENCODE Consortium was established in 2014, funded by NIMH. Now, it has grown into a big Consortium with 47 grants supporting 12 institutes and more than 100 scientists. The Consortium focuses on human brain, and we do have individual projects covering non-human primates, and even mice, but the majority of the studies are on humans. We cover several major psychiatric disorders, including schizophrenia, bipolar disorder, autism, PTSD, and major depression. But a big chunk of the brain samples actually come from psychiatrically normal people. We also cover a wide range of life stages from prenatal fetal tissue, developmental early life, and adult brain.

The major features of PsychENCODE which make it different from other Consortia like ENCODE and GTEx are the following four points:

1. Genetic variation: We use population samples and have hundreds, even thousands, of brain assays sequenced. With that, we can study genetic variation for its contribution or regulatory role in different omics.

2. Brain focus: We focus on the brain over other organ tissues. The major brain region we study is frontal cortex. We do have a few studies that cover other brain regions, but frontal cortex is a major brain region we focus on.

3. Major diseases: As I already said, schizophrenia, bipolar disorder, autism, major depression, and PTSD are the major diseases we study.

4. Different omics: We have several different omics covered by different projects. For example, we have chromatin data, RNA-seq data, and Hi-C data.

In 2019, the Consortium had the first wave of publications. Eleven papers were published in the Science family journals. Since the papers have been published, I will not spend time on the major results of those studies.

I just want to describe how you can access the data. All the PsychENCODE produced data have been deposited into Synapse, which is managed by Mette Peters and Kelsey Montegomery. So, this is the website they created as a public portal where you can browse the data and request access to all the data produced so far. It’s released to the public domain.

There are already more than 200 terabytes of data produced by the early phases of the PsychENCODE project. As you can see here, the biggest datasets are actually RNAseq data, followed by ChIP-seq data and ATAC-seq data. The other data types are relatively smaller, like Hi-C, NoMeSeq bisulfite sequencing, and so on.

A web interface or service that the Consortium is creating is this PsychSCREEN. This is a project led by Dr. Zhiping Wong group. They are creating a website for you to query PsychENCODE’s analytical results. You can query the information by genes, regulatory elements, or genetic variants. The genetic environment is a simple path. We will talk a little more about how they relate to PGC.

The major thing I want to deliver today is how PsychENCODE actually can be connected to PGC. So, this blue box shows you that SNP associations with different disorders are actually the major products from PGC – those GWAS results. The green oval actually covers the major product from psychENCODE. It contains all the brain omics for gene expression, for epigenetic measures, and protein abundance. With that, you can calculate QTLs, including expression QTLs (eQTLs), which basically associate SNPs with gene expression. You can also do QTLs for the other omics. Slowly, you can do differential analysis – differential expression, for example. Other than that, you can also perform Mendelian randomization analysis, for example, to study the causal relationship between SNP and disease.

I want to spend a few minutes talking about the project in my own lab that is called BrainGVEX. It’s one of the dozens of projects funded by PsychENCODE. My first collaborators on this project are Kevin White at Chicago and Junmin Peng from Children’s Hospital. My most recent project is collaborating with Stella Dracheva from Mount Sinai and Eran Mukamel from UCSD. We also have peripheral collaborations from Wang lab at the University of Texas, and Central South University from China, working on data mining and analyzing the data we have generated so far. I want to especially thank the program officers Gita and Alexander, who have provided us with continued support.

From this BrainGVEX study, we have generated genotype data – either by the array or low-pass whole genome sequencing – on more than 400 brains. The majority or most of them have been sequenced for transcriptome using RNA-seq. We also have a good proportion of samples with ATAC-seq, riboseq, and proteomics by mass spectrometry.

So, through that, you can see that this really covers the major components of the central dogma, from genetic variants to gene expression, transcription, translation – translating the mRNA binding to the ribosome, then translating into protein.

We use this multiomics data, along with ChIP-seq data, and we can perform QTL mapping. We can also build co-expression networks. With this information, we can further extend to connect to eQTLs from PGC. The goal is that we want to use all the information to understand and explain the GWAS signals, also trying to capture some causal relationships which are not obvious by just looking at the GWAS signals alone. On top of that, we can also perform something like TWAS analysis so we can identify normal risk changes in the pathway and further build a prediction model.

I want to share a few unpublished results. The first one is the QTL mapping of multiomics data. This is a study done by a former PhD student, Jiang Yi. So, he was analyzing our BrainGVEX data, which contains RNAseq, Riboseq, mass spec, and ATAC-seq from 200 to 400 brains.

The result shows different QTLs. Actually, can produce different, I mean, different omics can give you a different amount of QTLs. Like this panel A shows you eQTLs, splicing QTLs, has many more QTLs than riboQTLs, and protein QTLs. And if you look at the effect size of different QTLs, you can see that eQTLs have a stronger effect size than ribosome QTLs and protein QTLs. I think that’s easy to appreciate because of gene expression, ribosome binding is closer to genetic environments than protein, as it involves more steps in the central dogma. They do explain a different level of heritability from the GWAS results as well.

When you compare the different types of QTLs obtained from different omics, you will see that for those SNPs associated with the same gene pair at different omics, they have some good consistency. This is reflected in the positive correlation diagonal. But at the same time, you do see some interesting pairs going in the opposite direction. For example, in this panel comparing eQTL with ribosome QTL, we do have a small number of SNP and gene pairs that show a negative correlation. The same scenario can be observed when comparing ribosome QTL to protein or gene expression to protein.

At the same time realize that many gaps remain to be filled by the PsychENCODE Consortium or the data analysis, particularly from the eQTL perspective. So, we still did not cover well for early development. We do not have good coverage for diverse populations, and we still need to cover in detail cell types.

We have several ongoing projects, and I’m going to describe one right now. Again, this is unpublished data, a collaboration between my lab and Michael Gandal at UCLA. This slide, prepared by Michael’s student Cindy Wen, shows the scene we are looking at. We are generating eQTL isoforms, splicing QTLs from GTEx, on fetal brain tissue from nearly 700 samples covering three different trimesters of three population sources. This data actually shows you that the fetal brain does capture some unique early developmental QTLs. You can use this information to study and analyze the GWAS signal, and you see, they do enrich in the GWAS signal as well, as you observe in adult brain QTLs.

That’s a study on fetal brain. We already have some cross-population studies, but in adult brain, the situation is also there. So, we don’t have good coverage for diverse populations. The current data we generate from bulk tissue RNAseq predominantly originates from European population. We only have 18 African American brains, and that has been corrected to a degree in the single-cell RNAseq data. We have more representation from Hispanic and African American populations relative to the Caucasian sample. But so far, none of the non-European samples have been formally analyzed here.

We have a project ongoing to study East Asian, Han Chinese, brain eQTLs. This is a project with my collaborators and Chen Chao in China. These slides were prepared by our students, Chen Yu. Using 150 brain samples from China, we can identify a large amount of eQTLs and splicing QTLs, referring to thousands of genes. So, 80% of the eGenes and splicing QTLs are consistent across two populations, at the same time, it tells us that we still have a substantial amount of eQTLs that only appear in one population and not the other. And the important thing is, when we use the QTLs to explain GWAS signals, we see an interesting phenomenon. So, if you use East Asian QTLs to explain East Asian GWAS signals, you have a significantly higher proportion that can be explained, and then using the European eQTL data.

So, I said we have a gap to fill about the cell types eQTLs. That is a major task for this current phase of PsychENCODE. Actually, right now, hundreds of brains are under investigation, meaning they’re being sequenced for single-cell RNAseq or ATAC-seq. We expect many major cell types will be covered.

So, stay tuned. In 2022, there should be publications coming out from our Consortium regarding cell types of eQTLs, early developmental stage QTLs, and some population-specific QTLs.

Because of the pandemic, this is our latest group picture. Several labs joined the Consortium later during the pandemic, so we don’t have a group picture. As you can see, we have many senior investigators and young fellows in this picture. I believe everyone will have their own vision or interpretation of the PsychENCODE data. You definitely can approach them and listen to their explanation or presentation about PsychENCODE. Hopefully, that will be a very useful resource for you to study psychiatric genetics.

Thank you, bye.


GnomAD

Title: gnomAD: Using large genomic data sets to interpret human genetic variation

Presenter(s): Anne O’Donnell-Luria, PhD (Broad Institute of Harvard and MIT)

Anne O’Donnell-Luria:

Right, I am very excited to be here today to talk to you about one of my favorite topics: reference population databases.

Just to give you a little background, when we sequence one person’s exome, we are actually sequencing about 20,000 human genes, and we find 20- to 30,000 protein-coding variants. It varies depending on the person’s ancestry. And what we’re looking for from a rare disease perspective (which isn’t the only use of exome sequencing, obviously, but the perspective I come from is we’re looking for one or two pathogenic variants). And what do I mean by pathogenic variants? These are disease-causing variants and, in the, like changes in the sequence that cause disease. We use this to contrast with benign variants, or what the old terminology for that was polymorphisms. So, we kind of go through all this data and you look through it. I will just mention that Daniel MacArthur has a primer from 2017 that’s on the same topic that is a broader view and has a lot of the research applications of what we can do with databases. What I’m going to be talking to you about today is really how to use one of the largest population reference databases in very high detail, so that you should all come out of this able to feel comfortable using this database.

What’s in an exome?

So, what is in an exome? If I were to sequence any of our exomes from this room today, we would find many rare, potentially functional variants. We all have about 500 rare missense variants, and about a third of them are predicted damaging by in silico predictors. In silico predictors are things like PolyPhen, SIFT, CADD. There are many of these these days, and they are computational tools that look at a variant. They might look at conservation, they might look at how big the amino acid change is, biochemically or other properties, and they try to predict whether a variant is going to damage the protein structure or function, or whether it’s going to be tolerated. So, those are very useful tools to have, but given that about a third of rare variants are still predicted to be damaging, we’re obviously not able to just highlight our pathogenic variants with these predictors.

We all have about a hundred loss-of-function variants, so variants that disrupt the protein. These are places we might think that these could be important, but we all have a hundred of them. About 20 are homozygous or all knockout for about 20 genes, and we have about 20 very rare loss-of-function variants.

We have a hundred rare variants in known disease genes. When I say ‘rare,’ I tend to mean about less than 1% allele frequency in the general population. So, that’s a lot of variants also to look at there. When we first started making these large reference databases, we found that everyone in the general population had over 50 variants that had been reported as disease-causing in a clinical database like the Human Gene Mutation Database or ClinVar. Obviously, you’ve all made it here today, so you must not have 50 disease-causing variants because we’re all functioning very well. What it turned out was going on was that we used to sequence people who came into clinic with rare disease. We would look at the genes we knew were associated with disease, and we said, ‘Oh, if you have a variant that’s not seen in 50 or 100 Europeans, that must be what is causing your disease’, but it turns out a lot of those variants were ancestral variants, so they were common in East Asians or common in Latinos, and we were confusing those ancestry-related variants with disease variants. So, the databases are starting to get somewhat cleaned up, and it’s getting much, much better over time, particularly with current modern criteria for variant interpretation which I’ll briefly touch on.

We all have one or two de novo protein-coding mutations, so that’s a new mutation in us. And then we have an unknown number of sequencing errors. So, my goal as a clinical geneticist or a rare disease exome analysis is to find these pathogenic genetic variants within this sea of benign variation.

This is obviously the approach we would like to take—you’d like to just be able to look at where a variant is and know it right away. This is our current approach; it actually works pretty well. We’re able to make a diagnosis in about 30% of cases, but it requires digging through the exome variants until we kind of look like this and find what we’re looking for. I think many of you who I work with are very familiar with this feeling.

Harnessing the power of allele frequency

So, one of the ways that we—the one of the things that has really completely changed the field—is what we’re going to talk about today: using the large general population and harnessing the power of population allele frequency to compare against the exome of one individual so you can tell which variants are common, which variants are rare, and which variants are extremely rare.

Mendelian disease: Mainly looking for rare variants with large effect size

What I focus on is these rare alleles, but really, we can—the databases contain a lot of the whole allele frequency spectrum, all the common variants and down to the very rare variants. But a lot of what we’re talking about today is looking for rare variants that have a very large effect size. And so, if you have that rare variant, you will have the disease.

Increasing the scale of reference databases

These are the general reference population databases; these are the main ones that are available. Some of the early ones were the 1000 Genomes Project, which was an ancestrally diverse reference population database with about 26 ancestries that were sequenced. This was a really great initial reference, but it’s a very small database. Next was the Exome Sequencing Project (ESP) that sequenced about 6,000, a mixture of Europeans and African-Americans. Sometimes I hear that this is a healthy population, but it’s actually the same kind of general population that we see in ExAC and GnomAD, which we’ll talk about, that includes people with cardiovascular disease and other types of common diseases.

The Discovery EHR cohort is a really interesting cohort; it’s from a collaboration between the pharmaceutical company Regeneron and the healthcare system Geisinger in western Pennsylvania. They have sequenced a large proportion, they’re exome-sequenced, of the 50,000 individuals in their healthcare population there, and they have made an aggregate dataset. What’s really exciting about this dataset is that it’s connected to health medical records, so they’re able to learn interesting things about human biology. We don’t use it as a reference population because over half of the individuals in that healthcare system are related to each other, and that would be a very biased way to look at allele frequencies, but it’s an interesting dataset.

ExAC or the Exome Aggregation Consortium dataset is the largest, for a while, at the large publicly available dataset. It was released in October 2014. This work was led by Daniel MacArthur’s lab, which is who I work with, and it was much more ancestrally diverse and all exome data, about 60,000 samples. BRAVO is currently the largest whole-genome dataset that’s publicly available, and they have a site also that’s worth checking out. It’s through the TOPMed study funded by NHLBI.

And then, gnomAD is the majority of what I’m going to talk about today, although I will touch on some analysis that was done on ExAC and that we haven’t gotten to on gnomAD yet. Basically, this is now the largest reference population database. It has data, exome or genome data, from over a 140,000 individuals.

It has been provided by a 109 different—actually, a 110 different—PIs now. They basically are studies that were already conducted, many at the Broad Institute. What Daniel did is he formed this consortium and got all of these PIs to agree to share their data with this project. Then, all the data was run through the same pipeline and jointly called to make one giant, very high-quality dataset.

The actual whole dataset is available for download. I don’t suggest you do that because it’s a huge, huge file. But if you’re setting up a diagnostic pipeline or a research pipeline, you can then download those and annotate all of your data with the gnomAD data. We are not able to share individual-level data, we don’t have consent for that. These are studies that were already done for other reasons that we’re using for this dataset, and these are not samples that were sequenced for gnomAD. We don’t have information on these individuals. It’s pretty evenly balanced, with fifty-five percent male. The mean age is fifty-four years, so these are mostly adults. There are some, a few individuals under eighteen in the dataset. There are cases and controls from common disease studies. We don’t knowingly include any cohorts that were recruited for pediatric onset disease, but that doesn’t mean that there couldn’t be some individual with pediatric onset disease who participates in an adult study. There’s lots of type-2 diabetes, there’s schizophrenia and bipolar disorder, there’s GI disease, there’s cardiovascular disease, there’s a whole number of diseases. But there’s cases and controls, and one of the nice things is that because there are so many different cohorts in here, a lot of the effects from any individual cohort are really washed out in the whole dataset. So, we have a pretty good representation of the general population.

We report both the allele frequency for each variant but also something called popmax, and that is the highest allele frequency in any of these subpopulations. But we only use continental populations, so we use Europeans, South Asians, East Asians, Africans or African-Americans, and Latinos as our general populations for popmax.

You can go, and you can download all the data. As I mentioned, the exome data and the genome data are actually processed in parallel and kept separately. I want to highlight a common misconception that I’ve been hearing about: a lot people are downloading the exome data if they’re interested in studying coding regions, but they’re downloading the genome data if they’re interested in studying everything. The exome data has 125,000 individuals with exomes, the genome data has 15,000 individuals with genomes, so you’re only looking at a subset of the data if you download either of those files. There is no file that merges the exome and genome data together; they’re separate. So, if you’re only gonna do one, I guess you would do the exome data, but it’s better to work with both and the exome data because it’s a larger sample size.

This is how we determine a little bit about ancestries. So, Laurent Francioli did this work. This is the principal component analysis that’s showing you the different ancestry groups. You can see this is done by a random forest machine learning approach, so we are not having self-report ancestries, we’re determining them from the genetic data. And so what happens here: you have Europeans in orange over here, and the Finnish and Ashkenazi Jewish are the two blue samples that are coming nearby. We have the South Asian samples up here. The East Asian are over here. The Latino down here, and the African are over there. The way this is computed is we have individuals that we know their ancestry. We see where they fall on this map, give them a color, and then we figure out who falls near them. You’ll also notice these arcs between these. These are admixture, so that means there are individuals in the data set who might have, you know, one African parent and one South Asian parent, and so then you’re seeing the admixture between those.

We also, in this dataset, remove any low-quality samples. Anyone with sex chromosome abnormalities like Turner syndrome or Klinefelter. We also really importantly remove first and second-degree relatives, so that we shouldn’t have any inflation of our allele frequencies because of relatedness. Konrad Karczewski and Grace Tiao, along with others, have been working on subpopulations in this dataset using a fun way to represent this called ‘UMAP,’ that I’ve shown you. The bio archive link there and there’s also more information about this and the gnomAD’s release blog post on the MacArthur lab site. But basically, you can see that different subpopulations that are broken out here. The size of it has and has more to do with the representation of each population in the dataset, not like in the world, and the you can’t go too much by how close the populations are together, but it just kind of shows you the different subpopulations we’re able to break out.

Cystic fibrosis: to demonstrate a use of gnomAD

Okay, so now we’re going to take cystic fibrosis as our example gene to really dive into what information is available about a gene and about each variant in this dataset. Just to review, cystic fibrosis is an autosomal recessive disorder. You need two mutations to have the disease – one from your mother and one from your father. If you only have one pathogenic variant in this gene, then you are a carrier for cystic fibrosis.

All this information, well, I’ve talked about the downloads. The way I access it is actually through a website that was built to make it so anyone can have this information easily at their fingertips. It’s incredibly easy to use. I highly recommend you check it out. You just go to gnomad.broadinstitute org. When you get to this page, there’s a bar you can type in a gene name, you can type in a variant, and you can do region searches. This browser was initially set up by Konrad Karczewski and Ben Weisburd, but recently Matt Solomonson has taken over the lead on this and has redesigned the new face, and work and functionality of the gnomAD browser. And Nick Watts is another software engineer who’s joined the team and been working on this a lot.

So, this is the Cystic Fibrosis gene here. There’s a lot of information on this page. The first thing you’re I’m gonna draw your attention to are these black bars here: These are the exons, and the introns are represented between them. The introns are, obviously, not drawn to scale; they’re much bigger than that, but this is really all the data is in here. We’re just showing you the data around the exons. This arrow over here tells you if it’s a forward or reverse strand gene. So, this is a forward strand gene. These little blue mountains above the exons are the coverage – how many reads we have for each sample, sorry for the samples on average in the exome data in blue. You can see that we kind of have more in the middle, and then there’s some fall off at the edges. In green, where you see much flatter across the whole way, that’s the coverage of the data in the genome data. And, in genomes, we generally do about 30x, and that’s about what you see here.

There’s other information on here. You have the Ensembl ID up here, the number of variants we see on each page, there’s always a link that you can just click, and you can go right to this gene or variant on the UCSC Genome Browser. You can look up the gene in a clinical database – OMIM. We’re going to talk about this table later. We actually have all the ClinVar variant pathogenic and, sorry, pathogenic and likely pathogenic variants that are in the ClinVar database, which is a database where clinical labs put their interpretations of genetic variants. Just to really point out, this is the variants that are in that clinical database. Many of these, maybe most of these, may not be in the gnomAD dataset. So, we don’t have those linked yet. This is just to kind of show you where we see disease variants in this dataset in each gene.

This row here is actually noisy, but it’s showing you the different variants that we see. We see a ton of variation in every gene, basically, of the human genome. That does vary to some degree, but we see lots of variation across the human population. Red is the loss-of-function variants, and yellow is the missense variants, and green is the synonymous, and non-coding are in gray. The height of the bar has to do with the allele frequency. Below that, you can choose if you only want to look at a subset of the data, so you can only look at loss-of-function or only missense, or you can pick and choose among those. Then, at the bottom, is a long table and it’s a list of every variant we see in this gene. I will say that we do quality control on this, and so the variants that are shown are only the variants that are passing our strict QC metrics. There are other – we don’t exclude own – I mean, everything is there in the VCF. We don’t throw anything out. So, if you wanted to see everything in the VCF, including the variants that fail QC, then you just check this box to include filtered variants.

We have the variant listed here – the chromosome, the coordinate, the reference sequence, the alternate sequence, whether we see it in exome or genome, the consequence. This is the beginning of the gene, so these are all in the UTR, so they’re non-coding – it’s how the annotation tells you they’re in the UTR. The allele count is the number of times we see the variant in the hundred and forty thousand people. The allele number is the number of chromosomes that we have confidently high-quality genotypes on. So, this is 250,000 – the number of chromosomes. And for an autosomal gene, we have two copies of that chromosome, so the number of individuals is this number divided by two. The allele frequencies are the allele counts divided by the allele number. We also show the number of homozygotes, if there are any.

You can search for any variant using this nice search bar. I wanted to look up the phenylalanine 508 del variant, which is the most common pathogenic variant for cystic fibrosis. When I type that in, you’ll see that there are five variants within that single codon, and I am interested in that first one, so I click over here to go to the variant page. When you type this in, it actually also shows you where in the gene it falls – that’s sometimes helpful.

This is what the variant page, every variant page, looks like. There’s tons of information on here. I’m going to zoom in on some of it. Some of the things I’m going to ignore, but they’re here. Some of the quality metrics, if you wanted to dive into that, and some of the information about the different annotations on different transcripts. To kind of smoosh this into one page, just to show you some of the features that I want to highlight: This is the allele counts in exomes and then genomes. We have those separated by the the allele frequencies in the top-right-hand corner of both the gene page but also every variant page we now have subsets of gnomAD. If you’re studying a patient with some type of neurologic disease and you’re worried that someone in gnomAD has your variant and they’re in one of the cohorts for adult neurologic disease, you can actually choose the non-neuro – same for non-cancer – or the control-only dataset, and see if your variant is in one of these subsets. So there are some subsets you can use. I just use all of gnomAD. Also, there is some overlap of samples between TOPMed and gnomAD, and that’s why that option is there.

We have age histograms, so you can see the age of people who have the variant. These are all rounded ages. And then we have the IGV web view – sorry, browser – so you can look and see the actual raw read supports for the variants to see if you think the variant is real, if there are any issues there.

So this is really the heart of the page: the allele frequencies broken down by different ancestry groups. And this, I just want to use as an example. This is again the most common pathogenic variant for cystic fibrosis. We know that the carrier frequency of cystic fibrosis is very high in Europeans. Actually, one in twenty-five Europeans is a carrier for cystic fibrosis. I’m only looking at one variant. I don’t want to take them all and add them up. So, I’m just looking at this one variant. We know from databases that about one in forty Europeans is a carrier for this variant. So, I wanted to ask: How well does gnomAD work as a representation of the general population? For Europeans, I took the number of chromosomes we see from here, I divide by two chromosomes per person, I get the number of Europeans that are genotyped at this site. I multiply that by one in forty, and I get one thousand six hundred and twelve carriers that I expect. And I see fifteen hundred and ninety-eight, so it’s working very well.

You can look at interesting things like in East Asians – we actually don’t ever see this, this common carrier variant in Europeans, we never see this in the about 9,000 East Asians that we’ve genotyped at this site. And then, if you were interested in subpopulations, you just click this arrow and you can open up this, and you can see the allele counts and the allele frequencies in different subpopulations. You can also see how many males and females for each one.

In general, when we do filtering, we tend to use the full European because you worry a little bit about using subpopulations – that there could be founder effects or other smaller effects. But if something were seen in ten percent of Swedish, or something like that, that would make me somewhat reassured that it wasn’t a pathogenic variant. So, the information is helpful, but for the most part, we use the larger classes for when we think about filtering.

Now, we expect – I’m not going to show you the math – but by Hardy-Weinberg, we expect to see ten homozygotes out of the 64,000 Europeans in this dataset. So, how many do we actually see? We see one homozygote. And actually, if you asked me, I would have predicted that there would be zero homozygotes in that dataset. And that’s because, as I already told you, we haven’t included any cohorts that recruit for pediatric onset disease. And so, I wouldn’t think there would necessarily be anyone with cystic fibrosis in this dataset. But also, I can’t say 100 percent that we would have really excluded everyone because there are so many different types of studies in here. It’s possible that someone with cystic fibrosis could participate in another type of study. I was curious because type 2 diabetes is known to be a risk, if you have cystic fibrosis – you’re at an increased risk. So, I thought maybe they would participate in that study by accident. But actually, when we look at it, we’re able to tell, oh sorry, before I get to that, the first thing you want to do is you want to ask: Is this homozygous person, is that variant real? Because if it’s just a sequencing artifact, then it’s not as interesting. So, you go to the read data and you look. And so, we have a three-base pair deletion there, tons of coverage, looks very clean. I don’t have any concerns about that that I can raise from the sequence data. So, it looks real. What’s going on?

So, I looked in and saw that this person is actually in the control subset dataset. They seem to be a control in whatever study they’re participating in. And that’s interesting. It’s possible that this individual may not be penetrant for the condition. And that’s a way that we can find people who have potentially really interesting biology that we might be interested in doing like modifier studies or other studies to figure out why they’re not penetrant. I will say, if you go to cystic fibrosis clinics and talk to physicians there, they will say they have encountered people who aren’t penetrant for cystic fibrosis with these mutations that are normally considered completely penetrant. So, this is a known phenomenon that’s happening, and modifiers have been found in actually one of the sodium channels. So, it’s just an interesting thing you can think about with these databases – although really, more like biobanks – where you’re able to reconnect with the participants or where these studies are the most powerful.

GnoMAD is really not meant to find that one individual and connect with them and do additional studies. There is a database that does try to do that – it’s called Geno2MP. I just wanted to mention it briefly. There’s about 10,000 rare disease patient exome data in it. And it’s where you can go and look up your variant of interest, and you can actually click on the variant and get one or two HPO, high-level phenotype terms associated with that variant. So, if you were studying a new disease gene or looking for individuals that are suitable with rare disease – it’s not a healthy population – but this is a place you can go and find if your variant is in there. You can find out what it’s associated with. It’s helpful for ruling out a lot of variants. This is an intellectual disability gene, so this person has a skeletal system abnormality. So, I would not think that it’s pathogenic. But if you’re interested, you can always contact the person who has the case.

Okay, so we looked at this. I told you I thought it looks good. Hoping most of you agree. One of the questions I get is, ‘Well, that’s fine to tell me that it looks good, but what should I be looking for? What looks bad?’ So, I’m going to show you some examples.

Low confidence loss-of-function variants

These are the things we’re going to consider. The first is just some flags I want you to be aware of. So, the ‘low confidence loss of function’ – these are variants that are predicted to be loss-of-function. They might be nonsense variants, but there’s some reason that the LOFTEE tool, which is the tool developed by Konrad Karczewski and Daniel MacArthur, has flagged that this might not be causing loss-of-function. Sometimes these variants are in exons that don’t seem to be well-conserved or not really part of the main transcript. Other times, these are variants that are nonsense variants in the last exon, for example, wouldn’t be expected to trigger nonsense-mediated decay, and so those get flagged. It’s a computational prediction, so it doesn’t mean it won’t cause loss-of-function. Just like a variant that is called ‘stop gain’ at the top and is not flagged doesn’t mean it absolutely will cause loss-of-function. Human interpretation and experimental studies are needed on top of this. But at least, it’s something to be aware of.

Poorly aligned regions

This is a region that doesn’t look very good. If one of these was a variant I was interested in and this is what the region looked like, I would be very concerned that there are sequencing errors here. What you’re noticing is there are many variants in the region. They have different allele balances. Some are in very low allele balance, some are in more. There’s an insertion there too. There’s just a lot going on here that raises my concern. This is actually a low copy repeat. There are mapping issues here. This is probably a mapping artifact, and so you would not want to think of that as real. Low copy repeats are actually flagged, but anytime you see a lot going on like that, I would be cautious of it.

Homopolymer runs

Homopolymers. So, that’s where you have like a stretch of a single nucleotide in a row. These are hard for PCR to get through; they tend to make errors where they might insert a G or delete a G, can be any letter. These are common PCR artifact regions, but there are also regions that the human DNA polymerase has trouble also. They’re places where you can get real sequence variants. So, they’re just, I just want to highlight that you need to be cautious of these, and you definitely want to be Sanger validating these if you’re interested in them.

Multinucleotide variants

Multi-nucleotide variants are interesting. Those are when the VCF actually called each of these variants separately, but really when they’re in cis, you should be interpreting them together if they’re within the same codon as these two are. These are actually flagged in the gnomAD browser now, and I’m gonna show you that on the next slide. But I will also point out that the same kind of thing can happen within a complex indel. You can have a deletion of one base and then a few nucleotides later, you can have an insertion of one base, and you basically keep the frame then, and you probably end up with two synonymous or missense variants. But those are gonna be called two frameshifts by the VCF, and so you need to kind of put those back together. And we don’t have those flagged in gnomAD, but you should see these should pop out when you look at the read data.

So, multi-nucleotide variants are flagged in the browser here. This is what they look like on the page. There’s this warning that it’s a multi-nucleotide variant, and then if you click this ‘more info,’ it shows you this box where you can see the two variants. You can see how they’re interpreted separately and how they’re interpreted combined. So, this was called a nonsense variant. It’s a gene associated with severe intellectual disability, and so we weren’t sure why we were seeing that in the gnomAD dataset. But really, it’s just a missense variant. It’s likely not pathogenic.

Somaic mosaicism

The final thing here is skewed allele balance. So, we actually included, after lessons from ExAC, we now have a hard filter for an allele balance of 20% in the gnomAD dataset. Those variants are still in the dataset, but again, they’re in that filtered view only. But this one, for example, is in 21%, and it’s a pathogenic variant for a severe pediatric-onset intellectual disability disease. So, we were suspicious of it, and I think this is likely a somatic variant in gnomAD. Okay, the only reason, way to be sure, is to go back and like Sanger sequencing and get sort of different samples from the person. So, we can’t do any of that, but just that, I would be suspicious of it.

When a variant is absent from gnomAD, it’s important to determine if that region is covered

The other thing to check, it’s a little bit less of an issue in gnomAD now that we have genome data, but if a variant is absent from the dataset, is it really absent? There’s a few different things that could be happening. The first is that you might be looking at the HGVS nomenclature, and you might be looking at, like, a transcript in a P dot notation, and it might be that your variant is actually displayed differently in the GnomAD browser. So, you might be looking in the wrong place. It’s very important that you’re looking at the actual position you think you’re looking at. Chromosome coordinates are the best way to do that. And so, we’re trying to encourage diagnostic labs and urge all of you, if you get diagnostic reports, to ask your diagnostic labs to put the actual chromosome coordinates on the report so you know you’re looking at the right thing. Otherwise, you can put your HGVS into tools like MutationTaster or Mutilizer, and you can figure out what your position is.

The second is the position might not look, might not be well covered in GnomAD. In ExAC that was definitely an issue: high GC content areas or regions that just weren’t on the exome capture weren’t covered. So, to say that your variant is absent from a dataset when there’s no data there is meaningless. So, you don’t want to be doing that. And the third is the variant is actually absent, is not present in 140,000 people, and that’s more interesting. The way I kind of, if I want to look up and make sure a variant isn’t in, has good coverage but isn’t in the dataset, the cheat I do to do it quickly is I, you know, this is an example variant. This is the chromosome coordinate. I do a region search just on the, I don’t worry about the C>A, but I just do a search on the region. I look for the nearby variants, so there’s actually another variant in this codon. And then I look at the allele number. It’s the number of chromosomes genotyped at that position, or just some one somewhere in your run that was even nearby. And so, while this isn’t, I can’t say it’s absent from 140,000 people, I can say it’s absent from about 32,000 people, which is still like a very reasonable thing to say it’s a rare variant.

Nuance about allele number on the browser

I wanted to point out a little bit of a, just sort of a, it’s just a thing that’s happening with how the data is represented. The genome and exome data is processed separately. And so, you’ll see that all of these variants are in exomes, some are also in genomes, and so the allele number is in the 250,000 range. This variant is only identified in the genomes, and so it looks like the allele number drops there, but that’s because it’s only found in the genome VCF, not in the exome VCF. And so it just gets represented incorrectly. We’ve left this as it is. We think people can just kind of look around it, mostly because the way to fix this is to sort of change the browser to represent something different than the VCF represents, which we worry is gonna cause confusion. And the second way to do it is to actually edit the VCF, which is just shouldn’t be done. So, where this is just a caveat to be aware of. When I interpret this variant, I would interpret the coverage as the coverage of the variants in the, in the 250,000. There’s no reason to think that the coverage is dropping over this site. It’s just because it’s only on the one dataset.

All the coverage data for all of gnomAD is here, and that you can download and use in your pipeline, so you’re not having to look up every variant by hand. But if you have a single diagnostic report, this is an easy way to look it up.

Reference population databases in clinical variant interpretation

Okay, using reference population databases and clinical variant interpretation, I’m gonna show you a little bit about how we think about variant interpretation. Every variant I get, it’s, kind of, a state of a variant of uncertain significance how you start, and what you’re trying to figure out is, do we have evidence that doesn’t cause disease, that it’s benign, or evidence that it’s pathogenic? And our evidence are these standards and guidelines that are published by Richards et al. from the American College of Medical Genetics.

These are the guidelines that we use, and I’m actually just gonna focus on this population data here. So, I have zoomed in on that line so you can actually read it, and what we’re looking for is, “is the variant too common in the general population to be consistent to be causing the disease?”, “Is it absent or very rare in a population database, or is it like over, do you see it overrepresented in cases versus controls?”, which is more of a case-control study than strictly using a reference population database?

I want to give a caveat here. We give this moderate evidence if a variant is absent in the reference population. However, the vast majority of variants are not found in gnomAD, so there’s tons of variance in every gene in gnomAD, but overall, we only see about 10% of the possible variation in the dataset. And that doesn’t mean that the 90% of variation that’s absent is disease-causing, I mean, nobody is suggesting that. However, when we see a rare variant and it’s in a gene that we know is associated with disease, that’s a type of variant that causes disease, it raises our suspicion. When it’s absent in the general population, but it’s possible that we’re overcounting this a little bit, and we’re trying to think about ways that we can be a little smarter about that. So, it’s very useful, but I just want to put a little bit of caution on that.

When we use reference population databases, this is the power of them. So, when we had 6,000 people, if we asked in a single person’s exome “how many variants were left if we removed everything that had an allele frequency greater than one in 1,000 or greater than 0.1 percent?”. When we used the 6,000 people in the ESP database that has Europeans and African-Americans, we saw that we had 600 to almost a thousand variants that were left. You can see this database was built with Europeans and African-Americans in it, and so it works best in those ancestries and it works much less well in the other ancestries. When we went to ExAC, which had 60,000 people and a much better ancestral representation, you can see that the number of variants left that are very rare is about 150. It’s helped a lot, but that’s still 150 rare variants to think about, and that’s really going to 1 in thousand, for recessive diseases you can have variants that are much more common than that. But it really highlights the power of these databases.

Example genetic test report

This is an example of a genetic testing report that we get on patients, and every report we get now should tell you how often this variant is seen in the general population. This is a known disease gene, it’s a protein truncating variant where haploinsufficiency is the mechanism of disease. The patient’s phenotype fits, and the variant is not observed in the general population. So, this way, we’re able to classify it as pathogenic.

Applying gnomAD to clinical practice

The opposite thing we do is we say the variant is too common in the general population to be causing disease. Most often, these are benign, and so, you don’t even really see them on the report. But this was an example of a report that’s actually from 2015. The ExAC data came out in 2014, but some labs were a little bit slow to really adopt and get it into their workflow. You’ll notice that they don’t comment on the reference population allele frequency in this report. So, anytime I see a report like that, I always look up the variant in gnomAD, which I can do in like two or three minutes, basically, if that.

And so, this is the gene name here. This is the missense variant they see. It’s heterozygous. They called it a variant of uncertain clinical significance, which I interpret as a variant of uncertain significance, and it’s noted to be idiopathic pulmonary fibrosis, which is autosomal dominant. This is an adult-onset disease, and I actually have a young child with developmental delay, hypotonia, and some unclear respiratory issues that the pulmonologist actually sent this testing. And so, when I look up this variant in gnomAD, this is what I see. It’s actually present in 3.6 percent of Europeans and 2.6 percent of everyone. Again, this is a dominant disease, so I wouldn’t expect these high allele frequencies, and these are the number of homozygotes. So, 338 homozygotes in the gnomAD population with this variant. So, I felt that we were able to reclassify based on this information as benign. The other thing I will say is this variant is not conserved. When you look here, it’s not very well conserved, and you see it’s actually a leucine in a number of cases. This has actually been classified by another diagnostic lab and entered into ClinVar as benign.

Frequency filtering

Okay, I’m going to talk about frequency filtering. So, I’ve thrown out 1%, 0.1%. What allele frequency should we really be using? So, this is the statistical approach that was developed by James Ware and Nicky Whiffin, who works with him at Imperial College London, but they have both spent a lot of time working with us here at the Broad, also. And then also Eric Minikel, who was in Daniel MacArthur’s lab and now is a graduate student at the Broad working on prion disease. Okay, so this is published in Genetics in Medicine.

I’m going to kind of go over a higher level and also how you use it. So, the central tenant of this is the frequency of a pathogenic variant in a reference sample that’s not selected for the condition you’re studying should not exceed the prevalence of the condition—very simple. Also, reference population databases are a sampling of the general population. There are some exceptions to this: founder mutations can confuse this, bottleneck populations, balancing selection a little bit. And then the other thing is you need to take penetrance into consideration, because if you have something that’s lowly penetrant, then the variant can exceed the prevalence of the condition.

The other caveat here is we know that the stricter you can get with your filtering, the more likely you’re gonna have variants with higher odds ratios for developing disease. And so, you want to get to as low allele frequency filtering as you can because you’re gonna have fewer variants to look at.

So, I’m gonna pick one variant just to kind of use this to go through and walk through this example. This is a ClinVar variant that is uncertain significance. It’s actually been classified by nine different labs, and so we’re gonna talk about this one. They all agree it’s a variant of uncertain significance. And this is the allele frequency table for it, so it’s seen in the general population, but there are no homozygotes, and it’s not very common in any ancestry. But the other thing I need to tell you is that this is a variant for hypertrophic cardiomyopathy, which you would expect to possibly see in a gnomAD population. So, it wouldn’t be something that we depleted as pediatric onset disease.

Disease specific allele frequency (AF) thresholds for autosomal dominant disease

Okay, there are two pieces to this. The first is calculating the maximum credible population allele frequency. So, you need to look at the genetic architecture of the condition to say how common do you think there should be variants for this condition in the population. That has to do with the prevalence of the condition, the genetic heterogeneity of the condition, which is the hardest thing to figure out, and the penetrance, which is also hard to figure out sometimes, but honestly, you can kind of hand-wave; you don’t have to be exact in it.

Hypertrophic cardiomyopathy (HCM) specific AF threshold

Okay, so we’re gonna use hypertrophic cardiomyopathy as our first example. We know that about one in 500 people have hypertrophic cardiomyopathy and over the course of their life, will develop it in the general population. This is a dominant condition, so you have two chromosomes, so that’s you multiply by 1/2. We’re gonna call the penetrance for this condition 50%. I don’t know exactly what it is; it varies on the variant, but 50% seems like a relatively conservative estimate. And then, to estimate the genetic heterogeneity, what I’m going to do is just take the most common pathogenic allele that’s known for this pretty well-studied disease, not the allele frequency in the general population, but if you take a cohort of people with hypertrophic cardiomyopathy, what percent of those people will have their disease explained by a certain variant. And the most common cause of hypertrophic cardiomyopathy is this MYBPC3 variant that’s seen in 2.2% of the population, but we’re gonna use 3% as the upper confidence interval’s limit. So, we’ll put 3% in there. So you can do that math, and you get that number 6x10-5 our maximum credible population allel frequency, or you can go to this cardiodb.org allele frequency app, and you can just pick dominant disease, put your numbers in, and if you want to play and sort of say, ‘Well, what if I had 20% penetrance? What if I had 80% penetrance?’ you can kind of see how it affects your numbers and decide what number you’re gonna go with. These are the numbers we’re going to use for this example.

And then, you’re going to compare that to a filtering allele frequency, which is a little bit different than just the allele frequencies in gnomAD. If the variant’s filtering allele frequency is larger than the maximum credible population allele frequency, then you’re gonna discard the variant, it’s too common to be pathogenic. And if it’s not true, if the maximum population frequency is larger, then you’re going to retain the variant, it could be pathogenic. This is just a filtering step. There are other things you’re going to be doing; this is the filtering stuff.

So, what is the filtering allele frequency, and why do we need it? The reason we need it is that gnomAD is a sampling of the population. So, the numbers in gnomAD are not the actual allele frequency of this variant across the entire world or across a single ancestry. It’s a sampling of a population. And so, the filtering allele frequency is a conservative statistical adjustment because it’s a sampling. It’s the lower bound estimate of how rare a variant can still be to be still compatible with the gnomAD observation. I’m going to show you a picture that explains what that is.

Reference population databases sampling measurements of the general population

The first thing is, the reason we know we need this is by taking variants in the Exome Sequencing Project. If we take variants at 1% allele frequency in the Exome Sequencing Project and we look at their actual frequency in ExAC, you can see it’s kind of a Poisson distribution around 1%, even the same for 0.1 percent. But when we take variants that are only seen once in that population, so one out of six thousand, they’re not seen one out of six thousand people in ExAC. They’re actually often seen much less frequently, and actually, the majority of things that are seen once in ESP are not seen again in ExAC. And that’s just because there are a lot of very rare variants out there, and you’re always going to be sampling some very rare variants. So, we know that there’s a left skew for rare variants, and we want to take that into account.

Computing filtering allele frequencies

And so, because there’s a left skew, and it’s not exactly a Poisson. What Nicky and Eric did is they said this is the gnomAD allele frequency here, whatever this number is, and then they said, ‘What Poissons would fit that?’ This would be the 95% upper confidence interval. And then we’ll make the middle of that the filtering allele frequency. So it’s kind of saying it’s back-estimating based on your 95% confidence interval, kind of. So, your actual allele frequency you’re using for filtering will be a little bit less, a little bit rarer, and so you’re going to keep a few more variants, and with this approach, that make sense?

Okay. And we don’t do this on a global level; we actually do this on each of the continental ancestries, so that’s important. We don’t use Ashkenazi Jewish or Finnish, which are founder effects. We don’t use those populations, and we also, again as I mentioned before, don’t recommend the sub-continental populations for this type of filtering.

On each variant page in gnomAD, we have the popmax filtering allele frequency, and so that’s separately for exomes and genomes because all of the data is separate. I really would love one number. I generally recommend using the exome data just ’cause there’s more exome data, and so generally, it’s the better thing to use. But I would use kind of whatever is the higher allele count. But if you’re just gonna use one thing, I would use whichever has the higher allele count, which you can spot automatically, or where you can just use the exome data. If you hover over it, you’ll see which population we’re using for the popmax. You can also see in the table, the browser has the 95 percent kind of confidence interval level, and the VCF has both 95 and 99, so you can choose which you use.

Okay, so then you’re gonna take this number, taking that number and compare it to the number we calculated earlier. And then, to be careful here because you’re not comparing 6 to 4, you’re comparing four times 10 to the negative 4 again 6 times 10 to the negative 5. So this is smaller, so this, this is true that your filtering allele frequency is higher than your maximum credible population allele frequency, and so you were able to discard this variant as too common to be pathogenic. And I will kind of take a step back and say, when I say ‘retain a variant,’ rarity is necessary but not sufficient for pathogenicity. And over here, when I say ‘discard a variant,’ what I mean is “assign BS1” that is too common in controls. But if it’s in a disease-relevant gene, you probably want to do a little bit more than that. If you’re just kind of filtering your whole exome, you obviously don’t go into great detail with every variant in an exome. But basically, because this was in a disease-relevant gene – we’re thinking about cardiomyopathy – we would curate it further. So, I would give it “BS1 (‘benign supporting 1’) – too common in controls” based on the frequency filtering approach – and then I would see what else we know about it. It also gets BP5 (‘benign supporting 5’), which is that an alternate cause has been found in several cases, which is known from diagnostic labs that have this data, or looking in the literature and things like that. But diagnostic labs have reported this into ClinVar, that there are alternate causes of disease found in several cases. Additionally, nobody has any segregation data showing that this variant tracks with cardiomyopathy, and no functional data is available. So, based on BP5 and BS1, there’s a calculator used to figure out how you add these criteria up in terms of weighting, and this now gets a ‘likely benign’ classification.

Audience question: Is that at a whole-gene level? Because there’s now several genes where the exome is in one disease that’s mutated and exome mutations in another disease whereas this looks like the whole thing…

Anne: This is a variant-level assessment. So you’re asking if the you’re doing this assertion at the gene or the variant level?

Audience member: That’s why I was asking you which one you’re using here. Cause there are many diseases now that one gene causes two very different diseases, where you’re eliminating that as a possibility, but depending upon which exome it’s in, at what level are you resolving whole-gene or variant?

Anne: So this is a variant. This is assessing whether this variant is associated with or causing this specific phenotype, and I would do a different assessment for every phenotype I considered, because the genetic architecture, and the penetrance might be different for each disease. And so, every assertion is done specifically for a variant considering a specific phenotype or disease.

Audience member: That’s great.

Anne: Yes, Okay. Yes?

Audience member: Yes, so just because the penetrance could be different for each variant, for this one, if you had enough patients, would you try to do a statistical analysis to see if [indistinguishable].

Anne: So many is about using more of a case-control study because you have a did talk about having a cohort here. I don’t actually have that cohort, so I just know what was reported in ClinVar about this variant. But if you have a cohort, then absolutely doing kind of a case versus control study is a great thing to do. And the second part was thinking about the penetrance. So, two things: I could have played with the penetrance and said, even if I lowered the penetrance, would it still be in those numbers that I gave you – that for the FAF (Filtering Allele Frequency) and for the Maximum Credible Allele Frequency – were pretty far apart. So, even if I had lowered the penetrance a lot, I don’t think that would have changed my interpretation.

But you can also think about estimating the penetrance – how penetrant a variant could be for disease based on how often you see it in gnomAD. And that’s a more complex calculation, but something that people are doing with this data. It’s very interesting, and I don’t have time to go more into that.

I’m going to quickly talk about constraints. I have very little time left, so I’m gonna fly through this, but this is work by Kaitlin Samocha and Konrad Karczewski in developing the new constraint scores that I’m going to talk about today. And then Mark Daly and Dan MacArthur – they work on this. Kaitlin has an excellent primer on the model that underlies this, so I recommend checking that out if you’re interested.

Okay, so the idea here is we know about conservation across species. Constraint is how conserved something is within the human species, basically. And so if you have two genes – one that’s very tolerant to variation and one that is very constrained, intolerant of variation, or variation causes disease – those are going to look different when we look in the general population. Mutations are going to arise in both of those genes in the general population by chance at the same rate, basically, if they’re similar genes. And so the difference is: in a tolerant gene, a lot of that variation will get passed on because it won’t cause a problem. In a constrained gene, most of that variation is deleterious and does not get passed on. And so when we look, today, what we’re gonna see is tolerant genes have a lot of variation; constrained genes have very little variation.

And Konrad Karczewski has developed a metric called ‘observed over expected,’ which just says – using Kaitlin’s mathematical model of how many variants we should see in each gene of a dataset this size and then he counts up how many we actually see, and we just use this observed over expected, and we say: what percent of expectation do we observe in gnomAD? This is really cool, because for 70% of genes that are depleted for loss-of-function variation, we actually don’t know the human phenotype yet. And so, this highlights a lot of genes that are probably underlying interesting human biology.

The NSD1 gene causes Sotos syndrome – it’s an intellectual disability gene, it’s a well-described haploinsufficiency syndrome – and so the observed over expected is 0.04. We also can look at missense variation, so we expect 1,500 here, we see 1,000, and so we see about 70 percent of expected. And I mention that because I’m gonna go into that into a little more detail.

So, again, this is basically a constrained gene, but I want to give some important notes on constraint. It’s for dominant disease genes. We don’t expect to see a strong constraint signal for recessive disease genes. Occasionally, there’s some signal – you don’t have to have a super strong phenotype; you just have to have some degree of negative selection. But the selection has to occur before the age of fertility. So, for example, for BRCA1, which is a haploinsufficient risk gene for autosomal dominant breast ovarian cancer, we know that this is a true disease association – I’m not questioning that at all. But you don’t see that this is constrained for loss-of-function variation because much of the breast cancer onset is post-fertility.

So, I’m gonna again recommend Kaitlin’s talk to go through most of this, but this data is all here on the browser. If you were to – there’s two other features you can find on the browser. You can look at all the different transcripts, and you can see the gene expression there. The gene expression is over here. And the other thing you can do is regional missense constraint. If you just look at the ExAC set, it’s only listed on that part of it, but you can see regions of the genes that are depleted for missense variants. And so, I’m just gonna skip through all of that – sorry, guys.

DECIPHER is another site that represents this in kind of a different way and shows you the domains, and I just wanted – because I like it.

Okay, so on the gnomAD site, we have a frequently asked questions, and I recommend checking that out for a lot of information. And we have a contact – we have a GitHub – if you find browser issues, and you can also report a variant if you think that there’s a mistake on each variant page, if you think that there’s something going on with that variant. So, I hope I have convinced you today that reference population databases are very powerful and really important to review the read data and coverage. Frequency filtering is a really great approach to a more robust sequencing and filtering approach. I briefly talked about constraints, and I’m sorry that I didn’t have time to go through that in more detail. And then, I just wanted to mention that we are excited, and we expect that this year, we’re gonna have a larger whole-genome version of gnomAD v3 coming out. And also, there’s a structural variant reference call set that’s in progress, and we hope to have that on the browser coming this year, too. So, more exciting things to come.

And then, I just want to acknowledge Daniel MacArthur, who’s now joined us and can answer all of your questions, also. But he’s led this whole project, and this is a great resource for the community, but it’s brought to you by a large team that works very hard. And I also want to call out the whole Data Sciences Platform, all the gnomAD PIs, and the Hail team who created a data system that let this whole dataset be built and analyzed. And without Hail, we wouldn’t have been able to do any of this. So, I suggest you check it out, and it’s really great for doing analysis with this dataset and other large genomic datasets.