5.1 Quality Control (Video Transcript)
Quality Control: Introduction
Title: Quality control
Presenter(s): Katrina Grasby, PhD (The Psychiatric Genetics Group, QIMR Berghofer Medical Research Institute)
Coauthor(s): Lucía Colodro Conde, PhD (The Psychiatric Genetics Group, QIMR Berghofer Medical Research Institute)
Katrina Grasby:
Thanks for joining me for this session on Quality Control. In this recording I’m going to be talking about the quality control or QC steps that we apply to genetic data. So this is in the very early stages of a study. We’ve collected our DNA, it’s being transformed into data. We’re going to clean that data up and then we will impute and then we can do our statistical analyses. So there’s many points in a study that will be applying QC, but these steps that we’ll be discussing here and in the tutorial, are the quality control steps that we apply to our genetic data.
Why do quality control?
So why do quality control? Essentially, poor quality data is going to contribute to false positives and false negatives in our results. So we want robust results. We’re going to need to clean our data up. So we’ll be removing essentially genotyping errors. These can be errors in the calling of genotypes, or the translation of DNA into data. They can be due to lots of different factors. One of the pictures that I like to bring to my own mind was a story given to me by a woman that I work with who was involved in a project where they posted out two spit kits to a couple who were participating in a project, and somewhere in that delivery one of the kits went missing or was damaged. And the couple thought or were trying to be helpful and both of them spat into the same kit and posted that back to us --to her. In doing so they also included a letter to say what they had done, but it was a classic example of DNA contamination. It’s an example of human error. After all, we ended up with no usable data from two people instead of having usable data from one person. There is no way that we can disentangle that DNA in that spit kit and say this belongs to that person and this belongs to that person. It’s also an example of contaminated DNA, and even if they had not included a letter to say what they had done, the steps that we will go through in the tutorial would be able to identify a problem like this. So we can actually go OK, this isn’t a clear indication of data from a person, a specific person. We can remove that it doesn’t interfere with our analyses.
So one of the other things that will be doing in the tutorial is, after we’ve cleaned up our data, we’re going to have a look at the relationship structure within our data, and whilst that’s not necessarily a quality control step, it is a necessary aspect of coming to understand our data so that we can apply appropriate analyses and that is going to be important for minimizing our false positives and false negatives. So how do we go from DNA to data?
DNA to data
I’m a behavior geneticist. I use statistics to analyze data. I have no experience working in a laboratory, actually processing the DNA into data. But it is still useful for me to have an idea of these many different steps that are involved and an appreciation of what are the possible sources of error and what exactly does my data represent. So we are able to post out spit kits to participants who can spit into that kit at home and post it back. The sample is then processed so that the DNA is fragmented, it’s chopped up into little pieces. And then it’s amplified, so we’ve got more of it. And then DNA is extracted. We can store some and then we can plate some on to SNP chips or genotyping arrays. For it to be then further analyzed. So this down the bottom here. These images come from the Illumina website. This is an example here at A of a SNP chip or a genotyping array. So there are many different forms of SNP chips. The technology has improved overtime and I’m sure it will continue to improve. This here is an example of the bead technology. On this particular chip, there is space for information, DNA, from 12 different individuals. These horizontal bars here are each [for] a different individual. Now, if you’re thinking then, looking at this SNP chip, then if you got information from multiple individuals, and you’ll have many chips and they might be sent off to a DNA a genotyping company for processing in different batches. If you are thinking from an experimental point of view, and when you’ve got cases and controls, you want to have your cases and controls randomly allotted to both the chips and the batch runs that they’re being processed under. In a similar way, if you have males and females you want to randomized them across your chips and also your batch runs. That way we can ensure that we can actually pick up any particular batch effects in our data once we’ve got our data at the end.
So back to the chip. For each of these individuals, there will be hundreds of thousands of probes in order to test the alleles at hundreds of thousands of points in the genome. Many many many loci. So each of these wells have has a bead. This here is a schematic of a bead. So this bead is targeting an allele at a locus. So it has a particular sequence here, an address, so this is the order of bases. And then this here is the locus of interest. So once you’ve got your fragmented DNA, it’s going to come along, if it’s the right location in the genome, it will bind to this bead and then depending on if this allele here bonds to this C, so G will bond to C, this bead will fluoresce green. A different bead it might bond to, if there’s an A at that location and a T here, then it will bond this way and it will fluoresce red. So this is how we’re establishing at that locus. You might have a G or you might have a T and if it’s a G it’s going to bond to the C and fluoresce green. If it’s a T it will bond to the A and fluoresce red. So this is translating the DNA into a color, it is called an intensity. So if you’ve got you’ve got DNA coming from your biological father from your biological mother. You’ve got two alleles at that locus. If your two alleles are the same, you have two G alleles. They’re both going to be fluorescing green. Nice solid green color. If they both, if you’ve got two T alleles, they’re both going to be bonding to these A beads. They’re both going to be a nice solid red color. If you’ve got a G coming from one parent and a T coming from the other, then some of the beads that are C you’re going to fluoresce green. Some of the beads that are A are going to fluoresce red and that person is going to be heterozygous and they’ll have this yellow color. So these colors are then representing the three possible genotypes at that locus in the genome. And then these here are for hundreds of thousands of different loci in the genome. What I’ve got in this particular slide are examples of genotyping intensities. genotyping intensities So this is how we’re going to look at the color clusters representing the different genotypes. And see whether or not there are any problems. Now, this will likely, this is typically done by a genotyping company, you will probably not be doing this. But they will give you information about these first steps of quality control at this stage so you know what’s going on with your data. It’ll be there in a report from the company.
This top left-hand corner is a really good example of what we’re looking for. We have three nice, separated clusters. This is a homozygous A allele, This is a heterozygous group of individuals, and this is homozygous for the other allele. And in these two examples, with their little black Xs they are representing missing data. So missing data may not be terribly problematic if there’s just a little bit of missingness and it’s across all the different genotypes. However, if it is biased to one allele or one genotype, then that’s going to interfere with our allele frequencies in our sample, and that is going to mean that it’s not going to be representative of the population, it’s not representative in terms of how we can actually test for this genotype against this phenotype. We don’t want to have biased information about allele calling or genotype calling. Down here in the bottom left hand corner of it we have an example of a very rare allele. Sorry, a rare genotype, or it is a rare allele as well as a rare genotype. So there’s only one individual here who’s homozygous for the A allele. Very few heterozygous. In the middle down the bottom, this would be an example of a monomorphic group at this locus, so it really isn’t a useful locus for us to have genotyped. Or it could be that just this population is, there’s no variation in this population at this locus. And in the right hand bottom corner we have an example where there’s really been a failure to call the genotypes correctly. There is no indication of any red color, which is representing the heterozygous group of people. We’ve got these two kind of green clusters and the missingness is all off on this cluster it’s a complete fail.
Checking the data
So the steps that we’re going to be going through with our quality control tutorial is we’re going to start off by checking the data. We’re going to have a look at the file format. How is data coded? How is missingness coded? We’re going to look at the build, so that we know what assembly our data is on. The genotyping company would have provided us with that, but you might not always have access to that information, so there are ways that we can check that out ourselves. This is a very useful resource, which we will use in the tutorial to do that. And... Knowing what build your data is on is very important, particularly for meta-analysis, but also if you’re going to do any follow-up analyses with, or follow-up work with, your results. We’ll be doing a sex check, which is to check that the sex that we can infer from the genetic sex check information is matching the sex reported by the individuals. So this check is looking at the heterozygosity of the X chromosome. And we have different expectations depending on whether an individual has one or two X chromosomes. So if the individual is reporting their sex and the genetic information comes back and it doesn’t match, and that happens for a lot of your sample, then you might have a problem with the information that is, matching your genetic information that has been returned after genotyping to your participant IDs. Bear in mind this is about biological sex and not about gender.
Genotyping call rate
We will be checking for missingness. So there’s two types of missingness that will check for. One is this one, the genotyping call rate. This is where SNPs are missing information on individuals. So for each SNP we want to have information coming from most of our individuals. If there is too much missing data for that SNP, so too many individuals did not have information that was called correctly for that SNP, then that SNP might not be a good SNP for us to be using in our analyses.
Hardy Weinberg equilibrium
We will have a look at the Hardy-Weinberg equilibrium, to see whether or not our allele frequencies are matching what we expect. So this can highlight whether we’ve got some bias in terms of the frequency of alleles, or perhaps in our terms of calling genotypes appropriately, thinking back to those genotype Gwise intensities. Will be checking the minor allele frequency. So this is to make sure that we have enough information to do statistical analyses. If it’s too rare, then our GWAS is not the appropriate tool to use perhaps for this particular locus.
Sample Call Rate
We’ll be having a look at sample call rate. So this is another form of missingness. This is to say, do all of our individuals have information across almost all of their SNPs. So we don’t want individuals to be missing too much information across many SNPs.
Heterozygosity
We’ll be looking at the proportion of heterozygosity. So this is a way of checking-- Think back to that sample where we had two people spitting into the same kit. That’s going to give us too much variation. There will be way too much variation in that DNA sample. So heterozygosity would be excessive. Inversely, reduced heterozygosity could be an example of inbreeding, but it could also just be that we had lots of missing data.
Reduced Heterozygosity
So that’s one of the reasons we’re going to check out our missingness first before we do our heterozygosity check. Because we don’t want to be making, or we don’t want to be setting ourselves up, to potentially making inferences that have social consequences that are negative. So if you’ve got missing data and that’s the reason you have reduced heterozygosity, you don’t want to end up looking at your sample going “oh, there’s lots of inbreeding here”.
Relationship structure
Towards the end of the tutorial, after we cleaned it will then have a look at the relationship structure in our data. So we might have lots of families or we might have extended families. We want to know whether or not our individuals are related so that we can apply the right type of statistical analyses.
Population structure
And finally, we’ll be having look at population structure or stratification. So that will be talked about more in another one of the sessions, but this is when we have a look at a little frequencies. There is differences in allele frequencies across different groups or different populations and that is an important thing for us to be aware of and to be including appropriately in our analysis. Elsewise, we’re going to get false positives and false negatives. If your population structure is also correlated in some way with your outcome of interest, that’s where we’re going to get a problem. And that’s when we’re going to talk about it in terms of population stratification.
So these are going to be out checklist for our key steps in QC that will be running through the tutorial.
Running Quality Control on Genotype Data
Title: How to run Quality Control on Genome-Wide Genotyping Data
Presenter(s): Jonathan Coleman, PhD (Social, Genetic, and Developmental Psychiatry Centre, King’s College London)
Jonathan Coleman:
Hello, I’m Joni Coleman and in this brief presentation I’m going to discuss some key points concerned with running quality control on genome-wide genotype data which is a common first step in running a GWAS.
Overview
I’m going to provide a theoretical overview, addressing the overarching reasons why we need to do QC. Highlighting some common steps, and discussing a few pitfalls the data might throw up.
I’m not going to talk about conducting imputation, or GWAS analyses, or secondary analyses. Nor am I going to talk at great length about the process of genotyping and ensuring the quality of genotyping calls. I’ll similarly not go into any deep code or maths, however, if you are starting to run your own qc and analyses I recommend the PGC’s RICOPILI automated pipeline as a starting point. There are also some simple scripts on my group’s github that may be useful as well. They follow a step-by-step process with codes and explanations. We’re currently updating this repository, so look out for some video tutorials there as well.
The beginning: genome-wide genotypes
So here is our starting point. I’ll be using this graph on the top right several times through this talk, and this is a genotype calling graph with common homozygotes in blue, heterozygotes in green, and rare homozygotes in red. Hopefully your data will already have been put through an automated genotype calling pipeline, and if you’re really lucky, an overworked and under-appreciated bioinformatician might have done some manual recalling to ensure the quality of the data is as high as possible.
But in point of fact the data you will be using won’t be in this visual form but rather as a numeric matrix like the one below, with SNPs, and individuals. This might be in the form of a PLINK genotype file or it’s binary equivalent, or it’s in some similar form that can be converted to the PLINK format.
The desired endpoint: clean, complete data
Where we want to go is clean data with variants that are called in the majority of participants in your study, and won’t cause biases in downstream analyses.
That should give a nice clean Manhattan plot from GWAS was like the one below rather than the starry night effect of this poorly QC’d Manhattan plot above. However, something I’d like to emphasize across this talk is that QC is a data informed process, and what works for one cohort won’t necessarily be exactly right for another. Good QC requires the analyst to investigate and understand the data.
“Rare” variants
Often the first step is to remove rare variants, and this is because we cannot be certain of variant calls. Consider the variance in the circle on the right. Are these outlying common homozygotes or are they heterozygotes? We cannot really tell because there aren’t enough of them to form a recognizable cluster. Typically, we might want to exclude variants with a low minor allele count for example five. There are many excellent automated calling methods to increase the amount of certainty you have in these variants but it’s also worth noting that many analytical methods don’t deal well with rare variants anyway.
Again, the demands of your data determine your QC choices. It may be more useful for you to call rare variants even if you’re uncertain of them. Or you may wish to remove them and be absolutely certain of the variants that you retain.
Data missingness
Next we need to think about missing data. genotyping is a biochemical process and like all such processes it goes wrong in some cases, and a call cannot be made. this can be a failure of the genotyping probe or poor quality of DNA or a host of other reasons but such calls are unreliable and they need to be removed.
Missingness
Missingness is best dealt with iteratively. To convince you of that, let’s examine this example data. We want to keep only the participants (which are the rows in this example) with complete or near-complete data on the eight variants we’re examining (which here are shown in the columns). So, we could remove everyone with fewer than seven SNPs, but when we do that - oh dear, we’ve obliterated our sample size.
Iterative Missingness
So instead let’s do things iteratively. So, we’ll remove the worst SNP again, variant seven goes, and then we remove the worst participant, bye bye Dave, then we remove the next first SNP, so that’s SNP two, and now everyone has near complete data and we’ve retained nearly all of our cohort. So this was obviously a simple example, how does this look with real data?
Real data missingness
So here we have some real data, and it’s it’s pretty good data most variants are only missing in a small percentage of the cohort, but there are some that are missing in as much as 10 of the cohort. So let’s do that initiative thing removing variants missing in 10% of the individuals and then individuals who have more than 10% missing variants and then 9% and so on down to one percent. when we do this the data looks good. Nearly all of the variants are zero percent missingness and those that aren’t are present in at least 578 of the 582 possible participants, and we’ve lost around 25 participants for about 22 and a half thousand SNPS. but what if we didn’t do the iterative thing and we just went straight for 99 complete data.
So when we do that the distribution of variance looks good again, arguably it looks even better, and we’ve retained an additional 16 000 variants, but we’ve lost another 40 participants which is about six percent more of the original total than we lost with the iterative method. Typically, participants are more valuable than variants which can be regained through imputation anyway, but this again is a data-driven decision. If coverage is more important than cohort size in your case, you might want to prioritize well-genotyped variants over individuals.
Hardy-Weinberg equilibrium
So we’ve addressed rare variants where genotyping is uncertain, and missingness where the data is unreliable. but sometimes calling is simply wrong and, again there are many reasons that could be. we can identify some of these implausible genotype calls by using some simple population genetic theory. so from our observed genotypes we can calculate the allele frequency at any bioluelic snip we’ve called. so here the frequency of the a allele is twice the frequency of the AA calls (those are our common homozygotes in blue) plus the frequency of AB calls (our heterozygotes in green) and we can do the equivalent as you see on the slide for the frequency of the B allele.
Knowing the frequency of the A and the B allele we can use Hardy and Weinberg’s calculation for how we expect alleles at a given frequency to be distributed into genotypes, to generate an expectation for the genotypes we expect to observe at any given allele frequency. We can then compare how our observed genotypes i.e the blue, green, and red clusters fit to that expectation, and we can test that using a chi-squared test.
Now Hardy-Weinberg equilibrium is an idealized mathematical abstraction, so there are lots of plausible ways it can be broken, most notably by evolutionary pressure. As a result, in case control data it’s typically best to assess it just in controls, or to be less strict with defining violations of Hardy-Weinberg in cases. That said, in my experience genotyping errors can produce very large violations of Hardy-Weinberg, so if you exclude the strongest violations you tend to be removing the biggest genotyping errors.
Sex mislabelling
The previous steps are mostly focused on problematic variants, but samples can also be erroneous. One example is the potential for sample swaps, either through sample mislabeling in the lab, or correctly entered data in phenotypic data.
These are often quite hard to detect, but one way to detect at least some of these is to compare self-reported sex with X chromosome homozygosity, which is expected to differ between males and females. In particular males have one X chromosome, they’re what’s known as hemizygous so when you genotype them they appear to be homozygous on all SNPs on the X chromosome. Females on the other hand have two X chromosomes, they are holozygous, and they have a normal X distribution centered around zero which is the sample mean in this case. you could also look at chromosome Y SNPs for the same reason, however Y chromosome genotyping tends to be a bit sparse and is often not of fantastic quality, so there are benefits to using both of these methods. it’s also worth noting that potential errors here are just that - potential. Where possible it’s useful to confirm these with further information. For example if there isn’t a distinction between self-reported sex and self-reported gender in your phenotype data then known transgender individuals may be being removed unnecessarily. The aim here is to determine places where the phenotypic and genotypic data is discordant, as these may indicate a sample swap, and this might indicate the genotype to phenotype relationship has been broken and that data is no longer useful to you.
Homozygosity and the inbreeding coefficient
Average variant homozygosity can also be applied across the genome, where this metric is sometimes referred to as the breeding coefficient. it’s called that because high values of it can be caused by consanguinity. related individuals having children together, which increases the average homozygosity of the genome. there can also be other violations of expected homozygosity, so it’s worth examining the distribution of values and investigating or excluding any outliers that you see.
Relatedness
Examining genetic data also gives us the opportunity to assess the degree of relatedness between samples. For example, identical sets of variants implied duplicates or identical twins. 50% sharing implies a parent offspring relationship or siblings, and those two things can be separated by examining how often both alleles of a variant are shared. Specifically, we would expect parents and offspring to always share one allele at each variant, whereas siblings may share no alleles, they may share one allele, or they may share two alleles. lower amounts of sharing imply uncles and aunts, and their cousins, and grandparents, and so on down to more and more distant relationships. in some approaches to analysis, individuals are assumed to be unrelated, so the advice used to be to remove one member of each pair of related individuals.
However, as mixed linear models have become more popular in GWAS, and mixed linear models are able to retain and include related individuals in analyses, related individuals, therefore, should be retained if the exact analysis method isn’t known. Again, it’s worth having some phenotypic knowledge here. Unexpected relatives are a potential sign of sample switches and need to be examined, confirmed, and potentially removed if they are truly unexpected. and once again it’s important to know your sample, the data shown in this graph does not, despite what the graph appears to suggest, come from a sample with a vast amount of cousins, instead it comes from one in which a minority of individuals were from a different ancestry and that biases this metric. I’ll talk a little more about that in just a moment.
Average relatedness
Relatedness can also be useful for detecting sample contamination. Contamination will result in a mixture of different DNAs being treated as a single sample, and this results in an overabundance of heterozygote calls. This in turn creates a signature pattern of low-level relatedness between the contaminated sample and many other members of the cohort. These samples should be queried with the genotyping lab to confirm whether or not a contamination event has occurred, and potentially be removed if an alternative explanation for this odd pattern of inter-sample relatedness can’t be found.
Population structure
Finally, a word on genetic ancestry. Because of the way in which we have migrated across our history, there is a correlation between the geography of human populations and their genetics. This can be detected by running principal component analyses on genotype data pruned for linkage disequilibrium. For example this is the UK biobank data, you can see subsets of individuals who cluster together and who share European ethnicities, other subsets who share African ethnicities, and subsets who share different Asian ethnicities, and in a more diverse cohort you will be able to see other groupings as well. this kind of 2D plot isn’t the best way of visualizing this, for example here it isn’t really possible to distinguish these South Asian and admixed American groupings, and you don’t get the full sense of the dominance of European ancestry data in this cohort. The Europeans in this case account for around 95% of the full cohort but because of over plotting i.e. the same values being plotted on top of each other in this 2D plot, you don’t really appreciate that. Looking across multiple principal components helps for that.
Ancestry is important to QC. Many of the processes I’ve talked about rely on the groups being assessed fairly of being fairly homogeneous. As such, if your data is multi-ancestry it’s best to separate those ancestries out and re-run QC in each group separately.
Take-aways
So that was a brief run-through of some of the key things to think about when running QC.
I hope I’ve got across the need to treat this as a data informed process, and to be willing to re-run steps, and adjust approaches to fit cohorts. Although we’ve got something resembling standard practice in genotype QC, I think there are still some unresolved questions. So get hold of some data, look online for guides and automated pipelines, and enjoy your QC.
Thank you very much for listening, I’m doing a Q & A at 9 30 EST, otherwise please feel free to throw questions at me on twitter where I live, or at the email address on screen which I occasionally check. Thank you very much.
Considerations for Genotyping QC
Title: Considerations for genotyping, quality control, and imputation in GWAS
Author: Ayşe Demirkan, PhD (School of Biosciences, Surrey Institute for People-Centred Artificial Intelligence, University of Surrey)
Ayşe Demirkan:
Hello everyone, my name is Ayşe Demirkan. I’m affiliated with the University of Groningen from the Netherlands and the University of Surrey from the UK. This is a pre-recorded lecture in the second session of on-demand sessions: “Introduction to the Statistical Analysis of Genome-Wide Association Studies.” I will be talking about considerations for genotyping quality control and imputation in genomic association studies.
So here you see an overview of the lecture. We will shortly go over genotyping platforms and options for quality control. Then, I will talk about the definition and purpose of imputation and how it is done. This is going to include reference data, tools, analysis of imputed data, imputation accuracy, and assessing.
Genotyping and Platforms
Genotyping is the process of determining differences in the genetic makeup (genotype) of an individual by examining the individual’s DNA sequence. What we call genotyping is the process of determining differences in the genetic makeup, hence the genotype, of an individual by examining the individual’s DNA sequence. Of course, the technology used for genotyping depends on the structural properties of the genetic variation, whether it is a single nucleotide polymorphism or a copy number variation or other structural variations. It also depends on the project rationale or scientific question and your budget, mainly. Related to that, of course, is how many SNPs you want to genotype. If it is a genomic association study, and the number of individuals you would like to include. Depending on your study design, you will also be limited by your DNA sample quality and quantity.
So here on this slide, you see the most common approaches used for genotyping.
Common Approaches
SNPs, and depending on your study, you will most likely be using one of these. What are those Illumina microarrays? So, on the y-axis, you see the number of SNPs that are easily captured by the arrays, and on the x-axis, you see the number of individuals. Then, what do we have? We have PCR, RFLP, sequencing, pyrosequencing, and Fluidigm platforms, and TaqMan. For instance, one of the best examples are the Illumina arrays for whole-genome scans. Whole-genome genotyping by these arrays provides an overview of the entire genome and enables novel discoveries and associations. So, using high-throughput next-generation sequencing and microarray technologies, you can obtain a deeper understanding of the genome because you are covering a very wide proportion of the genome.
So you can use one of their selection of Illumina or Affymetrix arrays, which you think may be suitable for your study. There are many options, and for Illumina’s array platforms, there are genome-wide genotyping arrays for 18 species at the moment. The number of markers on each array varies by product. For humans, up to four million markers per sample are possible. Then, there is an Infinium low-cost screening array. For example, this one includes 600,000 markers on it. You can start from 200 nanograms of genomic DNA. What you can also do is add some custom marker panels. There is an add-on capacity of up to 50k markers. And then there is the Omni family of Illumina arrays.
Omni Family of Illumina Arrays
Here you see a simple description of their coverage and the inclusion of genetic markers in relation to their minor allele frequencies. So, these arrays on the left include only common variation with minor allele frequencies higher than five percent. Some include CNVs (Copy Number Variations), and some include SNPs with lower minor allele frequencies. Which one to choose among those will depend on your research question and the population you want to screen. For instance, are you looking for rare or common variations in terms of SNPs? Are you looking for CNVs? Are you working with a rare or common disease, and what is your sample size and budget? Now, I’ve listed some websites here. Please take 10-20 minutes to check on the technologies mentioned in the first section using these websites.
Quality Control (QC) of Genotyping: From Machine to Dataset - Genotype Calling
Now let’s talk about genotyping quality control (QC). You’ve designed your study, chosen a proper array platform or service, and used a service from your institute. So, one critical initial step from chemically induced intensity signals and data analysis is the transfer and QC of genotypes determined towards your computer. This critical step is called genotype calling. Genotype calling algorithms are always implemented in the proprietary software accompanying the genotyping platform you choose. So, you don’t need to invent them yourself. Typically, calling software uses a sort of mathematical clustering algorithm to analyze the raw intensity data and estimate the probability that their genotype is one of AA, AB, or BB for a given individual for a given biallelic marker locus. One method of checking initial SNP quality is visually inspecting the intensity clustering of a particular SNP in the overall population. Depending on this, one can decide whether a SNP is characterized by a clear signal or not.
Here on the left of these figures, you see a clear intensity clustering of a SNP in the population. You see that the common variant is depicted by red, the heterozygous by purple in the middle, and the homozygous for the less common allele are depicted as blue. The table on the right shows the raw values that this plot is figured from. So, here the plot shows a tight clustering of genotypes, and there is not much noise in the measurement of this SNP. Following that, there are important data QC steps. One of them is to work on replicates. For inspecting plating issues and by looking at genotype concordance, this would be a good thing to do – to include the same DNA sample on different batches of experiments. Then, there are Mendelian errors to control for, for instance, transmission inconsistencies. For example, SNPs with more than a 10 percent Mendelian error rate can be excluded. This would be based on the number of trios that you would include in your experiment.
Unfortunately, this option obviously is only available for family-based and trio designs only. Another thing, another QC measure we use is SNP call rate. This is basically the missing genotype rate, 1 minus the missing genotype rate per SNP. So, this can depend on the quality of the DNA samples, and this is generally between 95 percent and 99 percent. This is a very standard thing to include in your QC. Another thing is the Hardy-Weinberg equilibrium deviance of your SNP. This is another method for checking the quality and exclusion of your SNPs. This will be explained in the next slide. Another one is the sample call rate, so this is a sample-based QC method. This is a good indication of sample success. Different platforms have different thresholds, but this will be mainly determined by your initial DNA quality and will somehow be in relation to the SNP call rate. Once you do SNP call rate, you could do sample call rate, and you may want to repeat SNP call rate depending on that.
Another thing to do is sample gender check. For this quality measure, you need X chromosome information to calculate this, and you may want to add this as an additional sanity check in your data to make sure that there is a perfect overlap with your phenotype files in terms of sex.
Another important one is sample heterozygosity. This is to check for example outliers – for example, samples with more heterozygosity than expected can be an indication of contamination in your samples. And you also want to do something on top of all of that. You need to check samples’ cryptic relatedness and unexpected twinning, and whether there is actually relatedness and structure in the data. But this will be more covered in the lecture of RediK MAGI.
Hardy-Weinberg Equilibrium
As the occurrence of two alleles of a SNP in the same individual are two independent events, the distribution of the genotypes across individuals should be more or less in equilibrium with the frequencies of the alleles of a biallelic SNP. So, this is only possible in ideal conditions, of course, which would be random mating, no selection, equal survival, no migration, no mutation, and selection based on mutation, no inbreeding, and large population size.
So, under these conditions, above deviations from Hardy-Weinberg equilibrium is an indication of genotyping calling problems, and a commonly used threshold for genotypic variance is a p-value of Hardy-Weinberg equilibrium that is less than ten to the minus five. This is an indication of a deviation from Hardy-Weinberg equilibrium, and you may want to take a look at these SNPs, or you may want to exclude them from your dataset.
**Another important thing to always consider is genome builds and alignments.**
The characterization of the human genome is an ongoing effort, and a genome build tells us the positions of the SNPs in the genome on the genome. So, the latest build is called build 38, but the most commonly used one at the moment is still build 37. For instance, the HapMap was released on build 35 and build 36. So, you need to be aware of issues relating to merging and meta-analyzing data from different genome builds. Also, when preparing your data for imputation, this is very important because you need to make sure that your data is coded according to the same genome build between the target set and the reference dataset.
**So, there are tools for that. One they are called liftover tools. For instance, there is one from Oxford that we use for this purpose, and I provide the link to that here.**
Commonly Used Software for QC: PLINK
So, all of these QC steps I shortly went over here are pretty standard, and there are a couple of widely used tools. One very commonly used tool that we also use for data storage, analysis, and QC is called PLINK. Here on this slide, I made a snapshot of some of the PLINK options that I also covered during the lecture, and these functions are implemented in the PLINK software. You can use it for the QC of your genetic data.
**First thing to be able to use PLINK is to obviously install PLINK, and you will need to read your genotype call data in PLINK in the form of map or ped files. Then, you can perform QC at the SNP level – remove or extract SNPs – and you can perform QC at the sample level – you can remove or extract individuals.**
And under the summary statistics option here, there are functions listed to check for call rate missingness, Hardy-Weinberg equilibrium, highlight frequencies, and Mendelian errors. You can also perform sex checks. What PLINK can also do is to extract genetic principal components and identify cryptically related individuals or twinnings in the data and the genetic structure of the data. You can then use this to determine ethnic outliers in your datasets. I will not talk about this because this is a part of the lecture of RediK MAGI of the next session.
Intermezzo
Here I put two websites here. One of them is for PLINK and how to use PLINK for QC, and the other one is for BEAGLE, which I mentioned is one of the algorithms that you could use for genotype calling. So now, take 10-20 minutes to have a look at this website and try to grasp what you can do with them.
Genetic data missingness
Now, let’s talk about imputation. Why do we need imputation? We need imputation to address missingness in the genetic data. This is all about missing values in the genetic data. Where do the missing values come from? So, during the QC, we already set some values to missing, right? And also, during genotype calling, you could set some data points to missing. But actually, most of the missing values come from the initial targeted coverage of the genotyping chips and platforms we used. Remember that there are many types of arrays - some are more dense, less dense. There are arrays made specifically for oncological studies like onco arrays. There is a metabolic chip that is designed for metabolic diseases, especially, and there are arrays focused on mainly SNPs with higher minor allele frequency, or there are those focusing on CNVs. But even the dense SNP arrays do not cover all of the genetic variation. They cover much less than you would imagine. And, in addition to that, SNPs included in one array may not be included in the other one. And for many variable positions on the genome, we do not have matching information across genotype sets of individuals. For instance, look at what I try to depict here. I think of three individuals: the first two are typed on the array X and the third one is typed on array Y. So, hence, they have different missing data points. And when you try to pull their data for a pooled analysis or to be used in meta-analysis, you’re going to have even more missingness in this data because of the non-overlapping positions. And you will not be able to replicate findings from one dataset in the other one. So, additionally, we will be analyzing only half of the genetic variation, and we may miss causal variants in the analysis. This is because of all these reasons. We use genetic data imputation.
Imputation Principle
So, what do we do in principle? In principle, it means estimating the most likely genotypes in an individual at the missing positions by looking at the correlated SNP values from a more complete dataset, and based on that, writing over the missing values in the target dataset. So, how does it work? First of all, we need a dataset where dense genotypes are directly measured. This can be a dense array or it could be a set of sequenced individuals. This we call a reference panel. Then, we use an imputation software or service, and by looking at the correlation structure of the dense genotypes or sequence SNPs, we estimate them in the target dataset. So, at the end, these are probabilities, and we end up with dosage information for alleles or genotypes, rather than hard genotype calls. And this dosage information, which accounts for the uncertainty in the estimation, is then included in the genomic association study analysis. So, to sum up, the purpose of imputation is to increase power because obviously, the reference panel is more likely to contain the causal variants than a less dense genotyping array. To improve fine mapping because imputation provides a higher resolution overview of an association signal across a locus. And then, to enable meta-analysis because imputation is going to allow data that’s typed with different arrays to be combined up to variants in the reference panel.
Historical Milestones 2010-2018
Going over the historical milestones in terms of imputation also summarizes the theoretical and technological advancements in human genomic data imputation. So, one important advance in all of these was the generation of reference panels. The first reference panel was HapMap, and HapMap2 was the most commonly used release of HapMap. It consists of a limited sample of individuals from diverse genetic backgrounds: 60 Yoruba, 90 Han Chinese and Japanese, and six individuals that were Utah residents descending from European ethnic origin. Now, it sounds funny to think that we imputed thousands of people based on the genetic material of 60 Utah residents only, talking about the Europeans. But actually, this yielded a lot of success, and actually, this is what we had for a long time. So, we could only impute up to 3 million SNPs with HapMap at the time.
And then came the 1000 Genomes reference panel, which included at the end 2,500 individuals from multiple ethnic groups. And later on, and currently, the most widely used reference panel is the panel of Haplotype Reference Consortium (HRC). Recall shortly, this is a combined set of whole-genome and exome sequence data for more than 30,000 individuals and uses 39 million SNPs after imputation, of course. This is going to depend on the scaffold that you use for imputation as well. Many of these SNPs will not be imputed with good quality, but in ideal conditions, you can go up to 39 million. And finally, we now have a reference panel from the Trans-Omics for Precision Medicine (TOPMed) program, and this consists of almost 100,000 deeply sequenced human genomes and can yield up to 308 genetic variants to be identified. One technical milestone worth mentioning was the prephasing of haplotypes. Genetic imputation is a highly computationally intensive process because of the probabilistic framework and high rate of missing data that we are trying to deal with. One of the major milestones to reduce the computational burden was the introduction of prephasing. This idea involves a two-step imputation process. So, there is one initial step of prephasing, which is actually haplotype estimation of the genotypes, and a subsequent step of imputation into the estimated phased haplotypes. So, this reduces the complexity of the imputation process and speeds it up. The current version of all imputation software can deal with the prephasing approach. And what is very important is the choice of the reference panel.
So, it is shown that making use of the all-ancestries reference panels rather than ethnic-specific reference panels improves imputation accuracy for rare variants in any population. And formatted reference panels for IMPUTE and Minimac can be downloaded from the software websites. And it’s very important to make sure that genotype scaffold and reference panels are aligned to the same build of the human genome. I will get back to that later as well.
So, another very important and current technological advancement that makes our lives easier is the imputation services. These are freely available services such as the Michigan and Sanger imputation services. You can simply format and upload your data in a secure way to this server and get the data imputed and phased genotypes back in a few days. And this depends on the speed and how busy the server is and depending on the sample size you are trying to impute, of course.
Historical Milestones - Sanger
So, in parallel to the Michigan imputation server, the Sanger Institute also has a similar service. In this service, you can also upload your data in VCF format and optionally perform prephasing using BEAGLE or SHAPEIT software. The current reference panels in the Sanger imputation server include HRC, UK 10K, and 1000 Genomes. As I said, there is also a server dedicated to TOPMed. This is all very self-explanatory. This is how the Sanger imputation server would like you to prepare the data. So, there is a whole bunch of instructions there that you would like to use. The use of these services comes with instructions and manuals, so feel free to make an account there and run some test datasets in there. You will need to format the data as required in the instructions. You will need to match the coordinates and reference level of the genome builds and prepare one file for each chromosome. This is for the Sanger imputation server. And another important thing in terms of imputation is, of course, the speed.
Speed - Impute5 PLOS GENETICS
So, increasing reference panel size improves the accuracy of markers with low minor allele frequencies, but this positive effect increases computational challenges for imputation methods. So recently, a new imputation software, Impute 5, was introduced from the same group. It does memory-efficient imputation by selecting haplotypes using the positional Burrows-Wheeler transform. So, using the HRC reference panel, the developers of the software showed that Impute 5 is up to 30 times faster than Minimac 4 and up to 3 times faster than BEAGLE 5.1 and uses less memory than both of these methods.
Example Framework
So, using all the mentioned considerations up until now, you can build an in-silico framework similar to this one. So, you can use, for instance, PLINK functions for the first two steps of genetic data QC. Then you can check chip information and strand issues using R software tools. And if needed, you can update your genome build by using liftover tool. And you can then prephase by using SHAPEIT. And finally, impute in-house or using one of the servers mentioned. So, two links to this software are given here. Now take time, probably hours, to explore these three imputation services, and you may also want to produce quality plots per chromosome varying by minor allele frequency strata and position on the chromosomes. For instance, here is an example figure. This Imputation quality vs MAF figure shows a typical relationship between minor allele frequency and imputation quality. So, on the y-axis, you see the imputation accuracy as determined by imputation quality, as by r square or info score from different software, and on the x-axis, you see the minor allele frequency. You see that the accuracy is highest when the minor allele frequency is high, when the allele is more common. Then the accuracy goes lower where the minor allele frequency goes lower as well. You still have some well-imputed SNPs among the rare ones as well, but most of the low-quality SNPs are going to come from low minor allele frequency SNPs. So, keep in mind that when you filter by imputation quality, you will be filtering out a lot of rare SNPs as well.
Factors Affecting Imputation
So, at the genome-wide level, the number of individuals imputed has something to do with it. For this reason, we merge scaffold datasets before imputation if we are going to impute more than one. The more, the merrier. And the second factor is the reference panel. The choice of the reference panel, as well as the whole idea, is to use the correlation between SNPs across different populations, and this may be different from population to population. You want to go for a large multi-ethnic panel if you’re not able to go for a large ethnic-specific panel. Finally, at the SNP level, the lower the minor allele frequency, the lower the quality of the imputation is going to be.
How to Analyze the Imputed Data: Analysis of Imputed Genotypes
So, for each individual, imputation provides a probability distribution of possible genotypes for each untyped variant. These probabilities can be converted into best-guess genotypes, but this is not generally recommended, as it increases false positives and reduces power. Also, when you convert the probabilities to expected allele counts, you want to filter your best-guess genotypes. You want to apply strict filtering on the best-guess genotypes, and this would result in more NAs in your dataset. So, it’s better to convert the probabilities to expected allele counts and analyze by taking the uncertainty in the imputation into account. That’s really important. And to do that, you need to match the data formats to the software. Not all software uses all types of data, and you may need to do data conversions. A software called EPACTS, SNPTEST, and PLINK 2 supports the dosage information, and you need to check the lecture from REDCap MAGI for the analysis of genome-wide data.
Messages
So, this is the last slide of this lecture. The take-home message is that we are dealing with hypothesis-free approaches here. Unfortunately, it all comes down to the bittersweet money and resources we have. So, you need to think about what is the best and most cost-effective way of getting genetics done in a large sample size. And the answer is combining a dense genome scan array with imputation, as the reference panels are free at the moment and the cost of arrays is going lower as well. But you really need to think about the in silico part of doing so, as it also requires staff and computational resources to some level. And what else should you consider? In comparison, you want to know, depending on your research question, of course, whether there is a better array for you, or perhaps an array or a metabold chip if you are going to conduct your research in a very restricted field. And most importantly, what are the future uses of this data? Because obviously, you don’t want to build something that you’re going to use only for a couple of years and finalize the research on that. You ideally want to invest in big data. So, are you going to invest in a population-based cohort or a disease-based cohort? Is it going to be a short-term project, or is it going to be a follow-up study that’s likely to build up and extend throughout the years, including new phenotypes? And finally, who do you want to collaborate with? Which consortia? Which diseases?
Um, yeah. So, I hope this lecture will be useful for your research and future studies, and for the people who are interested in having a better and in-depth understanding of imputation. Every year, two times, we have a GIVAS course organized by the University of Surrey in collaboration with Imperial College and the University of Tartu from Estonia. This course includes a hands-on workshop as well as theoretical lectures, where we teach these concepts and matters in more detail. The last one was in May 10th to July 5th, 2021. For more information, there is an email address you can connect to. Thank you very much and have a nice conference for the remaining time.