Chapter 5.2: Imputation (Video Transcript)
Imputation Introduction
Title: Haplotypes and Imputation
Presenter(s): Dr. Gábor Mészáros, PhD (Institute of Livestock Sciences (NUWI), University of Natural Resources and Life Sciences)
Gábor Mészáros:
Hi everyone, welcome back to the Introduction to Genomics lecture series. This time, we will be talking about haplotypes and imputation before we move on to the new material. So, here is the quick summary from the previous lectures.
We talked about SNP markers that are widely used. There is a number of ways to express these genotypes, but we are always talking about biallelic SNPs. These biallelic SNPs are being genotyped with the species-specific SNP chips. We talked about how to determine the positions on the genome, and we talked about physical maps. Also, we talked about recombination events that are of major biological importance, and they introduce variability to the populations.
Haplotypes
This graph is also from the last time, so we have an individual here, and there is a recombination event. The previous “capital A,” “capital B,” and “capital C” haplotype is changing to “capital A,” “lowercase b,” and “lowercase c” haplotype because of this recombination event.
During this lecture, we will look a bit more closely at these haplotypes and also show how to use them or what is the use for them in the context of genotype imputation.
So, when we look at the genotypes, what we see, in reality, is paternal and maternal chromosomes together that are joined during fertilization to the set of alleles. So, we see certain genotypes at certain loci. For the sake of example, let’s say that we have these four individuals, and at four loci, we have these genotypes. What we see here are summaries only.
Now, of course, we can ask the question: What are the actual sets of alleles on each chromosome? For the first individual, it is easy because its genotype consists entirely of homozygotes. So, basically, “A A,” “B B,” and “C C.” So, we know that on both chromosomes are actually the very same haplotypes of “capital A lowercase b” and “capital C.”
In the second individual, we have one heterozygote already. So, while here is also just actually one option how to divide the haplotypes, the actual haplotypes on both chromosomes are different from each other. Because on one chromosome, there is “capital A capital B” on “lowercase c,” and the other chromosome has “lowercase a capital B” and “lowercase c.” So basically, one of each of these alleles goes into one chromosome, and the other to the other chromosome.
Of course, it becomes more interesting the more heterozygous we have on our genome, because this actually creates options on how the haplotypes could be distributed. So we have these three loci here, and two of them are heterozygous: the loci B and C. If we look at the pairwise combinations of these alleles, then we could arrive at actually two solutions: either this one or this one.
If you look at the alleles in these haplotype pairs, then the genotypes will end up always with this summary genotype. But also, if you look carefully, the haplotype pairs, the first two haplotype pairs are different from the second two haplotype pairs.
Of course, the more heterozygotes we have, the more complicated it gets. So, for example, for the three heterozygotes, we have even more combinations. So, I put question marks here. So if you want, you can work this one out yourself. Just pause the video here and try to work out what are the actual haplotype possibilities in case that we have three heterozygous loci. So, what are the haplotype combinations that are possible that end up with these summary genotypes?
After you’ve done it, you can unpause the video and see if you were right or just continue watching and get the answer in three, two, one… Go!
So, these are the actual possibilities. You see that there are actually four haplotype pairs that are possible based on these three heterozygous genotypes, and each of these haplotype pairs is different from one another. So, basically, how you solve this? First, you take the first from each pair: “capital I capital B capital C,” and then remains “lowercase a lowercase b lowercase c.” So this would be the first one. Then you go with the pairwise combinations. So you flip one locus at a time until you get to all combinations.
Phasing
Of course, we have many more than just three heterozygotes on the genome. So, there is a question: how do we solve these questions in practice, where we have tens of thousands of loci and also thousands of heterozygous genotypes? Now the answer is, of course, with computers. Fortunately, there is sophisticated software that solves these questions for us and delivers the haplotypes we could analyze further. This computation process is called phasing.
So, the phasing is a task of the process in the computer to assign alleles to the paternal and maternal chromosomes. It looks for haplotypes, or these so-called “phases,” in large-scale genotype data and solves these complex problems of assigning correct haplotypes.
Of course, this is easier if the so-called trios are genotyped. So they basically are the offspring and the father, mother, and their child, or even if we have multi-generational trios that include families, including grandparents and great-grandparents, are genotyped. So if everyone is genotyped, this process is somewhat easier. In reality, however, we don’t have this ideal situation. Many times only parts of the populations are genotyped, so it is harder to work out the actual haplotypes. Fortunately, this is also possible, and haplotypes could be determined also for samples of unrelated individuals, for a population. “Unrelated” here is in quotation marks because there is usually some kind of relationship between the individuals within a population.
So, as I mentioned before, there are specific software solutions for all of this, which actually divide the genotypes into smaller segments and try to derive these haplotypes from these smaller segments and merge them back properly.
Imputation - general definition
Now, when we determine these haplotypes or these phases in a population, these are really useful for a number of purposes, and one of these purposes is the so-called genotype imputation. I mentioned multiple times that the SNP genotyping is fairly reliable, but occasionally, we see missing genotypes. So, actually, with this genotype imputation process, we can make an educated guess on how to fill in these missing genotypes so we get the full information.
So, the imputation process is nothing else than filling in missing information. There are two major ways how we can use this method. The first one is the imputation of sporadically missing SNPs, and the other one is imputation between SNP chips. For example, we can extend a lower density SNP chip, for example, a 50k SNP chip, to a higher density. For both of these approaches, I will give examples in the following slides.
Imputation of sporadically missing genotypes
Out of the two methods, the imputation of sporadically missing SNPs is more straightforward. So, as we established, some of the SNPs could be missing due to genotyping error, and because of these genotyping errors, we might be forced to remove individuals from our analysis. Or, for example, if we need complete data in a sense that all SNPs should be known, then this is also a problem for us. But this situation could be fixed by imputing these sporadically missing SNPs.
Let’s say that we have an established haplotype in a population that looks like this: [haplotype diagram not provided]. And when we have another animal or individual that is genotyped, and there is a genotyping error, but the haplotype looks like this: [On slide]. So it’s basically totally the same as before, so all the other loci for this haplotype are exactly matching, but these genotypes are missing. Based on this comparison, if every other SNP fits, we have a very good idea what should be filled in the place of the question marks. So we have complete data also for this individual.
Imputation between different SNP densities
The imputation between SNP chips works on a similar logic, but it’s somewhat more complex. So let’s say we have SNP chips of two densities, and this is a smaller example. So you see that there are 16 columns here. This will be our larger SNP chip, and the second SNP chip would be a smaller one that consists of eight SNPs. Each line here would be an individual, and each column would be a locus, and these loci are either homozygous for one (that is a 0), heterozygous (that is a 1), and homozygous for the other (that is a 2).
Now, the usual arrangement with these smaller and larger SNP chips, so that contain more or fewer SNPs, is that one SNP chip or the smaller SNP chip is a subset of the bigger one. So, basically, all the SNPs from the smaller SNP chip appear on the bigger one as well, but there are other SNPs that are on the larger SNP chip but unknown for the smaller one. This shows the starting situation here when we genotyped nine individuals with the smaller SNP chip.
Now, let’s say that these individuals are from a population that are fairly unrelated, but we also know that even in unrelated individuals, there are short stretches of sequences that are identical by descent. These local patterns of IBD (Identical by Descent) could be described, and also the length of these segments determined, which, of course, varies based on the recombinations. If we identify these segments or these haplotypes, we can use them to our advantage. So, for example, these would be the haplotypes that occur in our population, and also, for the sake of this example, these are also color-coded.
So if we return to our original example for the nine individuals that are genotyped with the lower density SNP chip, we could see that each of these individuals could be described as a combination of certain haplotypes. And because these haplotypes are already known, we actually know what we should put into the place of the question marks. This is then also done, and the information is being filled in to these gaps that were previously unknown.
So what we basically do is we take the information from the higher density SNP chip, make the haplotypes for the population, and we use the information from these haplotypes to fill in the information also for the other individuals that were genotyped with a lower density SNP chip in case this lower density SNP chip is a subset of the higher density SNP chip.
Here I would also underline that these haplotypes do not come from nothing but actually, we need a sufficient number of individuals that are actually genotyped with this higher density SNP chip in this population. So we can determine the actual haplotypes that occur in this population, which can be further used for this genomic imputation as described here now.
Imputation accuracy and practical use
Why is this useful? Well, the lower density SNP chips tend to cost less. So if genotyping costs are an issue or we want to genotype a really large number of individuals, we can use, well, just this lower density SNP chip and go for the imputation process. Of course, for this, we need haplotypes that were determined based on individuals’ genotypes by the high-density SNP chip. This imputation is a so-called “in-silico,” so basically, it’s done with computers, which also means that it is with no additional costs other than the computation cost for the whole process.
There are different options and possibilities for software for this process, and to my knowledge, all or most of them are also free or open access. Based on this software, we can do the imputation that will be done with a certain accuracy. So actually, the whole process is not 100 percent accurate, but actually works surprisingly well. The imputation accuracy, in general, depends on the size of the reference set and the data quality. What I mean with this is that we need to determine actually the haplotypes that occur in this population or the population of interest. So of course, we need to have a representative sample genotype with the higher density SNP chips in order to determine the haplotypes that occur in the population. So we could use these haplotypes further on in the imputation process.
In general, the imputation works really well for the common SNPs, which occur reasonably frequently within a population. This also means that, unfortunately, the imputation works less well for these so-called rare SNPs that occur very infrequently, because there is just no way for the imputation process to pick it up from the haplotypes that are available for this population. So the general advice is that if someone is interested in very specific rare alleles, then the imputation process is perhaps not the best solution. In that case, genotyping the individuals with the actual higher density SNP chip is advisable.
But overall, the imputation works really well. So I put there that the imputation accuracy could be more than 95%. So I just put their numbers so you have a bit of an idea that we are talking about very high values, actually especially in the simulation studies. In my experience, the imputation accuracy is lower than 99%, and the people start to get unhappy. So it’s, actually, in the papers, especially in simulations, the imputation accuracy is much higher than 95%. In real data, well, it could be variable, as I mentioned. This really depends on the reference and the data quality.
Also, there is a range of possibilities how to evaluate the actual imputation accuracy, but it is mostly done with the so-called masking procedure. So it’s a very similar process that I described also in this presentation. So there are the genotypes obtained from a higher density SNP chip, and basically within this process, some of these genotypes are deleted, and then the imputation software is used to fill these missing markers in. But of course, we know what is the actual genotype for these higher density SNP chips.
So then, basically, the values that were filled in by the software and those that were obtained from the actual genotyping are compared, and this is the basis of how actually the imputation software is also being evaluated, how good of a job it does. But as I mentioned, these software do a surprisingly good job, and we already arrived at the end of this segment, and we end, as always, with a short summary.
Summary of the lecture
We talked about the so-called haplotypes that are a series of SNPs, and these haplotypes clarify which combination of alleles come from which parent. Of course, if we want to do these computations on the large scale or in real genotypes, we need to use computers for it, and there is a range of specialized software programs that do the job for us. The approach itself is called phasing, and these phases or haplotypes could be used in various ways, but one of the uses is the so-called imputation process, which is nothing else than filling in the missing SNPs to our data.
Here we also have options. If we want to fill in sporadically missing SNPs that were not genotyped for some reason, so some kind of genotyping errors or missing SNPs could be filled in or imputed. Or we have a different option when we can actually extend smaller SNP chips to a larger one based on haplotypes and information from these larger and denser SNP chips, perhaps even saving some money in the process because these lower density SNP chips tend to cost less. And if we are not interested in some very specific rare alleles, and we are fine with the imputed version of these genotypes, we can use these for our research.
So we end here today. Let me know if you have any questions or comments down in the comment section below. Also, thank you for your time you spent on this video, and I wish you a very nice day.
Imputation Steps
Title: Imputation
Presenter(s): Sarah Medland, PhD (The Psychiatric Genetics Group, Queensland Institute of Medical Research)
Sarah Medland:
Hello, my name is Sarah Medland and I’ll be talking to you today about imputation. So there are three main reasons why we might impute data. The first of these being meta-analysis or combining our data with that of another cohort. Secondly, fine-mapping. And I’ll give an example of that in just a moment, and Thirdly is to combine data from different chips.
So imagine a situation where you have a large cohort which is being genotyped half on chip A and half on chip B. If we were to put the data from these two chips together and analyze them, we would end up with a mixture of power distributions. So we would have some SNPs that are on both chips and they would be the most powerful SNPs in the analysis compared to those that are on one chip or the other. If we were to take this forward for analysis and look at our QQ and Manhattan plots, we would have a very hard time interpreting those results because of that mixture of powers. So if we were to find an association and go in and look at the region, we could expect that the distribution of p-values wouldn’t follow what we would expect based on the LD or the correlation structure within that region. So because we have this differential in power, the SNPs that would be on both chips would have the highest power and potentially higher p-values than those that are on one chip or the other. So to get around this, what we could do is bring those two data sets together, take them forward for imputation, and end up with a data set that has a fairly constant N and a single power distribution that’s not dependent on whether or not a SNP was present on the chip or not. We can also use imputation to correct for sporadic missingness, and genotyping errors, and also impute in types of variation that we haven’t directly genotyped such as structural variants.
Here’s an example of fine mapping. In this situation we have run a GWAS, but we’ve only used genotyped SNPs and we have this variant that we’re finding on chromosome 19. When we go in and have a look, it appears to be floating, so it’s not really supported. We have nothing really in this region to back it up particularly well. So looking at that, it’s very hard to work out if that could be a true finding or not, and one of the things we might do is fine mapping, which is to go in and impute other content in that region and see whether there is additional support that we’re not observing in our genotyped SNPs. So when we go ahead and do that in this case, we can see this actually is the true effect. It’s well supported by SNPs in the region, it’s just that these variants were not genotyped on this chip.
So when we’re talking about imputation, what are we actually talking about here? We usually start with a genotype data set that has missing or untyped genotypes. We have a reference set of haplotypes, so a public reference set, and those references are compared to our genotypes. We try and identify which haplotype best represents each segment of data, and then we infer in the missing content. So to put this another way, we start with the genotype sample, which has some genotypes but is missing others. We have our set of reference haplotypes. What we’re going to do is compare our genotype samples to our reference haplotypes. Try and work out which haplotypes best represent which segments of data, and then infer in the missing genotypes. This is done in a probabilistic way, and we can assess the accuracy of this imputation as we go.
Steps to Imputation
OK, so there’s a couple of steps and things we need to think about when we’re setting up for imputation. So firstly we need to have really well QC’ed data and this would be similar to the QC that you were shown in the QC in GWAS session from yesterday. Secondly, we need to decide which of our references we want to use, and we are in the situation where we now have quite a lot of references, so it’s worth thinking carefully about why you are using a particular reference and what you’re trying to do with your analysis.
The most common references to use at the moment are the 1000 Genome and the HRC references the 1000 Genome reference is a multi-ethnic reference, whereas the HRC is a predominantly European reference. The HapMap and 1000 Genome references can be downloaded and used locally. The other references are mainly only available from custom imputation servers. Although there is a wide difference in the size of the references, so for example 1000 Genome reference yields around 20,000 markers, whereas, sorry 20 million markers, Whereas the HRC yields around 40 million and TOPMed yields around 300 million. At the end of the day, if you have a cohort of predominantly European individuals, you’ll likely to end up with between 8 and 10 million usable markers for your analysis.
Phasing
So once we have QC’ed and decided on our references, the next step is to phase our data. Phasing in this case means we estimate the haplotypes within our data. So we take our genotype data. We try and reconstruct the haplotypes using reference data and so for example, in this situation here we have three genotypes and there are four potential haplotypes that we can arise that from that data. We don’t do this manually. We use software that’s been specially designed to do it, and the most common software packages at the moment, are Eagle and Shapeit. They use hidden Markov model and Markov chain Monte Carlo methods to reconstruct the haplotypes. And then these are used to provide scaffolds to infer or impute the data.
Imputation Programs
For our imputation as well we use customized programs and the most commonly used ones at the moment are minimac or impute. There are others that are available. An important point is to never use Plink for imputation, although Plink has an imputation option, it’s really not very well designed and I wouldn’t recommend using that. So the two most commonly commonly used imputation programs are minimac and impute. So minimac comes from the work of Gonçalo Abecasis, Christian Fuchsberger, and colleagues. And has a number of downstream analysis options, including SAIGE which will use later in the week, BoltLMM and Plink2. Impute is now up to impute version 5. This comes from Jonathan Marchini and colleagues, and it incorporates the Positional Burrows Wheeler Transform (PBWT), so it’s a fast and efficient way of undertaking imputation. Once again, it has a number of downstream analysis programs that have been written specifically for the output of this program.
Imputation Cookbook
So how would you actually go about doing your imputation? If you are in this situation where you have to do imputation locally, I would seriously recommend using what we call a cookbook and there are a number of these available online. So here’s a link for minimac3 imputation cookbook for 1000 Genomes. If you are in the situation where you can use an imputation server, I strongly recommend that you do that and there are a couple of these available. So there’s one at University of Michigan, which is probably the most heavily used one, there’s one at the Sanger in the UK and a new one, the TOPMed imputation server for those wanting to impute TOPMed data. Here’s a little shot of each of those front pages. In the practical, we’re not going to walk through how you impute data, because there’s a really good set of imputation practical sessions available on the Michigan imputation server site, these are from the American Association of Human Genetics meeting in 2020, and you can walk through each of those if you’re interested in learning how to run imputation on the server.
Data QC
The main points are that as I said, we need to QC the data well so we exclude SNPs with excessive missingness, low minor allele frequency, Hardy–Weinberg issues and Mendelian errors. We should also drop strand-ambiguous or palindromic SNPs. And you need to be careful that your data is on the right build and alignment. So depending on which reference you’ve chosen, if you’ve chosen the TOPMed references, you need to have your data on build 38. If you choose the others the build should be on build 37. So you need to output your data in the format expected for the phasing program, and it’s really important that you check the naming convention for the references and the program that you want to use. So do the SNPs, use RS numbers, or are they in a position reference?
If you are using an imputation server, once you have your data QC’ed and ready to go, it really is as simple as uploading your data, picking the options that you want to use, and then submitting the job. After the imputation you have about a week to get your imputed data off the server and then it’s all wiped.
OK, so once we’ve done our imputation, if we use the Michigan imputation server, then our data is going to be in a format called a VCF format. So in this format, each line in the file represents a variant and each block of data represents an individual. The file contains our imputed data in three different formats. The first of these before the semicolon is hard-call or best-guess genotype and this refers to the number of copies of the alternate allele that someone has. The second of these is the dosage format, which ranges between zero and two, and once again this is a count of a number of doses of the alternate allele that someone has. The third format, which is not used very often is what we call a genotype probability format, and so it’s the probability that an individual is has an AA, AB, or BB genotype for each of the SNPs in our file.
To go along with these, we have a series of info files that contain the information about the imputation accuracy and the frequency of the variants in the sample. So we have our SNP identifiers and you can see some of these get quite long. We have our two alleles, our frequencies, our r-squareds, or imputation accuracies, a column telling us whether it’s genotyped or not. And for those SNPs that are genotyped, we have a leave-one-out imputation accuracy. So this gives us an idea of how accurately these genotyped SNPs have been imputed or would have been imputed if they weren’t already genotyped.
Rsquared
So one thing to keep in mind is there are subtle differences between the way that the different programs create their R-squared metrics. In both cases, effectively the R-squared is the ratio of the observed variance to the expected variance. But there are small differences in how they are calculated. There’s also a difference in that the IMPUTE info measure is capped at one. Whereas the MACH or Minimac r-square measure is allowed to go above one as an empirical estimate.
The two programs to get fairly good agreement though and they should line up fairly well if you were to impute the same set of data both ways. So the R-squared or the info is telling us about the level of certainty we have in the data. So if we had an R-squared of one, it’s indicating there’s no uncertainty. R-squared zero means complete uncertainty, and for example, an R-squared on .8 on 1000 individuals will give us the same amount of power as if that SNP had been genotyped on 800 individuals? So you can think of it that way.
After we’ve done our imputation, it’s a good idea to do some QC and check that it’s worked. So some things that you can do is to look at the minor allele frequency compared to the reference and see how that looks. You can also look at the R-squared across the chromosome and see how that looks. So here’s some toy examples of a relatively good imputation, but you can see are r-squared sort of varies quite markedly across the genome, but generally follows a fairly well defined distribution. Whereas this is particularly and deliberately a bad imputation, and you can see that we’ve got a very different distribution here.
GWAS
After doing the basic imputation, it’s a good idea to run a GWAS for a trait that is fairly well powered. Ideally something continuous. Have a look at the Lambda and look at your Manhattan QQ plots. And then run the same trait using GWAS using only the observed Genotypes and plot the imputed versus the observed variant results and see what you get.
Consortiums
Lastly, quite often when we’re running imputation, we’re actually running it for a consortium or a meta-analysis, and they will give you instructions about which reference panels to use and what to do. They will probably ask you to analyze all variants, regardless of whether they pass QC or not. It’s important to think about this, especially if you are using those TOPMed references which have something like 300,000,000 variants, as only around 8 to 10 million of those will typically be useful for GWAS, and so you’ll be running analysis and uploading results for many, many SNPs that won’t be very useful for anyone. So if you’ve got any questions, feel free to ask them at the start of the practical session or to post them to the slack. Thanks very much, bye.
Imputation Deep-Dive
Title: An Introduction to Genotype Imputation
Presenter(s): Brian Browning, PhD (Department of Medicine, University of Washington)
Brian Browning:
Introduction
This talk is gonna look at genotype imputation, which is a standard technique. We’ll cover just an overview of it, and we’ll look at some of the models used, which then in the research talk that I’ll have on Thursday, we’ll make use of some of the information in the tutorial. I’ll finish up with maybe a little bit of discussion of programming because that’s something that’s very relevant to this audience.
What is imputation?
So imputation is just estimating missing data. You can use the other data in the data set, you have an external data set. And if you have played any word games, you’ve done imputation. Classic examples, Hangman. Start, give three characters, the last two characters or AT about a third of the letters in the alphabet can fit in there. And Hangman gives a good illustration of a general principle of imputation - the more context you have, the better you can fill in or estimate that missing data. If I give you some additional characters for the sentence “the dog chased the,” you can do a much better job filling it in. Instead of a third of the characters in the alphabet, there’s one - C probably springs to mind first. It could also be an R, but your probability distribution becomes much more pointed.
Genotype Imputation
Imputation is the filling in of genotypes. So, it originally started where you actually were imputing genotypes. Now, for computational reasons, we work on the haplotype and imputing alleles level. So your reference data consists of reference haplotypes, two phase reference haplotypes per reference sample. And the sample you’re imputing has two haplotypes too, but it’s missing a lot of data. It typically is genotyped on a SNP array. And you have just a couple of markers. You might have a marker here, or on one haplotype, there’s a G, and a marker here on the other haplotype, there’s a C. And on the second haplotype for the sample, an A and a C. And based on these reference samples and a probabilistic model, you want to make inferences about what all these dots are.
Applications to GWAS
Imputation has been around for a long time. Imputation of sporadic missing genotypes has been around for a long time. But imputation came into prominence in 2007 where a group from Oxford with the Wellcome Trust case-control consortium and sort of simultaneously a group in Michigan, Gonzalo Abecases' group, developed methods for imputed genotypes markers using reference data. And the initial application was to finding new trait-associated loci, and in the initial study, it actually didn’t produce much, although it will produce some power, the idea is that you have a SNP array in which you genotype 300,000, 500,000, a million markers, but there’s a lot of other markers in the genome. If you can impute them, then you can test additional markers. So it should give you a little bit of an increase in power, and it will.
The second application was for fine mapping. So your genotype markers showed that you have an association in a region, but there may be other markers that weren’t genotyped that give you a stronger signal. That can be valuable for replication studies. So you impute the additional markers in the region. You might find a marker that’s more highly correlated with the trait you’re interested in, and then that’s the marker that you’d want to take forward, definitely including your replication study because it should have a greater chance of replicating the association if it’s real.
These first two applications are nice -- I don’t think they’re necessary game-changers or they’re useful, very useful. But the real killer application is meta-analysis. So there’s lots of different SNP arrays out there with different numbers of markers by different vendors. SNP arrays from different vendors tend to not have a lot of overlap. And when you want to do meta-analysis, it’s very difficult to do meta-analysis when your datasets are genotyped on different markers. It’s like this Gordian knot. Well, imputation just slices through that knot. You take a reference panel, you impute all your individual datasets so that they all have the same markers that are in the reference panel, and now they’re on the same set of markers, and you proceed. So that’s been very valuable, when you see these studies in Nature, Nature Genetics where they have several hundred thousand samples, and they have scores and scores of associations. It’s imputation that made that work in a straightforward way because there’s a lot of different data sets, and they had to use imputation to get them all in the same marker sets to do the meta-analysis. So the meta-analysis, I think, has been very, very successful.
Imputation Output
What you get out of Imputation is not necessarily like Hangman where you’re guessing what you think is the most likely letter. It’s a probabilistic output. So we think based on the reference data and the observed date in the sample that on this allele of this haplotype, there’s a certain probability that the allele is the A allele or at a certain probability that it’s the B allele. So throughout this talk, all these methods extend to multi-allelic markers, but first, just to remove that complexity, we’ll assume them as di-allelic markers and I'll typically refer to the alleles as A and B. My background is with human data, so whenever I refer to some physical characteristics of data, I’ll be usually thinking, I’ll always be thinking of human data. So my apologies to people from an animal background. It’s just that’s not my background. So my examples are from the human domain.
So here’s a haplotype at a marker you might have, for the A allele, probability 98%, probability B allele 2%. And on the other haplotype, you also have a probability distribution that essentially gives you all the information you need for whatever you want to use. The advantage of probabilistic output is you’re capturing the uncertainty in the imputation rather than hard genotype calls where you’ve erased that uncertainty. You can get called genotypes if you want by just taking the genotype that has the highest probability. And to get posterior probabilities, the genotype level rather than the allele level, you just can assume Hardy-Weinberg equilibrium, and it pops out. Also, with probabilistic genotypes, you can use them in the standard frameworks for testing. So if you do linear regression analysis, typically the predictor at a marker is the number of copies of the minor allele, 0, 1, or 2. That same framework works with imputed data, it’s just instead of an integer number of copies, you have the expected number of copies, which is sometimes referred to as the expected dose of the allele. So like in this example, the B allele dose turns out to be 0.88. That’s what you’d plug into your regression analysis.
Measuring Imputation Accuracy
There’s more uncertainty in imputation; you need a way of measuring it. And there’s two sort of ways. One, I think, is really obvious; one is a little less so.
So the most obvious way is just genotype discordance.
The little less obvious way is the correlation in allele dosage A and this was, I think, again, the first groups that were developing imputation devised these methods. The Michigan group and the Oxford group devised something similar. Michigan devised the correlation metrics I’ll talk about here.
So R-squared turns out even though it’s a little bit more complicated has some advantages, big advantages. It’s normalized for allele frequency. So for example, if I tell you I have a marker that I can impute the alleles at this marker with 99.9% accuracy, well, it’s very tough to interpret that without some more information. If the marker allele frequency is 30%, 40%, 50%, 99.9% accuracy is really, really good. If the marker allele frequency is 0.1%, 99.9% accuracy is really, really, really bad, right? Because you could take and just do that the dumb imputation strategy of always imputing the alleles to be the major allele; you’ve destroyed all the information at the marker, and you’ve achieved 99.9% accuracy. So you have to know the allele frequency, whereas correlation automatically builds it in. If you’ve, you know, from your introductory statistics class, if you computed a correlation, there’s variances in the denominator. Those variances capture the allele frequencies. So it can be interpreted in a much better way without actually having to know the allele frequency.
The squared correlation metric where you’re looking at the expected correlation between the imputed allele dose, imputed number of copies of alleles in samples, and the true number of copies of alleles in samples. It also has a couple of other factors that strike me as not very obvious, but they’re useful. So R-squared can be estimated if the true genotypes are unknown. You can get an estimate of your accuracy without even knowing what the truth is. Now, it assumes that your posterior genotype probabilities are well-calibrated, so there is that assumption built-in. But if they are, you can estimate the R-squared from the imputed data itself without knowing the truth, you can estimate what that correlation is. This idea was developed by Michigan, and a derivation of something similar that illustrates one way to derive this is given in the reference I’ve cited in the American Journal of Human Genetics.
The second surprising feature is that R-squared gives information about relative power. So there’s an interpretation in terms of power that’s useful for R-squared. So it turns out that the allelic tests have similar power if you use imputed genotypes for in-samples or the true genotypes for R-squared x in-samples, something similar to this has been known for a long, long time. The best explanation I’ve seen in this is a box in the American Journal article that I’ve cited here if you want to look it up. It’s just a small box that goes through the derivation. So if you’re looking at a marker that you imputed where the estimated R-squared, which are taking to be the true R-squared and it is is 0.8, and you have a thousand total samples, you may be have cases and half controls if you test that imputed marker, the power should be roughly the same as using the true genotypes, which are correct for 80% of your samples, for 800 samples. So it gives a way of, if you’re trying to determine what R-squared thresholds to use for accepting your imputed genotypes to carry for that imputed marker into downstream analysis. This gives you a way to interpret what that R-squared might mean.
What determines the frequencies you can impute? The general rule with imputation is you can impute high-frequency markers really well and low-frequency markers not very well. Where’s the cutoff? What determines the threshold of what you compute? There’s two primary factors. One which we can change and one we can’t. The one we can’t change is the effective population size. We’re stuck with that. The more diverse the population is, the bigger effective population size, the shorter the shared haplotype segments in the population. The shorter those shared haplotype segments are, the harder it is to impute. Can’t change that. But what we can change if given money is the reference panel size. The bigger the reference panel size, the lower frequencies we can compute. And we often think in terms of minor allele frequency for many applications. That’s more natural. But for imputation, the thing I find useful to think about is minor allele count.
Imputation accuracy (MAF VS MAC)
This is the same data, and it also, these plots are sort of nice for getting a sense of the potential for what imputation accuracy can be for different frequencies. This is simulated data from a Northwest European population reference panel. Sizes vary from 50 to 200 thousand samples, and the target data was on a million SNP chip. On the left-hand plot, we’re plotting the squared correlation; this is using the true simulated "truth" versus the imputed data for different minor allele frequencies. On the right-hand plot, it’s the same data but broken up by minor allele count. On the left-hand side, you can see that you really need to know the frequency of what you’re imputing to understand the accuracy. On the right-hand side, when it’s expressed in terms of minor allele count, it’s much more stable. So, with this simulated data, around 10 markers, 10 copies of the minor allele in the reference panel, you’re getting a squared correlation accuracy around 0.74. Around 20 markers in the reference panel, 20 variants in a reference panel for that level, you’re getting around 80% imputation accuracy. So, this is simulated data; it’s going to be a bit better than current reference panels because current reference panels are predominantly, if not totally, but at least predominantly from low coverage sequence data, and low coverage sequence data, its Achilles heel is estimating low frequency genotypes. So, it has a very high error rate for low frequency genotypes, but as we move into reference panels obtained from high-coverage sequencing, this kind of performance should be practical in outbred populations like European populations. One of the inferences from this is that as you double the reference panel size, if you were able to impute markers with at least 20 copies of the minor allele with a certain reference panel size, when you double it, that should still be true; it’ll even get a teeny bit better. So, every time you double the reference panel size, the frequency of variants that you can compute, other things being equal, cuts in half. It’s sort of a linear relationship.
Why impute when you can sequence?
Now, if you’re from a sequencing background, a natural question to ask is why impute when you can sequence? Imputation has errors; sequencing is more accurate, high-coverage sequencing. Why go to the trouble of imputing? This slide just sort of breaks down the things that imputation is competitive with high-coverage sequencing at, and things that it’s not competitive with.
So, the easiest thing to do is estimating allele frequencies, and that’s what we do, and we do association testing, which is where imputation has been used most widely; that’s its strong point. So, with 50,000 Northwest European reference samples, if you impute down to 20 copies, that’s imputing down to a minor allele frequency of 10 or 2 times 10 to the minus 4th, so you can go relatively low and do very well with imputation if your goal is to estimate minor allele frequencies. If your goal is a little bit harder, it’s much harder to estimate a genotype than to estimate a minor allele frequency, then it gets more problematic.
In my simulated data, for 5 percent minor allele frequency and above, it does very well; it can actually estimate the genotype with about 99.9% accuracy in that range. But it’s not true for less than 5 percent; the accuracy slips, and the imputation at a genotype level, not a minor allele frequency level, but a genotype level, is just not as accurate. Now we could improve that by going to ever larger reference panels, but there is a break, and it’s much higher than the break for estimating allele frequency. There’s a much higher break for estimating genotypes’ threshold, and of course, for de novo mutations, I don’t care how, you know, big your reference panel is; you aren’t gonna be able to impute them.
So, it’s true there are things that genotype sequencing does much better than imputation. So why would you do it (imputation)? Money.
Alright, high coverage sequencing = a thousand dollars, or at least that’s the last time I checked. I think it’s still in that range, and that may even require an order in bulk. Yeah, I’m curious if somebody has data on that; I’d like to hear what the current costs are for genotyping as a service. Chip-typing, If you’re a good negotiator, you can get a pretty good deal on Chip-typing. You need big datasets; you need to play Alphametrics off against Illumina and get them to go against each other, but you can get it down to $50 a sample. Imputation with the current methods, you know, half a cent a genome for 10,000 reference samples, and these are order magnitude figures, five cents a genome for a hundred thousand reference samples. And if you have a million reference samples, which we won’t for a few years probably, 50 cents a sample. So, essentially a hundredfold less than the Chip-typing cost, and there’s a lot of data out there with Chip-typing costs or with GWAS chip data available. So, compared to sequencing, then the cost difference is a factor of two thousand. You may not have a thousand dollars to sequence a sample, but you probably have five cents. Alright, so yeah, there’s a trade-off depending on what your application, especially if you’re interested in association testing, imputation gives you a lot of bang for the buck.
Hidden Markov Model (HMM)
So the next part of the talk, I’d like to go over the models, the standard model, the most widely used model for imputation. There’s been some very nice, clever work developing other approaches, matrix completion approaches, summary statistic approaches. Oh, just, you know, in the interest of summarizing, I’m gonna talk about the one I’m most familiar with and also the one that, from what I’ve seen, has the greatest accuracy.
Hidden Markov models: So the basic methods are based on hidden Markov models where you have a Markov process, and you can’t observe the underlying states; the process is hidden. Well, what you do have is observed data, and I’ll go through the parts of the model, and then we’ll use this model in the research talk on Thursday. So I’ll describe the Lee and Stevens model. Once the field of imputation moved into what’s been called as pre-phasing, as were imputing onto haplotypes, the Lee and Stevens model becomes the model, in my view, becomes the model of choice because it’s computationally tractable at the haplotype level without having to do a lot of shortcuts. You can do the full Lee and Stevens model, and it gives you very accurate results. The reference for that seminal model is given on this slide.
Hidden Markov Model (HMM)
So the hidden Markov model has a number of components; I’ll go through those components, and all of them will typically be based on a slide like this. So I’ll go over that in some detail. So the first thing it has is model states, and there’s going to be a model state for every pairing of a haplotype and marker. So the markers, these are on the reference haplotypes, are given as columns of the matrix; the haplotypes, these are reference haplotypes, not the typotypes you’re imputing but those from the reference samples, are given here. We’ll label these h1, h2, h3, h4, and so on. The states of the model then just become the elements of the matrix, these circles. And for reasons that will become apparent in the slide or two, we want to label those states with the allele that the reference haplotype carries. So, we’ll use two; the blue allele will represent the reference allele, the yellow coloring will represent the alternative allele. So, the number of states in your model is just going to be the number of rows times columns. So, it’s the number of haplotypes times the number of markers.
The next component of the model.
Initial State Probabilities: After you’ve defined the states for the Lee and Stevens Hidden Markov Model, is the initial state probabilities. So, these are the probabilities before you’ve seen any observed data. And the way the model, the process, the Markov process works is you start at the first marker and you work your way through to the last marker. So, the initial state probabilities -- there’s only nonzero probabilities in the first column for the first marker. So, for each haplotype, all the states at the first marker will have equal probability so that they probably sum to one; there’s no reason to prefer one state to the other. And at every other marker, and every other column, those states have probability 0 to be at the beginning.
State Transitions: Then there’s state transitions, and just to keep the slide from getting too cluttered, I’ve only shown one so far. The state transitions I’ve been showing are just what you can think is the primary state transition. So, the primary state transition just goes ,when you go from one marker to the next, you stay on the same haplotype with probability close to one, that’s what happens. But actually, it’s a little bit more flexible and complex. With a small probability, you can jump to a random haplotype, and that’s what I’ve tried to show for just one single marker right here. So, with a probability close to one, you stay on the same state, and there’s no, what we call, no recombination. With a probability, a small probability, the remaining probability, you jump to a random state, and that random state can also include the state you’re on.
So, what that is modeling is historical recombination, where in the past, for a while, you’ve inherited the same haplotype; you’re imputing on matches or has inherited the same sequence of genotyped alleles as on a certain reference haplotype. And then because of a historical recombination, it switches to another reference haplotype. So, the probability of, that small probability, of transitioning to a random haplotype is proportional, over short distance, is approximately proportional to genetic distance. So, the bigger genetic distance, the higher recombination rate, and so you have a greater probability of transitioning to a random haplotype. So, I won’t show this anymore, but just be aware that the actual state transitions can go from any state at the next marker, but I’m only going to show the primary ones where you stay on the same haplotype.
- Emission probabilities: The next component of the hidden Markov model is you have to have some way of relating your observed data to the Markov process. And that comes from what are called emission probabilities. And this is where the labeling where we labeled each reference haplotype at each state with the allele that the reference haplotype carries. And so if you’re in a particular state, let’s just take the first state marker at the first haplotype, it will, with probability near one, emit the blue allele, and that’s shown in these equations here. So, if you’re in a blue state, the square (I use a square to represent the allele on the observed data), it will be epsilon; here’s a small value, it will be close to one. With a small probability, it will emit the other state; you’ll have a mismatch between your observed data and the state you’re in. And the same principle holds for yellow. If you’re in the yellow, you’ll emit a yellow allele with high probability, and with small probability, it’ll emit the other allele.
This works at any state where you have observed data. These open squares mean you have missing data like you would have if you were performing genotype imputation. Then here’s another marker that’s genotyped in your sample, you’re imputing, and so, for example, not knowing anything else, you just intuitively expect it’s more likely you’d be in a state on haplotype 4,1, or 2 because the emission probabilities are higher there because the allele matches than io states H3 and H5. At the states where you have missing data, the emission terms, they’re constant; it doesn’t matter what the underlying state is; you have no observed data and you can treat it as one in it, and it drops out.
Now let me backup, once you have the state probabilities, you can get the imputed legal probabilities. So the key thing you’re trying to understand is given this imputed data at a particular marker, at any imputed marker (I’m sorry, given the observed haplotype, the observed alleles on that haplotype), you want to understand what in each of these hidden states, what its probability is. Once you have that, you’re essentially done with the imputation problem. If you want to know what the probability of the blue allele at the third marker, once you have these state probabilities, you just add up all the state probabilities for blue haplotype for the blue states, and that gives you the probability, posterior probability of the blue allele, posterior probability of the yellow allele at the third marker, probability that that’s yellow, you just add up the state probabilities for all the states for the reference haplotypes where the reference haplotype carries the yellow alleles. So, the key is those state probabilities, and that’s the next slide.
Calculating HMM State Probabilities
So this standard way of breaking this up, It’s a really beautiful math. I love the math that you use to get these state probabilities. You break it up into what’s called a forward probability and a backward probability. So first, this is the state probability because little m is a marker. H is your haplotype, so it’s the probability of being in the state at marker M, haplotype H. The ’O' is your observed alleles. Capital 'M' is the total number of markers. So given all the observed alleles at the genotype markers, you want to be able to compute, conditional on that, what the state probability is. From what we talked about in the last slide, as soon as you know the state probabilities, you just sum them up to get your imputed allele probabilities. So you break it up using the end of conditional independence assumptions into two parts that are called a forward and a backward probability.
So there’s a couple of things to - I think to note about this equation.
One is there’s a forward probability for every state in your model. Remember, the number of states is the number of rows times columns, the number of haplotypes times the number of markers. So there’s a forward probability for each of those states, and there’s a backward probability for each of those states for each marker and for each haplotype. The forward probability only includes the observed data up through marker 'M'. So if we are imputing a forward probability for a state at marker 'M', the forward probability only includes the observed data up to marker 'M'. The backward probability includes the observed data from the next marker all the way to the end. So there’s that kind of division going on. And then let’s see, then the name for probability, I’m guessing it comes from the way they’re computed. It turns out you’d compute the four probabilities by making a forward pass through your data.
Computing Forward Probabilities: So the way it works is: You start with the state probabilities for marker 1, which you get just from the initial state probabilities. They’re all equal. Given your state probabilities for marker 1, there’s an update equation, which I’ll just flash on the screen in a few minutes. There’s an update equation that gives you all the state probabilities at marker 2. Once you have all the state probabilities at marker 2, there’s an update equation that uses these probabilities to determine all the state probabilities at marker 3. And you just march through your data one marker at a time. And at each step, you use the state probabilities at the preceding marker to give you the forward probabilities at the next marker. I said state probabilities, but I mean the forward probabilities at the preceding marker, and you keep marching.
Computing Backward Probabilities: As you might guess that backward probabilities work the same way. It’s just in reverse, you start at capital 'M', the last marker in your data set. You start with the backward probabilities there, which turn out to be all one initially. And then from that, you get the backward state probabilities at marker 'M' minus one, at the preceding marker. And you keep marching backward, so you end up eventually at marker 6. And given the backward probabilities at marker 6, you can get them at marker 5. Given all the backward probabilities at marker 5, the backward update equation gives you at marker 4 and so on. And you march back. So you do a forward pass and a backward pass through the data. And it’s imaginatively called the algorithm, the forward-backward algorithm.
Forward Update: We’ll use this Equation in the research talk, and I just wanted to give a high-level look at the equation. This is an example of the forward update equation. Don’t have to memorize it, but just to understand the different components of it. So remember, the forward update equation gives you the forward probabilities at marker 'M' plus 1, given the forward probabilities at marker 'M'. So, oops, forward probabilities at marker N plus 1, that’s the forward probabilities at marker M plus 1. You’ll notice there’s an M plus 1 there. Then you sum over all the reference haplotypes, all the states at marker M. So, here is the forward probability at marker M. And then you end up multiplying that by a transition probability. This is the state probability of being an H reference, the state for reference haplotype H prime and transitioning to the state at the next marker with reference to haplotype H. And then you multiply by an emission probability. So, given the state you’re in, what’s the probability of the observed allele at that state? The backward state update, I won’t go over it, has the same format. It’s a little bit different, but it’s the same format. You’re summing over a triple product, a product of the backward state probability at the state you’re coming from, a transition probability, and an emission probability. One thing that we’ll use in the research talk on Thursday is we’ll use very strongly the fact that when you’re at an imputed marker, this term effectively drops out. And that’s it. And we’ll use that to find some faster ways to compute imputation.
The Practice of Programming
In the last part of the talk, I wanted to talk about programming. This is for people with a CS background. This, some of this, a lot of this, all of this perhaps you’ve seen, but there’s probably people in here that are like I was a number of years ago, coming into computational genetics with little or no programming background. And so there’s certain things that I thought it might be helpful to just go over that may save you some time, make your life a little simpler, because a lot of our work involves writing code. So, I find when writing software that the chief challenge for me is complexity. If and when it gets more complex, my mind has a hard time grasping it, more error-prone. And as much as if you’re like me, you get a buzz from writing really fast code, it’s pretty exciting to write code that’s really fast, that’s really not the first object.
Simplicity: The first object is to write clean, simple code. Don’t worry about the optimization. You’re just trying to write the simple code so that when you look at it, it’s easy to understand what it does. It operates in a logical way. The structure of it makes sense, just easy for the mind to absorb and grasp without having to really study it. There is a place for optimization, but it’s not the first thing you want to do. There’s a famous quote that Donald Knuth took this statement from scripture and then changed it for computer science. He wrote, “premature optimization is the root of all evil.” And I think the idea is that if you optimize too soon, you can end up optimizing the wrong thing or doing unnecessary optimizations that don’t actually improve the code. Or if your code isn’t simple to begin with, you have a hard time finding the right optimizations. See, optimization is not free. Yes, it can speed things up, but it has a cost. If your simple code was fast enough, you wouldn’t need to optimize.
So when you optimize, by definition, you are introducing complexity. And that complexity has costs. You’re going to have more bugs in it because it’s more complex. It’s going to be harder to find the bugs because it’s more complex. It’s going to be harder to extend your code, to add new features to it because it’s more complex. It’s going to be harder to maintain the code because it’s more complex. It’s harder for other people to come look at your code and wonder what you’re doing because it’s more complex. So there’s that cost. And you have to weigh that cost against the expected benefit. Is a big increase in complexity worth getting a 5% reduction in runtime? In most cases, not. If your code is fast enough for practical purposes, getting a tenfold, hundredfold reduction in runtime may not be worth it for the increased complexity if it’s already fast enough for your purposes - if it runs in a second. So weigh it up and understand how much complexity you’re adding and what the trade-off is before you even add it because optimization is not, it’s just not free.
Modularity: The second general principle that I find useful is modularity, which just carries the idea that you want code units of code where the input is very simple, what it does is simple to understand, at least at a high level, and the output is very simple. A module of code that you can treat conceptually as a unit without having to really think about it very deeply. Now, when you write the code, you may not have to think about it deeply, but once it’s working and doing what you want, you can just treat it as a building block.
And ideally, you want your program to be very loosely interacting modules so that when you’re working on a particular module, because it operates very independently, you can give that your whole concentration. You don’t have to have the whole program, the whole complex program at your fingertips in your memory. You can just focus on the individual part you’re working at because it’s loosely coupled. The classic example of this would be UNIX utilities at the command line. If you’ve used a UNIX system, so UNIX utilities, there’s utilities for sorting, counting lines, counting words, extracting columns, extracting lines that meet a certain criteria, changing characters, replacing words. There’s all these UNIX utilities, and you can do a lot of your programming without actually sitting in writing code. You just take the UNIX utilities and you string them together in a series of filters or in a pipeline. UNIX utilities are a classic example of doing some units of code that do one thing, do it fairly well, and that you can work with as a unit without understanding how they’re implemented. And that kind of approach is really useful for writing complex projects.
Functions: Another general thing when you’re writing is if you can be aware of is when your classes or your methods, which are called functions depending on the language you’re working in, when they do too much. One of the pieces of advice I read early on when I was learning programming is, “Be aware of functions that extend more than the length of your computer screen.” And in my experience, that’s good advice. When it doesn’t fit on the screen, I’m more likely to make errors because I can’t see the whole function without scrolling. And the extra length that indicates it’s usually a bit of extra complexity. And so, you know, all rules have exceptions, but I generally, if the function extends more than the screen, I want to see if there’s a way to make the code cleaner, simpler, easier to understand by breaking it up into parts. Classes, if you work in an object-oriented environment, which for complex problems can be very useful, complex programs. For me, your threshold may be different, but my threshold is, once my file, I work in Java predominantly, once it’s more than two or three hundred lines, I notice that I have a harder time really understanding what’s going on in the class. And so, when it gets long, I, if I can, I try to find ways, see if there’s a way I can break things up that would make it simpler to understand. There’s not always, but I try.
Refactoring: So, to get that simple code, it involves what’s called refactoring, which is just cleaning up your code without changing how it behaves. So when do you refactor? How do you know what parts of the code you want to spend some effort cleaning up? Just from experience, anything I’ve just written, I scan, they need cleaning up. I never get a really right design the first time, unless it’s over just a trivial piece of code. I may not realize it when I write it that it has problems, but there’s an acid test for discovering you have code that’s hard to understand. It’s when you just come back to it after a couple of months. For some reason, you have to go back to your code, and you’re looking at it, and it’s very humbling because you can’t understand what you wrote. You wonder what you were thinking. Why did I do this? Is this a bug? Isn’t this more convoluted than doing it this other way? All those thoughts that when you look at the code with fresh eyes, it’s usually then that I spend time refactoring when I first wrote the code? I’m just a little too close for it. I’m a little...It’s hard for me to see the blemishes, but you come at it with fresh eyes, they’re really obvious. And it’s often in that part of that painful time you’re spending trying to understand what your code’s doing again, that also can, at the same time, it’s very easy to see ways that I could have done this differently. I could have combined things. I could have changed the organization to make it easier to structure. I could divide long methods into these two shorter methods. I can combine code that’s essentially duplicate code so that instead of having to maintain two components of code, it’s just one. So all those things become obvious when you look at the code again with fresh eyes. So refactoring, if you’ve had the experience like I’ve had many, many times of just having a not very fun day trying to understand what you wrote in the past, refactoring just means the next time you look at the code, it won’t be so bad. And it’ll help any poor soul that’s not you. If you have a hard time understanding your code, just imagine what somebody else coming into your code is going to...what difficulties they’ll have.
Testing and Debugging:Then for testing and debugging, one tool that I find really useful is regression testing. So this is not linear regression from statistics. It’s just testing to make sure your code is still working correctly. And this can be set up in a very automated way. You have some test datasets, the test datasets you originally used to convince yourself that your code was behaving well and doing what it was supposed to do. Save those test datasets, write some scripts, and save the output that you had from a previous version. And when you continue working on your code, maybe doing refactoring, and then you reach a sort of a stable point, the code, you can check and see whether your code is still producing the same output. You can check it at a qualitative level to see if it’s still producing the same accuracy. You can also just use the diff tool in UNIX to check whether the output files have changed at all. And this is a fantastic way to catch bugs that you’ve introduced. You had working code, you made some changes, and you didn’t realize it, but you broke something. It makes it very systematic, very easy to find those types of bugs.
Another useful tool is a version control system. I use a tool called Git. It allows you to go back and see the previous state of your program. You may have started off on a development and then found out that you made a mistake. You need to backtrack. Or maybe there’s a bug that you never caught before, and it was introduced in the past. It allows you to essentially do a binary search to find just the point where you made the bug and determine exactly what changes you made at that point. So you can very, with a microscope, nail down the place where the bug was introduced. If you use Git, there’s a lot of tools online for that. There’s a Udacity course on Git, a short Udacity course. There’s also some manuals online. If you search for something called “Git book,” you should find the online book that explains Git if you want to learn that. It’s not easy to learn Git initially I don’t think, these version control systems take a little bit of use, but they’re a powerful tool once you learn them. It’s like the UNIX environment. It’s hard to learn, but very powerful.
The last is, it pays to remember your previous bugs. Bugs are not uniformly distributed in your code. Typically, they tend to cluster. The reason for that is, you know, fairly natural. It could be a complex section of code, complex code more likely to harbor bugs. They’re either spatially correlated in your code. It could be something in your life where you weren’t at your best the day you wrote that code. You didn’t get enough sleep. You got a letter from the IRS, whatever. Just something that threw you a little bit where you’re not up to your full capabilities. So, if it pays to remember where your past bugs are when you find a bug. If you can remember the type of bugs you had and where they tended to occur in the code in the past, that can help you. It can give a little more preference to those parts of the code when you’re trying to track that bug down. And it can save you a lot of time.
So, thank you. Thank you for your attention. That’s the end of my remarks.