Chapter 2.4: Linkage Disequilibrium (Video Transcript)

What is linkage disequilibrium?

Title: What is linkage disequilibrium?

Presenter(s): Gábor Mészáros, PhD (Institute of Livestock Sciences (NUWI), University of Natural Resources and Life Sciences)

Gábor Mészáros:

Hello, everyone. Welcome back to the Genomics Boot Camp to a series about linkage disequilibrium. In this video, we just start to talk about this genomic phenomenon, so therefore, the question is what it is exactly. So what is linkage disequilibrium? And you will know it from this video.

Before we start, let’s activate some of our prior knowledge that you might know from this channel or from other sources. And the two most important issues to highlight here is one that the single nucleotide polymorphisms, or SNPs, exist. These are some markers that are widely used and currently, in my opinion, the most important marker types as for the genomics. Tens of thousands of such SNPs are being genotyped in a cost-effective way on the entire genome, so basically it is a very good way to get to know something about the genomes themselves. And the other bit of information that is important here is the existence of the recombinations or recombination events that are of major biological importance. This means that the nucleotides on our genome are not being inherited independently but in a form of shorter or longer genomic segments from the paternal and maternal side.

Mendels Law

Such statement about the non-independent inheritance is in conflict with the Mendel’s law of independent assortment, which says genes do not influence each other with regard to the sorting of alleles into genes and every possible combination of alleles for every gene is equally likely to occur. Mendel’s law is, of course, valid when the genes or the parts of the genome are far from each other or, for example, on separate chromosomes, but as you will see, it’s not valid when these genes or SNPs or parts of the genome are very close to each other.

Linkage Equilibrium

The next few slides I took from the presentation of Professor Henner Simianer from University of Göttingen to demonstrate what happens in such occasions and what, in fact, the linkage disequilibrium is. So, for the sake of simplicity, let’s have an individual with two alleles and we ensure that this individual is entirely heterozygous. Then we mate it with another individual, which is entirely homozygous for these alleles A and B. And when the law of independent assortment is valid, then we get the four possible genotypes and, if our sample is big enough, then we have all these possible genotypes with a 25% probability. So, in our population, we would see that all four genotypes appear in a proportion of 25. In some cases, however, we might notice that these four genotypes are, in fact, not appearing in an equal proportion but very, very differently from each other. In this case, the alleles denoted by the capital letters A and B seem to appear much more frequently together, and also the alleles denoted by the lowercase a and b also appear to occur much more frequently with each other. So, in this case 45% in comparison to a situation when the lowercase a for example and a higher case b would be together in a single individual. The reason for this is, of course, recombination, so we have a non-recombinant gametes on the one side and the recombined gametes on the other side. So, in other words again, in case a recombination happened between these two loci, then we have a different occurrence for these for these gametes. The degree of recombinations is measured by the recombination rate.

Linkage Disequilibrium

So what we see is a departure from this equal distribution, denoted as a linkage disequilibrium or LD, which is the very frequently used abbreviation. I wanted to underline that LD is, in fact, a parameter of the population, so you need more individuals and, in the best-case scenario, a large number of individuals to determine the LD between two loci in a population. It is, in fact, a non-random association between the loci within this population, which could be measured so we can tell that the two loci are strongly linked or weakly linked or not linked at all together. In other words, the LD tells us something about the strength of the information if we see an allele in a certain locus and what it can tell us about the occurrence of another allele on another locus. There are various methods to measure LD, and we will talk about these measurement possibilities in the next video.

Heatmap

So, here is an example of a part of a genome with a heat map, so here is basically the darker colors are high LD and lighter colors are low LD. So it’s measured with D’ in this example - don’t worry about this right now. But basically, what I want to show you that the SNPs that are close to each other tend to have a higher LD between each other, and there are, for example, SNPs – for example, #9 and #20 - you see that there are relatively lower LD between them. Also, there are segments on the genome that are so called LD blocks where all the pairwise combinations of the SNPs yield a high LD, so basically this whole block is inherited in one piece. I really like this picture about the LD block structure, in this particular case, in a chicken genome, that demonstrates a similar thing as in a previous picture or previous slide but on a much larger part of the genome. Also, it compares to chicken lines where we see that the same part of the genome could be very different even within the species when we talk about different breeds or, in this case, the different lines. And the reason for this could be that, for example, in this chicken breed, in this particular part of the genome, there are some genes that are very important for this breed, therefore this part of the genome is quite well conserved with very high pairwise LDs in this part and the same genes are, of course, present also in the other breed. In this breed these genes are not important or at least not selected for. Therefore, we do not see such a strong LD block occurring in this breed.

Why is LD important

This simple picture supposed to illustrate the use of LD and why is it important in the genomics as such. So what we have here is let’s say a chromosome and we have a SNP on them, and let’s say we did some kind of analysis where we found the significance of each SNP, so this would be the higher SNP is, the higher the significance level. Now we see that there we have some kind of a signal here. So the question is: is this the gene of interest that influences our trait? Of course, because we are speaking about SNPs that are themselves just markers and not the causal variants, so this is, in fact, not the gene of interest but actually just shows what region of the genome is interesting. For our case, so we have a region of interest here and most likely the exact gene of interest resides somewhere here. Now the size of this region is determined by the linkage disequilibrium so LD, because most likely this SNP, the highest significant SNP, is connected to all of these nucleotides around. Also that, the parts of the genome that were not observed and, therefore, helps us to find the gene that is actually interesting for us.

Summary

So, to sum up why LD is important, it actually defines how far we are allowed to or supposed to look from the detected markers when we are looking for the causality on our genome, and it shows the association strength between the observed SNPs and the unobserved genes or QTLs. Also, I want to underline that the LD itself is actually dependent on a population or species, so most likely these association strengths or LD in these regions or the size of the regions might differ between the different populations.

Applications

As for the applications of LD in genomics, it is really, really wide-ranging, so I just mentioned a few examples here. So in evolutionary biology, it allows us to reflect on past events and gives insight into evolutionary history. When it comes to genetic diversity, it’s a similar as in a previous point, but we often compute the effective population size for our populations, and one of the computation methods is via LD itself. Then we can tell something about the artificial or natural selection events on the genome when it again comes to the population with the so-called selection signatures and the genetic hitchhiking - these are events when the actual selected gene drags along part of the surrounding genome and appearing in a population as a selection signature. So this size is determined by the LD itself. My small schematics showed an example of a genome-wide association study where we could search for causal genes and, as it was mentioned, there we could locate our search to these interesting regions where the most important or most significant SNPs reside. And LD is also used in a genomic selection in an indirect way, so here the question is that, how many SNPs we need that are more or less equally distributed on the genome so the whole genome is covered. So we could detect these very small QTLs and very small genes that affect our trait of interest, which is usually a quantitative trait, so it is influenced by many genes of small effect. We need to ensure that the whole genome is covered properly, so that all of these small genes are connected to at least one snip in a high enough LD and, therefore, taken into account.

When the genomic LD value is calculated, so again a slide to give an answer to the initial question: what is linkage disequilibrium? So, linkage disequilibrium is a parameter that quantifies the non-random association between the loci. It shows if the frequency of a different alleles between two loci is higher or lower than it would be expected if the loci were independent from each other and associated randomly.

Conclusion

And the very last summary slide for the entire presentation we talked about SNPs, markers that allow us to track the associations between parts of the genome, and such associations and such connections could be non-random, especially if the SNPs or markers or parts of the genome are very close to each other. Such non-random association is called linkage disequilibrium or LD in a shortened abbreviated form. And there are wide-ranging applications of LD within genomics. In the following videos, we will continue with the exact measurement techniques of the linkage disequilibrium. Also, I will provide a very nice hands-on example but, of course, also show the way how the LD is computed for large data sets using PLINK. For today, I thank you for your time and see you at another video on the Genomics Bootcamp Channel. I wish you a very nice day.


Measuring linkage disequilibrium

Title: How to measure linkage disequilibrium?

Presenter(s): Gábor Mészáros, PhD (Institute of Livestock Sciences (NUWI), University of Natural Resources and Life Sciences)

Gábor Mészáros:

Hello everyone, welcome back to the Genomics Boot Camp. Today we will continue to talk about linkage disequilibrium. Before we start, however, I want to thank you for the 500 subscribers or the over 500 subscribers on the Genomics Boot Camp channel. It seems that there is a continued and ever increasing interest about the contents of the channel, which is, of course, very good news for me and good motivation to move forward. Of course, if you are not subscribed yet. you can use this opportunity to do so. So thank you, thank you, thank you, and without further delay, let’s move on to further discussions about linkage disequilibrium.

Linkage disequilibrium

So let’s start with a little bit of activation of our prior knowledge. Namely that single nucleotide polymorphisms, or SNPs, exist, and these are actually the main marker types we speak about on this channel. Linkage disequilibrium, as such, also exists, and we spoke about this in the previous video. And in short, it actually measures the non-random associations between SNPs, and there is wide-ranging applications for LD in genomics, but for today’s question: how to measure linkage disequilibrium?

We saw this graph in the previous video, and basically what we see is that some of the allele combinations are appearing much more frequently than others, thus leading to linkage disequilibrium.

This non-random association between some alleles of a different loci can be visualized in such a two by two, table where we have on one side locus A and with the alleles, capital A and small a, and on the top side, locus B with alleles, capital B and small b. Each of the combinations is then represented with a proportion, so basically the pAB is the proportion of the combinations for these alleles. If we are interested, for example, in the total proportion of the allele A, then we have it in the right side. The same we can tell about the proportion of all the other alleles, so the  lowercase a, the capital B, and this lowercase b. There is one rule if we are talking about biallelic systems that the sum of the proportion of the two alleles sums up to 1.

In case of linkage equilibrium, when the loci A and B are not linked, we can actually compute the genotype proportions based on the allele proportions. This is shown, for example, in the first line, and we can actually do the same for all the other genotypes. One additional comment here, so we use this feature of this previous 2 x 2 tables. So I told you that basically the proportions for the two alleles are summing up to 1, so basically the pb we can also express as 1 - pB. This is, of course, a preferred setting, because we reduce the number of unknowns from four to two. So basically, we just have to work with two variables: pA and pB. Now this situation is valid for the linkage equilibrium as I mentioned, so when the alleles are unconnected.

In case of linkage disequilibrium, however, these equations do not hold true. In other words, from the proportions of the alleles, we cannot compute the proportion of the genotypes. We cannot compute this, because there is a difference between these two metrics. That is the disequilibrium coefficient or D, and it can be computed as the proportion of the AB together minus the proportion of A times proportion of B, and when we include this equilibrium coefficient to the equations as before we get the proportions of the genotypes. So in other words, this D or this equilibrium coefficient is a measure of the linkage disequilibrium or a coefficient that can be used to express it.

D prime

But we have a bit of a problem with this this equilibrium coefficient because it’s a bit hard to interpret and the sign is arbitrary so it’s depending on which allele actually you consider first. Of course, there is a common convention to set the capital A and capital B to be the most common allele and the smaller case a and b to be the rare allele. But of course, the rarity of the alleles, or the minor allele frequencies, can change from population to population, even within the same species, so that is not such a big win after all.

Now there is a better version of this D or the disequilibrium coefficient called the D’. That is a scaled version that is computed as shown in this slide. So basically, it is divided by the minimum of these two values, in case the disequilibrium coefficient is lower than zero, or the minimum or of these two values if the disequilibrium coefficient is higher than zero. Now we end up with a coefficient, which is another, a more popular version of the linkage disequilibrium and ranges between minus one and plus one. Extreme values, in this case, imply that at least one of the haplotypes was not observed.

It has several advantages that is the D’ = 1 or D’ = -1 means that the two SNPs are not separated by recombination, so that they are in a complete LD. It also means that if the earlier frequencies of the two loci are relatively similar then the high D’ means that the markers are good surrogates for each other, but in this case, we have also some disadvantages. namely that the D’ estimates are inflated. In case, the sample size is small, and the D’ estimates might be also inflated when one of the alleles is rare.

Correlation

Then there is a follow-up question if there is a more intuitive way to measure linkage disequilibrium, to which the answer is yes. And that it can be measured straight with the correlation coefficient, or an adapted value anyway. This correlation coefficient as we would expect expresses the mutual relationship between the alleles of the two loci.

Now a little bit of refreshment on correlation, so this is the correlation as we know it from the statistics. So, it has a standardized abbreviation of lowercase r and ranges from -1 to +1. Basically there could be different values for the correlation coefficient, which describe how the two variables behave in relation to each other. So basically, we have an x and a y, and if, with the increasing x, the y decreases, then we have a negative correlation coefficient. And if the two values are increasing, so, with the increasing x, also y increases, then we have a positive correlation coefficient. In between -1 and 1, so the extreme values, so the correlation coefficient expresses the strength of such relationships. And of course, it could be situations when the two values are not correlated, so then the correlation coefficient is zero.

Now for linkage disequilibrium, we use the squared correlation coefficient, so this is the r2. So the squared correlation between the markers, and therefore, it ranges between 0 and 1. r2 of one implies that the markers provide exactly the same information, and the r2 of zero implied that the two markers are not connected at all, so independent from each other. So, in other words, the r2 measures a loss of efficiency when a marker A is replaced with the marker B.

Now the r2 value could be computed from the allele frequencies themselves and what we need is actually just 3 values from here: the joint appearance of A and B, the proportion of capital A, and the proportion of capital B.

Based on these values, we can put them into this very nice equation, and we can compute the correlation coefficient r2. Now I realize that there is quite a jump from the correlation itself to this equation, and I provide actually all the details in an additional video, which I call the advanced or “advanced” video, so you can look at it if you’re interested. It is already uploaded on the Genomics Boot Camp channel.

And once we have our measure of linkage disequilibrium, of course, we can compute the pairwise linkage disequilibria between markers.  Such it was done in Qanbari et al., 2010, for this example. So in here, what we have on the y-axis is the linkage disequilibrium, measured in r2. On the x-axis, there is a distance between the markers, in this case, measured in morgans, which can be, of course, translated to megabases. So this would be one megabase, two megabase difference, three, four, and five megabase difference. Each dot on this graph is a linkage disequilibrium value for a marker pair, so you see that if the two alleles are very close to each other, then the linkage disequilibrium could be very high up to 1, but then this value quickly goes down to a lower value as the distances between the markers or the genomic distances between the markers increase. This sudden decrease of the average linkage disequilibrium between markers is shown with this dashed line, and in fact, its shape is very typical for all the organisms. We refer to this sudden decrease as linkage disequilibrium decay.

The shape of this LD decay is, in general, similar between the organisms, but the starting values and the further developments can be different when it comes to different species or, in this case, as shown in Pérez O’Brien, 2014, it could be different even when we consider Bos taurus and Bos indicus breeds. This, of course, has some relevance when the LD is used for further genomic analysis as discussed in the last video.

Summary

So, to summarize this video, the LD characterizes the degree of relationships between the nearby loci. There are various methods and possibilities to measure it. One of them is the disequilibrium coefficient denoted by D or as a D’ - that is a metric to use to quantify the LD but it has some disadvantages. And then the other more commonly used metric is the r2, or the squared correlation coefficient. That is a robust measure of the LD. So this is the end of this video. If you’re interested how we arrive to the equation to compute the r2, then be sure to check out the next one, a follow-up to this, and if you say that is enough for today, then of course, I thank you for your time and wish you a very nice continuation of the day.


(Advanced) Measuring linkage disequilibrium

Title: How to measure linkage disequilibrium? (ADVANCED)

Presenter(s): Gábor Mészáros, PhD (Institute of Livestock Sciences (NUWI), University of Natural Resources and Life Sciences)

Gábor Mészáros:

Hello, everyone. Welcome back to another video on the Genomics Boot Camp. This time is a follow-up or an add-on to the previous video how to measure linkage disequilibrium. This video is entitled “advanced,” because we have a bit more equations here as on the average of the channel. The reason I did not include these ones into the previous video is that perhaps not everybody is interested or they don’t like to see, you know, too many equations in an educational video. Therefore, I thought it makes sense to divide the two, but you are here, you are interested, so let’s get to it. I repeat here a few slides that were in the original video in order to provide context, but we will go through these ones in a quicker manner.

R-squared

We start with this video where we ended last time, and this is our prior knowledge now. And the most important part is the r2, so there we have the commonly used robust measure of LD in the previous video. I was just including the summary equation, but in this video, we will have more information on how we arrive to this r2.

So again we have the correlation coefficient from a general statistics and abbreviated as a lowercase r, ranges from -1 to +1, and there are various values of the correlation coefficients that are expressing the relationships between the two variables x and y.

Of course, the correlation coefficient is also computed somehow and it is actually computed as a covariance between x and y divided by the square root of variance of x and variance of y. If we want to write out this one in a further detail, so we have this would be the covariance between x and y and the variance of x and the variance of y in these brackets. Basically, the goal of this video is to provide you with the link between this equation and the r2 equation to compute the linkage disequilibrium that was shown at the end of the last video and also will be shown more times also in this one.

So for refreshment we have the r2 value that is the squared correlation between the markers, therefore ranges between 0 and 1. And the value of 1 implies that the markers provide exactly the same information, while the value of 0 for the r square implies that the markers are not connected at all to each other. Therefore, it measured the loss of efficiency when a marker A is replaced by a marker B.

The r2 itself could be computed based on the allele frequencies from this 2x2 table, so we have a locus A and the locus B and each of them has two values capital B and small b and a capital A and a small a. So these values are valid for a biallelic system - so these kind of computations are valid only in a case that the markers have a two alleles, but because the SNPs on the SNP chips are biallelic by design, then we are good to go. There are, in fact, only three values that we need; that is the joint occurrence of A and B and the respective frequencies of capital A and capital B.

Now we have the r2, which is the measure of the linkage disequilibrium and that can be computed with this equation. Previously, it was stated that the correlation itself, so the r, could be computed with this equation. So basically, the main question of this video is how to get from this to this.

Now to answer this question, we have to reformulate our example a little bit. And we consider two loci A and B again on one gamete but each with a possible random realization. We give the value of 1 to locus A with the probability of pA and the value of 0 with the probability of p 1 minus p­A , so this is basically the other allele. So this is the pa and pA and then this is the pa, but because pa and p­A equal to one, so the sum of them is equal to one. So then the pa could be expressed as 1 minus pA. The very same thing happens with the locus B, so we assign a value of 1 with the probability of B and the value of 0 with the probability of one minus pB.

So this is a very particular example that we will follow up on in detail in the next video, but still I thought it’s useful to show it here so perhaps it’s eases some understanding. So basically what we have are two loci here: locus 1 and locus 2. So we have alleles capital A, lowercase a, and for locus 2, it’s  alleles B and lowercase b. And basically what we do is we replace with the capital A with that ones and the lowercase a is with the zeros, so then it looks like this. So basically. we just count how many capital A’s we have, so that would be a 5. And how many capital B’s we have that would be a 6. And the joint occurrence of capital A and capital B would be 4 in this particular example.

Now, of course, we are interested in in a general example, so we introduce a population size of N and we can compute the proportion of the allele A as the sum of occurrence of allele A divided by the population size N. The same way, we can compute the proportion of B with the sum of the allele capital B divided by the population size N. And the joint occurrence of A and B we can compute when the jointly occurring A and B together divided by the population size N.

Follow up

Okay then the follow-up is what is the correlation between the realizations of 0 and 1 of a random variable at the two loci, and here comes again the equation for the correlation coefficient so that is the r. This equation basically has three parts, so that is one in this top, so basically the covariance, and the variance of x and the variance of y.

And basically we have all these parts figured out already in the previous slide as shown here, so it can be replaced by N times the joint occurrence of A and B. And the sum x can be replaced by N times the proportion of A and the y could be replaced as the N times proportion of B. And similarly, all the other parts could be replaced and then, in some cases, also adjusted, so we end up with such a beautiful and much more simple equations. To provide you a link to the next slide, I also color code them, so basically we have this orange, the green, and the blue part, which we then put back to the original correlation coefficient equation.

So this is the top part here is the correlation coefficient, and then we put in the orange, the green, and the blue part. And then you see here that the N is present all the way, so it can be actually removed from the equation and then only this part remains. So this part is basically the simple correlation between two loci 1 and 2 containing the alleles A and B, and if we square that equation, because we are interested in r2 rather than a simple r, so then we have this correlation or a squared correlation coefficient that is the measure of the linkage disequilibrium as we have shown in the previous video and also in this one.

Summary

So, for the summary, the r2 is a commonly used and robust measure of LD, and the computation of the r2 is adapted based on the equation for the correlation coefficient. And for the allele and genotype frequencies in reality, of course, we talk about computation of linkage disequilibria between a large number of markers or marker pairs, so we use software for it, for example, PLINK or other software. And also, in the follow-up videos, we will actually show how it is done using PLINK itself. For today, I thank you for your time, thank you for your interest also in this advanced content, and I wish you a very nice continuation of the day.


Computing linkage disequilibrium

Title: Compute linkage disequilibrium (Part 1)

Presenter(s): Gábor Mészáros, PhD (Institute of Livestock Sciences (NUWI), University of Natural Resources and Life Sciences)

Gábor Mészáros:

Hello, everyone. Welcome back to another video on the Genomics Boot Camp back again with another video on linkage disequilibrium. This is the first part of the finishing two when we are actually computing the linkage disequilibrium this time by hand. I believe this is an interesting piece of knowledge, so it’s worthwhile to look at.

So what we know so far. So LD characterizes the degree of relationship between two loci as always, and r2 is the commonly used measure of the LD computed with the equation below.

And here we are at the first of the two examples I will show you today. The first one is the very basic one, and actually you might recall this 2x2 table from the previous videos. When we look at the r2 equation, it is clear what we need to compute, and that is the proportion of A, proportion of B, and the proportion of the joint occurrence of A and B. Therefore, we are interested in what is in this segment or cell, in this other one for the B, and the joint occurrence of A and B together. Because we are speaking about proportions, we need to divide them with the sum of all alleles. So, this is already how we compute the proportions. So, the proportion of A is the 5 divided by 10, so 0.5. In this case proportion of the B allele is 6 divided by 10, so 0.6. And the joint occurrence is 4 divided by 10, 0.4. When we have our usual equation, we basically have all the unknowns, so the proportion of A, B, and AB and we put them in, and after the computation, we get the LD between these two loci A and B as 0.167.

So after this warm-up example, I show you the real deal. That is this LD calculation exercise that was shown by Professor Henner Simianer at the “Livestock Conservation Genomics Data Tools and Trends” workshop back in 2012 in Pag Island, Croatia. I really like this example, because it puts the LD calculation into context of haplotypes and all the allele combinations using SNPs. So first, I give you the initial example how it was given back in 2012. Then I give you the breakdown of the example, and you will have the chance to stop the video and do the computations yourself and, after that, you can look up the solution in this video.

So the exercise looks like this, so we have our genome here and we have a four SNPs. First one is an AT SNP, second one a CG SNP, third one is also a CG SNP, and the fourth is an AT SNP. We have also our population, and these are already the haplotypes, where this haplotype for our genome or part of the genome appears 17 times, this other one 14 times, 3 times, 3 times, 2 times, and this haplotype appears just 1 time in the population. The exercise is to calculate the LD between all pairs of loci. Now in this video, I will show you how to compute LD between locus 1 and 2 in detail, then I will show you the solutions for the two other pairs of loci and the last three you can follow up yourselves if you want.

So to give you a bit of a head’s up information, I list here the loci that you need to compute the proportions for, so for locus 1 is the A, the locus 2 is a C, locus 3 is also a C, and locus 4 is an A. You need to compute it for these alleles specifically, because these are the major alleles for these loci. So, if you count them up, so the A is the major allele for locus 1 and the C is for locus 2 and the C is also for locus 3 and A for locus 4, and obviously then you need to compute the joint occurrences between the two loci that you are following up and you want to compute LD for.

So, if you actually have all these proportions, you can compute the LD between all pairs or actually for pairs 1-2, 1-3, and 1-4 as we will list it in this video. And putting these proportions into the well-known equation, then you get the r2 value for that pair of loci. Now, if you want to go ahead and do the calculations yourself on with pen and paper, then this is the moment to pause the video and well just go ahead with the computations. And after you are done or at least you are done with the locus, for example locus 1-2, then to resume the video and see if you have done well or if there is something to be corrected. Okay, so we are continuing with the solution in 3… 2… 1…

Now actually what needs to be done is pretty similar to the previous smaller case, but in this case, you need to build up the 2x2 table yourself, so here we have a locus 1 with an AT SNP, a locus 2 with the CG SNP. And because we told that the A is the actual major locus, then you actually need to count them up, so basically, you need to fill up this spot in the 2x2 table. For the locus 2, the C is the major locus, so you need to fill up this spot on the 2x2 table. And of course, the joint occurrence as well.

So how to do that? Of course, we will use the actual haplotype numbers as they are given in the exercise. Actually, right now, we are looking at locus 1 and locus 2, so basically these two columns are interesting for us, so we can actually hide the other ones so that they don’t bother us too much.

Now perhaps the first thing you might consider is that, well, we need to calculate the proportions, so we basically need to know how many haplotypes are there total. So basically, we add up 17 plus 14 plus 3 plus 3 plus 2 plus 1 and, when we do that, we actually arrived number 40, so this is the total number of haplotypes now.

For locus 1, we need to count how many times the allele A appears, so in this haplotype, there is an A so, it’s 17 times. So it’s 17. Here in the locus one is a T, so that is we don’t count. This also there is an A here and A here, here. There is a T here, so we don’t count this, and the A here. So basically, we have a 17 plus 3 plus 3 plus 1 - that is together 24.

Following the same logic for locus 2, in this case, we count the C allele. So here we have a C, so we count this. Here we have a G, that so we don’t count. This count, this, this, and this, so everywhere where is the c, so there is 17 plus 3 plus 3 plus 2 and that is together 25.

The last thing we need to find out that how many times the A and C appear together so here the A and C appear together, so that we count, this we don’t count this, but also AC here, AC here. Here although there is a C, but there is not an A, so we don’t count this part. And also here is an A but there is a G here, so it’s not an AC combination, so we don’t count this. So basically, we count this 17, these three, and this three, which is together 23.

And because we are actually interested in proportions rather than the numbers themselves. So we divide for the locus 1, the proportion of A is 24 divided by 40 - that is 0.6. And therefore, locus 2, the proportion of C is 25 divided by 40 - that is 0.625. And the joint occurrence the same way, so the 23 divided by 40 and that is this number 0.575.

So with this, we already have everything. So we have a proportion of A, proportion of C, and the joint occurrence, so we have our beautiful equation here. So this is the joint occurrence and this would be the proportion of A and proportion of C, so we put them in into the equation itself and then we end up with an LD between locus 1 and locus 2 as 0.711, so this is the LD between these two loci.

On this slide, I give you all the results for the other loci, so locus 1, 2, 3, and 4 and the joint occurrences of these three pairs. And if we fill them in into the equation, the well-known equation, then we end up with – well, this is just a repetition from before - and then there is the LD between locus 1-3 and the LD between locus 1 and 4, so this is basically it this is how you compute LD by hand.

So, for the summary, an example of the LD computation by hand was shown, which is not a very widespread piece of knowledge, so I argue that is really good to have it in your inventory. But on the other hand, we also have to note that this manual mode of computation is not very effective. For that, we need to use software solutions that are able to handle computations of LD between thousands and hundreds of thousands of loci, and such computations will be shown in the next video. For today, I thank you for your time and have a very nice rest of the day.