Chapter 3.1: SNP array genotyping (Video Transcript)

Title: SNP Chips (Introduction to genomics theory)

Presenter(s): Gábor Mészáros, PhD (Institute of Livestock Sciences (NUWI), University of Natural Resources and Life Sciences)

Gábor Mészáros:

Introduction

Hi everyone. Welcome back to the introduction to genomics lecture series. We continue now with the second part, and we will talk about SNP chips. Before we do so, we do a bit of a refreshment from previous lectures. So we talked about the DNA and its structure that is based on certain building blocks, and the set of rules of how these building blocks connect. We also established that if we cannot look at everything at once, molecular markers are good surrogates, because we actually know their genotype and their exact position on the genome. And also, they are connected to genomic regions of interest, for example genes, that influence the traits that we are actually interested in. There are multiple possibilities for genomic markers, but the most widespread and most widely-used ones are the so-called “SNP markers”.

SNP chips

So a bit of refreshment also on this. So the SNP markers are the single nucleotide polymorphism markers that are single base pair positions that are different between the genomes of two individuals. So, here we have Individual 1 and Individual 2, and we compare their sequence, and we will find that most of the sequence is totally identical all of the time, except some variants. And one of these types of variants are the SNP markers that are single base pair mutations. And the good thing with these SNP markers is that there are many of them throughout the genome, so we can cover the entire genome with these SNP markers and use them to our advantage. Now, there are really really a lot of these SNP markers, in the millions, and not all of them are consistently appearing within populations. So what we really want to do actually is identify just those SNP makresr that are consistently appearing, so we can genotype them all the time and analyze these consistent data from many individuals. If we have really a standardized, consistent set of SNPs, we can genotype these ones in a straightforward manner, and also in a cost-effective way. This cost-effective way is genotyping these standardized set of SNPs with the so-called SNP chips. These SNP chips have multiple names, or you can find multiple expressions for it, so the beadchip, beadarray, SNPchip, microarray - all of these are basically meaning the same thing.

So this is how the SNP chip looks like. As I mentioned they have multiple names, but one thing is common that the SNPs that are selected on them are biallelic by design. For example, there is allele A and allele B, so we have three possible genotypes: homozygous AA, homozygous BB, and heterozygous AB. If we look at these SNP chips from a very close perspective, we will find that on these SNP chips there are hundreds of thousands of tiny wells as shown on the right side of the screen. So we have these tiny wells and then in these wells there are these beads, and therefore the name beadchip or beadarray. What it is actually doing, that each of these wells and beads is coated with multiple copies of oligonucleotide probes targeting a very specific locus on the genome. Therefore, each of these wells, and each of these beads, is designed to capture a very specific SNP for the particular species for which the SNPchip is developed.

Now how does it work? So obviously we need DNA that we want to genotype and these DNA fragments pass over the beadchip. Each probe binds to a complementary sequence in the DNA and stopping one base before the locus of interest. After that, they come single base extensions that incorporate one of the four labeled nucleotides. Now these nucleotides are very special because when they are excited by a laser, so when the laser shines on them or points on them, they emit a specific signal, and the intensity of the signal actually conveys information about the genotype on that particular locus or in that particular bead. So this is actually shown on the picture on the left side here. So here we see the wells, and also the beads, here is the sequence, and then at the end of each sequence there is the labeled nucleotide. And this will be three SNPs here with the RS code and for each of these beads there is a certain genotype that emits a certan signal. So if there is a homozygous one it emits one signal very strongly and not the other one. Similarly, for a different locus there is a different homozygous genotype so it, again, emits a different signal, but again, just on the one side, and if there are some heterozygous genotype there is the signal intensity somehow in between the extremes.

Now this is how the SNP chips look closer to reality. so basically we have these lanes here, and you see that there is a tiny fraction of the lane is magnified and you see here these tiny dots that are each of them here is a well and the bead that it is emitting some kind of signal. of course these signals do not tell us anything just by looking at them, and they need to be analyzed in a very specific way, so that we know what is the exact meaning of the signal at each of these dots. This analysis is done by a specific genotyping cluster, so there is an algorithm in place that automatically clusters the samples into two homozygous and one heterozygous group. So there are circles around each cluster where the genotypes should fall, and also there are wider kind of shaded areas where we still accept the genotype calls, and then SNPs that are falling outside even these shaded areas are the ones that are not given a genotype. This is what i had in mind. So this is such a graph for a single SNP. Each dot here is an individual genotype for that particular SNP, and here are the circles. So this would be one homozygous, other homozygous, and in between them are the heterozygous genotypes. Whatever is falling into these circles is fine, so this is called as such genotype, and also you see these wider shaded areas, they are still OK, so the individual falls into this area is still called, for example here, as heterozygous. There are some individuals that are outside of these areas, for example this one, and this one, in this case, the algorithm is not certain about the actual genotype call, and this is how we get these so called “missing” SNPs or “missing” calls into our data. So just something went wrong and rather than giving a very inaccurate result, the genotyping algorithm determines that rather, it would not call this SNP, and put it as a missing one.

On a SNPchip we have pre-selected SNPs, so we have SNPs that are working very well as, in this case, so we can clearly determine the homozygous, heterozygous, and other homozygous genotypes. I show this example of a so-called bad SNP also just for comparison. So there are also cases like this. Again what we here we have circles, the homozygous, other homozygous, and heterozygous, but you see here that this is somewhat problematic. Here, some the of genotype calls are really really close to each another or even overlapping, so if an individual falls somewhere here, for example, its not really safe to determine if it is heterozygous or homozygous. So there are also SNPs like this, they are generally problematic, but they do not appear on the SNP chips, because, actually, on the SNP chips we will talk about and we will analyze, they are basically these sets of pre-selected well-working SNPs.

After these genotyping process is done, then basically everything gets transferred to a text file that is called the final report. Now, I made a few videos already about these final reports, and you can find them on the channel. But the short story is that everything from the genotyping routine is saved in this final report, which is basically a large text file, and part of this final report are also the genotypes, and these genotypes then can be transferred to other file formats, for example standard PLINK files, and then these files and these data could be analyzed with either PLINK or various other software as you see also a bunch of examples of this on this channel.

Notes on data handling

This series of videos is supposed to be more on the theory side, so I don’t want to spend too much time on this right now. If you are interested in the practical applications there are lots of other videos on the channel, but still I would mention that this is how the data then looks like. So here each line is one individual, and here we have the actual genotypes. And, of course, we know also the locations of these SNPs, we know which chromosome they are on, which exact base pair position they are on, and what is their name, so we can actually conduct routine analysis of various kinds. And afterwards, when we have our data, we can transform these data, by using appropriate methodologies with some kind of signals, and these signals might reveal something about the organisms, or the individuals, or populations we are interested in.

Now, when it comes to handling of the genomic data we do not rely entirely on our knowledge of biology, because we are talking actually about large data sets and these large datasets are handled exclusively with computers and various softwawre, then i dare to say that some kind of or some degree of knowledge of computers or information technologies is also really really useful if you want to do serious research in this area. I’m not saying you need to be a hardware or software expert, but you need still you need to know the basic jargon, and know your way around computers and also computers servers. It is really useful to have this kind of knowledge in the long run. When it comes to genomics, in our daily work we deal a lot with software because, as I mentioned, its really not possible to analyze this type of data by hand. While the programming skills are useful, well i say here essential, maybe i would rephrase that in a way that , yeah its really useful, and maybe not programming but scripting. if you’re really serious about this kind of work, or type of work, you really need to know some kind of a scripting language, and you need to be able to write some really basic scripts that tailor the data as you want, or modify the data in a way you want, or you be able to run software that you didn’t really use before, all kind of things. So you need to have some kind of knowledge of the computers and how to use them.

Depending on what you do you can rely on your own scripts, but there are also a ton of established programs that do all kinds of things. So, especially at the beginning, there’s really nothing wrong with relying on these established programs or packages that do the stuff that you want. For any given methodology or approach there is a large number of approaches and software solutions, so I would really encourage you to look around and see which ones fit your needs the best way. But at the end, we will all come back to the same thing, so we will come back to large text files that will have SNP genotypes in them, which can be homozygous, other homozygous, or heterozygous. So this is a different kind of graph, don’t worry about that, but basically what we are after are these SNP chips and SNP genotypes in a text format that we need to analyze.

Allele and genotype codes

In these large text files with the genotype data, you might find alleles and genotypes in different types of coding and these different coding types I want to detail in this slide. One of the most common ones is of course nucleotide coding. So we know that the DNA consists of four nucleotides: guanine, cytosine, adenine, and thymine. And these are also the abbreviations G, C, A, T for this type of coding. Now, you might notice that there is in brackets here is a TOP format coding, because in the SNP chips for some reason there are 2 types of nucleotide coding, usually that is called TOP and FORWARD. Actually both of them are nucleotide coding, so you would see the same type of codes, but genotypes for the same SNPs could be denoted a bit differently when it comes to TOP and FORWARD coding. If you analyze a single population this is not a problem, so you actually don’t need to care too much which coding it is. This question, or the question of TOP and FORWARD allele codes, come into play mostly when you want to merge datasets. Again, there is a video on data merging on this channel, so if you are really interested in that, I would just encourage you to look up that video. But for right now, just information that there is nucleotide coding and there could be TOP and FORWARD coding on the SNP chips.

Now I mention that each SNP chip is biallelic, meaning that there are exactly two alleles possible for each SNP. So you can actually simplify that, so actually you don’t need four letters, or 4 possibilities, because each of the SNPs is only biallelic, so you can actually recode or rename one allele as A and the other allele as B. So there is another type of coding, character allele codes, with so called AB coding. Also, sometimes you need to use programs, or software, or approaches, or otherwise its somehow beneficial to store the alleles codes not as characters, but as numbers. In this case, very often the numbers that are used for this purpose is 1 and 2. “1” is one of the alleles on the SNP, and the number “2” is the other allele on the SNP. Sometimes also you can find or come across numeric allele coding that uses “0” for one allele and “1” for the other allele. So in all cases, after you get the genotype file, you look it up, what is the coding style that is used, and also you need to ensure that you know what these allele codes mean, or what are the actual allele codes that are used in your particular case, because it could be different, and there is not one single rule or one single scheme that is used all the time. So there are some schemes that are used more often, but of course this doesn’t guarantee that the file that you have uses this particular allele or genotype coding conventions.

Also, while I mentioned all these allele codes, but there are also what i mentioned before are the missing alleles. These are often coded with “0”. Of course if the numeric coding is 0/1, then its coded something else, but most of the time, or many times, the missing alleles are coded “0” or something else. Also this is the other thing you need to check is that first what is the codes that is used for alleles, and the second thing is what are the codes that are used for missing genotypes. For example, for the final report it is customary, or I very often come across coding for the missing allele as a “-” or a minus sign.

Now in the previous slide I mentioned allele codes, but again, I underline that the SNPs are biallelic, meaning that they are two alleles that make up a certain genotype, and this could be homozygous one, homozygous other, and the heterozygous. Again, depending on the allele coding type, there could be different codes for the genotypes. So this would be an example of a nucleotide coding. This would be the example of the AB coding, so AA, AB, and BB. In case of numeric coding when the allele codes are 1 and 2, then these are the numeric coding genotype codes. Here, I would underline that this is not pronounced “eleven”, “twelve”, and “twenty-two”, but actually we refer to these genotypes as “one-one”, “one-two”, and “two-two”. And there is also a different type of genotype coding when you use just one number for each genotype, and this is customary to have it as 0, 1, and 2. And these numbers are used, so the 0, 1, and 2, because this is actually the numbers of the so called “2” alleles. So the “0” is used for the genotype 1/1 because there are zero “2” alleles, the heterozygous is often denoted as “1” because it’s from 1/2, so there is just one “2” allele here, and the 2/2 is denoted by “2”, because there are two “2” alleles here. And obviously, if this type of genotype coding is used, the code for the missing genotype must be something else than zero because zero is already used for one of the homozygous genotypes.

SNP chip types

So the SNP chips are specific for each species, and here I show possibilities of SNP chip types in cattle. I mention cattle as the first species because, well, I work mostly with livestock, and cattle are the most widely genotyped among livestock species. And because of this, it has also a lot of possibilities in terms of chip types. So what we have most commonly, or most often, is the so-called mid-density SNP chip. Funnily enough, it has about 54,000 SNPs, but it is still being referred to as 50K or mid-density, but anyways there is this chip that is very often used for many purposes, from population genetics, to genomic selection. there are also SNP chips that have a higher or lower density depending on what you want to use it for. so the high density SNP chip has around 800K SNPs and the low density around 7K, but this could also be different ones, I just really put it out as an example. also there are custom SNP chips that might have, for example any of these ones as a base, and adding additional, special SNPs that the people, or researchers, the breeding organizations, are very specifically interested in.

This is just a quick comparison of the 50k and the HD SNP chips in cattle. So you see that each of these coloured dots here is a SNP on all of the chromosomes in cattle, and you see that the entire genome is covered. Of course, we have much more SNPs in the HD, so it is much more covered, so the inter-SNP distances are much shorter, but all-in-all, both SNP chips do the job, and they are covering the entire genome, so we can use these data to conduct various types of analyses.

Of course, SNP chips exist for a wide variety of other species, and here I just mention some of these species, and some of these chip types. So there is a lot more on the market, but for you, just to have an idea, I mention a few of these. So there is a human SNP chip with around 900,000 SNPs. In horses, ovine, porcine, companion animals (for example, dogs, cats, and birds), and all these kinds of stuff, there are SNP chips available. Also, there are SNP chips for mice used in all kinds of research experiments. Additionally, in plants, wheat being one of the major crops, and I just included strawberry because I found it funny that there are SNP chips already existing for strawberries. Well, I just wasn’t expecting to find it, so I included it here as a kind of a “cherry on top,” in this case, a strawberry at the bottom of the list.

So, to summarize, there are different SNP chip types, and there are SNP chips for many species. As I mentioned on the previous slide, there are also different manufacturers, so there is at least some kind of competition on the market, which is, of course, very good for price development. There are options you can go for if you want something very specific.

There are a lot of laboratories that are providing the service of genotyping. So actually, you don’t need to have these genotyping machines in your lab. Basically, what you need is just to get the DNA, send it to a laboratory, and they do everything for you, including DNA extraction and genotyping. Then, they send you back the genotype data in a text format that you can then analyze further on.

I also want to mention on this slide the saying that sometimes comes up in relation to genotyping. The saying goes, “In the age of genotypes, the phenotype is King!” This actually points out that nowadays, getting genotypes is really easy. All you need to have is DNA, or even, you don’t need to have DNA but just a biological sample, and you send it into a lab, and you get back the genotypes in a relatively short time. But if you want to have some very specific phenotypes, you might have a hard time getting them.

So, while we are talking about genotypes a lot during these lectures, we shall not forget that phenotyping is also a crucial thing and is really, really important for a range of analyses that we might want to conduct. A general example would be, for example, a genome-wide association study when we want to associate the genotypes with the phenotypes. Obviously, we need those phenotype records. And if we remain in the livestock sector, for example, genomic selection is one of the large areas where we actually rely on phenotypic information as well, including recording and all this other stuff that we will detail in a specific presentation towards the end of this lecture series.

So again, just to summarize the entire process: You get the biological sample, and then you get the DNA out of that. You send it to a lab that uses SNP chips that generates the data, and you can use this data to get some kind of results out of them. And, of course, it depends on what kind of results you are after. You will use the appropriate methodologies, software, and so on. Some of these examples and tutorials are also on this channel, but of course, there is a wide range of possibilities that you could go for.

As for the applications of genomic data, as I mentioned, there are really, really many of them. I mean, when it comes to research groups, they tend to focus on certain types of analyses of genomic data. Some research groups are more after, let’s say, population genomics; others are more focused on genomics of diversity, and still, others may be interested in some kind of GWAS-oriented or selection signature-oriented analyses. So, it really depends on the personal interests of research groups and people.

There are lots of possible applications, and some of them we already mentioned on this channel, and certainly, we will mention others as well at some point. Also, during these lecture series, we will talk about some of these. So, you could use the genomic data to compute the admixture proportions between populations. In the case of crossbreeding, you can compute genomic relatedness. You can use it for genome-wide association studies, selection signatures, genomic selection, genomic inbreeding coefficients, and all kinds of stuff.

We will do everything eventually, but for now, we arrive at the end of this lecture, and I want to end it with a short summary. So, we talked about the SNP markers that are being genotyped with high-throughput machines that determine the genotype of these SNPs in a cost-efficient manner. At the end, what we get are large text files that could be further analyzed. While these text files have various ways of how the SNPs are expressed, or the genotypes are expressed for these biallelic SNPs, these could be the various nucleotide coding or numeric coding. There are also various possibilities of how the missing data is denoted.

Overall, these SNP chips are a very standard way of how to deal with genotype data in basically all populations, and SNP chips with different densities exist for many species.

So, we end here today. I thank you for your time you spent on this video, and I’m looking forward to seeing you again at the next lecture. So, thank you again, and have a very nice day.