Chapter 9.1: Copy Number Variation (Video Transcript)
Title: How to run Copy Number Variation (CNV) analysis
Presenter(s): Daniel Howrigan, PhD (Broad Institute of Harvard and MIT)
Daniel Howrigan:
Hello, my name is Daniel Howrigan, and today I’ll be talking about how to run copy number variation analysis. In today’s talk, I’ll address a couple of questions, namely: what is a copy number variant, and how do we detect it with genotype data? What does the CNV file format look like, and what does CNV analysis output look like? Finally, how do I use this to run CNV burden and Association tests?
So, what is a copy number variant? Well, I define it here as a subset of structural variation involving a gain or duplication or a loss/deletion of genomic sequence. Now, I call it a subset of structural variation because structural variation can encompass basically any change in the length of a genomic sequence, which can be as small as a single base pair insertion or deletion. Now, most SNPs that we think of are just substitutions, so the actual size of the genome isn’t changed with a single nucleotide polymorphism. But a structural variant can encompass anything from very small all the way to entire chromosomes.
What we call CNVs, well, being a general name, falls into the category of roughly, at least defined here, at minimum one kilobase to submicroscopic. With array data, we generally say something around 10 to 100 kilobases, given our sensitivity to detect these CNVs, whereas larger CNVs usually are greater than 500 kilobases up to multiple megabases, and then we get into much larger events. Now, I’m not going to get into the details of the mechanisms that cause it. I’ve listed a few here; you can look them up. But there are different hot spots in the genome that are more prone to these copy number variants due to repeat regions or often ways in the machinery that can make mistakes and lead to these gains or losses of genomic sequence.
How do we detect it using genotype data? So when we run across the genome collecting a bunch of SNPs...
I show here the basic mechanism, or at least a figure showing how we take light intensity from different experiments looking at capturing either you know allele A, I have a A as allele A, or the B allele being TT here, and then a heterozygous, you can see a mix of both red and green. Now we can leverage these allele frequencies in what we call the B allele frequency, or at least the frequency of the T in this case. Now, normally when there’s no copy number variant, these will be roughly equal and so be around 50 percent. When we see something like a large deletion or any sort of deletion, we should see a loss of heterozygosity and therefore we will see a gap in these B alleles, at least at any site that you would normally be heterozygote at. With the duplication, these would be a little bit off, so it would be more like two-thirds, one-third, and you would see maybe a movement kind of akin to where these pink lines may be going, even though they don’t quite define that you would see in a duplication that the B allele frequencies move off 50 but not to one and zero.
Now that’s looking at the B alleles, the other thing we use to detect it is the log R ratio (LRR) and this is basically measuring the light intensity and when you see a drop here, where we would see maybe in all of these calls, you should see a drop of say 50 percent when we have a deletion, and you will see subsequently a rise of about 33 percent for a duplication. Now I’m not going to get into the details of the collars that are used to detect these copy number variants, suffice to say this is the foundation that it’s built upon. What I’m going to talk about next is given that you’ve run copy number variant callers, here are the kind of QC calls that you’re moving into the analysis stage. Now, I use plink to analyze the CNV data, so you can get a bunch of different file formats, copy number variants from different callers, but you would let you know if you want to use plink you would convert this into this particular file format, it’s definitely which is the .cnv file, and it’s basically individual identifiers, the chromosome, the start, and end position of this CNV, the type one being a deletion, three being a duplication, and then also there’s a few other fields here, the score and sites field. In this example. I use the score being, say, the number of copy number variant callers that agreed upon the skip and call, and it could range up to six - sites here being the number of SNPs that are used to call the CNV. Now, I note here that score and sites are not forced into a particular convention, you could say replace score with the number of genes overlapping the CNV, or the site being some other variable that you would be interested in measuring. Now along with this file, plink creates a cnv.map file, and this is basically breaking down the break points of each CNV into akin to a map file, similar to what we have for SNP data.
And you can see here, I note that every different position is mapped, even the end position, but also maybe a single position after that end because you may want to do an additional test after the end of a CNV to see how things have changed.
Note that the CNV file format commands are not available in plink 1.9, but the initial version granted the speed-ups that you get using the newer version of plink aren’t all that applicable here because we’re generally dealing with rare variation and so these file sizes generally not too large, and the sorts of computing that you use is not too heavy.
So what does CNV analysis output look like? Usually, whenever you run a command in plink looking at your CNV files, you get out cnv.indiv file, so this is a per-sample file where you say the number of CNV segments that this individual has, the number of kilobases that these segments cover, and the average kilobases covered per segment. You also get a cnv.summary file; this is akin to the map file that is summarizing the number of affected and unaffected individuals at any given breakpoint or start and end-plus-one of a CNV. And this is basically, you know, the output here, well, it seems quite simple. I’ll show how with these files you can do quite sophisticated analysis by using a lot of different filters.
So a lot of the magic in plink is all the flags that you can use to subset your list of CNVs. What I have here is a verbose command just to show the optionality available in plink, and for each of these, I describe what that function is doing. So with plink, you have `--cfile`, this reads in the CNV and CNV map file. I want to select only deletions, I want to select CNVs that are at least 100 kilobases in length, I want CNVs with a score of four or higher and at least 50 sites. I want to exclude CNVs that overlap a particular region, so I can insert a different text file with a list of chromosome and start-end positions here, and I want to make sure in this exclusion that the CMVs must overlap by at least 50 percent to be excluded.
I can also look at frequency where at least 10 CNVs overlap; I would like to exclude those because maybe I’m more interested in very rare CNVs. I can also write out the frequencies of these CNVs as well just so I can guarantee which CNVs are being dropped, which CNVs are being kept, and then I can run a basic burden test using a permutation model, and here I just set the number of permutations to be ten thousand. So you can see there’s a lot of different flags here, and it’s manipulating a lot of these flags that can give you just what you would like in terms of your analysis.
Now granted, using the burden tests in plink doesn’t handle covariates, it basically just looks at something like case-control status, and so what I recommend is taking the output, particularly the CNV `cnd.indiv` file, and reading that into say Python or R. I prefer R to run more sophisticated models, and so you can see when you put more filters here, if I go back, the number of segments will change depending on what filters and obviously subsequently the number of kilobases covered by these segments will change as a function, and it’s kind of iteratively reading in these files at different filtering steps that can produce a wide range of tests.
So, I’ve shown a couple figures here that we published when looking at CNV burden in the PGC schizophrenia. So, I took in those .cnv files into R. I ran a logistic regression predicting schizophrenia status and adding a number of covariates such as principal components, genotype platforms, and then basically as I iteratively ran different commands in plink to look at Kb, CNV counts, lengths, frequencies, whether or not they’re in or not in a particular region, you can build up a number of tests of overall burden. So, in this example here, this would be all CNVs, deletions, and duplications stratified by different genotyping platforms, and then all together, and then in panel B here, I’m stratifying by different frequencies, and say previously implicated CMVs as a region. I want to say in the blue bars, we’re looking at enrichment here in the green and blue bars, but you can see the big deviation here. There’s a lot of enrichment when we talk about CNVs at this size or at least this frequency. I mean, and they go away because most of these have been implicated. So, I’ve excluded these regions, rerun the burden test, and you can see that we’ve captured much of the signal already with previously implicated CNVs.
So, how do we use this to run CNV association tests at individual CNV loci or at individual break points of CNVs? Basically, I would run a very similar command. I could use all the same filters, basically get rid of a number of commands, in particular, you’re getting rid of this CNV in indiv.perm step, but you still run a permutation test, and what you’ll get out is a .cmd.summary.mperm file, and at each base position, you can run an association test here using permutation. This would be the pointwise permutation value, but there’s also a family-wise permutation p-value that corrects for all the tests here. So, this association is run at all possible start and end plus one positions, and one of the things you can do if you like to include covariates in your data, at least what I’ve done in the past, is maybe run your logistic regression model with a lot of your covariates, pull out the residuals, and then use this as a quantitative trait, and you can run association mapping in plink to get p-values that way.
So, what does this look like? I think having a figure is instructive here. So, I plotted through a browser. I’ll break this down. This is our signal at the NRXN1 gene. So, in red, we have our deletions, light red deletions in our schizophrenia cases, dark red deletions in our schizophrenia controls. I also have duplications in blue, which don’t make up much of the signal here. But as you can see, I’ve also plotted the negative log 10 p-value, and if you look closely, you can see each little break point. You can see a different test being run, and you can see at you know a spot like this, where there are many different break points, a lot of granularity, you can run many different tests and get a shape of the association around this gene. Now, you can also collapse. Another test that we’ve done is collapsing across all the exons of this gene, and so this would be akin more to like a gene burden test where you collapse this region and then you test for overlap at that region, run a similar model, and then you can aggregate all the CNVs and cases and controls to report say a gene-based p-value.
So, that’s a very quick overview of how to run burden and association with CNV data, and a few considerations are that one of the things that you don’t have access to when looking at CNVs is imputation. And so, it is a consideration to think about, you know, there isn’t kind of a reference, you know, reference haplotypes or a larger dataset to do additional QC. So, there can be additional challenges, particularly with subpar data that they can’t be rescued in the way that imputation can rescue SNP genotypes. And on that note, you know, genotyping chip does matter very much because you can’t impute a bunch of different new sites. The variability in terms of the number of SNPs, particularly for smaller CNVs, is a very kind of important consideration. And so, you can think of, in a particular genotyping chip, if you don’t have a good case-control balance there, you may not have the sensitivity to properly detect CNVs. So, there’s a lot of work that goes into determining at what length of CNVs, or at least at what case control balances, given your genotyping chip, do you have the right amount of power and sensitivity to do a proper case-control association test?
Another thing too, ancestry PCs, most of the CNV hotspots and associated CNVs that we see in psychiatric disease aren’t very impacted, and mainly that’s because these are recurrent de novo CNV areas where there’s a higher mutation rate. But this is not really ancestrally, defined in terms of the fact that there’s not a large difference in the allele frequency across different ancestries. But it is still useful to include, particularly as you get to higher frequencies and you get more inherited CNVs. Finally, multiple testing correction permutation is more one of the more robust ways to account for the multiple tests because the nature of CNV data is such that, given the type of genotyping chip that you use, the size of your data set, no particular study is going to be very similar to, similar to another study, and that leveraging the correlation structure within your own data set, given you know your ability to detect CNVs at various, of various sizes and frequencies, using permutation is usually the best way to go about properly testing for Association.
So, if you have any questions, feel free to email me. I’ve also put down some sites you could search here for the PGC CNV paper that we did in 2017, and also I think that the write-up and plink on how to do this is really good and very descriptive of all its functionality. Thank you.