Software Tutorials: Genetic Correlation and SNP Heritability (Video Transcript)
LAVA
Title: LAVA Tutorial
Presenter(s): Josefin Werme, MSc (Department of Complex Trait Genetics, Vrije Universiteit Amsterdam)
Josefin Werme:
Hi, everyone! My name is Josefin Werme. I work at the Complex Trait Genetics Lab at the VU in Amsterdam, and today I’ll do a small tutorial to show you how to run LAVA.
So, LAVA is a tool dedicated to local genetic correlation analysis, and as the name suggests, this is simply a genetic relation but sort of for a local region that you have specified. And today, I will not have any time to go into technical details about this method, but if you’re interested in those, I suggest you check out this pre-print on BioRxiv.
Today we will go through four different analysis options. First, we have a univariate test, which is used to confirm the presence of local genetic signal for any phenotypes of interest. Then, we have a standard bivariate test, which is used to evaluate the local genetic correlation between any two phenotypes of interest. And then, we also have two types of conditional tests. First, a partial correlation, where you can (which you can use to) evaluate the local genetic correlation between two phenotypes of interest conditioned on another phenotype or a set of other phenotypes. And then, we also have a multiple regression approach that can allow you to model the local genetic signal for an outcome phenotype of interest using several predictors jointly.
So to run this, you need basically just summary statistics for your phenotypes of interest and just, you know, standard columns: SNP, effect, low, and some effect size, p-value, or a z-score, and a “N” column. And then, what you also need, of course, because we’re dealing with summary statistics, is some reference data for the estimation of the LD between the SNPs. Then you also need something called an input info file, which just specifies some variables that are required, such as, you know, phenotype ID, file name, and for binary phenotypes, it’s important to have the number of cases and controls. If you have continuous phenotypes, you can just set these to “NA”, and the program will know that this is a continuous phenotype and treat it as such.
So something that may bias your genetic correlations is if there is potential sample overlap between the data sets that your summary statistics were derived from. And in order to account for this, you can first run bivariate LD Score regression (LDSC) for your phenotypes and just extract the intercept from that correlation, put that into a matrix, and then LAVA will account for any potential bias due to sample overlap.
Then finally, of course, because we are analyzing local regions on the genome, we also need a file that specifies which regions to consider. And for this tutorial, we will use a file that was created basically segmenting the genome into smaller regions on the basis of LD. They’re not entirely independent blocks, but the LD has been minimized between them. But you are free to use whatever you want here if you have something else in mind.
Basic processing
Right. So, for this tutorial, what I will do is I will run through all the basic processing and input functions interactively in R. You can download the program from the GitHub, and on this GitHub, you can also find info about, well, more detailed information, about how to install the program, about the different types of input data, and also the analysis functions.
Right. So first of all, we have here the basic input processing functions. First, we just read in the locus file, and then we process all the summary statistics, this input info file, sample overlap file, if you provide one. If not, you can set this to null, but I would only do that if you’re really confident that there isn’t any potential sample overlap.
And this also specifies the prefix for your reference data and reads in the bim file because, so here we essentially align all the summary statistics to the reference so that the direction is consistent, subset the common SNPs, etc. And by default, this is going to read in and process all the summary statistics files that are listed in this input info file. But if you want only a subset of phenotypes, you can specify this ‘phenos’ argument here, and it’s going to only process those for you. So that’s the first thing you need to do.
Then, for any locus that we want to analyze with these subsequent analysis functions, we need to do some additional processing to convert the local marginal SNP effects to their joint counterparts while basically correcting for the LD between the SNPs and the locus. And this is done using this function here, and this function will also compute a bunch of other parameters that are relevant for the analysis. And yeah, what you do is you basically just pass this, a single locus and this input object with all the processed summary statistics, and, if you inspect this, you’ll find that this is actually an environment, and that’s just because it’s a bit more memory efficient to work with environments. But, yeah, so if you want to look at what’s inside here, you need to use ‘ls()’, and the same is actually true for this input object. It’s also an environment. But you can then subset this, just if you want to extract things like the start, oops, like the start-stop site or any other information from this locus. Yeah. Then once we’ve done this, we can proceed with analysis. Now, of course, here I am showing you, how to do this sort of locus-by-locus basis for a whole-genome bivariate and genetic correlation analysis. I would do, I, I mean, I would do this on a cluster computer, and I will, of course, provide scripts that show you how to do this, and you can adapt those however you want.
Right, so, yeah, to proceed with the analysis, once we’ve defined our locus object, we can then run this univariate analysis function. And this evaluates the local heritability for all your phenotypes. And, yeah, first you have the, you see, which, you know, just a list of phenotypes, you get the observed sample-estimated sample local heritability for the phenotype. And you get a p-value for this. If you’re interested in the actual estimated population-level h2, so for binary phenotypes, you would then need to provide a column that specifies the population prevalence of your phenotype, but for just evaluating significance, this isn’t relevant.
So, what do you consider sufficient to proceed with a bivariate analysis? It’s up to you, and it depends on what your aims are. Some people might just want to filter out completely non-significant loci, while other people might not even be interested in interpreting a genetic correlation from loci without genome-wide levels of local heritability.
So, yeah, there’s no, there’s no kind of gold standard here. It’s up to you, but I would definitely recommend filtering out really unassociated loci because, you know, it’s the same principle as with LD score regression. If there’s just no heritability, you wouldn’t proceed. But yeah, here, clearly, there is lots of signal. And so, to proceed with the bivariate genetic correlation analysis, we can run this function here. So, you only really need to provide this ‘locus’ argument, but now, for the purpose of this tutorial, I’m going to speed up the processing time slightly by turning off sort of additional iterations for the simulation p-values at lower thresholds, which would otherwise normally be on. And I would recommend keeping those on as well for your analyses.
Right. So here, what you get is, for each unique phenotype pair, you get the estimated local genetic correlation with 95% confidence intervals for this estimate. You also get the squared of this correlation, which can be interpreted as the proportion of local genetic signal in one phenotype that can be attributed to that of the other and vice versa, and then you also get your p-values, of course.
So, yeah, by default, this is going to do all the unique combinations of phenotypes. But if you just want a subset of the phenotypes you processed, you can do that using this ‘phenos’ argument. I forgot to say, but this also applies to this univariate function. And then, if, for example, you have only a target phenotype of interest, in this case, let’s say MDD, we can set this ‘target’ argument here, and it’s only going to compute the pairwise local genetic correlations with that phenotype and ignore those of your other phenotypes.
So, yeah, like I mentioned, I do think you should do some filtering based on univariate signal. And instead of writing an “if” statement, you can also use this combined function, which does that in one go. So, this will only do the bivariate local genetic correlation analysis conditioned on that there is some level of univariate association signal. So, this is my p-value threshold I selected, and as you can see, here bipolar disorder has now been excluded from any subsequent analysis because it didn’t reach this threshold. If you set a more stringent threshold, it’s just going to return null. And yeah, if all the phenotypes pass, by default, it’s going to do all the combinations between them. But you can, again, pass the ‘phenos’ or the ‘target’ argument, and it’s going to subset accordingly.
Conditional analysis
This is the basic univariate and bivariate association tests. So, like I mentioned, I will provide scripts for how to do this on a whole-genome level as well. But yeah, now I’m going to go through the conditional analysis.
So, if you find that there are significant bivariate genetic relations between some phenotypes, and you have a particular conditional hypothesis you might want to test, you can apply these two approaches to do so. And, as an example, say that here I’m particularly looking at MDD (that’s my phenotype of interest), and I want to model the local genetic signal in MDD using a bunch of psychiatric phenotypes. In this locus, I can see here that um, bipolar disorder and MDD are really strongly correlated, but there are also significant correlations with schizophrenia. And now, yeah, depending on your how many tests you did, but let’s just say all of these passed multiple testing correction. And then, maybe my question might be, well, does this relation actually still hold if I account for bipolar disorder? Because bipolar disorder and schizophrenia are, of course, also correlated, and perhaps this correlation is just confounded by that association.
And then I might run the partial correlation. So here, I specify my target association of interest is that between schizophrenia and MDD, and by default, it’s now just going to treat any remaining phenotypes as those that the condition on, as you can see here. Right. And here the output that you get is, first, which ones for you to target phenotypes, z is those that you condition on. Here, you just have the bivariate r squared for your on-target as a reference, and then the partial correlation with confidence intervals and p-value. And as you can see, there’s quite a noticeable reduction in the partial correlation between schizophrenia and MDD compared to the marginal or the bivariate association between those. And also, this is far from significant, which suggests indeed if we account for the bipolar disorder, the correlation between MDD and schizophrenia is no longer significant.
Right, so that’s one type of conditional test that you can do. There’s also the multiple regression, and this is if you want to model the genetic association signal for an outcome phenotype of interest using several predictors jointly. And again, we will do the target phenotype. It’s MDD. And yeah, now you get first your predictors, which is the outcome, and then you get your joint multiple regression coefficients standardized for your two predictors, with confidence intervals for those as well as a joint r2 for this association. And so, this r2 can be interpreted then as the proportion of local genetic signal in MDD that can be explained by both schizophrenia and bipolar disorders jointly. And so, that means that this then accounts for the correlations, the genetic correlations between these predictors as well.
Yeah, and what you might notice here is that the p-values are, of course, not very significant compared to what they were before, but that is typical of like multiple regression when you run it with correlated predictors. And this shouldn’t be interpreted as the fact that, you know, these phenotypes aren’t jointly relevant. It’s really typical and that’s why you need to always sort of interpret this in the context of the bivariate relations as well. And yeah, so although here we’re dealing with genetic correlations or genetic relations, sort of the interpretation of a partial correlation and multiple regression is the same as you would normally do.
Yeah, that’s quickly how to run uh these basic functions. Like I said, I will provide scripts for whole genome bivariate genetic correlation analysis. If things are still unclear, there’s more information on the GitHub about the individual functions. You can also check the function manuals. Also, you are very free to email me about any questions you might have, and also please join the Q&A if you like.
So I hope that this was at least somewhat informative, and yeah, if you have any more questions, email me, and thank you very much for watching.
GCTA-GREML
Title: Estimating SNP-based heritability with GCTA-GREML
Presenter(s): Jian Yang, PhD (Queensland Brain Institute, The University of Queensland)
Jian Yang:
Okay. First, I’d like to thank the organizer for this opportunity to talk and thank you for watching this video. In this presentation, I’ll be talking about the use of GCTA-GREML to estimate SNP-based heritability from GWAS data.
We have three modules in GCTA. The first module is mainly about heritability estimation, genetic correlation analysis, and phenotype prediction. In the second module we have method to genome-wide association analysis and in module number three we have a couple of other methods.
So, in module number one, we have methods to compute the genetic relationship matrix (GRM), to estimate or partition genetic variation, to estimate dominance variance, or variance explained by chromosome X. Um, we can run genetic correlation analysis or phenotype prediction analysis using BLUP or summary based BLUP. We also have a version of Haseman-Elston (HE) regression implemented in GCTA.
In model number two, we have a couple of GWAS methods, and there are two mixed linear model-based GWAS tools, called MLMA and fastGWA. So, if your sample size is not so large, you can use MLMA or MLMA-LOCO (leave-one-chromosome out). If your sample size is as big as the UK Biobank or even larger, you can use fastGWA or fastGWA-GLMM, if the phenotypes are binary. We also have COJO and mtCOJO for summary-based conditional analysis. We use COJO to do condition analysis conditioning on SNPs and mtCOJO for conditional analysis conditioning on traits. We have also included gene-based test methods, such as fastBAT and ACAT
In model number 3 we have GWAS simulator, doing GWAS simulation based on real genotype data. You can do PCA or PC projection if your discovery GWAS data set is too large. And we can also compute fst values for genetic variants and inbreeding coefficients for the individuals. We can run our LD score cmputing or to find LD friends for a given set of variants. We also have a version of GSMR implemented in GCTA to run Mendelian randomization analysis.
So, the main character today is GCTA-GREML. There are two steps to run GCTA-GREML. The first step is to create a GRM. Here is a command to run the analysis, and we tell the program to read in a genotype file in Plink format. We also tell the program to run the analysis on the autosomes, and we also have options to include or exclude a specific set of individuals or variants. If your data set is big, then you can parallelize the analysis onto each individual chromosome. In this example, we can separate the analysis on chromosome number one.
If this is still too slow, like the analysis in the UK Biobank, you can further divide the analysis or the computation on chromosome 1 into different parts. Here we divided analysis into 100 parts, and once all the parts are computed, you can simply use the command “cat” in Linux or Unix shell to merge all the parts together. You can further then merge the GRMs for the chromosomes into a single GRM.
Once you have the GRM, then you can run a GREML analysis by typing this command. Basically, the flag “--reml” to input the GRM and the phenotype file, and we also have other flags. For example, to include a specific set of individuals or some covariates. In this case, we have 10 PCs included as covariates. If your data come from case-control studies, then you can specify the prevalence of a disease in the population to convert the estimate of SNP-based heritability from an observed zero-one scale to an unobserved underlying liability scale. We also provide an option to run a genotype by environment interaction analysis, which is this one “--gxe”. So, you can tell the program to run REML with this flag to input the environmental information and then ask the program to run a likelihood ratio test for the GxE component specifically.
So you can also run a GREML analysis with multiple components. In that case, you can basically stratify the variants by different information. For example, you can divide the variants into different chromosomes or gene regions, intergenic regions, or minor allele frequency or LD scores of the variants. So, basically, the variants can be divided into many different groups, and then you run a GRM computation for each group, create a lot of GRMs, and then fit all the GRMs together in a GREML analysis to partition the genetic variants into different groups of variants.
Here is a very specific example. This method called GREML-LDMS is specifically designed to estimate SNP-based heritability from whole genome sequence data or imputed data with rare variants. There are four steps: in the first step, you will need to compute the LD scores for the variants. You can then stratify the variants by both minor allele frequency and LD scores of the variants. You then compute a GRM for each variant group by the following commands. And finally, you run a GREML analysis with all the GRMs fitted together. And of course, you can include the covariates in this kind of analysis.
We also provide options to do an analysis to estimate non-additive genetic variance, here, in this case, dominance variance. So, you may need to start with a computation of a dominance GRM together with an additive GRM, which is the default for the GRM I mentioned before. So, you have two commands here. One to compute an additive GRM, and the other to compute a dominance GRM. Once you have the additive and dominant GRMs, you can fit them together again in a GREML analysis to estimate additive and dominance genetic variance for a trait.
We also provide you an option to do genetic correlation analysis, which essentially is a bivariate GREML analysis. So, the command is actually very simple. You tell the program to run REML bivariate analysis with the input of GRM and the phenotype file. So I call this “balanced design” because, in this analysis, we assume you have two phenotypes measured on exactly the same set of individuals. But in some cases, the design can be unbalanced. For example, in this case, if I want to estimate the genetic correlation between males and females for human height, I treat the male height and the female height as two different phenotypes. So, in this case, I have phenotypes: male height measured on males only, and the female height measured on females only. So, if you input a phenotype file like this into this bivariate GREML analysis, the program can also estimate the genetic correlation, which is essentially the male and female genetic correlation.
I guess this is the end of the, kind of, tutorial talk, but I also want to take this opportunity to mention a few frequently asked questions. For example, a commonly asked question is the sample size required for a GCTA-GREML analysis. For common variants, I would suggest a sample size of at least 5,000; of course, it would be better if you can have more. And for genome sequence data or imputed data with reference, I may suggest tens of thousands. In terms of computer memory requirements, we have provided equations on our website to approximately compute the computer memory required for GRM computation and REML analysis.
So another question is about whether we can apply GREML analysis to family data. The answer is yes. We often have two strategies. The first strategy is to remove one of each pair of individuals with estimated genetic relatedness larger than a threshold, for example, 0.05. Another strategy would be to use a so-called big-K-small-K analysis. So, it’s essentially a multi-GRM analysis with one dense GRM and another sparse GRM to estimate SNP-based heritability and the pedigree-based heritability simultaneously. In terms of this relatedness threshold, we often use 0.05, but in real data applications, we don’t really see too much difference if we use 0.025 or even 0.1 for GRMs computed from Hapmap3 SNPs or a similar set of common SNPs.
So, we also have feedback about issues related to convergence or sometimes estimates are too big or too small. For example, in the REML iteration, the estimate can hit the boundary of the parameter space, which is zero or one. In this case, we think that it’s likely because the sample size is not large enough. We know that the variance of estimates (var(estimate)) is inversely proportional to sample size squared, so if the sample size is too small, the estimates vary a lot and by chance, it can hit the boundary if the true parameter is either too big or too small. In case-control studies, this may also indicate that you have some technical differences between cases and controls, and often more stringent quality control needs to be applied.
Another issue is that the GCTA-GREML analysis did not converge. Unfortunately, we don’t have a universal solution to this issue yet. It’s often to do with errors in the data or analysis script, and sometimes you may just have to increase the sample size.
Yeah, the other question is about whether we can use GREML to estimate the variance explained by a subset of variants. The answer is yes, but the estimate will be inflated if the SNPs are selected from the same data. So, one suggestion would be to select SNPs from one dataset and to do the estimation in another independent dataset. We also have flags in GCTA to estimate the effect size of the covariates or to estimate the fixed effects, and this is sometimes useful if you are interested in testing the association between a covariate and the phenotype, conditioning on the polygenic background.
Finally, we also have an option to do prediction, to predict the random effects. So here is an example: we can use this “—reml-pred-rand” to output all the BLUP solutions for the total genetic values of the individuals captured by the SNPs. We can then use this option “--blup-snp” to input the genetic values of individuals and then to convert individual BLUP solutions to BLUP solutions for the SNPs. And then you can use this SNP BLUP solutions in PLINK to compute a polygenic risk or PRS in an independent sample. So, it’s essentially a sort of individual-level data based BLUP analysis.
I think this is the end of my presentation, so if you have any questions, you can visit our website or drop me an email. And thank you again for watching this video.
Genetic Architecture
Title: Tracking 12 years of genetic architecture estimates of schizophrenia
Presenter(s): Naomi Wray, PhD, FAA, FAHMS (Department of Psychiatry and Big Data Institute, University of Oxford; Molecular Bioscience (IMB) and Queensland Brain Institute within the University of Queensland, Brisbane, Australia)
Naomi Wray:
Hello. So today in this short talk, I’m going to talk about tracking 12 years of genetic architecture estimates of schizophrenia. So, my target audience for this talk are really analysts from the PGC. The work I’m talking about comes from, is part of the latest PGC schizophrenia paper, which is currently on MedRxiv, and the work I’m talking about is particularly working with Vasa [Trubetskoy] and Mick [O’Donovan], as we drill down into some results, and the results are found in the main text section—SNP-based heritability and polygenic prediction—and also supplementary note. So, my motivation was really that schizophrenia is one of the flagship disorders for the PGC, often because the sample size is larger earlier on, some of the things that we do and see, are later seen by other disorders.
So, 12 years of tracking. So I’m considering the papers that started in 2009 with the International Schizophrenia Consortium paper, which was the first large genome-wide association study for schizophrenia with 3,000 cases. We then had the PGC Wave one in 2011 with 9,000 cases, PGC2 Wave 2 in 2014 with 37,000 cases, and the latest Wave 3 with 69,000 cases. So, this graph is just illustrating how the discovery over those Publications with on the x-axis number of cases and on the y-axis the number of independent associated loci going from actually zero from the 2009 paper to now nearly 250 with the Wave 3.
So before I start, I’m going to just give some basic concepts. So, with this bar, I’m representing the variation from all factors that contribute to a trait, in this case, risk of a disorder—risk of schizophrenia. This bar represents the variation attributable to genetic factors, and the ratio between these two is then the heritability. And we estimate this from family data. This bar represents the variance that’s attributable to genetic factors, which are tracked on our common variant genotyping arrays, so in our GWAS datasets. So obviously, our GWAS arrays don’t track all variants, just some of them, and so the variation that’s associated with them is smaller than the total genetic variance. And so, this ratio—the ratio between this one and this one—is the then the SNP-based heritability, which is less than the heritability. And then, this bar represents the variation that’s explained in out-of-sample prediction in polygenic risk prediction, and for that, we use the R2 term. So our expectation is that as this value is less than the SNP-based heritability, but as sample sizes increase, this value will get closer to this one.
Just a small word about R2. We’re used to the term R2 variance explained by a regression when we think about linear regression. When we have a binary trait, it’s a little bit more complicated, and so we have this statistic, the Nagelkerke R2.
One of the problems with this statistic is that it actually depends on the proportion of cases in the sample. So in this graph, I’m showing all the points on these lines explain the same variance in liability, but the actual estimate of the Nagelkerke actually depends on the proportion of cases in the sample.
So back to 2009, in this very first paper, this was the first paper where we actually made estimates of what we now called SNP-based heritability and the first paper where we looked at out-of-sample prediction. So this was the iconic figure from that paper. On the y-axis, it was the Nagelkerke R-squared. We’re predicting from the International Schizophrenia Consortium into other independent samples: here in another European sample called the MGS sample. The different color bars reflect that we were using the basic p-value clumping and thresholding method with the different thresholds for the p-values. In this talk, I’m not explaining polygenic risk prediction. I’m assuming everybody knows what it is, and if you don’t, there’s another PGC talk or several talks which are specifically about polygenic risk prediction.
So, in this paper, we asked two questions. We said, “what genetic architecture is consistent with the observations that we saw? and how do we expect the out-of-sample prediction to change as sample size increases?” So many of you might have heard me say before that my favorite figure in this paper is this supplementary figure eight. So, on the y-axis, we have the Nagelkerke R-squared, and on the x-axis, we have sample sizes, and these are results from simulations. So we considered a sample size of 3,000 cases and 3,000 controls, and we chose that to mimic the real data. And then we thought about increasing sample sizes going up to 20,000 cases and 20,000 controls. Back in 2009, we thought it was absolutely ridiculous to think about going to larger sample sizes. We even thought they’d be laughed at for going to 20,000 cases. But of course, 12 years on, we’re now three times that size. The different colored bars are again the different p-value thresholds, and the different graphs represent different genetic architecture models. So those genetic architecture models differed in the number of causal variants, the frequency of the alleles of the causal variants, and the effect sizes. And what we found was that we could model many different genetic architectures, which are really very different in terms of the number of causal variants, but each of them generated observations which would very be similar to what we actually saw in real life. But we predicted that as sample sizes increase, we might see differences in the shape of the p-value thresholding results depending on genetic architectures. But you can see we went up to a Nagelkerke R-squared of, you know, like 20 percent.
So what do we expect with increasing sample size, first of all, for SNP-based heritability? So, we have to have some caveats for making comparisons as sample sizes increase. First of all, we might have to make sure that we’re thinking about the same definition of phenotype. We need to have our samples drawn from the same ancestry. We need to be using the same SNP set for these comparisons, and the same methods to estimate the SNP-based heritability. There are some other third-order assumptions which we don’t need to worry about. But basically, our basic expectation is that with increasing sample size, we expect that the estimate of the SNP-based heritability to not change with sample size because we expect it to be an unbiased estimate whatever the sample size. Of course, sample size comes into the standard error, so we expect that as sample sizes increase, the standard errors will get smaller, and so the estimates might bounce around as sample size increases, but only with respect to the standard error.
So, what did we actually see? So here on the y-axis, I’ve got the estimate of SNP-based heritability on the liability scale. On the x-axis, the different sample sizes. So, this is the first International Schizophrenia Consortium sample of 3,000 cases, and this was the estimate we made from a simulation of 0.34. We were then able to directly go in and estimate from individual-level data using the GREML method, and then our estimate was 0.27 with this 95% confidence interval. We then went to the Wave one data, and then the estimate was 0.23, which was not statistically different. We did observe, though, that when we made those estimates, we made them separately for the ISC sample, the MGS sample, and the other cohorts, and those estimates individually were all larger than the overall estimate. So we had 0.27, 0.31, 0.27, but when we put them together, it was 0.23. So, what was going on there? Well, this could be that some of those assumptions I had on the previous slide weren’t upheld. So, for example, maybe a slightly different phenotype definition, not quite sampled from the same population, etc.
These are the estimates then from Wave 2 and Wave 3, again not statistically significant. This says 69k, but it was actually just from the European subset. I’ve got a comparison here with the methodology comparison. So these ones were by GREML. When we get to the larger waves, we didn’t have access to all the individual-level data, so the estimates are then based on summary statistics. We can use LD score regression, which is perhaps more commonly used, but we know from many simulation studies that LD score regression actually gives slightly biased estimates. It slightly underestimates the true SNP-based heritability, but the estimates using more modern methods, such as SBayes-S, are less biased. And so, this is now our best estimate of SNP-based heritability, at 0.24, but basically, the estimates haven’t really changed with increasing sample size as we would expect.
So, what do we expect with increasing sample size for out-of-sample sample prediction? Well, first of all, we, again, have to make comparisons like with like. So, we, you know, some of the same assumptions as for SNP-based heritability also hold for the GWAS Discovery sample, which makes our polygenic prediction. But then we also have some additional assumptions for the Target sample. So, we need to be comparing based on the same method to generate the polygenic risk score. We need to be using the same kind of SNP QC, which maybe is also part of the methodology. And to have an apples-and-apples comparison, we need to be using the same test cohort. So, under these conditions, what do we expect? We expect that the out-of-sample sample prediction statistic will increase as the sample size of the Discovery sample increases, and we’re going to have a maximum, and we know that there’s a maximum of that R2 statistic, which is the SNP-based heritability. So the sample size of the Target sample comes in not to the estimate but to the standard error. So the key thing is the sample size of the Discovery sample, the out-of-sample sample prediction R2 goes up, the sample size of the Target sample affects the standard error of the estimate.
So these are the results from the Wave 3 study, and this is a figure that’s in the paper that’s on MedRxiv. So on the y-axis here, we’ve got the R2 on the liability scale. On the x-axis, we’ve got those different p-value thresholds, again using the basic p-value clumping and thresholding method. In Wave 3, we have cohorts of different ancestries, and so these results show the leave-one-cohort-out predictions. So we leave one cohort out and predict into it. We have samples of different ancestries shown by the orange dots, which are African-Americans, going up to the green dots, which are the European cohort. So we can see we have better out-of-sample sample prediction in the European cohort. It’s not surprising given the majority of our Discovery cohort were also European. So we got the highest out-of-sample sample prediction for the p-value threshold of 0.05 and the median value across those cohorts was 7.3 percent. And so, this talk was really kind of motivated by this initial observation we had, that this out-of-sample sample prediction median across the cohorts of 7.3 percent, which for me sent up a slight flag warnings because I knew that from Wave 2 we had an out-of-sample sample prediction of seven percent, and I felt that this increase from 7 to 7.3 was not as much as I might have expected. So then, we did a bit of digging.
So what did we find? So this is now an out-of-sample sample prediction comparing apples with apples, going into a single cohort. This is now the MGS cohort, and here we’re looking at the Nagelkerke R2, and I’ve done that because that was the statistic used in that first paper in 2009. So now we do see when we’re comparing on a single cohort that with sample size increasing, we do get a good increase in variance explained, going from that 3 percent Nagelkerke’s R2 in 2009 up to 6%, and we had 9,000 cases, 18% for 37,000 cases, and 21% now. And this is then translating those estimates to the liability scale, so we’re now at a point of 9.9% liability variance explained. And, in fact, we know that if we use more advanced methods for generating the polygenic scores, this is now going above 10%.
And the other thing that we looked at was again thinking about like with like. So on the previous slide, I said that in the Wave 2 when we had the Wave 2 Discovery predicting into the left-out PGC Wave 2 cohorts, that the statistic of out-of-sample sample prediction median was 7%, and I showed on the previous slide when we have the Wave 3 predicting into the left-out Wave 3 cohorts, that median was 7.3%, which is not very big difference. But we realized that if we take the PGC3 Discovery and predict into the PGC2 cohorts, so now again a like-for-like comparison, that’s going up to 8.5%. So part of the low difference here was actually in the nature of those PGC3 cohorts.
So that was all I wanted to show in this talk and hoping that it’s useful for analysts in other PGC disorders. So just left to give my acknowledgments. Nothing I do goes without funding, so thank you for the funding, and particularly a shoutout to the PGC Schizophrenia Consortium and all the collaborators that make that happen, and particularly to Vasa [Trubetskoy], Mick [O’Donovan], and Jian [Zeng], who helped provide, you know, really investigate what’s presented in this talk. Thank you.