Software Tutorials: PRS (Video Transcript)
Comparison of PRS Methods
Title: Comparison of PRS methods
Presenter(s): Guiyan Ni, PhD (Institute for Molecular Bioscience, University of Queensland)
Guiyan Ni:
Welcome! The topic of this video is how to run polygenic risk score comparisons. Until now, we have already watched the individual talks on PRS-CS, LDpred2, and SBayesR, and other methods. I believe you all have a good understanding about polygenic scores and each of those methods.
So, this slide here is just to set up the scene to make sure we are on the same page. A polygenic risk score of an individual is a weighted sum of the count risk alleles. Based on the GWAS summary statistic results, the basic method for polygenic risk score is p-value clumping and thresholding. This method is simple, but it doesn’t formally model different genetic architecture. So there are many new methods that try to model the genetic architecture for the trait of interest.
For example, using different priors in the Bayesian regression like LDpred infinitesimal model, LDpred2, SBayesC, PRS-CS, and SBayesR. And also methods like Lassosum are using the lasso regression. MegaPRS is another method; It runs on different priors. For example, if it runs on a prior using a mixture of four normal distributions, it will be the same as SBayesR, or similar. And if it assumes all the SNPs have a contribution to the phenotype variance, then it will be similar to the LDpred infinitesimal model or SBLUP. It can also run a prior like BOLT, similar like BOLT-LMM, and also can run Lassosum regression. So the difference between MegaPRS and other methods is the expected per-SNP heritability can vary by LD and minor allele frequency. So in this talk, we will compare all those methods, and we know that when the method is proposed they already compared with other methods, but the fundamental question we are trying to answer here is: which method should we use in the PGC data?
Then we use the cross-validation, leave-one-cohort out cross-validation, to answer this question, and to compare the performance of different methods. Here’s a toy example showing the cross-validation. Each cell here is one cohort, and the pink cell is for the discovery cohort, and the green cell is for the target cohort. And in the first round of the analysis, we’re using the four pink discovery cohorts as a discovery set, and then validate the performance of each method in the target sample. Then we repeat this process until each of those cells, or each of those cohorts, serves as a target cohort. If it’s needed by the method, we have another cohort that will serve as a tuning sample to select the optimal hyperparameters. So, in the real data analysis, we use the GWAS summary statistics from schizophrenia 2 as a discovery sample, and out of those data, we actually have access to 33 cohorts where the individual level genotype is available, and we use each of them as a target sample. For the tuning cohort, we use four cohorts in turn to tune the hyperparameters, and then we can predict the polygenic score into each of the target samples.
Here, we used three statistics to measure the performance of different methods: One is AUC, another one is the proportion of variance explained in the liability scale, and the third one is the odds ratio. I will go through each of them to show how to calculate each of those statistics.
AUC
So first, let’s start with AUC. Here is a toy example on how to calculate AUC by hand. AUC is actually short for the “area under the ROC curve,” which is shaded by the pink here. The ROC curve is made by plotting against the true positive rate to the false positive rates at each possible cut off. So, what that means is, assume that this is the density plot for the polygenic score in the control samples, and here is for the case samples, and this vertical line is the current cutoff. In this case, this graph can be divided into four groups: true negative, false negative, true positive, and false positive. And then we can calculate the proportion of each group, and then we can calculate the true positive rates, and the false positive rates, which are the coordinates used in the ROC curve. So, in the current cutoff we use here, it means that we have roughly about 75 percent of cases correctly classified as a case, and then there are about 10 percent of controls that are wrongly classified as a case, which give us the coordinates for this dot here. And when we vary this vertical line, this cutoff, we will get this ROC curve as shown in this slide here.
Variance explained in the liability scale:
You’ve seen the first statistic we use to measure the performance of different methods, and the second one is variance explained in the liability scale when using ascertained case-control studies. So this variance is a function of variance explaining the observed scale, this R squared, observed case-control study, and another two parameters C and Theta. So the variance explained on the observed scale is actually a function of two likelihoods from the new model and the full model, which is designed in these two equations. And this parameter C is a function of k, z, and P, and this K parameter is actually the proportion of the population that is diseased, it also means the prevalence of the disease. And Z parameter is a density at this threshold T here, and this curve is a standard normal distribution. And the P is a proportion of cases in your GWAS result or in your case-control study. And the Theta parameter is a function of the same k, z, t, and threshold T but with a different combination.
So in this slide, I just give the final result of how to calculate the variance explained in the liability scale. The full derivation of this equation can be found in this reference.
Odds ratio:
An the third statistic is called the Odds Ratio. An Odds ratio is a ratio between two odds, and the odds is a probability of being a case, over the probability of being a control. So here is a toy example showing how to calculate the odds ratio by hand. It’s saying that we are ordering the individual based on their polygenic risk score from lowest to highest, and we are interested in the observation between the 10th decile and the 1st decile, with the number of cases and the controls shown in this table. The odds of being a case in the 1st decile is 23 over 103, and the odds of being a case in the 10th decile is 83 divided by 43. The odds ratio between the two deciles is 9.3. So, this value means when we order individuals based on their polygenic score, the individuals in the top 10 decile, in the top 10 percent, or in the 10th decile, have 9.3 times higher odds of being a case compared to the individuals in the bottom 10 percent. And, this odds ratio can be easily estimated from the logistical regression using the logistic link function.
So using the U1 cohort strategy, we can assess the AUC, variance explained, and odds ratio for each of those target cohorts. And here is shown the result for AUC and for each cohort of each method, and different colors here stand for different methods we used, and the y-axis here is a UC difference compared to the P plus T which is a benchmark we used. As you can see, of course, different validation cohorts, there are lots of variations, and that’s why we think our comparison is more robust compared to other comparisons when they only use one target cohort.
If we summarize these bar plots of each group by method, we can see we can observe this bar plot. The y-axis here is AUC, and each of the group stands for each of the methods we compared, and each of the bars in each of the group stands for a different tuning cohort we use. And we noticed that the methods that have formally modeled different genetic architecture, they actually have quite similar performance. This is because of the genetic architecture of psychiatric disorders -- they are quite polygenic. If we look at the results for the Alzheimer’s disease, which is less polygenic compared to a psychiatric disorder, we will observe a big difference across different methods. And then we also observed a similar pattern for our variance explained in the liability scale, and also the ratio between the top 10 percent and bottom 10 percent, also the Odds Ratio between top 10 percent and median. But we observe that LDpred2, SBayesR, and MegaPRS rank the highest amount in most of the comparison.
To summarize, in this talk, I showed how to calculate AUC, variance explaining liability scale, and also Odds Ratio by hand. And based on the comparison we made, we observed that for psychiatric disorders, which are very polygenic, all the methods perform similarly, but some are ranked higher than others. For example, I would refer to SBayesR and MegaPRS. So, the results actually here is part of the study which is recently published. And in this paper, we also did the comparison for major depression, and also other sensitivity analyses. And we also provide the code for each to run each of the method and for each comparison, and also each of the statistics used for comparison.
And with this, I would like to give a big thank you to Professor Naomi Wray, who always gives me huge support of whatever I needed. And thanks to all other PGC members, and thank you all.
PRS-CS and PRS-CSx
PRS-CSx to perform PRS analysis
Title: Hands-on tutorial of using PRS-CSx to perform multi-ancestry PRS analysis
Presenter(s):
Tian Ge, PhD (Center for Genomic Medicine, Mass General Research Institute)
Yunfeng Ruan, PhD (Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard)
Yunfeng Ruan:
Hello everyone, this is Yunfeng Ruan. I’m going to introduce how to calculate polygenic risk score with PRS-CSX. PRS-CSx combines multiple GWAS to increase the accuracy of polygenic risk score. If you are interested in the algorithm, please refer to our preprint on MedRxiv. Our GitHub has a very detailed readme that covers almost all aspects of how to run PRS-CSx. If you are familiar with Python, you may figure out how to run it within no time yourself. Here, I will walk you through the whole process and clarify some details in addition to the readme.
This is a workflow of the method: The input is GWAS summary statistics from different populations. PRS-CSx adjusts the effect size of each populations’ GWAS using their corresponding LD reference panel. The adjustment of effect size is performed to SNPs that are shared by the input GWAS, LD reference panel, and the target data. Therefore, PRS-CSx needs the prefix of target data so that it can read the SNP list from the .bim file. The output is two sets of adjusted effect size.
Next, you calculate the PRS of your target sample based on each set of the adjusted effect size. Usually, this is done by PLINK. Finally, you linearly combine the PRS from different populations’ GWAS using software like R or MATLAB and predict the phenotype with the final result in PRS. You can also add more GWAS so that hopefully you can have a better result.
Now, let’s see how to run PRS-CSx step by step. PRS-CSx is a command-line Python software. You can use either Python 2 or 3, and have to install SciPy and h5py. You can use the command “git clone” to download and install in one step, or you can click the code button to download it locally and unzip the compressed file. You can test if the installation is successful by commenting “prscsx --help”. It will print a long list like this.
Next, download the LD reference. We provide the pre-calculated LD reference panel link in the readme “Getting Started” section. We have LD references calculated from 1000 Genomes, and LD reference calculated from UK Biobank samples. You also need to download the SNP information and put it in the same directory with other LD reference files. After you download all the software and the LD reference, you can write your first analysis.
You can download the formatted test data and the user example command from the readme “Test Data” to run a test. The test will be finished in about one minute, and you will have two .txt files as the output. If you use your own data, the first thing to do is format your summary statistics file. The information is available in the “Using PRS-CSx” section. I want to highlight that the formatted inputs must contain a header line and have the right column order and the variable names.
Now you can finally run the method. Here is a typical command, which contains the path to the PRS-CSx, the directory of LD reference, the prefix of the target data, a list of formatted summary statistics. Please note that you should put a comma between different items, and there should be no space in the list. Also, the list of GWAS sample size, and the list of the population of the GWAS. We highly recommend you run PRS-CSx per chromosome to allow calculation in parallel. You can specify which chromosome you want to adjust by specifying it in the “chrom” option, and then the hyperparameter file and the output.
To address the effect size of one trait, you need to run 88 jobs. For each job, PRS-CSx will print something like this on either screen or the log file. You will get two outputs, one for each population. If you adjust each chromosome in parallel, you can concatenate the results in one file using the command “cat”. Altogether, you will have eight sets of adjusted effect size. You can calculate PRS based on the adjusted effect size using an allelic score function in PLINK.
The last step is to predict the phenotype with the PRS. First, you normalize all the PRS to have a mean of zero and a standard deviation equal to one. Then you optimize the hyperparameter in validation data. For each file, you predict the phenotype with the normalized PRS A and normalized PRS B, and get the R-squared. You compare R-squared from each file and learn the path file and the coefficient of the two PRS under that file.
Then in the testing data, you calculate the final result. With the knowledge of the path file and the coefficient of the two PRS, you choose the PRS under that path file and combine them with the learned coefficients. Then you use the combination of the PRS to calculate the R-squared. This will be your final result. If you have any other questions, you can raise an issue on the GitHub website. We thank Hailiang and Tian for supervising this project, and all team members and data resources.
PRS in Ancestrally-Diverse Populations
Title: PRS-CSx: Improving Cross-Population Polygenic Prediction using Coupled Continuous Shrinkage Priors
Presenter(s): Tian Ge, PhD (Center for Genomic Medicine, Mass General Research Institute)
Host:
It’s half past nine, so I will, you know, start by introducing Tian. Uh, Tian is a good friend and a good colleague of me, of mine, at Mass general hospital and Harvard Medical School. Um, he’s an assistant professor in Harvard Medical School and, you know, trained as a mathematician. He’s contributed to many, uh, you know, domains in science, including the neural genetics, neural imaging, statistical matters, and the genetics of neurodevelopment. So for today, I’m very excited to have him talking about a PRS, a new PRS method that is able to jointly model GWAS summary statistics of multiple ancestries to improve the prediction accuracy. So without further ado, uh, Tian, let’s get started.
Tian Ge:
Um, thanks for the very generous introduction and for having me today. So, I’m going to talk about our recent work that extends our base in Polygenic Prediction framework, which is known as PRSCS to PRS-CSx, which can now integrate GWAS summary stats from multiple populations in order to improve cross-population polygenic prediction.
Um, so to start, I’d just like to briefly recap the idea of polygenic prediction and then talk about the motivation and intuition behind the PRS-CSx work, which might be useful to see how PRS-CSx works.
As many of you already know, um, polygenic prediction summarizes the effects of genome-wide genetic markers to measure the genetic liability to a complex trait or disorder. So a conventional method to compute polygenic risk score is pruning and thresholding, which is also known as Clumping and Thresholding. So basically, to apply this method, usually, we set up a p-value threshold or screen a range of p-value thresholds. And for each of these p-value thresholds, we only consider SNPs reaching the significance level. We then perform a procedure called LD clumping, which basically retains the most significant SNP in each genomic region and discards all the SNPs that are in LD with the lead SNP. So, for example, in this figure, we will just keep the top SNP in purple and then remove all the SNPs that are in red, orange, green, light blue because they are in LD with the lead SNP. Um, and then finally, we sum up the genotypes of the remaining SNPs, weighted by the effect size estimates from external GWAS.
This pruning and thresholding method is very widely used because it’s conceptually intuitive and also computationally very efficient. But it has several limitations. So, for example, it relies on marginal GWAS association statistics, and we know that most associations in European regions may not be causal. So a lot of times, we might be using sub-optimal tagging SNPs to build PRS. And because of the LD clumping procedure we use, the method also ignores many of the secondary and tertiary signals in each genomic region. And finally, when we sum up the SNPs, we use the effect size estimates directly from external GWAS without any adjustment or shrinkage. Large effect size estimates in GWAS may suffer from winner’s curse, and most non-causal variants will have noisy non-zero effect size estimates. So by including these SNPs and using these GWAS effect size estimates directly without any adjustment, we are adding a lot of noise to our PRS. And all these limitations limit the predictive performance of PRS.
So, to address these limitations and improve the conventional pruning thresholding method, a more principled framework is to calculate the polygenic risk score by jointly modeling the genetic markers across the genome without any arbitrary pruning and thresholding. And to do this, we are basically fitting this linear regression problem where the phenotype vector is regressed onto this genotype matrix X, and beta here is a vector of SNP effect size, and epsilon is a vector that captures non-genetic effects. So, if we can jointly fit this regression model and get the effect size estimates, which is denoted as beta hat here, we can take the estimate to the target dataset where the genotypes are and compute the polygenic score.
So the methodological challenge here is that in genomic prediction, we often have many more SNPs than the number of samples we have. So this is an ultra-high-dimensional regression problem, and we need to regularize the effect size estimates to avoid overfitting. Um, and so we know that this pruning and thresholding method can actually be considered as a specific way of regularizing and shrinking the SNP effect size because it, in essence, shrinks the effect size of discarded SNPs to zero and performs no shrinkage on effect size estimates of selected SNPs. But we have discussed that this shrinkage scheme may be arbitrary and sub-optimal. So there are many more principled statistical methods that can be employed for shrinkage. For example, the frequentist approach is to fit a regularized regression using methods like lasso or Ridge regression or elastic net, which often encourages sparse effect size estimates and penalizes large effects. So, if you have heard of this measure called lassosum, that’s one of the polygenic prediction methods that applies lasso to build PRS.
In the past few years, we also see many Bayesian polygenic prediction methods that have been developed, and the Bayesian approach to tackle this high-dimensional regression problem is to assign a prior on SNP effect sizes to impose shrinkage. So, all the models basically fit the same regression, and the difference is what prior distribution to use. The question here is: how do we design a prior, or which prior is optimal, for this type of genomic prediction problem?
The most widely used prior is what we call the infinitesimal normal model, which is also known as the Bayesian ridge regression. So, with the effect size of each SNP follows a normal distribution. This model is very widely used in many classical statistical genetics methods, including like GCTA and LD Score Regression. So, all these methods assume this underlying infinitesimal normal genetic architecture, and one major advantage of this prior, and also that’s why this prior is so popular, is that it’s mathematically tractable, and there’s a closed form expression for the posterior.
Here, Lambda is a penalty parameter or shrinkage parameter, which depends on these two variances. One is the per-SNP variance of the SNP effects, and the other one is the residual variance. So, you can see that if the noise, the residual variance, is large relative to the genetic signal, then we impose a strong shrinkage on the effect size, and beta is shrunk towards zero. So, in the extreme case, if you have no genetic signal, then the beta will be shrunk to zero. On the other hand, if the genetic signal is relatively large to the noise, then the estimate will be closer to the least square estimator, and with this penalty from the matrix is always invertible, so this is a well-defined estimator.
We also noticed that this is a multivariate estimate of SNP effect sizes, and X transpose times X here is proportional to the LD Matrix, so it’s easy to incorporate LD information in this estimator, and in practice, you can always divide the genome into independent LD blocks, and then within each block, you can do this joint estimate of SNP effects.
So, with this being said, there are also limitations of this prior. As you can see here, the shrinkage parameter is a constant, meaning that under this infinitesimal normal prior, all the SNPs are treated equal, and they are shrunk towards zero at the same constant rate. So, this is sub-optimal because ideally, we want to impose various strong shrinkage on small and noisy non-causal signals, but at the same time, we don’t want to over-shrink large and real signals. So, what we really want is the shrinkage scheme that is adaptive to different SNPs and different GWAS signals, but this cannot be achieved by this infinitesimal normal prior because the penalty parameter here is a constant, which is not SNP-specific.
So, an alternative way to look at this problem is to take a look at the shape of the prior distribution, which is normal. This non-adaptive nature of the prior is equivalent to say that for the normal distribution, when used as a prior, there isn’t enough mass around zero to impose strong enough shrinkage on noisy signals. And because of this, the normal distribution has exponentially decayed tails. So, these tails are too thin, meaning that a priori, we believe there’s very low probability of large effect sizes. So, we don’t have a prior that can accommodate those large effect sizes, which often leads to overshrink of real signals. So, that’s why the Bayesian ridge regression or this infinitesimal normal prior is not very adaptive to different genetic architectures and usually only works well for highly polygenic traits.
So there are many works, um, trying to design a more flexible prior so that the polygenic model is more adaptive to varying genetic architectures. And one idea is that, in contrast to using a single normal distribution as the prior, we can use a mixture of two or more distributions. So, for example, one pioneering approach in this field, LDpred, uses this Spike and slab prior, which assumes that a fraction of the SNPs are null, so they have no effect on the phenotype, while the rest of the SNPs are causal SNPs, and their effects follow a normal distribution.
Um, if we take a look at a density prior, this prior has two components. So there’s a spike component, or which is a very narrow distribution centered at zero, which is often used to model small signals, and there’s a slab component, which is much wider and can be used to model large signals. And then, by varying this proportion of the chordal variance, which is coded in pi here, this model can cover a wide range of genetic architectures.
Um, so although this prior is much more flexible than the infinitesimal normal prior, it also has two limitations. So number one, this is a discrete mixture of two components, so we call this type of prior discrete mixture prior. So in posterior inference, you can see that each SNP can either belong to this null component or this normal component, normal component. So you can imagine that if there are a million Snips, then we have a discrete model space of 2 to the power of a million possibilities, which is, you know, almost unlikely to fully explore. And number two, so unlike the infinitesimal normal prior, which has a closed form multivariate update of the SNP effects, the spike and slab prior does not allow for a multivariate effect estimate. So in posterior inference, one has to update effect size SNP by SNP, which makes it very difficult to incorporate LD information in this model estimation procedure.
Um, so there are many other Bayesian polygenic prediction methods that have been developed and use different priors, but the majority of them are discrete mixture priors. So, for example, you can parameterize um, just two normal mixtures differently using an additive version or multiplicative version. So you can also do a null component plus a t distribution, which gives you a heavier tail to model larger signals. So S-Bayes in R, which is another method that receives a lot of attention recently, uses a mixture of four normals, and each of these normals captures the effect size distribution on a different scale, which makes the model even more flexible. And then finally, there are non-parametric models where the effect size distribution is assumed to be a mixture of an infinite number of normals, and in posterior inference, the data will determine the optimal number of mixtures. Um, so these are different variations of these discrete mixture normals, and they are all discrete mixtures of two or more distributions, so they largely share the same advantages and limitations of LDpred.
So just to quickly summarize, so we have discussed that the infinitesimal normal prior is computationally efficient and allows for multivariate modeling of LD dependence, but it’s not robust to varying genetic architectures. While discrete mixture priors, on the other hand, can create much more flexible models for the genetic architecture, they are computationally challenging, and it’s often difficult to incorporate LD information. So our motivation was to design a prior that can combine the advantages of these two types of priors.
So in our PRSCS work, we introduced this conceptually different class of priors, which is called continuous shrinkage priors. In contrast to the horizontal discrete mixture of normals, we use the hierarchical scale mixture of normals. Here, Phi is a global shrinkage parameter, which is similar to the penalty parameter in Ridge regression, and it is shared across all the SNPs, and models the overall sparseness of the genetic architecture. Different from the infinitesimal normal prior, we added this local shrinkage parameter. Here, J is the index of SNPs, so this local shrinkage parameter is SNP-specific and can adapt to different genetic signals.
You can see that if we integrate out these hyperparameters, the density function of this prior is continuous, which can also be seen in this density plot on the right. The dashed black line is the normal prior for reference, and the red and blue lines are the two components of the spike and slab prior, while the yellow line is the continuous shrinkage prior. So you can see that unlike the two-component spike and slab prior, the prior we used is one continuous density function, but it can approximate the shape produced by this discrete mixture prior. And compared to the normal distribution, you can see we put substantial mass near zero, which can impose strong shrinkage on small, uninformative signals, and at the same time, this distribution has heavy polynomial tails, which can retain large and real signals. So the continuous shrinkage prior is almost as flexible as the discrete mixture prior, but because of its continuous nature, it also shares some advantages of infinitesimal normal prior, that is, it allows for the multivariate modeling of LD patterns, and it’s also computationally efficient.
These are the motivation and some intuitions behind PRS-CS. I’m not going to talk about other features of the method, but the software is available on GitHub, which you can download and test. So we have released both pre-computed 1000 Genomes and UK Biobank reference panels for major populations, which hopefully has made the application easier. In the initial application of the PRS-CS method, we applied it to some existing GWAS of six common complex diseases and six quantitative traits, and then predicted to the Mass General Brigham Biobank. You can see that PRS-CS substantially improved the prediction over the conventional pruning and thresholding method, and often outperformed LDpred.
Another application that might be relevant to this group is the polygenic prediction of schizophrenia. In this study led by Amanda, we aggregated the schizophrenia cases and controls across four healthcare systems, Geisinger, Mount Sinai, Mass General Brigham, and Vanderbilt, as part of the psychMerge Consortium. You can see this polygenic risk score calculated by PRS-CS correlates with the case prevalence and schizophrenia diagnosis, and can be used to identify diseases that are genetically correlated with schizophrenia using a PheWASdesign. So that’s the a review of the ideas behind PRS-CS and some of its applications. But one limitation of PRS-CS is that it was designed and tested in homogeneous populations. Now it is well-recognized that cross-population predictive performance of polygenic risk scores decreases dramatically, especially when the target sample is genetically distant from the training example, due to the predominant European samples in the current GWAS studies.
So there are many factors that can limit the transferability of PRS learned from European GWAS. So, for example, there may be population-specific variants or variation in the SNP effect size estimates across populations. The allele frequencies and LD patterns are different across populations, um, and also the differences in the phenotyping or environmental factors can all affect the prediction accuracy. So in the past few years, there have been many efforts to expand the scale of non-European GWAS and to diversify the samples in genomic research in general. Although, the sample size of most non-European GWAS remains considerably smaller than European studies, so they cannot, right now, they cannot be used to fully characterize the genetic architecture in non-European populations and dissect relative contributions of these factors to PRS predictive performance. But one question we can ask is, can we leverage these existing non-European GWAS to improve trans-ethnic prediction of the PRS, even if they are smaller than European GWAS?
So we have been working on this method called PRS-CSx, which is a very simple extension of the PRSCS framework. So, here, to model existing non-European GWAS, we assume that we have data from K different populations, and then we still use this continuous shrinkage prior to model SNP effects, but this time this prior is shared across populations. So you can see that these shrinkage parameters do not depend on K, which is the index of the population, so they are shared across populations. And to use this coupled prior, we have implicitly made the assumption that causal variants are largely shared across populations. So we think this is a reasonable assumption given that many recent studies have estimated the trans-ethnic genetic correlations for many complex traits and diseases to be moderate to high. And with this coupled prior, we can borrow information across summary statistics and increase the accuracy of effects estimation, particularly for non-European populations whose GWAS size are relatively small. And the other advantage of this coupled prior is that we can leverage the LD diversity across populations to better localize the GWAS signal. So this is very similar to the idea of trans-ethnic fine mapping. So although we are not doing any form of fine mapping analysis here, we’re actually using this idea,
So although we use this shared prior across populations, the effect sizes for SNPs are still allowed to vary across populations, and so we believe this gives the model more flexibility. So we don’t constrain effect size to be the same across populations, and we also allow for population-specific variants, meaning that if a SNP is available in one population but absent in other populations due to, for example, the low frequency in other populations, we still include the SNP in the model, although in this case, there’s no effects to couple, we still include the SNPs in the modeling. So finally, PRCSX incorporates many features from PRSCS, so it allows for this multivariate modeling of LD patterns using population-specific reference panels, and also computational efficiency.
So in practice, PRCSX takes the GWAS summary stats and the ancestry-matched LD reference panels from multiple populations. It then joins and models all these data, fits the model, and then outputs one set of the SNP posterior effect sizes for each discovery population. And these SNP effect size estimates can then be taken to a validation dataset and calculate one PRS for each population. And we then learn an optimal linear combination of these PRS in the validation dataset and evaluate the predictive performance of the final PRS in an independent testing dataset.
So, as a comparison, and also in the results I’m going to show, so we examined two alternative methods that can combine GWAS summary stats from multiple populations. One method, which we call PT-meta here, performs a fixed-effect meta-analysis of the GWAS and then applies the pruning and thresholding method to the meta-GWAS. And since the LD pattern is mixed after this meta-analysis, so it has different LD reference panels in this case, and then select the one with the best predictive performance to evaluate in the testing dataset. The other method, which we call PT-multi, this method was developed by Alkes Price’s group a few years back. So they applied pruning and thresholding separately to each GWAS summary statistic, and then the resulting PRS are linearly combined in the validation dataset and then taken to the final PRS for evaluation.
Okay, um, so here are some results. So we selected um 17 quantitative traits that are shared between the UK Biobank and Biobank Japan. In this analysis, UK Biobank GWAS sample size is typically three to six times larger than the BBJ (Biobank Japan) sample size. We then train different PRS using BBJ GWAS only. These are the PRS measures that applied to the Biobank Japan GWAS only. And then these are the PRS that were trained on UK Biobank data only. The last three methods are those PRS that combine the GWAS summaries from the UK Biobank and Biobank Japan. And then we train these different PRS and predict into different populations in the UK Biobank that are independent of the UK Biobank training GWAS.
So, you can see, here in the first panel, when the target sample is the UK Biobank European population, you can see that PRS trained with the ancestry-matched UK Biobank GWAS performs better than PRS trained with the BBJ GWAS, which is expected. In this case, combining the UK Biobank and BBJ GWAS doesn’t help too much. You can see it is a very small, probably around five percent, improvement in prediction accuracy when we combine UK Biobank and Biobank Japan GWAS. That’s likely because the UK Biobank GWAS was already quite powerful, so adding a smaller East Asian GWAS doesn’t help too much in the prediction of the European samples. But when we predict into the UK Biobank East Asian samples, you can see PRCSx can increase the prediction accuracy. Here, the bar shows the median variance explained that was increased by about 25 percent when comparing PRCSx with these PRS trained on the European GWAS. And then if you compare with this ancestry-matched PRS trained in the Biobank Japan GWAS, the improvement was even larger, it’s around 80 percent. These results show that we can leverage this large-scale European GWAS to improve the prediction in non-European populations.
[[about 30 minutes in]]
When we predict into the UK Biobank African samples, the target population didn’t match any of the discovery samples, the Biobank Japan sample, or the UK Biobank sample. Both the European and East Asian samples are genetically distant from the African samples. So, in this case, the improvement in predictions was again limited, and predictions in the African population remain quite low relative to the predictions in European and East Asian populations.
So, we asked whether we can add some African samples to the discovery dataset to improve the prediction in the African population. Among the 17 traits we examined here, seven were also available in the PAGE study, which largely comprised African-American and Hispanic Latino samples. But you can see the sample size of the PAGE study was much smaller than the UK Biobank and Biobank Japan. So, the question here is whether adding a small African GWAS to the discovery dataset can improve projection. You can see in the right panel of this figure that when integrating this UK Biobank, Biobank Japan, and PAGE summary stats using PRN CSX, the prediction in the African sample was quite dramatically improved, and the improvement in median variance explained was about 70 percent when comparing with the PRS-CSX applied to UK Biobank and Biobank Japan GWAS only, and the prediction was also much better than the PRS trained on ancestry-matched PAGE study. So, these results suggest that we can leverage samples that have matched ancestry with the target population to improve prediction, even if the non-European training GWAS are considerably smaller than European studies. Adding the PAGE study to the discovery dataset also improved the prediction in other target populations, although the improvement was to a much lesser extent.
In the last example, we evaluated different PRS methods in the prediction of schizophrenia risk. In this analysis, we used the GWAS summary statistics derived from the PGC2 schizophrenia GWAS in the European samples and also the recent schizophrenia GWAS in the East Asians, led by Max Lam, Cheyenne, Hailiang, and colleagues. We have access to the individual level data of sixth East Asian cohorts, and we left out one cohort as the validation dataset. This is the dataset we use to learn hyperparameters or linear combinations of PRS. For the remaining six cohorts, we apply the leave-one-out approach, meaning that we, in turn, use one of the six cohorts as the testing dataset and then meta-analyze the remaining five cohorts with the other cohorts to generate the discovery GWAS for the East Asian population. We then again build different PRS using the East Asian GWAS only, or using the European GWAS only, or using methods that can combine these two GWAS. You can see that PRS-CSx can increase the prediction accuracy relative to methods trained on a single GWAS. The median variance explained here had approximately a 50% increase relative to GWAS using ancestry-matched East Asian GWAS and almost double the prediction accuracy when the PRS was trained using European GWAS.
On the right panel, you can see that at the tail of the PRS distribution, PRS-CSx can also better stratify patients at top and bottom PRS percentiles relative to other methods. Okay, so I think I’ll stop here and thank all my collaborators. In particular, Yunfeng has led many of the real data analysis in this project, and Hailiang has provided critical inputs in every aspect of the project. He also led the Stanley Center East Asia Initiative, which made the schizophrenia analysis in the East Asian cohorts possible. Our preprint is on medRxiv, and we have also released the software on GitHub. So any feedback or comments will be much appreciated. I will stop here and I’m happy to take any questions.
Host: Great, thanks a lot, here, and that’s a great talk. So, um, we have 20 minutes for questions.
Participant: Terrific talk, Tian. Really appreciate it. So, um, I’ll just start things off. I’m sure there are other questions, um, but I guess one question I have is, it seems like with these methods, beyond pruning and thresholding, one of the big barriers is that some of them are harder to implement than others. Pruning and thresholding is so easy, um, and so I was wondering if you could just say a little bit about how difficult it is to implement this approach and what information people need to be able to use PRS-CSx.
Tian: Um, yeah, so, um, I guess in many of the analyses, we still use pruning and thresholding as a baseline because it’s computationally faster and also a robust approach. So you can use that as a comparison. So in terms of those Bayesian methods, um, depending on the different implementations and different methods, will require different computational costs and usually takes longer than pruning and thresholding. But, um, we have tried to, you know, hopefully make this software easier to use. So we have released these reference panels so users usually don’t need to calculate their own reference panels. And then we can parallel the computation of chromosomes. And usually, for the longest chromosome, it takes around an hour or one and a half hours to finish. So I think it hopefully doesn’t add too much computational burden on end user.
Participant: That’s really helpful, that’s really useful, right? And so, people just summarize statistics from multiple populations. Of course, the other things are provided with your software.
Tian: Yeah.
Participant: great!
Host: So, on that topic, Tian, could you perhaps discuss a little bit about the mixed population? You know, I think not every population has a released reference panel. So, what are the considerations here and what are things you know these people can do if they want their analysis being done? You’ll see this matter.
Tian: Right, that’s a great question. There are a lot of challenges in terms of how to handle admixed populations because, um, you know, the LD patterns might depend on specific studies and the proportion of the samples in each study. So, I don’t think there’s a universal reference panel that can be used for all admixed populations. Um, so right now, for example, in this study when we try to model the PAGE GWAS summaries, it’s sort of a mixture of African-American samples and also Latino-Hispanic samples, so it’s kind of an admixture. So, we try to use an African reference panel to approximate in this situation and it turned out to work okay. But clearly, there is still a lot of work to do and think about how to better model the admixed populations and how to build reference panels in this case.
Participant: All right, maybe I can ask a question. So, solely dealing with summary statistics, it’s always simpler but at the same time more difficult, right, because you lose a lot of detailed information. But I’m curious, if you have individual-level data, will you be able to do that better for the admixed population because you should be able to have much higher individual-level resolution, incredible population local structure, right?
Tian: Um, right. So, I think there are two aspects of this question. Number one is, you know, do we lose any information when we use summaries that’s relative to individual data, whether it’s a homogeneous population or admixed population, right? Um, so the question, the answer to that question is, if you only look at, um, only use the second order information, it’s basically the LD information, and then you assume you have a LD reference panel that can accurately approximate the true LD patterns in your GWAS sample, and then there’s actually no information lost when we use the reference we use the summary stats data relative to the individual-level data, um. So, the question here is, can we find a GWAS ref, can we find a reference panel that can closely approximate the LD patterns in your GWAS sample? Um, and so a lot of times, you know, when the GWAS sample was conducted in a homogeneous population, we think, um, you know, the reference panel was accurate enough, but that also warrants, you know, if in the future it’s possible to release in-sample LD information with the GWAS summaries, then we can, of course, do better and get more accurate LD patterns, so we’ll have less information loss or less reference sample mismatch in this case.
And then the second part of the question was, you know, if we have individual-level data, can we do better to handle admixed populations? Um, and I think, sure, because with individual-level data, you can go beyond LD, you can look at local ancestry and do those decompositions and build PRS using those local ancestry information, which can of course be much more accurate than treating the whole genome in a homogeneous way. So, I think going forward, releasing LD information and local genetic information with the GWAS summaries, that might be the key to further improve the polygenic predictions in admixed populations.
Participant: Just to follow that, um, so do you have an implementation to deal with individual-level data?
Tian: We don’t actually, but if you have individual-level data, you can just compute the in-sample LD and then do a GWAS, so then I’ve tried a method to the GWAS summary stats, that should give you, um, you know, highly similar results to a method using individual-level information.
Participant: Thank you.
Participant: Did we have a question from Laura Slootman?
Participant - Laura: We did, but it was answered because I didn’t, uh, I thought the answer was uh, truncated before the individual-level conversation. I was going to ask specifically about what you just clarified.
Participant: Could I ask a practical question? Um, we usually transform odds ratio to log odds ratio before calculating PRS based on LD pruning and p-value approach. So, my question is, do we need to also do the same process in PRS-CSX because we know that the posterior effect size in this approach is relatively small? If we don’t, um, process the ratio to log odds ratio before calculating posterior effect size
Tian: Ah, right. So, PRSCSx can take odds ratio estimates, but it basically just takes the odds ratio and takes the log and converts it to standardized beta. So, if your, you know, GWAS summaries stats is odds ratio, then it’s fine. PRSCSx can take that.
Participant: I think it was really encouraging to see how much better the prediction was with the page samples for the African ancestry individuals. Um, do you have any ideas about why that worked as well as it did?
Tian: Yeah, that’s a good question. So, I think, number one, so if we have population-specific variants and information contained in the PAGE study GWAS summary stats that’s not available in the GWAS of other populations. The other possibility might be, you know, I integrate these GWAS, so because the LD pattern and the LD block is smaller in African samples, so we have a better localization of the GWAS signal which also improved the production accuracy. But I think there’s much work to do to dissect these contributions and see, you know, whether we can improve on that.
Participant: I think it’s just, it’s very encouraging because I think sometimes when you see samples for European ancestries that are over an order of magnitude or more larger than the other ancestries, you think, well, maybe it’s not worth including these other ancestries, but it sounds like these data are suggesting that definitely you should.
Participant: I just have a question. I think someone said about the phenotype. I just wonder if you think that, you know, for the Latin population, it’s because, you know, I think the sample size is small, if you have a better phenotype, something like that, should you have more chance to, you know, to, I don’t know, but to predict? I think this meta-analysis or phenotype is something, it’s a variable that matters here or not, you know, it’s just I don’t know if you start a project in Latin Americans that is totally mixed and you don’t know and you have a sample, there’s more sample size, do you think it’s good to spend, I don’t know, money or time having a deep phenotype or this doesn’t matter?
Tian: So the phenotype typing definitely influences the prediction accuracy of PRS, and then if you have very different phenotyping in say, your training and target populations, that might reduce the production accuracy. And in many of our work, so when we try different phenotyping methods, so for example, try to predict depression, and then there are different ways to, you know, type, like using ICD codes. Using any rule-based or algorithm-based definition of depression cases and controls, and they do give us meaningfully different prediction accuracy. And a lot of times, um, you know, there’s a balance here because if you use the simpler ICD-based method, you get, you know, more cases and controls. But if you use the more stringent definition, you get higher specificity but sometimes lower power because the case number is reduced. So, I think again, there’s many factors that contribute here, the sample size, and how the phenotyping matches between the discovery and target dataset and how specifically the phenotyping algorithm is. Um, a lot of times, you know, when you conduct the PRS analysis, these phenotyping issues are beyond our control because we only use GWAS summary stats, or test the PRS in the existing cohorts. But if you have control over these genotyping algorithms, um, I think some phenotyping can sometimes boost the prediction of PRS.
Participant: Okay, so another question is just today I saw, you know, a talk in a conference that they use a family polygenic trios, or families, to predict this polygenic risk score. And then, you know, they have this analysis using trios. What do you think about that? Is it a good strategy?
Tian: I think Trio or family studies provide additional opportunities that can’t be done by PRS of unrelated individuals. For example, you can better control environmental factors and sometimes you can decompose and dissect the transmitted and non-transmitted of use. Interesting questions that you can only do in family or Trios. So I think both methods are so, so a lot of work we do is to do this PRS analysis in population-based cohort and trying to stratify patients, for example, but in terms of memory study, they also give a different or unique aspect where you can look at the relative contribution of genetics environment. So I think both study designs are useful. and can be used to answer different questions.
Participant: Should your method/ can it be used in this approach?
Tian: That’s a good question. So right now, probably not because when we build the model, we assume the GWAS summaries are calculated on a large unrelated GWAS sample. So if we want to conduct any PRS analysis that is specific to your family or Trio design, probably need to look into, you know, more specific methods that can better address the questions there.
Participant: I have a question about how do you handle the LD matrices and population predictions. … [unintelligible] .. I remember there was a paper about estimating the cross-population prediction accuracy and um, he said the most tricky part of the smart population predictions is the different LD matrices and those variances. So in your method, how are these two parts handled?
Tian: So I actually have some trouble here in this first part of the question.
Participant: Well, maybe I can speak a louder. Is it a little bit better now. I would say, can you hear me? The voice quality is not great. Speaking up, it seems like helps a little, but it’s still a bit difficult to hear.
Participant: Sorry, I think it’s better sending the question in the chat if that’s okay.
Participant: Then I will type the question. Sorry, I don’t know if you have time.
Tian: but yeah, no problem. If you have any questions, you can also, you know, email me afterward.
Participant: Yeah, the question was, how are the two matrices handled in your method? Two different LD matrices?
Tian: Um, right. So we use, so there are, you know, if you have GWAS summary stats from different populations, you’ll have one with the, some population specifically matching that ancestry of the GWAS. And then when we do this effect size estimates, they are actually taken care of. And then, so different effect sizes in different populations were modeled by the matching LD reference panels.
Host: And there’s a follow-up question that, how are the variants with different minor allele frequency being handled into different matrices?
Tian: Um, how minor allele frequencies are handled. So we don’t... well, so I’m not sure how to answer this question. So how minor allele frequencies are handled. So when you come to the LD Matrix, they’re just using population-specific reference panel to compute that LD matrix, and that gives you the matrix for each population. And then when we model the effect size, and those effect size are, and the relationship between SNPs in each population, was mapped to the population-specific LD reference panel. Um, I’m not sure if that answers the question.
Host: Alright, so I think we’re at the hour. Thanks, uh, Tian for giving this great talk, and thanks everyone for joining us for this, uh, for this meeting. And we look forward to seeing each other again in a month!
Participant: I think the next meeting is May 5th. And I think we’ll be back at the, uh, 1:05 PM Eastern.
Participant: Great, thanks so much, Tian, and great to see everyone. Bye, folks. Bye.
PRSice2 and LASSOSUM
Title: PRSice: Basic Polygenic analysis
Presenter(s): Shing Wan Choi, PhD (Regeneron)
Shing Wan Choi:
So, hi there, today we are going to talk about, like, a little bit on PRSice, and because it’s a simple, relatively simple software, we’ll also talk about some of the basics of polygenic score analysis. So my name is Sam, Sam Choi, I’m from O’Reilly’s, from Icahn School of Medicine at Mount Sinai. Now, first thing first, we have to go to, like, the fundamental of polygenic score.
So polygenic score can be calculated as the sum of B, like betas, times x divided by n. The betas are the effect sizes and the X are the individual genotypes, so essentially it’s like the weight score of your genetic dosage. You can notice that, like, there’s an N here, which is the number of alleles. This helps us to account for the number of missingness and it’s slightly different from what you usually see in polygenic score predictions.
So, to illustrate, assume that we have three individuals that each have different alleles. Some of them might be missing, and then each of those, we have something called the effective allele. So, for example, for SNP A, the effective allele is A, so what we are saying is that for allele A, it has an effect size of 0.1 relative to the reference, which is like the non-effective value, which is the value G. For SNP B, the effective allele is G, and then it has an effect size of -2, so effect sizes don’t have to be all positive, it can be positive and negative, which has an effective allele T, with a value of four, which is like huge. But anyway, let’s go ahead and see how we calculate the polygenic score.
So, for each individual, we can basically just, we can try, to multiply the alleles with the corresponding effect size. So, for individual A, it has two effective alleles, so we will times the effect size by two, so we get 0.2. For the second individual, it has one effective allele, so we’ll say that it will multiply the betas by one, so 0.1. And the third individual has no effective value, so all of this value is like non-effective, so it’s a score of zero. Now we go ahead to, excuse me, the second SNPs is almost the same, but when we have missing data, we’ll usually just apply a zero to it, or sometimes we do mean imputations which use the population’s allele frequency of the effective allele to multiple in place of zero. So if this is our population, for example, we’ll say that maybe we have like 0.25 allele frequency for G across, because we only have two samples and there’s only one copy of alleles. But here we’ll just say zero, and then so on and so forth. What we’ll then do is we try to sum up every single one of these scores into one single total score. So for individual A, there will be 4 times 0.2, so 4.2 divided by the number of alleles, so one, two, three, four, five, six alleles. And same for the second one, but for the last sample, because we have two missing alleles when we divide the number, we’ll say we’ll divide by four instead of six. Now if we do mean imputations, we will still do it by six, so that’s like some of the fine control of the allele frequency, like how you do missingness imputations. And some softwares, like PRSice, do have functions like that to give you a more fine control of how to do this kind of imputations. And some other software will just always imply zero score.
After that, we can just calculate the score, so we can say that the first person has a score of 0.7, the second has a score of -0.317, and the last person has a score of two. So this is the, like, general concept of how a polygenic score is calculated.
Now, there are actually alternative formulas for PRS calculations. One of the more common ones is sum score, instead of what we are using, the average score. The main difference is we move the denominator from the equations, and this is usually what was shown in the literature. The second one, oops, why, why is it now? So, the problem of this sum score is that it can be biased by the level of missingness. So, um, if, for example, we have a sample, like, there’s an individual with a large amount of individual genotype missingness, and by definition, that sample’s score will be relatively smaller than other individuals, and when you try to rank samples or try to distribute them across the populations, this will introduce a bias that is like correlated with the level of genotype missingness, and that might not be ideal. That’s why, like, for PRSice and Plink, the default settings is the average score, not the sum score.
Now, another popular formula is the standardized score. What it does is take the average score and look at the whole populations, and we try to standardize it so that it has a mean of zero and standard deviation of one. The problem of the standardized score is that for binary traits, for case-control samples, when you standardize your score, it can be affected by ascertainment. So, imagine in a dataset where you have, like, 50% cases and 50% controls for a trait with a population prevalence of 1%, your data is enriched with cases. So when you calculate the mean and standard deviation, the cases contribute more than what you will otherwise observe in the population, and this introduces a bias. It’s not as much of a problem for quantitative traits, so that’s more specific for case-control. In PRSice, we have options where you can standardize your score using only the controls. We assume the controls are a more representative, more random sample. That is something that you can do and something that you should consider when you do the analysis.
Now, as I just mentioned before, there’s also a different way of handling missingness. Now, that’s the mean implementations that I just mentioned, which you can use the allele frequency of the your reference genome or your target sample data to estimate the binary frequency. The other one is zero-centered, so you try to make sure that the missing samples all have a value of zero by minusing the score by the allele frequency, so it’s similar. Now, the main challenges in polygenic score analysis is actually that we are mainly concerning two factors. The first one is the winner’s curse. When you select SNPs or variants for PRS constructions, they were selected because usually, like, because they’re significant or because they have high effect sizes, and that, like, unfortunately, when we do selections, those that were selected usually have inflated p-values. That’s why they got selected, and this is something we have to account for when we do the analysis. Another problem is linkage disequilibrium. When we do GWAS, linkage disequilibrium helps us a lot because now we can reduce the number of tag SNPs that we need and we can still have good coverage of the genome. But when we calculate a polygenic score, if we have multiple alleles in the same regions in high LD, then the effect size of that particular area will be double-counted and calculated many times, and that will cause a bias.
Now, to deal with winner’s curse, in polygenic scoring, we usually do something called shrinkage. What we essentially do is we push down the effect sizes by a certain factor using different methods, and we are saying that by shrinking or by pushing down these effect sizes, we get a better representation of the true effects, and hopefully, we’ll get a better estimate. Another one is, like, linkage disequilibrium. So, GWAS still relies a lot on tagging, so it doesn’t necessarily have genotyped the causal variants. It makes it quite difficult for us to say this variant is the true causal variant, and we can just guess, and variants, this does lead to the situation where when we calculate polygenic score, variants in LD to each other can lead to double counting. So, imagine the most extreme scenario when two alleles are in perfect LD, two variants are in perfect LD, and you don’t, like, when you calculate the polygenic score, you are essentially calculating the same score, like, same alleles twice in the same model. Now, if you imagine there’s a region of the genome that is in high LD, like chromosome 6, like MHC megions, then you will get your score dominated by this kind of region, and you don’t want that, because that causes bias.
So, we introduce a method called PRSice, or more like a software called PRSice. It’s a simple approach to polygenic score analysis. It mainly uses two methods, the clumping method for accounting for LD, and the p-value thresholding method as a form of shrinkage. It is computationally very efficient and relative to other software, it’s slightly easier to understand and comprehend, although it does have a lower performance in terms of R square and p-value associations. But we do argue that this is a very good starting point for you to get hold of your data, get a general idea of how your polygenic score will look like, and then improve upon it, maybe using more complex, more advanced software like LassoSum, LDSR, and PRS-CSx, PRS-CS, all of which you might have been exposed to in the PGC days. This was first developed by Jack Euesden at King’s College London and is now maintained by me and Dr. Paul O’Reilly.
So with linkage disequilibrium, PRSice performs something called clumping. So, imagine we have a bunch of variants. I use this crappy DNA symbols to represent variants, and I rank them by their significance. So the more high they are, the more significant they are. They are in LD, so they’re linked together. Now, if I calculate, if I don’t do anything and just calculate the score, then, like, these alleles can get double-counted, or give a high proportion of effects to the polygenic score. What PRSice does is that we look at these variants, we select the most significant SNPs or variants, and then we remove any variants that are in LD with it and have a lower, like, less significant p-value. So here, only the pink variants will remain in after clumping, and those will be used for the polygenic score calculations.
Next up, we do p-value thresholding. So, imagine we have a GWAS, like this is, I think, this is the PGC 108 freeze 3 GWAS. So what we do is that we will perform some kind of thresholding, a way of hard thresholding. What I meant is, let’s say we only select SNPs that have p-value less than 1E-8, it means that any variant that has p-value higher than 1E-8 will get the effect size shrank to zero. So the effect size becomes zero. PRSice will iteratively change the p-value threshold, and then, as you can see, we retain different portions of SNPs in the polygenic score, and we shrink the rest of the score variants to have an effect size of zero. The way PRSice selects the p-value threshold is by means of, at each threshold, we’ll get a PRS and we’ll regress it against the phenotype of interest. We’ll then say that the regression model that gives us the best performance, highest R square, is the best threshold. Now, a lot of people complain that this might be overfitting, and it does, it does like, because of this model optimization, it does cause some kind of overfitting, but this overfitting is inherent to polygenic score model, not just to PRSice.
So, what we developed is that in PRSice, we try to do something called permutation to calculate empirical p-value. So what it does is we first calculate the, we use the best PRS, we look at, like, how what p-value it does provide us. We then shuffle the phenotype to obtain a new p-value, and we repeat it, like, 10,000 times, and this gives us a null distribution of possible p-values, and then we can then calculate the empirical p-values as the number of times we observe a better, more significant p-value than the observed p-value. This helps us to eliminate a little bit of the overfitting problem of testing many parameters.
So that’s the gist of it. So, before you, like, we’ll go through some general principles of running PRSice and some of the things that you need to think of when you do it. So, before you start PRSice analysis, you’ll need the GWAS summary statistics, and you must have the SNP ID column, the effect size, effective allele, and p-value column in the summary statistics. Those are the bare minimum. The SNP ID is for matching the genotype data with the summary statistic data. The effect size is giving us the effect size estimate. The effective value tells us which allele just so that we can do strand flipping and p value is for clumping. Now, you also require genotype data for the target samples, and these samples must not be included in the GWAS, otherwise, you’ll get into trouble. This is fundamental to all PRS methods. You must not have any of your target samples present in your GWAS or you will get highly inflated, abnormal results. Now, PRSice does accept, like, file in plink or bgen format, so you can just directly use them, which is fine. You can also give PRSice the phenotype data and the appropriate covariates if it was not already included in the fam file or the sample file. You can also give PRSice an LD reference, like 1000 Genomes, if your target sample size is small, like less than 500 samples, although that usually don’t have enough power for PRS and that might be something useful.
Now, PRSice is mainly divided into two parts, the PRSice executable, which is used for the core analysis, and the PRSice.R, which is an R script that is just useful for visualization. Before you start, though, you should perform some basic quality controls on your data. For example, if you have any duplicated SNPs in your GWAS, you should make a decision of whether you want to maintain one of those copies or do you want to remove both of those copies. You should also do common GWAS filtering on your target data. For example, remove related samples, remove samples with sex mismatch, remove SNPs or samples with excessive missingness, etc., etc., to just ensure high-quality data. If you want more, you can refer to the PRS tutorial paper or nature protocol, and they will provide more of an outline of those.
Running PRSice is hopefully intuitive. We do try to make sure that our documentation is clear and our parameters make as much sense as possible. So running PRSice is just type PRSice, and then you have to give us the target sample using the b file parameter. You just give us the prefix and then you can give us the base, which is the summary statistic file. The GWAS, if your GWAS has a standard format, for example, the SNP ID is called SNP, the effect size is called Betas, the p-value is called P, the effective allele column is called A1, then PRSice will automatically detect those. Otherwise, you can always use the appropriate parameters, minus minus A1 for the effect size, effective allele, minus minus stats for effect sizes, etc., to indicate to tell us which columns correspond to which information. You can also give us the phenotype file using minus minus pheno and then use minus minus pheno column to indicate which column you want. If you have multiple columns, you would then need to tell us whether it is a continuous or non-continuous binary target using minus minus binary target. If it is continuous, PRSice will use linear regressions to identify the best threshold, and if it is a binary phenotype, then it would do a logistic regression. You can then also give a covariate file and use minus minus covariate, minus cov col to indicate the columns that you want to include in the covariate, and then one of the most asked-about features is print SNPs. People ask us if there’s any way to show the SNPs included in the PRS, and print SNPs option is your friend. And finally, to get the empirical p-value, you use minus minus perm 10,000. We recommend at least 10,000 because that’s when you get a relatively stable p-value at 0.05. You can go more, but it’s very computationally intensive. If you want to get the plots, all you need to do is just to add, use the R script instead. Sorry, misspell, should be R script, not all script. PRSice.R and then minus minus PRSice and then tell us where the executable is, and all the other commands are identical.
Now, the most important, one of the most important output from PRSice is the dot summary file, which shows us the best model fit. It will have information of the phenotype, the set, like, this is something related to PRSice, usually it’s just space in PRSice, which is genome-wide score threshold, tells you which p-value threshold we use, PRS.R square is the variance explained by only the PRS, Full.R squared is the variance explained by the full model, including the covariates and the PRS, and then null.R square is the variance explained only by the covariates. If you have a binary trait, you can give us the prevalence, and then we’ll do an adjustment of the PRS to account for ascertainment bias. If not, then it would just be minus. The coefficient is the regression coefficient. It can be huge if you use the average score because the one-unit change of polygenic score, when you take the average and given the small effect size, is a big change in phenotype usually. So, you will have a large value for coefficients, if you want, it can be smaller, the standardized score will generally give you a smaller coefficient, but it shouldn’t change your p-value or R square much. The p-value is the p-value model fit, standard errors, that’s an error of the model, number of SNPs is the number of SNPs included in the model. And if you do the imputations, then you will also get an empirical p-value column. You can also look at the individual threshold model fit by looking at the top PRSice file, but we generally found that not very informative, so we ignore that quite a lot. And then the other important output is the PRSice.best file, which contains the scores at the best p-value threshold. It has four columns for PRSice; FID and IID are the individual ID provided, and usually included in the plink file. In_Regressions, such as whether these samples were included in the regression model, so if your sample has a missing variant or missing phenotype, we will still calculate the polygenic score. But then, in this best score folder, we will tell you that it is not in the regression model by having a “no” in this column. And then, finally, the PRS, which is like the polygenic score column.
If you use the R script, you will also get some useful plots just for your eyes. So, this is like the bar plot, which is quite famous. It shows, sorry, it shows you the R-square at each different P-value threshold and the corresponding P-value at the top, and also colors. So, you can see immediately that the best P-value threshold is 0.4463 here, with an R-square roughly 0.05 and P-value of 4.7 times 10 to the minus 18. Alternatively, we also have a high-resolution score. If you use the high-resolution options of PRSice, it shows you the p-value, like, model fit of each individual at each individual pivotal threshold. The green line represents the same threshold that was included in the bar plot, and this shows us where the best pivotal threshold is and the corresponding P-values.
So, common problems we encounter with polygenic score analysis, like PRSice, we do try to capture as much of them as possible. So, we will launch our duplicate samples, duplicate variants, missing, like excessive missing covariates, wrong definitions of phenotypes, etc. We try to do that, but we cannot detect sample overlap. That’s something that you have to do yourself. But a general rule of thumb is, if your results are too good to be true, for example, everything is very significant or the R-square is extremely high, or something like that, basically a flat plot, and you know that you have some sample overlap, most likely.
So, this is almost the end of it. If you want to get more information about PRSice, you can go to PRSice.info, which is our official website. Or if you want to know more generally about polygenic score or how to do polygenic score, or what are the concepts, then you can go to our polygenic score tutorial at choishingwan.github.io/PRS-tutorial/. So, that’s it! Thank you very much.
LDpred2
Title: How to run LDpred2
Presenter(s): Florian Privé, PhD (National Centre for Register-Based Research, Aarhus University)
Florian Privé:
Hello, my name is Florian Privé. I’m a postdoc at Aarhus University, and I will be talking today in this tutorial video about how to compute a polygenic score using LDpred2. This video was made for the PGC Day at the WCPG 2021.
So, if you go to this tutorial of LDpred2 (link: https://privefl.github.io/bigsnpr/articles/LDpred2.html), then I explain how to, you know, do the whole computation using some small data. Actually, if you don’t have enough data to, you know, make your own LD reference or actually with at least 2,000 individuals, then we provide, in the LDpred2 paper, we provide some LD references based on the UK Biobank. So, a very large sample size that you can use. And, here, you can see the variants that are used in this data, and basically, this is a bit more than one million hapmap3 variants, where you have the information like chromosome position, alleles, you also have some pre-computed LD scores, and also you have, we change the position, so like the different builds. So, if you want, for example, to match with different sumstats, some methods you have different builds for the position, then you can match directly using those.
Okay, so this is the LDL from the tutorial, so in this tutorial we just use some small fake data just for the purpose of not having to download large files and also to run it quickly. So, you can download this tutorial data and zip it, and it will have a directory called “tmp-data.” And then I just read the data from the BED format, the PLINK BED format, in the format that is used in the bigSNP package. So, this format is easy. It’s just a list with basically three elements. The first one is the genotype matrix, which is stored using a special format that is a matrix format but stored on disk. And then you have two data frames with information on the individuals, and one with information on the identity variants. I also read the summary statistics, so again, it’s a fake summary statistic. So, here I’m using only 50,000 genetic variants just to make it small and easy to run.
And actually, the data that we need to run LDpred2, so we need the effect size, beta, the standard error of the effect size, and also the sample size. So, if you have the sample size per variant, then it’s better. And also, we need the information of chromosome, position, and alleles, to match between your summary statistics and the data that you’re using. So, just for the purpose of this tutorial, we just split the data that we have, so only like 500 people. We split it in the validation set and in the test sets. So, the validation set will be used to tune the hyperparameters, so choosing the parameter with the best prediction, and the test set is used to, you know, just see how it works in the final prediction in the test set.
The first thing that we need to do is matching the genotype data that we use with the summary statistics. So, we can match by chromosome, position, and alleles. So, to do that, there is this function called “snp_match.” And actually, if you run this, you will see that basically, almost no variants are matched. So, the problem here is that the sum stats and the data are not in the same position or not in the same build. So, yeah, actually what you can do is modify the build using “liftover.” And this function. If you use the LD file that we provide, you can use one of the other columns for the position in different builds. Or you can instead of tuning by positions, you can join by the RSIDs. So, for some PCs, we’ll do that here. So now, you can see that many more variants are matched.
Okay, now that we have matched the variants, we need also some LD reference. So, you can either use the different LD references that we provide or just compute one from your data. In LDpred2, we recommend to use a window size of 3 centiMorgans. So, you need to get the genetic position in this, you know, in centiMorgans. So, for that, you can use this function. Yeah, just to convert the position and the physical position into a genetic position.
Just for simplicity here, I’ve pre-computed them and you can just get them from this field. And then, per chromosome, we compute the correlation using this “snp_cor” function using a 3 centiMorgan window with this genetic position, and in parallel, it should be quite fast. And then, from this correlation matrix, which is just, you know, a standard sparse R matrix, we can get the LD scores. Because we’ll need the LD score to run LD correlation later. So, again, if you’re using the LD file that we provide, we provide some pre-computed LD scores. But remember that if you’re using LD Score regression with this, you know, this LD score, you need to use the number of ions corresponding to this LD score. So, this “total” here, not a subset of the matched variants. So, be aware of that.
That, and um... actually, I convert this matrix into another sparse matrix format but that is stored on disk just to make it more efficient to parallelize over this LD matrix. This compact = TRUE is quite new. It makes just the matrix twice as efficient. So now if you use the HapMap 3 variants, the size of this data on this should be approximately 15 or 16 gigabytes. And so you need to be able to store this to keep it in cache so that it’s fast. And you need also to store all your results from the, you know, all the outputs from LDpred2. So if using like 50 or 60 gigabytes, it should be more than enough normally.
And then we can run many models. So far, the infinitesimal model, which is just a model assuming like that all the variants are causal and actually there is an analytical formula for that. So it’s just a matter of solving a linear system so it should be quite fast. So you get the effect size, then you can get the prediction but just, you know, predict with the genotype matrix and then look at the correlation between the prediction and the actual phenotype in the test sets and we get like 32 percent of correlation.
Then we can run LDpred2(-grid), which is then so that our hyperparameters, the heritability, the SNP heritability, and the proportion of causal variant p which can be smaller than one if you expect not the not all variants to be causal. So for the heritability, we use the heritability estimate from LD score regression here. Actually, because this correction can be not very precise when, for example, the trait that you are interested in is not very polygenic, so we also try 0.7 times this value and 1.4 times this value. And actually, we also have 0.3 times this value because in our estimation we’ll see that if we use a small heritability estimate it induces more shrinkage marginalization and it’s sometimes useful when the quality of the summary statistics is not that bad.
For p, so, sequence between 10 to minus five to one on the log scale and then we have this parameter sparse. If you use sparse equals TRUE it will just make sure that you have some effects that are exactly zero, uh, to make the final position of sparse. So lots of, uh, some of the effect size will be exactly zero so the variants won’t be used in the model. So we can compute the betas corresponding to all these models, so a lot of models that are running in parallel. Then get the prediction from all these models and then in the validation set, we can compute some kind of score. So here we use the t score from a linear version between the phenotype and the polygenic score. So actually, you can use some covariates here as well, and if you’re using a binary trait, then you can use a glm with a family equals binomial.
And then we can look at all the models using ggplot if we want. So we have like the heritability, the color, the heritability x-axis, the proportion of causal variants. This is the z-score that we get here, with sparse equals false and sparce equals true. Also, we can look at the best models. So actually, if you care about sparsity, then you can maybe, you know, choose this model that is getting you almost the same score in the validation as the best model but then you have like a very sparse vector of effect size.
And we can get the best prediction and then for the best prediction, we compute in the validation set, we compute the final position in the test set and look at the correlation with the phenotype, and then we get some much better prediction than the infinitesimal model. So with LDpred2 it’s usually the case that you would get major better prediction with LDpred2 grid than compared to our LDpred2. And then we have LDPred2-auto, which is, eh, so instead of trying many values for the hyperparameters, it just estimates them from the data.
So the heritability is in validation proportional estimate directly from the data. We run actually many models here as well because we run many chains with different initial p values for p, uh, so in parallel again, so you get like here 30 different models.
You can look at one of them where you get like the betas, the proportion of the causal variants, the P estimate, the h2, the SNP heritability estimates and so on. You can look at them in ggplot2, like the path of the like, each iteration, the values of each iteration in the output, and then you can get all the prediction, and then you should have a way to filter the chains that, for example, didn’t convert or diverge. So to do that, we propose two methods. So the first one is based on just computing the standard deviation of the predictors and just keeping the ones that are close to each other, or you can also compare the heritability estimate and proportion of the causal estimates and look at the chains that are close to each other, and then the final prediction is just the average over the chains that you keep.
And then if you look at the correlation with the final estimate, final prediction with the phenotype, that you get a very good prediction. And finally, I will talk very quickly about Lassosum2, which is just a re-implementation of lassosum, but that uses the exact same parameters as LDPred2. So this LD matrix and this beta and c and effect sizes no, so effective sizes, exercises, and sample sizes and so you can just run a SNP lassosum, something like this with the exact same parameters, then compute so far grid of a hyperparameter, there are two in lassosum, you would get all the prediction, then you can choose the best one again based on the validation set. You can look at all of them, you can choose the best one, and actually what you could do is just you know take the LDPred2 grid and lassosum2 and take the prediction, the best prediction overall in the validation set. Yeah, that’s that’s all.
SBayesR
Content coming soon!
Title: How to run SBayesR
Presenter(s): Jian Zeng, PhD (Institute for Molecular Bioscience, University of Queensland)
Vertical Transmission and Genetic Nurture with PRS
Title: Estimating Vertical Transmission and Genetic Nurture using Polygenic Scores
Presenter(s): Jared Balbona, PhD (Institute for Behavioral Genetics, University of Colorado Boulder)
Coauthor(s):
Yongkang Kim, PhD (Janssen Pharmaceutical Companies of Johnson & Johnson)
Matthew Keller, PhD (Institute for Behavioral Genetics, University of Colorado Boulder)
Jared Balbona:
All right, hi everyone. Thank you all for watching. So my name is Jared Balbona, and today I’ll be talking about a project on estimating vertical transmission and genetic nurture using polygenic scores. This is a project I’ve been working on with Matt Keller and Yongkang Kim here at the University of Colorado Boulder, and it’s one that I’m very excited about. So without further ado, I will start with a brief overview.
As you all know, parents and offspring resemble one another in a variety of ways. One of the most commonly used examples of this, at least in our field, is that highly educated parents tend to have highly educated offspring. So there are a couple of different explanations for the similarity between parents and kids. The first and perhaps most obvious of which is shared genetic effects. So, the parent has a strong genetic predisposition for higher education, which she transmits to her daughter, giving her daughter that same predisposition. However, we can consider an alternative explanation in which being highly educated itself influences how parents raise their child. So, for example, a highly educated parent may be more likely to encourage a positive attitude towards schoolwork or to read to their child – all these different behaviors that collectively could have a potential influence on the child’s education level down the line.
So, this effect of a parental phenotype on an offspring one is called vertical transmission, and it’s mediated here by what’s called the offspring’s familial environment – that is, the environment in which the child is raised. So, of course, this vertical transmission path and this shared genetic one are not mutually exclusive but rather they co-occur, meaning that children with genes conducive to higher education are more likely to also have environments conducive to education. In other words, vertical transmission means that a person’s genes and environment will co-vary with one another, a covariance that has recently been termed genetic nurture. As you can probably guess, this genetic nurture covariance makes it very challenging to tease apart genetic and environmental pathways from one another, which is something that we need to do in order to estimate and understand the relative importance of parenting in shaping who children become. Furthermore, if unaccounted for, genetic nurture will bias the estimates of heritability that we get from both twin-based and genome-wide studies of genetic effects, and it does so in some kind of unpredictable and counter-intuitive ways, making it all the more important for us to tease apart. So, to this end, researchers have begun focusing not only on the genetic variants that parents transmit to their kids but also on the set of alleles that parents have but do not transmit to their children, shown up here.
So, like these transmitted alleles, the non-transmitted alleles still have a direct impact on the parental phenotype, which will in turn influence the offspring’s familial environment and subsequently, the offspring’s phenotype. However, unlike the transmitted alleles that are shared between parents and offspring, these non-transmitted alleles do not have a direct effect on the offspring – so notice that there’s no arrow right here.
Thus, by essentially subtracting these paths in green from these paths in red, we can get an unbiased estimate of this direct genetic effect – that is, the actual causal effect of these alleles on the phenotype. And we can use that to, in turn, get an unbiased estimate of vertical transmission. Recently, Matt, Yong, and I took this logic and instantiated it into a series of structural equation models for a paper we recently published. So, this is one of those models, which contains all the pieces I just described – namely, it has these three Y’s to represent the parents’ and offspring’s phenotypes. And for those of you who do structural equation modeling, note the circles here – so the parental phenotypes don’t actually need to be observed in our models; they can be represented as latent variables if need be.
In addition to these three phenotypes, we also have these three big F’s representing each person’s familial environment – that is, the environment in which they were raised. We also have these NTS and two NTS representing polygenic scores made from the non-transmitted portions of each parent’s genotype. And these T’s to represent polygenic scores made from the transmitted portions of the parental genotypes. So note that these T’s have arrows going to both the parental and to the offspring phenotypes. This model compares relationships between all these different pieces to estimate parental effects, and in doing so, it builds on past work in a number of ways – namely, using only a polygenic score, it can provide unbiased estimates of heritability, vertical transmission, and genetic nurture. And it can do so even when the polygenic score being used is not very predictive – which is great considering that polygenic scores generally do explain a small proportion of trait heritability.
Right now, secondly, they can properly account for various types of assortative mating at both equilibrium and disequilibrium. Three, they are not, in principle, biased by missing data if that data is missing at random. So that’s… These models do not require complete trios; if we just have father-child or mother-child pairs, that works too. You’ll just need a larger sample size to achieve equivalent power. Fourth, as will hopefully become clear over the course of this demonstration, these models are very easy to use, relying on relatively simple math. And five, which I hope will also become clear, they can allow for a vast number of extensions to best fit the trait and data being worked with. So, that was a really quick overview.
But moving on from there, I’ll now pivot a bit and start talking about how these estimates are actually derived using OpenMx. OpenMx is a package in R that’s great for working with structural equation models, and it’s one that I’m guessing a lot of you are probably already familiar with. So, we’ll be running a script I wrote in OpenMx on some example data that I simulated. You can download that the code and the practice data from here: tinyurl.com/scmpgs2021 [Note: this link does not work now]. That should redirect you to my Google Drive. But if, for whatever reason, this nice little short link doesn’t work out, you can also try to type in this link here, which will also take you to the same place where you can download the data and the script I’ll be going over here.
So, using the simulated data, we’ll be trying to answer the question: How does parental educational attainment influence offspring educational attainment? As with just about any structural equation modeling, this process begins with choosing the actual model like we described in our paper, there are several different variations of our models, each designed to address different types of questions with different types of data, and they might entail some different assumptions along the way. So, because of that, there are some questions to consider when you start thinking about the type of test you’d like to run. First, is this a trait that parents maybe are sorting on, and has this assortment likely been consistent across generations? So, in other words, is this a quality that partners are more likely to be similar on, such as education or height? And is it possible that the importance of this trait in mate selection has changed over time? So, for height, probably not; tall people have probably always preferred tall people, short people with short people. But for education, it’s very possible that it’s become increasingly relevant over the past 100 years or so. Secondly, there are good existing estimates of heritability for this trait, and could the heritability be age or sex dependent? Third, how many trios and parent-offspring pairs do I have available? Do they all have measured phenotypic data? Fourth, do I have any other nuclear family relatives like grandparents, siblings, or spouses in my data? So, one thing to point out for these first two points, which are a little bit more conceptual, you don’t need to have actual answers to those questions. These are just things to consider and are also things that you could test using our models if they’re of interest. And I should also point out, this is not by any means an exhaustive list; this is just something to, you know, consider as you’re getting started.
Moving back to those models, we can stick with the model that I showed earlier, this one right here, which makes two important assumptions. First, it assumes that there is assortative mating in both the parental and grandparental generations. So, following the educational attainment example we’re working with here, it assumes that in the parental generation and in the grandparental generation, spouses prefer to mate with people who are phenotypically similar to them, who have similar education levels.
Secondly, this model also assumes that the polygenic score explains the full trait heritability, which of course is not true for any trait at present. So, this specific model is mostly useful didactically. I wouldn’t recommend using it or the example script on actual data to estimate parental effects. But we do have other models that will relax the second assumption, which will be more effective for actual analyses. So, just something to point out, okay?
So, as mentioned earlier, our models require five pieces of—excuse me—five observed pieces of information. First, the offspring’s phenotype must be observed. So, in this case, it’s a value indicating the amount of schooling completed by each individual in the sample. We also have our models also require two educational attainment polygenic scores for each offspring—one from the set of alleles they inherited from their dad, one from the set of alleles they inherited from their mom—as well as two polygenic scores made from the non-transmitted portions of the parental genotypes.
Finally, um, as I touched on earlier, we can incorporate observed parental phenotypes if we’d like, which will increase our power, but they are not necessary, and for that reason, they’re not included in the example script I’ll be going over. So, these are our observed values that we feed into OpenMx. So, if I switch over to R, this is the script, uh, that’s posted on Google Drive. If we read in the data here and take a look at it, you can see that the data looks, uh, exactly the same as what, as what I showed on that PowerPoint slide. So, one column for each, um, excuse me, one column for each parameter, phenotype, and the four polygenic scores. One row for each individual. So, you can see, for example, this individual has a lower polygenic score and lower phenotypic score than, say, this individual here. And all the columns for this particular script do need to be in this specific order, which is something you can adjust when you do this yourself. So, uh, once we’ve written the data for this script, we’ll specify a couple of options. Uh, you can play around with which options you’d like to use; some might be better suited for specific types of tests and others. However, one that is required for our models is this NP Solo Optimizer option, which I’ll explain in a second.
Alright, so the first thing to do is to create the different variables that our models will be estimating, which we’re doing here. So, each of these will, uh, get an estimate for each of these down the line. We’re explaining that they’re free to be estimated; we’re not constraining them to a specific value. And below that, for some of those variables, we’ll be giving algebraic constraints. So, these algebraic constraints are derived from path tracing; they’re all described in, uh, pretty great detail in our paper if that’s of interest to you. And one important thing to note, you can see, for example, that v y is a function of v f, and v f is a function of v y; similarly, G is a function of omega, omega is a function of G. This type of circularity is a hallmark of a lot of causal models, and it’s definitely present in our models. These are called non-linear constraints, and to be estimated, uh, we would be, that’s where this NP Solid Optimizer comes in. So, NP Cell is a software package that OpenMx implements to estimate nonlinear constraints, and it does so in a way that’s really efficient and effective, uh, and works really well for these kinds of analyses.
So, once we create the variables and provide algebraic constraints, we’ll also provide some more equality constraints. In this case, just saying that these variables are equal in value. The algebra here, so nothing’s super interesting in this case, but for other more complicated tests that you might want to run, you can play around with the equality constraints here. You can say things like, ‘I want the genetic values to be constrained for parents and for their offspring,’ or you can relax that assumption and say that you don’t want the parental and offspring’s genetic values to be the same, or you can do that for assortment for parental effects, whatever you want. As with everything in the script, you can play around with different options to get different results. So, once you’ve specified all the different variables, we will then provide an expected covariance matrix here. The idea behind that—and we’ll go to PowerPoint for this—is that OpenMx will take all the different variables that are inside of the expected covariance matrix and simultaneously get estimates for each one in a manner that minimizes the distance between the estimates and the observed, excuse me, the observed values as much as possible. So, it’ll try to come up with values of Delta, G, W, and K that approximate 0.28 as closely as possible, for example, and then they’ll do that for all of these.
Moving back to the R script, we specify all these options, and if I don’t get to anything, hopefully, the annotations will describe what’s going on. But if you have any questions, feel free to reach out to me. So, here you can see we have a couple of nested for loops. These are trying out different combinations of start values for Delta, V, E, and F. And if we run that, using these different start values, OpenMx will iteratively test out different combinations of parameter estimates in a way that, like I said, minimizes the distance between the expected and the observed covariance matrices. So, once we run that, we get a couple of warning messages here, and fortunately, these don’t look like anything to really worry too much about. Most are saying that the start values are not feasible, which is to be expected. If we’re telling OpenMx that phenotypic variance is very heavily dependent on genetic factors and that it’s also very heavily dependent on environmental factors, it’s possible that those two things can’t coexist at the same time, and in which case, it will produce this warning and NAs in your data, which is totally fine.
So, once we’ve tested out all those different combinations of start values, that’s the final step—step four. We’ll select the one with the lowest log likelihood and produce it at the bottom in a data frame called ‘result summary,’ shown here. So, uh, from ‘result summary,’ this is really what the whole script is all about. We can see the phenotypic variance and the relative amounts of variance due to additive genetics, parental effects, and unique environment. And, uh, it has a key here at the bottom to explain what these different columns mean. Um, I made you forget. So, in these results, we can see that parental effects are explaining a relatively small proportion of the phenotypic variance, especially compared to additive genetics and unique environment. So, if these results are to be believed, it appears that parents have very little, or I shouldn’t say very, relatively little influence on the amount of schooling that their children choose to pursue later in life. Of course, it’s possible that if you tried a different model that makes different assumptions, tests different constraints, it may fit even better than this one, and you might see a slight change in the relative importance of genetics and environment. So, that’s something to consider.
And also, I know not to belabor the point. I know I’ve said this a couple of times, but again, these are simulated data, they’re totally made up. Actual empirical evidence suggests that parents do have a relatively substantial influence on their offspring’s education level. So, take these numbers with a big grain of salt.
Okay, so that is the script, and moving back to the PowerPoint, I can close now by talking about, uh, a couple of the other models we built and the types of questions that they can address. So, importantly, a lot of these models are part of upcoming publications in our lab, so they haven’t been posted on GitHub just yet, but hopefully, they will very soon. First, we have models that can leverage sporadic relatedness in large samples. So, you can derive estimates of parental effects without needing to actually recruit a large number of trios. For example, um, there aren’t a ton of trios in the UK Biobank, but there are a large number of siblings and spouses, so you can use those to drive estimates, um, which is a lot easier than getting a hold of actual trio data. Two, these models can be used bivariately without complicating things too much. So, we can test the effect of a parental trait on a different offspring trait to get an unbiased estimate, which has never been done before. We can also model various types of assortative mating, like primary phenotypic assortment, genetic homogamy, or social homogamy.
Four, we can test whether effects are sex-dependent. So, this can be used to address questions like, do fathers have a greater influence on offspring substance use than mothers? Are these effects dependent on whether they have a son or a daughter, things like that? And similarly, we can examine how effects have changed across previous generations. So, you can look at how mate preferences have changed across generations, like I said earlier, but you can also look at whether the influence of parenting has increased or decreased over time. So, maybe peer influence becomes more important for certain traits, or maybe parenting remains equally important across time. A lot of different avenues for future studies, a lot of things that we’re really excited about, and, uh, hopefully, we’ll start to release data on in the upcoming future.
So, with that, I’d like to thank my co-authors, Matt and Yong, and all of you for listening. I know that was a lot of information in a pretty short period of time, so feel free to email us if you have any questions. We’re happy to help out, to help sort of troubleshoot anything, we can do to help. So, thank you all very much.