Software Tutorials: Genomic SEM (Video Transcript)
Title: Genomic SEM Tutorial
Presenter(s): Andrew Grotzinger, PhD (Institute for Behavioral Genetics, University of Colorado Boulder)
Andrew Grotzinger:
I’m Andrew Grotzinger, and in this video, we’re going over how to run Genomic SEM using psychiatric traits as examples. Genomic SEM is a general framework for modeling genetic covariance matrices produced by methods like LD score regression to then estimate any number of structural equation models that can be used to test hypotheses about the processes that gave rise to the data that we observe. It only requires GWAS summary statistics, and those summary statistics can come from samples with unknown and varying degrees of sample overlap. What that means is that you can now estimate models for really rare traits that you would not otherwise observe in the same sample. It’s going to be split into two parts for the Practical.
The first is how to estimate a user-specified model using genome-wide estimates. The second part is how to incorporate the effects of individual SNPs to do things like estimating multivariate GWAS. All the Practical materials are available at this box link here, and this includes an R script and all the files needed to run that R script. At the top of that R script is going to include some code to actually download Genomic SEM if you haven’t done so already. Throughout the presentation, we’re going to be using GWAS summary statistics for schizophrenia, bipolar, major depressive disorder. And I’m moving on to how to actually estimate the user-specified model.
Part 1: Estimating User Specified Models in Genomic SEM
This takes three primary steps.
The first is munging the summary statistics (Munge: convert raw data from one form to another)
The second is running LD score regression.
And the third is using that output from LD score regression (ldsc) to then run the model that you specify.
Step 1: munge example code
Munge is a general procedure to format the summary stats in the way the LD score regression is expecting. For the sake of this practical, we’re using a subset of 10,000 SNPs for schizophrenia, bipolar, major depression. In general, you can download and use a full set of summary stats on a personal laptop. For the purpose of the Practical and restricting file size, we’re just using a subset here. It takes four arguments. The first is the name of the files. The second is a reference file that's used to align to the same reference allele across traits. Third is the name of the traits, and the fourth is the total sample size. Putting it all together in this last line here, where I’m specifying all these arguments and running munge. Again to really highlight that we are only using the restricted subset for the purpose of the Practical. When munge runs, it’s going to produce a .log file that you should inspect to make sure things like column headers are interpreted correctly. In particular, I’ve highlighted this section here where it prints how the effect column was interpreted. For schizophrenia, it’s an odds ratio and it’s interpreted correctly. But you’ll want to make sure to read the corresponding ReadMe files for the summary statistics and then look at this .log file to cross-reference whether or not that effect column is being interpreted right.
Step 2: ldsc example code
In this second step, we run LD score regression within the context of Genomic SEM. The first argument is traits, and that’s the name of these now munged summary stats, which will all end in that .sumstats.gc ending. Then, for binary traits, we do the liability threshold correction that requires inputting the sample prevalence, which reflects the cases over the total sample size, and then the population prevalence, which you can get from either epidemiological papers or from looking at the paper from the corresponding univariate GWAS. The fourth and fifth arguments are the folder of LD scores and the LD score weights. In almost all cases, this is going to be the same folder, and here we’re using the European LD scores because we’re using the European-only summary stats. Finally, we specify the names of the traits, and this is how the traits are going to be named in your actual model. On this last line of code, we are at LD score regression.
Ex. Running Steps 1 and 2 in R
In the next step, we’re actually going to specify the model. But before we do that, I want to switch over to R and actually run through the code so you can see it. First, I’m going to load in this package, and I’m going to set the working directory to where I downloaded those workshop materials. Then, for Munge, I’m going to set those files, the Hat-Map3 list, that reference file, the trait names, the sample size, and then run Munge. In the second step, I’m going to take those munged summary stats and put the sample and population prevalence for the liability correction, the folder of LD scores, and all the score weights, which are the same and then, the trait names, and finally, actually run LD score regression. This is going to produce results that are not actually interpretable because we use that subset of 10,000 SNPs. So when we now go on to step three of running the model, I’ve created an LD score regression object that uses the full set of summary statistics that we’re now going to load in so you can actually produce integral results in the context of the model. We’re going to load that in and now switch back over to the PowerPoint to talk about how you specify a model in Genomic SEM.
Step 3: How to specify a model
We use the Lavaan syntax for running the model, and in this case, we would specify a regression relationship of A using A ~ B, or for those of you that think of it in the format of Y ~ X or outcome ~ predictor. For covariances, you would specify two tildes, and of course, for the variance of a variable, it would be the covariance with itself, so A ~~ A. For a factor, you specify the factor name here, followed by an = ~, and then the factor indicators. To fix a parameter, you would put a number followed by an asterisk on the right-hand side of the parameter that you’re estimating. So this would fix the covariance between A and B to 1. To name a parameter, you would write the parameter label using some set of letters. Here, I’m naming the covariance between A and B “CovAB,” and what this does is it allows you to use model constraints for this parameter. Let’s say the covariance you’re estimating is negative, but you have some sense that it should be positive, so you put them as a parameter constraint to keep it above zero.
We loaded in that pre-made LDSC data, which again is using the full set of summary statistics. This is not simulated data because we’re using summary stats. This is often something you can readily download online. Then, to actually run the model, this takes two necessary arguments and two optional arguments. The first is “CovStruct,” which is the output from LD score regression. The second is the model. We’re running a common factor model here, and we’re telling Lavaan using this “NA*” that we want to freely estimate the first loading, and then we want to fix the variance of the factor to 1. So we’re using what’s known as unit variance identification for this model. An optional third argument is what estimation method you want to use. We offer DWLS and maximum likelihood, but the default is DWS. Another optional argument is “std.lb,” and that’s whether or not you want to automatically specify the variances of variables to 1.
You would run the model. This is going to produce this set of results, and I’ll show that in R here in just a second. But to walk you through what those results mean, the first three columns are the parameters being estimated. The fourth and fifth columns here are the unstandardized estimate and standard error for those parameters, so it's the estimated standard error applied to the genetic covariance matrix, and the standardized estimate and standard error are the estimate’s standard error for the model applied to the genetic correlation matrix where the heritabilities are scaled to 1. The model fit here is all going to print as “NA” because when you’re estimating a common factor defined by three indicators, it uses up all of your degrees of freedom, so it perfectly fits the model. But I want to walk through how you would interpret these model fit statistics more generally.
Interpreting model fit statistics
Chi-square is the model chi-square that reflects the exact index of fit. Then the degrees of freedom and p-value for the model chi-square in the next two columns; note that this will almost always be highly significant for your model because chi-square is sensitive to sample size, which, by definition, is massive in GWAS space. The next piece is AIC, which can be used to compare models regardless of whether they are nested, with lower values indicating better fit. CFI is the comparative fit index, which has these general heuristic cutoffs, with higher being better. Finally, SRMR is the standardized root mean square residual, where lower is better.
Going back over to R, we’re going to specify that argument, specify the model, the estimation method, and std.lb. We’re going to run the model, and you’ll see that we produce the same results that we produce over here, not in the slides, but in the code. I’ve included some alternative ways that you could specify this model. For example, if you wrote “std.lb = TRUE,” you don’t have to write that “NA*F1*F1,” or you could use the “common factor” function to estimate the same model and produce the same results.
Part 1: Multivariate GWAS in Genomic SEM
Moving on to part two, I want to talk about how you would run multivariate GWAS in Genomic SEM. To be clear, you don’t have to do both of these parts; you can certainly run a genome-wide model in part one and publish that alone. You don’t have to bring in these individual SNP effects, but I did want to show that, because oftentimes, for people, the multivariate GWAS space is what’s of most interest. This includes four primary steps, the first two of which mirror what we already did, which is to run a munge and LD score regression. We’re only going to go over the last two steps of running the “sumstats” function and the multivariate G-WAS functions: “commonfactorGWAS” and “userGWAS.”
Step 3: sumstats example code
The “sumstats” function takes a number of different arguments. It can be a little bit confusing, and I’ll note that we have a flowchart on our GitHub to help you figure out how the arguments should be specified. Again, we’re using the drastically subset SNPs for the purpose of the presentation, and we’re also using a subset of the reference file, which is used to align the alleles and to pull out the SNP minor allele frequency. The third argument is the trait names, and then the fourth argument here, we’re letting the “sumstats” function know that the standard errors for these traits are on the logistic scale. You’ll want to be really careful with this because oftentimes, for binary traits, the effect column may be on an odds ratio scale, but then the standard error will be on a logistic scale, and that’s something that you can determine oftentimes from the ReadMe file. If you have continuous traits or binary traits analyzed using a continuous model, it’s going to be a different set of arguments. For the sake of time, I’m not going to go over all of that here, but we do have that flowchart on the GitHub, and of course, you can reach out to us if you’re confused or getting strange results. Putting that all together, you’d run the “sumstats” function here. Much like the “munge” function, this will produce a .log file that you want to make sure you go over to make sure everything’s interpreted correctly. And then, once that’s done, you can use that output in combination with the LD score regression output to run models that include the effects of individual SNPs.
Step 4a: commonfactorGWAS example code
I’m going to go over both the “commonfactorGWAS” and “userGWAS” functions, starting with the “commonfactorGWAS” function. This takes two necessary arguments. The first is the output from LD score regression, and the second is the output from the “sumstats” function. Then, finally, once again, you can specify “DWLS” or “ML” for the estimation method, and then you can also specify whether you want to run in parallel. In practice, if you’re running a GWAS batch, you cannot run on a laptop, unlike the models that we were showing in part one. So, for this, you really will want to be running in parallel on a computing cluster. But for the purposes of this because we’re using a small subset of the SNPs, we’re going to run not in parallel so you get a sense of how these functions work.
I’m going to switch now over to R to go through that “sumsats” function, where, again, we read the files, the reference file, the trait names, and whether the standard errors are on a logistic scale. We’re running “T” for true for all three, running “sumstats,” and now taking that output and specifying the LD score regression output, the output from “sumstats” that we created, and actually running the “commonfactorGWAS” function. So, this is going to produce output that looks like this. There’s a lot of columns. It’s got the SNP, chromosome, base pair, minor allele frequency, A1 and A2, the parameter being estimated, which is the effect of the SNP on the factor, and the estimate, sandwich-corrected standard error, Z estimate, p-value, and then this other metric that we call “QSNP,” which indexes whether or not that SNP really does not fit the model. So, this is the chi-square-distributed test statistic with degrees of freedom that are going to depend on the number of indicators in your model. So, that indexes the extent to which a common factor model is insufficient for accounting for the pathways from an independent pathways model. If this common pathway of the SNP on the factor is really a poor representation of the independent effects of the SNPs on these individual indicators, then Q is going to be significant. That’s going to happen when you have things like a SNP that has directionally opposing effects on the indicators or let’s say the SNP has a really strong effect on one of the indicators but not the other. If we go now over to these results, you’ll see the estimates here. One thing you want to inspect is these last two columns of “Fail” and “Warning.” Zero means it’s good to go, but if you look at the warning messages, you’ll see that a handful of the variances are estimated as negative. I want to use this as an opportunity to talk about how you might troubleshoot a warning like this and then also go over how you would use “userGWAS” in this context.
Step 4b: userGWAS example code
For “userGWAS,” it takes those same first two arguments of the output from LD score regression and the “sumstats” output, but then you’re also going to specify the actual model that you want to run that’s now going to include the effect of this individual SNP. So, we’re running that same model, the common factor model. The common factor GWAS is automatically specified behind the scenes where the SNP predicts that common factor, but then we’re additionally adding in these model constraints. We’re renaming these parameter labels as the residual variances for schizophrenia, bipolar, and MDD, and we’re constraining them to be above zero because of that warning that we got when running "commonfactorGWAS". If we now go over and run that, I want to mention that another optional argument for “userGWAS” is this “sub” argument. This is whether or not you want to save a particular piece of the model output. The "commonfactorGWAS" is automatically only saving the effect of the SNP on the common factor. What the “sub” argument does is it allows you to tell “userGWAS” that, for each of these SNPs, I don’t want to save all of the model output, including the factor loadings and the residual variances. For the sake of memory, I want you to save the effect of this SNP on the common factor. So, “sub” does not change at all how the model is estimated; it changes how large the output file is. I would highly recommend setting this argument.
Final Notes
So, now we’re going to run this, and while that’s running, I want to make some final notes that parallel processing is available for both “userGWAS” and “commonfactorGWAS.” Parallel is executing the exact same code, serial processing, except that it takes advantage of additional cores. An ideal runtime scenario, you would split your jobs across computing nodes on a cluster and run in parallel there. There is also MPI functionality available. So, again, all runs are completely independent of one another. I’ve listed a number of resources here, including the GitHub, and I’m not going to go through this, but I’ve included some slides about some things to keep in mind.
So, the final thing I’d want to go back over here and show you that this produces the same set of results, “commonfactorGWAS” are very nearly so now that we’ve added those residual variance constraints. But now if we look at the warnings, we see that those warnings are now gone. So, with that, I’ll add that you can reach out to us with questions, and I hope this was helpful to you.