MOSTWAS: Multi-Omic Strategies for Transcriptome-Wide Association Studies

Traditional predictive models for transcriptome-wide association studies (TWAS) consider only single nucleotide polymorphisms (SNPs) local to genes of interest and perform parameter shrinkage with a regularization process. These approaches ignore the effect of distal-SNPs or other molecular effects underlying the SNP-gene association. Here, we outline multi-omics strategies for transcriptome imputation from germline genetics to allow more powerful testing of gene-trait associations by prioritizing distal-SNPs to the gene of interest. In one extension, we identify mediating biomarkers (CpG sites, microRNAs, and transcription factors) highly associated with gene expression and train predictive models for these mediators using their local SNPs. Imputed values for mediators are then incorporated into the final predictive model of gene expression, along with local SNPs. In the second extension, we assess distal-eQTLs (SNPs associated with genes not in a local window around it) for their mediation effect through mediating biomarkers local to these distal-eSNPs. Distal-eSNPs with large indirect mediation effects are then included in the transcriptomic prediction model with the local SNPs around the gene of interest. Using simulations and real data from ROS/MAP brain tissue and TCGA breast tumors, we show considerable gains of percent variance explained (1–2% additive increase) of gene expression and TWAS power to detect gene-trait associations. This integrative approach to transcriptome-wide imputation and association studies aids in identifying the complex interactions underlying genetic regulation within a tissue and important risk genes for various traits and disorders.

• MeTWAS first detects the association between the expression of gene and expression of gene local to the gene ESR1 (Chromosome 6) to generate local eQTLs and SNPs local to FOXA1 (Chromosome 14) to generate distal-eQTLs for a 400-sample eQTL reference panel and 1,500-sample 140 GWAS imputation panel. We considered two scenarios for each set of simulation parameters: (1) an ideal 141 case where the leveraged associated between the distal-SNP and gene of interest exists in both the 142 reference and imputation panel, and (2) a "null" case where the leveraged association between the distal-143 SNP and the gene of interest exists in the reference panel but does not contribute to phenotype 144 heritability in the imputation panel. Though the choice of these loci was arbitrary for constructing the 145 simulation, there is evidence that ESR1 and FOXA1 are highly co-expression in breast tumors, and local-146 eQTLs of FOXA1 have been shown to be distal-eQTLs of ESR1 [29].

148
In these simulation studies, we found that MOSTWAS methods performed well in prediction across 149 different causal proportions and local and distal mRNA expression heritabilities and generally outperform 150 local-only modelling. Furthermore, across all simulation settings, we observed that MOSTWAS showed 151 greater or nearly equal power to detect gene-trait associations compared to local-only models. We found 152 that, under the setting that distal-eQTLs contributes to trait heritability, the best MOSTWAS model had 153 greater power to detect gene-trait associations than the local-only models, with the advantage in power 154 over local-only models increasing with increased distal expression heritability (Figure 2A). Similarly, we 155 found that as the proportion of total expression heritability attributed to distal variation increased, the 156 positive difference in predictive performance between the best MOSTWAS model and the local-only The power of the distal-SNPs added-last test increased significantly as both the sample sizes of the eQTL 168 reference panel and the GWAS imputation panel increased (Supplemental Figure S4). At a sample size 169 of 10,000 in the GWAS panel with summary statistics (a suitably large GWAS) and a sample size greater than 200 in the eQTL panel, MOSTWAS obtained over 65% power to detect significant distal significant 171 associations (Supplemental Figure S4) Real data applications in brain tissue 178 We applied MOSTWAS to multi-omic data derived from samples of prefrontal cortex, a tissue that has 179 been used previously in studying neuropsychiatric traits and disorders with TWAS [44,45]. There is ample 180 evidence from studies of brain tissue, especially the prefrontal cortex, that non-coding variants may 181 regulate distal genes [44,46,47]; in fact, an eQTL analysis by Sng et Table S1). We also found that MeTWAS and DePMA performed better in cross-validation 2 than local-190 only models (Figures 3A-C). Mean predictive 2 for local-only models was 0.029 (25% to 75% inter-191 quartile interval (0.0,0.015)), for MeTWAS models was 0.079 (0.019, 0.082), and for DePMA models was 192 0.045 (0.013, 0.037).
We used 87 samples in ROS/MAP with genotype and mRNA expression data that were not used in model training to test portability of MOSTWAS models in independent cohorts. As shown in Figure 4A 196 and Supplemental Figure S5, DePMA models obtained the highest predictive adjusted 2 in the external 197 cohort (0.042 (0.009, 0.057)), with MeTWAS (0.040 (0.010, 0.054)) also outperforming local-only models 198 (0.031 (0.007, 0.039)). Overall, among genes with cross-validation adjusted 2 ≥ 0.01, 187 out of 267 199 genes achieved external predictive 2 ≥ 0.01 using local-only models, 683 out of 911 using MeTWAS, 200 and 2,135 out of 2,934 using DePMA (Figure 3A-C Table S2). We also compared these 11 associations to those 209 identified by local-only models (PrediXcan [3] and TIGAR [53]), with raw -values of association shown in 210 Figure 4B. MOSTWAS showed stronger associations at 8 of these loci than both local-only and DPR 211 models. We followed up on the 5 significantly associated loci using the permutation and added-last tests 212 (Methods and Supplemental Methods). Three of these loci (APOE, SORL1, ZCWPW1) showed 213 significant associations, conditional on variants with large GWAS effect sizes (permutation test significant 214 at FDR-adjusted < 0.05). These three loci also showed significant associations with distal variants, 215 above and beyond the association with local variants, at FDR-adjusted < 0.05 (Supplemental Table   216 S2).

218
We then conducted a transcriptome-wide association study for risk of major depressive disorder (MDD) 219 using summary statistics from the Psychiatric Genomics Consortium (PGC) genome-wide meta-analysis 220 that excluded data from the UK Biobank and 23andMe [54]. QQ-plots for TWAS -statistics and -values 221 are provided in Supplemental Figure S7 and Supplemental Figure S8 for both local-only and MOSTWAS models, showing earlier departure from the null using local-only models compared to MOSTWAS. Overall, using all heritable genes with cross-validation 2 with the best MOSTWAS model in 224 ROS/MAP, we identified 102 MDD risk-associated loci with FDR-adjusted < 0.05 that persisted when 225 subjected to permutation testing at FDR-adjusted < 0.05 (colored red in Figure 4C). We downloaded 226 genome-wide association study by proxy (GWAX) summary statistics from the UK Biobank [55] for 227 replication analysis of loci identified using PGC summary statistics. We found 7 of these 102 loci (labeled 228 in Figure 4C and listed in Supplemental Table S3) also showed an association in UK Biobank GWAS 229 that was in the same direction as in PGC. In comparison, using local-only models, we identified 11 genes 230 with significant association with MDD risk at FDR-adjusted < 0.05 that persisted after permutation 231 testing; none of these loci showed significant associations in the UK Biobank GWAX in the same direction 232 as in PGC. These replication rates between MOSTWAS and local-only models were similar (accounting 233 for the total number of associations), highlighting that the inclusion of distal variation does not hinder the 234 replicability of MOSTWAS associations in comparison to local-only models [55,56]. Local-only results are 235 provided in Supplemental Data. It is important to note here that the UK Biobank dataset is not a GWAS 236 dataset as it defined a case of MDD as any subject who has the disorder or a first-degree relative with 237 MDD. Hence, the study forfeits study power to detect gene-trait associations for MDD [55,56].

238
Nonetheless, we believe that strong prediction in independent cohorts and TWAS results across two 239 independent cohorts provided an example of the robustness of MOSTWAS models.

241
In summary, we observed that MOSTWAS models generally had higher predictive 2 than local-only 242 models both in training and independent cohorts. We also found that MOSTWAS recapitulated 5 known 243 Alzheimer's risk loci that were not detected by local-only modeling (both PrediXcan [3] and TIGAR [53]), 3 244 of which had significant distal associations above and beyond the information in local variants using our 245 added-last test. We also illustrated that some MDD-risk-associated loci detected by MOSTWAS in a We applied MOSTWAS using breast tumor multi-omics and disease outcomes, motivated by recent

258
Estimates of heritability for genes were 2-4 times larger when we considered distal variation using 259 MOSTWAS methods (Supplemental Table S1). We also found that MeTWAS and DePMA performed 260 better in cross-validation 2 , with larger numbers of models at 2 ≥ 0.01 and significant germline 261 heritability using MOSTWAS models than local-only models (Figures 3D-F

275
Lastly, we conducted association studies for breast cancer-specific survival using local-only and the [34]. Here, we constructed the weight burden test, as described above and in Pasaniuc et al and Gusev et al [4,28]. We prioritized genes with Benjamini-Hochberg (BH) [37] adjusted < 0.05 for permutation 279 testing. Of the 122 genes that had cross-validation 2 ≥ 0.01 in TCGA-BRCA using both local-only and 280 MOSTWAS models, we found 2 survival associations with the same loci at BH FDR-adjusted < 0.05, Expression prediction in DePMA hinges on prioritizing distal-eSNPs via mediation analysis for inclusion in the final DePMA predictive model, adopting methods from previous studies [11,12,14]. A multi-omic 419 dataset with gene expression, SNP dosages, and potential mediators is first split into training-testing 420 subsets. Based on the minor allele frequencies of SNPs and total sample size, we recommend a low 421 number of splits (less than 5).

423
In the training set, we identify mediation test triplets that consist of (1) a gene of interest, (2) a distal-424 eSNP associated with the expression of the gene (default of < 10 −6 ), and (3) a set of mediating 425 biomarkers local to and associated with the distal-eSNP (default of FDR-adjusted < 0.05). We estimate 426 the total indirect mediation effect (TME) of the distal-eSNP on the gene of interest mediated through the 427 set of these mediators, as defined by Sobel [74]. We assess the magnitude of this indirect effect using a 428 two-sided permutation test to obtain a permutation -value, as more direct methods of computing 429 standard errors for the estimated TME are often biased [14,75]. We also provide an option to estimate an 430 asymptotic approximation to the standard error of the TME and conduct a Wald-type test. This asymptotic   Figure 1: Example of a biological mechanism MOSTWAS leverages in its predictive models. Here, assume a SNP (in green) within a regulatory element affects the transcription of gene that codes for a transcription factor. Transcription factor then binds to a distal regulatory region and affects the transcription of gene . The association between the expression of gene and gene is leveraged in the first step of MeTWAS. A distal-eQTL association is also conferred between this distal-SNP and the eGene , which is leveraged in the DePMA training process.

Figure 2: Comparison of TWAS power via simulations using MOSTWAS and local-only models. (A)
Proportion of gene-trait associations at < 2.5 × 10 −6 using local-only (red) and the most predictive MOSTWAS (blue) models across various local and distal expression heritabilities, trait heritability, and causal proportions. (B) Proportion of significant gene-trait associations across the same simulation parameters with no distal effect on the trait in the simulated external GWAS panel.

Figure 3: Predictive adjusted 2 from cross-validation across local-only, MeTWAS, and DePMA models.
If a given gene does not have ℎ 2 > 0 with < 0.05, we set the predictive adjusted 2 to 0 here for comparison. The top row compares local-only and MeTWAS, middle row compares local-only and DePMA, and the bottom row compares MeTWAS and DePMA. The left column has performance in ROS/MAP, while the right column has performance in TCGA-BRCA. All axes indicate the CV adjusted 2 for different models.

Figure 4: External validation of MOSTWAS and gene-trait associations using MOSTWAS models. (A)
Median predictive adjusted $R^2$ in held-out cohorts from TCGA-BRCA and ROS/MAP in local-only, MeTWAS, and DePMA models that have in-sample significant heritability. The interval shows the 25% and 75% quantiles for external cohort predictive 2 . (B) Associations with 11 known Alzheimer's risk loci, as identified in literature, using MOSTWAS, local-only, and TIGAR Dirichlet process regression (DPR).
Loci are labeled with P if the permutation test achieves FDR-adjusted < 0.05 and D if the added-last test achieves FDR-adjusted < 0.05. (C) TWAS associations for major depressive disorder risk using GWAS summary statistics from PGC. Loci are colored red if the overall association achieves FDRadjusted < 0.05 and the permutation test also achieves FDR-adjusted < 0.05. We label the 7 loci that were independently validated with UK Biobank GWAX summary statistics at FDR-adjusted < 0.05 for both the overall association test and permutation test. (D) TWAS associations for breast cancerspecific survival using GWAS summary statistics from iCOGs. Loci are colored red and labeled if the overall association achieves FDR-adjusted < 0.05.