Departments of Mathematics, Computer Science, and Statistics, St. Olaf College, 1520 St. Olaf Avenue, Northfield, MN 55057, USA

Department of Mathematics, Wittenberg University, 200 West Ward Street, Springfield, OH 45501, USA

Division of Applied Mathematics, Brown University, 151 Thayer Street, Providence, RI 02912, USA

Department of Statistics and Operations Research, University of North Carolina, 318 Hanes Hall, CB 3260, Chapel Hill, NC 27599-3260, USA

Department of Mathematics, Statistics and Computer Science, Dordt College, 498 4th Ave. NE, Sioux Center, IA 51250, USA

Abstract

Analyzing sets of genes in genome-wide association studies is a relatively new approach that aims to capitalize on biological knowledge about the interactions of genes in biological pathways. This approach, called pathway analysis or gene set analysis, has not yet been applied to the analysis of rare variants. Applying pathway analysis to rare variants offers two competing approaches. In the first approach rare variant statistics are used to generate

Background

Analysis of single-nucleotide polymorphism (SNP) microarray data in genome-wide association studies has traditionally been agnostic because prior biological knowledge about the genome has not been taken into account. However, as the biological knowledge base increases, it is increasingly common to use a priori biological knowledge in the analysis of SNP data. A recently proposed approach to integrate biological knowledge in the analysis of SNP data in genome-wide association studies is pathway or gene set analysis

Pathway analysis methods have been successful in a variety of applications (e.g., expression data, SNP microarray data). However, this approach has yet to be applied to rare variant analysis of next-generation sequence data. A variety of methods have been proposed for the analysis of rare variants

In this paper we implement pathway analysis using two opposing strategies. In the first strategy we create sets of SNPs by combining all SNPs within genes in a set or pathway of interest, and then we apply recently proposed rare variant methods to these sets. In the second strategy we use rare variant methods to generate a statistic for each gene (combining information on all rare variants in the gene) and then apply traditional pathway analysis approaches to the gene-level statistics. We compare the strategies by evaluating the type I error rate and comparing statistical power under a variety of different scenarios. Comparisons are made using simulated phenotype data and real sequence data made available as part of Genetic Analysis Workshop 17 (GAW17).

Methods

Data

All analyses presented here are based on data provided by the organizers of GAW17. The data consist of 697 unrelated individuals from the 1000 Genomes Project genotyped at 24,487 autosomal SNPs with minor allele frequencies (MAFs) ranging from 7.17 × 10^{−4} to 0.499. All SNPs are contained in at least one of 3,205 different genes, and so the data can be considered a mini-exome scan. The organizers of GAW17 simulated a dichotomous phenotype for the 697 individuals. The dichotomous disease phenotype is caused by a combination of measured SNPs (160 SNPs in 36 genes) and unmeasured SNPs. Two-hundred separate simulated phenotype replicates (each based on the same disease model) were produced.

Gene set construction

To evaluate the effectiveness of different approaches to pathway analysis, we constructed 2,000 sets of 25 genes with varying degrees of association with the phenotype. The 2,000 sets fall into four broad categories: (1) Five hundred sets contain some number

Rare variant methods

We used two different rare variant methods in our analyses: the WS method ^{2} statistic to assess statistical significance.

Pathway analysis

The traditional approach to pathway analysis is to first generate gene scores and then to aggregate the gene scores across all genes in a set; an alternative approach is to aggregate SNPs into sets directly. To implement the gene score aggregation approaches, we use the

Results

Type I error

The first step in comparing the eight methods for pathway analysis involved comparing the type I error rates of the eight methods on 500 sets that did not contain any truly causal SNPs (see Table

Type I error rates across the five approaches for the 500 null sets and the 500 nonspurious gene sets

Pathway method

Across 500 null sets

Across 500 nonspurious gene sets

Nominal

Nominal

Nominal

Nominal

No gene-level aggregation

WS

0.492

0.190

0.043

0.004

CMC

0.232

0.054

0.037

0.003

Gene-level aggregation

WS-GSEA

0.048

0.004

0.001

0.000

WS-KS

0.040

0.003

0.000

0.000

WS-Fisher

0.429

0.244

0.010

0.004

CMC-GSEA

0.063

0.007

0.002

0.000

CMC-KS

0.044

0.004

0.004

0.000

CMC-Fisher

0.484

0.235

0.048

0.006

WS, weighted sum; CMC, combined multivariate collapsing; GSEA, gene set enrichment analysis; KS, Kolmogorov-Smirnov test; Fisher, Fisher’s combined probability test.

Power

Next we explored the power of the eight different methods. Because of inflation of the type I error rate as a result of genes showing spurious association with the phenotype, we chose to analyze the power of only the 500 sets containing truly causal genes and noncausal genes that also did not show spurious association with the phenotype. Figure

Power of pathway analysis methods across gene sets with varying numbers of associated genes

Power of pathway analysis methods across gene sets with varying numbers of associated genes

Relationships between power and set characteristics

For each set and method combination, we calculated the number of times out of 200 that the set was identified as significant (

Each of the three methods had particular sets for which it was optimal: The WS method yielded the highest power for 429 of the 500 sets, the WS-Fisher method for 44 of the 500 sets, and the CMC-Fisher method for 27 of the 500 sets. For the 27 sets for which the CMC-Fisher method was best, the average power increase was 8.0% (SD = 7.9%) compared to the WS method and 13.4% (SD = 10.6%) compared to the WS-Fisher method. For the 44 sets for which the WS-Fisher method was best, the average power increase was 9.8% (SD = 9.7%) compared to the WS method. In the rest of this section, we attempt to characterize the sets found to be optimal by each method to provide insight into the strengths and weaknesses of each approach.

A logistic regression model was fitted to predict whether the WS method was the best. The model used five explanatory variables: (1) number of causal genes in the set, (2) number of causal SNPs in the set, (3) total number of SNPs in the set, (4) average MAF for associated genes in the set (MAFs for causal SNPs were first summed within genes), and (5) weighted risk score (sum of pairwise products of MAF and ^{−7}) was negatively associated, whereas the number of causal SNPs (^{−7}) was positively associated. Thus overall the WS method did better than the Fisher methods when the number of causal SNPs in the set was higher and the weighted risk score was lower, this being a situation of consistently small associations in the set. The Fisher methods did better in situations in which there were only a few strong associations in the set.

Discussion

Overall, we find substantial differences in power and type I error rates between the different methods, with the WS, WS-Fisher, and CMC-Fisher methods having the highest power while controlling the type I error rate. However, type I error control occurs only with the elimination of spurious genes, which have been identified by others

As noted, when spuriously associated genes are removed from sets, type I error rates are well controlled by all methods (including the WS and WS-Fisher methods), suggesting that if spurious associations are better handled by rare variant methods, type I errors should be well controlled. As shown by Luedtke et al.

Conclusions

Although the WS method outperforms the WS-Fisher and CMC-Fisher methods in the aggregate, the Fisher methods improve their relative performance when the number of causal SNPs is low and the weighted risk score of the set is high. This situation occurs when a few strongly associated genes or SNPs are present in a set but most of the set is not associated with the phenotype. Further analysis with a more comprehensive set of simulated and real sets is needed to fully explore the advantages and disadvantages of the WS method relative to the WS-Fisher and CMC-Fisher methods. It is particularly important to note that the Fisher methods may still be necessary and useful for pathway analysis because in many real-life applications of pathway analysis only a small fraction of a set may actually be associated with the phenotype.

Competing interests

The authors declare that there are no competing interests.

Authors’ contributions

NT and AB designed the study and directed the research. AP, AS, AL and SP implemented the study and analyzed results. AP, AS and NT drafted the manuscript.

Acknowledgments

This work is funded by National Human Genome Research Institute grant R15HG004543.

This article has been published as part of