Hypergraph models of biological networks to identify genes critical to pathogenic viral response

Representing biological networks as graphs is a powerful approach to reveal underlying patterns, signatures, and critical components from high-throughput biomolecular data. However, graphs do not natively capture the multi-way relationships present among genes and proteins in biological systems. Hypergraphs are generalizations of graphs that naturally model multi-way relationships and have shown promise in modeling systems such as protein complexes and metabolic reactions. In this paper we seek to understand how hypergraphs can more faithfully identify, and potentially predict, important genes based on complex relationships inferred from genomic expression data sets. We compiled a novel data set of transcriptional host response to pathogenic viral infections and formulated relationships between genes as a hypergraph where hyperedges represent significantly perturbed genes, and vertices represent individual biological samples with specific experimental conditions. We find that hypergraph betweenness centrality is a superior method for identification of genes important to viral response when compared with graph centrality. Our results demonstrate the utility of using hypergraphs to represent complex biological systems and highlight central important responses in common to a variety of highly pathogenic viruses.

that of the system itself, rather than presenting only a simplified view. Commonly, biological systems and processes present as complex networks of interacting entities, for example within and between genes, pathways, and complexes. Graphs are frequently used to model these interactions, but since graphs can only capture interactions between pairs of entities, they fall short in many cases and are not able to model the full complexity present in biological systems and processes.
In this paper we investigate the role that hypergraph models, as mathematical generalizations of graph models, can play in providing the necessary complexity to capture multi-way interactions in biological systems inferred from genomic expression data. We introduce a hypergraph model of this data using data thresholding, and assert that the complexity provided by our proposed hypergraphs more closely represents the systems being studied. In order to validate this assertion we introduce new average hypergraph centrality metrics and provide a comparison between the use of graph and hypergraph centrality metrics to identify genes that are critical in host responses to viral infection. Our findings show that the genes identified using our hypergraph model and centrality metrics align better with genes previously known to correlate with viral response than do genes identified using similar metrics applied to graphs or using average fold change for each gene across all experimental conditions.

Network science for high-throughput data analysis
Modern biology has been transformed by the rapid growth of technologies to measure the abundance of large numbers of biological entities over many samples simultaneously. Such high-throughput methods like transcriptomics, proteomics, metabolomics, and lipidomics allow researchers to gain unparalleled scientific insight into the mechanisms underlying biological systems. A wide range of biological questions have been addressed using such systems biology approaches including questions related to cancer, microbiomes, and infectious disease. Analysis methods for high-throughput measurements are also varied, ranging from simple statistical tests for differential abundance (between control and experimental conditions, for example), to dimensionality reduction, to machine learning, all with the aim of extracting more relevant information from the high-dimensional and often noisy measurements.
A powerful approach for modeling systems using high-throughput data is network biology. Here biological systems are modeled as graphs, with molecular entities (genes, proteins, metabolites) represented as vertices, and relationships between molecules represented as edges connecting them. Relationships between molecules are generally determined from existing knowledge of protein-protein interactions, regulatory interactions, metabolic networks, or can be inferred from high-throughput systems biology data. We and others have used networks inferred from correlation or mutual information between abundance profiles of genes and proteins to identify critical entities [1][2][3], integrate different data types [4][5][6][7], and represent and predict temporal dynamics in the system [8][9][10].

Hypergraphs for complex network models
While graph-based methods have been quite successful in the biological domain, their ability to model complex relationships amongst entities is necessarily limited. Graphs inherently model relationships (edges) between pairs of entities (vertices). But biological systems are replete with relationships among many entities, for example in protein complexes, transcription factor and microRNA regulation networks, lipid and metabolite enzyme-substrate interactions, metabolic networks, pathways, and protein function annotations. Relationships may be interactions, for example, metabolites working together in a metabolic process, or they may represent some commonality among the entities, like genes being differentially expressed in the same conditions, or regulated by the same transcription factor. In a graph model all of these multi-way relationships would be represented as groups of pairs of subunits, which would not fully capture how groups of components interact or have similar behavior.
Sometimes sets of related components are already understood, and sometimes they need to be discovered in experimental data, like high-throughput 'omics. In either event, a higher order mathematical model is needed. The mathematical object that natively represents multi-way interactions amongst entities is called a "hypergraph". In contrast to a graph, in a hypergraph the relationships amongst entities (still called vertices) are connected generally by "hyperedges", where each hyperedge is an arbitrary subset of vertices. Thus every graph is a hypergraph in which each hyperedge happens to have exactly two vertices. A challenge for scientists is to recognize the presence of hypergraph structure in their data, and to judge the relative value of representing them natively as hypergraphs or reducing them to graph structures.
Hypergraph models allow for higher fidelity representation of data that may contain multi-way relationships, albeit at the price of a higher complexity model. An example using a small subset of transcriptomic expression data is shown in Fig. 1. In the upper left is an expression matrix with log 2 -fold change values for five genes (rows) across four experimental conditions (columns). The lower left shows a hypergraph representation of the data, with each gene modeled as a hyperedge surrounding those conditions (vertices) for which the log 2 -fold change is greater than 2. Those cells in the expression matrix are shown in bold, distinguishing those conditions that are included in that gene's hyperedge. The upper right of Fig. 1 shows a matrix produced from one possible graph-based approach to representing these data. Here each pair of genes is related if there is at least one condition for which both genes have log 2 -fold change greater than 2. This would then be interpreted as an "adjacency matrix" of a graph, which is shown in the lower right. It can be seen that this graph representation necessarily loses a great deal of information, boiling down the rich interaction structure that we know to be present to a fully connected graph on all five genes. For example, the hypergraph shows that two pairs of genes-AARS and ABHD11, AASDHPPT and ABCB6-are much more related than other pairs. This fact is not apparent in the graph model.
Although graphs and graph theory dominate network science applications and methods [11], hypergraphs are well-known objects in mathematics and computer science. They have a history of use in a range of applications [12][13][14], and are seeing increasingly wide adoption [4,[15][16][17][18]. In the biological literature we have seen hypergraphs used to model gene and protein interaction networks, pathways, and metabolic networks as derived from a variety of data types. In many of these cases the authors derive hypergraphs from an underlying graph, rather than directly from data. For example, Chitra built a hypergraph model based on an existing graph model of gene interaction networks [19]. They adapt the PageRank algorithm to hypergraphs in order to study disease-gene prioritization, and find that for monogenic diseases hypergraph PageRank noticeably outperforms graph PageRank. Tran studies protein function prediction building a graph from a similarity matrix derived from gene expression data [20] and then applying soft clustering to this graph to produce a hypergraph. Function prediction using this hypergraph is then shown to be superior to predictions based on graphs. Protein-protein interaction networks are studied by Klamt et al. using graph algorithms to find sets of independent elements or tightly connected elements [13]. In those three papers the authors infer a hypergraph from a graph structure rather than directly from data.
Ramadan et al. use hypergraphs to model the yeast proteome, where proteins are vertices and complexes are hyperedges [21], and apply an algorithm that finds tightly connected vertices to identify the core proteome. Finally, Zhou and Nakhleh study the claim that metabolic networks are hierarchical and small-world [22]. While this claim comes from a graph model of the networks, Zhou and Nakleh instead model the metabolic networks of E. coli as a hypergraph and show that the claimed hierarchy and scaling properties are not supported. This result in particular conveys a critical message: when biological interactions are simplified into pairwise relationships and modeled using a graph, they can exhibit very different structure than when their true complexity is modeled using a hypergraph. Because of this structural variance, conclusions drawn based on the graph could provide misleading results. Although the data we consider are different, our method is similar to these last two papers in that we build hypergraphs directly from biological data rather than inferring a hypergraph from a standard graph model of the data. We have not observed researchers building hypergraphs directly from 'omics data, as we will in this paper.

Modeling host response in viral pathogenesis
Viral infection causes a response in the host cells in which the expression of a variety of cell systems are up-or down-regulated. The pathogenesis of the infection is reflected in the signature of host responses elicited by each virus. Host response to viral infection has been extensively studied for decades, yet the root mechanisms of why some infections are severe and some are not remain poorly understood. However, high-throughput molecular approaches offer a way to discover novel host response genes, proteins, and pathways that contribute to the systems-level development of pathogenesis. A major advantage of such a systems biology approach to pathobiology is the ability to identify novel, key elements of a biological process, such as which regulators are involved in critical processes. High-throughput profiling methods (e.g. transcriptomics) provide powerful tools for examining how entire systems respond to different perturbations such as acute disease. Network reconstruction provides the opportunity to utilize all available data and is a critically important tool for representing complex sets of interactions [23].
In this paper we develop and explore a new hypergraph model (see "Hypergraph representations and centrality metrics" section) of host response using transcriptomics data from viral infection by five highly pathogenic viruses in a number of biological systems (see "Data acquisition and processing" section). We found that gene rankings computed using an average hypergraph centrality were highly enriched for known immune and infection-related genes. While rankings derived from graphs constructed using other traditional computational biology techniques applied to the same infection data also resulted in rankings enriched for critical genes, we demonstrate that our hypergraphbased metrics yield superior enrichment results. These results highlight the usefulness of our hypergraph model for exploring mechanisms of virus pathobiology.

Results
By analyzing the curated omnibus transcriptomic data set described in "Data acquisition and processing" section from cells infected with five different viruses and their mutants using both graph and hypergraph approaches, we illustrate the advantages of applying our hypergraph approach to uncover the underlying molecular signatures and mechanisms common across host response to viral infection broadly.

Hypergraph and graph structure
We create hypergraphs from transcriptomics log 2 -fold change data calculated from gene expression levels of infected experiments relative to time-matched uninfected mock experiments Formally defined in "Hypergraph representations and centrality metrics" section, in our hypergraphs hyperedges represent genes and vertices represent conditions. The vertex representing condition X is contained in the hyperedge representing gene G if gene G is significantly perturbed, either up-or down-regulated, for condition X. The entire hypergraph represents n = 179 experimental conditions (vertices) and m = 7,782 genes (edges). A small subset of highly connected hyperedges (that is, genes with a large core of common conditions), is shown in Fig. 2.
Distributions of fundamental hypergraph statistics can illuminate some of the complex interaction structure present in the data. Figure 3a shows that the distribution of the sizes of the hyperedges (that is, the number of conditions a gene is significantly perturbed in) is roughly power-law, sometimes referred to as "heavy tailed". This means that there are many genes (1,247 of them) significantly perturbed in only one condition and relatively few genes significantly perturbed in many conditions, with a maximum number of conditions for a single gene on the order of 100. The six largest hyperedges, with sizes greater than 100 in increasing order, correspond to the genes ISG15, IL6, ATF3, RSAD2, USP18, and IFIT1. All of these genes are part of the interferon response, a critical pathway in response to viral infections [24,25].
On the other hand the vertex degree distribution is the number of edges a particular vertex is contained in (that is, the number of genes significantly perturbed for each condition (vertex)). This is shown as a histogram in Fig. 3b, and it has a very different shape than the edge size distribution. There are relatively few conditions with small numbers (less than 200) or large numbers (more than 500) of significantly perturbed genes. The most common number of significantly perturbed genes for a condition is between 400 and 500. This peak is likely an artifact of how we choose when a vertex is contained within an edge. The degree of a condition vertex is the number of genes that are significantly perturbed for that condition. By our procedure these are genes with z-score higher than 2 and p-value less than 0.05. If we only used the z-score threshold, and our fold change data are normally distributed for each condition, then we would expect that 5% of the genes would have z-score greater than 2. There are 9,760 genes in our data and 5% of that would be 488 genes, which is roughly where the peak is. The skewness and additional modes of the distribution of degrees are likely due to the addition of the p-value condition.
Finally, Fig. 3c shows another power-law distribution, this time of the size of pairwise edge intersections, or in other words, the number of conditions that pairs of genes are both significantly perturbed within. We see that there are many pairs of genes that have few conditions in common and only a few pairs of genes that have many conditions in common, again with a maximum on the order of 100. The pair of genes with largest intersection is not surprisingly the two largest edges, IFIT1 and USP18, with 103 conditions in common. Interestingly, IFIT1 and USP18 are both well-established interferon response genes, with IFIT1 strongly promoting interferon activity, and USP18 serving to dampen the response [26]. In order to compare our hypergraph approach to more common graph approaches we employed the CLR graph methodology that we have used previously to enrich for important genes in a network [1,2,27]. The CLR algorithm was run on the matrix of transcriptomics log 2 -fold change values using parameters spline = 3 and bins = 10 (as used in the original CLR manuscript [28]) and the resulting matrix was filtered for all mutual information values ≥ 2 . With this approach, any two genes with mutual information above the threshold have similar expression profiles and form a graph edge.
In comparing the CLR graph to our hypergraph we find distinct differences, indicating that the structures are capturing different relationships about sometimes different sets of genes. First, the set of genes present in the hypergraph is a subset of the genes present in the CLR graph, meaning that any gene that is shown to be significantly perturbed in at least one condition (i.e., present in the hypergraph) has mutual information ≥ 2 with at least one other gene (i.e., present in the CLR graph), but not necessarily vice versa. Comparing the edges present in the CLR graph to the hyperedge intersections present Fig. 3 Distributions of simple hypergraph statistics of our hypergraph. a Edge size distribution. Each edge represents a gene, the size of the edge is the number of conditions in which the gene is significantly perturbed. b Vertex degree histogram. Each vertex represents a condition, the degree of a vertex is the number of genes significantly perturbed in that condition. c Pairwise edge intersection size distribution. Each edge represents a gene. The intersection between two edges indicates the set of conditions that both genes are significantly perturbed in in the hypergraph, of the nearly 3.7 million edges in the CLR graph, 2.4 million of them are between genes that are in the hypergraph. Roughly 1.1 million of these are present as hyperedge intersections in the hypergraph while 1.3 million do not have a corresponding hyperedge intersection. The remaining 1.3 million edges have one or both endpoints in the set of genes that are not present in the hypergraph. Moreover, there are a total of nearly 6 million nonempty pairwise hyperedge intersections (gene interactions) in the hypergraph, indicating that the hypergraph is expressing additional structure and relationships among genes that the CLR graph does not capture. Finally, for each gene hyperedge we compute the number of other gene hyperedges that it intersects. These values are only loosely correlated with the CLR graph vertex degrees (number of other genes with mutual information ≥ 2 ), with Pearson correlation 0.25. This indicates that not only are there additional connections in the hypergraph, as observed above, there is not a linear relationship between the number of CLR graph connections for a gene and the hypergraph connections for a gene. In other words, one cannot infer the hypergraph from the CLR graph, as it represents fundamentally different higher order relationships specifically over the set of genes with significant perturbation.

Gene importance rankings
Previous studies using graph approaches with similar viral data have demonstrated that network measures like betweenness centrality could be used to identify critical genes [1]. In the present work we hypothesize that extensions of these common graph metrics to a hypergraph, as defined in "Hypergraph representations and centrality metrics" section, can be leveraged to improve upon this prior work. In particular, we hypothesize that, as has been shown in the graph setting, high s-centrality is correlated with the gene being more important in host response to pathogenic viruses.
We calculate average s-betweenness centrality, BC s (g) , and average harmonic s-closeness centrality, HCC s (g) , for all genes (hyperedges) in the hypergraph for 1 ≤ s ≤ 50 . These average centralities change as s increases only for gene hyperedges that overlap in s vertices with at least one other gene hyperedge. In our hypergraph, for s = 50 less than 1% of hyperedges remain connected to at least one other hyperedge with that overlap size, resulting in very little change in the centrality values. Both of these average s-centrality computations provide a numerical value for each gene that can be used to rank the genes from most important (high centrality) to least (low centrality).
To serve as a simpler, but still hypergraph-based, comparison we created another ranked list using hyperedge size for each gene, i.e., the number of conditions the gene was significantly perturbed in. Larger edge sizes indicate more conditions in which the gene was significantly perturbed and therefore the gene is potentially more important to host response.
To compare our hypergraph centrality ranking approach to the CLR graph approach we use the NetworkX graph analytics Python package [29] to calculate vertex degree, betweenness centrality, and harmonic closeness centrality of the CLR association graphs.
To provide a simple baseline for comparison a final ranked list was computed directly from the log 2 -fold change table without using any graph structure. For each gene we computed its average absolute value of log 2 -fold change and ranked the genes from highest to lowest average. Higher values mean the gene is more likely to be highly perturbed from the mock-infected samples in many conditions.
In the Additional file 1 we provide all gene rankings for average s-betweenness centrality and average harmonic s-closeness centrality, hyperedge size, CLR graph betweenness, CLR graph closeness, CLR graph degree, and average fold change.

Comparison of rankings
To ascertain whether our hypergraph rankings are more highly enriched for genes known to be important in host response to viral infection, we gathered three distinct sets of genes: 1) all genes associated with the Gene Ontology (GO) term "immune response" (GO:0006955), downloaded from amigo.geneontology.org, referred to as 'IR' hereafter, 2) interferon-stimulated genes gathered from interferome.org (http:// www. inter ferome. org/ inter ferome/ search/ searc hGene. jspx), referred to as 'ISG' , and 3) a set of human proteins known to be targets of pathogens acquired from Dyer, et al. [30], referred to as 'PT' . Although this is a limited set in terms of number of targets, it represents a set collected from a wide number of pathogens, both viral and bacterial, and is a conservative set for assessing the performance of our method and making comparisons between different approaches and parameters. In Table 1 we show the size of each gene set (along the diagonal) and the sizes of each pairwise intersection of gene sets (off the diagonal). Since our data encompasses a wide variety of virus types and infection systems, general immunerelated sets were deemed suitable for our purpose.
In order to measure the performance of our rankings, we applied gene set enrichment analysis (GSEA) [31] to each of our gene rankings (average hypergraph s-centralities, hyperedge size, CLR centralities, CLR vertex degree, and mean fold-change) using the three immune-related sets as target gene sets. The GSEA score of a ranked list, computed for a specific gene set, quantifies how concentrated the gene set is at the extremal values of the list. A high GSEA score means the gene set is concentrated at the top of the list while a low (highly negative) score indicates that the gene set is concentrated towards the bottom of the list. A score closer to zero means that the gene set is more uniformly distributed throughout the ranked list. The significance, or p-value, of an observed enrichment score, ES, is assessed by comparing it with a set of ES 0 randomized scores. Figure 4 shows GSEA scores for all rankings and for all three target gene sets. We note the following conclusions: • Both average s-centrality metrics for most s values, as well as hyperedge size, showed much higher enrichment than lists derived from CLR graphs and average fold change. • But average s-betweenness enrichment was universally higher than average harmonic s-closeness enrichment, suggesting that these two measurements are capturing fundamentally different behavior within hypergraphs, and that average s-betweenness appears to be more effective at capturing genes that are important in host responses to viral infection. • Both centrality enrichment results (betweeness and closeness) improve significantly when larger s values are taken into account, indicating that when higher order interactions are considered, they become more powerful in identifying important genes. Although the maximum intersection between two hyperedges is 103, Fig. 4 indicates that by s = 20 our best performing measure, average s-betweenness centrality, plateaus with only minimal increase in enrichment score as s increases further.
As noted in "Hypergraph representations and centrality metrics" section we also constructed hypergraphs using z-score thresholds of 3, 4, and 5 and computed GSEA scores for their average s-betweeness and average harmonic s-closeness rankings. Versions of Fig. 4 for z = 3, 4, and 5 are included in the Additional file (see Additional files 2-4: Fig. S1-Fig. S3). The choice of z-score values did not change the conclusion that GSEA scores increase with s values, or that rankings derived from the hypergraph have higher GSEA scores than those derived from the CLR graph. However, with smaller hypergraphs (from large z thresholds), the p-values of GSEA increase (see Additional files 5-8 Fig. S4-Fig. S7). Therefore, z = 2 is an adequate z-score threshold to balance high GSEA score with low p-value of the score, under a permutation test. A summary visualization of our results is shown in Fig. 5 taking the rankings for both average s-betweenness and harmonic s-closeness for the highest level of s = 50 as the representative hypergraph centrality rankings. We compare those with the five other rankings and again see that average s-betweenness centrality outperforms all other measures. While average harmonic s-closeness centrality outperforms all graph measures it is outperformed by the simple hyperedge size ranking. The p-values, nearly all significantly less than 0.05, are shown in the same plot at the end of the bars. These results demonstrate that average s-betweenness, but not necessarily average harmonic s-closeness, considers the complexity of the hypergraph and provides superior performance over graph metrics with regards to identifying biologically important genes. This aligns with prior work in which betweenness centrality computed for vertices in a graph identifies important genes in a network [1,2,27,32].
In order to compute the average s-centrality measures we first separately computed s-centralities BC s (e) and HCC s (e) for each gene edge s. We also computed enrichment scores for these ranked lists. However, we do not include these in our overall comparison because while some s-values produced rankings that were more enriched than others we did not see any trend in performance that would lead us to choose an optimal single s value (e.g., enrichment scores are not unimodal as a function of s). This led us to consider the average s-centralities, in order to take advantage of all intersection levels simultaneously. In future work it may be worth reexamining single s values, or removing

Discussion
We draw attention to two primary observations of interest in our results. First is the observation that s-betweenness centrality consistently outperforms s-closeness centrality. At first glance this seems surprising since betweenness and closeness calculated on the CLR graph have comparable performance. While the explanation for this result is the subject of ongoing investigation, we observe that these two types of centrality are measuring significantly different properties. For both graphs and hypergraphs high harmonic closeness centrality indicates that on average a gene is close to many other genes, while high betweenness centrality means that a gene is on many short paths between other genes. Sometimes these two notions coincide, as seems to be the case in the CLR graph, but there are cases in which they do not. For example, a gene that may be more on the periphery, i.e., not on many (short) paths, could still be very close to a central core. Being on few short paths this gene would have very low betweenness. However, since it is close to a central core it could have high closeness score.
This seems to be the case in the hypergraphs we are studying. Since there are many conditions that have a lot of significantly perturbed genes (see Fig. 3b) there is likely a large central core in the hypergraph that increases the closeness scores for peripheral genes, and perhaps all genes. Indeed we have observed that for small values of s the s-closeness values do not correlate with edge size, however for large s values the s-closeness scores do tend to correlate with edge size. This likely means that for low s values the closeness is somehow washed out by this central core and any variability we see is not significant. In contrast, for s-betweenness we see a correlation between edge size and betweenness at all s values. However, even though s-betweenness is correlated to edge size for all s values its enrichment score is still much larger than that for edge size and so seems to be capturing something more significant about hypergraph structure.
This difference between closeness and betweenness may also be related to the nature of the large gene expression data set used in our study. Since both mouse and humanbased gene expression data were included in the hypergraph some genes may serve as bridges between different regions of the hypergraph (e.g. predominantly human regions vs predominantly mouse regions). Genes that are truly important in host response to viral infection would be important across species and more effectively brought to light by the betweenness measure that tends to highlight elements occupying bridge-like positions in the hypergraph. Thus, betweenness centrality may be most useful for identifying critical elements when heterogeneous data sets are analyzed.
Our second observation is that our average s-betweenness centrality significantly outperforms established graph centrality techniques. This is entirely in keeping with our expectation, as the purpose of hypergraphs is to capture the complex, multi-way interactions present in a system that are beyond the ability of graphs to model. Thus where betweenness centrality has been used in prior studies to identify important biological features the application of hypergraph s-betweenness may promote discovery of additional features of interest. While the finding that hypergraph betweenness represents a new tool for identifying critical hypergraph elements is an exciting contribution of this study, it also presents an additional immediate benefit: genes highly ranked by hypergraph betweenness that do not appear in any of our target gene sets represent potentially novel discoveries of genes central to viral infection. One good example of this is the ZZZ3 gene, which appears in position 4 out of 7,782 in the average hypergraph betweenness ranking, but does not appear in any of the IR, ISG or PT gene sets. ZZZ3 is part of the histone reader ATAC complex, which scans the state of histone modification and contributes to gene activation/repression mechanisms [33]. No known connection between virus infection and ZZZ3 exists, but it may serve a critical role in regulating gene expression in response to general infection.
Similarly, EPHX1, GDF15, and DUSP1 were not included in the three gene sets and ranked 29, 30 and 33, respectively. These genes are identified as an epoxide detoxification component, a stress responsive cytokine and a stress-responsive phosphatase, respectively. These roles may be related to virus-induced stress in host cells, but the specific mechanisms involved are yet to be elucidated. More exploration of these and other highly ranked genes is the subject of future work for us.

Conclusion
The work we present in this paper is similar to much of the work surveyed in our literature review in that we show the value of hypergraphs over traditional graph analysis of biological data. However, our work differs from these prior studies in a number of ways. First, our hypergraphs are built natively from transcriptomics data rather than based on existing graph models of systems. Although still capturing some multi-way complexities, hypergraphs inferred from graphs may include some induced interactions not actually present in the system that is being modeled. Creating hypergraphs natively from the data avoids this imputation. Other papers we surveyed do create hypergraphs natively from other types of data, but rather than applying centrality measures instead study more structural features like highly connected vertices.
Previous work [1,2,27,32] had demonstrated that graph metrics can be used to identify important genes in association graphs, and so we set out to determine if hypergraphs provided an improvement over graphs. To assess performance of (hyper)graphs derived from our large viral infection gene expression data set, we identified three gene sets related to virus/pathogen infection and performed an enrichment analysis of our ranked lists compared to these gene sets. While the sets were partially overlapping they represented relatively distinct aspects of viral infection in general. Our results show that average s-betweenness, but not necessarily average harmonic s-closeness, was a useful metric that is able to identify key genes in a comprehensive gene expression data set. While average harmonic s-closeness does outperform both CLR centrality measures, CLR degree, and average fold change, it does not exceed the performance of a simple ranking according to hyperedge size, which does not require the full hypergraph structure to calculate. On the other hand, ranking based on average s-betweenness outperformed all other metrics.
The hypergraphs we created used samples from a wide range of viruses, strains, cell types, and time since infection. In future work we plan to apply this measure to compare critical genes in viral response across differing sample features. For example, we will split our hypergraph based on pathogenicity (high vs. low), cell type or host, and time since infection (early vs. late). Comparing the critical genes across these different hypergraphs may allow us to discover previously unknown indicators of viral infection for early detection or severity determination. Other future work we plan to pursue includes considering other hypergraph constructions, other data types, and hypergraph algorithms to identify highly connected vertices. We plan to combine transcriptomics with proteomics and other 'omics measurements to understand whether hybrid hypergraphs yield better results or if the inclusion of more data washes out the complexities.

Ebola Virus
(Wild type and two mutants) in human hepatocyte cells.

Influenza Virus
H7N9 (Wild type and two mutants) in human lung epithelial cells, H1N1 (wild type) in human lung epithelial cells, H1N1 (wild type) in mouse lung, H5N1 (wild type and one mutant) in mouse lung, H7N9 (wild type and two mutants) in mouse lung.

MERS-coronavirus (Wild type and four mutants) in human lung epithelial cells,
(wild type only) in ex vivo human epithelial cells, in ex vivo human lung fibroblasts, and ex vivo human lung microvascular endothelial cells. SARS-coronavirus (Wild type and four mutants) in human lung epithelial cells and mouse lung.

West Nile Virus
(Wild type and one mutant) in mouse cerebral cortex, mouse cerebellum and mouse lymph node.
Raw microarray data was processed for background correction, quantile normalization and summarization using the limma package for R (available on Bioconductor) to derive a single normalized intensity value per probe. We utilized a conservative approach and decided to only remove samples (arrays) that showed obvious evidence for failed hybridization or damaged arrays. These types of occurrences are easily detected by inspecting PCA plots and expression heatmaps. Specifically, we looked at PCA plots and expression heatmaps of the 500 most variable genes according to the coefficient of variation. This manual approach resulted in preserving as much data as possible.
The data in this form, as it is available from the above GEO repositories, was used for further compendium construction. Differential expression analysis with linear models in limma [34] was used to identify fold changes and p-values for significance of changes, as well as adjusted p-values to account for multiple testing. Since datasets were collected using multiple microarray platforms, merging gene identifiers using probe IDs was not possible, so genes were matched at the gene symbol level instead. Individual genes are often represented by multiple probes, so only the most significantly changed probe among those representing a single gene was retained in each dataset. This provided a way to match data rows across experiments, and resulted in a compendium matrix of genes common to all datasets with 9,760 rows, with each row representing a single gene.

Hypergraph representations and centrality metrics
Formally, a hypergraph is a structure V , E , with V = {v j } n j=1 a set of vertices, and E = {e i } m i=1 a family of hyperedges with each e i ⊆ V . Hyperedges can come in different sizes, |e i | , possibly ranging from the singleton {v} ⊆ V (distinct from the element v ∈ V ) to the entire vertex set V. A hyperedge e = {v 1 , v 2 } where |e| = 2 is the same as a graph edge and so it follows that all graphs are hypergraphs, specifically identified as being "2-uniform". Where clear from context we may use the terms edge and hyperedge interchangeably.
We construct a hypergraph from transcriptomics data using a threshold approach, much like the example in Fig. 1. Again, vertices v j will represent individual biological or experimental "conditions" (e.g., mouse lung cells treated with a strain of Influenza virus and sampled at 8 h) and hyperedges e i represent genes. Thus for us, a hyperedge e i is a gene i that includes a collection of conditions j as its vertices v j . For each condition, we transform the log 2 -fold change values (relative to uninfected mock) for all of the genes into absolute value z-scores. Then, the vertex representing condition X is contained in the hyperedge representing gene G if the absolute value z-score for G in X is greater than or equal to 2 and the adjusted p-value for that log 2 -fold change measurement is less than 0.05. Since transcriptomics log 2 -fold change values tend to be normally distributed for each condition across all genes a z-score transformation is a reasonable way to get all conditions onto the same scale before applying a threshold. The specific thresholds on z-score and p-value were chosen as commonly used in the field, and in exploring other z-score thresholds we have verified limited sensitivity to them. We note that using a higher z-score threshold results in smaller hypergraph, generated from the genes that change more dramatically.
In this way, hyperedges correspond to genes, and indicate the groups of conditions in which that gene is both highly perturbed (either up or down) from the mock infected control condition, and for which that perturbation is statistically significant. We say that the gene is "significantly perturbed" in the condition. Unlike hypergraph models of pathways or metabolic reactions, hypergraphs constructed from highthroughput data do not necessarily represent actual biological interactions but rather capture relationships based on similar behavior among entities.
It is important to point out that this method to construct a hypergraph using thresholds on absolute value of z-score and p-value is a specific case of a flexible framework we propose for how hyperedges can be formed from 'omics data. Applying other thresholds will result in different hypergraph models of the same data, to potentially answer different questions. For example, in order to understand the relationship and behavioral similarity among up-regulated genes one might consider a gene hyperedge to contain those conditions for which the gene has high raw (as opposed to absolute value) z-score or log 2 -fold change, as in the Fig. 1 example. One could also form edges from conditions for which a gene has a highly negative z-score or fold change, to explore the structure of down-regulated genes. We chose a threshold on the absolute value of z-score in this paper as an attempt to understand genes which are perturbed at all in response to viral infection.
We recognize that this formulation is fundamentally different from a typical graph approach to systems biology data. One such example of a graph approach is context likelihood of relatedness (CLR) in which genes that show similar expression patterns across all conditions, as measured by mutual information, are linked together [28]. Our approach to constructing hypergraphs from the data can be seen as having greater sensitivity and flexibility since it allows similarity between genes to be assessed across any number of conditions (as quantified by the size of the overlap of their hyperedges) rather than requiring assessment across all conditions, as in the mutual information calculation used to define CLR edges. In "Gene importance rankings" section we provide a comparison between our hypergraph and the CLR graph formed from the data.
Another difference between our hypergraph formulation and typical graph approaches is that in graph approaches vertices represent genes and edges indicate some relationship between genes such as interaction or expression correlation. Our motivation for swapping the roles of vertices and edges is for the sake of clarity in our description of hypergraph centrality measures below. Moreover, as a technical matter, each hypergraph H determines another one, called its "dual" H * , formed exactly by swapping the roles of vertices and edges [35]. Therefore, the dual to our hypergraph formulation has the more traditional form with genes as vertices, but the description of hypergraph centrality in this setting would be less intuitive.
As in graphs, the way in which hyperedges connect vertices in complex patterns is central to the study of hypergraphs. While many hypergraph topological measures are available, either as generalizations of graph measures to account for multi-way interactions or as native hypergraph-only measures, our focus in this paper is applying generalizations of graph centrality measures to hypergraphs built from transcriptomics data to identify important genes. In order to define these hypergraph centrality measures we must first introduce the notions of a hypergraph walk and distance [36]. Given two hyperedges e, f ∈ E , an s -walk between e and f is a sequence of hyperedges e 0 , e 1 , . . . , e k such that e 0 = e , e k = f , and s ≤ |e i ∩ e i+1 | for all 0 ≤ i ≤ k − 1 . An s-walk with k + 1 edges has length k. In other words, an s-walk is any sequence of edges, not necessarily of minimal length, such that pairwise intersections between neighboring edges have size at least s. Note that a graph walk is a 1-walk. We note that one could define a hypergraph s-walk to be between vertices rather than hyperedges, as is typically done in a graph. But as above, for the sake of clarity in defining centrality measures we use this edge-based definition.
Continuing to follow Aksoy et al. [36], for a fixed s > 0 , we define the s -distance d s (e, f ) between two edges e, f ∈ E as the shortest length of the possibly many s-walks between them. If there is no s-walk between two edges then the s-distance is infinite. Aksoy et al. also define a number of network science methods generalized from graphs to hypergraphs, including vertex degree, diameter, and clustering coefficients. This work will use their generalization of betweenness centrality and harmonic closeness centrality to hypergraphs using the stratification parameter s.
• The s -betweenness centrality of an edge e is where σ s fg is the total number of shortest s-walks from edge f to edge g and σ s fg (e) is the number of those shortest s-walks that contain edge e. • The harmonic s -closeness centrality of an edge e is the reciprocal of the harmonic mean of all distances from e: where E s = {e ∈ E : |e| ≥ s} . We may refer to this as s-closeness in this paper, although elsewhere in the literature this term refers to a slightly different concept where the harmonic mean is replaced with the arithmetic mean. Intuitively, harmonic s-closeness centrality captures the extent to which a given hyperedge is close in s-distance to other hyperedges. In order to have high harmonic s-closeness a hyperedge must have small s-distance to all (or most) other hyperedges. s-Betweenness, on the other hand, identifies bottlenecks in a hypergraph. A hyperedge with high s-betweenness has many shortest s-walks pass through it. In comparison, the original formulation of betweenness and harmonic closeness centrality in the setting of graphs has the s-distance and number of s-paths replaced simply by graph distance and shortest path.
In order to take into account multiple s values simultaneously in our analysis we developed the average s -betweenness centrality and average harmonic s -closeness centrality, averaging the centrality values defined in [36] across a range of s values. Our average s-centralities are defined as Computing average s-centralities for each hyperedge provides a ranked list of hyperedges from most central (high value) to least central. All hypergraph construction, metric calculations, and visualizations were performed using the Python hypergraph library HyperNetX (https:// github. com/ pnnl/ Hyper NetX).
Some conventional approaches to infer graph structures from high-throughput data use correlated gene expression patterns to build connections. In this context, a gene with high degree (i.e. a hub) has similar expression behavior to many other genes, implicating it as a potential master regulator of gene expression. A gene with high betweenness (i.e. a bottleneck) on the other hand, bridges two regions of the graph indicating that it spans two different behavioral profiles. Genes in this position are potentially involved in causing a transition from one response pattern to another. Thus hubs and bottlenecks may represent master gene expression regulators of two different varieties. Previous work by our group and others has shown that graph