Collections > Electronic Theses and Dissertations > STATISTICAL ANALYSIS OF RELATIONAL DATA: MINING AND MODELING COMPLEX NETWORKS
pdf

Networks have created many new and exciting areas of scientific inquiry, particularly in the field of statistics. The relational and often complex nature of network data requires the development of new statistical techniques that address analysis, modeling, and simulation. In this dissertation, we make contributions to the development and application of statistical methodology on network data. This work is divided into two related areas. The first part of the dissertation is devoted to the problem of community detection: the unsupervised clustering of vertices in a network. Community detection is a common and important first step in the analysis of networks because networks tend to cluster into densely connected groups of vertices that often closely associate with important physical patterns of a modeled system. We develop and evaluate two novel significance-based detection techniques - the Extraction of Statistically Significant Communities (ESSC) algorithm, and Multilayer Extraction. The ESSC algorithm is a hypothesis testing approach for undirected networks, while Multilayer Extraction is a score-based approach that identifies significant vertex-layer communities in multilayer networks. The performance and potential use of both methods are investigated through simulations and real data applications, and large graph consistency is established for the Multilayer Extraction algorithm. The second part of the dissertation is devoted to the simulation and modeling of networks with weighted edges. The generalized exponential random graph model (GERGM) was recently proposed to model networks with continuous-valued edges; however, current estimation algorithms for the GERGM only allow inference on a restricted family of model specifications. We develop a Metropolis-Hastings estimation method that greatly extends the family of weighted graphs that can be modeled under the GERGM framework. We show that new flexible model specifications are capable of avoiding the common problem likelihood degeneracy. Furthermore, new specifications are capable of efficiently capturing network motifs in applications where such models were not previously available.