Affiliation: School of Medicine, Curriculum in Bioinformatics and Computational Biology
Networks have become a common data mining tool to encode relational definitions between a set of entities. Whether studying biological correlations, or communication between individuals in a social network, network analysis tools enable interpretation, prediction, and visualization of patterns in the data. Community detection is a well-developed subfield of network analysis, where the objective is to cluster nodes into 'communities' based on their connectivity patterns. There are many useful and robust approaches for identifying communities in a single, moderately-sized network, but the ability to work with more complicated types of networks containing extra or a large amount of information poses challenges. In this thesis, we address three types of challenging network data and how to adapt standard community detection approaches to handle these situations. In particular, we focus on networks that are large, attributed, and multilayer. First, we present a method for identifying communities in multilayer networks, where there exist multiple relational definitions between a set of nodes. Next, we provide a pre-processing technique for reducing the size of large networks, where standard community detection approaches might have inconsistent results or be prohibitively slow. We then introduce an extension to a probabilistic model for community structure to take into account node attribute information and develop a test to quantify the extent to which connectivity and attribute information align. Finally, we demonstrate example applications of these methods in biological and social networks. This work helps to advance the understand of network clustering, network compression, and the joint modeling of node attributes and network connectivity.