Passive, automatic detection of network server performance anomalies in large networks Public Deposited

Downloadable Content

Download PDF
Last Modified
  • March 21, 2019
  • Terrell, Jeff
    • Affiliation: College of Arts and Sciences, Department of Computer Science
  • Network management in a large organization often involves-- whether explicitly or implicitly-- the responsibility for ensuring the availability and responsiveness of network resources attached to the network, such as servers and printers. Users often think of the services they rely on, such as web sites and email, as part of the network. Although tools exist for ensuring the availability of the servers running these services, ensuring their performance is a more difficult problem. In this dissertation, I introduce a novel approach to managing the performance of servers within a large network broadly and cheaply. I continuously monitor the border link of an enterprise network, building for each inbound connection an abstract model of the application-level dialog contained therein without affecting the operation of the server in any way. The model includes, for each request/response exchange, a measurement of the server response time, which is the fundamental unit of performance I use. I then aggregate the response times for a particular server into daily distributions. Over many days, I use these distributions to define a profile of the typical response time distribution for that server. New distributions of response times are then compared to the profile to determine whether they are anomalous. I applied this method to monitoring the performance of servers on the UNC campus. I tested three months of continuous measurements for anomalies, for over two hundred UNC servers. I found that most of the servers saw at least one anomaly, although for many servers the anomalies were minor. I found seven servers that had severe anomalies corresponding to real performance issues. These performance issues were found without any involvement of UNC network managers, although I verified two of the issues by speaking with network managers. I investigated each of these issues using the contextual and structural information measured at the border link in order to diagnose the cause of the issue. I found a variety of causes: overloaded servers from increased demand, heavy-hitting clients, authentication problems, changes in the nature of the request traffic, server misconfigurations, etc. Furthermore, information about the structure of connections proved valuable in diagnosing the issues.
Date of publication
Resource type
Rights statement
  • In Copyright
  • "... in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Department of Computer Science."
  • Jeffay, Kevin
Degree granting institution
  • University of North Carolina at Chapel Hill
Place of publication
  • Chapel Hill, NC
  • Open access

This work has no parents.