Network Capacity Management and the Trouble with Change

While working parallel projects on different sides of the Capacity Management (CM) problem, I’ve noticed a rather large disconnect between the way some networks operate and the assumptions hidden with the monitoring tools in use.

Stable networks are the ideal case, but not the problem. If a network is stable in its configuration, with stable routing, never changing configurations, and rather stable use, then the assumptions made by simple trending are useful when trying to understand and manage the capacity of network resources. The only dynamic concern is the addition of new customers, and the change in the usage of existing customers. If these dynamics grow, they are not often bursty, and could be predicted reasonably with the simple trending in the available Capacity Management and Application Performance Monitoring (APM) tools.  No problem there.

However, networks are often dynamic, and that can be a problem.  A network failure forces a change in routing. A special event or mission drives a new network usage dynamic. Some services are designed to be very dynamic, such as policy based routing (PBR), OpenFlow, and Software Defined Networking (SDN). One reason for implementing these capabilities is to allow a network to be more dynamic, handling traffic by time of day, specific external need, or even providing users more flexibility in their use of the network’s resources. The result is an opportunity for much greater complexity, more degrees of freedom, and many other internal and external causes of changes in traffic. These dynamics make the job of knowing what to do to address a potential capacity, usage, or response time problem, for example, a much more difficult task.  There is the potential for more spikes in usage, and more causes in these spikes that are not necessarily capacity or application problems.  The monitoring problem just gets worse, and it becomes more difficult to know when more capacity is needed. The CM and APM tasks just got harder.

Some tools are set to address the common architectures that apply some of some of these dynamics, but many tools do not, and none address all potential use cases. There are good APM and CM tools for Cloud environments, and some that handle Link AGgregations (LAG). But the general application of these new, dynamic features of networking are not well addressed by the general networking tools community. Now some tools have abilities to address needs like flow based analysis, which helps a lot. Some tools allow dynamic analysis of subsets of traffic data in very general ways, but that requires expert analyst users, which are expensive, rare, and busy.  Analytics has the potential to assist, when the right information is presented, but that is also a hard problem.

At the very least, it is apparent that specific architectures and use cases will drive specific requirements of their monitoring tools in an even greater degree than ever before. Flexible tools and well trained expert users will allow greater flexibility and insurance, at a cost.  Improved development in the CM and APM space will need to step in next to drive down those costs.

About Rupe

Dr. Jason Rupe wants to make the world more reliable, even though he likes to break things. He received his BS (1989), and MS (1991) degrees in Industrial Engineering from Iowa State University; and his Ph.D. (1995) from Texas A&M University. He worked on research contracts at Iowa State University for CECOM on the Command & Control Communication and Information Network Analysis Tool, and conducted research on large scale systems and network modeling for Reliability, Availability, Maintainability, and Survivability (RAMS) at Texas A&M University. He has taught quality and reliability at these universities, published several papers in respected technical journals, reviewed books, and refereed publications and conference proceedings. He is a Senior Member of IEEE and of IIE. He has served as Associate Editor for IEEE Transactions on Reliability, and currently works as its Managing Editor. He has served as Vice-Chair'n for RAMS, on the program committee for DRCN, and on the committees of several other reliability conferences because free labor is always welcome. He has also served on the advisory board for IIE Solutions magazine, as an officer for IIE Quality and Reliability division, and various local chapter positions for IEEE and IIE. Jason has worked at USWEST Advanced Technologies, and has held various titles at Qwest Communications Intl., Inc, most recently as Director of the Technology Modeling Team, Qwest's Network Modeling and Operations Research group for the CTO. He has always been those companies' reliability lead. Occasionally, he can be found teaching as an Adjunct Professor at Metro State College of Denver. Jason is the Director of Operational Modeling (DOM) at Polar Star Consulting where he helps government and private industry to plan and build highly performing and reliable networks and services. He holds two patents. If you read this far, congratulations for making it to the end!
This entry was posted in Engineering Consulting, IT and Telecommunications and tagged , , , , , , , , , , , , , . Bookmark the permalink.

Comments are closed.