While working parallel projects on different sides of the Capacity Management (CM) problem, I’ve noticed a rather large disconnect between the way some networks operate and the assumptions hidden with the monitoring tools in use.
Stable networks are the ideal case, but not the problem. If a network is stable in its configuration, with stable routing, never changing configurations, and rather stable use, then the assumptions made by simple trending are useful when trying to understand and manage the capacity of network resources. The only dynamic concern is the addition of new customers, and the change in the usage of existing customers. If these dynamics grow, they are not often bursty, and could be predicted reasonably with the simple trending in the available Capacity Management and Application Performance Monitoring (APM) tools. No problem there.
However, networks are often dynamic, and that can be a problem. A network failure forces a change in routing. A special event or mission drives a new network usage dynamic. Some services are designed to be very dynamic, such as policy based routing (PBR), OpenFlow, and Software Defined Networking (SDN). One reason for implementing these capabilities is to allow a network to be more dynamic, handling traffic by time of day, specific external need, or even providing users more flexibility in their use of the network’s resources. The result is an opportunity for much greater complexity, more degrees of freedom, and many other internal and external causes of changes in traffic. These dynamics make the job of knowing what to do to address a potential capacity, usage, or response time problem, for example, a much more difficult task. There is the potential for more spikes in usage, and more causes in these spikes that are not necessarily capacity or application problems. The monitoring problem just gets worse, and it becomes more difficult to know when more capacity is needed. The CM and APM tasks just got harder.
Some tools are set to address the common architectures that apply some of some of these dynamics, but many tools do not, and none address all potential use cases. There are good APM and CM tools for Cloud environments, and some that handle Link AGgregations (LAG). But the general application of these new, dynamic features of networking are not well addressed by the general networking tools community. Now some tools have abilities to address needs like flow based analysis, which helps a lot. Some tools allow dynamic analysis of subsets of traffic data in very general ways, but that requires expert analyst users, which are expensive, rare, and busy. Analytics has the potential to assist, when the right information is presented, but that is also a hard problem.
At the very least, it is apparent that specific architectures and use cases will drive specific requirements of their monitoring tools in an even greater degree than ever before. Flexible tools and well trained expert users will allow greater flexibility and insurance, at a cost. Improved development in the CM and APM space will need to step in next to drive down those costs.