Then you must improve your management. If your organization realizes it needs to do better with it’s Capacity Management, that is a good place to start. It is a healthy desire to check your health if not a recognition of a need to improve. But it is only the first step, and there are plenty more to possibly get wrong. It is tempting to think you lack the tools or information or other elements that make up a good Capacity Management solution. But that cannot be the case. Certainly, it is reasonable to assume any organization interested in Capacity Management must have some resource it must manage, and particularly its capacity. Truly owning a resource means taking responsibility for its capacity, assuring it provides its intended value, not creating a burden or net loss; truly managing that resource. Obtaining the resource without also obtaining its ability to be managed is therefore a lack of management. Having the ability but not doing it effectively is also a failure of management. Therefore, once you recognize you need better capacity management, it is important to recognize immediately that you need better management first!
An important attraction of using Virtual Machines (VM) instead of traditional servers can be presented as a Capacity Management (CM) advantage.
While a traditional server can serve multiple purposes up to a point, they are not built to be as flexible, and don’t scale in the same way. A server is a combination of hardware and software, with the software being relatively static in configuration. If needs are highly variable, more servers of different types are needed, and so multiple stochastic demands have to be met with relatively independent servers.
But a VM environment does two key things: it decouples the software from the hardware, so now the software is a pool; and it allows hardware to be configured more freely, so that it can be shared as well. So now software can be provisioned rapidly, closer to real time, and hardware can be shared among a larger pool of resources. By decoupling software from hardware, and making hardware a rapidly configurable resource, the various demands now share a larger pool, may vary less overall, and share a larger pool. By sharing a larger pool, utilization can be managed tighter, even with the same variability, so hardware resources are wasted less. Software, because it can be freely copied in such an environment, can be rapidly created and destroyed as needed, as long as licensing can allow it.
So what used to have very long lead times, leading to higher costs and lower utilization of resources, now can be managed with higher utilization and lower costs, simply because VM can be rapidly created and destroyed as needed, in the form needed. The true advantage of VM, and indeed Cloud architectures as well, is better systems behavior for CM.
Returning from Sharkfest, I found that my laptop’s configuration must have changed. The laptop, running Windows 7 home version, was now not able to see a few of the non-windows machines on my network, including the NAS. Assuming the cause was one of the many changes I did while at Sharkfest, to access the internet on campus, or to tie into the ad hoc networks for the hands on sessions, I decided to do some digging. And what better place to start but with Wireshark.
Looking at the packet traffic, I could immediately see a problem. The computer was talking to the gateway, but its request for a given IPv4 address was being redirected to a location it could no longer find. Flush the ARP tables, but no change. After more research and a couple regedit operations, some things got better, but not everything. The computer still thinks it needs to append a Home prefix that doesn’t exist. Then I corrected an option on the network adapter, but the problem was still not fixed. Then I gave up and went to static IP. That change solved the problem, and all behavior was better than ever before, including the packet capture file. A comparison of packets before and after in Wireshark shows much was improved; the calls to the false home location stopped, as did a lot of the other chatter.
But it bothered me that DHCP was not working properly. I probably just wasn’t patient enough, as a switch back to DHCP after a couple days, and everything remains better than before I left for Sharkfest. The packet trace file looks much better too, and the computer’s network response time is faster. Looks like time would have cleared the issue, either after or without the cleanup I did.
There are a few things this experience teaches:
- Having a baseline of your computer’s performance is always useful. Critical in fact. How else do you know what normal operation looks like in a packet capture?
- Knowing your network is always important for reliability and security, and Wireshark is a great, free, community supported tool for doing that. Do a packet capture every once in awhile and look at the network’s behavior, both to learn and diagnose.
- Application Performance Monitoring (APM) is tricky, and can be very dependent on the environment. How well would such a system pick up on issues with computers not being able to communicate fully, when users work around the problem anyway? Would it even be able to tell if it never caused a problem with a specific application? And if not, how useful is the solution if that is a critical operation for users? Tools that dive deep, like Wireshark, will always have a role.
- Networks are resilient these days, and sometimes all they need is a bit of time to sort themselves out; patience is a virtue here as well, no matter how fast those packets fly.
- Could someone categorize some common performance issues found with Wireshark, automate some of it, and build indicators to help teach the Wireshark novices? Automation is useful, but only when it accompanies user education for validation and verification.
- There is no substitute for having a good Failure Modes, Effects, and Criticality Analysis (FMECA), or good Root Cause Analysis (RCA) when the former is lacking. In fact, that is a recurring theme in all systems and networks, in all disciplines, in all things important. The packet baseline is strongly related to the FMECA or RCA, as it provides clues for which failure occurred, and how the service is affected.
I read a lot, and I mean a lot, of network reliability, availability, and performability papers. The ones we publish in the IEEE Transactions on Reliability, I read multiple times. So there are a lot of relationships that go unnoticed, or at least unacknowledged, yet are important to understand nonetheless. Some of these tricks I outline below mean that some common assumptions in these papers are not limiting at all, while others are limiting only in that they impact scale. Some of these tricks are ones I use myself when I encode network models for analysis so that I don’t have to do a lot of head-spinning algorithm re-development. I’m lazy. I’d rather use something I know works rather than build something new that might work a bit better, but take much more time and effort to get there. If you too are lazy, read on, and save some time and effort overall!
1) A link is a node is a network. It is common for solutions to network reliability problems (including availability and performability, for example) to assume only nodes or links can fail. That presents no problem, as the two are convertible. A link that can fail can be easily converted to a node that can fail by simply placing a node on the link, rendering the link no longer the failure item, as in Fig. 1 below. Likewise, a node that can fail can be converted to a link by reversing that process, as long as the node has only two bidirectional links connecting to it, or two unidirectional links (one in, one out). Handling the more general node case, however, gets more complicated.
Fig. 1. Converting a fail able link to a fail able node.
a) Because a node that can fail can be modeled as a set of links that fail concurrently (with correlation of unity), simply by failing the connecting links whenever the node fails, allowing correlation is the easiest way to handle it. See Fig. 2 for a good picture of this replacement. Because most analysis approaches (heuristics, Monte Carlo simulations, structure functions) can be easily augmented to handle correlated parts, this method is usually easiest and best. See trick 2) below, as well.
Fig. 2. Replacing a failing node with a set of links that fail concurrently.
b) If links are unidirectional, then place a new node at all incoming links of the old node, and a new node at all outgoing links of the old node, then add a link between the two new nodes, resulting in a single link that fails just like the old node you are replacing. See Fig. 3.
Fig. 3. Replacing a fail able node with a fail able link, when links are directional.
c) Now it can get complicated (too complicated for a good figure), but extend case b) for each possible combination of a single inbound link to all other links being outbound, replicating each case, and you have the equivalent sub-network that replaces a node with multiple bi-directional links. This approach assures a bidirectional link is not used in both directions at the node, but could allow both directions to be utilized in the overall network, so this approach is limited, and may not address all possible analysis approaches.
2) Bidirectional is equivalent to parallel unidirectional, with correlation. Take a single bidirectional link, split it into two parallel, unidirectional links flowing in opposite directions, and force the two new links to only fail concurrently (unity correlation). The new configuration is equivalent to the original. But cycles are still possible, so must be handled by the algorithm you use.
3) Complex correlations can be handled with correlated and uncorrelated parts. Links or nodes that only fail as a group are fully correlated, while most networks are composed of parts that are independent (or assumed to be, at least). Between these two situations of completely correlated to uncorrelated, we have partially correlated links or nodes. In static networks (where the reliabilities or availabilities or performabilities do not change), it is sufficient to split the part into fully correlated and uncorrelated parts, and assign the probabilities appropriately. Positive correlation is handled by a fully correlated part in series with each uncorrelated part, whereas negative correlation is handled by a fully correlated part in parallel with each uncorrelated part. The former case is shown in Fig. 4, and the latter case is similar.
Fig. 4. Top: replacing a link with one that fails correlated (blue) with another, and a link that is independent (orange). Bottom: replacing a node with correlated and uncorrelated nodes.
See the network in Figure 1 below. Source node 1 fails, which from a pure availability perspective disrupts only traffic originating or terminating at node 1 because all other nodes can still communicate. So a coherent network (as is the one in the figure) continues to be coherent after this failure, as it must due to the definition of coherent.
But now consider performance. Say the traffic from node 1 (for our example, between nodes 1 and 4, but not necessarily) was significant enough to constrain the traffic traversing the network between nodes 2 and 3. For example, the node in the center of the figure may be insufficient for the demand on the network. Traffic between nodes 2 and 3 cannot be adequately carried when all resources in the network are working. From their perspective, the service fails when the network is fully working. But when node 1 fails, suddenly resources are free, and the network can now carry all the traffic between nodes 2 and 3. When the network fails in this way, the network suddenly works from the perspective of nodes 2 and 3.
This situation demonstrates incoherent behavior in a network when capacity constraints are considered. The example network is simple, but clearly extends to the more complex, realistic case. This case, which we may all reference, resulted from a conversation with a colleague where we concluded that this type of incoherent behavior was often forgotten.
Fig. 1. A simple network to illustrate incoherent behavior under capacity constraints.
NOTE: Special thanks to Troy Houston from SevOne for pointing out an error in an earlier version of this post.
Checklists are a very effective way to improve a manual process. When an enterprise realizes they have a resource that should be better managed, they consider capacity management, in its general form if not specifically in the context of IT. The first step is to determine your best answers to the following questions, being sure to estimate the reliability of your answers as well.
- What is the resource that should be managed? Clearly identify the resource that needs to be managed, as crisply as possible. Define any boundaries to connected and related resources. If it is a pool of servers, how many, where are they located, which are included or excluded from the pool, and what resources on the servers are important enough to be managed?
- Who owns that resource and its capacity? For the sake of action and accountability, identify the owner or owning organization responsible for the capacity of the resources. It is key to identify the owner, and that owner must be able to control the resource’s capacities. If this responsibility is somehow shared, then either divide the resource into two pools to avoid the conflict, or clearly define the nature of this shared responsibility so that there is no ambiguity for who must do what.
- How do they know how much capacity they own? If you can simply go and count those resources, maybe that is enough. But most resources in need of management will need some sort of inventory management capability.
- How do they know how much capacity they need? If change is rapid, then the information must be even more rapid. Rapid data collection usually requires automated data collection. And in any case the information must be reliable, but should otherwise be as inexpensive as can be while still meeting the requirements.
- How do they know how much capacity they will need? Either an analyst will need to do this work, or perhaps share the effort with an analytics engine. The reliability of the demand forecast is critical. Low reliability of the forecast leads to larger amounts of spare capacity, so we have a cost tradeoff to consider.
- How does the organization add or subtract capacity? Processes must be known, and effective, for changing capacity. Procurement processes play a key role usually, so supply chain management becomes important.
The first step of any capacity management project is to answer those questions above, then to check how adequate the answers are, such as described by the questions below. If the answers are not adequate, then it is time to consider increasing the enterprise’s capability maturity.
- Is the resource and responsibilities ill defined? Then some process improvement work may be in order to clarify the roles and responsibilities.
- Is the monitoring insufficient? Then perhaps it is time to invest in tools, or improve the tools you have, or improve the use of the tools you have.
- Is the information inadequate? If there is information missing or incorrect, then work on filling those gaps. IT systems experts, systems engineers, or data analysts may need to complete the effort here.
- Is the funding insufficient? Balance budgets to be service oriented, or consider adding pressure for technology improvements to reduce costs.
- Is the procurement process insufficient? Increase the spare capacity in the supply chain to ease the pain, but look at supply chain optimization to rebalance the solution effectively and efficiently.
- Do the teams lack coordination? Much like the first problem in this list, perhaps we need more clarity of roles and responsibilities, but deeper in the organization. Clarify and get agreement on responsibilities, accountabilities, controls (establish or improve your RASIC or RACI), and operational level agreements (OLA) so there is no lack of clarity.
There is a lot more to capacity management than just getting a good tool to provide nice graphs. Much more. There is a responsibility that comes with ownership, and tools and techniques and agreements do not remove that responsibility. They can only help you be a more effective owner.
When comparing architectures for availability and reliability, it is often difficult to comprehend the small differences in the estimates.
It is tempting to compare in terms of downtime, but doing that is a slippery slope; people like to think of minutes per year as an actual outcome, almost too real for reality. Remember there is no such thing as an average; there is also no such outcome, no prediction that is as accurate as people like it to be. But that doesn’t stop the decision makers from thinking as though the estimate is a tangible number.
A couple tricks I like to use are Data Bars in Excel, and a log conversion that compares the “number of nines.”
Under Conditional Formatting in Excel, the Data Bars capability is a very simple way of presenting a grid of numbers, such as when plotting the effect of two different variables in a system design, or combinations of connections in a network design. Figure 1 is an example of the availability of node to node connections in a network, in pairs. Each row is a particular node pair, and each column is as well, so the diagonal is the availability of a node to node connection, and the off diagonal is the availability of the pair of node to node connections. Notice that clusters and patterns are easier to see, such as the higher availability of the diagonal as expected, or the relatively high availability cluster at the bottom right for nodes 3, 4, 5.
Figure 1. Node to node availability grid for an arbitrary network.
When decision makers are comfortable with thinking in terms of the number of nines for comparison, then a log scale can be helpful, as it makes the smaller differences easier to see against the larger differences. But that can lead to unfair comparisons on a graph, where a small difference is made to look as large as a much larger difference. So I like to use this type of comparison against a standard or target, say “five nines” or 0.99999 availability for example. Table I below gives a few of the common conversions. Note that “four and a half nines” is not 0.99995, but rather about 0.9999684. Precisely, the “number of nines” is calculates as –log(unavailability) in the table.
Conversions of availability, downtime, and fair weighting of nines.
These methods have different uses, and are not for every use either. The first method of Data Bars is broadly useful for comparing large numbers of data visually. And as long as the numbers are fair, the weighting provided by that method will be, and so the comparison should be useful and fair. Edward Tufte might approve (http://www.edwardtufte.com/tufte/). But the second method of log conversion to “the number of nines” can be misleading in comparisons of a large number of results as it is nonlinear. So I reserve its use for comparisons against targets and goals for a design.
Capacity Management (CM) from a telecommunications and IT perspective has a fairly broad meaning, sometimes overlapping some related terms. While understandable, it is unfortunate because of the confusion it causes. But no worries, as we can sort it out pretty well right here.
Capacity Management often refers to the larger context of more than the capacity of physical entities, such as CPU, memory, ports, or even routers and switches. This broad reference is both understandable and useful. It’s understandable because, once you identify as a critical resource an entity such as software, a service, a network resource, even when these entities are not physical, you create the need to manage its capacity. So related terms such as Network Performance Monitoring (NPM), Network Capacity Management (NCM), and even Application Performance Monitoring/Management (APM), all can be considered as subsets of CM. And reasonably so. In fact, the ITIL takes this approach: http://www.itlibrary.org/index.php?page=Capacity_Management.
As a result, it can be convenient to reference the entire set as all CM. But this becomes inconvenient when we need to relate to the management of just physical entities. The simplest way out of this is to just be clear, and rely on the context.
If that is not sufficient, then let’s just agree to use CM for the specific idea of the management of physical entities, which may include software if it is isolated to a physical location (not spread around as in applications and services, to be described below).
Now we can let NPM refer to the management of network performance, which can span across elements falling under CM. Likewise, NCM can refer to the management of the capacity of these network resources, which span across elements falling under CM.
Beyond that, APM can refer to the performance monitoring of applications and services spanning across the network(s) and utilizing many different CM elements.
A term I’ve not seen, which could be convenient to add, would be Application Capacity Management, which can refer to the act of managing the capacity of applications, as you should expect. I wouldn’t mind coining a couple more obvious terms: Network Reliability Management, and Application Reliability Management, for obvious reasons.
Now to achieve success with APM and NPM, often Analytics are leveraged, and this is an emerging area as well. While there are tools in existence today that do a great job at finding causes of network and application problems before a person even has a chance to investigate, many more are being created that take it to the next level. And outside the IT and Telecommunications arenas, we have the developing engineering space of Prognostics and Health Management (PHM). PHM is all about utilizing telemetry about a component or system to estimate the risk of failure. Because lack of capacity is a form of system failure, there really is no difference in the concepts of PHM and CM or NPM or APM. So while the various camps of engineering develop within their focused areas, we shall see eventual cross pollination which can lead to exceptional abilities in the IT and Telecommunications network and system analytics arena.
The real engine behind technology advancement is reliability. Innovators would benefit from focusing on reliability.
Sure we always strive to do better for various reasons: competition, market share, profit, save lives. If necessity is the mother of invention, perhaps laziness is its father. But neither would succeed without the work to make things more reliable.
Innovative solutions can’t be implemented widely until they are made reliable. Technology advances from not being able to do a thing, to doing it almost by accident, to figuring out how to do it, to doing it more efficiently (with less waste, which requires higher quality, which is more reliable), to doing it cheaply enough and reliable enough to scale the solution. To scale a technology or innovation is to develop it. That development engine is the work to make it reliable. So anyone working to develop anything is tasked to make things more reliable.
If you want to be successful at development, innovation, or leveraging technology in any way, you may benefit from paying close attention to reliability concerns. It could be the best measure of effectiveness you could use for decision making.
Indeed reliability is the driver behind development, so we might be tempted to refer to it as research and reliability, but then we can’t refer to it as R&R now can we?
Once you have the Application Performance Monitoring (APM) tool in place, how do you use it effectively? The vendor sold you a tool, convinced you it would make your days so much easier, and now you have even more equipment to deal with. How is that better? Sure you traded woes, but if you use that tool effectively, and it was a good choice, then you should be on your way to trading up.
Hopefully you’ve been trained in how to use the tool, and perhaps these steps will be familiar from that training. Or, if you’re still testing APM tools, here are some steps you can follow to see if the tool does what it should.
- Be sure to pick the right measure of performance for your application. And you picked the right application to start with too, right? Assuming you did, the performance measurements should align strongly with what the customer believes is important. Surely, downtime will be important to them. Maybe latency is as well, but how much latency is tolerable. How about jitter? Jot that and the other measures down, and make sure they combine to be a sufficient set of measures to assure performance. Are there SLA in place? If so, what is it, and are you measuring performance against those SLA measures?
- For each of these measures, establish a baseline. Determine typical network performance. Determine the level of performance for your measures when the application is performing as intended, and to a sufficient level of quality to meet SLAs. Keep in mind also that you should establish a baseline of performance that accounts for special events, time of day, day of week, and other foreseeable or predictable factors. Too short of a baseline will miss longer trends. Too long of a baseline might include explainable causes that should be excluded. Do a good job of finding a truly representative baseline for each measure.
- Construct your forecasts. Many good tools will produce a basic trend to get you started, but don’t rely purely on the default methods. Some statistical modeling off line, perhaps outside the APM tool, may be necessary. Certainly some validation is occasionally necessary, and always prudent. You may need to put in some serious modeling work to form a good forecast of your performance measures, when you consider such issues as planned growths and grooms, customer churn, service transitions, and other factors, some known and others not.
- Determine the SLA you need to maintain, and where the network or service points are where the performance bottlenecks take place. And when you determine the service levels to target, make sure you determine the costs of missing the SLA, and the risks of doing such. To understand these tradeoffs well, you may need to create another model of cost, risk, and service performance. If your SLA is not directly measured by your APM tool, then you will need a translation model in order to proceed.
- Determine the lead time. If you risk breaking an SLA, how long does it take to address the cause of the missed performance level? If the mitigation is network augments, the procurement process will engage, so the sparing policies become relevant, as does the entire supply chain. How much variability is in this lead time as well, and how much variability in the measure such that the SLA may or may not be broken? Variability translates to risk of missing the SLA, so variability is the risk of cost. There is a cost associated with addressing a cause, and a cost associated with the risk, so we have costs to tradeoff. We have to understand the lead time distribution, or at least something that helps us understand the risks involved. Fast deterioration of performance means we need an earlier trigger. But the deterioration may vary, so we may not want to act very much earlier.
- Determine this trigger point carefully. The SLA affects the trigger point. But so do the variability of the lead time, the measure, and the forecast. And so do all the costs and risks involved: the cost of correcting the capacity, the cost of missing the SLA, and the risk of missing an SLA before the correction is complete, in a very simple model. Every time we update these factors we have to update our trigger accordingly. Most good APM tools will let you set a good trigger point based on a percentile, which is very convenient if available. A good risk and cost tradeoff model will balance the percentile against the SLA to provide the balance of costs, so a trigger based on percentile of a performance measure is appropriate.
- Determine the related triggers as well. There are at least two types. One is the triggers that tell you something has changed about your assumptions (risk, cost, variability) so you must revisit your trigger points. A second type is a potential prognostic health monitoring (PHM) measurement that might be a secondary measure of effectiveness with which to trigger from. A prognostic indicator may be better than the main performance measure if its reliability is better (lower variability, lower risk).