Understand the problem
Example – An enterprise on an existing network is looking to turn up a new service. But they are not sure if the network can support this new service at the desired Service Level Agreement (SLA), and what edge configuration will be required to meet the availability and performance requirements of the SLA.
Generalities – Traceability of requirements is a key guide at this point. At times, requirements are not as solid as they seem, or not understood as fully as it seems. A good modeler will understand the problem in sufficient detail to know what kind of measures of performance and effectiveness are needed, and how detailed the model must be.
Gather the relevant information
Example – The engineering team collects architectures, configurations, and specific performance of the equipment in the network, under the software versions that exist. Due to the nature of the network and service offering, a worst case example and an overall network estimate are in order. So the team collects failure rates and network configuration information to form an inventory of information about the network that is available. In some cases, the team must make reasonable assumptions. By collecting goodput estimates from the Network Management System (NMS), and modeling the goodput based on circuit length, they create a reasonable statistical model for performance and utilization of the circuits needed for the service. A simple nonlinear regression using a commercial off-the-shelf (COTS) statistics package was sufficient to form the model. By collecting outage data from the NMS and the ticketing system, in combination with Telcordia part count methods, they create a statistical model of failure rates for components, and repair rates under various conditions. New statistical estimates of failure rates are then used further.
Generalities – It is important to understand the equipment in the network in reasonable detail to meet the needs of the problem. Sometimes it is sufficient to have Telcordia Method 1 estimates, but at other times specific field failure and repair information is more important. A comparative architecture may be assessed before either architecture is built by using Telcordia part count method 1. But analysis of sparing plans requires specific field data. Modeling tools like Opnet can provide performance estimates in some cases, but can also provide good data from the network, as can tools from CA, SolarWinds, and others.
Dovetail the needs with the information available
Example – After some assessment, it seems there are field failure data for estimating both failure rates and repair rates, plus the original part count method results. The team, having access to the details necessary, work with the equipment vendors to create updated estimates on failure rates, and improved repair rate estimates. Further, they work with the equipment vendors to validate their performance estimates and reliability models of the equipment for the circuits they intend to build through the equipment. The best way to validate an understanding of network behavior is by presenting the modeling structure and assumptions to the experts, who usually reside within the engineering teams of the equipment providers. Block diagrams of all kinds are usually in order, but circuit diagrams and call scripts may be useful too.
Generalities – Here is where the form follows function as we design the right modeling solution. At times, a COTS solution is best because it directly answers the question, or a part of the question, and is already populated with the information required. But most of the time there is at least some modeling work necessary that COTS tools are not ready to provide. Therefore, it is good to have some general purpose models at the ready, with some capabilities to be adjusted and reused as needed. Further, this is where we determine whether a simple spreadsheet model is sufficient, a network simulation must be designed, or an algebraic or stochastic model should be formulated.
Create a baseline model, validate and verify
Example – The modeling team iterates with the engineering experts to create a baseline model. They create a worst case availability estimate, and network performance estimate as well. They create a small spreadsheet model for the simple case for ballpark estimation, by building simple circuit estimate and edge configuration estimates. The first look suggests some redundancy in order for services resembling the worst case structure. In addition, a 1G circuit may not have enough goodput to provide the service sufficiently as hoped, so a trunk group of two 1G links is suggested. The engineering team agrees it will work, but the cost will increase, so there is some concern. The modelers create a failure and performance model using general purpose software, leveraging their shortest path solver, and a hybrid state enumeration and Monte Carlo simulation engine to analyze an encoded state model for the network. They find that by looking at the overall network model the baseline model with a single 1G circuit should work fine most of the time. These results suggest some alternative models to try, leading to a new architecture, and some adjustments to the baseline network. The engineering team updates the business case based on the baseline model and baseline network, verifying that the architecture must be adjusted to meet all potential customers as is needed.
Generalities – It is usually useful to build simple models for comparisons with the more complicated one, and with existing knowledge, then to solve for known results when available. This approach helps to validate the more complicated model. Then the more complicated model can be built to match the baseline, and validated and verified to the best degree possible. Once created, the baseline can provide some early results that may not be available. The overall network availability expected can be calculated or at least estimated better than can be measured directly from the network.
Create the alternative model or models
Example – A few redundant architectures are built, and a couple of options for different edge devices are modeled as well. The throughput of the inexpensive edge device is not sufficient, but by providing redundant edge devices the service is highly reliable and meets goodput needs. But the repair issues with two remotely deployed edge devices causes the costs to be too high, so an alternate edge devices is chosen, with redundant links to the core network. By managing over-subscription carefully, the business case will work well, and the network can scale, plus the end to end availability work, as found through the model. To determine this, the modelers took the baseline, adjusted the nodes and links of the edges of the network to represent the new edge devices, updated the failure rate and repair information, and reran the model for comparison with the baseline. Further, because engineers were concerned about how the new edge equipment will protect against failures, the modeler creates a stochastic reliability model that handles complex states of the edge equipment during protection and high utilization.
Generalities – The alternatives explored here may be for different network states, possible specific outcomes of concern, sensitivity analysis of the baseline, alternative architectures as in the example, etc. Because of the vast range of analyses necessary to support network architecture, planning, and engineering decisions, a flexible network modeling capability is necessary.
Analyze results, cycle back if needed
Example – The modelers work closely with the engineers to make small changes and rerun the models a few times, validating and verifying that the architectures to recommend are low cost, high reliability and availability, and more than meet the performance needs. The results are in terms of cost models, goodput and performance results, and availability and reliability estimates for the selected architectures. Because the results are in the form of numerical results, spreadsheets, and block diagrams of several architectures, the team has much work yet to do to form the final report.
Generalities – By working in tight rapid cycles with the engineering experts, the modelers can bring results, work through adjustments, and come up with solid recommendations very fast, allowing for reduced overall risk.
Form results for presentation and decision support
Example – The engineers and modelers work together to trim down the results into a complete view that can be understood easily by leadership and decision makers. They form an architecture document, and supporting presentation. To aid in understanding, they use the simple model and functional block diagrams to explain the worst case issues, presenting the performance and availability results in a table comparing the options, making it clear why the selected architectures are recommended. The business case analysis is complete, with tornado diagrams, and Monte Carlo sensitivity analysis of the parameters of concern. The more detailed network model results are provided in graphical form, with statistics describing the range of results to be expected in the network, explaining how the mix if architectures will meet the needs of the customer base for the target SLAs. The new edge device is selected, and for risk reduction the engineers take the model results as requirements to begin searching for a second provider.
Generalities – Because so much information must be understood in a network analysis, we find it important to present the statistics in table form, in conjunction with network diagrams that associate the statistics to the physical network whenever possible. When presenting multiple architectures for availability, we often rely on box plots of the architectures placed on the same plot for easy comparison, plus the estimates or calculated results for the availability. To provide a good sense as to the amount of traffic and how it is carried on the network, we often use 3D network plots that show multiple layers of traffic. Because it is important to understand architectures well, we rely on block diagrams often. We can combine these and other display options to present a complete solution that meets the risk to reward tradeoff needed.