I hava a cold. Did I fail?

When a person gets ill, say from a cold like the one I have right now, we tend to slow down, maybe have trouble focusing, or otherwise performing certain tasks. Clearly we function, but not as well as we would like, and perhaps not as well as usual.  This is a degraded state, clearly. But that doesn’t mean we failed, as in the usual binary state sense of either success or failure, functioning or failed, etc.

Without our systems to tell us to slow down because we aren’t feeling too well, we could run the risk of pushing ourselves beyond our limits, perhaps to the point if actual physical failure (say hospitalized, or bed ridden), and perhaps to a state which we cannot recover (death). But we have those feedback systems to tell us when to reduce the stress, and take time to heal, or enter a repair state.

Prognostics and Health Management (PHM) is the engineering field that is working to develop these feedback systems. In some cases, repair requires replacement, so the indicators suggest a replacement instead of repair. But these systems can be indicators of an engineering system being ill in some sense, in a degraded state. While Telecommunications and IT systems have had this advantage for a very long time, in many ways and on many parts of these complex systems, not all engineering system could be built cost-effectively with these systems. But with ubiquitous communications and nanotechnologies leading to inexpensive early warning systems, more engineering system can take advantage of such solutions, leading to more reliable engineering systems. 

When we get ill, and we are functioning in a degraded state, we may fail at certain tasks. From a mission point of view, we can fail due to being in the degraded state. What we are usually able to do, we can’t while we’re ill. I might not make it to work if my cold gets bad enough. I’m degraded, but the tasks I intended to do don’t get done, so the mission of work will fail. That is, unless someone can take my place. As PHM develops more fully, we can clearly determine when systems or parts of a system are ill, and replace them with like parts. The missions of those systems will therefore fail less often, with less cost for maintenance, and fewer mission failures.

Posted in Engineering Consulting, IT and Telecommunications, ORMS, Quality, RAMS - all the -ilities | Tagged , , , , , , , , | Comments Off on I hava a cold. Did I fail?

What is the cure for Diversity Fever?

Everyone seems to have diversity fever. We want to improve the reliability of our engineered solution, or the availability of a network, so we add redundancy, make sure we have diverse access to the service, and so on.

Diversity is expensive. Not only in terms of spending real money to duplicate systems, but all the management of resources that goes with it, including repair, maintenance, power, space, perhaps monitoring, etc.

How often do we really need that? Sure adding diversity is easy. But it isn’t always beneficial, and not always the best option. What is the goal for this diversity, and are you meeting it?

  • If it is to improve reliability, consider that you just introduced two things that can fail where you used to have one. Double the rate of occurrence of failures, and you may have just decreased the reliability of what you wanted to improve. Or you may not have impacted it at all. And depending on the protection system, you may have created a reliance on the redundant system yet be blind to its failure, resulting in lower reliability. At best, you may have a good solid protection system in place, but have you analyzed the system to know if it is now better than the single system? Did you get enough improvement to make the case for the redundant system?
  • If you are trying to improve availability, then is the problem dominated by failure frequency, or maybe it is repair time? Perhaps your resources are better spent improving repair time, or better yet installing PHM to improve proactive maintenance.
  • Are you trying to improve the reliability of one thing, or many things?  If you are going to use the new redundancy in a load sharing situation, that means you might be able to do more than you did before with the new resources. For example, having two batteries for your cell phone means if one goes bad you can use the other one and make another call.  But what if you really need two phones? Are you improving the availability of a phone, or trying to accomplish a mission that requires two phones to work? It could be that you have not really build redundancy, but rather build dependency. The added resources are not redundant, but rather critical. Where you once needed one, you now need two, or three. And having anything less than all the units working at one time means you can’t accomplish your goal.  In this case, your introduction of what you thought was redundancy became a system that has lower reliability and availability.  The two units are not in parallel, but in series instead.
  • Are the units really independent?  If the failure mode of one unit leads to a higher probability of failure in the other unit, then the overall system may be worse than before.  What good are those two batteries for your phone when they are both in your pocket when you fall in the pool?

To be fair, redundancy in systems is often a good way to enhance design to improve reliability and availability. It isn’t always complicated, but it can be. It is important to think about what you are trying to achieve by enhancing the reliability and availability of a system or network or process. There are many options beyond redundancy to consider, and sometimes redundancy is a very bad choice.

Posted in Engineering Consulting, IT and Telecommunications, ORMS, Quality, RAMS - all the -ilities | Tagged , , , , , , , , , , , , , , , , , , | Comments Off on What is the cure for Diversity Fever?

It’s all about Reliability

If you consider yourself an engineer, think about what you do in a very broad sense. Engineering is all about applying scientific discovery and making it practical, useful, and widely used.

If you are an Electrical Engineer, for example, then you are at some level of the electronics systems hierarchy building off of scientific discovery by making systems that address needs, solve problems, etc. If all we had to work with was the first version of every scientific bench experiment, we would not be able to build the systems we do today. The performance would not be sufficient for most problems, the costs would be huge, and certainly the reliability would be very poor. But focusing on making things cheaper is not enough, as it may still not perform well enough to be useful. And improve performance all you want, and it still won’t be of use if it continues to fail frequently, cannot be easily repaired, or does not meet mission needs, cannot be maintained, does not survive operation, etc.

Think about what you do as a professional, and I’ll bet you can generalize at least part of what you do as assuring the reliability of something. Even the scientists worry about reliability, as their experiments have to be repeatable for confirmation, and that is just a concern of reliability. Even the teacher is focused on improving the reliability of the students.

So if you are an engineer of any sort, even a scientist, a teacher, or a professional filling most any role for that matter, consider yourself to be a professional focused on reliability.  And if you feel the need to improve on your own reliability, consider attending the 2014 RAMS® conference, the annual reliability and maintainability symposium, to be held in Colorado Springs in just a few months.

Posted in Engineering Consulting, IT and Telecommunications, ORMS, Quality, RAMS - all the -ilities, Uncategorized | Tagged , , , , , , , , , , , , , , , | Comments Off on It’s all about Reliability

Reliability Society Denver Section Joint Meeting with SME on Subsurface Measurement and its Applications to Manufacturing

On March 19th, the IEEE Reliability Society Denver Section held its first meeting of the year in Westminster, CO, jointly with the Denver Section of the Society for Manufacturing Engineers (SME).  Lively, informal technical discussion over dinner continued into the technical presentation by Yiming Deng, Ph.D., Assistant Professor at CU-Denver and Anschutz Medical Campus.

IEEE RS and SME have a lot in common, so the discussions centered around many current, overlapping concerns including solar power generation, power systems design, IEEE Greentech Conference, airline maintenance, and many other engineering related topics. At least one of the 12 attendees was both a member of SME and IEEE.

Yiming Deng, who is also the Director of LIIP at CU-Denver, presented some interesting details about his research and the many facets of work he accomplishes with his many graduate students. After explaining his background, the university resources available, and the basic concepts behind the topic, he proceeded to cover some of the cutting edge work he pursues.

  • Advanced Electro-Magnetic Imaging is useful for Nondestructive Evaluation, and Structural Health Monitoring, both key concepts within the energetic topic of Prognostics and Health Management (PHM).
  • Electromagnetic Nondestructive Evaluation involves magnetic sensors which must be calibrated carefully, and studied, usually with mathematical models, to assure a high probability of detection. I found it interesting to learn that while the operator is good at recognizing large features that are indications of problems, the software is better at identifying smaller features, so a combination of the two works best.
  • Sensors are optimized using a forward model. Because modeling can bring a much larger variety of test conditions and image features, engineers can optimize the sensor array size, configuration, lift off, operation frequency, and more to make sure the sensors are most sensitive. The trick is to avoid type I and type II errors in a way to reduce overall costs.
  • Yiming and his students are working on a novel use of microwave imaging called near field scanning microwave imaging (NFMW).  Because of the near field application, the resolution is not determined by the wavelength, which is too large to be useful, but rather determined by the aperture size, which can be easily controlled.
  • His team is also looking at a hybrid approach, combining microwave and ultrasound to gain the best of both methods.

The applications in this work to both reliability and manufacturing are both amazing, and nearly endless. PHM is well known to bring advantages to reliability by proactively identifying problems before they occur. And as a nondestructive evaluation technique, these methods can be applied to inspection points on the manufacturing line to find defects before shipment in some cases, and in re-manufacturing plants to assess the rehabilitation and repair necessary to make a product or system as good as new.

If you’re disappointed that you missed this talk, be sure not to miss the next one, tentatively scheduled for early May, with details to be given here.

Posted in Uncategorized | Comments Off on Reliability Society Denver Section Joint Meeting with SME on Subsurface Measurement and its Applications to Manufacturing

The What and How of Network Modeling

I often get asked to describe what I do and how I do it, especially when it comes to network modeling. The other day, that need became urgent. With the pressure on to come up with a description, here is what I came up with. How does it compare to the methods you use to model a network?

Understand the problem

Example – An enterprise on an existing network is looking to turn up a new service. But they are not sure if the network can support this new service at the desired Service Level Agreement (SLA), and what edge configuration will be required to meet the availability and performance requirements of the SLA.

Generalities – Traceability of requirements is a key guide at this point. At times, requirements are not as solid as they seem, or not understood as fully as it seems. A good modeler will understand the problem in sufficient detail to know what kind of measures of performance and effectiveness are needed, and how detailed the model must be.

Gather the relevant information

Example – The engineering team collects architectures, configurations, and specific performance of the equipment in the network, under the software versions that exist. Due to the nature of the network and service offering, a worst case example and an overall network estimate are in order.  So the team collects failure rates and network configuration information to form an inventory of information about the network that is available. In some cases, the team must make reasonable assumptions.  By collecting goodput estimates from the Network Management System (NMS), and modeling the goodput based on circuit length, they create a reasonable statistical model for performance and utilization of the circuits needed for the service. A simple nonlinear regression using a commercial off-the-shelf (COTS) statistics package was sufficient to form the model. By collecting outage data from the NMS and the ticketing system, in combination with Telcordia part count methods, they create a statistical model of failure rates for components, and repair rates under various conditions. New statistical estimates of failure rates are then used further.

Generalities – It is important to understand the equipment in the network in reasonable detail to meet the needs of the problem. Sometimes it is sufficient to have Telcordia Method 1 estimates, but at other times specific field failure and repair information is more important. A comparative architecture may be assessed before either architecture is built by using Telcordia part count method 1.  But analysis of sparing plans requires specific field data. Modeling tools like Opnet can provide performance estimates in some cases, but can also provide good data from the network, as can tools from CA, SolarWinds, and others.

Dovetail the needs with the information available

Example – After some assessment, it seems there are field failure data for estimating both failure rates and repair rates, plus the original part count method results. The team, having access to the details necessary, work with the equipment vendors to create updated estimates on failure rates, and improved repair rate estimates. Further, they work with the equipment vendors to validate their performance estimates and reliability models of the equipment for the circuits they intend to build through the equipment. The best way to validate an understanding of network behavior is by presenting the modeling structure and assumptions to the experts, who usually reside within the engineering teams of the equipment providers. Block diagrams of all kinds are usually in order, but circuit diagrams and call scripts may be useful too.

Generalities – Here is where the form follows function as we design the right modeling solution. At times, a COTS solution is best because it directly answers the question, or a part of the question, and is already populated with the information required. But most of the time there is at least some modeling work necessary that COTS tools are not ready to provide. Therefore, it is good to have some general purpose models at the ready, with some capabilities to be adjusted and reused as needed. Further, this is where we determine whether a simple spreadsheet model is sufficient, a network simulation must be designed, or an algebraic or stochastic model should be formulated.

Create a baseline model, validate and verify

Example – The modeling team iterates with the engineering experts to create a baseline model. They create a worst case availability estimate, and network performance estimate as well. They create a small spreadsheet model for the simple case for ballpark estimation, by building simple circuit estimate and edge configuration estimates. The first look suggests some redundancy in order for services resembling the worst case structure. In addition, a 1G circuit may not have enough goodput to provide the service sufficiently as hoped, so a trunk group of two 1G links is suggested. The engineering team agrees it will work, but the cost will increase, so there is some concern. The modelers create a failure and performance model using general purpose software, leveraging their shortest path solver, and a hybrid state enumeration and Monte Carlo simulation engine to analyze an encoded state model for the network. They find that by looking at the overall network model the baseline model with a single 1G circuit should work fine most of the time. These results suggest some alternative models to try, leading to a new architecture, and some adjustments to the baseline network. The engineering team updates the business case based on the baseline model and baseline network, verifying that the architecture must be adjusted to meet all potential customers as is needed.

Generalities – It is usually useful to build simple models for comparisons with the more complicated one, and with existing knowledge, then to solve for known results when available. This approach helps to validate the more complicated model. Then the more complicated model can be built to match the baseline, and validated and verified to the best degree possible. Once created, the baseline can provide some early results that may not be available. The overall network availability expected can be calculated or at least estimated better than can be measured directly from the network.

Create the alternative model or models

Example – A few redundant architectures are built, and a couple of options for different edge devices are modeled as well. The throughput of the inexpensive edge device is not sufficient, but by providing redundant edge devices the service is highly reliable and meets goodput needs. But the repair issues with two remotely deployed edge devices causes the costs to be too high, so an alternate edge devices is chosen, with redundant links to the core network. By managing over-subscription carefully, the business case will work well, and the network can scale, plus the end to end availability work, as found through the model. To determine this, the modelers took the baseline, adjusted the nodes and links of the edges of the network to represent the new edge devices, updated the failure rate and repair information, and reran the model for comparison with the baseline. Further, because engineers were concerned about how the new edge equipment will protect against failures, the modeler creates a stochastic reliability model that handles complex states of the edge equipment during protection and high utilization.

Generalities – The alternatives explored here may be for different network states, possible specific outcomes of concern, sensitivity analysis of the baseline, alternative architectures as in the example, etc. Because of the vast range of analyses necessary to support network architecture, planning, and engineering decisions, a flexible network modeling capability is necessary.

Analyze results, cycle back if needed

Example – The modelers work closely with the engineers to make small changes and rerun the models a few times, validating and verifying that the architectures to recommend are low cost, high reliability and availability, and more than meet the performance needs. The results are in terms of cost models, goodput and performance results, and availability and reliability estimates for the selected architectures.  Because the results are in the form of numerical results, spreadsheets, and block diagrams of several architectures, the team has much work yet to do to form the final report.

Generalities – By working in tight rapid cycles with the engineering experts, the modelers can bring results, work through adjustments, and come up with solid recommendations very fast, allowing for reduced overall risk.

Form results for presentation and decision support

Example – The engineers and modelers work together to trim down the results into a complete view that can be understood easily by leadership and decision makers.  They form an architecture document, and supporting presentation. To aid in understanding, they use the simple model and functional block diagrams to explain the worst case issues, presenting the performance and availability results in a table comparing the options, making it clear why the selected architectures are recommended. The business case analysis is complete, with tornado diagrams, and Monte Carlo sensitivity analysis of the parameters of concern. The more detailed network model results are provided in graphical form, with statistics describing the range of results to be expected in the network, explaining how the mix if architectures will meet the needs of the customer base for the target SLAs. The new edge device is selected, and for risk reduction the engineers take the model results as requirements to begin searching for a second provider.

Generalities – Because so much information must be understood in a network analysis, we find it important to present the statistics in table form, in conjunction with network diagrams that associate the statistics to the physical network whenever possible. When presenting multiple architectures for availability, we often rely on box plots of the architectures placed on the same plot for easy comparison, plus the estimates or calculated results for the availability. To provide a good sense as to the amount of traffic and how it is carried on the network, we often use 3D network plots that show multiple layers of traffic. Because it is important to understand architectures well, we rely on block diagrams often.  We can combine these and other display options to present a complete solution that meets the risk to reward tradeoff needed.

Posted in Engineering Consulting, IT and Telecommunications, ORMS, Quality, RAMS - all the -ilities | Tagged , , , , , , , , , , | Comments Off on The What and How of Network Modeling

Capacity Management, Failure Mitigation, and Where to Start?

Standard mitigation tools are a great place to start with Capacity Management, just like any failure mitigation need.

As a telecommunications and IT consultant, we’re often called upon to solve problems. I remember some of my early training that suggested a good place to start is with generative questions, and exploration of the problem with framing questions and through levels of logic to find the opportunity to dovetail their needs with your skills.  But so often, you never get that chance?

Instead, I find a good place to start with many problems is with case studies. Everyone involved in the problem has been a part of the pain, and they can usually provide a case study or two.

It doesn’t matter the context, as a failure of something has usually lead someone to call on you. In business, we are all problem solvers.  General failure mitigation tools, including FMECA, FMEA, quality tools like fishbone Diagrams, all come in handy to describe the failure, and lead you to a mitigation.

Capacity Management is no different.  Consider the categories of capacity management failure.

  • Not enough capacity leads to shortage, and therefore symptoms like long order cycle times, poor network performance, service failures, high operations costs, and related symptoms.
  • Too much capacity leads to wasted resources, high capital costs, and other related symptoms.
  • Surprises in demand can lead to the symptoms of insufficient capacity at times, yet be a temptation leading to too much capacity being deployed.
  • And there are other potential causes including poor information, no data, lack of ownership, etc.; see any basic FMECA or fishbone diagram for guidance.

Note that all of these conditions can exist in the same business, on the same network, and on the same consumable resource.

Just like a solid software development project starts with use cases, a solid systems engineering project has design reviews, and reliability and quality engineering projects seek to understand the ways a system fails and how to improve on that system, Capacity Management is no different. Start with the cases studies that lead an organization to seek a better solution to managing the capacity of its consumable resources, be that network capacity, operations staff, cloud storage, or service and application performance.

Posted in Engineering Consulting, IT and Telecommunications, ORMS, Quality, RAMS - all the -ilities | Tagged , , , , , , , , , , , , , , , , , | Comments Off on Capacity Management, Failure Mitigation, and Where to Start?

Reliability Society Denver Section Meeting on Reliability Based Steel Sheet Pile Assessment

On Thursday, November 29th, the Denver Section of the IEEE Reliability Society enjoyed an evening of professional discussions and networking, with a very interesting technical talk from Dr. Rui Liu, a drawing for a book by Gregg Hobbs, and pizza and subs for the attendees.

We started at 6pm at the Senate Chambers in the Tivoli Building at the University of Colorado-Denver with informal discussions about engineering and reliability, future meeting ideas, and other business. In attendance were at least four IEEE members, and eleven non-IEEE members, many of them students, a few of which decided they would join the IEEE that evening. CU Denver’s Department of Electrical Engineering sponsored us for the cost of the facility, and everyone enjoyed pizza and subs provided by the IEEE Reliability Society’s Denver Section. This event, which was free to attendees and open to the public, was held at a time and location convenient for the attendees who expressed interest in attending, as well as the speaker. One discussion some of us had may lead to a January joint meeting with student engineering groups at Denver University.

At 6:30pm, Dr. Liu started his talk by describing how organizations like the U. S. Army Corps of Engineers assesses the condition of water transportation infrastructures (such as steel sheet pile structures, miter gates, and sector gates used in water transportation systems), through various measurements of deviation, visual inspection of degradation features, and subjective measurements of condition and degradation as well. Dr. Liu has contributed to this work by developing several models of various degradation features, including statistical variability of materials and measurement error as well as mechanical engineering models, then combining these into reliability measurement assessments through a reliability index.  By developing these models, engineers are now able to assess the various maintenance options, in combinations and comparatively, to optimize the maintenance of infrastructures. The U. S. Government is responsible for the maintenance of infrastructures, and has estimates of the cost of correcting the degradation of critical infrastructures in trillions of dollars. Dr. Liu’s research works can help us potentially reduce these costs, and at least prioritize the maintenance of a good portion of these systems so we can be much more effective with our Government’s maintenance dollars. The attendees were very interested in the work, asking multiple questions and offering suggestions for where to develop the work’s benefits even further.

Then we held a drawing for a free book: “Accelerated Reliability Engineering: HALT and HASS” by Gregg Hobbs. Virginia Hobbs, the owner and operator of Hobbs Engineering, provided the book, and told us a bit about the history of Highly Accelerated Life Testing and Stress Screening. One fortunate attendee was able to take the book home, and already had plans for reading it and sharing the information.

In closing the evening, we thanked Dr. Liu and Virginia Hobbs for their contributions, mentioned our tentative plans for a January or February meeting next, and completed many of the discussions and ideas from the evening.

Be sure to watch our website for updates on our next and future meetings, and to contact us for information, or suggestions on future meetings you would prefer to attend or perhaps contribute to. We hope to see you there!

Posted in Engineering Consulting, Quality, RAMS - all the -ilities | Tagged , , , , , , , , , , , | Comments Off on Reliability Society Denver Section Meeting on Reliability Based Steel Sheet Pile Assessment

Network Capacity Management and the Trouble with Change

While working parallel projects on different sides of the Capacity Management (CM) problem, I’ve noticed a rather large disconnect between the way some networks operate and the assumptions hidden with the monitoring tools in use.

Stable networks are the ideal case, but not the problem. If a network is stable in its configuration, with stable routing, never changing configurations, and rather stable use, then the assumptions made by simple trending are useful when trying to understand and manage the capacity of network resources. The only dynamic concern is the addition of new customers, and the change in the usage of existing customers. If these dynamics grow, they are not often bursty, and could be predicted reasonably with the simple trending in the available Capacity Management and Application Performance Monitoring (APM) tools.  No problem there.

However, networks are often dynamic, and that can be a problem.  A network failure forces a change in routing. A special event or mission drives a new network usage dynamic. Some services are designed to be very dynamic, such as policy based routing (PBR), OpenFlow, and Software Defined Networking (SDN). One reason for implementing these capabilities is to allow a network to be more dynamic, handling traffic by time of day, specific external need, or even providing users more flexibility in their use of the network’s resources. The result is an opportunity for much greater complexity, more degrees of freedom, and many other internal and external causes of changes in traffic. These dynamics make the job of knowing what to do to address a potential capacity, usage, or response time problem, for example, a much more difficult task.  There is the potential for more spikes in usage, and more causes in these spikes that are not necessarily capacity or application problems.  The monitoring problem just gets worse, and it becomes more difficult to know when more capacity is needed. The CM and APM tasks just got harder.

Some tools are set to address the common architectures that apply some of some of these dynamics, but many tools do not, and none address all potential use cases. There are good APM and CM tools for Cloud environments, and some that handle Link AGgregations (LAG). But the general application of these new, dynamic features of networking are not well addressed by the general networking tools community. Now some tools have abilities to address needs like flow based analysis, which helps a lot. Some tools allow dynamic analysis of subsets of traffic data in very general ways, but that requires expert analyst users, which are expensive, rare, and busy.  Analytics has the potential to assist, when the right information is presented, but that is also a hard problem.

At the very least, it is apparent that specific architectures and use cases will drive specific requirements of their monitoring tools in an even greater degree than ever before. Flexible tools and well trained expert users will allow greater flexibility and insurance, at a cost.  Improved development in the CM and APM space will need to step in next to drive down those costs.

Posted in Engineering Consulting, IT and Telecommunications | Tagged , , , , , , , , , , , , , | Comments Off on Network Capacity Management and the Trouble with Change

Reliability Society Denver Section Meeting on Network Reliability and Prognostics

On Friday, October 26th, the Denver Section of the IEEE Reliability Society held a technical meet at the Tivoli Building of the University of Colorado-Denver. Our newest Chair, Dr. Jason W. Rupe, presented his observations and work on the advances in the area of network reliability, in particular Telecommunications Reliability.

After a brief informal discussion to get the event going, and to give everyone a chance to enjoy the pizza and subs provided by the CU-Denver Department of Electrical Engineering, the presentation started with a discussion about how every engineer has the task of reliability in some way. Reliability, after all, is really about making things last longer, work better, and more effectively deliver. Once research creates a capability, engineers take over to develop it, which is really about making it more reliable. In other words, reliability is development, making a technology scale better.

Then the main talk proceeded with a presentation of how different areas of engineering are mostly unaware of the work done in other areas, and how that leads to research that is not well connected to some of the practical problems. Jason then presented some of the work he is doing that attempts to take the best of the research, but add his own improvements to help the work better address the needs of engineers. After presenting some of his approaches and example results, the group discussed their thoughts on the matter, and even how they have observed themselves in their own areas where more sharing of approaches and technology will benefit all.

Rather than being a “T” shaped person, one who is broad in many areas, but very deep in one, we discussed how an engineer should be a network shaped person, looking across disciplines to use and share ideas and results and capabilities to better all. As reliability engineers and researchers, we are uniquely capable, having a lot to share with many disciplines as we work within many disciplines as well.

We concluded the presentation with an impromptu tour of some of the CU-Denver labs. Dr. Yiming Deng, who conducts research in many areas of prognostics and making medical devices more reliable, showed us some of the work he is leading on using a combination of two sensing technologies at the same time to get a better view into devices for better non-destructive testing. Then Dr. Rui Liu gave us a quick tour of the concrete testing lab, which includes facilities for rapid temperature cycling of concrete blocks, and environmentally controlled chambers for designed experiments on the curing of concrete. He also showed us several of the devices they have which allows them to create new mixes and test them for strength.

We concluded with a discussion of how we learned a bit from each other, and some ideas we gained as a result of sharing our different disciplines. While we had far fewer attendees than we planned and hoped for, the result was a benefit to the engineers and student who attended. We all were able to return home with new ideas for improving the reliability in our various disciplines.

If you missed this meeting, do not be concerned as our next meeting, planned at the same location on Thursday, November 29, where we will learn more about the research being conducted by Dr. Rui Liu on concrete and bridge reliability. Hope to see you there!

Posted in Engineering Consulting, IT and Telecommunications, Quality, RAMS - all the -ilities | Tagged , , , , , , , , , | Comments Off on Reliability Society Denver Section Meeting on Network Reliability and Prognostics

Leveraging Human Solutions in Operations Research

People maintain control in spite of our new machine overlords.

Operations Research (O.R.) projects often focus on trying to take control of the entire problem, and therefore fail when the engineer or manager with the ultimate control cannot validate, verify, or sometimes even follow the solution recommended by the software. So we add graphics and simple ways of explaining the results, hoping the person with power gets comfortable and follows the recommendation.  But when that person doesn’t have a Ph.D. in O.R., they still don’t completely trust the solution. Why is that?

  • Sometimes it is because they know something the software doesn’t.
  • Maybe it is because there are requirements or constraints they just can’t articulate.
  • Perhaps it is because there are unpredictable events that the user believes could happen, and would cause the software to do very bad things.

All these possible reasons, and more, make it difficult to trust the software. Even when a person, even an expert, can’t possibly do a better job at finding the right solution, the user doesn’t trust. So the solution is again ignored.  I’ve seen this time and time again.

It seems time to try an alternate approach: use our O.R. skills to convert the problem to one the engineer or manager has a fighting chance to solve, and make sense of the results. Use our applied mathematics skills to clarify the problem, not just optimize. When O.R. works, I contend you will find that the real work does exactly that, converts the problem to help the person solve it, not try to take the reins from the person.

After much time trusting and validating, maybe they will ask for the decision to be automated. But not right away, not all the time, and not without an option to take back the reins.

Posted in Engineering Consulting, ORMS | Tagged , , , , , , , , , | Comments Off on Leveraging Human Solutions in Operations Research