Improving Signal-To-Noise: Streamlining Alerts

By Clay Roach

Almost everyone working in the IT Ops space in 2017 has experienced their fair share of alert fatigue. The diverse set of tools and platforms required to operate competitively virtually ensures any IT Ops manager will be drowning in notifications by lunchtime, even when everything is running smoothly.

EM Diagram

 

The most common solve for the problem of alert fatigue – learning to tune out all the alerts so you can focus on the work at hand – is a non-starter. Some alerts are of critical importance, which creates an inconvenient reality for IT Ops when attempting to separate the signal from the noise. If you can learn how to manage them, you’ll gain valuable visibility into your IT operations and be better prepared to combat outages.

P1 IncidentStep 1: Consolidate and Correlate Events

What we’ve learned over the past five or six years is that trying to find a single platform to monitor everything going on in your IT space isn’t the right solution, especially for enterprise-level organizations. Best-of-breed tools simply deliver better results; but by their nature these tools are only great at one thing, so most large-scale ecosystems are going to be populated by a wide variety of tools each doing one thing very well. Most IT Ops vendors provide some kind of event consolidation, but each takes a slightly different approach. The leaders in event management systems have historically been IBM and HPE Software, but ServiceNow and Moogsoft have entered the space from the low-end and have compelling solutions as well.

Once you accept that some tools can’t be replaced and multiple tools can’t be avoided, the next step is to find a platform which will allow you to do event consolidation. An event management platform that is able to correlate two alerts from two different tools are reporting the same information and only report it once goes a long way towards making your ops management more efficient. If you can turn twenty alerts from twenty tools all telling you the same thing into one alert, your Tier 1 operators have twenty fewer things to look at.

Screen Shot 2017-03-29 at 2.06.41 PMStep 2: Service Mapping

Once you’re able to cut down on redundant alerts telling you the same information, the next step is to develop a framework for prioritizing what alerts make it through to your support team and ops managers. The best framework is one built on service mapping, also called service modeling. A service map accounts for all the technologies that underlie your service across every layer, and traces the interactions between them to help establish the causal relationships in your ecosystem.

For example, an airline might have a mobile app for customers to book flights. As a customer, you’d never realize that you are going through several different technology layers to make your reservations. So it’s not really just one application, it’s just just an added application level on top of all these other technologies: databases, web servers, other applications being brought in to share data, and all these parts are interacting.

Consider a SAN failure; it is going to cause the system-wide outages for anything managed by that same system, but when you look at all your technologies in a flat list you’re never going to see that cause and effect relationship. Service mapping charts these dependencies and associates underlying technology stack with the business service, which gives you immediate situational awareness and helps reduce downtime during an outage (and can even help predict outages before they occur).

Once it’s done, you can look from the bottom up and see that if a server is down and your mobile app is partially run on that server, your mobile app might be going down too. And if this information is going into the same platform you’re using for event correlation, suddenly you’re not just seeing an alert about a server going down as an isolated, potentially lower priority incident, but rather correlated to the business service that it powers.

Screen Shot 2017-03-29 at 2.08.04 PMStep 3: Automation

With all this information (correlated events, a service map) on one platform, you can start targeting low-hanging fruit for automation. Even with a platform telling you what exactly is wrong, which means you’ve cut out wasted time spent on the database guy blaming the network guy and the network guy blaming someone else, it still comes down to someone actually having to fix the problem and resolve the situation.

Of course problems are going to arise that are extremely complex involving lots of technologies, which are going to require equally complex, creative solutions. At the same time, alerts are going to come through letting you know you need to increase memory on a server or restart a process. Simple Tier 1 or basic Tier 2 things like this are your low-hanging fruit. You can suppress that alert, run a script to do that work, and that’s one less thing your Tier 1 people are wasting time on.

The next level of this idea is not just to automate technology but to also start to automate your human processes. If increasing memory on a server needs two approvals, there is a way you could automate that entire process:

Instead of getting on a call for 30 minutes to get the first approval, and then another 30-minute call to get that second approval (that is scheduled based on availability and might happen in three days) what if you set up a script to circulate the approvals? “This remediation is required, this is automated, just click on ‘yes’ or ‘no’ based on your approval.” And you’re good to go. Of course process automation could require a bigger set of tools to deploy, so a great first step is just to focus on the technology side and try to automate basic Tier 1 things that don’t require any approvals.

Screen Shot 2017-03-29 at 2.08.56 PMStep 4: Analytics

Every tool you have in place is generating tons of data and it’s creating a missed opportunity. Less than 1% of data is being used for analytics, so there’s nowhere to go but up. We can leverage this data coming from all of our best of breed tools to become more proactive. Rather than just running through logs after a service outage to trace back and see where it began, we can try to spot the precursors to the outage and hopefully take preventative measures.

Once you have a big enough sample of the metrics these tools are reporting on, things like end user response times, network throughput, or system metrics, , you can establish baselines for what a healthy ecosystem looks like. And if we know what it’s supposed to look like, we can identify anomalies. With event correlation, we can start to see these anomalies being reported from different tools as the same event. With service mapping, we can see these events in terms of how they relate to the whole ecosystem. And with automation, we can get scripts in place to address these events if possible.

Analytics also can be used to power executive dashboarding that will correlate the overall health and performance of the IT ecosystem with revenue information and help someone like a CIO or CFO see the health of the business in real-time. The dashboard that an IT Ops manager looks at is going to be much more in-depth and much harder to read and interpret the further you go up the ladder, so being able to build a separate executive dashboard is really valuable.

With event management software, you can pass through alerts that have been re-prioritized by the platform. Even if the tool generating the alert thinks it’s red, it might only be yellow in relation to the overall ecosystem. With a simple dashboard you can help the management level or C-Suite see that resources are being expended intelligently by helping them to understand the relative importance of different situations and the the relationship between service and revenue. There are very few organizations out there today with the level of IT maturity that’s required to run analytics like this, but new tools can help you bootstrap quickly.

Screen Shot 2017-03-29 at 2.09.41 PMStep 5: Implementation

The final step in this process is taking all this knowledge and broad-strokes strategy and actually implementing it in your organization, which can seem overwhelming. This is where the latest operations tools – like OpStream – are going to come into play. These tools facilitate the functionality that’s discussed here in a way that helps you get off the ground running without the need for custom implementation and integration.

Tools will only get you part of the way there though. That’s why it’s important to find resources that will help you look at what you have by running event management workshops, plan out where you want to go with in-depth project scoping, and what concrete, actionable steps you’ll need to take to get there. Knowing the solution to a problem and implementing that solution are two different things; Take the first step towards streamlining your alerts, improving service availability and delivery. Contact us today to speak with our experts.

 

Topics: Opstream

Clay Roach

Written by Clay Roach