AIOps: Moving beyond the trigger-driven alerts model with deep learning

The standard application of AI to IT operations is to mitigate the “overload of red alerts”: too much noise, not enough signal. This is a valid concern as IT estates grow larger, but the emphasis on individual overload does nothing to address the underlying process issue. By Shaun McGirr, AI Evangelist, Dataiku

  • 2 years ago Posted in

With AIOps, organisations are able to generate more meaningful signals to begin with. The industries leading the way use AIOps to strategically manage vast IT estates and improve the quality of data they are generating for better, more specific insight. And for those on the frontiers, deep learning models have become a key asset, as when applied correctly they can counter the human bias that leads to “tell-me-everything,” trigger-driven alerting.

Telecoms companies face particularly complex IT estates and have already reaped rewards from this application of AIOps to the underlying problem rather than its surface-level manifestation. To understand this potential value deeply, it’s useful to analyse the trigger-driven alerts model, and how large enterprises including telecoms are using AIOps with deep learning to move past it.

The problem with trigger-driven alerts in telecoms

It’s time to re-think the problems AI was meant to solve for IT operations. Think about it this way: if every system or service is sending alerts, how do you know which ones deserve attention? This is a tough enough problem to solve, but could easily lead to ever-more complex AI systems built on top of the current alerts ecosystem, which in turn would require alerts.

What we need is to use AIOps to move beyond filtering through thousands of trigger-driven alerts and instead improve the way alerts are generated — it’s a bit like saying, if you have trouble locating the needles in the haystack, maybe it’s time to get rid of 90% of the hay. This requires proactive improving the quality of the data you generate about service levels, so your operations are running on a higher-octane fuel to begin with.

Telecoms face twin pressures that make this kind of AIOps solution especially relevant: ever-increasing complexity and scale in these companies’ global IT operations, yet unignorable customer expectations of increasingly smooth, frictionless experiences across channels.

As enterprises, telecoms can have tens of thousands of discrete IT services running nationally or globally, which need constant monitoring. When companies only had 10 or 20 services to monitor, they put people in a global operations centre with a few dashboards and everything was fine, but there are only so many dashboards a pair of eyes can monitor at once, not to mention all the other potential sources of alerts.

These companies know that hiring another 1,000 people to monitor 100,000 services simply isn’t realistic, because of the work involved in attending to each service. It’s a bit like repainting the Golden Gate Bridge – by the time the crew is done, the next crew is starting on the other side because the work already needs doing again. This is where AIOps can be a lifesaver: but it will require pushing into new techniques such as deep learning.

The Challenges of Using Deep Learning

Ever since deep learning became trendy, companies have been tempted to use it as the proverbial hammer for every nail. Ultimately deep learning models, like any technology, are best applied to particular types of problems (otherwise, they can create more problems than they solve).

Deep learning is a high-octane and high-risk approach in many ways, and time and time again, we see it overcomplicating processes when there may have already been a much more explainable, existing way of doing things. Time series forecasting is a great example of this: models from the sixties worked just fine, but when companies used deep learning to reinvent the wheel with them, they often found worse results at greater cost!

AIOps in Action with Deep Learning

However, there is a place for deep learning — it just needs to be implemented in the right place within AIOps. One of the most advanced AIOps approaches is training a model on the prior history of a given IT service, unlocking value from the vast data assets that IT operations typically collect, especially in unstructured or semi-structured log files.

By training models on the history of service failures, companies can monitor degradation and get early warnings of a likely upcoming failure long before the dashboard turns red, the issue becomes a problem, and the engineers become overloaded with alert fatigue.

It’s this highly proactive AIOps-driven approach to creating more relevant signals in the first place that gets telecoms IT departments away from the scattergun approach, and make headway. Using deep learning for what it’s good at, these teams are able to gain enhanced visibility of the very complex processes that may lead to failure in a system, or even the series of events that may lead to customer churn, but without generating alert fatigue.

Deep learning models are able to manage the hundreds, if not thousands of variables that may determine these processes, and excel over other approaches where models quickly become too complex, and where humans can’t specify upfront what the model should take into consideration.

From a strategic point of view, IT teams know they need to break their IT estates into smaller, more manageable pieces, but may not have appreciated that data scientists could help them get there. If companies can get to this data, organise it and use it in this way, they can use the years of service performance data they have been recording all along, but which has probably only ever been used to generate alerts. With AIOps, that data can be repurposed for something greater: increasing knowledge about the health of the IT estate as a whole.

When companies begin to apply these technologies this way, they can also begin to understand which services are most critical. They can then focus on making those services faster and more resilient, and solve the underlying business problems faster. And, as they get to practice using more complex techniques such as deep learning to improve their service delivery, the cost of each additional deployed model will decrease over time, especially if they can shift the intensive-but-temporary computing workloads required to train deep learning models to a cloud service.

It’s about using the technology at hand for what it’s good at. The diversity of IT services today is so immense that it is already impossible for any one person or group to manage it single-handedly. With judicious use of deep learning and other advanced techniques within AIOps, companies can manage these vast IT estates across huge networks both on-premise, and in the cloud. Telecoms is only one example of how AIOps is driving efficiencies across billing, customer service, maintenance, and infrastructure.

By Ram Chakravarti, chief technology officer, BMC Software.
By Darren Watkins, chief revenue officer at VIRTUS Data Centres.
By Steve Young, UK SVP and MD, Dell Technologies.
By Richard Chart, Chief Scientist and Co-Founder, ScienceLogic.
By Óscar Mazón, Senior Product Manager Process Automation at Ricoh Europe.
By Chris Coward, Director of Project Management, BCS.