Advancing AIOps: Preventive operations powered by Davis AI

15 hours ago 3
News Banner

Looking for an Interim or Fractional CTO to support your business?

Read more

The 2024 CrowdStrike incident demonstrated our societal vulnerabilities to IT outages. A faulty software update caused widespread issues, impacting critical services globally, including airlines, banks, hospitals, and public safety systems. Despite recent advancements such as containers, Kubernetes, and platform engineering, it’s evident that managing enterprise software services has become increasingly complex. IT operations must be prepared to quickly address and mitigate disruptions, ensuring business continuity and minimizing damage.

AI, especially AIOps, has emerged as a pivotal solution, promising to avoid downtime. The 2024 State of AI Report highlights this trend, with 89% of technology leaders anticipating that AI will significantly enhance incident response by learning to automate and optimize various tasks, such as performance monitoring and workload scheduling.

 Wikimedia Commons.)Figure 1. Blue screens of death at LGA airport due to the July 2024 CrowdStrike outage. (Source: Wikimedia Commons.)

AIOps can identify and address potential issues before they become major incidents by learning from history and analyzing large amounts of data in real time. This approach improves operational efficiency and resilience, though it’s not without flaws. The complexity of IT environments and the changing nature of threats necessitate human oversight and ongoing adjustment of AIOps systems to handle unforeseen challenges and ensure optimal performance. Additionally, predictions based on historical data are reactive, solely relying on past information to anticipate future events, and can’t prevent all new or emerging issues. This limitation highlights the importance of continuous innovation and adaptation in IT operations and AIOps strategies.

“The shift from reactive to preventive operations represents the next evolution in AIOps.”
Bernd Greifeneder, CTO Dynatrace

When Dynatrace set out with Davis® AI over 10 years ago, pioneering AI-driven operations, we focused initially on problem identification before moving on to problem remediation. The next milestone in enhancing the capabilities of Davis AI—another pioneering step forward in AI-driven operations—is outright problem prevention. In this blog post, we explain how the unique combination of causal, predictive, and generative AI—augmented by the latest Davis AI advancements—is transforming how Dynatrace customers manage and optimize their IT infrastructure.

Automatic root cause detection

Modern, complex, and distributed environments generate a substantial number of events. This necessitates additional requirements such as minimizing the total number of issues, eliminating false positives, and conducting accurate root cause analysis.

Dynatrace has a longstanding reputation for accurately analyzing root causes and identifying related events. While other methods typically rely on mere correlation and historical data analysis, we’ve further enhanced our capabilities by implementing causational analysis, which leverages contextual information automatically gathered during data ingestion and processing in addition to historical data analysis. This is achieved using Dynatrace Grail™, our causational data lakehouse, which unifies all data in an always-up-to-date topology model. By applying causal AI to incoming data in real time, Davis instantly learns and continuously adapts to new information. This facilitates more precise root cause analysis and anomaly detection, including identifying seasonal anomalies and establishing auto-adaptive thresholds.

Figure 2. Root cause analysis with the Problems appFigure 2. Root cause analysis with the Problems app

When applying this Davis root cause detection within our own IT environment, Davis effectively filters out over 99.9% of incoming data noise, condensing hundreds of thousands of daily system events into no more than four or five incidents that require attention from our IT operations team.

These algorithms are not limited to monitoring IT environments. At our February 2025 Dynatrace Perform session on exploratory analytics with AI-driven insights, the Performance Engineering Lead of XXXLutz—one of the world’s largest furniture retailers operating more than 370 stores across Europe—explains how XXXLutz utilizes Davis AI to proactively identify critical order drops, allowing them to respond quickly and effectively to changing market conditions and ensuring that their business remains agile and responsive to the needs of their customers.

At the core of Dynatrace problem remediation stands the Problems app—an optimized view into opinionated insights, details, and context of each detected issue—for Operations, SREs, and developers. It filters billions of log lines, including the topology of each incident and its affected entities, for efficient problem triaging and troubleshooting, resulting in a 56% faster mean time to repair (MTTR) for critical incidents.

With the latest release, we drive this further by improving the automatic connection of relevant log and trace data for further drill down, presenting the full context of an issue in a single view. This provides comprehensive visibility into even complex architectures, simplifying the process of examining relevant details and addressing code-level issues, reducing 100 clicks and manual filtering to a single click with no loss of context.

Figure 3. Comparative analysis of multiple problems with Davis CoPilotFigure 3. Comparative analysis of multiple problems with Davis CoPilot

By utilizing Davis CoPilot™, you can conduct comparative analyses of multiple issues, obtain natural language summaries of individual problems, and receive contextual recommendations along with specific remediation steps.

You can also link troubleshooting guides created in Notebooks to remediated issues, thereby building an intelligent knowledge base. Davis automatically connects additional documents as well as stored workflows. So the next time a similar problem arises, Davis brings up related guides, enabling teams to learn from previous experiences and reducing the risk of knowledge loss.

Figure 4. Harness your collective knowledge by connecting troubleshooting guidesFigure 4. Harness your collective knowledge by connecting troubleshooting guides

Please refer to our recent blog posts for more information on utilizing Problems for AI-driven insights and the latest Davis CoPilot advancements.

Automating the remediation

While obtaining comprehensive insights is beneficial, true transformation occurs through the use of tools that automatically execute remediation steps. To implement these “AI-driven operations,” it’s essential to forecast future requirements, including capacity demands, potential system failures, and security incidents.

Traditional forecasting engines typically depend on historical data, stored in metrics. In contrast, Davis AI generates real-time predictions, facilitating proactive operations. This capability is due to Davis’s ability to process raw data, such as logs, for forecasting, leveraging Grail to execute previously unattainable queries.

Consider the following scenario: You begin by retrieving and analyzing logs to identify relevant values for automation. Once this task is complete, you proceed to your pipelining tool to configure ingestion rules that extract these values into metrics and then wait several weeks for your prediction engine to generate alerts that can serve as triggers for your workflows.

However, when utilizing Dynatrace with its integrated anomaly detection and forecasting capabilities, you gain the advantage of schema-less data analysis and the ability to process any raw data into time series in real time. This significantly reduces the time required to establish AIOps workflows from several weeks to less than 30 minutes.

Preventive operations

The complexity of modern software environments makes it challenging to determine a service’s reliability solely through testing. It’s impractical to emulate scenarios such as generating a million tickets to assess performance capabilities. This necessitates real-time insights and operations rather than reactive problem-solving or raising alerts to notify personnel.

Preventive operations address this need by enabling proactive corrective actions before issues arise, akin to predictive maintenance. AI-supported anomaly detection identifies parameters that deviate from the norm, allowing for automatic configuration adjustment to mitigate potential problems preemptively.

Figure 5. Dynatrace offers the only unified, AI-powered platform for all data, all teams, and all possibilities.Figure 5. Dynatrace offers the only unified, AI-powered platform for all data, all teams, and all possibilities.

Davis CoPilot combines the “power of three”:

  • Davis causal AI for identifying anomalies and root cause analysis
  • Davis predictive AI for precise forecasting and determining when to take action
  • Generative AI capabilities that perform actions beyond simply sending notifications or restarting services

In this way, Dynatrace extends AIOps beyond traditional IT operations tasks and addresses complex scenarios, including security use cases such as threat observability. Consider the following real-world example:

At Dynatrace, we log all failed login attempts. We can predict potential threats when abnormal patterns are identified and raise a security event by utilizing seasonal baselining. The subsequent workflow involves checking the IP address and generating a threat score. Upon reaching a certain threshold, a new ruleset is automatically added to the web application firewall. This entire process is fully automated, running before a problem even occurs, significantly reducing the response time from over an hour to a fraction of a second.

In another instance, automatic log pattern analysis crawling our application logs decreased the number of bugs in the production environment by 15% and freed up time previously spent on log analysis and triaging (in pre-prod), equivalent to 17 full-time employees. Consequently, these 17 developers can now dedicate their efforts to adding more value to Dynatrace.

Summary

The State of AI report states that over 88% of technology leaders anticipate AI will enhance incident responses and improve their teams’ ability to predict and proactively resolve service-affecting issues.

With Dynatrace, organizations are prepared to evolve their ITOps and SRE departments from troubleshooting to prevention, getting proactive with forecasting, and utilizing generative AI instead of purely focusing on history-focused root cause analysis.

Start your preventive operations journey with smart automation and auto-remediation that prevents larger issues.

Are you interested in gaining more insights?

Read Entire Article