Creating a custom Rovo Agent to Streamline Incident Management

Summary

Challenge: Manual investigation of DevOps incidents in complex AWS environments causes delays in root cause identification. The lack of a centralised interface leads to inefficiencies, communication gaps, and a higher risk of human error.

Solution: A custom Forge-built Rovo Agent automates root cause analysis within Jira Service Management. It connects to AWS, analyses diagnostic data, and provides clear, conversational insights while updating tickets in real time.

Outcome: This reduces resolution time, improves reliability, and enhances operational efficiency. It also increases transparency, supports better decision-making, and minimises human error during incidents.

Background

Here at AC, our team wanted to address the challenges faced by DevOps and Cloud operations teams when an AWS service or resource experiences an issue.

To tackle this, we developed a custom, Forge-built Rovo DevOps Agent – designed to consolidate Root Cause Analysis (RCA) and streamline incident management.

Challenge

When an AWS service or resource experiences an issue, it often results in a huge burden on IT teams. With engineers forced to sift through multiple metrics and consoles to identify the root cause, it’s a time-consuming, manual investigative effort, and can delay incident resolution.

The additional complexity of some AWS environments, which often consist of numerous interconnected services and resources, makes it even more difficult to identify the specific component causing an issue.

This process requires expertise, patience, and a deep understanding of system architecture. Without a centralised, conversational interface for incident investigation, teams struggle to collaborate and share information effectively.

Solution

At Automation Consultants, we recognised these pain points as a significant barrier to operational efficiency and effective incident management.

The solution? Developing an intelligent, custom Rovo Agent, to streamline root cause analysis by consolidating all relevant information into a single interface.

Built in Forge, and harnessing powerful AI, the Agent automates the initial stages of incident investigation by quickly providing engineers with a summary of the root cause and actionable next steps to resolve the issue.

How this custom Rovo Agent works

When an incident is raised in Jira Service Management (JSM), the Agent springs into action by identifying which AWS resource (such as an EC2 instance or Lambda function) is linked to the incident. It then connects to AWS through secure API calls and gathers a wealth of diagnostic data, including instance states, CloudWatch metrics, error logs, and other relevant data collected.

What makes this Agent unique?

This Agent does not just collect technical information; it also analyses it to detect failures and anomalies (such as a stopped EC2 instance, a spike in error rates, or a malfunctioning Lambda function).

It then structures these findings into a clear, digestible summary, pinpointing the likely root cause and highlighting the affected components. This summary is delivered directly to the DevOps team through the Jira ticket, ensuring that engineers have immediate access to actionable insights without having to manually trawl through multiple AWS dashboards.

Conversational interface

This Agent also supports a range of conversational prompts, allowing users to ask questions. In response, it can produce detailed reports on the cause of service breakdowns, interpret information from JSM, Assets, and AWS, and even provide tailored recommendations for remediation.

Additionally, the Agent can update the incident ticket with new findings as more data becomes available, keeping all stakeholders informed in real time.

Benefits

By automating the initial stages of root cause analysis, we anticipate that the agent would dramatically reduce an incident’s mean time to resolution (MTTR).

Engineers would no longer need to manually gather and interpret data from disparate AWS consoles. Instead, they would receive a concise, accurate summary of the problem with suggested next steps, all without ever having to leave the JSM incident ticket.

This streamlined process improves the accuracy of the agent in identifying problems, and enhances user satisfaction with the agent’s insights

This acceleration in incident response would lead to fewer and shorter service disruptions, directly improving the reliability and availability of business-critical systems.

The reduction in manual AWS investigations would both free-up valuable engineering time, and minimise the risk of human error during high-pressure incidents. Teams would be able to focus their expertise on resolving the underlying issue, rather than on data gathering and initial diagnosis, further contributing to operational efficiency.

The agent also enhances transparency and collaboration. By updating the Jira ticket with detailed findings and explanations, the agent ensures that all relevant stakeholders, from engineers to IT managers, have access to the same up-to-date information. This shared visibility fosters better communication, more informed decision-making, and a stronger culture of accountability.