The growing complexity of modern software makes troubleshooting difficult, requiring deep knowledge and manual work across various systems. This results in slower problem-solving and less efficient operations. More and more customers need automated tools to handle routine tasks and simplify complex processes, so they can resolve issues faster and focus on delivering inovations for their customers.
Today, we’re announcing a new capability in Amazon Q Developer to investigate and remediation operational issues, which is now in preview. This generative AI-powered capability guides you through operational diagnostics and automates root cause analysis for problems in your workloads.
Here’s a quick look at how you can now use Amazon Q Developer for operational investigations.
AWS has more operational experience and scale than any other major cloud provider, delivering cloud services to customers around the world for over 17 years. AWS built this experience into Amazon Q Developer operational capabilities to create and present investigation hypotheses, and guide you through troubleshooting and remediation – capabilities that no other major cloud provider offers.
Get started with operational investigation using Amazon Q Developer
This new capability from Amazon Q Developer seamlessly integrates with Amazon CloudWatch and AWS Systems Manager, providing a unified experience while troubleshooting issues. To get started with this capability, you need to complete some prerequisites. You can learn more on the Get Started with Amazon Q Developer Operational Investigations page.
I’ve completed the setup and configured a CloudWatch alarm to monitor the metrics for my application. After receiving a notification email, I navigate to that alarm in Amazon CloudWatch. I observe that the metric has exceeded its threshold over several time periods.
With this finding, I select Investigate. Then, I have two options: Start new investigation or Add to existing investigation. Because I’m just getting started, I select Start a new investigation and provide some details and notes if necessary.
After I’ve created the investigation, I can view the details by choosing View Details on the banner.
The investigation page is divided into two main sections: the left-hand Feed panel, which contains all findings added during the investigation, and the right-hand Suggestions panel, which displays a list of finding suggestions from Amazon Q Developer to assist in the investigation.
Amazon Q Developer uses its knowledge of my AWS resources to automatically discover the relationships between them and create a topology map of the application. This makes it possible for Amazon Q Developer to follow the architecture and quickly find the component that caused an alarm, helping me get back into production faster than ever before.
As I investigate further, Amazon Q Developer proposes hypotheses based on a series of related metrics from various AWS services such as Amazon DynamoDB, AWS Lambda, Amazon Elastic Container Service (Amazon ECS) and others. I can choose Show reasoning to understand why.
One of the hypotheses suggests that the slowness is caused by throttling on a DynamoDB table, with read and write capacity units frequently exceeding the provisioned limits. I find this hypothesis makes sense, and I can Accept it, which will bring it into my Feed.
With all these findings, I can collect all the supporting data to troubleshoot this issue. In one of the hypotheses from Amazon Q Developer, I can also view suggested actions. I select View actions to understand my options for remediation.
In the Suggested actions menu, Amazon Q Developer proposes AWS Systems Manager Automation runbooks related to the hypothesis. Where applicable, it suggests automated runbooks from the AWS Systems Manager library, which includes over 400 AWS-authored and thousands of customer-authored runbooks to help remediate observed issues. Each runbook defines the actions that Systems Manager performs to help resolve the issue. Additionally, Amazon Q Developer provides relevant documentation links from AWS re:Post articles and AWS Documentation pages.
Here’s the list of suggested actions from Amazon Q Developer. I choose View runbook to understand more on how I can solve this issue by modifying DynamoDB provisioned capacity.
Here, I can read more information on this runbook. It will offer a description of the runbook, including execution history telling me if I ran this runbook successfully in this account in the past.
I can enter the required parameters as defined in the configuration. Under Execution preview segment, I can review a summary highlighting the impact on targeted resources. After confirming the details, I select Execute to implement the necessary changes for my workloads.
After running the runbook, I can see the results, which are then added to my feed.
Another feature I appreciate is the multiple ways to access this capability. For example, in my CloudWatch metrics for my AWS Lambda function, I can initiate an investigation and add findings directly. I can also select the Amazon Q Developer operational investigations icon to open the investigation panel.
This new capability from Amazon Q Developer feels like having an AWS expert available 24/7 to assist with operational troubleshooting. It lowers the barrier to operational experience and saves valuable time and effort.
Now in preview
The new capability of Amazon Q Developer to help you investigate and remediate operational issues is now in preview in the US East (N. Virginia) Region. Transform your operational investigation today and accelerate remediation with Amazon Q Developer. Visit Amazon CloudWatch documentation page to get started.
Happy troubleshooting!
— Donnie