Monitoring best practices for event delivery with Amazon EventBridge

3 months ago 31

News Banner

This post is written by Maximilian Schellhorn, Senior Solutions Architect and Michael Gasch, Senior Product Manager, EventBridge

Amazon EventBridge is a serverless event router that allows you to decouple your applications, using events to communicate important changes between event producers and consumers (targets). With EventBridge, producers publish events through an event bus, where you can configure rules to filter, transform, and route your events to a variety of targets such as AWS Lambda functions, Amazon Kinesis Data Streams, and public HTTPS endpoints (API destinations).

In event-driven architectures, the flow of sending and receiving events is asynchronous. There is no direct feedback to the producer when targets are invoked or if the invocation was successful. Therefore, to make sure business logic executes reliably in event-driven applications, it’s essential to get an understanding of your event delivery behavior with metrics, such as the number of delivery retries, failed delivery attempts, and the time it takes to deliver events. These metrics allow you to monitor the health of your event-driven architectures, and understand and mitigate event delivery issues caused by underperforming, undersized or unresponsive targets.

This post discusses how to monitor event delivery with EventBridge metrics to detect common event delivery issues and increase the reliability of your event-driven architectures on AWS.

Background

EventBridge is a multi-tenant system that handles more than 2.6 trillion events per month as of February 2024. EventBridge maintains fairness and availability under high load using mechanisms to detect and isolate noisy neighbors. As part of the AWS shared responsibility model, you are responsible to monitor and respond to target-related issues for reliable event delivery. For example, an underprovisioned Kinesis data stream or throttled API destination as a target will lead to delivery retries, delays, and failures.

Solution overview

EventBridge provides a variety of metrics to observe, troubleshoot, and optimize event delivery. For example, counter-based metrics such as InvocationAttempts, SuccessfulInvocationAttempts, RetryInvocationAttempts, and FailedInvocations allow you to observe throttling and calculate error rates. Latency-based metrics such as IngestionToInvocationSuccessLatency provide insights into event delivery and delays.

In the following sections, we demonstrate the behavior of these metrics through an example application and discuss best practices for reliable event delivery. The example is composed of three key components, as numbered in the following architecture:

An HTTP load generator to simulate different load patterns through Amazon API Gateway.
An EventBridge event bus and a rule with an API destination target, throttled at 50 requests per second to simulate an under-scaled resource.
A dead-letter queue (DLQ) that makes sure events are retained in case of invocations that fail permanently.

Example application architecture

The load generator creates varying load over multiple phases. To observe the number of incoming events, use the EventBridge metrics MatchedEvents or TriggeredRules on the rule name dimension, as illustrated in the following graph.

Number of incoming events visualized in CloudWatch Metrics

The following use cases focus on monitoring event delivery. Therefore, cases where event producers are not able to publish events due to permission errors or are experiencing throttling quotas on PutEvents are not covered.

Use case 1: Detecting event delivery issues due to target rate limiting

In this use case, event delivery will experience retries due to an under-scaled API destination target. The API processes all requests successfully. The load generator runs in three phases:

First, it warms up with a low number of requests and slowly increases the load while staying below the API destination rate limit of 50 requests per second
In the second phase, the load generator increases to 100 requests per second, exceeding the configured invocation rate on the API destination
Finally, the load generator slows to 50 requests per second, and eventually finishes

The following graph was created via CloudWatch Metrics and illustrates this scenario.

Load pattern of the example application

EventBridge supports new rule name dimensions for selected metrics, making it straightforward to observe invocations (event delivery) per rule. The following metrics are recommended:

InvocationAttempts – The overall number of times EventBridge attempts to invoke the target, including retries
SuccessfulInvocationAttempts – The number of invocation attempts that were successful
RetryInvocationAttempts – The number attempts that originated from retries

The following graph visualizes the metrics within the first phase of the example scenario. In this phase, the load stays below the configured rate limit of the target. When events are delivered successfully without throttling or errors, InvocationAttempts and SuccessfulInvocationAttempts are equivalent and RetryInvocationAttempts is 0 (the metric is only emitted if there are retries).

EventBridge Metrics during the first phase without throttling or errors

In the second phase (06:55), the load generator creates more events than the target can handle, exceeding the API destination invocation rate limit. This is reflected in the graph by InvocationAttempts and MatchedEvents increasing, while SuccessfulInvocationAttempts stays at the configured API destination rate limit. At the beginning of the phase, RetryInvocationAttempts is 0 because retries due to rate limiting from API destinations are not immediately executed, but delayed with exponential backoff. After the delay, RetryInvocationAttempts starts increasing (06:58), as shown in the following graph.

EventBridge Metrics during the ramp-up phase of the load generator

Because InvocationAttempts also includes retries, the overall number of InvocationAttempts is higher than the incoming MatchedEvents.

Lastly, during the cool down period, when the number of incoming events is decreasing significantly (7:03), more retry attempts succeed, and therefore InvocationAttempts and RetryAttempts reduce. Even though there are no more new incoming events (07:05), there are still events being retried that will eventually finish (07:14).

EventBridge Metrics during the cool-down phase of the load generator

Based on the observations during this scenario, we can calculate the overall custom metric SuccessfulInvocationRate. If you consider retries as a first sign of degraded system state, you can calculate this rate as SuccessfulInvocationAttempts/InvocationAttempts. For example, in Amazon CloudWatch, you can use metric math. Depending on your requirements, you can set up CloudWatch alarms to create notifications when a certain threshold is hit.

Custom SuccessfulInvocationRate metric generated with CloudWatch metric math

Although an occasional decrease of SuccessfulInvocationRate due to temporary traffic spikes or invocation errors can be considered normal, a constant mismatch is an indication of a misconfigured target and needs to be addressed as part of the shared responsibility model.

Use case 2: Detecting and handling event delivery failures

By default, EventBridge retries delivering an event for 24 hours and up to 185 times. After all retry attempts are exhausted, the event is dropped or sent to a DLQ. See Using dead-letter queues to process undelivered events in EventBridge for more information on how to configure a DLQ with EventBridge. These events can be visualized through the FailedInvocations or InvocationsSentToDlq metrics. Because FailedInvocations doesn’t consider retries that eventually succeed as failed invocations, this metric wasn’t visible in the previous example.

The following graph represents the same application and load pattern, but the EventBridge rule is configured with a maximum of three retries. During the first phase, there are no failed attempts because the load stays below the throttling limit.

EventBridge Metrics with FailedInvocations after maximum retries exceed

In the second phase, you can observe FailedInvocations starting after the initial retries (three) have been exceeded. Because the example application has a DLQ configured, InvocationsSentToDlq can provide the same insight, and can be used for alerting.

If you’re experiencing a large amount of FailedInvocations or InvocationsSentToDlq, it’s recommended to investigate if the target is properly scaled and able to receive the given traffic. For cases where retries are expected, the retry policy should be configured accordingly.

Use case 3: Detecting event delivery delays

The metrics outlined in the previous scenarios provided an overview of how to monitor your event delivery by the total number of retries or failures during a given time period. However, EventBridge also provides a metric that lets you observe the end-to-end latency (the time it takes from event ingestion to successful delivery to the target).

This can be achieved with the new IngestionToInvocationSuccessLatency metric. This metric surfaces effects from retries and delayed delivery, for example due to timeouts and slow responses from targets. In the following graph, you can observe 50th and 99th percentiles (p50 and p99) for IngestionToInvocationSuccessLatency on the right Y axis. During the second phase of the load generator, where invocations exceed the number of events the target can process, retries occur. Therefore, the overall time until events are delivered successfully to the target increases to almost 10 minutes (597,621ms, p99).

Combination of counter based metrics and latency based metrics

IngestionToInvocationSuccessLatency includes the time the target takes to successfully respond to event delivery. This allows you to monitor the end-to-end latency between EventBridge and your target, and detect performance variations and degradations of targets, even when there is no target throttling or errors. For example, the following graph displays constant successful invocations while the latency increases due to longer response times of the target over a 5-minute period (starting at 09:07).

Visualization of increased target latency without errors or retries

Conclusion

In this post, we explored best practices for observing event delivery with EventBridge. By using key metrics like SuccessfulInvocationAttempts, RetryInvocationAttempts, and FailedInvocations, you can gain visibility and identify issues early. With CloudWatch metric math, you can calculate a SuccessfulInvocationRate metric, allowing you to define thresholds and alerts on a single key metric.

Furthermore, the new IngestionToInvocationSuccessLatency metric provides insights into the end-to-end event delivery latency between EventBridge and your targets, enabling you to detect and respond to performance degradation. It’s recommended to combine these key metrics into a holistic overview, such as using CloudWatch dashboards. By setting up appropriate alarms and taking a proactive approach to observability, you can mitigate event delivery problems and build resilient, scalable, event-driven applications on AWS with EventBridge. Navigate to Monitoring Amazon EventBridge to get an overview of the available metrics and how to get started.

Try these metrics out with your own use case!

To find more serverless patterns, check out Serverless Land.

Read Entire Article