We observe increased error rates on API
Incident Report for Opsgenie
Postmortem

SUMMARY

On Feb 4, 2021, between 13:00 and 14:10 UTC, Opsgenie customers in our US region experienced degraded performance across multiple Opsgenie APIs. The event was triggered by a sudden increase in traffic to our APIs. The incident was detected within four minutes by our automated monitoring systems and mitigated by scaling impacted services and blocking some of the unexpected traffic. The total time to resolution was one hour and 10 minutes. During the incident, 21% of API requests failed.

ROOT CAUSE

The issue was triggered by an excessive number of HTTP requests hitting the Opsgenie APIs starting on Feb 4, 2021, at 13:00 UTC. Although the majority of requests were rate limited, the excessive load caused by the increase in traffic that led to resource starvation across some network (e.g. proxy services) and application-level services.

At 13:04 UTC, Opsgenie customers experienced HTTP 5XX status errors as the services attempted to recover by scaling up based on preconfigured scaling policies. At 13:05 UTC, our automated monitoring systems paged our engineering and SRE teams who were online within 4 minutes at 13:09 UTC. The incident response team identified the root cause via a review of metrics and logs and identified several countermeasures within minutes.

At 13:15 UTC, network-level services were scaled up to better handle the traffic volume. Scaling up network-level services resulted in the increased rate-limiting of requests from upstream services and some services also experienced resource starvation as a result of the increase in traffic volume, causing some instances to enter an unhealthy state.

Our services are designed to handle large bursts of traffic. These bursts are often a result of our customers experiencing incidents that result in large volumes of alert notifications reaching the Opsgenie APIs. To prevent against unknown failure modes, our systems are designed to prevent scaling beyond pre-configured and tested limits. Because of this reason, our auto-scaling policies were unable to gracefully manage the unusual increase in traffic volume. Between 13:23 and 13:55 UTC, our incident response team scaled up the impacted services to meet the traffic demand. As new instances started to receive traffic, with the reduced load per instance and as unhealthy instances were evicted, Opsgenie services return back to a healthy state.

At 13:59 UTC, the failure rate dropped to 1% and by 14:10 UTC errors were mitigated completely. As a precaution, additional firewall rules were implemented at 15:04 UTC to prevent a recurrence.

REMEDIAL ACTIONS PLAN & NEXT STEPS

We have already initiated a thorough post-incident review process and compiled our corrective actions to prevent future incidents of this kind.

  • We will be introducing auto-scale to upstream services above the existing pre-provisioned capacity to prevent future incidents resulting from resource starvation.
  • We will enhance our monitoring to reduce the detection and response time even further.

We know that outages impact your productivity. We are taking immediate steps to improve the platform’s performance and availability.

Thanks,

Atlassian Customer Support

Posted Feb 11, 2021 - 12:28 UTC

Resolved
This issue has been resolved and all the services are operating normally. The problem is mitigated at 14:10 UTC. We will provide detailed impact analysis.
Posted Feb 04, 2021 - 15:28 UTC
Monitoring
The issue has been resolved and we are closely monitoring the services.
Posted Feb 04, 2021 - 14:33 UTC
Identified
The problem is related to a customer sending a massive amount of traffic and causing degraded performance.
We have rate-limited the customer requests and observing recovery at the moment.
Posted Feb 04, 2021 - 14:18 UTC
Investigating
We are investigating the increased error rates and we'll be providing more updates shortly.
Posted Feb 04, 2021 - 13:41 UTC
This incident affected: EU (Incident REST API, Alert REST API, Heartbeat REST API) and US (Incident REST API, Alert REST API, Heartbeat REST API).