On Feb 4, 2021, between 13:00 and 14:10 UTC, Opsgenie customers in our US region experienced degraded performance across multiple Opsgenie APIs. The event was triggered by a sudden increase in traffic to our APIs. The incident was detected within four minutes by our automated monitoring systems and mitigated by scaling impacted services and blocking some of the unexpected traffic. The total time to resolution was one hour and 10 minutes. During the incident, 21% of API requests failed.
The issue was triggered by an excessive number of HTTP requests hitting the Opsgenie APIs starting on Feb 4, 2021, at 13:00 UTC. Although the majority of requests were rate limited, the excessive load caused by the increase in traffic that led to resource starvation across some network (e.g. proxy services) and application-level services.
At 13:04 UTC, Opsgenie customers experienced HTTP 5XX status errors as the services attempted to recover by scaling up based on preconfigured scaling policies. At 13:05 UTC, our automated monitoring systems paged our engineering and SRE teams who were online within 4 minutes at 13:09 UTC. The incident response team identified the root cause via a review of metrics and logs and identified several countermeasures within minutes.
At 13:15 UTC, network-level services were scaled up to better handle the traffic volume. Scaling up network-level services resulted in the increased rate-limiting of requests from upstream services and some services also experienced resource starvation as a result of the increase in traffic volume, causing some instances to enter an unhealthy state.
Our services are designed to handle large bursts of traffic. These bursts are often a result of our customers experiencing incidents that result in large volumes of alert notifications reaching the Opsgenie APIs. To prevent against unknown failure modes, our systems are designed to prevent scaling beyond pre-configured and tested limits. Because of this reason, our auto-scaling policies were unable to gracefully manage the unusual increase in traffic volume. Between 13:23 and 13:55 UTC, our incident response team scaled up the impacted services to meet the traffic demand. As new instances started to receive traffic, with the reduced load per instance and as unhealthy instances were evicted, Opsgenie services return back to a healthy state.
At 13:59 UTC, the failure rate dropped to 1% and by 14:10 UTC errors were mitigated completely. As a precaution, additional firewall rules were implemented at 15:04 UTC to prevent a recurrence.
We have already initiated a thorough post-incident review process and compiled our corrective actions to prevent future incidents of this kind.
We know that outages impact your productivity. We are taking immediate steps to improve the platform’s performance and availability.
Thanks,
Atlassian Customer Support