Intermittent errors across multiple products in eu-central

Incident Report for Opsgenie

Postmortem

SUMMARY

On July 19, 2022, between 05:40 and 07:10 UTC, Atlassian customers in the EU region using Jira, Confluence and Opsgenie experienced problems loading pages through the web UI. The incident was automatically detected at 05.14 by one of Atlassian’s automated monitoring systems. The main disruption was resolved within 16 minutes with the full recovery taking additional 74 minutes.

IMPACT

Between July 19, 2022, 05:40 UTC and July 19, 2022, 07:10 UTC Jira, Confluence and OpsGenie users saw some web pages fail to load. During the 16 minute period from 06:40 UTC to 6:56 UTC, customers were unable to access Jira Confluence and OpsGenie web UI because the Atlassian Proxy (the ingress point for service requests) was unable to service most requests.

ROOT CAUSE

The issue was caused by an AWS initiated change that impacted Elastic Block Store (EBS) volume performance to such an extent that new instance creation and therefore auto scaling, was blocked. As a result, the products above, as well as essential internal Atlassian services could not auto scale to the increasing incoming service requests as the EU region came online. Once the AWS change had been rolled back, most Atlassian services recovered. Some internal services required manual scaling as a result of unhealthy nodes preventing scaling initiation, which prolonged complete recovery.

REMEDIAL ACTIONS PLAN & NEXT STEPS

We know that outages impact your productivity and we apologize to customers whose services were impacted during this incident. We see two main avenues to increase our resiliency during an incident where AWS auto scaling is blocked:

Implement step scaling: Simple scaling in most cases works well. In this case due to nodes becoming unhealthy, simple scaling stops responding to scaling alarms and therefore the service can become “stuck” and will not recover once scaling is possible again. We are exploring the use of step scaling, as this will allow scaling even in the case of instances becoming unhealthy.
Implement improved alarming to identify “stuck” scaling to increase the TTR when scaling is available again.

We are taking these immediate steps to improve the platform’s resiliency.

Thanks,

Atlassian

Posted Aug 02, 2022 - 22:59 UTC

Resolved

Between 07:00 UTC to 07:45 UTC, we experienced degraded functionality for some features in Confluence, Jira Work Management, Jira Service Management, Jira Software, Opsgenie, and Atlassian Developer. The issue has been resolved and the service is operating normally.

Posted Jul 19, 2022 - 08:52 UTC

Monitoring

Multiple Atlassian Cloud products and addons were unavailable to customers in some EU regions. The issue has been resolved and we are monitoring for further impact.

Posted Jul 19, 2022 - 08:43 UTC