Delays and failures

Incident Report for Opsgenie

Postmortem

SUMMARY

From 13:57 on November 25th, 2020 - 19:50 on November 27th, 2020 UTC, a portion of data synchronization within Atlassian systems was delayed by up to 54 hours, with a subset of real-time customer functionality being down for the first 15 hours of the timeframe. The incident was caused by an outage to AWS service that Atlassian cloud infrastructure depends on. Customers of Atlassian’s Cloud Platform observed the following impact:

Across multiple products, users experienced delays in completion of new sign-ups, user deletions, authentication and authorization policy changes, updating of search results, propagating product-emitted triggers to Forge apps, activity and in-app notifications, features behind a personalized rollout flag being served incorrectly, and inability to at-mention new users who signed up. In addition, service was downgraded for the following product capabilities:

Jira - Automation rules for Jira were delayed in being enacted, and activity details did not propagate accurately to Jira’s Your Work page and start.atlassian.com for the duration of the outage
Confluence - Search results and analytics functionality like page views were not updated; user permission changes were also lagging for the duration of the outage
Trello - Search results were not updated and user permission changes were lagging for the duration of the outage
Opsgenie - Logging out, user invites, and user access post on-boarding were delayed for the duration of the outage
Bitbucket - Delays in push and merge operations for the duration of the outage
Statuspage - User invites, new sign-up completion, and user permission changes were delayed for the duration of the outage

The incident was detected within 8 minutes via our automated monitoring systems. We mitigated the impact by redirecting our internal asynchronous communication traffic from the US East region to the US West region which put our systems into a known good state. We were able to restore all product functionality for customers within 15 hours and the total time to resolution including clearing the backlog of data synchronization was about 54 hours & 19 minutes.

ROOT CAUSE

The event was triggered by a significant AWS outage (https://aws.amazon.com/fr/message/11201/) for 14 hours in the US East region. Atlassian’s Enterprise Service Bus (ESB) is the backbone for async communication between services and systems. The ESB has a hard dependency on AWS Kinesis, which was part of the AWS outage. As a result, a significant portion of the data flow within Atlassian systems was either delayed or could not succeed due to the data pipe that carries communications following a user’s activity being down. This outage impacted customers in across the globe.

TECHNICAL REASONS

Atlassian has many internal systems that perform follow-up actions after a user’s interaction with our products. Examples of such follow-up actions include propagating correct authentication and authorization policy updates, updating our search indexes, provisioning access for new users post sign-up, and automation triggering after a data update. All of these systems rely on being informed asynchronously via our Enterprise Event Bus about the prior action a user has taken, or a data change that has occurred. Our Enterprise Event Bus in turn is dependent on AWS Kinesis, a data processing platform that broadcasts messages between message producing systems and client systems interested in consuming a subset of messages each, depending on the client’s designated follow-up functionality. A total outage of AWS Kinesis in one of our major geographic regions, US East, led to a significant outage for Atlassian due to the inability to propagate any information within our systems via our Enterprise Event Bus.

REMEDIAL ACTIONS PLAN & NEXT STEPS

During the post-incident review, we have identified enhancements in our technical architecture, and resilience measures to counter failures of our Enterprise Service Bus and AWS Kinesis. Moving forward, to minimize a hard dependency on AWS Kinesis, we will implement automated migration of customer traffic to a Kinesis instance in another geographic region during an outage, and better retention of data at key stages of data flow within our systems to improve data synchronization posterity in case of an outage.

Posted 4 years ago. Dec 08, 2020 - 23:15 UTC

Resolved

The issue has been resolved and the systems are operating normally.

Posted 4 years ago. Nov 26, 2020 - 16:37 UTC

Update

We've identified the issue and are still working on a fixing it at the earliest.

Posted 4 years ago. Nov 26, 2020 - 13:09 UTC

Update

Several systems are experiencing data propagation delays and failures. We've identified the issue and are still working on a fix.

Posted 4 years ago. Nov 26, 2020 - 02:50 UTC

Identified

Several systems are experiencing data propagation delays and failures. We've identified the issue and are working on a fix.

Posted 4 years ago. Nov 25, 2020 - 19:00 UTC

Investigating

Several systems are experiencing data propagation delays and failures. We're currently investigating it.

Posted 4 years ago. Nov 25, 2020 - 16:13 UTC