From 13:57 on November 25th, 2020 - 19:50 on November 27th, 2020 UTC, a portion of data synchronization within Atlassian systems was delayed by up to 54 hours, with a subset of real-time customer functionality being down for the first 15 hours of the timeframe. The incident was caused by an outage to AWS service that Atlassian cloud infrastructure depends on. Customers of Atlassian’s Cloud Platform observed the following impact:
Across multiple products, users experienced delays in completion of new sign-ups, user deletions, authentication and authorization policy changes, updating of search results, propagating product-emitted triggers to Forge apps, activity and in-app notifications, features behind a personalized rollout flag being served incorrectly, and inability to at-mention new users who signed up. In addition, service was downgraded for the following product capabilities:
The incident was detected within 8 minutes via our automated monitoring systems. We mitigated the impact by redirecting our internal asynchronous communication traffic from the US East region to the US West region which put our systems into a known good state. We were able to restore all product functionality for customers within 15 hours and the total time to resolution including clearing the backlog of data synchronization was about 54 hours & 19 minutes.
The event was triggered by a significant AWS outage (https://aws.amazon.com/fr/message/11201/) for 14 hours in the US East region. Atlassian’s Enterprise Service Bus (ESB) is the backbone for async communication between services and systems. The ESB has a hard dependency on AWS Kinesis, which was part of the AWS outage. As a result, a significant portion of the data flow within Atlassian systems was either delayed or could not succeed due to the data pipe that carries communications following a user’s activity being down. This outage impacted customers in across the globe.
Atlassian has many internal systems that perform follow-up actions after a user’s interaction with our products. Examples of such follow-up actions include propagating correct authentication and authorization policy updates, updating our search indexes, provisioning access for new users post sign-up, and automation triggering after a data update. All of these systems rely on being informed asynchronously via our Enterprise Event Bus about the prior action a user has taken, or a data change that has occurred. Our Enterprise Event Bus in turn is dependent on AWS Kinesis, a data processing platform that broadcasts messages between message producing systems and client systems interested in consuming a subset of messages each, depending on the client’s designated follow-up functionality. A total outage of AWS Kinesis in one of our major geographic regions, US East, led to a significant outage for Atlassian due to the inability to propagate any information within our systems via our Enterprise Event Bus.
During the post-incident review, we have identified enhancements in our technical architecture, and resilience measures to counter failures of our Enterprise Service Bus and AWS Kinesis. Moving forward, to minimize a hard dependency on AWS Kinesis, we will implement automated migration of customer traffic to a Kinesis instance in another geographic region during an outage, and better retention of data at key stages of data flow within our systems to improve data synchronization posterity in case of an outage.