Performance issues and outages with Cloud products
Incident Report for Opsgenie
Postmortem

SUMMARY

We understand the importance of providing reliable and consistent service to our valued customers. On July 6, 2023, from 03:52 to 15:11 UTC, we experienced an issue with an upgraded version of a third-party tool that functions as our internal artifact management system. Despite our monitoring system identifying the incident within two minutes, this issue led to the degradation of the scaling capabilities of our internal hosting platform, resulting in service degradation or outages for customers of Atlassian cloud.

In response to this situation, we are taking immediate measures to enhance the stability of our system and prevent similar issues from re-occurring.

IMPACT

This incident affected multiple regions and products due to the diminished scaling capabilities of our internal hosting platform. In most products and offerings, customers faced reduced functionality, slower response times, and limited access to specific features.

ROOT CAUSE

The root cause of the incident was the introduction of new functionality in a third-party tool that functions as our internal artifact management system. It led to an unexpected increase in the load on the primary database of the artifact system.

Upon identifying and localizing the problem, we promptly adjusted the system configuration to regain stability.

REMEDIAL ACTIONS PLAN & NEXT STEPS

Over the next months, we will enact a temporary freeze on non-critical upgrades of the artifact management system, and we will focus our efforts on three high-priority initiatives:

  1. Enhancing system scaling: We prioritized work ensuring that downtime in a critical infrastructure component does not affect the scaling of other components. We expect to complete this initiative within the next two months.
  2. Reducing interdependencies: We are working to mitigate the risk of potential cascading failures by ensuring that significant system components are able to operate independently in the case of issues. Initiatives 1 and 2 are already in progress but have been given priority to be completed as soon as possible.
  3. Strengthening testing procedures: Alongside these initiatives, we are addressing the need for even more stringent testing procedures than we already have in place to prevent potential issues in future updates. We are committed to collaborating closely with our technology partners to ensure the most optimal experience for our customers.

We apologize for any inconvenience caused by this incident and appreciate your understanding. Our team is dedicated to continually improving our systems and processes to provide you with the exceptional service you deserve. Thank you for your continued support and trust in us.

Sincerely,

Atlassian Customer Support

Posted Jul 14, 2023 - 05:07 UTC

Resolved
We experienced performance issues and outages for several Atlassian Cloud Products. The issue has been resolved and the service is operating normally.
Posted Jul 06, 2023 - 15:37 UTC
Monitoring
We have identified the root cause of an issue with an internal infrastructure component that has been impacting multiple Cloud products - including Jira Software, Jira Service Management and Confluence - and customers. This issue had lead to a performance impact and, in some cases, outages.

We have implemented a fix to resolve the issue and recovery is in progress.
Posted Jul 06, 2023 - 13:55 UTC
Identified
We are investigating an issue with an internal infrastructure component that is impacting multiple Cloud products, including Jira Software, Bitbucket, Jira Service Management and Confluence, and customers. These issues include performance impact and, in some cases, outages.

Users may experience slow loading and uploading of attachments, login issues or inability for new customers to sign up. We have identified the root cause and are actively working on the service recovery.
Posted Jul 06, 2023 - 11:18 UTC