Multiple product logins
Incident Report for Opsgenie
Postmortem

SUMMARY

On August 30, 2023, between 4:07 and 5:30 UTC, some customers were unable to login to Atlassian's Cloud products using id.atlassian.com.  Logged-in users were also unable to switch accounts, change passwords, or log out. Users with existing sessions were not impacted.

Between 5:32 and 6:00 UTC, traffic was incrementally restored to a previous build, mitigating the impact for users.

The total time to resolution was one hour and 53 minutes.

IMPACT

Users were not able to login using Atlassian's shared account management system (id.atlassian.com).

This affected users who were trying to login to the following products: Jira, Confluence, Trello, Opsgenie, mobile apps and ecosystem apps.

Aside from the inability to login, there was no impact on other Atlassian products or features.

ROOT CAUSE

Multiple Set-Cookie headers were unintentionally modified so that only the last Set-Cookie header remained in the response to user's browsers.  The issue was caused by a change to Network Extensions within the Edge Network. As a result, users that needed a new session could not login.  Upon login, the users were redirected to login again and no session was created for them.

REMEDIAL ACTIONS PLAN & NEXT STEPS

We know that outages impact your productivity. While we have a number of testing and preventative processes in place, this specific issue was not detected in Atlassian's staging environment.  End-to-end tests did not cover the use case of multiple Set-Cookie headers in the single response and therefore this bug went unnoticed.

We are prioritizing the following improvement actions to avoid repeating this type of incident:

  • Automated tests to be put in place to validate that cookies are not being removed from responses.
  • Configuration of networking extensions will be guaranteed to be identical in staging and production to ensure errors are picked up earlier.

Furthermore, we typically deploy our changes progressively by cloud region to avoid broad impact, but in this case, the change was not deemed risky and was deployed to all regions. To minimize the impact of breaking changes to our environments, we will implement additional preventative measures:

  • Changes to network extensions in the future will use progressive rollouts.
  • With staging being properly utilized, errors similar to this one will not be deployed to any production environments.

We apologize to customers whose services were impacted during this incident; we are taking immediate steps to improve the platform’s performance and availability.

Thanks,

Atlassian Customer Support

Posted Sep 18, 2023 - 01:00 UTC

Resolved
Between 4:30AM UTC to 6:00AM UTC, we experienced issues for users attempting to login for Atlassian Support, Confluence, Jira Work Management, Jira Service Management, Jira Software, Opsgenie, Trello, Jira Product Discovery, Compass, and Atlassian Analytics. The issue has been resolved and the service is operating normally.
Posted Aug 30, 2023 - 06:17 UTC
Investigating
We are investigating reports of intermittent errors for login to Atlassian Support, Confluence, Jira Work Management, Jira Service Management, Jira Software, Opsgenie, Trello, Jira Product Discovery, Compass, and Atlassian Analytics Cloud customers. We will provide more details once we identify the root cause.
Posted Aug 30, 2023 - 05:19 UTC