Delays in notification service
Incident Report for Opsgenie
Postmortem

SUMMARY

On Sep 14, 2022, between 03:36 PM and 04:26 PM UTC, Atlassian customers using the Opsgenie product received delayed notifications for up to 50 minutes. The event was triggered by a code change that upgrades a common framework. The changes included in this framework update impacted customers in the both US and EU regions. The incident was detected  by the on-call developer and mitigated by reverting the latest changes, which put Opsgenie systems into a known good state. The total time to resolution was around 50 minutes.

IMPACT

The overall impact was between Sep 14, 2022, 03:36 PM UTC, and Sep 14, 2022, 04:26 PM UTC on Opsgenie products. The incident service disruption was limited to US and EU region customers who did not receive their notifications immediately, but instead experienced notification delays of up to 50 minutes. In total, ~132K notifications in the US region and ~23.6K notifications in the EU region were sent with delays. Only less than %0.6 of the active customers were affected.

ROOT CAUSE

The issue was caused by an Atlassian-initiated change to upgrade a common framework. While the majority of the intended changes had been tested successfully, there were some accompanying changes with the framework upgrade that caused the notification service to stop processing new notification requests. Instead, these notifications remained in the queues until the deployment was reverted, resulting in notification delays for customers of up to 50 minutes.

REMEDIAL ACTIONS PLAN & NEXT STEPS

We know that outages impact your productivity. 

We are prioritizing the following improvement actions to avoid repeating this type of incident:

  • We are improving the testing and deployment processes we follow after framework updates.
  • We are implementing new monitoring to reduce the detection and response time even further.

We apologize to customers whose services were impacted during this incident; we are taking immediate steps to improve the platform’s performance and availability.

Thanks,

Atlassian Customer Support

Posted Sep 21, 2022 - 08:42 UTC

Resolved
This incident has been resolved.
Posted Sep 14, 2022 - 16:53 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Sep 14, 2022 - 16:24 UTC
Identified
We have identified the problem and working on it. We are expecting that notification service will return normal state in a short time.
Posted Sep 14, 2022 - 16:08 UTC
Investigating
We are seeing delays with outbound notifications. We have identified the cause and are currently working on mitigation of this issue.
Posted Sep 14, 2022 - 16:00 UTC
This incident affected: EU (Email Notification Delivery, SMS Notification Delivery, Voice Notification Delivery, Mobile Notification Delivery) and US (Email Notification Delivery, SMS Notification Delivery, Voice Notification Delivery, Mobile Notification Delivery).