Increased delays on Jira Cloud and Jira Service Management Cloud integrations while creating/updating Opsgenie alerts in US region

Incident Report for Opsgenie

Postmortem

SUMMARY

On April 3, 2023, from 1:15 pm UTC to 5:20 pm UTC Atlassian customers using Opsgenie product to integrate with a separate Jira Service Management Cloud instance faced significant delays while creating and updating alerts from Jira Cloud and Jira Service Management Cloud integrations in the US region. The issue was reported by our customers and also detected via internal monitoring tools.

The reason for the incident was that one of the Opsgenie integration components could not scale to the high volume of requests from Jira. This caused delays in creating alerts or Jira issues by up to 30 minutes.

The incident was mitigated by scaling the integration component, which put Atlassian systems into a known good state. The total time to resolution was about four hours and 30 minutes.

IMPACT

The overall impact was on April 3, 2023, from 1:15 pm UTC to 5:20 pm UTC. The Incident caused degradation to customers hosted in the US region only.

This caused delays of up to 30 min, in creating Opsgenie alerts from Jira issues for customers who have the Jira to Opsgenie integration enabled.

ROOT CAUSE

The issue was caused by the sudden spike in the volume of messages, due to bulk actions. This requires scaling up the instances manually. Our proactive monitoring prevents delays by alerting early enough to allow manual scaling. A misconfiguration in this threshold and escalation policy, in our monitoring system, prevented us from scaling up instances well in time.

‌

REMEDIAL ACTIONS PLAN & NEXT STEPS

We know that outages impact your productivity. We are prioritizing the following improvement actions to avoid repeating this type of incident:

Improving auto-scaling for integration components to take care of sudden spikes in the volume of incoming messages for creating alerts via integration
Adding additional monitoring mechanisms to raise an alarm when volume thresholds are breached

We apologize to customers whose services were impacted during this incident; we are taking immediate steps to improve the product’s performance and availability.

Thanks,

Atlassian Customer Support

Posted Apr 13, 2023 - 07:16 UTC

Resolved

We observed some delays while creating/updating alerts from Jira Cloud and Jira Service Management Cloud integrations in US region.
The problem is resolved now.

Posted Apr 03, 2023 - 17:34 UTC

This incident affected: US (Incoming Integration Flow).