On April 3, 2023, from 1:15 pm UTC to 5:20 pm UTC Atlassian customers using Opsgenie product to integrate with a separate Jira Service Management Cloud instance faced significant delays while creating and updating alerts from Jira Cloud and Jira Service Management Cloud integrations in the US region. The issue was reported by our customers and also detected via internal monitoring tools.
The reason for the incident was that one of the Opsgenie integration components could not scale to the high volume of requests from Jira. This caused delays in creating alerts or Jira issues by up to 30 minutes.
The incident was mitigated by scaling the integration component, which put Atlassian systems into a known good state. The total time to resolution was about four hours and 30 minutes.
The overall impact was on April 3, 2023, from 1:15 pm UTC to 5:20 pm UTC. The Incident caused degradation to customers hosted in the US region only.
This caused delays of up to 30 min, in creating Opsgenie alerts from Jira issues for customers who have the Jira to Opsgenie integration enabled.
The issue was caused by the sudden spike in the volume of messages, due to bulk actions. This requires scaling up the instances manually. Our proactive monitoring prevents delays by alerting early enough to allow manual scaling. A misconfiguration in this threshold and escalation policy, in our monitoring system, prevented us from scaling up instances well in time.
REMEDIAL ACTIONS PLAN & NEXT STEPS
We know that outages impact your productivity. We are prioritizing the following improvement actions to avoid repeating this type of incident:
We apologize to customers whose services were impacted during this incident; we are taking immediate steps to improve the product’s performance and availability.
Atlassian Customer Support