We observe increased error rates on API

Incident Report for Opsgenie

Postmortem

SUMMARY

On March 3, 2021, between 01:29 and 03:15 UTC Opsgenie customers were unable to access core functionality of the product. The event was triggered by a change to update the SSL certificate for *.opsgenie.com on Atlassian's Edge Network Infrastructure, which impacted customers globally. The incident was detected within 31 minutes through customer support tickets and mitigated by rolling back to the previous SSL certificate. The total time to resolution was one hour and 27 minutes.

IMPACT

The issue was caused by a failed change to renew the *.opsgenie.com SSL certificate. As a result, Opsgenie could not perform TLS handshakes with the Opsgenie API, and users received HTTP 502 errors.

ROOT CAUSE

Our mechanism for renewing the *.opsgenie.com SSL certificate failed to export the Subject Alternative Names (SANs) required to support Opsgenie API traffic. We deployed the certificate missing the SANs to Atlassian's Edge Network Infrastructure, resulting in customers unable to perform TLS handshakes with api.opsgenie.com.

REMEDIAL ACTIONS PLAN & NEXT STEPS

We know that outages impact your productivity. While we have a number of tests and preventative processes in place, this specific issue wasn’t identified because the change was related to a very specific kind of certificate using wildcard entries in the SAN. This was not picked up by our automated continuous deployment suites and manual test scripts. Furthermore, we deploy our changes progressively (by cloud region) to avoid broad impact but in this case, our deployment and incident detection did not work as expected.

We are prioritizing the following improvement actions to avoid repeating this type of incident:

Introducing additional tests to SSL certificate renewals which will prevent the deployment of incorrect certificates.
Bringing further enhancements to increase the coverage of testing external dependencies and inbound endpoints.

We apologize to customers whose services were impacted during this incident; we are taking immediate steps to improve Opsgenie's performance and availability.

Thanks,

Atlassian Customer Support

Posted Mar 10, 2021 - 23:36 UTC

Resolved

The issue has now been resolved and all services are operating normally.

Posted Mar 03, 2021 - 03:26 UTC

Monitoring

Fix has been deployed to our network and rapid recovery is seen. We are monitoring the system for a full recovery

Posted Mar 03, 2021 - 03:12 UTC

Identified

The problem is related to a misconfiguration on our network causing degraded performance. We are working on a fix and adjusting our network. Started to see recovery and continue monitoring changes.

Posted Mar 03, 2021 - 03:04 UTC

Investigating

We are investigating the increased error rates and we'll be providing more updates shortly.

Posted Mar 03, 2021 - 02:18 UTC

This incident affected: US (Incident REST API, Alert REST API, Heartbeat REST API, Incoming Integration Flow, Outgoing Integration Flow, Configuration REST APIs, Incoming Call Routing).