On March 3, 2021, between 01:29 and 03:15 UTC Opsgenie customers were unable to access core functionality of the product. The event was triggered by a change to update the SSL certificate for *.opsgenie.com on Atlassian's Edge Network Infrastructure, which impacted customers globally. The incident was detected within 31 minutes through customer support tickets and mitigated by rolling back to the previous SSL certificate. The total time to resolution was one hour and 27 minutes.
The issue was caused by a failed change to renew the *.opsgenie.com SSL certificate. As a result, Opsgenie could not perform TLS handshakes with the Opsgenie API, and users received HTTP 502 errors.
Our mechanism for renewing the *.opsgenie.com SSL certificate failed to export the Subject Alternative Names (SANs) required to support Opsgenie API traffic. We deployed the certificate missing the SANs to Atlassian's Edge Network Infrastructure, resulting in customers unable to perform TLS handshakes with api.opsgenie.com.
We know that outages impact your productivity. While we have a number of tests and preventative processes in place, this specific issue wasn’t identified because the change was related to a very specific kind of certificate using wildcard entries in the SAN. This was not picked up by our automated continuous deployment suites and manual test scripts. Furthermore, we deploy our changes progressively (by cloud region) to avoid broad impact but in this case, our deployment and incident detection did not work as expected.
We are prioritizing the following improvement actions to avoid repeating this type of incident:
We apologize to customers whose services were impacted during this incident; we are taking immediate steps to improve Opsgenie's performance and availability.
Thanks,
Atlassian Customer Support