Services Outage
Incident Report for Adaptavist Cloud Apps
Postmortem

On January 9th and 10th 2018 we experienced two major service outages affecting all customers using ScriptRunner for Jira Cloud. Between 18:30 and 22:40 on 9th Jan our add-on was unavailable, communication with customer Jira instances was broken and Scheduled Jobs did not execute. Between 00:30 and 09:00 on 10th Jan no customer scripts were executed - this includes scripts from the Script Console, Script Listeners, Escalation Service and Scheduled Jobs.

We understand that these outages cause problems with your business procedures and daily workflows that depend on ScriptRunner and we are sorry.

Our team is now working to improve our procedures, infrastructure and software to prevent these problems from occurring again.

Root Causes

We identified several problems during our root cause analysis:

  • One of our underlying virtual machines in our container cluster was unavailable.
  • New container services were not starting successfully.
  • A bug in our autoscaling caused services to be scaled down to no running services.
  • Our logging infrastructure was unavailable for 24hours prior to this outage, although no log messages were lost.

Additionally, we did not update this StatusPage during the outages and we did not have a clear internal plan to follow during out-of-hours outages.

Changes We're Making

We have made it clear internally who should be contacted during out-of-hours outages. We will be making sure that engineers are readily available out-of-hours.

We are modifying our alerting systems to better report service outages.

We have updated our service autoscaling to prevent scaling down until we have resolved the problem where new services don't start successfully.

We are investigating alternatives to our logging infrastructure and intermediate fallback steps we can implement in the mean time.

Posted 9 months ago. Jan 10, 2018 - 17:50 UTC

Resolved
We have completed an analysis of the outages that occurred on 9th and 10th January and begun taking steps to prevent them in the future. Additional changes are already in our procedural and development pipeline. A more detailed post-mortem will follow.
Posted 9 months ago. Jan 10, 2018 - 15:28 UTC