Atlassian: Poor team coordination and wrong scripts to blame for cloud outages

Atlassian: Poor team coordination and wrong scripts to blame for cloud outages

Atlassian has published a blog post in which CTO Mr. Vishwanath explains in detail why the provider’s cloud tools were discontinued for some of its customers. In addition to the problems already known when running the script, internal communication issues are said to be to blame. The apparently complicated recovery process is also described – and provides a preliminary explanation that the process, as announced, could take another two weeks.

The error occurred as a result of the now native integration of the Jira service “Insight – Asset Management” into the manufacturer’s products. As part of the conversion process, the intention was to disable legacy standalone versions of “Insight” that were still installed. Atlassian in blog post.

The team used a ready-made script for this. In the run-up to this, however, there was an internal communication error: the team that was to deactivate received incorrect information from the team that had planned the process. Instead of just the ID for the affected Insights instance, the ID for the entire cloud instance on which the standalone app was installed was passed.

In addition, the script used was not suitable for use: in addition to the “mark for deletion” function (“marked for deletion”), which allows the restoration of deleted data, it also includes “delete permanently”. There’s also the function (“delete “permanently”), which you really only need to meet compliance rules. When the script was run, however, the latter mode was executed and 400 clients’ data permanently is removed.

See also  Expectations for Today's Apple Event: Expected Launch

Data Management Backup is maintained across multiple AWS Availability Zones. In the past, these backups were only used to restore individual data points, for example if customers accidentally deleted their own data. The process was not previously designed to restore multiple data sets at once.

The recovery process is also complex and requires, among other things, 1-to-1 communication with affected people – so recovery of individual accounts takes up to five days. In the meantime, however, the company wants to automate the more time-consuming manual process and be able to process up to 60 cases in parallel.

However, the incident and the company’s response time do not meet its own requirements CTO Mr. Vishwanath at Blowpost Continued: “We know that such incidents can undermine trust”. So they want to create another, more detailed post-incident report as well as work on external communications and provide daily status updates in the future.

As of last Tuesday, parts of Atlassian’s customers no longer have access to the provider’s popular cloud tools, such as Jira and Confluence. Yesterday, on Tuesday, the company said the failure could last up to two weeks for individual teams. As of April 13, the problem was resolved for only 45 percent of those affected.

more from IX Magazine

more from IX Magazine

more from IX Magazine


(JVO)

on home page

LEAVE A REPLY

Please enter your comment!
Please enter your name here