England has never been so hot: Coningsby on Tuesday measured 40.3 degrees Celsius in the shade. Luton Airport had to stop operations because the runway melted; There were also heat-related disruptions in trains, highways and power supply. The designers of the data centers clearly did not expect such high temperatures. Both Google Cloud and Oracle experienced outages as some of the major cooling systems had malfunctioned.
In Oracle, network connections and block volume, compute and object storage services were affected, in Google it was the Google Computing Engine (GCE), including autoscaling, persistent disks and virtual computers (VMs) running on them. For example, Kubernetes instances, SQL databases, BigQuery warehouses, and of course, many websites were affected.
Oracle believes in “seasonal temperatures”
Oracle acknowledged on Tuesday, “As a result of unseasonal temperatures in the region, part of the refrigeration equipment at the UK South (London) data center has experienced a problem, resulting in the need to shut down part of our service infrastructure.” ” Uncontrolled hardware failures This step was taken with the intention of limiting the potential for long-term impact on our customers.”
To put it in plain English: Our air conditioning system, including unnecessary cooling, can’t handle this monkey heat. We had to suddenly shut down routers and computers, otherwise they would burn us down and customer data would be wasted. Oracle did not say what other seasons the temperatures in England would have been typical for, and whether the air conditioning systems would have been out.
The emergency shutdown of the London Oracle Server began at 1:10 pm on Tuesday. In the second step, like this Inform Oracle, further computers were manually shut down “as a preventive measure” “to avoid further hardware failure”. In addition, “relevant service teams were activated to bring the affected infrastructure to a healthy state”. Want to say: The air conditioning is broken. For some hardware, the emergency shutdown came too late. By consciously switching off, we save the rest.
The temperature in the data center only reached “usable temperature” seven hours after the emergency shutdown, with some broken cooling systems repaired after nine hours, all of them after eleven hours. After 20 hours of hard work, Oracle was able to report that “all services and their resources have now been restored”.
Google’s air conditioner also failed
At Google, the outage started about two hours after Oracle, that is, at 3:10 pm. “There is a cooling-related outage in one of our buildings that hosts an area of the Europe-West2-Europe-West2 region. This has caused a partial outage in that area, resulting in virtual machine shutdowns and loss of computing power . A little led some of our customers”, data company reported, Other customers lost their persistent disk redundancy. Google’s cooling system also failed and the group had to shut down other computers as a precaution.
After nine hours, Google was able to fully clear for its cloud services in London, with only a small portion of the persistent disk still suffering from I/O errors. The lesson from the sad is that resilience in times of climate change requires not only protection from floods, hurricanes and fires, but also help for climate refugees. Similarly, cooling devices should be more heavily dimensioned and hardware should be designed to be more heat resistant. This increases the energy consumption of data centers if we do not become more data-efficient.