The fall of CloudLinux or how to recover cloud instance without networking

Danil Smirnov
danil.smirnov
Published in
3 min readSep 8, 2019

--

The issue

We don’t expect that kind of incidents anymore in the modern world. In this epoch of DevOps and automated testing it simply should not happen. We also are quite used to cheap and easy backups in the cloud, e.g. EBS volume lifecycle policies of AWS EC2 service…

But sometimes stars come to this rare configuration and my friend, who owns a pet SOHO web-hosting, has woken up by alarm, reporting his CloudLinux server in AWS seems dead from 3am of that unfortunate night. :(

What is even worse, he didn’t configure EBS lifecycle policies for the server, relying on per-client data backups only — hence they are quite useless in this painful situation, when the server doesn’t respond to any requests.

Reboots didn’t help, and after checking the system logs in AWS console, we quickly discovered that it was a network failure. Networking failed to up on boot with quite vague error:

Failed to start LSB: Bring up/down networking

The situation looked really bad: googling of the error message has brought to us tons of irrelevant cases and, as the server lives in AWS, there is simply no way to connect to it without networking up and running. Huh.

It was really good luck for my friend to find a root cause of the issue quite quickly: the error would not allow us to identify the problem easily. It was a nasty bug in the last update of CloudLinux distribution — version 7.7:

After the planned upgrade, which requires reboot as one of its steps and automatically applied by cPanel software, the system won’t start and the only way to connect to it in the cloud has been lost.

I wonder how many system administrators in web-hosting companies all over the world had day full of troubles on Thursday, September 4? How many of them followed wrong way until they realised the root cause of the issue…

I believe that any company in our times (especially with customers like IBM and DELL) should use DevOps methodology to prevent crap software to be released to wide audience… But shit happens and we now need to recover the server.

Recovery

Hard disk volumes of instance in the cloud can be detached and plugged into another healthy instance with just few clicks — as mounted folder. Of course it’s quite useful for logs examination, but not only.

It is also possible to “boot” into the volume attached using Linux chroot command:
https://en.wikipedia.org/wiki/Chroot

After that it’s possible to run yum/rpm commands to install/upgrade/downgrade faulty packages to its previous version, as it’s recommended by CloudLinux support to fix the issue. After the fix applied, we were able to start the server in the cloud and connect to it as normal.

I wish this article helps someone in case of such disaster, but I personally prefer to follow DevOps approach treating servers not pets but cattle with everything easily reproducible in automated way.

Or, even better, get rid of servers completely, moving to serverless approach. :)

To sleep well, you really deserve this.

--

--