Post-Mortem after a server-wide crash

Yesterday we had a general failure across our web real-estate.

Since no one seems to be discussing this, I’m writing it down. Also because I learned a couple of things during the diagnosis (partnering with team-mate Diogo Silva).

Here are some of the TIL.

Silent updates brought me down, or some sysadmin goodness

Google Cloud runs Ubuntu and has automatic updates active (at least for sensitive / security rollouts). Ubuntu released a patch for the libc6 C library, which contained a bug. The symptom was that we couldn’t use cURL to talk to Mailgun or Salesforce APIs. This post brought my attention to what was going on, since some of our Landing Pages run on top of Laravel. Other people were having similar issues, even if running AWS instances.

This raises the very interesting question of how much balance you want from a cloud provider, between set-it-and-forget-it, manage my server for me VS. manually approving and applying updates (looking at you Giacomo).

What solved it

Via giphy.com.

Via giphy.com.

We hard reset the server, but simply restarting the PHP service would have done the trick. Just like the Laravel News post suggested.

sudo service php7.0-fpm restart

The next morning Ubuntu rolled back the update, and the services crashed, again. Restarting PHP did it for us, again. Go figure.

Hi Russia, Hi China!

I later concluded there was no relation to the crash, but while perusing the (access and login) logs on the server I noticed two attempts of attack. One was a brute force, trial-and-error attempt at GETting a backup of a “admin” DB. Essentially a script tried all the combinations of possible file name admin-db, domain-name.com, sql-lite, etc. with possible file extensions .tar.gz, .zip, .rar, .out, etc. Some people are brave enough to host a copy of their production DB right out in the public. Jeff Starr has a good post detailing this attack.

I also saw attempts to log in to the server as root.

The former attacks were coming from Russia, the latter from China. How cliché. Any folks from North Korea reading this?

(I had to jump through some loops to figure out the IP addresses of the request, since we use Cloudflare as proxy.)

Wrapping up

This is a potentially damaging rollout, and while I have been (and am) a huge Ubuntu fan and user for 10 years now, I feel a proper discussion (and apology) should come from their Team, as well as from Google’s (we’re paying customers).

Leave a Reply

  • (will not be published)

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>