February downtime, what really happened?

Some of you may have noticed that since the end of January the site had been experincing some very slow load times that affected the usability of the site and in recent days we were forced to take the site down because of this and other problems we were experiencing both with our custom software and server.

Here's the complete story:

Warning: this may become a TLDR post.

Fri, Jan 31, 2014:

  • Jonathan notices the server is using up too much CPU to respond to new page requests. Server is rebooted and problem seems to go away.

Sat, Feb 1 2014:

  • Asynchronous comment loads are released in the afternoon and the problem manifests again. CPU spikes and stays up a long time.
  • After several reboots the problem persists. A decision is made that the server should be left like that before doing anything else and seeing if the problem goes away by itself.

Sun, Feb 2, 2014:

  • The problem got worse.
  • The site is taken offline to perform live debugs, but not much is achieved with this.
  • A first diagnosis points to faulty/heavy database requests by our custom software.
  • A decision is made to leave the server like this until the following day.

Mon, Feb 3, 2014:

  • A code review was performed to improve the database requests, the site gets a little better, BUT;
  • By the afternoon we're forced to take the site down completely as a server problem has emerged.
  • After many hours of research and googling, we realize our server is full of garbage files.
  • The custom software we run is reinstalled and we lose our password encryption directives in the process, all users will be forced to reset their password, including admins...
  • The site is left offline until Tuesday for a good cleaning.

Tue, Feb 4, 2014;

  • Now we know our HDD space has run out of addresses to add new files.
  • A huge batch process was executed to get rid of all the garbage sessions files on the server.
  • The site is kept offline while the batch deletion is in progress.
  • The site is released live with minimal systems on. Only blends, comments, news, documentation pages, users, downloads and the contact form are 100% functional.
  • Collections and Likes are disabled because they have not been reviewed.
  • More features will be brought back in the following days, one by one.

Wed, Feb 5, 2014:

  • Liking and Collecting are re enabled, you still can't edit or review your collections.
  • Requests are back online as well as Search bar, lots of optimizations were done to them.

Thu, Feb 6, 2014:

  • Questions and the User index are back online. Many optimizations have been performed.

Fri, Feb 7, 2014:

  • Enabled private messaging and the dashboard.

Thu, Feb 13, 2014:

  • After many review iterations on all the site code, Collections are 100% operational.
  • Site's been performing like a charm since mid day when we changed some db directives.

So in short, we had some software optimizations that were necessary, then on the same weekend, a server storage problem which was much worse manifested itself and forced a complete site takedown for emergency troubleshooting and debugging. It was very unfortunate that both systems failed on the same weekend, but now we are aware of such failures and can work towards preventing them in the future.

What's next?

We will be reviewing our code and bringing every feature back to life as soon as we're done optimizing the database queries and testing their functionality. Some indexes, like those in collections, may be affected by the optimization and may end up showing less data than they showed before. But this is just to make sure the site works as fast as possible for you the users :)

We will keep updating this post with any other events as they happen.

Thanks for being so patient while we have been offline!

All the features you've come to know and love will be back in the following days :)

Cheers!

Edited February 13, 2014 by poifox