Outage report - 2017-10-26

Official news and announcements
Post Reply
User avatar
Tomas
Posts: 1206
Joined: Sat Jun 25, 2016 12:33 pm

Fri Oct 27, 2017 10:17 am

We suffered a partial service outage on the 26th of October 2017.

The outage started at 20:17 UTC, and all services recovered on the 27th at 02:35 UTC.
Total outage duration: 6 hours, 18 minutes.

Services affected:
- Portal
- Licensing server
- Wiki
- Forum
- Tracker

Services NOT affected:
- Website
- DNS
- Email

Outage report:
We suffered an outage on both our primary and secondary service providers.
While our services are redundant and can tolerate an outage on either of the providers, in this incident, both service providers went down.

After investigation, we found that BOTH providers had scheduled maintenance windows, which we were NOT notified about.
(we received no notification from either of the providers)

Outage prevention into the future:
We already run a part of our services in the AWS cloud (the ones not affected by this outage). We were already planning on migrating the remaining services to AWS, but due to the fact that our services are already redundant, and due to the time investment such a migration requires, it was a low priority task.
After this outage, we have definitely bumped this in priority on our ToDo list. We are hoping to migrate all the remaining services to multi-AZ AWS until the rest of the year.

We have also given our providers a stern talking to, in hopes of at least getting a proper scheduled outage notification the next time they have a maintenance window...
Post Reply