TechnologicalByte's Website

 Home  |  About  |  Repository  |  Blog

(Return to Tech's Blog)

So, my website went down

(Posted on December 25, 2024 - 3:53 PM)

A postmortem of the site outage caused by a third party. 


Sure, this isn't the first outage.

    Throughout this site's 1 year of existence (soon to be 2!), the first major outage occurred on November 9, 2024. There have been some minor outages that only lasted about a couple of minutes, but this outage was unlike any other outages that occurred prior. I'll be going over what happened and what my service provider had to say.

    Before I start, I'd like to give a huge thanks to the hard working engineers working around the clock to get my website and other impacted services back online!

Technical information inbound! If you don't care about the details, I'll give you a little TL;DR. Simply put, it was caused by a third-party network provider.

Timeline of what happened minute by minute.

       Date and time formatted in Mountain time.

November 9, 2024 - 7:50 PM

Was notified about my website being offline, attempted to remote-in to no avail from both VNC and through my server's control panel, which was unable to gather the server's actual online state. Sent a reboot request via the control panel to no response. My CDN was the only one online until 8:04 PM.

November 9, 2024 - 8:02 PM

Service provider notified me of an outage with no further information given at that point.

November 9, 2024 - 8:04 PM

My CDN went offline preventing anyone from downloading content from my website. I was notified from my service provider that there was an issue impacting many server clusters and that my servers were one of them impacted. Though, no more details were given at this point.

November 9, 2024 - 8:05 PM

All of my servers went offline at that point, including my Discord server's bot (Hermosa).

November 9, 2024 - 9:22 PM

Was notified that engineers are at work diagnosing the issue and will get back to me with any further information.

November 9, 2024 - 11:51 PM

Was notified that the outage should be fixed by tomorrow morning, checked the calendar and the 10th landed on a Sunday, which I know many employees/engineers are off on that day, but still, I kept my hopes up to seeing my servers coming back online.

-- No more updates on the 9th.

November 10, 2024 - 6:12 AM

As guessed, was notified that the outage is taking longer to resolve due to many engineers being off on Sunday, but was assured that a few engineers are working to resolve this issue and was given a rough estimate of returning to availability by an hour or so.

Four hours later....

November 10, 2024 - 10:46 AM

Was reported that the outage was not caused by any hardware fault, but a major network outage impacting many servers caused by a third-party and because my server isn't classified as a VPS, it is classified as a dedicated server. On-site engineers attempted to reassign the affected servers IP address to a different one to no avail, due to a major network outage occurred caused by a third-party.

November 10, 2024 - 12:02 PM

Was notified from my service provider that the third-party network provider given information regarding the outage. They indicated that an update was pushed out to the network infrastructure yesterday around 5:30 PM and because there were no reports of any issues, the update went live. Around an hour later, one of the network infrastructure was under extreme load, and on-site engineers redirected network traffic to a different node, which eased the extreme load. Unfortunately, 20 minutes later, redirected network traffic experienced major slowdowns as more and more network infrastructures started to become overloaded in which engineers shut down overloaded nodes, in which my servers was among one of the redirected network traffic and that due to the engineers shut down the affected network infrastructures, my server, along with other servers also impacted, being knocked offline.

I was then notified from my service provider that an update rollback is underway and the outage should be resolved by today.

November 10, 2024 - 2:12 PM

As the network provider was rolling back changes, I was notified that I should be able to access my server. After the 6th attempt of connecting, I was able to get in. As expected, speeds were cut severely, going from 1GBps down to 5-10MBps. I decided to not start up my website and left the server online without any running programs. Both my bot and CDN servers remain offline however.

November 10, 2024 - 2:30 PM

As a curious person I am, I decided to boot up nginx and after about 5-10 seconds, my website came online, went out and checked the speeds and as expected, was very, very slow. But without the CDN server running, anyone attempted to download content resulted in a very rarely seen error "CDN Error!" followed by the subtitle, "This is very embarrassing. My content delivery network appears to hit a snag.", with an HTTP response code being 503.

-- Website continued to run at reduced speeds for hours till the early morning of the 11th.

November 10, 2024 - 5:22 PM

Network provider finished rolling back changes and restarted the affected nodes, website server speeds increased from 5-10MBps to 60-130MBps. My Discord bot started up while my CDN remain offline. My service provider notified me that other servers are coming back online and expected full service availability in a few hours or till tomorrow, the 11th.

November 10, 2024 - 8:38 PM

Network provider reported no issues occurring after the rollback and that all impacted servers should be back online, all but one server, my CDN remained offline until early mornings. A few hours later, my website is up at full speed; my service provider continued to monitor the incident.

-- No more updates on the 10th.

November 11, 2024 - 7:55 PM

Service provider notified me that all servers are back online and that the issue has been resolved. I checked my servers and was able to confirm my CDN was back online, but my website was unable to contact the CDN due to network change in which I was able to fix. Thus, ending the major outage.

In conclusion

   This is first major outage that occurred, bringing down my website, Discord bot, and my CDN. I felt like making a postmortem to include details on the outage, not for the sake of transparency, but for lessons learned about rolling out infrastructure changes, as it could lead to outage impacting major services, like what happened recently with Verizon, American Airlines and others.

TechnologicalByte

as he creates his new years' resolutions.


This is my website!

Website's changelog for you curious minded people out there! ;)

Copyright 2023 - 2025 TechnologicalByte