Amazon Web Services (AWS) Outage Affects Alarm Grid

Posted By

It was a crazy day across the internet. Like a lot of other websites you might use, Alarm Grid was affected by the disruptions from the outage at Amazon Web Services (AWS). We wanted to let you know what happened, how it affected us, and the good news that things are getting back to normal.

What Happened?

According to Amazon, at around 3:11 AM EDT Amazon's huge AWS US-EAST-1 region (which is a massive data center in Virginia) started having issues. The first signs of trouble were increased latencies and error rates for sites and applications serviced by US-EAST-1. This regional data center runs a huge portion of the internet, so when it has a bad day, it can create a domino effect that takes down lots of services.

The crux of the problem appears to have been two-fold. Early in this event, at around 5:00 AM EDT, Amazon identified an issue with DNS resolution to the DynamoDB endpoint within the US-EAST-1 region. DNS or Domain Name System is how a URL like alarmgrid.com gets tracked down to its source IP address. By 6:35 AM EDT, Amazon reported that this issue had been fully mitigated.

The second source of trouble has to do with an underlying system designed to monitor the health of AWS load-balancing servers. It's unclear exactly what happened with this system currently, but mitigation efforts are progressing well for that issue as well. You're likely still seeing latency with some sites and apps that you visit. At 4:03 PM EDT, the AWS Health Dashboard said:

"Service recovery across all AWS services continues to improve. We continue to reduce throttles for new EC2 Instance launches in the US-EAST-1 Region that were put in place to help mitigate impact. Lambda invocation errors have fully recovered and function errors continue to improve. We have scaled up the rate of polling SQS queues via Lambda Event Source Mappings to pre-event levels. We will provide another update by 1:45 PM PDT."

How Did This Affect Alarm Grid?

Alarm Grid uses AWS in several ways to keep our website and services running smoothly. When AWS went down, it meant we had trouble with:

  • Access to our site: The site alarmgrid.com has been unavailable at times, throughout the day today.
  • Website features: You might have noticed our site was super slow, or that parts of our customer portal weren't loading at all. If you are attempting to place an order, you may see some issues. Please, retry any failed operation. You can reach out to support@alarmgrid.com if you need assistance.
  • Our own support tools: Our web-based phone system was affected earlier in the day, but seems to have recovered as of now.

We know how frustrating this is, especially when you're trying to get something done. We sincerely apologize for the headache. Our site is up, but is still experiencing significant latency as of this writing. Rest assured, we're here and we'll help however we can.

Good News: Things Are Coming Back Online!

As of this afternoon, AWS says they've identified the issue and are well on their way to a resolution. We're already seeing our systems come back to life, and the Alarm Grid site should be getting back to normal.

What Wasn't Affected

Fortunately, monitoring services were apparently unaffected. We received no reported issues from our central station partners: Criticom Monitoring Service for customers in the United States, and Rapid Response for our Canadian customers. Alarm signal processing continued as usual. Likewise, the Total Connect 2.0 and Alarm.com remote services also appear to have been unaffected.


A Few Quick Stats on the Outage:

Just to show how big this outage was, at its peak, Downdetector.com showed nearly ten thousand reports for AWS. It also caused problems for major services like Snapchat, Ring, Roblox, Fortnite, and many others.

Tags: , , , , , , , , , ,

Comments