What Caused The Resideo Total Connect 2.0 Outage Sunday?

Posted By

I was able to speak with an industry insider familiar with the events at Resideo's data center on Sunday night into Monday evening. This person related to me that there was an HVAC failure at the primary data center. It was initially thought to be an easy fix, but that turned out to be false.

Things started to go wrong in Resideo's primary data center on Sunday night at around 7:00 PM Eastern Time. An HVAC failure allowed the temperature in the data center to climb to a dangerous level for the servers located there. The normal temperature is around 70℉ (21℃) but on Sunday it rose into the neighborhood of 130℉ (54.4℃). The servers are set to failsafe, so rather than continue running, and risk catastrophic damage, they began to shut down.

An automated system is in place which notifies engineering and other stakeholders when a serious event like this occurs. An HVAC technician responded. Initially, the technician believed this would be a quick and easy fix, so the decision was made not to switch to the secondary data center, which is located in the Chicago area. The switch takes a bit of time, somewhere around 20 minutes, and the thought was that it wouldn't be worthwhile at that point to make the switch.

However, the HVAC tech discovered that in order to implement a fix, he or she was going to require a part, which they didn't have and couldn't get at that time. So, at around 1:00 AM Eastern Time, the decision was made to switch things over to the secondary data center. By about 1:30 AM Eastern Time, the backup data center was in control.

At around daylight Monday morning the HVAC system in the primary data center had been fixed. Once it was fixed, there was a period of time where the temperature was coming down to an acceptable level. By approximately 11:00 AM Eastern Time, Resideo was ready to switch back to the primary data center. At this point, alarm signaling was back up and had been for some time. By around 2:00 PM AlarmNet360 was back up, and by about 6:00 PM Total Connect 2.0 was back online, though customers and our own testing show that it was somewhat sluggish at first.

This outage affected three (3) things. The most serious was alarm signaling. During the early hours of the outage, customer's systems were unable to send signals to the monitoring station, or to send notifications to the customers themselves. Total Connect 2.0, the customer-facing app and website for end-user remote control was also down. Lastly, AlarmNet360, the alarm dealer facing service used to create or cancel accounts and remotely troubleshoot issues was also affected. When things went wrong, the initial focus was on getting alarm signaling backup as quickly as possible. This was the focus when they initially switched to the Chicago area data center.

This is a fully redundant system, and it is tested regularly. According to my source, there were hourly notifications being sent to alarm dealers, but the database of email addresses for these notifications seems to be outdated. This is something they will address going forward. A root cause analysis will be completed in the coming days, and any processes or procedures that need to be updated will be dealt with at that time. The site at status.resideo.com doesn't have a section showing either AlarmNet360 or Total Connect 2.0 status. Hopefully, this is something that will change in the very near future as well. Finally, those dealers who did receive notification noted that the emails weren't flagged as containing particularly important information. This is also something that will be addressed in the future.

Tags: , , , , , , , , , , ,

Comments


Thanks, Julia. I'm glad it wasn't some nefarious hacker. However. Our panel was still showing a communication failure after midnight Western time, which would be after 3 am Eastern. They should have switched over as soon as servers started shutting down. That's exactly the sort of scenario that capability is intended for. There are many systems available to make such switches between data centers automatic, I'm amazed it seems to have been some contemplated decision. I don't even understand waiting for the HVAC tech. Yes, they should have system status for AlarmNet 360 and Total Connect 2.0 as well!! Also seems like the E-mail and text notification system that tells me my panel changes could have been used to notify of the outage?
@Julia Ross Feel free to share my feedback/complaint with Resideo as I'm reconsidering recommending their systems to family/friends in the future.
Honeywell/Resideo is an embarrassment. If their system was "fully redundant" like they claim, the second data center in Chicago would've automatically taken over when the primary data center went down. Not only are the Alarmnet servers responsible for monitoring residential customers, but it's also responsible for commercial and high security applications. The fact that a faulty air conditioner was responsible for causing a global outage that prevented burglar, fire, CO, medical alarms from being delivered is completely unacceptable. Then on top of that, it took them over 6 hours of downtime to decide "oh, maybe we should switch over to the backup data center". What a joke.... Their servers should be redundant like how Criticom handles their central stations (load balanced, and if one goes down the other two automatically handle the load). Whoever oversees the Alarmnet servers needs to reevaluate the redundancy of their service or step down and allow someone who is more qualified do it. Maybe while they're at it, they can fix their substandard Total Connect 2.0 app. It would be nice if it would take less than 10 to 45 seconds to process an arm/disarm command or activate a zwave device.