What Caused The Resideo Total Connect 2.0 Outage Sunday?

Posted By

I was able to speak with an industry insider familiar with the events at Resideo's data center on Sunday night into Monday evening. This person related to me that there was an HVAC failure at the primary data center. It was initially thought to be an easy fix, but that turned out to be false.

Things started to go wrong in Resideo's primary data center on Sunday night at around 7:00 PM Eastern Time. An HVAC failure allowed the temperature in the data center to climb to a dangerous level for the servers located there. The normal temperature is around 70℉ (21℃) but on Sunday it rose into the neighborhood of 130℉ (54.4℃). The servers are set to failsafe, so rather than continue running, and risk catastrophic damage, they began to shut down.

An automated system is in place which notifies engineering and other stakeholders when a serious event like this occurs. An HVAC technician responded. Initially, the technician believed this would be a quick and easy fix, so the decision was made not to switch to the secondary data center, which is located in the Chicago area. The switch takes a bit of time, somewhere around 20 minutes, and the thought was that it wouldn't be worthwhile at that point to make the switch.

However, the HVAC tech discovered that in order to implement a fix, he or she was going to require a part, which they didn't have and couldn't get at that time. So, at around 1:00 AM Eastern Time, the decision was made to switch things over to the secondary data center. By about 1:30 AM Eastern Time, the backup data center was in control.

At around daylight Monday morning the HVAC system in the primary data center had been fixed. Once it was fixed, there was a period of time where the temperature was coming down to an acceptable level. By approximately 11:00 AM Eastern Time, Resideo was ready to switch back to the primary data center. At this point, alarm signaling was back up and had been for some time. By around 2:00 PM AlarmNet360 was back up, and by about 6:00 PM Total Connect 2.0 was back online, though customers and our own testing show that it was somewhat sluggish at first.

This outage affected three (3) things. The most serious was alarm signaling. During the early hours of the outage, customer's systems were unable to send signals to the monitoring station, or to send notifications to the customers themselves. Total Connect 2.0, the customer-facing app and website for end-user remote control was also down. Lastly, AlarmNet360, the alarm dealer facing service used to create or cancel accounts and remotely troubleshoot issues was also affected. When things went wrong, the initial focus was on getting alarm signaling backup as quickly as possible. This was the focus when they initially switched to the Chicago area data center.

This is a fully redundant system, and it is tested regularly. According to my source, there were hourly notifications being sent to alarm dealers, but the database of email addresses for these notifications seems to be outdated. This is something they will address going forward. A root cause analysis will be completed in the coming days, and any processes or procedures that need to be updated will be dealt with at that time. The site at status.resideo.com doesn't have a section showing either AlarmNet360 or Total Connect 2.0 status. Hopefully, this is something that will change in the very near future as well. Finally, those dealers who did receive notification noted that the emails weren't flagged as containing particularly important information. This is also something that will be addressed in the future.

Tags: , , , , , , , , , , ,

Comments


Disqus doesn't display your email to us so I can't check if you are an Alarm Grid customer or not. Would you mind emailing us at support@alarmgrid.com so we can talk further either way? Also, I'd be curious to hear what model phone you're using and what Android version you're on and what Total Connect 2.0 app version you're on. Finally, is your L7000 running via WIFI and/or cellular for the alarm communications?
Sadly if you run the Total Connect 2.0 website on your smartphone instead of the App it will load quickly and the Z-Wave devices control is quick. The Android App writers need replaced with qualified engineers.
In the last year I have really started regretting my purchase of the L7000 panel. Resideo Total Connect 2.0 has ruined it for me! The Android App is BROKEN, loads SLOW and the Devices Tab shows random statuses of devices usually showing incorrectly the actual status of devices Running the website for Total Connect 2.0 works as it should, loads fast and devices status shows correctly. So I don't use the TC2 Android App and instead forced to use the website on my smartphone as that is the only reliable way to access Total Connect 2.0 Resideo opened a support case with their engineering team but this has been a issue over 6 months now. The case # 09560308 will likely be unresolved. My Advise to Anyone installing a new system is to not purchase a Honeywell panel.
Hey Steve, Thanks for your concern. Once we have more information on their corrective actions that we can share, we will follow up with a new blog post if we are able to share that information.
Thanks, Julia. I'm glad it wasn't some nefarious hacker. However. Our panel was still showing a communication failure after midnight Western time, which would be after 3 am Eastern. They should have switched over as soon as servers started shutting down. That's exactly the sort of scenario that capability is intended for. There are many systems available to make such switches between data centers automatic, I'm amazed it seems to have been some contemplated decision. I don't even understand waiting for the HVAC tech. Yes, they should have system status for AlarmNet 360 and Total Connect 2.0 as well!! Also seems like the E-mail and text notification system that tells me my panel changes could have been used to notify of the outage?
@Julia Ross Feel free to share my feedback/complaint with Resideo as I'm reconsidering recommending their systems to family/friends in the future.
Honeywell/Resideo is an embarrassment. If their system was "fully redundant" like they claim, the second data center in Chicago would've automatically taken over when the primary data center went down. Not only are the Alarmnet servers responsible for monitoring residential customers, but it's also responsible for commercial and high security applications. The fact that a faulty air conditioner was responsible for causing a global outage that prevented burglar, fire, CO, medical alarms from being delivered is completely unacceptable. Then on top of that, it took them over 6 hours of downtime to decide "oh, maybe we should switch over to the backup data center". What a joke.... Their servers should be redundant like how Criticom handles their central stations (load balanced, and if one goes down the other two automatically handle the load). Whoever oversees the Alarmnet servers needs to reevaluate the redundancy of their service or step down and allow someone who is more qualified do it. Maybe while they're at it, they can fix their substandard Total Connect 2.0 app. It would be nice if it would take less than 10 to 45 seconds to process an arm/disarm command or activate a zwave device.