In case it’s not been immediately obvious to anyone, I’ve done some simple diagrams to explain where RIM went wrong in this catastrophic outage they’ve been suffering.
You see, most companies implement what we call redundant infrastructure. In systems that require high availability, this is often accomplished with something as simple as clustered (either LAN or WAN) hardware and communications. Sometimes it’s designed that each component runs at the same time, sharing the load, but if one fails, the other one takes over and runs all the load. In simple terms, it looks like this:
Unfortunately, RIM seemed more focused on having failover capabilities for upper level management, so it instead clustered its’ CEOs:
Unfortunately though, the hardware resiliency wasn’t as up to scratch, and when it started to fail, RIM started having a catastrophic outage.
Now, you may have expected at that point for the active/active CEO cluster to step in and help. Unfortunately though, they’ve barely been heard from. So, in cluster terms, we have to assume a sort of reversed split-brain situation has occurred, where both components of the cluster think the other component is still running:
It’s also a lesson for all you other companies out there: you need fault tolerant infrastructure as well as CEOs.