Where RIM went wrong

By | 2011/10/14

In case it’s not been immediately obvious to anyone, I’ve done some simple diagrams to explain where RIM went wrong in this catastrophic outage they’ve been suffering.

You see, most companies implement what we call redundant infrastructure. In systems that require high availability, this is often accomplished with something as simple as clustered (either LAN or WAN) hardware and communications. Sometimes it’s designed that each component runs at the same time, sharing the load, but if one fails, the other one takes over and runs all the load. In simple terms, it looks like this:

Active/Active ClustersThat all makes sense, right?

Unfortunately, RIM seemed more focused on having failover capabilities for upper level management, so it instead clustered its’ CEOs:

RIM Clustered CEOsThe supposed theory behind this is that the two CEOs, working in an active/active arrangement, could handle load better and get the job done better than a single CEO – and provide resiliency!

Unfortunately though, the hardware resiliency wasn’t as up to scratch, and when it started to fail, RIM started having a catastrophic outage.

Now, you may have expected at that point for the active/active CEO cluster to step in and help. Unfortunately though, they’ve barely been heard from. So, in cluster terms, we have to assume a sort of reversed split-brain situation has occurred, where both components of the cluster think the other component is still running:

RIM-splitbrainAnd there you have it – why RIM is having their current outage.

It’s also a lesson for all you other companies out there: you need fault tolerant infrastructure as well as CEOs.