We are all familiar with Mother Nature’s violent electrical storms and what happens when there’s a loss of power and everything goes dark. Everyone worries, and life and work as we know it becomes paralyzed. The August 2003 Toronto blackout, the biggest one in North American history, served as an important, although unplanned, test of emergency preparedness for many organizations. It also served as a call to action for organizations that realized they did not have proper emergency preparedness and business continuity plans in place that included addressing the long-term unavailability of utility power.
It’s not a matter of if, it’s a matter of when, and for the purpose of supporting critical operations, the public utility power grid is unreliable. It’s through this precautionary principle frame of mind that effective emergency preparedness plans begin. The 2003 blackout, that affected 55 million people, was an event that Q9’s Toronto-One Data Centre was uniquely equipped to handle.
Data centres should always assume inherent unreliability of public utility power sources, and be able to maintain continuous operations throughout emergencies. Their capabilities should include on-site power generation and refueling. Just as important, minimum N+1 redundant design of internal power delivery systems and qualified in-house technicians are responsible for regular inspection, testing, and maintenance. Data centres should have 100 per cent uptime, and should be designed for full end-to-end control of their power systems. They should not rely on the availability of external resources to ensure successful implementation and operation.
How cloud can help
In an emergency, rapid communications and continued operations are critical to minimizing the impact of the incident. One of the advantages of cloud infrastructure is that it can be engineered for diversity of hardware, hosts, and even geographies. When combined with the elastic nature of cloud, a system like this inherently provides the ability to protect both data and applications by applying redundancy as desired across the cloud.
While the underlying ability to implement redundancy, backup, High Availability (HA), and Disaster Recovery (DR) exists in most well-engineered cloud platforms, the actual design, customization, implementation, and execution of such strategies can still be tricky or cumbersome for the average business to actually engineer and realize.
It’s important to really understand what your cloud provider – or if you’re building a private cloud, your cloud technology vendors – can provide you, both in terms of technological capability as well as management and support.
Preparing for an emergency
In addition to understanding how to enable the redundancy of your data and applications, it’s critical to understand the network connectivity to your cloud and how it’s protected. Make sure you’ve worked with your network teams to provide the required redundancy needed both at the network, and if required, at the load-balancing level to make sure your cloud remains resilient and available.
One of the biggest things to note here is telemetry. Understanding the underlying status of your cloud and the components and technologies that drive it means you can pull the telemetry and reporting you need not only to diagnose and recover from an event when it happens, but to potentially even prevent the incident in the first place. Of course, this means choosing a technology stack that can provide the required level of detail and intelligence when it comes to this kind of telemetry data.