Holiday Downtime, Without Data Downtime

C. Bostian

•

December 27, 2022

•

Data center downtime can be costly. Gartner estimates that downtime can cost $5,600 per minute, extrapolating to well over $300K per hour. When your organization’s digital service is interrupted, it can impact employee productivity, company reputation, and customer loyalty. It can also result in the loss of business, data, and revenue. With the heart of the holiday season happening, we have tips on how to enjoy holiday downtime while avoiding the high costs of data center downtime.

To prevent data center downtime, it’s important to first understand why it happens. There can be many causes; an analysis found that the main cause for unplanned downtime is software system failure (27%), followed by hardware system failure (23%), human error (18%), network transmission failure (17%), and environmental factors (8%). Human error is thought to make up 55% and 22% of critical application downtime by contributing to 40% of operation errors and system outages, respectively. Only around 7% of the system outages involved security-related incidents. Much of this downtime results from mistakes of inexperienced staff or, more rarely, intentional and malicious activities of employees. These happen when changes are implemented, such as upgrading software, patching, and reconfiguring systems. In the ever-evolving world of technology, stagnation is not an option — so the solution is to find guardrails for change that don’t impede innovation.

Handling the Holidays

Since human error contributes to a significant portion of data center downtime, one of the safest ways to handle the holidays is to put a hold on changes. A best practice across most tech organizations is to have a system embargo during special dates. For example, right before a new product launch, during special events, and during the holidays. Usually, a code freeze will be put in place a few days or even a week before the special period to ensure that no mistakes are pushed into the system.

Load testing the system can also be helpful before the holidays. This is especially relevant if you have an e-commerce site or another service that may become more popular during this time. Using load testing, you can check the performance of your site, app, or software under different loads. This will help you to better understand how it performs when accessed by a large number of users, and at what point bugs, errors, and crashes become an issue. It will also help expose bottlenecks and security vulnerabilities that occur when the load is particularly high. Knowing the limitations of your system can help with setting up alerts and providing relevant information to the people solving any issues that do arise.

As is often said: failing to plan is planning to fail. Having a good on-call plan in place, including documentation and a system alert management system, will go a long way to limiting downtime that does occur. Many organizations have a rotating schedule over the holidays, where different engineers are on-call for 24-hour periods. Having a good system alert management system in place helps expedite the process, by alerting the on-call engineer of issues quickly and ideally proactively.

Barriers to Availability

Data architectures are becoming increasingly complex, which makes them more rigid and fragile. Many rely on multiple discrete data sources, multiple layers, various interfaces, and a spaghetti of pipelines. In this modern-day scenario, building high-availability applications becomes increasingly difficult. Each of the sources, layers, pipelines, and applications built on top become an additional “point of failure” to be comprehended in the high availability architecture.

Human errors are one of the major adversaries of availability. In 2017, a typo at Amazon took down Amazon’s popular web hosting service, S3, and with it, a good portion of the internet. Human error extends beyond writing proficiency. Without proper data observability, logging, governance, and documentation, the number of potential human errors can be wide-ranging. For example, relying on low-quality data sources can cause hard-to-identify bugs at large scales.

Security threats can also result in downtime. During the holidays, technical teams that organizations rely on to secure services may be less available, making them vulnerable times for an attack. Overly complex data architectures, multiple disparate pipelines, dark data, and improper governance can all present serious security risks.

Achieving High Availability

To achieve five nines (99.999%) availability, technical teams need modern tools to overcome the barriers described above. DataOS, an operating system for your data stack created by Modern, can help with all of these obstacles to availability, and more. It supports every data lifecycle stage while improving the quality, discoverability, and observability of your data.

As a layer on top of your legacy or modern databases, it enables a modern programmable enterprise with a composable, flexible data architecture. DataOS weaves a connective fabric between all of your data sources, dramatically reducing the number of fragile pipelines. This simplified data architecture means fewer potential points of failure. Built-in tools provide unprecedented observability, helping teams to quickly understand, diagnose, and manage data health. The flexible, robust architecture and heightened visibility and observability of data provided by DataOS translate to increased capacity to prevent downtime.

Especially during the holidays, teams are stretched thin. This compounds existing strains on IT teams that already spend most of their time on maintaining data, leaving less time to derive value from it. DataOS automates significant portions of the tedious, but essential, data-gathering and engineering tasks. This leaves more time for technical teams to dedicate to operationalizing data and preventing downtime.

While it may not be possible to prevent all service failures and interruptions, it may be possible to predict them. Predictive analytics can be an invaluable tool for preventing IT disasters. The ability to properly store and access large, big-data sets containing historical performance information and machine learning capabilities are necessary for attaining accurate predictions. That’s why DataOS contains all the essential tools for building high-performing machine learning. Out-of-the-box UI, CLI, and AI tools support every stage of the data development lifecycle, from finding, accessing, governing, and modeling data to measuring impact. With DataOS, your teams can build predictive analytics to proactively problem solve, without unexpected interruptions during special holidays.

Don’t let data center downtime interfere with your holiday downtime. Learn more about DataOS here.

Topics:

Data Engineering

Data Management

Data Quality