Zero Database Downtime - Can it be Done?
Non-stop availability has become the baseline in today’s fast-paced business world. Around-the-clock customer support, always-open online shops and 24-hour convenience stores have created a significant shift in customer expectations. If your business is not available when your customer needs you, well… then you are out of business.
As such, business operation teams take any database downtime very seriously, as clearly, downtime is equated with revenue loss. In fact, database downtime costs companies an average of $7,900 per minute, with an average downtime duration lasting 90 minutes. Many companies report that they endure around 97 hours of database downtime per year.
While the hard revenue loss is compelling, the damage to reputation can be even more significant in today’s hyper-connected social media world, further driving businesses to find proactive solutions to approach zero database downtime.
While it’s virtually impossible to guarantee 100% availability and zero downtime, getting as close as possible is essential. Firstly, businesses should take steps to minimize any potential for outages that would cause unplanned downtime. This includes any hardware failures or equipment malfunctions that would plunge the business into crisis mode, requiring staff to detect the issue, repair the damage and get the systems up and running as fast as possible. It is also good practice to utilize network-monitoring systems to help reduce downtime, by serving as an early warning system if a system is about to go down.
Minimize Planned Downtime
In contrast, planned downtime is downtime that is caused when an administrator shuts down the system or restricts operations at a scheduled time in order to implement upgrades, repairs and other changes. Any changes or updates to the database, no matter how small, require the production database to be taken offline intentionally for some period of time. As every extra minute that the systems are down translates into a loss for the business, there are several steps and best practices organizations can follow to minimize the amount of planned downtime:
Strategize the scheduling
Smaller updates may require the system to be offline for a matter of minutes or seconds, while larger updates often take longer. As planned downtime is scheduled, administrators can plan for it to occur at a time that would be least disruptive. Strategize well. If your user base is located in a single region, plan any upgrades for off-hours, such as middle of the night or on the slowest business days. If your user base is global, check which time zones have the fewest users, and plan your scheduled maintenance accordingly. If at all possible, give your users at least a few days' notice for planned downtime.
Identify the root of database issues
If a database glitch is found, spend the time necessary to uncover the root cause of the problem, even if it takes days or weeks to find the cause of performance issues in large, complex, distributed production environments. Avoid placing band-aids that temporarily cover up a larger issue.
Minimize human errors
According to a 2016 survey by ITIC, 52% of respondents identified human error as the leading issue affecting downtime. Unintentional errors in coding, misspellings, or accidental overrides are all common issues that can needlessly bring down your systems during scheduled maintenance, regardless of organization size, structure or industry. By using proper tools and automation, deployment safeguards can be put in place and the risk of error can be significantly reduced, especially in repetitive or complex tasks. In addition, instituting internal service level agreements between employees can further reduce human error.
Audit: Knowledge is power
Data audits allow the administrators to keep their finger on the pulse of all data activities, including modifications or updates. Perform systems and process audits in order to identify those most critical to the business, and work to eliminate redundancies to ensure those processes remain available in the future. Implementing a version control mechanism – as is common with app development – further reduces risk of error. Knowing when, what and how changes were made means that they can be undone fairly easily. Regularly check backups as well.
Test, test, test
Before implementing any updates, test each application or change made under conditions that mirror a live environment as much as possible. Always test your migrations using a copy of your production data to see how long it takes, the kind of performance impact the migration will have and find solutions to these impacts if required.
Verification and failback
A failback strategy should be in place to avoid any downtime and data loss, in case the new environment is not stable. Strive for backwards compatibility by performing schema changes that will not affect the existing code and also by ensuring that the deployed code can work with the legacy schema. A complementary concept is forward compatibility, where legacy schema is coded to disregard unknown fields and parameters, but provides a roadmap for future updates. Of course, following an upgrade or migration, always confirm that the source and target database are synchronized.
With the knowledge that 80% of companies lose revenue when their database is down, putting a proactive plan in place to maximize uptime should be at the heart of every business’ operational strategy. Smart scheduling, identifying root cause, minimizing human error, auditing and testing, and performing verification and failback, are best practices that your company can implement today in order to approach zero database downtime effectively.