Say disaster and many IT teams imagine floods, hurricanes and breaches. What we don’t always picture: human error. Maybe that’s because someone hitting the wrong button or deleting files when they mean to transfer them just isn’t as dramatic. But data is valuable no matter how it gets contaminated, stolen or lost – and human error can impact daily workflows and recovery.
In fact, a Dimensional Research study showed human error is a major cause of outages. Consider the number of systems and workflows your staff touches each day and you can see why; it’s just too easy for a database administrator or engineer to make one wrong move and unleash a data disaster.
While you might think simple errors would cause simple problems, they can have as big an impact as a major catastrophe. Remember these examples?
- Amazon’s epic outage in February. While trying to fix a slow billing system, the company’s S3 team decided to take a small number of servers offline. An employee accidentally entered an incorrect command and removed a large number of servers instead. But while it may have been no more than a typo, the resulting outage was colossal. Some of the servers that were removed impacted other parts of the S3 system, which prompted Amazon to restart every affected system and complete safety checks. In addition to major media outlets and top brands plunging into downtime, many other organizations were unable to use common applications like Slack.
- NASDAQ going down for almost an hour. Once again, an employee performing a simple operational function set off a chain reaction with ugly consequences. An incorrect data delivery to the index distribution system impacted the Global Index Data Service which provides valuation data for electronic trades. Trading came to a halt while the team corrected the problem.
- British Airways disrupting travel plans for 75,000 passengers. This was caused after an engineer disconnected a power supply; when it was reconnected, a surge damaged the system. The company was flooded with claims for compensation.
Of course, errors causing outages are only one side of the story. Downtime caused by other factors can be extended by errors on the backup and disaster recovery side. Forgetting to make backups, not snapshotting frequently enough, forgetting to test backups, failing to make more than one copy and leaving tape backups exposed to degradation are all common mistakes that can turn a minor downtime occurrence into an epic disaster.
The Best Defense Against Human Error
Automation. Modern BDR solutions offering automation can save organizations from a variety of human failings. Team members forget to make backups or they simply get lazy and postpone basic system tasks; they screw up replication or restoration and forget to test their backups. When disaster strikes, functioning, current backups are nowhere to be found. Automated testing and other features can protect your team from their mistakes.
Redundancy. Just as a system administrator might delete the wrong files, someone can delete or otherwise impair backups accidentally. If someone has impacted the condition of your backups, having another set is the best insurance there is. If you haven’t already, consider partnering your offsite datacenter with cloud backups or some other layered arrangement.
Encryption. If someone on the team accidentally exposes your most sensitive data, you’ll feel a lot better knowing it’s encrypted. This applies to backups as well as other systems.
Safeguards. After their epic outage, Amazon installed measures to prevent engineers from being able to remove such a large server capacity. Do a risk assessment of the human errors that might happen, then add safeguards to stop them from happening.
Simplicity. According to the Dimensional Research study, 59 percent of study participants claim growing network complexity is a contributing factor to human error. This is one of the best reasons out there for choosing unified systems that are simple and intuitive to use. Obviously at Quorum we’re all about bringing your BDR components under one roof with our solution, but we also believe that teams should step away from multi-vendor chaos in general. Quick and simple technology goes far in eliminating confusion and mistakes.
The most galling thing about human error might be the self-blame. Fires break out, tornadoes sweep through and criminals attack; but human error feels more preventable. While no IT team will ever be completely free of mistakes, you can set your organization up now to eliminate some of the more common ones through better technology and more efficiency recovery practices.