Top Causes of SQL Server Downtime!
Every DBA must ask this question to themselves when they start their day within their SQL environment. You might be thinking I'm kidding that saying why should I think about it. Recenlty when I have participated in SQL Server Open world (SSOW) and after my presentations on Business Continuity & HIgh Availability presentations, several attendees responded to my call for feedback about the primary causes of planned and unplanned SQL Server outages in their environments.
The first question I have asked them is "Have you tested your implementation?", few of them were smiling at me and few were kept quiet and the following list presents the reasons in order of how many of them reported or agreed with me as their primary cause of downtime—not in order of most serious problems. After all, your server is either available or it isn't. Users and employers don't care why they can't get to the database—they simply want the database up and running when they need it.
1. Service Packs and Hot fixes release:
Applying service packs and other patches was the leading cause of downtime (I include OS- and database-level patches and service packs in this group). In light of the many security-related critical updates that Microsoft has recently released, the company needs to improve patch-management functionality so that customers can apply service packs and patches without having to reboot the involved SQL servers.
2. Random Bugs and Unknown Problems with their own code
Random bugs? Most of them described that as memory-leak problems. Everyone wants to blame Microsoft for these kinds of problems, but I've seen many cases where client-developed code or third-party software was responsible for the outages. So do not assume once it is working it will work on any environment and always test it before deploying it on a large scale environment.
3. Errors in Administration (Human errors)
"It was my fault." But few of them summed up this common IT problem by saying, "The reasons we experience downtime are usually caused by human error or network problems. Our main problems come from technicians who either reconfigure something without notifying anyone or test something without notification or from equipment that breaks down. My main frustration comes from technicians who forget to communicate." This reason falls into the "oops, I forgot" class of problems and is a reminder that High availability and business continuity are equal parts technology and human policies and procedures.
4. Lack of Knowledge and Training (call it SQL or Web)
This reason is directly related to the errors-in-administration problem. The difference is that, in the previous problem, DBAs know what to do and simply either don't do it or do it incorrectly. However, sometimes DBAs make mistakes because they don't know better. You'll never have a highly available database environment without investing in the human component of high availability through policies, procedures, and training.
I feel the following are also most important ones which aren't mentioned by the users I met there in SSOW:
· Adding indexes to very large tables, which causes blocking
· Virus attacks
· Complex environment interdependencies
With a formal procedures in place you can achieve the high availability with lesser cost & downtime to your application, do not jump into quick decisions on deploying new patches in your environment. Test it before applying it on the large scale of usage in your environment.