The typical Database management strategy employed in most environments is something like this: the DBAs setup a server to the developers/applications requirements, configure alerting, setup maintenance jobs for backups and index maintenance, and then wait for any issues/request that may arise. As long as the developer is happy and the application is working, what’s left to do? The obvious answer is plenty, but it’s understandable that a DBA or accidental DBA would be more concerned about issues they are facing rather than performing regular health checks on a system that is currently functioning at a high level. This is the reactive rut that DBAs and the environments they manage get stuck in. By taking a mainly reactive stance to managing your environment, you’re most likely exposing your business to critical issues. Let’s walk through the top reactive offenses I see in firms before they become Agio DBA clients.
- Old Backups
Last summer we were onboarding a client that had a mature SQL Server environment and during our evaluation we came across a database on a prod server that hadn’t been backed up for six months. When we informed the client, I was met with the silence that comes with shock. Why? Turns out, the database was extremely important to their business and they we skeptical of my evaluation because the backup job has been running daily without a single failure. I understood their reasoning, but not their logic. A job completing successfully isn’t indicative that a backup was taken successfully. In this specific case, one of the DBAs was doing an update on this specific database and updated the job to skip this particular database during the update process. The problem was that the DBA never reverted the job back once her work was done. Luckily for the client, they found out due to our evaluation instead of the hard way.
A proactive approach to this problem is to have a separate alerting process that independently checks those critical processes actually doing the work. You have to verify the results of the process, not that your critical process ran correctly.
- Corruption Checks
Of the four main regularly scheduled maintenance tasks, checking databases for corruption is an afterthought and occasionally not even done. It’s easy to understand why. Doing Index maintenance impacts almost every transaction run on your servers and is highly visible to the rest of the business. Your developers or application users will start treating you like a rock star after you take a process that once took 30 seconds down to less than a second. Not so much when you perform regularly scheduled corruption checks because no one’s patting you on the back when you restore a backup and everything runs perfectly.
The problem is that when corruption happens, it doesn’t always show up. It could have literally occurred a year ago and that means every backup taken since also contains this corruption. There’s no worse time to learn that you should’ve been checking for database corruption when you are trying to recover from a DR outage, and you are unable to restore your database from backup.
- Proactive Resource Management
One of the first steps we do at Agio when bringing on a new client is to execute an extensive evaluation of your environment. We look into every part of the OS, the instance, configuration, and database to compare it against known best practices. No changes are made during this step, we’re just documenting to create a clear picture of your environment. Once this is done, we go line by line with you to discuss everything we found, our recommendations, and then decide with you which changes we should implement and which we should disregard. After this, we have a solid baseline to move forward with monitoring.
As soon as our evaluation is done however, it’s already outdated. Five minutes after we give a server the all clear, a new proc could’ve been introduced that completely tanks the server memory. What was once more than enough memory is no longer enough. Resource management needs to be a continuous part of managing your database environment. You can quickly outgrow your memory, storage, network, and processor. Keeping on top of how resources are being used is paramount to heading off issues before they occur.
Of all the clients we’ve onboarded, not a single one has had a defined security policy for their database environment. Once this is brought to the forefront of our clients’ minds, it’s usually a quick obstacle to hurdle in terms of setting up a defined policy everyone can agree upon. That said, having a defined set of instructions on how your environment is to be permissioned, accessed, and connected to is only half the battle.
How do you enforce your security policy and verify that your admins aren’t skirting around your security policies? Our recommendation is to put in place a robust reporting and auditing framework designed around your specific security policy. You can’t hope everyone is following your policy and then find out after a breach employees were in violation. Finally, regular audits of who has permissions and regular reporting on security changes also needs to be part of your proactive strategy.
- Poor Query Performance Tracking & Reporting
As mentioned in the introduction, most DBAs won’t know that a proc is performing slowly until a developer or end user reaches out letting them know. By the time the DBA finds out about this slowness, countless hours of troubleshooting could’ve occurred before the developer was able isolate the issue to a single proc. Instead, it’s better to track changes to performance and generate regular reports on procs that are running slower against previous run time. If for no other reason than to see the surprise on your developer’s face when it’s YOU that tells him/her their proc is running slow.
The usual Database Administration practice of being reactive to your customers isn’t enough in today’s data driven world. You need to make sure your environment is reliable by ensuring backups and corruption checks are running and functioning correctly. You need to make your environment is secure by making sure your security policy is being followed. And you need to ensure your environment is performing optimally by identifying poor performance when it happens and actively monitoring resource utilization. After all, if you aren’t being proactive when managing your database environment, you’re essentially waiting for on an outage or critical issue to identify issues in production. Isn’t that too late?