Here at MILL5, we realize that customers need help in maintaining their solutions which is why we automate as much as possible when we build them. When that is not enough, we help maintain these solutions by offering managed services at a lower cost than hiring someone full time. This article will talk about a customer and their need to support a mission critical application running on SQL Server.
Recently we were called into a brand new customer to help them address concerns over their SQL Server 2012 AlwaysOn Availability Group. This customer does not have a full-time DBA on staff. What they have are infrastructure engineers and architects that know how to manage their virtual server environment and the hardware it runs on. So when they called us, they wanted to talk about SQL Server. I just might know a little about SQL Server having spent 10 years at Microsoft focused on the SQL Server community. During that time I ran the SQL Server Upgrade and Compatibility Labs, was a member of the SQL Server Insiders, and obtained my SQL Server Certified Master certification. It was clear that this was an opportunity for us to help this customer.
This particular customer was running a custom application, Hyland’s OnBase, with SQL Server running in an AlwaysOn Availability Group. Like many custom applications, we needed to know the best possible high-availability and disaster recovery solution. After talking to both the customer and the vendor, we were able to come away with several recommendations.
Check the placement of your quorum voting members.
Typically you setup your Windows Server Failover Cluster to support automatic failover from a primary server to a secondary server in the same data center for high availability. Additional secondary servers can be setup in the same data center for better high availability or in a remote data center for disaster recovery.
This customer had a three node cluster, a primary and synchronous secondary in their primary data center and an asynchronous secondary in their disaster recovery data center. Automatic failover was setup between the primary and secondary for high availability and manual failover to their disaster recovery data center for disaster recovery. The remote node in the disaster recovery data center was setup so that it did not get a vote in the quorum process for failover (i.e. NodeWeight equal to 0). This makes sense since we are not supporting automatic failover to the remote data center. Unfortunately, this leaves us with only two voting nodes which meeant that we cannot have automatic failover for high availability. To fix this, an additional quorum member using a file share was added to maintain a majority vote. This file share must be accessible to all voting node in the cluster can be used as a voting witness.
Here is where things get interesting. This file share was placed at a third data center. Yes, you heard correctly. You might be wondering why this was done. We are not exactly sure, but we suspect that the file share was placed in a third data center for automatic failover to the disaster recovery site. To have automatic failover to the disaster recovery data center, you need to add a fourth node in your disaster recovery data center and your file share needs to be at the third data center as a tie breaker in case network connectivity was lost between the primary and disaster recovery sites. Also, placing a file share in a third data center is a general recommendation by Microsoft for multi-site clusters.
So, is having a file share quorum member in a third data center a concern? We think so. Remember, this cluster does not support automatic failover to the disaster recovery data center. The file share is only needed to provide a majority vote between the primary and synchronous secondary server at the primary data center. Having the file share in a third data center only complicates the cluster configuration and introduces a situation where a lack of network connectivity to the third data center leaves the cluster in a state where it cannot automatically failover. So while we were not happy with this configuration, we decided not to address this issue until the upgrade of their SQL Servers from 2012 to 2016.
Make sure you are doing log backups.
Database backups are an important part of ongoing database maintenance. Unfortunately, many companies do not know what to do when basic maintenance tasks like regular databases backups go wrong. The same customer that we discussed previously mentioned that their database log file was consuming a considerable amount of space on disk. In fact, the log file was much larger than the data files for the database.
The database was setup to use full recovery mode was needed for AlwaysOn Availability Groups and point-in-time recovery using backups. If your database is configured to use full recovery, you should be taking regular log backups to free space within the log file so that space can be reused. The process which frees up space for reuse is known as transaction log truncation. Truncation must happen periodically to free up space and to ensure that the size of the log file does not keep growing. That is why it is important to take regular log backups.
When we examined the database we did not see any backups being taken. That means that the log was not being backed up and therefore never truncated. Because this had gone unchecked for some time, the log file increased to a size that was now unmanageable. Taking a log backup will free up space within the log file but it will not reduce the size of the log file on disk. You must shrink the log using the DBCC SHRINKFILE to reduce the size of the file on disk.
This is where the story gets interesting again. The customer suggested that we manually truncate the log using BACKUP LOG WITH TRUNCATE_ONLY. This is absolutely the wrong thing to do. Why? BACKUP LOG WITH TRUNCATE_ONLY is a dangerous command since it empties the contents of the log file without really backing it up. This leaves you susceptible to data loss. That is why Microsoft removed the feature in SQL Server 2008. If you ever truncate your database log file, please do a full backup immediately after to protect yourself from data loss.
With regards to the database, there were a couple lessons to be learned. The first lesson is “Do not ignore regular database maintenance”. As we saw, doing so only causes more problems later on. The next lesson is “Do not play part time database administrator”. Doing so could be a costly mistake. Fortunately for our customer, they were able to avoid these mistakes by engaging MILL5.
While we focused on the database for this article, we had performed an extensive analysis of the OnBase application and worked directly with Hyland to come up with the best possible solution for our customer. The customer now has recommended procedures that reduce the overall recovery time of the OnBase application if a failure were to ever occur. These new procedures decrease the recovery time from hours to either minutes or seconds.
So what is this customer supposed to do going forward? They have three choices:
- Continue to play the part time database administrator role
- Hire a full time resource to be a database administrator
- Engage MILL5 to ensure that their applications are being maintained on an ongoing basis
If you are interested in knowing more about MILL5 and our service offerings, send an email to firstname.lastname@example.org.