As IT professionals, our main mission is to keep applications running successfully. Each application is judged by end-users, using different factors. Exchange is considered successful by users as long as email can be sent and received. Active Directory® is seen as running efficiently, as long as users can access the file shares that they need to. SQL Server, on the other hand, functions like a collection of those, and other services. Depending on your environment and how SQL Server is used, it’s possible that SQL Server is responsible for moving data, storing data, querying data, presenting data, emailing data, and controlling access to data. If any one piece of the puzzle is having an issue with performance, you will begin to lose the confidence of your user base. The key, as with other applications, is to make sure you stay out of the habit of putting out fires, and stay in the business of preventing fires.
The best way to do that is start by implementing a monitoring solution. Notice that I said start. Just because you are monitoring, doesn’t mean that your work is done. Even the best monitoring solutions need to be configured with the alerting levels that match with your organization and your environment. More importantly, you need to have an understanding of what the metrics you are monitoring mean. Even with the best monitoring solutions on the market, just because there is not a warning coming from the application, it doesn’t mean that the metrics aren’t showing problems with your SQL Server. Just like other applications, the key resource groups to monitor, as part of SQL Server, are storage, CPU, and memory. This list isn’t the only metrics you should watch in these categories, but are some of the most popular to track for finding performance issues.
When we think about monitoring storage, it’s easy to focus on just available space. Obviously with SQL Server, if you run out of space, it will stop functioning, but storage issues can also be monitored for latency. To be able to see these issues, most monitoring applications display metrics that shows the number of reads and writes on a database, along with latency of read and write requests. These metrics can be pulled from most monitoring solutions, but are also available by reviewing the Dynamic Management Views (DMV) within SQL Server. For storage issues, one of the more valuable DMVs is sys.dm_io_virtual_file_stats. This DMV keeps a running tally of the occurrences of waits and duration of waits, along with the number of read and write transactions. It allows you to see the total number and duration of when it took place. By looking at those metrics, if you have a high number of waits, it can mean that you don’t have enough drives to handle all of the IO requests, among other things.
The key is to take that data set and create and average to understand what you are looking at. Paul Randall (blog | twitter) wrote a great post here explaining this DMV and offers a great query that creates an aggregate of the dataset that calculates average bytes and latency. Keep in mind, that if one transaction does a large number of reads against one file, these average calculations will be skewed, so these numbers should not be the only things you monitor. If your waits are long in duration, this can highlight that your storage is too slow for running SQL Server on. There’s too many reasons to list why your storage would be slow. Suffice it to say, if you are showing long waits, it is recommended that you look at the medium you are using for storage for the server to make sure that the hardware is working properly. If it is, there is a longer conversation to have about buying better drives, or talking to your SAN administrator about moving the storage to different LUNs. Though, if only one database is showing higher waits, this can point to issues with queries running on the database.
You can also look at counters in Performance Monitor. Performance counters like “Database -> I/O Database Writes/sec” and “Database -> I/O Database Reads/Sec” allow you to see the amount calls to storage for reading and writing data. Also, here is a Performance Monitor Data Collector Template to collect Total Disk Reads and Writes, Page File Usage, and Disk Queue Length. Disk Queue Length is also a tell-tale sign that your storage medium can’t keep up with your workload. When you have a constant high disk queue length, it usually means you have transactions that are waiting for their turn to execute on the disk. We’ll get in to looking for those worst performing queries in a future post. Page File Usage isn’t a sign of IO layer issues, but it can be a cause of issues. If a server has memory pressure and is paging memory out to disk, it can create higher than normal hard disk IO which will take resources away from normal processes.
In the next section, we will take a look at some red flags that indicate you’re experiencing CPU pressure.