When it comes to building software, our end users are the key to what we build. Most developers and project managers will focus on the use cases are for their software. So whatever is done, it is seen from the eye of the user. Likewise, when a database professional gets deployed into a customer location, they may hear comments from end users such as, “It is slow,” or, “The software is not responsive,” or, “It just hangs!”. However generic these comments may be, we are expected to solve the problem for the customers who are paying us for the software we sold. Solving the root cause of these problems can be challenging, because the issues may not even be in our software but the way the environment is setup.
Recently, I was lucky enough to be part of one such query from a customer. On deeper inspection, I discovered the solution in a completely different direction than I had originally considered. In this case, the SQL Server DBA reported that the systems were slowing down intermittently and he needed to know why and how to resolve the issue.
Here’s how I approached this issue:
Start troubleshooting: pinpointing the root cause
There are a number of tools available to help you pinpoint the root cause of performance issues (including SolarWinds Database Performance Analyzer), and I took a structured approach to solving. You must also look at these strategies when you get into troubleshooting.
- Many DBAs may start with Perfmon. Perfmon may tell you about disk pressure, but be sure to ask if this positively identifies the issue as being related to the disks. Remember that servers can have a lot of issues, and you must stay focused on where the real problem is. If you don’t, you may end up fixing a perceived problem that does nothing to address the true end user’s pain.
- We checked Activity monitor. We noticed a large number of PAGELATCH waits. If you look at processes, you may see a large number of processes waiting for PAGELATCH_SH and PAGELATCH_UP on resources 2:1:3 and sometimes on 2:1:1. For us, this was a significant hint that we were moving in the right direction.
- Here, we paused and looked at the wait resources. 2:1:3 refers to database id 2 (tempdb), file id 1, page id 3. We also discussed that page ids: 1 is the PFS page, 2 is the GAM page, and 3 is the SGAM page. We discussed what each of these meant and what should be the next steps.
- If you query sys.em_exec_requests, you can see many processes waiting for PAGELATCHes on 2:1:3 and 2:1:1. Cross apply sys.dm_exec_sql_text and you will find that these are almost all calls made to a function. Use the statement_start_offset and statement_end_offset and you will see that these are all waiting on in INSERT… SELECT combination statement.
- As before, this bit of information assured me I was moving in the right direction.
- When we found the function and looked at the text, we recognized that it was a table valued function. So how is a “table valued function” used? Most likely, it is part of another select statement somewhere.
- By right clicking on the process in Activity Monitor, you can get to the SQL statement. When you do this, you can see this is in a stored procedure. Open the stored procedure and look at the text. Here you find where the table valued function is being called. In all, how many calls to table valued functions are being made for each execution of that stored procedure? Where are the tables being returned by these table valued functions being stored?
Analyze the results
What we found was that the contention was placed on SGAM and PFS pages in tempdb. The query that seemed to be causing this issue rapidly created temporary objects stored in tempdb, and then rapidly dropped them. The first 8 pages created by these temp objects will require allocation in shared extents, and thus updates in SGAM. Since they do not have clustered indexes, we hit the PFS with all inserts into these temp objects as well as with object allocation and deallocation.
Once we understood the results of our investigation, we were ready to make some changes to address the true cause of the performance issues:
- We implemented trace flag 1118 to stop the use of shared extents, and so relieve pressure on the SGAM. (Use DBCC TRACEON(1118, -1) to set this on in the current server session. Use a startup parameter to set a trace flag on SQL Server startup.)
- Create multiple tempdb files: one equally sized tempdb file/processor up to a maximum of 8.
- Rewrite the query to use CROSS APPLY instead of calling the function for each column.
- Send the query back to the developers to find another way to derive the same data set with perhaps only one temp object created per each execution of the stored procedure.
After we implemented the trace flag, there was no more contention on the SGAM, but we could not tell if overall contention was reduced.
What we were able to determine by finding the page number the contention existed on and then running SELECT % 8088, was that we still had PFS contention, but farther on in the file. When we re-wrote the query to use the cross apply, we could see a definite decrease in contention.
Sometimes the code that we write can point the problem resolution in a different direction, but if we are systematic, we can get to the true root cause of the problem. There is much to learn when it comes to how tempDB works inside SQL Server, and this case study was just a simple example of what you might uncover.