This tale starts two days ago, with one of my Oracle DBA’s reporting a problem. “We’re missing records from a table, looks like replication from SQL Server is broken.” Yeah, sure, you Oracle guys are always blaming SQL Server when something goes wrong. Based on the evidence so far, he was, however, seemingly correct. Grabbing my cane, I struggled out of my easy chair and hobbled over to the computer console.The table in question contains 96 million rows, and is replicated to both an Oracle data warehouse and a SQL Server warehouse. There is also a companion audit table, populated via a trigger, capturing all update and delete activity that occurs within the source table. It would be easy to track down the culprit responsible for these missing rows.
Personally, I was starting to suspect the maid. A quick comparison of row counts revealed 1,295 missing rows. It also revealed a bigger mystery – the rows were missing from the SOURCE table, not the subscriber table in the Oracle warehouse. Checking the subscriber table in the SQL warehouse revealed the same thing – 1,295 rows that existed in the subscriber table but not in the source. I quickly formed an assumption – somebody had deleted these rows from the source table, but replication had for some reason failed to push that/those transactions to the subscribers. The audit table should make it easy to find out who, I just needed to know which rows were missing. The primary key on this table is “RowID”, so a simple “NOT IN” query gave me the list of RowID values that were missing.
SELECT RowID FROM WAREHOUSE.StagingDB.dbo.BigCriticalTable WHERE RowID NOT IN ( SELECT RowID FROM PublisherDB.dbo.BigCriticalTable )
Grabbing the first RowID from that list, I queried the audit table, ready to reveal the identity of the person responsible for deleting this missing row.
SELECT * FROM zBigCriticalTable_Audit WHERE RowID = 12345 ORDER BY LogDate
And the villain is none other than… Hmmm, odd, I can see that the record was modified a few times, but according to the audit table, it has NOT been deleted. This villain is good, very good, he/she was able to bypass our auditing process. Cinching up my tights, I reached for the next tool in my utility belt – the trace logs that we capture 24 hours a day, 7 days a week.
I knew the specific day that these rows disappeared, so mining the trace logs was simply a matter of loading up all of the logs for that day into a table, then running some queries to find the DELETE statement that removed RowID 12345. I started that load process, and nine hours later, I had my answer. Once again, the villain is none other than… Ummm, hmmm, there doesn’t seem to be a DELETE statement anywhere in the trace logs. I suddenly broke out into a cold sweat. As the numbness began to fade from my brain (manager, remember?), I realized who my opponent was.
As DBAs, we’re responsible for keeping our data safe and secure. We’re the silent guardians of the information age. The bane of our existence, always lurking, always threatening, is data corruption. Rarely seen these days, but when it is, there’s always a mess to be cleaned up. Corruption is a deep, dark pit, and I just landed at the bottom of it.
Going back to my source table, I took a look at the table structure:
The table has a couple of indexes:
These are both non-filtered indexes, and should therefore both be able to locate a record by its RowID value, using an index scan if necessary. Using index hints, I used each index to check for the existence of my test record, RowID = 12345:
SELECT * FROM BigCriticalTable WITH (INDEX(PK_BigCriticalTable)) WHERE RowID = 12345 SELECT * FROM BigCriticalTable WITH (INDEX(CIX_BigCriticalTable_SecretCode)) WHERE RowID = 12345
Aha! When the clustered index is used, I’m able to see my test record. The non-clustered index, in spite of being keyed on RowID, returns nothing. My test row is missing from the non-clustered index. Easy fix, right? Just rebuild the non-clustered index!
ALTER INDEX PK_BigCriticalTable ON BigCriticalTable REBUILD WITH (ONLINE=ON)
Is that the way to climb out of this pit? Let’s have a look:
SELECT * FROM BigCriticalTable WITH (INDEX(PK_BigCriticalTable)) WHERE RowID = 12345 SELECT * FROM BigCriticalTable WITH (INDEX(CIX_BigCriticalTable_SecretCode)) WHERE RowID = 12345
No luck – the row is still missing from the non-clustered index. Back to the bottom of the pit. Time to give DBCC a try:
DBCC CHECKTABLE (BigCriticalTable)
This is the actual DBCC CHECKTABLE output from my real table:
DBCC results for 'HostedAuthorization'. Msg 8951, Level 16, State 1, Line 1 Table error: table 'HostedAuthorization' (ID 293576084). Data row does not have a matching index row in the index 'IDX_HOSAUT_ClientAuthIDClient' (ID 3). Possible missing or invalid keys for the index row matching: Msg 8955, Level 16, State 1, Line 1 Data row (1:36536:157) identified by (ClientCode = 'XXXX' and ClientStudentID = '32168796' and UNIQUIFIER = 48) with index values 'ClientAuthorizationID = NULL and ClientCode = 'XXXX' and ClientStudentID = '32168796' and UNIQUIFIER = 48'. (same errors repeat hundreds of times) There are 94437500 rows in 510195 pages for object "HostedAuthorization". CHECKTABLE found 0 allocation errors and 2508 consistency errors in table 'HostedAuthorization' (object ID 293576084). repair_rebuild is the minimum repair level for the errors found by DBCC CHECKTABLE (HCDS.dbo.HostedAuthorization). DBCC execution completed. If DBCC printed error messages, contact your system administrator.
Looks pretty scary, right? Imagine running into that in a dark data center!
So, according to that output, it looks like I can fix this with DBCC using the REPAIR_REBUILD option. I just have to put the database into single-user mode first. Which means shutting down an extremely busy system used by people around the world, 24 hours a day, or waiting four more days until the scheduled weekend maintenance outage. Neither is an attractive option. So, what to do?
Let’s take a close look at just what the REPAIR_REBUILD option does. This is direct from Books Online for SQL Server 2008 R2:
Some specific parts of that stand out, in particular the parts about “no possibility of data loss” and lots of references to rebuilding indexes. Sounds an awfully lot like what happens when you run this:
ALTER INDEX ALL ON BigCriticalTable REBUILD WITH (ONLINE=ON)
Is this the secret to climbing out of this pit? After running that ALTER INDEX statement, I checked my query results again:
SELECT * FROM BigCriticalTable WITH (INDEX(PK_BigCriticalTable)) WHERE RowID = 12345 SELECT * FROM BigCriticalTable WITH (INDEX(CIX_BigCriticalTable_SecretCode)) WHERE RowID = 12345
Success! The “missing” record is now visible in both indexes. The city, err, data is saved!
Am I suggesting this as a replacement for DBCC CHECKTABLE? No, not at all. I’m simply offering it as an alternative to a costly system outage, or as an alternative to waiting and possibly allowing the problem to get worse. It’s a low-risk option worth trying (unlike heli-skiing). Again, I’m NOT suggesting this to be a replacement for, or your first choice over, the DBCC functions. If you’re facing a situation where you need to try something NOW, and DBCC isn’t an option, this might be a solution for you. In my case, the missing data was causing problems NOW, but I could not put the database into single-user mode to run DBCC without bringing down the production system.
You can stop with the Christian Bale voice now…
Leave a Reply