SQL Server Optimizations for High Concurrency
Our business needs very robust, low latency, highly available and durable online transactional system which supports high concurrency for about four weeks in a year. It’s almost like Thanksgiving sale where you mark down very popular item (think of iPhone ) by 100 % . We operate on state level ( K-12 ) online assessment market where entire state takes test in spring for few weeks. To operate in this market , you need to have a robust system which can absorb such a sudden spike in transaction volumes.
Infrastructure:
We use SQL Server 2016 on Windows Server Failover cluster. Our storage layer includes RAID 10 SAN Storage and Local SSD for TempDB.
LATCH Waits
During initial load testing, we have noticed high occurrences of LATCH Waits. While investigating the root cause for LATCH waits , we found that few tables being accessed by high number of concurrent sessions thus creating LATCH Waits. Most of our transactions are very small data set inserts and updates. We also encountered Last Page Insertion Contention.
Latches are internal to the SQL engine and are used to provide memory consistency, whereas locks are used by SQL Server to provide logical transactional consistency.
PAGELATCH is used to synchronize short term access to database pages that reside in the Buffer cache, as opposed to a PAGEIOLATCH, which is used to synchronize physical access to pages in disk. These are normal in every system, the problem being when there is contention . In our use case many concurrent sessions accesses a single page, causing waits and hindering the ability to perform these inserts and updates efficiently.
A page in SQL Server is 8KB and can store multiple rows. To increase concurrency and performance, buffer latches are held only for the duration of the physical operation on the page, unlike locks which are held for the duration of the logical transaction.
From the root cause Analysis, it became very clear that we have LATCH Contention on few tables which needed to be alleviated to improve the throughput.
Latch Waits Noticed
TempDB Allocation Page Contention
DBCC TRACEON(3604) CREATE TABLE PageResults (ParentObject sysname, OBJECT sysname ,Field sysname ,VALUE nvarchar(MAX)) INSERT INTO PageResults (ParentObject, Object ,Field, VALUE) EXEC ('DBCC PAGE(67,1,3987384,3) WITH tableresults')
Root Cause Identification:
Clear WaitStats and Buffer Cache to Initialize
-- Remove all elements from the plan cache for one database DECLARE @intDBID INT; SET @intDBID = ( SELECT [dbid] FROM master.dbo.sysdatabases WHERE name = 'LoadTestDB' ); -- Flush the procedure cache for one database only IF @intDBID IS NOT NULL BEGIN DBCC FLUSHPROCINDB (@intDBID); END; GO -- Reset wait and latch statistics. DBCC SQLPERF('sys.dm_os_latch_stats' , CLEAR) DBCC SQLPERF('sys.dm_os_wait_stats' , CLEAR)
Wait Stats Query:
Declare @ExcludedWaits Table (WaitType sysname not null primary key) -- Waits that can be ignored Insert Into @ExcludedWaits Values ('CLR_SEMAPHORE'), ('SQLTRACE_BUFFER_FLUSH'), ('WAITFOR'), ('REQUEST_FOR_DEADLOCK_SEARCH'), ('XE_TIMER_EVENT'), ('BROKER_TO_FLUSH'), ('BROKER_TASK_STOP'), ('CLR_MANUAL_EVENT'), ('CLR_AUTO_EVENT'), ('FT_IFTS_SCHEDULER_IDLE_WAIT'), ('XE_DISPATCHER_WAIT'), ('XE_DISPATCHER_JOIN'), ('BROKER_RECEIVE_WAITFOR'); Select SessionID = WT.session_id, WaitDuration_ms = WT.wait_duration_ms, WaitType = WT.wait_type, WaitResource = WT.resource_description, Program = S.program_name, QueryPlan = CP.query_plan, SQLText = SUBSTRING(ST.text, (R.statement_start_offset/2)+1, ((Case R.statement_end_offset When -1 Then DATALENGTH(ST.text) Else R.statement_end_offset End - R.statement_start_offset)/2) + 1), DBName = DB_NAME(R.database_id), BlocingSessionID = WT.blocking_session_id, BlockerQueryPlan = CPBlocker.query_plan, BlockerSQLText = SUBSTRING(STBlocker.text, (RBlocker.statement_start_offset/2)+1, ((Case RBlocker.statement_end_offset When -1 Then DATALENGTH(STBlocker.text) Else RBlocker.statement_end_offset End - RBlocker.statement_start_offset)/2) + 1) From sys.dm_os_waiting_tasks WT Inner Join sys.dm_exec_sessions S on WT.session_id = S.session_id Inner Join sys.dm_exec_requests R on R.session_id = WT.session_id Outer Apply sys.dm_exec_query_plan (R.plan_handle) CP Outer Apply sys.dm_exec_sql_text(R.sql_handle) ST Left Join sys.dm_exec_requests RBlocker on RBlocker.session_id = WT.blocking_session_id Outer Apply sys.dm_exec_query_plan (RBlocker.plan_handle) CPBlocker Outer Apply sys.dm_exec_sql_text(RBlocker.sql_handle) STBlocker Where R.status = 'suspended' -- Waiting on a resource And S.is_user_process = 1 -- Is a used process And R.session_id <> @@spid -- Filter out this session And WT.wait_type Not Like '%sleep%' -- more waits to ignore And WT.wait_type Not Like '%queue%' -- more waits to ignore And WT.wait_type Not Like -- more waits to ignore Case When SERVERPROPERTY('IsHadrEnabled') = 0 Then 'HADR%' Else 'zzzz' End And Not Exists (Select 1 From @ExcludedWaits Where WaitType = WT.wait_type) ORDER BY WaitDuration_ms DESC Option(Recompile); -- Don't save query plan in plan cache
Common Latch Waits
; WITH [Latches] AS (SELECT [latch_class], [wait_time_ms] / 1000.0 AS [WaitS], [waiting_requests_count] AS [WaitCount], 100.0 * [wait_time_ms] / SUM ([wait_time_ms]) OVER() AS [Percentage], ROW_NUMBER() OVER(ORDER BY [wait_time_ms] DESC) AS [RowNum] FROM sys.dm_os_latch_stats WHERE [latch_class] NOT IN ( N'BUFFER') AND [wait_time_ms] > 0 ) SELECT MAX ([W1].[latch_class]) AS [LatchClass], CAST (MAX ([W1].[WaitS]) AS DECIMAL(14, 2)) AS [Wait_S], MAX ([W1].[WaitCount]) AS [WaitCount], CAST (MAX ([W1].[Percentage]) AS DECIMAL(14, 2)) AS [Percentage], CAST ((MAX ([W1].[WaitS]) / MAX ([W1].[WaitCount])) AS DECIMAL (14, 4)) AS [AvgWait_S] FROM [Latches] AS [W1] INNER JOIN [Latches] AS [W2] ON [W2].[RowNum] <= [W1].[RowNum] GROUP BY [W1].[RowNum] HAVING SUM ([W2].[Percentage]) - MAX ([W1].[Percentage]) < 95; -- percentage threshold GO
Latch Waits Resolution:
Approaches Considered:
- In -Memory OLTP
- Replacing Identity Integer Column with GUID as leading Column for Index
- HASH Partioning with Computed Column
Out of these three options, In-Memory OLTP looked very promising but we didn’t have enough time to implement in-memory OLTP Migration. So we adopted second solution. We replaced Identity Integer Columns with GUID which did increase page splits and index fragmentation but our workload is insert heavy. So we made this trade-off.This is not our preferred resolution but we made this trade off based on available time and resources.
In future, we are planning on migrating these hot tables to In-Memory OLTP. Just replacing INT’s with GUIDs, we did see about 20-30 X Performance improvements.
Technique | Advantages | Disadvantages |
---|---|---|
Non-sequential key/index | Allows the use of other partitioning features, such as archiving data using a sliding window scheme and partition switch functionality. | Possible challenges when choosing a key/index to ensure „close enough to‟ uniform distribution of inserts all of the time. GUID as a leading column can be used to guarantee uniform distribution with the caveat that it can result in excessive page-split operations. Random inserts across B-Tree can result in too many page-split operations and lead to latch contention on non-leaf pages. |
Hash partitioning with computed column | Transparent for inserts. | Partitioning cannot be used for intended management features such as archiving data using partition switch options. Can cause partition elimination issues for queries including individual and range based select/update, and queries that perform a join. Adding a persisted computed column is an offline operation. |
Effort | Less Development Effort | More Development Effort |
TempDB Allocation Page Contention
SELECT session_id , wait_type , wait_duration_ms , blocking_session_id , resource_description , ResourceType = CASE WHEN CAST(RIGHT(resource_description, LEN(resource_description) - CHARINDEX(':', resource_description, 3)) AS INT) - 1 % 8088 = 0 THEN 'Is PFS Page' WHEN CAST(RIGHT(resource_description, LEN(resource_description) - CHARINDEX(':', resource_description, 3)) AS INT) - 2 % 511232 = 0 THEN 'Is GAM Page' WHEN CAST(RIGHT(resource_description, LEN(resource_description) - CHARINDEX(':', resource_description, 3)) AS INT) - 3 % 511232 = 0 THEN 'Is SGAM Page' ELSE 'Is Not PFS, GAM, or SGAM page' END FROM sys.dm_os_waiting_tasks WHERE wait_type LIKE 'PAGE%LATCH_%' AND resource_description LIKE '2:%'
TempDB PFS Page Contention Noticed
TempDB Non Allocation Page Contention
PFS Page Contention Alleviation:
From the root cause Analysis, it became very clear that number of TempDB Files we had was not sufficient. We followed the industry standard best practice of 8 Equal Sized TempDB Data Files because we had 8 Cores. We increased our TempDB File Counts by 4 and repeated the load Test Run. At 24 TempDB Files, we didn’t encounter PFS Page Contention and we settled on 24 TempDB Files. This number is lot higher than the best practice recommendation but that’s what worked for us and we derived this through numerous Load Test Runs.
ALTER DATABASE [tempdb] ADD FILE (NAME = N'temp9',FILENAME = N'T:TempDBtempdb_mssql_9.ndf',SIZE = 2097152KB,FILEGROWTH =1048576KB ) GO ALTER DATABASE [tempdb] ADD FILE (NAME = N'temp10',FILENAME = N'T:TempDBtempdb_mssql_10.ndf',SIZE = 2097152KB,FILEGROWTH =1048576KB ) GO ALTER DATABASE [tempdb] ADD FILE (NAME = N'temp11',FILENAME = N'T:TempDBtempdb_mssql_11.ndf',SIZE = 2097152KB,FILEGROWTH =1048576KB ) GO ALTER DATABASE [tempdb] ADD FILE (NAME = N'temp12',FILENAME = N'T:TempDBtempdb_mssql_12.ndf',SIZE = 2097152KB,FILEGROWTH =1048576KB ) GO ALTER DATABASE [tempdb] ADD FILE (NAME = N'temp13',FILENAME = N'T:TempDBtempdb_mssql_13.ndf',SIZE = 2097152KB,FILEGROWTH =1048576KB ) GO ALTER DATABASE [tempdb] ADD FILE (NAME = N'temp14',FILENAME = N'T:TempDBtempdb_mssql_14.ndf',SIZE = 2097152KB,FILEGROWTH =1048576KB ) GO ALTER DATABASE [tempdb] ADD FILE (NAME = N'temp15',FILENAME = N'T:TempDBtempdb_mssql_15.ndf',SIZE = 2097152KB,FILEGROWTH =1048576KB ) GO ALTER DATABASE [tempdb] ADD FILE (NAME = N'temp16',FILENAME = N'T:TempDBtempdb_mssql_16.ndf',SIZE = 2097152KB,FILEGROWTH =1048576KB ) GO
ThreadPool Waits Observed
We also encountered ThreadPool waits because we had few thousand transactions coming in per second. We increased our Max Worker Thread size. Extreme caution needs to be exercised to modify this configuration because this might lead to unwanted consequences.
Default Values
EXEC sp_configure 'max worker threads', 1472 ; GO RECONFIGURE; GO
ACCESS_METHODS_HOBT_VIRTUAL_ROOT Waits Noticed
This latch class is when a thread is waiting for access to the in-memory metadata entry containing a B-tree’s root page ID. EX access is required to change the root page ID, which typically happens when a B-tree becomes a level deeper during heavy insertions into a new index and the existing root page has to split. Every B-tree traversal has to start at the root page, which requires obtaining this latch in SH mode.
Max Degree of Parallelism and Cost Threshold
We did several run with different Max Degree of Parallelism and Cost Threshold and settled on following values based on our load Test Observation.
EXEC sp_configure 'max degree of parallelism','12' RECONFIGURE EXEC sp_configure 'cost threshold for parallelism','30' RECONFIGURE
Recommended Values by Microsoft
NUMA Node | Processor Configuration | Recommended MaxDOP Values |
---|---|---|
Server with single NUMA node | Less than 8 logical processors | Keep MAXDOP at or below # of logical processors |
Server with single NUMA node | Greater than 8 logical processors | Keep MAXDOP at 8 |
Server with multiple NUMA nodes | Less than 8 logical processors per NUMA node | Keep MAXDOP at or below # of logical processors per NUMA node |
Server with multiple NUMA nodes | Greater than 8 logical processors per NUMA node | Keep MAXDOP at 8 |
Lock Pages in Memory:
We also enabled lock Pages in Memory
Trace Flag 1118
In SQL Server 2016, By default this trace flag is on. This trace flag switches allocations in tempdb from single-page at a time for the first 8 pages, to immediately allocate an extent (8 pages). It’s used to help alleviate allocation bitmap contention in tempdb under a heavy load of small temp table creation and deletion. But previous versions, you need to explicitly turn on this trace flag.
Learnings:
We learned a lot during this exercise. Supporting high concurrency ( few thousands transactions per second ) needs very robust system and majority of the standard best practices are not applicable in these scenarios. We ran into Latch Waits, Thread Pool Waits, TempDB Allocation Contention and Write Log Waits and solved one problem at a time after carefully considering the tradeoffs. It took several weeks of coordinated effort from several teams to identify the bottleneck in the system and alleviate those bottlenecks.
References: