Archiving strategy for DWH

Recently, we have got a case where our customer requested to implement archiving strategy for their DWH. We wanted to share with you how we approached this and what was the final output.

The task was to create something like an archive version of operational data store tables which are not used for daily reporting and need to be kept only for archive purpose. The data left in the current tables will be queried daily and need to be optimized for fast data retrieval. Data retention in these tables was set to last 3 months. Since system is used in 7x24 mode, we also need to avoid any blocking situation during data migration to the archival tables, so it needs to be lightning fast. Also, ETL processes should not be affected by this change, so no developer work on this side is allowed.

Our approach (High level):

Analyze how tables are used to choose the best possible candidate for future partitioning.
Setup a proof of the concept (POC) and test process functionality.
Deploy to production.

Analyze how tables are used to choose the best possible candidate column for future partitioning.

First, we need to carefully examine existing workload on the server. Since nobody tells us how exactly the table is used, we need to find out on our own. For this purpose, we choose lightweight engine-based no of Extended Events framework called Server Audit. To use this approach, you will need to have Enterprise Edition of SQL Server. Otherwise, I suggest using Extended Events directly and collect all operations for the specific table, then filter out unnecessary events like a maintenance code, deletes, etc.

For our test case we use table Geodata_Status, which holds information regarding delivery of potential parcels. We setup a session to collect.

USE Master
GO
-- Create Audit Container
CREATE SERVER AUDIT [SMT_MT_GEODATA_STATUSTable_QueryTrace]
TO FILE 
(	FILEPATH = N'D:\Trace'
	,MAXSIZE = 10000 MB
	,MAX_ROLLOVER_FILES = 10
	,RESERVE_DISK_SPACE = OFF
)
WITH
(	QUEUE_DELAY = 30000
	,ON_FAILURE = CONTINUE
)

GO
Use AdventureWorks2019
GO
-- Create database specification and place it into audit container
CREATE DATABASE AUDIT SPECIFICATION [SMT_MT_Trace_IssuesTableActivity]
FOR SERVER AUDIT [SMT_MT_GEODATA_STATUSTable_QueryTrace]
ADD (SELECT ON OBJECT::[dbo].[GEODATA_STATUS] BY [public]),
ADD (INSERT ON OBJECT::[dbo].[GEODATA_STATUS] BY [public]),
ADD (UPDATE ON OBJECT::[dbo].[GEODATA_STATUS] BY [public])
     
GO
-- Enable Audit specification
ALTER DATABASE AUDIT SPECIFICATION [SMT_MT_Trace_IssuesTableActivity] WITH (STATE = ON) ;
Use Master
GO 
-- Enable audit itself
ALTER SERVER AUDIT [SMT_MT_GEODATA_STATUSTable_QueryTrace]   WITH (STATE = ON) ;

Later after a day of monitoring we examined the results. The code was slow, but gave us the needed conclusion.

WITH CTE
AS
	(
  SELECT 
	   action_id
	   ,SUBSTRING(statement
					,PATINDEX('%FROM%',statement) 
					,CHARINDEX ( 'BATCH' , statement  , PATINDEX('%FROM%',statement) ) +8
					
					) FromClause
		,SUBSTRING(statement,PATINDEX('%WHERE%',statement) ,LEN(Statement)) WhereClause
		,COUNT(Action_id) Ocurrences
    FROM sys.fn_get_audit_file('D:\Trace\SMT_MT_GEODATA_*.*', DEFAULT, DEFAULT)
	WHERE  action_id IN ('SL','UP')  
			
	GROUP BY action_id
			,SUBSTRING(statement
					,PATINDEX('%FROM%',statement) 
					,CHARINDEX ( 'BATCH' , statement  , PATINDEX('%FROM%',statement) ) +8
					
					)
			,SUBSTRING(statement,PATINDEX('%WHERE%',statement) ,LEN(Statement))
)

SELECT action_id
		,WhereClause
		, FromClause 
		, Ocurrences
FROM CTE 
WHERE WhereClause LIKE 'WHERE%'
ORDER BY Ocurrences DESC

The output might looks like this.

It seems that MPSID_BIC3 was mostly used in searches followed by ParcelNumber. So, indexing strategy must reflect this. For Partitioning we decided to use STATUSDATETIME column and agreed with customer to add this partition column into search clauses for most used code as showed in the results of our analysis. Then partition elimination feature will do most of the work for us.

Another thing was to identify the proper granularity for planned archive storage. We wanted to use Columnstore compression (to allow sporadic querying of the table if needed and not load CPU too much). We also plan to use Table partitioning for fast data switch between tables.

After initial analysis, we identified that there are enough rows per month to use this approach.

(Column names from the left Rows, Month, Year)

During prototype setup we also computed compression ratios and it looks awesome.

CLI- Row store clustered index

CCI- Column store clustered Index

Setup a proof of the concept (POC) and test process functionality.

Before we jump into production, we need to carefully examine the correct order of actions and identify any possible showstoppers. So, we started with POC.

During this process we identified several drawbacks which need to be resolved.

Unnecessary wide and inappropriate data type for partitioned column.

Unfortunately, STATUSDATETIME column was not a datetime but char data type. Remember, we are not allowed to change it due to existing ETL processes which expect this date type. We came with a solution to create new Column (STATUSDATETIME_Correct) based on existing one which will have a proper data type. Transition code looks like this.
You cannot use computed column as a partition column because of limitation between columnstore index and partitioning. Physical one was needed.
For partition switch you need to have the same index structure on both tables. Even if on the archive one the index is not used. The crucial information was how indexes are really used on the tables. For this purpose, we used our long-term monitoring (SMT) find and remove unnecessary indexes from both tables, which saved a lot of storage resources, since those indexes was row store due to singleton lookup query requirements.
(Example of such as statistics in SMT, not actual data)
Columns statistics will be configured as incremental, but only for row store indexes. Column store indexes have stats only for delta store rows in CCI.
After the archival process finishes, we need to prevent writes with old values in the STATUSDATETIME column to current tables. We enforce this by updating of the check constraint on the current table.

For our prototype we used a day granularity of partitions. Based on our previous findings, there is enough rows per partition. We plan to use partition switching, we do not want to do any data intensive operations to prevent any long schema locks. So, for this purpose we just want to switch partitions without using MERGE or SPLIT partition. A daily partition was good aim for this table. Based on current SQL Server limits on number of partitions, we can prepare this solution next 27 years which we assume a good enough.

We plan to not keep archive partitions in the current table and move them to separate the archive table to prevent situations when users omit partition column in their searches and cause full table scan. Archive table will be placed in different schema (e.g., archive) and users’ access will be managed by the database role allowing user access only for specific schema (e.g. dbo).

The prototype code looks like this.

USE [StageArea_TEST]
GO

-- Setup Test Table 
SET ANSI_NULLS, QUOTED_IDENTIFIER,ANSI_PADDING, NOCOUNT ON
GO
CREATE TABLE [demo].[GEODATA_STATUS](
	[id] [bigint] IDENTITY(1,1) NOT NULL,
	[insert_time] [datetime] NULL,
	[TTEVENTS] [varchar](16) NULL,
	[NUMORDER] [varchar](2) NULL,
	[PARCELNUMBER] [varchar](14) NULL,
	[STATUSTYPE] [varchar](5) NULL,
	[ORIGINPARCELNUMBER] [varchar](32) NULL,
	[SERVICECODE] [varchar](3) NULL,
	[ASCODE] [varchar](20) NULL,
	[DZIPCODE] [varchar](9) NULL,
	[DCOUNTRYCODE] [varchar](3) NULL,
	[SCANCODE] [varchar](10) NULL,
	[REASONCODE] [varchar](3) NULL,
	[REASONCODE2] [varchar](3) NULL,
	[REASONCODE3] [varchar](3) NULL,
	[DEPOT] [varchar](7) NULL,
	[AGENTLOCATIONCODE] [varchar](7) NULL,
	[AGENTLOCATIONCODETYPE] [varchar](5) NULL,
	[AGENTPARCELNUMBER] [varchar](32) NULL,
	[COUNTRY] [varchar](3) NULL,
	[CITY] [varchar](32) NULL,
	[STATUSDATETIME] [varchar](14) NULL,
	[TIMEZONE] [varchar](5) NULL,
	[MEMDATETIME] [varchar](14) NULL,
	[DDEPOT] [varchar](7) NULL,
	[DELIVERYTOUR] [varchar](5) NULL,
	[MEASUREDWEIGHT] [varchar](8) NULL,
	[DIMENSION] [varchar](9) NULL,
	[PARCELREF] [varchar](35) NULL,
	[PARCELREF2] [varchar](35) NULL,
	[PODIMAGEREF] [varchar](256) NULL,
	[RNAME] [varchar](35) NULL,
	[TRDPNAME] [varchar](50) NULL,
	[AREA] [varchar](3) NULL,
	[STOP] [varchar](3) NULL,
	[CONTAINERINFO1] [varchar](2) NULL,
	[CONTAINERINFO2] [varchar](200) NULL,
	[SHIPMENTREF] [varchar](35) NULL,
	[SUPPLIERREF] [varchar](35) NULL,
	[CONTAINERREF] [varchar](14) NULL,
	[CONTAINERTYPE] [varchar](2) NULL,
	[ACTORID] [varchar](10) NULL,
	[OPERATIVERID] [varchar](10) NULL,
	[PICKUPORDERNUMBER] [varchar](20) NULL,
	[GPSLAT] [varchar](16) NULL,
	[GPSLONG] [varchar](16) NULL,
	[RETURNPARCELNUMBER] [varchar](14) NULL,
	[COMPINFO1] [varchar](1024) NULL,
	[DelisID] [varchar](50) NULL,
	[CustomerDSW] [varchar](25) NULL,
	[update_time_delisID] [datetime] NULL,
	[update_time_customerDSW] [datetime] NULL,
 CONSTRAINT [PK_GEODATA_STATUS_ID] PRIMARY KEY CLUSTERED 
(
	[id] ASC
)WITH (PAD_INDEX = ON, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 100) ON [PRIMARY]
) ON [PRIMARY]
GO

CREATE NONCLUSTERED INDEX [IX_NC_Parcelnumber_GEODATA_STATUS] ON [demo].[GEODATA_STATUS]
(
	[PARCELNUMBER] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO

CREATE NONCLUSTERED INDEX [IX_statusdatetime_geodata_status] ON [demo].[GEODATA_STATUS]
(
	[STATUSDATETIME] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO

--Load Data only every 100th row form prod for test purpose
SET IDENTITY_INSERT StageArea_Test.demo.GEODATA_STATUS ON
INSERT StageArea_Test.demo.GEODATA_STATUS ([id]
      ,[insert_time]
      ,[TTEVENTS]
      ,[NUMORDER]
      ,[PARCELNUMBER]
      ,[STATUSTYPE]
      ,[ORIGINPARCELNUMBER]
      ,[SERVICECODE]
      ,[ASCODE]
      ,[DZIPCODE]
      ,[DCOUNTRYCODE]
      ,[SCANCODE]
      ,[REASONCODE]
      ,[REASONCODE2]
      ,[REASONCODE3]
      ,[DEPOT]
      ,[AGENTLOCATIONCODE]
      ,[AGENTLOCATIONCODETYPE]
      ,[AGENTPARCELNUMBER]
      ,[COUNTRY]
      ,[CITY]
      ,[STATUSDATETIME]
      ,[TIMEZONE]
      ,[MEMDATETIME]
      ,[DDEPOT]
      ,[DELIVERYTOUR]
      ,[MEASUREDWEIGHT]
      ,[DIMENSION]
      ,[PARCELREF]
      ,[PARCELREF2]
      ,[PODIMAGEREF]
      ,[RNAME]
      ,[TRDPNAME]
      ,[AREA]
      ,[STOP]
      ,[CONTAINERINFO1]
      ,[CONTAINERINFO2]
      ,[SHIPMENTREF]
      ,[SUPPLIERREF]
      ,[CONTAINERREF]
      ,[CONTAINERTYPE]
      ,[ACTORID]
      ,[OPERATIVERID]
      ,[PICKUPORDERNUMBER]
      ,[GPSLAT]
      ,[GPSLONG]
      ,[RETURNPARCELNUMBER]
      ,[COMPINFO1]
      ,[DelisID]
      ,[CustomerDSW]
      ,[update_time_delisID]
      ,[update_time_customerDSW])
SELECT a.[id]
      ,[insert_time]
      ,[TTEVENTS]
      ,[NUMORDER]
      ,[PARCELNUMBER]
      ,[STATUSTYPE]
      ,[ORIGINPARCELNUMBER]
      ,[SERVICECODE]
      ,[ASCODE]
      ,[DZIPCODE]
      ,[DCOUNTRYCODE]
      ,[SCANCODE]
      ,[REASONCODE]
      ,[REASONCODE2]
      ,[REASONCODE3]
      ,[DEPOT]
      ,[AGENTLOCATIONCODE]
      ,[AGENTLOCATIONCODETYPE]
      ,[AGENTPARCELNUMBER]
      ,[COUNTRY]
      ,[CITY]
      ,[STATUSDATETIME]
      ,[TIMEZONE]
      ,[MEMDATETIME]
      ,[DDEPOT]
      ,[DELIVERYTOUR]
      ,[MEASUREDWEIGHT]
      ,[DIMENSION]
      ,[PARCELREF]
      ,[PARCELREF2]
      ,[PODIMAGEREF]
      ,[RNAME]
      ,[TRDPNAME]
      ,[AREA]
      ,[STOP]
      ,[CONTAINERINFO1]
      ,[CONTAINERINFO2]
      ,[SHIPMENTREF]
      ,[SUPPLIERREF]
      ,[CONTAINERREF]
      ,[CONTAINERTYPE]
      ,[ACTORID]
      ,[OPERATIVERID]
      ,[PICKUPORDERNUMBER]
      ,[GPSLAT]
      ,[GPSLONG]
      ,[RETURNPARCELNUMBER]
      ,[COMPINFO1]
      ,[DelisID]
      ,[CustomerDSW]
      ,[update_time_delisID]
      ,[update_time_customerDSW]
FROM StageArea.dbo.GEODATA_STATUS
		JOIN  (	SELECT  id, ROW_NUMBER () OVER (ORDER BY [STATUSDATETIME]) AS rno  
					FROM StageArea.dbo.GEODATA_STATUS WITH (NOLOCK)
					) AS A
		ON a.id = StageArea.dbo.GEODATA_STATUS.id
WHERE A.rno % 100 = 0 --100th row mark
SET IDENTITY_INSERT StageArea_Test.demo.GEODATA_STATUS OFF


--- Add Datetime column [STATUSDATETIME]
ALTER TABLE StageArea_test.demo.GEODATA_STATUS
ADD [STATUSDATETIME_Correct] datetime NULL 
--Fill column with valid data from old column 
UPDATE StageArea_test.demo.GEODATA_STATUS
SET [STATUSDATETIME_Correct]= DATETIMEFROMPARTS ( SUBSTRING ([STATUSDATETIME] , 1,4)
							, SUBSTRING ([STATUSDATETIME] , 5,2)
							, SUBSTRING ([STATUSDATETIME] , 7,2)
							, SUBSTRING ([STATUSDATETIME] , 9,2)
							, SUBSTRING ([STATUSDATETIME] , 11,2)
							, SUBSTRING ([STATUSDATETIME] , 13,2), 00 )

ALTER TABLE StageArea_test.demo.GEODATA_STATUS
ALTER COLUMN  [STATUSDATETIME_Correct] datetime NOT NULL 

-- GENERATE Partition function and scheme (As a replacement for CreateStagingTable.)
-- For sake of simplicity ,generator make only days for last three months
DECLARE @Boundaries nvarchar(max) = (SELECT demo.TiebrakerGenerator( DATEADD( month ,-3,GETDATE())))
DECLARE @stmt nvarchar(max)

SET @stmt = 'CREATE PARTITION FUNCTION PF_GEODATA_STATUS_STATUSDATETIME (DATETIME)
AS RANGE LEFT FOR VALUES ('+@Boundaries+')'

EXECUTE (@stmt)

CREATE PARTITION SCHEME PS_GEODATA_STATUS_STATUSDATETIME
AS PARTITION PF_GEODATA_STATUS_STATUSDATETIME	
ALL TO ( [PRIMARY])
--- SETUP Partitioning on existing tables 

BEGIN TRANSACTION
ALTER TABLE [demo].[GEODATA_STATUS] 
DROP CONSTRAINT [PK_GEODATA_STATUS_ID] 
WITH ( ONLINE = OFF )

--presort b tree key
CREATE CLUSTERED INDEX [CL_GEODATA_STATUS_STATUSDATETIME_Correct] ON [demo].[GEODATA_STATUS] ([STATUSDATETIME_Correct])  --with (drop_existing = on)
 ON [PS_GEODATA_STATUS_STATUSDATETIME]([STATUSDATETIME_Correct])
 -- make CCI
CREATE CLUSTERED COLUMNSTORE INDEX [CL_GEODATA_STATUS_STATUSDATETIME_Correct] ON [demo].[GEODATA_STATUS] with (drop_existing = on)
 ON [PS_GEODATA_STATUS_STATUSDATETIME]([STATUSDATETIME_Correct]) 

 --Setup Partitioned Nonclustered PK
ALTER TABLE [demo].[GEODATA_STATUS] 
ADD  CONSTRAINT [PK_GEODATA_STATUS_ID] 
PRIMARY KEY NONCLUSTERED ([id] ASC,[STATUSDATETIME_Correct] ASC)
WITH (PAD_INDEX = ON, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 100) 
ON  [PS_GEODATA_STATUS_STATUSDATETIME]([STATUSDATETIME_Correct])


--And rest of the indexes
CREATE NONCLUSTERED INDEX [IX_NC_Parcelnumber_GEODATA_STATUS] ON [demo].[GEODATA_STATUS]([PARCELNUMBER] ASC)
WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) 
ON [PS_GEODATA_STATUS_STATUSDATETIME]([STATUSDATETIME_Correct])

CREATE NONCLUSTERED INDEX [IX_statusdatetime_geodata_status] ON [demo].[GEODATA_STATUS]([STATUSDATETIME] ASC)
WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) 
ON [PS_GEODATA_STATUS_STATUSDATETIME]([STATUSDATETIME_Correct])

COMMIT TRANSACTION

----- Make Archive table 
--CREATE Schema Archive

CREATE TABLE [Archive].[GEODATA_STATUS](
	[id] [bigint] IDENTITY(1,1) NOT NULL,
	[insert_time] [datetime] NULL,
	[TTEVENTS] [varchar](16) NULL,
	[NUMORDER] [varchar](2) NULL,
	[PARCELNUMBER] [varchar](14) NULL,
	[STATUSTYPE] [varchar](5) NULL,
	[ORIGINPARCELNUMBER] [varchar](32) NULL,
	[SERVICECODE] [varchar](3) NULL,
	[ASCODE] [varchar](20) NULL,
	[DZIPCODE] [varchar](9) NULL,
	[DCOUNTRYCODE] [varchar](3) NULL,
	[SCANCODE] [varchar](10) NULL,
	[REASONCODE] [varchar](3) NULL,
	[REASONCODE2] [varchar](3) NULL,
	[REASONCODE3] [varchar](3) NULL,
	[DEPOT] [varchar](7) NULL,
	[AGENTLOCATIONCODE] [varchar](7) NULL,
	[AGENTLOCATIONCODETYPE] [varchar](5) NULL,
	[AGENTPARCELNUMBER] [varchar](32) NULL,
	[COUNTRY] [varchar](3) NULL,
	[CITY] [varchar](32) NULL,
	[STATUSDATETIME] [varchar](14) NULL,
	[TIMEZONE] [varchar](5) NULL,
	[MEMDATETIME] [varchar](14) NULL,
	[DDEPOT] [varchar](7) NULL,
	[DELIVERYTOUR] [varchar](5) NULL,
	[MEASUREDWEIGHT] [varchar](8) NULL,
	[DIMENSION] [varchar](9) NULL,
	[PARCELREF] [varchar](35) NULL,
	[PARCELREF2] [varchar](35) NULL,
	[PODIMAGEREF] [varchar](256) NULL,
	[RNAME] [varchar](35) NULL,
	[TRDPNAME] [varchar](50) NULL,
	[AREA] [varchar](3) NULL,
	[STOP] [varchar](3) NULL,
	[CONTAINERINFO1] [varchar](2) NULL,
	[CONTAINERINFO2] [varchar](200) NULL,
	[SHIPMENTREF] [varchar](35) NULL,
	[SUPPLIERREF] [varchar](35) NULL,
	[CONTAINERREF] [varchar](14) NULL,
	[CONTAINERTYPE] [varchar](2) NULL,
	[ACTORID] [varchar](10) NULL,
	[OPERATIVERID] [varchar](10) NULL,
	[PICKUPORDERNUMBER] [varchar](20) NULL,
	[GPSLAT] [varchar](16) NULL,
	[GPSLONG] [varchar](16) NULL,
	[RETURNPARCELNUMBER] [varchar](14) NULL,
	[COMPINFO1] [varchar](1024) NULL,
	[DelisID] [varchar](50) NULL,
	[CustomerDSW] [varchar](25) NULL,
	[update_time_delisID] [datetime] NULL,
	[update_time_customerDSW] [datetime] NULL,
	[STATUSDATETIME_Correct] [datetime] NOT NULL,
 CONSTRAINT [PK_GEODATA_STATUS_Archive_ID] PRIMARY KEY NONCLUSTERED 
(
	[id] ASC,
	[STATUSDATETIME_Correct] ASC
)WITH (PAD_INDEX = ON, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 100)
)ON [PS_GEODATA_STATUS_STATUSDATETIME]([STATUSDATETIME_Correct])
GO
CREATE CLUSTERED COLUMNSTORE INDEX [CL_GEODATA_STATUS_Archive_STATUSDATETIME_Correct] ON [Archive].[GEODATA_STATUS] WITH (DROP_EXISTING = OFF, COMPRESSION_DELAY = 0)
GO

CREATE NONCLUSTERED INDEX [IX_NC_Parcelnumber_GEODATA_STATUS_Archive] ON [Archive].[GEODATA_STATUS]([PARCELNUMBER] ASC)
WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) 
ON [PS_GEODATA_STATUS_STATUSDATETIME]([STATUSDATETIME_Correct])

CREATE NONCLUSTERED INDEX [IX_statusdatetime_geodata_status_Archive] ON [Archive].[GEODATA_STATUS]([STATUSDATETIME] ASC)
WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) 
ON [PS_GEODATA_STATUS_STATUSDATETIME]([STATUSDATETIME_Correct])

-- SWITCH all column stats to Incremental  
--On Db level allow this option
ALTER DATABASE [StageArea_TEST] SET auto_create_statistics on (incremental = on)
GO 

--On Table level
DECLARE @Stmt nvarchar(max)
DECLARE @Schema nvarchar(256) =  QUOTENAME('demo')
DECLARE @TableName nvarchar(512) =  QUOTENAME('GEODATA_STATUS')
DECLARE @StatsName nvarchar(512) 
DECLARE @TablenameWithSchema nvarchar(512) = @Schema+'.'+@TableName


	SET @Stmt =  'UPDATE STATISTICS '+@TablenameWithSchema+' WITH INCREMENTAL = ON'

	EXECUTE (@Stmt)

--Reconfigure table lock mechanism 

	SET @Stmt =  'ALTER TABLE '+@TablenameWithSchema+' SET (LOCK_ESCALATION = AUTO)'

	EXECUTE (@Stmt)
GO

DECLARE @Stmt nvarchar(max)
DECLARE @Schema nvarchar(256) =  QUOTENAME('Archive')
DECLARE @TableName nvarchar(512) =  QUOTENAME('GEODATA_STATUS')
DECLARE @StatsName nvarchar(512) 
DECLARE @TablenameWithSchema nvarchar(512) = @Schema+'.'+@TableName

	SET @Stmt =  'UPDATE STATISTICS '+@TablenameWithSchema+' WITH INCREMENTAL = ON'

	EXECUTE (@Stmt)

--Reconfigure table lock mechanism 

	SET @Stmt =  'ALTER TABLE '+@TablenameWithSchema+' SET (LOCK_ESCALATION = AUTO)'

	EXECUTE (@Stmt)


GO


-- Check for non empty partitions in source table 
SELECT f.name,index_id, partition_number, in_row_used_page_count, row_count ,pr.value
					FROM sys.dm_db_partition_stats AS ps 
						JOIN sys.partition_Range_values  AS pr
							ON  ps.partition_number = pr.boundary_ID	
						JOIN sys.partition_functions As f
							ON  f.function_id = pr.function_id
					WHERE object_id = object_id('demo.GEODATA_STATUS') AND index_id = 1
						AND f.name =  'PF_GEODATA_STATUS_STATUSDATETIME'
						AND  row_count > 0


-- Switch Partitons from prod to archive based on requested time windows 
-- After switch no data should be added to the base table for thse windows.  
-- We want to enforce this by updating of check constraint


EXEC ptf.WLDRSwitchpartition
@SourceTable  = 'GEODATA_STATUS'
,@SourceTableSchema = 'demo'
,@TargetTable = 'GEODATA_STATUS'
,@TargetTableSchema = 'Archive'
,@StartPartitionWindow = '2020-07-07 00:00:00.000'
,@EndPartitionWindow  = '2020-07-10 00:00:00.000'
,@PartFunction  = 'PF_GEODATA_STATUS_STATUSDATETIME'
,@PartColumn  = 'statusdatetime_Correct'


-- Check for non empty partitions in Archive table 
SELECT f.name,index_id, partition_number, in_row_used_page_count, row_count ,pr.value
					FROM sys.dm_db_partition_stats AS ps 
						JOIN sys.partition_Range_values  AS pr
							ON  ps.partition_number = pr.boundary_ID	
						JOIN sys.partition_functions As f
							ON  f.function_id = pr.function_id
					WHERE object_id = object_id('Archive.GEODATA_STATUS') AND index_id = 1
						AND f.name =  'PF_GEODATA_STATUS_STATUSDATETIME'
						AND  row_count > 0



--Observations
-- Not too much rows per partition..  ~60k On production * 100 = ^M rows per partiton willbe OK
-- Partition granularity in question. Maybe use a week interval to get above 100K-1M. 
SELECT * FROM sys.partitions WHERE object_id = Object_id('Demo.Geodata_status') ORDER BY Rows DESC
--CLI no compression 5.5GB  SP_spaceused 'demo.geodata_status_old'
--CCI 0.5GB SP_spaceused SP_spaceused 'demo.geodata_status'

--Partition elimination is working, this read only partition 70 
SELECT  TOP 10 * FROM demo.Geodata_status WHERE STATUSDATETIME_Correct between '2020-07-08' AND '2020-07-09'

--Column stats for CCI is NOT incremental, automatically skipped by SQL Server change scripts and updates..  Stats seems to be interesting only for deltastore 
--We can use Archive Schema to  limit user access  and  Demo (future Dbo) Schema for reporting needs, this will prevent users to accidentally access large amount of data in the store.
--If we like to change the storage for the target table, there will be needed changes in this mechanism (make more sliding window scenario) and have a real data movement, current action is data movement free operation.

Deploy on production.

The results of implementation.

We end up with two tables, in schema dbo (general purpose) and archive (history). First hold data for last three months archive will hold data for long term archival purpose.

We also setup an automated daily job which switch daily partitions from dbo to archive schema in no time. Literally instant migration thanks to the fact, that tables are sitting in the same filegroup, switch is just metadata operation, no physical data movement.

We achieved great compression for archival purpose.

Before change:

After change with archive table:

Current table:

Thanks to clustered columnstore index and removal of unused indexes from dbo and archive tables we get rid of row store indexes which lead to shrink of the table size from 597 GB to 23,8GB (3,4GB for dbo + 20,4GB for archive table). This is a nice example how effective the columnstore storage approach is in this case. Compression ratio was 25.

Final words

With the usage of existing SQL server features, we were able to help customer save a majority of existing storage space. Around 25 times more data can be placed in the table without the need to allocate any additional storage space. Surely, in the meantime we will use it somewhere else.

We also allowed for instant archival process, which do not cause any long-term blocking, which is really needed in these days of never-ending retail sales (24x7 coverage).

At the end as a mark of our performance tuning history, we allowed users queries to operate faster thanks to much less logical reads needed to complete query. This was achieved by table partition elimination feature and columnstore table storage. Server use less CPU, Memory, and disk IO for this workload, so we created a capacity for future customers projects.

As you saw, it was not as straightforward and we need to tackle several issues such as update of ETL process to fill data also into our new partition column, but this small change is outweighing by the whole process gain and optimization of server resource usage.

Archiving strategy for DWH

Analyze how tables are used to choose the best possible candidate column for future partitioning.

Setup a proof of the concept (POC) and test process functionality.

Deploy on production.

Final words

Labels

Categories

About author

More tips and tricks

SMT 1.1 - updates

SQL 2012 Certification Changes

Is model db important?

References

Menu

Woodler s. r. o.

Woodler CZ s. r. o.

Archiving strategy for DWH

Analyze how tables are used to choose the best possible candidate column for future partitioning.

Setup a proof of the concept (POC) and test process functionality.

Deploy on production.

Final words

Subscribe to newsletter

Labels

Categories

About author

More tips and tricks

SMT 1.1 - updates

SQL 2012 Certification Changes

Is model db important?

References

Menu

Woodler s. r. o.

Woodler CZ s. r. o.