Challenges of Large Scale DML using T-SQL

Using T-SQL to insert, update, or delete large amounts of data from a table will results in some unexpected difficulties if you’ve never taken it to task.

Let’s say you have a table in which you want to delete millions of records. If the goal was to remove all then we could simply use TRUNCATE. However, if we want to remove records which meet some certain criteria then executing something like this will cause more trouble that it is worth.

Why not run the following in a production environment?

DELETE FROM myTable WHERE columnA = 'whatever';

Transaction log growth – expect a large transaction like this to cause the transaction log to expand. Calls to expand the database files are some of the most expensive operations SQL Server makes to the OS. Hopefully the initial sizing was good and the growth settings are reasonable.

– expect a large transaction like this to cause the transaction log to expand. Calls to expand the database files are some of the most expensive operations SQL Server makes to the OS. Hopefully the initial sizing was good and the growth settings are reasonable. Blocking – concurrency is going to increase the likelihood of lock escalation. That will lead to the possibility of excessive blocking.

– concurrency is going to increase the likelihood of lock escalation. That will lead to the possibility of excessive blocking. Rollbacks – if the SPID is killed then the transaction will enter a rollback state. For very large transactions this can take a lot of time. Don’t even think about killing or stopping SQL Services during a rollback unless you want a corrupt database!

What can we do to improve this?

We follow an important maxim of computer science – break large problems down into smaller problems and solve them.

We break up the transaction into smaller batches to get the job done. Each of the above points can be relived in this manner. Let’s take a look at some examples.

How to Use T-SQL to Delete Millions of Rows

Combine the top operator with a while loop to specify the batches of tuples we will delete. Tracking progress will be easier by keeping track of iterations and either printing them or loading to a tracking table.

Let’s setup an example and work from simple to more complex.

--Make table with millions of records CREATE TABLE Colors ( colA varchar(10) ); INSERT INTO Colors (colA) VALUES('Red'); GO 1000000 select * from Colors; --1,000,000 --Delete in batches declare @rc int; --record count set @rc = 1; while @rc > 0 begin --deleting by sets of 10,000 delete top (10000) from Colors; set @rc = @@ROWCOUNT; end

Now let’s print some indication of progress.

--Delete in batches with progress printed declare @rc int; set @rc = 1; while @rc > 0 begin delete top (10000) from Colors; set @rc = @@ROWCOUNT; raiserror('deleted 10,000 rows',10,1) with nowait; end

A better way is to store progress in a table instead of printing to the screen.

/* Delete in batches with progress put into table */ --make table for holding progress create table progress ( tableName varchar(128) ,totalRecords int ,recordsDeleted int ,datetimestamp datetime ); insert into progress (tableName, recordsDeleted) values('Colors',0); declare @totalRecords int; select @totalRecords = count(*) from Colors; update progress set totalRecords = @totalRecords; select * from progress; --begin to delete declare @rc int; set @rc = 1; while @rc > 0 begin delete top (10000) from Colors; set @rc = @@ROWCOUNT; update progress set recordsDeleted += @rc ,datetimestamp = getdate(); end --View in another query thread select * from progress with(nolock)

Finally let’s incorporate dynamic SQL.

/* Delete in batches (parameter) with progress put into table Keep same progress table logic as abvoe */ declare @sql nvarchar(max); set @sql = ' declare @batchSize int; declare @rc int; --record count set @batchSize = 10000; set @rc = 1; while @rc > 0 begin delete top (@batchSize) from Colors; set @rc = @@ROWCOUNT; update progress set recordsDeleted += @rc ,datetimestamp = getdate() end'; exec sp_executesql @sql;

How to Use T-SQL to Insert Millions of Rows

Similar principles can be applied to inserting large amounts from one table to another. We cannot use the same code as above with just replacing the DELETE statement with an INSERT statement.

What do you see wrong here?

--Source table CREATE TABLE sourceTable ( colA varchar(10) ); --Target table CREATE TABLE targetTable ( colA varchar(10) ); INSERT INTO sourceTable(colA) VALUES('Red'); GO 1000000 declare @rc int; set @rc = 1; --WRONG! while @rc > 0 begin insert into targetTable select top 10000 * from sourceTable; set @rc = @@ROWCOUNT; end

The problem is a logic error – we’ve created an infinite loop!

To avoid that we will need to keep track of what we are inserting. Here are a few examples:

--Keeping track of inserts declare @rc int; declare @tableRows int; declare @batchSize int; declare @start int; declare @end int; set @rc = 1; select @tableRows = count(*) from sourceTable; --initialize to total number of rows in table set @batchSize = 10000; set @start = 1; set @end = @start + @batchSize - 1; while @rc < @tableRows begin with cte(colA, RowNbr) as ( select ColA, ROW_NUMBER() over(order by colA) as 'RowNbr' from sourceTable ) insert into targetTable(colA) select colA from cte where RowNbr between @start and @end; set @rc += @batchSize; set @start = @end + 1 ; set @end = @start + @batchSize - 1; end

Assume here we want to migrate a table – move the records in batches and delete from the source as we go.

create table #tmp ( rownbr int ,colA varchar(10) ); declare @rc int; declare @tableRows int; declare @batchSize int; set @rc = 1; select @tableRows = count(*) from sourceTable; --initialize to total number of rows in table set @batchSize = 10000; while @rc < @tableRows begin --Load records into temp table with cteInsertingRows as ( select ColA, ROW_NUMBER() over(order by colA) as 'RowNbr' from sourceTable ) insert into #tmp (rownbr, colA) select RowNbr, ColA from cteInsertingRows where RowNbr between 1 and @batchSize; --insert into target table insert into targetTable select colA from #tmp; --delete from source table delete st from (select colA, ROW_NUMBER() over(order by colA) as 'rownbr' from sourceTable) st inner join #tmp t on st.colA = t.colA and st.rownbr = t.rownbr --clear temp table for next iteration truncate table #tmp; set @rc += @batchSize; end

How to Use T-SQL to Update Millions of Rows

We can employ similar logic to update large tables.

--Updates in batches by row numbers declare @rc int; declare @tableRows int; declare @batchSize int; declare @start int; declare @end int; set @rc = 1; select @tableRows = count(*) from Colors; set @batchSize = 10000; set @start = 1; set @end = @start + @batchSize - 1; while @rc < @tableRows begin with cte(colA, RowNbr) as ( select ColA, ROW_NUMBER() over(order by colA) as 'RowNbr' from Colors ) update cte set colA = 'Blue' from cte where RowNbr between @start and @end; set @rc += @batchSize; set @start = @end + 1 ; set @end = @start + @batchSize - 1; end

Concluding Thoughts

Breaking down one large transaction into many smaller ones can get job done with less impact to concurrency and the transaction log.

This is just a start. Consider what we left out:

T-SQL is not the only way to move data. We can also consider bcp, SSIS, C#, etc.

When going across linked servers the waits for network may be the greatest. Keep that in mind as you consider the right method.

Joins play a role – whether local or remote. Be mindful of foreign keys and referential integrity.

Indexes can make a difference in performance

Dynamic SQL and cursors can be useful if you need to iterate through sys.tables to perform operations on many tables

UPDATE – 11/3/18

I want to clarify some things about this post. I got good feedback from random internet strangers and want to make sure everyone understands this. But first…

WARNING! If you are not careful when batching you can actually make things worse!

WARNING! These are contrived examples meant to demonstrate a methodology. Please do NOT copy them and run in PROD!

WARNING! There is no “one size fits all” way to do this. Please test this and measure performance before running in PROD! Don’t just take code blindly from the web without understanding it.

Pointing out a couple things:

Be mindful of your indexes. Sometimes it can be better to drop indexes before large scale DML operations.

The INSERT piece – it is kind of a migration. You can use an output statement on the delete then insert. Can also move the row number calculation out of the loop so it is only executed once.

Here is a good example you should read to illustrate testing the performance of batches.

Like what you are reading? Tell your friends. Tell your foes. Please subscribe!

Yes I want to Subscribe!