A very common database operation is to transform a simple CSV string into a table so it can be used in more complex query operations. There is an overabundance of SQL functions online that perform such transformations. Here is one: [sourcecode language="sql"] CREATE FUNCTION [dbo].[fn_ParseCsvString] ( @CSVString VARCHAR(max) , @Delimiter VARCHAR(1) ) RETURNS @tbl TABLE ( s VARCHAR(128) ) AS BEGIN DECLARE @i INT , @j INT SELECT @i = 1 WHILE @i BEGIN SELECT @j = CHARINDEX(@Delimiter, @CSVString, @i) IF @j = 0 BEGIN SELECT @j = LEN(@CSVString) + 1 END INSERT @tbl SELECT SUBSTRING(@CSVString, @i, @j - @i) SELECT @i = @j + LEN(@Delimiter) END RETURN END [/sourcecode]
And you’ll use it like this:
[sourcecode language="sql"] select * from dbo.fn_ParseCSVString ('1,2,3,4,5,6,7,8,9,10,11,12,13,14,15', ',') [/sourcecode]
The problem is the way we use those functions and the fact that their performance order is linear to the number of comma separated values in the string. That works out ok for a small number of items, but not when you want to JOIN on a very large number of CSV. When you are using the CSV functions in a sproc or a query, you'd be writing JOIN statements to the result set from the function in order to filter a real table in your database.
Let's look at a case scenario: Suppose you have an Address table in your Geo schema. Your address table has a primary key 'Id' column and other columns that are irrelevant for the purpose of this article. If you wanted to query your Address based on CSV values using your function, you'd do something like this:
DECLARE @csv VARCHAR(MAX) = '1,2,3,4,5,6,7'
SELECT a.Id FROM Geo.[Address] a JOIN dbo.fn_ParseCsvString(@csv, ',') tbl ON a.Id = tbl.s [/sourcecode]
Here is some benchmarking on the previous query (all in microseconds): 1 second = 1000 milliseconds = 1'000'000 microseconds
1000 CSV ~ 0.5 secs – T1: 557'272 – T2: 408'185 – T3: 446'485
5000 CSV ~ 2.3 secs – T1: 2'376'159 – T2: 2'307'394 – T3: 2'370'191
10000 CSV ~ 4.6 secs – T1: 4'699'332 – T2: 4'737'877 – T3: 4'641'049
You get the idea and it gets even worse when you do left joins on CSV. Even if your tables are indexed on the actual columns you are joining to your function, sequentially calculating those values and inserting them into a table will take some unnecessary time. So here is a solution that requires not a function but instead a store procedure and the reason for the store procedure is the need to run Dynamic SQL.
PROCEDURE [dbo].[CsvToTable] @Csv VARCHAR(MAX) , @SchemaName VARCHAR(128) , @TableName VARCHAR(128) , @KeyColumnName VARCHAR(32) = 'Id' , @Debug BIT = 0 AS SET NOCOUNT ON ; BEGIN TRY DECLARE @mySql NVARCHAR(MAX)= 'select distinct ' + @KeyColumnName + ' from ' + QUOTENAME(@schemaName) + '.' + QUOTENAME(@tableName)
IF ( @Csv IS NULL OR @Csv = '' ) BEGIN SET @mySql = @mySql + ' where 1=0' END ELSE BEGIN
SET @mySql = @mySql + ' where ' + @KeyColumnName + ' in (' + @Csv + ')' END IF ( @Debug = 1 ) PRINT @mySql EXEC sp_executesql @mySql END TRY BEGIN CATCH --Your catch logic here --EXEC dbo.Error_Rethrow END CATCH
And this is how you’d use it:
[sourcecode language="sql"] DECLARE @csv VARCHAR(MAX) = '1,2,3,4,5,6,7'
DECLARE @tbl TABLE ( Id INT ) INSERT INTO @tbl EXEC dbo.CsvToTable @csv, N'Geo', N'Address', N'Id'
SELECT a.Id FROM Geo.Address a JOIN @tbl tbl ON tbl.Id = a.Id [/sourcecode]
Here is some benchmarking on the new solution (all in microseconds): 1 second = 1000 milliseconds = 1'000'000 microseconds
10000 CSV ~ 0.2 secs – T1: 212'289 – T2: 183'635 – T3: 204'688
WOW! That's 1/5 of a second on average to query 10K records. MUCH MUCH better ah?!
Let’s check pros and cons of this solution.
- There is nothing faster than this solution to convert CSV into a table. Period.
- You can configure additional indexes on the columns you want to lookup using a CSV approach and the indexed will have an impact on the speed of the CSV look-up; versus always having a linear execution order because the function is unaware of indexes.
- Simple to use.
- It’s using dynamic SQL, and that brings all the negative things writing dynamic SQL has. I will recommend having a dedicated role to runners of this sproc so that users do not have permissions to run it. Try to always use this sproc through another sproc.
- The fact that is a sproc “feels” wrong, since by their nature sprocs can have side effects, although this one has none. This is precisely the reason why a function does not allow for execution of dynamic SQL.
I think I can live with the feeling in trade of the unmatched performance this solution offers. A couple of things to note that were done to prevent SQL injection on the Dynamic SQL:
- Both schema and table names are provides as separate parameters of the sproc and they are put together in the code with the '.' separator.
- Both schema and table name are wrapped using quotename()
- When executing the Dynamic SQL, I'm using sp_executesql rather than just the plain EXEC() function.
- There is a @Debug parameter that prints the generated Dynamic SQL when debugging. This will safe you a couple of headaches.