TSQL String Tokenizer Function for SQL Server

NOTE: If your database is 2016 or newer, you can use the new STRING_SPLIT function. See an example at the end of this post.

String tokenizer for SQL Server 2008.

We often have need of string tokenizers in applications. I had need of one in some TSQL development I am did for a customer. I had previously written one for Oracle so I decided to re-write it for SQL Server 2008 (I believe it should work in 2005).

The main limitations with the conversion are the lack of a hierarchical “CONNECT BY LEVEL” clause and the lack of a LEAD() function for SQL Server’s limited implementation of analytics. Also, the function is dependent on my generate_series() function as described in this article.

Still, with a little perseverance I came up with the following working implementation.

USE [GISDB]  -- You need to change this if you use this function.
GO
-- Create Tokenizer Function
--
CREATE FUNCTION [dbo].[Tokenizer]
(
  @p_string     VarChar(MAX),
  @p_separators VarChar(254)
)
Returns @tokens TABLE 
(
  id        int,
  token     varchar(MAX),
  separator varchar(MAX)
)
As
/****f* TOOLS/Tokenizer (2008)
 *  NAME
 *    Tokenizer - Splits any string into tokens and separators.
 *  SYNOPSIS
 *    Function Tokenizer (
 *       @p_string     varchar(max),
 *       @p_separators varchar(254)
 *     )
 *     Returns @tokens TABLE 
 *     (
 *       id        int,
 *       token     varchar(MAX),
 *       separator varchar(MAX)
 *     ) 
 *  EXAMPLE
 *    SELECT t.id, t.token, t.separator
 *      FROM [dbo].[TOKENIZER]('LINESTRING(0 0,1 1)',' ,()') as t
 *    GO
 *    id token       separator
 *    -- ---------- ---------
 *     1 LINESTRING (
 *     2 0          NULL
 *     3 0          ,
 *     4 1          NULL 
 *     5 1          )
 *  DESCRIPTION
 *    Supplied a string and a list of separators this function returns resultant tokens as a table collection.
 *    Function returns both the token and the separator.
 *    Returned table collection contains a unique identifier to ensure tokens and separators are always correctly ordered.
 *  INPUTS
 *    @p_string     (varchar max) - Any non-null string.
 *    @p_separators (varchar 254) - List of separators eg '(),'
 *  RESULT
 *    Table (Array) of Integers
 *      id        (int)         - Unique identifier for each row starting with first token/separator found.
 *      token     (varchar MAX) - Token between separators
 *      separator (varchar MAX) - Separator between tokens.
 *  AUTHOR
 *    Simon Greener
 *  HISTORY
 *    Pawel Barut, http://pbarut.blogspot.com/2007/03/yet-another-tokenizer-in-oracle.html
 *    Simon Greener -  Aug 2008 - Converted to SQL Server 2008
 *    Simon Greener -  Aug 2012 - Added extra return variables.
 *    Simon Greener - June 2020 - Converted to Inline Table-Valued Function from Multi-Statement Table-Valued Function
 *  COPYRIGHT
 *    (c) 2008-2020 by TheSpatialDBAdvisor/Simon Greener
******/
BEGIN
WITH MyCTE AS (
  SELECT c.beg, 
         c.sep, 
         ROW_NUMBER() OVER(ORDER BY c.beg ASC) as rid
    FROM (SELECT b.beg, c.sep
            FROM (SELECT a.beg 
                    FROM (SELECT c.IntValue as beg
                            FROM [dbo].[generate_series](1,DATALENGTH(@p_string),1) c 
                          ) a
                  ) b,
                  (SELECT SUBSTRING(@p_separators,d.IntValue,1) as sep
                      FROM [dbo].[generate_series](1,DATALENGTH(@p_separators),1) d
                    ) c
            WHERE CHARINDEX(c.sep,SUBSTRING(@p_string,b.beg,1)) > 0
          UNION ALL 
          SELECT 0 as beg, CAST(NULL AS varchar) as sep
         ) c
)
INSERT INTO @tokens
SELECT ROW_NUMBER() OVER (ORDER BY a.rid ASC) as Id, 
       CASE WHEN DATALENGTH(a.token) = 0 THEN NULL ELSE a.token END as token, 
       a.separator
  FROM (SELECT d.rid,
                SUBSTRING(@p_string, 
                         (d.beg + 1), 
                         (d.end_p - d.beg - 1) ) token,
               d.separator
          FROM (SELECT BASE.rid,
                       BASE.beg,
                       LEAD.beg end_p,
                       LEAD.sep separator
                 FROM MyCTE BASE LEFT JOIN MyCTE LEAD ON BASE.rid = LEAD.rid-1
               ) d
           WHERE d.end_p IS NOT NULL
       ) as a;
RETURN;
END;
GO

Here are my, simple, tests.

 SELECT distinct t.token
   FROM dbo.Tokenizer('LineString:MultiLineString:MultiPoint:MultiPolygon:Point:Point:LineString:Polygon:Polygon',':') as t;

Result.

token
LineString
MultiLineString
MultiPoint
MultiPolygon
Point
Polygon
 SELECT t.token
   FROM dbo.tokenizer('The rain in spain, stays mainly on the plain.!',' ,.!') t;

Result.

token
Then
rain
in
spain
stays
mainly
on
the
plain

Now, if you want to collect them back into a single string, here’s an example of what you can do.

 SELECT (STUFF((SELECT DISTINCT ':' + a.gtype
                  FROM (SELECT distinct t.token as gtype
                          FROM dbo.Tokenizer('LineString:MultiLineString:MultiPoint:MultiPolygon:Point:Point:LineString:Polygon:Polygon',':') as t
                        ) a
                 ORDER BY ':' + a.gtype
                 FOR XML PATH(''), TYPE, ROOT).value('root[1]','nvarchar(max)'),1,1,'')
         ) as GeometryTypes;

Result.

GeometryTypes
LineString:MultiLineString:MultiPoint:MultiPolygon:Point:Polygon

Upgraded Version for SQL Server 2012 and aboce

SQL Server 2012 and above provide the LAG/LEAD analytic functions that were missing from 2008. The above 2008 Tokenizer function can be re-written as follows.

 -- Connect to database holding generate_series
 --
 USE [GISDB] -- Change to your database
 GO
 -- Drop function if exists
 --
 IF EXISTS (SELECT *
              FROM dbo.sysobjects 
             WHERE id = object_id (N'[dbo].[Tokenizer]') 
               AND OBJECTPROPERTY(id, N'IsTableFunction') = 1)
 DROP FUNCTION [dbo].[Tokenizer]
 GO
 --
 -- Now let's create it
 --
CREATE FUNCTION [dbo].[Tokenizer]
(
  @p_string     VarChar(MAX),
  @p_separators VarChar(254)
)
Returns @tokens TABLE 
(
  id        int,
  token     varchar(MAX),
  separator varchar(MAX)
)
As
/****f* TOOLS/Tokenizer (2012)
 *  NAME
 *    Tokenizer - Splits any string into tokens and separators.
 *  SYNOPSIS
 *    Function Tokenizer (
 *       @p_string     varchar(max),
 *       @p_separators varchar(254)
 *     )
 *     Returns @tokens TABLE 
 *     (
 *       id        int,
 *       token     varchar(MAX),
 *       separator varchar(MAX)
 *     ) 
 *  EXAMPLE
 *
 *    SELECT t.id, t.token, t.separator
 *      FROM [dbo].[TOKENIZER]('LINESTRING(0 0,1 1)',' ,()') as t
 *    GO
 *    id token       separator
 *    -- ---------- ---------
 *     1 LINESTRING (
 *     2 0          NULL
 *     3 0          ,
 *     4 1          NULL 
 *     5 1          )
 *
 *  DESCRIPTION
 *    Supplied a string and a list of separators this function returns resultant tokens as a table collection.
 *    Function returns both the token and the separator.
 *    Returned table collection contains a unique identifier to ensure tokens and separators are always correctly ordered.
 *  INPUTS
 *    @p_string     (varchar max) - Any non-null string.
 *    @p_separators (varchar 254) - List of separators eg '(),'
 *  RESULT
 *    Table (Array) of Integers
 *      id        (int)         - Unique identifier for each row starting with first token/separator found.
 *      token     (varchar MAX) - Token between separators
 *      separator (varchar MAX) - Separator between tokens.
 *  AUTHOR
 *    Simon Greener
 *  HISTORY
 *    Simon Greener - Aug 2012 - Converted to SQL Server 2012 (Uses new Lag/Lead function and returns separators)
 *  COPYRIGHT
 *    (c) 2008-2020 by TheSpatialDBAdvisor/Simon Greener
******/
BEGIN
WITH MyCTE AS (
  SELECT c.beg, 
         c.sep, 
         ROW_NUMBER() OVER(ORDER BY c.beg ASC) as rid
    FROM (SELECT b.beg, c.sep
            FROM (SELECT a.beg 
                    FROM (SELECT c.IntValue as beg
                            FROM [dbo].[generate_series](1,DATALENGTH(@p_string),1) c 
                          ) a
                  ) b,
                  (SELECT SUBSTRING(@p_separators,d.IntValue,1) as sep
                      FROM [dbo].[generate_series](1,DATALENGTH(@p_separators),1) d
                    ) c
            WHERE CHARINDEX(c.sep,SUBSTRING(@p_string,b.beg,1)) > 0
          UNION ALL 
          SELECT 0 as beg, @p_string as sep
         ) c
)
INSERT INTO @tokens
SELECT Row_Number() OVER (ORDER BY a.rid ASC) as id, 
       Case When DATALENGTH(a.token) = 0 Then null Else a.token End as token, 
       a.separator
  FROM (SELECT d.rid,
               SubString(@p_string, (d.beg + 1), (Lead(d.beg,1) OVER (ORDER BY d.rid ASC) - d.beg - 1) ) as token,
               Lead(d.sep,1) OVER (ORDER BY d.rid asc) as separator
          FROM MyCTE d 
       ) as a
 WHERE DATALENGTH(a.token) <> 0 or DATALENGTH(a.separator) <> 0;
RETURN;
END;
GO

Testing

 select distinct t.token
   from dbo.Tokenizer('LineString:MultiLineString:MultiPoint:MultiPolygon:Point:Point:LineString:Polygon:Polygon',':') as t;

Results

token
LineString
MultiLineString
MultiPoint
MultiPolygon
Point
Polygon

The classic “Rain in Spain…”.

 SELECT t.*
   FROM dbo.tokenizer('The rain in spain, stays mainly on the plain.!',' ,.!') t;

Results

id token separator
1 The {SPACE}
2 rain {SPACE}
3 in {SPACE}
4 spain ,
5 NULL {SPACE}
6 stays {SPACE}
7 mainly {SPACE}
8 on {SPACE}
9 the {SPACE}
10 plain .
11 NULL !

Now, let’s process a POLYGON WKT.

 SELECT t.id, t.token, t.separator
   FROM dbo.tokenizer('POLYGON((2300 400, 2300 700, 2800 1100, 2300 1100, 1800 1100, 2300 400), (2300 1000, 2400 900, 2200 900, 2300 1000))',' ,()') as t;

Results

 

id token separator
1 POLYGON (
2 NULL (
3 2300 {SPACE}
4 400 ,
5 NULL {SPACE}
6 2300 {SPACE}
7 700 ,
8 NULL {SPACE}
9 2800 {SPACE}
10 1100 ,
11 NULL {SPACE}
12 2300 {SPACE}
13 1100 ,
14 NULL {SPACE}
15 1800 {SPACE}
16 1100 ,
17 NULL {SPACE}
18 2300 {SPACE}
19 400 )
20 NULL ,
21 NULL {SPACE}
22 NULL (
23 2300 {SPACE}
24 1000 ,
25 NULL {SPACE}
26 2400 {SPACE}
27 NULL {SPACE}
28 900 ,
29 NULL {SPACE}
30 2200 {SPACE}
31 900 ,
32 NULL {SPACE}
33 2300 {SPACE}
34 1000 )
35 NULL )

This time don’t include the space as a separator.

 SELECT t.id, t.token, t.separator
   FROM dbo.tokenizer('POLYGON((2300 400, 2300 700, 2800 1100, 2300 1100, 1800 1100, 2300 400), (2300 1000, 2400 900, 2200 900, 2300 1000))',',()') as t;

Results

id token separator
1 POLYGON (
2 NULL (
3 2300 400 ,
4 2300 700 ,
5 2800 1100 ,
6 2300 1100 ,
7 1800 1100 ,
8 2300 400 )
9 NULL ,
10 {SPACE} (
11 2300 1000 ,
12 2400 900 ,
13 2200 900 ,
14 2300 1000 )
15 NULL )

SQL Server 2016+ STRING_SPLIT

The STRING_SPLIT function released for SQL Server 2016 can be used instead of my Tokenizer function. The main difference is that the STRING_SPLT function returns only the tokens (or an additional ordinal identifier field) whereas the Tokenizer functions return a token id (ordinal), the token, and the separator between the tokens.

select distinct t.value
  from STRING_SPLIT('LineString:MultiLineString:MultiPoint:MultiPolygon:Point:Point:LineString:Polygon:Polygon',':') as t;

-- Result
--
value
LineString
MultiLineString
MultiPoint
MultiPolygon
Point
Polygon

-- Now return the ordinal value as an id.
--
select t.ordinal as id, t.value as token
  from STRING_SPLIT('LineString:MultiLineString:MultiPoint:MultiPolygon:Point:Point:LineString:Polygon:Polygon',':',1) as t;

-- Result
--
id	token
1	LineString
2	MultiLineString
3	MultiPoint
4	MultiPolygon
5	Point
6	Point
7	LineString
8	Polygon
9	Polygon

I hope that someone out there finds this post useful.

Leave a Reply

Your email address will not be published. Required fields are marked *