TSQL String Tokenizer Function for SQL Server

Update: Tokenizer has been updated for Denali analytic features and to expose the separators in the output.

We often have need of string tokenizers in applications. I had need of one in some TSQL development I am currently doing for a customer. I had previously written one for Oracle so I decided to re-write it for SQL Server 2008 (I believe it should work in 2005).

The main limitations with the conversion are the lack of a hierarchical “CONNECT BY LEVEL” clause and the lack of a LEAD() function for SQL Server’s limited implementation of analytics. Also, the function is dependent on my generate_series() function as described in this article

Still, with a little perseverance I came up with a working implementation.

Here it is.

 USE [GISDB]  -- You need to change this if you use this function.
 GO
 /*********************************************************************************
 ** @function   : Tokenizer
 ** @precis      : Splits any string into its tokens.
 ** @description : Supplied a string and a list of separators this function
 **                returns resultant tokens as a table collection.
 ** @example     : SELECT t.token
 **                  FROM dbo.tokenizer('The rain in spain, stays mainly on the plain.!',' ,.!') t;
 ** @param       : p_string. The string to be Tokenized.
 ** @param       : p_separators. The characters that are used to split the string.
 ** @depend      : dbo.generate_series()
 ** @history     : Pawel Barut, http://pbarut.blogspot.com/2007/03/yet-another-tokenizer-in-oracle.html
 ** @history     : Simon Greener - Jul 2006 - Original coding (extended SQL)
 ** @history     : Simon Greener - Aug 2008 - Converted to SQL Server 2008
 **/
 Drop Function Tokenizer;
 ==<nbsp/>==
 Create Function Tokenizer(@p_string     VarChar(MAX),
                            @p_separators VarChar(254))
   Returns @varchar_table TABLE 
  (
    token varchar(MAX)
   )
 As
 Begin
   Begin
       With myCte As (
       Select c.beg,
              c.fullstring,
              ROW_NUMBER() OVER(ORDER BY c.beg ASC) RowVersion
         From (Select b.beg, b.fullstring
                 From (Select a.beg, @p_string as fullstring
                         From (Select c.IntValue as beg
                                 From dbo.generate_series(1,DATALENGTH(@p_string),1) c 
                               ) a
                      ) b,
                      (Select SUBSTRING(@p_separators,d.IntValue,1) as delim
                         From dbo.generate_series(1,DATALENGTH(@p_separators),1) d
                      ) c
                Where CHARINDEX(c.delim,SUBSTRING(b.fullstring,b.beg,1)) > 0
                Union All Select 0 as beg, @p_string as fullstring 
                Union All Select DATALENGTH(@p_string)+1 as beg, @p_string as fullstring
              ) c
       )
       INSERT INTO @varchar_table
       Select SUBSTRING(d.fullstring, (d.beg + 1), (d.end_p - d.beg - 1) ) token
         From (Select BASE.beg, 
                       LEAD.beg end_p, 
                       BASE.fullstring
                 From MyCTE BASE LEFT JOIN MyCTE LEAD ON BASE.RowVersion = LEAD.RowVersion-1
              ) d
        Where d.end_p Is Not Null
          And d.end_p > d.beg + 1;
       RETURN;
   End;
 End
 GO

Here are my, simple, tests.

 SELECT distinct t.token
   FROM dbo.Tokenizer('LineString:MultiLineString:MultiPoint:MultiPolygon:Point:Point:LineString:Polygon:Polygon',':') as t;

Result.

token
LineString
MultiLineString
MultiPoint
MultiPolygon
Point
Polygon
 SELECT t.token
   FROM dbo.tokenizer('The rain in spain, stays mainly on the plain.!',' ,.!') t;

Result.

token
Then
rain
in
spain
stays
mainly
on
the
plain

Now, if you want to collect them back into a single string, here’s an example of what you can do.

 SELECT (STUFF((SELECT DISTINCT ':' + a.gtype
                  FROM ( select distinct t.token as gtype
                           from dbo.Tokenizer('LineString:MultiLineString:MultiPoint:MultiPolygon:Point:Point:LineString:Polygon:Polygon',':') as t
                        ) a
                 ORDER BY ':' + a.gtype
                 FOR XML PATH(''), TYPE, ROOT).value('root[1]','nvarchar(max)'),1,1,'')
         ) as GeometryTypes;

Result.

GeometryTypes
LineString:MultiLineString:MultiPoint:MultiPolygon:Point:Polygon

Upgraded Version for Denali

 ** @history     : Simon Greener - Apr 2012  - Extended to include returning of tokens
 -- Connect to database holding generate_series
 --
 USE [GISDB] -- Change to your database
 GO
 -- Drop function if exists
 --
 IF EXISTS (SELECT *
              FROM dbo.sysobjects 
             WHERE id = object_id (N'[dbo].[Tokenizer]') 
               AND OBJECTPROPERTY(id, N'IsTableFunction') = 1)
 DROP FUNCTION [dbo].[Tokenizer]
 GO
 --
 -- Now let's create it
 --
 /*********************************************************************************
 ** @function   : Tokenizer
 ** @precis      : Splits any string into its tokens.
 ** @description : Supplied a string and a list of separators this function
 **                returns resultant tokens as a table collection.
 ** @example     : SELECT t.token
 **                  FROM dbo.tokenizer('The rain in spain, stays mainly on the plain.!',' ,.!') t;
 ** @param       : p_string. The string to be Tokenized.
 ** @param       : p_separators. The characters that are used to split the string.
 ** @history     : Pawel Barut, http://pbarut.blogspot.com/2007/03/yet-another-tokenizer-in-oracle.html
 ** @history     : Simon Greener - Jul 2006 - Original coding (extended SQL)
 ** @history     : Simon Greener - Aug 2008 - Converted to SQL Server 2008
 ** @history     : Simon Greener - Aug 2012 - Converted to SQL Server 2012 and return separators
 **/
 Create Function [dbo].[Tokenizer](@p_string     VarChar(MAX),
                                    @p_separators VarChar(254))
   Returns @varchar_table TABLE 
  (
    id        int,
    token     varchar(MAX),
    separator varchar(MAX)
   )
 As
 Begin
   Begin
     With MyCTE As (
       Select c.beg, c.sep, ROW_NUMBER() OVER(ORDER BY c.beg ASC) as rid
         From (Select b.beg, c.sep
                 From (Select a.beg 
                         From (select c.IntValue as beg
                                 from dbo.generate_series(1,DATALENGTH(@p_string),1) c 
                               ) a
                       ) b,
                       (select SUBSTRING(@p_separators,d.IntValue,1) as sep
                           from dbo.generate_series(1,DATALENGTH(@p_separators),1) d
                         ) c
                 Where CHARINDEX(c.sep,SUBSTRING(@p_string,b.beg,1)) > 0
               Union All Select 0 as beg, CAST(NULL AS varchar) as sep
              ) c
     )
     Insert Into @varchar_table
     Select Row_Number() Over (Order By a.rid ASC) as Id, 
            Case When DataLength(a.token) = 0 Then null Else a.token End as token, 
            a.sep
       From (Select d.rid,
                     SubString(@p_string, (d.beg + 1), (Lead(d.beg,1) Over (Order By d.rid asc) - d.beg - 1) ) as token,
                     Lead(d.sep,1) Over (Order By d.rid asc) as sep
               From MyCTE d 
            ) as a
      Where DataLength(a.token) <> 0 or DataLength(a.sep) <> 0;
     Return;
   End;
 End
 GO

Testing

 select distinct t.token
   from dbo.Tokenizer('LineString:MultiLineString:MultiPoint:MultiPolygon:Point:Point:LineString:Polygon:Polygon',':') as t;

Results

token
LineString
MultiLineString
MultiPoint
MultiPolygon
Point
Polygon

The classic “Rain in Spain…”.

 SELECT t.*
   FROM dbo.tokenizer('The rain in spain, stays mainly on the plain.!',' ,.!') t;

Results

id token separator
1 The {SPACE}
2 rain {SPACE}
3 in {SPACE}
4 spain ,
5 NULL {SPACE}
6 stays {SPACE}
7 mainly {SPACE}
8 on {SPACE}
9 the {SPACE}
10 plain .
11 NULL !

Now, let’s process a POLYGON WKT.

 SELECT t.id, t.token, t.separator
   FROM dbo.tokenizer('POLYGON((2300 400, 2300 700, 2800 1100, 2300 1100, 1800 1100, 2300 400), (2300 1000, 2400 900, 2200 900, 2300 1000))',' ,()') as t;

Results

 

id token separator
1 POLYGON (
2 NULL (
3 2300 {SPACE}
4 400 ,
5 NULL {SPACE}
6 2300 {SPACE}
7 700 ,
8 NULL {SPACE}
9 2800 {SPACE}
10 1100 ,
11 NULL {SPACE}
12 2300 {SPACE}
13 1100 ,
14 NULL {SPACE}
15 1800 {SPACE}
16 1100 ,
17 NULL {SPACE}
18 2300 {SPACE}
19 400 )
20 NULL ,
21 NULL {SPACE}
22 NULL (
23 2300 {SPACE}
24 1000 ,
25 NULL {SPACE}
26 2400 {SPACE}
27 NULL {SPACE}
28 900 ,
29 NULL {SPACE}
30 2200 {SPACE}
31 900 ,
32 NULL {SPACE}
33 2300 {SPACE}
34 1000 )
35 NULL )

This time don’t include the space as a separator.

 SELECT t.id, t.token, t.separator
   FROM dbo.tokenizer('POLYGON((2300 400, 2300 700, 2800 1100, 2300 1100, 1800 1100, 2300 400), (2300 1000, 2400 900, 2200 900, 2300 1000))',',()') as t;

Results

id token separator
1 POLYGON (
2 NULL (
3 2300 400 ,
4 2300 700 ,
5 2800 1100 ,
6 2300 1100 ,
7 1800 1100 ,
8 2300 400 )
9 NULL ,
10 {SPACE} (
11 2300 1000 ,
12 2400 900 ,
13 2200 900 ,
14 2300 1000 )
15 NULL )

I hope that someone out there finds this useful.