TSQL String Tokenizer Function for SQL Server

NOTE: If your database is 2016 or newer, you can use the new STRING_SPLIT function. See an example at the end of this post.

String tokenizer for SQL Server 2008.

We often have need of string tokenizers in applications. I had need of one in some TSQL development I am did for a customer. I had previously written one for Oracle so I decided to re-write it for SQL Server 2008 (I believe it should work in 2005).

The main limitations with the conversion are the lack of a hierarchical “CONNECT BY LEVEL” clause and the lack of a LEAD() function for SQL Server’s limited implementation of analytics. Also, the function is dependent on my generate_series() function as described in this article.

Still, with a little perseverance I came up with the following working implementation.

USE [GISDB]  -- You need to change this if you use this function.
GO
-- Create Tokenizer Function
--
CREATE FUNCTION [dbo].[Tokenizer]
(
  @p_string     VarChar(MAX),
  @p_separators VarChar(254)
)
Returns @tokens TABLE 
(
  id        int,
  token     varchar(MAX),
  separator varchar(MAX)
)
As
/****f* TOOLS/Tokenizer (2008)
 *  NAME
 *    Tokenizer - Splits any string into tokens and separators.
 *  SYNOPSIS
 *    Function Tokenizer (
 *       @p_string     varchar(max),
 *       @p_separators varchar(254)
 *     )
 *     Returns @tokens TABLE 
 *     (
 *       id        int,
 *       token     varchar(MAX),
 *       separator varchar(MAX)
 *     ) 
 *  EXAMPLE
 *    SELECT t.id, t.token, t.separator
 *      FROM [dbo].[TOKENIZER]('LINESTRING(0 0,1 1)',' ,()') as t
 *    GO
 *    id token       separator
 *    -- ---------- ---------
 *     1 LINESTRING (
 *     2 0          NULL
 *     3 0          ,
 *     4 1          NULL 
 *     5 1          )
 *  DESCRIPTION
 *    Supplied a string and a list of separators this function returns resultant tokens as a table collection.
 *    Function returns both the token and the separator.
 *    Returned table collection contains a unique identifier to ensure tokens and separators are always correctly ordered.
 *  INPUTS
 *    @p_string     (varchar max) - Any non-null string.
 *    @p_separators (varchar 254) - List of separators eg '(),'
 *  RESULT
 *    Table (Array) of Integers
 *      id        (int)         - Unique identifier for each row starting with first token/separator found.
 *      token     (varchar MAX) - Token between separators
 *      separator (varchar MAX) - Separator between tokens.
 *  AUTHOR
 *    Simon Greener
 *  HISTORY
 *    Pawel Barut, http://pbarut.blogspot.com/2007/03/yet-another-tokenizer-in-oracle.html
 *    Simon Greener -  Aug 2008 - Converted to SQL Server 2008
 *    Simon Greener -  Aug 2012 - Added extra return variables.
 *    Simon Greener - June 2020 - Converted to Inline Table-Valued Function from Multi-Statement Table-Valued Function
 *  COPYRIGHT
 *    (c) 2008-2020 by TheSpatialDBAdvisor/Simon Greener
******/
BEGIN
WITH MyCTE AS (
  SELECT c.beg, 
         c.sep, 
         ROW_NUMBER() OVER(ORDER BY c.beg ASC) as rid
    FROM (SELECT b.beg, c.sep
            FROM (SELECT a.beg 
                    FROM (SELECT c.IntValue as beg
                            FROM [dbo].[generate_series](1,DATALENGTH(@p_string),1) c 
                          ) a
                  ) b,
                  (SELECT SUBSTRING(@p_separators,d.IntValue,1) as sep
                      FROM [dbo].[generate_series](1,DATALENGTH(@p_separators),1) d
                    ) c
            WHERE CHARINDEX(c.sep,SUBSTRING(@p_string,b.beg,1)) > 0
          UNION ALL 
          SELECT 0 as beg, CAST(NULL AS varchar) as sep
         ) c
)
INSERT INTO @tokens
SELECT ROW_NUMBER() OVER (ORDER BY a.rid ASC) as Id, 
       CASE WHEN DATALENGTH(a.token) = 0 THEN NULL ELSE a.token END as token, 
       a.separator
  FROM (SELECT d.rid,
                SUBSTRING(@p_string, 
                         (d.beg + 1), 
                         (d.end_p - d.beg - 1) ) token,
               d.separator
          FROM (SELECT BASE.rid,
                       BASE.beg,
                       LEAD.beg end_p,
                       LEAD.sep separator
                 FROM MyCTE BASE LEFT JOIN MyCTE LEAD ON BASE.rid = LEAD.rid-1
               ) d
           WHERE d.end_p IS NOT NULL
       ) as a;
RETURN;
END;
GO

Here are my, simple, tests.

 SELECT distinct t.token
   FROM dbo.Tokenizer('LineString:MultiLineString:MultiPoint:MultiPolygon:Point:Point:LineString:Polygon:Polygon',':') as t;

Result.

token

LineString

MultiLineString

MultiPoint

MultiPolygon

Point

Polygon

 SELECT t.token
   FROM dbo.tokenizer('The rain in spain, stays mainly on the plain.!',' ,.!') t;

Result.

token

Then

rain

spain

stays

mainly

the

plain

Now, if you want to collect them back into a single string, here’s an example of what you can do.

 SELECT (STUFF((SELECT DISTINCT ':' + a.gtype
                  FROM (SELECT distinct t.token as gtype
                          FROM dbo.Tokenizer('LineString:MultiLineString:MultiPoint:MultiPolygon:Point:Point:LineString:Polygon:Polygon',':') as t
                        ) a
                 ORDER BY ':' + a.gtype
                 FOR XML PATH(''), TYPE, ROOT).value('root[1]','nvarchar(max)'),1,1,'')
         ) as GeometryTypes;

Result.

GeometryTypes

LineString:MultiLineString:MultiPoint:MultiPolygon:Point:Polygon

Upgraded Version for SQL Server 2012 and aboce

SQL Server 2012 and above provide the LAG/LEAD analytic functions that were missing from 2008. The above 2008 Tokenizer function can be re-written as follows.

 -- Connect to database holding generate_series
 --
 USE [GISDB] -- Change to your database
 GO
 -- Drop function if exists
 --
 IF EXISTS (SELECT *
              FROM dbo.sysobjects 
             WHERE id = object_id (N'[dbo].[Tokenizer]') 
               AND OBJECTPROPERTY(id, N'IsTableFunction') = 1)
 DROP FUNCTION [dbo].[Tokenizer]
 GO
 --
 -- Now let's create it
 --
CREATE FUNCTION [dbo].[Tokenizer]
(
  @p_string     VarChar(MAX),
  @p_separators VarChar(254)
)
Returns @tokens TABLE 
(
  id        int,
  token     varchar(MAX),
  separator varchar(MAX)
)
As
/****f* TOOLS/Tokenizer (2012)
 *  NAME
 *    Tokenizer - Splits any string into tokens and separators.
 *  SYNOPSIS
 *    Function Tokenizer (
 *       @p_string     varchar(max),
 *       @p_separators varchar(254)
 *     )
 *     Returns @tokens TABLE 
 *     (
 *       id        int,
 *       token     varchar(MAX),
 *       separator varchar(MAX)
 *     ) 
 *  EXAMPLE
 *
 *    SELECT t.id, t.token, t.separator
 *      FROM [dbo].[TOKENIZER]('LINESTRING(0 0,1 1)',' ,()') as t
 *    GO
 *    id token       separator
 *    -- ---------- ---------
 *     1 LINESTRING (
 *     2 0          NULL
 *     3 0          ,
 *     4 1          NULL 
 *     5 1          )
 *
 *  DESCRIPTION
 *    Supplied a string and a list of separators this function returns resultant tokens as a table collection.
 *    Function returns both the token and the separator.
 *    Returned table collection contains a unique identifier to ensure tokens and separators are always correctly ordered.
 *  INPUTS
 *    @p_string     (varchar max) - Any non-null string.
 *    @p_separators (varchar 254) - List of separators eg '(),'
 *  RESULT
 *    Table (Array) of Integers
 *      id        (int)         - Unique identifier for each row starting with first token/separator found.
 *      token     (varchar MAX) - Token between separators
 *      separator (varchar MAX) - Separator between tokens.
 *  AUTHOR
 *    Simon Greener
 *  HISTORY
 *    Simon Greener - Aug 2012 - Converted to SQL Server 2012 (Uses new Lag/Lead function and returns separators)
 *  COPYRIGHT
 *    (c) 2008-2020 by TheSpatialDBAdvisor/Simon Greener
******/
BEGIN
WITH MyCTE AS (
  SELECT c.beg, 
         c.sep, 
         ROW_NUMBER() OVER(ORDER BY c.beg ASC) as rid
    FROM (SELECT b.beg, c.sep
            FROM (SELECT a.beg 
                    FROM (SELECT c.IntValue as beg
                            FROM [dbo].[generate_series](1,DATALENGTH(@p_string),1) c 
                          ) a
                  ) b,
                  (SELECT SUBSTRING(@p_separators,d.IntValue,1) as sep
                      FROM [dbo].[generate_series](1,DATALENGTH(@p_separators),1) d
                    ) c
            WHERE CHARINDEX(c.sep,SUBSTRING(@p_string,b.beg,1)) > 0
          UNION ALL 
          SELECT 0 as beg, @p_string as sep
         ) c
)
INSERT INTO @tokens
SELECT Row_Number() OVER (ORDER BY a.rid ASC) as id, 
       Case When DATALENGTH(a.token) = 0 Then null Else a.token End as token, 
       a.separator
  FROM (SELECT d.rid,
               SubString(@p_string, (d.beg + 1), (Lead(d.beg,1) OVER (ORDER BY d.rid ASC) - d.beg - 1) ) as token,
               Lead(d.sep,1) OVER (ORDER BY d.rid asc) as separator
          FROM MyCTE d 
       ) as a
 WHERE DATALENGTH(a.token) <> 0 or DATALENGTH(a.separator) <> 0;
RETURN;
END;
GO

Testing

 select distinct t.token
   from dbo.Tokenizer('LineString:MultiLineString:MultiPoint:MultiPolygon:Point:Point:LineString:Polygon:Polygon',':') as t;

Results

token

LineString

MultiLineString

MultiPoint

MultiPolygon

Point

Polygon

The classic “Rain in Spain…”.

 SELECT t.*
   FROM dbo.tokenizer('The rain in spain, stays mainly on the plain.!',' ,.!') t;

Results

id	token	separator
1	The	{SPACE}
2	rain	{SPACE}
3	in	{SPACE}
4	spain	,
5	NULL	{SPACE}
6	stays	{SPACE}
7	mainly	{SPACE}
8	on	{SPACE}
9	the	{SPACE}
10	plain	.
11	NULL	!

Now, let’s process a POLYGON WKT.

 SELECT t.id, t.token, t.separator
   FROM dbo.tokenizer('POLYGON((2300 400, 2300 700, 2800 1100, 2300 1100, 1800 1100, 2300 400), (2300 1000, 2400 900, 2200 900, 2300 1000))',' ,()') as t;

Results

id	token	separator
1	POLYGON	(
2	NULL	(
3	2300	{SPACE}
4	400	,
5	NULL	{SPACE}
6	2300	{SPACE}
7	700	,
8	NULL	{SPACE}
9	2800	{SPACE}
10	1100	,
11	NULL	{SPACE}
12	2300	{SPACE}
13	1100	,
14	NULL	{SPACE}
15	1800	{SPACE}
16	1100	,
17	NULL	{SPACE}
18	2300	{SPACE}
19	400	)
20	NULL	,
21	NULL	{SPACE}
22	NULL	(
23	2300	{SPACE}
24	1000	,
25	NULL	{SPACE}
26	2400	{SPACE}
27	NULL	{SPACE}
28	900	,
29	NULL	{SPACE}
30	2200	{SPACE}
31	900	,
32	NULL	{SPACE}
33	2300	{SPACE}
34	1000	)
35	NULL	)

This time don’t include the space as a separator.

 SELECT t.id, t.token, t.separator
   FROM dbo.tokenizer('POLYGON((2300 400, 2300 700, 2800 1100, 2300 1100, 1800 1100, 2300 400), (2300 1000, 2400 900, 2200 900, 2300 1000))',',()') as t;

Results

id	token	separator
1	POLYGON	(
2	NULL	(
3	2300 400	,
4	2300 700	,
5	2800 1100	,
6	2300 1100	,
7	1800 1100	,
8	2300 400	)
9	NULL	,
10	{SPACE}	(
11	2300 1000	,
12	2400 900	,
13	2200 900	,
14	2300 1000	)
15	NULL	)

SQL Server 2016+ STRING_SPLIT

The STRING_SPLIT function released for SQL Server 2016 can be used instead of my Tokenizer function. The main difference is that the STRING_SPLT function returns only the tokens (or an additional ordinal identifier field) whereas the Tokenizer functions return a token id (ordinal), the token, and the separator between the tokens.

select distinct t.value
  from STRING_SPLIT('LineString:MultiLineString:MultiPoint:MultiPolygon:Point:Point:LineString:Polygon:Polygon',':') as t;

-- Result
--
value
LineString
MultiLineString
MultiPoint
MultiPolygon
Point
Polygon

-- Now return the ordinal value as an id.
--
select t.ordinal as id, t.value as token
  from STRING_SPLIT('LineString:MultiLineString:MultiPoint:MultiPolygon:Point:Point:LineString:Polygon:Polygon',':',1) as t;

-- Result
--
id	token
1	LineString
2	MultiLineString
3	MultiPoint
4	MultiPolygon
5	Point
6	Point
7	LineString
8	Polygon
9	Polygon

I hope that someone out there finds this post useful.

The Spatial Database Advisor

Top 5 Recent Articles

OptimoRoute

Functions to Convert Native PostgreSQL geometric types to PostGIS

Converting PostgreSQL Native Geometric Types to PostGIS

Fitting Bezier Polygon to Points

Fixing Geography Ring Orientation

ARTICLES CATEGORIES

TSQL String Tokenizer Function for SQL Server

Documentation