Top 5 Recent Articles
ARTICLES CATEGORIES
- Algorithms (22)
- All (399)
- Biography (1)
- Blog (44)
- Business Requirements (1)
- Commentary (1)
- Conversion (2)
- Customers (2)
- Data Models (1)
- Education (2)
- GeoRaptor (13)
- GPS (1)
- Image Processing (2)
- Import Export (8)
- Licensing (2)
- LiDAR (1)
- Linear Referencing (4)
- Manifold GIS (3)
- Mapping (1)
- MySQL Spatial (7)
- Networking and Routing (including Optimization) (5)
- Open Source (18)
- Oracle Spatial and Locator (194)
- Partitioning (1)
- PostGIS (36)
- Projections (1)
- Published Articles (1)
- qGIS (1)
- Recommendations (1)
- Services (1)
- Software Change Log (1)
- Source Code (37)
- Space Curves (9)
- Spatial Database Functions (109)
- Spatial DB comparison (1)
- Spatial XML Processing (11)
- SQL Server Spatial (92)
- Standards (3)
- Stored Procedure (17)
- Tessellation or Gridding (10)
- Tools (2)
- Topological Relationships (1)
- Training (2)
- Triangulation (2)
TSQL String Tokenizer Function for SQL Server
NOTE: If your database is 2016 or newer, you can use the new STRING_SPLIT function. See an example at the end of this post.
String tokenizer for SQL Server 2008.
We often have need of string tokenizers in applications. I had need of one in some TSQL development I am did for a customer. I had previously written one for Oracle so I decided to re-write it for SQL Server 2008 (I believe it should work in 2005).
The main limitations with the conversion are the lack of a hierarchical “CONNECT BY LEVEL” clause and the lack of a LEAD() function for SQL Server’s limited implementation of analytics. Also, the function is dependent on my generate_series() function as described in this article.
Still, with a little perseverance I came up with the following working implementation.
USE [GISDB] -- You need to change this if you use this function.
GO
-- Create Tokenizer Function
--
CREATE FUNCTION [dbo].[Tokenizer]
(
@p_string VarChar(MAX),
@p_separators VarChar(254)
)
Returns @tokens TABLE
(
id int,
token varchar(MAX),
separator varchar(MAX)
)
As
/****f* TOOLS/Tokenizer (2008)
* NAME
* Tokenizer - Splits any string into tokens and separators.
* SYNOPSIS
* Function Tokenizer (
* @p_string varchar(max),
* @p_separators varchar(254)
* )
* Returns @tokens TABLE
* (
* id int,
* token varchar(MAX),
* separator varchar(MAX)
* )
* EXAMPLE
* SELECT t.id, t.token, t.separator
* FROM [dbo].[TOKENIZER]('LINESTRING(0 0,1 1)',' ,()') as t
* GO
* id token separator
* -- ---------- ---------
* 1 LINESTRING (
* 2 0 NULL
* 3 0 ,
* 4 1 NULL
* 5 1 )
* DESCRIPTION
* Supplied a string and a list of separators this function returns resultant tokens as a table collection.
* Function returns both the token and the separator.
* Returned table collection contains a unique identifier to ensure tokens and separators are always correctly ordered.
* INPUTS
* @p_string (varchar max) - Any non-null string.
* @p_separators (varchar 254) - List of separators eg '(),'
* RESULT
* Table (Array) of Integers
* id (int) - Unique identifier for each row starting with first token/separator found.
* token (varchar MAX) - Token between separators
* separator (varchar MAX) - Separator between tokens.
* AUTHOR
* Simon Greener
* HISTORY
* Pawel Barut, http://pbarut.blogspot.com/2007/03/yet-another-tokenizer-in-oracle.html
* Simon Greener - Aug 2008 - Converted to SQL Server 2008
* Simon Greener - Aug 2012 - Added extra return variables.
* Simon Greener - June 2020 - Converted to Inline Table-Valued Function from Multi-Statement Table-Valued Function
* COPYRIGHT
* (c) 2008-2020 by TheSpatialDBAdvisor/Simon Greener
******/
BEGIN
WITH MyCTE AS (
SELECT c.beg,
c.sep,
ROW_NUMBER() OVER(ORDER BY c.beg ASC) as rid
FROM (SELECT b.beg, c.sep
FROM (SELECT a.beg
FROM (SELECT c.IntValue as beg
FROM [dbo].[generate_series](1,DATALENGTH(@p_string),1) c
) a
) b,
(SELECT SUBSTRING(@p_separators,d.IntValue,1) as sep
FROM [dbo].[generate_series](1,DATALENGTH(@p_separators),1) d
) c
WHERE CHARINDEX(c.sep,SUBSTRING(@p_string,b.beg,1)) > 0
UNION ALL
SELECT 0 as beg, CAST(NULL AS varchar) as sep
) c
)
INSERT INTO @tokens
SELECT ROW_NUMBER() OVER (ORDER BY a.rid ASC) as Id,
CASE WHEN DATALENGTH(a.token) = 0 THEN NULL ELSE a.token END as token,
a.separator
FROM (SELECT d.rid,
SUBSTRING(@p_string,
(d.beg + 1),
(d.end_p - d.beg - 1) ) token,
d.separator
FROM (SELECT BASE.rid,
BASE.beg,
LEAD.beg end_p,
LEAD.sep separator
FROM MyCTE BASE LEFT JOIN MyCTE LEAD ON BASE.rid = LEAD.rid-1
) d
WHERE d.end_p IS NOT NULL
) as a;
RETURN;
END;
GO
Here are my, simple, tests.
SELECT distinct t.token
FROM dbo.Tokenizer('LineString:MultiLineString:MultiPoint:MultiPolygon:Point:Point:LineString:Polygon:Polygon',':') as t;
Result.
| token |
| LineString |
| MultiLineString |
| MultiPoint |
| MultiPolygon |
| Point |
| Polygon |
SELECT t.token
FROM dbo.tokenizer('The rain in spain, stays mainly on the plain.!',' ,.!') t;
Result.
| token |
| Then |
| rain |
| in |
| spain |
| stays |
| mainly |
| on |
| the |
| plain |
Now, if you want to collect them back into a single string, here’s an example of what you can do.
SELECT (STUFF((SELECT DISTINCT ':' + a.gtype
FROM (SELECT distinct t.token as gtype
FROM dbo.Tokenizer('LineString:MultiLineString:MultiPoint:MultiPolygon:Point:Point:LineString:Polygon:Polygon',':') as t
) a
ORDER BY ':' + a.gtype
FOR XML PATH(''), TYPE, ROOT).value('root[1]','nvarchar(max)'),1,1,'')
) as GeometryTypes;
Result.
| GeometryTypes |
| LineString:MultiLineString:MultiPoint:MultiPolygon:Point:Polygon |
Upgraded Version for SQL Server 2012 and aboce
SQL Server 2012 and above provide the LAG/LEAD analytic functions that were missing from 2008. The above 2008 Tokenizer function can be re-written as follows.
-- Connect to database holding generate_series
--
USE [GISDB] -- Change to your database
GO
-- Drop function if exists
--
IF EXISTS (SELECT *
FROM dbo.sysobjects
WHERE id = object_id (N'[dbo].[Tokenizer]')
AND OBJECTPROPERTY(id, N'IsTableFunction') = 1)
DROP FUNCTION [dbo].[Tokenizer]
GO
--
-- Now let's create it
--
CREATE FUNCTION [dbo].[Tokenizer]
(
@p_string VarChar(MAX),
@p_separators VarChar(254)
)
Returns @tokens TABLE
(
id int,
token varchar(MAX),
separator varchar(MAX)
)
As
/****f* TOOLS/Tokenizer (2012)
* NAME
* Tokenizer - Splits any string into tokens and separators.
* SYNOPSIS
* Function Tokenizer (
* @p_string varchar(max),
* @p_separators varchar(254)
* )
* Returns @tokens TABLE
* (
* id int,
* token varchar(MAX),
* separator varchar(MAX)
* )
* EXAMPLE
*
* SELECT t.id, t.token, t.separator
* FROM [dbo].[TOKENIZER]('LINESTRING(0 0,1 1)',' ,()') as t
* GO
* id token separator
* -- ---------- ---------
* 1 LINESTRING (
* 2 0 NULL
* 3 0 ,
* 4 1 NULL
* 5 1 )
*
* DESCRIPTION
* Supplied a string and a list of separators this function returns resultant tokens as a table collection.
* Function returns both the token and the separator.
* Returned table collection contains a unique identifier to ensure tokens and separators are always correctly ordered.
* INPUTS
* @p_string (varchar max) - Any non-null string.
* @p_separators (varchar 254) - List of separators eg '(),'
* RESULT
* Table (Array) of Integers
* id (int) - Unique identifier for each row starting with first token/separator found.
* token (varchar MAX) - Token between separators
* separator (varchar MAX) - Separator between tokens.
* AUTHOR
* Simon Greener
* HISTORY
* Simon Greener - Aug 2012 - Converted to SQL Server 2012 (Uses new Lag/Lead function and returns separators)
* COPYRIGHT
* (c) 2008-2020 by TheSpatialDBAdvisor/Simon Greener
******/
BEGIN
WITH MyCTE AS (
SELECT c.beg,
c.sep,
ROW_NUMBER() OVER(ORDER BY c.beg ASC) as rid
FROM (SELECT b.beg, c.sep
FROM (SELECT a.beg
FROM (SELECT c.IntValue as beg
FROM [dbo].[generate_series](1,DATALENGTH(@p_string),1) c
) a
) b,
(SELECT SUBSTRING(@p_separators,d.IntValue,1) as sep
FROM [dbo].[generate_series](1,DATALENGTH(@p_separators),1) d
) c
WHERE CHARINDEX(c.sep,SUBSTRING(@p_string,b.beg,1)) > 0
UNION ALL
SELECT 0 as beg, @p_string as sep
) c
)
INSERT INTO @tokens
SELECT Row_Number() OVER (ORDER BY a.rid ASC) as id,
Case When DATALENGTH(a.token) = 0 Then null Else a.token End as token,
a.separator
FROM (SELECT d.rid,
SubString(@p_string, (d.beg + 1), (Lead(d.beg,1) OVER (ORDER BY d.rid ASC) - d.beg - 1) ) as token,
Lead(d.sep,1) OVER (ORDER BY d.rid asc) as separator
FROM MyCTE d
) as a
WHERE DATALENGTH(a.token) <> 0 or DATALENGTH(a.separator) <> 0;
RETURN;
END;
GO
Testing
select distinct t.token
from dbo.Tokenizer('LineString:MultiLineString:MultiPoint:MultiPolygon:Point:Point:LineString:Polygon:Polygon',':') as t;
Results
| token |
| LineString |
| MultiLineString |
| MultiPoint |
| MultiPolygon |
| Point |
| Polygon |
The classic “Rain in Spain…”.
SELECT t.*
FROM dbo.tokenizer('The rain in spain, stays mainly on the plain.!',' ,.!') t;
Results
| id | token | separator |
| 1 | The | {SPACE} |
| 2 | rain | {SPACE} |
| 3 | in | {SPACE} |
| 4 | spain | , |
| 5 | NULL | {SPACE} |
| 6 | stays | {SPACE} |
| 7 | mainly | {SPACE} |
| 8 | on | {SPACE} |
| 9 | the | {SPACE} |
| 10 | plain | . |
| 11 | NULL | ! |
Now, let’s process a POLYGON WKT.
SELECT t.id, t.token, t.separator
FROM dbo.tokenizer('POLYGON((2300 400, 2300 700, 2800 1100, 2300 1100, 1800 1100, 2300 400), (2300 1000, 2400 900, 2200 900, 2300 1000))',' ,()') as t;
Results
| id | token | separator |
| 1 | POLYGON | ( |
| 2 | NULL | ( |
| 3 | 2300 | {SPACE} |
| 4 | 400 | , |
| 5 | NULL | {SPACE} |
| 6 | 2300 | {SPACE} |
| 7 | 700 | , |
| 8 | NULL | {SPACE} |
| 9 | 2800 | {SPACE} |
| 10 | 1100 | , |
| 11 | NULL | {SPACE} |
| 12 | 2300 | {SPACE} |
| 13 | 1100 | , |
| 14 | NULL | {SPACE} |
| 15 | 1800 | {SPACE} |
| 16 | 1100 | , |
| 17 | NULL | {SPACE} |
| 18 | 2300 | {SPACE} |
| 19 | 400 | ) |
| 20 | NULL | , |
| 21 | NULL | {SPACE} |
| 22 | NULL | ( |
| 23 | 2300 | {SPACE} |
| 24 | 1000 | , |
| 25 | NULL | {SPACE} |
| 26 | 2400 | {SPACE} |
| 27 | NULL | {SPACE} |
| 28 | 900 | , |
| 29 | NULL | {SPACE} |
| 30 | 2200 | {SPACE} |
| 31 | 900 | , |
| 32 | NULL | {SPACE} |
| 33 | 2300 | {SPACE} |
| 34 | 1000 | ) |
| 35 | NULL | ) |
This time don’t include the space as a separator.
SELECT t.id, t.token, t.separator
FROM dbo.tokenizer('POLYGON((2300 400, 2300 700, 2800 1100, 2300 1100, 1800 1100, 2300 400), (2300 1000, 2400 900, 2200 900, 2300 1000))',',()') as t;
Results
| id | token | separator |
| 1 | POLYGON | ( |
| 2 | NULL | ( |
| 3 | 2300 400 | , |
| 4 | 2300 700 | , |
| 5 | 2800 1100 | , |
| 6 | 2300 1100 | , |
| 7 | 1800 1100 | , |
| 8 | 2300 400 | ) |
| 9 | NULL | , |
| 10 | {SPACE} | ( |
| 11 | 2300 1000 | , |
| 12 | 2400 900 | , |
| 13 | 2200 900 | , |
| 14 | 2300 1000 | ) |
| 15 | NULL | ) |
SQL Server 2016+ STRING_SPLIT
The STRING_SPLIT function released for SQL Server 2016 can be used instead of my Tokenizer function. The main difference is that the STRING_SPLT function returns only the tokens (or an additional ordinal identifier field) whereas the Tokenizer functions return a token id (ordinal), the token, and the separator between the tokens.
select distinct t.value
from STRING_SPLIT('LineString:MultiLineString:MultiPoint:MultiPolygon:Point:Point:LineString:Polygon:Polygon',':') as t;
-- Result
--
value
LineString
MultiLineString
MultiPoint
MultiPolygon
Point
Polygon
-- Now return the ordinal value as an id.
--
select t.ordinal as id, t.value as token
from STRING_SPLIT('LineString:MultiLineString:MultiPoint:MultiPolygon:Point:Point:LineString:Polygon:Polygon',':',1) as t;
-- Result
--
id token
1 LineString
2 MultiLineString
3 MultiPoint
4 MultiPolygon
5 Point
6 Point
7 LineString
8 Polygon
9 Polygon
I hope that someone out there finds this post useful.
Documentation
- GeoRaptor Documentation
- MySQL Spatial General Functions
- Oracle LRS Objects
- Oracle Spatial Exporter (Java + pl/SQL)
- Oracle Spatial Object Functions
- Oracle Spatial Object Functions (Multi Page)
- PostGIS pl/pgSQL Functions
- SC4O Oracle Java Topology Suite (Java + pl/SQL)
- SQL Server Spatial General TSQL Functions
- SQL Server Spatial LRS TSQL Functions