Top 5 Recent Articles
ARTICLES CATEGORIES
- Algorithms (19)
- All (400)
- Biography (1)
- Blog (45)
- Business Requirements (1)
- Commentary (1)
- Customers (2)
- Data Models (1)
- Education (2)
- GeoRaptor (13)
- GPS (1)
- Image Processing (2)
- Import Export (8)
- Licensing (2)
- LiDAR (1)
- Linear Referencing (4)
- Manifold GIS (3)
- Mapping (1)
- MySQL Spatial (7)
- Networking and Routing (including Optimization) (4)
- Open Source (18)
- Oracle Spatial and Locator (193)
- Partitioning (1)
- PostGIS (34)
- Published Articles (1)
- qGIS (1)
- Recommendations (1)
- Services (1)
- Software Change Log (1)
- Source Code (35)
- Space Curves (9)
- Spatial Database Functions (107)
- Spatial DB comparison (1)
- Spatial XML Processing (11)
- SQL Server Spatial (91)
- Standards (3)
- Stored Procedure (15)
- Tessellation or Gridding (10)
- Tools (2)
- Topological Relationships (1)
- Training (2)
- Triangulation (2)
TSQL String Tokenizer Function for SQL Server
Update: Tokenizer has been updated for Denali analytic features and to expose the separators in the output.
We often have need of string tokenizers in applications. I had need of one in some TSQL development I am currently doing for a customer. I had previously written one for Oracle so I decided to re-write it for SQL Server 2008 (I believe it should work in 2005).
The main limitations with the conversion are the lack of a hierarchical “CONNECT BY LEVEL” clause and the lack of a LEAD() function for SQL Server’s limited implementation of analytics. Also, the function is dependent on my generate_series() function as described in this article
Still, with a little perseverance I came up with a working implementation.
Here it is.
USE [GISDB] -- You need to change this if you use this function. GO /********************************************************************************* ** @function : Tokenizer ** @precis : Splits any string into its tokens. ** @description : Supplied a string and a list of separators this function ** returns resultant tokens as a table collection. ** @example : SELECT t.token ** FROM dbo.tokenizer('The rain in spain, stays mainly on the plain.!',' ,.!') t; ** @param : p_string. The string to be Tokenized. ** @param : p_separators. The characters that are used to split the string. ** @depend : dbo.generate_series() ** @history : Pawel Barut, http://pbarut.blogspot.com/2007/03/yet-another-tokenizer-in-oracle.html ** @history : Simon Greener - Jul 2006 - Original coding (extended SQL) ** @history : Simon Greener - Aug 2008 - Converted to SQL Server 2008 **/ Drop Function Tokenizer; ==<nbsp/>== Create Function Tokenizer(@p_string VarChar(MAX), @p_separators VarChar(254)) Returns @varchar_table TABLE ( token varchar(MAX) ) As Begin Begin With myCte As ( Select c.beg, c.fullstring, ROW_NUMBER() OVER(ORDER BY c.beg ASC) RowVersion From (Select b.beg, b.fullstring From (Select a.beg, @p_string as fullstring From (Select c.IntValue as beg From dbo.generate_series(1,DATALENGTH(@p_string),1) c ) a ) b, (Select SUBSTRING(@p_separators,d.IntValue,1) as delim From dbo.generate_series(1,DATALENGTH(@p_separators),1) d ) c Where CHARINDEX(c.delim,SUBSTRING(b.fullstring,b.beg,1)) > 0 Union All Select 0 as beg, @p_string as fullstring Union All Select DATALENGTH(@p_string)+1 as beg, @p_string as fullstring ) c ) INSERT INTO @varchar_table Select SUBSTRING(d.fullstring, (d.beg + 1), (d.end_p - d.beg - 1) ) token From (Select BASE.beg, LEAD.beg end_p, BASE.fullstring From MyCTE BASE LEFT JOIN MyCTE LEAD ON BASE.RowVersion = LEAD.RowVersion-1 ) d Where d.end_p Is Not Null And d.end_p > d.beg + 1; RETURN; End; End GO
Here are my, simple, tests.
SELECT distinct t.token FROM dbo.Tokenizer('LineString:MultiLineString:MultiPoint:MultiPolygon:Point:Point:LineString:Polygon:Polygon',':') as t;
Result.
token |
LineString |
MultiLineString |
MultiPoint |
MultiPolygon |
Point |
Polygon |
SELECT t.token FROM dbo.tokenizer('The rain in spain, stays mainly on the plain.!',' ,.!') t;
Result.
token |
Then |
rain |
in |
spain |
stays |
mainly |
on |
the |
plain |
Now, if you want to collect them back into a single string, here’s an example of what you can do.
SELECT (STUFF((SELECT DISTINCT ':' + a.gtype FROM ( select distinct t.token as gtype from dbo.Tokenizer('LineString:MultiLineString:MultiPoint:MultiPolygon:Point:Point:LineString:Polygon:Polygon',':') as t ) a ORDER BY ':' + a.gtype FOR XML PATH(''), TYPE, ROOT).value('root[1]','nvarchar(max)'),1,1,'') ) as GeometryTypes;
Result.
GeometryTypes |
LineString:MultiLineString:MultiPoint:MultiPolygon:Point:Polygon |
Upgraded Version for Denali
** @history : Simon Greener - Apr 2012 - Extended to include returning of tokens -- Connect to database holding generate_series -- USE [GISDB] -- Change to your database GO -- Drop function if exists -- IF EXISTS (SELECT * FROM dbo.sysobjects WHERE id = object_id (N'[dbo].[Tokenizer]') AND OBJECTPROPERTY(id, N'IsTableFunction') = 1) DROP FUNCTION [dbo].[Tokenizer] GO -- -- Now let's create it -- /********************************************************************************* ** @function : Tokenizer ** @precis : Splits any string into its tokens. ** @description : Supplied a string and a list of separators this function ** returns resultant tokens as a table collection. ** @example : SELECT t.token ** FROM dbo.tokenizer('The rain in spain, stays mainly on the plain.!',' ,.!') t; ** @param : p_string. The string to be Tokenized. ** @param : p_separators. The characters that are used to split the string. ** @history : Pawel Barut, http://pbarut.blogspot.com/2007/03/yet-another-tokenizer-in-oracle.html ** @history : Simon Greener - Jul 2006 - Original coding (extended SQL) ** @history : Simon Greener - Aug 2008 - Converted to SQL Server 2008 ** @history : Simon Greener - Aug 2012 - Converted to SQL Server 2012 and return separators **/ Create Function [dbo].[Tokenizer](@p_string VarChar(MAX), @p_separators VarChar(254)) Returns @varchar_table TABLE ( id int, token varchar(MAX), separator varchar(MAX) ) As Begin Begin With MyCTE As ( Select c.beg, c.sep, ROW_NUMBER() OVER(ORDER BY c.beg ASC) as rid From (Select b.beg, c.sep From (Select a.beg From (select c.IntValue as beg from dbo.generate_series(1,DATALENGTH(@p_string),1) c ) a ) b, (select SUBSTRING(@p_separators,d.IntValue,1) as sep from dbo.generate_series(1,DATALENGTH(@p_separators),1) d ) c Where CHARINDEX(c.sep,SUBSTRING(@p_string,b.beg,1)) > 0 Union All Select 0 as beg, CAST(NULL AS varchar) as sep ) c ) Insert Into @varchar_table Select Row_Number() Over (Order By a.rid ASC) as Id, Case When DataLength(a.token) = 0 Then null Else a.token End as token, a.sep From (Select d.rid, SubString(@p_string, (d.beg + 1), (Lead(d.beg,1) Over (Order By d.rid asc) - d.beg - 1) ) as token, Lead(d.sep,1) Over (Order By d.rid asc) as sep From MyCTE d ) as a Where DataLength(a.token) <> 0 or DataLength(a.sep) <> 0; Return; End; End GO
Testing
select distinct t.token from dbo.Tokenizer('LineString:MultiLineString:MultiPoint:MultiPolygon:Point:Point:LineString:Polygon:Polygon',':') as t;
Results
token |
LineString |
MultiLineString |
MultiPoint |
MultiPolygon |
Point |
Polygon |
The classic “Rain in Spain…”.
SELECT t.* FROM dbo.tokenizer('The rain in spain, stays mainly on the plain.!',' ,.!') t;
Results
id | token | separator |
1 | The | {SPACE} |
2 | rain | {SPACE} |
3 | in | {SPACE} |
4 | spain | , |
5 | NULL | {SPACE} |
6 | stays | {SPACE} |
7 | mainly | {SPACE} |
8 | on | {SPACE} |
9 | the | {SPACE} |
10 | plain | . |
11 | NULL | ! |
Now, let’s process a POLYGON WKT.
SELECT t.id, t.token, t.separator FROM dbo.tokenizer('POLYGON((2300 400, 2300 700, 2800 1100, 2300 1100, 1800 1100, 2300 400), (2300 1000, 2400 900, 2200 900, 2300 1000))',' ,()') as t;
Results
id | token | separator |
1 | POLYGON | ( |
2 | NULL | ( |
3 | 2300 | {SPACE} |
4 | 400 | , |
5 | NULL | {SPACE} |
6 | 2300 | {SPACE} |
7 | 700 | , |
8 | NULL | {SPACE} |
9 | 2800 | {SPACE} |
10 | 1100 | , |
11 | NULL | {SPACE} |
12 | 2300 | {SPACE} |
13 | 1100 | , |
14 | NULL | {SPACE} |
15 | 1800 | {SPACE} |
16 | 1100 | , |
17 | NULL | {SPACE} |
18 | 2300 | {SPACE} |
19 | 400 | ) |
20 | NULL | , |
21 | NULL | {SPACE} |
22 | NULL | ( |
23 | 2300 | {SPACE} |
24 | 1000 | , |
25 | NULL | {SPACE} |
26 | 2400 | {SPACE} |
27 | NULL | {SPACE} |
28 | 900 | , |
29 | NULL | {SPACE} |
30 | 2200 | {SPACE} |
31 | 900 | , |
32 | NULL | {SPACE} |
33 | 2300 | {SPACE} |
34 | 1000 | ) |
35 | NULL | ) |
This time don’t include the space as a separator.
SELECT t.id, t.token, t.separator FROM dbo.tokenizer('POLYGON((2300 400, 2300 700, 2800 1100, 2300 1100, 1800 1100, 2300 400), (2300 1000, 2400 900, 2200 900, 2300 1000))',',()') as t;
Results
id | token | separator |
1 | POLYGON | ( |
2 | NULL | ( |
3 | 2300 400 | , |
4 | 2300 700 | , |
5 | 2800 1100 | , |
6 | 2300 1100 | , |
7 | 1800 1100 | , |
8 | 2300 400 | ) |
9 | NULL | , |
10 | {SPACE} | ( |
11 | 2300 1000 | , |
12 | 2400 900 | , |
13 | 2200 900 | , |
14 | 2300 1000 | ) |
15 | NULL | ) |
I hope that someone out there finds this useful.
Documentation
- MySQL Spatial General Function Documentation
- Oracle LRS Object Documentation
- Oracle Spatial Exporter Package Documentation
- Oracle Spatial Object Function Documentation
- Oracle Spatial Object Function Documentation (Multi Page Version)
- PostGIS pl/pgSQL Function Documentation
- SC4O Oracle Java Topology Suite (Stored Procedures) Package Documentation
- SQL Server Spatial General TSQL Function Documentation
- SQL Server Spatial LRS TSQL Function Documentation