Top 5 Recent Articles
ARTICLES CATEGORIES
- Algorithms (22)
- All (399)
- Biography (1)
- Blog (44)
- Business Requirements (1)
- Commentary (1)
- Conversion (2)
- Customers (2)
- Data Models (1)
- Education (2)
- GeoRaptor (13)
- GPS (1)
- Image Processing (2)
- Import Export (8)
- Licensing (2)
- LiDAR (1)
- Linear Referencing (4)
- Manifold GIS (3)
- Mapping (1)
- MySQL Spatial (7)
- Networking and Routing (including Optimization) (5)
- Open Source (18)
- Oracle Spatial and Locator (194)
- Partitioning (1)
- PostGIS (36)
- Projections (1)
- Published Articles (1)
- qGIS (1)
- Recommendations (1)
- Services (1)
- Software Change Log (1)
- Source Code (37)
- Space Curves (9)
- Spatial Database Functions (109)
- Spatial DB comparison (1)
- Spatial XML Processing (11)
- SQL Server Spatial (92)
- Standards (3)
- Stored Procedure (17)
- Tessellation or Gridding (10)
- Tools (2)
- Topological Relationships (1)
- Training (2)
- Triangulation (2)
String Tokenizer for Oracle
UPDATED FUNCTION AT END OF ARTICLE RETURNS SEPARATORS
We often have need of string tokenizers in applications. Here is one for Oracle based on work by Pawel Barut.
-- We need a type to hold the returned tokens
--
CREATE OR REPLACE TYPE T_TokenSet AS TABLE OF VARCHAR2(4000);
/
SHOW errors
GRANT EXECUTE ON T_TokenSet TO public WITH GRANT OPTION;
/*********************************************************************************
** @function : Tokenizer
** @precis : Splits any string into its tokens.
** @description : Supplied a string and a list of separators this function
** returns resultant tokens as a pipelined collection.
** @example : SELECT t.column_value
** FROM TABLE(tokenizer('The rain in spain, stays mainly on the plain.!',' ,.!') ) t;
** @param : p_string. The string to be Tokenized.
** @param : p_separators. The characters that are used to split the string.
** @requires : t_TokenSet type to be declared.
** @history : Pawel Barut, http://pbarut.blogspot.com/2007/03/yet-another-tokenizer-in-oracle.html
** @history : Simon Greener - July 2006 - Original coding (extended SQL sourced from a blog on the internet)
**/
CREATE OR REPLACE
FUNCTION Tokenizer(p_string IN VarChar2,
p_separators IN VarChar2 DEFAULT ' ')
RETURN T_TokenSet Pipelined
AS
v_strs camod_common.T_TokenSet;
BEGIN
IF ( p_string IS NULL
OR
p_separators IS NULL ) THEN
RETURN;
END IF;
WITH sel_string AS (SELECT p_string fullstring FROM dual)
SELECT substr(fullstring, beg+1, end_p-beg-1) token
Bulk Collect INTO v_strs
FROM (SELECT beg, Lead(beg) OVER (ORDER BY beg) end_p, fullstring
FROM (SELECT beg, fullstring
FROM (SELECT Level beg, fullstring
FROM sel_string
CONNECT BY Level <= LENGTH(fullstring)
)
WHERE instr(p_separators,substr(fullstring,beg,1)) >0
UNION ALL
SELECT 0, fullstring
FROM sel_string
UNION ALL
SELECT LENGTH(fullstring)+1, fullstring
FROM sel_string)
)
WHERE end_p IS NOT NULL
AND end_p > beg + 1;
FOR i IN v_strs.FIRST..v_strs.LAST Loop
PIPE ROW(v_strs(i));
END Loop;
RETURN;
END Tokenizer;
Here are my, simple, tests.
SELECT DISTINCT t.column_value AS token
FROM TABLE(Tokenizer('LineString:MultiLineString:MultiPoint:MultiPolygon:Point:Point:LineString:Polygon:Polygon',':')) t;
Result.
| token |
|---|
| LineString |
| MultiLineString |
| MultiPoint |
| MultiPolygon |
| Point |
| Polygon |
SELECT t.column_value AS token
FROM TABLE(tokenizer('The rain in spain, stays mainly on the plain.!',' ,.!')) t;
Result.
| token |
|---|
| The |
| rain |
| in |
| spain |
| stays |
| mainly |
| on |
| the |
| plain |
Updated Function
Sometimes it is more than handy to be able to access the separators in the position they are found in the output from a Tokenizer. Here is an updated version of the above that does this.
-- New types
--
-- We need a type to hold the returned tokens
--
DROP TYPE T_Token Force;
DROP TYPE T_TokenSet Force;
CREATE TYPE T_Token AS Object (
id INTEGER,
token varchar2(30000),
separator varchar2(30000)
);
/
SHOW errors
GRANT EXECUTE ON T_Token TO public WITH GRANT OPTION;
CREATE TYPE T_TokenSet AS TABLE OF codesys.t_token;
/
SHOW errors
GRANT EXECUTE ON T_TokenSet TO public WITH GRANT OPTION;
/*********************************************************************************
** @function : Tokenizer
** @precis : Splits any string into its tokens.
** @description : Supplied a string and a list of separators this function
** returns resultant tokens as a pipelined collection.
** @example : SELECT t.column_value
** FROM TABLE(tokenizer('The rain in spain, stays mainly on the plain.!',' ,.!') ) t;
** @param : p_string. The string to be Tokenized.
** @param : p_separators. The characters that are used to split the string.
** @requires : t_TokenSet type to be declared.
** @history : Pawel Barut, http://pbarut.blogspot.com/2007/03/yet-another-tokenizer-in-oracle.html
** @history : Simon Greener - July 2006 - Original coding (extended SQL sourced from a blog on the internet)
** @history : Simon Greener - Apr 2012 - Extended TO include returning OF tokens
**/
CREATE OR REPLACE
FUNCTION Tokenizer(p_string IN VarChar2,
p_separators IN VarChar2 DEFAULT ' ')
RETURN T_TokenSet Pipelined
AS
v_tokens codesys.T_TokenSet;
BEGIN
IF ( p_string IS NULL
OR
p_separators IS NULL ) THEN
RETURN;
END IF;
WITH myCTE AS (
SELECT c.beg, c.sep, ROW_NUMBER() OVER(ORDER BY c.beg ASC) rid
FROM (SELECT b.beg, c.sep
FROM (SELECT Level beg
FROM dual
CONNECT BY Level <= LENGTH(p_string)
) b,
(SELECT SubStr(p_separators,level,1) AS sep
FROM dual
CONNECT BY Level <= LENGTH(p_separators)
) c
WHERE instr(c.sep,substr(p_string,b.beg,1)) >0
UNION ALL SELECT 0, CAST(NULL AS varchar2(10)) FROM dual
) c
)
SELECT T_Token(ROW_NUMBER() OVER (ORDER BY a.rid ASC),
CASE WHEN LENGTH(a.token) = 0 THEN NULL ELSE a.token END,
a.sep) AS token
Bulk Collect INTO v_tokens
FROM (SELECT d.rid,
SubStr(p_string,
(d.beg + 1),
NVL((Lead(d.beg,1) OVER (ORDER BY d.rid ASC) - d.beg - 1),LENGTH(p_string)) ) AS token, Lead(d.sep,1) OVER (ORDER BY d.rid ASC) AS sep
FROM MyCTE d
) a
WHERE LENGTH(a.token) <> 0 OR LENGTH(a.sep) <> 0;
FOR v_i IN v_tokens.FIRST..v_tokens.LAST loop
PIPE ROW(v_tokens(v_i));
END LOOP;
RETURN;
END Tokenizer;
Testing
SELECT DISTINCT t.token
FROM TABLE(Tokenizer('LineString:MultiLineString:MultiPoint:MultiPolygon:Point:Point:LineString:Polygon:Polygon',':')) t;
Results
| TOKEN |
|---|
| LineString |
| MultiLineString |
| MultiPoint |
| MultiPolygon |
| Point |
| Polygon |
The classic “Rain in Spain…”.
SELECT t.*
FROM TABLE(tokenizer('The rain in spain, stays mainly on the plain.!',' ,.!')) t;
Results
| ID | TOKEN | SEPARATOR |
|---|---|---|
| 1 | The | {SPACE} |
| 2 | rain | {SPACE} |
| 3 | in | {SPACE} |
| 4 | spain | , |
| 5 | (null) | {SPACE} |
| 6 | stays | {SPACE} |
| 7 | mainly | {SPACE} |
| 8 | on | {SPACE} |
| 9 | the | {SPACE} |
| 10 | plain | . |
| 11 | (null) | ! |
Now, let’s process a POLYGON WKT.
SELECT t.id, t.token, t.separator
FROM TABLE(tokenizer('POLYGON((2300 400, 2300 700, 2800 1100, 2300 1100, 1800 1100, 2300 400), (2300 1000, 2400 900, 2200 900, 2300 1000))',' ,()')) t;
Results
| ID | TOKEN | SEPARATOR |
|---|---|---|
| 1 | POLYGON | ( |
| 2 | (null) | ( |
| 3 | 2300 | {SPACE} |
| 4 | 400 | , |
| 5 | (null) | {SPACE} |
| 6 | 2300 | {SPACE} |
| 7 | 700 | , |
| 8 | (null) | {SPACE} |
| 9 | 2800 | {SPACE} |
| 10 | 1100 | , |
| 11 | (null) | {SPACE} |
| 12 | 2300 | {SPACE} |
| 13 | 1100 | , |
| 14 | (null) | {SPACE} |
| 15 | 1800 | {SPACE} |
| 16 | 1100 | , |
| 17 | (null) | {SPACE} |
| 18 | 2300 | {SPACE} |
| 19 | 400 | ) |
| 20 | (null) | , |
| 21 | (null) | {SPACE} |
| 22 | (null) | ( |
| 23 | 2300 | {SPACE} |
| 24 | 1000 | , |
| 25 | (null) | {SPACE} |
| 26 | 2400 | {SPACE} |
| 27 | (null) | {SPACE} |
| 28 | 900 | , |
| 29 | (null) | {SPACE} |
| 30 | 2200 | {SPACE} |
| 31 | 900 | , |
| 32 | (null) | {SPACE} |
| 33 | 2300 | {SPACE} |
| 34 | 1000 | ) |
| 35 | (null) | ) |
This time don’t include the space as a separator.
SELECT t.id, t.token, t.separator
FROM TABLE(tokenizer('POLYGON((2300 400, 2300 700, 2800 1100, 2300 1100, 1800 1100, 2300 400), (2300 1000, 2400 900, 2200 900, 2300 1000))',',()')) t;
Results
| ID | TOKEN | SEPARATOR |
|---|---|---|
| 1 | POLYGON | ( |
| 2 | (null) | ( |
| 3 | 2300 400 | , |
| 4 | 2300 700 | , |
| 5 | 2800 1100 | , |
| 6 | 2300 1100 | , |
| 7 | 1800 1100 | , |
| 8 | 2300 400 | ) |
| 9 | (null) | , |
| 10 | {SPACE} | ( |
| 11 | 2300 1000 | , |
| 12 | 2400 900 | , |
| 13 | 2200 900 | , |
| 14 | 2300 1000 | ) |
| 15 | (null) | ) |
I hope that someone out there finds this useful.code id=
Documentation
- GeoRaptor Documentation
- MySQL Spatial General Functions
- Oracle LRS Objects
- Oracle Spatial Exporter (Java + pl/SQL)
- Oracle Spatial Object Functions
- Oracle Spatial Object Functions (Multi Page)
- PostGIS pl/pgSQL Functions
- SC4O Oracle Java Topology Suite (Java + pl/SQL)
- SQL Server Spatial General TSQL Functions
- SQL Server Spatial LRS TSQL Functions