Top 5 Recent Articles
ARTICLES CATEGORIES
- Algorithms (22)
- All (399)
- Biography (1)
- Blog (44)
- Business Requirements (1)
- Commentary (1)
- Conversion (2)
- Customers (2)
- Data Models (1)
- Education (2)
- GeoRaptor (13)
- GPS (1)
- Image Processing (2)
- Import Export (8)
- Licensing (2)
- LiDAR (1)
- Linear Referencing (4)
- Manifold GIS (3)
- Mapping (1)
- MySQL Spatial (7)
- Networking and Routing (including Optimization) (5)
- Open Source (18)
- Oracle Spatial and Locator (194)
- Partitioning (1)
- PostGIS (36)
- Projections (1)
- Published Articles (1)
- qGIS (1)
- Recommendations (1)
- Services (1)
- Software Change Log (1)
- Source Code (37)
- Space Curves (9)
- Spatial Database Functions (109)
- Spatial DB comparison (1)
- Spatial XML Processing (11)
- SQL Server Spatial (92)
- Standards (3)
- Stored Procedure (17)
- Tessellation or Gridding (10)
- Tools (2)
- Topological Relationships (1)
- Training (2)
- Triangulation (2)
Using SQL Server XML and Spatial to import VicMap seed/Incremental Update Format (IUF) data (Part 2)
In Part 1 were were introduced to the IUF data, its structure, and processing (for Locality elements).
In Part 2 we will look at ways of reducing the data volume of the seed file, and look at issues when processing whole layers (LGA will be used instead of LOCALITY).
In Part 3 we will look at the effect enhanced pre-processing of the structure of the XY elements of the source seed file can have on data volume, reduced processing complexity and improved speed.
Pruning or Filtering Tables
A piece of functionality that is required when processing XML with SQL is the ability to select from the document only those layers that are needed.
It is also possible to remove unneeded layers from the original document (perhaps one wished to reduce data usage for an Azure Blob Store!)
Both of these approaches are described in the following sections.
1. SQL Filtering using XML Attributes
In part 1 we saw how to process the XML in order to select and process the data a single layer at a time.
This is useful so that we can write to a dedicated table eg [dbo].[Locality] allowing the XPATH/SQL to extract different attributes that are not present in another layer eg LGA.
The filtering can be done using a SQL WHERE clause predicate on the HEADER/@TABLE attribute, or can be done using XPATH against the XML document.
The following extracts the relevant components from the first article.
with data as ( select CONVERT(xml, '<?xml version="1.0"?> <!DOCTYPE IUF SYSTEM "iuf_full.dtd"> <IUF> ... </IUF>',2) as iuf ) select n.c.value('(HEADER/@TABLE)[1]', 'varchar(max)') as table_name from data as d CROSS APPLY d.iuf.nodes('/IUF/INSERT/POLYGON/POLYGON_NAME/POLYGON_ITEM[HEADER/@TABLE="LOCALITY_POLYGON"]') n(c); GO table_name ---------------- LOCALITY_POLYGON -- OR Filtering via a SQL WHERE clause predicate with data as ( select CONVERT(xml, '<?xml version="1.0"?> <!DOCTYPE IUF SYSTEM "iuf_full.dtd"> <IUF> ... </IUF>',2) as iuf ) SELECT n.c.value('(HEADER/@TABLE)[1]','varchar(max)') as table_name FROM data as d CROSS APPLY d.iuf.nodes('/IUF/INSERT/POLYGON/POLYGON_NAME/POLYGON_ITEM') n(c) WHERE n.c.value('(HEADER/@TABLE)[1]','varchar(max)') ='LOCALITY_POLYGON' GO table_name ---------------- LOCALITY_POLYGON
2. Removing layers And Spaces from original document using XSLT
2.1. Removing layers
While filtering against the whole document is simple, the source document is very large (~360M) and can adversely affect the execution times for processing the XML.
Besides the Locality, LGA and Postcode layers, the VicMap Admin file also contains the Parish and Township layers which are less used than the former.
XML processing of the raw XML data includes these additional layers even if that processing is nothing more than filtering.
The unused layers can be removed by filtering using external (to SQL Server) tools.
2.2. Removing Spaces
The supplied XML data is “pretty printed”/Formatted for human consumption and, as we have seen, looks like this:
<?xml version="1.0"?> <!DOCTYPE IUF SYSTEM "iuf_full.dtd"> <IUF> <INSERT> <POLYGON> <!-- SEED for POLYGON table LGA_POLYGON commencing --> <POLYGON_NAME> <POLYGON_ITEM> <HEADER TABLE="LGA_POLYGON" PFI="367" CREATED="20131024152212" UFI="468592502"></HEADER> <OUTER><XY>1,2</XY></OUTER> </POLYGON_ITEM> </POLYGON_NAME> <!-- SEED for POLYGON table LOCALITY_POLYGON commencing --> <POLYGON_NAME> <POLYGON_ITEM> <HEADER TABLE="LOCALITY_POLYGON" PFI="367" CREATED="20131024152212" UFI="468592502"></HEADER> <OUTER><XY>1,2</XY></OUTER> </POLYGON_ITEM> </POLYGON_NAME> <!-- SEED for POLYGON table PARISH_POLYGON commencing --> <POLYGON_NAME> <POLYGON_ITEM> <HEADER TABLE="PARISH_POLYGON" PFI="367" CREATED="20131024152212" UFI="468592502"></HEADER> <OUTER><XY>1,2</XY></OUTER> </POLYGON_ITEM> </POLYGON_NAME> <!-- SEED for POLYGON table POSTCODE_POLYGON commencing --> <POLYGON_NAME> <POLYGON_ITEM> <HEADER TABLE="POSTCODE_POLYGON" PFI="367" CREATED="20131024152212" UFI="468592502"></HEADER> <OUTER><XY>1,2</XY></OUTER> </POLYGON_ITEM> </POLYGON_NAME> </POLYGON> </INSERT> </IUF>
Application XML processing does not need the spaces/tabs used in formatting the XML document.
How much can we save by removing the leading spaces so that the document looks like this:
<?xml version="1.0"?> <!DOCTYPE IUF SYSTEM "iuf_full.dtd"> <IUF> <INSERT> <POLYGON> <!-- SEED for POLYGON table LGA_POLYGON commencing --> <POLYGON_NAME> <POLYGON_ITEM> <HEADER TABLE="LGA_POLYGON" PFI="367" CREATED="20131024152212" UFI="468592502"></HEADER> <OUTER><XY>1,2</XY></OUTER> </POLYGON_ITEM> </POLYGON_NAME> <!-- SEED for POLYGON table LOCALITY_POLYGON commencing --> <POLYGON_NAME> <POLYGON_ITEM> <HEADER TABLE="LOCALITY_POLYGON" PFI="367" CREATED="20131024152212" UFI="468592502"></HEADER> <OUTER><XY>1,2</XY></OUTER> </POLYGON_ITEM> </POLYGON_NAME> <!-- SEED for POLYGON table PARISH_POLYGON commencing --> <POLYGON_NAME> <POLYGON_ITEM> <HEADER TABLE="PARISH_POLYGON" PFI="367" CREATED="20131024152212" UFI="468592502"></HEADER> <OUTER><XY>1,2</XY></OUTER> </POLYGON_ITEM> </POLYGON_NAME> <!-- SEED for POLYGON table POSTCODE_POLYGON commencing --> <POLYGON_NAME> <POLYGON_ITEM> <HEADER TABLE="POSTCODE_POLYGON" PFI="367" CREATED="20131024152212" UFI="468592502"></HEADER> <OUTER><XY>1,2</XY></OUTER> </POLYGON_ITEM> </POLYGON_NAME> </POLYGON> </INSERT> </IUF>
The following XSLT (StripAllSpaces.xslt) was used:
<?xml version="1.0" encoding="UTF-8"?> <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output method="xml" omit-xml-declaration="no" indent="yes" doctype-system="iuf_full.dtd" encoding="UTF-8"/> <xsl:strip-space elements="*"/> <!--Identity transform (aka identity template).--> <xsl:template match="@*|node()"> <xsl:copy> <xsl:apply-templates select="@*|node()"/> </xsl:copy> </xsl:template> </xsl:stylesheet>
The Windows msxml.exe utility was used to execute the XSLT processing as follows (other Linux xml tools can be used):
msxsl all_seed_data_17-feb-2019.xml StripAllSpaces.xslt -xw -xe -t -o all_seed_data_17-feb-2019_spaces.xml Microsoft (R) XSLT Processor Version 4.0 Source document load time: 36611 milliseconds Stylesheet document load time: .293 milliseconds Stylesheet compile time: .114 milliseconds Stylesheet execution time: 5511 milliseconds
The original file, all_seed_data_17-feb-2019.xml, was 342 MB (358,757,534 bytes) in size.
The resultant file, all_seed_data_17-feb-2019_spaces.xml, is 241 MB (252,810,486 bytes).
This is as saving of (358,757,534 – 252,810,486) = 105,947,048 which means the new document is 70.5 % of the original size.
Quite a substantial saving!
2.3. Removing Layers Using XSLT
Finally, we can reduce the size of the document further by removing unwanted (Township and Parish) layers.
The following XSLT script enables this pruning.
<?xml version="1.0" encoding="UTF-8"?> <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output method="xml" omit-xml-declaration="no" indent="yes" doctype-system="iuf_full.dtd" encoding="UTF-8"/> <xsl:strip-space elements="*"/> <!--Identity transform (aka identity template).--> <xsl:template match="@*|node()"> <xsl:copy> <xsl:apply-templates select="@*|node()"/> </xsl:copy> </xsl:template> <!--templates to match and remove DESCRIPTION, PARISH_POLYGON and TOWNSHIP_POLYGON complete tags.--> <xsl:template match="DESCRIPTION"/> <xsl:template match="*/POLYGON/POLYGON_NAME[POLYGON_ITEM/HEADER/@TABLE='PARISH_POLYGON' or POLYGON_ITEM/HEADER/@TABLE='TOWNSHIP_POLYGON']" /> </xsl:stylesheet>
Processing using msxml.exe is as follows:
msxsl all_seed_data_17-feb-2019.xml RemoveParishTownshipElements.xslt -xw -xe -t -o all_seed_data_17-feb-2019_thinned.xml Microsoft (R) XSLT Processor Version 4.0 Source document load time: 41546 milliseconds Stylesheet document load time: .332 milliseconds Stylesheet compile time: .174 milliseconds Stylesheet execution time: 3601 milliseconds
The size of a complete seed document for VicMap Admin is 358,760,448 bytes.
If we remove the two unwanted layers and the leading spaces, the resultant file size is 144,052,224 bytes.
The difference is a saving of (358,760,448 – 144,052,224) = 214,708,224 bytes. So, the new file is ( 144,052,224 / 358,760,448 ) * 100.0 = 40.15 % of the original.
That is a huge saving!
The document all_seed_data_17-feb-2019.xml is a copy of the VicMap Admin data at a point in time. More recent files can be downloaded from the VicMap servers for free.
3. Processing the XML to produce a table
All the work done in Part 1 involved simple processing of “in line” XML, with the last example showing how to read from a file of a very small XML document(admin_seed_example.xml); neither included writing to a table.
The initial processing of the file with the two layers and spaces removed took quite a while to process (greater than 12 hours).
This was/is probably due to my XPATH processing of the XML document directly from the disk file using OPENROWSET, but also my inexperience with this XML/IUF document. It may also be that the wordiness of the coding of the coordinates (
In the end the following recipe was followed:
- 3.1 Create target table;
- 3.2 Load the XML into a table from the disk file;
- 3.3 Create XML Indexes on this table;
- 3.4 Process the loaded XML one layer at a time to create an output table;
- 3.5 Index and Visualise the results.
3.1 Create target table
-- Create LGA table DROP TABLE IF EXISTS [vicmap].[admin_seed] GO CREATE TABLE [vicmap].[admin_seed] ( id int identity(1,1) not null, iuf xml ); GO
3.2 Load the XML into a table from the disk file
INSERT INTO [vicmap].[admin_seed](iuf) SELECT CONVERT(xml, BulkColumn, 2) as iuf FROM OPENROWSET(BULK N'C:\Users\Simon\Downloads\all_seed_data_17-feb-2019\all_seed_data_17-feb-2019_thinned.xml', SINGLE_BLOB ) AS x; GO ALTER TABLE [vicmap].[admin_seed] ADD CONSTRAINT vicmap_admin_seed_pk PRIMARY KEY CLUSTERED (id); GO
3.3 Create XML Indexes on this table
-- Create Primary XML Index CREATE PRIMARY XML INDEX PXML_admin_seed_iuf ON [vicmap].[admin_seed](iuf); GO -- Create Selective XML Index on table section CREATE SELECTIVE XML INDEX SXI_Admin_Seed_IUF_ix ON [vicmap].[admin_seed](iuf) FOR ( pathHeader = '/IUF/INSERT/POLYGON/POLYGON_NAME/POLYGON_ITEM/HEADER/@TABLE' AS XQUERY 'xs:string' SINGLETON ); -- Test by checking explain plan select COUNT(*) from (select c.pfi, c.ufi, CONVERT(datetime, SUBSTRING(c.create_date, 1,4)+'-'+ SUBSTRING(c.create_date, 5,2)+'-'+ SUBSTRING(c.create_date, 7,2)+' '+ SUBSTRING(c.create_date, 9,2)+':'+ SUBSTRING(c.create_date,11,2)+':'+ SUBSTRING(c.create_date,13,2), 20) as create_date, c.polygon_item from (select n.c.value('(HEADER/@PFI)[1]', 'varchar(100)') as pfi, n.c.value('(HEADER/@UFI)[1]', 'varchar(100)') as ufi, n.c.value('(HEADER/@CREATED)[1]','varchar(100)') as create_date, n.c.query('.') as polygon_item from [vicmap].[admin_seed] as d CROSS APPLY d.iuf.nodes('/IUF/INSERT/POLYGON/POLYGON_NAME/POLYGON_ITEM[HEADER/@TABLE="LGA_POLYGON"]') n(c) ) as c ) as f; GO
Examination of the plan shows that the primary and selective indexes are being used. The plan may look like this.
3.4 Process the loaded XML one layer at a time to create the LGA table
The processing of the [vicmap].[admin_seed] document to produce a final LGA (Locality SQL is available on request) table can be achieved as follows.
DROP TABLE [vicmap].[lga]; GO CREATE TABLE [vicmap].[lga] ( id int IDENTITY(1,1) not null, pfi varchar(50), ufi varchar(50), create_date varchar(100), lga_code varchar(100), lga_name varchar(500), lga_official_name varchar(500), gazettal_registration varchar(500), pfi_created varchar(100), geog geography ); GO INSERT INTO [vicmap].[lga] WITH (TABLOCKX) ( pfi, ufi, create_date, lga_code, lga_name, lga_official_name, pfi_created, geog) SELECT a.pfi, a.ufi, a.create_date, lga_code, lga_name, lga_official_name, pfi_created, geography::STGeomFromText(a.exterior_rings,4326) .STSymDifference( geography::STGeomFromText(COALESCE(b.interior_rings,'POLYGON EMPTY'),4326) ) as geog from (select g.pfi, g.ufi, g.create_date, case when count(*) = 1 then 'POLYGON(' + STRING_AGG( g.exterior_ring,',') + ')' else 'MULTIPOLYGON(' + STRING_AGG('('+g.exterior_ring+')',',') + ')' end as exterior_rings from (select f.pfi, f.ufi, f.create_date, '(' + (STRING_AGG(f.XY,',') WITHIN GROUP (ORDER BY f.xy_position ASC)) + ')' as exterior_ring from (select n.c.value('(HEADER/@PFI)[1]', 'varchar(100)') as pfi, n.c.value('(HEADER/@UFI)[1]', 'varchar(100)') as ufi, n.c.value('(HEADER/@CREATED)[1]','varchar(100)') as create_date, o.i.value('(local-name(.))', 'varchar(100)') as coord_tag, o.i.value('for $i in . return count(../*[. << $i]) + 1','int') as ring_position, x.y.value('for $i in . return count(../*[. << $i]) + 1', 'int') as xy_position, REPLACE(x.y.value('(text())[1]', 'varchar(max)'),',',' ') as XY from [vicmap].[admin_seed] as d CROSS APPLY d.iuf.nodes('/IUF/INSERT/POLYGON/POLYGON_NAME/POLYGON_ITEM[HEADER/@TABLE="LGA_POLYGON"]') n(c) CROSS APPLY n.c.nodes('OUTER') o(i) /* Extract OUTER node to enable counting */ CROSS APPLY o.i.nodes('XY') x(y) /* Extract XY coordinates for current OUTER */ ) as f group by f.pfi, f.ufi, f.create_date, f.ring_position ) as g group by g.pfi, g.ufi, g.create_date ) as a left outer join (select g.pfi, g.ufi, case when count(*) = 1 then 'POLYGON(' + STRING_AGG( g.interior_ring,',') + ')' else 'MULTIPOLYGON(' + STRING_AGG('('+g.interior_ring+')',',') + ')' end as interior_rings from (select f.pfi, f.ufi, f.coord_tag, f.ring_position, '('+(STRING_AGG(f.XY,',') WITHIN GROUP (ORDER BY f.xy_position DESC))+')' as interior_ring from (select n.c.value('(HEADER/@PFI)[1]', 'varchar(100)') as pfi, n.c.value('(../HEADER/@UFI)[1]', 'varchar(100)') as ufi, o.i.value('(local-name(.))', 'varchar(100)') as coord_tag, o.i.value('for $i in . return count(../*[. << $i]) + 1','int') as ring_position, x.y.value('for $i in . return count(../*[. << $i]) + 1', 'int') as xy_position, REPLACE(x.y.value('(text())[1]', 'varchar(max)'),',',' ') as XY from [vicmap].[admin_seed] as d CROSS APPLY d.iuf.nodes('/IUF/INSERT/POLYGON/POLYGON_NAME/POLYGON_ITEM[HEADER/@TABLE="LGA_POLYGON"]') n(c) CROSS APPLY n.c.nodes('INNER') o(i) /* Extract OUTER node to enable counting */ CROSS APPLY o.i.nodes('XY') x(y) /* Extract XY coordinates for current OUTER */ ) as f group by f.pfi, f.ufi, f.coord_tag, f.ring_position ) as g group by g.pfi, g.ufi ) as b on (b.pfi = a.pfi and b.ufi = a.ufi) LEFT OUTER JOIN /* Attributes */ (SELECT v.pfi,v.ufi,v.LGA_CODE,v.LGA_NAME,v.LGA_OFFICIAL_NAME,v.GAZETTAL_REGISTRATION,v.PFI_CREATED FROM (select f.pfi,f.ufi,f.attribute_name,f.attribute_value from (select n.c.value('(../HEADER/@PFI)[1]', 'varchar(100)') as pfi, n.c.value('(../HEADER/@UFI)[1]', 'varchar(100)') as ufi, SUBSTRING(n.c.value('(text())[1]','varchar(max)'), 1, CHARINDEX('=',n.c.value('(text())[1]', 'varchar(max)'))-1 ) as attribute_name, REPLACE( SUBSTRING(n.c.value('(text())[1]', 'varchar(max)'), CHARINDEX('=',n.c.value('(text())[1]', 'varchar(max)'))+1, 1000), '"', '' ) as attribute_value from [vicmap].[admin_seed] as d CROSS APPLY d.iuf.nodes('/IUF/INSERT/POLYGON/POLYGON_NAME/POLYGON_ITEM[HEADER/@TABLE="LGA_POLYGON"]/ATT') n(c) ) as f ) AS T PIVOT ( max(attribute_value) FOR attribute_name in ( LGA_CODE, LGA_NAME, LGA_OFFICIAL_NAME, GAZETTAL_REGISTRATION, PFI_CREATED ) ) as v ) as c ON (c.pfi = a.pfi and c.ufi = a.ufi ); GO -- 87 rows in 2 hours 23 minutes -- Check if resultant polygons are valid SELECT l.geog.IsValidDetailed() as vReason, count(*) as eCount FROM vicmap.lga as l GROUP BY l.geog.IsValidDetailed(); GO -- vReason eCount -- 24400: Valid 87
3.5 Index and Visualise the results.
-- Primary key on ID ALTER TABLE [vicmap].[lga] ADD CONSTRAINT lga_pk PRIMARY KEY CLUSTERED (id); GO -- Unique index on PFI CREATE UNIQUE INDEX lga_pfi_ix ON [vicmap].[lga](PFI); GO -- Create Spatial Index CREATE SPATIAL INDEX [lga_geog_spx] ON [vicmap].[lga] ([geog]) USING GEOGRAPHY_AUTO_GRID WITH (CELLS_PER_OBJECT = 16, PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY] GO -- Visualise using SSMS SELECT * FROM [vicmap].[lga] as l; GO
This resulting image of all the LGA data is as follows:
In Part 3 we will look at preprocessing of the XML file to improve performance and show how to load the data using XQUERY rather than XPATH.
Documentation
- MySQL Spatial General Functions
- Oracle LRS Objects
- Oracle Spatial Exporter (Java + pl/SQL)
- Oracle Spatial Object Functions
- Oracle Spatial Object Functions (Multi Page)
- PostGIS pl/pgSQL Functions
- SC4O Oracle Java Topology Suite (Java + pl/SQL)
- SQL Server Spatial General TSQL Functions
- SQL Server Spatial LRS TSQL Functions