Top 5 Recent Articles
ARTICLES CATEGORIES
- Algorithms (22)
- All (399)
- Biography (1)
- Blog (44)
- Business Requirements (1)
- Commentary (1)
- Conversion (2)
- Customers (2)
- Data Models (1)
- Education (2)
- GeoRaptor (13)
- GPS (1)
- Image Processing (2)
- Import Export (8)
- Licensing (2)
- LiDAR (1)
- Linear Referencing (4)
- Manifold GIS (3)
- Mapping (1)
- MySQL Spatial (7)
- Networking and Routing (including Optimization) (5)
- Open Source (18)
- Oracle Spatial and Locator (194)
- Partitioning (1)
- PostGIS (36)
- Projections (1)
- Published Articles (1)
- qGIS (1)
- Recommendations (1)
- Services (1)
- Software Change Log (1)
- Source Code (37)
- Space Curves (9)
- Spatial Database Functions (109)
- Spatial DB comparison (1)
- Spatial XML Processing (11)
- SQL Server Spatial (92)
- Standards (3)
- Stored Procedure (17)
- Tessellation or Gridding (10)
- Tools (2)
- Topological Relationships (1)
- Training (2)
- Triangulation (2)
Using SQL Server XML and Spatial to import VicMap seed/Incremental Update Format (IUF) data (Part 3)
This is the 3rd and last blog post about processing VicMap IUF XML data.
(The 6 month delay is because I lost a draft of this article, and have been doing other things.)
While much in this and the previous articles are about an Australian XML data format, the principles outlined remain the same.
This article is about:
PreProcessing XML
The SQL processing the IUF XML in the previous articles is more complex than it needs to be.
File Size Reduction
We have seen (in Part 2) that where XML is formatted for human readability (c.f., “Pretty Printing”), it contains mostly white space and, if that white space is removed the size of the document is greatly reduced.
Coordinate Ordering
One aspect of that complexity is the fact that coordinates encoded in the
That is the coordinates before and after an
The SQL would be easier if a coordinate was encoded with its actual position within a polygon ring encoded as an attribute:
<OUTER> <XY id="1" >148.044619 -36.3917393</XY> <XY id="2" >148.0447565 -36.3917393</XY> <XY id="3" >148.0447565 -36.391678</XY> <XY id="4" >148.044619 -36.391678</XY> <XY id="5" >148.044619 -36.3917393</XY> </OUTER>
The SQL to process this is not provided in this article.
WKT Coordinate Lists
Another way of solving the XY order problem is to conduct some pre-processing of the XML before processing in SQL Server. To convert all the coordinates of a polygon ring into an WKT coordinate list as follows:
<OUTER>(148.044619 -36.3917393,148.0447565 -36.3917393,148.0447565 -36.391678,148.044619 -36.391678,148.044619 -36.3917393)</OUTER>
The formatting of the original IUF file to match this new description is via the following pre-processing (includes removal of datasets and spaces).
ECHO Strip un-need datasets PARISH_POLYGON and TOWNSHIP_POLYGON from XML, and drop spaces... msxsl all_seed_data_17-feb-2019.xml RemoveParishTownshipElements.xslt -xw -xe -t -o all_seed_data_17-feb-2019_thinned.xml ECHO Check validity of thinned XML... xml val all_seed_data_17-feb-2019_thinned.xml ECHO Reformat XML Change OUTER/INNER/XY to WKT coordinate strings. sed "s#,# #g" all_seed_data_17-feb-2019_thinned.xml | sed ":begin;$!N;s#</XY>\n<XY>#,#;tbegin;P;D" | sed ":begin;$!N;s#<OUTER>\n<XY>#<OUTER>(#;tbegin;P;D" | sed ":begin;$!N;s#</XY>\n</OUTER>#)</OUTER>#;tbegin;P;D" | sed ":begin;$!N;s#<INNER>\n<XY>#<INNER>(#;tbegin;P;D" | sed ":begin;$!N;s#</XY>\n</INNER>#)</INNER>#;tbegin;P;D" > all_seed_data_17-feb-2019_modified.xml ECHO Check validity of modified XML... xml val all_seed_data_17-feb-2019_modified.xml
The original file, all_seed_data_17-feb-2019.xml, was 342 MB (358,757,534 bytes) in size.
The final file, all_seed_data_17-feb-2019_modified.xml, after all processing, is 99.2 MB (104,047,912 bytes) in size.
This is as saving of (358,757,534 – 104,047,912) = 254,709,622 which means the new document is 29.0 % of the original size.
Quite a substantial saving!
Here is a Locality object before and after WKT processing (two spaces left before XY tags for readbility):
-- Before <POLYGON_ITEM> <HEADER TABLE="LOCALITY_POLYGON" PFI="205407825" CREATED="20131024153045" UFI="468965778"></HEADER> <OUTER> <XY>148.04446887,-36.39192699</XY> <XY>148.04487,-36.39192699</XY> <XY>148.04487,-36.39175625</XY> <XY>148.04446887,-36.39175625</XY> <XY>148.04446887,-36.39192699</XY> </OUTER> <INNER> <XY>148.04452460167477,-36.391801308813505</XY> <XY>148.04481426832533,-36.391801308813427</XY> <XY>148.04481426826769,-36.391881931347953</XY> <XY>148.04452460173229,-36.391881931347996</XY> <XY>148.04452460167477,-36.391801308813505</XY> </INNER> <OUTER> <XY>148.044619,-36.3917393</XY> <XY>148.0447565,-36.3917393</XY> <XY>148.0447565,-36.391678</XY> <XY>148.044619,-36.391678</XY> <XY>148.044619,-36.3917393</XY> </OUTER> <INNER> <XY>148.04463,-36.39169</XY> <XY>148.04475,-36.39169</XY> <XY>148.04475,-36.39173</XY> <XY>148.04463,-36.39169</XY> </INNER> <INNER> <XY>148.04463073970265441 -36.39171145551856767</XY> <XY>148.0446664800863914 -36.39173171778336524</XY> <XY>148.04463186538401942 -36.39173115494267563</XY> <XY>148.04463073970265441 -36.39171145551856767</XY> </INNER> <ATT>LOCALITY_NAME="WOODFIELD"</ATT> <ATT>GAZETTED_LOCALITY_NAME="WOODFIELD"</ATT> <ATT>VICNAMES_ID="103388"</ATT> <ATT>PFI_CREATED="20050927080612"</ATT> </POLYGON_ITEM> -- After <?xml version="1.0"?> <POLYGON_ITEM> <HEADER TABLE="LOCALITY_POLYGON" PFI="205407825" CREATED="20131024153045" UFI="468965778"></HEADER> <OUTER>(148.04446887 -36.39192699,148.04487 -36.39192699,148.04487 -36.39175625,148.04446887 -36.39175625,148.04446887 -36.39192699)</OUTER> <INNER>(148.04452460167477 -36.391801308813505,148.04452460173229 -36.391881931347996,148.04481426826769 -36.391881931347953,148.04481426832533 -36.391801308813427,148.04452460167477 -36.391801308813505)</INNER> <OUTER>(148.044619 -36.3917393,148.0447565 -36.3917393,148.0447565 -36.391678,148.044619 -36.391678,148.044619 -36.3917393)</OUTER> <INNER>(148.04463 -36.39169,148.04475 -36.39173,148.04475 -36.39169,148.04463 -36.39169)</INNER> <ATT>LOCALITY_NAME="WOODFIELD"</ATT> <ATT>GAZETTED_LOCALITY_NAME="WOODFIELD"</ATT> <ATT>VICNAMES_ID="103388"</ATT> <ATT>PFI_CREATED="20050927080612"</ATT> </POLYGON_ITEM>
Now, processing the modified XML shows a large increase in speed.
USE TESTDB GO DECLARE @xdoc xml, @rdoc int; SELECT @xdoc = convert(XML,BulkColumn,2) FROM OPENROWSET(BULK /* Open WKT "version" of all-seed_data_17-feb-2019 */ 'C:\Users\Simon\Downloads\all_seed_data_17-feb-2019\all_seed_data_17-feb-2019_modified.xml' , SINGLE_BLOB) as xdoc; --Create an internal representation of the XML document. EXEC sp_xml_preparedocument @rdoc OUTPUT, @xdoc; SELECT d.table_name, d.pfi, d.ufi, CONVERT(datetime, substring(d.created, 1,4) + '/' + substring(d.created, 5,2) + '/' + substring(d.created, 7,2) + ' ' + substring(d.created, 9,2) + ':' + substring(d.created,11,2) + ':' + substring(d.created,13,2), 20 ) as created, d.exterior_ring_count,d.interior_ring_count, d.geog FROM (SELECT a.table_name, a.pfi, a.created, a.ufi, exterior_ring_count, interior_ring_count, case when interior_ring_count is not null then geography::STGeomFromText('POLYGON(' + a.exterior_rings + ','+b.interior_rings+')',4283) else geography::STGeomFromText('POLYGON(' + a.exterior_rings + ')',4283) end as geog FROM (SELECT x.table_name, x.pfi, x.created, x.ufi, COUNT(*) as exterior_ring_count, /* Collect up all exterior rings */ STRING_AGG(x.ring,',') as exterior_rings FROM OPENXML (@rdoc, '//IUF/INSERT/POLYGON/POLYGON_NAME/POLYGON_ITEM/OUTER', 2) WITH (table_name varchar(128) '../HEADER/@TABLE', pfi varchar(128) '../HEADER/@PFI', created varchar(128) '../HEADER/@CREATED', ufi varchar(128) '../HEADER/@UFI', ring varchar(max) './text()' ) as x WHERE x.table_name IN ('LGA_POLYGON','LOCALITY_POLYGON','POSTCODE_POLYGON') group by x.table_name, x.pfi, x.ufi, x.created ) as a LEFT OUTER JOIN /* In case no interior rings */ (SELECT x.table_name, x.pfi, x.created, x.ufi, COUNT(*) as interior_ring_count, /* Collect up all interior rings */ STRING_AGG(x.ring,',') as interior_rings FROM OPENXML (@rdoc, '//IUF/INSERT/POLYGON/POLYGON_NAME/POLYGON_ITEM/INNER', 2) WITH (table_name varchar(128) '../HEADER/@TABLE', pfi varchar(128) '../HEADER/@PFI', created varchar(128) '../HEADER/@CREATED', ufi varchar(128) '../HEADER/@UFI', ring varchar(max) './text()' ) as x WHERE x.ring is not null AND x.table_name IN ('LGA_POLYGON','LOCALITY_POLYGON','POSTCODE_POLYGON') group by x.table_name, x.pfi, x.ufi, x.created ) as b on ( b.table_name = a.table_name and b.pfi = a.pfi and b.ufi = a.ufi) LEFT OUTER JOIN /* We are merging all attributes across all layers */ (select x.table_name, x.pfi, x.ufi, x.created from OPENXML (@rdoc, /* all_seed has only INSERT segments */ '//IUF/INSERT/POLYGON/POLYGON_NAME/POLYGON_ITEM/ATT', 2) WITH (table_name varchar(max) '(../HEADER/@TABLE)[1]', pfi varchar(100) '(../HEADER/@PFI)[1]', ufi varchar(100) '(../HEADER/@UFI)[1]', created varchar(100) '(../HEADER/@CREATED)[1]', attributes varchar(max) '(text())[1]' ) as x group by x.table_name, x.pfi, x.ufi, x.created ) as c ON (c.table_name = a.table_name and c.pfi = a.pfi and c.ufi = a.ufi) ) as d; EXEC sp_xml_removedocument @rdoc; GO (3754 rows affected)
The lines were processed on my desktop in 30 seconds, which is a lot faster than processing the original IUF XML.
QED.
So, the benefits of this approaches using in these three articles are:
- While FME is the tool du jour in situations where multiple spatial format datasets need processing (not just XML) it can be an expensive option where a single format needs processing.
- One can use SQL Server’s (PostgreSQL’s or Oracle’s) standard XML tools, along with its spatial data types, to read, process and load spatial data from any XML format.
- This processing is cloud friendly, requiring no other tools than that provided in cloud environments like Azure.
Generally the size of XML files is due to Pretty Printing to improve human readability with substantial file size reductions possible with the removal of unnecessary spaces, comments etc. - Sometimes the source file can contain datasets that are not needed, their removal will reduce file size and speed processing.
- Finally, some encoding choices can increase format and processing complexity as could be seen from the
… tags. If possible, consideration should be given to modifying such encoding. The time to do this can be minimal compared to the decrease in processing complexity resulting in substantial performance improvements.
I hope this is of use to someone out there.
Documentation
- MySQL Spatial General Functions
- Oracle LRS Objects
- Oracle Spatial Exporter (Java + pl/SQL)
- Oracle Spatial Object Functions
- Oracle Spatial Object Functions (Multi Page)
- PostGIS pl/pgSQL Functions
- SC4O Oracle Java Topology Suite (Java + pl/SQL)
- SQL Server Spatial General TSQL Functions
- SQL Server Spatial LRS TSQL Functions