Using SQL Server XML and Spatial to import VicMap seed/Incremental Update Format (IUF) data (Part 3)

This is the 3rd and last blog post about processing VicMap IUF XML data.

(The 6 month delay is because I lost a draft of this article, and have been doing other things.)

While much in this and the previous articles are about an Australian XML data format, the principles outlined remain the same.

This article is about:

PreProcessing XML

The SQL processing the IUF XML in the previous articles is more complex than it needs to be.

File Size Reduction

We have seen (in Part 2) that where XML is formatted for human readability (c.f., “Pretty Printing”), it contains mostly white space and, if that white space is removed the size of the document is greatly reduced.

Coordinate Ordering

One aspect of that complexity is the fact that coordinates encoded in the tags have no order except that which is implied from their location within the IUF.

That is the coordinates before and after an pair are assumed to be in sequential order for the polygon being encoded. Processing by implied ordering was addressed in the processing in the previous articles, but it complicates, and slows, the XML processing.

The SQL would be easier if a coordinate was encoded with its actual position within a polygon ring encoded as an attribute:

<OUTER>
  <XY id="1" >148.044619 -36.3917393</XY>
  <XY id="2" >148.0447565 -36.3917393</XY>
  <XY id="3" >148.0447565 -36.391678</XY>
  <XY id="4" >148.044619  -36.391678</XY>
  <XY id="5" >148.044619  -36.3917393</XY>
</OUTER>

The SQL to process this is not provided in this article.

WKT Coordinate Lists

Another way of solving the XY order problem is to conduct some pre-processing of the XML before processing in SQL Server. To convert all the coordinates of a polygon ring into an WKT coordinate list as follows:

<OUTER>(148.044619 -36.3917393,148.0447565 -36.3917393,148.0447565 -36.391678,148.044619 -36.391678,148.044619 -36.3917393)</OUTER>

The formatting of the original IUF file to match this new description is via the following pre-processing (includes removal of datasets and spaces).

ECHO Strip un-need datasets PARISH_POLYGON and TOWNSHIP_POLYGON from XML, and drop spaces...
msxsl all_seed_data_17-feb-2019.xml RemoveParishTownshipElements.xslt -xw -xe -t -o all_seed_data_17-feb-2019_thinned.xml

ECHO Check validity of thinned XML...
xml val all_seed_data_17-feb-2019_thinned.xml

ECHO Reformat XML Change OUTER/INNER/XY to WKT coordinate strings.

sed "s#,# #g" all_seed_data_17-feb-2019_thinned.xml | 
sed ":begin;$!N;s#</XY>\n<XY>#,#;tbegin;P;D" | 
sed ":begin;$!N;s#<OUTER>\n<XY>#<OUTER>(#;tbegin;P;D" | 
sed ":begin;$!N;s#</XY>\n</OUTER>#)</OUTER>#;tbegin;P;D" | 
sed ":begin;$!N;s#<INNER>\n<XY>#<INNER>(#;tbegin;P;D" | 
sed ":begin;$!N;s#</XY>\n</INNER>#)</INNER>#;tbegin;P;D" > all_seed_data_17-feb-2019_modified.xml

ECHO Check validity of modified XML...
xml val all_seed_data_17-feb-2019_modified.xml

The original file, all_seed_data_17-feb-2019.xml, was 342 MB (358,757,534 bytes) in size.

The final file, all_seed_data_17-feb-2019_modified.xml, after all processing, is 99.2 MB (104,047,912 bytes) in size.

This is as saving of (358,757,534 – 104,047,912) = 254,709,622‬ which means the new document is 29.0 % of the original size.

Quite a substantial saving!

Here is a Locality object before and after WKT processing (two spaces left before XY tags for readbility):

-- Before
<POLYGON_ITEM>
<HEADER TABLE="LOCALITY_POLYGON" PFI="205407825" CREATED="20131024153045" UFI="468965778"></HEADER>
<OUTER>
  <XY>148.04446887,-36.39192699</XY>
  <XY>148.04487,-36.39192699</XY>
  <XY>148.04487,-36.39175625</XY>
  <XY>148.04446887,-36.39175625</XY>
  <XY>148.04446887,-36.39192699</XY>
</OUTER>
<INNER>
  <XY>148.04452460167477,-36.391801308813505</XY>
  <XY>148.04481426832533,-36.391801308813427</XY>
  <XY>148.04481426826769,-36.391881931347953</XY>
  <XY>148.04452460173229,-36.391881931347996</XY>
  <XY>148.04452460167477,-36.391801308813505</XY>
</INNER>
<OUTER>
  <XY>148.044619,-36.3917393</XY>
  <XY>148.0447565,-36.3917393</XY>
  <XY>148.0447565,-36.391678</XY>
  <XY>148.044619,-36.391678</XY>
  <XY>148.044619,-36.3917393</XY>
</OUTER>
<INNER>
  <XY>148.04463,-36.39169</XY>
  <XY>148.04475,-36.39169</XY>
  <XY>148.04475,-36.39173</XY> 
  <XY>148.04463,-36.39169</XY>
</INNER>
<INNER>
  <XY>148.04463073970265441 -36.39171145551856767</XY>
  <XY>148.0446664800863914 -36.39173171778336524</XY>
  <XY>148.04463186538401942 -36.39173115494267563</XY>
  <XY>148.04463073970265441 -36.39171145551856767</XY>
</INNER>
<ATT>LOCALITY_NAME="WOODFIELD"</ATT>
<ATT>GAZETTED_LOCALITY_NAME="WOODFIELD"</ATT>
<ATT>VICNAMES_ID="103388"</ATT>
<ATT>PFI_CREATED="20050927080612"</ATT>
</POLYGON_ITEM>

-- After
<?xml version="1.0"?>
<POLYGON_ITEM>
<HEADER TABLE="LOCALITY_POLYGON" PFI="205407825" CREATED="20131024153045" UFI="468965778"></HEADER>
<OUTER>(148.04446887 -36.39192699,148.04487 -36.39192699,148.04487 -36.39175625,148.04446887 -36.39175625,148.04446887 -36.39192699)</OUTER>
<INNER>(148.04452460167477 -36.391801308813505,148.04452460173229 -36.391881931347996,148.04481426826769 -36.391881931347953,148.04481426832533 -36.391801308813427,148.04452460167477 -36.391801308813505)</INNER>
<OUTER>(148.044619 -36.3917393,148.0447565 -36.3917393,148.0447565 -36.391678,148.044619 -36.391678,148.044619 -36.3917393)</OUTER>
<INNER>(148.04463 -36.39169,148.04475 -36.39173,148.04475 -36.39169,148.04463 -36.39169)</INNER>
<ATT>LOCALITY_NAME="WOODFIELD"</ATT>
<ATT>GAZETTED_LOCALITY_NAME="WOODFIELD"</ATT>
<ATT>VICNAMES_ID="103388"</ATT>
<ATT>PFI_CREATED="20050927080612"</ATT>
</POLYGON_ITEM>

Now, processing the modified XML shows a large increase in speed.

USE TESTDB
GO

DECLARE @xdoc xml,
        @rdoc int;
SELECT @xdoc = convert(XML,BulkColumn,2)
  FROM OPENROWSET(BULK 
                  /* Open WKT "version" of all-seed_data_17-feb-2019 */
                  'C:\Users\Simon\Downloads\all_seed_data_17-feb-2019\all_seed_data_17-feb-2019_modified.xml'
                  , SINGLE_BLOB) as xdoc;
--Create an internal representation of the XML document.  
EXEC sp_xml_preparedocument @rdoc OUTPUT, @xdoc;
SELECT d.table_name, d.pfi, d.ufi,
       CONVERT(datetime,
               substring(d.created, 1,4) + '/' +
               substring(d.created, 5,2) + '/' +
               substring(d.created, 7,2) + ' ' +
               substring(d.created, 9,2) + ':' +
               substring(d.created,11,2) + ':' +
               substring(d.created,13,2),
               20
       ) as created,
       d.exterior_ring_count,d.interior_ring_count,
       d.geog
  FROM (SELECT a.table_name, a.pfi, a.created, a.ufi, exterior_ring_count, interior_ring_count,
               case when interior_ring_count is not null
                    then geography::STGeomFromText('POLYGON(' + a.exterior_rings + ','+b.interior_rings+')',4283) 
                    else geography::STGeomFromText('POLYGON(' + a.exterior_rings + ')',4283) 
                end as geog
          FROM (SELECT x.table_name, x.pfi, x.created, x.ufi, 
                       COUNT(*) as exterior_ring_count,
                       /* Collect up all exterior rings */
                       STRING_AGG(x.ring,',') as exterior_rings
                  FROM OPENXML (@rdoc,
                                '//IUF/INSERT/POLYGON/POLYGON_NAME/POLYGON_ITEM/OUTER',
                                2)
                          WITH (table_name varchar(128) '../HEADER/@TABLE',
                                pfi        varchar(128) '../HEADER/@PFI',
                                created    varchar(128) '../HEADER/@CREATED',
                                ufi        varchar(128) '../HEADER/@UFI',
                                ring       varchar(max) './text()'
                               ) as x
                 WHERE x.table_name IN ('LGA_POLYGON','LOCALITY_POLYGON','POSTCODE_POLYGON')
                 group by x.table_name, x.pfi, x.ufi, x.created
               ) as a
               LEFT OUTER JOIN /* In case no interior rings */
               (SELECT x.table_name, x.pfi, x.created, x.ufi, 
                       COUNT(*) as interior_ring_count,
                       /* Collect up all interior rings */
                       STRING_AGG(x.ring,',') as interior_rings
                  FROM OPENXML (@rdoc,
                                '//IUF/INSERT/POLYGON/POLYGON_NAME/POLYGON_ITEM/INNER',
                                2)
                          WITH (table_name varchar(128) '../HEADER/@TABLE',
                                pfi        varchar(128) '../HEADER/@PFI',
                                created    varchar(128) '../HEADER/@CREATED',
                                ufi        varchar(128) '../HEADER/@UFI',
                                ring       varchar(max) './text()'
                               ) as x
                 WHERE x.ring is not null
                   AND x.table_name IN ('LGA_POLYGON','LOCALITY_POLYGON','POSTCODE_POLYGON')
                 group by x.table_name, x.pfi, x.ufi, x.created
               ) as b
               on ( b.table_name = a.table_name and b.pfi = a.pfi and b.ufi = a.ufi)
               LEFT OUTER JOIN
               /* We are merging all attributes across all layers */
               (select x.table_name, x.pfi, x.ufi, x.created
                  from OPENXML (@rdoc,
                                /* all_seed has only INSERT segments */
                                '//IUF/INSERT/POLYGON/POLYGON_NAME/POLYGON_ITEM/ATT',
                                2)
                          WITH (table_name varchar(max) '(../HEADER/@TABLE)[1]',
                                pfi        varchar(100) '(../HEADER/@PFI)[1]',
                                ufi        varchar(100) '(../HEADER/@UFI)[1]',
                                created    varchar(100) '(../HEADER/@CREATED)[1]',
                                attributes varchar(max) '(text())[1]'
                               ) as x
                 group by x.table_name, x.pfi, x.ufi, x.created
               ) as c
               ON (c.table_name = a.table_name and c.pfi = a.pfi and c.ufi = a.ufi)
       ) as d;
EXEC sp_xml_removedocument @rdoc;
GO

(3754 rows affected)

The lines were processed on my desktop in 30 seconds, which is a lot faster than processing the original IUF XML.

QED.

So, the benefits of this approaches using in these three articles are:

  • While FME is the tool du jour in situations where multiple spatial format datasets need processing (not just XML) it can be an expensive option where a single format needs processing.
  • One can use SQL Server’s (PostgreSQL’s or Oracle’s) standard XML tools, along with its spatial data types, to read, process and load spatial data from any XML format.
  • This processing is cloud friendly, requiring no other tools than that provided in cloud environments like Azure.
    Generally the size of XML files is due to Pretty Printing to improve human readability with substantial file size reductions possible with the removal of unnecessary spaces, comments etc.
  • Sometimes the source file can contain datasets that are not needed, their removal will reduce file size and speed processing.
  • Finally, some encoding choices can increase format and processing complexity as could be seen from the tags. If possible, consideration should be given to modifying such encoding. The time to do this can be minimal compared to the decrease in processing complexity resulting in substantial performance improvements.

I hope this is of use to someone out there.