Using SQL Server XML and Spatial to import VicMap seed/Incremental Update Format (IUF) data (Part 2)

In Part 1 were were introduced to the IUF data, its structure, and processing (for Locality elements).

In Part 2 we will look at ways of reducing the data volume of the seed file, and look at issues when processing whole layers (LGA will be used instead of LOCALITY).

In Part 3 we will look at the effect enhanced pre-processing of the structure of the XY elements of the source seed file can have on data volume, reduced processing complexity and improved speed.

Pruning or Filtering Tables

A piece of functionality that is required when processing XML with SQL is the ability to select from the document only those layers that are needed.

It is also possible to remove unneeded layers from the original document (perhaps one wished to reduce data usage for an Azure Blob Store!)

Both of these approaches are described in the following sections.

1. SQL Filtering using XML Attributes

In part 1 we saw how to process the XML in order to select and process the data a single layer at a time.

This is useful so that we can write to a dedicated table eg [dbo].[Locality] allowing the XPATH/SQL to extract different attributes that are not present in another layer eg LGA.

The filtering can be done using a SQL WHERE clause predicate on the HEADER/@TABLE attribute, or can be done using XPATH against the XML document.

The following extracts the relevant components from the first article.

with data as (
select CONVERT(xml, 
'<?xml version="1.0"?>
<!DOCTYPE IUF SYSTEM "iuf_full.dtd">
<IUF>
...
</IUF>',2) as iuf
)
select n.c.value('(HEADER/@TABLE)[1]',  'varchar(max)') as table_name
  from data as d
       CROSS APPLY 
       d.iuf.nodes('/IUF/INSERT/POLYGON/POLYGON_NAME/POLYGON_ITEM[HEADER/@TABLE="LOCALITY_POLYGON"]') n(c);
GO

table_name
----------------
LOCALITY_POLYGON

-- OR Filtering via a SQL WHERE clause predicate

with data as (
select CONVERT(xml, 
'<?xml version="1.0"?>
<!DOCTYPE IUF SYSTEM "iuf_full.dtd">
<IUF>
...
</IUF>',2) as iuf
)
SELECT n.c.value('(HEADER/@TABLE)[1]','varchar(max)') as table_name
  FROM data as d
       CROSS APPLY
       d.iuf.nodes('/IUF/INSERT/POLYGON/POLYGON_NAME/POLYGON_ITEM') n(c)
 WHERE n.c.value('(HEADER/@TABLE)[1]','varchar(max)') ='LOCALITY_POLYGON'
GO
 
table_name
----------------
LOCALITY_POLYGON

2. Removing layers And Spaces from original document using XSLT
2.1. Removing layers

While filtering against the whole document is simple, the source document is very large (~360M) and can adversely affect the execution times for processing the XML.

Besides the Locality, LGA and Postcode layers, the VicMap Admin file also contains the Parish and Township layers which are less used than the former.

XML processing of the raw XML data includes these additional layers even if that processing is nothing more than filtering.

The unused layers can be removed by filtering using external (to SQL Server) tools.

2.2. Removing Spaces

The supplied XML data is “pretty printed”/Formatted for human consumption and, as we have seen, looks like this:

<?xml version="1.0"?>
<!DOCTYPE IUF SYSTEM "iuf_full.dtd">
<IUF>
   <INSERT>
      <POLYGON>
         <!--   SEED for POLYGON table LGA_POLYGON commencing   -->
         <POLYGON_NAME>
            <POLYGON_ITEM>
               <HEADER TABLE="LGA_POLYGON" PFI="367" CREATED="20131024152212" UFI="468592502"></HEADER>
                 <OUTER><XY>1,2</XY></OUTER>
            </POLYGON_ITEM>
         </POLYGON_NAME>
         <!--   SEED for POLYGON table LOCALITY_POLYGON commencing   -->
         <POLYGON_NAME>
            <POLYGON_ITEM>
               <HEADER TABLE="LOCALITY_POLYGON" PFI="367" CREATED="20131024152212" UFI="468592502"></HEADER>
                 <OUTER><XY>1,2</XY></OUTER>
            </POLYGON_ITEM>
         </POLYGON_NAME>
         <!--   SEED for POLYGON table PARISH_POLYGON commencing   -->
         <POLYGON_NAME>
            <POLYGON_ITEM>
               <HEADER TABLE="PARISH_POLYGON" PFI="367" CREATED="20131024152212" UFI="468592502"></HEADER>
                 <OUTER><XY>1,2</XY></OUTER>
            </POLYGON_ITEM>
         </POLYGON_NAME>
         <!--   SEED for POLYGON table POSTCODE_POLYGON commencing   -->
         <POLYGON_NAME>
            <POLYGON_ITEM>
               <HEADER TABLE="POSTCODE_POLYGON" PFI="367" CREATED="20131024152212" UFI="468592502"></HEADER>
                 <OUTER><XY>1,2</XY></OUTER>
            </POLYGON_ITEM>
         </POLYGON_NAME>
      </POLYGON>
   </INSERT>
</IUF>

Application XML processing does not need the spaces/tabs used in formatting the XML document.

How much can we save by removing the leading spaces so that the document looks like this:

<?xml version="1.0"?>
<!DOCTYPE IUF SYSTEM "iuf_full.dtd">
<IUF>
<INSERT>
<POLYGON>
<!--   SEED for POLYGON table LGA_POLYGON commencing   -->
<POLYGON_NAME>
<POLYGON_ITEM>
<HEADER TABLE="LGA_POLYGON" PFI="367" CREATED="20131024152212" UFI="468592502"></HEADER>
<OUTER><XY>1,2</XY></OUTER>
</POLYGON_ITEM>
</POLYGON_NAME>
<!--   SEED for POLYGON table LOCALITY_POLYGON commencing   -->
<POLYGON_NAME>
<POLYGON_ITEM>
<HEADER TABLE="LOCALITY_POLYGON" PFI="367" CREATED="20131024152212" UFI="468592502"></HEADER>
<OUTER><XY>1,2</XY></OUTER>
</POLYGON_ITEM>
</POLYGON_NAME>
<!--   SEED for POLYGON table PARISH_POLYGON commencing   -->
<POLYGON_NAME>
<POLYGON_ITEM>
<HEADER TABLE="PARISH_POLYGON" PFI="367" CREATED="20131024152212" UFI="468592502"></HEADER>
<OUTER><XY>1,2</XY></OUTER>
</POLYGON_ITEM>
</POLYGON_NAME>
<!--   SEED for POLYGON table POSTCODE_POLYGON commencing   -->
<POLYGON_NAME>
<POLYGON_ITEM>
<HEADER TABLE="POSTCODE_POLYGON" PFI="367" CREATED="20131024152212" UFI="468592502"></HEADER>
<OUTER><XY>1,2</XY></OUTER>
</POLYGON_ITEM>
</POLYGON_NAME>
</POLYGON>
</INSERT>
</IUF>

The following XSLT (StripAllSpaces.xslt) was used:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output method="xml" omit-xml-declaration="no" indent="yes" doctype-system="iuf_full.dtd" encoding="UTF-8"/>
  <xsl:strip-space elements="*"/>
  <!--Identity transform (aka identity template).-->    
  <xsl:template match="@*|node()">
    <xsl:copy>
        <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>
</xsl:stylesheet>

The Windows msxml.exe utility was used to execute the XSLT processing as follows (other Linux xml tools can be used):

msxsl all_seed_data_17-feb-2019.xml StripAllSpaces.xslt -xw -xe -t -o all_seed_data_17-feb-2019_spaces.xml
Microsoft (R) XSLT Processor Version 4.0

Source document load time:     36611 milliseconds
Stylesheet document load time: .293 milliseconds
Stylesheet compile time:       .114 milliseconds
Stylesheet execution time:     5511 milliseconds

The original file, all_seed_data_17-feb-2019.xml, was 342 MB (358,757,534 bytes) in size.

The resultant file, all_seed_data_17-feb-2019_spaces.xml, is 241 MB (252,810,486 bytes).

This is as saving of (358,757,534 – 252,810,486) = 105,947,048 which means the new document is 70.5 % of the original size.

Quite a substantial saving!

2.3. Removing Layers Using XSLT

Finally, we can reduce the size of the document further by removing unwanted (Township and Parish) layers.

The following XSLT script enables this pruning.

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output method="xml" omit-xml-declaration="no" indent="yes" doctype-system="iuf_full.dtd" encoding="UTF-8"/>
  <xsl:strip-space elements="*"/>

  <!--Identity transform (aka identity template).-->    
  <xsl:template match="@*|node()">
    <xsl:copy>
        <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

  <!--templates to match and remove DESCRIPTION, PARISH_POLYGON and TOWNSHIP_POLYGON complete tags.-->    
  <xsl:template match="DESCRIPTION"/>
  <xsl:template match="*/POLYGON/POLYGON_NAME[POLYGON_ITEM/HEADER/@TABLE='PARISH_POLYGON' or POLYGON_ITEM/HEADER/@TABLE='TOWNSHIP_POLYGON']" />

</xsl:stylesheet>

Processing using msxml.exe is as follows:

msxsl all_seed_data_17-feb-2019.xml RemoveParishTownshipElements.xslt -xw -xe -t -o all_seed_data_17-feb-2019_thinned.xml

Microsoft (R) XSLT Processor Version 4.0

Source document load time:     41546 milliseconds
Stylesheet document load time: .332 milliseconds
Stylesheet compile time:       .174 milliseconds
Stylesheet execution time:     3601 milliseconds

The size of a complete seed document for VicMap Admin is 358,760,448 bytes.

If we remove the two unwanted layers and the leading spaces, the resultant file size is 144,052,224 bytes.

The difference is a saving of (358,760,448 – 144,052,224) = 214,708,224 bytes. So, the new file is ( 144,052,224 / 358,760,448 ) * 100.0 = 40.15 % of the original.

That is a huge saving!

The document all_seed_data_17-feb-2019.xml is a copy of the VicMap Admin data at a point in time. More recent files can be downloaded from the VicMap servers for free.

3. Processing the XML to produce a table

All the work done in Part 1 involved simple processing of “in line” XML, with the last example showing how to read from a file of a very small XML document(admin_seed_example.xml); neither included writing to a table.

The initial processing of the file with the two layers and spaces removed took quite a while to process (greater than 12 hours).

This was/is probably due to my XPATH processing of the XML document directly from the disk file using OPENROWSET, but also my inexperience with this XML/IUF document. It may also be that the wordiness of the coding of the coordinates () increases processing complexity and therefore time (see Part Three for an approach to handle this).

In the end the following recipe was followed:

  • 3.1 Create target table;
  • 3.2 Load the XML into a table from the disk file;
  • 3.3 Create XML Indexes on this table;
  • 3.4 Process the loaded XML one layer at a time to create an output table;
  • 3.5 Index and Visualise the results.

3.1 Create target table

-- Create LGA table
DROP TABLE IF EXISTS [vicmap].[admin_seed]
GO

CREATE TABLE [vicmap].[admin_seed] (
  id  int identity(1,1) not null,
  iuf xml
);
GO

3.2 Load the XML into a table from the disk file

INSERT INTO [vicmap].[admin_seed](iuf)  
SELECT CONVERT(xml, BulkColumn, 2) as iuf
  FROM OPENROWSET(BULK N'C:\Users\Simon\Downloads\all_seed_data_17-feb-2019\all_seed_data_17-feb-2019_thinned.xml',
                  SINGLE_BLOB
       ) AS x; 
GO

ALTER TABLE [vicmap].[admin_seed] ADD CONSTRAINT vicmap_admin_seed_pk PRIMARY KEY CLUSTERED (id);
GO

3.3 Create XML Indexes on this table

-- Create Primary XML Index
CREATE PRIMARY XML INDEX PXML_admin_seed_iuf  
    ON [vicmap].[admin_seed](iuf);  
GO  

-- Create Selective XML Index on table section
CREATE SELECTIVE XML INDEX SXI_Admin_Seed_IUF_ix  
    ON [vicmap].[admin_seed](iuf)
FOR  
(  
  pathHeader = '/IUF/INSERT/POLYGON/POLYGON_NAME/POLYGON_ITEM/HEADER/@TABLE' AS XQUERY 'xs:string' SINGLETON
);

-- Test by checking explain plan
select COUNT(*) 
  from (select c.pfi, c.ufi,
	           CONVERT(datetime,
                       SUBSTRING(c.create_date, 1,4)+'-'+
                       SUBSTRING(c.create_date, 5,2)+'-'+
                       SUBSTRING(c.create_date, 7,2)+' '+
                       SUBSTRING(c.create_date, 9,2)+':'+
                       SUBSTRING(c.create_date,11,2)+':'+
                       SUBSTRING(c.create_date,13,2),
		        	   20) as create_date,
               c.polygon_item
          from (select n.c.value('(HEADER/@PFI)[1]',    'varchar(100)') as pfi,
                       n.c.value('(HEADER/@UFI)[1]',    'varchar(100)') as ufi,
                       n.c.value('(HEADER/@CREATED)[1]','varchar(100)') as create_date,
        	           n.c.query('.') as polygon_item
                  from [vicmap].[admin_seed] as d
                       CROSS APPLY 
                       d.iuf.nodes('/IUF/INSERT/POLYGON/POLYGON_NAME/POLYGON_ITEM[HEADER/@TABLE="LGA_POLYGON"]') n(c)
                ) as c
       ) as f;
GO

Examination of the plan shows that the primary and selective indexes are being used. The plan may look like this.

Indexed Query Execution Plan

3.4 Process the loaded XML one layer at a time to create the LGA table

The processing of the [vicmap].[admin_seed] document to produce a final LGA (Locality SQL is available on request) table can be achieved as follows.

DROP TABLE [vicmap].[lga];
GO

CREATE TABLE [vicmap].[lga] (
  id                    int IDENTITY(1,1) not null,
  pfi                   varchar(50),
  ufi                   varchar(50),
  create_date           varchar(100),
  lga_code              varchar(100),
  lga_name              varchar(500),
  lga_official_name     varchar(500),
  gazettal_registration varchar(500),
  pfi_created           varchar(100),
  geog                  geography
);
GO

INSERT INTO [vicmap].[lga] WITH (TABLOCKX) ( 
       pfi, ufi, create_date, lga_code, lga_name, lga_official_name, pfi_created, geog)
SELECT a.pfi, a.ufi, a.create_date, lga_code, lga_name, lga_official_name, pfi_created,
       geography::STGeomFromText(a.exterior_rings,4326)
           .STSymDifference( 
              geography::STGeomFromText(COALESCE(b.interior_rings,'POLYGON EMPTY'),4326)
       ) as geog
  from (select g.pfi, g.ufi, g.create_date, 
               case when count(*) = 1 
                    then 'POLYGON('      + STRING_AGG(    g.exterior_ring,',')     + ')'
                    else 'MULTIPOLYGON(' + STRING_AGG('('+g.exterior_ring+')',',') + ')'
                end as exterior_rings
          from (select f.pfi, f.ufi, f.create_date,  '(' +  (STRING_AGG(f.XY,',') WITHIN GROUP (ORDER BY f.xy_position ASC)) + ')' as exterior_ring
                  from (select n.c.value('(HEADER/@PFI)[1]',    'varchar(100)')                as pfi,
                               n.c.value('(HEADER/@UFI)[1]',    'varchar(100)')                as ufi,
                               n.c.value('(HEADER/@CREATED)[1]','varchar(100)')                as create_date,
                               o.i.value('(local-name(.))',     'varchar(100)')                as coord_tag,
                               o.i.value('for $i in . return count(../*[. << $i]) + 1','int')  as ring_position,
                               x.y.value('for $i in . return count(../*[. << $i]) + 1', 'int') as xy_position,
                               REPLACE(x.y.value('(text())[1]', 'varchar(max)'),',',' ')       as XY
                          from [vicmap].[admin_seed] as d
                               CROSS APPLY 
                               d.iuf.nodes('/IUF/INSERT/POLYGON/POLYGON_NAME/POLYGON_ITEM[HEADER/@TABLE="LGA_POLYGON"]') n(c)
                               CROSS APPLY
                               n.c.nodes('OUTER') o(i)  /* Extract OUTER node to enable counting */
                               CROSS APPLY
                               o.i.nodes('XY') x(y)     /* Extract XY coordinates for current OUTER */
                       ) as f
                 group by f.pfi, f.ufi, f.create_date, f.ring_position
               ) as g
         group by g.pfi, g.ufi, g.create_date
       ) as a
       left outer join
       (select g.pfi, g.ufi,
               case when count(*) = 1 
                    then 'POLYGON('      + STRING_AGG(    g.interior_ring,',')     + ')'
                    else 'MULTIPOLYGON(' + STRING_AGG('('+g.interior_ring+')',',') + ')'
                end as interior_rings
          from (select f.pfi, f.ufi, f.coord_tag, f.ring_position, '('+(STRING_AGG(f.XY,',') WITHIN GROUP (ORDER BY f.xy_position DESC))+')' as interior_ring
                  from (select n.c.value('(HEADER/@PFI)[1]',    'varchar(100)')  as pfi,
                               n.c.value('(../HEADER/@UFI)[1]', 'varchar(100)')  as ufi,
                               o.i.value('(local-name(.))',     'varchar(100)')  as coord_tag,
                               o.i.value('for $i in . return count(../*[. << $i]) + 1','int')  as ring_position,
                               x.y.value('for $i in . return count(../*[. << $i]) + 1', 'int') as xy_position,
                               REPLACE(x.y.value('(text())[1]', 'varchar(max)'),',',' ')       as XY
                         from [vicmap].[admin_seed] as d
                               CROSS APPLY 
                               d.iuf.nodes('/IUF/INSERT/POLYGON/POLYGON_NAME/POLYGON_ITEM[HEADER/@TABLE="LGA_POLYGON"]') n(c)
                               CROSS APPLY
                               n.c.nodes('INNER') o(i)  /* Extract OUTER node to enable counting */
                               CROSS APPLY
                               o.i.nodes('XY') x(y)     /* Extract XY coordinates for current OUTER */
                        ) as f
                        group by f.pfi, f.ufi, f.coord_tag, f.ring_position
                ) as g
             group by g.pfi, g.ufi
       ) as b
       on (b.pfi = a.pfi and b.ufi = a.ufi)
      LEFT OUTER JOIN /* Attributes */
      (SELECT v.pfi,v.ufi,v.LGA_CODE,v.LGA_NAME,v.LGA_OFFICIAL_NAME,v.GAZETTAL_REGISTRATION,v.PFI_CREATED
         FROM (select f.pfi,f.ufi,f.attribute_name,f.attribute_value
                 from (select n.c.value('(../HEADER/@PFI)[1]',  'varchar(100)') as pfi,
                              n.c.value('(../HEADER/@UFI)[1]',  'varchar(100)') as ufi,
                              SUBSTRING(n.c.value('(text())[1]','varchar(max)'),
                                        1,
                                        CHARINDEX('=',n.c.value('(text())[1]', 'varchar(max)'))-1
                              ) as attribute_name,
                              REPLACE(
                                SUBSTRING(n.c.value('(text())[1]', 'varchar(max)'),
                                          CHARINDEX('=',n.c.value('(text())[1]', 'varchar(max)'))+1,
                                          1000),
                                '"',
                                ''
                              ) as attribute_value
                         from [vicmap].[admin_seed] as d
                              CROSS APPLY 
                              d.iuf.nodes('/IUF/INSERT/POLYGON/POLYGON_NAME/POLYGON_ITEM[HEADER/@TABLE="LGA_POLYGON"]/ATT') n(c)
                      ) as f
              ) AS T
         PIVOT (
             max(attribute_value)
             FOR attribute_name in ( LGA_CODE, LGA_NAME, LGA_OFFICIAL_NAME, GAZETTAL_REGISTRATION, PFI_CREATED )
         ) as v
      ) as c
      ON (c.pfi = a.pfi and c.ufi = a.ufi );
GO
-- 87 rows in 2 hours 23 minutes

-- Check if resultant polygons are valid
SELECT l.geog.IsValidDetailed() as vReason, count(*) as eCount
  FROM vicmap.lga as l 
 GROUP BY l.geog.IsValidDetailed();
GO

-- vReason      eCount
-- 24400: Valid 87

3.5 Index and Visualise the results.

-- Primary key on ID
ALTER TABLE [vicmap].[lga] ADD CONSTRAINT lga_pk PRIMARY KEY CLUSTERED (id);
GO

-- Unique index on PFI
CREATE UNIQUE INDEX lga_pfi_ix ON [vicmap].[lga](PFI);
GO

 -- Create Spatial Index
CREATE SPATIAL INDEX [lga_geog_spx] 
    ON [vicmap].[lga] ([geog]) 
 USING  GEOGRAPHY_AUTO_GRID 
  WITH (CELLS_PER_OBJECT = 16, PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) 
    ON [PRIMARY]
GO

-- Visualise using SSMS
SELECT * FROM [vicmap].[lga] as l;
GO

This resulting image of all the LGA data is as follows:

Victorian Local Government Areas

In Part 3 we will look at preprocessing of the XML file to improve performance and show how to load the data using XQUERY rather than XPATH.