Usefulness of Spatial Metadata as a Foundation for an Australian data.gov and other uses

Disclaimer

What follows is a personal opinion based on experience. While many statements appear critical of the people who create and administer spatial metadata in Australia and overseas that is not my intention: we all do what we can do given time, staff constraints and education and training background. Some assertions may appear to be binary: that is I am right you are wrong. To get attention paid to an issue sometimes it is necessary to use non-passive sentence structure and more polemical arguments. I hope that the way I write doesn’t cloud the issues for those that read this article. Some statements about particular datasets may be out of date but that is of no particular importance as the statements highlight issues that were, and are still real with the assumption that spatial metadata is great and a solid basis for search engines.

Introduction

This article is written as a response to:

Professor Brian Fitzgerald’s blog article on an Australian version of Data.gov
Having read the the submissions by spatial data “experts” to the Victorian Parliament – Economic Development and Infrastructure Committee’s Inquiry into Improving Access to Victorian Public Sector Information and Data
The thread on the LinkedIn group “Geospatial Data Integration” called “What about Spatial Data Search engines?” written by Alan Keown;
My own work on building an active data/metadata search engine for a customer.

In particular, in writing this blog comment I hope you will come to understand why I think that the assumptions made by spatial data experts that the spatial data collected and “documented” by Government Departments (at all levels) in Australia (and probably overseas) can, or will, provide a good platform for building accurate data discovery search engines such as is done at the US Government’s data.gov website, is inaccurate and fanciful.

The experience and perspective I bring to this is debate is three-fold:

As an ex-GIS Manager needing to provide access to descriptions of internally and externally sourced datasets within an organisation’s information systems (internally data needed documentation via an appropriate, value-adding, manner);
As a consultant who is often asked how to find spatial data within Australia (never mind manage and create search engines for);
As a computer scientist specialising in database management and the integration of spatial data and processing within the existing IT infrastructure within organisations.

Let’s start with externally (eg Government) sourced data and its “documentation”.

Ben Searle in his comment on Prof Fitzpatrick’s article says:

In simple terms this relates to the description of the data sets in a standard manner that enables structures searches, and helps the user to understand what the data set can be used for. The Australian spatial community has been working on this approach for a number of years and has been very successful in developing this type of capability.

Yes, it is true that metadata generation has been a part of Australia spatial data communities’ focus for a number of years but I would argue that the quality of this documentation, from the perspective of its fitness of purpose for the consumer, is on the whole, poor. As an practitioner, I would contest the assertion of successfulness in developing this criteria. In my opinion, the metadata is not as good as the public statements, such as Ben’s assert for its quality and usefulness.

Within the context of the wider computing world, the spatial community’s concentration on simplistic, external metadata statements could never be asssed as best practice.

I would also asset that it is very difficult to quantitatively assess the quality of a dataset through the current “standardised” documentation available for nearly all the publicly available datasets. It is only when the actual data arrives and is accessed do the real questions get asked: it is here that metadata is shown to be inadequate to the task.

Aspects

There are three aspects I will cover that help determine the quality of metadata.

How-up-to date is the metadata?;
The context that surrounds the data (no data is an island unto itself);
How accurately the metadata describes those aspects of the data that the consumer requires?

Up to Date Metadata

How up-to-date the metadata also has a number of aspects.

A critical one is that the type and source of data often affects the metadata quality.

If the data is the result of a scientific endeavour in which data is collected for a single project, is subject to quantitative analysis, and then finally contributed to a metadata registry such as the ASDI or like those of the Australian Antarctic Division’s Data Center then, generally, the metadata associated with these data are up to date and, at this level, comprehensive.

But scientists wrapping up a strongly statistical/quantitative self-contained project (often consisting of a single dataset) is one thing but documenting even simple datasets that are undergoing constant update and modification is another. For the latter the quality of the data and the metadata is generally poor because of how the data is managed, who manages that data, and how all the aspects of a dataset’s metadata are managed.

For example, the threatened species data under the control of a government Department in Tasmania when I was a GIS Manager responsible for making it available for the management of these species in forest operations was provided pretty well on the same spreadsheet it was managed in. That spreadsheet was full of data errors that rendered a lot of the data unusable (species longitude/latitude references put animals that couldn’t swim in the middle of the ocean due to the the normal dyslexic switching of digits in a number. Yet if all you had to rely on was the metadata then you would think that no problems existed! (This data is now under firm control inside an Oracle database).

Let’s take a look at another of the more important datasets in the state of Tasmania (my state): the road network. This is the publicly accessible metadata statement (in the standard industry format that Ben Searle in Prof Fitzpatrick’s article) alludes to: LIST Transport Segment

This metadata statement is woefully inaccurate. For instance the data’s storage format is not “Digital – ESRI ArcInfo Grid file”. It is stored in Oracle, as ESRI’s SDEBINARY format, accessed only be ESRI ArcSDE compliant software. But really, who cares what it is stored in? (One should prefer data managed in databases but some vendors and application designers are still capable of building exceedingly poorly defined database models.)

Another aspect of this metadata is that it’s positional accuracy statement is mostly incorrect. The document says nothing about how the data that goes in to the dataset is collected. For example, what GPS data capture standards are employed when capturing the data? But also, how is the new data integrated into the old so that the data quality is improving over time? Is the editing being done by people who just adjust, extend, snip so that it “looks right”? Or is the data being integrated using proper adjustment methods such as “least squares”.

I can tell you that with this transportation dataset even accurately defined intersections using sub-meter GPS and documented techniques were still be subjected to the “looks right”/“least effort” approach to editing.

But they don’t tell you that in the metadata “positional accuracy” or “lineage” sections do they!

Attribution and Data Structure

The List’s Transport metadata also gives no indication as to attribution. For example, does the dataset have attributes for:

Road Name (whose?);
Type of road (dual carriageway etc);
Source;
Road Classification (Tourism or some other form);
Speed;
Surface;
Private/Public/Open/Closed;
Traffic direction;
etc?

The only way to find out is to follow the non-hyperlinked “http://www.dpiw.tas.gov.au/lis/listdata.html” at the bottom of the page and then click on the Volume 2 Road Theme hyperlink to be taken to a the document “The LIST Data Compilation Specifications for Data Sets Volume 2 The Road Theme, Draft Version 0.2” (a non-standardised document) which indicates the last time that it was updated was “13/10/99 10:12”. Eleven years old! (This database model has undergone changes since 1999!)

Complexity

Finally, most data that is created and managed is not done so in isolation. Most datasets being interchanged are part of more complex systems because they are describing the real world via, possibly spatially, described entities in relationships with others. Complex data models consist of many tables involved in relationships and described by a myriad of reference, attribute or lookup tables. The current metadata and distribution frameworks are inadequate in communicating this. These are organised around the interchange of single/simple spatial dataset, delivered by non-database based proprietary file formats such as shapefiles. The industry is slowly becoming aware of this complexity by software such as GeoServer supporting the delivery of Application Schemas. But even here the concentration is on read-only delivery and not interactive search and update.

The spatial industry’s narrow vertical market and need to sell product to a small professional base, seems to have ensured that it’s products do not align well with non-GIS product and solutions managing real world databases of which spatial is only ever a very small part.

Is Usefulness defined as where the “rubber hits the road”?

Before I discuss the “fitness for purpose” or what metadata an end-user expects, a bit of background.

I was once, for nearly eight years, a GIS Manager in a small forestry company. We had the usual (external) spatial metadata pushed down our throats as being “best practice” for metadata management, but the offerings of the standards bodies and GIS vendors don’t work in commercial entities. Nor do they work well with the sorts of complexities that come from data model context i.e., a spatial datasets is often one part of the description of a corporate entity (e.g., multiple spatial columns may descibe a single entity in a single database table): it does not exist on its own. The spatial description is not synonymous for the dataset. This is for a number of reasons:

Spatial metadata rarely solves a business problem for non-governmental companies and organisations that are about “reducing cost and maximising profitabilty”. If your charter isn’t to publish data then metadata as it currently stands is pretty useless;
Staff are too busy doing important day-to-day work. Filling in metadata eg lineage, to reflect changes post-factum only produces wrong descriptions that are out of date even at the time they are written.
The main metadata end-users want (not all – project based scientists may need the metadata in making an initial assessment about whether to use a piece of data – but even then that assessment does not have access to all the facts) is live metadata about real data quality and ownership. Not someone’s assessment of what they think the quality is, and not an ownership that does not have attached to it real responsibility.

Real, active and up to date metadata describes things humble things like an attribute named SLOPE describing a sewer pipe in a table. For this, the important questions are things like:

“What is the range of valid values?” and
“Does the data held against this attribute fit that range?”

For example, in a client’s database there was no SQL CHECK constraint giving me the metadata I wanted so I was forced to check the actual attribute:

 SELECT min(slope), max(slope) 
   FROM pipe;
 
 MIN(SLOPE) MAX(SLOPE)
 -787.763   21.509

Great so I know the max and min, but what is the actual valid range? Is the valid range smaller than this (therefore some data is invalid) or larger (then all data must be correct)? More simply, what is the data definition of the SLOPE item? Is it floating point data? What is its precision (how many decimal places)? And, finally, what are the units of measure (percentage)?

If I checked the GIS metadata I would find probably no reference to this column at all because most GIS metadata doesn’t bother to describe the data structure of a table via attribute definitions (because GIS vendors push spatial data management external to the real business table).

Constrain the Model

What I have done in the past, and continue to do today, is something that is considered “best practice” in the data modelling and IT communities: create a self-referential data model by creating database constraints on all my columns.

 ALTER TABLE WATER_PIPES 
   ADD CONSTRAINT WATER_PIPES_SLOPE_CK 
       CHECK ( SLOPE BETWEEN -30 AND 30 );

One can also attach a comment to this attribute:

 COMMENT ON COLUMN water_pipes.slope 
      IS 'Describes the slope of a sewer pipe in terms of percentage rise over the length of the pipe.';

This constraint information is then stored in the metadata catalog that the database vendor provides. The international standard for this is the SQL92 INFORMATION_SCHEMA.

The model (and data) thus becomes constrained in a manner that is independent of any client application (or editing system) but with active metadata that is available to any client. For example, constraint and comment information like that above, when created, can be accessed by interactive web pages that access this metadata to provide important search and reporting facilities. Then, when a user clicks on a pipe that person can request active and up to date information about the data and its quality.

Coupled with dynamic textual and spatial data search facility that can access the actual dynamically updated data creates the sort of rich accurate platform for building data access engines. An example of a dynamic spatial search capability is the fact that Oracle Spatial’s RTree index holds the Minimum Bounding Rectangle(MBR) of all indexed data in its root node. This MBR is actually held as an sdo_geometry object that can itself be mapped. You cannot get more up to date that that! An example of a attribute data search capability that can be coupled to spatial data search functionality is described later.

Responsibility and Metadata

In addition, the foresters I provided systems for wanted to know, for any “theme” being mapped, who was responsible for its definition and application within the business process. So, we added to our metadata enquiry system attributes that recorded who to contact (name, email, phone number etc) because the GIS Manager isn’t responsible for the definition, edit and application of a dataset to solve a specific business problem.

In the end, I discovered that the majority of my “externally imposed” spatial data metadata responsibilities could be solved by auto-generating in excess of 90% of the simplistic metadata they required dynamically and using free tools.

Active Metadata – An example

See my article called New Presentation on Active Spatial metadata

In essence I built a client-neutral data/metadata search facility for a customer using only SQL Server 2008.The components were:

My own implementation of the OGC GEOMETRY_COLUMNS metadata table with a bunch of T-SQL procedures to actively/passively populate the table (BTW this table made ogr2ogr work properly even though it is not built to access the .NET based geometry/geography data);
SQL Server 2008’s Geography and Geometry data types;
SQL Server 2008’s FULL–TEXT indexing engine.
The Deep Earth open source GIS Silverlight client (I did not write this aspect of the system).

The GEOMETRY_COLUMNS table provided the main spatial metadata table. The FULL–TEXT indexing engine provided direct access to both the GEOMETRY_COLUMNS table and the actual attribute data describing the 700+ tables in the database.

The solution allowed a customer to initiate a search from Deep Earth which sent the spatial extent of the user’s current map view to a search engine along with a FULL–TEXT search string that the Microsoft search engine could understand. The application filters the 700+ tables initially by spatial extent (in the GEOMETRY_COLUMNS metadata) before searching the candidate table’s attribute data using the search criteria. The mappable layers that were returned from the match are then returned to the Deep Earth client for display. An example question might be: “Is there anything with the name ‘PVC’ and ‘150’ in size anywhere within my current extent”.

While the customer used Deep Earth as its query and display client, the solution can be used by any client capable of executing a T-SQL procedure.

There was surprisingly little T-SQL written. The main search function is only 120 lines of code. The function that builds a full text index on all columns in a table is 200 lines of code (though a lot of that is discovering and building a list of indexable columns which, on first pass, is then written to a configuration table so that an administrator may remove columns for when the next automated index run occurs). The speed of the search was fast enough not to cause users concern that their search was not being processed. Response times are of the order of a few seconds.

The Microsoft FULL–TEXT search engine is fast, flexible and powerful (more powerful than what one can ask of Google). All database vendors offer similar search engines inside the database (eg Oracle Text) etc.

Data Exchange

What I find telling is the way the GIS ships data around. Shipping me data in a shapefile with a single, badly filled in, page of metadata (at least it is machine readable in XML! Wow!) is an example of the pathetic state of GIS metadata and data interchange. It is a 1960s form of data management (flat files) that really has no place in modern computing. There is nothing in the shapefile’s Dbase file format that allows for the independent description of data attributes (as one can do with a database constraint).

As a computer scientist who specializes in database technologies and the integration of spatial processing within IT technologies and frameworks (instead of imposing “GIS” solutions from vendors) I am interested in solutions that fully integrate with the data tier of an organization’s data because that is where quality data management occurs. It is also where ALL the organization’s data resides. And that can only be done with creating fully documented and constrained data models (and then using that metadata in dynamic, query-able systems).

However, even in the IT industry itself, the creation of fully documented, fully described, application independent database data models is still not universal. But without it there is no transparency in data management: asserting one doesn’t need check constraints because the applications implement all the necessary business rules does two things: 1) locks business rules up inside application code, reducing accessibility and transparency; 2) makes data quality services only accessible to the application that implements the rules. Rarely does the assertion of programmers that quality data is produced by applications tested by the application of a few declarative data quality statements like CHECK constraints.

Food for thought or useless, overly technical rambling? Your comment.

Simon

The Spatial Database Advisor

Top 5 Recent Articles

OptimoRoute

Functions to Convert Native PostgreSQL geometric types to PostGIS

Converting PostgreSQL Native Geometric Types to PostGIS

Fitting Bezier Polygon to Points

Fixing Geography Ring Orientation

ARTICLES CATEGORIES

Usefulness of Spatial Metadata as a Foundation for an Australian data.gov and other uses

Documentation