Top 5 Recent Articles
- Algorithms (20)
- All (400)
- Biography (1)
- Blog (45)
- Business Requirements (1)
- Commentary (1)
- Customers (2)
- Data Models (1)
- Education (2)
- GeoRaptor (13)
- GPS (1)
- Image Processing (2)
- Import Export (8)
- Licensing (2)
- LiDAR (1)
- Linear Referencing (4)
- Manifold GIS (3)
- Mapping (1)
- MySQL Spatial (7)
- Networking and Routing (including Optimization) (4)
- Open Source (18)
- Oracle Spatial and Locator (193)
- Partitioning (1)
- PostGIS (34)
- Projections (1)
- Published Articles (1)
- qGIS (1)
- Recommendations (1)
- Services (1)
- Software Change Log (1)
- Source Code (35)
- Space Curves (9)
- Spatial Database Functions (108)
- Spatial DB comparison (1)
- Spatial XML Processing (11)
- SQL Server Spatial (92)
- Standards (3)
- Stored Procedure (15)
- Tessellation or Gridding (10)
- Tools (2)
- Topological Relationships (1)
- Training (2)
- Triangulation (2)
To Constrain or Not to Constrain: There should be NO Question
In my consulting travels I, like many other database-loving specialists, find databases with little (a Primary Key, perhaps) or no constraints being applied to the data model managed by a relational database (which may be Oracle, SQL Server, PostgreSQL, MySQL etc).
I finalised a report for a Government Department here in Australia recently where this was the case in their Oracle databases so I included the following comment in my final report expecting that it will be ignored:
The sole use of an [external application tool] to test database-based data for the adherence of simple database integrity constraints is not best practice.
I also recommended that database constraints (documented elsewhere) be applied to their data models.
I have found such issues, and many similar observations and recommendations many times in the past, with other clients but such recommendations rarely get implemented. The reasons for this probably relate to database technology being treated as a “foreign tool” to all but the IT department’s DBA or specific application data modeler. Certainly databases are not something IT programmers or GIS-professionals understand: if they have to work with a database GIS professionally mostly seem to prefer to do so via some GIS tool like ArcCatalog. The database, regardless as to reason, is the poor cousin in the data management stack of most applications.
While this article is, strictly speaking, not about GIS, some GIS products do what other commercial applications in the normal IT-space do: that is deliberately not implement constraints for the data models using the host databases constraint system. Rather they prefer to implement their model constraints in their own database (metadata) tables and apply the constraints in their own middle or client-tier technologies.
In all cases the assumption seems to be made that:
- The application is where data quality starts and finishes so let’s make sure it is OK.
- If we get the application right, the database will not need to be constrained.
But this doesn’t justify a separate metadata system. Why not constrain the database’s data model (say using the ANSI standard INFORMATION_SCHEMA) and build applications rules on-the-fly by “mining” this catalog?
(Sure, some things will be missing, eg spatial “referential integrity” constraints like “pipe falls at end of pipe”, but it is only these that should be stored in custom metadata tables.)
Before I go any further. I don’t like “either”/“or” arguments. Data quality is not about a single process such that model specific constraints are applied only at a single level eg application. Data quality should be the goal of the whole of an application’s technology stack.
But having said this almost all we do in scientific computing is collect data within one application but modify, manage, analyze and present it with other applications. The common foundation on which all application activity starts and ends is usually a single (independently) quality controlled database. If there is any tier in the stack that should be clearly and fully documented and constrained it is the DATA TIER. Or another way:
Software comes and goes but the data and its structure is far more constant.
I will try and explain the reason for constraining from two perspectives:
- Relational Database Theory
- Database tuning and performance
1. Relational Database Theory
Database theory requires that databases should be ‘self-referential’.
What is meant by ‘self-referential’ is that an application should not be required to impart meaning to the data being stored and managed. To this end all commercially available database software such as SQL Server and Oracle include a system ‘catalog’ that controls and documents all data that is stored within them. This metadata documents all tables stored in the database, their column names and types, the valid data values they may contain, any indexes that may be defined on sets of columns and the relationships between tables. The catalog also controls access to these tables by applications and users.
This approach to database management removes the need to ask a programmer to access application source code to answer basic end-user questions relating to database structure, quality and integrity.
Database metadata is both descriptive and prescriptive. Thus the definition of a ‘rule’ controlling data values in a specific table/column e.g., a column describing a percentage value should always fall ‘BETWEEN 0 AND 100′ succinctly describes data quality but, once applied, will stop values other than that defined from being entered.
This was particularly evident during a discussion I once had with staff in an organization about the range of correct values for a SURFACE_LEVEL column in table called PITS. The range of values actually held by the column were far greater than expected by staff. Examination of the associated DATUM column shed no light on the small number of exceeding large values. In fact, the DATUM column contained mixed case values (e.g., ‘AHD’ and ‘ahd’) and a variety of possibly identical references (is the ‘AGD/AHD’ supposed to be the same as ‘AHD’ and ‘ahd’?). These columns should have constraints similar to the following defined and applied:
|If all surface_levels were metric and between 0 and 500meters||ALTER TABLE pits
ADD CONSTRAINT pits_surface_level_ck CHECK (surface_level BETWEEN 0 and 500) ENABLE NOVALIDATE1;
|If all DATUM values were drawn from another table (called datum_lookup) acting as a lookup:||ALTER TABLE pits
ADD CONSTRAINT datum_lookup_datum_fk
FOREIGN KEY (datum) REFERENCES lookup_datum(datum);
|Or if not lookup table||ALTER TABLE pits
ADD CONSTRAINT datum_check
CHECK (datum in (‘AHD’,’IMPERIAL’) ENABLE NOVALIDATE;
|If surface_level’s values can only be interpreted with the datum column||ALTER TABLE pits
ADD CONSTRAINT surface_level_datum_ck
CHECK ( ( datum = ‘AHD’ and surface_level between 0 and 500 )
OR ( datum = ‘IMPERIAL’ and surface_level between 0 and 1500 ) ) ENABLE NOVALIDATE;
Once constraints such as these have been defined, applications should use these rules when providing access to the database through ‘mining’ the metadata: this ‘data-driven’ approach to application development provides the most flexible method for building and deploying applications that use relational databases. This approach enables a ‘Model Driven Architecture’ (MDA) approach to software technologies that reduce application development time through removing the need developers to implement basic ‘rule driven’ data entry and validation.
2. Database tuning and performance
I have often pointed out to clients that a fully constrained database will not only provide the final gateway through which all data must pass, but it also a critical component in ensuring database query performance.
Normally, this is asserted to be a “database best practice”. But such an “assertion” is hard to prove as there is little in the way of a collated Database Book Of Knowledge (DBOK) on the Internet whose findings are based on empirical evidence. This may be because of the complexities of individual data models that make it hard to create standardized model-independent use cases. Add to this the complexities of the different commercial databases’ query optimizers and architectures. Finally, even if a database was constrained and those constraints provided improvements to database performance who is to know other than the DBA and maybe the odd programmer or user. Databases are hidden, unsung heroes of any organization: no one blows their trumpets.
So, imagine my pleasure when I got back from some consulting interstate to see the May – June 2009 edition of the Oracle Magazine contained a great article on just this issue by Tom Kyte called On Constraints, Metadata, and Truth.
If there is one person who, for Oracle professionals, provides a walking “book of database best practice” it is Tom Kyte. Tom always bases his recommendations on solid, empirical research. So, it is with pleasure that I let Tom conclude this article on how a constrained data model can aid database performance.
In case you think that contraints help performance only for Oracle see Sarvesh Singh’s article showing how Check and Foreign Key Constraint Improves Query Performance for SQL Server.
I hope someone finds this article useful