GIS software and Database Primary Keys

There has been some discussion over on the PostGIS email list about the use of primary keys on tables by GIS client software such as uDig, QGis, MapInfo etc. It is a discussion that comes up now and again (perennial, I suppose) especially when new users of PostGIS (or Oracle on the Oracle forum etc) start asking questions.

So, what are primary keys, why are they important to database tables and GIS software, and what limitations exist?

What is a primary key and its importance to a database table?
Firstly, we need to be aware of the notion of a candidate key.

A candidate key is a combination of attributes [When multiple attributes are used in a candidate key, it is also called a composite key] that can be uniquely used to identify a database record without any extraneous data. Each table may have one or more candidate keys.

In addition, we note that a candidate key:

… can either be part of the actual record itself [eg the unique attribute triplet – “customer_name,date_of_birth,sex”] or it can be a single artificial field (one that has nothing to do with the actual record eg object_id).

So, a table can have many candidate keys, composed of one or more attributes, and but that all have to be unique.

Finally, from a list of unique candidate keys for a table ONE is selected to be the table’s primary key. So, depending on design.

… a table may have arbitrarily many unique keys but at most one primary key.

Example of Defining a Table with a Primary and Unique Key
Here is a PostGIS example of how to define a table with geometry data, a single primary key (on the column id) and an alternate unique (candidate) key (on column parkname via a UNIQUE constraint).

 

 DROP TABLE parks;
 Query returned successfully with no result in 62 ms.

 CREATE TABLE parks
 (
   id       serial NOT NULL,
   parkname character varying(100),
   park_type character varying(1),
   suburb_id integer,
   lastinsp date,
   geom geometry,
   CONSTRAINT parks_pkey PRIMARY KEY (id),
   CONSTRAINT parkname_ukey UNIQUE (parkname),
   CONSTRAINT enforce_dims_geom CHECK (ndims(geom) = 2),
   CONSTRAINT enforce_geotype_geom CHECK (geometrytype(geom) = 'MULTIPOLYGON'::text OR geom IS NULL)
 )
 WITH (
   OIDS=FALSE
 );
 NOTICE:  CREATE TABLE will create implicit sequence "parks_id_seq" for serial column "parks.id"
 NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "parks_pkey" for table "parks"
 NOTICE:  CREATE TABLE / UNIQUE will create implicit index "parkname_ukey" for table "parks"
 Query returned successfully with no result in 250 ms.

 -- Since we have declared the geometry CHECK constraints ourselves, we can insert the spatial metadata into the geometry_columns table rather than use the AddGeometryColumn function.
 --
 insert into public.geometry_columns (f_table_catalog,f_table_schema,f_table_name,f_geometry_column,coord_dimension,srid,type)
 values (' ','public','parks','geom',2,28355,'POLYGON');
 Query returned successfully: one row with OID 395776 inserted, 63 ms execution time.

This table can now be opened in most GIS systems. If, however, instead of using a meaningless valued column, id, we had chosen to use just the parkname column as the primay key, no GIS client package around today would be able to work with this table even though the park name values held in parkname are actually unique.

 -- Executing query:
 DROP TABLE parks;
 Query returned successfully with no result in 16 ms.

 CREATE TABLE parks
 (
   parkname character varying(100),
   park_type character varying(1),
   suburb_id integer,
   lastinsp date,
   geom geometry,
   CONSTRAINT parks_pkey PRIMARY KEY (parkname),
   CONSTRAINT enforce_dims_geom CHECK (ndims(geom) = 2),
   CONSTRAINT enforce_geotype_geom CHECK (geometrytype(geom) = 'MULTIPOLYGON'::text OR geom IS NULL)
 )
 WITH (
   OIDS=FALSE
 );
 NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "parks_pkey" for table "parks"
 Query returned successfully with no result in 141 ms.

Finally, what if the table had two or more columns in the primary key? The table would also not be able to be dealt with by most/all GIS software around today. Here is an example of a table with a two column primary key.

 DROP TABLE cities;
 Query returned successfully with no result in 16 ms.

 CREATE TABLE cities (
     name            text,
     population      float,
     altitude        int,
     state           char(2),
     geom geometry,
   CONSTRAINT cities_pkey PRIMARY KEY (name,state),
   CONSTRAINT enforce_dims_geom CHECK (ndims(geom) = 2),
   CONSTRAINT enforce_geotype_geom CHECK (geometrytype(geom) = 'POINT'::text OR geom IS NULL)
 )
 WITH (
   OIDS=FALSE
 );
 NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "cities_pkey" for table "cities"

 Query returned successfully with no result in 110 ms.

The only way we can work with this sort of table is to add a unique integer field (in the table following I have called it gid as in geographic-id) to them and hope that the GIS will allow you to work with the defined, unique field. (For example, ArcSDE would allow this as it simply requires a UNIQUE index on a the integer field, but MapInfo, which requires Primary Keys, may not – I do not have a copy of MapInfo to check this.)

 DROP TABLE cities;
 Query returned successfully with no result in 0 ms.

 CREATE TABLE cities (
     name            text,
     population      float,
     altitude        int,
     state           char(2),
     geom geometry,
     gid  serial not null,
   CONSTRAINT cities_pkey PRIMARY KEY (name,state),
   CONSTRAINT cities_gid UNIQUE (gid),
   CONSTRAINT enforce_dims_geom CHECK (ndims(geom) = 2),
   CONSTRAINT enforce_geotype_geom CHECK (geometrytype(geom) = 'POINT'::text OR geom IS NULL)
 )
 WITH (
   OIDS=FALSE
 );
 NOTICE:  CREATE TABLE will create implicit sequence "cities_gid_seq" for serial column "cities.gid"
 NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "cities_pkey" for table "cities"
 NOTICE:  CREATE TABLE / UNIQUE will create implicit index "cities_gid" for table "cities"
 Query returned successfully with no result in 110 ms.

This discussion applies to Oracle, SQL Server, MySQL etc. The limitation is not in the database product but, as is usually the case, in the GIS product.

I hope this is of use to someone.