A Bit About Database Keys, Part 1: Natural vs. Surrogate Keys

Keys are values that uniquely identify a single row in a database table.  They are necessary because it is highly possible that a table will have two rows that contain identical data but represent two separate entities in the real world.

Let’s say we wanted to create a table in PostgreSQL to store information about our friends.  The schema for this table might be created with the following code:

CREATE TABLE friends(
first_name text,
last_name text
);

This table has two text columns, one for first name and one for last name.  Now let’s populate our table with some data:

INSERT INTO friends (first_name, last_name)
VALUES ('Monica', 'Geller'),
('Ross', 'Geller'),
('Phoebe', 'Buffet'),
('Chandler','Bing'),
('Joey', 'Tribbiani'),
('Rachel', 'Green');
INSERT 0 6

Our friends table now looks like this:

SELECT * FROM friends;
 first_name | last_name 
------------+-----------
 Monica     | Geller
 Ross       | Geller
 Phoebe     | Buffet
 Chandler   | Bing
 Joey       | Tribbiani
 Rachel     | Green
(6 rows)

Natural Keys

Now we need a way to identify each row. We may want to add additional columns to store additional attributes about our friends, and to do this we need to be able to access each row individually.

Let’s add a column for hair color:

ALTER TABLE friends
ADD COLUMN hair_color text;

We would like to store the value ‘blonde’ in the hair_color column for our friend Phoebe. Looking at the data in our table so far, every individual listed has a different first name, so this value can be used as a unique identifier for a column.

This is an example of a natural key, in that the value we are using as a key for the table is already part of our data.

So let’s use the natural key of first_name to update Phoebe’s row:

UPDATE friends SET hair_color = 'blonde'
WHERE first_name = 'Phoebe';
UPDATE 1
SELECT * FROM friends;

 first_name | last_name | hair_color 
------------+-----------+------------
 Monica     | Geller    | 
 Ross       | Geller    | 
 Chandler   | Bing      | 
 Joey       | Tribbiani | 
 Rachel     | Green     | 
 Phoebe     | Buffet    | blonde
(6 rows)

So far, this is working just fine. However, what if we meet more friends and decide to include their data in the table as well?

INSERT INTO friends (first_name, last_name)
VALUES ('Monica', 'Lewinsky'),
('Al', 'Green'),
('Joey', 'Kangaroo');
INSERT 0 3

SELECT * FROM friends;

 first_name | last_name | hair_color 
------------+-----------+------------
 Monica     | Geller    | 
 Ross       | Geller    | 
 Chandler   | Bing      | 
 Joey       | Tribbiani | 
 Rachel     | Green     | 
 Phoebe     | Buffet    | blonde
 Monica     | Lewinsky  | 
 Al         | Green     | 
 Joey       | Kangaroo  | 
(9 rows)

Now we want to update the table so that Joey Kangaroo’s hair color is orange. How can we access that row? first_name is no longer a unique column, and neither is last_name.

As the table stands, we could use a combination of first and last name, as so far each combination is unique. But what happens if we meet someone with a common name, like John Smith, and then meet another John Smith? We will have two identical first name/last name combinations. We would have to add yet another column, like birthday. But in a world of nearly 8 billion people, eventually that could be duplicated as well.

Surrogate Keys

To solve this problem, we can create another column that contains a unique identifier that we specify. This identifier wasn’t originally part of the data set; it is merely used to identify rows in the table. This way, we can ensure that this key will always be unique. What’s more, it will never change, as names or other real-world data could.  This is called a surrogate key.

Surrogate keys are usually automatically-incrementing integers, which can be created in PostgreSQL with the data type serial.

ALTER TABLE friends
ADD COLUMN id serial PRIMARY KEY;

Now each row in the table has its own id, which can be used to access the row.

SELECT * FROM friends;
 first_name | last_name | hair_color | id 
------------+-----------+------------+----
 Monica     | Geller    |            |  1
 Ross       | Geller    |            |  2
 Chandler   | Bing      |            |  3
 Joey       | Tribbiani |            |  4
 Rachel     | Green     |            |  5
 Phoebe     | Buffet    | blonde     |  6
 Monica     | Lewinsky  |            |  7
 Al         | Green     |            |  8
 Joey       | Kangaroo  |            |  9
(9 rows)

Updating a single row is now much more straightforward:

UPDATE friends SET hair_color = 'orange'
WHERE id = 9;
UPDATE 1

SELECT * FROM friends;
 first_name | last_name | hair_color | id 
------------+-----------+------------+----
 Monica     | Geller    |            |  1
 Ross       | Geller    |            |  2
 Chandler   | Bing      |            |  3
 Joey       | Tribbiani |            |  4
 Rachel     | Green     |            |  5
 Phoebe     | Buffet    | blonde     |  6
 Monica     | Lewinsky  |            |  7
 Al         | Green     |            |  8
 Joey       | Kangaroo  | orange     |  9
(9 rows)

So, to summarize:

  • Keys uniquely identify rows in a table.
  • Natural keys are data that already exists (naturally) as part of a data set. Using them as row identifiers can cause problems because data that seems unique might not actually be so in the long run.
  • Surrogate keys are artificially created row identifiers that get around the problems associated with natural keys.

In Part 2, we’ll be looking at primary and foreign keys.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.