Last week, I took advantage of a few precious days off to stretch my SQL chops and start designing a usable database. This presented a couple of challenges, the first of which was simply remembering the syntax for creating tables, columns of a specific type, etc. The second challenge was figuring out the multifaceted approach to maintaining data integrity. This latter piece involves judiciously-placed constraints on the data, ensuring that it will always make sense in a real world scenario (even without an application to interpret it). More on this later.
I’ve talked about the tabular format of a database before (like an Excel spreadsheet, remember?). What I avoided discussing earlier are the inherent limitations in modeling an arbitrary system as a series of tables. Individual entities fit in well — for instance, you can easily define a person by their most salient details: username, first name, last name, email address, etc. — but relationships among entities becomes more difficult. How would you denote that two people, as two separate rows in the Person table, are friends? There’s no place for that sort of detail in the Person table. I mean, sure, you could try to add a “friend” column, but that would take either only 1 entry (thus restricting every person to have only 1 friend) or an amalgam of all the friends’ names, which would be an unparsable blob of text.
No; it’s best to keep entities and relationships separate. So to implement the example above, you would create a a second table, Friends, in which each row contains the names of two people who are friends. This way, every person in the system has exactly one entry in the Person table, but a variable number of rows in the Friends table depending on how, uh, socially active they are. A misanthrope will not appear in any rows in the Friend table, whereas a social butterfly will have many, many entries.
For this to work, though, we do need to make sure that we can target entries in the entity (People) table from the relationship (Friend) table. To ensure that we’re accurate, we need a way to identify a person — some detail that is unique to them and will never have a NULL value. This is called the Primary Key. A full name won’t work (not unique: pity the John Smiths of the world); a username could be a suitable candidate; the best option, in my humble opinion, would be a a numerical id that is assigned when a person is created. If any new ID is an incrementation of the last one assigned, then all IDs will be unique and non-null. Furthermore, it’s unlikely that such a number would ever be changed later on, which saves us the trouble of propagating an entity update through all its relationships. So using this PK, our tables might look like this:
Person +--------------------------------+ | ID | name | email | +--------------------------------+ | 5 | Alice | firstname.lastname@example.org | | 8 | Bob | email@example.com | +--------------------------------+ Friends +---------------------------+ | friend_1_id | friend_2_id | +---------------------------+ | 5 | 8 | +---------------------------+
The row in the Friends table shows that the person with ID 5 is friends with the person with ID 8. Tie that back to the Person table, and you see that Alice and Bob are friends.
So, the Person(ID) column is a Primary Key, which becomes the single most identifying detail of any entry in that table — the detail by which any relationship can target that specific entry. But there’s something else going on here. You don’t want the columns Friends(friend_1_id) and Friends(friend_2_id) to be numbers that just happen to correspond to entries in Person(ID). If that were the case, they could easily be entered incorrectly; or the ID they refer to might get changed. Either way, you would end up with a data integrity issue: the information in your database would no longer reflect the state of the real world, because an entry in the Friends table won’t refer to an actual person, and two friends won’t be marked as such.
We want to maintain the data integrity of our database independent of any application code that runs on top of it, so we need to force the values in Friends(friend_1_id) and Friends(friend_2_id) to always refer to some value in Person(ID). We explicitly recognize that they are Foreign Keys, values that refer to entries in another table. We also tell the database what do to if an entry in the parent table is changed — for instance, if Bob’s id was changed to 9, or if he is removed altogether. The possible actions in response to such a change are RESTRICT/NO ACTION, which prevents the change from occurring; CASCADE, which forwards the update down to the child table; and SET NULL, which changes the modified/deleted FK’s value to NULL.
Ok, so now with a basic understanding of PKs and FKs, let’s put together a declaration for the tables outlined above. (On a technical note: it turns out that the default MySQL engine doesn’t support FKs, meaning you have to manually declare both the parent and child table to use the InnoDB engine.)
CREATE TABLE Person ( id INT AUTO_INCREMENT PRIMARY KEY, name VARCHAR(32), email VARCHAR(32) ) ENGINE=INNODB; CREATE TABLE Friends ( friend_1_id INT, friend_2_id INT, FOREIGN KEY (friend_1_id) REFERENCES Person(id) ON UPDATE CASCADE ON DELETE CASCADE, FOREIGN KEY (friend_2_id) REFERENCES Person(id) ON UPDATE CASCADE ON DELETE CASCADE ) ENGINE=INNODB;
Breaking down the example: we first declare the Person table. The column ‘id’ is of type INT (integer) that automatically increments; it also happens to be the PK. The columns ‘name’ and ’email’ are then declared, each as a VARCHAR(32) — a character string of variable length, up to 32 characters. The table declaration is closed, and we specify that it should use the InnoDB engine.
We then go on to declare the Friends table. Both columns are of type INT, to match Person(id). We then say both are FKs, state which table(column) they reference, and decide what to do on update or delete. I selected CASCADE, which makes sense in both cases: if a Person’s ID is changed, we should update the Friends relationship with the new ID; if a Person is deleted entirely, it’s not possible for them to be friends with anyone else, so we CASCADE the delete to the Friends table.
Before signing off, I should point out that the Person/Friends example I’ve been working with is purposefully simplistic. I’m not implying this is how major social networks would implement such a relation; and I certainly wouldn’t recommend its use (my own work uses significantly more constraints to maintain its data integrity). I just wanted to illustrate basic table declarations along with the use of Primary and Foreign Keys.