Refactoring to Third Normal Form
We ended the last chapter with a simple set of tables and models that is seemingly impervious to invalid data. Through careful unit testing of both the data layer and the models in the application layer, we guaranteed that all references between tables will be valid, and that each individual column can contain only appropriate data.
It’s tempting at this point to leave the realm of data modeling and begin writing a front-end for the theatre tickets website. We can imagine additional requirements for even the first version of our site, though, such as saving orders, or knowing how many seats there are for sale in a given auditorium.
As secure as the physical layer we put together seems to be, the design itself is constricting. Features such as those just mentioned will be difficult to add in an elegant way. In this chapter, we will refactor the data model so that it is more open to future changes. First, the concept of third normal form (3NF) will be introduced. Applying 3NF will afford us the flexibility to add additional information to pieces of data that are bound currently, such as auditoriums and movie ratings. We will then add additional tables we know we will need relating to ticket orders; doing so presents additional opportunities for refactoring.
Third Normal Form
In database theory, there are numerous normalization patterns, most of which are numbered ordinally: non-first normal form (NF2), first normal form (1NF), second normal form (2NF), and so on. The study and formalization of normalization patterns has a set of overriding themes. The first is to guarantee the correctness of data across a series of insertions, updates, and deletes. The second is to facilitate the process of querying tables. There’s also a positive by-product: the higher the level of normalization, the less repeated data there is in the data model. In addition to saving space, avoiding repeated data prevents data corruption because updating a value only requires changing it in single place, rather than meticulously updating it in many places. All of these properties of normalization tend to make our lives as programmers much easier.
In this chapter, we will discuss third normal form, which is a sufficient target normalization to make our data model more open to future changes. The principle behind 3NF is that no columns in a table depend on any non-key columns.
Consider the movies table from the previous chapter, shown in Figure 6-1. The id column is the primary key. Technically, this table is 3NF, because no columns depend on any other column that is not the primary key. However, if we decided to add additional information about the rating, we would be in trouble.
Figure 6-2 shows the movies table extended to hold a description of the rating. With this schema, as soon as we have more than one movie with the same rating, we begin to have duplicated data in the table. Not only are we wasting space, but we now run the risk of having two movies that are rated R, but have different rating descriptions. If the descriptions happened to be different, how would we know which was the correct one?
The solution is to normalize the table on the rating column by creating a ratings table, and referencing that table with a rating_id column in the movies table, as shown in Figure 6-3. Now there is a single place where the movie description is stored, and we 3NF is satisfied.
In general, good candidates for normalization are columns that fit the following criteria:
- Not part of the primary key
- Not a foreign key reference into another table
- Not intrinsic data, such as a name
- Not a measured value, such as the time or a temperature
We apply the same logic to the auditoriums table, which allows us to add a column to store the number of seats available in each auditorium. The resulting 3NF data model is shown in Figure 6-4.
Note that we didn’t stop at adding an auditorium_id column to movie_showtimes. We also removed the theatre_id column. That’s because theatre_id is actually functionally dependent on, or bound to, the auditorium_id. We can find the theatre for a showtime by first traversing the auditorium_id relationship to the auditorium table, and then traversing the theatre_id relationship to the theatres table.
If we had instead kept the theatre_id column to make the association simpler, as in Figure 6-5, it would be possible to have a movie_showtimes record that referred directly to some theatre A, but also referred indirectly, through its auditorium, to some other theatre B. This would be anomalous and disastrous, so at this stage, we don’t give any more thought to keeping the theatre_id column around.
But it does seem strange to not be able to tell which theatre a movie plays in within first examining an auditorium record. In fact, it is strange. We’ll come back to this problem in this chapter’s exercises, and again in Chapter 8.
In any case, now our schema is worlds more flexible. We can easily extend information about ratings or auditoriums and we only need to worry about updating a single record to do so.
Here’s the DDL for these new tables:
create sequence ratings_id_seq; create table ratings ( id integer not null default nextval('ratings_id_seq'), name varchar(16) not null unique check (length(name) > 0), description text, primary key (id) ); create sequence auditoriums_id_seq; create table auditoriums ( id integer not null default nextval('auditoriums_id_seq'), theatre_id integer not null references theatres(id), auditorium_identifier varchar(64) not null check (length(auditorium_identifier) >= 1), seats_available integer not null, primary key (id), unique(theatre_id, auditorium_identifier) ); create index auditoriums_theatre_id_idx on auditoriums(theatre_id);
Because we aren’t done with all of our refactoring yet, we won’t worry about models or tests here. Thinking about the unit tests necessary for these tables and what validations might be necessary are left as exercises for the reader.
Timing is key
We could have created a separate ratings table from the start, when we defined our schema in the previous chapter, but we didn’t know then that we might want to store extra information about ratings. Stemming violations of 3NF is something we’d like to do before the violation occurs. The trick is to notice bound data—the rating was literally bound to the movie—early in the design phase of the data layer. Once lots of code is written atop a data model, making changes can be painful. Worse, if your production application has collected lots of data and you can’t start from scratch, the effort and time required to re-arrange data within a live database can be prohibitive. Spotting columns that are likely to need fully fledged tables in the next iteration of feature sets is a skill that requires experience as well as intuition about the problem at hand.
Often, novice data modelers – even those who know about third normal form – skip these sorts of normalizations because the immediate benefits are not apparent. The short-term gain of not having to define a new table, model class, and unit tests seems irresistible. Skipping normalization for a quick gain is highly discouraged, though. While it is absolutely true that it takes more effort up front in order to achieve a defensively normalized schema, the time required to normalize after the fact can easily take 10 times the effort of doing so right away. At that point, it’s often tempting to pick a bad solution, such as adding a seats_available column to movie_showtimes rather than to an auditoriums table, where it belongs.
Once a schema goes down a sloppy path, it eventually—and sooner than you might think—becomes unusable. The application layer becomes riddled with bugs because the data layer is too permissive. For example, if the application layer must always update the seats_available column in movie_showtimes, code can creep in that sets it incorrectly: perhaps to zero, or some random value. Or, possibly worse—those doing data entry might be expected to enter the value every time they add a new showtime. Suddenly the business is experiencing massive problems because the system oversold every show, or perhaps refused to sell any tickets at all.
Refactoring: Inheritance and Mixins
Since we know our website’s purpose is to sell movie tickets, it makes sense to add tables to hold sales data before we begin coding up the frontend. Figure 6-5 shows our first stab at adding tables to store orders and ticket purchases. The orders table holds each transaction, including the purchaser’s name, address, and credit card information. The purchased_tickets table is a line item table for each ticket purchased in the transaction. Rather than simply providing a column in the orders table to record the number of tickets purchased, we split the tables up, foreseeing the need to account for tickets at different prices, such as student or senior discounted tickets, matinees, and so on.
In adding these additional tables, we find a glaring violation of the Don’t Repeat Yourself (DRY) principle. The violation of DRY occurs in the repetition of address data between the theatres table and our new orders table. If we kept the schema as is, with the data duplicated, we would also end up duplicating a lot of other code, including validation code and display logic for addresses.
Instead of duplicating this structure, we create a separate addresses table. If mimicking the previous examples, we would create address_id columns in the theatres and orders tables, and refer to individual addresses that are inserted in the addresses table.
We won’t do that here, though. This data is not in violation of 3NF. It is dependent only on the primary key of the table it is in, not some other column in the table as in our other examples. The problem we are trying to solve is of repeated code, not repeated data. In code, we can solve this sort of problem with class inheritance (or in Ruby, with mixins). The same can be accomplished at the data layer with Postgres’s table inheritance mechanism. We’ll use this feature to solve our current problem. First, we create the addresses table, without a primary key – the tables that implement addresses functionality already have their own.
create table addresses ( line_1 varchar(256) not null check (length(line_1) > 0), line_2 varchar(256), city varchar(128) not null check (length(city) > 0), state varchar(2) not null check (length(state) = 2), zip_code varchar(9) not null references zip_codes(zip), phone_number varchar(10) not null check (length(phone_number) = 10) );
The addresses table does not correspond directly to a Rails model class. Instead, its columns are sucked into tables that need it using the inherits keyword:
create table theatres ( id integer not null default nextval('theatres_id_seq'), name varchar(256) not null unique, primary key (id) ) inherits (addresses); If we describe the theatres table, it now has all of the properties of the addresses table: movies_development=# \d theatres Table "public.theatres" Column | Type --------------+------------------------ line_1 | character varying(256) line_2 | character varying(256) city | character varying(128) state | character varying(2) zip_code | character varying(9) phone_number | character varying(10) id | integer name | character varying(256) Inherits: addresses
Because Postgres allows multiple inheritance, we want to allow for the same flexibility within Rails. Ruby provides for multiple inheritance through the use of mixins, and Rails takes this one step further with plugins. Below is our plugin for address support, based on the template from Chapter 3. Example 6-1 shows the file acts_as_address.rb, in the plugin’s lib/ directory.
Example 6-1. acts_as_address.rb, from our acts_as_address plugin
module Addresses module ClassMethods def acts_as_address class_eval do <<-DELIM # define our validators and associations validates_presence_of :line_1, :city, :state, :zip_code, :phone_number validates_length_of :line_1, :maximum => 256 validates_length_of :line_2, :maximum => 256 validates_length_of :city, :maximum => 128 validates_length_of :state, :maximum => 2 belongs_to :zip_code, :foreign_key => :zip_code # define the find_within_miles proc for all models that contain addresses find_within_miles = Proc.new do |zip, miles| z = ZipCode.find_by_zip(zip) !z ?  : self.class.find(:all, :include => :zip_code, :conditions => [ "miles_between_lat_long(?, ?, zip_codes.latitude, zip_codes.longitude) < ?", z.latitude, z.longitude, miles]) end define_method :find_within_miles, find_within_miles DELIM end end def self.included(base) base.extend ClassMethods base.class_eval do include InstanceMethods end end end
Example 6-2 shows the init.rb file in the plugin’s top-level directory.
Example 6-2. The init.rb file for our acts_as_address plugin
ActiveRecord::Base.class_eval do include Addresses end
Now we can simply say, within the theatre and order classes:
Our new schema diagram, shown in Figure 6-7, is much cleaner now that we have normalized the repetitive address information into its own table. Each table that inherits addresses is much easier to understand as well. In addition, our models that implement the Addresses module stay clean. In both the database and our application code, if we need to make changes to the way addresses work, we need only do it in one place.
Are there more opportunities for normalization here? Before reading on, examine the data model and brainstorm about the business. Put yourself in the shoes of the developer of this website. What are features you might want that would require dependent data?
Drawing from my own experience, I would say that there has never been a time where a zip code was entered in a database that the site in question would not have benefited from knowing additional information about the zip code. For example, in our application, what theatres within 25 miles of 02139 are playing the latest Harry Potter? Certainly, we can solve this by adding latitude and longitude columns to the theatres table, but we risk duplicating data (and therefore creating conflicting data) if we have multiple theatres in the same area. Further, if we ever want to validate user input, such as verifying that a city, state, zip code combination is valid, we will need an auxiliary table of zip code data.
We will leave the refactoring of the zip code table for the next chapter.
- Based on what you know about the problem domain, for each column in data from this chapter model, list the ranges of valid and invalid data.
- Write unit tests for those boundary conditions. Where appropriate, add validations to make your unit tests pass.
- Add the theatre_id column back to the movie_showtimes table. Prove to yourself that you can reproduce the anomalous situation described in this chapter. Is there any way you can use a datavbase constraint to prevent the anomalies?
- Find a table where a column takes on one of a set of values, but those values are not foreign key references into another table, as in Figure 6-7.
- Create a new table, where the rows represent the set of values from 1, above, and any other columns that were dependent on this value, rather than the original table’s primary key (Figure 6-8).
We can create the table shown above with the following SQL statement. Note that it is up to you to account for any data inconsistencies caused by improper normalization:
insert into colors(name, hex) select distinct color, hex from my_table;
- Create a column in the original table for the foreign key reference, then populate it. After doing so, delete the original columns.
alter table my_table add column color_id integer; update my_table set color_id = ( select id from colors where m.color = c.name and m.hex = c.hex ); alter table my_table drop column color; alter table my_table drop column hex; alter table my_table add constraint my_table_color_id_fkey (color_id) references colors(id); alter table my_table alter color_set not null;
- Create a model class for the new table. Add association references appropriately. The new model will have a has_many association to the original model class, and the original class will have a belongs_to relationship with the new model class.
|Chapter 5 : Building a Solid Data Model
||Chapter 7 : Domain Data|