Maybe Normalizing Isn't Normal

Your six-way join is only needed because you used surrogate instead of natural keys. When you use natural keys, you will discover you need much fewer joins because often data you need is already present as foreign keys.

@T.E.D:
I’ve never been a database guy, but this whole issue seems like the classic problem of optimization to me. There’s almost always that trade-off between elegant readable design and optimal execution speed.

Absolutely, although you’ve got another element in the trade-off: data correctness. A database design that allows inconsistent data will have inconsistent data.

I have to agree with other posters: you can denormalize if you have to, and you know what you’re doing. If you don’t you may have to assume that the cost of incorrect data trumps other considerations.

The solution it 2 tier storage. Websites tend to be read many, write few. The first tier is normalized database, the second tier is de-normalized data specifically constructed for use by the application. In an app that I am building, the 2nd tier is a combination of memcached and plain old disk using binary serialization.

The expense now is that you have two writes, one to the real db and one to the de-normalized tier.

So now you’ve separated the read requirements/performance from the write and relational data storage.

@Leandro:

Your six-way join is only needed because you used surrogate instead of natural keys. When you use natural keys, you will discover you need much fewer joins because often data you need is already present as foreign keys.

The whole question of surrogate vs. natural keys is of course a topic of enormous debate. Generally, I’m with you: natural keys are easier, safer, and often don’t hurt performance the way people assume they will.

I think the general advantage of normalization is the same as the DRY principle in coding. When you denormalize stuff you can wind up maintaining copies of data and doing the same thing twice.

Another advantage is it is easier to go from normalized to denormalized when you need it than the other way round.

Just as well Dare is a social networker and not a database designer!

Look at some of the arbitrary limits in his design:

  • A fixed number of telephone numbers per person
  • A fixed number of job histories per person
  • A fixed number of IM accounts per person

Now, what happens when the big DOH! hand hits the forehead, and our short-sighted denormaliser suddenly realises that out there, in the REAL world, the numbers of things in sets change on a regular and frequent basis … that’s right, we have to change the schema!

It gets worse. The proponents of denormalised schemas are the most likely to write select * … in their queries. That means reviewing and changing all our code as well.

A properly designed database schema should survive most business changes. Denormalised (abnormalised?) schemas don’t.

And as others have pointed out in this thread, the avoidance of surrogate keys can do wonders for your performance.

In my experience, beginning developers get into a lot of trouble because they’re afraid of joins. The common symptoms are apps that are:

(i) excessively complex,
(ii) never quite work right, and
(iii) are expensive to change

It definitely is true that we’re moving into an era of web scale systems that are much greater in size and power than traditional enterprise systems. Many people are pushing the performance limits of the traditional RDBMS, and we’re seeing a move towards distributed main-memory databases and other new technologies.

Workaday business apps benefit greatly from normalization, so I think this post has a dangerous message for many developers. In the case above, I think it still makes sense to use a normalized database to keep the authoritative copy of the data, but it makes sense to cache the data in denormalized forms. Rather than creating a table with the notorious phone_number_1, phone_number_2, … phone_number_33 fields that I’m always cleaning up in other people’s applications, you can just serialize the data with the serialization system that comes with your language or JSON or something like that.

Social sites generally have many more reads than writes, so this can improve performance 100 fold or more, but still keep the advantages of a relational back end. If the cache gets messed up, you can always do a partial or complete rebuild based on data you trust.

I’ve had programmers tell me that they don’t care about the size of data or normalization. Then they cry when they have 200,000 fat rows and performance crawls.

For those advocating the use of Views: the database does not store the view data. The data is recreated, using the view’s SELECT statement, every time a user queries a view.

Furthermore, see this article on View performance: http://sql-server-performance.com/tips/views_general_p1.aspx

What I think is funny is the number of people who think that because they use LINQ or Hibernate they aren’t affected by these issues… In the end you are storing your data in a relational database, right? Unless you are plugging LINQ or Hibernate into something other than Oracle, SQL Server, MySQL or Postgres (or any other RDBMS) this is still an issue for you…

Sure, you might not be coding the six inner joins, but your OO mapper sure is.

Scary.

@Dare
In addition, who says you wouldn’t want all the user data including phone numbers at once? Have you ever seen a profile page on FB? That’s exactly what it does.

I think what Ariel is saying is that if you performed the 6-join query your return data would be duplicated. For example if a user had two phone numbers and three screen names the return data would look like:

user_id phone_number screen_name
1 111-1111 name1
1 111-1111 name2
1 111-1111 name3
1 222-2222 name1
1 222-2222 name2
1 222-2222 name3

Using this format of the data it would take some monkeying around to create the display text:

User: 1
Phone Numbers: 111-1111, 222-2222
Screen Names: name1, name2, name3

It would be easier to select by phone number and then by screen name instead of trying to get all of the data at once.

@[ICR]: Kudos to you.

@[ICR]: Kudos to you.

Please think! Is this post even about normalization? The 2 database designs you used to illustrate your point are actually solutions to 2 different problems. They are not the same database and they cannot represent the same data. You can argue that the second one is faster but only if it can actually be used to represent the facts that are meaningful to your problem otherwise this is just hot air. This kind of lack of thinking is what usually makes life with databases a hellish proposition because almost invariably someone will ask for data that is not stored because it was easier to say I need performance so I’ll just ignore the requirements (or not think them thru) and de-normalize!

I keep my data normalized. I keep it all in memory. I use Prolog to do it.

Jeff,
You make a good point but there really is a upper limit on how far things can be stretched with denormalization. When stored procedures start to have anywhere from 100 to 200 arguments and sql server 2000 warns you that your row size exceeds its 8063 byte limit then maybe…just maybe its time to normalize the tables just a bit.

But this design also requires a whopping six
joins to retrieve a single user’s information.

Yes, if you need all of it or if you need all of it in a single table.
Very often neither is the case. Very often you don’t need all of the information at once and sometimes you first need some of it and later on some more, so you don’t need them in a single table and as joined query, but as separate queries istead.

i hope the people at twitter listen to this. :stuck_out_tongue:
they got people to care, now it’s time to move fwd :stuck_out_tongue:

The link for Findory and Amazon is: http://radar.oreilly.com/2006/05/database-war-stories-8-findory.html

Jeff,

You miss the point of normalization entirely. Normalization is about optimizing large numbers of small CrUD operations, and retrieving small sets of data in order to support those crud operations. Think about people modifying their profiles, recording cash registers operations, recording bank deposits and withdrawals. Denormalization is about optimizing retrieval of large sets of data. Choosing an efficient database design is about understanding which of those operations is more important. A system that will primarily be serving content may be a good choice for denormalization. That doesn’t mean that normalization doesn’t scale, it just doesn’t scale for one particular problem.