Maybe Normalizing Isn't Normal

I wonder if these changes we’re seeing in how important normalization is in applications isn’t a reflection of the changes in the types of applications people are writing today. These days, with most web sites focusing on the infinite possible connections between people and other entities in a system, it may just be that a denormalized data model might make more sense right off the bat.

Thanks for the thoughts. I think I’ll write more on this at blatantCommercialhttp://cozenedcognizance.blogspot.com/2008/07/what-kind-of-applications-are-we.html/blatantCommercial.

Chris Rikli: Multiple queries will hurt performance much less than the multi-join monstrosity above that will return indistinct and useless data.

Here here!!! You might look into this solution as well. I’ve had two ugly stored procedures as of late that I’ve refactored/optimized/whatever by simply going to the core table and doing the major selection criteria and then making a bunch of small queries to fill in the gaps.

Table Variables (SQL Server 2000 8.0 and above, sorry legacy folks (I have an app still stuck there 8^D) allow for this and are much better than temporary variables.

Yay! BuggyFunBunny’s back!

(back to reading comments now…)

When dealing with this exact issue, I took an approach of using a normalized data repository supplemented with a denormalized view of the data. For data that does not change very often (how often do users change their phone number, really?), this is usually adequate, though you certainly run into problems of stale data unless carefully handled.

In my case, it was more than simply doing a multiple table join across a single database, it was doing a multiple database join across 4 completely incompatible databases: standard RDBMS, Chemical database (outside of the RDBMS), and Biological database (also outside the RDBMS)

Having the denormalized dataset allowed us to reduce queries that originally took 5-10 minutes to complete in 10 seconds. It all depends on your particular application.

I always normalize. If a group of data (in your case user data) is being queried the same way every time, then either you over-normalized or you need to create a view.

I once wrote a CRM (digital Rolodex, like SalesForce.com) application and used it as an opportunity to use a textbook denormalized data model. It turned out to be a huge mistake because the model was wrong for the application.

For CRM, you’re essentially storing business cards. A business card is fully denormalized: the basic unit is the card, and while you may search on a couple of different fields, users just want to view the card. While it makes some sense to have a separate table for phone numbers and email (since there may be multiple), there’s only one mailing address.

Where I really got into trouble was in having an actual person/organization hierarchy instead of just a department column. Traversing arbitrary hierarchies in SQL is painful, and using a Java O/R bridge didn’t help. Doing a search for ATT employees becomes a recursive set of queries. What’s more, 99% of the time, the department was blank or inaccurate anyway.

Which brings me to the next issue: half the time, the data wouldn’t import cleanly. We’d get business card type data from our trade shows in Excel format-- essentially completely denormalized. There are a dozen different ways to spell ATT (including, at the time, Bell Labs), and often company names were misspelled.

Eventually, we scrapped the project and switched to SalesForce.com, which uses a data format which is just denormalized enough to have a separate company table. I think it even has Phone 1, Phone 2 and Phone 3. But it meant I didn’t have to get pulled off my current project to do a custom data import after every trade show.

The moral of the story is that the correct (fully denormalized) data model is not the most general one, but the one that best fits the expectations of the users and/or your data source. Denormalizing any more just forces nonexistent precision. Business cards and Excel spreadsheets are essentially single tables, so using several tables to represent them can lead to problems.

…scalability is not your problem, getting people to give a shit is.

IOW, scalability is not a problem, until it is. Strip away the scatalogical reference, and all you have is a boring truism.

Case in point: Twitter. The idea of mentioning Twitter as a scalability success story is perfectly risible. People have actually stopped giving a shit (to steal an artless phrase) because of their performance problems.

I normalise, then have distinct (conceptually transient) denormalised cache tables which are hammered by the front-end. I let my model deal with the nitty-gritty of the fix-ups where appropriate (handy beginUpdate/endUpdate-type methods mean the denormalised views don’t get rebuilt more than necessary).

Of course, a smart RDBMS would let you use… (drumroll) views…and optimise accordingly. I’d imagine this is the sort of thing that Oracle would do. I haven’t checked.

I could be missing the point of the article, but I don’t think he’s suggesting that we skip past normalization but instead consider if it needs to be done.

What if he only plans to allow those three phone numbers for a user - Should that be moved to a separate table? What about only two? Or just one?

At what point do you stop blindly normalizing the table structures and consider if everything really needs to have a table of its own?

Doesn’t seem like an unreasonable question to me…

I don’t know SQL, but your diagram looks a hell of a lot like what an ordinary Real Programmer would call a record with some pointers. Obviously the denormalized version (one record per user) is better than the normalized version (many records per user) for pretty much all applications — easier to read, easier to write, easier for the computer to deal with. So maybe this is some big religious issue to SQL clerks, but it’s certainly a no-brainer to normal folks.

The real WTF, of course, is that you’re apparently writing your own software to deal with tagging. Tagging is a solved problem, guys. You don’t have to write code to manipulate tags anymore.

Mo: You should get out more, the two real RDBMs (SQLServer and Oracle) have views :slight_smile:

Stop playing with mySQL.

Unfortunately, the SQL given in the article is not a meaningful representation of a real set of data about a single user.

In most cases, our user-retrieving SP would return multiple recordsets for the one-to-many entity-relationships, not an inner product of all the combinations.

Now if you are talking about returning data for many users for analysis, you are probably still not going to combine the data this way.

In the large data analysis situation, you can start to materialize views and temporary tables, etc., but the best thing I’ve seen is apply dimensional modeling to make a Kimball-style data warehouse give a very fast and easy de-normalized representation of the data. This is a read-only replica of the data in a completely different model.

Normalization is all about reducing redundancy. Achieving great performance requires a balance; if you normalize everything by the book, you will most certainly pay the price in performance. Leaving redundancies in key places can dramatically speed up your system because that helps the db engine optimize the indexes and run faster.

To the poster who just got out of the db module (course?) at the university: you will learn that Uni-level db is just the tip of the iceberg and that the overreaching principle in db design is that there is no overreaching principle.

@Jim

That’s exactly what you would say if you hadn’t yet been bitten in the ass by a denormalisation decision taken years ago, that’s making your life 20 times harder now.

With all of the wonderful technologies described here to mitigate the speed of large joins, I fail to see how anyone can reasonably argue that denormalisation in any form makes sense. I’m pretty sure the entirety of the RBDMS vendor community’s brain cycles that have been spent thinking about this problem trumps anything from people who think sysadmins and DBAs are idiots.

@ Paul Thompson

Then I read on and discovered all the loons and morons who think they
know a lot more then they do about databases. People who claim that
views are recomputed every time a query runs (that’d be you Steven
Bey).

Many thanks for you kind words.

I don’t profess to be a database expert and my initial thought had been to use a view, however, I had a question mark over my head about when the cached data (in the view) gets updated, so I did a little research, via our friend Google. The only place I found, stating that view data is cached, was on Wikipedia and I prefer not to take Wikipedia at face value. I did, however, find other articles that either explicitly stated or implied that views DON’T cache data (including the url that I posted, which is on a website called sqlserverperformance.com and written by a SQL Server MVP).

If you can provide me with a credible source, saying that view data is cached, I will happily read the article.

Once again, many thanks.

ATWOOoOD!!

(A cry of frustration from a DBA stuck in legacy de-normalised hell)

I love your blog Jeff, but I have to say that when I first read this post my first reaction was one much like the one you yourself had after reading about Wasabi. If this post leads to even one more de-normalised database in the world, is is already evil. Anyway - before I had time to comment, most of what I wanted to say (and more) had already been said (rudely, even).

There is now no real point in adding another voice to the thorch-wielding pitchfork DBA posse knocking on your door, so instead I want to make a suggestion: Eat my shorts. No, wait, sorry, that was frustration again - I want to suggest this:

In the spirit of the ideas behind stackoverflow, and in order to spread knowledge and learn, let us help!

Instead of going dark about this, challenge the community! Make posts with examples where you feel you will run into trouble, and allow people to comment. Let the people who throw the tomatoes here put up or shut up! Hides tomato behind back

For thisandthat purpose in stackoverflow, I need to have A, B, D, (D*X-4) and H from these six tables (ill.), and I need to handle a load of about 4 executions per second. My current plan is to jam everything up into one huge table because then the execution time drops from 345 ms to 87 ms. I need at least y, preferably less. Comments?

I hold (not even weakly) the strong opinion that normalising is not a collective madness, it is not like the religion of the angle brackets, (XML), it is The Right Way™. I am confident that in the lifetime of Stack Overflow you will NEVER encounter a situation where denormalising is the right thing to do.

Let us prove it to you!

The basic decision on the initial design is simple just ask yourself these two questions before you begin.

  1. Is my Site OLTP? If the answer is yes then Normalize.
  2. Is my site OLAP? If the answer is yes then De-Normalize!
    Simple as that.
    If your answer is Yes to both questions then you need 2 databases.
    If your answer is No to both questions then you need 0 databases.

This post seems to suggest throwing normalisation away

Seriously? Sometimes I wonder if I’m reading the same blog as most of these commenters. This is certainly not the first time either.

I’m with OJ - let us build it :slight_smile:

And to all the folks suggesting copying data into a denormalized table via nightly batches or triggers etc…
You are describing a materialized (indexed) view. The functionality exists in all mature rdbms. Your idea has (unsurprisingly) been discovered by rdbms architects and implemented.

I wonder how many people read the links before posting.
Just WOW. Thanks for all those links, I think I’ll spend some serious time reading all those entries on highscalability.com.

Oh and i’ve had similar experiences of denormalised awesomeness with high-throughput systems. Data was renormalised every 24 hours and just simplified everything.
Sure, it’s not the first thing you should do but it is an option.