Maybe Normalizing Isn't Normal

Good post about denormalization. I tend to class myself as a zealous software engineer who likes to stick to the software/database principles that I was taught in university, but through working with clients and real world databases there is definitely a place for denormalization.

I would be interested in your views of keeping the integrity of the denormalized fields. Here is a page discussing this http://www.ssw.com.au/ssw/standards/Rules/RulestoBetterSQLServerdatabases.aspx#DenormalizedFields

This is a dangerous article, or perhaps you just haven’t seen the number of horrific denormalised databases I have. People use these posts as excuses to build some truly horrific crimes against sanity.

Normalise first, denormalise if you have to. If you can’t normalise then don’t even try and denormalise, get someone else to do it.

A six-way join on 100,000 row tables is better than a single query on a 1,000,000,000 row table. I’ve been there. It wasn’t fun.

Denormalising correctly requires more skill than normalising correctly. Just because Flickr duplicates data doesn’t mean that wholesale denormalisation is automatically a good idea.

Some sort of middle-ground: temporary staging tables, very heavy caching, etc. will gain significant performance advantages without having to corrupt your underlying data.

I don’t think I’ve ever seen a performance problem caused by a normalised database - not when such things as caching have been used correctly.

Nice comments. I think a lot of people who understand databases better than straight coding (myself included) get annoyed with coders telling us how to write databases.

It seems to me Jeff hasn’t got a lot of database experience, and is saying don’t bother with normalisation cos it’s difficult to understand.

The article is also confusingly written, criticising normalisation due to concerns over scalability, and then saying that scalability doesn’t matter.

If you understand normalisation - you know when to use it and when not, because you know what the costs and benefits are. If you don’t understand normalisation, but you need to store a lot of data in a way that can be queried and analysed, you had better learn it.

To pick up on Eric’s comment that denormalisation is for reporting, reporting is just querying a database. Why then are there accepted design patterns for reporting, e.g. star and snowflake?

We have them because reporting has to scale to query much higher numbers of probably distant (i.e. many joins) data. So we denormalise into facts and dimensions.

The change that has come about in the last few years is that many internet scale applications (especially social ones) have reporting style operations at their very core. Look at the problem of aggregating data from a social graph. In the abstract this is a reporting style problem and therefore some level of denormalisation is the only way to get data out quickly.

Now we face the further issue that unlike traditional business reporting systems, sites like twitter and facebook cannot wait for the queries to return over a timescale of minutes, they need it in seconds or less. Which has led them to implement denormalisation on a massive scale.

After several years of companies starting with a tight normalised model and iterating to a denormalised model (see the fantastic O’Reilly series linked in the main article) it’s right that people start looking at this wealth of experience and asking the question When is it right to start with a denormalised model?

Data integrity is the most important aspect of a database. PERIOD. Performance is ALWAYS second.

Always is a dangerous word.

hehe. The gravitational constant is ALWAYS 32 feet/sec/sec. ALWAYS; and anyone designing an aeroplane who chooses another value fails.

This is the crux of the contention: coders who view the database as a code widget to manipulate versus database wonks who view the database as primal and the coders as Huns out to rape, pillage, and destroy the data. Coders not only want what is rightfully theirs: make the data look pretty to the user, but also to control the data in client code. Always a bad idea.

Coders refuse to accept that Codd’s 12 rules are even rules, much less laws of nature. Database wonks realize that they are laws of nature.

Great article Jeff!

@Ariel

Re: james: are you serious? Tables don’t have 1 row in them they have many. So you make them plural (1 row singular, multiple rows plural). Rule: table names are plural, column names are singular.

Singular table names are completely acceptable and often preferred by OO developers using object-relational mappers like Hibernate. One would say the User table contains instances of class ‘User’.

Class name to table name mappings are predictable and reliably generated if single names are used. Think in OO terms and you will never use a plural table name again.

Singular table names eliminate having to know anything about natural language pluralization rules (and exceptions) like Category - Categories, Criterion - Criteria, Radius - Radii, Person - People, Mouse - Mice, Index - Indices. (More exceptions than there are rules, and that’s just in the English language).

There are no conventions for naming join tables when plurals are used for table names. Would it be UserAffiliations, UsersAffiation or UsersAffilliations? If it is not UsersAffilliations then why did one of the names suddenly become singular and one plural? If it’s UsersAffilliations then grammatically that name is wrong too unless you add the possessive apostrophe making it Users’Affilliations. It’s a whole bowl of wrong. Avoid plurals altogether and keep things simple.

Maybe this has already been said, so I apologise for not reading all 212 comments…

Be careful not to confuse a denormalised database with a non-normalised database.

The former exists because a previously normailsed database needed to be ‘optimised’ in some way.

The latter exists because it was ‘designed’ that way from scratch.

The difference is subtle, but important.

You don’t optimize anything – code or your database – without first understanding exactly where your performance gains will be. Use object oriented design when you use a database. I’ve seen people create all sorts of objects, then put direct SQL calls in the middle of their code. Grrr.

Remember the whole MVC style of programming? THIS IS WHAT IT IS ABOUT! Write your code to be easy to maintain and with good design. Push all database context to object models. That way, when you restructure your database, you don’t have to recode your entire application.

Once you’ve finished coding, you can start optimizing both the code and the database. A good SQL database has all sorts of tricks to make what seems like multiple table lookups to be a single operation. Deciding not to normalize your database because you think it is inefficient is a mistake.

In the mid-70s (yes, I’m THAT old) I worked on a rather popular computer called the Cado. The Cado could do things that took other computers costing 10x as much. It was the first computer that many businesses could actually afford.

The secret was the architecture: Cado made the assumption that hardware cost much more than software, so they optimized the hardware and that optimization was reflected in the OS (it had no file allocation table – we tracked that by paper and pencil) and the software which required developers to use scant resources and write their programs in 256 byte segments. This allowed four users to use the Cado in only 48K of memory.

You heard of Cado? Nope, they went into a tailspin in the mid-80s because the primary assumption of hardware cost vs. software cost ended up being wrong. By the 1980s, hardware costs dropped and it was software that became the major cost. Overly optimizing the software meant that the programs were almost impossible to maintain. It also meant that programs were written for specific clients, so that you had 20 programs for 1000 clients. Meanwhile, software written for maintenance, scalability, and configurability could be written for millions of clients. Thus, spreading the programming costs over a much larger sales base.

By the mid-80s, the specialized hardware needed to run the Cado programs meant you couldn’t take advantage of the PC mass market. A PC would cost $3,000 while the Cado hardware would cost $15,000. Cado programs would cost thousands of dollars while the equivalent PC programs could be had for $600. Hundreds of Cado VARS went belly up as their sales disappeared. No one wanted to spend $20K on a system that was slower, more buggy, and harder to maintain than an equivalent $7,000 PC based system.

So, don’t do what Cado did and make stupid optimization assumptions. Write good code and then test where optimization might actually do some good. Otherwise, you’ll be writing yourself in a corner.

I recently worked on a project that had a distributed database where inventory items were stored in local databases. To know which distributed table to fetch the item from, the object’s item number would include the location where that item was stored. You knew the item number, you knew which database where the item was stored, you could do a single lookup to fetch that item.

What a great idea until inventory items were moved from one location to another. That meant you had to change the inventory number of the item which meant rewriting the entire inventory table, all transactions (since the inventory item number was in the transaction), history tables, customer tables, etc. It ended up that about 2 to 3 dozen items would move every month. So, every month, the entire database which was distributed in over 60 locations would have to be updated. So, that one single database optimization technique cost us several days of down time each and every month.

@BuggyFunBunny
The gravitational constant is ALWAYS 32 feet/sec/sec. ALWAYS; and anyone designing an aeroplane who chooses another value fails.

From http://www.mathpages.com/home/kmath054.htm
So far we have assumed that the acceleration of gravity was constant
over thr range of interest, but in fact there is a slight reduction
in g as we go up in altitude. This can be significant when dealing
with the energy of an object …

Also, Codd’s rules say nothing about database design and everything about Codd’s vision of RDBMS (that system that hosts the database).

Your points about Code Wonks vs Data Wonks is well taken.
As a Data Wonk, it is my view that: those Code Wonks that don’t know how to View Execution Plan do not get a say in database design.

@David B
I also checked wikipedia; the differences I checked all worked out to 32/sec/sec (rounded, standard). But, OK.

I’ve always seen that design (normalization in this thread) can be inferred, albeit distantly, from:

rule 0 - The system must qualify as relational, as a database, and as a management system.

rule 9 - Changes to the logical level (tables, columns, rows, and so on) must not require a change to an application based on the structure. Logical data independence is more difficult to achieve than physical data independence.

rule 10 - Integrity constraints must be specified separately from application programs and stored in the catalog. It must be possible to change such constraints as and when appropriate without unnecessarily affecting existing applications.

rule 12 - If the system provides a low-level (record-at-a-time) interface, then that interface cannot be used to subvert the system, for example, bypassing a relational security or integrity constraint.

Rule 10 came to be embodied in DRI, unique constraints, check constraints and the like.

Rule 12 tells the coders to keep their grubby paws off.

Your example is probably the stupidest case for denormalization I’ve ever seen. You didn’t even touch on the problems of one-to-many relationships.

If it’s one-to-one, denormalize.
If it’s one-to-many or zero-to-many, normalize.
If it’s one-to-(specific number) it depends on how large the data you may be repeating is, and how scalable you want to be.

The bottom line here is I/O. If the normalization means doing 6 I/Os for a single person, versus one I/O in the single-table solution, it still isn’t a significant difference in terms of the wall clock if you’re just going after one person. Crank through a million people though, and it is a different story.

Please keep up the awesome blogging Jeff, you are always a good read!

Thanks!
rcs

A normalized database is good for inserts, updates, and data consistency. A denormalized database is good for fast selects, and works well if the table structure is close to what you’re reporting or loading into memory. That’s why it’s reasonably common to have two databases, a transactional one that’s relational, and a read only reporting database that gets updated from the relational database either by triggers or by a daemon.
If you’re reading lots of data, and you have a decent number of joins, you might be better off reading the data a table at a time, and doing the joins in memory. I’ve worked on systems where simple selects were essentially “free”, and most of the cost was the network (100 Mbit from the client). If you have several 1-n or n-n joins, the cost of reading the extra rows over the network will dominate the cost of the joins in a normalized network. In those cases, you’re better off caching “static” data locally, and doing the join on the client.

I was intriged to read that google gql doesn’t allow joins. I guess they figure they have hit the scalability wall - so instead of denormalizing later, they made it part of their architecture. Or - maybe the architecture they are using forced it. Who knows.

When is it right to start with a denormalised model?

Answer: never.

Yes, I know always and never are dangerous words. I use them here nonetheless.

If your application starts out with a denormalised model, it is going to be much more expensive to develop.

You have to make sure you manually keep track of all of the places that data needs to be updated, changed or extended. The likelihood of introducing bugs increases, which forces you to dedicate more time to testing and fixing these bugs. Furthermore, you make it much harder for developers new to the system to understand how it works because they don’t have a simple model to follow - they have to wrap their head around a mess.

It is silly to spend all this extra time, effort and cost on something when you don’t know if it will even pay off. What if the requirements change? A seemingly small request from your client could actually turn out to be a huge amount of work simply because of your implementation. Somehow, I don’t think the client will accept your estimate of 2 weeks for adding a second group of contact details to the user’s profile.

Knuth said that premature optimisation is the root of all evil. He was right. You should optimise only when you can identify tangible areas of an application to optimise through profiling. Otherwise, you are wasting your time because your optimisations may never have been needed in the first place and you’re just creating additional work for yourself.

Dear god, please don’t let there be a generation of database designers who think this is the right way to do it.

Google App Engine’s designers say you should denormalize a Google App Engine database and should avoid relational ideas such as normalization.

For this I abandoned Google App Engine. The Greater Fools on their newsgroup continue to discuss exactly how to achieve this.

Hmmm. I would break the interested_in out into another table just in case your user is interested in more than one thing.

hehe. The gravitational constant is ALWAYS 32 feet/sec/sec. ALWAYS; and anyone designing an aeroplane who chooses another value fails.

Actually, being pedantic, that’s the value of gravitational acceleration near the surface of the Earth. The gravitational constant (G) is (more or less) 6.674 x 10 ^ -11 m^3/(kg*s^2)

While I agree that normalization is the way to teach, as a design principle (RE: SomeGuy), but it would be nice to see optimization issues mentioned more often when talking about best design principles in Computer Science.

John,

The reason GQL doesn’t support joins is that it’s not a relational database…