Maybe Normalizing Isn't Normal

[sigh]

The whole issue is pretty simple 99% of the time - normalized databases are write optimized by nature. Since writing is slower then reading and most database are for CRUD then normalizing makes sense. Unless you are doing a lot more reading then writing. Then if all else fails (indexes, etc.) then create de-normalized tables (in addition to the normalized ones).

Iā€™ve never considered myself a DB(A) guy ā€¦ so reading some of this is interesting.

Most of the databases Iā€™ve built myself havenā€™t had the volume of traffic that really mattered much ā€“ so Iā€™ve always done what I thought was the ā€˜intuitiveā€™ normalization and in cases where it was faster or easier to put some of it in a denormalized fashion, did so.

Good read.

and table names are singular because each row in the table designates one specific person; one row = one user

This is another religious war. Should table names be singular or plural? I was in a company that mandated that tables should be singular, one to match the above quote, and second, if nothing else, so that you have 1 less letter to type for every table name.

However, I am of the opinion that tables, since they normally hold more than one item, are plural, and the object that holds a single row is singular. Collection objects that hold multiple rows should be plural.

Hmm. Does anyone expect a real discussion of tech at Coding Horror. Itā€™s as bad as /.

stupid db design (or at least incomplete), stupid sql.

If most smart people require whole books to describe the (wonderful) world of RDBMSā€™s, do you really think anyone here can have anything more than a soundbite contest?

Iā€™m surprised some twat hasnā€™t called using databases ā€˜cargo cult programmingā€™ yet.

Stick to toys and trivia, thatā€™s what this blog is best at.

Sean Bamforth is on the money here.

You need a normalized logical schema. This is what the DBMS should provide to the application. The DBMS should denormalize as necessary to optimize performance WITHOUT exposing those denormalizations to the application.

The problem is that DBMSes donā€™t do this. Sure, they give you indices and caching and materialized views, but they donā€™t go all the way.

Your denormalized schema has lost all of the one to many relationships that caused the original need for normilization in the first place.

Normalize first, it protects data integrity and makes maintenance a breeze. If you hit the performance wall, then denormalize.

However, your example is still bad, because the denormalized table is nothing like the normalized schema.

@Jeff - Youā€™re prematurely optimizing.

@Leandro - Natural keys! I agree, and bless you for mentioning it.

In 25 years of software development as contractor and employee in all kinds of companies and industries, every performance problem I ever encountered with a transactional normalized data base was solved by adding indexes or fixing poorly written SQL statements in the application. In a few cases the data base was poorly designed, but normalization wasnā€™t the culprit.

Donā€™t fear normalization. Embrace it.

With any normalization there are some trade-offs.

Assuming that each user has one screen name might be an acceptable restriction. Same for affiliation.

With a separate phone table a user can have as many phone numbers of as many types as I want. Not so after normalization. This might be acceptable, or not.

So, it the perception is that normalization changes representation without changing functionality, thatā€™s incorrect. The functionality changes might be acceptable, or not, but that should be a conscious decision.

Wowā€¦

After I first read this article I wanted to comment remarking that the concept of denormalization being better was not new or all that revolutionary. People working in data warehousing and business intelligence have been practicing denormalization as an optimization it for years.

Then I read on and discovered all the loons and morons who think they know a lot more then they do about databases. People who claim that views are recomputed every time a query runs (thatā€™d be you Steven Bey). Others bragging about their 400,000 pageviews a month websites (generously estimated, perhaps all of 12 pages a minute, woah baby Ben Mills).

Now, before I get too degrading here, let me just get to the point.

As others have already stated, databases have many different purposes. And the way you implement a small or large database is going to be radically different then a monstrously huge one. Denormalization is common in production real world systems.

Wow, I read back and am just astonished at the complete lack of understanding by the average commenter aboveā€¦ Woahā€¦

I donā€™t do much (Any) db design, but I do a lot of software design.

It feels to me like there might be a compromise solution that involves creating your Source/reference database 100% normalized then using that to generate pre-calculated tables.

This sounds like even more duplication, but it shouldnā€™t really be since you only modify your source tables and the optimized tables are 100% generated. This gives you the ability to regenerate the optimized tables any time, and also to change what you generate as you encounter challenges.

Is this approach at all workable?

People who claim that views are recomputed every time a query runs (thatā€™d be you Steven Bey).

In Oracle a view is just a stored query. It is re-executed when it is queried. A materialized_view in Oracle has assocated storage for the view result set. So mabye you and Steven Bey have a different understanding of the term.

Maybe the example in this posting is just a poor one, but the argument for denormalisation usualy comes down to avoiding the overhead of joins. As I posted above, in this case itā€™s trivial to avoid that overhead with a hash cluster. Problem solved.

For a system like that I donā€™t think there is much question normalization is best. You get the benefits of being able to store multiple phone noā€™s, addresses, work historyā€¦ for one user id record but letā€™s face it you are almost always going to be finding records based on user_id which will be the primary key for most tables so it will be very quick anyway.

As you said with speed there wonā€™t be much difference either way especially when you factor in the duplication you get filling up extra 8k sql pages using non normalized. By the way LEFT JOINS jeff.

The only system I have denormalized for reporting is an electronic attendance system like below:

A register has a number of student IDā€™s
A register has a week range between 1-53 (variable based on when it runs in academic year)
A register has academic year and register ID

By the time you get down to the student marks table it has:
Acad Period Register ID Student ID Week Mark

You end up with upto 53 mark records per student per register per year. Hundreds of thousands of records which are for the main part the key to the table. Pivoting the marks leaving nulls for the weeks the register wasnā€™t active cut down the speed of reporting students with low attendance to about a 20th of what it was before hand.

@Jeff: You hit upon my speciality :slight_smile: See, I said speciality, not specialty, because it sounds cooler.

Anyway, I write social applications for a living. And Iā€™ve found what works best is:

  1. Put all the indexable stuff into the users table, including zipcode, email_address, even mobile_phone ā€“ hey, this is the future! :slight_smile:

  2. Put the rest of the info into a TEXT variable, like extra_info, in JSON format. This could be educational history, or anything else that you would never search by.

  3. If you have specific applications (think facebook), create separate tables for them and join them onto the user table whenever you need.

I actually go beyond this, and implement synchronization with other online identities, like OpenID and Facebook, but thatā€™s a bit overkill for this post.

While this article is very good there are a number of things not addressed.

How is the data being used? Rapid inserts like Twitter? New user registration? Heavy reporting?

How one stores data vs. how one uses data vs. how one collects data vs. how timely must new data be visible to the world vs. should be put OLTP data into a OLAP cube each night? etc. are all factors that matter.

While it is lovely to normalize data, there is no one size fits all for all applications.

Enjoyed this post, a lot of these comments that treat normalization like a religion are quite the head scratcher though. Iā€™m guessing they are from DBAs who are just barely more intelligent then a sysadmin. Learn to think a little.

The best approach usually ends up being a hybrid. You start off with a highly normalized database, work through the usual indexing and views to speed it up, and then start denormalizing a field here or there that require one too many joins.

So, in essence Greg, you are throwing your hands and going F this relational crap. Iā€™ll do it live.

Sorry, I donā€™t buy it. Sticking serialized data in a column stinks like code smell.

Bottom line is - consider the future expansion of the site. Database design, like programming, is a trade-off. Are you going to introduce multiple Xā€™s for every Y? Then perhaps you DO need a join. What will happen when the most common operation happens? What will happen when you introduce a distributed cache, like memcache? You gotta use a cacheā€¦

Ah, I did see some sanity in the comments. Your LOGICAL model should always be fully normalized. After all, it is the engine that you derive everything else from. Your PHYSICAL model may be denormalized or use system specific tools (materialized views, cache tables, etc) to improve performance, but such things should be done after the application level solutions are exhausted (page caching, data caches, etc.

For very large scale applications I have found that application partitioning can go a long way to giving levels of scalability that monolithic systems fail to provide: each partition is specialized for the function it provides and can be optimized heavily, and when you need the parts to co-operate you bind the partitions together in higher level code.

It might be possible to overdo it, but trust me, I have had 20 times the problems with denormalized data than with normalized.

With normalized data, I had a complex join with records in the millions each and had no trouble retrieving it in mere seconds from a 450 MHz server that they were still using. I just added the appropriate index (took about 45 seconds) and problem solved. That 2 minutes of work turned into a year of consulting because I fixed it so fast.

But donā€™t get me started on denormalized data. Data is stored willy-nilly throughout the application when it should be a code table and then, inevitably, someone wants to do a search on it. Except that people misspelled it and added The at the beginning or , The at the end or put 2 spaces in it a couple times, etc., etc., etc. At that point, your ENTIRE dataset that the company has spent years gathering is WORTHLESS!

People donā€™t care what you put into a database. They care what you can get out of it. They want freedom and flexibility to grow their business beyond 3 affiliations. (Oops, you wrote select * in your code, now you might have trouble adding columns.) And have you ever tried to write a query to find everyone in the same affiliation? Hope those are codes and not free text. And I hope the affiliations are indexed if you want any kind of performance on that search. But now you need 3 separate indexes instead of one. And the third one might perform poorly because it is mostly null and whatever codes are typical to people with 3 codes.

I could go on and on about the real-world problems I have faced with denormalized data, versus the lack of problems with normalized data.

Say what you want in your blog, but a lifetime of experience says that you are wrong.

And, Iā€™ve worked with Dare before and heā€™s a cool guy (so no offense Dare if you read this) but he writes tools for Microsoft programming languages and stuff. He doesnā€™t maintain software with 10,000,000 row database tables on a daily basis.

I 80% agree with that.
I was always telling my teachers that, trying to convince them they were doing it wrong.
I admit I didnā€™t read all of it, but looking at the diagrams, I can tell you that when you have a 1 to 1 relationship, like UserPhoneNumber, then: yes, there is no point, and I would call it an error to have that table.
The only case where you would need to take it outside the [User] table, is if that [User] table grows out of proportion AND that you often need to fetch user info without his phone number. Otherwise, you are correct to bring his phone number into the main User table.
The tables left of [User] (and above and under) should have have stayed outside though. UNLESS you KNOW there will not be more than 3. Also, if you often need to fetch only the phone number of users, having it in a mini table will give you another advantageā€¦
But if you donā€™t need his phone number for example, for a query, please use SELECT [field names], and NOT SELECT *!! You will gain another millisecond there!
In the end, it all comes down to a case-by-case db design, there is no formula that will work everywhere with optimum performance.
Consider the number of users connected simultaneously, the expected size of the db, etc etcā€¦!

dbadmin NNM :stuck_out_tongue: