Why Can't Database Tables Index Themselves?

Here's a thought question for today: why can't database tables index themselves?

Obviously, indexes are a central concept to databases and database performance. But horror tales still abound of naive developers who "forget" to index their tables, and encounter massive performance and scalability problems down the road as their tables grow. I've run into it personally, and I've read plenty of other sad tales of woe from other developers who have, too. I've also forgotten to build indexes myself on non primary key columns many times. Why aren't databases smart enough to automatically protect themselves from this?


This is a companion discussion topic for the original blog entry at: http://www.codinghorror.com/blog/2006/07/why-cant-database-tables-index-themselves.html
1 Like

DB2 calls it Configuration Advisor. The notion of sell-tuning databases (kind of an oxymoron, if you ask some folk) is gaining currency. But such machinations, of necessity, take a simplistic (e.g. file) view of tables.

Nice idea, but I guess one problem is the overhead of doing this dynamically. SQL server 2005 can create an index on the table without an exclusive lock on the table (stopping anyone else reading the data…) but prior versions can’t do this. Creating an index isn’t an online operation…

For a big table with millions of rows, creating an index takes quite a while, so you’re much better to spend the time during development and make sure that everything is indexed the way it should be, rather than having the double whammy of the system virtually grinding to a halt because it suddenly tries a query that requires a table scan, because of a missing index… then the system jumps in and creates the missing index, causing more overheads, and/or blocking.

I can’t see this idea flying until databases are smart enough to be trusted with managing their workload a bit smarter. DBAs don’t like the DB server doing things that they didn’t expect.

“If so, determine which field(s) could be indexed, for that particular query, to remove the need for a table scan”

If there are 10 queries , each requiring 10 different “optimal” indices , and each having the same approximate hit rate , the database would end up creating 10 different indexes!

Besides , the indexes play a huge role in database capacity planning , creating a “on the fly” index creation engine might throw a spanner in all automated backups etc.

Just my 2 cents.

I have more or less given up creating indexes myself. I just let SQL Server’s tuning advisor decide for me.

As a programmer, I have long since come to grips with the fact that the compiler will always be better at low level optimizations than me. Since indexes are a low-level optimization (as-in there is no functional change), I see no reason to not let the database handle that as well.

The only difference between what you want and what we have is that you have to manually hit the button from time to time. So I think self-tuning indexes are a possibility for the future.

sorry, wrong Advisor (there are so many), Design Advisor. But this is on a thread at com.databases.ibm-db2:

Tuning is an art. Why would you want to use robot?

“If there are 10 queries , each requiring 10 different “optimal” indices , and each having the same approximate hit rate , the database would end up creating 10 different indexes!”

Not neccessarily. Most of the time, two or three queries will run more often than the others. The indexes for frequently used queries would get a higher priority.

MySql has this functionality.

Index creaion tunning is a scenario based task. As Buggy Fun Bunny said, it’s an art. The same as network design or app design.

Analysis Services for MS SQL Server (their cub/OLAP product) does this mostly automagically. I always found it to be mediocre at best, but I hear it has gotten better since I played with it.

Similarly, I agree with Jonathan Allen above who said that he mostly allows the SQL Server Tuning Wizard to do his indexes for him. I will occasionally (read: semi-annually) let the system run for a day with the with a trace running. Then I just have to come in the next morning and let the analyzer re-tune everything. I always double-check the suggestions for sanity, of course, so I’m not a total quiche-eater.

Considering that great most of my applications use queries as simple SELECTs or SELECTs with JOINs, database index auto-tuning would not be difficult and could save me some work. Good idea :slight_smile:

The idea has been advanced recently on reg developer, to just index everything instead of indexing nothing. (Of course, you might have to pick and choose among multi-column indices still, you’d hardly want every possible permutation of columns indexed.) As long as reading is 80%+ of requests it should make performance sense.

Every server already logs performance data for query caches, it can’t be hard to analyse it on idle or off-peak times to start an index. It wouldn’t drag performance down any more than using it while rebuilding a raid array.

Microsoft SQL Server actually has a “statistics” feature that behind the scenes evidently acts something like an index in the creation of query plans. (I’m not enough of an expert on the innards of SQL Server to explain it better than that.) This appears to relate to your idea of auto indexes. I recently posted this to my blog, which explains how to get around a “too many indexes” error in a SQL Server table that has a ridiculous number of columns (I have had the pleasure of inhereting a couple of these):

http://www.developerdotstar.com/community/node/503

Most people will want to skip reading the post unless you’re having the specific problem I describe there; the gist of it is that on a table with a huge number of columns SQL Server maxes out the number of allowed indexes on the table with its behind-the-scenes column-specific “statistics.” My point in bringing it up is that it seems to me that SQL Server treats these column-specific “statistics” as a kind of index–otherwise, why would you have to delete one of the statistics (as described in my post) before creating a new explicit index on the table.

Dan

Rick O, SSAS creates read only, OLAP DBs. So it’s a whole lot easier to have many indexes when you are not getting inserts updates, not to mention that they’re compressed.

There’s a lot of other information available to help decide on potential indexes. For example, the statistics that Mr. Read refers to would inform the system that the spread of values in a given row isn’t selective enough, so an index there wouldn’t help.

However, I don’t see any discussion here about the COST of maintaining an index. For systems that are heavy on data retrieval, this may be fine, but for high-volume transactional systems, the cost of INSERT/UPDATE/DELETE may be critical.

Because the robot gets us 90% of the way there. Eg, it creates obvious missing indexes that are killing perf). That’s a heck of a lot farther than the 0-10% most naive DB users are at!

Why would you sacrifice the 90% for the 10%? Particularly since you can have both, it’s not an either/or scenario.

Jeff Lewis, can you elaborate on this… provide an URL or other reference?

1 Like

Right, I expected people to bring up the classic tradeoff of indexes: the more indexes you have, the slower your inserts and updates are.

But certainly you can envison a DB server that has certain thresholds of (table scan frequency and cost) divided by (update and insert frequency and cost) and makes a reasonable decision about what to do. If you have a table that’s read 1000x a day but written to once a week, obviously you can err on the side of indexes.

The database servers have all this data; I say they should act on it using some basic, configurable rules

I was reading this blog post and thought: “Where’s the WTF? This sounds reasonable, I don’t get it.”

Then I realised I wasn’t reading The Daily WTF.

More coffee this way please!

1 Like

Well see I sorta of agree with the concept of self-tuning to a degree. I think it would be more helpful for the database to tell the dba what index to create based on use as opposed to automatically doing it. Why? because the creation of an index can be exteremly costly if the database is actually be gasp used. Imagine you are taking in sales orders on your e-commerce site and all of a sudden the db decides it needs to use a new index. Table locks… can’t process orders… people get cracky because they are loosing out on sales… etc. I think it would be really useful though for the database to submit the feedback to a dba so that the index can be optimized in a controled way.

just my 2 cents.

I haven’t worked with it myself, but my ETL colleagues tell me Teradata and Netezza are as close to self-indexing as databases can get. Evidently everything is indexed on every column automatically. Of course, they ain’t cheap.