All Abstractions Are Failed Abstractions

This was a good read! I still don’t use LINQ to SQL.

The problem with LINQ is that you have to learn LINQ and you still have to know SQL in all details in order to optimize and debug LINQ queries.
So if LINQ abstraction doesn’t protect you from complexity of dealing with SQL – then what’s the point of dealing with LINQ?

I’m not surprised that Microsoft is on the way of abandoning LINQ.

Matt’s right; you seem to have shot your thesis to hell by having identical results after removing Linq from the equation. It looks like you want to blame Linq, realize you can’t, then just throw up your hands and say “it’s all just bad abstractions!”

It looks to me like the problem is in the SQL engine. I’d be interested to see the timing of the same operations in different SQL engines (SQL Server, Oracle, MySQL, PostgreSQL), controlling for syntax and feature variations between them.

By the way, the ORM I’m most familiar with is ActiveRecord, which is good about making the common things simple while allowing (or encouraging) dropping to SQL for the more complex things.

I hate to nag, but I think you misunderstood Joel’s point. I think he was getting at the very fact you pointed out: you have to be careful about treating abstractions as if they’re really as clean as they look. You really need to understand how it meshes with at least a layer or two up and down to avoid bad algorithms and performance-sapping holes.

@Dennis,

Can you provide any evidence that Microsoft is abandoning LINQ? I know they are combining the LINQ-to-SQL and the EF teams, but I have not heard anything about the abandonment of LINQ.

am i the only one here grossed out by this? if i had my DBA hat on, i’d also be carrying a DBA Knife to kill you in your sleep.

i understand the programmer allure, but damn… * selects by default?

Can’t you just select just the ID in the linq2sql query?

var posts =
(from p in DB.Posts
where
p.PostTypeId == PostTypeId.Question &&
p.DeletionDate == null &&
p.Score >= minscore
orderby p.LastActivityDate descending
select p.Id). // – magic here, select only the ID
Take(maxposts);

If you want perfect abstractions, start programming this: http://opa.yale.edu/news/article.aspx?id=6764

I don’t think the statement: “All non-trivial abstractions, to some degree, are leaky.” is true.

Some abstractions are ill-fitting and leaky, but not all. You can, for instance, use C to complete any work done in assembler. There may actually be a few OS specific usages, but encapsulated they don’t leak out past the OS. With a higher abstraction like C, you can get more work accomplished (and it is more portable).

Regular expressions are another example of something that fully covers their expression space. Of course, if you try to go beyond their expressive capabilities and attempt to do something crazy like write a C compiler entirely out of REs, you will fail. But that’s not leaking, it is just misuse, somewhat equivalent to using a hammer to pound in screws.

Paul.

Why not just use SQL? It’s much more straightforward. I don’t understand the desire to rewrite the SQL to some object notation. All it does is add unnecessary overhead to the application. Is looping through a SQL query really all that difficult?

You always have to understand all the details behind an abstraction to use it well. Java resource leaks are a really good example of this. “You don’t have to allocate your own memory!” but you better understand how you are using it.

I don’t think this means an abstraction has failed, it just means you can’t pretend complexity doesn’t exist.

In my experience Linq doesn’t generate * selects if you made a proper DBML, but I can’t ascertain that for sure right now.

However, if you want to retrieve only the IDs, LINQ2SQL gives you a couple of pretty good options. You can either use a stored procedure method, in which case your SQL gets validated by the database, or you can just create a smaller PostStub class only with the bits that you need.

As far as abstractions go, LINQ2SQL is awesome. It’s not complex enough to be hard to manage, and it’s still got enough features that you can massage it into doing whatever you want in a fairly elegant way. What else could you want?

Also a good framework just tries to conform to the 90/10 rule. It aught to easily solve the problem 90% of the time and ensure that the other 10% of the time it’s still possible to solve the problem even if awkward or hard.

Jeff, first off I just wanted to say that I’ve been following your blog for a while now and I’ve enjoyed your articles. Keep up the good work.

As for the matter at hand, I won’t argue the leaky abstraction part, but I do question your implementation. Why not have the top 48 questions cached in memory on the application server? Then have a background thread that updates that list every N seconds? Is it really so important that every user request get the exact recent 48 questions at that moment? Can’t you live with a little latency in the name of performance? Can the user not live with a list that is 3 seconds out of date (hell it might be out of date by the time is takes to service his/her request anyways)?

I agree that, looked at closely enough, all abstractions leak. However, there may be other approaches to the problem that allow you to sidestep that leak.

I think the native integer size of the processor would affect the bignum class, in that operations that fit within the native size would run more quickly than those that don’t. Here’s a test, in Python:
bignum = (2**20)
count = 10000000
while count:
x = bignum + 1
count -= 1

This takes about 3.5 seconds on my computer for all bignum values up to 230. If you set bignum to be 231, it takes closer to 5 seconds, and continues to take about that long as you increase bignum. Python’s arbitrary-size integer support is a leaky abstraction, in that it’s slower to do anything with large numbers than it is with small numbers.

@janos erdelyi

Indeed.

What is actually going on behind the scenes in SQL server when you SELECT * is an initial query to the master DB system tables to get references to all the columns in your specified table which is then used to build the explicit list of columns for the query you meant.

Now, I expect this process is more than just a bit optimized - SELECT * is after all a pretty ubiquitous shortcut, but it is still a shortcut and hence has a cost, and it’s one of the first things any performance-obsessed DBA (i.e all of 'em!) will tell you to stop doing.

As to why just getting the ID column is quicker, that’s because for most common SQL server DB designs an ID/autonumber field serves as both the primary key and the clustered index for that table, which is going to get massively optimized as well.

I find for any complex queries across very large data sets the only acceptable solution is to first fetch the set of IDs/keys that match the filter criteria, stick them in a temporary table and then join that on the named list of specific columns you need to SELECT from the original table, and, more likely, any other related ones too.

And ideally this should all be done within a stored procedure so you get the benefits of pre-compilation of the SQL query plan as well as not having to make multiple network calls from web to DB server.

I have no idea whether LINQ supports the column/field mapping + editor integration with stored procedure outputs, but if it doesn’t I won’t even consider it.

I don’t see anyone else commenting on this, but why call ToList on an IQueryable object? IQueryable implements IEnumerable, so calling ToList basically forces you to process the list of results twice:

  • Once to convert the result to a List
  • Once to process the items in the for loop

Since this is an article aimed at performance, wouldn’t it be a quick (if somewhat trivial) win?

Most abstractions have a tradeoff. Often the cost is less flexibility and lower performance. But what you get is usually productivity, reliability, safety, etc. Over time, though, the tradeoff goes from being a bad idea in all but trivial situations up to the status quo. You seem to have left out the fact that SQL itself is an abstraction and is even used as an example in the linked Joel Spolsky article!

“The SQL language is meant to abstract away the procedural steps that are needed to query a database, instead allowing you to define merely what you want and let the database figure out the procedural steps to query it. But in some cases, certain SQL queries are thousands of times slower than other logically equivalent queries. A famous example of this is that some SQL servers are dramatically faster if you specify “where a=b and b=c and a=c” than if you only specify “where a=b and b=c” even though the result set is the same. You’re not supposed to have to care about the procedure, only the specification. But sometimes the abstraction leaks and causes horrible performance and you have to break out the query plan analyzer and study what it did wrong, and figure out how to make your query run faster.”

But the same is true about most programming languages. Look at C. Originally, programmers still dipped down to assembly for the most intensive code. That started to happen less over time as computers got faster. Then people started building on C, and C was the new “lowest level”. It happens all the time.

Your conclusion shows that you and I aren’t actually in disagreement. Some leaks get fixed, some leaks stop mattering. Over time, some abstractions get thrown away, and others get built on with more abstractions. What I will disagree with you on, is your title which I’m sure was just intended to be a little inflammatory. Just because it leaks, does not mean it failed.

It still seems to me that the cons of this particular abstraction outweigh the pros.

While I agree the database is the main suspect, I don’t think it is guilty of a performance offence.

During the second example, the database might have cached the row ids from the first query knowing you were going to ask for the full rows in the next query.

It might not have the rows already in its hands, but it knew exactly where to retrieve them from the disks.

Disclaimer: I’m not a database developer, but it is what I’d do if I was one.