All Abstractions Are Failed Abstractions

Not to pile on (which of course is exactly what I’m doing) but you can achieve select ID quite easily with Linq to SQL and I’m amazed you’re using Linq to SQL and haven’t realized that yet. There’s a very strong, predictable correlation between your LINQ query and the resulting SQL.

Once you have a better understanding of Linq to SQL, then you might be able to comment intelligently on how leaky of an abstraction it is.

@ Xiani

“Now, I expect this process is more than just a bit optimized - SELECT * is after all a pretty ubiquitous shortcut, but it is still a shortcut and hence has a cost, and it’s one of the first things any performance-obsessed DBA (i.e all of 'em!) will tell you to stop doing.”

IF you are actually getting all the columns, I’d argue that “SELECT *” may be (depending on the DBMS implementation) a MORE performant method than “SELECT COL1, COL2, COL3, COL4, COL5”.

SELECT *:

App passes in simple string.
DBMS parses SQL query string
DBMS sees “*” and translates that to a (cached) array of column references

SELECT [list of named columns]

App constructs longer string (either from a cached array of column references or stored in-memory as the intact SQL string)
DBMS parses SQL query string
DBMS sees “COL1” and translates that to a column reference
DBMS sees “COL2” and translates that to a column reference

The upside in the latter is that the application can rely on column positioning to get specific column values, but that leads to highly fragile code and should generally be avoided by the application programmer.

Not trying to pick nits or anything, but just wanted to point out that optimizing one “layer” of your stack by imposing draconian edicts on other layers isn’t always a net performance gain.

To add: ubiquitous shortcuts are generally a GOOD thing in an environment, as they can be special-cased for performance. Caching the ‘*’ column reference array is a quick and easy special case optimization.

In general, when allowing for highly expressive environments, the more “shortcuts” you provide your developers the more opportunities you have to increase performance when your developers take advantage of those shortcuts. On the opposite perspective, when developing in a well-written system with prominent shortcuts, it is often better to USE the shortcut and thus take advantage of that special casing instead of trying to outthink the system designer. Cases where the above is NOT true should be highly visible in the documentation.

@janos erdelyi - “if i had my DBA hat on, i’d also be carrying a DBA Knife to kill you in your sleep.”

They gave you a DBA Knife? All I got was the stupid Hat! I can see how the Knife would may so many things so much easier, though. Hmmm, I wonder if RedGate sells them…

(seriously, that’s the funniest thing I’ve seen all week)

Great article (again). A willingness to consider abstractions to see if they make life easier, rather than a blanket “that abstraction is leaky so I won’t use it” is essential.
See my post on this mindset in the context of business rules:
http://www.ebizq.net/blogs/decision_management/2009/07/business_rules_are_a_failed_ab.php

JT

There is so much FUD regarding LINQ to SQL in the comments to this post that it would take Several paragraphs to correct them all. I won’t try.

But - going unmentioned - is the use of lambda expressions rather than the pseudo SQL syntax. This allows the build up of complex queries in easy to understand syntax rather. Say goodbye to the typical joins from hell or procedural code in T-SQL - not the best place for it.

Also, the ability to use the SAME abstraction working over in memory collections, XML, and data from the database provide a comprehensive, consistent model for working with collections. This fact seems to be missing from the discussion.

Jeff - I would take up Mr Hanselman on his offer!

Linq2Sql is great for rapid prototyping and throw-away scripts.
But if you have to play the performance game on high volume webservers, you better handtune and use sqldatareaders and other forms of optimizations.
Sometimes you have to import sql-data in your own datastructures to gain the needed performance. An sql-server itself is a big layer of abstraction, which of course often leads to performance-bottlenecks.

@Hanselman, @JohnOpinCar & others:

You have either missed the point entirely or you are just taking a strawman cheapshot at Jeff. Get down from your high horses and re-read what Jeff actually wrote.

Select p vs select p.id doesn’t at all solve Jeffs problem. He wants all the columns. p.id is useful in that once you have found the underlying issue that the abstraction tries to hide you can work around it, but that’s totally beside the point.

The point is that you have to bypass the abstraction in order to find the issue in the first place.

Jeffs claim that lin2sql generates “select *” by default is a valid complaint (if it is accurate, I assume it is). Why doesn’t select p translate into “select [col1], [col2],…” etc? This is a very minor part of what Jeff wrote, though.

That it is possible to work around that with Linq is good in a way, as it means that the leaky abstraction has a band-aid.

If I want to, Linq lets me have to think about every sql weirdness but uses a different syntax that I have to learn - hooray…no wait, that’s just pointless!

However, Linq lets you work with sql in an easy consistent way by abstracting some things for you. This works fine in most cases, which is why linq isn’t pointless.

But is a LEAKY abstraction, and I think Jeff has made a good, valid point. This time…:wink:

"Linq to SQL is a good abstraction"
How so? I’ve never understood how mangling SQL into a new “Linq” structure is somehow an improvement. I love how it fits in to the program - but I don’t see why the structure is so different to SQL.

It reminds me of back in 2002 when one of the VB.Net designers told me that the names of many object oriented features were re-named from C++/C#/standard terminology just so that it was “different”. Their argument was if it wasn’t different it wouldn’t be a different language. Here we have it again with Linq where it is different from SQL just so that it is different. It would be great if they could have based it on SQL - that would have made sense.

So - is Linq’s any good? That comes down to perspective. From a programmer’s perspective it offers ease of programming. From database performance perspective it is a big failure. You point out only some of the performance issues - there are many other performance issues you don’t list.

Linq doesn’t allow cached server side execution, which can be huge for large data sets with complicated relationships. Index and other compiler hints, collation issues and more advanced statements can’t be done (at least not to my knowledge).

I’ve found major slow downs specifically caused by Linq. I’ve done a couple of searches and the general rule of thumb seems to be that data access runs at 50% under Linq compared to not using Linq.

I’m programming for a system that has 4-500 concurrent users on a database with over 100,000 customers and many millions of transactions. So for me the difference can be substantial. A SQL stored procedure, properly tunned and cached, may run in 10 seconds while the same command issued through Linq may take a minute or more to run.

Maybe this wouldn’t matter so much on a smaller system.

@Stephen says…

there is a quite small limit to how often you can
just “add a server” and see a real increase.

Of course that is true. Generally speaking this problem is the inability to run operations in parallel due to interdependency. However, generally speaking database type applications are quite amenable to parallelization. This is due to the fact that reads are far more prevalent than writes (and the fact that SELECT is largely declarative rather than procedural.) Consequently, database performance can often be improved by adding read only servers behind a good load balancing scheme.

In any application there tends to be a few small bottlenecks, and you should apply your “fix the nasty code” process there. Note that often fixing bottlenecks introduces nasty code, of necessity. (For example, introducing code that is highly dependent on the implementation details of a particular server, or breaking the abstraction of the database with such things as nolock.)

I recently upgraded our ERP server from a quad-core …
The net result? A 5% increase in performance.

Two things – I would not have gone with a big iron server, but rather multiple smaller, simpler servers. Secondly, if that is the case your problem is most likely very nasty code with lots of unnecessary dependencies. This is the sort of code that is written by people who obsess about squeezing every ounce of performance out of a server, which usually means they do so with premature optimization, leading to horrible, nasty, unmaintainable, and tangled code.

I know nothing more about your situation that what you told me, but I would make an educated guess that that was your situation. I’d recommend that you build a quality test case suite, and then start refactoring the heck out of the code.

I think it’s because the underlying thing being abstracted is really general (e.g., a pencil, a microprocessor) and the abstraction is less general and more oriented towards a particular way of doing something (e.g., word processor, which works in lines of letters instead of line drawings, or shading, or whatever; higher-level language, which may not let you grab those bits and do things with and to them like you can with assembler), so you end up using the abstraction like a waldo or it’s like using a wrench as a hammer: it feels horribly inefficient and wrong. The abstraction is more specialized. So you have to understand the underlying thing really really well and the idea of the abstraction really really well in order to push the abstraction around to do what you could do pretty easily with the underlying thing. I am thinking of a particular circa 1980s Jovial compiler …

Microsoft WANTS developers to use SQL; thye just recommend that it be done in a controlled manner: through stored procedures. LINQ struck those of us in the DBA realm as being completely contradictory to Microsoft’s prior recommendations, and apparently there’s a bit of a scism in Redmond about it.

In my opinion, it falls in the same category as the “4GL” tools of the 80’s: it’s a tool to allow poor programmers to write bad code faster.

@Philip - LINQ2SQL != LINQ. LINQ is a generalised querying language for working with many different types of data in an identical manner, from in-memory collections to back-end databases. That’s the whole point - make sure you know what you’re criticising first.

LINQ2SQL allows for the use of Stored Procedures (see http://weblogs.asp.net/scottgu/archive/2007/08/16/linq-to-sql-part-6-retrieving-data-using-stored-procedures.aspx) if you want generation of query plans ahead of time etc.

Could you provide the links where a 50% speedup is proven for using raw data readers as opposed to LINQ? I’m betting that you can’t - MS wouldn’t release LINQ2SQL with anything like that sort of overhead.

Enough with the FUD. Learn about the technology before trying to shoot it down.

Oh and where’s Jeff in this discussion? He seems to have gone very quiet. I would have thought he would at least have corrected the post after so many people pointed out his mistakes.

@David R.

Re-reading my post I can see that I should have been a lot more explicit.

Seeing as the topic was specifically about Linq to SQL I thought it was self evident that this was the subject. I like linq, and I’m not saying anything negative about it in general, just the use of SQL.

As for use of stored procedures through linq, and the use of compiled statemetns - they simply don’t work anywhere near as fast. BUT - I was refering to linq statements - compiled or not. (I probably should have stated that).

Please don’t assume that we are random users making comments about things we don’t understand. I’ve been doing DBA on SQL for a decade so I’m capable of listinging to what others say and then doing my own tests to see what is correct. I LIKE LINQ (both generally and for SQL) - but it simply doesn’t perform for me. I HOPE future versions do.

So - no FUD involved - these are tests I have done. I simply can not get LINQ to perform anywhere near the speed of other methods. Please try it yourself before condemming me as a FUD maker. I even explained my situation to make it clear that it may only be upper end users have have to worry about this.

Also, before you condem that we haven’t learnt the techology it would pay for you to do some quick google searches to find out what the actual state of play is.

For instance,I did a google search and turned up something right away
http://www.devtoolshed.com/content/performance-benchmarks-linq-vs-sqldatareader-dataset-selects-part-1

http://www.devtoolshed.com/node/12

I know little about LINQ or SQL, but there seems to be a logical fallacy in this post.

The argument is that LINQ is less efficient than SQL, but to optimise the SQL you have to make 48 individual queries to extract 48 rows. Can this be done in a single SQL statement? If not, doesn’t that make SQL as much of a “failed” abstraction as LINQ?

By the way, you never state what the time is for the actual 48 query option. Why not? You surely don’t think the equation “(48 * 3 ms) + 260 ms = 404 ms”, is realistic.

IMHO abstractions aren’t fundamentally leaky, the humans who implement them are.

If "SELECT TOP 48 * " is really slower (instead of faster) than “SELECT TOP 48 id” plus 48* “SELECT * WHERE id=”, something must be badly broken in your database. The second combination creates by far more work for the database. I can only guess the experienced difference is because of caching effects.

@Dave R

I think we aren’t entirely in disagreement.

I explicitly detailed the size of my data and stated that my findings may only be due to size of the data I have.

You accuse me of FUD in general, and I disagree. But I actually agree that I am generating FUD in relation to LINQ2SQL for large databases/data warehouses. You are right! I am clearly saying that it is an inappropriate technology for large databases (Doubt). I am clearly saying that it runs slower than other more appropriate technologies on larger databases (Fear). I am saying that the cost/benefit is unknown without the developer first doing benchmarking and testing (Uncertainty). YES - I am preaching FUD for Linq2sql for large databases!

I chose those links because they were some of the first ones I googled. I could probably find more - but I don’t have the time to prove myself right when I have already done that ON MY DATA. For all in know LINQ2SQL may be faster in some situations, that’s why I STILL test alternative options in order to get the desired performance.

You said the links only used a few rows, and that’s a valid issue. So, did YOU repeat the tests on that page with extra data? Did you add in 100,000 to 100,000,000 records and then look at the results? If not then give it a go - all the code was there to do just that. You don’t have to believe the site author, you CAN do the test yourself on data that matches the size of tables you deal with.

Chances are you can’t be bothered, and I don’t blame you. I only started benchmarking and testing when I first noticed massive slowdowns when using LINQ2SQL, which is what woke me up to the issue.

I actually agree that there is a balance between the benefits of LINQ2SQL. I DO use Linq2sql for small/quick tasks, the problem is that majority of my data tasks are over large volumes of data. If something takes at least 16 ms to run… you could make it 10 times slower and I’m not going to care. BUT if something takes at least 60 seconds to run, you can put money on the fact I will be doing everything I can to make sure that’s all it takes to run.

You mentioned the “small” penalty of LINQ, and this is where the size of the data becomes important. For large data it isn’t a small penalty to use LINQ, it is a large one.

To be very specific, I have one table where the current row count is 3,505,920, and another that is 22,381,736. These tables are accessed very frequently. I simply can’t get the performance out of LINQ2SQL to justify its use.

I have several hundred concurrent users. I am about to have maybe 800 mobile vehicles that log their gps position every 30 seconds AND provide extra data and GPS positions during specific events - producing a mind blowing amount of daily data.

So for me the speed impact of LINQ matters. StackOverflow? I don’t know - that’s up to them to do their own technology comparisons and tests. For you? I don’t know.

If you take it that I am generating FUD – fair enough, but the fear, uncertainty and doubt can all be eliminated by some very simple benchmarks that data developers should be doing.

@Philip - did you actually read the sites you linked to? The guy’s tests are hardly real-world. He does 10000 SELECTs on a single 25-row table using three different methods. If you’re re-running a single query 10000 times using LINQ without compiling your query, the setup costs are going to be repeated 10000 times and amplified.

If you’re generalising that all LINQ2SQL queries exhibit a 50% slowdown based on this, then, yes, I’m going to accuse you of FUD, or at least lack of knowledge about the technology involved. And it really doesn’t matter how many years you’ve done SQL - this argument is about LINQ2SQL, not T-SQL. I’m still waiting for the stats to support your argument - I’m guessing that you can’t find them.

It would be interesting to see what sort of effect switching from LINQ2SQL to ‘raw’ datareaders would have on a site such as StackOverflow. But I’m betting that the developers are happy to pay a small penalty for the increase in productivity afforded to them.

@Dave R

I think we aren’t entirely in disagreement.

I explicitly detailed the size of my data and stated that my findings may only be due to size of the data I have.

You accuse me of FUD in general, and I disagree. But I actually agree that I am generating FUD in relation to LINQ2SQL for large databases/data warehouses. You are right! I am clearly saying that it is an inappropriate technology for large databases (Doubt). I am clearly saying that it runs slower than other more appropriate technologies on larger databases (Fear). I am saying that the cost/benefit is unknown without the developer first doing benchmarking and testing (Uncertainty). YES - I am preaching FUD for Linq2sql for large databases!

I chose those links because they were some of the first ones I googled. I could probably find more - but I don’t have the time to prove myself right when I have already done that ON MY DATA. For all in know LINQ2SQL may be faster in some situations, that’s why I STILL test alternative options in order to get the desired performance.

You said the links only used a few rows, and that’s a valid issue. So, did YOU repeat the tests on that page with extra data? Did you add in 100,000 to 100,000,000 records and then look at the results? If not then give it a go - all the code was there to do just that. You don’t have to believe the site author, you CAN do the test yourself on data that matches the size of tables you deal with.

Chances are you can’t be bothered, and I don’t blame you. I only started benchmarking and testing when I first noticed massive slowdowns when using LINQ2SQL, which is what woke me up to the issue.

I actually agree that there is a balance between the benefits of LINQ2SQL. I DO use Linq2sql for small/quick tasks, the problem is that majority of my data tasks are over large volumes of data. If something takes at least 16 ms to run… you could make it 10 times slower and I’m not going to care. BUT if something takes at least 60 seconds to run, you can put money on the fact I will be doing everything I can to make sure that’s all it takes to run.

You mentioned the “small” penalty of LINQ, and this is where the size of the data becomes important. For large data it isn’t a small penalty to use LINQ, it is a large one.

To be very specific, I have one table where the current row count is 3,505,920, and another that is 22,381,736. These tables are accessed very frequently. I simply can’t get the performance out of LINQ2SQL to justify its use.

I have several hundred concurrent users. I am about to have maybe 800 mobile vehicles that log their gps position every 30 seconds AND provide extra data and GPS positions during specific events - producing a mind blowing amount of daily data.

So for me the speed impact of LINQ matters. StackOverflow? I don’t know - that’s up to them to do their own technology comparisons and tests. For you? I don’t know.

If you take it that I am generating FUD – fair enough, but the fear, uncertainty and doubt can all be eliminated by some very simple benchmarks that data developers should be doing.

This is my last post on this page - I’ll come back to read if you want to post anything else, but I think I’ve made my case as clear as I’m able to.