Everything Is Fast For Small n

the always forgotten algorithm: “Counting sort”.
It’s in my opinion the fastest way to sort integers,
that are in a known range (i.e for instance between 0 100)

God, the number of times I’ve tried to explain to management why a QA database 1/20th the size of the production one isn’t perfectly adequate for testing…

movable type regenerates thousands of static htmls with every minor change

I’m not sure that’s the case…

I have zero section entries on this blog. No tags, no sections. I’ve deleted them all, even the templates. There’s the main index, and the individual entry, and the “all posts” page. That’s it.

When a comment is entered, all it needs to do is regenerate the main page (to update the recent comment count) and the actual entry page.

Ahh…the infamous Shlemiel the painter’s algorithm.

(see Joel’s infamous “Back to Basics” post…from away back in 2001)
http://www.joelonsoftware.com/articles/fog0000000319.html

I’ve seen this over and over both doing QA and as an “optimization engineer”. (current gig) I just can’t quite wrap my brain around the blind spot we all have for small data sets…I’ve been guilty as the rest.

I think I’m going to take that execution time chart, print it out on our plotter, and post it as my cube wall with the title “Everything is fast for small n”

It’ll become my new mantra. I’ll use it in meetings. Slip it into casual conversation. My wife will love it. My kids will sagely repeat it. My dog will go on command for it.

Ok, maybe we’ll just stick with the wall chart.

M said:

How can anyone actually manage to get a job as a programmer and not know the basic facts of sorting?

Because there are way more programmer jobs than there are good programmers. Worse, a lot of times, neither the manager nor the HR person knows the first thing about programming themselves. There’s a widespread myth that programming is something you can just “pick up” over a few weekends.

@Aaron G:
yes obviously if you are hitting a section of code very frequently and your system runs slow, then that code is a good candidate for optimisation.

My point was though that sometimes (for a small dataset) a linear search is a better choice. For example, if I used a dictionary approach then that may cause the dictionary library to be linked in, increasing the size of my program and it’s memory usage (which is still a consideration for the embedded systems I work on). Other complications and overhead may come from having to wrap primitive variables to store them in the dictionary and from dealing with exceptions or error conditions.

To those saying “This is obvious. Every programmer should know big O.”:
You might have a perfect grasp of big O and be absolutely convinced that your algorithm will scale well, but you should still test it with large data regardless. There is always a difference between theory and reality (e.g. the theory may not consider the slow down caused by paging or database cursor moving).

re: Mike Ford’s analysis.

Nice work. Optimize only when you have to. The only way to know if you have to is to profile the app so you know where the hot spots are.

Perhaps a bit off topic, but… I just hope when you decided not to optimize that particular routine, you at least placed a comment in the code explaining why you found it to be unnecessary. That would keep other well-meaning coders from coming along later and optimizing it anyway, when the boss says, “make the app faster.” That would be a time waste, and a quick comment could prevent that. If your comment has specifics, like “.1 ms on the sort and 6 ms to draw,” it helps drive the point home.

In fact, the original coder(s) may have done the exact analysis that you did, then decided on the simple approach because the numbers were small. Had they commented their thinking, you wouldn’t have had to spend your time going over the same ground.

fantastic article on big O notation in layperson’s terms. As a CS dropout, that was one of the things i never really ‘got’

First, a broad observation - even with small test datasets, and even if you don’t know your algorithm’s O(), you can spot trouble coming by simply testing different sizes of data. If doubling your test set causes your runtime to double, you have evidence of O(n); better than that, and you approach scalability; worse than that, and you need a new algorithm.

As for tndal’s assertion that quicksort should be ditched for heapsort because of its worst case: don’t be daft. Firstly, in-place algorithms are only any use in mutable arrays - a pure functional language or an application that needs to sort sequential data really ought to be using mergesort. Secondly, quicksort’s worst case can be avoided by making sure you choose median pivots, which isn’t hard to do in practice - choosing the median of the first, middle and last elements of your data set will generally get close enough (and the function you use to choose that can be recycled to sort 3-element sets at the bottom of the recursion). Thirdly, quicksort is more sympathetic to modern computer architectures than heapsort (eg. better cache behaviour), and so runs a lot faster on average. Heapsort might be better if predictability of in-place sort time is more important than average speed - but generally, when predictability is critical it’s predictability of latency, which points towards maintaining data in a sorted state in the first place. Lastly, dogmatic opinions have no place in an engineer’s toolbox.

Relating to Graham Stewart’s point: some real-world implementations of recursive algorithms actually switch to a “simple” sort when the data size is small enough that the reduced setup time is more significant than the O(n^2) runtime (in particular, there is no point in calling ANY sort algorithm on a 3-element data set). Likewise, the WATCOM compiler only implemented linear search on its symbol tables, because student programs were so small that the setup costs were never recouped by longer searches. Even in the graph in the main post, if you know your problem size will never exceed 4k or so, you might as well stick with the Alpha variant (and yes, sometimes you can say “this will never happen” - by explicitly ensuring it can never happen).

Part of it is that programmers after a certain year came out of school fiddling about with Access databases and what-not. My first position was with a company, using mainframes, where 1MM records was the norm. It was a good base to work from.

I’m surprised to see everyone talking about “Big O Notation”, and algorithm complexity, but nobody used the name “Order-Of-Complexity”. Isn’t that what the big O stands for? Or did I just make that up? I hope not, because I use the phrase all the time!

Oh well, it’s been a while since my algorithms classes. :slight_smile:

I had a very important lesson in my first university data structures course. The first few classes were devoted to sorting. We started with the theory, covered the Big-O of bubble sort, insertion sort, quicksort and heapsort. We then drew the curves and talked about predicted behavior. Then we had to implement each sort. Then we were given a file of 20K words in random order and had to graph the actual times for each algorithm. The first thing I noted was that the predicted best sort never performed better than the others. I was convinced this couldn’t be true, so I created a dictionary of 150K words and ran the tests again, and I tried feeding an already sorted file and reverse sorted file for good measure. Now the actual graph was nearer the theory graph. For a certain value of n, the theory was dead on, but when you add in the constant offset of actual code and real systems the curves intersect at different points. The most surprising bit was that for n 15 bubble sort always won (for that implementation). Years later I found confirmation of this when porting some unix code. The author of that sort function always used bubble sort for values less than 20 and one of the n log(n) sort for everything else.

Anyway, that first CS course taught me that I not only needed the theory, the data structures, but sometimes also needed to empirically prove those things and that has served me well for nearly 20 years.

In all fairness, the “Neglect 1 little index in your database and you’re suddenly back in TRS-80 land” isn’t quite fair, as it leads folks to believe that your DBMS is going to perform simple n^2 algorithms for unindexed joins.

A real DBMS (read: not one that you implemented for a CS assignment) will very rarely do a naive n^2 search on your data. What it WILL do is create a hash table of all of your data and use that as an index to avoid this. But that is often quite costly in terms of time and CPU itself, and is something you’d rather avoid…

Firstly, to the many people seemingly surprised by Jeff’s explanation of the Big-O notation - yes, most people here should know it. However, I am willing to bet a big enough portion don’t. Not everyone has been and done a Comp-Sci degree. I myself, only just embarking on one this term, have been programming for several years and have only just learnt the Big-O notation properly. I know a couple of other people who read this blog for the less hard core programming aspects who would appreciate the explanation.

Secondly, sometimes the obvious needs to be stated, even for experts. We don’t all deliberate over what specific algorithm to use based on all the different metrics. Sometimes we just go with our instincts, which happen to be wrong.

"On a different note, the tables you presented are probably easier to grok as graphs. You might even want to show one graph in a linear scale, and one graph in logarithmic scale.“
I would disagree. The graph presented does a good job of visualising the difference between different types. It’s hard to read precice values from a graph, and I think the " 16, 4, 25, 64, 256, 12, 20.9 trillion” has more impact :slight_smile:

Is Microsoft aware of this concept? I’m still rather curious as to how my Outlook can lock-up when I’m simply deleted an e-mail, or how XP locks up so often when I’m trying to log off, usually in the saving settings phase.

You might have a perfect grasp of big O and be absolutely convinced that your algorithm will scale well, but you should still test it with large data regardless. There is always a difference between theory and reality (e.g. the theory may not consider the slow down caused by paging or database cursor moving).

That’s exactly my point-- test, test, test! Use large data in your testing!

Consider how many times you’ve used software that didn’t scale. iTunes on Windows doesn’t seem to scale: http://www.hanselman.com/blog/iTunes7UnspeakablySlow.aspx

Heck, the Movable Type installation that powers this blog is a perfect example of failing to test with large datasets. Why does it take 15-20 seconds to enter a comment? That sure didn’t happen two years ago, before I had 27,404 comments and 946 entries in the MySql database…

Leons Petrazickis: "Incidentally, an Alpha is a superfast mid-90s processor that DEC used to put out and NT used to run on.

A TRS-80 is an early 80s hobbyist computer. It’s about 10,000 times slower than an Alpha."

Ummm… For some reason you think that the highly technical readers of this blog don’t know that?

And FYI: Just because something is a lot faster sometimes doesn’t make it always a lot faster. For example, a Ferrari is a lot faster than a Vespa, except when they’re both climbing the same mountain and halfway up the Ferrari runs out of gas.

My point is that, even though the Alpha may be about 10K times as fast as the TRS-80, that doesn’t mean it was faster at everything. (I think that might have been Jeff’s point as well.)

Hey Jeff,

The columns in those tables are out of order. n^(7/6) dominates n*lg(n), you just haven’t printed enough rows to show it.

c

The true lesson here: read back through all the posts on this blog.

Jeff posted the exact same thing about three years ago in this same blog. I’m pretty sure it’s 100% identical.

Imagine all the good stuff that’s back there waiting for you to find!

I have seen this happen with badly designed SQL queries as well. Everything is great when the test database has 100 rows in each table, but when production data floods in and there are millions of rows (and one million rows is small for some settings bigger than I have worked in) then the queries grind on and the app gets slower and slower.