Everything Is Fast For Small n

With generic implementations it will happen when you try
to sort data that is already sorted - and that is a
pretty likely occurrence.

At least one version of Microsoft C++ exhibited that behavior with qsort().

So much for the people who say, “I don’t need to know anything about sorting, I just call the sort routine in the library”.

Even if you picked a really stupid sort algorithm, I’m willing to bet your problem wasn’t principally the sort algorithm choice but the fact that you were choosing one at all. JavaScript provides Array.sort(compareFunction) since NS/IE 3. If you were having to write your own sort, you must not have been using an Array but instead some DOM collection. Which means you were at least reading from the DOM over and over and possibly also reordering the DOM as you went. That’s tremendously slower than reading and writing plain variables. That’s why your script barfed with a measly 100 rows. A better strategy is to pull out the sorting value and a DOM object reference per row into an array, sort that (which will be effectively instant on 100 items), and then reorder/rewrite the whole table in one go.

“Worth noting is insertsort is O(N) for already sorted data. This makes insertsort ideal for applications where you resort data that drifts out of order.”

What makes insertion sort even more ideal for pre-sorted data is the fact that it’s a stable sort, unlike quicksort. That’s another important aspect of sorting that most teach-yourself-on-the-job coders have never heard of. Maybe Jeff should do another blog entry on the subject.

People have mentioned qsort’s O(N^2) worst-case performance and how it typically occurs with naive qsort implementations for already sorted data. Worth noting is insertsort is O(N) for already sorted data. This makes insertsort ideal for applications where you resort data that drifts out of order.

Another reason to test with large data sets: interface optimization.

If a program’s interface is only usable with few records, then a more performant sorting algorithm isn’t going to save it.

Jeff posted the exact same thing about three years ago in this same blog. I’m pretty sure it’s 100% identical.

Congratulations on being such a long-time fan and an astute observer. :slight_smile:

I was sort of disappointed that post never got any attention, because it was always one of my favorites. I reformatted it for clarity and added some content to make it what you see today, then deleted the original post.

Quicksort worst case (n^2) can be eliminated if you randomize the order of your data before the sort. That takes O(n), so together with quicksort the worst case becomes O(n*log n)

"God, the number of times I’ve tried to explain to management why a QA database 1/20th the size of the production one isn’t perfectly adequate for testing…
A. Lloyd Flanagan on September 21, 2007 07:20 AM "

Or have legal tell you even faking sensitive data is forbidden. The only values we could use for a Social Security Number sort/search were 999-99-9999 and 000-00-0000. Everything else might be mistaken for the real thing, and open to lawsuit.

I love this post, and I’m surprised by the drift of the comments.

On the other hand, it is perhaps more accurate that everything is equally slow for small n, since the constants dominate (setup, teardown, loading, yada yada).

Well, I suppose there are some good-order solutions that have implementations with really really terrible small n performance, but it is difficult for that to escape notice, I trust.

So what’s wrong with javascript’s built in Array.sort([sortcb])? Presumably the js engine guys have allready made all the hard choices for you?

You could have added the O(n) radix sort in your examples. And this is worst case :slight_smile:

In my experience, the reality is often “everything is fast if it doesn’t call the database”. The optimizations to avoid database calls dwarf anything else.

@headshots:
Not necessarily. The worst case in quicksort arises when you consistently manage to choose pivots which divide the data into one large set and one empty-ish set. But happily, you don’t have to choose a perfect pivot to get the average case; it just has to be close enough (ie. the spilt data sets have to be within an order of magnitude of each other, on average). Your suggestion of randomising the data order solves the “nearly sorted input” case, but what if your random ordering results in a data set that was previously unsorted becoming nearly sorted? Right back where you started (and a fine example of cargo cult programming to boot).

That’s why I suggested taking three elements from disparate places in the data set and choosing the middle one as the pivot. Most naive (or example) quicksorts select the first element as the pivot - which if the data is nearly sorted, is guaranteed to be a lousy choice. However, choosing the median of the first, central and last elements will always give you a more or less perfect pivot on sorted data, and do no worse on average than picking the first element for randomised data. And it’s O(1) so long as you have random access to the data set (and as I said before, if you don’t, you really need to be using mergesort).

This is very interesting; I am seeing exactly this on my computer science course. Thanks Jeff. The most common algorithms I am seeing are n, log(n) and n . log(n) and n^c.

I’m going to have to side with Eug here and argue that Bubblesort is O(n) in the best case, and not O(n^2). It’s O(n^2) in the worst case. It might sound like a pedantic point to make, but the distinction is critical, particularly for an article intended to educate people on computational complexity.

If you know that your data is typically sorted then paradoxically you may be much better off using Bubblesort than an O(n log n) ‘worst case’ alternative. An example where this might be useful is in a graphics engine where you can exploit frame-to-frame coherence by depth-sorting the list of objects to be rendered each frame, and you expect their relative positions to change infrequently.

There are various other examples in graphics and physics engines where this kind of temporal coherence can be exploited, and undoubtedly in lots of other domains too. Of course the key lesson should be to understand the complexity of candidate algorithms, understand your data (profile!) to make the best selection of algorithm for your needs.

I would recommend that everyone also watch Raymond Chen’s speech from the 2005 PDC “Five Things Every Win32 Programmer Needs to Know” where he points out how things going on inside of Windows will effect your data structures and algorithms.

We were guilty of this too, but I have begun to appreciate the value of large n. We’re an app that will give users the chance to create records that are searched as frequently as a user switches applications.

We actually reduced the complexity of the code by testing for large n.

We found that a simple iteration works in large-n cases (n = 5000, where you might see n = 100 from a power user). The value in large n here is in justifying design. Without testing we would not be able to prove that a simple solution worked and a complex algorithm for the sake of performance was not needed.

Great video on sorting techniques, “Sorting Out Sorting” via Joey DeVilla:

http://globalnerdy.com/2007/09/24/sorting-out-sorting/

Sorting’s failure to scale is a specific example of a general problem with interfaces that don’t scale well. One of my least-favorite examples is the way Apple replaced it’s excellent Sherlock search interface with Spotlight.

Spotlight probably works fine for users with 1 to 1,000 data files. I’m a writer, I’ve owned computers since 1987, and I’m a backup fanatic. At one point my Documents folder had over 300,000 files in it. Instead of getting 10 search results like Joe User, I get 100, or 1,000. So I need an interface that allows me to

  1. Easily choose WHERE I will search (which folder or folders);

  2. Quickly SORT the files on the basis of what I think will help me find what I want,

  3. Quickly SCAN the files, and

  4. Quickly RE-SORT the files by a different parameter if steps 1 and 2 don’t work right away.

Spotlight may have a way to quickly delimit which folders to search, but I haven’t found it. Worse, it sorts results into seperate categories, and displays only SOME of the hits. This makes it impossible to just scan the whole list of files for a match. If what I want isn’t in the first few hits, I must also remember what KIND of file it is. Ack! More mental overhead!

Mental overhead does not scale! I can’t just add a faster CPU or more RAM to my brain.

I suspect that Spotlight’s developers came up with their cool idea, implemented it, and tested it … on machine with a few dozen to a few thousand files. And it worked – in that environment. I ended up turning it off and use FileBuddy for searching. Not ideal, not as good as Sherlock, but it works.

The traffic here in the Bay Area is another example of technology that doesn’t scale well. Trains scale well; autos do not.

Just don’t fall into the “same order of magnitude” mistake people make fresh out of computer science studies:
http://dotmad.blogspot.com/2007/10/every-n-matters.html