Everything Is Fast For Small n

Wilma Keppel:

You’re a writer, but you don’t know when to use “its” instead of “it’s”? For shame…

Reading these comments, I kept wondering, why do a sort in the first place?

If you know that usually the data is already sorted, then only sort it when you know it has changed. I mean, changed enough to justify the effort of sorting. Why not have two tables, one known to be sorted, and a small table not sorted? Then you can quickly look up the large table. If the item is not found there, you can quickly look up the small table. Do the sorting only occasionally!

That is assuming that you do need to sort the data in the first place. If the purpose of doing the sort is simply to be able to use a quick method of finding an item in a sorted table, perhaps you are going about things in an inefficient way. A hash table might be a far better solution.

Sorting is expensive. Try not to do it. Ever.

The most common time I find myself needing a sort is when looking at some search results. Typically you will want to sort by name, price, phone number, location or any number of things that will depend on the situation. Although you can create a load of presorted lists and keep them up to date as each new record is added/deleted, it might not be worth it if you have too many of them… the cost of adding a record might become impractically high.

I’ve done a lot of data entry and I have to say the amount of crap software out there is enormous. One of the worst was raising an order where you add a record for each different item. By the time 20 different items are on said order it takes 20 /seconds/ to add the next record. Clearly a case of using a naive or inappropriate algorithm… although given the severe slowness… I would like to think it is a bug. Unsuprisingly they had only tested for it up to 5 items. Still not fixed after a year… in which time I have written a /scripting language/ to automate the process (and some others)… at least I don’t need to waste my time anymore. :slight_smile:

Important article Jeff.

One clarification might be to indicate that the “lg n” column is using a base of 2. I had forgotten a lot of high school mathematics and the windows calculator has a “log” key, so I was mystified for a while how for example lg(16) = 4 and not 1.204…

The windows calculator doesn’t appear to provide a single button for “log b n” so I had to hunt up the expression to calculate base 2 logs in 2 steps. log base2 x = log 10 x / log 10 2.

another excellent collection of visual sort comparisons – animated GIFs with no Java dependency!

http://www.sorting-algorithms.com/

AJ,

As you say, log_b x = log x / log b

Log b is a constant independent of x.

Big-O notation drops the constant.

So it doesn’t matter what base you use, it’s either logarithmic complexity or it isn’t.

Microsoft had been sorting badly for many years before visual C++

Having learned of quicksort I decided to pit a quicksort implemented in microsoft’s Q-Basic interpreter against their ms-dos command-line sort tool (which exhibited O(n^2) performance)
IIRC my interpreted code was slightly faster at sorting a list of 8000 short strings.

Amen - I’ve been bitten by this before - most especially by the cases where I’ve thought “yeah, I could index on some parameter and get O(1) rather than do an O(n) linear search for it…but there won’t be any cases where anyone’ll search on that parameter”. Yeah, WRONG! Learn that ‘this case will never happen’ is always a false statement…

When doing technical interviews I always make a point of asking a question centered around Big-O notation and algorithmic complexity.

Candidates claiming to be “senior” software engineers fail instantly if they know nothing about Big-O and related concepts…

Tables show 7/(6^n); I think you mean (7/6)^n.

To those saying that it’s OK to run a linear search if you “control n”, be very careful. Your linear search over 20 items might end up being executed 10,000 times in a loop (I’ve seen this), and some batch is now taking a full minute when it should be more like 5 seconds. The difference doesn’t have to be as dramatic as 1 second vs. 1 month for users to care.

It’s really not that much harder to use a dictionary in C#. I don’t understand the complaints about adding all this complexity. In actual fact, the code for a dictionary is often cleaner, especially if there’s one specific field you usually use as an identifier.

Getting an item from a dictionary:

MyObject FindDictionaryItem(int objectID)
{
return dic[objectID]; // We’re done!
}

Getting an item from a list:

MyObject FindListItem(int objectID)
{
foreach (MyObject m in list)
{
if (m.ID == objectID)
{
return m;
}
}
throw new KeyNotFoundException();
}

I think most people would say that the dictionary version is actually much LESS complex.

The simpler answer is that if you do not test on full scale data before you release your application to the users, you are an idiot.

Testing on full scale data can be a real challenge. Data protection regulations in Europe say that you can’t use personal data for testing purposes, unless you get permission. If you’re dealing with data about millions of members of the public, for example on a government project, where are you going to get all that data from? You need to write a decent data generator early on, that can generate vast quantities of realistic data. Also, make sure your data generator doesn’t have any geometric-performance algorithms!