# Sorting [log n] different values in linear time

I have an array of n integers that can only assumelog npossible values (and any value). For example, inS = [349,12,12,283,349,283,283,12], there are only 3 different numbers(log 8 = 3).

I have to sort this array in less thanO(nlogn)time. Which algorithm should I use? Maybe Radix Sort with Counting Sort? What about its analysis?

Since it is a given that there will be onlylog(n)unique elements, you could get the sorted list inO(n)time using the algorithm below:

1. create a mapping of how many distinct items are there in the list, and keep the count of each (basically a dictionary/hashmap of key, count)
• This is a single iteration over the input list, soO(n)steps.
2. Sort the above list of tuples of sizelog(n)based on their key.
• Say we use merge sort, then the time complexity of merge sort isk*log(k), wherekis size of the input.
• Replacingkwithlog(n), we get complexity of this step asO(log(n)*log(log(n))).
• Since in terms of complexity,O(log(n)*log(log(n)))<O(n), so overall complexity till this step isO(n) + O(log(n)*log(log(n))), which is equivalent toO(n)again.
3. Iterate over the sorted keys, and generate the sorted list using a single for loop, wherein each value is repeated by its count. There will be maxO(n)iterations here.

So overall, the algorithm above will run in O(n) time.

Radix Sort complexity isO(dn)with d as the number of digits in a number.

The algorithm runs in linear time only when d is constant! In your case d = 3log(n) and your algorithm will run inO(nlog(n)).

I'm honestly not sure how to solve this problem in linear time. Is there any other piece of information regarding the nature of the numbers I'm wondering if there is any other piece of information missing about the nature of numbers...

Well, here's a simple implementation of an MSD radix sort for DNA. It's written in D because that's the language that I use most and therefore am least likely to make silly mistakes in, but it could easily be translated to some other language. It's in-place but requires 2 * seq.length passes through the array.

void radixSort(string[] seqs, size_t base = 0) {
if(seqs.length == 0)
return;

size_t TPos = seqs.length, APos = 0;
size_t i = 0;
while(i < TPos) {
if(seqs[i][base] == 'A') {
swap(seqs[i], seqs[APos++]);
i++;
}
else if(seqs[i][base] == 'T') {
swap(seqs[i], seqs[--TPos]);
} else i++;
}

i = APos;
size_t CPos = APos;
while(i < TPos) {
if(seqs[i][base] == 'C') {
swap(seqs[i], seqs[CPos++]);
}
i++;
}
if(base < seqs[0].length - 1) {
}
}

Obviously, this is kind of specific to DNA, as opposed to being general, but it should be fast.

Edit: I got curious whether this code actually works, so I tested/debugged it while waiting for my ownbioinformaticscode to run. The version above now is actually tested and works. For 10 million sequences of 5 bases each, it's about 3x faster than an optimizedintrosort.

Let's look at en example with two-digit decimal numbers:

49, 25, 19, 27, 87, 67, 22, 90, 47, 91

Sorting by the first digit yields

19, 25, 27, 22, 49, 47, 67, 87, 90, 91

Next, you sort by the second digit, yielding

90, 91, 22, 25, 27, 47, 67, 87, 19, 49

Seems wrong, doesn't it? Or isn't this what you are doing? Maybe you can show us the code if I got you wrong.

If you are doing the second bucket sort on all groups with the same first digit(s), your algorithm would be equivalent to the recursive version. It would be stable as well. The only difference is that you'd do the bucket sorts breadth-first instead of depth-first.

UPDATE

1. Count the number of each element in the list (using a hashtable) (time: O(n)).
2. De-duplicate the list (time: O(n)).
3. Sort the now-de-duped items (time: O(log n * log log n)).
4. Build a list with the right number of copies of each item (time: O(n)).

Overall, this is O(n) and easy to implement.

Here's some python that implements this:

def duplicates_sort(xs):
keys = collections.Counter(xs)
result = []
for k in sorted(keys):
result.extend([k] * keys[k])
return result