First, you will note that "ones_vector", "flagmask_vector" and "datamask_vector" have all been replaced by functions that work on the fly in O(1). These are "ones", "flagmask" and "datamask" respectively. The idea is that each of these is computable as a geometric sum
C+C*2width+ ... +C*2(size-1)width
for which the closed-form formula is well known. I've also added the convenience function "mask", that gives a number of the desired length that is all ones in binary. With this it is possible to compute, for example, "flagmask" of the previous solution.
Second, the function "ge" (greater or equals) is now complemented by a second function, "lt" (less than). Once we know how to compute "ge", additional comparators, such as "lt" or "equals" are trivial extensions of it.
Next, and more importantly, the first solution presented had a rather awkward handling of numbers repeated in the input. This solution uniquifies the input beforehand by multiplying all numbers by a large enough factor (here, the size of the input) and adding a counter. This insures that the numbers are unique modulo n, and are hence unique. After sorting, we can retrieve the original numbers by simple integer division by n.
Lastly, and most notably, the comparison by a different cyclic shift in each iteration has been replaced. We are still performing all O(n2) comparisons, and are still doing so in O(n) iterations, but the order is now different. Now, in the i'th iteration we compare all array members to the i'th element in the input array. To do this, we compare the vector encoding the original array with a vector representing n copies of the i'th element. This is, again, an easy to compute geometrical series. It is simply "ones(size,width)*array[i]".
As in the original solution, we sum over the results of all comparisons between "array[i]" and all other elements. Here this means taking the sum over all the elements in an entire vector (the vector of comparisons computed in the i'th step). Importantly, summing over the elements of an entire vector is tractable in O(1) using the following observation:
X0*20*width + X1*21*width + ... + X(size-1)*2(size-1)*width = X0+X1+ ... + X(size-1) (mod 2width-1)
In our case, all we need to do is to take the integer housing the result of the "less than" operation modulo 2width-1. As long as we know that the total sum is less than 2width-1, this is known to produce the same answer as summing all elements.
Once we know how many numbers are smaller than "array[i]", we know where to put it in the sorted array, completing this implementation of O(n) "sorting".
In trying to parallelize this algorithm, the first problem that arises is that we packed the input array a little too well. By choosing "width" to be "int(log(max(maximum,1))/log(2))+1" we haven't left any room for a "flag" bit, and so it isn't possible to use "ge" as-is. This was actually a hint for the readers to look for a "change_width" function, by which I mean a function working in O(1) that takes an input
a=a0+a1*2width+ ... + asize-1*2(size-1)width
and returns the output
a'=a0+a1*2width'+ ... + asize-1*2(size-1)width'
where all ai are in the range 0≤ai<2min(width,width'). (It doesn't matter what the behavior of the function is when no such ai exist.)
It turns out that this function is not difficult to create. If width'≥(size+1)*width, we can multiply the original array by "ones(size,width'-width)". This places copies of the original array in regular intervals chosen so that with a single additional AND we can mask out the superfluous elements and end up with the desired array.
If we want to choose a new width that is much narrower than the original width (if width≥width'*size, to be exact), we can do so even more easily, by calculating the original array modulo "mask(width-new_width)", a very similar trick to what we did when summing array elements with a modulo operation, only this time we use a modulo that is much larger because we want the summand to be an entire vector, not a single element.
This solves two extreme cases: making "width" extremely wide or extremely narrow. To solve all the intermediate cases, we simply widen the array to an intermediate width "temp_width=(size+1)*max(width,width')", then narrow it back again. The full implementation of "change_width" (as well as all of the code for this month's solution) is available in Python code here.
As a side note, I'd like to mention that "change_width" is an extremely powerful function to have. A person using it together with addition, subtraction and bit shifts can implement any bitwise binary function. Proving this is left as an exercise to the reader. Another powerful thing about having "change_width" is that it makes us never need to use arrays again. We have seen that integers that house vector-encoded arrays (which we'll henceforth refer to simply as "vectors") carry the same information as the original arrays, and it's not difficult to see that O(1) "set" and "get" functions can be implemented for these, to make element access to vectors as easy as to the original arrays, but so far we had the problem of bounded element size. If a person wanted to set an array position to a value greater than 2width-1, that person would have been stumped by the limits of the vector encoding. With "change_width", however, we can re-encode the vector to a new width at any time, making all usage of arrays completely superfluous: vectors do everything arrays do, and - as we have seen in the previous month's riddle - they can also do much more.
Now, we can begin parallelizing our existing solution. The original first step was to multiply all array elements by n as part of the uniquification process. We replace this by a multiplication by the larger value 2n. This makes both multiplication and division very easy: they can be accomplished by simple bit shifts.
Next we need to add a counter. This is equivalent to adding the number
0+1*2width+ ... +(size-1)*2(size-1)width
which, too, is a number easy to compute in O(1) if one remembers one's geometrical series. The function "counting" calculates this number in the code.
This handles uniquification of the input array. Next, we want to perform all O(n2) comparisons together. We do so by producing two vectors, "repeated_array" and "wide_array". To make the explanation clearer, we will follow it with an example. Let us say that the original unique array is [4, 9, 7]. "repeated_array" will be [4, 9, 7, 4, 9, 7, 4, 9, 7], whereas "wide_array" will be [4, 4, 4, 9, 9, 9, 7, 7, 7]. "repeated_array" is easy to produce by making copies of the entire array through multiplication by the appropriate "ones" vector. "wide_array" can be produced by first using "change_width" to widen it into [4, 0, 0, 9, 0, 0, 7, 0, 0], then multiplying by a "ones" vector.
Once performing a "less than" operation, we now need to sum all results relevant to a particular array element. So, once we compared where "wide_array" is less than "repeated_array" we have, in our example, the vector [0, 1, 1, 0, 0, 0, 0, 1, 0] and we want to produce from it the vector [0, 2, 1] by summing in position i all the elements whose original positions are i modulo 3. (The result is [0+0+0, 1+0+1, 1+0+0].) In the original solution, we calculated each of these values separately by use of a modulo. To calculate all of them together we simply use a larger modulo, instead. Taking the modulo by "mask(size*width)" folds the length-9 vector into the appropriate length-3 vector.
The final step in the original sort is array look-up. The original line is "sorted_array[num_less_than]=array[i]". When performing this in parallel what we really need is a function that receives two vectors, a and b, in our case [4, 9, 7] and [0, 2, 1], and produces a vector, result, such that result[b[i]]=a[i]. This sounds more difficult than it really is. First, let's use our counter to create the vector [0, 1, 2] and multiply it by a "ones" vector to created the repeated version [0, 1, 2, 0, 1, 2, 0, 1, 2]. We will also need the "wide" version of vector b, which is [0, 0, 0, 2, 2, 2, 1, 1, 1]. Now, we find where these two vectors are equal ([1, 0, 0, 0, 0, 1, 0, 1, 0]), use this as a mask into the original "wide_array" (the "wide" version of vector a), and get the vector [4, 0, 0, 0, 0, 9, 0, 7, 0]. Lastly, a simple modulo operation folds this into the desired sorted array [4, 7, 9].
Again, the full code for this solution is available here.
(Incidentally, this was "reverse" array look-up, but clearly implementing the "forward" version, result[i]=a[b[i]], is equally straightforward.)
What is the "real life" complexity of the algorithm? Well, that depends on what you mean by "real life". This algorithm works in the desired O(1)-time in the described computational model. It does so by using only a fixed number of atomic operations. If we wish to consider the complexity of the algorithm under more stringent computational models we should therefore consider the time it takes to accomplish the "heaviest" of these operations. This is (arguably) the multiplication of a vector housing n numbers with a vector housing n2 numbers. If one was to perform this on vectors with minimized values of width (which we don't), it would have been possible to compute the result of this single atomic operation by O(n3) multiplication/addition operations on integers with size comparable to that of the integers in the input array. I would normally refer to this as an O(n3) operation, but, of course, the actual complexity is a function of the computational model used and different people will prefer to use different computational models, so there is no absolute answer to this question.