Environment: Windows 9x, TC or C’s MinGW port of GCC.
Preface
My previous article on Insertion Sort (you can find it at www.codeguru.com/algorithms/kmethod.html) faced a lot of real hot comments (especially by Peter). This forced me to write about more interesting and comprehensive sorting methods, the result of which is this one. Through this writing I have tried to give in-depth coverage of the entire sort algorithm; I hope Peter wouldn’t mind reading it. This article assumes that you really don’t know about the iterations, looping, and so forth; hence, it explains these in detail first.
Iterations
While designing and writing applications for the end user, sometimes it is found that a set of statements are being repeated for a pre-determined time. When such a condition arises, we go for the looping. Some people, however, may argue about the use of the functions then as they too are used repeatedly throughout the life-span of the program (application). The main difference here is that the function is set (block) of statements that peform a cohesive task. For example, we may have a function for reading the input from the user, while another displays the same back to the user. Because these two functions are cohesive, we write functions (hence the modular approach) and through the main function (where the actual execution is taking place) we just call these newly written functions for as many times as we need by the use of the loopings’ iterations.
Recursion
The simplest definition of the recursion is “calling a function from within itself (directly of indirectly) is called recursion.” Most programmers say that recursion is a luxury in a programming language, but the fact is that recursion is essential for solving some problems. Although we can solve a problem by using recursion, the problem can (not always) be solved through the iteration approach too, but it becomes too cumbersome to understand and debug. Recursive functions can be catergorized into two categories, namely primitive recursive functions and nonprimitive recursive functions. Primitive recursive functions are the functions that can normally be solved iteratively (an example is factorial, sum of digits, and so forth), whereas nonprimitive recursive functions can’t be solved iteratively (not easily at least) (an example is Ackermann’s function). A general algorithm model for any recursive procedure (whether primitive or nonprimitive) contains the following steps:
Prologue >> Save the parameters, local variables, and return address.
Body >> If the base criterion has been reached, then perform the final computation and go to Step 3; otherwise, perform the partial computation and go to Step 1 (initiate a recursive call).
Epilogue >> Restore the most recent saved parameters, local variables, and return address. Go to this return address.
A more detailed discussion of the same is beyond the scope of this article. Have a look at the following code to see a recursive fuction in action:
int fun(long n)
{
static int sum;
sum = 0;
if (n >= 10)
return sum + (n % 10) + fun(n / 10);
else
return n;
}
Time Complexity
Whenever an algorithm is developed, several analytical phases are applied to it for the stress test. The most obvious, the correctness, can be done simply by tracing the algorithm, reading the algorithm for logical correctness, implementing the algorithm, and testing it on some data or using mathematical techniques to prove it correct. The other factor that is widely used to determine the performance of an algorithm is the Time complexity.
Time complexity is a factor that determines how much time the algorithm takes to finish. It is almost impossible to calculate the exact time for an algorithm (due to a variety of machines available); however, we can always obtain an estimated time. Usually, this is most easily done by isolating a particular operation, sometimes called an active operations, that is central to the algorithm and that is executed essentially as often as any other (for example, inner and outer iterations). The other operation in the algorithm, the assignments, the manipulation of index, and so forth, occur no more often than the body of the inner iterations. These other operations are called collectively bookkeeping operations and generally are not counted.
Let us consider, for example, the case of a linear sort, where two loops are required (both run from 0 to N-1) to run N times (each). Thus, together the time complexity of such an algorithm becomes N * N; in other words, N2 or O(N2).
Fundamental Concept of Quick Sort
This sorting method performs very well on larger tables. At each step in the method, the goal is to place a particular record in its final position within the table. In so doing, all records that precede this record have smaller keys, while all records that follow it have larger keys. This technique essentially partitions the table into two subtables (and hence is also called partition-sort). The same process can then be applied to each of these subtables and repeated until all records are placed in their final positions.
Each time a table is partitioned into two smaller subtables, these can be processed in turn by using a recursive approach. A general algorithm based on a recursive approach follows:
- Partition the current table into two subtables.
- Invoke Quicksort to sort the left subtable.
- Invoke Quicksort to sort the right subtable.
Algorithm
Given a table K of N records, this recursive procedure sorts the table in ascending order. A dummy record with key K[N+1] is assumed where K[I] <= K[N+1] for all 1 <= I <= N. The integer parameters LB and UB denote the lower and upper bounds of the current subtable being processed. I and J are used as index variables to select keys. KEY contains the key value that is being placed in its final position within the sorted table. FLAG is used as a logical variable indicating the end of a processed that places a record in it final position. When FLAG becomes false, the input subtable has been partitioned into two disjointed parts:
QUICK_SORT(K, LB, UB)1. [Initialize]
FLAG <- true2. [Perform Sort]
If LB < UB
then I <- LB
J <- UB + 1
KEY <- K[LB]
Repeat while FLAG
I <- I + 1
Repeat while K[I] < KEY (scan the keys from left to right)
I <- I + 1
Repeat while K[J] > KEY (scan the keys from right to left)
J <- J – 1
If I < J
then K[I] <—> K[J] (interchange records)
else FLAG <- fales
K[LB] <—> K[J] (interchange records)
Call QUICK_SORT(K, LB, J – 1) (sort first subtable)
Call QUICK_SORT(K, J + 1, UB) (sort second subtable)3. [Finished]
Return
The procedure is intially invoked by the statement Call QUICK_SORT(K, 1, N)
.
Time Complexity Analysis
Now, let us focus on the time complexity of the Quick sort algorithm. The analysis of the procedure QUICK_SORT
is given by
T(N) = P(N) + T(J-LB) + T(UB - J)
where P(N), T(J-LB), and T(UB-J) denote the times to partition the given table, sort the left subtable, and sort the right subtable, respectively. Note that the time to partition a table is O(N). The worst case occurs when, at each invocation of the procedure, the current table is partitioned into two subtables with one of them being empty (that is J = LB or J = UB). Such a situation, for example, occurs when the given key set is already sorted. The sorting of the example key set {11, 23, 36, 42} would yield the following sequence of partitions:
11 {23 36 42} 11 23 {36 42} 11 23 36 {42} 11 23 36 42
The point to be noted here is that the present method of sorting is no better than the selection sort. The worst case time analysis, assuming J = LB, then becomes
Tw = P(N) + Tw(0) + Tw(N-1) = c*N + Tw(N-1) = c*N + c*(N-1) + Tw(N-2) = c*N + c*(N-1) + c*(N-2) + Tw(N-3) . . . = c * {(N+1)(N)}/2 = O(N2)
The worst case can sometimes be avoided by choosing the record for final placement more carefully at each stage. Istead of always choosing KLB, as was done in the previous approach, we could choose a random position in the interval [LB, UB]. Another approach is to take the middle position in the interval, that is [(LB+UB)/2]. Finally, the position could be chosen to be the median of the keys in the interval, although this option is costly.
The best case analysis occurs when the table is always partitioned in half, that is, J = [(LB+UB)/2]. The analysis becomes:
Tb = P(N) + 2Tb(N/2) = c*N + 2Tb(N/2) = c*N + 2c(N/2) + 4Tb(N/4) = c*N + 2c(N/2) + 4c(N/4) + 8Tb(N/8) . . . = (log2N)*c*N + 2log2N*Tb(1) = O(Nlog2N)
The average case analysis (albeit difficult to evaluate) also comes out to be O(Nlog2N)
. I wrote this article after the detailed study of the various sorting methods and my work is influenced by Paul G. Sorenson and Jean Paul Trembley of the Department of Computational Science of University of Saskatchewan, Saskatoon.