Search Algorithms

There are different types of search algorithms that help us choose between the speed, memory and complexity.

Linear Search

This is the simplest and the most intuitive way of searching through the data. We pick up the elements sequentially and identify if this is the element we want. If yes, we conclude - else we continue. This is the simplest to implement; but not so good in its cost. In the best case scenario, we can get the result in the first attempt - in the worst case, we will have to look through the whole list before we conclude. Thus, it has cost C(N).

Binary Search

This works only if we have a sorted data set that is easy to index. We start with the middle element and compare it with our target. If it matches, we are done. Else, if it is less, we jump to the element between this and the first (1/4) and if it is more, we jump to the point between this and end (3/4). Thus, at each step, we identify if we need to stop or move up or move down. At each time, our step reduces by half.

With this, the best case is of course the first hit and the worst case is less than N. It is O(log N).

Thus, the cost is not linear - it is logarithmic. But, the two major requirements are that the data should be sorted and it should be easy to index. If we have a singly linked list, we will have to traverse from the first element for every index. In such a case, it would be a terrible mistake to use binary search. Similarly, if the data is not sorted, the binary search will never work.

Should we sort the data just for binary search. That is not as stupid as it sounds. It depends upon the data size and the number of times we plan to search. There are times when it is better to sort the data once before we start with the search.

Interpolate Search

This further extends the concept of binary search. Instead of jumping right in the middle of the list, it tries to estimate the position of the target based on the assumption that the data is uniform. That is, in a list of numbers that start with 1 and end in 100, it assumes that number 40 would be around the 40th position. Thus it starts from 40. And then, at each branch, it further estimates the position of the target element based on the two extremes.

If we have a good, uniform distribution of the data, this works wonderfully well and we can reach the target really fast. Its cost is O(log (log N)) - which is much better than binary search.

But the assumption here is that the data should be uniformly distributed. Else, the excessive calculation for interpolation would be an absolute wastage.

Hash Key

As we saw above, searching requires a lot of effort in comparing the data. We count the complexity and cost of an algorithm based on the number of times we have to compare data elements. But, what if we simplify this comparison itself?

Hashing is an innovative concept that can have a fabulous impact on the search speed. Over the years, hashing has earned its place in many other domains (encryption, security, finance, communication...) as well. But this is where it started.

A hash value is a numeric value obtained for a chunk of data. The hashing algorithm is out of our scope. But, essentially it uses the given data chunk as a sequence of bytes and numerically works on these bytes to get an integer. Of course, this integer is not unique. Just because the hash value matches, it does not mean that the data chunks are equal. But, the way it is implemented, it has a fairly distributed wide range - so there are very few overlaps is a real world application.

Here, we get hash (keys) for all the elements in the available data (values) and note the relation between the key and data. Since these are numbers, we can save them in an array. Now, when we want to search for an element, we start by searching the hash key itself. Since this is numeric search on an array, it can be performed pretty fast. Once we have identify the key, we can then look into the elements with that key - if any of them really match our search object.