Searching Algorithms


ICSTARS
Kurt Stephens ks.icstars_AT_kurtstephens_DOT_com
2002/04/03

Overview

Searching is probably the most common task in computing.
Searching is probably the most useful thing that computers can do for people.
Searching is directly related to sorting and the data structures you are searching through.
Some searching algorithms are better than others; some can only be implemented with certain data structures and/or data ordering.
Some searching algorithms are harder to code than others.
Some searching algorithms are good for small data sets, some are better for big data sets.
Some searching algorithms are already implemented for you.
Some searching algorithms are better or worse for the job at hand.
Some searching algorithms and data structures are better than others if you are changing your data sets more often than you are searching.
There is no right way to search.  
This lecture will present some basic searching algorithms and data structures.

Terms

Array
Linked-List
Doubly-Linked-List
Element
Tree
Key
Linear Search
Binary Search
Tree Search
Hash Search
Hash Function
Hash Table
Comparison
Iteration
Initialization
Termination
Pseudo-Code

Linear Search

Close your eyes.
Imagine a library.
Imagine you are looking for a book in the library by its title.
Imagine a library filled with books on shelves without any organization or card catalog.
Imagine the librarians are on strike.
Imagine having to search through all the shelves, book by book,
to find the book your looking for.
Imagine that the books are on the shelf with the spine facing away from you;
You must take each book of the shelf to examine its title.
You might find the book on the first shelf or the last or not at all.
Imagine the library has only 5 books.
Imagine the library has 50 books.
Imagine the library has over 500,000 books.
Imagine it takes 1 second to get the book off the shelf and compare the title.

Imagine book slots on all the shelves each big enough to hold one book.
Imagine each book taking up only one slot.
Imagine no empty slots in the entire library.
Imagine a unique sequential number for each slot, starting with slot #0.

Pseudo-code:
  1. Start at book slot #0.
  2. If we are at the end of the library, we did not find a book.
  3. If book in the book slot has the right title, we found the book.
  4. Move to the next book.
  5. Goto step 2
The steps from 2 and 5 are part of one iteration of the algorithm.
Step 1 is the initialization of the algorithm, it sets up the starting conditions for the algorithm.
Step 2 and 3 are both potential terminators of the algorithm, they control when the algorithm terminates.
Step 2 terminates the algorithm when we do not find a matching book;
Step 3 terminates the algorithm when we find a matching book.
Remember: an algorithm that never terminates is useless.

Q: So how do we represent this problem (data structure) and solution (algorithm) in a computer program?
A: The library is an array or a linked-list.  The book slots are elements in the array or linked-list.  The books are the data in the elements.

Imagine that the library that you go to usually has the books you're looking for.
Imagine that the library usually does not have the books you're looking for.
Imagine the time you could waste just trying to find out if you should try a different library (or amazon.com)?

Searching to find out something is not there is sometimes just as important as finding out if something is there.

Q: On the average how many comparisons will linear searching use to find an element?
A: N/2 where N is the number of elements.

Q: On the average how many comparisons will linear searching use to not find an element?
A: N, where N is the number of elements.  Actually, it's always N.

There must be a better way.
Good librarians sort the books on the shelves.
Why?  Because sorting and organizing is good for searching.
In fact searching and sorting are so closely related that most people sort by searching.

Binary Search

Imagine the books have been painstakingly sorted by title.
If you know that "Java" is before "M" (the middle of the alphabet) you can concentrate your search in the first half the books, books between "A" and "M".
If you know that "Java" is after "F" (the first quarter of the the alphabet) you can concentrate your search in the second quarter of the books, books after "F" but before "M".
Pretty soon you'll narrow your search to one book or no books.

If the data you are searching through is in a particular order searching can be done a lot quicker for big arrays.
 
Binary search is an algorithm that can find a element in a sorted array quicker than the linear search by comparing the middle element of the array and deciding which half of the array to continue searching.  
It is a "divide-and-conquer" algorithm; it splits the array into smaller and smaller arrays until we get to the matching element.  
It's called binary search because the element we are looking for is either on the left (0) or right half (1) of the array at each iteration.

Imagine an array containg 10 letter elements: "A", "D", "F", "H", "K", "M", "O", "R", "T", "Z".
We want to find "R".

"[" marks the left side of the array, "]" marks the right side of the array, bold mark the middle element, each iteration finds the middle element, compares it with the key ("R"), and selects a new left or right side of the array to search.


Elements
Iteration
0
1
2
3
4
5
6
7
8
9
1
[A
D
F
H
K
M
O
R
T
Z]
2






[O
R
T
Z]
3






[O
R]



Fig. 1: Binary search only takes 3 iterations and comparisons to find "R";
a linear search would take 8 iterations and comparisons.


Pseudo-code:
  1. Start with whole array from the beginning (0) to the end (books.length)
  2. Find the element in the middle of the subarray.
  3. Compare the key to the middle element.
  4. If if the key less than the middle element,
    Use the subarray on the left of the middle element.
  5. If the key is greater than the middle element,
    Use the subarray on the right of the middle element.
  6. If the key is the same as the middle element,
    Return the matching element.
  7. If the new subarray has at least one element,
    Go back to step 2 with the new subarray.
  8. Otherwise, there is no match.

Q: On the average how many comparisons will binary searching use to find an element?
A: log2 N, where N is the number of elements.  A binary search of 65,000 elements only takes at most 16 comparisons.

Q: On the average how many comparisons will binary searching use to not find an element?
A: log2 N, where N is the number of elements.

Q: Can you do a binary search on a linked list?
A: You could try it, but you'll spend a lot time scanning though your linked list to find the i'th elements.  Since the first scan to reach the first middle element will cover half the elements, you might as well do some comparisions.  Binary search is good for random-access containers like arrays; linked lists are sequental-access containers.


Binary Search #1
Fig. 2: Binary Search for "M"

Binary Search #2
Fig 3: Binary Search on a Bigger File

Binary Tree Search

A binary tree search requires that elements be stored in a tree data structure.  The tree data structure is such that for any node in the tree, all the elements of left subtree are less the element in the node, and all the right subtree elements are greater than, or equal to, the node element.

Searching a tree is very similar to the binary search because a sorted tree is similar to a sorted array.

Q: On the average how many comparisons will a tree search take to find an element?
A: That depends on what the tree looks like.

Q: What is the most difficult thing about using a tree for searching?
A: Making sure the tree remains sorted and balanced.  A search through a poorly balanced tree looks like a linear search through a linked list.

Q: How do you make a balanced tree?
A: Lazy answer: Wait till the "Sorting" lecture.
    Simple answer: Sort the list (or randomize the list) then build the tree starting with the middle (median) element as the root node.
    Hard answer: Get the Knuth books.

Q: Can you make a tree with an array?
A: Yes.  Wait till the "Sorting" lecture.



Insert Binary Tree
Fig 4. Inserting into a binary search tree.

Build Binary Tree
Fig. 5:  Building a binary search tree

Delete Binary Tree
Fig. 6: Delete from a binary search tree

Wide Tree Search

Actually this is not a binary radix search but a wide tree search.  Maybe I should cut this out, but it leads into hashing nicely.
   
Imagine 26 shelves; each shelf is labeled with a letter and a letter code and has all the books beginning with the first letter from the title.
Imagine 26 bins on each shelf; each bin is labeled with a letter and a letter code and has all the books with the second letter from the title.
Imagine 26 smaller bins in each bin; each smaller bin has all the books with the third letter from the title.
etc.
Imagine trying to find a book in this library.

Imagine a book titled "Java" in this library.
It would be on:

Wide tree search is similar to a tree search; but on a tree with an array of subtrees for each subkey (or letter, in this example).

Q: On the average how many comparisons will a wide tree search take to find an element?
A: Depends on the length of the title.

Radix sorting is related to wide tree searching but it uses a fixed number of bins and does not require a tree.

Q: How much memory would a wide tree use?
A: There could be a lot of waste space in the nodes further down the tree. A wide tree for 3-character keys might need 26 * 26 nodes, each node having 26 subnodes; 17,576 array elements.

Wide tree are not very useful because they tend to use a lot more space than a binary search tree would, without that much speed benefit unless you have short keys.
Imagine if we could give just give each book title its own bin.  
What if the bin for "Java" was simply book slot #100122?
It could really make things quicker if we had a unique, "special" number for each book title.

Hash Search

Imagine again the slot number of each book slot on all the shelves as a unique number.
Similar to the index of an element in an array.
If we had a "special" number for each unique book title we could put each book in the book slot for its "special" number.

First you need an algorithm to give each book title a "special" number.
This type of algorithm is called a hash function.
A simple hash function for a book title string might be to sum all the character codes in the string.
 
Since each string will probably get a different number with a hash function you could use the result of the hash function
as the index to another array. 
An array containing elements indexed by there hash functions is called a hash table.  
The index into the hash table is the hash index.
The hash index is hash(key) MOD hash_table.length.
 
If you have 100,000 books and the longest book title is only 10 characters,
you can (usually) find any book without any searching, just for the cost of computing the hash function of 10 characters.

Q: How long does it take to find to find an element using hash searching?
A: The cost of computing the hash function.

Q: How long does it take to not find an element using hash searching?
A: The cost of computing the hash function.

Q: What would happen if two book titles had the same hash value?
A: You would get a hash collision.  
    Can you find the hash collision in the example code and data?  
    Hard Hint: Addition is commutative.
    Easy Hint: What two book titles have the exact same characters but in a different order.

Q: How can you avoid a hash collision?
A: You can:
   1. Make your hash table bigger (or smaller!)
   2. Make a better hash function or a perfect hash function.
   3. Keep a list of all the elements with the same hash value.
   4. Keep scanning the hash table until you find the matching element or an empty element.
   5. All of the above.

Q: If hash tables are so great, why use anything else?
A: Hash tables have high relative overhead if your table is really small.  The cost of making the hash table might outweigh the search speed-up.

Q: Why don't we just store books using ISBN numbers as book slot numbers?
A: If you only had a few books, you would have a really big, mostly empty library, high lighting bills and a big rent to pay each month.

Hash Chaining
Fig. 7: Hash Chaining

Searching Tools

Resources

Book.java

Searching.java

References

"Algorithms", Robert Sedgewick -- covers many different data structures and algorithms with a simple style, not to heavy on the math.

"The Art Of Computer Programming", Vol 1-3, Donald Knuth -- The "be-all", "end-all" of algorithm books, very heavy on higher math, not for weak constitutions, you've been warned, ask for a copy for your birthday.

Web Resources for Data Structures Animations and Tutorials --  http://www.cs.virginia.edu/~mab8e/structures.html


$Id: Searching.html,v 1.15 2002/07/10 07:04:58 stephens Exp $