Project: Perfect Hashing on Strings Using Cichelli Method

Counting Words Using a Cichelli Minimal Perfect Hashing Function

* Taken originally from a problem developed by the University of Wisconsin


1.0 Problem Statement and Purpose

This project is meant to help you develop skills at working with hash tables and writing entire programs from scratch. In this project you will only be provided with a project description and list of program requirements. If you are doing well in the course up to this piont (you have an A- grade, or better), no code will be provided for this project. Otherwise, here is some additional code to get you started with your design. This means you will need to learn to design and implement an entire program on your own. The video, below, provides a description of the problem statement:


2.0 Project Description

2.1 Basic Program
For this project you will design a program that searches a text file looking for key words, keeps statistics on the results of the search, and prints out these statistics. To make this work, you will need to read all the key words from a file, build a minimal perfect hash table, and then start reading the contents of a second text file and look for these key words. The following sections examine each of these steps.

2.2 Reading from a File
To read from a file you must first create a FileReader object and then pass it to a BufferedReader object. At this point you can start reading from the file using the readLine() method. A few things to be careful of, the creation of a FileReader object can throw a FileNotFoundException. You are required to catch this exception, so you must create the FileReader object inside of a try block. Also, the readLine() method can also throw an exception. The exception it throws is IOException and you must catch this, too.

This all sounds a lot more confusing than it really is. Here is a simple example that opens a file, reads each line, and then prints them all out. Notice that readLine() returns null when there is no more data in the file.

import java.io.*;

class Tester {
	 public static void main(String[] args) {
		  // validate command line arguments
		  if(args.length != 1) {
				System.err.println("Error: usage - Tester file");
				System.exit(1);
		  }
		  
		  try {
				// open up the file for reading
				FileReader input = new FileReader(args[0]);
		      BufferedReader in = new BufferedReader(input);
		  
				// now read each line and print it out
				String line;
				while((line = in.readLine()) != null)
					 System.out.println(line);
		  } catch(FileNotFoundException e) {
				System.err.println("File " + args[0] + " not found.");
				System.exit(1);
		  } catch(IOException e) {
				System.err.println(e);
				System.exit(1);
		  }
	 }
}

If you want, you can cut and paste this program and run it exactly as it is. That is about all there is to opening and reading files in Java.

2.3 The Hash Function
For this project you will be using Cichelli's Method to build a minimal perfect hash table. How to build this table is discussed in the next section. Here we describe the hash function itself.

h(word) = (length(word) + g*(firstletter(word)) + g*(lastletter(word))) % size

The following table is a description of each term in this equation:

TermDescription
h(word)This will be the index into the hash table for the word.
length(word)The number of characters in the word.
gg is the value associated with a given letter. The value ranges between 0 and some maximum value. The max value is one of the inputs to the Cichelli Algorithm.
firstletter(word)The character that is in the first position of the word.
lastletter(word)The character that is at the end of the word.
sizeThe total number of key words. It is also the number of elements in the hash table.

To compute the hash function for any integer is going to require a number of data structures. Obviously, you must have the array that represents the hash table. This array will be filled with the actual key words. Each key word should appear only once in the hash table. Secondly, you will need an array to store the g value of each letter that appears in the beginning or end of a word.

2.4 Building a Minimal Perfect Hash Table
This is one of the major components of your project. You will need to open the key word file provided by the user on the command line and read each of the words into an ArrayList. Once you have them all read, you can start to build your hash table and define your hash function. You will accomplish this by using the Cichelli Method discussed in class. The ArrayList that all the key words were initially read into will be the initial word list for the algorithm. A review of the algorithm can be found in the notes on Perfect Hashing earlier in this course.

2.5 Counting Key Words
Once this minimal perfect hash table is constructed, you are ready to begin reading the text file and counting key words. This should be a fairly simple process. You will read a line from the text file, break the line into tokens (one token for each word in the line), and then examine each token. (What String function would most easily accomplish this?) To examine a token, simply pass it through the hash function and see if the word is currently in the hash table.

2.6 Statistics

Another major part of this project is recording statistics. A summary of all the statistics you must keep are presented in the following table:

StatisticDescription
linesThis is the total number of lines read by your tester program. Blank lines should not be counted in this total. Only lines that actually have at least one word on them should be counted.
wordsThis is the total number of words checked (key words plus non- key words). A word does not have to be in the dictionary. A word is considered to be any string of consecutive characters with no white space in between them. In other words both "hello" and "sadflk" are considered to be words by this program.
keyWordsThis is the total number of key words found in the text. Words should only be counted if they are an exact match to the keyword given in the file. In other words, "Alabama" is not the same as "alabama". Hopefully, this will make things easier.
TimeYou need to record the total time of your program. This time should include the time to read the key words, create the hash table, read the text file, and record all the statistics. It is suggested that you start a timer as one of the first things you do in the program and stop it immediately before printing out the statistics. The following is a little sample code that shows how you can time events. You need to use the Date class to do it so you may want to check out the documentation on this method at the Java API web site. Do not worry about the Thread.sleep() method in the program - you do not (and should not) be using this function. It only appears here to simulate some kind of work being done.

import java.util.*;

class Timer {
   public static void main(String[] args) {
      try {
         // start the timer
         Date timeStart = new Date();

         // do the actual program work
         Thread.sleep(2000);

         // stop the timer
         Date timeEnd = new Date();

         // calculate the total time and print the result
         long totalTime = timeEnd.getTime() - timeStart.getTime();
         System.out.println("Total Time: " + totalTime);
      } catch(InterruptedException e) { }
   }
}

Again, you can copy and run this code exactly as it appears to see what the output is.

When the program is finished reading the text file, it should print out a list of statistics. The output should look exactly like the following:

**********************
***** Statistics *****
**********************
Total Lines Read: xxx
Total Words Read: xxx
Break Down by Key Word
     key1: xx
     key2: xx
     key3: xx
       .
       .
       .
Total Key Words:  xxx
Total Time of Program: xxxxx milliseconds

The x's should be replaced with actual numbers and key1,2,3,... should be replaced with the actual key words. The number following the key word is to indicate how many times that key word appeared in the ananlyzed text.

3.0 Simplifications

  1. All punctuation has been removed from the input file; This will make it easier for you to treat the words "done", "done,", "done?", "done!", "done.", etc. identically. 
  2. You can assume that the input file is formatted properly with one word (token) per line;
  3. There will be no blank lines in the input file;
  4. The input file contains only lowercase letters.

4.0 Program Design Tips

The number one rule about writing a program from scratch is not to write a single line of code until you have developed a good design. Developing this good design is probably the hardest thing to learn how to do in programming. Here are a few of my suggestions on how you should go about designing and writing this project.

  1. Read the program through once to get an idea of what is expected. Then, read it again. The second time through you may want to take some notes. At this point, you should have a thorough understanding of what is expected of you. If you need to read it a third time, do so.
  2. Do a very rough design. This basically means figuring out what classes you are going to want to create and what the purpose of each class will be. For this project, I can think of at least three classes you should build. A class that contains your main() method. A class for the Cichelli hash table. And another class to hold, calculate, and print the statistics. You may also want additional classes for parsing files or for some other kind of work. It is completely up to you; however, be sure you know why you need each class and what its purpose will be for.
  3. Once you have a list of your classes, go through each class and determine what functions it will need and what each of these functions will do. This should include information about what parameters each function will take and what type of value it will return.
  4. Then go through each function and write psuedo-code describing how the function will flow. This is a very critical step because if it is done properly, your code writing will be much simpler. A good piece of advice at this stage is when you are examining a function, assume all the other functions already work - even if you have not written the pseudo-code for a function. If you did Step 3 properly, you will know how each function should work, what arguments it will take, and what values it will return.
  5. The very last step is to go ahead and write your project up. Do not write all the code at once. Write functions one or two at a time, test them and make sure they work as you expect and then go on to the next function or two. It is always best to start with the lowest level functions first and work your way up. In other words, if you have some function that does not call any other of your functions, that is the one you want to start with. You can then test it and make sure it works (this often requires a little creativity). Once you have all of these functions written, you can then move up the chain to functions that only call these lowest level functions. Once you finish all of these, move up another level. Continue to do this until all of your functions are written.

One of the things you will notice about this design is that Step 4 calls for a top-down approach (describe all of the higher level functions assuming the lower level ones are done and then move down to the next level). Where as, Step 5 calls for a bottom-up approach (write and test all of the low level functions before moving on to the higher level functions). I think if you follow this approach, you should find your programs (for this or any other class) get written much more quickly and with far fewer bugs.

5.0 Running Your Program

Once your program has been compiled, it should be run in the following manner:

prompt> java WordCount keyWordFile textFile

The key word file should contain the list of words to search for. The text file is the file you will search looking for the key words. A sample of each file can be found below:

Key Word File
Text File
Results

Additional key word and text files will be provided later, but they will all have a similar format.

6.0 Handing in Your Project

After you have demoed your code to your instructor, place a copy of your code in your on-line notebook under the pages marked "Hashing."

7.0 Grading

The project will be graded out of 100 points and be graded on the following criteria:

  • Non-trivial program compilation
  • Correct Output
  • Performance (running time)

We will grade your code by running it against a series of test files and checking the output of the program against a known, correct implementation. Any differences in your output versus that of the correct implementation will result in a point deduction. Currently, the only test files provided are keys-1.dat and text-1.dat. Links to these files can be found in Section 4.0 above as well as the results of running this test. Without a doubt, much more in depth tests will be run than this initial one provided. This means you should write some of your own test files and check for any errors.

We will also examine parts of your code to make sure that you are implementing the data structures as required. If you are getting the correct results but not implementing the proper data strcucture, your score will suffer severely. You must implement the proper data structures.