Hashing

What is Searching?
Searching is the process by which sets of data are looked through in order to find a specific value. For example, if I had an array of 10 animals;
[ Rabbit, Deer, Ox, Hamster, Giraffe, Sloth, Turtle, Yak, Elephant, Crow ]
... And I wanted to find the position of the Yak in the array, I would need to search. Now, the simplest way to do this would be Linear Searching.

Linear Searching

In Linear Searching, you would tell the computer to look at each element in the array, one by one, until it finds the correct element it is looking for. This method may be inefficient for larger data sets, but ensures that if an element is not found through the search, we can guarantee that it does not exist (as we looked at every element). 
Pseudo code:
search = "Yak" For x in array If x == "Yak" Return x Else Return null
Here are example animations of Linear Searching:
Element Found:


Element Not Found:

Binary Searching

Binary Searching is another way to search a data set. This can be more efficient than Linear Searching, but in order for it to work, the data set must be already sorted. Binary Searching works like a game of Hot or Cold; It will choose an element from the array (usually midmost), ask if the value is more or less, then divide the array to either the left or the right side of the mid element. It will repeat this until it is either found or determined nonexistent in the array. You can see why this tends to be more efficient than Linear searching: rather than looking at all 100 elements, right off the bat you rule out of half the array based on a single comparison. That being said, remember that the data has to be sorted first. This is a disadvantage to Linear Searching, which can perform the search on any array, sorted or not.

Pseudo code:
bSort(int lo, int hi, array[], int search) int mid = (lo + hi)/2 if mid == search return mid if array[mid] > search bSort(lo, mid, array[], search) else bSort(mid, hi, array[], search)

Here is an example animation of Binary Searching:

Element Found:



Element Not Found:


Searching Quiz

In this quiz, you will be asked questions regarding the two searching methods previously talked about. 
  1. 1 point
    Does Linear Search go through each element one by one?
  2. 1 point
    Is Linear Search fast or slow (in an average amount of data)?
  3. 1 point
    What size of data is Linear Search best at searching through?
  4. 1 point
    What searching method plays "Hot and Cold" with the data set?
  5. 1 point
    What searching method is faster with a large amount of data?


What is Hashing?

Hashing is a way of assigning a set of data, a file, a program -- any set of digital data -- a unique fingerprint. 
What makes up a good fingerprint? 
  • It HAS to be unique; I shouldn't have the same fingerprint as someone else.
  • It HAS to be secure; I shouldn't be able to see a fingerprint and immediately know whose it is.
  • It HAS to be easy to utilize; I shouldn't have to run a DNA test to determine the person first.
  • It HAS to be uniform to others; I shouldn't have a fingerprint that covers my entire body, and someone else shouldn't have a fingerprint that covers half their thumb. 
Why do we even hash at all? Well, think of it this way. Let's say that there was some sort of crime scene, and we had to identify the criminal. Would we want to find him fast, or slow? Obviously, we would want to find the criminal as soon as possible. We would't want to interrogate every person to exist to find the criminal, as that is most likely impossible to do, not to mention time consuming. So, how do we analyze crime scenes? Fingerprints! If the forensics department finds a fingerprint in the crime scene, they can compare it to every fingerprint of who they think is a suspect, and determine which person it is. Again, this is much quicker than interrogating ever person.
Hashing is the same thing as analyzing fingerprints. Essential, you are assigning a unique ID to digital data in order to reference and check later on. Suppose you are in the secret sector of the government, working on a top-secret project with a colleague named John. You have a picture of an alien, and you want to send it to John. But, John gets the photo and says it doesn't look correct. You could look at every pixel in the photo side by side, comparing it, but that would take forever. With hashing, you could assign a unique ID to your photo, and then have John check his photo's ID and compare the two. If they are the same, that means that the files are identical. However, if your ID's are different, it means that somehow the photo was altered.
A more common and practical example are programs. Sometimes you will notice if you download a program off of a website, you will see something that says, "MD5: ~~~~~~~~" or, "SHA-256: ~~~~~~~~". This is an example of hashing. Ironically, a website that hosts an MD5 check tool has just this:



Another practical use are passwords. Your email account provider (Google, Yahoo, Outlook, etc.) probably does not store your password. If you sign up and choose the password "password1234", that doesn't necessarily mean they have a .txt file with all of the users passwords letter for letter. If they do, that is a bad thing in terms of hacking. So if you forget your password for a website, you shouldn't get your password emailed to you. If you do, that means the website has no security, and is prone to hacking. Get off of that website before you lose your account.
What makes hashing secure though? Hashing functions are algorithms that convert digital data to a unique fingerprint, usually a combination of letters and numbers. Like the list before, a good fingerprint has to be unique. A picture of an orange cat and a picture of a gray cat should not have the same hashed ID. A good fingerprint also has to be secure, which is the main part of why hashing is used for passwords. Usually, a strong hashing function should be a one-way algorithm, meaning you shouldn't be able to "undo" the algorithm with ease, or even at all. It should also be easy to use; you should be able to throw any type of data at it and get an ID in a respectable amount of time.  Finally, a good fingerprint should also look somewhat like others. Look at the example from above: 

Notice how both hashed ID's are similar -- same length, and both are a combination of letters and numbers.
So how do we hash something? A hashing algorithm can be as simple as taking the first ten characters from a string, and filling in any missing ones with the alphabet ("Elephantab", "Dogabcdefg"). While not secure, it is a simple way of hashing data. Other functions are much more complicated, and much more secure than this one. For example, MD5 was a common hashing algorithm until people found out ways to decrypt it. Nowadays, MD5 hashing is mainly used for file checksums, or non-private data storage. MD5 is not really used for passwords anymore. The rising function is SHA-256. This function is much more secure than MD5, and is mainly used for password storage. Big companies should utilize SHA-256. 
How do people decrypt hashed data? Two methods of decrpyting are pretty easy to understand. You can either make a key to open a door, or break it down. One way is to reverse-engineer the algorithm to find a counter-algorithm that converts hashed data back to the original data. Once discovered, this can be quick, but is also very challenging and requires an insane level of intelligence. The other way is called brute-force. In this, you hash random data and compare it to a hashed data set in hope to match an ID. This requires a lot of time, and a powerful machine. Both of these methods are possible to do, but require a lot of time and money, or a lot of brain power. 
But this doesn't necessarily mean your passwords are at risk. Yes, some hashes can be decrypted, but it can be prevented. The good part about hashing algorithms is that if you hash a string, and hash the hashed ID of it, you get a different hash. Most decryption tools take common words and passwords and hash them, comparing to what you gave the computer as a hashed input. For example, I hashed the word "Cow" in MD5 format. Then, I went to an online decryption tool and entered in the hashed data. It solved it in seconds:

I then hashed the hash of "Cow" and tried to decrypt that, and this was the result:

It couldn't crack it! What you should do as soon as possible, is take your password that you use for everything, hash that, and then use that in a password change. Then, when you need to enter your password, open up a text to hash tool and copy and paste your hash code in. That way, you are never storing your password anywhere. 

Here are two videos about Hashing:

(Not owned by Westhill CS, work is by Tom Scott at Computerphile.)

(Not owned by Westhill CS, work is by Tom Scott at Computerphile.)

To sum up, hashing is a secure fingerprint for digital data. It is used in a wide variety of applications today, and is constantly getting stronger and better. If you were ever forced out of an account to re-login, it probably means they changed their password encryption, and need to re-encrypt your password. But how does hashing relate to searching?

Hashing Quiz


In this quiz, you will be asked questions based on the videos about hashing and passwords, and the materials covered in the section.
  1. 1 point
    What is hashing?
  2. 1 point
    What hashing algorithm was used in a wide variety of places, which was ultimately bad when it was cracked?
  3. 1 point
    What factors are important in a good hashing algorithm?
  4. 1 point
    Why do we hash things?
  5. 1 point
    What way of hash cracking takes a lot of power and time?
  6. 1 point
    How difficult is it to mimic a hash with a different set of data?
  7. 1 point
    What should you do if a website emails you your password because you forgot it?
  8. 1 point
    How can you make your passwords stronger?


What are Hash Tables?


Hash tables are yet another data structure. Unlike arrays, hash tables can expand with ease. And unlike linked lists, hash tables can access an element without having to go through each pointer. Hash Tables are very similar to Hash Maps, but Hash Maps are not synchronized like Hash Tables are, meaning, Hash Maps are faster but are not synchronized, and Hash Tables are slow because they are synchronized. 
Why use hash tables? Lets say we had an array of length 26, one element for each letter in the alphabet. Well, what if we wanted to add "ant" to the array? Recall hash functions from before. By using a hash function as simple as taking the uppercase value of the first letter, and assigning it its corresponding value in the alphabet (A = 0, B=1, C =2, etc), we can add it to where it goes. So, "ant" would go to index 0. Great! Now, we have "bear". Running it through the function, it tells the computer to put "bear" at index 1. We run "cow" through, and it goes to index 2. But, now we have to add "apple". We run it through, and it says to put it at index 0. But, "ant" is already there. This is called a collision. A simple fix is to move down the array until there is an open spot, but this is a problem. This adds more collisions in the future. Right now, "apple" is assigned to index 3. Well, we input "dog" to the function, and it wants to go to index 3. This gets messy really quickly, and is not efficient for searching. How can we solve this? Well, if we create a map of lists, we can have the expandability per index in the array. So, if we wanted to add "apple", the computer would add it to the list at index 0. 
Now, to search, we are able to pass the input to the search query into the hashing function to figure out what linked list it will be in, which is much more efficient than Linear Search or binary search. 

Here is a quick example of how Hash Tables work:


In this example, we take names of people and their phone numbers, and add them to a hash table to be used as a phonebook. The name passes through the hashing function, and is condensed to the first initial + the first three letters of the last name. Then, we show how we can pass through an input to a query to find the associated value to the key.

Hash Tables Quiz

In this quiz, you will be asked questions based off of the materials covered in the Hash Table section.
  1. 1 point
    What are Hash Tables?
  2. 1 point
    What happens when two pieces of data want to go into the same index?
  3. 1 point
    What do you need before inserting a key into a hash table?
  4. 1 point
    How do you search a hash table?
  5. 1 point
    In which of these things would a hash table be used in?

Tying it all Together

We use hashing in order to speed up searching greatly. By organizing data into their own categorized lists, we are able to create a method of finding those pieces of data again through the use of a hash function. This ultimately creates a hash table, that can be used in many applications with the need for fast sorting. It doesn't just have to sort things alphabetically though, it has other practical uses. Let's say you are creating a database in which you need to quickly pull up information on a person. Using a hash function, you can enter their name, "John Doe", and then get something like "jdoe". Placing that in the hash table, you can map certain data to 'jdoe'. Then, when you want to access his birthday, you can run his name through the algorithm, and you would get 'jdoe', which means you know exactly where and what key you need to access in order to get John Doe's information. Or what if you have a real estate program that needs to sort and access houses based on their value? You could pass it through the hashing algorithm (or even through deep learning) to generate the houses value. Then, you could determine which linked list it should go in based on the price range (for example, A = > $5,000,000, B = <$5,000,000 & > $2,500,000, C = <$2,500,000 & >$1,000,000, etc.). Now, this example is a bit more complex, as deep learning would probably have to be implemented in order to generate accurate prices for a house, and the database would have to update whenever there was a fluctuation in the price or effect of a certain factor in a house (for example, instead of 4 windows being standard in 2015, 8 windows are standard in 2016). But once you had all of the data organized in the hash table, you would then be able to access the data at a much faster rate than if you went through one by one, calculating the price of a house and trying to compare it to the other ones. Hash tables make for a much cleaner search and sort. 

Hashing is just the process of assigning an ID to a piece of data for later use. Recalling the information from before, Hashing has a wide variety of applications, to checksums, passwords, encryption, searching, compressing data that doesn't need to be reversed back into regular data, and much, much more. We covered that there are multiple hashing algorithms in practice today, though some are being replaced by new ones, especially in the case for password hashing.

We also covered two searching methods, Linear Search and Binary Search. Each has its own advantage, though we learned that hash tables can make searching much more efficient.