Saturday, September 8, 2018

Intro to Hash Table with Chaining

Hash table is a data structure that combines an array with a linked list. It takes the random access ability of an array and combines it with dynamism of linked list.

If implemented well we are basically taking all the advantages of both the data structure to get insertion,deletion and search in constant time.

Hash table is a combination of two things:

  1.  Hash function: it returns a positive integer that is called hash code.
  2. An Array: it is capable of storing data of the type we want to store
The basic idea of hash table is we take our data and run that data through the hash function, so the data is processed and return a number (called hash code) and with that number we just store the data we want to store in the data structure.

But how can we define our hash function? There is no one particular way to define a hash function. There are lot of good and bad hash function. And as we're trying to get the operation close to constant time we'll have to implement good hash function.

A good hash function should have:
  • Use only the data being hashed -> we don't want to incorporate anything else other than the data
  • Use all the data being hashed -> we want to use all of the data
  • Be deterministic -> every time we pass the same data we should get the same hash code
  • Uniformly distribute data -> data should be spread out through out the able
  • Generate very different hash code for very similar data
Example of a hash table is:
In this we are taking a string ans adding ASCII value of each of the character and returning the sum mod the size of hashTable(i.e HASH_MAX) and that is our hash code and we are saving the string to that position of hash table.

But if there is two input first- "aaop" and second- "paoa" both of them will return the hash code 7 if we use the above hash function. So, which ever data will save in the hash table that will be overwritten by the later one. We call this problem collision.

Collision occurs when two pieces of data run through the same hash function and returns the same hash code.

But we want to save both the data into the hash table. So, how do we do it?  one way can be linear probing.

In linear probing, if we have a collision, we try to place data in the next consecutive element in the array until we find a vacancy.So, if we can not put any of the two strings in hash code 7 then we'll try to put it in 8 and if that is also not vacant we'll increase hash code until we find it. That means  we are stretching away from constant time and leaning toward order of n with this process, and this problem is called clustering.

So, linear probing is subject to a problem called clustering. That means once there is a miss, two adjacent cells will contain data, making more likely that in the future cluster will grow.

The other problem is in our code we are still have room for specific amount of string(i.e HASH_MAX) that means we are still limited. We can only store as much data as we have location in the array.

So, how can we solve these problems? This is where chaining comes in play. And this is where we bring the linked list back to the picture. What if each element of array holding one piece of data, it held multiple pieces of data?

But that doesn't make sense, we know each element of array can only one piece of data of a data type. But what if that data type is a linked list?So, what if every element of an array is a pointer to the head of a linked list? Then we can build these linked list and grow them arbitrarily.

So, we start to grow these chains out of the array locations. Now we can store an arbitrary amount of data into the hash table without ever running into collision.

And we know that when we insert something in the front of the linked list, that is a O(1) operation.

But we know search or find function in linked list takes O(n) time. But instead of having one linked list, we now have HASH_MAX(i.e size of array)  number  of linked list where HASH_MAX can be 10 or 1000  whatever the size of array is, then time complexity is O(n/10) or O(n/1000). 

We know theoretically we disregard  constants when we measure complexity, but in real world these things matter. We will notice this happens to run 10 times or 1000 times faster, because we're distributing one long chain (linked list) across the 1000 smaller chains.So, each time we search through one of those chains we can ignore the other 999 chains we don't care about.

 As we are focusing on only one linked list which on average going to be thousand times shorter. So, we still are tending toward the average case of being constant time.

 Chaining works like this:

We take the hash code and make it point to the node which contains the string. And in collision i.e if we try to put "paoa" in the hash table:

We make a new node with that data and put it in end of the list that was pointed by 7.
 I will show the implementation in my next post.

No comments:

Post a Comment

Introduction To Binary Search Tree

A tree data structure is a way to hold data that looks like a tree when it's visualized. For example: All data points in a tre...