Hash tables indeed provide incredibly efficient key-value lookup by mapping keys to array indices using hash functions. However, basic single hashing runs into collisions when multiple keys hash to the same index. Techniques like double hashing resolve these efficiently while keeping hash tables fast. In this article, we will understand what is Double Hashing, how it works, and a Python example.

## Table of Contents

## What is Double Hashing?

Hash functions convert arbitrary keys like strings or numbers into fixed-size integer indexes to store values in arrays — this provides constant time O(1) lookup. However, collisions occur with a limited index range when different keys map to the same slot based on the hashing algorithm.

**Double hashing in data structures refers to a collision resolution technique used in hash tables and hash-based collections like sets and maps.**

Collisions require handling before values can be retrieved, reducing performance. Double hashing builds on single hashing to handle collisions with minimal additional cost.

Double hashing involves not just one, but two hash functions. The primary function determines the initial position, and the secondary function comes into play if a collision occurs, dictating the step size for probing the next available slot in the hash table.

## How Double Hashing Works?

Double hashing utilizes two different simple hash functions rather than one. When a collision occurs during lookup with the primary hash, the secondary hash calculates another index to probe until an empty slot is found.

For example with initial hash positioning at index 5 but slot occupied, add a second hash output (say 7) to step sequentially ahead checking index 12, 19, etc under a designed upper limit before cycling back and exploring alternate locations.

This elegant probing approach distributes keys uniformly avoiding primary clusters. With proper hash combinations minimizing recurring probes, double hashing provides excellent lookup time retaining hash table speed and, hence better complexities.

The effectiveness of double hashing depends heavily on the two hash functions:

- H1 behaves as normal hashing spreading keys uniformly
- H2 serves as a “perturbation” function generating good probe sequence
- A random or checksum style function for H1 and a simple MOD prime number for H2 works well

Deletion creates a problem when using double hashing as the empty slot must be explicitly marked deleted. This is because retrieval probes relying on H2 skip would skip over deleted slots failing to find entry. A tombstone marker indicating the deleted location addresses this.

Though double hashing reduces clustering due to even distribution compared to linear probing, the collision resolution logic still incurs some costs. Rehashing the table to a bigger size is done when the load factor exceeds 0.75 or even 0.5 for performance-sensitive systems.

The choice of H2 controls the probe sequence generated during collision handling. Simple modular hash produces only short cyclic sequences ideal for hardware implementation. Using irrational multiplication derives longer randomlike probe cycles better distributing inserts across the table.

**Hash Function 1 (H1)**

- H1 is utilized to generate the initial bucket index to check for the key
- It should provide good key distribution across hash table size
- Popular choices include CRC32, FNV-1, SHA-1 that spread keys uniformly
- The goal is to minimize collisions of different keys mapping to the same index
- Output range should match table capacity (typically prime number near the power of 2)

**Hash Function 2 (H2)**

- H2 is only invoked upon collision from H1 to calculate the skip interval
- Skip interval determines the offset between successive probe attempts
- The objective is to minimize cluster probing and cyclic behavior
- A common approach is to make H2 output a number relatively prime to table size
- For example, if the capacity is 500 buckets, suitable skip steps are 7, 11, 13, etc
- Secondary clustering should be avoided using different (non-linear) function

## Advantages of Double Hashing

Now, you might be wondering, why bother with this two-step hashing process? Well, the beauty of double hashing is in its ability to provide a more even distribution of elements in the hash table which reduces the chances of collisions:

**Reduced Clustering:**Unlike some collision resolution techniques that may lead to clustering, where elements clump together in certain areas of the hash table, double hashing distributes elements more evenly. This results in a more efficient use of space and faster retrieval times.**Dynamic Step Size:**The secondary hash function acts as a dynamic step size adjuster. This adaptability helps the algorithm to navigate the hash table more effectively.

## Additional Best Practices

- Two key goals are minimizing collisions for H1 and recurrent probes for H2
- If collisions grow beyond tolerance, rehash with a larger capacity
- Load factor typically kept below 0.75 via capacity doubling
- Random seeds can introduce volatility improving distribution

## Double Hashing Example in Python

```
class DoubleHashTable:
def __init__(self, size):
self.size = size
self.table = [None] * size
def hash_function_1(self, key):
# First hash function for initial position
return key % self.size
def hash_function_2(self, key):
# Second hash function for resolving collisions
# Should be non-zero and relatively prime to the table size
return 1 + (key % (self.size - 1))
def insert(self, key, value):
# Insert a key-value pair into the hash table using double hashing
index = self.hash_function_1(key)
step = self.hash_function_2(key)
while self.table[index] is not None:
# Collision occurred, probe to the next position
index = (index + step) % self.size
# Found an empty slot, insert the key-value pair
self.table[index] = (key, value)
def search(self, key):
# Search for a key in the hash table
index = self.hash_function_1(key)
step = self.hash_function_2(key)
while self.table[index] is not None:
stored_key, value = self.table[index]
if stored_key == key:
# Key found, return the corresponding value
return value
# Collision occurred, probe to the next position
index = (index + step) % self.size
# Key not found
return None
def display(self):
# Display the contents of the hash table
for i in range(self.size):
if self.table[i] is not None:
print(f"Index {i}: {self.table[i]}")
else:
print(f"Index {i}: Empty")
# Example usage:
hash_table = DoubleHashTable(size=10)
hash_table.insert(5, "Apple")
hash_table.insert(2, "Banana")
hash_table.insert(15, "Cherry")
hash_table.insert(25, "Doughnut")
hash_table.display()
search_result = hash_table.search(2)
if search_result is not None:
print(f"Value for key 2: {search_result}")
else:
print("Key not found.")
```

```
# Output
Index 0: Empty
Index 1: Empty
Index 2: (2, 'Banana')
Index 3: (25, 'Doughnut')
Index 4: Empty
Index 5: (5, 'Apple')
Index 6: Empty
Index 7: Empty
Index 8: Empty
Index 9: (15, 'Cherry')
Value for key 2: Banana
```

In the above code, two hash functions are defined within the class. `hash_function_1`

is used to calculate the initial position for a given key, and `hash_function_2`

is used to determine the step size for probing in case of collisions. The second hash function must be non-zero and relatively prime to the table size to ensure effective probing.

## Conclusion

Double Hashing tackles the problem of hashing collision in an optimal manner. I hope you have a good understanding of Double Hashing and how it actually works. These are the key considerations when implementing double hashing for hash table or hash set collisions for optimal memory utilization and performance: carefully selecting hash combinations, marking deletions, setting load thresholds, and profiling distribution.

## FAQs

### What is double hashing in data structure? What is the difference between double hashing and rehashing?

Double hashing in data structures refers to a collision resolution technique used in hash tables and hash-based collections like sets and maps.

### What are the three types of hashing?

Here are three main types of hashing:

– **Division Hashing:** It involves dividing the key (data to be hashed) by the size of the hash table and using the remainder as the hash value.

– **Multiplication Hashing:** In Multiplication Hashing, we multiply the key by a constant, extract the fractional part, and then multiply it by the size of the hash table.

– **Universal Hashing:** Universal hashing involves using a family of hash functions and selecting one at random for each hash table.

### What is the difference between double hashing and rehashing?

The key difference between double hashing and rehashing is: Double hashing uses two hash functions to handle collisions in hash tables, while rehashing completely builds a new hash table when load factor crosses a threshold.

**How does the initial position get determined in double hashing?**

The initial position in double hashing is determined by the first hash function. This function takes the key (data to be hashed) and computes an initial position in the hash table.

**What role does the second hash function play in double hashing?**

The second hash function in double hashing comes into play when a collision occurs, i.e., when the initial position is already occupied. This function dictates the step size for probing the next available slot in the hash table which allows the algorithm to find an empty spot.

**Are there any challenges or considerations when implementing double hashing?**

Yes, one important consideration is the selection or design of appropriate hash functions. The effectiveness of double hashing depends on the quality of these functions. Additionally, monitoring and adjusting the load factor, which represents the ratio of occupied slots to the total slots in the hash table, is crucial for maintaining efficiency.

**How does the efficiency of double hashing compare to other collision resolution techniques?**

The efficiency of double hashing depends on factors like the quality of hash functions and the load factor of the hash table. In general, double hashing is known for its ability to provide a more even distribution of elements, which can result in faster retrieval times compared to some other collision resolution techniques, especially in cases with high contention for hash table slots.

## Leave a Reply