1.2 Hash tables and collision strategies

Once ADTs, pointers, and basic node structures are in place, the next question is how to support dictionary-style operations efficiently. A hash table is one of the standard answers: instead of preserving sorted order, it applies a hash function to jump directly to a bucket.

That jump is powerful, but it is never perfect. The whole chapter exists to explain what the jump can guarantee, where collisions come from, and how the implementation recovers when multiple keys land in the same location.

The ADT viewpoint comes first

The hashtable tutorial starts from operations, not from buckets:

insert a value under a key,
look up the value for a key,
delete an existing key-value entry.

So the logical object is a dictionary or symbol table. The hash table is only one implementation strategy for that ADT.

Definition

Hashtable as a dictionary ADT

A hashtable stores key-value pairs and supports dictionary operations such as insert, lookup, and delete by mapping each key to a bucket index.

This definition already implies two different design questions:

how keys are converted into bucket indices,
what the implementation does when two keys map to the same index.

What a hash function is supposed to do

Definition

Hash function

A hash function maps a key into an integer index in a fixed range, usually $[0, H_SIZE)$ .

The local tutorial material emphasizes four properties of a good hash function:

it must always return a legal bucket index,
it should reduce collisions,
it should distribute keys reasonably evenly,
it should be quick to compute.

The last point matters because a beautifully distributed hash rule is not very useful if every lookup spends too long computing the bucket index itself.

Worked example

A simple string hash idea

For a string key, one very simple classroom rule is:

int Hash(char *s, int nBuckets) {
   int h = 0;
   while (*s != '\0') {
      h += *s;
      s++;
   }
   return h % nBuckets;
}

This kind of rule is easy to compute, but it is also easy to fool. Different keys with the same character-sum pattern can collide often. So the example is useful for reasoning, even if a production hash table would want something stronger.

Read and try

Test a simple hash-and-bucket model

The lab maps sample keys into buckets with a simple hash rule so you can see collisions and chaining directly.

Bucket countKeys

Collision count: 0

Keys	Bucket index
cat	4
dog	6
cow	0
cod	2

Bucket index 0

cow

Bucket index 1

∅

Bucket index 2

cod

Bucket index 3

∅

Bucket index 4

cat

Bucket index 5

∅

Bucket index 6

dog

Collision is unavoidable, not exceptional

The most important conceptual shift is that collisions are not a rare bug. They are a normal consequence of sending a large key space into a fixed set of buckets.

Definition

Collision

A collision happens when two different keys are mapped to the same bucket index.

If the bucket count is fixed and the key space is large, collisions are inevitable. So the real implementation question is not “how do we avoid every collision?” but “how do we recover from collisions without breaking dictionary correctness?”

Chaining keeps colliding keys in one bucket list

The lecture material treats chaining as the most direct recovery method.

Definition

Chaining

With chaining, each bucket stores a linked list of all entries whose keys hash to that bucket.

This design fits the pointer-and-node model from the previous section very naturally.

typedef struct cellT {
   char *key;
   void *value;
   struct cellT *next;
} cellT;

Each bucket head points to the first node in its chain. Lookup therefore has two layers:

hash the key to choose a bucket,
scan only that bucket’s chain instead of the whole table.

Worked example

What lookup does under chaining

Suppose $Hash("cat") = 3$ , and bucket 3 stores the chain

$("cow", value1) -> ("cat", value2) -> ("cod", value3)$ .

To execute Lookup(table, "cat"), the implementation:

computes bucket 3,
moves along the linked list in bucket 3,
compares keys until "cat" is found,
returns value2.

The collision does not destroy correctness. It only makes that one bucket take longer to search.

Open addressing uses probing instead of linked chains

The tutorial also introduces open addressing. Instead of storing a linked list in each bucket, the table keeps probing other slots when the preferred one is occupied.

Typical probing styles include:

linear probing,
quadratic probing,
double hashing.

The key difference from chaining is where the extra work lives:

chaining keeps multiple entries in a bucket-side list,
open addressing keeps searching for another empty slot in the main table.

This means deletion, clustering, and probe-sequence design become central implementation issues.

Theorem

Collision handling is part of correctness, not only performance

If the collision strategy is wrong, the table can return the wrong answer or lose entries entirely, even if the hash function itself is valid.

That is why the collision policy belongs in the conceptual explanation, not as an implementation afterthought.

Why load factor matters

Even a good hash function degrades if the table is too full.

The load factor is usually written as

\alpha = \frac{\text{number of stored entries}}{\text{number of buckets}}.

As $\alpha$ grows:

chains become longer under chaining,
probe sequences become longer under open addressing,
average-case lookup moves away from the ideal constant-time picture.

So resizing is not just a practical optimization. It is part of protecting the performance model promised by the data structure.

Generic pointers make the ADT reusable

The local hashtable lecture also uses $void *$ values. This is a deliberate ADT decision.

typedef struct hashtableCDT *hashtableADT;

hashtableADT EmptyHashtable();
void Enter(hashtableADT table, char *key, void *value);
void *Lookup(hashtableADT table, char *key);

The table does not need to know the concrete value type. It only stores the association between a key and an opaque pointer to some client-owned data.

That makes the same hashtable ADT usable for many different applications, as long as the client also knows how to interpret the returned pointer correctly.

Common mistakes

Common mistake

A collision does not mean two keys are equal

If $Hash(key1) == Hash(key2)$ , that only means the keys land in the same bucket. It does not mean the keys are identical.

Common mistake

A valid hash range is necessary but not sufficient

A hash function that always returns a legal bucket index can still be bad if it clusters many unrelated keys into the same few buckets.

Common mistake

Average-case speed depends on keeping the table healthy

Saying “hash table lookup is O(1)” silently assumes a reasonable hash function and a controlled load factor. It is not a license to ignore collisions.

Quick checks

Quick check

What is a collision in hashing?

Answer in terms of keys and bucket indices.

Solution

Answer

Quick check

Under chaining, what extra work happens after computing the bucket index?

Focus on where the implementation looks next.

Solution

Answer

Exercises

Quick check

Why does a hashtable still need key comparison after hashing?

Use collision reasoning, not only a slogan about speed.

Solution

Guided solution

Quick check

Explain one tradeoff between chaining and open addressing.

Answer in terms of where the collision-handling work is stored.

Solution

Guided solution

Reading the implementation more closely

The lecture slides do not present hash tables as a pile of buckets first. They present them as an ADT first and as a storage strategy second. That order matters.

Definition

Dictionary semantics of the table

A hash table is a dictionary structure. For one key, the table should support insertion, lookup, and deletion. If the same key is inserted again, the newer value replaces the older association instead of creating a second copy of the key.

That overwrite behavior is easy to miss, but it is one of the reasons the ADT is a dictionary rather than a multiset.

Worked example

Enter updates an existing key

Suppose a table already stores:

$"cat" -> 3$
$"dog" -> 8$

If Enter(table, "cat", 9) is called again, the intended result is not two copies of "cat". The association for "cat" is updated, so a later Lookup(table, "cat") returns 9.

This is what the lecture means when it says Enter inserts a particular value for a specified key and overwrites the old value if the key already exists.

Why key comparison still matters after hashing

Hashing narrows the search to a bucket, but it does not finish the job. The course examples repeatedly return to this point because it is the first place students often over-generalize.

If two keys hash to the same bucket, the table still has to compare the stored key with the query key. Otherwise it would not know whether the match is exact or only accidental.

Worked example

A bucket can contain several unrelated keys

Assume bucket 3 contains:

$("cow", value1) -> ("cat", value2) -> ("cod", value3)$

If the lookup key is "cat", hashing only tells us to inspect bucket 3. The implementation must still compare "cow", then "cat", before it can return the correct value.

Chaining is flexible because it stores collision structure separately

The simplest collision strategy here is chaining. It is easy to reason about because each bucket has its own list of entries.

This gives the implementation a useful freedom: it can add new entries at the front of the list, the back of the list, or in another consistent order. The dictionary contract does not care about that local ordering as long as lookup and overwrite behave correctly.

Common mistake

Bucket order is not the dictionary contract

Students sometimes assume the order inside one chain is part of the meaning of the table. It is not. The contract is about finding the correct key-value pair; the chain order is only an implementation detail.

Worked example

Insertion and lookup under chaining

Consider a table with five buckets and a simple hash rule.

Insert "ape".
Insert "ant".
Insert "apple".
Look up "ant".

If "ape" and "ant" collide, they will appear in the same bucket chain. The lookup for "ant" hashes once, follows only that chain, and compares keys until it finds the exact entry.

This is why chaining can still stay fast on average when the buckets remain reasonably balanced.

Open addressing keeps the search inside the table

The tutorial and lecture also show open addressing. Instead of attaching a list to each bucket, open addressing keeps searching for another slot inside the table itself.

The cost of a collision therefore moves into the probing rule.

linear probing checks the next slot, then the next, and so on;
quadratic probing jumps by square offsets;
double hashing uses a second hash function to generate a step size.

Definition

Linear probing

With linear probing, if the preferred slot is full, the table checks the next slot in sequence and wraps around when necessary.

The lecture warns that linear probing tends to create long runs of filled buckets. That effect is called primary clustering, and it degrades the performance of the table because future probes are more likely to collide with the same dense region.

Quadratic probing is one response to that problem.

Definition

Quadratic probing

With quadratic probing, the ith probe uses a square offset such as $i^2$ . The step size grows faster than in linear probing, which helps reduce primary clustering.

Worked example

Quadratic probing with a small table

The tutorial gives a concrete table of size 10 with the hash rule $h(k) = k % 10$ . Insert the entries {89, 18, 49, 58, 69} using quadratic probing.

The probe sequence works as follows:

89 goes to bucket 9
18 goes to bucket 8
49 hashes to 9, collides, then probes bucket 0
58 hashes to 8, collides, then probes 9, then 2
69 hashes to 9, collides, then probes 0, then 3

So the occupied buckets end up as:

$0 -> 49$
$2 -> 58$
$3 -> 69$
$8 -> 18$
$9 -> 89$

That example is useful because it shows that the probe rule, not just the raw hash value, determines where each entry finally lives.

Common mistake

Quadratic probing does not magically remove all limits

Quadratic probing reduces primary clustering, but it does not make the table invincible. The lecture shows a load-factor restriction example where, with $Nbuckets = 7$ and $F(i) = i^2$ , the probe sequence repeats before reaching all slots. Some buckets may never be visited under a given probe rule.

Worked example

Why the probe sequence can miss slots

The lecture example for $Nbuckets = 7$ and $h0 = 0$ gives the probe sequence

0, 1, 4, 2, 2, 4, 1

The important point is not the arithmetic itself. The important point is that the probing rule can fail to cover every bucket. That is why open addressing needs a load-factor discipline and why the exact probing formula matters.

Definition

Double hashing

Double hashing uses a second hash function to determine the probe step. In general, it spreads probes better than linear or quadratic probing, but it requires two hash functions.

That tradeoff is classic CSCI2520 material: more structure in the rule can buy better distribution, but it also makes the implementation more complicated.

Rehashing is the cost of changing the table shape

When the bucket count changes, the hash function must change too. Every stored entry has to be reinserted under the new rule. That process is called rehashing.

In many applications rehashing is rare, but it is conceptually important because it explains why a hash table cannot be treated as a fixed pile of memory locations forever.

Worked example

Why resizing forces rehashing

If a table grows from one bucket count to another, the old bucket index is no longer guaranteed to be valid for the new table size. The entries must be visited again, their keys hashed again, and their new positions computed again.

That is why resizing is not just a matter of copying bytes. It changes the meaning of the bucket index itself.

How to choose between chaining and probing

No collision strategy is universally best. The useful comparison is the set of tradeoffs each strategy makes visible.

Chaining is easier to explain and naturally handles collisions with a linked list.
Linear probing keeps the table compact but can form primary clustering.
Quadratic probing reduces that clustering but needs stricter load-factor control.
Double hashing usually distributes probes better, but it is more expensive because it depends on two hash functions.

If you are reading code, the safest question is not “Which strategy is faster in abstract?” The safer question is “Which strategy matches the table size, the expected load, and the update pattern of this program?”

A compact code-reading checklist

When you look at a hashtable implementation in C, check these things in order:

What counts as the key, and what counts as the stored value?
What does Enter do if the key already exists?
Does lookup compare the stored key after hashing?
Where do collision results live: in a chain, in a probe sequence, or in both?
What happens when the table gets too full?

If those five answers are clear, the rest of the implementation becomes much less mysterious.

More exercises

Quick check

Why does `Enter` need to overwrite an old value when the key already exists?

Answer from the dictionary contract, not from the code alone.

Solution

Guided solution

Quick check

In quadratic probing, what problem does the lecture say is reduced compared with linear probing?

Name the clustering effect.

Solution

Answer

Quick check

Using the tutorial example with table size 10, where do 49 and 69 land under quadratic probing?

Use the probe sequence, not just the raw hash values.

Solution

Guided solution

Quick check

Why does the lecture say double hashing is better in general but still not free?

Focus on what extra ingredient it needs.

Solution

1.2 Hash tables and collision strategies

CSCI2520: Data structures

The ADT viewpoint comes first

Hashtable as a dictionary ADT

What a hash function is supposed to do

Hash function

A simple string hash idea

Test a simple hash-and-bucket model

Collision is unavoidable, not exceptional

Collision

Chaining keeps colliding keys in one bucket list

Chaining

What lookup does under chaining

Open addressing uses probing instead of linked chains

Collision handling is part of correctness, not only performance

Why load factor matters

Generic pointers make the ADT reusable

Common mistakes

A collision does not mean two keys are equal

A valid hash range is necessary but not sufficient

Average-case speed depends on keeping the table healthy

Quick checks

What is a collision in hashing?

Answer

Under chaining, what extra work happens after computing the bucket index?

Answer

Exercises

Why does a hashtable still need key comparison after hashing?

Guided solution

Explain one tradeoff between chaining and open addressing.

Guided solution

Reading the implementation more closely

Dictionary semantics of the table

Enter updates an existing key

Why key comparison still matters after hashing

A bucket can contain several unrelated keys

Chaining is flexible because it stores collision structure separately

Bucket order is not the dictionary contract

Insertion and lookup under chaining

Open addressing keeps the search inside the table

Linear probing

Quadratic probing

Quadratic probing with a small table

Quadratic probing does not magically remove all limits

Why the probe sequence can miss slots

Double hashing

Rehashing is the cost of changing the table shape

Why resizing forces rehashing

How to choose between chaining and probing

A compact code-reading checklist

More exercises

Why does Enter need to overwrite an old value when the key already exists?

Guided solution

In quadratic probing, what problem does the lecture say is reduced compared with linear probing?

Answer

Using the tutorial example with table size 10, where do 49 and 69 land under quadratic probing?

Guided solution

Why does the lecture say double hashing is better in general but still not free?

Guided solution

Section mastery checkpoint

Prerequisites

Key terms in this unit

Premium learning add-ons

More notes in this series

Why does `Enter` need to overwrite an old value when the key already exists?