Once ADTs, pointers, and basic node structures are in place, the next question is how to support dictionary-style operations efficiently. A hash table is one of the standard answers: instead of preserving sorted order, it applies a hash function to jump directly to a bucket.
That jump is powerful, but it is never perfect. The whole chapter exists to explain what the jump can guarantee, where collisions come from, and how the implementation recovers when multiple keys land in the same location.
The ADT viewpoint comes first
The hashtable tutorial starts from operations, not from buckets:
- insert a value under a key,
- look up the value for a key,
- delete an existing key-value entry.
So the logical object is a dictionary or symbol table. The hash table is only one implementation strategy for that ADT.
Definition
Hashtable as a dictionary ADT
A hashtable stores key-value pairs and supports dictionary operations such as insert, lookup, and delete by mapping each key to a bucket index.
This definition already implies two different design questions:
- how keys are converted into bucket indices,
- what the implementation does when two keys map to the same index.
What a hash function is supposed to do
Definition
Hash function
A hash function maps a key into an integer index in a fixed range, usually .
The local tutorial material emphasizes four properties of a good hash function:
- it must always return a legal bucket index,
- it should reduce collisions,
- it should distribute keys reasonably evenly,
- it should be quick to compute.
The last point matters because a beautifully distributed hash rule is not very useful if every lookup spends too long computing the bucket index itself.
Worked example
A simple string hash idea
For a string key, one very simple classroom rule is:
int Hash(char *s, int nBuckets) {
int h = 0;
while (*s != '\0') {
h += *s;
s++;
}
return h % nBuckets;
}
This kind of rule is easy to compute, but it is also easy to fool. Different keys with the same character-sum pattern can collide often. So the example is useful for reasoning, even if a production hash table would want something stronger.
Read and try
Test a simple hash-and-bucket model
The lab maps sample keys into buckets with a simple hash rule so you can see collisions and chaining directly.
Collision count: 0
| Keys | Bucket index |
|---|---|
| cat | 4 |
| dog | 6 |
| cow | 0 |
| cod | 2 |
Bucket index 0
cow
Bucket index 1
∅
Bucket index 2
cod
Bucket index 3
∅
Bucket index 4
cat
Bucket index 5
∅
Bucket index 6
dog
Collision is unavoidable, not exceptional
The most important conceptual shift is that collisions are not a rare bug. They are a normal consequence of sending a large key space into a fixed set of buckets.
Definition
Collision
A collision happens when two different keys are mapped to the same bucket index.
If the bucket count is fixed and the key space is large, collisions are inevitable. So the real implementation question is not “how do we avoid every collision?” but “how do we recover from collisions without breaking dictionary correctness?”
Chaining keeps colliding keys in one bucket list
The lecture material treats chaining as the most direct recovery method.
Definition
Chaining
With chaining, each bucket stores a linked list of all entries whose keys hash to that bucket.
This design fits the pointer-and-node model from the previous section very naturally.
typedef struct cellT {
char *key;
void *value;
struct cellT *next;
} cellT;
Each bucket head points to the first node in its chain. Lookup therefore has two layers:
- hash the key to choose a bucket,
- scan only that bucket’s chain instead of the whole table.
Worked example
What lookup does under chaining
Suppose , and bucket 3 stores the chain
.
To execute Lookup(table, "cat"), the implementation:
- computes bucket
3, - moves along the linked list in bucket
3, - compares keys until
"cat"is found, - returns
value2.
The collision does not destroy correctness. It only makes that one bucket take longer to search.
Open addressing uses probing instead of linked chains
The tutorial also introduces open addressing. Instead of storing a linked list in each bucket, the table keeps probing other slots when the preferred one is occupied.
Typical probing styles include:
- linear probing,
- quadratic probing,
- double hashing.
The key difference from chaining is where the extra work lives:
- chaining keeps multiple entries in a bucket-side list,
- open addressing keeps searching for another empty slot in the main table.
This means deletion, clustering, and probe-sequence design become central implementation issues.
Theorem
Collision handling is part of correctness, not only performance
If the collision strategy is wrong, the table can return the wrong answer or lose entries entirely, even if the hash function itself is valid.
That is why the collision policy belongs in the conceptual explanation, not as an implementation afterthought.
Why load factor matters
Even a good hash function degrades if the table is too full.
The load factor is usually written as
As grows:
- chains become longer under chaining,
- probe sequences become longer under open addressing,
- average-case lookup moves away from the ideal constant-time picture.
So resizing is not just a practical optimization. It is part of protecting the performance model promised by the data structure.
Generic pointers make the ADT reusable
The local hashtable lecture also uses values. This is a deliberate ADT decision.
typedef struct hashtableCDT *hashtableADT;
hashtableADT EmptyHashtable();
void Enter(hashtableADT table, char *key, void *value);
void *Lookup(hashtableADT table, char *key);
The table does not need to know the concrete value type. It only stores the association between a key and an opaque pointer to some client-owned data.
That makes the same hashtable ADT usable for many different applications, as long as the client also knows how to interpret the returned pointer correctly.
Common mistakes
Common mistake
A collision does not mean two keys are equal
If , that only means the keys land in the same bucket. It does not mean the keys are identical.
Common mistake
A valid hash range is necessary but not sufficient
A hash function that always returns a legal bucket index can still be bad if it clusters many unrelated keys into the same few buckets.
Common mistake
Average-case speed depends on keeping the table healthy
Saying “hash table lookup is O(1)” silently assumes a reasonable hash function and a controlled load factor. It is not a license to ignore collisions.
Quick checks
Quick check
What is a collision in hashing?
Answer in terms of keys and bucket indices.
Solution
Answer
Quick check
Under chaining, what extra work happens after computing the bucket index?
Focus on where the implementation looks next.
Solution
Answer
Exercises
Quick check
Why does a hashtable still need key comparison after hashing?
Use collision reasoning, not only a slogan about speed.
Solution
Guided solution
Quick check
Explain one tradeoff between chaining and open addressing.
Answer in terms of where the collision-handling work is stored.
Solution
Guided solution
Reading the implementation more closely
The lecture slides do not present hash tables as a pile of buckets first. They present them as an ADT first and as a storage strategy second. That order matters.
Definition
Dictionary semantics of the table
A hash table is a dictionary structure. For one key, the table should support insertion, lookup, and deletion. If the same key is inserted again, the newer value replaces the older association instead of creating a second copy of the key.
That overwrite behavior is easy to miss, but it is one of the reasons the ADT is a dictionary rather than a multiset.
Worked example
Enter updates an existing key
Suppose a table already stores:
If Enter(table, "cat", 9) is called again, the intended result is not two
copies of "cat". The association for "cat" is updated, so a later
Lookup(table, "cat") returns 9.
This is what the lecture means when it says Enter inserts a particular value
for a specified key and overwrites the old value if the key already exists.
Why key comparison still matters after hashing
Hashing narrows the search to a bucket, but it does not finish the job. The course examples repeatedly return to this point because it is the first place students often over-generalize.
If two keys hash to the same bucket, the table still has to compare the stored key with the query key. Otherwise it would not know whether the match is exact or only accidental.
Worked example
A bucket can contain several unrelated keys
Assume bucket 3 contains:
If the lookup key is "cat", hashing only tells us to inspect bucket 3. The
implementation must still compare "cow", then "cat", before it can return
the correct value.
Chaining is flexible because it stores collision structure separately
The simplest collision strategy in the source material is chaining. It is easy to reason about because each bucket has its own list of entries.
This gives the implementation a useful freedom: it can add new entries at the front of the list, the back of the list, or in another consistent order. The dictionary contract does not care about that local ordering as long as lookup and overwrite behave correctly.
Common mistake
Bucket order is not the dictionary contract
Students sometimes assume the order inside one chain is part of the meaning of the table. It is not. The contract is about finding the correct key-value pair; the chain order is only an implementation detail.
Worked example
Insertion and lookup under chaining
Consider a table with five buckets and a simple hash rule.
- Insert
"ape". - Insert
"ant". - Insert
"apple". - Look up
"ant".
If "ape" and "ant" collide, they will appear in the same bucket chain. The
lookup for "ant" hashes once, follows only that chain, and compares keys until
it finds the exact entry.
This is why chaining can still stay fast on average when the buckets remain reasonably balanced.
Open addressing keeps the search inside the table
The tutorial and lecture also show open addressing. Instead of attaching a list to each bucket, open addressing keeps searching for another slot inside the table itself.
The cost of a collision therefore moves into the probing rule.
- linear probing checks the next slot, then the next, and so on;
- quadratic probing jumps by square offsets;
- double hashing uses a second hash function to generate a step size.
Definition
Linear probing
With linear probing, if the preferred slot is full, the table checks the next slot in sequence and wraps around when necessary.
The lecture warns that linear probing tends to create long runs of filled buckets. That effect is called primary clustering, and it degrades the performance of the table because future probes are more likely to collide with the same dense region.
Quadratic probing is one response to that problem.
Definition
Quadratic probing
With quadratic probing, the ith probe uses a square offset such as .
The step size grows faster than in linear probing, which helps reduce primary
clustering.
Worked example
Quadratic probing with a small table
The tutorial gives a concrete table of size 10 with the hash rule
. Insert the entries {89, 18, 49, 58, 69} using quadratic
probing.
The probe sequence works as follows:
89goes to bucket918goes to bucket849hashes to9, collides, then probes bucket058hashes to8, collides, then probes9, then269hashes to9, collides, then probes0, then3
So the occupied buckets end up as:
That example is useful because it shows that the probe rule, not just the raw hash value, determines where each entry finally lives.
Common mistake
Quadratic probing does not magically remove all limits
Quadratic probing reduces primary clustering, but it does not make the table invincible. The lecture shows a load-factor restriction example where, with and , the probe sequence repeats before reaching all slots. Some buckets may never be visited under a given probe rule.
Worked example
Why the probe sequence can miss slots
The lecture example for and gives the probe sequence
0, 1, 4, 2, 2, 4, 1
The important point is not the arithmetic itself. The important point is that the probing rule can fail to cover every bucket. That is why open addressing needs a load-factor discipline and why the exact probing formula matters.
Definition
Double hashing
Double hashing uses a second hash function to determine the probe step. In general, it spreads probes better than linear or quadratic probing, but it requires two hash functions.
That tradeoff is classic CSCI2520 material: more structure in the rule can buy better distribution, but it also makes the implementation more complicated.
Rehashing is the cost of changing the table shape
When the bucket count changes, the hash function must change too. Every stored entry has to be reinserted under the new rule. That process is called rehashing.
The lecture notes point out that rehashing is rarely necessary for most applications, but it is conceptually important because it explains why a hash table cannot be treated as a fixed pile of memory locations forever.
Worked example
Why resizing forces rehashing
If a table grows from one bucket count to another, the old bucket index is no longer guaranteed to be valid for the new table size. The entries must be visited again, their keys hashed again, and their new positions computed again.
That is why resizing is not just a matter of copying bytes. It changes the meaning of the bucket index itself.
How to choose between chaining and probing
The source material does not frame one strategy as universally best. It instead makes the tradeoffs visible.
- Chaining is easier to explain and naturally handles collisions with a linked list.
- Linear probing keeps the table compact but can form primary clustering.
- Quadratic probing reduces that clustering but needs stricter load-factor control.
- Double hashing usually distributes probes better, but it is more expensive because it depends on two hash functions.
If you are reading code, the safest question is not “Which strategy is faster in abstract?” The safer question is “Which strategy matches the table size, the expected load, and the update pattern of this program?”
A compact code-reading checklist
When you look at a hashtable implementation in C, check these things in order:
- What counts as the key, and what counts as the stored value?
- What does
Enterdo if the key already exists? - Does lookup compare the stored key after hashing?
- Where do collision results live: in a chain, in a probe sequence, or in both?
- What happens when the table gets too full?
If those five answers are clear, the rest of the implementation becomes much less mysterious.
More exercises
Quick check
Why does Enter need to overwrite an old value when the key already exists?
Answer from the dictionary contract, not from the code alone.
Solution
Guided solution
Quick check
In quadratic probing, what problem does the lecture say is reduced compared with linear probing?
Name the clustering effect.
Solution
Answer
Quick check
Using the tutorial example with table size 10, where do 49 and 69 land under quadratic probing?
Use the probe sequence, not just the raw hash values.
Solution
Guided solution
Quick check
Why does the lecture say double hashing is better in general but still not free?
Focus on what extra ingredient it needs.
Solution