Understanding Hash Collisions
What are hash collisions, why they are mathematically inevitable, how the birthday paradox applies, and which hash algorithms have been broken by practical collision attacks.
Detailed Explanation
A hash collision occurs when two different inputs produce the same hash output. Because hash functions map an infinite input space to a finite output space, collisions must mathematically exist. The security question is whether an attacker can find them efficiently.
The birthday paradox:
The birthday paradox states that in a group of just 23 people, there is a 50% probability that two share a birthday. Applied to hashing, for an n-bit hash function, you need to try approximately 2^(n/2) inputs before finding a collision with 50% probability. For MD5 (128-bit), this is 2^64 operations. For SHA-256 (256-bit), this is 2^128 operations. The difference between 2^64 (feasible) and 2^128 (infeasible) is enormous: 2^128 is about 340 undecillion.
Types of collision attacks:
A classical collision attack finds any two inputs x and y where hash(x) = hash(y). A preimage attack finds an input x that hashes to a specific target value h. A second preimage attack finds a different input y with the same hash as a given input x. Collision attacks are the easiest; preimage attacks are harder. For MD5, collision attacks are trivial, but preimage attacks are still impractical. This distinction matters because some applications (like HMAC) only require preimage resistance.
Real-world collision exploits:
The most impactful MD5 collision exploit was the creation of a rogue CA certificate in 2008 by researchers at CWI Amsterdam. They created two certificates with the same MD5 hash: one benign (submitted for signing) and one malicious (used as a CA certificate). This allowed issuing trusted certificates for any domain. The Flame malware (2012) used a similar technique to forge Microsoft Windows Update signatures.
Collision resistance of current algorithms:
MD5: broken since 2004. Collisions can be generated in seconds. SHA-1: practically broken since 2017 (SHAttered). Chosen-prefix collisions demonstrated in 2020. SHA-256: no known weaknesses. Best-known attack is generic birthday attack at 2^128 operations. SHA-3: no known weaknesses. Different internal structure (sponge construction) provides defense-in-depth against SHA-2 specific attacks.
Protecting against collisions:
Use SHA-256 or SHA-3 for any application requiring collision resistance. For digital signatures, always sign the hash of the document, ensuring the hash algorithm is collision-resistant. For data storage, if using hashes as unique identifiers, choose an algorithm with sufficient collision resistance for your data volume.
Use Case
Understanding hash collisions is essential for security engineers evaluating hash algorithm choices and for developers building systems that use hashes as unique identifiers.