Hashes, checksums and MACs explained
Lets start with the basics.
In Cryptography, a hashing algorithm converts many bits to fewer bits through a digest operation. Hashes are used to confirm integrity of messages and files.
The hash can be considered a fingerprint of the original content, but unlike fingerprints that are all unique, hashes may produce what we call a collision, the same fingerprint for wo different messages. There are two important properties we need to take note of here:
- It is impossible to derive any of the original content from the hash
- The same data will always produce the same hash
So if we publish a hash for a zip download and you get the same hash after downloading the zip to your computer then you can be fairly sure it has not been tampered with during download.
When you want to implement message integrity - meaning the message hasn't been tampered with in transit - the inability to predict collisions is very important.
A collision is when several many-bit combinations produce the same fewer bit output.
All hashing algorithms generate collisions.
And thats when things get a bit more technical...
- The Cryptographic strength of a hashing algorithm is defined by the inability for an individual to determine what the output is going to be for a given input.
because if they determine that, they could construct a file with a hash that matches a legitimate file and compromise the assumed integrity of the system. The difference between CRC32 and MD5 is that MD5 generates a larger hash that's harder to predict.
A 32-bit hash can describe 4 billion different messages or files using 4 billion different unique hashes. If you have 4 billion and 1 files, you are guaranteed to have 1 collision. 1 TB Bitspace has the possibility for Billions of Collisions. If I'm an attacker and I can predict what that 32 bit hash is going to be, I can construct an infected file that collides with the target file; that has the same hash.
Additionally if I'm doing 10mbps transmission then the possibility of a packet getting corrupted just right to bypass crc32 and continue along the wire to the destination and execute is very low. Lets say at 10mbps I get 10 errors\second. If I ramp that upto 1gbps, now I'm getting 1,000 errors per second. If I ram upto 1 exabit per second, then I have an error rate of 1,000,000,000 errors per second. Say we have a collission rate of 1\1,000,000 transmission errors, Meaning 1 in a million transmission errors results in the corrupt data getting through undetected. At 10mbps I'd get error data being sent every 100,000 seconds or about once a day. At 1gbps it'd happen once every 5 minutes. At 1 exabit per second, we're talking several times a second.
If you pop open wireshark you'll see your typical ethernet header has a CRC32, your IP header has a CRC32, and your TCP Header has a CRC32, and that's in addition to the what the higher layer protocols may do; e.g. IPSEC might use MD5 or SHA for integrity checking in addition to the above. There are several layers of error checking in typical network communications, and they STILL goof now and again at sub 10mbps speeds.
CRC Cyclic Redundancy Check has several common versions and several uncommon but generally is designed to just tell when a message or file has been damaged in transit (multiple bits flipping). CRC32 by itself is not a very good error checking protocol by today's standards in large, scalar enterprise environments because of the collision rate; the average users hard-drive can have upwards of 100k files, and file-shares on a company can have tens of millions. The ratio of hash-space to the number of files is just too low. CRC32 is computationally cheap to impliment whereas MD5 isn't.
MD5 was designed to stop intentional use of collissions to make a malicious file look malignant. It's considered insecure because the hashspace has been sufficiently mapped to enable some attacks to occur, and some collissions are preditable. SHA1 and SHA2 are the new kids on the block.
For file verification, Md5 is starting to be used by a lot of vendors because you can do multigigabyte files or multiterrabyte files quickly with it and stack that ontop of the general OS's use and support of CRC32's. Do not be suprised if within the next decade filesystems start using MD5 for error checking.
CRC's versus MD5,SHA1,SHA2.
While properly designed CRC's are good at detecting random errors in the data (due to e.g. line noise), the CRC is useless as a secure indicator of intentional manipulation of the data. And this is because it's not hard at all to modify the data to produce any CRC you desire (e.g. the same CRC as the original data, to try to disguise your data manipulation).
Therefore, even a 2048-bit CRC would be cryptographically much less secure than a 128-bit MD5.
There is a reason cryptographically strong hashes such as MD5 or SHA require much more computation than a simple CRC.
SHA-1: A 160-bit hash function which resembles the earlier MD5 algorithm. This was designed by the National Security Agency (NSA) to be part of the Digital Signature Algorithm. Cryptographic weaknesses were discovered in SHA-1, and the standard was no longer approved for most cryptographic uses after 2010.
SHA-2: A family of two similar hash functions, with different block sizes, known as SHA-256 and SHA-512. They differ in the word size; SHA-256 uses 32-bit words where SHA-512 uses 64-bit words. There are also truncated versions of each standard, known as SHA-224, SHA-384, SHA-512/224 and SHA-512/256. These were also designed by the NSA.