I was having a conversation with some colleagues today on the topic of checksums. Someone was quoted as being able to hash a terabyte of data using MD5 in one minute. This is of course ludicrous for a single hard drive on a home computer, but I’m curious to see what it would take to achieve (or at least get close). Thus, I’m setting out on an experiment.

The experiment is simple. I’ll create a sample file of known size, and generate an MD5 hash from it on several different storage mediums varying in IO capability. First up, I’ll create my sample file. I’m running Ubuntu 12.04, so I’ll use dd to create a two gigabyte file using my system disk as effectively random input data. The command is this:

sudo dd if=/dev/sda of=/media/500G-Raid/sample.file bs=1M count=2000

If you wish to replicate this experiment, you’ll just need to adjust your “of” file path to a suitable location to place the sample file.

First up, I’ll run the test with the sample file located on my two-disk 500GB RAID0 array. My terminal output is below, with the first line being the test command:

time md5sum sample.file
e273795bcc527a5157ce7f2095cacd42 sample.file

real 0m18.055s
user 0m4.312s
sys 0m0.936s

Doing a little math we see that the processing rate is 110.77 megabytes per second, or 0.00664 terabytes per minute. While running the test, my CPU usage hovered around 30%, indicating that the bottleneck here is the read speed of my RAID array. This result is rather far from the target, I think I can do better.

Next up I’ll run the test with the sample file located on my Samsung 830 128GB SSD drive. Output is below:

time md5sum sample.file
e273795bcc527a5157ce7f2095cacd42 sample.file

real 0m7.841s
user 0m4.332s
sys 0m0.884s

This brings us up to 255 megabytes per second! This is much improved, but it still only amounts to 0.01530 terabytes per minute. I happen to know that 255 megabytes per second is just about the maximum throughput of the SATA-II disk interface that my SSD is attached to. While it could read much faster on a SATA-III interface, we are again limited by disk-related hardware.

I have one more trick up my sleeve. Ubuntu has a ramdisk build in to the system, located at /dev/shm. Files placed here are stored in RAM (read: very, very fast memory), and do not require any conventional disk access to be read. As with anything stored in RAM, files located here are lost forever once the computer is shut down. I’ll copy my sample file here now, and run the test again:

time md5sum sample.file
e273795bcc527a5157ce7f2095cacd42 sample.file

real 0m4.680s
user 0m4.148s
sys 0m0.528s

Now it’s up to 427 megabytes per second, wowza! While impressive, I have only reached a speed of 0.02564 terabytes per minute. Not even close to the mark. Knowing something about RAM, you might have been expecting this number to be much higher (I was). However the data is not merely being read from RAM, it is also undergoing a complex transformation performed by the CPU. As reflected in my CPU chart which showed 100% utilization, I have reached the upper speed limit at which my system can perform the MD5 function.

In conclusion, hashing data with MD5 at one terabyte per minute is a VERY tall order. My computer is a few years old, but I have trouble believing that even a brand new machine would best my ramdisk speed in this test by more than 50%. MD5 is a single threaded function, so all the cores in the world won’t improve its performance. I’m aware that there are suites that will run MD5 and other checksum function benchmarks; I wanted to run these tests in a way that would reflect practical user experience, using common storage media. For anyone interested, the hardware specifications used in the experiment are as follows:

Intel Core 2 Duo e6550 CPU @ 3.0 GHz
500GB RAID0 array composed of two Western Digital 250GB HDDs (disk used in first test)
Samsung 830 128GB SSD, connected via SATA-II interface (disk used in second test)
6GB of DDR2-800 RAM @ 860 MHz (disk used in the third test)