Since I’ve started working in digital data preservation, I’ve thought a lot more about the importance of assuring the continued integrity of data files. We use checksum verification all the time at work, but I don’t know many people who hold their own personal data to the same standard. Is it not as significant or worth saving?

I create monthly backup images of the boot drive on each of my own computers, so that if something were to happen to them I would be able to restore my data easily. This is all assuming that the image hasn’t become corrupt, of course. When we’re talking about operating system files, the change of a few bytes could mean the difference between a stable or regularly-crashing system. Those of you who have tried to install an operating system from a scratched or corrupted optical disc are probably familiar with this pain. Fortunately, checksums are there to allow us to detect such a change. When corruption is detected, one should restore from a (hopefully existing) second backup.

I created a set of bash functions for use within a Linux environment, to ease the creation and verification of MD5 checksum files. They are to be placed in your .bashrc file, which is in your home directory. Bash functions can be placed here and then called from terminal, enabling the creation of shortcuts for often used functionality. The code is below:

md5sc()
{
	local target="$1"
	local checksumString=$(md5sum "$target")
	local checksum=${checksumString:0:32}
	echo "$checksum" > "$target.md5"
	echo "$checksum $target"
}
 
md5v() #md5 verifies a file based on it's .md5 sidecar file. return 0 == passed.
{
	local target="$1"
	local targetNameLength=${#target}
	let targetNameLength-=4
	local targetExtension=${target:targetNameLength:4}
 
	local controlChecksum=0 #checksum in sidecar file
	local testChecksumTarget=0 #file to create checksum from
	if [ "$targetExtension" == ".md5" ]; then
		controlChecksum=$(cat "$target")
		controlChecksum=${controlChecksum:0:32}
		testChecksumTarget=${target:0:targetNameLength}
	else
		controlChecksum=$(cat "$target.md5")
		controlChecksum=${controlChecksum:0:32}
		testChecksumTarget="$target"
	fi
 
	local testChecksum=0 #checksum of file we're verifying
	if [ -e "$testChecksumTarget" ] && [ "$controlChecksum" != "" ]; then
		local testChecksumString=$(md5sum "$testChecksumTarget")
		testChecksum=${testChecksumString:0:32}
		if [ "$controlChecksum" == "$testChecksum" ]; then
			echo "Control: $controlChecksum"
			echo "Test   : $testChecksum"
			echo "$testChecksumTarget passed verification!"
			return 0;
		else
			echo "Control: $controlChecksum"
			echo "Test   : $testChecksum"
			echo "$testChecksumTarget FAILED verification!"
			return 1;
		fi
	else
		echo "Sidecar or target file missing or unreadable."
	fi
}

You can use these functions like any other command in a bash terminal. The md5sc function creates a side-car file containing the checksum of the functions input file. The input file and created checksum file should be stored together to ease later verification. The md5v function is used to verify the checksum in the side-car file against a newly generated checksum of the file under scrutiny. The side-car file name is the same as the input file, with the addition of a “.md5” extension. Example syntax and usage below:

The first example shows how the checksum side-car file is generated:

user@MyComp:~$ md5sc MyAwesomeFile.zip
a2292e4b988f01bf9961ecdfb1cf3e2f MyAwesomeFile.zip
user@MyComp:~$ ls
MyAwesomeFile.zip MyAwesomeFile.zip.md5

Now to verify our file against the stored checksum:

user@MyComp:~$ md5v MyAwesomeFile.zip
Control: a2292e4b988f01bf9961ecdfb1cf3e2f
Test : a2292e4b988f01bf9961ecdfb1cf3e2f
MyAwesomeFile.zip passed verification!

We can target the .md5 file as well, its relation is inferred from the file name.

user@MyComp:~$ md5v MyAwesomeFile.zip.md5
Control: a2292e4b988f01bf9961ecdfb1cf3e2f
Test : a2292e4b988f01bf9961ecdfb1cf3e2f
MyAwesomeFile.zip passed verification!