md5verify: A script to automatically verify file integrity

I have a lot of files on my computer. Email archives, personal documents, stuff for work, photos I've taken...the list goes on - I'm sure most people reading this are in a similar boat. On occasion I've found some files to be missing or corrupt which is disturbing but is probably something to be expected. The bad part is, I keep backups, but I rotate them out when they reach a certain age which means if I don't notice a file is corrupt or missing I'll eventually lose it forever.

I stayed up late a few nights ago and wrote a script to raise an alert when something has changed. On its first run the script will recursively walk a directory tree hashing each file and storing the hashes in the directory (in an md5sum compatible formatted file). On subsequent runs it will begin tracking new files automatically but it will also print messages for missing and changed files. By saving the checksums in each directory it becomes portable - you can copy a directory somewhere else and still be able to verify nothing changed (a quick md5sum -c checksums.txt will let you know).

By default the script only prints messages when it sees something fishy so it's perfect to drop into cron and it uses exit statuses so it'll work for nagios too. I've been running it for a few months and have found a couple files that have changed - nothing critical yet but it's nice to know it's there.

18 Comments

This reminds me of Tripwire (http://sourceforge.net/projects/tripwire/) and rsync (with --itemize-changes --archive --link-dest). We used the former for file change detection and the latter for incremental backups with hard links (reporting new and changed files), like OS X's Time Machine does.
-- Jan!, 31 Jan 2011
While a home-made script is better than nothing, such programs already exist and are probably packaged for your favorite distro/OS. Take a look at AIDE, for example. It has a lot of options and is plenty fast.

http://aide.sourceforge.net/

Enjoy
-- RĂ©mi, 31 Jan 2011
This is really cool, but inadequately paranoid. For one thing, errors on hard drives (and in memory, hard drive controllers, etc) are distributed randomly, which means that it could be the hash which has the error, not the file that was hashed. Also, you can still lose whole clusters/inodes or even whole files if you get an error in the directory entries.

What you need is hierarchical hashes. Hash the files, and store that like you do currently. Then hash all the metadata about those files (including the list of hashes) and store that in the parent directory. Do this recursively by depth-first traversal. Once you're done, the hashes in the top level directories monitor the contents of the lower level ones, allowing you to detect errors in them. Of course, don't forget to recursively update the hashes when you add or remove files, or change their contents.

Also, I would suggest that rather than writing your own implementation in python, you should look into using ZFS; it does all this for you, plus it can correct the errors it finds, not merely detect them. http://en.wikipedia.org/wiki/ZFS#Data_Integrity. I've been using zfs-fuse for a while now, and I really like it. It's not very fast (which is unfortunate, since I put my home directories into my ZFS pool), but it's perfect for archiving important data.
-- db48x, 31 Jan 2011
ZFS is a great idea, but it doesn't give me the (admittedly easy to make) checksums files with each directory.

I figured I could deal with the small number of warnings manually (by checking the backup vs the live file and figuring out which was correct). Recursive checksumming is an interesting idea though.
-- Wil Clouser, 31 Jan 2011
Ah. I guess I'm not clear on why you would need the checksum files if you know that the filesystem is managing that for you. Perhaps your backup media doesn't use a ZFS filesystem.
-- db48x, 01 Feb 2011
Yeah, that was the idea.

Every time I rebuild a computer I look at ZFS and it always seems like the stable kernel space version is right around the corner (or so the mailing lists would lead me to believe). One of these days I'll stop believing it and just install it with fuse.
-- Wil Clouser, 01 Feb 2011
Hey, this script was a lifesaver. I'm concerned that my hard disk may be in the early stages of failure, and I want to monitor files to see if their readable data changes. It took a lot of googling to find your script--it might help to mention some more key search terms, like "file integrity checker", "file alteration monitor" (not FAM, haha), "file checksum", "file modification detection", etc. Maybe those will help someone else.

Anyway, I needed to make some changes to suit my needs, so I did:

https://code.launchpad.net/~alphapapa/+junk/chafifi

I also upgraded it to argparse.

I already had a bzr repo for it when I decided to fork it on github...and I don't really want to use both for the same thing. So feel free to pull it out of there if you are interested in my version.

BTW, can we officially GPL it?

Thanks again.
-- Adam, 15 Jan 2012
Thanks Adam, I'll check out your changes. I'd be wary of running the script on a dying hard drive, it's pretty intensive. Sometimes I'm wary of running it on my healthy hard drives. :)

I'll put a BSD license file in the root which is GPL compatible.
-- Wil Clouser, 15 Jan 2012
Thanks, Wil. I added the license to mine.

I'm not sure if my hard drive is recently losing sectors or if upgrading my OS recently just caused sectors to be accessed and written which hadn't been in a long time--perhaps they were going bad for a long time. Anyway, smartmontools + this script will help me keep an eye on it. :)

BTW, I wonder if using a SQLite db would be better than dumping hashfiles all over the place. What do you think?
-- Adam, 16 Jan 2012
SQLite is an option. I wanted the capability of copying a directory and getting all the checksums under it automatically.
-- Wil Clouser, 16 Jan 2012
I found out your script while searching for a better solutions for verifying file integrity. My problem is a little bit different but it lays on the same land that yours, to be sure that the files still the same.
I my case I do several HUGE copies and in some cases the files get corrupted during the process, what gives me an incredible headache. My solution did not work well since it took several hours to verify each file using cksum (On Solaris and HP-UX).
As I supposed there's no way to perform a check sum of each file without wait hundreds of hours for the script to be finished.
I know that it is a very tough task for the machine to open and hash the files but I was eager to find out some magic solution for it, and it seems there's not. Your script also take too much time to finish the verification. (Ok, I habe to check the integrity of 1000000 files) :\
Anyway, thanks for sharing your solution, it's pretty well done. :)
-- Leo, 31 Jan 2012
Thanks for the script. Exactly what I was looking for, and I will be interested to see how often silent corruption is discovered in my large library of media files.

I made simple changes so that changed and missing files' hashes would not be updated (I don't trust myself with just a single warning, I may need constant reminding of a problem). However I'm not smart enough to fix a problem I'm having. On Debian stable (squeeze) with Python 2.6.6 the recursion into subdirectories does not work. Has something about the os.walk function perhaps changed and it would work with a newer Python?

Anyway, thanks again.
-- Henry, 06 Mar 2012
I don't think os.walk has changed in a while and 2.6.6. is pretty new. I run the script on 2.6.5 for what it's worth.
-- Wil Clouser, 06 Mar 2012
Yeah, the script does not work on Debian. It cannot recurse through folders. However it works on a flat dir with only files in it.
-- Thomas, 14 Oct 2013
Nice script though.
-- Thomas, 14 Oct 2013
I just tested this on both Debian squeeze and the newest Wheezy. It does not work on either of those.
This is the error:
ERROR - Error writing checksums file ./tmp/.checksums: [Errno 2] No such file or directory: './tmp/.checksums'
ERROR - [Errno 2] No such file or directory: './foo'
-- Thomas, 14 Oct 2013
Sorry for the troubles, Thomas. Try the latest version: https://github.com/clouserw/scripts/blob/master/md5verify.py
-- Wil Clouser, 30 Oct 2013
Thanks! This is exactly what I was looking for: A nice simple script to make checksum files and verify them later on.
-- David Hogue, 18 Jan 2014

Post a comment

All comments are held for moderation; basic HTML formatting accepted.

Name: