A feature to detect corrupt files.


I have occasionally downloaded obviously corrupt files and it would be a good thing if I could detect them quickly as I add torrents to the download pool. A quick way of doing this is to measure the compressibility/entropy of the file.

Let's say that I have downloaded a 4 GB movie file. If I want to quickly check the legitimacy of it I can try compress it with say WinRAR. If it gets compressed down to 2MB then I know immediately that something is wrong with the file.

This routine could be incorporated into the utorrent engine as a measure of entropy which in compression science is the part of a datastream/file that cannot be reduced without loss of information. In contrast the part of a datastream that can be removed without loss of information is called redundancy.

Any compressed file such as a jpeg, mpeg, gif, png, zip, rar, 7z, mp3, ogg, mpc, flac and so on is expected to have an entropy of 90-100%. A pure text file can have an entropy as low as 5-10%. Therefore a measure of entropy can yield a quick indication whether a file is legit or not.

Corrupt files does _not_ mean that the torrent is. During the years I have used torrents for downloading, several corrupt files have slipped past my attention and the hash check system of the torrent client. It's not a rare case, it does happen frequently enough for me to be concerned about it.

The following google search query contains such corrupt data, a torrent of 18.7GB that looks 100% legit but is actually just a bunch of zeroes:

Magical Library filetype:torrent


I could give more examples but this is what comes from the top of my head.

Yes I do know that some are fake but some people may believe that something is wrong with their software when they run into problems. A check on the entropy will reveal fakes immediately.

When it comes to the "magical library", I just compressed it with 7-zip which compressed it from 18.7GB down to 8GB so at least half of this library is a pile of zeroes. It would be great to have a tool to filter out all the fakes, unfortunately the archiver that comes with Gnome in OpenSolaris doesn't display the compression ratio for each file so I'm unable to pinpoint the redundant files.

I actually don't like the word fake since the seeder himself might be unaware that the files are "empty". Utorrent usually reserves space creating empty files for download and if you interrupt the download you will be left with, ..., empty files, and it is easy to believe that they are legit if you are not careful.

