Jump to content

Allow matching files in two torrents to be processed together


madkat

Recommended Posts

Let's say I'm downloading the just-released BigFileCollectionTorrent Version 1.1 and 90% of the files are the same as Version 1.0.

I don't have Version 1.0 yet, and there's LOADS of seeders with that torrent.

Version 1.1, because it's so new, has one seeder who is only online some of the time.

I would like to be able to make use of the seeders/peers with Version 1.0 to reduce the load on the seeder of Version 1.1 and allow me to get the package sooner without having to mess about with merging folders manually. Additionally, I can start seeding Version 1.1 much sooner.

I realise it's not a simple task to coordinate across torrents, but surely if the filename, size and hash match it's a fair bet that the file is the same? This could always be an optional feature that defaults to Off.

(It would also be nice to be able to download into a singular folder, and choose to Discard, Overwrite or Label (append filename) on non-matching duplicates)

Could also extend to checking file hashes & sizes within a singular torrent, if the "cleanliness" of the original data is questionable.

Link to comment
Share on other sites

This is more complicated than it may seem, because in bittorrent, hashes are not calculated per file. Hashes are calculated per "piece", and pieces on the ends of files usually overlap into other files. So it is rather nontrivial to check whether a particular file you have is the same as a particular file in a given torrent. You'd have to have the end of the file before it, and the beginning of the file after it, to get a positive match.

This is further complicated by the fact that in BigFileCollection v1.0, the files may be in a different order than they are in BigFileCollection v1.1, which would make this completely impossible to do accurately, afaik.

If you could specifically match "pieces" (or "chunks", or whatever you want to call them) of individual torrents to each other, then it might be possible through an intermediate semantic layer, but I don't know how the implementation works.

Link to comment
Share on other sites

Damn, I thought each file had its own hash too.

Sadly, piece matching would be unworkable - as you quite rightly point out, files may be in a different order (or, just as importantly, an earlier file may be a different size - even 1 byte would be enough to change all the hashes!)

BitComet torrents add "padding" files which I gather align the starts of files with the starts of chunks, which could plausibly help, but they are quite ugly and only a handful of torrents carry them.

I guess you could add some protocol customisation which allows a client to ask two of its peers for the hashes of completed files, but then it would only work in uTorrent <<==>> uTorrent transfers.

Link to comment
Share on other sites

  • 4 months later...

I'm in no way a BT developer, (Just a CS student)..., but it seems that an algorithm could be created, that of course wouldn't be 100 percent effective, but It seems as though it'd be likely that two torrents that contained one matching file, are on the whole more likely to contain more than one matching file.

I understand that hashes aren't universally unique, but if the hashes of all the pieces of one torrent compared against all the pieces of another torrent came back with a large number of matches, order excluded, you could assume that there are matching files somewhere in the torrent.

From there, it's another algorithm to determine what files contained the matching hashes, and the fact that the start and end of the identical files were in pieces with unrelated files wouldn't matter in determining identical files across separate torrents.

I understand that this is over simplifying, but the process should still be workable.

Link to comment
Share on other sites

If any of the changed files grow by even one byte in size, it's fairly likely that most of the pieces key (which contains the hashes) will change. And that's not even considering newly added files or removed files. Attempting to compare the hashes in fundamentally different torrents (even if different only by one file) is unlikely to give you what you want -- and if someone has to make a v1.1 of a large torrent, then chances are, more than just a handful of files were added/removed/changed.

Link to comment
Share on other sites

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...