Jump to content

Duplicate detection (reopened)


carnino

Recommended Posts

Ref: http://forum.utorrent.com/viewtopic.php?pid=37321#p37321

Dear "Firon" (and others).

Please don't close a topic just because you think it cannot be done. Please have some decency and allow for the ideas to be exchanged.

I understand that you do not know me from a "bar of soap" or what my knowledge or experience is, in these matters, but please allow for some respect.

I do know what I am talking about and I've been in development for over 20 years. I could write the code for it myself, but just don't have the time for it. I will, however take the time to explain how it can be achieved.

I will describe the process for single file torrents, but can be extended for multiple file torrents

Each file, has in the torrent a sequence of SHA-1 hashes for each of its consecutive pieces, which is used to verify the file as the pieces get downloaded.

When two files (in different single file torrents) have the same overall size and the same piece size, then it is a quick and easy matter of comparing the sequences of hashes to know if the content of the two files is the same, and thus know beforehand if the files match in content.

If they do, then as a piece comes in for the file "A" in torrent "A", then it can be automatically also be used for file "B" in torrent "B". This is effectively the same as using the sources of both to complete both files more quickly.

This is the simple case, and can be implement quite easily. However, it can be complicated further for the case of torrents with different piece sizes in comparison to each other, which would require more intricate development.

In that case, the simple comparison of the hashes would not suffice to detect a duplicate content before the process starts, but can be done at a later stage, during the download process.

Knowing that the files "A" and "B" of different torrents have the same size, would allow them to be placed in an "alert" status, so that as pieces came in for one file, they can be analyzed for the other.

Lets say the torrent "A" has a piece size 1MB and torrent "B" has a piece size of 2MB. When a piece for "B" comes in, one could split it in two, and calculate the hash for each 1MB part and that compared to hash sequence of "A". If they match hash and position, they can be used, and if not, just ignored.

In the opposite direction, if a piece for "A" arrived, it would be concatenated with the adjacent piece (already received beforehand) to be hashed, and the result be compared to the hash sequence and position in torrent "B". If they matched, it could be used, and so on.

However, not all piece sizes would be the same or exact multiples or factors of each other, but the principle is the same even for non factors or non multiples, just the math and analysis would be more complex.

There would naturally be an increase in CPU usage for the extra hash calculations, but nothing out of the ordinary and the download speed advantage could be substantial in these cases.

This whole process could actually be extended to have the whole download process work on a "piece" concept instead of a "file" concept and allow for a piece to be used in any position or file or torrent as long as the hash matches.

This would allow for speed enhancements for areas of files that are duplicates of other sections of the file or in other files. Believe it or not, this is actually quite frequent.

All this is possible, without altering the torrent or the communications with the tracker. All this intelligence and processing takes place in the client software only.

I hope this short explanation is understandable to most, but especially the main developer of "uTorrent", so that he may consider it for inclusion in his application.

Best regards,

Carnino

Link to comment
Share on other sites

  • 2 months later...

I wish you wouldn't Dismiss this so quickly.

Yes, it /can/ be complex... but with the right approach, it might be possible to implement it in a simple way. (In fact, the way it was described above is only one possible implementation.. and not the simplest)

For example, in it's simplest form, you could allow the user to identify possible duplicates manually.

Right click a file.. go to "Find Duplicates".. a window pops up with other currently downloading files which have the same filesize... and you place a checkmark in each one that you feel is the "same" file.

Then, whenever enough data exists in one of the files to make a complete hash-checkable "piece" in one of the other files, you "assume" the files match.. check the hash, and if it matches, copy the data over and mark the piece as finished. If it doesn't match, then you just do nothing, and posibly mark the files as NOT being identical (to prevent wasted work in the future). This would even work for multiple file torrents, and torrents with differing piece sizes.

Honestly, the advantages here could be enormous. Download speeds would be considerably faster. This is a scenario that MANY of us have happen every day. Some new file comes out.. and immediately several torrents show up with varying piece sizes. It is also, (to my knowledge) a feature that is not available in any other BT client. It would really be a STRONG reason to use uTorrent over any of the alternatives, if this were implemented.

-JohnQ

PS. Would it be hard to add a "force re-check" for ONE file at a time instead of the entire torrent?

(Possibly while the torrent is running?)

If this were implemented, And I were downloading the same file from multiple torrents, I would be able to put together an external program to just do a bitwise or of two files periodically.. and then re-hash-check the file. (Assuming that uTorrent writes 0's in areas where nothing has been downloaded yet)

Just some food for thought..

Please consider this.

Link to comment
Share on other sites

Hi JohnQ

I just read the whole thread.

I agree with Ludde that this would be very complicated.

But my questions to you are:

Do you really download the same torrent from different trackers at the same time, where they have different piece sizes? This is the only way for your theory to work.

Because if the torrents are different, the case for having exactly matching pieces in size and bytes (identically) would be very small (0,00001% perhaps) i think.

Why should Ludde implement a feature, which is for no use for 99,999% of the users?

The only case for your idea to work and take effect could be: Loading down a torrent from a tracker (perhaps 1.1.2006)

Redownload it (perhaps a new version of the torrent, who knows) on 1.2.2006.

So you want µtorrent to check, which pieces to we have allready (piece size must be the same, differences of the torrent must be on the end = added)

Is this the way it should work, or do i missunderstand you?

Link to comment
Share on other sites

Actually, The method I am describing Does *NOT* require that the pieces match at all.

Let me try a more concrete example.

Lets say that there are two torrents, Alpha and Beta.

Alpha:

Piece size is 256k.

Contains 4 files (total size is 6,421,543 Bytes)

FileA1 is 1,638,410 Bytes

FileA2 is 1,508,643 Bytes

FileA3 is 1,517,557 Bytes

FileA4 is 1,756,933 Bytes

Pieces(25 of them): abcdefghijklmnopqrstuvwxy

Therefore:

FileA1 spans a through f, and part of g

FileA2 spans the remainder of g, h through l, and part of m

FileA3 spans the remainder of m, n through q, and part of r

FileA4 spans the remainder of r and s through y

Beta:

Piece size is 1Mb.

Contains 3 files (total size is 4,704,662 Bytes):

FileB1 is 1,878,422 Bytes

FileB2 is 1,517,557 Bytes

FileB3 is 1,308,683 Bytes

Pieces(5 of them): ABCDE

Therefore:

FileB1 spans A and part of B

FileB2 spans the remainder of B and C and part of D

FileB3 spans the remainder of D and E

Now... For the sake of argument, lets say that FileA3 and FileB2 are identical files, and that the user can suggest this possibility to uTorrent (or uTorrent can guess, if it is feeling particularly pro-active).

Obviously NONE of the SHA1 hashes in the two torrents are the same. (This is the most general case, and makes no assumptions about the alignment, or size of the pieces)

What I am suggesting is simple...

Lets say that piece C of Beta is downloaded, passes the hash check, and and gets written to disk. This corresponds to bytes 218,730 ~ 1,267,305 (inclusive) of FileB2

Since we are assuming that FileB2 and FileA3 are the same, then the data in piece C in Beta should also correspond to to bytes 218,730 ~ 1,267,305 of FileA3. Which (in terms of pieces in Alpha) corresponds to part of piece m, all of pieces n,o, and p, and part of piece q.

The segment of the data in C which aligns with pieces n,o, and p could immediately be checked using their hashes in Alpha. If they match, great, we have just saved ourselves from having to download them.

What to do with the partial m and q in this example is up for some debate... one option is to simply discard them.

Another option is to keep portions of m and q which fill complete "blocks" in the pieces, and mark them as being partially downloaded... then when the "rest" of m or q is downloaded, it can be combined with the existing parts, and hash checked.. if the hash fails, just discard the part which came from C and continue on your merry way.. But if it succeeds, it saves further downloading.

Obviously, this would work in the other direction as well... If pieces m,n,o,p and q were downloaded correctly in alpha, they could be combined, and the center portion could be checked against C's hash in Beta. Similarly, if only piece o is downloaded in Alpha, then several of the central blocks of C can be tentatively filled in (pending a hash check when the rest of C finishes downloading)

There is a lot of room for improving on this idea as well.. such as automatically detecting potential duplicates.. or prioritizing pieces so as to take into account both sources, etc. But I think that the majority of the benefit would come from a basic implementation as described here.

The simplest explanation I can provide is this:

Every time a full (verrified) piece is finished for torrent which has duplicate files, look to see if any of the information in the file can be copied into missing pieces (or blocks of pieces) in the duplicate. If so, copy them, and hash check as needed.

Effectively, all this is really doing is using YOURSELF as an additional source for blocks of data. Everything still needs hash checking, and so on, but it simply alows the client to share downloaded data between two seperate torrents, which have some comonality.

I hope this makes more sense. I checked my figures above a few times, but obviously it is a lot of math to do by hand and I might have made smal atithmetic errors. Please forgive any math problems I have up there :-)

-JohnQ

Link to comment
Share on other sites

So i think, i unterstood you right

... but it simply alows the client to share downloaded data between two seperate torrents, which have some comonality ...

And that's the problem i think:

... Because if the torrents are different, the case for having exactly matching pieces in size and bytes (identically) would be very small (0,00001% perhaps) i think ...
Link to comment
Share on other sites

If you look at the rest of my example though, you'll see that "matching pieces in size and bytes" is not required at all. In fact, in my example above they DO NO MATCH!

Please note that the implementation I outlined is NOT the same implementation that carnino initially proposed. It has the same overall effect, but is more general, requires fewer assumptions, and is MUCH easier to implement.

The only time that this feature would have any effect is when:

1) The *SAME* identical file appears as a duplicate in two active torrents.

2) The user (or uTorrent) has identified them as a *possible* duplicate.

The piece sizes and offsets are completely irrelavent.

Additionally, in the event that a file is INCORRECTLY identified as a duplicate, it is easily detected, and no corruption can/will occur.

Maybe I am not explaining it clearly enough, but let me assure you that the pieces do *not* need to be the same size, do *not* need to have similar offsets, and do *not* need to have the sme SHA1 hashes.

In fact, it is so very unlikely that all of these things would ever match, that a scheme which depended on them would almost never work. This is why the scheme I outlined above does not require any of these things to be true!

-JohnQ

Link to comment
Share on other sites

Maybe it's my fault for not reading your replies correctly... Or answering you origional question. :-)

YES, I often find myself downloading the *identical* file in several (anywhere from 2 to 6) torrents.

For example, you often find that one person has torrented a file with a 1mb piece size.. and then somone re-torrents it (and includes a text file, or a Thumbs.db or something) with a smaller piece size... And then somone else re-torents it again as part of a multi-file compilation torrent with yet another piece size...

And all you can do is start them all going, and wait until the first one finishes.. And when you are unlucky, all of them race to 90% and stop because there are no seeds.. and you just KNOW in your heart that if you could combine those partial files easily, you could make one good copy, and could seed it to fix the seedless torrents.

And then if the one torrent that finished first was NOT the compilation, then you have to stop the compilation, and copy the finished file over top of the partial download in the complilation, force a re-check (On the whole thing), and then resume it in order to take advantage of the data that you have already downloaded.

So.. yes... the file does need to be identical.. and yes.. this does happen to me on a daily basis.

Hope that makes more sense now ;-)

-JohnQ

Link to comment
Share on other sites

Do you know this option of µTorrent:

Adding a new torrent (in your case with one additional file) and check the files which are included.

Only download the missing (new) file of this torrent is enough!

If you have allready started, go this way:

At "Files" - Tab, CTRL-A (in Windows for select all), then use right-click "Don't download", so no file will be downloaded. Now you can select the wanted one(s) and right-click them the priority to download.

Is this what you are searching for?

Link to comment
Share on other sites

Yeah, I am aware of that option... I will not do what I want it to do.

I'm not trying to download one file out of a multi-file torrent.

I am trying to make use of multiple *different* torrents which contain some files in common, without having to download the same data several times.

All I want to do is prevent uTorrent from having to download the same file multiple times when it appears in several active torrents. The client does not need to download the extra data, why not take advantage of the redundancy and accelerate the download?

Link to comment
Share on other sites

  • 2 weeks later...

If I understand what you are asking, i.e. uTorrent comparing two files in two different torrents to determine if they are the same, it's virtually impossible.

The reason being that files aren't hashed individually. All the data comprising the files to be torrented are concantinated together, as per the BitTorrent specification, and then the resulting blob is chopped up into piece sizes and each piece hashed.

You could end up with a given piece containing the tail end of one file and the beginning of the next and have no way of knowing what a fragemented file like that would hash to unless you had the actual file.

The result is that one can't know the hash of an individual file described in a torrent file using the torrent file itself.

So, until both files were downloaded, there would be no way to compare hashes and if you already downloaded them, there's no real point.

Link to comment
Share on other sites

Two identical files have the same SHA1 information, so comparing the file-hashes would make it possible to determine whether or not the files are the same.

This would mean that it is not impossible to make something of the like, only that it would be rather complicated to do so.

Link to comment
Share on other sites

The reason being that files aren't hashed individually. All the data comprising the files to be torrented are concantinated together, as per the BitTorrent specification, and then the resulting blob is chopped up into piece sizes and each piece hashed.

Animorc, just because two identical files have the same SHA1 information, it doesn't mean two files in two different batches will necessarily have the same piece hash information.

There are far too many variables to consider when dealing with common files across multi-file torrents to make this practical.

Link to comment
Share on other sites

There are far too many variables to consider when dealing with common files across multi-file torrents to make this practical.

just saw this thread right now.

It's funny, exactly what the threadopener is saying is what i am doing right now.

http://forum.utorrent.com/viewtopic.php?id=8273

Starting yesterday with this process. (right now only 1 of them active downloading but force rechecking and thereby "updating" the data for the other torrent)

all that has to be implemented is a fuction to see if data that is about to be requested has allready been written to the contentfile to prevent wasting of bandwith if the pieces are identical

Funny that no client writer has considered this before, But as i understand at least the new "get right" client is able to use pieces that he has loaded from other sources (here HTTP/FTP) also for the torrent swarm.

Link to comment
Share on other sites

Funny that no client writer has considered this before, But as i understand at least the new "get right" client is able to use pieces that he has loaded from other sources (here HTTP/FTP) also for the torrent swarm.

Shareaza has had HTTP since 2.1.0.0 (November 2004) and FTP since 2.2.0.0 (November 2005). It also has GNUTella, GNUTella2 and ED2K sources that can be bridged into the swarm.

GetRight with torrent support is still in beta. Shareaza's HTTP and FTP sources in torrent downloads have been in stable versions for months

Link to comment
Share on other sites

  • 4 months later...

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...