Jump to content

File piece caching via DHT?


Recommended Posts

Missing file pieces in Bittorrent is often an issue for completing a torrent. Is it technically feasible to set up file piece caching via DHT? As long as someone is willing to cache these file pieces (using SHA1 as the filename), could DHT be extended in Bittorrent to support something like this?

I realize anyone participating would contribute bandwidth (up & down) to maintain a cache and download less available file cache pieces to maintain availability. Maybe it doesn't even have to a Bittorrent extension, but even a separate system and Bittorrent would only need to query a separate system for missing file pieces after a threshold is exceeded.

This looks pretty interesting in terms of an open file system:

http://opendht.org/faq.html

http://sysnet.ucsd.edu/octopod/

Link to comment
Share on other sites

The sheer VOLUME of data one needs... then again, didn't Archimedes say 'Give me a firm place to stand and a lever large enough and I will move the whole Earth'?

Petabytes of data went across lines in the time it took me to type the quotation. :/ I know none of the technical hurdles, but surely if the ability to keep the data intact and not chunking it on personal users' computers was a requirement, then I think it would speed adoption of any client supporting such abstraction for "rare piece availability" :/

Link to comment
Share on other sites

How can you retrieving file pieces by their hash unless you know a bit about their files?

Only if they were in the same torrent just shared by different people would you find matching file pieces. A file piece that is offset 1 byte from the same file piece will have a different hash. It would be inordinately bandwidth-intensive to map out all the possibilities.

Link to comment
Share on other sites

> How can you retrieving file pieces by their hash unless you know a bit about their files?

I don't know what you're asking here.

> It would be inordinately bandwidth-intensive to map out all the possibilities.

Yes, there would be some trade offs for caching. The benefit is that you could connect to any peer that has the file piece you need. And since you are caching that gives you some legal protection like ISPs.

Link to comment
Share on other sites

The DHT is only really suitable for storing small amounts of information. Information contained in the DHT is carried by UDP packets, which are limited to roughly 1kB in practice (you really don't want to fragment UDP packets), it is duplicated 8 times (so that it doesn't disappear when a node leaves the network), and needs to be refreshed periodically.

So keeping pieces in the DHT, which are between 128kB and 1MB, is really not a good idea.

-- Juliusz

Link to comment
Share on other sites

I'm not sure, but I don't think hermanm is suggesting that the actual payload be sent via DHT/UDP. I think what he's is suggesting is that DHT track individual pieces, so that announces for particular piece hashes would cause the requester and the peers with the corresponding data to connect and transfer the data like any other normal BitTorrent connection.

While the idea seems okay on the surface, I don't think the problems associated with it make it worth the trouble...

1. To keep listings of individual piece hashes searchable by all DHT nodes would be cause for significant overhead. How many pieces would you say are on each torrent, on average? How many torrents would you expect each peer to have loaded, on average? I'd say at least somewhere in the tens of thousands of hashes just for individual pieces alone for many users. Searching through that many for each DHT announce received by a node wouldn't be pretty.

2. SHA1 isn't collision-free. Just because a peer has a piece with the same hash doesn't mean it actually is the same data. Yes, it's normally unlikely that this would ever occur, but one constraint that greatly mitigates this further from happening is the fact that piece hashes are currently associated with swarms around particular infohashes, so chances are pretty good that if you're requesting data on a particular swarm, it'll be good. If you get rid of that constraint, it makes it that much easier for non-identical data to be sent. I don't have any hard statistics on this, but it's just a bit reasoning.

3. Related to point 2 is the fact that hit-and-run poisoning becomes somewhat easier with this idea. Think, some random guy responding to every single request for a particular hash, saying "yeah, I've got that" and then sending junk. He wouldn't have to join every single swarm to poison everyone. Yes, it can be similarly done with DHT as is by simply replying with his IP as part of the announce results (effectively causing the node to join the announced swarms), but I'd imagine piece requests would be much more common than individual torrent announces, magnifying the effect.

Link to comment
Share on other sites

Kademlia is efficient in how many nodes it needs to contact, but that's not what I was referring to. I was referring to the number of hashes each node themselves would need to search through to determine whether they have a potentially matching piece -- outside the scope of Kademlia. At any rate, in retrospect, that point is relatively minor, given that there are efficient methods of searching data depending on how the data is stored. I was just wondering about how much overhead it would incur for clients to have to constantly keep track of all their piece hashes at all times just for this functionality to work efficiently. (Edit: Oh wait, it would only work on completed pieces from torrent jobs that are started anyway, so it likely wouldn't be too expensive. That leaves the remaining cases I pointed out above.)

Anyhow, a central router can't keep track of which nodes have which pieces in real-time, and it shouldn't anyway, given the fact that DHT is supposed to be decentralized in the first place.

Link to comment
Share on other sites

To reduce load on the DHT network, I did suggest the request only occur after all other methods have been exhausted. An alternative I suppose, could be to use UseNet -- uuEncode each file piece and title the subject the SHA-1. *sigh*

Link to comment
Share on other sites

Even if you only request data from the DHT as a last resort, seeders will need to announce piece data to the DHT for your scheme to be useful. Are you suggesting that every seeder should announce every piece it has?

Link to comment
Share on other sites

For the record, on torrents I create, personally, I like 3000 piece torrents. It's much larger for files < 1 GiB, but tends to create "activity" while downloading those dual layer DVD ISOs. I guess I could switch to 1000, as 1 piece / % update in the listview, but for whatever reason I liked 3000 at the time.

Multiply that by the estimated 4000 torrents I had last year when I created a snapshot of my bittorrenting, that's at least 120 MiB of data to push EVERY update??

Link to comment
Share on other sites

You all bring up good points about scalability. I guess we're back to ISPs hosting torrent content caches. Or if Amazon S3 was willing to donate cloud space, we could have µTorrent do lookups to s3.amazon.com/torrent-file-piece/4163DF14FD3FDBEEC49458075322F2E0778AEAB3 if the piece SHA1 was 4163DF14FD3FDBEEC49458075322F2E0778AEAB3

Link to comment
Share on other sites

Hehe indeed, it's simpler to propose the ideas than expect someone to help implement them. Even OpenDHT on planetlab hasn't had great uptime lately. :) But I do like the idea for those rare 99% problems due to pieces. Unfortunately unless it's old/there's an old peer with the data it's unlikely they'd even have 1 piece of the file.

Link to comment
Share on other sites

Even if you only request data from the DHT as a last resort, seeders will need to announce piece data to the DHT for your scheme to be useful.

I don't see why that has to be so; DHT doesn't have to be pushed-based. You ask for a hash, nodes search their list to see if they have it, or if a node they're connected to has it, then respond with the results. The exact same idea can easily be applied here.

Link to comment
Share on other sites

Ugh. Thank you Ultima. Indeed push- vs. pull-based skipped my mind at the time. I was simply astounded at the amount of data I had amassed.

I love this idea, but it all goes back to rarity as I see it. If you've got rare data to begin with, and you're using something additional as a lookup for that data it's more likely to be struck by lightning than find what you were looking for in the first place.

Link to comment
Share on other sites

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...