Jump to content

Similarity-Enhanced Transfer (SET) - %70 faster torrents?


Strider3000

Recommended Posts

There is a new P2P download technology just announced called SET.

>>> " As a result, SET should greatly expand the available sources of any given file. In practice, it seemed to work pretty well. Using existing P2P networks, they were able to grab a 30MB movie trailer in only a third of the time, since their software was able to find other sources that shared about 50 percent similarity. The rate of an MP3 download shot up by over 70 percent. "

I would like to see uTorrent implement this. Apparently the code is will be open and available at the announcement time.

Read the article for more information:

http://arstechnica.com/news.ars/post/20070410-accelerated-p2p-by-similarity-searches.html

Link to comment
Share on other sites

Despite appearances, this is more a search method to find additional sources rather than a "new" p2p system:

"Using existing P2P networks, they were able to grab a 30MB movie trailer in only a third of the time, since their software was able to find other sources that shared about 50 percent similarity."

There's a problem with such a system: it CAN'T scale indefinitely and counts on the computing power of all those sharing the possibly-similar files to help find matches. Too many people running too many searches at once would bog down already-overloaded search systems. Most regular searches on file-sharing networks only pass an overall hash of a file, not section-based hashes...even if they support section-based hashes. Otherwise, each search could require 4+ times as much data -- as sources parrot back what each file's individual hash sections are.

A couple researches running this probably will do little harm. A refined NEW p2p system built from the ground up with these ideas deeply embedded in it may be possible, but even still it will probably have slightly heavier overheads for searches than the best current p2p networks.

Link to comment
Share on other sites

When Shareaza, a multi-network client, says no to something like this, it's time to take a serious look at how realistic it is to expect something like this to work.

http://forums.shareaza.com/showthread.php?s=&threadid=53643

The Shareaza userbase equates the source sharing and file identification system used in SET with the hashing system in the Fasttrack network of Kazaa.

No.

Link to comment
Share on other sites

There has been efforts made for various file-sharing networks to ignore MP3 tags for the sake of finding more sources for files. That's all this is in a nutshell...and it's old tech too!

Like others have said, this would cause slightly flawed files to propagate even faster than before...as the unflawed original would not appear to have any more sources than the flawed one.

In a BitTorrent environment where searches for torrents tend to be separated from the torrent download process, this would probably be utterly worthless as the bandwidth needed to find additional partial sources in a timely manner would likely be greater than the speed gains.

It's a little like asking everyone if they know exactly what's on page 2 of their newspaper in the hopes it matches yours...without telling them in advance what newspaper you're talking about. Or even if you state what newspaper, you don't know exactly what day...and are asking a friend of a friend of a friend if they know. The information chain itself tends to require exponential bandwidth use for each additional hop made. (1 hop = asking a friend, 2 hops = asking a friend of a friend) That simply cannot scale indefinitely even if everyone had 1 gigabit/sec internet connections.

Link to comment
Share on other sites

"With these parameters in place, they tested their system on a number of simulated networks, and compared it to BitTorrent clients on the same networks. In general, the speedups from SET were most dramatic when the network performance was worst. Speedup was minimal when any server was on a fast network, because a single server could saturate the clients. Put both servers and clients on slow, asymmetric DSL links, and SET provided a huge boost. In the real world, these inefficient conditions appear to predominate: the authors cite studies showing that over 60 percent of P2P downloads are never completed, and the median transfer time for a 100MB file on Kazaa was over a day."

It seems the authors of the paper on SET have already looked at the issue of the huge computational and/or bandwidth needs and have scaled their technology accordingly. You should read all of their paper. I would like to see a "beta" test of this technology before putting it in the graveyard.

Link to comment
Share on other sites

I've also read the paper and it does seem like it is a great idea. I would like to see it in uTorrent, at least I'd like to see the developers here at least try it out and release it for developers only. One idea I thought of for this is only applying SET to files found on the DHT network rather than to trackers. Also, the paper explains that overhead is very minimal I believe they say overhead would be increased by 0.5%.

I am just sick of downloading files that have 90% available and that file is the only file that can be found.

Link to comment
Share on other sites

I've read the papers now and I can add that even if significant amounts of bandwidth are saved, this is only at the price of hashing/indexing any and all potentially matching files into pieces which are offset from the usual hashing scheme/s done by BitTorrent and Gnutella. As hashing is already a cpu-intensive process, having to do this potentially across "random" locations on-the-fly is not scalable. The locations are "random" because although they're looking for matching starting patterns, the byte offset can be in a large range and not fall on even (base-2) standard hash pieces of the file.

Even if the cpu hashing issues were non-existant, searches to find the similar file pieces would not be as trivial bandwidth-wise as pointed out without existing network structures already catering to searches-by-hash. While Gnutella in theory HAS this ability, it is only hashes for WHOLE files, and doesn't scale nicely. I've followed its development and was even a long participant of BearShare's beta-testing and Gnutella web developer's forum (at yahoo) which I still post occasionally at the latter. Gnutella v0.6 and beyond already has a very complex search network consisting of multiple layers (UltraPeer-to-leaf, TTH tables passed between UltraPeers to eliminate non-matches, Push routes, Alt-Locations, and DHT via UDP packets directly between sources and downloaders.)

BitTorrent's hashing ability has a weakness where multi-file torrents can have a piece be both part of the end of 1 file and the beginning of another in the torrent. Thus, for a similar torrent if the first file is longer than in the original...EVERY file after that has ALL their hashed sections offset by the length-change. So searching for duplicates that way is currently NOT supported by BitTorrent. Even finding single episode torrents that match at least in part a larger compilation "season 1" would require "intelligent design" or pure blind luck to a degree that's hopelessly impractical. The single episode torrents would have to have exactly the same size piece size as the compilation, otherwise the hash values returned would not match. EACH episode would all have to be perfectly padded to fall right on the chunk size ending so there's no piece which is shared by 2 episodes -- and who would want 0-256 KB of 0's padded onto the end of every file they download just so it exactly fills up even 256 KB intervals? These torrents would have to have significant peers and seeds (on multiple trackers?) to see substantial improvements. And lastly, only those that are "aware" of the similar torrents would gain anything by it.

Giant compilation torrents (over 4 GB) or torrents which have stupidly-small piece size already have .torrent files exceeding 1 MB to contain all the piece hash values. But for Similarity-Enhanced Transfer (SET) style hashes to be really effective, they talked about looking at segments of 16 KB size and generating corresponding hashes for each. This would typically mean a 10-fold increase in .torrent file size (100KB -> 1MB .torrent files) and would VASTLY increase the bandwidth burden on trackers and websites which have .torrent file download links.

Were these values calculated on the fly instead of pre-computed and stored in the .torrent file, a "lightweight" client such as µTorrent would likely be impossible. The crosstalk of peers+seeds to match all those minute pieces would be increased by at least an order of magnitude over current BitTorrent traffic.

The best you can hope for is only a little better than Shareaza's scheme which dropped MP3 tags from MP3 and AVI files, since those are often changed AUTOMATICALLY by media players and even windows itself! (Case in point is there's been a few people here complaining their files aren't perfectly downloaded...due to exactly this kind of "corruption".)

Link to comment
Share on other sites

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...