Jump to content

v2.x: large PPS (Packets/sec) hurting routers/ISPs- how to decrease it


rafi

Recommended Posts

Many have obserbved larger amount of Packets Per Second in the 2.X line. The main reasons were:

A- adding of uTP protocol (to the existing TCP )

B- increased bt.connect_rate

C- much higher half_open default (Win7/Vista)

D- use of smaller packets (and more of them) for payload traffic

A/D is payload traffic related

B/C is connection-related

All of that causes more packets & higher PPS, thus:

1- inefficiant traffic/data transfer ( -> high overhead)

2- failures of older network equipment (SOHO-routers/ISPs)

3- more Internet traffic in general

The latest 2.03 beta-release seems to improve on issues A/D and 1, that were repeatedly raised in this forum, by increasing the payload packets sizes. After this, PPS is still much higher than in pre-2.x releases.

I suggest to further decrease PPS by aiming at the connection related PPS issues (B/C) .

uT increased PPS is also due to:

a- Using concurrent uTP + TCP connections attempts (doubled, x2)

b- Retring uTP connection - 3 times to un-responsive peers in the list (and most of them are so for many reasons). This also triples the effective connect_rate (x3) !

c- Effectively eliminating the '8' max_halfopen limit (= concurrent TCP connection attempts). This made # of the concurrent attempts - sometimes theoreticaly unlimitted (in steady state it's limited by the max peers #)

My suggestions for lowering PPS are:

1)For improving "b" (useless uTP retries) - I suggest to NOT retry to connect with uTP at all . The main "pro" reasons:

- most peers do not respond anyways (no uTP support, some internal limit reached)

- The concurent TCP connection request a kind of retry/backup by itself

- the next peer in the list is being connected to, and this is a kind of re-ry too

- the non-responsive peer will be re-tried when the peers-list is cycled the next time

cons:

- missing some reachable peers due to comm. errors.

2)For improving "c" (huge max_halfopen) - I suggest to impose a lower "max_halfopen" 'window' on all Windows OSes again. Setting a "max_halfopen" of "32" will be good enough for non XP systems.

Since we have eliminated holdups/retries for uTP I also suggest to add a similar 'half_open-like Window' restrain for uTP as well. A "32" will be good here too.

Alltogether we'll now have a default of 64 concurrent connections for uTP + TCP (40 for XP).

Pros:

- avoid bombarding network equipment with hundreds of packets

cons:

- slowing connection-generation rate a bit

3)For improving B (high connect_rate) - reduce connect_rate from 10 to 5 connections/sec. My experience with it - (on XP) is good.

pros:

- directly decrease PPS (180 connections a minute is good enough)

cons:

- slowing down connection rate

My estimate for the expected decrease (of 100% PPS of connection generation)

-> #1) no retries: -65% of the uTP retry rate => ~-16 uTP PPS (-??? TCP)

-> #3) lower connect_rate => ~-10 PPS (TCP + uTP)

-> #2) lower TCP max_halfopen, new one for uTP => limit max connecitons to <64 on nonXP/<40 on XP

I hope that all of the above will be added to final 2.03 release so to REALLY make a change for a better efeciancy and PPS.

PPS (4, 14, 32) vs connect_speed (x1, x5 , x10)

( incoming traffic - firewalled. uTP only)

pps4conspeed1.th.pngpps16conspeed5.th.pngpps32conspeed10.th.png

Link to comment
Share on other sites

The latest 2.03 beta-release seems to improve on issues A/D and 1, that were repeatedly raised in this forum, by increasing the payload packets sizes. After this, PPS is still much higher than in pre-2.x releases.

How high is PPS compared to previous uT releases? Is this amount still too high for some crappy routers?

uT increased PPS is also due to:

a- Using concurrent uTP + TCP connections attempts (doubled, x2)

Ok. Let's get everyone to use uTP and then we can switch off TCP attempts.

b- Retring uTP connection - 3 times to un-responsive peers in the list (and most of them are so for many reasons). This also triples the effective connect_rate (x3) !

Uh, the behavior here is no different from TCP.

c- Effectively eliminating the '8' max_halfopen limit (= concurrent TCP connection attempts). This made # of the concurrent attempts - sometimes theoreticaly unlimitted (in steady state it's limited by the max peers #)

Again, newer versions (and for that matter, older versions) of Windows do not have a max halfopen limit anymore. So, this is no different from TCP. Also, since a significant amount of bandwidth is not consumed with connection attempts, I have a hard time believing PPS is significantly higher as a result of concurrent connections per second.

1)For improving "b" (useless uTP retries) - I suggest to NOT retry to connect with uTP at all . The main "pro" reasons:

- most peers do not respond anyways (no uTP support, some internal limit reached)

Haha.

A) I believe most peers support uTP.

B) Should we not retry TCP connections either?

- the next peer in the list is being connected to, and this is a kind of re-ry too

- the non-responsive peer will be re-tried when the peers-list is cycled the next time

There is no peer list "cycle". Peers are tried in random order if we have not retried them too recently. If uT is retrying a peer, there is no one to move on to.

2)For improving "c" (huge max_halfopen) - I suggest to impose a lower "max_halfopen" 'window' on all Windows OSes again. Setting a "max_halfopen" of "32" will be good enough for non XP systems.

Since we have eliminated holdups/retries for uTP I also suggest to add a similar 'half_open-like Window' restrain for uTP as well. A "32" will be good here too.

Alltogether we'll now have a default of 64 concurrent connections for uTP + TCP (40 for XP).

Pros:

- avoid bombarding network equipment with hundreds of packets

cons:

- slowing connection-generation rate a bit

You still haven't shown that this "bombarding" is harmful. Why arbitrarily slow down the connection rate on a hunch?

There's no point in optimizing if the problem is already fixed.

Link to comment
Share on other sites

Firon:

Computers had no problem with opening a crapton of connections before XP SP2. And they sure don't after Vista removed the stupid halfopen limits.

I agree. But the issue is (I guess) "crappy network equipment" and possibly redundant connecting traffic , and not "crappy PCs"... :P

I'll do my best below to answer Greg Hazel's comments.

I have also to emphasize - that I have no issue with PPS on my own PC/router. My ISP though, seems to think otherwise regarding his equipment.

Greg Hazel: How high is PPS compared to previous uT releases?

The topic is about connection related PPS. So, If I tell you that it's ~500+% higher than 1.85 during speed buildup/connections generation (when connections #< max # of peers to connect) would you believe me ?

So, I suggest that you do your own homework and measure it . It's not that difficult. Even for you guys... :P

Anyways, keeping PPS as low as possible is best, regardless.

My theory:

+100% - adding outgoing uTP as default

+50-100% - the uTP retries

+300% - incrementing max_halfopen from 8 to 400 (actually - more-packets/for-longer-duration)

Is this amount still too high for some crappy routers?

Dunno for myself. Other people/mods/ISPs in this forum seem to think that it is. You try it out, and tell me what is an acceptable #, and I can then tell you if it's high or not.

I'm sure that you are much better equipped/capable than me in measuring data on the various OSes (I have only one).

[a.concurrent uTP/TCP connects:]

OK. Let's get everyone to use uTP and then we can switch off TCP attempts.

I agree. But until this utopia is realized (in 3-5 years?), let's try and deal with current reality instead.

Maybe improve on this x2/+100% factor (and it's much increased with retries) by sending TCP after uTP requests only on time-outs or visa-versa ?

[b.uTP connect retries]

Uh, the behavior here is no different from TCP.

... B) Should we not retry TCP connections either?

OK, than I was wrong, this multiplies the related PPS X6 times than the connect_rate (I was measuring uTP only, since it's new/added in v2.x).

If you measure the statistics on success rate of the retries (2nd/3rd ones), then we can logically deduct if this contributes to more connections, or more PPS ;) . For now, I take the number of 90% un-connect-able peers as a criteria, and I deducted that we better save our breath (PPS...) here. And yes, maybe not retry both protocols ...

Not hammering peers will be a good result too. Alternatively - one retry will also improve PPS.

Haha.

A) I believe most peers support uTP.

Great. so we'll need less TCP connects if done sequentially after uTP... ( I was going to say - "not for long", first ... ;) )

Again, newer versions (and for that matter, older versions) of Windows do not have a max halfopen limit anymore.

So, this is no different from TCP. Also, since a significant amount of bandwidth is not consumed with connection

attempts, I have a hard time believing PPS is significantly higher as a result of concurrent connections per second.

1. Again, the topic is about connecting-related-PPS, not bandwidth (and I also think it does NOT consume much bandwidth)

2. You don't need to believe anything. Just test it. You tell me if x2 of the connect speed (+100%) is significant or not (unless you divide connect_speed by 2 - for concurrent TCP+uTP)

3. halfopen has different effects on # of packets/PPS - it does not effects directly the PPS, but the packets' burst duration.

4. hugely increased max_halfopen will make the connection attempts keep for a much longer period of time (theoretically - 400/10 = 40 seconds minimum).

[cycling peers' list]

There is no peer list "cycle". Peers are tried in random order if we have not retried them too recently.

I'm sure I don't know the exact details here (I assumed it is implemented this way). Random is good (equal rights for peers?...). I would try to exhaust the whole list first before re-trying an unresponsive peer, so to increase the change for success.

If uT is retrying a peer, there is no one to move on to.

You better double check the code than. Since I've noticed retries starting right on startup, also with a 10K+ peers list. It is my impression that it is triggered just by connect timeout. If it should be like you say (and we assume bugs exist in utopia too) - just make it so, and the result is similar to what I referred to as "cycle" the peers' list. Problem solved :)

Outgoing connections, @10K+ swarm, all inputs - firewalled , only TCP and only uTP runs

http://img30.imageshack.us/img30/4922/tcpretryx336s.png

http://img28.imageshack.us/img28/9035/utpretryx4246s.png

You still haven't shown that this "bombarding" is harmful. Why arbitrarily slow down the connection rate on a hunch?

Me ? I have to show it ? I thought you are responsible for development, testing and support issues... Just look around the forum. Don't you remember ? The ISPs bitching threads ? multiple other users having what you define as "crappy" equipment ? .. Give me a break please. My router is fine, thank you ... ;)

There's no point in optimizing if the problem is already fixed.

If you mean - fixing the payload related PPS (that, after 6 months, does look promising... :) ) - again, this is not the issue on this thread. I'm not aware of any change done in 2.03/x to improve the connection-related PPS (at least it's not in the changelogs)

The increased amount/PPS of small payload packets, small connection related packets and maybe uTP connect-scheme - has done much harm for the past 6 months (PTs, ISPs). Payload traffic/packets is hopefully improved now (and I hope overhead will also be reduced), What I'm saying is - try to improve 2.03 in respect to correction-logic/rate as well, and it may do some good to fast finding of good peers too...

And there is always a point in optimizing Internet traffic if it's not actually needed.

Link to comment
Share on other sites

Is this amount still too high for some crappy routers?

Dunno for myself. Other people/mods/ISPs in this forum seem to think that it is. You try it out, and tell me what is an acceptable #, and I can then tell you if it's high or not.

I don't believe ISPs have given feedback on 2.0.3 in the wild. Certainly they do not monitor which versions of uT are in use, so they could not be giving feedback on the latest version.

So, if you're just speculating about PPS due to connection attempts and its impact, I would say this is premature optimization, and not anything we're willing to change until we see a need for it.

Link to comment
Share on other sites

yes, we assume it is fixed. I said so myself in the original post. Do we also *assume* the payload causes it (with both crappy uploads, routers, and Russian ISPs) ? or it was tested and confirmed by you ? Did you forget the famous PTs thread ?

I think it's up to you to just fix/optimize/tune issues that you see as not valid (like the retries you say you planed differently than implemented), thus avoid further complaints from PTs and ISPs on various other issues mentioned above.

Link to comment
Share on other sites

yes, we assume it is fixed. I said so myself in the original post. Do we also *assume* the payload causes it (with both crappy uploads, routers, and Russian ISPs) ? or it was tested and confirmed by you ?

I did not write the fix, Arvid did. However it would result in a 4.8x reduction in PPS in most cases, so I believe that is sufficient.

I think it's up to you to just fix/optimize/tune issues that you see as not valid (like the retries you say you planed differently than implemented), thus avoid further complaints from PTs and ISPs on various other issues mentioned above.

There's no point in fixing things which are not broken.

Link to comment
Share on other sites

You might be correct, Greg (and I hope you are), but the really relevant data needed to know this is the answer to your previous very good question:

Greg Hazel:

How high is PPS compared to previous uT releases ? Is this amount still too high for some crappy routers?

Though I've made some tests (as mentioned above) I rather you do them on Win7/Vista.

The 3 numbers needed are:

1) How bigger is the 2.x/1.8x PPS-s during download/seeding (and I hope/guess the answer will be 1:1. or even less)

2) How bigger is the 2.x/1.8xc PPS during connecting (only TCP before, and TCP + uTP now with the rest of the related changes)

3) what is the reference PPS - that make routers suffer

I think that by now - you only might have the answer to #1 (if you meant it to be in reference to 1.8 and not 2.02) . Can you provide/review the data for #2/3 as well ?

Link to comment
Share on other sites

Greg Hazel:

How high is PPS compared to previous uT releases ? Is this amount still too high for some crappy routers?

The 3 numbers needed are:

1) How bigger is the 2.x/1.8x PPS-s during download/seeding (and I hope/guess the answer will be 1:1. or even less)

2) How bigger is the 2.x/1.8xc PPS during connecting (only TCP before, and TCP + uTP now with the rest of the related changes)

3) what is the reference PPS - that make routers suffer

I disagree. 1 is relevant, 2 is not relevant. 3 would be good to know but is not something we can measure.

I think that by now - you only might have the answer to #1 (if you meant it to be in reference to 1.8 and not 2.02) . Can you provide/review the data for #2/3 as well ?

Sorry, the burden of proof is on you if you would like to report a problem. If an ISP reports an issue, with data, we can look at it.

Link to comment
Share on other sites

I agree. Sorry, I myself don't have Win7, and I also didn't report the problem. Others did, and i just tried to analyze (unlike the other bug of the payload overhead).

You can sit and wait for the world to run 2.03 and see what goes (and I am very curious...).Or try anticipate, and try keep the data to proved old values/levels (1.8x like). Up to you... :)

Link to comment
Share on other sites

The old levels are not enough data. We also data from the ISP. If we go assuming that the old levels were at the absolute maximum, it will be difficult to do anything more in the future. If they were at the maximum, who's to say they weren't too high already? Optimizing without data from the ISP is a great way to spend a lot of time working on potentially the wrong thing.

Link to comment
Share on other sites

who's to say they weren't too high already?

meh, that's being over cautious. I'd say - years of 1.8x experience (with crappy routers at the time) should be good enough.

There is no guarantee for anything, and who else - if not BT inc is more qualified to contact ISPs (right, its best to have data from them). It is just reasonable to assume that the closest you get to the previous values - the better (maybe even x4-5 times that as I suggested).

Some part of the users will always benefit, others might not. You cannot be perfect (unlike with overhead... :P). and some of the corrective actions - changing some defaults or canceling retires can take you just 5 minutes . it's shouldn't take 7 months as the overhead bug did... ;)

Link to comment
Share on other sites

who else - if not BT inc is more qualified to contact ISPs (right, its best to have data from them).

Great, let me know some ISPs and contact info and I'll shoot them a message.

It is just reasonable to assume that the closest you get to the previous values - the better (maybe even x4-5 times that as I suggested).

We know that previous values are at least as functional as they were, but we do not know if those systems were still PPS limited, and would achieve higher throughput with lower PPS. Perhaps higher PPS would even encourage these ISPs to upgrade, improving service for their customers. Trying to get to the previous level is much harder now that we're doing more.

Some part of the users will always benefit, others might not. You cannot be perfect (unlike with overhead... :P). and some of the corrective actions - changing some defaults or canceling retires can take you just 5 minutes . it's shouldn't take 7 months as the overhead bug did... ;)

It's 5 minutes to make some blind changes you think might help someone somehow without any real proof, potentially causing more, harder to diagnose, issues. It's 7 months to find out what's really going on, and fix things the right way.

I'll wait.

Link to comment
Share on other sites

Will it be possible for you to just spend those 5 minutes on running both 1.8.5 & 2.0.3 on Win7, default settings, downloading a well seeded torrent (~10K seeds) till your max connections # is reached . WireShark it, capture it's traffic (.cap), + copy the statistics->I/O - PPS graph for both to here? Connections logs will be nice as well. If you can, this will be great.

Link to comment
Share on other sites

Just a comment on:
Greg Hazel wrote:

If uT is retrying a peer, there is no one to move on to.

I can just suggest (again) to double-check and confirm that. WireSharking it - shows otherwise.

Wireshark doesn't show how many peers are in the internal list. There's nothing to check. This system works as designed.

Link to comment
Share on other sites

True, WireShark does not show peers' list, only retries. But on a ~10K peer's list (viewable with the "copy-peers list" function, or a debugger if you check it), and observing retries after only a few seconds of download-time - there are retires for the same peer/IP when there shouldn't be any.

If it is "as designed" or not - that's another story. It's just not as you've explained, or at least not as I understood you.

You know, from resent experience with a similar "by design" assumption related to the design of overhead/packets-size - if I did understand your design (as quoted), it can either be a possible bug, or your memory of the actual implementation/design is not that good... ;)

Have a great week! And I am pretty sure arvid's upcoming 2.03 release will be a big improvement regardless of this issue.

Link to comment
Share on other sites

  • 1 month later...

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...