Archived

This topic is now archived and is closed to further replies.

rafi

is uTP less efficient than TCP? The implementation should be improved!

Recommended Posts

Here are few observations related to uTP performance/efficiency as reflected in the latest V 2.01/Beta build. Please feel free to comment.

Since the latest Beta build is suppose to be the final-pre-RC beta, here are some observations/results from repeating of tests I ran on previous beta releases. Since some connectivity related issues were fixed (hopefully…) I expected test result to improve.

The focus here is not on absolute speed, but more on efficiency/overhead compared to TCP, that were pointed out in those earlier tests of the previous beta versions.

An additional test shows the difference in latency/congestion when downloading/uploading at full line capacity using the two protocols.

The bottom line:

There is, still, high overhead in uTP over TCP (20-30% overhead!). Redundant re-connect messages are increasing the PPS and number of packets even more than the +55% uTP packets observed already in the previous tests mentioned above. No significant difference to TCP was observed when "pinging" during download or upload with uTP.

As it is, it seems that there is no reason to prefer using uTP over TCP. Quite the opposite.

Tests' conditions:

- XP/SP2

- 2-3 active torrents

- 2.5M/250K ADSL connection

- 22K Upload speed limit, no download speed limit

- using uTP new header & net.uTP_dynamic_packet_size - false

Overhead numbers are accumulated for both upload and download traffic.

Tests' description:

1. Download test: 2-3 torrents downloading

2. Upload/seeding test: 3 torrents seeding

3. 2 instances of uTorrent over loop-back (with 2 "peers"/connections)

4. Same as 3 – but seeder instance with TCP only, peer – set to in uTP+TCP

5. Running ping of 1024 bytes during download and upload at full bandwidth (300K download 32K upload)

Tests 1-3 were ran in two "modes":

a. uTP (bt.transp_disposition=26, using new uTP header in 2.0+ clients)

b. TCP (bt.transp_disposition =5)

Test results/screen-shots:

1. Download test

a. Typical speeds and overhead (uTP):

D: 151.6K O: 24.8K | U: 7.7K O: 12.9K Overhead: 37.7/158.7 =~ 23% overhead

b. Typical speeds and overhead (TCP):

D: 150.7K O: 5.2K | U: 12.0K O: 7.3K Overhead: 12.5/158.7 =~ 8% overhead

http://img708.imageshack.us/img708/8196/dlutptcp.png

dlutptcp.th.png

2. Upload/seeding test

a. Typical speeds and overhead (uTP):

D: 0.6K O: 0.9K | U: 12K O: 3.2K Overhead: 4.1/12.6 =~ 32%

b. Typical speeds and overhead (TCP):

D: 0.3K O: 1.2K | U: 18.6 K O: 0.7K Overhead: 1.9/18.9 =~ %10

http://img686.imageshack.us/img686/7474/ulutptcp.png

ulutptcp.th.png

3. LAN/Local loop-back test

a. uTP overhead (UL-overhead/DL-speed) =~ 20%

b. TCP overhead (UL-overhead/DL-speed) =~ 5%

http://img405.imageshack.us/img405/6084/tcputpolgraph.png

tcputpolgraph.th.png

4. Monitoring for extra messages/packets

TCP clients (uT and none-uT) seems to receive a connect request-message every minute from every uTP enabled uTorrent peer.

If we extrapolate and estimate this for a 600 uT peers' in a swarm - this can reach up to 10 messages/sec with some active TCP connections disconnected and replaced with uTP connections (if the maximum allowed connections permits).

http://img340.imageshack.us/img340/3625/utpnotallowed.png

utpnotallowed.th.png

5. Pinging during Download or Upload/seeding with uTP/TCP

We'll try to see here if uTP has less effect on latency than TCP. At the same time - we'll observe the overhead.

Test conditions:

- Ping size: 1024 bytes

- 3 Ping destinations - a close one (~40 msec latency) and 2 far ones (~150 msec latency)

- Upload limit - 32KBps (connection max. - 250Kbps) & unlimited

- Download limit - none (connection max. - 2.5Mbps)

- TCP only & uTP only "modes"

Results: both uTP and TCP increase ping by the sane ~40 msec

Overheads: with uTP: 25%|35% during seeding (!) with TCP : 8%

Download Upload/seeding

dlutptcpping.th.png ulutptcp1ping.th.png

http://img202.imageshack.us/img202/9581/dlutptcpping.png http://img714.imageshack.us/img714/6326/ulutptcp1ping.png

Unlimited upload/seeding

ulutptcp3pingz.th.png

http://img708.imageshack.us/img708/1483/ulutptcp3pingz.png

Conclusions and final thoughts

1. There is no real change/improvement in performance from the previous beta releases

2. uTP overhead is significantly larger than in TCP, and even more so in this low-upload-bandwidth connection

3. One reason is (as shown before) uTP using more packets (and @higher PPS) with smaller average size.

Another reason is using dynamic packets' size to implement speed limiters

4. Extra messages – like redundant connect requests – increase even more the PPS/overhead,

when thousands of peers are participating in a popular torrent

5. As suggested in previous releases-tests, under some conditions (of peers having larger upload bandwidth),

changing the defaults of net.uTP_dynamic_packet_size to FALSE and net.utp_initial_packet_size to 8,

might increase packets size and reduce the overhead a bit

6. Previous testing also showed uTP using 55% more packets than TCP http://forum.utorrent.com/viewtopic.php?id=69592

7. I have not yet seen a demo of congestion being reduced under those test-conditions.

Also, the ping tests seem to produce similar results for both TCP and uTP.

So, there is no reason to prefer using uTP over TCP until their performance – at least – match up.

Share this post


Link to post
Share on other sites

Improving congestion by increase of seeding-packets sizes when upload speed is being limited

Here is a suggestion for implementing seeding/upload- speed control with uTP, so that it can be more "congestion friendly" and can achieve the smallest possible PPS/overhead.

In the current implementation, I observed small packets being used mainly during speed control/limiting. Also, very large 'packets' of 4-16K length seem to be used and transferred to the OS stack, to be further split by the OS to MTU sized packets. Both of these techniques may contribute to increased overhead, and decreased latency/pacing between packets, which is not helping with minimizing congestion.

I'll try to use the following principles, for my following suggestion:

- Make use of maximum packet length whenever possible

- Pace packets as much as possible (insert delays between them)

- Use only Windows ticks resolution (~20 msec) – that is claimed to be a system limitation

- Assume there is a fast connection from the PC to the local router/line-modem (>=100Mbps)

It is very likely that this will be done on account of the speed accuracy, that might not obey the set speed limit so nicely. I don't see it as a big disadvantage.

Let's take this use-case example –

- A 2.5M/250K ADSL line

- A ~75% UL speed limit of ~25KBps

- Aiming for packet size similar to the MTU - ~1.5 KBytes

- Out of the NIC speed/latency X – is negligible at this line rate, less than 1ms/packet (@100Mbps)

(a) For this, the required PPS is 25K/1.5K=16.67, and the pacing interval (PI) is 1/16.67= 60ms (==60/20==3.0 ticks)

Let's show this for several other speed limits we can set:

(b)Limit – 24K , PPS – 16 , PI – 63ms (==3.15 ticks)

© Limit – 23K , PPS – 15.33 , PI – 65ms (==3.26 ticks)

Limit – 22K , PPS – 14.66 , PI – 68ms (==3.4 ticks)

Limit – 21K , PPS – 14, PI – 71ms (==3.57 ticks)

Limit – 20K , PPS – 13.33 , PI – 75ms (==3.75 ticks)

Limit – 18K , PPS – 12 , PI – 83ms (==4.15 ticks)

Since we would like to have a 1K granularity, so we'll need a way to have a ~0.1-0.2 tick resolution. This case, is the most difficult case to handle. We will not need this 1K granularity at higher line rates.

Here is how I propose to can get this granularity:

1. Let's define a virtual cycle, that will include Y packets. Let say Y= 20 packets.

2. For a +0.1 tick (+2 ms) increase, we'll use 2 out of those 20 in the 'cycle' to be paced/delayed at +1 tick longer than the rest.

3. This will repeat itself every virtual cycle.

In our use-cases (a)-©:

(a) Each (all) cycle packets are paced at 3 ticks intervals

(B) To get an average of 3+0.15 tick latency - we'll generate a 3:17 duty cycle. 17-packets are paced @3 ticks, and 3 @4 ticks

© To get an average of 3+0.26 tick latency - we'll generate a 5:15 duty cycle. 15 packets are paced @3 ticks, and 5 @4 ticks

And so on…

Note, that at a higher line rate, X – the out of the NIC speed can be more significant , and should be taken into account in the duty-cycle equation.

We have achieved here:

1. A more efficient data transfer – less packets (=less Acks), larger/largest size

2. Pacing and delays , that can improve congestion. Other applications can 'squeeze in' more of their own packets.

Share this post


Link to post
Share on other sites
In the current implementation, I observed small packets being used mainly during speed control/limiting. Also, very large 'packets' of 4-16K length seem to be used and transferred to the OS stack, to be further split by the OS to MTU sized packets. Both of these techniques may contribute to increased overhead, and decreased latency/pacing between packets, which is not helping with minimizing congestion.

Packets larger than MTU would be a bug. Please post a wireshark capture where this can be seen.

Small packets do help minimize congestion. Please show measurements you thought showed otherwise.

- Pace packets as much as possible (insert delays between them)

We already do this

- Use only Windows ticks resolution (~20 msec) – that is claimed to be a system limitation

This is not true. Windows has microsecond resolution with QPC

- Assume there is a fast connection from the PC to the local router/line-modem (>=100Mbps)

This assumption does not help any calculation. There is a bottleneck, somewhere, obviously. It doesn't matter where it is.

1. A more efficient data transfer – less packets (=less Acks), larger/largest size

Less packets != less acks, with delayed acks.

2. Pacing and delays , that can improve congestion. Other applications can 'squeeze in' more of their own packets.

No, all you have accomplished is periodic delay, and less than maximum transfer rate.

Share this post


Link to post
Share on other sites

Since I cannot spend more time, repeating any of the tests after a delayed response of about a month ... :P I'll see what I can locate (saved). I'm sure you are more the capable to reproduce all (results/captures) with the repro steps/test-setup above-here.

> Packets larger than MTU would be a bug ...

go ahead, find it -

201atcp10dldefault16k.th.png

capture: http://www.mediafire.com/?xn2mlmuwmwy

> pacing:

>We already do this

>This is not true. Windows has microsecond resolution with QPC

Arvid wrote:

you have the OS timer granularity being relatively coarse (on windows you get timer interrupts with a granularity of 10-20 milliseconds), which makes it harder to evenly distribute the packets.

I adopted my algorithm per limitations I heard of ^. I guess that if you think sending-pacing is possible at higher granularity, you should check the code and discuss it with the devs... (and anyways, un-splitted 16K packets (TCP or uTP) are a bit hard to pace inside the OS stack... )

>There is a bottleneck, somewhere, obviously. It doesn't matter where it is

obviously. but it matters where it is, from the aspect of who is pacing/controlling the rate and how ( Being in the host-PC (uT/OS) , the router, the ISP ).

>No, all you have accomplished is periodic delay, and less than maximum transfer rate.

I suggest you review again the proposed algorithm above, and see that the defined (limited) rate is being kept.

>Small packets do help minimize congestion. Please show measurements you thought showed otherwise.

And I'm still waiting to see yours (together with jch here, performance wise) ... :P I'll be needing this to sync with your methods and tools, so to be on the same base.

( And you can see one example I've posted in the 2.01 thread #81 )

Share this post


Link to post
Share on other sites

Maybe it's reasonable to assume that anyone who has configured an upload rate limit doesn't care about the delay full sized MTU packets could cause?

However, the upload rate itself might be an attempt to reduce delay (which seems quite reasonable), and lowering the upload rate limit would not have the desired effect in that case. It would be counter-intuitive and might even make people set artificially low upload rate limits.

Regarding packet pacing; if I understand you correctly, that's the only way you can do it. You cannot ask your computer to wake up reliably at a certain interval. All sorts of things can happen to make that interval longer that what you asked for. The only way is to keep track of how long it took, and then send the number of packets that time period represents at the current rate.

The main problems is that there is no "current rate" in a window based congestion controller. There's only a window size, and a number of bytes in-flight. You could estimate the current rate by dividing the window size by the RTT. This is quite noisy and will have unpredictible feedback effects on the window size, since the window size is updated on the same time-scale, using the same input, to a certain degree. Once you have this mechanism, you just have to call this tick function as often as you can.

The tradeoff here is smoothness in your pacing and the number of times the CPU has to switch back to your process. You could set your thread priority to real-time, and essentially force the OS to schedule your thread once each quanta, but it might cost you CPU.

The other problem is that what you're doing is essentially delaying filling up your send window. I believe this is what greg referred to as the periodic delay. This is very likely to be sub-optimal throughput wise, compared to always filling your window.

The third factor is the ACK-traffic going back. For pacing to work well, regardless of if you try to do it explicitly as you suggest, or more implicit like TCP, you need to ack as often as possible. Acking one whole window at a time will reduce your granularity, and your ability to send packets evenly.

I'm not trying to put you off coming up with a good solution to packet pacing, I just want to explain that it's not as simple as it might appear. When I first set off to implement pacing I thought I had it all figured out, but in practice there were a number of things breaking some aspect of the protocol. Modems actually act as packet pacers, and just using a straight forward windowed approach, acking every packet, will give you a decent pacing. It could even get you higher granularity than pacing explicitly, since incoming packets can make windows switch to your thread in the middle of someone elses quota (as greg pointed out as well).

Share this post


Link to post
Share on other sites

You are correct. the above is in no way a full & tested solution to pacing large UDP packets. It's just an idea to target the issue of dealing with bad tick granularity. Acks were not considered at all.

I can just think of one other example: using of TCP (in uT as well). For some strange (.. ;) ..) reason uT speed limiting works there, and it seems - with large packets and small overhead too...

Edit/PS:

1) I've emailed you some more data

2) to be honest - 'cleaning up the act(code)' here can eventually bring uTP to TCP level in performance/overhead. There is still the possibility that equipment is being "sensitive" to UDP as is. uTP can still cause this issue (maybe to less extent), even if it will be more efficient.

Share this post


Link to post
Share on other sites