Tuesday, 16 July 2013

QUIC notes: Rationale, FEC and Head of Line blocking

Having been involved with ZeroMQ for a few years now, and having taken a deeper look at messaging last year, I enjoy occasionally dipping into various network protocols. One of the most interesting recent efforts is QUIC, from the Chrome team (who are just a short bridge away from my Googley home of the Google+ team), which is aimed to provide a post-SPDY protocol to be used instead of HTTP over TCP for high performance web sites.

Things at Google tend to start with a design doc that lays out the rational and the plan for a given project, and happily the QUIC doc is openly available. It's a great read, and its worth highlighting some of the more interesting problems the team are addressing.

To start with, the document lays out the (12) goals of the project and the motivations for them. These roughly break down into two camps for me:

Usable Now

This is a protocol the team intends to deploy, and it is one that has to do the job of existing setups. That means it's got to be efficient routed, switched, and handled by the myriad bits and pieces of network hardware that sit between a server and a client. It has to offer reliability, so that apps which require reliable delivery can get it without layering their own on top. It has to be friendly to the internet, so it will back off when there is congestion, which is vital to getting good performance over busy links. Finally, it's got to be at least as secure as the current SSL/TLS setups we're familiar with, so people can trust the protocol to transport confidential data.

  • Works with todays internet
  • Support reliable transport for multiplexed streams
  • Efficient demux-mux properties
  • Congestion avoidance comparable to TCP (across multiplexed streams)
  • Privacy assurances comparable to TLS (e.g. HTTPS)
  • Reused of existing protocols whereever possible

Low latency/high throughput

These goals are the real meat of the improvements the protocol is aiming to deliver. Some of these are obvious (e.g. reduced packet count is likely to make things quicker, all else being equal), but some of them aren't, so I've picked out a few below.

  • Reliable and safe resource requirements scaling
  • Reduced head-of-line blocking due to packet loss
  • Minimal round trip costs in setup and during packet loss
  • Minimal round trips during setup
  • Forward error correction to avoid retransmission
  • Reduced bandwidth and increased channel status responsiveness
  • Reduced packet count

Minimising Round Trips

Networking technology has consistently improve over the last 30-40 years: we get faster or more capable chips, more memory in devices, higher bandwidth links, and all sorts of other improvements which result in use being able to shift more data, more easily. The one thing that stays constant in this is the speed of light - if we have to send a signal over a given distance, that's the best we can do. This means that while we might be able to send more data down a line at a given moment, the time take to get a response isn't going to improve very much. The round-trip-time (latency, or ping) is the length of time to go to the server and back, and it can be quite costly, particularly when there are high latency hops involved.

Most of the time this is just a fact of life - if you're far away from a server, you're going to have to wait a bit longer. That said, different design choices can mean it has an outsized impact on the performance of certain applications. The web as it stands is built on HTTP over TCP - a protocol which presents a reliable connection stream to the applications. When a browser needs to make a connection, it has to go through a handshake. To take a simplified exchange:

Client -> Server: SYN
Server -> Client: SYN ACK
Client -> Server: ACK
Client -> Server: (segment with http request in it)
Server -> Client: ACK
Server -> Client: (segment with http response)
Client -> Server: ACK
Client -> Server: FIN 
Server -> Client: FIN ACK

In reality there would likely be many more segments (aka packets) with the different parts of the http response in, but the point is that before the client can even begin the process of rendering and displaying the response, it has to go through several round-trips to setup the connection. If the connection were an HTTPS one, there would be even more. All of this means there is a delay is doing the thing the user wanted. This tends to hurt even more on mobile devices, where to save power the radio is often turned down when not being actively used, severing any existing connections.

A lot of work has been put into TCP and HTTP to speed this up. This includes widespread ideas like pipelining, where the browser keeps a TCP connection open and sends further HTTP requests down that same pipe. It also includes more unusual techniques: Chrome's backup TCP method means it will start a new TCP connection to the server if it hasn't had a response in 250ms.

QUIC is looking at this problem in two ways. One of them is by using UDP rather than TCP. TCP is designed for reliable, ordered data. UDP is more of a fire-and-forget protocol that may or may not get the data you sent to the other side, and doesn't make any promises on the order in which it'll be received.

The second thing QUIC aims to do is merge the security layer (e.g. TLS as in HTTPS) with the transport. This means that you can save round-trips. Rather than having to establish a TCP connection then establish TLS on top, you can setup both at the same time and not have to go back and forth so many times. QUIC attempts to minimise these trips by initialising the crypto as it initialises the connection , and by making sensible assumptions such as using previously established credentials if possible.

Packet Loss & Retransmission: Forward Error Correction

One of the other goals QUIC identified was related - avoiding retransmissions on loss. As QUIC uses UDP, it has to build some method for determining whether the other side of a connection has received all the parts of a message - say all of the parts of a web page. TCP does this by having a sequence number, and having the receiving side send ACK messages back for the highest continuous number it has seen - so if it sees segments 1, 2, 3, 5 it can only ACK up to 3, as 4 has not yet made it. If packet 6 then arrives, the receiver can again send out an ACK for 3, so the sender can tell it's stuck and needs some of the earlier data to be retransmitted.

There are other methods. In PGM, a multicast protocol, receivers don't ACK data they have received, instead they look for gaps in the sequence numbers and send out a NACK, a negative acknowledgement. So, if a PGM receives 1, 2, 3, 5 it can send back a NACK message for 4, and the sender can retransmit it.

In both cases, this introduces latency: if the receiver needs all of the parts of the data to have usable chunk of information, it has to wait for the sender to work it out that it should resend, and then for the data to be delivered. In the mean time, none of the subsequent data can be delivered, even if it was usable without the missing chunk.

QUIC tries to avoid having to go back to the sender through the user of Forward Error Correction. This works by having a backup that can be used to regenerate for any one of the packets in a group - like an understudy in a theatre company who has learned the lines for 3 different roles, and so can cover for any one of them. Its spending some extra bandwidth up front, in order to introduce redundancy into the data stream. This allows receivers to fix missing packets on their own without having to go back to the sender and incur extra latency.

We can demonstrate this briefly. If we're sending a sentence as a series of packets, it might look like this:


After our run of packets, we then send a forward error correction packet. To do this, we just need to XOR each byte of the previous packets (which we can do as they go out) with our error correction packet, padding the data packets out to the maximum length with zeros if necessary. So, for example for this run of 4, the first bytes would be the ASCII value of I (73), T (84), W (87) and T again: 73^84^87^84 = 30.

We then send that fifth packet. If any one of the preceding four is lost, we can XOR the rest against the FEC packet, and reconstruct it. For example, if we lost the first packet, for the first byte we would XOR T, W, T and our FEC value of 30: 84^87^84^30 = 73 - or the ASCII code for I! Here's an example in Javascript:

So, for occasionally, uncorrelated, packet loss we can easily recover the missing packets without having to go back to the server or wait for a timeout. QUIC groups its data streams into FEC groups, and sends out check packets at the end of each one.

Head of line blocking due to packet loss

Much of QUIC is based on the work done for SPDY, which included multiplexing. Multiplexing is a convenient solution to a couple of problems. Firstly, the round trip time cost for setting up a TCP and TLS connection is required for each independent connection between a client and a server,. Secondly, most browsers have a limit of the number of TCP connections that can be made to a given domain at a time - so even when more requests could be made in parallel, they wont be.

SPDY attempted to resolve this by combining multiple connections into a single TCP connection. While you can pipeline requests across a single connection, it's easy for that pipeline to become stuck - for example retrieving a slow-to-generate page while it could be grabbing fast-to-get images. Because these go through separate SPDY streams, now everything can be retrieved just as fast as it can be transmitted.

However, just because one slow request can't hold up others doesn't mean that a similar situation can't occur, just at a lower level. Imagine we have 6 streams going from client to server:

1 C <----[]----- S
2 C <-------[]-- S
3 C <---[]------ S
4 C <--[]------- S
5 C <-------[]-- S
6 C <------[]--- S

In a simple world (as in reality the process is sped-up by sending a window-worth of packets before waiting for acknowledgement), the server sends some data, then the client acknowledges the data, and the server doesn't send any more until the client has acknowledged. Now, imagine that we get some loss:

1 C <---[]------ S
2 C <------[]--- S
3 C <--[]------- S
4 C <-X--------- S
5 C <------[]--- S
6 C <-----[]---- S

Now stream 4 has to wait for the server to time out (which defaults to 3 seconds!) or to recognise the lost data, and resend the packet. If the server has sent other packets in the mean time they'll make it to the client, but they wont be delivered to the application. Bad news for connection 4, but no worries for everyone else.

Now, if we imagine the same thing in a SPDY connection, where all 6 streams are combined into one TCP connection:

1 C <-|-[]------ S
2 C <-|----[]--- S
3 C <-|[]------- S
4 C <-X--------- S
5 C <-|----[]--- S
6 C <-|---[]---- S

Because all 6 connections are now flowing over 1 TCP connection, and TCP guarantees the order, now none of the other streams are going to get their data. TCP doesn't know they're unrelated, so it holds the entire flow up until it gets the missing data.

QUIC gets around this by being based on UDP, so if one packet is lost, it doesn't stop the data getting delivered. Because UDP doesn't guarantee order, it's up to QUIC to reassemble the incoming data into useful streams - and the way streams and multiplexing is something I'll talk about in another post.