Weave Networking Performance with the New Fast Data Path

By Bryan Boreham
November 13, 2015

Weave Net 1.2 delivers performance close to standard networking on today’s common x86 servers thanks to “fast data path”, a feature we previewed a few months ago. We’ll be doing additional testing on servers with newer 10G network...

Related posts

Docker Networking with No External Cluster Store: Announcing Weave 1.4 Plugin

How Weave built a cloud deployment for Scope using Kubernetes

Weave Scope gets your Docker containers under control

Weave Net 1.2 delivers performance close to standard networking on today’s common x86 servers thanks to “fast data path”, a feature we previewed a few months ago. We’ll be doing additional testing on servers with newer 10G network interface cards (NICs) that have additional hardware acceleration, and will publish that data shortly.

Weave fast data path just works

The most important thing to know about fast data path is that you don’t have to do anything enable this feature. If you have tried 1.2, you are probably using fast data path already. When we were considering the technical options for improving the performance of Weave Net, many of them implied limitations to the existing feature set, compromising such features as robust operation across the internet, firewall traversal, or IP multicast support. We didn’t want to force users to choose between a “fast” mode and a “features” mode when they set up a Weave network.

In any case where Weave can’t use fast data path between two hosts it will fall back to the slower packet forwarding approach used in prior releases. The selection of which forwarding approach to use is automatic, and made on a connection-by-connection basis. So a Weave network spanning two data centers might use fast data path within the data centers, but not for the more constrained network link between them.

There is one Weave Net feature that, for the moment, won’t work with fast data path: encryption. So if you enable encryption with the --password option to weave launch (or use the WEAVE_PASSWORD environment variable), the performance characteristics will be similar to prior releases. We are investigating the options to resolve this, so if this is important we’d like to hear a bit more about your requirements: please get in touch!

How it works

Weave implements an overlay network between Docker hosts, so each packet is encapsulated in a tunnel header and sent to the destination host, where the header is removed. In previous releases, the Weave router added/removed the tunnel header. The Weave router is a user space process, so the packet has to follow a winding path in and out of the Linux kernel:

weave-net-encap1-1024x459.png

Fast data path uses the Linux kernel’s Open vSwitch datapath module, which allows the Weave router to tell the kernel how to process packets, rather than processing them itself:

weave-net-fdp1-1024x454.png

Fast data path reduces CPU overhead and latency because there are fewer copies of packet data and context switches. The packet goes straight from the user’s application to the kernel, which takes care of adding the VXLAN header (the NIC does this if it offers VXLAN acceleration). VXLAN is an IETF standard UDP-based tunneling protocol, so you can use common networking tools like Wireshark to inspect the tunneled packets.

weave-frame-encapsulation-178x300.png

Weave UDP Encapsulation format

Prior to version 1.2, Weave Net used a custom encapsulation format. But the use of ODP implies the use of one of the encapsulation formats it supports. Fast data path uses VXLAN. Like Weave Net’s custom encapsulation format, VXLAN is UDP-based, and so it is unlikely to cause problems for network infrastructure. Because VXLAN is a standard format, you can use standard tools like Wireshark to inspect the encapsulated packets.

The required ODP and VXLAN features are present in Linux kernel versions starting with 3.12, and if your kernel was built without the necessary modules Weave Net falls back the “user mode” packet path.

See which connections use fast data path

Weave automatically uses the Fast Data Path for every connection unless it encounters a situation that prevents it from working. To ensure that Weave can use the fast data path:

  • Avoid Network Address Translation (NAT) devices
  • Open UDP port 6784 (used by the Weave routers)
  • Ensure the WEAVE_MTU can fit the MTU of the intermediate network (see below)

Use of fast data path is an automated connection-by-connection decision made by Weave, so you can end up with a mixture of connection tunnel types. If fast data path can’t be used for a connection, Weave falls back to the “user space” packet path. Once you have set up a Weave network, you can query the connections with the weave status connections command:

<code>$ weave status connections
<- 192.168.122.25:43889  established fastdp a6:66:4f:a5:8a:11(ubuntu1204)</code>

Here fastdp indicates that fast data path is being used on that connection. Otherwise, that field shows sleeve to indicate Weave Net’s fall-back encapsulation method:

<code>$ weave status connections
<- 192.168.122.25:54782  established sleeve 8a:50:4c:23:11:ae(ubuntu1204)</code>

Performance results

This section discusses the network performance that is achievable with fast data path, and how its performance compares to that of prior releases of Weave Net.

But first, a note of caution: we are not promising that similar results will be seen in all network environments. In cloud computing environments in particular, the “physical network” may actually be a sophisticated Software Defined Network. Differences in treatment of VXLAN packets vs. unencapsulated TCP traffic will be reflected in the performance of Weave Net compared to host networking. We have even seen cases where Weave Net fast data path throughput is better than host networking throughput, though we claim no credit for that. None of this is specific to Weave Net: any use of VXLAN can suffer from such effects.

Another point to note is that native TCP network throughput is assisted by the TSO/LRO (TCP Segmentation Offload / Large Receive Offload) support of modern network interface chipsets. With these facilities, the network interface hardware takes on some of the burden of segmenting a TCP data stream into packets, and reassembling it on the receiver. This helps to avoid CPU being the bottleneck to network throughput even when using a traditional 1500 byte MTU. But conventional TSO/LRO does not support the further packet manipulations needed for VXLAN encapsulation (hardware VXLAN offload does exist, but currently only in high-end NICs). So, with VXLAN, more CPU work is required to sustain a given TCP throughput, and this CPU usage may present a bottleneck. Again, any use of VXLAN is subject to this effect.

We tested performance between two Amazon EC2 c3.8xlarge instances with enhanced networking (10 Gigabit/sec):

The following chart compares TCP throughput between containers using Weave Net 1.1.2, using Weave Net 1.2.1, and directly on the hosts. iperf3 was used for these measurements:

throughput.png

Weave Net 1.2 using VXLAN encapsulation delivers the same throughout as regular traffic using VXLAN tunneling. c3.8xlarge instances don’t support VXLAN hardware acceleration, so overall performance is below that of standard TCP connections, which benefit from TCP hardware acceleration in the network interface chipset.

The only option passed to iperf3 in these throughput tests was -t 30, in order to test over a reasonable amount of time. We also conducted tests with the --parallel option to test throughput over multiple TCP streams, and saw similar overall throughput. qperf’s tcp_bw test also produced consistent results. The Linux distro was Amazon Linux 2015.09.1 (kernel 4.1.10, docker 1.7.1).

The two results for Weave Net 1.2.1 compare the effect of the MTU setting. The WEAVE_MTU environment variable when running weave launch sets the MTU for the weave network. There is a trade-off: Set the MTU too low and performance suffers. Set it too high, and the WEAVE_MTU plus the VXLAN tunnel overhead of 50 bytes causes the packets to exceed the MTU of the underlying network. In this situation Weave Net will fall back to its user space encapsulation. The default value of WEAVE_MTU is 1410, which is a conservative value that should work in almost all network environments. But if you know that the underlying network supports a higher MTU, you can set it to obtain improved performance. AWS EC2 supports 9000 byte jumbo frames for many instance types, so we can safely set WEAVE_MTU to 8950.

Measuring CPU usage is tricky when most of the work is being done in the kernel itself, so we took some of the CPU cores offline (through /sys/devices/system/cpu/cpuN/online) to see if that impacted performance. Throughput is consistent even with only one core online:

cores.png

Finally, we measured latency using qperf’s udp_lat test:

latency.png

As you can see, the latency is only a little over host network latency, and as with the bandwidth, it’s a vast improvement on prior versions of Weave Net.


Related posts

Docker Networking with No External Cluster Store: Announcing Weave 1.4 Plugin

How Weave built a cloud deployment for Scope using Kubernetes

Weave Scope gets your Docker containers under control