Earlier last week I was asked to look into the intermittent failure of one of the smoke tests for our new gossip based DNS implementation. The test is quite straightforward – it starts then kills containers in a loop across a number of hosts, asserting that DNS records and tombstones are propagated in a timely fashion. The distribution of failures was odd however – some people could reproduce it reliably, while others not at all; an initial analysis by Tom and Matthias had revealed that for some reason the TCP connection between the weave routers was flapping, resulting in delayed delivery of gossip. Why?

Of course I was completely unable to reproduce this behaviour on my own workstation – more on that later. Consequently I borrowed Tom’s GCE instances, configured the test to use only two weave peers with a single connection and set about the problem with tcpdump and conntrack:

host2:~$ sudo tcpdump -i eth0 -tt tcp and port 6783
1438082113.675743 IP host2.us-central1-a.positive-cocoa-90213.40514 > host1.us-central1-a.postive-cocoa-90213.6783: Flags [.], ack 13904, win 626, options [nop,nop,TS val 1844274 ecr 1843667], length 0
1438082113.941917 IP host2.us-central1-a.positive-cocoa-90213.40514 > host1.us-central1-a.postive-cocoa-90213.6783: Flags [P.], seq 11422:11691, ack 13904, win 626, options [nop,nop,TS val 1844340 ecr 1843667], length 269
1438082113.942187 IP host1.us-central1-a.postive-cocoa-90213.6783 > host2.us-central1-a.positive-cocoa-90213.40514: Flags [.], ack 11691, win 580, options [nop,nop,TS val 1843743 ecr 1844340], length 0
1438082127.587805 IP 172.17.0.155.40514 > host1.us-central1-a.postive-cocoa-90213.6783: Flags [P.], seq 3447297095:3447297364, ack 1698670085, win 626, options [nop,nop,TS val 1847752 ecr 1843743], length 269
1438082140.931569 IP host1.us-central1-a.postive-cocoa-90213.6783 > host2.us-central1-a.positive-cocoa-90213.40514: Flags [P.], seq 13904:14257, ack 11691, win 580, options [nop,nop,TS val 1850491 ecr 1844340], length 353
1438082140.931649 IP host2.us-central1-a.positive-cocoa-90213.40514 > host1.us-central1-a.postive-cocoa-90213.6783: Flags [R], seq 3447297095, win 0, length 0
1438082140.931661 IP host1.us-central1-a.postive-cocoa-90213.6783 > host2.us-central1-a.positive-cocoa-90213.40514: Flags [P.], seq 14257:14903, ack 11691, win 580, options [nop,nop,TS val 1850491 ecr 1844340], length 646
1438082140.931698 IP host2.us-central1-a.positive-cocoa-90213.40514 > host1.us-central1-a.postive-cocoa-90213.6783: Flags [R], seq 3447297095, win 0, length 0
1438082140.931789 IP 172.17.0.155.40514 > host1.us-central1-a.postive-cocoa-90213.6783: Flags [P.], seq 0:269, ack 1, win 626, options [nop,nop,TS val 1851088 ecr 1843743], length 269
1438082140.931797 IP host1.us-central1-a.postive-cocoa-90213.6783 > host2.us-central1-a.positive-cocoa-90213.40514: Flags [P.], seq 14903:15790, ack 11691, win 580, options [nop,nop,TS val 1850491 ecr 1844340], length 887
1438082140.931828 IP host2.us-central1-a.positive-cocoa-90213.40514 > host1.us-central1-a.postive-cocoa-90213.6783: Flags [R], seq 3447297095, win 0, length 0

host2:~$ sudo conntrack -E -o timestamp -p tcp --dport 6783
[1438082023.557680]     [NEW] tcp      6 120 SYN_SENT src=172.17.0.155 dst=10.240.17.237 sport=40514 dport=6783 [UNREPLIED] src=10.240.17.237 dst=10.240.27.13 sport=6783 dport=40514
[1438082023.558335]  [UPDATE] tcp      6 60 SYN_RECV src=172.17.0.155 dst=10.240.17.237 sport=40514 dport=6783 src=10.240.17.237 dst=10.240.27.13 sport=6783 dport=40514
[1438082023.558364]  [UPDATE] tcp      6 432000 ESTABLISHED src=172.17.0.155 dst=10.240.17.237 sport=40514 dport=6783 src=10.240.17.237 dst=10.240.27.13 sport=6783 dport=40514 [ASSURED]
[1438082114.248538] [DESTROY] tcp      6 src=172.17.0.155 dst=10.240.17.237 sport=40514 dport=6783 src=10.240.17.237 dst=10.240.27.13 sport=6783 dport=40514 [ASSURED]

For some reason, netfilter is destroying the flow mapping prematurely! This has two consequences: firstly, further outbound router traffic to $HOST1 no longer has its source address rewritten (see packet at 1438082127.587805) – it will be ignored by $HOST2 (even if it makes it that far – the intervening network may filter the unroutable RFC1918 address). Secondly, further inbound traffic from $HOST1 is rejected with a TCP RST (see packets at 1438082140.931649, 1438082140.931698 and 1438082140.931828). This explains the connection flap – but why is the kernel prematurely removing an in-use established flow mapping?

The conntrack events we’ve been examining are delivered to userspace via a netlink socket; this gives us a place to start looking in the kernel source. Poking around in the UAPI files (v3.16.0) we find the following in uapi/linux/netfilter/nf_conntrack_common.h:

/* Connection tracking event types */
enum ip_conntrack_events {
 IPCT_NEW, /* new conntrack */
 IPCT_RELATED, /* related conntrack */
 IPCT_DESTROY, /* destroyed conntrack */

That looks useful – lets find out where that event is emitted:

awh@mantisal:~/workspace/github.com/torvalds/linux$ ack IPCT_DESTROY net
net/netfilter/nf_conntrack_netlink.c
626: if (events & (1 << IPCT_DESTROY)) {
682: if (events & (1 << IPCT_DESTROY)) {
net/netfilter/nf_conntrack_core.c
363: if (nf_conntrack_event(IPCT_DESTROY, ct) < 0) {
398: unlikely(nf_conntrack_event_report(IPCT_DESTROY, ct,

Inspection of nf_conntrack_netlink.c reveals that both uses of IPCT_DESTROY occur via nf_ct_delete, which is itself referred to from only two places: the flow timeout handler (death_by_timeout) and the netlink code that processes manual delete requests from userspace. We’re not manually deleting the flow, and since the default timeout for the established state is five days it’s unlikely to be the timeout either. How to make further progress?

It occurs that nf_ct_delete must be being called somehow, and as we’ve accounted for all the static references in the code it must be via a function pointer. At this point we can turn to the kernel’s dynamic function tracing capability to see who is calling nf_ct_delete:

        cd /sys/kernel/debug/tracing
    echo function > current_tracer
    echo nf_ct_delete > set_ftrace_filter
    echo func_stack_trace > trace_options
    echo global > trace_clock
    echo 1 > tracing_on
    cat trace_pipe

Running the tests again, we quickly see from the stack traces that in addition to being called from the kernel timer code, death_by_timeout is also called by __nf_ct_kill_acct by function pointer:

          weaver-20733 [000] ..s1  7677.668694: nf_ct_delete <-death_by_timeout
          weaver-20733 [000] ..s1  7677.668713: <stack trace>
 => nf_ct_delete
 => death_by_timeout
 => __nf_ct_kill_acct
 => nf_nat_ipv4_fn
 => nf_nat_ipv4_out
 => nf_iterate
 => nf_hook_slow
 => br_nf_post_routing
 => nf_iterate
 => nf_hook_slow
 => br_forward_finish
 => br_nf_forward_finish
 => br_nf_forward_ip
 => nf_iterate
 => nf_hook_slow
 => __br_forward
 => deliver_clone
 => br_flood
 => br_flood_forward
 => br_handle_frame_finish
 => br_nf_pre_routing_finish
 => br_nf_pre_routing
 => nf_iterate
 => nf_hook_slow
 => br_handle_frame
 => __netif_receive_skb_core
 => __netif_receive_skb
 => process_backlog
 => net_rx_action
 => __do_softirq
 => do_softirq_own_stack
 => do_softirq
 => __local_bh_enable_ip
 => ip_finish_output
 => ip_output
 => ip_local_out_sk
 => ip_queue_xmit
 => tcp_transmit_skb
 => tcp_write_xmit
 => __tcp_push_pending_frames
 => tcp_push
 => tcp_sendmsg
 => inet_sendmsg
 => sock_aio_write
 => do_sync_write
 => vfs_write
 => SyS_write

The originating code is also in nf_conntrack_core.c:

1297     if (del_timer(&ct->timeout)) {
1298         ct->timeout.function((unsigned long)ct);
1299         return true;
1300     }

Sneaky, it dereferences the pointer inside the timer struct. Anyway, we have another lead – who calls nf_ct_kill_acct, and why? Another quick grep leads us to nf_nat_ipv4_fn in net/ipv4/netfilter/iptable_nat.c:

128     default:
129         /* ESTABLISHED */
130         NF_CT_ASSERT(ctinfo == IP_CT_ESTABLISHED ||
131                  ctinfo == IP_CT_ESTABLISHED_REPLY); 
132         if (nf_nat_oif_changed(ops->hooknum, ctinfo, nat, out))
133             goto oif_changed;
134     }
135 
136     return nf_nat_packet(ct, ctinfo, ops->hooknum, skb);
137 
138 oif_changed:
139     nf_ct_kill_acct(ct, ctinfo, skb);
140     return NF_DROP;

Revelation – if a packet matching an extant masquerade flow attempts to egress through an interface other than the one that established the flow initially, the flow is destroyed and the packet dropped. Unfortunately we’re back to square one – why has the kernel suddenly decided to route a packet out of a different interface to the one it used milliseconds before? And what is that interface?

Whilst extremely powerful, dynamic function tracing has some limitations – it’s not possible to inspect the function arguments, which we’d ideally like to do here to determine where the kernel has decided to send our packet. Perhaps we can get the information in another way?

host2:~$ sudo iptables -tnat -I POSTROUTING 1 -p tcp --dest $HOST1 --dport 6783 -j LOG --log-prefix WEAVE:

This will cause iptables to log packets traversing the POSTROUTING table. Initially we see what we expect – this corresponds to the SYN packet sent from the router on $HOST2 to $HOST1:

[ 7586.977937] WEAVE:IN= OUT=eth0 PHYSIN=vethe830ed5 SRC=172.17.0.155 DST=10.240.17.237 LEN=60 TOS=0x00 PREC=0x00 TTL=63 ID=43692 DF PROTO=TCP SPT=40514 DPT=6783 WINDOW=28400 RES=0x00 SYN URGP=0 

But then something decidedly weird, corresponding to the timestamp of the call to nf_ct_delete:

[ 7677.876096] WEAVE:IN= OUT=docker0 PHYSIN=vethe830ed5 PHYSOUT=veth8cad4e5 SRC=172.17.0.155 DST=10.240.17.237 LEN=321 TOS=0x00 PREC=0x00 TTL=64 ID=43772 DF PROTO=TCP SPT=40514 DPT=6783 WINDOW=626 RES=0x00 ACK PSH URGP=0

For some reason, a TCP packet from the router on $HOST2 to $HOST1 is egressing the docker bridge via one of the other application container’s veth interfaces, instead of eth0 on $HOST2 as expected. That certainly constitutes a change of interface, but what on Earth is going on?

One thing I had observed is that the connection flap did not occur if the routers were left connected without the test running, even though they periodically exchange heartbeat traffic; furthermore, the timing of the failure was apparently random, indicating some kind of racy interaction between the weave router traffic and the starting and stopping of our test containers. Could it be that for a brief window during container startup, an erroneous route was in force that affected our traffic?

Something about this hypothesis didn’t seem quite right. The dynamic function trace of nf_ct_delete invocations revealed that it was being invoked from the bridge forwarding code – layer 2. Perhaps the problem resided there – lets examine another tcpdump, this time of the traffic on the eth0 interface in the weave router container namespace, with link level headers displayed:

host2:~$ sudo nsenter --target $(pidof weaver) --net tcpdump -i eth0 -e -tt tcp and port 6783
1438082113.675704 02:42:ac:11:00:9b (oui Unknown) > 3a:a4:1b:bf:23:96 (oui Unknown), ethertype IPv4 (0x0800), length 66: 172.17.0.155.40514 > host1.us-central1-a.postive-cocoa-90213.6783: Flags [.], ack 12117, win 626, options [nop,nop,TS val 1844274 ecr 1843667], length 0
1438082113.941886 02:42:ac:11:00:9b (oui Unknown) > 3a:a4:1b:bf:23:96 (oui Unknown), ethertype IPv4 (0x0800), length 335: 172.17.0.155.40514 > host1.us-central1-a.postive-cocoa-90213.6783: Flags [P.], seq 9598:9867, ack 12117, win 626, options [nop,nop,TS val 1844340 ecr 1843667], length 269
1438082113.942204 3a:a4:1b:bf:23:96 (oui Unknown) > 02:42:ac:11:00:9b (oui Unknown), ethertype IPv4 (0x0800), length 66: host1.us-central1-a.postive-cocoa-90213.6783 > 172.17.0.155.40514: Flags [.], ack 9867, win 580, options [nop,nop,TS val 1843743 ecr 1844340], length 0
1438082114.248335 02:42:ac:11:00:9b (oui Unknown) > 3a:a4:1b:bf:23:96 (oui Unknown), ethertype IPv4 (0x0800), length 335: 172.17.0.155.40514 > host1.us-central1-a.postive-cocoa-90213.6783: Flags [P.], seq 9867:10136, ack 12117, win 626, options [nop,nop,TS val 1844417 ecr 1843743], length 269
1438082114.455707 02:42:ac:11:00:9b (oui Unknown) > 3a:a4:1b:bf:23:96 (oui Unknown), ethertype IPv4 (0x0800), length 335: 172.17.0.155.40514 > host1.us-central1-a.postive-cocoa-90213.6783: Flags [P.], seq 9867:10136, ack 12117, win 626, options [nop,nop,TS val 1844469 ecr 1843743], length 269
1438082114.663704 02:42:ac:11:00:9b (oui Unknown) > 3a:a4:1b:bf:23:96 (oui Unknown), ethertype IPv4 (0x0800), length 335: 172.17.0.155.40514 > host1.us-central1-a.postive-cocoa-90213.6783: Flags [P.], seq 9867:10136, ack 12117, win 626, options [nop,nop,TS val 1844521 ecr 1843743], length 269
1438082115.079708 02:42:ac:11:00:9b (oui Unknown) > 3a:a4:1b:bf:23:96 (oui Unknown), ethertype IPv4 (0x0800), length 335: 172.17.0.155.40514 > host1.us-central1-a.postive-cocoa-90213.6783: Flags [P.], seq 9867:10136, ack 12117, win 626, options [nop,nop,TS val 1844625 ecr 1843743], length 269
1438082115.911727 02:42:ac:11:00:9b (oui Unknown) > 3a:a4:1b:bf:23:96 (oui Unknown), ethertype IPv4 (0x0800), length 335: 172.17.0.155.40514 > host1.us-central1-a.postive-cocoa-90213.6783: Flags [P.], seq 9867:10136, ack 12117, win 626, options [nop,nop,TS val 1844833 ecr 1843743], length 269
1438082117.579700 02:42:ac:11:00:9b (oui Unknown) > 3a:a4:1b:bf:23:96 (oui Unknown), ethertype IPv4 (0x0800), length 335: 172.17.0.155.40514 > host1.us-central1-a.postive-cocoa-90213.6783: Flags [P.], seq 9867:10136, ack 12117, win 626, options [nop,nop,TS val 1845250 ecr 1843743], length 269
1438082120.915723 02:42:ac:11:00:9b (oui Unknown) > 3a:a4:1b:bf:23:96 (oui Unknown), ethertype IPv4 (0x0800), length 335: 172.17.0.155.40514 > host1.us-central1-a.postive-cocoa-90213.6783: Flags [P.], seq 9867:10136, ack 12117, win 626, options [nop,nop,TS val 1846084 ecr 1843743], length 269
1438082127.587741 02:42:ac:11:00:9b (oui Unknown) > 56:70:1e:fc:91:9e (oui Unknown), ethertype IPv4 (0x0800), length 335: 172.17.0.155.40514 > host1.us-central1-a.postive-cocoa-90213.6783: Flags [P.], seq 9867:10136, ack 12117, win 626, options [nop,nop,TS val 1847752 ecr 1843743], length 269
1438082140.931760 02:42:ac:11:00:9b (oui Unknown) > 56:70:1e:fc:91:9e (oui Unknown), ethertype IPv4 (0x0800), length 335: 172.17.0.155.40514 > host1.us-central1-a.postive-cocoa-90213.6783: Flags [P.], seq 9867:10136, ack 12117, win 626, options [nop,nop,TS val 1851088 ecr 1843743], length 269

Interesting – the destination MAC is fluctuating. The weave router container has been configured with the docker bridge IP as default gateway – when it wants to send a packet off-host, it will insert the corresponding MAC entry into the destination hardware address of the ethernet packet. Initially it is 3a:a4:1b:bf:23:96 – we can see we’re sending packets to $HOST1 and getting responses. The packet at 1438082114.248335 does not get a response – incidentally its timestamp matches that of the nf_ct_delete call, the conntrack DESTROY message and the peculiar iptables logging. We then see a number of further packets to $HOST1 that do not receive a response, and then for the last two packets (which are some time later) the destination MAC changes to 56:70:1e:fc:91:9e. For some reason the MAC of docker0 is changing, and before ARP in the weave router netns catches up a packet is sent to the old docker0 MAC which now happens to correspond to one of the application container veth interfaces. What gives?

It turns out that the standard behaviour of the bridge device on Linux is to assume the lowest MAC of any slave interfaces unless it has had one set explictly – here is the relevant code from net/bridge/br_stp_if.c:

221 bool br_stp_recalculate_bridge_id(struct net_bridge *br)
222 {
223     const unsigned char *br_mac_zero =
224             (const unsigned char *)br_mac_zero_aligned;
225     const unsigned char *addr = br_mac_zero;
226     struct net_bridge_port *p;
227 
228     /* user has chosen a value so keep it */
229     if (br->dev->addr_assign_type == NET_ADDR_SET)
230         return false;
231     
232     list_for_each_entry(p, &br->port_list, list) {
233         if (addr == br_mac_zero ||
234             memcmp(p->dev->dev_addr, addr, ETH_ALEN) < 0)
235             addr = p->dev->dev_addr;
236 
237     }
238     
239     if (ether_addr_equal(br->bridge_id.addr, addr))
240         return false;   /* no change */
241     
242     br_stp_change_bridge_id(br, addr); 
243     return true;
244 }

Unless addr_assign_type (visible from /sys/class/net/docker0/addr_assign_type) is NET_ADDR_SET, we loop over all ports to find the numerically lowest MAC and use that. Consequently, whenever our test containers were started, there was a chance that the veth pair created to connect them to the docker0 bridge would have a lower MAC than extant slaves, and this would cause the bridge MAC to change.

Docker 1.6.2 and 1.7.1 use an ioctl to set the MAC address of the bridge, whereas 1.7.0 uses netlink directly leading to the problems outlined above. This is why I couldn’t reproduce the problem – I had rebuilt my test VMs recently and picked up 1.7.1. If you’re using 1.7.0 you may want to upgrade – issue filed here.


Footnote – you may have noticed that the debugfs traces and iptables logging show monotonic time since boot, not epoch time like tcpdump and conntrack. I created this utility to compute an approximate offset which could be added to obtain epoch time – the following value applies to the logs in this article:

host2:~$ ./monotime-offset 
1438074436.579662598
host2:~$ ./monotime-offset 
1438074436.579662455
host2:~$ ./monotime-offset 
1438074436.579662496

If you’re aware of a more accurate method for determining this offset please let me know.