MiniDNS, TCP, and the internet

  |   Source

We had this bug come in yesterday.

It was a bit unexpected - as we tested it pretty extensively when it was being developed.

The line in question was this:

client.send(tcp_response)

In eventlet 0.17.x this behaved like the standard socket.sendall() , instead of socket.send()

(It was this commit as it turns out - it was noted in the release notes here but we missed it)

The other major problem is that the bug did not manifest itself until we pushed the AXFR over a long range connection.

To test it, I booted a VM in West US (California I believe), installed designate and populated a DB with a large zone using

generate-zone-rrsets.sh (Source)

    #!/bin/bash
    for i in $(seq 1 2000); do
        http http://IP:9001/v2/zones/ZONE_ID/recordsets name=ww$i.largetestzone.tld. records:='["10.0.0.1"]' type=A
        http http://IP:9001/v2/zones/ZONE_ID/recordsets name=txt$i.largetestzone.tld. records:='["Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam molestie leo sit amet commodo aliquet. Sed semper felis sit amet egestas euismod. Nulla non elementum orci. Nulla pharetra, ligula eget aliquet sagittis, velit nisl rhoncus nibh, vitae amet.", "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam molestie leo sit amet commodo aliquet. Sed semper felis sit amet egestas euismod. Nulla non elementum orci. Nulla pharetra, ligula eget aliquet sagittis, velit nisl rhoncus nibh, vitae amet.", "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam molestie leo sit amet commodo aliquet. Sed semper felis sit amet egestas euismod. Nulla non elementum orci. Nulla pharetra, ligula eget aliquet sagittis, velit nisl rhoncus nibh, vitae amet.", "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam molestie leo sit amet commodo aliquet. Sed semper felis sit amet egestas euismod. Nulla non elementum orci. Nulla pharetra, ligula eget aliquet sagittis, velit nisl rhoncus nibh, vitae amet."]' type=TXT
    done

Then from Dublin I started to run to see what was happening.

dig @IP -p 5354 largetestzone.tld. AXFR

dig @IP -p 5354 largetestzone.tld. AXFR

What was interesting was that I had to get to 3 TCP messages before I started to see any issues, and that those issues did not appear when I was testing from the same geographical region.

What did we learn

  • Network stuff is hard.
  • Eventlet will change api's.
  • Read Eventlet Release Notes

I think that we can start looking for ways to test this in the gate as well. having a gate check that loads a large zone, adds latency + packet loss to fake a long range connection, and then ensures we get a proper result from an API query would definitly catch real world errors like this.

If anyone wants to implement this, just hop in to #openstack-dns on freenode :)

Patches are definitly welcome.

Other Bugs Found

So, in the testing of this fix - we found a new bug. I am sure it will be the focus of a new blog (whenever we track it down) but, it seems a large amount of traffic can cause eventlets socket.sendall() to explode, and not send the complete traffic.

eventlet-sendall-bug-logs.log (Source)

[-] Handling TCP Request from: MY_IP:50401 from (pid=60592) _dns_handle_tcp /home/graham/designate/designate/service.py:250
Policy check succeeded for rule 'all_tenants' on target {}
Unhandled exception while processing request from MY_IP:50401
 Traceback (most recent call last):
   File "/home/graham/designate/designate/service.py", line 342, in _dns_handle
     client.sendall(tcp_response)
   File "<eventlet_path>/eventlet/greenio/base.py", line 388, in sendall
     tail += self.send(data[tail:], flags)
   File "<eventlet_path>/eventlet/greenio/base.py", line 379, in send
     return self._send_loop(self.fd.send, data, flags)
   File "<eventlet_path>/eventlet/greenio/base.py", line 374, in _send_loop
     timeout_exc=socket.timeout("timed out"))
   File "<eventlet_path>/eventlet/greenio/base.py", line 203, in _trampoline
     mark_as_closed=self._mark_as_closed)
   File "<eventlet_path>/eventlet/hubs/__init__.py", line 162, in trampoline
     return hub.switch()
   File "<eventlet_path>/eventlet/hubs/hub.py", line 294, in switch
     return self.greenlet.switch()
 timeout: timed out


We decided that as all of the DNS servers we support will retry, and designate has retry semantics for updates, we should be OK merging this fix, and finding the new bugs root cause in the near future.

Comments powered by Disqus