Andrei Pavel · e3c1f9fd
--- a/Designs/receiver-queue-and-thread.md
+++ b/Designs/receiver-queue-and-thread.md
+# Receiver queue and thread
+
+## Background
+
+Kea processes packets sequentially in plain FIFO order. This works great until there is more traffic that Kea can handle. In such case Kea continues working through the backlog of packets, even though some of them are old and clients may possibly gave up waiting.
+
+What's even worse, Kea will continue processing oldest packets first, which may trigger an avalanche effect. Kea responds to the oldest packets first that clients already given up waiting for and retransmit which puts more packets in the queue.
+
+There is a mechanism in the DHCP protocol to mitigate the problem to some degree (the retransmissions are supposed to have the same transaction-id, so client should match the response to oldest transmission with its newest retransmission), but it only mitigates the problem a bit. Imagine a case where client transmitted a packet, didn't get a response in time, so retransmitted. Now Kea eventually processes the original packet and sends back a response. The client got its configuration and it's happy. However, Kea now has the retransmission packet still waiting for processing. Kea will spend cycles on it and will eventually produce a response nobody is waiting for anymore.
+
+Note this mechanism is not about improving performance. It will help Kea behave better in overloaded scenarios.
+
+## Problem statement
+
+Copied from Trac #5611 ticket [https://kea.isc.org/ticket/5611]. As it is about design look at #5555 [https://kea.isc.org/ticket/5555] for tentative code. The gitlab issue is #42 (although the code has probably not been ported from trac repo yet).
+
+The current Kea implementation processes the inbound socket buffer as a simple queue - first in, first out. When the server is under pressure and not handling client packets as fast as they are arriving, a backlog will build up.
+
+If the situation continues for long enough, the client packets that the server is handling will have already timed-out on the client side, so it is pointless to spend time processing them and moreover wasting time on these old packets prevents the server from handling newer packets until they too have timed out. Effectively, it stops responding to active clients because it never gets through the backlog fast enough to reach the most recent inbounds.
+
+Even though the initial spike in traffic may have subsided, the degraded performance can mean that clients change their behaviour, adding retries to the backlog and/or reverting back to initial discovery - thus increasing the backlog of packets to be processed and making recovery unlikely without restarting the server to clear things down.
+
+We need to handle this situation better so that even when swamped, Kea servers are able to process a proportion of recently-received client packets, instead of none of them because it's 'stuck' with the oldest ones instead.
+
+Suggestions being mooted so far suggest either an independent socket reading thread (or process) to manage the inbound traffic and to pull it off the sockets/interfaces on which the Kea server is listening. This will prevent the UDP buffers from overflowing as well as allowing the socket reader to apply better logic to:
+
+ * discarding the oldest client packets in favour of the most recently received
+ * managing the 'waiting' buffers appropriately to the throughput capacity of the server
+
+Maximum per-server throughput will be highly dependent on both configuration and the choice of back-end (e.g database, or memfile, and if database, how and where etc..) - so it would be good to have the I/O handler be tunable too - not discarding too soon for a fast server and so on.
+
+There's no clear operational mitigation strategy for this, other than ensuring sufficient headroom when provisioning so that there are no peaks in client traffic that can overwhelm the server(s) maximum capacity.
+
+(Notably, increasing inbound UDP buffers is likely to make the situation worse rather than better.)
+
+# Proposed solutions
+
+ 1. edit the ticket to explain why playing with ioctl SO_RECVBUF is not a good idea and why it won't work. `done`.
+
+ 2. move the interface socket scan from receive[46] to a thread, replacing it by watch socket scan.
+
+ 3. use a dedicated watch socket with a common message buffer to signal errors from the thread.
+
+ 4. manage a ring buffer filled by the new thread and 4o6 reader. New packets will be signaled using the watch socket. Consumer is receive[46] tail code.
+
+ 5. add a ring buffer with received packets protection by a lock on write (push by producers, pop by consumer).  Note that boost provides the class template for circular (aka ring) buffers.
+
+ 6. add a configuration parameter for the ring buffer size (a suitable default and size guideline is required).
+
+ 7. organize the receiver thread to scan interfaces receiving one packet per socket per loop (i.e. continue to the next socket instead to break at the first ready).
+
+ 8. add a (third) watch socket to signal to the thread when to terminate (typically in interface manager close all sockets).
+
+ 9. recode watch socket isReady to use FIONREAD in place of select (should be far faster and avoid imbricated select calls). Small ticket candidate.
+
+ 10. address the multiple packets by buffer in BPF. Small ticket candidate for a reduced version: return last instead first packet.
+
+ 11. discuss with QA (Quality Assurance) about statistics to maintain and of course performance impact.