Andrei Pavel · e3c1f9fd
--- a/Designs/ha-mt-proposals.md
+++ b/Designs/ha-mt-proposals.md
+# Rough Design proposals for HA with MT
+
+**OBSOLETE**: This page documents an early discussions about the HA+MT bottleneck. For the actual design, see [this page](https://gitlab.isc.org/isc-projects/kea/-/wikis/designs/HA-MT-Design-for-Multi-threaded-Http-HA-traffic). This page is kept for historical reasons, as it contains useful comments.
+
+----
+
+**PROBLEM:** Enabling MT provides substantial performance gains. Unfortunately, also enabling HA eliminates that gain and HA+MT is roughly the same performant as single-threaded (or in some cases worse).
+
+As discussed on the [2020-07-09 call](https://pad.isc.org/p/kea-ha-perf), we came up with several proposals. Two of them 4.1 (opening direct tcp connection between servers) and 4.3 (removing CA) were two more favored. The goal of this document is to refine proposals for both to make a more informed decision between them.
+
+# 4.1 direct TCP connections between servers
+
+Engineers volunteered to come up with the proposal description: @razvan, @fdupont
+
+# 4.3 Eliminating CA
+
+Engineers volunteered to come up with the proposal description: @marcin
+
+Kea was initially single threaded and for several years into the project we still considered multi process architecture as a viable option to achieve parallelism in DHCP traffic processing. It was tempting because since Kea early days we had support for lease database backends and connecting multiple servers into a single database backend would solve many issues with lease information sharing between multiple DHCP instances. Having this option in mind, it made sense to have a single interface (process) to the external world that would receive and distribute control commands to multiple servers. The Kea Control Agent provides such interface. The choice of HTTP as a communication protocol was obvious, because it is in widespread use and there are many tools that interact with it. Probably the most important aspect of using HTTP is the ability to use third party reverse proxies to secure the communication. At the time when we implemented CA, we didn't want to write and maintain our own version of TLS and alike.
+
+The situation in 2020 is slightly different. We chose the path of multi threaded implementation for DHCP servers. That means that in a typical deployment we'd deal with a single instance of each Kea daemon. It is likely that multiple daemons are running at the same time, e.g. DHCPv4 + DHCPv6 + DDNS, but the number of instances is reduced significantly comparing to the case when parallelism is achieved by running multiple instances of the same kind. That seems to reduce significance of having a single daemon serving as HTTP interface and commands forwarder.
+
+I am proposing that we consider removing the Kea Control Agent entirely and move its HTTP server function to the respective Kea daemons so as they can receive the commands over HTTP POST directly. In addition, I am proposing to extend the HTTP server code (currently used in CA) to allow multiple simultaneous connections over HTTP, each connection having its own state. The connections should be by default persistent, i.e. it should be possible to reuse them for subsequent requests. This is also done today, but we allow a single connection to the server at the time.
+
+The maximum number of connections to the server should be restricted by a configuration parameter. When the number of active connections (those which are currently used to perform a transaction) reaches the limit, the new requests should be queued and the served when current transaction ends and the connection becomes available.
+
+Direct communication with the daemons bypasses the overhead of using the unix domain socket to forward the commands. Allowing multiple simultaneous connections increases efficiency of the communication with the servers.
+
+Removal of the Kea Control Agent reduces the number of running processes and therefore poses less risk of a failure. It also reduces the complexity of the installation, simplifies the configuration of the environment and finally, lowers the risk of user errors related to invalid specification of the `service` parameter or lack of thereof.
+
+It is also worth to mention that in installations where Stork is in use, the Stork Agent process must be running on the monitored machine. The Stork Agent can forward commands to Kea. Currently they are forwarded via Kea Control Agent. They could be forwarded directly, which would reduce the number of different communication layers. Right now, the Stork server talks to Stork Agent, which talks to Kea CA, which talks to the actual server. Any extra layer brings additional latency and possible communication issues.
+
+The effect of such solution would be that the DHCP servers (and other servers) would end up having two different command channels: unix domain socket or HTTP and one could pick one of them applying appropriate settings in the server configuration. This seems to be cleaner, or at least less confusing than the current situation in which you'd use CA if you need HTTP or talk to the server directly if this is unix domain socket.
+
+With this solution there is no need for changing anything in the HA implementation or the changes will be minimal, related to having to establish multiple HTTP connections when needed. The format of the HA control commands remains the same.
+
+Sticking to HTTP as a communication channel for HA allows for using the same security mechanisms as used for regular commands sent via the control channel. It also avoids exposing additional endpoints (dedicated for HA) as in other solutions described here.
+
+The major drawback of the presented solution is that communication with different daemons would require that they listen on different endpoints.