Andrei Pavel · e3c1f9fd
--- a/Designs/Backend-Assisted-Lease-Selection-design.md
+++ b/Designs/Backend-Assisted-Lease-Selection-design.md
+**This document is obsolete. It is replaced by [Free Lease Queues Design](https://gitlab.isc.org/isc-projects/kea/-/wikis/designs/free-lease-queues-design)**
+
+
+[[_TOC_]]
+
+# Backend Assisted Lease Selection Design
+
+This is a design document describing a proposed solution to optimize a process by which Kea finds and offers available leases to the requesting DHCP client. It improves efficiency of the lease allocation stage in which the allocator picks candidate leases iteratively and checks whether they are available for allocation. The current approach has been proven to be highly inefficient when the address pools become exhausted or nearly exhausted.
+
+The features described in this design are sometimes called "leases pre-allocation", because the concept of storing information about unallocated leases in the lease database is one of the core ideas to improve lease allocation performance.
+
+The design issue for this feature is #1126.
+
+# Basic Concepts
+
+The new lease lookup scheme is going to shift some logic from the allocation engine to the lease backend implementations. We're migrating away from the lease allocation strategy in which the allocation engine "blindly" picks a candidate lease and then verifies if it is available, to the strategy in which the lease backend finds and offers a candidate lease.
+
+Even though we can't always guarantee that the lease offered by the backend can be allocated to the given client, shifting the lease selection to the backend should significantly reduce the number of costly interactions between the allocation engine and the lease backend. In particular, it guarantees that if there is a free lease in a given address range it will be offered to the allocation engine with a single query, even if this last available lease in the pool. We're going to call this strategy a Backend Assisted Lease Selection (BALS). In the current code, the lease backend passively responds to the queries whether the lease picked by the allocation engine is available or not. In the extreme case, the allocation engine can make a large number of queries equal to the size of the address pool to find the last available lease. If the pool is exhausted, it will make all these queries to find out that there are no more available leases. With the Backend Assisted Lease Selection the allocation engine will learn that the pool is already exhausted with a single query. It can then try other pools or even subnets, if the subnets are grouped in shared networks.
+
+The lease offered by the backend is available for allocation from the backend's perspective. There may be some additional conditions to be tested on the allocation engine's side prior to offering the lease to the client, e.g. checking if the given lease is not reserved to another client. If any of these additional conditions is not met, the allocation engine sends additional queries to the lease backend asking for another lease. This makes it similar to the currently used allocation strategy in that it can take several attempts before the lease is allocated, however the number of such attempts is brought to a minimum. Most of the time a single call should be sufficient to find and allocate a candidate lease. If the server is configured to assigned in-pool host reservations, it is possible that the candidate address is reserved to another client in which case at least two queries are needed. If the pool includes many host reservations, the number of attempts to find an available lease may significantly grow, but this should be no worse than the current situation. In fact, filling the address pool with in-pool reservations should be considered a misconfiguration. In that case it is recommended to use out of pool reservations which should minimize the risk of collisions or eliminate them entirely.
+
+# Backends Compatibility
+
+The concepts described in this design will have to be implemented for different lease database backends. Due to the complexity of the allocation engine and variety of lease database backends supported by Kea, it is impractical to implement BALS for all backends at once or even within a single Kea release. The likely approach is to start with the Memfile backend, then MySQL and PostgreSQL. Support for BALS in Cassandra may never be added or will be added at the very end. Therefore, we must assume the coexistence of the BALS and the current lease selection strategy using the `IterativeAllocator`.
+
+In order to take advantage of the existing allocation engine code and keep the required changes at minimum, this design proposes that new allocator is created (in addition to the existing `IterativeAllocator`) which will support BALS. This allocator will be similar to the `IterativeAllocator`, but instead of walking over the addresses in the address pool it will make calls to the lease backend to get the next available address.
+
+Depending on the backend's capabilities, one or another allocator will be selected upon the server's startup or reload. The changes in the allocation engine to support the new allocator are expected to be minimal.
+
+# BALS Allocator
+
+The BALS Allocator, besides providing capability for lease pre-allocation, will provide the ability for free leases shuffling. This is going to fill a long standing gap in Kea implementation to hand out leases in a random order to make network scanning harder. However, both of these capabilities come with some cost for server's startup, reconfiguration and memory consumption. This is mostly a problem for large pools, mostly in IPv6 but also in IPv4 for /8 pools.
+
+Suppose we have an IPv6 /64 pool and the BALS allocator is used. This pool is very large and pre-allocation of leases in that pool is unlikely to be beneficial, because pre-allocation is meant to improve performance of nearly exhausted pools or pools with high utilization. Using the mixture of small and large pools imposes a requirement on the BALS allocator to enable pre-allocation on selected pools and not use pre-allocation on other pools.
+
+BALS comes with free leases shuffling capability but we should also consider whether it should be possible to use pre-allocation with incremental lease allocation.
+
+In order to use the new allocator one would need to configure the server. Currently, there is no configuration knob to use other allocator than iterative. The design proposes to create new global parameter `lease-allocation`, which defines the lease allocation strategy, with the following enum values
+- `iterative`,
+- `random`
+- `iterative-pre-allocation`
+- `random-pre-allocation`
+
+The last 3 strategies will be provided by the BALS allocator, i.e. when any of these strategies is selected the BALS allocator will be used by the allocation engine. Even if the strategy with pre-allocation is in use, it should be possible to selectively disable it for certain pools, subnets and shared networks. This is controlled by the new `lease-pre-allocation` boolean parameter. For example, the following configuration:
+
+```json
+{
+    "Dhcp4": {
+        "lease-allocation": "random-pre-allocation",
+        "shared-networks": [
+            {
+                "name": "felix",
+                "lease-pre-allocation": false
+            },
+            {
+                "name": "ivy",
+            }
+        ]
+    }
+}
+```
+
+instructs the server to use lease pre-allocation and to shuffle the pre-allocated leases. This is by default used for all pools. The pre-allocation is selectively disabled for the shared network "felix", all subnets belonging to it and all pools belonging to these subnets. For these pools the leases will still be shuffled. The shared network "ivy" uses the default setting, so for all the pools in this shared network the leases will be shuffled and pre-allocated.
+
+
+# New API Calls
+
+```c++
+uint8_t getLeaseSelectionSupport() const;
+void preallocateLeases4(IOAddress range_start, IOAddress range_end);
+void preallocateLeases6(IOAddress range_start, IOAddress range_end);
+Lease4Ptr findAvailableLease4(IOAddress range_start, IOAddress range_end, uint64_t attempt_count);
+Lease6Ptr findAvailableLeaseNA(IOAddress range_start, IOAddress range_end, uint64_t attempt_count);
+... prefix delegation TBD ....
+```
+
+The `getLeaseSelectionSupport()` function will be implemented by all backends and will return a bitwise value indicating if the given backend supports the BALS and for which lease types. Based on that the allocation engine will be able to determine which code path to take for lease allocation. The following flags will be pre-defined:
+```c++
+const uint8_t LEASE_SELECTION_NONE = 0x00;
+const uint8_t LEASE_SELECTION_ADDR4 = 0x01;
+const uint8_t LEASE_SELECTION_NA = 0x02;
+const uint8_t LEASE_SELECTION_PD = 0x04;
+const uint8_t LEASE SELECTION_ALL = 0x07;
+```
+
+The `preallocateLeases4` and `preallocateLeases6` functions are to be triggered upon the server's first configuration, reconfiguration or configuration update for individual address pools for which preallocation should take place. As a result, the backend specific implementations will create appropriate lease entries to mark those that are available.
+
+The remaining functions are used by the allocation engine to get the next available lease of a given type for a given pool. The pool is specified as a pair of IP addresses denoting the beginning and the end of the pool.
+
+The third argument is a counter of attempts made to get an available address for a given client/packet. It should be set to 0 upon the first attempt and incremented when the offered address is rejected by the allocation engine and another attempt is made.
+
+# Memfile
+
+The Memfile is the first backend to support BALS.
+
+Currently, the backend stores the information about allocated leases in a multi index container. It is straightforward and efficient to query the container for a range of allocated leases, a single allocated lease etc. Though, this container is not designed for efficient searches of available leases. This would involve both looking into the defined pools and the leases stored in the multi index container. By eliminating the allocated leases from the pool it would be possible to find a collection of unallocated/available leases. This process would be pretty inefficient as proven by some earlier attempts we have made to implement it.
+
+The idea to improve the process of finding an available lease is based on extending the backend to include another container holding the information about **available** leases. Finding available leases in such container is as efficient as finding an allocated lease in the existing multi index container. However, the usage of the additional container brings some complexity either to the backend and to the server's configuration to ensure the data integrity.
+
+## Containers
+
+The design proposes to have two distinct containers, one for holding the allocated leases (A-container) and the new container for holding unallocated leases (U-container). These containers have different requirements for indexing. Thus, it would be impractical to combine the information about the allocated and available leases in a single container. When the lease gets allocated it becomes necessary to remove this lease from the unallocated leases and create new lease entry in the container holding allocated leases. Similarly, lease deletion should result in removal of the allocated lease and creation of a new entry in the container holding unallocated leases. There are also some other interesting cases to consider: lease is expired or expired reclaimed. In both cases the lease is technically available for allocation but the lease entry exists in the A-container. The following table summarizes which container holds a lease entry depending on the lease state.
+
+| Lease state | A-container | U-container |
+| ----------- | ----------- | ----------- |
+| unallocated | no          | yes         |
+| allocated   | yes         | no          |
+| expired     | yes         | no          |
+| reclaimed   | yes         | yes         |
+
+The `expired` lease in this case is the one for which the valid lifetime elapsed and the client failed to renew it, but the server hasn't yet reclaimed this lease. The lease is reclaimed via the periodic reclamation process triggered by the server according to the user settings in the server's configuration file. The time window when the lease is expired and not yet reclaimed strictly depends on the server's configuration. In extreme cases an expired lease may never be reclaimed. It should still be possible to allocate such lease, though.
+
+The most efficient way to find an available lease is by getting one from the U-container. In most cases, the first lease found within this container will be accepted by the allocation engine and there will be no need to reuse an expired lease from the A-container. If the lease reclamation routine is enabled, the expired leases will eventually be moved to the U-container by this routine. Looking for expired leases in the A-container brings additional overhead to packet processing, but can't be completely avoided. One case is when the lease reclamation routine is disabled. Another case is when the allocation engine refuses to use the leases returned by the backend, typically because of the conflicts with the host reservations. In those two situations the backend assisted allocator should be able to reach for the expired leases.
+
+The backend assisted allocator should access server's configuration to determine whether the lease reclamation routine is enabled or not. If it is disabled, it should make an attempt to transfer expired leases from the A-container to the U-container upon each attempt to find an available lease. Without this, the allocator wouldn't be able to offer expired leases. In another case, when the lease reclamation is enabled, it is better to rely on it to transfer expired leases from the A-container to the U-container and not cause additional overhead for packet processing.
+
+The `attempts_count` is used by the backend assisted allocator to determine when the transfer of expired leases should take place in cases when the lease reclamation is enabled. This design proposes that such transfer should take place when the `attempts_count` is greater than 0. The `attempts_count` is increased by the allocation engine while trying to allocate the lease. It is set to 0 upon the first attempt. If the allocation engine rejects the lease offered by the backend for the given client it will set it to 1 and retry. Such retry is a signal to the allocator that there may be a deficiency in available leases within the U-container and the transfer of expired leases should be attempted.
+
+Note that the `attempts_count` is set per packet (also per thread). It is obviously reset to 0 when the new packet comes in.
+
+The U-container must have the following indexes:
+- ordered index sorting elements by IP address - to be able to efficiently remove a lease which got allocated,
+- sequenced index which can be used as bi-directional list that puts the free addresses in the order in which they will be offered to the allocator.
+
+The contents of the U-container are not persisted. They are stored in memory only. If the server is restarted the A-container is restored from the lease file, the pools are read from the configuration storage and finally the U-container is recreated using combined information from the A-container and the pools' configurations.
+
+
+## Pools Configuration
+
+There are several different ways to configure pools to be served by a Kea server:
+- Kea configuration files,
+- subnet_cmds hooks library,
+- cb_cmds hooks library.
+
+A change in the pools configuration should trigger an action of pre-allocating the leases for the updated pools. Pre-allocation may be a time consuming operation depending on the size of the pools configured. The server should be able to verify if the latest configuration change affected any pools. If no pools have been modified as a result of the reconfiguration, the pre-allocation step should be skipped. If only some pools have been modified, the pre-allocation should be performed only for the modified pools.
+
+There is no standard way in which the pre-allocation could be triggered for all configuration methods listed above.
+
+When the server is reloaded, the server should be able to compare the old and new configurations and detect which pools have been added. The `preallocateLeasesX` function should be executed as a final configuration step for all new pools. We could potentially make such checks during the commit() call on the `CfgMgr`, but a PoC would be needed to confirm whether it is a good idea.
+
+The subnet_cmds and configuration backend add/modify subnets directly to the current config (not staging config), and that doesn't involve a commit phase. Therefore, we'd need to trigger `preallocateLeasesX` whenever they update the current config directly. Again, the best place to trigger pre-allocation is to be determined.
+
+# MySQL and Postgres
+
+The SQL backends store information about allocated leases in the `lease4` and `lease6` tables. Similarly to the Memfile backend, finding an available lease using the information about allocated leases is inefficient and thus storing the information about unallocated leases is required.
+
+The challenging part is how to efficiently populate the unallocated leases to the `lease4_avail` and `lease6_avail` tables. Let's start with the divagations about the IPv4 case first.
+
+## Tables and Queries for IPv4
+
+The available addresses are populated for individual address pools. For each such address pool we have to make a query which finds "gaps" in the `lease4` table and creates the corresponding entries in the `lease4_avail` table. IPv4 addresses are represented as 4 byte integers in the database. They can be sorted in the ascending order as a sequence of numbers and the typical techniques for finding gaps in a sequence of numbers can be applied.
+
+The following query in PostgreSQL:
+
+```sql
+SELECT avail FROM generate_series(('192.0.2.10'::inet - '0.0.0.0'::inet),
+                                  ('192.0.2.100'::inet - '0.0.0.0'::inet), 1) AS avail
+    LEFT JOIN lease4 on avail = lease4.address
+    WHERE lease4.address IS NULL;
+```
+
+selects all unallocated addresses (not found in the `lease4` table) for a pool of `192.0.2.10 - 192.0.2.100`. Those addresses can be inserted into the `lease4_avail` table to mark they are available for allocation. The `ON CONFLICT/DO NOTHING` construct can be used with this insert to bypass the duplicates if some of these entries already exist in the `lease4_avail` table.
+
+The trigger should be added on the `lease4_avail` table to delete entries as a result of inserting leases into the `lease4` table. Similarly, deleting the lease from the `lease4` table should trigger insertion into the `lease4_avail` table.
+
+*IPv6 to be described*
+
+## Pseudo Random Lease Selection
+
+Randomizing available lease selection from the pool can be achieved by adding a hashing column to the `leaseX_avail` tables.
+
+The following query is the extension of the query provided in the previous section which besides inserting available addresses would also add md5 digest to the table:
+
+```sql
+WITH avail_addresses AS
+    (SELECT avail FROM generate_series(('192.0.2.10'::inet - '0.0.0.0'::inet),
+                                       ('192.0.2.100'::inet - '0.0.0.0'::inet), 1) AS avail
+     LEFT JOIN lease4 on avail = lease4.address
+     WHERE lease4.address IS NULL)
+     INSERT INTO lease4_avail (address, address_hash)
+         SELECT avail, md5(avail::text)
+         FROM avail_addresses;
+```
+
+where `lease4_avail` is created as follows:
+
+```sql
+CREATE TABLE lease4_avail (
+    address bigint NOT NULL PRIMARY KEY,
+    address_hash text NOT NULL
+);
+CREATE UNIQUE INDEX lease4_avail_address_hash_unique ON lease4_avail(address_hash);
+```
+
+Then the query similar to this can be used to retrieve next available address:
+
+```sql
+SELECT * FROM lease4_avail
+WHERE address >= 3221225994 AND address <= 3221226084
+    AND address_hash > md5(3221226082::text)
+ORDER BY address_hash;
+```
+
+where `3221226082` is the integer representation of the last IP address returned for the given pool. The `3221225994` and `3221226084` designate the pool range.
+
+This is the example order in which the addresses from the 192.0.2.10 to 192.0.2.100 are returned for such hashing algorithm:
+
+```
+   address
+-------------
+ 192.0.2.71
+ 192.0.2.91
+ 192.0.2.70
+ 192.0.2.90
+ 192.0.2.42
+ 192.0.2.47
+ 192.0.2.23
+ 192.0.2.72
+ 192.0.2.76
+ 192.0.2.51
+ 192.0.2.28
+ 192.0.2.38
+ 192.0.2.64
+ 192.0.2.14
+ 192.0.2.36
+ 192.0.2.53
+ 192.0.2.43
+ 192.0.2.21
+ 192.0.2.50
+ 192.0.2.60
+ 192.0.2.87
+ 192.0.2.99
+ 192.0.2.17
+ 192.0.2.57
+ 192.0.2.78
+ 192.0.2.37
+ 192.0.2.100
+...
+```
+
+## Considerations About Multiple Kea Instances
+
+It appears to be a common use case to have multiple Kea servers connected to the same database. These servers share lease information in the `lease4` and `lease6` tables. The SQL trigger mechanism addresses the problem of consistency between the `leaseX` and `leaseX_avail` tables during normal operation. The aspect which requires some consideration is the servers' startup and reconfiguration phases.
+
+Each server connected to the database is configured with a set of address pools. The servers may be configured with different address pools, the same address pools or overlapping address pools. The server connecting to the lease database is unaware whether any other server is connected to it and what pools it is configured to serve addresses from. The server connecting to the database is interested in "pre-allocating" the leases for the pools it is configured to serve, but it must not delete any existing entries in the `leaseX_avail` tables as they might have been inserted by another Kea instance according to its configuration. The server connecting to the database should traverse addresses in its own pools and create the corresponding entries in the `leaseX_avail` tables. If any of the entries already exist (because the pools of the two servers overlap), the server skips to the next address leaving this entry in place.
+
+*Note: skipping the existing entry upon insert can be achieved by using ON CONFLICT/DO NOTHING construct in PostgreSQL.*
+
+When the server terminates it must not delete any entries from the `leaseX_avail` tables because another server may be using them. The may obviously lead to a dirty database including extraneous data for the leases which belong to the address pools not configured on any server. If this is a concern, it is possible to introduce a configuration parameter which enables cleanup of the stale entries. The cleanup may only be enabled in deployments where a single server is connected to the database at a given time.
+
+# Tradeoffs
+
+Pre-allocation of leases should significantly reduce the time to find a next available lease in a situation when the pools are getting exhausted. However, that also has some tradeoffs. The server must go over all IP addresses within each configured pool and check if there is a valid lease for this address. This costs additional time upon server's startup and when new pools are added or existing pools are modified via the server reload or when pools are updated from the Configuration Backend. The larger the pool the more time it takes to pre-allocate the leases in that pool. If there are many pools or the pools are large, this may lengthen the server's startup time. We currently have no data how much longer it would take for the server to startup as a function of the number of addresses in the pools. Appropriate experiments are to be done.
+
+In many cases the problem of degrading performance in nearly exhausted pools appears for smaller pools, e.g. /24, /16, simply because they are smaller. Very often new subnets and shared networks are added to extend the address space in the particular network segment. The pre-allocation is mostly useful in such cases when there are many small pools from which an address is to be picked. Such problem is less common in larger pools, e.g. /8. At the same time, the pre-allocation in the large pools takes considerably more time. Therefore, it seems to be a good idea to be able to disable pre-allocation on the pool level. The appropriate configuration knob should be available at pool, subnet, shared network and global levels. The default value for that knob should be "off" (pre-allocation disabled).
+
+Creation of the new lease or deletion of the new lease (including lease updates via HA) will also have some additional overhead related to additional operations on the U-container. This, however, should not be visible in case of the Memfile backend as these operations are conducted in server's memory.
+
+We're still debating whether for the SQL backends we should have dedicated SQL tables holding available leases or the available leases should be stored in memory as in case of the Memfile backend. The impact on the lease operations time will depend on that choice.
+
+# Implementation Tasks
+
+* [ ] Configuration knob: enable BALS on global, sn, subnet and pool level (2d)
+* [ ] Update MySQL CB with new config knobs (2d)
+* [ ] Trigger pre-allocation during reload (3d)
+* [ ] Trigger pre-allocation for subnet_cmds (3d)
+* [ ] Trigger pre-allocation for CB (3d)
+* [ ] BALS support in Memfile (5d)
+* [ ] Create new allocator supporting BALS (5d)
+* [ ] BALS support in MySQL (5d)
+* [ ] BALS support in PostgreSQL (5d)
+* [ ] Collision avoidance with host reservations (4d)
+
+Total of: 37 days
+Without SQL: 27 days
+
+Doesn't include testing effort required: both for functional and performance testing.