Kea issues

hub-and-spoke change of 'origin' command argment type is not backward compatible

2024-04-23T14:36:38Z

Hub-and-spoke altered the type of the command origin argument from string to int. This makes HA communications not backwardly compatible as older peers reject the new data type. This the response of a 2.4.0 peer to a heartbeat from its peer which is current master (2.5.8-dev): ``` 2024-04-19 13:10:00.324 DEBUG [kea-dhcp6.http/1886.140305063425600] HTTP_SERVER_RESPONSE_SEND_DETAILS detailed information about response sent to 175.16.1.32: HTTP/1.1 200 OK Content-Length: 65 Content-Type: application/json Date: Fri, 19 Apr 2024 13:10:00 GMT [ { "result": 1, "text": "'origin' argument must be a string" } ] ```

Update "upgrading-from-older-ha-versions"

2024-04-18T13:52:30Z

Update upgrading-from-older-ha-versions: The ARM contains the following upgrade advice. ```` Do not send the ha-maintenance-start command while the server running the old hook library is still online. The server receiving this command will return an error. ```` This may need to be updated as all supported Kea versions contain the "ha-maintenance-start" command

Clarify application of the ha-scopes command in the actual deployments

2024-03-14T15:02:26Z

`ha-scopes` command can modify servers scopes without changing its role and other HA parameters. It can be a powerful tool, but its use can put the server in a state that will be very confusing for the Administrator. I think this command requires more documentation and warnings about its usage. For example: \ We have a hot standby pair and send the `ha-scopes` command to the `standby` server, enabling scopes of both servers. This results in `primary` and `standby` servers replying to DHCP traffic. But the second server still reports as in a `standby` state. This can lead to massive confusion for Administrators.

HA multiple relationships and RADIUS reselect are incompatible

2024-03-27T15:26:47Z

Nothing trivial can be done to fix this other to drop the first query (RADIUS hook parks the query at the subnet select callout and knows the right subnet when the RADIUS response is received). For other queries using cached RADIUS information the correctness relies on the order of the HA and RADIUS hooks (RADIUS before HA).

DHCPRELEASE and lease expiration in active-standby HA setup

2024-04-23T18:46:40Z

DHCPvRELEASE lease expiration in active-standby HA setup Kea 2.5.5 When a client sends a DHCPRELEASE message to a Kea primary HA server, the expired lease processing settings are honoured. However, the primary updates the failover server with instructions to delete the lease. This leads to a divergence of lease data between the two servers. [SF00001636](https://isc.lightning.force.com/lightning/r/Case/500S6000004XPRy/view)

HA lease updates do not create an accounting entry in v6

2024-01-25T15:00:10Z

In v6, HA lease updates are done with the `lease6-bulk-apply` command which is not handled in the `command_processed` RADIUS callout. This is unlike v4 which does create accounting entries for HA lease updates sent via `lease4-update`.

subnet-get commands should fetch leases for selected subnets with pagination

2024-03-22T13:15:53Z

In HA, we use lease commands to synchronize the database. The lease commands fetch all leases with pagination. However, in the hub-and-spoke model it would be useful to fetch the leases only for selected subnets because the relationships are partitioned by subnet. Today, all leases have to be fetched by each relationship and those that do not belong to the relationship are discarded. This is inefficient. One thing to consider is that subnet identifiers are listed explicitly in the commands.

Kea HA issue with terminating connection

2023-11-10T09:50:24Z

We recently migrated our DHCP setup from dhcpd to Kea. It runs on two servers with hot standby and a memfile backend for the leases. Kea assigns IP addresses for around 7000 pools. Over the past few months the HA connection terminated in random intervals. From looking at the logs on the passive node I can see a lot of 'ResourceBusy: IP address ... could not be updated' warnings prior to the connection terminating. Since multithreading is enabled I suspected this may be due to the threads encountering a resource lock on the memfile. I suppose after the lease update fails a few times, the connection is terminated. Is the 'ResourceBusy' warning the cause for the terminating HA connection and is there any way to fix the underlying issue? Any ideas on the issue are greatly appraciated. Here are the logs from the primary server: ``` Jun 12 15:04:31 dhcp-1 kea-dhcp4[564812]: WARN [kea-dhcp4.alloc-engine.139625735366400] ALLOC_ENGINE_V4_ALLOC_FAIL_SUBNET [hwtype=1, cid=[], tid=0x0: failed to allocate an IPv4 lease in the subnet 123.123.123.123/30, subnet-id 30926, shared network (none) Jun 12 15:04:31 dhcp-1 kea-dhcp4[564812]: WARN [kea-dhcp4.alloc-engine.139625735366400] ALLOC_ENGINE_V4_ALLOC_FAIL [hwtype=1], cid=[], tid=0x0: failed to allocate an IPv4 address after 1 attempt(s) Jun 12 15:04:31 dhcp-1 kea-dhcp4[564812]: WARN [kea-dhcp4.alloc-engine.139625735366400] ALLOC_ENGINE_V4_ALLOC_FAIL_CLASSES [hwtype=1], cid=[], tid=0x0: Failed to allocate an IPv4 address for client with classes: ALL, HA_primary-dhcp, VENDOR_CLASS_MSFT 5.0, UNKNOWN Jun 12 15:04:39 dhcp-1 kea-dhcp4[564812]: WARN [kea-dhcp4.alloc-engine.139625726973696] ALLOC_ENGINE_V4_ALLOC_FAIL_SUBNET [hwtype=1], cid=[], tid=0x0: failed to allocate an IPv4 lease in the subnet 123.123.123.123/30, subnet-id 30926, shared network (none) Jun 12 15:04:39 dhcp-1 kea-dhcp4[564812]: WARN [kea-dhcp4.alloc-engine.139625726973696] ALLOC_ENGINE_V4_ALLOC_FAIL [hwtype=1], cid=[], tid=0x0: failed to allocate an IPv4 address after 1 attempt(s) Jun 12 15:04:39 dhcp-1 kea-dhcp4[564812]: WARN [kea-dhcp4.alloc-engine.139625726973696] ALLOC_ENGINE_V4_ALLOC_FAIL_CLASSES [hwtype=1], cid=[], tid=0x0: Failed to allocate an IPv4 address for client with classes: ALL, HA_primary-dhcp, VENDOR_CLASS_MSFT 5.0, UNKNOWN Jun 12 15:04:45 dhcp-1 kea-dhcp4[564812]: WARN [kea-dhcp4.ha-hooks.139625718580992] HA_LEASE_UPDATE_CONFLICT [hwtype=1], cid=[], tid=0x0: lease update to standby-dhcp (http://dhcp-2:8001/) returned conflict status code: ResourceBusy: IP address:123.123.123.123 could not be updated. (error code 4) Jun 12 15:04:56 dhcp-1 kea-dhcp4[564812]: WARN [kea-dhcp4.alloc-engine.139625735366400] ALLOC_ENGINE_V4_ALLOC_FAIL_SUBNET [hwtype=1], cid=[], tid=0x0: failed to allocate an IPv4 lease in the subnet 123.123.123.123/30, subnet-id 30926, shared network (none) Jun 12 15:04:56 dhcp-1 kea-dhcp4[564812]: WARN [kea-dhcp4.alloc-engine.139625735366400] ALLOC_ENGINE_V4_ALLOC_FAIL [hwtype=1], cid=[], tid=0x0: failed to allocate an IPv4 address after 1 attempt(s) Jun 12 15:04:56 dhcp-1 kea-dhcp4[564812]: WARN [kea-dhcp4.alloc-engine.139625735366400] ALLOC_ENGINE_V4_ALLOC_FAIL_CLASSES [hwtype=1], cid=[], tid=0x0: Failed to allocate an IPv4 address for client with classes: ALL, HA_primary-dhcp, VENDOR_CLASS_MSFT 5.0, UNKNOWN Jun 12 15:05:28 dhcp-1 kea-dhcp4[564812]: WARN [kea-dhcp4.alloc-engine.139625726973696] ALLOC_ENGINE_V4_ALLOC_FAIL_SUBNET [hwtype=1], cid=[], tid=0x0: failed to allocate an IPv4 lease in the subnet 123.123.123.123/30, subnet-id 30926, shared network (none) Jun 12 15:05:28 dhcp-1 kea-dhcp4[564812]: WARN [kea-dhcp4.alloc-engine.139625726973696] ALLOC_ENGINE_V4_ALLOC_FAIL [hwtype=1], cid=[], tid=0x0: failed to allocate an IPv4 address after 1 attempt(s) Jun 12 15:05:28 dhcp-1 kea-dhcp4[564812]: WARN [kea-dhcp4.alloc-engine.139625726973696] ALLOC_ENGINE_V4_ALLOC_FAIL_CLASSES [hwtype=1], cid=[], tid=0x0: Failed to allocate an IPv4 address for client with classes: ALL, HA_primary-dhcp, VENDOR_CLASS_MSFT 5.0, UNKNOWN Jun 12 15:05:31 dhcp-1 kea-dhcp4[564812]: WARN [kea-dhcp4.alloc-engine.139625752151808] ALLOC_ENGINE_V4_ALLOC_FAIL_SUBNET [hwtype=1], cid=[], tid=0x0: failed to allocate an IPv4 lease in the subnet 123.123.123.123/30, subnet-id 30926, shared network (none) Jun 12 15:05:31 dhcp-1 kea-dhcp4[564812]: WARN [kea-dhcp4.alloc-engine.139625752151808] ALLOC_ENGINE_V4_ALLOC_FAIL [hwtype=1], cid=[], tid=0x0: failed to allocate an IPv4 address after 1 attempt(s) Jun 12 15:05:31 dhcp-1 kea-dhcp4[564812]: WARN [kea-dhcp4.alloc-engine.139625752151808] ALLOC_ENGINE_V4_ALLOC_FAIL_CLASSES [hwtype=1], cid=[], tid=0x0: Failed to allocate an IPv4 address for client with classes: ALL, HA_primary-dhcp, VENDOR_CLASS_MSFT 5.0, UNKNOWN Jun 12 15:05:39 dhcp-1 kea-dhcp4[564812]: WARN [kea-dhcp4.ha-hooks.139625701795584] HA_LEASE_UPDATE_CONFLICT [hwtype=1], cid=[], tid=0x0: lease update to standby-dhcp (http://dhcp-2:8001/) returned conflict status code: ResourceBusy: IP address:123.123.123.123 could not be updated. (error code 4) Jun 12 15:05:39 dhcp-1 kea-dhcp4[564812]: WARN [kea-dhcp4.ha-hooks.139625718580992] HA_LEASE_UPDATE_CONFLICT [hwtype=1], cid=[], tid=0x0: lease update to standby-dhcp (http://dhcp-2:8001/) returned conflict status code: ResourceBusy: IP address:123.123.123.123 could not be updated. (error code 4) Jun 12 15:05:39 dhcp-1 kea-dhcp4[564812]: ERROR [kea-dhcp4.ha-hooks.139625718580992] HA_LEASE_UPDATE_REJECTS_CAUSED_TERMINATION too many rejected lease updates cause the HA service to terminate Jun 12 15:05:39 dhcp-1 kea-dhcp4[564812]: ERROR [kea-dhcp4.ha-hooks.139625718580992] HA_TERMINATED HA service terminated due to an unrecoverable condition. Check previous error message(s), address the problem and restart! ``` Here are the logs from the standby server: ``` Mar 12 19:25:06 dhcp-2 kea-dhcp4[203037]: WARN [kea-dhcp4.lease-cmds-hooks.139670034884352] LEASE_CMDS_UPDATE4_CONFLICT lease4-update command failed due to conflict (parameters: { "client-id": "", "expire": 1678688706, "force-create": true, "fqdn-fwd": false, "fqdn-rev": false, "hostname": "", "hw-address": "", "ip-address": "", "state": 0, "subnet-id": 2907, "valid-lft": 43200 }, reason: ResourceBusy: IP address:123.123.123.123 could not be updated.) Mar 12 19:25:06 dhcp-2 kea-dhcp4[203037]: WARN [kea-dhcp4.lease-cmds-hooks.139670009706240] LEASE_CMDS_UPDATE4_CONFLICT lease4-update command failed due to conflict (parameters: { "client-id": "", "expire": 1678688706, "force-create": true, "fqdn-fwd": false, "fqdn-rev": false, "hostname": "", "hw-address": "", "ip-address": "", "state": 0, "subnet-id": 2907, "valid-lft": 43200 }, reason: ResourceBusy: IP address:123.123.123.123 could not be updated.) Mar 12 19:27:28 dhcp-2 kea-dhcp4[203037]: WARN [kea-dhcp4.lease-cmds-hooks.139670009706240] LEASE_CMDS_UPDATE4_CONFLICT lease4-update command failed due to conflict (parameters: { "client-id": "", "expire": 1678688848, "force-create": true, "fqdn-fwd": false, "fqdn-rev": false, "hostname": "", "hw-address": "", "ip-address": "", "state": 0, "subnet-id": 3812, "valid-lft": 43200 }, reason: ResourceBusy: IP address:123.123.123.123 could not be updated.) Mar 12 19:32:05 dhcp-2 kea-dhcp4[203037]: WARN [kea-dhcp4.lease-cmds-hooks.139670018098944] LEASE_CMDS_UPDATE4_CONFLICT lease4-update command failed due to conflict (parameters: { "client-id": "", "expire": 1678689125, "force-create": true, "fqdn-fwd": false, "fqdn-rev": false, "hostname": "", "hw-address": "", "ip-address": "", "state": 0, "subnet-id": 274, "valid-lft": 43200 }, reason: ResourceBusy: IP address:123.123.123.123 could not be updated.) Mar 12 19:32:34 dhcp-2 kea-dhcp4[203037]: WARN [kea-dhcp4.lease-cmds-hooks.139670009706240] LEASE_CMDS_UPDATE4_CONFLICT lease4-update command failed due to conflict (parameters: { "client-id": "", "expire": 1678689154, "force-create": true, "fqdn-fwd": false, "fqdn-rev": false, "hostname": "", "hw-address": "", "ip-address": "", "state": 0, "subnet-id": 113, "valid-lft": 43200 }, reason: ResourceBusy: IP address:123.123.123.123 could not be updated.) Mar 12 19:32:36 dhcp-2 kea-dhcp4[203037]: ERROR [kea-dhcp4.ha-hooks.139670104323840] HA_TERMINATED HA service terminated due to an unrecoverable condition. Check previous error message(s), address the problem and restart! Mar 12 22:11:09 dhcp-2 kea-dhcp4[203037]: ERROR [kea-dhcp4.packets.139670138794688] DHCP4_BUFFER_RECEIVE_FAIL error on attempt to receive packet: Truncated DHCPv4 packet (len=0) received, at least 236 is expected. ``` The relevant config is the following on both hosts, differing only in the "this-server-name" property. ``` "hooks-libraries": [{ "library": "/usr/lib/x86_64-linux-gnu/kea/hooks/libdhcp_lease_cmds.so", "parameters": {} }, { "library": "/usr/lib/x86_64-linux-gnu/kea/hooks/libdhcp_stat_cmds.so", "parameters": {} }, { "library": "/usr/lib/x86_64-linux-gnu/kea/hooks/libdhcp_ha.so", "parameters": { "high-availability": [{ "this-server-name": "standby-dhcp", "mode": "hot-standby", "heartbeat-delay": 10000, "max-response-delay": 60000, "max-ack-delay": 5000, "max-unacked-clients": 5, "peers": [{ "name": "primary-dhcp", "url": "http://dhcp-1:8001/", "role": "primary", "auto-failover": true }, { "name": "standby-dhcp", "url": "http://dhcp-2:8001/", "role": "standby", "auto-failover": true }] }] } }] ```

Cross-check - server should check its HA partner config

2023-06-15T13:50:50Z

Here's an idea for new HA capability. On startup (or when explicit command is called), the server retrieves its partner configuration with `config-get` and checks it for consistency: if the subnets and pools are defined the same way, if the subnet-ids match etc. Right now the doc says those should be the same, with the only difference being server-name, but we don't check it. What to do with spotted differences is to be determined. We could print a warning, refuse HA connection, shutdown, or even maybe the primary attempt to fix its partner's config. This is merely an idea. If we like it, the first step would be to turn this into more coherent design. Hence the ~design.

HA hook's URLs should support DNS resolution with configurable re-resolution

2023-04-06T13:43:06Z

--- name: Feature request about: Allow using DNS resolution in HA hook's URLs --- **Some initial questions** - Are you sure your feature is not already implemented in the latest Kea version? **yes** - Are you sure what you would like to do is not possible using some other mechanisms? **not reasonable** - Have you discussed your idea on kea-users or kea-dev mailing lists? **no** **Is your feature request related to a problem? Please describe.** I am deploying HA Kea on Kubernetes where (using SDNs) pod(/container) IPs are not constant. The hostname can be made persistent though. Now I can create a Kubernetes service per Pod which will assign a so-called cluster IP which is stable and gets redirected to the pod. This works alright for HA communicating to the control agent, but not using the dedicated listener. **Describe the solution you'd like** Preferably allow using DNS (re-)resolution for HA hook's URLs. Or allow specifying the listener's bind-address. **Funding its development** Kea is run by ISC, which is a small non-profit organization without any government funding or any permanent sponsorship organizations. Are you able and willing to participate financially in the development costs? **no** **Participating in development** Are you willing to participate in the feature development? ISC team always tries to make a feature as generic as possible, so it can be used in wide variety of situations. That means the proposed solution may be a bit different that you initially thought. Are you willing to take part in the design discussions? Are you willing to test an unreleased engineering code? **yes** **Contacting you** preferably via gitlab.

RFE: HA plugin ability to detect partner inabilty to receive client requests ...

2023-07-31T14:12:57Z

--- name: Feature request about: HA plugin ability to detect partner inabilty to receive client requests and transition it to 'partner-down' --- **Some initial questions** - Are you sure your feature is not already implemented in the latest Kea version? Yes - Are you sure what you would like to do is not possible using some other mechanisms? Yes - Have you discussed your idea on kea-users or kea-dev mailing lists? Yes **Is your feature request related to a problem? Please describe.** (This issue was created as a result of an extensive thread on kea-users) When the HA plugin is being used in either hot-standby or load-balancing mode, Kea peers are able to notice some forms of communications failures and force the other peers to the 'partner-down' state in order to provide service to clients supported by the other peer. However, in a situation where client requests are not being delivered to a peer, but it is otherwise fully operational including the peer-to-peer communications link, clients supported by that peer will not be serviced, but the other peer(s) care unable to notice the issue and take action to correct it. This situation could arise when the Kea peers are using separate network links for client traffic and HA traffic, or when the Kea peers are receiving client traffic via a DHCP relay and the relay configuration is incorrect. **Describe the solution you'd like** One (or more) opt-in mechanisms that the Kea admin can choose to enhance the ability to detect peer failures to service clients, even when the peer's Kea daemon is otherwise fully operational. **Describe alternatives you've considered** Some discussions about external monitoring solutions have occurred, and that is certainly an option which some admins could choose. **Funding its development** Kea is run by ISC, which is a small non-profit organization without any government funding or any permanent sponsorship organizations. Are you able and willing to participate financially in the development costs? Yes **Participating in development** Are you willing to participate in the feature development? ISC team always tries to make a feature as generic as possible, so it can be used in wide variety of situations. That means the proposed solution may be a bit different that you initially thought. Are you willing to take part in the design discussions? Are you willing to test an unreleased engineering code? Yes

HA pool rebalancing

2023-02-02T14:23:33Z

This idea is not new. It was recently brought up by @cathya in Porto (see [notes](https://pad.isc.org/p/porto2022-kea-features-for-stork#L58). The overall concept is to design and implement a mechanism similar to the one in ISC DHCP. When there are two servers in load-balancing, it is possible that one of them will run out of addresses while the other one still has many. Couple random comments: - The pool rebalancing would somehow make both partners negotiate the pools and rebalance them. - Using a hysteresis approach with high/low threshold would prevent the mechanism to go crazy when running out of addresses. We don't want it to go crazy when there's one or two addresses left. - The pool dynamism would add extra complexity as the modified pool range would need to be stored somewhere that would survive crashes/reboots etc. This requires a ~design. It's a complicated feature request with a high potential for endless tweaks, conflicting tuning requests etc. We will do it one day, but this would require a lot of design, testing and tuning.

HA Load-Balancing Network issue detection between Relay and Kea

2023-01-26T15:22:15Z

Hi, I have already tried to resolve this issue with the kea users community, but it seems not many are using HA Load Balancing. I have the following problem. Scenario: Multiple DHCP-Relays at different sites with both KEA-Servers as DHCP-Servers. Both servers are available and the load balancing shifts the requests between the two servers. Incident: Because of a network issue Kea 1 is not available from the clients. The network connection between Kea 1 and Kea 2 still works, so no partner-down state. Expected behaviour: Kea 2 sees the unacked clients of Kea 1 and sets Kea 1 in partner-down state and handles all requests. Experienced behaviour: Kea 2 still reports HA_BUFFER4_RECEIVE_NOT_FOR_US and does not handle the requests. Unacked clients is not counted. Is there a misunderstanding or configuration mistake on my side? ``` { "library": "/usr/local/lib/kea//hooks/libdhcp_ha.so", "parameters": { "high-availability": [ { "this-server-name": "server2", "mode": "load-balancing", "heartbeat-delay": 10000, "max-response-delay": 60000, "max-ack-delay": 10000, "max-unacked-clients": 1, "delayed-updates-limit": 100, "peers": [ { "name": "server1", "url": "http://192.168.248.1:8080/", "role": "primary", "auto-failover": true }, { "name": "server2", "url": "http://192.168.248.2:8080/", "role": "secondary", "auto-failover": true } ] } ] } } ``` Thank you, Mathias

partner-down state transition when max-unacked-clients reached

2022-11-24T14:45:22Z

Suppose the server lost the connection with its partner. The server begins the failover procedure by checking whether or not the partner responds to the DHCP queries. The `max-unacked-clients` setting controls how many different clients should retry getting the lease with the increased value of the `secs` field before the server considers partner dead. One would expect the server to make `partner-down` transition as soon as the number of unacked clients reaches the configured number. In fact, the state transitions are generally performed when the server completes a heartbeat or a lease update. It is possible that under heavy traffic there will be much larger number of unacked clients and the server still sits in the normal state (e.g. hot-standby), waiting for the heartbeat trigger. Assuming the heartbeat interval is reasonable, it should probably be fine. However, we may consider starting the transition as soon as the number of unacked clients reaches the configured maximum.

HA lease v6 updates use the default hwtype and hwaddr_source

2023-07-31T13:51:18Z

Notice the discrepancy in the last two columns: * `server1`: ``` address,duid,valid_lifetime,expire,subnet_id,pref_lifetime,lease_type,iaid,prefix_len,fqdn_fwd,fqdn_rev,hostname,hwaddr,state,user_context,hwtype,hwaddr_source 2001:db8:50::11,00:03:00:01:01:03:0d:04:0b:01,4000,1663013972,1,3000,0,5946,128,0,0,,01:03:0d:04:0b:01,0,,1,0 2001:db8:50::12,00:03:00:01:01:04:0e:05:0c:02,4000,1663013972,1,3000,0,3512,128,0,0,,01:04:0e:05:0c:02,0,,1,0 2001:db8:50::d,00:03:00:01:01:05:0f:06:0d:03,4000,1663013972,1,3000,0,5918,128,0,0,,01:05:0f:06:0d:03,0,,1,2 2001:db8:50::e,00:03:00:01:01:06:10:07:0e:04,4000,1663013973,1,3000,0,4936,128,0,0,,01:06:10:07:0e:04,0,,1,2 ``` * `server2`: ``` address,duid,valid_lifetime,expire,subnet_id,pref_lifetime,lease_type,iaid,prefix_len,fqdn_fwd,fqdn_rev,hostname,hwaddr,state,user_context,hwtype,hwaddr_source 2001:db8:50::11,00:03:00:01:01:03:0d:04:0b:01,4000,1663013972,1,3000,0,5946,128,0,0,,01:03:0d:04:0b:01,0,,1,2 2001:db8:50::12,00:03:00:01:01:04:0e:05:0c:02,4000,1663013972,1,3000,0,3512,128,0,0,,01:04:0e:05:0c:02,0,,1,2 2001:db8:50::d,00:03:00:01:01:05:0f:06:0d:03,4000,1663013972,1,3000,0,5918,128,0,0,,01:05:0f:06:0d:03,0,,1,0 2001:db8:50::e,00:03:00:01:01:06:10:07:0e:04,4000,1663013973,1,3000,0,4936,128,0,0,,01:06:10:07:0e:04,0,,1,0 ``` The ones with `hwaddr_source = 0` are updated from the other peer. `hwtype = 1` is also likely a default that happens to match its source in the examples above. It looks like `Lease6::toElement()` and `Lease6Parser::parse()` need the `hwtype` and `hwaddr_source` capabilities.

Kea HA hot-standby mode - standby peer not catching up

2023-07-31T13:42:46Z

Hi, I'm testing a Kea HA setup in hot-standby mode, with the following settings: * Kea 2.0.1 DHCPv4 + control agent. * Two Kea instances: one "primary" and the other "standby". * memfile backend with file persistence enabled. * Lease synchronization enabled in the HA setup. * The only hooks libraries in use are ha and lease_cmds. I ran perfdhcp simulating multiple clients against the primary. After a while of sending many requests to the primary, I see that both instances have stored leases, but the standby didn't completely catch up with the primary. That is, when I inspect the leases on both instances using the lease4-get-all API command, I see that the number of leases did increase on both instances, but the standby has less leases than the primary. If I manually call the ha-sync API command, or if I restart the standby, or if I reload the configuration in the standby, the standby does a sync and catches up with the primary, and the number of leases becomes equal again. However, if I then run perfdhcp repeatedly, standby eventually starts falling behind again. Note that, when this happens, if I call the "ha-heartbeat" API command on both instances, they both report an "unsent-update-count" of 0. A similar thing happens with DHCPv6. Is this behavior expected? Is it normal for the standby to not catch up with the primary during HA operation, needing manual intervention ("ha-sync", restart or config reload) to catch up? Thank you.

Memory leak in HA scenario with backup server down

2024-04-15T14:10:06Z

--- name: Memory leak in HA scenario with backup server down about: Memory loss is created on running instances --- **Describe the bug** HA mode is configured with three servers (primary, secondary, backup) and is serving clients. When the backup server becomes unavailable, the primary and secondary experience a continuous memory leak which is manifested as a continuous increase in RSS memory use for the isc-kea-dhcp4-server process. The size of the memory leak is in direct correlation with the number of active clients - the larger number, the greater the memory leak. Once the backup server is deleted from the configuration or it becomes active again, there is no more memory increase, but the old memory is not freed. **To Reproduce** Steps to reproduce the behavior: 1. Run KEA (DHCP4 only) in HA scenario with two load-balancing servers (primary and secondary) and a single backup server 2. Start serving clients (40k in our scenario) and monitoring RSS usage for the KEA server process 3. Disable backup server 4. Verify that RSS usage is increasing continuously 5. Enable backup server 6. Verify that RSS usage is stable **Expected behavior** The servers should not have any memory leaks. **Environment:** - Kea version: 1.8.2, 2.0.2 - OS: Ubuntu 18.04 - Memfile - libdhcp_lease_cmds, libdhcp_stat_cmds, libdhcp_ha **Additional Information** ``` { "Dhcp4": { "dhcp-queue-control": { "enable-queue": true, "queue-type": "kea-ring4", "capacity": 256 }, "interfaces-config": { "interfaces": [ "eth1" ], "dhcp-socket-type": "udp" }, "control-socket": { "socket-type": "unix", "socket-name": "/tmp/kea-dhcp4-ctrl.sock" }, "lease-database": { "type": "memfile", "persist": true, "name": "/var/lib/kea/dhcp4.leases", "lfc-interval": 3600, "port": 0 }, "expired-leases-processing": { "reclaim-timer-wait-time": 10, "flush-reclaimed-timer-wait-time": 25, "hold-reclaimed-time": 3600, "max-reclaim-leases": 100, "max-reclaim-time": 250, "unwarned-reclaim-cycles": 5 }, "renew-timer": 60, "rebind-timer": 100, "valid-lifetime": 120, "option-data": [], "hooks-libraries": [ { "library": "/usr/lib/x86_64-linux-gnu/kea/hooks/libdhcp_lease_cmds.so", "parameters": {} }, { "library": "/usr/lib/x86_64-linux-gnu/kea/hooks/libdhcp_stat_cmds.so" }, { "library": "/usr/lib/x86_64-linux-gnu/kea/hooks/libdhcp_ha.so", "parameters": { "high-availability": [ { "this-server-name": "server3", "mode": "load-balancing", "heartbeat-delay": 3000, "max-response-delay": 7000, "max-ack-delay": 7000, "max-unacked-clients": 20, "peers": [ { "name": "server2", "url": "http://:8080/", "role": "secondary", "auto-failover": true }, { "name": "server1", "url": "http://:8080/", "role": "primary", "auto-failover": true }, { "name": "server3", "url": "http://:8080/", "role": "backup", "auto-failover": true } ] } ] } } ], "option-def": [ { "name": "classless-static-route", "code": 121, "space": "dhcp4", "type": "record", "array": true, "record-types": "uint8, uint8" } ], "client-classes": [ // anonymized ], "subnet4": [ // anonymized ], "reservations": [], "loggers": [ { "name": "kea-dhcp4", "output_options": [ { "output": "syslog" } ], "severity": "error", "debuglevel": 0 } ] } } ``` **Contacting you** Email/Github, telephone is available after contact

HA error when partner received duplicated DHCP requests with the same Transac...

2023-07-31T13:39:24Z

**Describe the bug** Because our redundant topology the same Kea server will get the same DHCP request message with the same transaction ID from different relays. We run our kea servers in the load-balancing HA mode. Server 1 will send the lease4-update to server 2. Because server 1 received the same packet twice with the same transaction ID it sends both updates to server 2. Server 2 then responds with resource busy because it is still updating the lease from the first request when the second request comes in. Because server 2 errors out server 1 puts the server into unknown state. It will then resync, but it breaks on every DHCP request. We are storing leases in Postgres. **To Reproduce** Steps to reproduce the behavior: 1. Run Kea dhcpv4 in load balancing mode. Send a duplicate DHCP request with the same transaction ID to a server. 2. Server 1 will process both DHCP requests and will send the lease4-update to server 2. Both of these requests happen at almost the same time. 3. Server 1 will then put server 2 into unknown state because the lease4-update command failed. **Expected behavior** I would expect the update to not fail, but I am not sure how DHCP servers are supposed to handle duplicate requests with the same transaction IDs **Environment:** - Kea dhcp4 v 2.0.0 - OS: [e.g. Debian 11 x64] - Storing leases in Postgres - `libdhcp_lease_cmds`, `libdhcp_ha` **Additional Information** - I have attached a PCAP that shows the DHCP requests and the lease4-update command - Attached is the kea config that runs on both servers [ss21-dhcp-debug.pcap](/uploads/f4c0beee834c75b7131e4e485af12d47/ss21-dhcp-debug.pcap)[kea-config.json](/uploads/a5a5a89cd10a736706b93ac7eabd4e9d/kea-config.json) **Contacting you** slowe@clairglobal.com

HAServiceTest.sendSuccessfulUpdatesAuthorizedMultiThreading sometimes fails

2024-04-18T13:19:01Z

This time it happened on distcheck on CentOS 8. https://jenkins.aws.isc.org/job/kea-dev/job/distcheck/415/execution/node/136/log/?consoleFull ``` 16:04:40 [ RUN ] HAServiceTest.sendSuccessfulUpdatesAuthorizedMultiThreading 16:04:40 ../../../../../../../src/hooks/dhcp/high_availability/tests/ha_service_unittest.cc:1096: Failure 16:04:40 Expected equality of these values: 16:04:40 2 16:04:40 factory3_->getResponseCreator()->getReceivedRequests().size() 16:04:40 Which is: 1 16:04:40 ../../../../../../../src/hooks/dhcp/high_availability/tests/ha_service_unittest.cc:1102: Failure 16:04:40 Value of: update_request3 16:04:40 Actual: false 16:04:40 Expected: true 16:04:40 [ FAILED ] HAServiceTest.sendSuccessfulUpdatesAuthorizedMultiThreading (2 ms) ```

possible race on lease_update_backlog_

2021-04-15T15:31:05Z

I am not sure if this can happen when calling HAService::communicationRecoveryHandler (main thread) which calls lease_update_backlog_.clear() and HAService::asyncSendLeaseUpdates (processing threads) which calls lease_update_backlog_.push(...). Main thread also calls HAService::asyncSendLeaseUpdatesFromBacklog which does lease_update_backlog_.pop() which can race with processing threads but can only end up postponing the exit from HAService::asyncSendLeaseUpdatesFromBacklog (maybe forever)? The race might not be possible because of different transition states but it is not obvious from the code. A state diagram might be useful. To note that operations on lease_update_backlog_ are thread safe.