Hooks ParkingLot performance issues, this causes problems in HA/MT
The crux of the matter is, depending on how things are structured architecturally, it is possible for DHCP work to be faster than the HA Http work. If the former is allowed to proceed independently of the latter, the queue (i.e. parking lot) for pending Http work to be done grows. The parking lot implementation, currently is an std::list which requires sequential searches. This becomes a problem when the DHCP rate is sufficiently faster than the Http rate and the Http work begins to backup. The larger the backup, the more costly the search.
One way to mitigate the cost of the searches is by replacing std::list with std::map. This helps MT mode code, primarily in times of heavy load, or waves (such as what perfdhcp does in avalanche mode). I've attached a patch which does this. Ignore code added under conditional compile PARK_INST that was for other diagnostic work:
ST mode does not suffer from this, even though it uses parking, because it naturally rate limits the DHCP work so you can't really establish a big backlog
Imagine if our DHCP layer(s) didn't DROP inbound client packets... i.e. had infinite socket buffers... under severe load we would build an insurmountable backlog. This where the whole notion of congestion handling come in.
You can think of the parking lot of pending Http work in a similar fashion at the moment, the implementation allows an infinite amount of parking, thus one can accrue an insurmountable backlog. Bear in mind that ANY hook library the uses packet parking could experience this. It is a characteristic of packet parking, not HA.
There are lots of ways to improve/alter this here are some suggestions (not listed in any order other than as they came to mind):
-
Don't park. If the DHCP worker threads own and do the Http work synchronously prior to starting a new DHCP client packet, the problem does not exist. In other words, the leasesX_committed callback becomes a blocking callback. This would help HA but do nothing for Hook libs that park.
-
Establish some sort of flow control between DHCP and parked work, if parking reaches some threshold pause DHCP work.
-
Limit parking lot to a configurable, maxim size and use some sort of algorithm to drop packets, the simplest being to drop the oldest ones in place of newer ones. Discarding parked objects which are beyond some age limit and so on.
This ticket is a duplicate of #1307 (closed) which implements the 3. option i.e. a size limit on the parking lot size. Some comments below.