- Combined DHCP and DNS Service Management
- Discussion of these requirements is on-going at https://gitlab.isc.org/isc-projects/stork/issues/18
- Common Requirements
- Kea Monitoring
- Common Tools
- Kea Tools or Use cases
- BIND Status and Activity
- BIND Performance Details
- BIND troubleshooting use cases
- Application Infrastructure
- User Interface - List of Pages/panes from discussion in WAW
- Significant Questions/Issues
Combined DHCP and DNS Service Management
Stork is intended to provide a centralized focus for monitoring and managing BIND9 and Kea servers and hopefully eventually, DHCP and DNS services in general. We want to ultimately be able to manage BIND9 and Kea together, to address the DDNS issues, so we would like to include BIND support from the first release.
Release 1 - Focus on Monitoring, Health, Activity, Performance (could be 1.0, or 0.8...)
The main benefit in the first release will be aggregating and presenting selected data useful for common operational management of DHCP and DNS services and servers. In later releases we would like to also manage configuration of these servers, but we are starting with monitoring. In this initial release we will need to establish an extensible, high performance infrastructure for collecting, storing and analyzing DNS and DHCP activity data.
Northbound interfaces - API and GUI
It is ideal that the northbound interfaces support both a graphical display (via an ordinary web server) and an API. When we get to the point that we are managing configurations, users will certainly want APIs to integrate with their provisioning systems. In the context of the first release, there may be a native GUI as well as an API to a data visualization tool such as Grafana, and possibly another to Nagios, or Cacti, or another fault management and or alerting system. I expect traditional BIND users will be less interested in a BIND-specific GUI, Kea users have had less time to develop their own systems and will be more interested. Kea users also have relatively more interest in a GUI to drive configuration changes because of the lack of alternatives.
Minimum useful product
- Kea users - Anterius alternative, with pool utilization monitoring at a minimum.
- BIND users - solve one or more use cases wrt metrics not easily accomplished with existing tools. Either performance tuning, zone update timing, zone signing timings or cache management.
- Roles: admin (can do all possible actions), user (has limited access)
|1.1||IPs||As an admin I can add Kea and BIND instances (their addresses and some metadata). I also can browse them and search them by particular fields.||1|
|1.2||Application(s) installed||User would like to see what applications are installed on each server. (dhcpv4, dhcpv6, ddns, Kea hooks, MySQL/Postgres/Cassandra backend, bind 9 (presumably there can be more than one application per server - certainly there can be multiple Kea hooks installed and configured. Includes service IP.||1|
|1.3||Application role||User would like to be able to quickly see what is the server's role in the network (Kea:failover primary, failover secondary, failover lb, lease backend, host backend, config backend, BIND zone master, secondary, hidden master, ... configured by stork admin.||1|
|1.4||Software versions||OS version running, application version, build#. discovered. Many users will have their own build systems, or use multiple OSes, so the 'version' field may need to include OS, build#, etc & we need to allow for version #s that do not match ISC version numbers in case of OS packages with different numbering systems. Config flags the image was built with. Hooks loaded (Kea hooks, BIND hooks, BIND RPZ plug-in)||DB backend versions out of scope for 1.0|
|1.5||ISC application operational status||running/not running, date/time of last reload/reboot, uptime since last reload (computed)||1|
|1.6||Platform status||do we need OS and or platform operational status, uptime since last reload/restart? discovered.||1|
|1.8||Application configuration||The user would like to be able to scan the configuration of the server and easily identify which settings are changed from the default value. These settings will ideally be grouped into related settings. list of all significant parameters with their value in the current running application (not in the configuration file, but running), and the default value for that parameter. Does NOT include Kea backend databases. More important for BIND than for Kea.||1-Kea 2-BIND|
|1.9||Import application server list||The user has an inventory of BIND and Kea servers in a spreadsheet and like to import this into Stork without retyping it. This spreadsheet will have columns that don't match those that Stork wants, so the columns will need to be examined.||1|
This is a bit more than 'monitoring' because it also requires some reading and analysis of configuration data (pools, host reservations), but it is all still read-only.
|2.1||Leases list||human-readable list of leases sorted by default from most recent to oldest, with sorting by any fields in the lease, search based on MAC address or IP address. Lease database must also include which server owns the lease. If we can also do a reverse DNS lookup on the IP address (this can be a process triggered by the admin, it doesn't have to happen magically) to popular a hostname field, that would be good too. This should not require querying all the dhcp servers - it should come from a central lease db in Stork. I am thinking it is updated by notification from the dhcp servers, after some initialization process where it gets all the current leases.||1|
|2.2||Hosts list||human-readable list of host reservations, with sorting by IP, date assigned, host name. Show if the lease has actually been requested/assigned. Perhaps pxeboot file option value? hostname option value?||1|
|2.3||Kea response times||It needs to be possible for example, to see if there is a backlog of requests building up that are unfilled as an indicator the Kea server is becoming overloaded.||1|
|2.4.0||Subnets configured||Per server, what subnets are configured, with what ranges? SubnetID, pools within the subnet, how many addresses is that? (compute, some of our users have trouble figuring this out!), how many of these addresses are in currently assigned, how many of these addresses are available for use?||1|
|2.4.1||Pool utilization||Display pool utilization % by pool||1|
|2.5.0||Shared networks configured||Per server, what shared networks are configured, with what ranges? SubnetID, pools within the subnet, how many addresses is that? (compute, some of our users have trouble figuring this out!), how many of these addresses are in currently assigned, how many of these addresses are available for use?||1|
|2.5.1||Pool utilization- shared networks||Display address utilization % by shared network||1|
|2.6||Failover status||green/red with detail view (when was the last failover event? heartbeat status? Active/passive or load balancing?)||1|
|3.4||Log viewer||view log of monitored servers of significant events since (some limited size). if possible, include platform logs (e.g. platform restarts, OS updates...) and stork application logs. This is envisioned initially as a fairly simple display of the log file on an individual server. This is not a massive database of historical logs with analysis.||1|
Kea Tools or Use cases
This is a list of things that are not strictly monitoring. Putting them on a separate list does not imply anything about the layout in the gui or workflow. This is where we can add value by facilitating common or difficult operational tasks. Currently this list is pretty speculative.
|4.4||Clients refusing offers||See all addresses declined by clients - troubleshooting||1|
BIND Status and Activity
Most BIND 9 users have added BIND 9 to their existing fault monitoring systems by now. What is lacking is any integrated way to manage application + server performance together, and any way to view the status of events that are not queries or responses, such as interactions between servers (IXFR/AXFRs), journal updates, signing operations and the like.
|5.1||Zone list||human-readable list of zones, sortable by zone name, time of last update (this might be the default sort), zone size? signing status (signed/unsigned/expired?), #RRs. 'dynamic' or 'traditional' zone files||1|
|5.4||Zone signing status||DNSSEC details, key information, signature validity period||1|
|5.5||Zone/rr signing performance||monitoring where BIND is in signing and resigning new or updated zones, both the status and time it takes to complete the signing operation. I realize this is potentially very detailed and complicated, but think of the use case where an auth publisher has a few very large zones - how can they track their signing process?||1|
|5.6||Query activity||(queries per second received and answered), with some time series so you can identify changes from the usual pattern, TOD patterns.||1|
|5.7||RPZ statistics||logging of RPZ 'matches', with the name of the RPZ, name of the answer zone and action taken - rewrites, nxdomain, etc and counters (eg. 15 minute intervals). This for the purpose of proving to management that the RPZ service is worthwhile and impactful. The user wants to know how much of an impact (each RPZ zone) is having.||1|
BIND Performance Details
Two problems that operators want to address: how can I improve performance (and improving cache utilization is one way to improve performance) and what is my memory being used for?
|6.1||Query details||easily monitor the volume of queries and responses, rrtypes, response codes, by TCP vs UDP, perhaps by some response size buckets, this is a baseline function that everyone needs and these statistics should be available on a per-server basis from BIND today. Include queries that are dropped. Ideal if these can be displayed both per-server and aggregated across clusters of servers.||1|
|6.4||Cache hit ratio||% of queries answered from cache (time series)||1|
|6.5||Cache aging||cache size, average ttl of records in cache, # of records pre-fetched and # of those that expired without being re-queried, top 500(?) records most frequently queried, cache cleaning (how dirty is the cache)||1|
|6.7||Memory utilization||what is named's current memory allocation being used for. Esp needed by hybrid server operators (amt used for auth vs recursive)||1|
BIND troubleshooting use cases
These could be 'tools' or simply test cases. These are some tasks we want to facilitate.
|7.1||Performance troubleshooting||What is BIND doing (while it is, eating memory, eating CPU, not responding, apparently twiddling it's thumbs or ..?) Do I need to increase any of my throttles because I'm getting close to the limits?||1|
|7.2||Cache analysis||What's in cache (by RTYPE - real entries, not the expired ones, although expired but not yet cleaned up might also be interesting||1?|
|7.3||Cache cleanup||What's expired in cache and still not cleaned up?|
|7.4||Memory utilization||How much memory are my auth zones occupying? How much memory is RRL using?||1|
|7.5||Throttling||Throttling features include RRL, Fetch-limits, client-quotas, TCP quotas.. Is this server being throttled by fetch-limits or is this zone being throttled by fetch-limits? so, log instances of crossing the thresholds where throttling kicks in, when you cross the threshold again on the way down.||1|
|7.5.1||Cookies||what % of clients are avoiding RRL by providing cookies||?|
|7.6||SRTT++||See a list of servers for a domain and the current and historical srtt values for those servers. Which server will BIND query for this domain and why. Also, which servers are EDNS capable?||1|
Web app. OK to support limited OS for the platform
|10.1||Installation||We definitely are going to want a package... with all our dependencies, we may need a SCL approach.|
|10.2||User authentication||local authentication is adequate for 1.0, later versions will require network authentication|
|10.3||User authorization||fine grained access control is a foundation that will be needed in later versions. for 1.0 it would be good if there are at least read-only users and stork administrators|
|10.4||Report generation||for a select number of graphs, it is very useful to be able to save (.pdf) the chart and be able to save it automatically, eventually to be able to schedule email delivery of reports. Mostly the use here is to send these to mgmt. RPZ effectiveness report is a popular item.|
User Interface - List of Pages/panes from discussion in WAW
- Navigation: servers, alerts, activity & utilization, performance details, configuration view, settings
- search button to search for a hostname, zone name, IP address, clientID/MAC
- server list - service - application role - service sw version - HA status
- Host reservations list - half dozen options per reservation, possibly multiple IPs.
- Leases list
- Shared networks/Subnets/Pool with graphical indicator making it easy to see high pool utilization at a glance
- DNS Activity - time series data showing queries per second
- Zone list (name, class, signer status) - would be nice to have visual indicator of signer status
- DNSSEC panel - NTAs
- RPZ stats
- Performance related information - drill down per server - QPS. leases per second, memory, cache, button to go view logs
- Admin page - view, add, remove users, groups, manage user permissions
- Alerts list -
- We will have a list of guidelines for accessibility. We must be accessible to color blindness at a minimum and we should also be accessible to people with vision impairments as well.
|A||OSes supported||FreeBSD 12 and Ubuntu 18.04
more coming in milestone 2
|Oct 7, 2019|
|B||Docker||Supported, but optional.||Oct 9, 2019|
|B||Does Stork support Kea/BIND built and installed by hand? Or does it require to be installed Stork way only?|
|Product packaging||e.g. is there a single version, multiple versions, add-ons, binary only or source...|
|How do Stork release numbers relate to BIND and Kea versions?|
- Application server: a machine with running Kea or BIND (or MySQL or PostgreSQL or Cassandra, in use by Kea)
- TOD pattern: Time of Day pattern