Stork is intended to provide a centralized focus for monitoring and managing BIND9 and Kea servers and hopefully eventually, DHCP and DNS services in general. We want to ultimately be able to manage BIND9 and Kea together, to address the DDNS issues, so we would like to include BIND support from the first release.
Discussion of these requirements is on-going at #18
Release 1 - Focus on Monitoring, Health, Activity, Performance (could be 1.0, or 0.8...)
The main benefit in the first release will be aggregating and presenting selected data useful for common operational management of DHCP and DNS services and servers. In later releases we would like to also manage configuration of these servers, but we are starting with monitoring. In this initial release we will need to establish an extensible, high performance infrastructure for collecting, storing and analyzing DNS and DHCP activity data.
Northbound interfaces - API and GUI
It is ideal that the northbound interfaces support both a graphical display (via an ordinary web server) and an API. When we get to the point that we are managing configurations, users will certainly want APIs to integrate with their provisioning systems. In the context of the first release, there may be a native GUI as well as an API to a data visualization tool such as Grafana, and possibly another to Nagios, or Cacti, or another fault management and or alerting system. I expect traditional BIND users will be less interested in a BIND-specific GUI, Kea users have had less time to develop their own systems and will be more interested. Kea users also have relatively more interest in a GUI to drive configuration changes because of the lack of alternatives.
Minimum useful product
Kea users - Anterius alternative, with pool utilization monitoring at a minimum.
BIND users - solve one or more use cases wrt metrics not easily accomplished with existing tools. Either performance tuning, zone update timing, zone signing timings or cache management.
Roles: admin (can do all possible actions), user (has limited access)
As an admin I can add Kea and BIND instances (their addresses and some metadata). I also can browse them and search them by particular fields.
User would like to see what applications are installed on each server. (dhcpv4, dhcpv6, ddns, Kea hooks, MySQL/Postgres/Cassandra backend, bind 9 (presumably there can be more than one application per server - certainly there can be multiple Kea hooks installed and configured. Includes service IP.
User would like to be able to quickly see what is the server's role in the network (Kea:failover primary, failover secondary, failover lb, lease backend, host backend, config backend, BIND zone master, secondary, hidden master, ... configured by stork admin.
OS version running, application version, build#. discovered. Many users will have their own build systems, or use multiple OSes, so the 'version' field may need to include OS, build#, etc & we need to allow for version #s that do not match ISC version numbers in case of OS packages with different numbering systems. Config flags the image was built with. Hooks loaded (Kea hooks, BIND hooks, BIND RPZ plug-in)
DB backend versions out of scope for 1.0
ISC application operational status
running/not running, date/time of last reload/reboot, uptime since last reload (computed)
do we need OS and or platform operational status, uptime since last reload/restart? discovered.
The user would like to be able to scan the configuration of the server and easily identify which settings are changed from the default value. These settings will ideally be grouped into related settings. list of all significant parameters with their value in the current running application (not in the configuration file, but running), and the default value for that parameter. Does NOT include Kea backend databases. More important for BIND than for Kea.
Import application server list
The user has an inventory of BIND and Kea servers in a spreadsheet and like to import this into Stork without retyping it. This spreadsheet will have columns that don't match those that Stork wants, so the columns will need to be examined.
This is a bit more than 'monitoring' because it also requires some reading and analysis of configuration data (pools, host reservations), but it is all still read-only.
human-readable list of leases sorted by default from most recent to oldest, with sorting by any fields in the lease, search based on MAC address or IP address. Lease database must also include which server owns the lease. If we can also do a reverse DNS lookup on the IP address (this can be a process triggered by the admin, it doesn't have to happen magically) to popular a hostname field, that would be good too. This should not require querying all the dhcp servers - it should come from a central lease db in Stork. I am thinking it is updated by notification from the dhcp servers, after some initialization process where it gets all the current leases.
human-readable list of host reservations, with sorting by IP, date assigned, host name. Show if the lease has actually been requested/assigned. Perhaps pxeboot file option value? hostname option value?
Kea response times
It needs to be possible for example, to see if there is a backlog of requests building up that are unfilled as an indicator the Kea server is becoming overloaded.
Per server, what subnets are configured, with what ranges? SubnetID, pools within the subnet, how many addresses is that? (compute, some of our users have trouble figuring this out!), how many of these addresses are in currently assigned, how many of these addresses are available for use?
Display pool utilization % by pool
Shared networks configured
Per server, what shared networks are configured, with what ranges? SubnetID, pools within the subnet, how many addresses is that? (compute, some of our users have trouble figuring this out!), how many of these addresses are in currently assigned, how many of these addresses are available for use?
Pool utilization- shared networks
Display address utilization % by shared network
green/red with detail view (when was the last failover event? heartbeat status? Active/passive or load balancing?)
view log of monitored servers of significant events since (some limited size). if possible, include platform logs (e.g. platform restarts, OS updates...) and stork application logs. This is envisioned initially as a fairly simple display of the log file on an individual server. This is not a massive database of historical logs with analysis.
Kea Tools or Use cases
This is a list of things that are not strictly monitoring. Putting them on a separate list does not imply anything about the layout in the gui or workflow. This is where we can add value by facilitating common or difficult operational tasks. Currently this list is pretty speculative.
Clients refusing offers
See all addresses declined by clients - troubleshooting
BIND Status and Activity
Most BIND 9 users have added BIND 9 to their existing fault monitoring systems by now. What is lacking is any integrated way to manage application + server performance together, and any way to view the status of events that are not queries or responses, such as interactions between servers (IXFR/AXFRs), journal updates, signing operations and the like.
human-readable list of zones, sortable by zone name, time of last update (this might be the default sort), zone size? signing status (signed/unsigned/expired?), #RRs. 'dynamic' or 'traditional' zone files
Zone signing status
DNSSEC details, key information, signature validity period
Zone/rr signing performance
monitoring where BIND is in signing and resigning new or updated zones, both the status and time it takes to complete the signing operation. I realize this is potentially very detailed and complicated, but think of the use case where an auth publisher has a few very large zones - how can they track their signing process?
(queries per second received and answered), with some time series so you can identify changes from the usual pattern, TOD patterns.
logging of RPZ 'matches', with the name of the RPZ, name of the answer zone and action taken - rewrites, nxdomain, etc and counters (eg. 15 minute intervals). This for the purpose of proving to management that the RPZ service is worthwhile and impactful. The user wants to know how much of an impact (each RPZ zone) is having.
BIND Performance Details
Two problems that operators want to address: how can I improve performance (and improving cache utilization is one way to improve performance) and what is my memory being used for?
easily monitor the volume of queries and responses, rrtypes, response codes, by TCP vs UDP, perhaps by some response size buckets, this is a baseline function that everyone needs and these statistics should be available on a per-server basis from BIND today. Include queries that are dropped. Ideal if these can be displayed both per-server and aggregated across clusters of servers.
Cache hit ratio
% of queries answered from cache (time series)
cache size, average ttl of records in cache, # of records pre-fetched and # of those that expired without being re-queried, top 500(?) records most frequently queried, cache cleaning (how dirty is the cache)
what is named's current memory allocation being used for. Esp needed by hybrid server operators (amt used for auth vs recursive)
BIND troubleshooting use cases
These could be 'tools' or simply test cases. These are some tasks we want to facilitate.
What is BIND doing (while it is, eating memory, eating CPU, not responding, apparently twiddling it's thumbs or ..?) Do I need to increase any of my throttles because I'm getting close to the limits?
What's in cache (by RTYPE - real entries, not the expired ones, although expired but not yet cleaned up might also be interesting
What's expired in cache and still not cleaned up?
How much memory are my auth zones occupying? How much memory is RRL using?
Throttling features include RRL, Fetch-limits, client-quotas, TCP quotas.. Is this server being throttled by fetch-limits or is this zone being throttled by fetch-limits? so, log instances of crossing the thresholds where throttling kicks in, when you cross the threshold again on the way down.
what % of clients are avoiding RRL by providing cookies
See a list of servers for a domain and the current and historical srtt values for those servers. Which server will BIND query for this domain and why. Also, which servers are EDNS capable?
Web app. OK to support limited OS for the platform
We definitely are going to want a package... with all our dependencies, we may need a SCL approach.
local authentication is adequate for 1.0, later versions will require network authentication
fine grained access control is a foundation that will be needed in later versions. for 1.0 it would be good if there are at least read-only users and stork administrators
for a select number of graphs, it is very useful to be able to save (.pdf) the chart and be able to save it automatically, eventually to be able to schedule email delivery of reports. Mostly the use here is to send these to mgmt. RPZ effectiveness report is a popular item.
User Interface - List of Pages/panes from discussion in WAW