Vicky Risk · 24a3d13e
--- a/Stork-1.0-Requirements.md
+++ b/Stork-1.0-Requirements.md
+### Combined DHCP and DNS Service Management
+Stork is intended to provide a centralized focus for monitoring and managing BIND9 and Kea servers and hopefully eventually, DHCP and DNS services in general. We want to ultimately be able to manage BIND9 and Kea together, to address the DDNS issues, so we would like to include BIND support from the first release.
+
+### Release 1 - Focus on Monitoring, Health, Activity, Performance
+The main benefit in the first release will be aggregating and presenting selected data useful for common operational management of DHCP and DNS services and servers. In later releases we would like to also manage configuration of these servers, but we are starting with monitoring. In this initial release we will need to establish an extensible, high performance infrastructure for collecting, storing and analyzing DNS and DHCP activity data.
+
+### Northbound interfaces - API and GUI
+It is ideal that the northbound interfaces support both a graphical display (via an ordinary web server) and an API. When we get to the point that we are managing configurations, users will certainly want APIs to integrate with their provisioning systems. In the context of the first release, there may be a native GUI as well as an API  to a data visualization tool such as Grafana, and possibly another to Nagios, or Cacti, or another fault management and or alerting system.
+
+### Minimum useful product
+* **Kea users** - Anterius alternative, with pool utilization monitoring at a minimum.
+* **BIND users** - solve one or more use cases wrt metrics not easily accomplished with existing tools. Either performance tuning, zone update timing, zone signing timings or cache management. 
+
+
+##  Common Requirements
+
+| # | Feature | Details | Feasibility | Release or GL#? |
+| ------ | ------ | ------ | ------ | ------ |
+| 1.0 | Server list | meta-data per server including: location (geographic), location (physical)-hall:rack:slot, department, dns name for server(?), contact name, email and phone number.  configured by stork admin. | | |
+|  1.1 | IPs | server interface IPs. configured by stork admin.   | | |
+|  1.2 | Application(s) installed | (dhcpv4, dhcpv6, ddns, MySQL/Postgres/Cassandra backend, bind 9 (presumably there can be more than one application per server? can this be discovered?|  |  |
+|  1.3 | Application role | role in the network (Kea:failover primary, failover secondary, failover lb, lease backend, host backend, config backend, BIND zone master, secondary, hidden master, ... configured by stork admin. |  |  |
+|  1.4 | Software versions | OS version running, application version, build#. discovered. Many users will have their own build systems, or use multiple OSes, so the 'version' field may need to include OS, build#, etc & we need to allow for version #s that do not match ISC version numbers in case of OS packages with different numbering systems.  |  |  |
+|  1.5 | ISC application operational status | running/not running, date/time of last reload/reboot, uptime since last reload (computed) |  |  |
+|  1.6 | Platform status | do we need OS and or platform operational status, uptime since last reload/restart?  discovered. |  |  |
+|  1.7 | Platform information | platform name/type, memory, free memory and memory in use by the application (BIND/KEA), # of CPUs. discovered.   |  |  |
+
+
+1. **Simplistic discovery of network devices**  
+   A minimum for 'discovery' should include import of a csv (not really discovery, but then at least we could import a list of devices from another ipam)
+   1. Must have: manually adding Kea and BIND server one by one
+   1. CSV import of Kea and BIND services (stretch goal)
+   1. CSV import of end user devices (desktops, laptops) (out of scope for 1.0)
+   1. 2 methods for discovering existing DHCP servers:
+      1. scan for CA TCP port (out of scope 1.0)
+      1. send Discover/Solicit, list requirements (out of scope 1.0)
+
+**:question: Q**: Is it for discovering existing Kea and BIND services (this includes DHCP relay) or discovering hosts in the network?
+
+
+## Kea Monitoring 
+| # | Feature | Details | Feasibility | Release or GL#? |
+| ------ | ------ | ------ | ------ | ------ | 
+|  2.1 |  Leases list| human-readable list of leases from most recent to oldest, with sorting by fields in the lease, search based on any of the fields in the lease  |  |  |
+|  2.2 |  Hosts list| human-readable list of host reservations, with sorting by IP, date assigned, host name. Ideally will also show if the lease has been requested. Perhaps pxeboot file option value?  |  |  |
+|  2.3 |  Leases timing| leases/time requested and assigned, with some time series so you can identify changes from the usual pattern, delay between discover and accept, TOD patterns. discovered.  |  |  |
+|  2.4 | Pool utilization | # of IPs available, in use, and 'awaiting release' leases per pool. actual vs. configuration. how to handle prefixes? |  |  |
+|  2.5 | Real time renewals | some kind of running display of renewals coming in would be a fun, showy feature, whether it is illustrated with boxes in a display changing colors or some other way. If it is possible to click and drill down to look at lease lifetimes? is there some way of collecting and reporting on these timers, perhaps as the leases are renewing?  |  |  |
+|  2.6 | Failover status | green/red with detail view (when was the last failover event? heartbeat status? Active/passive or load balancing?)  discovered. |  |  |
+
+## Common Tools
+| # | Tool | Details | Feasibility | Release or GL#? |
+| ------ | ------ | ------ | ------ | ------ |
+|  3.1 | Software update status | look up (presumably at ISC) whether the application software branch is currently supported (Stable, Extended Support, Development, EOL) and report what the current/latest version on the branch is  (  |  |  |
+|  3.2 | Software update scheduler| ambitious, but it would be great if we could figure out how to provide an easy way to schedule updates. Would need to include a dialog for indicating where the package or binary repo is.  |  |  |
+|  3.3 | Log viewer| view log of monitored servers of significant events since last restart. if possible, include platform logs (e.g. platform restarts, OS updates...) and stork application logs |  |  |
+
+
+## Kea Tools
+This is a list of things that are not strictly monitoring. Putting them on a separate list does not imply anything about the layout in the gui or workflow. This is where we can add value by facilitating common or difficult operational tasks.  Currently this list is pretty speculative.
+
+| # | Tool | Details | Feasibility | Release or GL#? |
+| ------ | ------ | ------ | ------ | ------ |
+|  4.1 | Lease lookup | tool that facilitated searching for a lease by any field in the lease, active or historical, would be a useful utility. ultimately, when we are also providing configuration support, we will may to store additional information with each lease (eg. user ID - things that we look up in another database at the time and associate with the dhcp lease).  |  |  |
+|  4.2 | Failover test   | no idea what this could consist of, but it is a common request, how to test failover |  |  |
+|  4.3 | Options evaluator | evaluates what options and or addresses would be assigned to a given client - how to 'test' the configuration?  |  |  |
+|  4.4 | LFC monitor | is there something to monitor re lease file cleanup? |  |  |
+|  4.5 | device fingerprinting | at some point, we are going to want to start storing information on all the devices Kea has ever seen, along with whatever information we can discern about them (when they last appeared on the network, when they first contacted Kea, dns name associated, device type(from fingerprint), options requested/provided?, incl pxe file location |  |  |
+
+
+## BIND Status and Activity 
+Most BIND 9 users have added BIND 9 to their existing fault monitoring systems by now. What is lacking is any integrated way to manage application + server performance together, and any way to view the status of events that are not queries or responses, such as interactions between servers (IXFR/AXFRs), journal updates, signing operations and the like. 
+
+| # | Feature | Details | Feasibility | Release or GL#? |
+| ------ | ------ | ------ | ------ | ------ | 
+|  5.1 |  Zone list | human-readable list of zones, sortable by zone name, soa timestamp?, zone size? signing status (signed/unsigned/expired?), #RRs. 'dynamic' or 'traditional' zone files |  |  |
+|  5.2 |  Zone xfr status  | See current SOA record, monitor notifies, time since last xfer/ixfr, size, source of transfer. If there is any way to see the time it took for the transfer to become effective in the server, that would be ideal. |  |  |
+|  5.3 |  Zone xfr performance  | See current SOA record, monitor notifies, time since last xfer/ixfr, size, source of transfer. If there is any way to see the time it took for the transfer to become effective in the server, that would be ideal. |  |  |
+|  5.4 |  Zone signing status | DNSSEC details, key information, signature validity period |  |  |
+|  5.5 |  Zone signing performance | monitoring where BIND is in signing and resigning new or updated zones, both the status and time it takes to complete the signing operation. |  |  |
+|  5.6 |  Query activity | (queries per second received and answered), with some time series so you can identify changes from the usual pattern, TOD patterns. |  |  |
+|  5.7 |  RPZ reporting | logging of RPZ 'matches', with the name of the RPZ, name of the answer zone and action taken - rewrites, nxdomain, etc and counters (eg. 15 minute intervals). This may require log analysis. |  |  |
+
+
+
+## BIND Performance Details
+Sophisticated DNS operators do extensive data analysis, but so much data is generated, it is expensive to keep it all around. The ideal solution might include streaming query and response data, with periodic analysis and creation of counters based on that data (which is then not preserved). This would enrich the current internal BIND counters and metrics with additional detail. 
+
+| # | Feature | Details | Feasibility | Release or GL#? |
+| ------ | ------ | ------ | ------ | ------ | 
+|  6.1 |  Query details | easily monitor the volume of queries and responses, rrtypes, response codes, by TCP vs UDP, perhaps by some response size buckets, this is a baseline function that everyone needs and these statistics should be available on a per-server basis from BIND today. Ideal if these can be displayed both per-server and aggregated across clusters of servers. |  |  |
+|  6.2 |  Query & Response log analysis | One use case here is drilling down to look at actual queries during a spike in one of the usual metric observed above to find out the query source, name queried for, etc, to identify the source of malicious traffic. This is the sort of data that would be ideally analyzed and discarded, because keeping a lot of it around would become expensive. One issue is that today, enabling DNSTAP on BIND requires restarting BIND.  |  |  |
+|  6.3 |  Response latency | (average, max, min, mode - time between receiving a query and sending a response, as well as for resolvers, whether the response came from cache or not) (Serious providers will use test clients at various locations in the network to continuously test/audit the dns service. We should consider attempting to support that at some point in the future.)  |  |  |
+|  6.4 |  Cache hit ratio | % over time (cache size, average ttl of records in cache, # of records pre-fetched and # of those that expired without being re-queried, top 500(?) records most frequently queried  |  |  |
+
+
+## Application Infrastructure
+Web app. OK to support limited OS for the platform
+
+| # | Feature | Details | Feasibility | Release or GL#? |
+| ------ | ------ | ------ | ------ | ------ | 
+|  10.1 |  Installation | We definitely are going to want a package... with all our dependencies, we may need a SCL approach.
+ |  |  |
+|  10.2 |  User authentication | local authentication is adequate for 1.0, later versions will require network authentication |  |  |
+|  10.3 |  User authorization | fine grained access control is a foundation that will be needed in later versions. for 1.0 it would be good if there are at least read-only users and stork administrators |  |  |
+
+
+## Significant Questions/Issues
+| # | Question | Response | Date resolved |
+| ------ | ------ | ------ | ------ | 
+| | Does Stork support Kea/BIND built and installed by hand? Or does it require to be installed Stork way only? | | |
+| | Is the product named Stork or is that the code name? | | |
+| |  | | |
+| | How do Stork release numbers relate to BIND and Kea versions? | | |
+
+
+Glossary:
+- Application server: a machine with running Kea or BIND (or MySQL or PostgreSQL or Cassandra, in use by Kea)
+- TOD pattern: Time of Day pattern
\ No newline at end of file