... | ... | @@ -18,23 +18,14 @@ It is ideal that the northbound interfaces support both a graphical display (via |
|
|
|
|
|
| # | Feature | Details | Feasibility | Milestone |
|
|
|
| ------ | ------ | ------ | ------ | ------ |
|
|
|
| 1.0 | Network devices list | meta-data per server including: location (geographic), location (physical)-hall:rack:slot, department, dns name for server(?), contact name, email and phone number. configured by stork admin. | | 3 |
|
|
|
| 1.1 | IPs | server interface IPs. configured by stork admin. (Bind, kea instances addresses) | | 1 |
|
|
|
| 1.2 | Application(s) installed | (dhcpv4, dhcpv6, ddns, Kea hooks, MySQL/Postgres/Cassandra backend, bind 9 (presumably there can be more than one application per server - certainly there can be multiple Kea hooks installed and configured. Includes *service IP*.| | 1 |
|
|
|
| 1.3 | Application role | role in the network (Kea:failover primary, failover secondary, failover lb, lease backend, host backend, config backend, BIND zone master, secondary, hidden master, ... configured by stork admin. | | 1 |
|
|
|
| 1.2 | Application(s) installed | User would like to see what applications are installed on each server. (dhcpv4, dhcpv6, ddns, Kea hooks, MySQL/Postgres/Cassandra backend, bind 9 (presumably there can be more than one application per server - certainly there can be multiple Kea hooks installed and configured. Includes *service IP*.| | 1 |
|
|
|
| 1.3 | Application role | User would like to be able to quickly see what is the server's role in the network (Kea:failover primary, failover secondary, failover lb, lease backend, host backend, config backend, BIND zone master, secondary, hidden master, ... configured by stork admin. | | 1 |
|
|
|
| 1.4 | Software versions | OS version running, application version, build#. discovered. Many users will have their own build systems, or use multiple OSes, so the 'version' field may need to include OS, build#, etc & we need to allow for version #s that do not match ISC version numbers in case of OS packages with different numbering systems. Config flags the image was built with. Hooks loaded (Kea hooks, BIND hooks, BIND RPZ plug-in)| | DB backend versions out of scope for 1.0 |
|
|
|
| 1.4.1 | Current ISC versions | Report (via some lookup at ISC) what the current is. | | 1 |
|
|
|
| 1.5 | ISC application operational status | running/not running, date/time of last reload/reboot, uptime since last reload (computed) | | 1 |
|
|
|
| 1.6 | Platform status | do we need OS and or platform operational status, uptime since last reload/restart? discovered. | | 1 |
|
|
|
| 1.7 | Platform information | platform name/type, memory, free memory and memory in use by the application (BIND/KEA), # of CPUs. discovered. | | 2 |
|
|
|
| 1.8 | Application configuration | list of all significant parameters with their value in the current running application (not in the configuration file, but running), and the default value for that parameter. Does NOT include Kea backend databases. More important for BIND than for Kea. | | 1-Kea 2-BIND |
|
|
|
| 1.8.1 | Diff between running config and saved config file | significant parameters that are differing in the current running config vs the saved config | | 3 |
|
|
|
| 1.9 | Import application server list | CSV import of Kea and BIND services (backend DBs as well?) | | 1 |
|
|
|
| 1.9.2 | Import network device list | CSV import of end user devices (desktops, laptops). Note this assumes we can handle storing additional fields such as contact information, geo and dept location, etc. | | 3 |
|
|
|
| 1.9.3 | Network device discovery | LLDP/CDP type discovery, scan for CA TCP port ...
|
|
|
| | out of scope 1.0 |
|
|
|
|
|
|
**:question: Q**: Is it for discovering existing Kea and BIND services (this includes DHCP relay) or discovering hosts in the network?
|
|
|
|
|
|
|
|
|
## Kea Monitoring
|
... | ... | @@ -42,24 +33,17 @@ This is a bit more than 'monitoring' because it also requires some reading and a |
|
|
|
|
|
| # | Feature | Details | Feasibility | Release or GL#? |
|
|
|
| ------ | ------ | ------ | ------ | ------ |
|
|
|
| 2.0 | Address space | show the total address space assigned to the organization, with chunks allocated to various dhcp servers within that overall allocation. This will become much more useful once we are also supporting configuration, but we should start with this concept in mind. | | 2 |
|
|
|
| 2.1 | Leases list| human-readable list of leases sorted by default from most recent to oldest, with sorting by any fields in the lease, search based on MAC address or IP address. Lease database must also include which server owns the lease. If we can also do a reverse DNS lookup on the IP address (this can be a process triggered by the admin, it doesn't have to happen magically) to popular a hostname field, that would be good too. This should not require querying all the dhcp servers - it should come from a central lease db in Stork. I am thinking it is updated by notification from the dhcp servers, after some initialization process where it gets all the current leases. | | 1 |
|
|
|
| 2.2 | Hosts list| human-readable list of host reservations, with sorting by IP, date assigned, host name. Show if the lease has actually been requested/assigned. Perhaps pxeboot file option value? hostname option value? | | 1 |
|
|
|
| 2.3 | Kea response times | It needs to be possible for example, to see if there is a backlog of requests building up that are unfilled as an indicator the Kea server is becoming overloaded. | | 1 |
|
|
|
| 2.4 | Pool utilization | # of IPs available, in use, and reclaimed. leases per pool. actual vs. configuration. how to handle prefixes? | | 1 |
|
|
|
| 2.5 | Real time renewals | some kind of running display of renewals coming in would be a fun, showy feature, whether it is illustrated with boxes in a display changing colors or some other way. If it is possible to click and drill down to look at lease lifetimes? is there some way of collecting and reporting on these timers, perhaps as the leases are renewing? Low priority, not sure how useful this is. | | 4 |
|
|
|
| 2.6 | Failover status | green/red with detail view (when was the last failover event? heartbeat status? Active/passive or load balancing?) | | 1 |
|
|
|
| 2.7 | Failover test | some sort of test process that will force a failover to validate it. This may also be useful for software updates. | | 3 |
|
|
|
|
|
|
|
|
|
## Common Tools
|
|
|
| # | Tool | Details | Feasibility | Release or GL#? |
|
|
|
| ------ | ------ | ------ | ------ | ------ |
|
|
|
| 3.1 | Software update status | look up (presumably at ISC) whether the application software branch is currently supported (Stable, Extended Support, Development, EOL) and report what the current/latest version on the branch is. | | 2 |
|
|
|
| 3.2 | Software updater | This was REMOVED from the initial requirements list because it seems like it is too complicated to do it well for the wide variety of deployment environments users have. | | out of scope for 1.0 |
|
|
|
| 3.3 | Configuration variance | report on running configuration parameters that are non set to their default values. This could consist of a filter on the configuration from the view in 1.8 that shows parameters that have a default value, where the active setting is different from the default. (requested by support for troubleshooting) | | 2 |
|
|
|
| 3.4 | Log viewer| view log of monitored servers of significant events since (some limited size). if possible, include platform logs (e.g. platform restarts, OS updates...) and stork application logs. This is envisioned initially as a fairly simple display of the log file on an individual server. This is not a massive database of historical logs with analysis. | | 1 |
|
|
|
| 3.5 | Log analysis| | | >1 |
|
|
|
| 3.5 | Log alerting | alerting when specific words appear in the log | | >1 |
|
|
|
|
|
|
|
|
|
## Kea Tools or Use cases
|
... | ... | @@ -67,13 +51,8 @@ This is a list of things that are not strictly monitoring. Putting them on a sep |
|
|
|
|
|
| # | Tool | Details | Feasibility | Release or GL#? |
|
|
|
| ------ | ------ | ------ | ------ | ------ |
|
|
|
| 4.1 | Devices database | List of devices that have appeared on the network at some point in the past. We may want a knob to configure how long to save this information for. The idea is to retain information associated with a given MAC address, even if it does not have a current valid lease, so that when it next gets a lease, we have this other information about it. Ultimately, we will want to store additional information with each lease (eg. user ID - things that we look up in another database at the time and associate with the dhcp lease). This should be possible both on a per-server basis and on a global basis. | | >1 |
|
|
|
| 4.2 | Failover test | no idea what this could consist of, but it is a common request, how to test failover. what we need is something that will reassure the admin that failover is 'ready and working' | | >1 |
|
|
|
| 4.3 | Relay drill down | The user wants to know what relay the client is behind. See all requests that came through a specific relay. If all we have is leases per relay, that is ok. | | >1 |
|
|
|
| 4.4 | Clients refusing offers | See all addresses declined by clients - troubleshooting | | 1 |
|
|
|
| 4.5 | | | | |
|
|
|
| 4.6 | device fingerprinting | options requested/provided?, incl pxe file location. It is also important to know the order in which the options were requested, so this will probably require a Kea hook. | | >1 |
|
|
|
|
|
|
|
|
|
## BIND Status and Activity
|
|
|
Most BIND 9 users have added BIND 9 to their existing fault monitoring systems by now. What is lacking is any integrated way to manage application + server performance together, and any way to view the status of events that are not queries or responses, such as interactions between servers (IXFR/AXFRs), journal updates, signing operations and the like.
|
... | ... | @@ -81,27 +60,20 @@ Most BIND 9 users have added BIND 9 to their existing fault monitoring systems b |
|
|
| # | Feature | Details | Feasibility | Release or GL#? |
|
|
|
| ------ | ------ | ------ | ------ | ------ |
|
|
|
| 5.1 | Zone list | human-readable list of zones, sortable by zone name, time of last update (this might be the default sort), zone size? signing status (signed/unsigned/expired?), #RRs. 'dynamic' or 'traditional' zone files | | 1 |
|
|
|
| 5.2 | Zone xfr status | See current SOA record, monitor notifies, time since last xfer/ixfr, size, source of transfer. If there is any way to see the time it took for the transfer to become effective in the server, that would be ideal. | | >1 |
|
|
|
| 5.3 | Zone xfr performance | See current SOA record, monitor notifies, time since last xfer/ixfr, size, source of transfer. If there is any way to see the time it took for the transfer to become effective in the server, that would be ideal. You want to know, how dynamic a zone is. How often you are getting updates for the zone, how large are the updates, etc. How much is the transfer traffic impacting the traffic, and which are the zones that are causing the problem.| | >1 |
|
|
|
| 5.4 | Zone signing status | DNSSEC details, key information, signature validity period | | 1 |
|
|
|
| 5.5 | Zone/rr signing performance | monitoring where BIND is in signing and resigning new or updated zones, both the status and time it takes to complete the signing operation. I realize this is potentially very detailed and complicated, but think of the use case where an auth publisher has a few very large zones - how can they track their signing process? | | 1 |
|
|
|
| 5.6 | Query activity | (queries per second received and answered), with some time series so you can identify changes from the usual pattern, TOD patterns. | | 1 |
|
|
|
| 5.7 | RPZ statistics | logging of RPZ 'matches', with the name of the RPZ, name of the answer zone and action taken - rewrites, nxdomain, etc and counters (eg. 15 minute intervals). This for the purpose of proving to management that the RPZ service is worthwhile and impactful. The user wants to know how much of an impact (each RPZ zone) is having. | | 1 |
|
|
|
| 5.7 | RPZ analysis | logging of RPZ 'matches', with the name of the RPZ, name of the answer zone and action taken - rewrites, nxdomain, etc and counters (eg. 15 minute intervals). This for the purpose of proving to management that the RPZ service is worthwhile and impactful. The user wants to know how much of an impact (each RPZ zone) is having. | | 2 |
|
|
|
| 5.8 | RPZ client reporting | log of clients that are presumably infected based on their DNS requests for malware zones. | | >1 |
|
|
|
|
|
|
|
|
|
## BIND Performance Details
|
|
|
Sophisticated DNS operators do extensive data analysis, but so much data is generated, it is expensive to keep it all around. The ideal solution might include streaming query and response data, with periodic analysis and creation of counters based on that data (which is then not preserved). This would enrich the current internal BIND counters and metrics with additional detail.
|
|
|
|
|
|
| # | Feature | Details | Feasibility | Release or GL#? |
|
|
|
| # | Feature | Details | Feasibility | Milestone |
|
|
|
| ------ | ------ | ------ | ------ | ------ |
|
|
|
| 6.1 | Query details | easily monitor the volume of queries and responses, rrtypes, response codes, by TCP vs UDP, perhaps by some response size buckets, this is a baseline function that everyone needs and these statistics should be available on a per-server basis from BIND today. Include queries that are dropped. Ideal if these can be displayed both per-server and aggregated across clusters of servers. | | 1 |
|
|
|
| 6.2 | Query & Response log analysis | One use case here is drilling down to look at actual queries during a spike in one of the usual metric observed above to find out the query source, name queried for, etc, to identify the source of malicious traffic. This is the sort of data that would be ideally analyzed and discarded, because keeping a lot of it around would become expensive. One issue is that today, enabling DNSTAP on BIND requires restarting BIND. | | >1 |
|
|
|
| 6.3 | Response latency | (average, max, min, mode - time between receiving a query and sending a response, as well as for resolvers, **whether the response came from cache or not**) (Serious providers will use test clients at various locations in the network to continuously test/audit the dns service. We should consider attempting to support that at some point in the future.) | | >1 |
|
|
|
| 6.4 | Cache hit ratio | % of queries answered from cache (time series) | | 1 |
|
|
|
| 6.5 | Cache aging | cache size, average ttl of records in cache, # of records pre-fetched and # of those that expired without being re-queried, top 500(?) records most frequently queried, cache cleaning (how dirty is the cache) | | 1 |
|
|
|
| 6.6 | Cache visualization | some chart that help to visualize what sort of data is in the cache, how much is being renewed with short TTLs, how much is being prefetched, etc. the ultimate goal is to help the user optimize the cache so it is most efficient in for their purpose. How much memory is consumed. | | >1 |
|
|
|
| 6.7 | Memory utilization | what is named's current memory allocation being used for. Esp needed by hybrid server operators (amt used for auth vs recursive) | | 1 |
|
|
|
|
|
|
|
... | ... | |