- Topic - short description of the issue
- Details - more details about the issue, some examples, etc
- Severity - how severe the issue is for users (high, medium, low)
- Priority - how quickly we should fix it (high, medium, low)
- GL # - GitLab issue number
|no clear hierarchy of pages and entities||On app page it is not clearly visible that it belongs to some machine so the relation between them is not clear. Menu structure is not clear: no explicit link to dashboard; Kea DHCP (apps), Machines in Services group but why? what are Services? Apps and Machines are Services?||high|
|mixing/messing entities of machine, app, daemon and service||machine page and app page are perceived as the same thing||medium|
|no good navigation to problem details||places indicating problem with red mark does not lead to another place with more details so hard to conclude anything about the problem and how to deal with it||high|
|inconsistent links to machine||showing once hostname, once address to a machine||medium|
|resolution issues||1280px width support is expected||low|
Comments from Vicky
I think you have identified the primary issues. I know these comments don't really belong here, but I didn't have another place handy to put them.
Screen Resolution: (I think we need to lower our expectations here) I took a poll of our own infrastructure team, and their screen resolution is all over the place. The answer is, you ask 4 ppl you get 4 very different answers.
- R - High res "I use a 4K 27" monitor as a second screen to my MBP (13", 2560 x 1600). the resolution on the former is something like 3840 x 2160
- E - I'm on a 25" at 1440p. My old MBA has some 1280x1000 or so resolution.
- Mr. W - i've been working on a single 24" 1920x1080 monitor
- D - would the snarky answer of "80x25" help? On a more serious note, it would be useful if the thing parsed somehow in lynx/links/etc. or had a lite/mobile-friendly/reactive mode. Lots of NOCs have the company hand-me-downs, and lots of admins are looking at things on their secondary or everyday-carry devices.
Currently, many of our Kea users chose Kea because want maximum control and they are relatively confident of their technical skills. They are trying to save money by using open source. They are likely to have a senior engineer with good network and UNIX administration skills managing the network. This person may be the ONLY one managing Kea, and they probably have a number of other critical responsibilities. They chose Kea over existing commercial systems that have extensive management capabilities and controls. They may have used ISC DHCP previously, and are thinking they don't want to take on supporting very old software. In some cases they have struggled with the initial Kea installation and configuration, particularly with determining how to design the system to give the right clients the intended options and addresses.
Many of these are greenfield ISPs and they don't have legacy systems to integrate with. Their clients are pretty homogeneous and they have good control over what clients are used, so they think their requirements are basic. However, they are growing fast, or hope to grow fast, so they are focused on scalability and high availability. It is common to also have some idea for differentiating clients for providing tiered service to improve revenue that they hope to implement with Kea.
Some Kea users are universities, and they may choose Kea because it has reportedly better IPv6 support than ISC DHCP. They may have a large and complicated network, but be trialing Kea in one department or location. They may have more other administrative systems (e.g. user authentication, established fault monitoring, help desk) to integrate with, but they might have more of an academic interest in Kea, and their users might be more flexible about being 'experimented on.'
Hierarchy of needs
These are requirements in order of urgency, and in the order I think we should be trying to satisfy them. This is pretty much the order in which we ARE and have been doing them. I think we should develop a canonical list of 'tasks' the system administrator has to do. When we are struggling with a UI issue, we should ask ourselves, which is the most important task we are trying to support here, and what information or controls does the user need to do that task?
level 1 - Hello World. Dashboard
- Is the system alive, can I contact it? (multiple panels show this today, but different organizations may have a different 'index' field, IP address, hostname, nickname, physical location, department..)
- Are clients reaching it, and is it configured correctly? Is the system processing requests? (grafana stats)
- Is the system going to run out of resources anytime soon? I need to go work on something else for a while. (see pool, CPU, memory utilization. May also need to see these stats variance over time, to determine what is a time of day issue vs an overall trend.)
- Is the failover system I set up going to work when there is a failure? (see HA status, perform HA 'test'- tbd)
- How can I get this thing to alert me when there is a problem? (check dashboard, set up grafana alerts??)
level 2 - App/System details
Network admin reports some problem, possibly connectivity or reachability.
Now that there is a problem, how do I troubleshoot it?
- is it the app, or the server itself? (look at app and host machine in 1 panel)
- look at the history, when did it start? (look at logs)
- what do these log messages (stats, alarms) mean (check documentation)
- is it a connectivity problem? (ping, traceroute, look at other clients in same subnet)
- is it an issue with the relay, CMTS or CPE? (look at clients/lease data by relay)
- what other things happened at the time the problem started? (check grafana)
- Was there a configuration change? (check last reload time)
- Is this intermittent or on-going?
- If the problem seems to be a client that can contact Kea but isn't getting a lease, why not? What pool would they hit? Is the pool exhausted?
- Is there an increase in NAKs? Can I see which clients are getting NAKed? Can I see the contents of the client request?
- Is there congestion? What is causing it? Can I see which clients are repeatedly sending discovers or requests?
- Given a MAC address, can I tell when that client last had a lease, and when it expired? (this kind of problem is much more likely to be pursued in an enterprise than at an ISP)
- Is this a new customer? How can I tell if this client has a reservation? Is there a problem with our provisioning system?
- What do I have to send to ISC to get help? (OS details, request for save/package data button, possibly screenshot)
level 3 - After the issue - planning for stability
- Is there a way to test to confirm that the problem is gone and ensure on an on-going basis that it hasn't recurred?
- What performance am I getting out of this system? How many transactions per second is each server managing? (LPS on dashboard)
- How many more clients can I support before I need to add more systems? (LPS, # current leases, pool utilization, cpu utilization.. calculator?)
- What parts of the system are the slowest? Is it the db, or the Kea server CPU or something else? (tbd)
level 4 - More efficient operations
- My boss would like some kind of report to see how many devices we have on the network. She would like a breakdown by device type (laptop, mobile, polycom, server, printer). (lease list, fingerprinting) She would like to see the growth in clients over time to use for projections. She would like to see the time to get a lease (responsiveness) or other random statistics.
- We have a process for bringing up a new POP or bringing up a new customer that we would like to integrate with Kea. This might include a custom hook, or custom provisioning step.
- We want to offer a new service, with longer/shorter leases, fewer/more addresses per cpe, IPv6 only, dual-stack...
- We hired another person to help with DHCP. Can we share the configuration tasks and keep track of the changes with this thing? (tbd)
- We are able to plan ahead and forsee needing to do some maintenance. Can I schedule that here? How do I gradually bleed off the traffic off a machine to take it out of service? How do I trigger the failover so I can work on the other partner in the pair? (Tbd)
- Someone screwed up the DHCP! How do I tell which of my colleagues did that? (tbd)
- The typical BIND user who might try Stork is going to already have other tools for managing BIND. Most BIND users have nagios, zabbix or cacti or something similar for fault management. They will also have Provisioning systems vary widely, but there are many open source tools out there for zone file provisioning, as well as many home-grown tools.
- BIND users know they have to be able to mine the logs for information. Information about how to set up logs, what to look for - these are the hottest queries in the KB and the most popular webinars. However, over-logging can seriously impact BIND performance.
- BIND users have been asking for a supported Prometheus exporter, because it is essential in DNS to monitor the query make-up to spot DDOS attacks.
These users are looking for are solutions for problems they can't address with existing tools. By definition, none of these are simple problems and all will likely require work in BIND as well as in Stork. These include:
- troubleshooting performance problems. This requires looking at the platform, activity, and the application together. It also requires, ultimately, also looking at configuration options that affect performance.
- troubleshooting issues related to what is in cache. This may require looking at data in the ADB which is not currently exposed to the user.
- monitoring and troubleshooting issues related to zone file updates. These can impact query performance. In case of signed zones, this may also require monitoring signing operations, which can take a long time (e.g. an hour) in case of very large zone files.