Tomek Mrugalski · 4a37ae7a
--- a/designs/Agent-Server-Communication.md
+++ b/designs/Agent-Server-Communication.md
+[[_TOC_]]
+# Intro
+
+Communication between Agents and Server are based on [gRPC](https://grpc.io/). This is an RPC implementation that uses ProtoBuf for data serialization and HTTP2 for transport. gRPC allows unidirection calls that allows one time request and response messages but also streaming requests and responses or only one side streaming.
+
+# Establishing Connection
+
+Server initiates connection using gRPC Dial method. The connection is stored in runtime Agents map for later reuse. From that moment Server has established connection to Agent and can invoke remote commands defined in gRPC API.
+
+```mermaid
+sequenceDiagram
+    Server->>Agent: New gRPC connection
+    Server->>Agent: gRPC call e.g. RestartKea
+    Agent-->>Server: gRPC response for RestartKea
+    Server->>Agent: Another gRPC call 
+    Agent-->>Server: Another gRPC response
+```
+
+# Connections Maintenance
+
+Established connection is stored in a map (hash table) (so called `AgentsMap`) where key is an IP address of Agent and a port e.g. "10.0.0.6:8080" and value is a structure that among other things a connection is kept. To call given Agent its IP address and port is required. Having its address a connection in AgentsMap is searched and then a call to Agent can be made.
+
+# Calls to Agents
+
+Connections to Agents can be broken any time and this is unknown to Server till it tries to use the connection. To manage calls to agents a calls queue to Agent is maintained in StorkDB. When any part of Stork Server wants something from Agent it puts its request or task to a table in StorkDB, so called `agent_task`. Each agent task has:
+1. a reference to agent entry in `agent` table which holds IP address of agent
+1. task arguments
+1. task result
+1. task state, one of [new, waiting-for-agent-response, completed]
+
+# Agent Tasks Manager
+
+Agent Tasks Manager is a goroutine that:
+
+1. periodically looks for pending tasks for execution in agent_task table and executes them or cleans them up if they expired or are irrelevant now
+1. receives signals over a channel from other Stork Server parties about added new task to agent_task table (to speed up its execution and do not wait for periodical sweep)
+1. on new task for given agent it checks if there is no currently being executed tasks and then it starts its execution in sub-goroutine
+
+The concepts is that given Agent has only one task currently being executed. To avoid blocking execution on Agent Task Manager is spawns separate goroutine for given task for given Agent. So there can be multiple goroutines executing different tasks for different agents but always only one for given Agent.
+
+When gorouting receives response from Agent it stores result in agent_task table.
+
+TODO: how other Stork Server party can learn that its task is completed? Is it waiting for the completion? Maybe there should be registered a called for given task type that is executed on task completion?
+
+# Discussions
+The question was which part should initiate connection.
+
+## Agent Initiates Connection
+This means that Agent is a client and Server is a server.
+
+### Pros
+
+1. only one server, TLS certificate is needed (to prevent man in the middle attacks so Agent can trust the Server)
+1. only one port to Server is needed to be opened in the firewall
+
+### Cons
+
+1. Server cannot call RPC functions to Agent as Server exposes them, Server may only stream messages in the response to one long call from Agent [dev problem]
+1. big number of Agents may overload Server [dev problem]
+
+## Server Initiates Connection
+This means that Agent is a server and Server is a client.
+
+### Pros
+
+1. Server can easily throttle calls to Agents
+1. Server can call regular RPC functions exposed by Agents
+
+### Cons
+
+1. each Agent needs to get a TLS certificate that Server can validate that it talks to given, real Agent (to prevent man in the middle attacks) [ux problem]
+1. it is required to open port in firewall to each Agent [ux problem]
+
+## Aspects that may influence the decision
+
+1. Prometheus had similar problem. They decided to go with server => agents connection. This scales better. If you connect too many agents to a single server, it will still be polling through them, but slower. If the architecture was agents connecting to the server, it would be swamped with data and information lost due to socket buffers being full.
+
+2. The services (kea, bind) will likely be running on machines that are heavily firewalled. It is generally better to punch a hole in a firewall that is outbound (connection from the host to outside, i.e. agent to server), as this prevents possible attack vector.
+
+3. For the certs/keys exchange, we could do something similar that Apple is doing. When you're connecting two devices, one displays a 4-6 digits code and the other one expects the code to be entered to be sure that you're connecting the right device and not a spoofed one.
+
+4. Far in the future, we may implement some sort of a discovery mechanism - one entity (server or agent) that is willing to register or accept new agents would send a v4 broadcast or v6 multicast. This would not be bullet-proof and would be prone to errors in routed networks. But it could be convenient especially if we go ahead with the concept of providing a VM images or raspberry pies. Obviously, all of this is far in the future, but the prospect of doing something like that may influence the design for simpler things now. [Godfryd] This will be very limited to one subnet, otherwise network admins would need to do such packets forwarding on switches like this is done for DHCP. So this requires even more effort from users to setup Stork.
+
+5. We need to take into account that each machine with BIND or Kea will be connected to Prometheus as well. That model assumes connecting from Prometheus server to machines. We assumed that measurements will be server by Stork Agent ie. it will be an exporter. So this may require opening another port in firewall to all such connections like described in point 2.
+
+
+Decision: Server initiates connection.
\ No newline at end of file