Statepuller - only one refresh at a time
tl;dr: Statepuller rarely return HTTP 500 on update an agent state, but it doesn't corrupt a database.
Description
We have a hard-to-resolve race problem with updating machine state in the Stork ("statepuller.go", GetMachineAndAppsState
). The problem occurs when the Stork tries refreshing an application state from multiple goroutines at the same time.
Refresh may be triggered by:
- On Stork start
- Periodically
- On user request
Some refresh procedures may be called at the same time. Refresh state looks like this:
- Get state from an agent
- Get state from a database
- Calculate diff
- Provide changes in the application
- Provide changes in the subnets and others
Points 2, 4, and 5 are doing in a separate transaction. It may happen that after fetching state from the database (2.) and calculate diffs (3.) in one goroutine, another goroutine modified the application. It causes that the calculated diffs are incorrect. The exception is thrown from point 4. where the unique index constraints are checked.
It crashes on point 4. I try to fix it by handle the unique constraint violation. And it goes next to point 5. There aren't unique indexes - all pass through, but the subnets and other data are duplicated.
Risk analysis
The problem is quite rare as it occurs only when the state of a new agent is inserted. It shouldn't be also very dangerous as the unique constraints protect the database against duplication of data.
But it may be a problem if somebody wants to use Stork API in the external project (as API unexpectedly returns HTTP 500) or it may interrupt processing in a function that calls the state update.