Race and constraint errors when adding Kea with many subnets
This is a result of sanity checks for Stork 0.13.0 release. See #441 (comment 174606)
Copying over the errors and the debugging result below:
While testing the demo on Ubuntu 20.04 I apparently reproduced the issue earlier mentioned by @godfryd here: !230 (comment 173942)
I've got the following error trace:
server_1 | ERRO[2020-11-05 10:03:47] statepuller.go:278 cannot store application state: ERROR #23505 duplicate key value violates unique constraint "access_point_unique_idx"
server_1 | problem with adding access point to app 11: &{11 control 8 127.0.0.1 8000 }
server_1 | isc.org/stork/server/database/model.updateAppAccessPoints
server_1 | /repo/build-root/backend/server/database/model/app.go:68
server_1 | isc.org/stork/server/database/model.AddApp
server_1 | /repo/build-root/backend/server/database/model/app.go:226
server_1 | isc.org/stork/server/apps/kea.CommitAppIntoDB
server_1 | /repo/build-root/backend/server/apps/kea/appkea.go:462
server_1 | isc.org/stork/server/apps.GetMachineAndAppsState
server_1 | /repo/build-root/backend/server/apps/statepuller.go:269
server_1 | isc.org/stork/server/apps.(*StatePuller).pullData
server_1 | /repo/build-root/backend/server/apps/statepuller.go:59
server_1 | isc.org/stork/server/agentcomm.(*PeriodicPuller).pullerLoop
server_1 | /repo/build-root/backend/server/agentcomm/puller.go:93
server_1 | runtime.goexit
server_1 | /repo/build-root/tools/1.13.5/go/src/runtime/asm_amd64.s:1357
server_1 | problem with adding access points to app: &{ID:11 CreatedAt:2020-11-05 10:03:46.320564 +0000 UTC MachineID:8 Machine:0xc000e8e160 Type:kea Active:false Meta:{Version:1.7.3 ExtendedVersion:} AccessPoints:[0xc002c34820] Daemons:[0xc0007c00d0 0xc00023e8f0 0xc00037e0d0 0xc00037ef70]}
server_1 | isc.org/stork/server/database/model.AddApp
server_1 | /repo/build-root/backend/server/database/model/app.go:228
server_1 | isc.org/stork/server/apps/kea.CommitAppIntoDB
server_1 | /repo/build-root/backend/server/apps/kea/appkea.go:462
server_1 | isc.org/stork/server/apps.GetMachineAndAppsState
server_1 | /repo/build-root/backend/server/apps/statepuller.go:269
server_1 | isc.org/stork/server/apps.(*StatePuller).pullData
server_1 | /repo/build-root/backend/server/apps/statepuller.go:59
server_1 | isc.org/stork/server/agentcomm.(*PeriodicPuller).pullerLoop
server_1 | /repo/build-root/backend/server/agentcomm/puller.go:93
server_1 | runtime.goexit
server_1 | /repo/build-root/tools/1.13.5/go/src/runtime/asm_amd64.s:1357
server_1 | ERRO[2020-11-05 10:03:47] statepuller.go:62 error occurred while getting info from machine 8: problem with storing application state in the database
server_1 | INFO[2020-11-05 10:03:47] statepuller.go:67 completed pulling information from machines: 0/1 succeeded
server_1 | ERRO[2020-11-05 10:03:47] puller.go:95 errors were encountered while pulling data from apps: problem with storing application state in the database
server_1 | isc.org/stork/server/apps.(*StatePuller).pullData
server_1 | /repo/build-root/backend/server/apps/statepuller.go:61
server_1 | isc.org/stork/server/agentcomm.(*PeriodicPuller).pullerLoop
server_1 | /repo/build-root/backend/server/agentcomm/puller.go:93
server_1 | runtime.goexit
server_1 | /repo/build-root/tools/1.13.5/go/src/runtime/asm_amd64.s:1357
server_1 | INFO[2020-11-05 10:03:48] middleware.go:42 served request
Looking further into the code I think that this is what happens:
- User adds new machine via UI. The machine runs Kea with many subnets.
- The machine entry is fairly quickly added to the database and then apps state is being fetched.
- Before the app is added to the db its configuration is being fetched so it takes a while before we create app entry in the db.
- The state puller monitors this machine already so it tries to get its updated state.
- The state puller checks if the app is already in the db, but it is not because the first operation of machine creation is still in progress.
- The state puller assumes it found a new Kea app because there is no such app in the db yet. The state puller will proceed as it was to add a new app.
- The initial create machine operation finally completes.
- The state pulling also completes and the state puller attempts to add the "new app" into db.
- This operation causes no conflict with (app_id, type) because the puller assigned new app_id, treating it as a new machine.
- This causes an attempt to insert the new app which fails because of the (machine_id, port) constraint.
Overall, the database state should remain consistent because the second instance of the app is not added. But, it produces a lot of work for the server to do parallel updates and also causes a lot of traffic.
The solution in #409 (closed) apparently worked because it fetched existing app by (type) rather than (type, app_id). Overall, the right approach would rather be to not do state pulling before the previous pull has completed. We'd need to implement some signaling for it.