* Remove explicity state machine code for all but replication.Replication
* Introduce explicit error types that satisfy interfaces which provide
sufficient information for replication.Replication to make intelligent
retry + queuing decisions
* Temporary()
* LocalToFS()
* Remove the queue and replace it with a simple array that we sort each
time (yay no generics :( )
Pruner now backs off as soon as there is an error, making that error the
Error field in the pruner report.
The error is also stored in the specific *fs that failed, and we
maintain an error counter per *fs to de-prioritize those fs that failed.
Like with replication, the de-prioritization on errors is to avoid '
getting stuck' with an individual filesystem until the watchdog hits.
ActiveSide.do() can only run sequentially, i.e. we cannot run
replication and pruning in parallel. Why?
* go-streamrpc only allows one active request at a time
(this is bad design and should be fixed at some point)
* replication and pruning are implemented independently, but work on the
same resources (snapshots)
A: pruning might destroy a snapshot that is planned to be replicated
B: replication might replicate snapshots that should be pruned
We do not have any resource management / locking for A and B, but we
have a use case where users don't want their machine fill up with
snapshots if replication does not work.
That means we _have_ to run the pruners.
A further complication is that we cannot just cancel the replication
context after a timeout and move on to the pruner: it could be initial
replication and we don't know how long it will take.
(And we don't have resumable send & recv yet).
With the previous commits, we can implement the watchdog using context
cancellation.
Note that the 'MadeProgress()' calls can only be placed right before
non-error state transition. Otherwise, we could end up in a live-lock.
Also:
- Defensive measures in control http server (1s timeouts)
(prevent the leak, even if request body is not closed)
- Add prometheus metrics to track control socket latencies
(were used for debugging)