Commit Graph

31 Commits

Author SHA1 Message Date
Christian Schwarz
69bfcb7bed daemon/active: implement watchdog to handle stuck replication / pruners
ActiveSide.do() can only run sequentially, i.e. we cannot run
replication and pruning in parallel. Why?

* go-streamrpc only allows one active request at a time
(this is bad design and should be fixed at some point)
* replication and pruning are implemented independently, but work on the
same resources (snapshots)

A: pruning might destroy a snapshot that is planned to be replicated
B: replication might replicate snapshots that should be pruned

We do not have any resource management / locking for A and B, but we
have a use case where users don't want their machine fill up with
snapshots if replication does not work.
That means we _have_ to run the pruners.

A further complication is that we cannot just cancel the replication
context after a timeout and move on to the pruner: it could be initial
replication and we don't know how long it will take.
(And we don't have resumable send & recv yet).

With the previous commits, we can implement the watchdog using context
cancellation.
Note that the 'MadeProgress()' calls can only be placed right before
non-error state transition. Otherwise, we could end up in a live-lock.
2018-10-19 17:23:00 +02:00
Christian Schwarz
4ede99b08c replication: simpler PermanentError state + handle context cancellation 2018-10-19 17:23:00 +02:00
Christian Schwarz
3c06235dca replication + zfs: leave From field instead of To field empty for initial send 2018-10-14 13:06:23 +02:00
Christian Schwarz
59a4e2db5f replication: regenerate pdu.pb with new protoc-gen-go 2018-10-13 17:23:39 +02:00
Christian Schwarz
af3d96dab8 use enumer generate tool for state strings 2018-10-12 22:10:49 +02:00
Christian Schwarz
cb83a26c90 replication: wakeup + retry handling: make wakeups work in retry wait states
- handle wakeups in Planning state
- fsrep.Replication yields immediately in RetryWait
- once the queue only contains fsrep.Replication in retryWait:
transition replication.Replication into WorkingWait state
- handle wakeups in WorkingWait state, too
2018-10-12 13:12:28 +02:00
Christian Schwarz
d17ecc3b5c replication/fsrep: report Pending[0] problem as fsrep problem in RetryWait state 2018-10-12 12:45:37 +02:00
Christian Schwarz
2990193512 replication: export SleepUntil in report 2018-09-24 19:23:53 +02:00
Christian Schwarz
fa47667f31 bring back prometheus metrics, with new metrics for replication state machine 2018-09-07 22:22:34 -07:00
Christian Schwarz
975fdee217 replication & pruning: ditch replicated-property, use bookmark as cursor instead
A bookmark with a well-known name is used to track which version was
last successfully received by the receiver.
The createtxg that can be retrieved from the bookmark using `zfs get` is
used to set the Replicated attribute of each snap on the sender:
If the snap's CreateTXG > the cursor's, it is not yet replicated,
otherwise it has been.

There is an optional config option to change the behvior to
`CreateTXG >= the cursor's`, and the implementation defaults to that.

The reason: While things work just fine with `CreateTXG > the cursor's`,
ZFS does not provide size estimates in a `zfs send` dry run
(see acd2418).
However, to enable the use case of keeping the snapshot only around for
the replication, the config flag exists.
2018-09-05 19:51:06 -07:00
Christian Schwarz
acd2418803 handle DryRun send size estimate errors with bookmarks 2018-09-05 17:41:25 -07:00
Christian Schwarz
8eade3d20a replication/pdu: fix broken test 2018-09-04 17:01:46 -07:00
Christian Schwarz
be57d6ce8e replication/diff: replace invalid comparison of CreateTXG with Creation 2018-09-04 14:01:48 -07:00
Christian Schwarz
0c4a3f8dc4 pruning/history: properly communicate via rpc if snapshot does not exist 2018-09-04 14:01:48 -07:00
Christian Schwarz
ad28fd1ecb replication: diff does not need special case for receiver/sender == nil 2018-09-02 15:46:42 -07:00
Christian Schwarz
b95e983d0d bump go-streamrpc to 0.2, cleanup logging
logging should be user-friendly in INFO mode
2018-09-02 15:45:18 -07:00
Anton Schirg
f387e23214 fix: at least two snapshots were needed to start replication 2018-08-30 19:20:18 +02:00
Anton Schirg
48feaff054 fix some status display alignment 2018-08-30 15:21:07 +02:00
Anton Schirg
b5957aca37 do dry runs in planning stage to estimate size of all sends 2018-08-30 12:59:16 +02:00
Anton Schirg
98f3f3dfd8 show expected size of current send
Needs to be changed to send sizes for all planned steps
2018-08-30 12:58:13 +02:00
Anton Schirg
6ca11a7391 byte counter for status 2018-08-30 12:54:30 +02:00
Christian Schwarz
22ca80eb7e remote snapshot destruction & replication status zfs property 2018-08-30 11:51:47 +02:00
Christian Schwarz
a2aa8e7bd7 finish pruner implementation 2018-08-29 19:00:45 +02:00
Christian Schwarz
ee5445777d logging format 'human': continue printing prefixed fields if some are missing 2018-08-26 19:13:09 +02:00
Christian Schwarz
7ff72fb6d9 replication: document most important aspects of Endpoint interface 2018-08-26 15:12:43 +02:00
Christian Schwarz
cf01086df5 build: pin protoc version and update protobuf + regenerate 2018-08-26 14:35:18 +02:00
Christian Schwarz
71203ab325 move various timeouts to package-level variables 2018-08-25 22:30:16 +02:00
Christian Schwarz
88de8ba8bb initial repl policy: get rid of unimplemented options 2018-08-25 22:23:47 +02:00
Christian Schwarz
e30ae972f4 gofmt 2018-08-25 21:30:25 +02:00
Christian Schwarz
54c9dcb7c1 move replication policy constants to package replication 2018-08-22 10:11:14 +02:00
Christian Schwarz
7b3a84e2a3 move replication package to project root (independent of cmd package) 2018-08-22 00:19:03 +02:00