zrepl

mirror of https://github.com/zrepl/zrepl.git synced 2024-11-22 08:23:50 +01:00

Author	SHA1	Message	Date
Christian Schwarz	c743c7b03f	refactor snapper & support cron-based snapshotting fixes https://github.com/zrepl/zrepl/issues/554 refs https://github.com/zrepl/zrepl/discussions/547#discussioncomment-1936126	2022-09-25 19:23:44 +02:00
Christian Schwarz	2d8c3692ec	rework resume token validation to allow resuming from raw sends of unencrypted datasets Before this change, resuming from an unencrypted dataset with send.raw=true specified wouldn't work with zrepl due to overly restrictive resume token checking. An initial PR to fix this was made in https://github.com/zrepl/zrepl/pull/503 but it didn't address the core of the problem. The core of the problem was that zrepl assumed that if a resume token contained `rawok=true, compressok=true`, the resulting send would be encrypted. But if the sender dataset was unencrypted, such a resume would actually result in an unencrypted send. Which could be totally legitimate but zrepl failed to recognize that. BACKGROUND ========== The following snippets of OpenZFS code are insightful regarding how the various ${X}ok values in the resume token are handled: - `6c3c5fcfbe/module/zfs/dmu_send.c (L1947-L2012)` - `6c3c5fcfbe/module/zfs/dmu_recv.c (L877-L891)` - https://github.com/openzfs/zfs/blob/6c3c5fc/lib/libzfs/libzfs_sendrecv.c#L1663-L1672 Basically, some zfs send flags make the DMU send code set some DMU send stream featureflags, although it's not a pure mapping, i.e, which DMU send stream flags are used depends somewhat on the dataset (e.g., is it encrypted or not, or, does it use zstd or not). Then, the receiver looks at some (but not all) feature flags and maps them to ${X}ok dataset zap attributes. These are funnelled back to the sender 1:1 through the resume_token. And the sender turns them into lzc flags. As an example, let's look at zfs send --raw. if the sender requests a raw send on an unencrypted dataset, the send stream (and hence the resume token) will not have the raw stream featureflag set, and hence the resume token will not have the rawok field set. Instead, it will have compressok, embedok, and depending on whether large blocks are present in the dataset, largeblockok set. WHAT'S ZREPL'S ROLE IN THIS? ============================ zrepl provides a virtual encrypted sendflag that is like `raw`, but further ensures that we only send encrypted datasets. For any other resume token stuff, it shoudn't do any checking, because it's a futile effort to keep up with ZFS send/recv features that are orthogonal to encryption. CHANGES MADE IN THIS COMMIT =========================== - Rip out a bunch of needless checking that zrepl would do during planning. These checks were there to give better error messages, but actually, the error messages created by the endpoint.Sender.Send RPC upon send args validation failure are good enough. - Add platformtests to validate all combinations of (Unencrypted/Encrypted FS) x (send.encrypted = true \| false) x (send.raw = true \| false) for cases both non-resuming and resuming send. Additional manual testing done: 1. With zrepl 0.5, setup with unencrypted dataset, send.raw=true specified, no send.encrypted specified. 2. Observe that regular non-resuming send works, but resuming doesn't work. 3. Upgrade zrepl to this change. 4. Observe that both regular and resuming send works. closes https://github.com/zrepl/zrepl/pull/613	2022-09-25 17:32:02 +02:00
Cole Helbling	1df0f8912a	Add `--skip-cert-check` flag to `zrepl configcheck` to prevent checking cert files It may be desirable to check that a config is valid without checking for the existence of certificate files (e.g. when validating a config inside a sandbox without access to the cert files). This will be very useful for NixOS so that we can check the config file at nix-build time (e.g. potentially without proper permissions to read cert files for a TLS connection). fixes https://github.com/zrepl/zrepl/issues/467 closes https://github.com/zrepl/zrepl/pull/587	2022-07-08 20:18:41 +02:00
Christian Schwarz	2642c64303	make initial replication policy configurable (most_recent, all, fail) Config: ``` - type: push ... conflict_resolution: initial_replication: most_recent \| all \| fali ``` The ``initial_replication`` option determines which snapshots zrepl replicates if the filesystem has not been replicated before. If ``most_recent`` (the default), the initial replication will only transfer the most recent snapshot, while ignoring previous snapshots. If all snapshots should be replicated, specify ``all``. Use ``fail`` to make replication of the filesystem fail in case there is no corresponding fileystem on the receiver. Code-Level Changes, apart from the obvious: - Rework IncrementalPath()'s return signature. Now returns an error for initial replications as well. - Rename & rework it's consumer, resolveConflict(). Co-authored-by: Graham Christensen <graham@grahamc.com> Fixes https://github.com/zrepl/zrepl/issues/550 Fixes https://github.com/zrepl/zrepl/issues/187 Closes https://github.com/zrepl/zrepl/pull/592	2022-06-26 14:36:59 +02:00
Lapo Luchini	c6a9ebc71c	job/active: add "last completed" metric for error reporting use case: So that I can use a more resilient alerting such as "last complete was sent more than 24h ago". fixes https://github.com/zrepl/zrepl/issues/516 closes https://github.com/zrepl/zrepl/pull/530	2021-11-10 17:35:12 +01:00
InsanePrawn	b2c6e51a43	client/signal: Revert "add signal 'snapshot', rename existing signal 'wakeup' to 'replication'" This was merged to master prematurely as the job components are not decoupled well enough for these signals to be useful yet. This reverts commit `2c8c2cfa14`. closes #452	2021-03-25 22:26:17 +01:00
Calistoc	2c8c2cfa14	add signal 'snapshot', rename existing signal 'wakeup' to 'replication'	2021-03-14 18:16:23 +01:00
Christian Schwarz	0ceea1b792	replication: simplify parallel replication variables & expose them in config closes #140	2021-03-14 17:30:10 +01:00
Christian Schwarz	efe7b17d21	Update to protobuf v1.25 and grpc 1.35; bump CI to go1.12 From: github.com/golang/protobuf v1.3.2 google.golang.org/grpc v1.17.0 To: github.com/golang/protobuf v1.4.3 google.golang.org/grpc v1.35.0 google.golang.org/protobuf v1.25.0 About the two protobuf packages: https://developers.google.com/protocol-buffers/docs/reference/go/faq > Version v1.4.0 and higher of github.com/golang/protobuf wrap the new implementation and permit programs to adopt the new API incrementally. For example, the well-known types defined in github.com/golang/protobuf/ptypes are simply aliases of those defined in the newer module. Thus, google.golang.org/protobuf/types/known/emptypb and github.com/golang/protobuf/ptypes/empty may be used interchangeably. Notable Code Changes in zrepl: - generate protobufs now contain a mutex so we can't copy them by value anymore - grpc.WithDialer is deprecated => use grpc.WithContextDialer instead Go1.12 is now actually required by some of the dependencies.	2021-01-25 00:39:01 +01:00
InsanePrawn	180c3d9ae1	Reformat all files with `make format`. Signed-off-by: InsanePrawn <insane.prawny@gmail.com>	2020-08-31 23:57:45 +02:00
Hans Schulz	83fdffbcef	replication: prometheus metric for number of failed replications in last attempt - package replication: metric - Grafana panel - wiring - changelog Signed-off-by: Christian Schwarz <me@cschwarz.com> closes #341	2020-08-04 01:19:44 +02:00
Christian Schwarz	30cdc1430e	replication + endpoint: replication guarantees: guarantee_{resumability,incremental,nothing} This commit - adds a configuration in which no step holds, replication cursors, etc. are created - removes the send.step_holds.disable_incremental setting - creates a new config option `replication` for active-side jobs - adds the replication.protection.{initial,incremental} settings, each of which can have values - `guarantee_resumability` - `guarantee_incremental` - `guarantee_nothing` (refer to docs/configuration/replication.rst for semantics) The `replication` config from an active side is sent to both endpoint.Sender and endpoint.Receiver for each replication step. Sender and Receiver then act accordingly. For `guarantee_incremental`, we add the new `tentative-replication-cursor` abstraction. The necessity for that abstraction is outlined in https://github.com/zrepl/zrepl/issues/340. fixes https://github.com/zrepl/zrepl/issues/340	2020-07-26 20:32:35 +02:00
Christian Schwarz	1c270b7e39	add option to disable step holds for incremental sends This is a stop-gap solution until we re-write the pruner to support rules for removing step holds. Note that disabling step holds for incremental sends does not affect zrepl's guarantee that incremental replication is always possible: Suppose you yank the external drive during an incremental @from -> @to step: * restarting that step or future incrementals @from -> @to_later` will be possible because the replication cursor bookmark points to @from until the step is complete * resuming @from -> @to will work as long as the pruner on your internal pool doesn't come around to destroy @to. * in that case, the replication algorithm should determine that the resumable state on the receiving side isuseless because @to no longer exists on the sending side, and consequently clear it, and restart an incremental step @from -> @to_later refs #288	2020-06-14 15:26:05 +02:00
Christian Schwarz	292b85b5ef	[#316 ] endpoint / replication protocol: more robust step-holds and replication cursor management - drop HintMostRecentCommonAncestor rpc call - it is wrong to put faith into the active side of the replication to always make that call (we might not trust it, ref pull setup) - clean up step holds + step bookmarks + replication cursor bookmarks on send RPC instead - this makes it symmetric with Receive RPC - use a cache (endpoint.sendAbstractionsCache) to avoid the cost of listing the on-disk endpoint abstractions state on every step The "create" methods for endpoint abstractions (CreateReplicationCursor, HoldStep) are now fully idempotent and return an Abstraction. Notes about endpoint.sendAbstractionsCache: - fills lazily from disk state on first `Get` operation - fill from disk is generally only attempted once - unless the `ListAbstractions` fails, in which case the fill from disk is retried on next `Get` (the current `Get` will observe a subset of the actual on-disk abstractions) - the `Invalidate` method is called - it is a global (zrepl process-wide) cache fixes #316	2020-06-14 15:21:36 +02:00
Christian Schwarz	10a14a8c50	[#307 ] add package trace, integrate it with logging, and adopt it throughout zrepl package trace: - introduce the concept of tasks and spans, tracked as linked list within ctx - see package-level docs for an overview of the concepts - main feature 1: unique stack of task and span IDs - makes it easy to follow a series of log entries in concurrent code - main feature 2: ability to produce a chrome://tracing-compatible trace file - either via an env variable or a `zrepl pprof` subcommand - this is not a CPU profile, we already have go pprof for that - but it is very useful to visually inspect where the replication / snapshotter / pruner spends its time ( fixes #307 ) usage in package daemon/logging: - goal: every log entry should have a trace field with the ID stack from package trace - make `logging.GetLogger(ctx, Subsys)` the authoritative `logger.Logger` factory function - the context carries a linked list of injected fields which `logging.GetLogger` adds to the logger it returns - `logging.GetLogger` also uses package `trace` to get the task-and-span-stack and injects it into the returned logger's fields	2020-05-19 11:30:02 +02:00
InsanePrawn	44bd354eae	Spellcheck all files Signed-off-by: InsanePrawn <insane.prawny@gmail.com>	2020-02-24 16:06:09 +01:00
Christian Schwarz	58c08c855f	new features: {resumable,encrypted,hold-protected} send-recv, last-received-hold - Resumable Send & Recv Support No knobs required, automatically used where supported. - Hold-Protected Send & Recv Automatic ZFS holds to ensure that we can always resume a replication step. - Encrypted Send & Recv Support for OpenZFS native encryption. Configurable at the job level, i.e., for all filesystems a job is responsible for. - Receive-side hold on last received dataset The counterpart to the replication cursor bookmark on the send-side. Ensures that incremental replication will always be possible between a sender and receiver. Design Doc ---------- `replication/design.md` doc describes how we use ZFS holds and bookmarks to ensure that a single replication step is always resumable. The replication algorithm described in the design doc introduces the notion of job IDs (please read the details on this design doc). We reuse the job names for job IDs and use `JobID` type to ensure that a job name can be embedded into hold tags, bookmark names, etc. This might BREAK CONFIG on upgrade. Protocol Version Bump --------------------- This commit makes backwards-incompatible changes to the replication/pdu protobufs. Thus, bump the version number used in the protocol handshake. Replication Cursor Format Change -------------------------------- The new replication cursor bookmark format is: `#zrepl_CURSOR_G_${this.GUID}_J_${jobid}` Including the GUID enables transaction-safe moving-forward of the cursor. Including the job id enables that multiple sending jobs can send the same filesystem without interfering. The `zrepl migrate replication-cursor:v1-v2` subcommand can be used to safely destroy old-format cursors once zrepl has created new-format cursors. Changes in This Commit ---------------------- - package zfs - infrastructure for holds - infrastructure for resume token decoding - implement a variant of OpenZFS's `entity_namecheck` and use it for validation in new code - ZFSSendArgs to specify a ZFS send operation - validation code protects against malicious resume tokens by checking that the token encodes the same send parameters that the send-side would use if no resume token were available (i.e. same filesystem, `fromguid`, `toguid`) - RecvOptions support for `recv -s` flag - convert a bunch of ZFS operations to be idempotent - achieved through more differentiated error message scraping / additional pre-/post-checks - package replication/pdu - add field for encryption to send request messages - add fields for resume handling to send & recv request messages - receive requests now contain `FilesystemVersion To` in addition to the filesystem into which the stream should be `recv`d into - can use `zfs recv $root_fs/$client_id/path/to/dataset@${To.Name}`, which enables additional validation after recv (i.e. whether `To.Guid` matched what we received in the stream) - used to set `last-received-hold` - package replication/logic - introduce `PlannerPolicy` struct, currently only used to configure whether encrypted sends should be requested from the sender - integrate encryption and resume token support into `Step` struct - package endpoint - move the concepts that endpoint builds on top of ZFS to a single file `endpoint/endpoint_zfs.go` - step-holds + step-bookmarks - last-received-hold - new replication cursor + old replication cursor compat code - adjust `endpoint/endpoint.go` handlers for - encryption - resumability - new replication cursor - last-received-hold - client subcommand `zrepl holds list`: list all holds and hold-like bookmarks that zrepl thinks belong to it - client subcommand `zrepl migrate replication-cursor:v1-v2`	2020-02-14 22:00:13 +01:00
Christian Schwarz	b5ff1a9926	snapper + client/status: snapshotting reports	2019-09-27 21:31:00 +02:00
Christian Schwarz	5b97953bfb	run golangci-lint and apply suggested fixes	2019-03-27 13:12:26 +01:00
Christian Schwarz	7756c9a55c	config + job: forbid non-verlapping receiver root_fs refs #136 refs #140	2019-03-21 12:07:55 +01:00
Christian Schwarz	4ee00091d6	pull job: support manual-only invocation	2019-03-16 14:24:05 +01:00
Christian Schwarz	07b43bffa4	replication: refactor driving logic (no more explicit state machine)	2019-03-13 15:00:40 +01:00
Christian Schwarz	796c5ad42d	rpc rewrite: control RPCs using gRPC + separate RPC for data transfer transport/ssh: update go-netssh to new version => supports CloseWrite and Deadlines => build: require Go 1.11 (netssh requires it)	2019-03-13 13:53:48 +01:00
Christian Schwarz	98bc8d1717	daemon/job: explicit notice of ZREPL_JOB_WATCHDOG_TIMEOUT environment variable on cancellation	2018-10-22 11:03:31 +02:00
Christian Schwarz	94427d334b	replication + pruner + watchdog: adjust timeouts based on practical experience	2018-10-21 18:37:57 +02:00
Christian Schwarz	190c7270d9	daemon/active + watchdog: simplify control flow using explicit ActiveSideState	2018-10-21 12:53:34 +02:00
Christian Schwarz	f704b28cad	daemon/job: track active side state explicitly	2018-10-21 12:52:48 +02:00
Christian Schwarz	e63ac7d1bb	pruner: log transitions to error state + log info to confirm pruning is done in active job	2018-10-19 17:23:00 +02:00
Christian Schwarz	69bfcb7bed	daemon/active: implement watchdog to handle stuck replication / pruners ActiveSide.do() can only run sequentially, i.e. we cannot run replication and pruning in parallel. Why? * go-streamrpc only allows one active request at a time (this is bad design and should be fixed at some point) * replication and pruning are implemented independently, but work on the same resources (snapshots) A: pruning might destroy a snapshot that is planned to be replicated B: replication might replicate snapshots that should be pruned We do not have any resource management / locking for A and B, but we have a use case where users don't want their machine fill up with snapshots if replication does not work. That means we _have_ to run the pruners. A further complication is that we cannot just cancel the replication context after a timeout and move on to the pruner: it could be initial replication and we don't know how long it will take. (And we don't have resumable send & recv yet). With the previous commits, we can implement the watchdog using context cancellation. Note that the 'MadeProgress()' calls can only be placed right before non-error state transition. Otherwise, we could end up in a live-lock.	2018-10-19 17:23:00 +02:00
Christian Schwarz	82f0060eec	Revert "daemon/job/active: push mode: awful hack for handling of concurrent snapshots + stale remote operation" This reverts commit `aeb87ffbcf`.	2018-10-19 09:35:30 +02:00
Christian Schwarz	aeb87ffbcf	daemon/job/active: push mode: awful hack for handling of concurrent snapshots + stale remote operation We have the problem that there are legitimate use cases where a user does not want their machine to fill up with snapshots, even if it means unreplicated must be destroyed. This can be expressed by not configuring the keep rule `not_replicated` for the snapshot-creating side. This commit only addresses push mode because we don't support pruning in the source job. We adivse users in the docs to use push mode if they have above use case, so this is fine - at least for 0.1. Ideally, the replication.Replication would communicate to the pruner which snapshots are currently part of the replication plan, and then we'd need some conflict resolution to determine whether it's more important to destroy the snapshots or to replicate them (destroy should win?). However, we don't have the infrastructure for this yet (we could parse the replication report, but that's just ugly). And we want to get 0.1 out, so showtime for a dirty hack: We start replication, and ideally, replication and pruning is done before new snapshot have been taken. If so: great. However, what happens if snapshots have been taken and we are not done with replication and / or pruning? * If replicatoin is making progress according to its state, let it run. This covers the important situation of initial replication, where replication may easily take longer than a single snapshotting interval. * If replication is in an error state, cancel it through context cancellation. * As with the pruner below, the main problem here is that status output will only contain "context cancelled" after the cancellation, instead of showing the reason why it was cancelled. Not nice, but oh well, the logs provide enough detail for this niche situation... * If we are past replication, we're still pruning * Leave the local (send-side) pruning alone. Again, we only implement this hack for push, so we know sender is local, and it will only fail hard, not retry. * If the remote (receiver-side) pruner is in an error state, cancel it through context cancellation. * Otherwise, let it run. Note that every time we "let it run", we tolerate a temporary excess of snapshots, but given sufficiently aggressive timeouts and the assumption that the snapshot interval is much greater than the timeouts, this is not a significant problem in practice.	2018-10-12 22:47:06 +02:00
Christian Schwarz	d584e1ac54	daemon/job/active: fix race in updateTasks If concurrent updates strictly modify different members of the tasks struct, the copying + lock-drop still constitutes a race condition: The last updater always wins and sets tasks to its copy + changes. This eliminates the other updater's changes.	2018-10-12 22:15:07 +02:00
Christian Schwarz	89e0103abd	move wakeup subcommand into signal subcommand and add reset subcommand	2018-10-12 20:50:56 +02:00
Christian Schwarz	f9d24d15ed	move wakup mechanism into separate package	2018-10-12 12:44:40 +02:00
Christian Schwarz	be962998ba	move serve and connecter into transports package	2018-10-11 21:21:46 +02:00
Christian Schwarz	125b561df3	rename root_dataset to root_fs for receiving-side jobs	2018-10-11 18:03:18 +02:00
Christian Schwarz	4e16952ad9	snapshotting: support 'periodic' and 'manual' mode 1. Change config format to support multiple types of snapshotting modes. 2. Implement a hacky way to support periodic or completely manual snaphots. In manual mode, the user has to trigger replication using the wakeup mechanism after they took snapshots using their own tooling. As indicated by the comment, a more general solution would be desirable, but we want to get the release out and 'manual' mode is a feature that some people requested...	2018-10-11 15:59:23 +02:00
Christian Schwarz	75e42fd860	pruner: implement Report method + display in status command	2018-09-24 19:25:40 +02:00
Christian Schwarz	75ba5874a5	active side: track activities in Run() as atomically updated member	2018-09-24 19:23:53 +02:00
Christian Schwarz	d04b9713c4	implement pull + sink modes for active and passive side	2018-09-24 12:36:10 +02:00
Christian Schwarz	e3be120d88	refactor push + source into active + passive 'sides' with push and source 'modes'	2018-09-24 12:36:10 +02:00

41 Commits