Commit Graph

87 Commits

Author SHA1 Message Date
Lapo Luchini
c6a9ebc71c job/active: add "last completed" metric for error reporting
use case:

    So that I can use a more resilient alerting such as "last complete was sent more than 24h ago".

fixes https://github.com/zrepl/zrepl/issues/516
closes https://github.com/zrepl/zrepl/pull/530
2021-11-10 17:35:12 +01:00
Christian Schwarz
1f0f2f8569 pruner + docs: less confusing type names, some comments, better docs for keep: not_replicated
fixes https://github.com/zrepl/zrepl/issues/524
2021-10-10 21:11:38 +02:00
Christian Schwarz
845195b7ed bandwidth limiting: fix crash with SnapJob
zrepl daemon panics when the snap job triggers

fixup for f5f269bfd5 (bandwidth limiting)
fixes #521

Oct 01 16:14:56 cstp zrepl[56563]: panic: invalid config`BandwidthLimit` field invalid: BucketCapacity must not be zero
Oct 01 16:14:56 cstp zrepl[56563]:         panic: end span: span still has active child spans
Oct 01 16:14:56 cstp zrepl[56563]: goroutine 38 [running]:
Oct 01 16:14:56 cstp zrepl[56563]: github.com/zrepl/zrepl/daemon/logging/trace.WithSpan.func2()
Oct 01 16:14:56 cstp zrepl[56563]:         /home/cs/zrepl/zrepl/daemon/logging/trace/trace.go:341 +0x2ea
Oct 01 16:14:56 cstp zrepl[56563]: github.com/zrepl/zrepl/daemon/logging/trace.WithTaskAndSpan.func1()
Oct 01 16:14:56 cstp zrepl[56563]:         /home/cs/zrepl/zrepl/daemon/logging/trace/trace_convenience.go:40 +0x2e
Oct 01 16:14:56 cstp zrepl[56563]: panic(0xcee9c0, 0xc000676730)
Oct 01 16:14:56 cstp zrepl[56563]:         /home/cs/go1.16.6/src/runtime/panic.go:965 +0x1b9
Oct 01 16:14:56 cstp zrepl[56563]: github.com/zrepl/zrepl/endpoint.NewSender(0xf5bbc0, 0xc0003840c0, 0xc0000b2c90, 0x4, 0xc0002c5958, 0x0, 0x0, 0x0, 0xc000068cf8)
Oct 01 16:14:56 cstp zrepl[56563]:         /home/cs/zrepl/zrepl/endpoint/endpoint.go:68 +0x1ec
Oct 01 16:14:56 cstp zrepl[56563]: github.com/zrepl/zrepl/daemon/job.(*SnapJob).doPrune(0xc00039e000, 0xf6e3b8, 0xc0006541b0)
Oct 01 16:14:56 cstp zrepl[56563]:         /home/cs/zrepl/zrepl/daemon/job/snapjob.go:179 +0x198
Oct 01 16:14:56 cstp zrepl[56563]: github.com/zrepl/zrepl/daemon/job.(*SnapJob).Run(0xc00039e000, 0xf6e3b8, 0xc0001d83c0)
Oct 01 16:14:56 cstp zrepl[56563]:         /home/cs/zrepl/zrepl/daemon/job/snapjob.go:127 +0x329
Oct 01 16:14:56 cstp zrepl[56563]: github.com/zrepl/zrepl/daemon.(*jobs).start.func1(0xc0006a4100, 0xf6e3b8, 0xc00022a0f0, 0xf72d18, 0xc00039e000)
Oct 01 16:14:56 cstp zrepl[56563]:         /home/cs/zrepl/zrepl/daemon/daemon.go:255 +0x15b
Oct 01 16:14:56 cstp zrepl[56563]: created by github.com/zrepl/zrepl/daemon.(*jobs).start
Oct 01 16:14:56 cstp zrepl[56563]:         /home/cs/zrepl/zrepl/daemon/daemon.go:251 +0x425
Oct 01 16:14:56 cstp systemd[1]: zrepl.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Oct 01 16:14:56 cstp systemd[1]: zrepl.service: Failed with result 'exit-code'.
2021-10-09 15:52:38 +02:00
Christian Schwarz
f5f269bfd5 send/recv: job-level bandwidth limiting
Sponsored-by: Prominic.NET, Inc.

fixes #339
2021-09-12 20:08:43 +02:00
InsanePrawn
b2c6e51a43 client/signal: Revert "add signal 'snapshot', rename existing signal 'wakeup' to 'replication'"
This was merged to master prematurely as the job components are not decoupled well enough
for these signals to be useful yet.

This reverts commit 2c8c2cfa14.

closes #452
2021-03-25 22:26:17 +01:00
Calistoc
2c8c2cfa14 add signal 'snapshot', rename existing signal 'wakeup' to 'replication' 2021-03-14 18:16:23 +01:00
Christian Schwarz
0ceea1b792 replication: simplify parallel replication variables & expose them in config
closes #140
2021-03-14 17:30:10 +01:00
InsanePrawn
393fc10a69 [#285] support setting zfs send / recv flags in the config (send: -wLcepbS, recv: -ox)
Co-authored-by: Christian Schwarz <me@cschwarz.com>
Signed-off-by: InsanePrawn <insane.prawny@gmail.com>

closes #285
closes #276
closes #24
2021-02-20 17:20:45 +01:00
Christian Schwarz
1c937e58f7 zfs.NilBool: document its purpose and move it to its own package 'nodefault' 2021-02-20 17:04:57 +01:00
Christian Schwarz
efe7b17d21 Update to protobuf v1.25 and grpc 1.35; bump CI to go1.12
From:
github.com/golang/protobuf v1.3.2
google.golang.org/grpc v1.17.0

To:
github.com/golang/protobuf v1.4.3
google.golang.org/grpc v1.35.0
google.golang.org/protobuf v1.25.0

About the two protobuf packages:
https://developers.google.com/protocol-buffers/docs/reference/go/faq
> Version v1.4.0 and higher of github.com/golang/protobuf wrap the new
implementation and permit programs to adopt the new API incrementally. For
example, the well-known types defined in github.com/golang/protobuf/ptypes are
simply aliases of those defined in the newer module. Thus,
google.golang.org/protobuf/types/known/emptypb and
github.com/golang/protobuf/ptypes/empty may be used interchangeably.

Notable Code Changes in zrepl:
- generate protobufs now contain a mutex so we can't copy them by value
  anymore
- grpc.WithDialer is deprecated => use grpc.WithContextDialer instead

Go1.12 is now actually required by some of the dependencies.
2021-01-25 00:39:01 +01:00
Mathias Fredriksson
48be4032a2 daemon: fix data race in snapjob pruner report
closes #416
2021-01-25 00:16:01 +01:00
InsanePrawn
180c3d9ae1 Reformat all files with make format.
Signed-off-by: InsanePrawn <insane.prawny@gmail.com>
2020-08-31 23:57:45 +02:00
Hans Schulz
83fdffbcef replication: prometheus metric for number of failed replications in last attempt
- package replication: metric
- Grafana panel
- wiring
- changelog

Signed-off-by: Christian Schwarz <me@cschwarz.com>

closes #341
2020-08-04 01:19:44 +02:00
Christian Schwarz
30cdc1430e replication + endpoint: replication guarantees: guarantee_{resumability,incremental,nothing}
This commit

- adds a configuration in which no step holds, replication cursors, etc. are created
- removes the send.step_holds.disable_incremental setting
- creates a new config option `replication` for active-side jobs
- adds the replication.protection.{initial,incremental} settings, each
  of which can have values
    - `guarantee_resumability`
    - `guarantee_incremental`
    - `guarantee_nothing`
  (refer to docs/configuration/replication.rst for semantics)

The `replication` config from an active side is sent to both endpoint.Sender and endpoint.Receiver
for each replication step. Sender and Receiver then act accordingly.

For `guarantee_incremental`, we add the new `tentative-replication-cursor` abstraction.
The necessity for that abstraction is outlined in https://github.com/zrepl/zrepl/issues/340.

fixes https://github.com/zrepl/zrepl/issues/340
2020-07-26 20:32:35 +02:00
Christian Schwarz
95fc299733 daemon/job: test that sample configs are buildable 2020-07-26 20:32:35 +02:00
Christian Schwarz
1c270b7e39 add option to disable step holds for incremental sends
This is a stop-gap solution until we re-write the pruner to support
rules for removing step holds.

Note that disabling step holds for incremental sends does not affect
zrepl's guarantee that incremental replication is always possible:

Suppose you yank the external drive during an incremental @from -> @to step:

* restarting that step or future incrementals @from -> @to_later` will be possible
  because the replication cursor bookmark points to @from until the step is complete
* resuming @from -> @to will work as long as the pruner on your internal pool doesn't come around to destroy @to.
    * in that case, the replication algorithm should determine that the resumable state
      on the receiving side isuseless because @to no longer exists on the sending side,
      and consequently clear it, and restart an incremental step @from -> @to_later

refs #288
2020-06-14 15:26:05 +02:00
Christian Schwarz
292b85b5ef [#316] endpoint / replication protocol: more robust step-holds and replication cursor management
- drop HintMostRecentCommonAncestor rpc call
    - it is wrong to put faith into the active side of the replication to always make that call
      (we might not trust it, ref pull setup)
- clean up step holds + step bookmarks + replication cursor bookmarks on
  send RPC instead
    - this makes it symmetric with Receive RPC
- use a cache (endpoint.sendAbstractionsCache) to avoid the cost of
  listing the on-disk endpoint abstractions state on every step

The "create" methods for endpoint abstractions (CreateReplicationCursor, HoldStep) are now fully
idempotent and return an Abstraction.

Notes about endpoint.sendAbstractionsCache:
- fills lazily from disk state on first `Get` operation
- fill from disk is generally only attempted once
    - unless the `ListAbstractions` fails, in which case the fill from
      disk is retried on next `Get` (the current `Get` will observe a
      subset of the actual on-disk abstractions)
    - the `Invalidate` method is called
- it is a global (zrepl process-wide) cache

fixes #316
2020-06-14 15:21:36 +02:00
Christian Schwarz
10a14a8c50 [#307] add package trace, integrate it with logging, and adopt it throughout zrepl
package trace:

- introduce the concept of tasks and spans, tracked as linked list within ctx
    - see package-level docs for an overview of the concepts
    - **main feature 1**: unique stack of task and span IDs
        - makes it easy to follow a series of log entries in concurrent code
    - **main feature 2**: ability to produce a chrome://tracing-compatible trace file
        - either via an env variable or a `zrepl pprof` subcommand
        - this is not a CPU profile, we already have go pprof for that
        - but it is very useful to visually inspect where the
          replication / snapshotter / pruner spends its time
          ( fixes #307 )

usage in package daemon/logging:

- goal: every log entry should have a trace field with the ID stack from package trace

- make `logging.GetLogger(ctx, Subsys)` the authoritative `logger.Logger` factory function
    - the context carries a linked list of injected fields which
      `logging.GetLogger` adds to the logger it returns
    - `logging.GetLogger` also uses package `trace` to get the
      task-and-span-stack and injects it into the returned logger's fields
2020-05-19 11:30:02 +02:00
InsanePrawn
44bd354eae Spellcheck all files
Signed-off-by: InsanePrawn <insane.prawny@gmail.com>
2020-02-24 16:06:09 +01:00
Christian Schwarz
58c08c855f new features: {resumable,encrypted,hold-protected} send-recv, last-received-hold
- **Resumable Send & Recv Support**
  No knobs required, automatically used where supported.
- **Hold-Protected Send & Recv**
  Automatic ZFS holds to ensure that we can always resume a replication step.
- **Encrypted Send & Recv Support** for OpenZFS native encryption.
  Configurable at the job level, i.e., for all filesystems a job is responsible for.
- **Receive-side hold on last received dataset**
  The counterpart to the replication cursor bookmark on the send-side.
  Ensures that incremental replication will always be possible between a sender and receiver.

Design Doc
----------

`replication/design.md` doc describes how we use ZFS holds and bookmarks to ensure that a single replication step is always resumable.

The replication algorithm described in the design doc introduces the notion of job IDs (please read the details on this design doc).
We reuse the job names for job IDs and use `JobID` type to ensure that a job name can be embedded into hold tags, bookmark names, etc.
This might BREAK CONFIG on upgrade.

Protocol Version Bump
---------------------

This commit makes backwards-incompatible changes to the replication/pdu protobufs.
Thus, bump the version number used in the protocol handshake.

Replication Cursor Format Change
--------------------------------

The new replication cursor bookmark format is: `#zrepl_CURSOR_G_${this.GUID}_J_${jobid}`
Including the GUID enables transaction-safe moving-forward of the cursor.
Including the job id enables that multiple sending jobs can send the same filesystem without interfering.
The `zrepl migrate replication-cursor:v1-v2` subcommand can be used to safely destroy old-format cursors once zrepl has created new-format cursors.

Changes in This Commit
----------------------

- package zfs
  - infrastructure for holds
  - infrastructure for resume token decoding
  - implement a variant of OpenZFS's `entity_namecheck` and use it for validation in new code
  - ZFSSendArgs to specify a ZFS send operation
    - validation code protects against malicious resume tokens by checking that the token encodes the same send parameters that the send-side would use if no resume token were available (i.e. same filesystem, `fromguid`, `toguid`)
  - RecvOptions support for `recv -s` flag
  - convert a bunch of ZFS operations to be idempotent
    - achieved through more differentiated error message scraping / additional pre-/post-checks

- package replication/pdu
  - add field for encryption to send request messages
  - add fields for resume handling to send & recv request messages
  - receive requests now contain `FilesystemVersion To` in addition to the filesystem into which the stream should be `recv`d into
    - can use `zfs recv $root_fs/$client_id/path/to/dataset@${To.Name}`, which enables additional validation after recv (i.e. whether `To.Guid` matched what we received in the stream)
    - used to set `last-received-hold`
- package replication/logic
  - introduce `PlannerPolicy` struct, currently only used to configure whether encrypted sends should be requested from the sender
  - integrate encryption and resume token support into `Step` struct

- package endpoint
  - move the concepts that endpoint builds on top of ZFS to a single file `endpoint/endpoint_zfs.go`
    - step-holds + step-bookmarks
    - last-received-hold
    - new replication cursor + old replication cursor compat code
  - adjust `endpoint/endpoint.go` handlers for
    - encryption
    - resumability
    - new replication cursor
    - last-received-hold

- client subcommand `zrepl holds list`: list all holds and hold-like bookmarks that zrepl thinks belong to it
- client subcommand `zrepl migrate replication-cursor:v1-v2`
2020-02-14 22:00:13 +01:00
Matthias Freund
4ba85c40cc daemon/job/build_jobs: fix validateReceivingSidesDoNotOverlap 2020-01-30 10:46:32 +01:00
Juergen Hoetzel
b3231d2bed daemon: fix typos in error messages
closes #255
2019-12-11 21:30:48 +01:00
Christian Schwarz
b5ff1a9926 snapper + client/status: snapshotting reports 2019-09-27 21:31:00 +02:00
Christian Schwarz
77d3a1ad4d build: drop go Dep, switch to modules, support Go 1.13
bump enumer to v1.1.1
bump golangci-lint to v1.17.1

no `go mod tidy` because 1.13 and 1.12 seem to alter each other's output

fixes #112
2019-09-14 13:36:44 +02:00
Christian Schwarz
f31b54582f daemon/job: fix receiving-job root-fs overlap detection
fixup for 7756c9a5
fixes #160
2019-03-27 13:12:26 +01:00
Christian Schwarz
5b97953bfb run golangci-lint and apply suggested fixes 2019-03-27 13:12:26 +01:00
Christian Schwarz
afed762774 format source tree using goimports 2019-03-22 19:41:12 +01:00
Christian Schwarz
7756c9a55c config + job: forbid non-verlapping receiver root_fs
refs #136
refs #140
2019-03-21 12:07:55 +01:00
Christian Schwarz
158d1175e3 rename SinglePruner to LocalPruner 2019-03-17 21:18:25 +01:00
Christian Schwarz
b25da7b9b0 job: snap: comment fix 2019-03-17 21:07:42 +01:00
Christian Schwarz
5cd2593f52 job: snap: workaround for replication cursor requirement 2019-03-17 21:07:01 +01:00
Christian Schwarz
17818439a0 Merge branch 'problame/replication_refactor' into InsanePrawn-master 2019-03-17 17:33:51 +01:00
Christian Schwarz
4ee00091d6 pull job: support manual-only invocation 2019-03-16 14:24:05 +01:00
Christian Schwarz
aff639e87a Merge remote-tracking branch 'origin/master' into InsanePrawn-master 2019-03-15 21:05:20 +01:00
Christian Schwarz
07b43bffa4 replication: refactor driving logic (no more explicit state machine) 2019-03-13 15:00:40 +01:00
Christian Schwarz
796c5ad42d rpc rewrite: control RPCs using gRPC + separate RPC for data transfer
transport/ssh: update go-netssh to new version
    => supports CloseWrite and Deadlines
    => build: require Go 1.11 (netssh requires it)
2019-03-13 13:53:48 +01:00
InsanePrawn
160a3b6d32 more gofmt, drop snapjob.go_prefmt after it was accidentally added 2018-11-21 22:14:43 +01:00
InsanePrawn
3cef76d463 Refactor snapJob() to snapJobFromConfig() 2018-11-21 14:37:03 +01:00
InsanePrawn
e9564a7e5c Inlined a couple legacy leftover functions from the mode copypasta 2018-11-21 14:35:40 +01:00
InsanePrawn
d0f898751f Gofmt snapjob.go 2018-11-21 14:02:21 +01:00
InsanePrawn
22d9830baa Fix prometheus with multiple jobs 2018-11-21 04:26:03 +01:00
InsanePrawn
e10dc129de Make getPruner() private 2018-11-21 03:39:03 +01:00
InsanePrawn
dd11fc96db Touchups in job.go 2018-11-21 03:27:39 +01:00
InsanePrawn
7de3c0a09a Removed the references to a pruning 'side' in the singlepruner logging code and the snapjob prometheus thing. 2018-11-21 02:52:33 +01:00
InsanePrawn
141e49727c Missed a last reference to tasks 2018-11-21 02:51:23 +01:00
InsanePrawn
442d61918b remove most of the watchdog machinery 2018-11-21 02:42:13 +01:00
InsanePrawn
58dcc07430 Added SnapJobStatus 2018-11-21 02:08:39 +01:00
InsanePrawn
19d0916e34 remove snapMode, rename snap_ActiveSide to SnapJob 2018-11-21 01:54:56 +01:00
InsanePrawn
1265cc7934 pruned unused lines and comments ;) 2018-11-21 01:34:50 +01:00
InsanePrawn
3d2688e959 Ugly but working inital snapjob implementation 2018-11-20 19:30:15 +01:00