Commit Graph

1124 Commits

Author SHA1 Message Date
Christian Schwarz
e0c7ceedd5 prevent transient zrepl status error: Post "http://unix/status": EOF
See the comment added to client.go in this commit.

fixes https://github.com/zrepl/zrepl/issues/483
fixes https://github.com/zrepl/zrepl/issues/262
fixes https://github.com/zrepl/zrepl/issues/379
fixes https://github.com/zrepl/zrepl/issues/379
2022-06-26 14:39:35 +02:00
Christian Schwarz
2642c64303 make initial replication policy configurable (most_recent, all, fail)
Config:

```
- type: push
  ...
  conflict_resolution:
    initial_replication: most_recent | all | fali
```

The ``initial_replication`` option determines which snapshots zrepl
replicates if the filesystem has not been replicated before.
If ``most_recent`` (the default), the initial replication will only
transfer the most recent snapshot, while ignoring previous snapshots.
If all snapshots should be replicated, specify ``all``.
Use ``fail`` to make replication of the filesystem fail in case
there is no corresponding fileystem on the receiver.

Code-Level Changes, apart from the obvious:
- Rework IncrementalPath()'s return signature.
  Now returns an error for initial replications as well.
- Rename & rework it's consumer, resolveConflict().

Co-authored-by: Graham Christensen <graham@grahamc.com>

Fixes https://github.com/zrepl/zrepl/issues/550
Fixes https://github.com/zrepl/zrepl/issues/187
Closes https://github.com/zrepl/zrepl/pull/592
2022-06-26 14:36:59 +02:00
JMoVS
1acafabb5b docs: Fix typo in disjoing to disjoint
Signed-off-by: Justin Scholz <git@justinscholz.de>
2022-05-07 22:13:56 +02:00
Christian Schwarz
19b2deb2cf run go mod tidy; go version go1.17.2 linux/amd64
Juding from the (now deleted) comments in go.mod, this might break 1.12
build. If it does, the CI will catch it as it currently build using
1.17 and 1.12.

fixes https://github.com/zrepl/zrepl/issues/586
2022-05-07 21:59:51 +02:00
Christian Schwarz
ce6701fb33 status: fix over-counted step when status != stepping
This is a fixup of

  commit b00b61e967
  Author: Christian Schwarz <me@cschwarz.com>
  Date:   Sun Nov 21 15:15:23 2021 +0100

      status: user-visible replication step number should start at 1

fixes https://github.com/zrepl/zrepl/issues/589
refs https://github.com/zrepl/zrepl/issues/538
2022-04-24 15:24:39 +02:00
Christian Schwarz
0121929164 build: use git+https to fix lazy.sh docdep failures
CircleCI fails like so:

    #!/bin/bash -eo pipefail
    ./lazy.sh docdep

    pip3 is /home/circleci/.pyenv/shims/pip3
    Installing doc build dependencies
    Obtaining sphinxcontrib-versioning from git+git://github.com/rwblair/sphinxcontrib-versioning.git@7e3885a389a809e17ea55261316b7b0e98dbf98f#egg=sphinxcontrib-versioning (from -r ./docs/requirements.txt (line 28))
      Cloning git://github.com/rwblair/sphinxcontrib-versioning.git (to revision 7e3885a389a809e17ea55261316b7b0e98dbf98f) to ./src/sphinxcontrib-versioning
      Running command git clone --filter=blob:none --quiet git://github.com/rwblair/sphinxcontrib-versioning.git /home/circleci/project/src/sphinxcontrib-versioning
      fatal: remote error:
        The unauthenticated git protocol on port 9418 is no longer supported.
      Please see https://github.blog/2021-09-01-improving-git-protocol-security-github/ for more information.
      error: subprocess-exited-with-error

      × git clone --filter=blob:none --quiet git://github.com/rwblair/sphinxcontrib-versioning.git /home/circleci/project/src/sphinxcontrib-versioning did not run successfully.
      │ exit code: 128
      ╰─> See above for output.

      note: This error originates from a subprocess, and is likely not a problem with pip.
    error: subprocess-exited-with-error

    × git clone --filter=blob:none --quiet git://github.com/rwblair/sphinxcontrib-versioning.git /home/circleci/project/src/sphinxcontrib-versioning did not run successfully.
    │ exit code: 128
    ╰─> See above for output.

    note: This error originates from a subprocess, and is likely not a problem with pip.

    Exited with code exit status 1

    CircleCI received exit code 1
2022-03-20 20:23:01 +01:00
Christian Schwarz
bc96f8f212 build/circleci: update to Ubuntu 20.04 image for release-* jobs
Background: `machine: true` is deprecated:

    https://circleci.com/docs/2.0/images/linux-vm/14.04-to-20.04-migration/
2022-02-15 22:55:25 +01:00
Christian Schwarz
459508c9d9 docs: sendrecvoptions: placeholders: fix wrong link name and add summarizing config snippet for recv.placeholders
fixes https://github.com/zrepl/zrepl/issues/573
2022-02-05 10:59:33 +01:00
Lapo Luchini
4a27cc63a8 prometheus: convert zrepl_version_daemon to zrepl_start_time metric
closes https://github.com/zrepl/zrepl/pull/556
fixes #553
2022-01-20 19:33:18 +01:00
Christian Schwarz
0a6840273a build: add tag-release Make target 2022-01-20 19:25:22 +01:00
madbrain76
76ef84f83b docs: fix typo in backup_to_external_disk.rst
closes https://github.com/zrepl/zrepl/pull/568
2022-01-20 19:25:03 +01:00
Christian Schwarz
66946df756 docs: continous_server_backup: simplify by removing need for recv.placeholder 2022-01-09 12:51:00 +01:00
Andrew Gunnerson
556fac3002 docs: document fan-out replication & add quick-start guide
closes https://github.com/zrepl/zrepl/pull/552
fixes https://github.com/zrepl/zrepl/issues/551

Signed-off-by: Andrew Gunnerson <chillermillerlong@hotmail.com>
Co-authored-by: Christian Schwarz <me@cschwarz.com>
2022-01-09 12:45:09 +01:00
Christian Schwarz
1ad7df2df3 docs: badges & links to Matrix chat room
fixes https://github.com/zrepl/zrepl/issues/488
2022-01-09 12:05:19 +01:00
Christian Schwarz
a3d010c5f0 util/optionaldeadline: disable scheduler latency-sensitive tests in CircleCI 2021-12-30 14:41:06 +01:00
Christian Schwarz
12503dc55a rpc/dataconn/timeoutconn: disable TestPartialWriteMockConn in CircleCI 2021-12-18 18:02:34 +01:00
Christian Schwarz
7d10a71cc0 0.5 changelog + front page update 2021-12-18 17:36:54 +01:00
Christian Schwarz
3d3d1b5679 quickstart: sample config uses placeholders, so provide sample value for recv.placeholder.encryption 2021-12-18 17:18:58 +01:00
Christian Schwarz
5240ab4949 docs: quickstart: make users aware that the example rules apply to all snaps, not just zrepl's
fixes https://github.com/zrepl/zrepl/issues/540
2021-12-18 16:30:15 +01:00
Christian Schwarz
19aebd399f docs: add a note that FreeBSD jail zfs userland needs to be kept in sync with kernel module
fixes https://github.com/zrepl/zrepl/issues/500
2021-12-18 16:06:26 +01:00
Christian Schwarz
04e03f4d06 platformtest: retry zpool export if 'pool is busy'
On Ubuntu, something seems to be holding on to the pool for too long.
2021-12-18 15:58:24 +01:00
Christian Schwarz
2e2a8a1d5d docs: add docs on how to run platform tests
fixes https://github.com/zrepl/zrepl/issues/478
2021-12-18 15:55:22 +01:00
Christian Schwarz
a2b2e0fe34 daemon/control: make http server {Read,Write}Timeout envconst-configurable
refs https://github.com/zrepl/zrepl/issues/379
2021-12-18 15:14:33 +01:00
Christian Schwarz
af2905d245 docs: apt repo: deploy gpg to /usr/share/keyrings and use 'signed-by' in repo definition
gpg --dearmor because of note in https://wiki.debian.org/DebianRepository/UseThirdParty

fixes https://github.com/zrepl/zrepl/issues/529
2021-12-18 15:14:33 +01:00
Christian Schwarz
c3f0041efd zrepl test placeholder: fix panic if dataset does not exist
fixes https://github.com/zrepl/zrepl/issues/406
2021-12-18 15:14:33 +01:00
Christian Schwarz
083f6001eb build: freebsd armv7 and arm64 binaries
fixes https://github.com/zrepl/zrepl/issues/539
2021-12-18 15:14:33 +01:00
Christian Schwarz
2d57ec6ee0 docs: changelog: mention upstream ashift 9 => 12 send/recv bug 2021-12-18 15:14:33 +01:00
Christian Schwarz
fb6a9be954 fix encrypt-on-receive with placeholders
fixes https://github.com/zrepl/zrepl/issues/504

Problem:
  plain send + recv with root_fs encrypted + placeholders causes plain recvs
  whereas user would expect encrypt-on-recv
Reason:
  We create placeholder filesytems with -o encryption=off.
  Thus, children received below those placeholders won't inherit
  encryption of root_fs.
Fix:
  We'll have three values for `recv.placeholders.encryption: unspecified (default) | off | inherit`.
  When we create a placeholder, we will fail the operation if  `recv.placeholders.encryption = unspecified`.
  The exception is if the placeholder filesystem is to encode the client identity ($root_fs/$client_identity) in a pull job.
  Those are created in `inherit` mode if the config field is `unspecified` so that users who don't need
  placeholders are not bothered by these details.

Future Work:
  Automatically warn existing users of encrypt-on-recv about the problem
  if they are affected.
  The problem that I hit during implementation of this is that the
  `encryption` prop's `source` doesn't quite behave like other props:
  `source` is `default` for `encryption=off` and `-` when `encryption=on`.
  Hence, we can't use `source` to distinguish the following 2x2 cases:
  (1) placeholder created with explicit -o encryption=off
  (2) placeholder created without specifying -o encryption
  with
  (A) an encrypted parent at creation time
  (B) an unencrypted parent at creation time
2021-12-18 15:12:47 +01:00
Christian Schwarz
c1e2c9826f trace: hint debug env var in error when crashing due to active child tasks
refs https://github.com/zrepl/zrepl/issues/542
2021-12-05 18:57:43 +01:00
Christian Schwarz
b00b61e967 status: user-visible replication step number should start at 1
fixes https://github.com/zrepl/zrepl/issues/538
2021-11-21 15:32:18 +01:00
Christian Schwarz
ac147b5a6f replication: report a filesystem is active vs. blocked on something
- `BlockedOn` prop in JSON report
- Bring back the `*` in front of the filesystem report as an activity indicator.

fixes https://github.com/zrepl/zrepl/issues/505
2021-11-14 17:34:32 +01:00
Samy Mahmoudi
1850a332ed docs: prune: improve docs for 'grid' rule
- Substitute full words for both string name 'gridspec' and short form 'grid spec'
- Fix alignment and make spacing more consistent
- Fix fall of snapshots into buckets for the example to really reflect right-exclusiveness

closes https://github.com/zrepl/zrepl/pull/535
2021-11-14 17:34:32 +01:00
Christian Schwarz
20ff9717bc fix mis-spelled send option for embedded data
fixes https://github.com/zrepl/zrepl/issues/522
2021-11-14 17:34:32 +01:00
Christian Schwarz
c2fbf93365 daemon: provide os.Environ() in zrepl status
Useful for debugging.

fixes https://github.com/zrepl/zrepl/issues/534
2021-11-14 17:34:32 +01:00
Christian Schwarz
cf5e8e8f26 docs: add runbook on how to migrate sending side to new zpool
fixes https://github.com/zrepl/zrepl/issues/525
2021-11-14 17:34:32 +01:00
Christian Schwarz
c600cc1f60 skip timing-sensitive tests on CircleCI
We had too many spurious test failures in the past.
But on a developer machine, the tests don't usually fail because the
system isn't loaded as much.
So, only disable test on CircleCI.
2021-11-14 17:34:32 +01:00
Lapo Luchini
c6a9ebc71c job/active: add "last completed" metric for error reporting
use case:

    So that I can use a more resilient alerting such as "last complete was sent more than 24h ago".

fixes https://github.com/zrepl/zrepl/issues/516
closes https://github.com/zrepl/zrepl/pull/530
2021-11-10 17:35:12 +01:00
Christian Schwarz
1f0f2f8569 pruner + docs: less confusing type names, some comments, better docs for keep: not_replicated
fixes https://github.com/zrepl/zrepl/issues/524
2021-10-10 21:11:38 +02:00
Christian Schwarz
5104ad3d0b build: use go 1.17 for testing & release builds 2021-10-09 16:51:08 +02:00
Christian Schwarz
a6dbda1ea8 go1.17: run goimports to supports the new //go:build lines 2021-10-09 16:51:08 +02:00
Christian Schwarz
1edb8014bc build: circleci: stop storing artifacts
When we need artifacts, we use MinIO anyways.
And we have accumulated about 1TiB of (free) CircleCI artifact storage.
Don't need to waste space unnecessarily.
2021-10-09 16:13:58 +02:00
Christian Schwarz
845195b7ed bandwidth limiting: fix crash with SnapJob
zrepl daemon panics when the snap job triggers

fixup for f5f269bfd5 (bandwidth limiting)
fixes #521

Oct 01 16:14:56 cstp zrepl[56563]: panic: invalid config`BandwidthLimit` field invalid: BucketCapacity must not be zero
Oct 01 16:14:56 cstp zrepl[56563]:         panic: end span: span still has active child spans
Oct 01 16:14:56 cstp zrepl[56563]: goroutine 38 [running]:
Oct 01 16:14:56 cstp zrepl[56563]: github.com/zrepl/zrepl/daemon/logging/trace.WithSpan.func2()
Oct 01 16:14:56 cstp zrepl[56563]:         /home/cs/zrepl/zrepl/daemon/logging/trace/trace.go:341 +0x2ea
Oct 01 16:14:56 cstp zrepl[56563]: github.com/zrepl/zrepl/daemon/logging/trace.WithTaskAndSpan.func1()
Oct 01 16:14:56 cstp zrepl[56563]:         /home/cs/zrepl/zrepl/daemon/logging/trace/trace_convenience.go:40 +0x2e
Oct 01 16:14:56 cstp zrepl[56563]: panic(0xcee9c0, 0xc000676730)
Oct 01 16:14:56 cstp zrepl[56563]:         /home/cs/go1.16.6/src/runtime/panic.go:965 +0x1b9
Oct 01 16:14:56 cstp zrepl[56563]: github.com/zrepl/zrepl/endpoint.NewSender(0xf5bbc0, 0xc0003840c0, 0xc0000b2c90, 0x4, 0xc0002c5958, 0x0, 0x0, 0x0, 0xc000068cf8)
Oct 01 16:14:56 cstp zrepl[56563]:         /home/cs/zrepl/zrepl/endpoint/endpoint.go:68 +0x1ec
Oct 01 16:14:56 cstp zrepl[56563]: github.com/zrepl/zrepl/daemon/job.(*SnapJob).doPrune(0xc00039e000, 0xf6e3b8, 0xc0006541b0)
Oct 01 16:14:56 cstp zrepl[56563]:         /home/cs/zrepl/zrepl/daemon/job/snapjob.go:179 +0x198
Oct 01 16:14:56 cstp zrepl[56563]: github.com/zrepl/zrepl/daemon/job.(*SnapJob).Run(0xc00039e000, 0xf6e3b8, 0xc0001d83c0)
Oct 01 16:14:56 cstp zrepl[56563]:         /home/cs/zrepl/zrepl/daemon/job/snapjob.go:127 +0x329
Oct 01 16:14:56 cstp zrepl[56563]: github.com/zrepl/zrepl/daemon.(*jobs).start.func1(0xc0006a4100, 0xf6e3b8, 0xc00022a0f0, 0xf72d18, 0xc00039e000)
Oct 01 16:14:56 cstp zrepl[56563]:         /home/cs/zrepl/zrepl/daemon/daemon.go:255 +0x15b
Oct 01 16:14:56 cstp zrepl[56563]: created by github.com/zrepl/zrepl/daemon.(*jobs).start
Oct 01 16:14:56 cstp zrepl[56563]:         /home/cs/zrepl/zrepl/daemon/daemon.go:251 +0x425
Oct 01 16:14:56 cstp systemd[1]: zrepl.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Oct 01 16:14:56 cstp systemd[1]: zrepl.service: Failed with result 'exit-code'.
2021-10-09 15:52:38 +02:00
Christian Schwarz
2c9fcd7c14 rpc/dataconn: always close send stream returned from Sender.Send()
discovered while debugging #457
2021-10-09 15:43:31 +02:00
Christian Schwarz
4f9b63aa09 rework size estimation & dry sends
- use control connection (gRPC)
- use uint64 everywhere => fixes https://github.com/zrepl/zrepl/issues/463
- [BREAK] bump protocol version

closes https://github.com/zrepl/zrepl/pull/518
fixes https://github.com/zrepl/zrepl/issues/463
2021-10-09 15:43:27 +02:00
Christian Schwarz
a8e92971d0 zfs: rewrite SendStream, fix bug in Close() on FreeBSD, add platformtests
This commit was motivated by https://github.com/zrepl/zrepl/issues/495
where, on FreeBSD with OpenZFS 2.0, a SendStream.Close() call might wait indefinitely for `zfs send` to exit.
The reason is that, due to the refactoring done for redacted send & recv
(30af21b025),
the `dump_bytes` function, which writes to the pipe, executes in a separate thread (synctask taskq) iff not `HAVE_LARGE_STACKS`.
The `zfs send` process/thread waits for that taskq thread using an uninterruptible primitive.
So when we SIGKILL `zfs send`, that signal doesn't reach the right thread to interrupt the pipe write.

Theoretically this affects both Linux and FreeBSD, but most Linux users `HAVE_LARGE_STACKS` and since https://github.com/penzfs/zfs/pull/12350/files OpenZFS on FreeBSD `HAVE_LARGE_STACKS` as well.
However, at least until FreeBSD 13.1, possibly for the entire 13 lifecycle, we're going to have to live with that oddity.

Measures taken in this commit:
- Report the behavior as an upstream bug https://github.com/openzfs/zfs/issues/12500
- Change SendStream code so that it closes zrepl's read-end of the pipe (see comment in code)
- Clean up and make explicit SendStream's state handling
- Write extensive platformtests for SendStream
    - They pass on my Linux install and on FreeBSD 12
    - FreeBSD 13 still needs testing.

fixes https://github.com/zrepl/zrepl/issues/495
2021-09-19 20:11:31 +02:00
Christian Schwarz
b54e477602 platformtest: fix 'active child tasks' panic for ReceiveForceRollbackWorksUnencrypted
Revealed by rework of SendStream in a prior commit.
2021-09-19 20:03:01 +02:00
Christian Schwarz
959fb08a89 platformtest: fix replication tests (SizeEstimationConcurrency field in PlannerPolicy was not set)
fixup of 0ceea1b792 (parallel replication knobs)
2021-09-19 20:03:01 +02:00
Christian Schwarz
6ac012aa3c platformtest: work around missing feature detection for test 'ReplicationPropertyReplicationWorks' 2021-09-19 20:03:01 +02:00
Christian Schwarz
3e93b31f75 platformtest: fix bandwidth-limiting-related panics (missing BucketCapacity in sender/receiver config)
panic while running test: invalid config`Ratelimit` field invalid: BucketCapacity must not be zero
main.runTestCase.func1.1
	/home/cs/zrepl/zrepl/platformtest/harness/harness.go:190
runtime.gopanic
	/home/cs/go1.13/src/runtime/panic.go:679
github.com/zrepl/zrepl/endpoint.NewSender
	/home/cs/zrepl/zrepl/endpoint/endpoint.go:68
github.com/zrepl/zrepl/platformtest/tests.replicationInvocation.Do
	/home/cs/zrepl/zrepl/platformtest/tests/replication.go:87
github.com/zrepl/zrepl/platformtest/tests.ReplicationFailingInitialParentProhibitsChildReplication
	/home/cs/zrepl/zrepl/platformtest/tests/replication.go:925
main.runTestCase.func1
	/home/cs/zrepl/zrepl/platformtest/harness/harness.go:193
main.runTestCase
	/home/cs/zrepl/zrepl/platformtest/harness/harness.go:194
main.HarnessRun
	/home/cs/zrepl/zrepl/platformtest/harness/harness.go:107
main.main
	/home/cs/zrepl/zrepl/platformtest/harness/harness.go:42
runtime.main
	/home/cs/go1.13/src/runtime/proc.go:203
runtime.goexit
	/home/cs/go1.13/src/runtime/asm_amd64.s:1357

fixup for f5f269bfd5 (bandwidth limiting)
2021-09-19 20:03:01 +02:00
Christian Schwarz
08df208149 bandwidth limiting: use correct field name on error
fixup for f5f269bfd5 (bandwidth limiting)
2021-09-19 20:03:01 +02:00