Commit Graph

1075 Commits

Author SHA1 Message Date
Christian Schwarz
bbdc6f5465
fix handling of tenative cursor presence if protection strategy doesn't use it (#714)
Before this PR, we would panic in the `check` phase of `endpoint.Send()`'s `TryBatchDestroy` call in the following cases: the current protection strategy does NOT produce a tentative replication cursor AND
  * `FromVersion` is a tentative cursor bookmark
  * `FromVersion` is a snapshot, and there exists a tentative cursor bookmark for that snapshot
  * `FromVersion` is a bookmark != tentative cursor bookmark, but there exists a tentative cursor bookmark for the same snapshot as the `FromVersion` bookmark

In those cases, the `check` concluded that we would delete `FromVersion`.
It came to that conclusion because the tentative cursor isn't part of `obsoleteAbs` if the protection strategy doesn't produce a tentative replication cursor.

The scenarios above can happen if the user changes the protection strategy from "with tentative cursor" to one "without tentative replication cursor", while there is a tentative replication cursor on disk.
The workaround was to rename the tentative cursor.

In all cases above, `TryBatchDestroy` would have destroyed the tentative cursor.

In case 1, that would fail the `Send` step and potentially break replication if the cursor is the last common bookmark. The `check` conclusion was correct.

In cases 2 and 3, deleting the tentative cursor would have been fine because `FromVersion` was a different entity than the tentative cursor. So, destroying the tentative cursor would be the right call.

The solution in this PR is as follows:
* add the `FromVersion` to the `liveAbs` set of live abstractions
* rewrite the `check` closure to use the full dataset path (`fullpath`) to identify the concrete ZFS object instead of the `zfs.FilesystemVersionEqualIdentity`, which is only identified by matching GUID.
  * Holds have no dataset path and are not the `FromVersion` in any case, so disregard them.

fixes #666
2023-07-04 20:21:48 +02:00
Goran Mekic
bc5e1ede04
metric to detect filesystems rules that don't match any local dataset (#653)
This PR adds a Prometheus counter called
`zrepl_zfs_list_unmatched_user_specified_dataset_count`.
Monitor for increases of the counter to detect filesystem filter rules that
have no effect because they don't match any local filesystem.

An example use case for this is the following story:
1. Someone sets up zrepl with `filesystems` filter for `zroot/pg14<`.
2. During the upgrade to Postgres 15, they rename the dataset to `zroot/pg15`,
   but forget to update the zrepl `filesystems` filter.
3. zrepl will not snapshot / replicate the `zroot/pg15<` datasets.

Since `filesystems` rules are always evaluated on the side that has the datasets,
we can smuggle this functionality into the `zfs` module's `ZFSList` function that
is used by all jobs with a `filesystems` filter.

Dashboard changes:
- histogram with increase in $__interval, one row per job
- table with increase in $__range
- explainer text box, so, people know what the previous two are about
We had to re-arrange some panels, hence the Git diff isn't great.

closes https://github.com/zrepl/zrepl/pull/653

Co-authored-by: Christian Schwarz <me@cschwarz.com>
Co-authored-by: Goran Mekić <meka@tilda.center>
2023-05-02 22:13:52 +02:00
Tercio Filho
2b3daaf9f1
zrepl status: hide progress bar once all filesystems reach terminal state (#674)
* Added `IsTerminal` method
* Made rendering of progress bar conditional based on IsTerminal
2023-05-02 19:28:56 +02:00
Sebastian Jäger
2b3df7e342
docs: address setup with two or more external disks (#695) 2023-05-02 18:57:26 +02:00
Christian Schwarz
5e4d4188f4 circleci: use orb circlci/go for module caching 2023-02-26 13:08:05 +01:00
Christian Schwarz
1e8ffe4486 circleci: run platform tests in CircleCI 2023-02-26 13:08:05 +01:00
Christian Schwarz
59389b84a2 platformtest: fix logmockzfs wrapper script / make test-platform for Go 1.19
See the comment in the script.

refs https://github.com/golang/go/issues/53962

 used by make test-platform breaks the test on Go 1.19
2023-02-26 13:08:05 +01:00
Christian Schwarz
4fae0bb68e grafana: update dashboard to Grafana 9.3.6
... by importing the old version of the dashboard JSON into Grafana 9.3.6, then
re-exporting it.
2023-02-26 11:28:57 +01:00
Guillermo Ramos
9777a441e9
dist: add openrc service file
closes https://github.com/zrepl/zrepl/pull/664
2023-01-27 23:59:45 +01:00
InsanePrawn
1a72edea5d docs/jobs: add replication- conflict_resolution-options to active job types 2023-01-26 00:09:28 +01:00
Christian Schwarz
96db636582 build: circleci: don't trigger periodic full pipeline build for problame/circleci-build 2023-01-08 12:35:59 +01:00
Christian Schwarz
190ab7c08d build: circleci: stop using minio for artifact storage
CircleCI artifacts are available publicly.
And regarding expiration of artifacts, it doesn't really
matter because I delete minio artifacts after 30d as well.
2022-12-30 14:24:23 +01:00
Christian Schwarz
6be133f55d remove unused JobDebugSettings along with docs
For this kind of debugging, we switched to env vars a while ago.
For example, ZREPL_RPC_DEBUG.

I don't think we have a substitute for the RPCLog stuff.
However, NetConnLogger is still in the codebase.

obsoletes https://github.com/zrepl/zrepl/pull/661
2022-12-22 18:13:45 +01:00
Christian Schwarz
5ffd470596 docs: update comment on overriding mountpoint properties during zfs recv of ZVOLs
fixes https://github.com/zrepl/zrepl/issues/430
2022-12-10 12:53:24 +01:00
Christian Schwarz
2119dc40ab docs: update supporters list 2022-12-10 12:00:57 +01:00
Christian Schwarz
0df1c4cdcc docs: changelog: move donation banner to 0.6 release 2022-11-01 09:57:24 +01:00
Christian Schwarz
2658695a35 build: bump minimum Go version to 1.18, as a dependency in ./tools requires it
https://app.circleci.com/pipelines/github/zrepl/zrepl/6085/workflows/bf5b11f2-8dc4-40a2-bb7a-fcf3cf8205d4/jobs/42340

  ...
  build github.com/golangci/golangci-lint/cmd/golangci-lint: cannot load io/fs: cannot find module providing package io/fs
  go install github.com/wadey/gocovmerge
  go: downloading github.com/wadey/gocovmerge v0.0.0-20160331181800-b5bfa59ec0ad
  go: extracting github.com/wadey/gocovmerge v0.0.0-20160331181800-b5bfa59ec0ad
  go install golang.org/x/tools/cmd/goimports
  # golang.org/x/mod/module
  ../../go/pkg/mod/golang.org/x/mod@v0.6.0/module/module.go:147:5: undefined: errors.As
  note: module requires Go 1.17
  go install golang.org/x/tools/cmd/stringer
  # golang.org/x/tools/go/internal/gcimporter
  ../../go/pkg/mod/golang.org/x/tools@v0.2.0/go/internal/gcimporter/iimport.go:520:9: undefined: constant.Make
  ../../go/pkg/mod/golang.org/x/tools@v0.2.0/go/internal/gcimporter/iimport.go:616:9: undefined: constant.Make
  note: module requires Go 1.18
  go install google.golang.org/grpc/cmd/protoc-gen-go-grpc
  go: downloading google.golang.org/grpc v1.46.2
  go: extracting google.golang.org/grpc v1.46.2
  go: downloading google.golang.org/grpc/cmd/protoc-gen-go-grpc v1.1.0
  go: extracting google.golang.org/grpc/cmd/protoc-gen-go-grpc v1.1.0
  go install google.golang.org/protobuf/cmd/protoc-gen-go

  Exited with code exit status 123
2022-10-31 20:13:36 +01:00
Christian Schwarz
1ac1635b3d build: circleci: update CA certs in go 1.12 image 2022-10-31 20:13:26 +01:00
Christian Schwarz
4a2806f6d1 build: fix deb-docker performance on newer Docker
See comment in Makefile
2022-10-27 00:47:12 +02:00
Christian Schwarz
0a264b9b41 docs: add announcement for next release 2022-10-27 00:19:06 +02:00
Christian Schwarz
a3379d6785 docs: finalize 0.6 changelog 2022-10-27 00:19:06 +02:00
Christian Schwarz
6260b75031 snapper: fix delayed snapshots caused by system suspend/resume
See explainer comment in periodic.go for details.

fixes https://github.com/zrepl/zrepl/issues/611
2022-10-27 00:19:06 +02:00
Christian Schwarz
3ffb69bfb0 config: support zrepl's day and week units for snapshotting.interval
Originally, I had a patch that would replace all usages of
time.Duration in package config with the new config.Duration
types, but:
1. these are all timeouts/retry intervals that have default values.
   Most users don't touch them, and if they do, they don't need
   day or week units.
2. go-yaml's error reporting for yaml.Unmarshaler is inferior to
   built-in types (line numbers are missing, so the error would not have
   sufficient context)

fixes https://github.com/zrepl/zrepl/issues/486
2022-10-27 00:19:06 +02:00
Yannick Dylla
1da8f848f2 snapper: support custom timestamp format
fixes https://github.com/zrepl/zrepl/issues/465
closes https://github.com/zrepl/zrepl/pull/639
2022-10-27 00:19:06 +02:00
Christian Schwarz
6ed4626df9 grafana dashboard: remove zrepl version number from title
fixes https://github.com/zrepl/zrepl/issues/624
2022-10-27 00:19:06 +02:00
Christian Schwarz
c07f9ec62e build: use go 1.19 for testing & release builds
New docker image since the old one was deprecated, according
to https://discuss.circleci.com/t/go-lang-docker-image-circleci-golang-1-19-is-missing/44961
2022-10-27 00:19:06 +02:00
Christian Schwarz
fd5b0e6831 build: update golangci-lint
The previous commits were done in response to updating to
the version that we now pin in this commit.
We do the update after the fixes so that each commit builds.
2022-10-27 00:19:06 +02:00
Christian Schwarz
a4cea1b4f3 go1.19: zfs.SendStream.Close() after EOF would return context cancellation error
Before upgrading to Go 1.19, these platform tests would sproadically
fail due to the reason outlined in the comment

  github.com/zrepl/zrepl/platformtest/tests.SendStreamMultipleCloseAfterEOF
  github.com/zrepl/zrepl/platformtest/tests.SendStreamCloseAfterEOFRead
2022-10-27 00:19:06 +02:00
Christian Schwarz
c0b52b92d5 systemd: set GOTRACEBACK=crash so that we have core dumps
They are useful, not least to debug issues with debugging
SIGSYS caused by overly restrictive settings in the unit file.
(See previous commit for an example.)
2022-10-26 22:39:18 +02:00
Christian Schwarz
12018b3685 go1.19: adjust systemd unit to allow setrlimit
Go 1.19 uses it during startup.

From the Go changelog:

> On Unix operating systems, Go programs that import package os now
> automatically increase the open file limit (RLIMIT_NOFILE) to the
> maximum allowed value; that is, they change the soft limit to match the
> hard limit. This corrects artificially low limits set on some systems
> for compatibility with very old C programs using the select system call.
> Go programs are not helped by that limit, and instead even simple
> programs like gofmt often ran out of file descriptors on such systems
> when processing many files in parallel. One impact of this change is
> that Go programs that in turn execute very old C programs in child
> processes may run those programs with too high a limit. This can be
> corrected by setting the hard limit before invoking the Go program.
2022-10-26 22:39:18 +02:00
Christian Schwarz
a91fb873e4 fix incorrect use of sort.StringSlice
A newer version of staticheck found these:

> SA4029: sort.StringSlice is a type, not a function, and
> sort.StringSlice(variants) doesn't sort your values; consider using
> sort.Strings instead (staticcheck)
2022-10-24 22:22:41 +02:00
Christian Schwarz
a6aa610165 run go1.19 gofmt and make adjustments as needed
(Go 1.19 expanded doc comment syntax)
2022-10-24 22:22:41 +02:00
Christian Schwarz
6c87bdb9fb go1.19: switch to new nolint directive that is compatible with Go 1.19 gofmt 2022-10-24 22:22:11 +02:00
Christian Schwarz
b9250a41a2 go1.18: address net.Error.Temporary() deprecation
Go 1.18 deprecated net.Error.Temporary().
This commit cleans up places where we use it incorrectly.
Also, the rpc layer defines some errors that implement

  interface { Temporary() bool }

I added comments to all of the implementations to indicate
whether they will be required if net.Error.Temporary is ever
ever removed in the future.

For HandshakeError, the Temporary() return value is actually
important. I moved & rewrote a (previously misplaced) comment
there.

The ReadStreamError changes were
1. necessary to pacify newer staticcheck and
2. technically, an error can implement Temporary()
   without being net.Err. This applies to some syscall
   errors in the standard library.

Reading list for those interested:
- https://github.com/golang/go/issues/45729
- https://groups.google.com/g/golang-nuts/c/-JcZzOkyqYI
- https://man7.org/linux/man-pages/man2/accept.2.html

Note: This change was prompted by staticheck:

> SA1019: neterr.Temporary has been deprecated since Go 1.18 because it
> shouldn't be used: Temporary errors are not well-defined. Most
> "temporary" errors are timeouts, and the few exceptions are surprising.
> Do not use this method. (staticcheck)
2022-10-24 22:21:52 +02:00
Christian Schwarz
a967986a18 fixup: fix hooks unit tests
The previous commit c743c7b03f
broke the hooks unit tests.

GitHub was not configured to require passing tests for master merge.
Didn't notice it locally due to Go's test caching.
I amended this before pushing this change.
2022-10-09 15:36:00 +02:00
Christian Schwarz
c743c7b03f refactor snapper & support cron-based snapshotting
fixes https://github.com/zrepl/zrepl/issues/554
refs https://github.com/zrepl/zrepl/discussions/547#discussioncomment-1936126
2022-09-25 19:23:44 +02:00
Christian Schwarz
a9c61b4b0b zrepl status UI: include w shortcut to wrap lines in help bar 2022-09-25 19:23:44 +02:00
Christian Schwarz
206d359dcd docs: sendrecvoptions: fix heading level for section on placeholders 2022-09-25 18:23:54 +02:00
Christian Schwarz
2d8c3692ec rework resume token validation to allow resuming from raw sends of unencrypted datasets
Before this change, resuming from an unencrypted dataset with
send.raw=true specified wouldn't work with zrepl due to overly
restrictive resume token checking.

An initial PR to fix this was made in https://github.com/zrepl/zrepl/pull/503
but it didn't address the core of the problem.
The core of the problem was that zrepl assumed that if a resume token
contained `rawok=true, compressok=true`, the resulting send would be
encrypted. But if the sender dataset was unencrypted, such a resume would
actually result in an unencrypted send.
Which could be totally legitimate but zrepl failed to recognize that.

BACKGROUND
==========

The following snippets of OpenZFS code are insightful regarding how the
various ${X}ok values in the resume token are handled:

- 6c3c5fcfbe/module/zfs/dmu_send.c (L1947-L2012)
- 6c3c5fcfbe/module/zfs/dmu_recv.c (L877-L891)
- https://github.com/openzfs/zfs/blob/6c3c5fc/lib/libzfs/libzfs_sendrecv.c#L1663-L1672

Basically, some zfs send flags make the DMU send code set some DMU send
stream featureflags, although it's not a pure mapping, i.e, which DMU
send stream flags are used depends somewhat on the dataset (e.g., is it
encrypted or not, or, does it use zstd or not).

Then, the receiver looks at some (but not all) feature flags and maps
them to ${X}ok dataset zap attributes.

These are funnelled back to the sender 1:1 through the resume_token.

And the sender turns them into lzc flags.

As an example, let's look at zfs send --raw.
if the sender requests a raw send on an unencrypted dataset, the send
stream (and hence the resume token) will not have the raw stream
featureflag set, and hence the resume token will not have the rawok
field set. Instead, it will have compressok, embedok, and depending
on whether large blocks are present in the dataset, largeblockok set.

WHAT'S ZREPL'S ROLE IN THIS?
============================

zrepl provides a virtual encrypted sendflag that is like `raw`,
but further ensures that we only send encrypted datasets.

For any other resume token stuff, it shoudn't do any checking,
because it's a futile effort to keep up with ZFS send/recv features
that are orthogonal to encryption.

CHANGES MADE IN THIS COMMIT
===========================

- Rip out a bunch of needless checking that zrepl would do during
  planning. These checks were there to give better error messages,
  but actually, the error messages created by the endpoint.Sender.Send
  RPC upon send args validation failure are good enough.
- Add platformtests to validate all combinations of
  (Unencrypted/Encrypted FS) x (send.encrypted = true | false) x (send.raw = true | false)
  for cases both non-resuming and resuming send.

Additional manual testing done:
1. With zrepl 0.5, setup with unencrypted dataset, send.raw=true specified, no send.encrypted specified.
2. Observe that regular non-resuming send works, but resuming doesn't work.
3. Upgrade zrepl to this change.
4. Observe that both regular and resuming send works.

closes https://github.com/zrepl/zrepl/pull/613
2022-09-25 17:32:02 +02:00
Christian Schwarz
7769263c2e platformtest: add QueueSubtest functionality
Use it from a top-level test case to queue the
execution of sub-tests after this test case is complete.

Note that the testing harness executes the subtest
_after_ the current top-level test. Hence, the subtest
cannot use any ZFS state of the top-level test.
2022-09-25 17:10:53 +02:00
Christian Schwarz
89f7c76c4e lint: allow empty else branches 2022-09-25 17:10:53 +02:00
jtagcat
c7771f98f5 docs: improve overview
There were and still is too many words. It's a very white paper vibe.
Docs needs to be more brief, exact, and on-point.

closes https://github.com/zrepl/zrepl/pull/618
2022-07-31 15:50:53 +02:00
jtagcat
299f1c906e docs: overview: clarify configs _are_ ordered
Previously with unordered list, and 'are considered'
left if unsure whether one or all files are 'considered'.
In reality, the first valid is used, so an ordered list and
perhaps better wording communicates this fact.

refs https://github.com/zrepl/zrepl/pull/618
2022-07-31 15:33:23 +02:00
Kiss Károly
d3f68ae4e8 replication: ignore bookmarks when computing incremental path
fixes https://github.com/zrepl/zrepl/issues/490
closes https://github.com/zrepl/zrepl/pull/619

Co-authored-by: Christian Schwarz <me@cschwarz.com>
2022-07-31 15:25:19 +02:00
Christian Schwarz
193abbe6b1 fix active child tasks panic with endpoint.ListAbstractionsStreamed
The goroutine that does endTask() for
"list-abstractions-streamed-producer" can be preempted
after it has closed the out and outErrs channel,
but before it calls endTask().
If the parent ("handler") then gets scheduled and
and ends itself, it will observe an active child task
"list-abstractions-streamed-producer".

This is easy to demo by injecting a sleep here:

  --- a/endpoint/endpoint_zfs_abstraction.go
  +++ b/endpoint/endpoint_zfs_abstraction.go
  @@ -575,6 +576,7 @@ func ListAbstractionsStreamed(ctx context.Context, query ListZFSHoldsAndBookmark
          ctx, endTask := trace.WithTask(ctx, "list-abstractions-streamed-producer")
          go func() {
                  defer endTask()
  +               defer time.Sleep(10 * time.Second)
                  defer close(out)
                  defer close(outErrs)

fixes https://github.com/zrepl/zrepl/issues/607
2022-07-17 21:44:03 +02:00
Goran Mekić
02b215128e build: consistently use $(MAKE) when invoking it recursively
Not for the `docker run ... make ...` commands though!

closes https://github.com/zrepl/zrepl/pull/615
2022-07-12 00:18:38 +02:00
Christian Schwarz
dc03db7423 rpc/grpcclientidentity/authlistener_grpc_adaptor: don't assume peer.Addr is set
On Illumos, getpeername doesn't work from Go on socketpair sockets.
That's why .RemoteAddr() returns nil on such a socket.
And that `nil` ultimately lands in the `p.Addr`.
So, `p.Addr.String()` would deref `nil`, leading to

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0xaea33e]

    goroutine 614 [running]:
    github.com/zrepl/zrepl/rpc/grpcclientidentity.NewInterceptors.func1({0xf1e158, 0xc000631200}, {0xd514c0, 0xc000631230}, 0xc000032740, 0xc000524348)
    	/dpool/export/home/mills/Downloads/code/oi-userland-gh/components/sysutils/zrepl/build/amd64/rpc/grpcclientidentity/authlistener_grpc_adaptor.go:121 +0x13e
    github.com/zrepl/zrepl/replication/logic/pdu._Replication_ListFilesystems_Handler({0xdb30c0, 0xc00001a630}, {0xf1e158, 0xc000631200}, 0xc00052b7a0, 0xc000522000)
    	/dpool/export/home/mills/Downloads/code/oi-userland-gh/components/sysutils/zrepl/build/amd64/replication/logic/pdu/pdu_grpc.pb.go:186 +0x16a
    google.golang.org/grpc.(*Server).processUnaryRPC(0xc00016e700, {0xf2bc00, 0xc0000f2780}, 0xc00011c200, 0xc000522150, 0x1497c78, 0x0)
    	/dpool/export/home/mills/Downloads/code/oi-userland-gh/components/sysutils/zrepl/zrepl-0.5.0/gopath/pkg/mod/google.golang.org/grpc@v1.35.0/server.go:1217 +0xe28
    google.golang.org/grpc.(*Server).handleStream(0xc00016e700, {0xf2bc00, 0xc0000f2780}, 0xc00011c200, 0x0)
    	/dpool/export/home/mills/Downloads/code/oi-userland-gh/components/sysutils/zrepl/zrepl-0.5.0/gopath/pkg/mod/google.golang.org/grpc@v1.35.0/server.go:1540 +0xcb3
    google.golang.org/grpc.(*Server).serveStreams.func1.2(0xc000373b70, 0xc00016e700, {0xf2bc00, 0xc0000f2780}, 0xc00011c200)
    	/dpool/export/home/mills/Downloads/code/oi-userland-gh/components/sysutils/zrepl/zrepl-0.5.0/gopath/pkg/mod/google.golang.org/grpc@v1.35.0/server.go:878 +0xad
    created by google.golang.org/grpc.(*Server).serveStreams.func1
    	/dpool/export/home/mills/Downloads/code/oi-userland-gh/components/sysutils/zrepl/zrepl-0.5.0/gopath/pkg/mod/google.golang.org/grpc@v1.35.0/server.go:876 +0x1ec

fixes https://github.com/zrepl/zrepl/issues/598
2022-07-10 23:59:40 +02:00
Cole Helbling
1df0f8912a Add --skip-cert-check flag to zrepl configcheck to prevent checking cert files
It may be desirable to check that a config is valid without checking for
the existence of certificate files (e.g. when validating a config inside
a sandbox without access to the cert files).

This will be very useful for NixOS so that we can check the config file
at nix-build time (e.g. potentially without proper permissions to read cert
files for a TLS connection).

fixes https://github.com/zrepl/zrepl/issues/467
closes https://github.com/zrepl/zrepl/pull/587
2022-07-08 20:18:41 +02:00
3nprob
e4112d888c add ZREPL_DESTROY_MAX_BATCH_SIZE env var to control max batch destroy size
fixes #508
closes https://github.com/zrepl/zrepl/pull/604
2022-06-30 09:22:26 +02:00
Christian Schwarz
53f9bd6d88 docs: update CLI usage to --mode raw & remove outdated "Limitations" section
fixes https://github.com/zrepl/zrepl/issues/609
2022-06-28 00:17:34 +02:00