Because some jobs add client identity to root_fs and other jobs don't do
that,
we can't reliable detect overlapping of filesystems. And and the same
time we
need an ability to use equal or overlapped root_fs for different jobs.
For
instance see this config:
```
- name: "zdisk"
type: "sink"
root_fs: "zdisk/zrepl"
serve:
type: "local"
listener_name: "zdisk"
```
and
```
- name: "remote-to-zdisk"
type: "pull"
connect:
type: "tls"
root_fs: "zdisk/zrepl/remote"
```
As you can see, two jobs have overlapped root_fs, but actually datasets
are not
overlapped, because job `zdisk` save everything under
`zdisk/zrepl/localhost`,
because it adds client identity. So they actually use two different
filesystems:
`zdisk/zrepl/localhost` and `zdisk/zrepl/remote`. And we can't detect
this
situation during config check. So let's just remove this check, because
it's
admin's duty to configure correct root_fs's.
---------
Co-authored-by: Christian Schwarz <me@cschwarz.com>
Command used:
```
cd build
go get -u github.com/golangci/golangci-lint/cmd/golangci-lint
go mod tidy
```
Further, golangci-lint requires go 1.20 to build, so, use that as the lowest version in the CI.
The sphinxcontrib-versioning seems unmaintainted and I can't
get the fork that we used before this PR working on Python 3.10.
The situation wrt maintenance doesn't seem much better for
sphinx-multiversion, but, at least I could get it to work
with current sphinx versions.
The main problem with sphinx-multiversion is that it doesn't render
anything at `/`. I.e., `https://zrepl.github.io/configuration.html` will
be 404.
That's different from `sphinxcontrib-versioning`, and thus switching
to sphinx-multiversion would break URLs.
We host on GitHub pages and don't control the webserver,
so, we can't use webserver-level redirects to keep the URLs working.
We could create JS-level redirects, or `http-equiv`, but that's ugly as
well.
The simplest solution was to fork sphinx-multiversion and hard-code
zrepl's specific needs into that fork.
The fork is based off v0.2.4 and pinned via requirements.txt.
Here are its unique commits:
https://github.com/Holzhaus/sphinx-multiversion/compare/master...zrepl:sphinx-multiversion:zrepl
We should revisit `sphinx-polyversion` in the future once its docs
improve.
See
https://github.com/Holzhaus/sphinx-multiversion/issues/88#issuecomment-1606221194
This PR updates the various Python packages, as I couldn't get
sphinx-multiversion to work with the (very old) versions that were
pinned in `requirements.txt` prior to this PR.
This PR's `requirements.txt` is from a clean Python 3.10 venv on Ubuntu
22.10 after running
```
pip install sphinx sphinx-rtd-theme
pip install 'git+https://github.com/zrepl/sphinx-multiversion/@52c915d7ad898d9641ec48c8bbccb7d4f079db93#egg=sphinx_multiversion'
```
Before this PR, we would panic in the `check` phase of `endpoint.Send()`'s `TryBatchDestroy` call in the following cases: the current protection strategy does NOT produce a tentative replication cursor AND
* `FromVersion` is a tentative cursor bookmark
* `FromVersion` is a snapshot, and there exists a tentative cursor bookmark for that snapshot
* `FromVersion` is a bookmark != tentative cursor bookmark, but there exists a tentative cursor bookmark for the same snapshot as the `FromVersion` bookmark
In those cases, the `check` concluded that we would delete `FromVersion`.
It came to that conclusion because the tentative cursor isn't part of `obsoleteAbs` if the protection strategy doesn't produce a tentative replication cursor.
The scenarios above can happen if the user changes the protection strategy from "with tentative cursor" to one "without tentative replication cursor", while there is a tentative replication cursor on disk.
The workaround was to rename the tentative cursor.
In all cases above, `TryBatchDestroy` would have destroyed the tentative cursor.
In case 1, that would fail the `Send` step and potentially break replication if the cursor is the last common bookmark. The `check` conclusion was correct.
In cases 2 and 3, deleting the tentative cursor would have been fine because `FromVersion` was a different entity than the tentative cursor. So, destroying the tentative cursor would be the right call.
The solution in this PR is as follows:
* add the `FromVersion` to the `liveAbs` set of live abstractions
* rewrite the `check` closure to use the full dataset path (`fullpath`) to identify the concrete ZFS object instead of the `zfs.FilesystemVersionEqualIdentity`, which is only identified by matching GUID.
* Holds have no dataset path and are not the `FromVersion` in any case, so disregard them.
fixes#666
This PR adds a Prometheus counter called
`zrepl_zfs_list_unmatched_user_specified_dataset_count`.
Monitor for increases of the counter to detect filesystem filter rules that
have no effect because they don't match any local filesystem.
An example use case for this is the following story:
1. Someone sets up zrepl with `filesystems` filter for `zroot/pg14<`.
2. During the upgrade to Postgres 15, they rename the dataset to `zroot/pg15`,
but forget to update the zrepl `filesystems` filter.
3. zrepl will not snapshot / replicate the `zroot/pg15<` datasets.
Since `filesystems` rules are always evaluated on the side that has the datasets,
we can smuggle this functionality into the `zfs` module's `ZFSList` function that
is used by all jobs with a `filesystems` filter.
Dashboard changes:
- histogram with increase in $__interval, one row per job
- table with increase in $__range
- explainer text box, so, people know what the previous two are about
We had to re-arrange some panels, hence the Git diff isn't great.
closes https://github.com/zrepl/zrepl/pull/653
Co-authored-by: Christian Schwarz <me@cschwarz.com>
Co-authored-by: Goran Mekić <meka@tilda.center>
CircleCI artifacts are available publicly.
And regarding expiration of artifacts, it doesn't really
matter because I delete minio artifacts after 30d as well.
For this kind of debugging, we switched to env vars a while ago.
For example, ZREPL_RPC_DEBUG.
I don't think we have a substitute for the RPCLog stuff.
However, NetConnLogger is still in the codebase.
obsoletes https://github.com/zrepl/zrepl/pull/661
Originally, I had a patch that would replace all usages of
time.Duration in package config with the new config.Duration
types, but:
1. these are all timeouts/retry intervals that have default values.
Most users don't touch them, and if they do, they don't need
day or week units.
2. go-yaml's error reporting for yaml.Unmarshaler is inferior to
built-in types (line numbers are missing, so the error would not have
sufficient context)
fixes https://github.com/zrepl/zrepl/issues/486