Commit Graph

10 Commits

Author SHA1 Message Date
Goran Mekic
bc5e1ede04
metric to detect filesystems rules that don't match any local dataset (#653)
This PR adds a Prometheus counter called
`zrepl_zfs_list_unmatched_user_specified_dataset_count`.
Monitor for increases of the counter to detect filesystem filter rules that
have no effect because they don't match any local filesystem.

An example use case for this is the following story:
1. Someone sets up zrepl with `filesystems` filter for `zroot/pg14<`.
2. During the upgrade to Postgres 15, they rename the dataset to `zroot/pg15`,
   but forget to update the zrepl `filesystems` filter.
3. zrepl will not snapshot / replicate the `zroot/pg15<` datasets.

Since `filesystems` rules are always evaluated on the side that has the datasets,
we can smuggle this functionality into the `zfs` module's `ZFSList` function that
is used by all jobs with a `filesystems` filter.

Dashboard changes:
- histogram with increase in $__interval, one row per job
- table with increase in $__range
- explainer text box, so, people know what the previous two are about
We had to re-arrange some panels, hence the Git diff isn't great.

closes https://github.com/zrepl/zrepl/pull/653

Co-authored-by: Christian Schwarz <me@cschwarz.com>
Co-authored-by: Goran Mekić <meka@tilda.center>
2023-05-02 22:13:52 +02:00
Christian Schwarz
a8e92971d0 zfs: rewrite SendStream, fix bug in Close() on FreeBSD, add platformtests
This commit was motivated by https://github.com/zrepl/zrepl/issues/495
where, on FreeBSD with OpenZFS 2.0, a SendStream.Close() call might wait indefinitely for `zfs send` to exit.
The reason is that, due to the refactoring done for redacted send & recv
(30af21b025),
the `dump_bytes` function, which writes to the pipe, executes in a separate thread (synctask taskq) iff not `HAVE_LARGE_STACKS`.
The `zfs send` process/thread waits for that taskq thread using an uninterruptible primitive.
So when we SIGKILL `zfs send`, that signal doesn't reach the right thread to interrupt the pipe write.

Theoretically this affects both Linux and FreeBSD, but most Linux users `HAVE_LARGE_STACKS` and since https://github.com/penzfs/zfs/pull/12350/files OpenZFS on FreeBSD `HAVE_LARGE_STACKS` as well.
However, at least until FreeBSD 13.1, possibly for the entire 13 lifecycle, we're going to have to live with that oddity.

Measures taken in this commit:
- Report the behavior as an upstream bug https://github.com/openzfs/zfs/issues/12500
- Change SendStream code so that it closes zrepl's read-end of the pipe (see comment in code)
- Clean up and make explicit SendStream's state handling
- Write extensive platformtests for SendStream
    - They pass on my Linux install and on FreeBSD 12
    - FreeBSD 13 still needs testing.

fixes https://github.com/zrepl/zrepl/issues/495
2021-09-19 20:11:31 +02:00
Christian Schwarz
af2d6579c5 [#347] zfscmd: fix dangling trace Task on .Start() failure
fixes #347
2020-09-02 22:45:44 +02:00
Christian Schwarz
0f3da73ef1 [#347] zfscmd + zfs: define .Start() semantics, apply to call sites in pkg zfs
fixes #347
2020-09-02 22:45:44 +02:00
Christian Schwarz
10a14a8c50 [#307] add package trace, integrate it with logging, and adopt it throughout zrepl
package trace:

- introduce the concept of tasks and spans, tracked as linked list within ctx
    - see package-level docs for an overview of the concepts
    - **main feature 1**: unique stack of task and span IDs
        - makes it easy to follow a series of log entries in concurrent code
    - **main feature 2**: ability to produce a chrome://tracing-compatible trace file
        - either via an env variable or a `zrepl pprof` subcommand
        - this is not a CPU profile, we already have go pprof for that
        - but it is very useful to visually inspect where the
          replication / snapshotter / pruner spends its time
          ( fixes #307 )

usage in package daemon/logging:

- goal: every log entry should have a trace field with the ID stack from package trace

- make `logging.GetLogger(ctx, Subsys)` the authoritative `logger.Logger` factory function
    - the context carries a linked list of injected fields which
      `logging.GetLogger` adds to the logger it returns
    - `logging.GetLogger` also uses package `trace` to get the
      task-and-span-stack and injects it into the returned logger's fields
2020-05-19 11:30:02 +02:00
Christian Schwarz
0e5c77d2be [#277] rpc + zfs: drop zfs.StreamCopier, use io.ReadCloser instead 2020-05-18 19:46:24 +02:00
Christian Schwarz
aed6149c8c zfscmd: fix crash in zfscmd_prometheus.go due to incorrectly extracted ProcessState
fixup of 96e188d7c4
refs #196
refs #301

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x9a472a]

goroutine 15826 [running]:
os.(*ProcessState).systemTime(...)
        /home/cs/go1.13/src/os/exec_unix.go:98
os.(*ProcessState).SystemTime(...)
        /home/cs/go1.13/src/os/exec.go:141
github.com/zrepl/zrepl/zfs/zfscmd.waitPostPrometheus(0xc000c04800, 0xe21ce0, 0xc000068270, 0xbf9f80d88107e861, 0x19bae710e6, 0x13a8b60)
        /home/cs/zrepl/zrepl/zfs/zfscmd/zfscmd_prometheus.go:69 +0x22a
github.com/zrepl/zrepl/zfs/zfscmd.(*Cmd).waitPost(0xc000c04800, 0xe21ce0, 0xc000068270)
        /home/cs/zrepl/zrepl/zfs/zfscmd/zfscmd.go:155 +0x18a
github.com/zrepl/zrepl/zfs/zfscmd.(*Cmd).CombinedOutput(0xc000c04800, 0xc0004b8270, 0xd02eea, 0x3, 0xc0001f6c40, 0x3)
        /home/cs/zrepl/zrepl/zfs/zfscmd/zfscmd.go:40 +0xb3
github.com/zrepl/zrepl/zfs.ZFSRelease(0xe36aa0, 0xc0004b8270, 0xc0009a3a40, 0x13, 0xc0004a5d00, 0x1, 0x1, 0xed62eb221, 0x13a8b60)
        /home/cs/zrepl/zrepl/zfs/holds.go:102 +0x2a7
github.com/zrepl/zrepl/endpoint.ReleaseStep(0xe36aa0, 0xc0004b8270, 0xc0004befc0, 0xe, 0xd08482, 0x8, 0xc0001cb02f, 0x2, 0x1eeea3bff89dc90b, 0x134d6, ...)
        /home/cs/zrepl/zrepl/endpoint/endpoint_zfs_abstraction_step.go:130 +0x367
github.com/zrepl/zrepl/endpoint.(*Sender).SendCompleted.func2(0xc000459190, 0xc000390e30, 0xc00041fd80, 0xc0004befc0, 0xe, 0xd08482, 0x8, 0xc0001cb02f, 0x2, 0x1eeea3bff89dc90b, ...)
        /home/cs/zrepl/zrepl/endpoint/endpoint.go:419 +0x1c3
created by github.com/zrepl/zrepl/endpoint.(*Sender).SendCompleted
        /home/cs/zrepl/zrepl/endpoint/endpoint.go:413 +0x776
2020-04-21 14:10:25 +02:00
Christian Schwarz
0834a184b8 zfscmd: do not do duplicate waitPre callbacks
it just makes sense that if we only dispatch one waitPost, we should
also only dispatch one waitPre
2020-04-21 14:10:18 +02:00
Christian Schwarz
96e188d7c4 zfscmd: fix nil deref in waitPostLogging when command was killed
fixes #301
2020-04-08 00:26:56 +02:00
Christian Schwarz
1336c91865 zfs: introduce pkg zfs/zfscmd for command logging, status, prometheus metrics
refs #196
2020-04-05 20:47:25 +02:00