Commit Graph

983 Commits

Author SHA1 Message Date
Christian Schwarz
845195b7ed bandwidth limiting: fix crash with SnapJob
zrepl daemon panics when the snap job triggers

fixup for f5f269bfd5 (bandwidth limiting)
fixes #521

Oct 01 16:14:56 cstp zrepl[56563]: panic: invalid config`BandwidthLimit` field invalid: BucketCapacity must not be zero
Oct 01 16:14:56 cstp zrepl[56563]:         panic: end span: span still has active child spans
Oct 01 16:14:56 cstp zrepl[56563]: goroutine 38 [running]:
Oct 01 16:14:56 cstp zrepl[56563]: github.com/zrepl/zrepl/daemon/logging/trace.WithSpan.func2()
Oct 01 16:14:56 cstp zrepl[56563]:         /home/cs/zrepl/zrepl/daemon/logging/trace/trace.go:341 +0x2ea
Oct 01 16:14:56 cstp zrepl[56563]: github.com/zrepl/zrepl/daemon/logging/trace.WithTaskAndSpan.func1()
Oct 01 16:14:56 cstp zrepl[56563]:         /home/cs/zrepl/zrepl/daemon/logging/trace/trace_convenience.go:40 +0x2e
Oct 01 16:14:56 cstp zrepl[56563]: panic(0xcee9c0, 0xc000676730)
Oct 01 16:14:56 cstp zrepl[56563]:         /home/cs/go1.16.6/src/runtime/panic.go:965 +0x1b9
Oct 01 16:14:56 cstp zrepl[56563]: github.com/zrepl/zrepl/endpoint.NewSender(0xf5bbc0, 0xc0003840c0, 0xc0000b2c90, 0x4, 0xc0002c5958, 0x0, 0x0, 0x0, 0xc000068cf8)
Oct 01 16:14:56 cstp zrepl[56563]:         /home/cs/zrepl/zrepl/endpoint/endpoint.go:68 +0x1ec
Oct 01 16:14:56 cstp zrepl[56563]: github.com/zrepl/zrepl/daemon/job.(*SnapJob).doPrune(0xc00039e000, 0xf6e3b8, 0xc0006541b0)
Oct 01 16:14:56 cstp zrepl[56563]:         /home/cs/zrepl/zrepl/daemon/job/snapjob.go:179 +0x198
Oct 01 16:14:56 cstp zrepl[56563]: github.com/zrepl/zrepl/daemon/job.(*SnapJob).Run(0xc00039e000, 0xf6e3b8, 0xc0001d83c0)
Oct 01 16:14:56 cstp zrepl[56563]:         /home/cs/zrepl/zrepl/daemon/job/snapjob.go:127 +0x329
Oct 01 16:14:56 cstp zrepl[56563]: github.com/zrepl/zrepl/daemon.(*jobs).start.func1(0xc0006a4100, 0xf6e3b8, 0xc00022a0f0, 0xf72d18, 0xc00039e000)
Oct 01 16:14:56 cstp zrepl[56563]:         /home/cs/zrepl/zrepl/daemon/daemon.go:255 +0x15b
Oct 01 16:14:56 cstp zrepl[56563]: created by github.com/zrepl/zrepl/daemon.(*jobs).start
Oct 01 16:14:56 cstp zrepl[56563]:         /home/cs/zrepl/zrepl/daemon/daemon.go:251 +0x425
Oct 01 16:14:56 cstp systemd[1]: zrepl.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Oct 01 16:14:56 cstp systemd[1]: zrepl.service: Failed with result 'exit-code'.
2021-10-09 15:52:38 +02:00
Christian Schwarz
2c9fcd7c14 rpc/dataconn: always close send stream returned from Sender.Send()
discovered while debugging #457
2021-10-09 15:43:31 +02:00
Christian Schwarz
4f9b63aa09 rework size estimation & dry sends
- use control connection (gRPC)
- use uint64 everywhere => fixes https://github.com/zrepl/zrepl/issues/463
- [BREAK] bump protocol version

closes https://github.com/zrepl/zrepl/pull/518
fixes https://github.com/zrepl/zrepl/issues/463
2021-10-09 15:43:27 +02:00
Christian Schwarz
a8e92971d0 zfs: rewrite SendStream, fix bug in Close() on FreeBSD, add platformtests
This commit was motivated by https://github.com/zrepl/zrepl/issues/495
where, on FreeBSD with OpenZFS 2.0, a SendStream.Close() call might wait indefinitely for `zfs send` to exit.
The reason is that, due to the refactoring done for redacted send & recv
(30af21b025),
the `dump_bytes` function, which writes to the pipe, executes in a separate thread (synctask taskq) iff not `HAVE_LARGE_STACKS`.
The `zfs send` process/thread waits for that taskq thread using an uninterruptible primitive.
So when we SIGKILL `zfs send`, that signal doesn't reach the right thread to interrupt the pipe write.

Theoretically this affects both Linux and FreeBSD, but most Linux users `HAVE_LARGE_STACKS` and since https://github.com/penzfs/zfs/pull/12350/files OpenZFS on FreeBSD `HAVE_LARGE_STACKS` as well.
However, at least until FreeBSD 13.1, possibly for the entire 13 lifecycle, we're going to have to live with that oddity.

Measures taken in this commit:
- Report the behavior as an upstream bug https://github.com/openzfs/zfs/issues/12500
- Change SendStream code so that it closes zrepl's read-end of the pipe (see comment in code)
- Clean up and make explicit SendStream's state handling
- Write extensive platformtests for SendStream
    - They pass on my Linux install and on FreeBSD 12
    - FreeBSD 13 still needs testing.

fixes https://github.com/zrepl/zrepl/issues/495
2021-09-19 20:11:31 +02:00
Christian Schwarz
b54e477602 platformtest: fix 'active child tasks' panic for ReceiveForceRollbackWorksUnencrypted
Revealed by rework of SendStream in a prior commit.
2021-09-19 20:03:01 +02:00
Christian Schwarz
959fb08a89 platformtest: fix replication tests (SizeEstimationConcurrency field in PlannerPolicy was not set)
fixup of 0ceea1b792 (parallel replication knobs)
2021-09-19 20:03:01 +02:00
Christian Schwarz
6ac012aa3c platformtest: work around missing feature detection for test 'ReplicationPropertyReplicationWorks' 2021-09-19 20:03:01 +02:00
Christian Schwarz
3e93b31f75 platformtest: fix bandwidth-limiting-related panics (missing BucketCapacity in sender/receiver config)
panic while running test: invalid config`Ratelimit` field invalid: BucketCapacity must not be zero
main.runTestCase.func1.1
	/home/cs/zrepl/zrepl/platformtest/harness/harness.go:190
runtime.gopanic
	/home/cs/go1.13/src/runtime/panic.go:679
github.com/zrepl/zrepl/endpoint.NewSender
	/home/cs/zrepl/zrepl/endpoint/endpoint.go:68
github.com/zrepl/zrepl/platformtest/tests.replicationInvocation.Do
	/home/cs/zrepl/zrepl/platformtest/tests/replication.go:87
github.com/zrepl/zrepl/platformtest/tests.ReplicationFailingInitialParentProhibitsChildReplication
	/home/cs/zrepl/zrepl/platformtest/tests/replication.go:925
main.runTestCase.func1
	/home/cs/zrepl/zrepl/platformtest/harness/harness.go:193
main.runTestCase
	/home/cs/zrepl/zrepl/platformtest/harness/harness.go:194
main.HarnessRun
	/home/cs/zrepl/zrepl/platformtest/harness/harness.go:107
main.main
	/home/cs/zrepl/zrepl/platformtest/harness/harness.go:42
runtime.main
	/home/cs/go1.13/src/runtime/proc.go:203
runtime.goexit
	/home/cs/go1.13/src/runtime/asm_amd64.s:1357

fixup for f5f269bfd5 (bandwidth limiting)
2021-09-19 20:03:01 +02:00
Christian Schwarz
08df208149 bandwidth limiting: use correct field name on error
fixup for f5f269bfd5 (bandwidth limiting)
2021-09-19 20:03:01 +02:00
Christian Schwarz
936ed73a45 rpc/dataconn: fix log message in closeErr case
refs #457
2021-09-19 18:40:39 +02:00
Christian Schwarz
8fee536260 transport/tcp: ipmap tests: remove tests that cover CIDR normalization
They broke on Go 1.17.
See
https://www.bleepingcomputer.com/news/security/go-rust-net-library-affected-by-critical-ip-address-validation-vulnerability/
for context.

fixes #514
2021-09-15 08:43:08 +02:00
Christian Schwarz
01b4792974 fix make deb-docker for all platforms but amd64 2021-09-13 22:54:21 +02:00
Christian Schwarz
ad80bb3735 status: byteprogresshistory: disable averaging as workaround for #497
refs #497
2021-09-12 20:08:44 +02:00
Christian Schwarz
f5f269bfd5 send/recv: job-level bandwidth limiting
Sponsored-by: Prominic.NET, Inc.

fixes #339
2021-09-12 20:08:43 +02:00
Christian Schwarz
5b16769057 docs: update supporters 2021-08-30 11:01:25 +02:00
Christian Schwarz
009bd410af docs: prune: improve grid example 2021-07-08 19:46:24 +02:00
Christian Schwarz
bcfcd7a134 docs / CI: stop creating churn with doc commits & commit as zreplbot@ 2021-07-08 17:07:24 +02:00
Matthias Freund
bf1276f767 status: port status-v1 ETA calculation patch
Must have forgotten to integrate it into the status-v2 branch at the
time.

refs https://github.com/zrepl/zrepl/issues/98#issuecomment-872154091

cc @dcdamien
2021-07-08 15:00:26 +02:00
James W. Brinkerhoff IV
9fa7a18351 docs: quickstart: external_disk: fix typo in example 'derive -> drive'
closes #472
2021-04-19 23:36:03 +02:00
sre
50e8ee4549 docs: apt repo: use sudo in the snippet that sets up the repo
I generally like when snippets are provided in a way which could be used without running as root, and uses sudo when applicable. This change allows for this.

It will, however print out one extra line, which is possible to remove by adding '>/dev/null' after '/etc/apt/sources.list.d/zrepl.list'.

closes #461
2021-04-17 21:50:36 +02:00
Lapo Luchini
3b5a1a8b9a docs/monitoring: change suggested prometheus port to 9811
Change to 9811 as registered with the prometheus project now.

Closes #444.
2021-03-28 18:18:02 +02:00
Christian Schwarz
f661d9429f pruning/keep_last_n: correctly handle the case where count > matching snaps
fixes #446
2021-03-25 22:42:01 +01:00
InsanePrawn
ac4b109872 status/interactive: Revert to simple wakeup/reset signalling
Signed-off-by: InsanePrawn <insane.prawny@gmail.com>

closes #452
2021-03-25 22:26:17 +01:00
InsanePrawn
b2c6e51a43 client/signal: Revert "add signal 'snapshot', rename existing signal 'wakeup' to 'replication'"
This was merged to master prematurely as the job components are not decoupled well enough
for these signals to be useful yet.

This reverts commit 2c8c2cfa14.

closes #452
2021-03-25 22:26:17 +01:00
Lukas Schauer
ee2336a24b zfs: pipe size: default to value of /proc/sys/fs/pipe-max-siz
Addition by @problame: move getPipeCapacityHint() into platform-specific
code. This has the added benefit of not recognizing the envvar as an
envconst on platform that do not support resizing pipes.  => won't show
up in (zrepl status --raw).Global.Envconst

fixes #424
cloes #449
2021-03-25 22:24:50 +01:00
Cole Helbling
1e85b1cb5f client/status: allow raw mode without a tty
A fairly common pattern when presented with JSON is to pipe it to `jq`.

This also allows one to `zrepl status --mode raw > log` and operate on
the JSON there.

closes #442
2021-03-22 23:58:32 +01:00
Christian Schwarz
5c6d69a69c zfs: PropertySource: set type to uint32 so that enumer-generated code is platform-independent
make zrepl-bin test-platform-bin vet lint GOOS=freebsd   GOARCH=386
make[2]: Entering directory '/src'
GO111MODULE=on go build -mod=readonly  -ldflags "-X github.com/zrepl/zrepl/version.zreplVersion=v0.3.1-20-g07f2bff" -o "artifacts/zrepl-freebsd-386"
zfs/propertysource_enumer.go:41:9: constant 18446744073709551615 overflows PropertySource
zfs/propertysource_enumer.go:48:66: constant 18446744073709551615 overflows PropertySource
zfs/propertysource_enumer.go:57:23: constant 18446744073709551615 overflows PropertySource

fixes #429
2021-03-14 22:32:45 +01:00
Christian Schwarz
299808aaaf docs: 0.4 changelog: mention the 0.3.1 fix to prune:grid
fixes #400
refs 3a4e841c73
2021-03-14 21:13:26 +01:00
Christian Schwarz
7071adc774 docs: 0.4 changelog 2021-03-14 20:56:02 +01:00
InsanePrawn
8d678eed19 docs: add note about zfs recv -x mountpoint with ZVOLs
refs #430
2021-03-14 20:26:39 +01:00
Calistoc
ab7abd9686 status: show the current replication attempt's runtime 2021-03-14 18:24:28 +01:00
Christian Schwarz
a58ce74ed0 implement new 'zrepl status'
Primary goals:

- Scrollable output ( fixes #245 )
- Sending job signals from status view
- Filtering of output by filesystem

Implementation:

- original TUI framework: github.com/rivo/tview
- but: tview is quasi-unmaintained, didn't support some features
- => use fork https://gitlab.com/tslocum/cview
- however, don't buy into either too much to avoid lock-in

- instead: **port over the existing status UI drawing code
  and adjust it to produce strings instead of directly
  drawing into the termbox buffer**

Co-authored-by: Calistoc <calistoc@protonmail.com>
Co-authored-by: InsanePrawn <insane.prawny@gmail.com>

fixes #245
fixes #220
2021-03-14 18:24:25 +01:00
Calistoc
2c8c2cfa14 add signal 'snapshot', rename existing signal 'wakeup' to 'replication' 2021-03-14 18:16:23 +01:00
Christian Schwarz
0ceea1b792 replication: simplify parallel replication variables & expose them in config
closes #140
2021-03-14 17:30:10 +01:00
Christian Schwarz
07f2bfff6a build: bump release build Go version + quickcheck default to Go 1.16 2021-02-20 17:29:17 +01:00
Christian Schwarz
b9ee14f6d7 endpoint.ReceiverConfig: remove obsolete TODO comment 2021-02-20 17:20:58 +01:00
InsanePrawn
393fc10a69 [#285] support setting zfs send / recv flags in the config (send: -wLcepbS, recv: -ox)
Co-authored-by: Christian Schwarz <me@cschwarz.com>
Signed-off-by: InsanePrawn <insane.prawny@gmail.com>

closes #285
closes #276
closes #24
2021-02-20 17:20:45 +01:00
Christian Schwarz
1c937e58f7 zfs.NilBool: document its purpose and move it to its own package 'nodefault' 2021-02-20 17:04:57 +01:00
Christian Schwarz
70bbdfe760 zfs: ResumeToken: parse embedok, largeblockok, savedok if available
Developed for #285 but ultimately not used for it.
2021-02-20 17:04:57 +01:00
Christian Schwarz
efe7b17d21 Update to protobuf v1.25 and grpc 1.35; bump CI to go1.12
From:
github.com/golang/protobuf v1.3.2
google.golang.org/grpc v1.17.0

To:
github.com/golang/protobuf v1.4.3
google.golang.org/grpc v1.35.0
google.golang.org/protobuf v1.25.0

About the two protobuf packages:
https://developers.google.com/protocol-buffers/docs/reference/go/faq
> Version v1.4.0 and higher of github.com/golang/protobuf wrap the new
implementation and permit programs to adopt the new API incrementally. For
example, the well-known types defined in github.com/golang/protobuf/ptypes are
simply aliases of those defined in the newer module. Thus,
google.golang.org/protobuf/types/known/emptypb and
github.com/golang/protobuf/ptypes/empty may be used interchangeably.

Notable Code Changes in zrepl:
- generate protobufs now contain a mutex so we can't copy them by value
  anymore
- grpc.WithDialer is deprecated => use grpc.WithContextDialer instead

Go1.12 is now actually required by some of the dependencies.
2021-01-25 00:39:01 +01:00
Christian Schwarz
166e80bb96 lazy.sh: use 'go install' + import paths from build/tools.go 2021-01-25 00:16:01 +01:00
Mathias Fredriksson
6ac537210b daemon: avoid math/rand race by using global source
Unless we're using the global source for math/rand, (*rand.Rand).Read
should not be called concurrently. We seed the rng in daemon.Run to
avoid ambiguity or hiding global side effects inside packages.

closes #414
2021-01-25 00:16:01 +01:00
Christian Schwarz
596a39c0f5 bump golangci-lint to 1.35.2 and fix resulting lint errors
GO111MODULE=on golangci-lint run ./...
endpoint/endpoint.go:487:9: S1039: unnecessary use of fmt.Sprintf (gosimple)
                panic(fmt.Sprintf("ClientIdentityKey context value must be set"))
                      ^
platformtest/platformtest_ops.go:259:41: S1039: unnecessary use of fmt.Sprintf (gosimple)
                                return nil, &LineError{scan.Text(), fmt.Sprintf("unexpected tokens at EOL")}
                                                                    ^
platformtest/platformtest_ops.go:266:41: S1039: unnecessary use of fmt.Sprintf (gosimple)
                                return nil, &LineError{scan.Text(), fmt.Sprintf("unexpected tokens at EOL")}
                                                                    ^
util/optionaldeadline/optionaldeadline_test.go:97:50: SA1029: should not use built-in type string as key for value; define your own type to avoid collisions (staticcheck)
        pctx := context.WithValue(context.Background(), "key", "value")
                                                        ^
rpc/rpc_debug.go:8:5: var `debugEnabled` is unused (unused)
rpc/dataconn/dataconn_debug.go:8:5: var `debugEnabled` is unused (unused)
rpc/dataconn/frameconn/frameconn.go:42:9: S1039: unnecessary use of fmt.Sprintf (gosimple)
                panic(fmt.Sprintf("frame header is 8 bytes long"))
                      ^
platformtest/platformtest_ops.go:322:40: S1039: unnecessary use of fmt.Sprintf (gosimple)
                        return nil, &LineError{scan.Text(), fmt.Sprintf("unexpected tokens at EOL")}
2021-01-25 00:16:01 +01:00
Mathias Fredriksson
d118bcc717 rpc: fix data race in timeoutconn
- `timeoutconn` handles state, yet calls to Read/Write make a copy of
  that state (non-pointer receiver) so any outbound calls will not have
  the state updated

- Even without the copy issue, the renew methods can in edge cases set a
  new deadline _after_ DisableTimeouts have been called, consider the
  following racy behavior:
    1. `renewReadDeadline` is called, checks `renewDeadlinesDisabled`
       (not disabled)
    2. `DisableTimeouts` is called, sets `renewDeadlinesDisabled`
    3. `DisableTimeouts` invokes `c.SetDeadline`
    4. `renewReadDeadline` invokes `c.SetReadDeadline`

To fix the above, the `Conn` receiver was made to be a pointer
everywhere and access to renewDeadlinesDisabled is now guarded
by an RWMutex instead of using atomics.

closes #415
2021-01-25 00:16:01 +01:00
Mathias Fredriksson
48be4032a2 daemon: fix data race in snapjob pruner report
closes #416
2021-01-25 00:16:01 +01:00
Mathias Fredriksson
2d545c87cd rpc: fix read mutex unlock in dataconn stream
closes #413
2021-01-25 00:16:01 +01:00
Christian Schwarz
0b48ba54f8 replication/driver: simplify second-attempt step correlation code & fix statekeeping
Before this change, the step correlation code returned early in several cases:
- did not set f.planning.done in the cases where it was a no-op
- did not set f.planning.err in the cases where correlation did not
  succeed

Reported-by: InsanePrawn <insane.prawny@gmail.com>
2021-01-25 00:16:01 +01:00
Christian Schwarz
c3d87289bb [#388] util/semaphore: fix TestSemaphore test
fixes #388
2021-01-24 22:28:47 +01:00
Christian Schwarz
0d96627ffb [#388] trace: make WithTaskGroup actually concurrent 2021-01-24 22:28:47 +01:00
Christian Schwarz
61acc7494a [#388] endpoint: fix incorrect use of trace.WithTaskGroup in ListAbstractionsStreamed
Capturing behavior was broken.
2021-01-24 13:58:23 +01:00