zrepl

mirror of https://github.com/zrepl/zrepl.git synced 2024-12-11 01:31:02 +01:00

Author	SHA1	Message	Date
Christian Schwarz	e63ac7d1bb	pruner: log transitions to error state + log info to confirm pruning is done in active job	2018-10-19 17:23:00 +02:00
Christian Schwarz	359ab2ca0c	pruner: fail on every error that is not net.OpError.Temporary()	2018-10-19 17:23:00 +02:00
Christian Schwarz	45373168ad	replication: fix retry wait behavior An fsrep.Replication is either Ready, Retry or in a terminal state. The queue prefers Ready over Retry: Ready is sorted by nextStepDate to progress evenly.. Retry is sorted by error count, to de-prioritize filesystems that fail often. This way we don't get stuck with individual filesystems and lose other working filesystems to the watchdog. fsrep.Replication no longer blocks in Retry state, we have replication.WorkingWait for that.	2018-10-19 17:23:00 +02:00
Christian Schwarz	69bfcb7bed	daemon/active: implement watchdog to handle stuck replication / pruners ActiveSide.do() can only run sequentially, i.e. we cannot run replication and pruning in parallel. Why? * go-streamrpc only allows one active request at a time (this is bad design and should be fixed at some point) * replication and pruning are implemented independently, but work on the same resources (snapshots) A: pruning might destroy a snapshot that is planned to be replicated B: replication might replicate snapshots that should be pruned We do not have any resource management / locking for A and B, but we have a use case where users don't want their machine fill up with snapshots if replication does not work. That means we _have_ to run the pruners. A further complication is that we cannot just cancel the replication context after a timeout and move on to the pruner: it could be initial replication and we don't know how long it will take. (And we don't have resumable send & recv yet). With the previous commits, we can implement the watchdog using context cancellation. Note that the 'MadeProgress()' calls can only be placed right before non-error state transition. Otherwise, we could end up in a live-lock.	2018-10-19 17:23:00 +02:00
Christian Schwarz	4ede99b08c	replication: simpler PermanentError state + handle context cancellation	2018-10-19 17:23:00 +02:00
Christian Schwarz	814fec60f0	endpoint + zfs: context cancellation of util.IOCommand instances (send & recv for now)	2018-10-19 16:12:21 +02:00
Christian Schwarz	ace4f3d892	transport/tlsclientauth: handle cancellation of dialCtx	2018-10-19 16:08:20 +02:00
Christian Schwarz	82f0060eec	Revert "daemon/job/active: push mode: awful hack for handling of concurrent snapshots + stale remote operation" This reverts commit `aeb87ffbcf`.	2018-10-19 09:35:30 +02:00
Christian Schwarz	53ac853cb4	client/configcheck: build jobs for checking config and allow selecting what to print	2018-10-18 16:35:29 +02:00
Christian Schwarz	a5376913fd	daemon/job: fix buildJob returning nil error on job uild error Would show up as ugly nil-pointer-deref panic later during daemon startup	2018-10-18 16:19:27 +02:00
Christian Schwarz	6fcf0635a5	zfs: generalize dry send information for normal sends and with resume token This is in preparation for resumable send & recv, thus we just don't use the ResumeToken field for the time being.	2018-10-18 15:56:28 +02:00
Christian Schwarz	1f072936c5	fix default stdout outlet	2018-10-18 15:48:24 +02:00
Christian Schwarz	3c06235dca	replication + zfs: leave From field instead of To field empty for initial send	2018-10-14 13:06:23 +02:00
Christian Schwarz	f13749380d	docs: add warnings of changing semantics for manually created snapshots in 0.1	2018-10-13 18:34:37 +02:00
Christian Schwarz	eadb6f823d	docs: remove unreleased annotation from changelog for 0.1	2018-10-13 17:35:38 +02:00
Christian Schwarz	e7497ab3d0	LICENSE + docs: adjust copyright	2018-10-13 17:34:05 +02:00
Christian Schwarz	59a4e2db5f	replication: regenerate pdu.pb with new protoc-gen-go	2018-10-13 17:23:39 +02:00
Christian Schwarz	2c994e879c	filters: fix broken error message reported by go vet on go 1.11	2018-10-13 17:17:34 +02:00
Christian Schwarz	de2768c91d	build: produce darwin binaries	2018-10-13 16:57:25 +02:00
Christian Schwarz	fb6f58b735	client/status: switch to package tcell which works with solaris Can't cross compile Solaris binaries though: tcell for Solaris needs cgo.	2018-10-13 16:57:05 +02:00
Christian Schwarz	be4e244f1f	build: fixup `af3d96dab8`: syntax error in builddep install	2018-10-13 16:29:33 +02:00
Christian Schwarz	074f989547	Merge branch 'replication_rewrite' (in fact it's a 90% rewrite)	2018-10-13 16:26:23 +02:00
Christian Schwarz	87c8957889	build: fixup `be962998ba`: broken makefile	2018-10-13 16:22:19 +02:00
Christian Schwarz	f6cf23779f	docs: Remove stale TIP for dry-run zrepl test subcommand. Won't make it to 0.1	2018-10-13 16:22:19 +02:00
Christian Schwarz	92a1a6d2ca	docs: fix wrong subcommand for configcheck	2018-10-13 16:22:19 +02:00
Christian Schwarz	63169c51b7	add 'test filesystems' subcommand for testing filesystem filters	2018-10-13 16:22:19 +02:00
Christian Schwarz	5c3c83b2cb	cli: refactor to allow definition of subcommands next to their implementation	2018-10-13 16:22:19 +02:00
Christian Schwarz	aeb87ffbcf	daemon/job/active: push mode: awful hack for handling of concurrent snapshots + stale remote operation We have the problem that there are legitimate use cases where a user does not want their machine to fill up with snapshots, even if it means unreplicated must be destroyed. This can be expressed by not configuring the keep rule `not_replicated` for the snapshot-creating side. This commit only addresses push mode because we don't support pruning in the source job. We adivse users in the docs to use push mode if they have above use case, so this is fine - at least for 0.1. Ideally, the replication.Replication would communicate to the pruner which snapshots are currently part of the replication plan, and then we'd need some conflict resolution to determine whether it's more important to destroy the snapshots or to replicate them (destroy should win?). However, we don't have the infrastructure for this yet (we could parse the replication report, but that's just ugly). And we want to get 0.1 out, so showtime for a dirty hack: We start replication, and ideally, replication and pruning is done before new snapshot have been taken. If so: great. However, what happens if snapshots have been taken and we are not done with replication and / or pruning? * If replicatoin is making progress according to its state, let it run. This covers the important situation of initial replication, where replication may easily take longer than a single snapshotting interval. * If replication is in an error state, cancel it through context cancellation. * As with the pruner below, the main problem here is that status output will only contain "context cancelled" after the cancellation, instead of showing the reason why it was cancelled. Not nice, but oh well, the logs provide enough detail for this niche situation... * If we are past replication, we're still pruning * Leave the local (send-side) pruning alone. Again, we only implement this hack for push, so we know sender is local, and it will only fail hard, not retry. * If the remote (receiver-side) pruner is in an error state, cancel it through context cancellation. * Otherwise, let it run. Note that every time we "let it run", we tolerate a temporary excess of snapshots, but given sufficiently aggressive timeouts and the assumption that the snapshot interval is much greater than the timeouts, this is not a significant problem in practice.	2018-10-12 22:47:06 +02:00
Christian Schwarz	a85abe8bae	client/status: improve hiding of data if current state makes it obsolete	2018-10-12 22:47:06 +02:00
Christian Schwarz	d584e1ac54	daemon/job/active: fix race in updateTasks If concurrent updates strictly modify different members of the tasks struct, the copying + lock-drop still constitutes a race condition: The last updater always wins and sets tasks to its copy + changes. This eliminates the other updater's changes.	2018-10-12 22:15:07 +02:00
Christian Schwarz	af3d96dab8	use enumer generate tool for state strings	2018-10-12 22:10:49 +02:00
Christian Schwarz	89e0103abd	move wakeup subcommand into signal subcommand and add reset subcommand	2018-10-12 20:50:56 +02:00
Christian Schwarz	025fbda984	client/status: only show progress bar in non-planning states	2018-10-12 16:00:37 +02:00
Christian Schwarz	9bb7b19c93	pruner: handle replication cursor being older than any snapshot correctly	2018-10-12 15:29:07 +02:00
Christian Schwarz	cb83a26c90	replication: wakeup + retry handling: make wakeups work in retry wait states - handle wakeups in Planning state - fsrep.Replication yields immediately in RetryWait - once the queue only contains fsrep.Replication in retryWait: transition replication.Replication into WorkingWait state - handle wakeups in WorkingWait state, too	2018-10-12 13:12:28 +02:00
Christian Schwarz	d17ecc3b5c	replication/fsrep: report Pending[0] problem as fsrep problem in RetryWait state	2018-10-12 12:45:37 +02:00
Christian Schwarz	f9d24d15ed	move wakup mechanism into separate package	2018-10-12 12:44:40 +02:00
Christian Schwarz	1fb59c953a	implement transport protocol handshake (even before streamrpc handshake)	2018-10-11 21:21:46 +02:00
Christian Schwarz	be962998ba	move serve and connecter into transports package	2018-10-11 21:21:46 +02:00
Christian Schwarz	a97684923a	refactor: socketpair into utils package (useful elsewhere)	2018-10-11 21:17:43 +02:00
Christian Schwarz	1643198713	docs: reflect changes in replication_rewrite branch	2018-10-11 18:03:18 +02:00
Christian Schwarz	125b561df3	rename root_dataset to root_fs for receiving-side jobs	2018-10-11 18:03:18 +02:00
Christian Schwarz	0c3a694470	fixup: add test for global section	2018-10-11 17:52:19 +02:00
Christian Schwarz	525a875825	main: better descriptions for root subcommands	2018-10-11 17:52:19 +02:00
Christian Schwarz	4e16952ad9	snapshotting: support 'periodic' and 'manual' mode 1. Change config format to support multiple types of snapshotting modes. 2. Implement a hacky way to support periodic or completely manual snaphots. In manual mode, the user has to trigger replication using the wakeup mechanism after they took snapshots using their own tooling. As indicated by the comment, a more general solution would be desirable, but we want to get the release out and 'manual' mode is a feature that some people requested...	2018-10-11 15:59:23 +02:00
Christian Schwarz	14febbeb4c	config: skip files that do not end in .yml	2018-10-11 13:09:04 +02:00
Christian Schwarz	93c90cd705	pruning: fix YAML representation of PruneKeepRegex	2018-10-11 13:07:52 +02:00
Christian Schwarz	01668a989e	transport local: named listeners + struct renaming	2018-10-11 13:06:47 +02:00
Christian Schwarz	976c1f3929	util.IOCommand: add stderr logging for unexpected crashes in calls to ProcessState.Sys() Crashes observed on a FreeBSD 11.2 system 2018-09-27T05:08:39+02:00 [INFO][csnas]: start replication invocation="62" 2018-09-27T05:08:39+02:00 [INFO][csnas][repl]: start planning invocation="62" 2018-09-27T05:08:58+02:00 [INFO][csnas][repl]: start working invocation="62" 2018-09-27T05:09:57+02:00 [INFO][csnas]: start pruning sender invocation="62" 2018-09-27T05:10:11+02:00 [INFO][csnas]: start pruning receiver invocation="62" 2018-09-27T05:10:32+02:00 [INFO][csnas]: wait for wakeups 2018-09-27T06:08:39+02:00 [INFO][csnas]: start replication invocation="63" 2018-09-27T06:08:39+02:00 [INFO][csnas][repl]: start planning invocation="63" 2018-09-27T06:08:44+02:00 [INFO][csnas][repl]: start working invocation="63" 2018-09-27T06:08:49+02:00 [ERRO][csnas][repl]: receive request failed (might also be error on sender) invocation="63" filesystem="<REDACTED>" err="concurrent use of RPC connection" step="<REDACTED>(@zrepl_20180927_030838_000 => @zrepl_20180927_040835_000)" errType="errors.errorString" panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x7d484b] goroutine 3938545 [running]: os.(ProcessState).os.sys(...) /usr/lib/golang/src/os/exec_posix.go:78 os.(ProcessState).Sys(...) /usr/lib/golang/src/os/exec.go:157 github.com/zrepl/zrepl/util.(IOCommand).doWait(0xc4201b2d80, 0xc420070060, 0xc420070060) /go/github.com/zrepl/zrepl/util/iocommand.go:91 +0x4b github.com/zrepl/zrepl/util.(IOCommand).Read(0xc4201b2d80, 0xc420790000, 0x8000, 0x8000, 0x800c76d90, 0x0, 0xc420067c10) /go/github.com/zrepl/zrepl/util/iocommand.go:82 +0xe4 github.com/zrepl/zrepl/util.(ByteCounterReader).Read(0xc4202dc580, 0xc420790000, 0x8000, 0x8000, 0x8c6900, 0x7cb201, 0xc420790000) /go/github.com/zrepl/zrepl/util/io.go:118 +0x51 github.com/zrepl/zrepl/vendor/github.com/problame/go-streamrpc.(chunkBuffer).readChunk(0xc42057e3c0, 0x800d1bbf0, 0xc4202dc580, 0xc420790000, 0x8000, 0x8000) /go/github.com/zrepl/zrepl/vendor/github.com/problame/go-streamrpc/stream.go:58 +0x5e github.com/zrepl/zrepl/vendor/github.com/problame/go-streamrpc.writeStream(0xa04620, 0xc4204a9c20, 0x9fe340, 0xc4200d6380, 0x800d1bbf0, 0xc4202dc580, 0x8000, 0xc42000e000, 0x900420) /go/github.com/zrepl/zrepl/vendor/github.com/problame/go-streamrpc/stream.go:101 +0x1ce github.com/zrepl/zrepl/vendor/github.com/problame/go-streamrpc.(Conn).send(0xc4200d6380, 0xa04620, 0xc4204a9c20, 0xc42057e2c0, 0xc42013d570, 0x800d1bbf0, 0xc4202dc580, 0x0, 0x0) /go/github.com/zrepl/zrepl/vendor/github.com/problame/go-streamrpc/main.go:374 +0x557 github.com/zrepl/zrepl/vendor/github.com/problame/go-streamrpc.(Client).RequestReply.func1(0x999741, 0x7, 0xc4200d6380, 0xa04620, 0xc4204a9c20, 0xc42013d570, 0xa00aa0, 0xc4202dc580, 0xc420516480) /go/github.com/zrepl/zrepl/vendor/github.com/problame/go-streamrpc/client.go:169 +0x148 created by github.com/zrepl/zrepl/vendor/github.com/problame/go-streamrpc.(Client).RequestReply /go/github.com/zrepl/zrepl/vendor/github.com/problame/go-streamrpc/client.go:167 +0x227	2018-09-27 12:06:59 +02:00
Christian Schwarz	75e42fd860	pruner: implement Report method + display in status command	2018-09-24 19:25:40 +02:00

... 9 10 11 12 13 ...

1041 Commits