zrepl

mirror of https://github.com/zrepl/zrepl.git synced 2024-11-26 18:34:43 +01:00

Author	SHA1	Message	Date
InsanePrawn	3d2688e959	Ugly but working inital snapjob implementation	2018-11-20 19:30:15 +01:00
Christian Schwarz	7ab51fad0d	zfs: add 'received' property source, handle 'any' source correctly and use 'any' for placeholder FS detection we want was first noticed in zfs 0.8rc1 Upstream doc PR: https://github.com/zfsonlinux/zfs/pull/8134	2018-11-16 13:07:13 +01:00
Christian Schwarz	3472145df6	pruner + proto change: better handling of missing replication cursor - don't treat missing replication cursor as an error in protocol - treat it as a per-fs planning error instead	2018-11-16 12:21:54 +01:00
Christian Schwarz	5e1ea21f85	pruning: add 'Negate' option to KeepRegex and expose it in config	2018-11-16 12:21:54 +01:00
Christian Schwarz	2db3977408	cli: add 'test placeholder' subcommand for placeholder debugging	2018-11-16 12:21:54 +01:00
Christian Schwarz	ca6d5d3bb5	build: Travis CI configuration	2018-11-16 12:10:58 +01:00
Christian Schwarz	163c2bc533	docs: update requirements.txt	2018-11-16 12:10:58 +01:00
Christian Schwarz	dd286aa12e	client: fix status bytes per second measurement still far from perfect, but better than incorrect values	2018-11-05 01:37:51 +01:00
Christian Schwarz	80babe3ab4	docs/README: update package hierarchy overview	2018-10-26 22:05:57 +02:00
Christian Schwarz	ca0cab0a15	docs/tutorial: fix headlines	2018-10-26 21:52:49 +02:00
JMoVS	ad8be226fd	fix small typo	2018-10-22 11:32:37 +02:00
Christian Schwarz	9b3e5c38e2	docs: fix changelog + invocations of wakeup subcommand	2018-10-22 11:27:00 +02:00
Christian Schwarz	7e1c5f5d1f	docs: discourage use of ssh+stdinserver transport due to inferior error handling	2018-10-22 11:25:16 +02:00
Christian Schwarz	98bc8d1717	daemon/job: explicit notice of ZREPL_JOB_WATCHDOG_TIMEOUT environment variable on cancellation	2018-10-22 11:03:31 +02:00
Christian Schwarz	2889a5d5ff	client/status: current bytes/second + spinning progress bar	2018-10-21 23:15:21 +02:00
Christian Schwarz	0b8c19c620	docs/tutorial: switch to push setup & use mutual TLS (2 machines)	2018-10-21 22:20:35 +02:00
Christian Schwarz	a62b475f46	docs/transport/tls: document self-signed certs procedure for 2-machine setup	2018-10-21 22:20:07 +02:00
Christian Schwarz	1691839c6b	replication: handle context cancellation errors as GlobalError	2018-10-21 19:06:35 +02:00
Christian Schwarz	36265ff349	fixup `438f950be3`: forgotten ErrorCount in printf	2018-10-21 18:37:57 +02:00
Christian Schwarz	94427d334b	replication + pruner + watchdog: adjust timeouts based on practical experience	2018-10-21 18:37:57 +02:00
Christian Schwarz	b2844569c8	replication: rewrite error handling + simplify state machines * Remove explicity state machine code for all but replication.Replication * Introduce explicit error types that satisfy interfaces which provide sufficient information for replication.Replication to make intelligent retry + queuing decisions * Temporary() * LocalToFS() * Remove the queue and replace it with a simple array that we sort each time (yay no generics :( )	2018-10-21 18:37:57 +02:00
Christian Schwarz	ae5e60b1ae	client/status: display problems as wrapped + indented if they do not fit the current line	2018-10-21 17:50:08 +02:00
Christian Schwarz	fffda09f67	replication + pruner: progress markers during planning	2018-10-21 17:50:08 +02:00
Christian Schwarz	5ec7a5c078	pruner: report: fix broken checks for state (wrong precedence rules)	2018-10-21 13:37:08 +02:00
Christian Schwarz	190c7270d9	daemon/active + watchdog: simplify control flow using explicit ActiveSideState	2018-10-21 12:53:34 +02:00
Christian Schwarz	f704b28cad	daemon/job: track active side state explicitly	2018-10-21 12:52:48 +02:00
Christian Schwarz	5efeec1819	daemon/control: stop logging status endpoint requests	2018-10-20 12:50:31 +02:00
Christian Schwarz	438f950be3	pruner: improve cancellation + error handling strategy Pruner now backs off as soon as there is an error, making that error the Error field in the pruner report. The error is also stored in the specific fs that failed, and we maintain an error counter per fs to de-prioritize those fs that failed. Like with replication, the de-prioritization on errors is to avoid ' getting stuck' with an individual filesystem until the watchdog hits.	2018-10-20 12:46:43 +02:00
Christian Schwarz	50c1549865	pruner: fixup `69bfcb7bed`: add missing progress updates for watchdog	2018-10-20 10:58:22 +02:00
Christian Schwarz	6e21a67473	build: detect if generate made things dirty and break release build in that case	2018-10-19 17:52:49 +02:00
Christian Schwarz	17ab39d646	build: add missing subpackages	2018-10-19 17:23:00 +02:00
Christian Schwarz	44d2057df8	client/configcheck: check logging config	2018-10-19 17:23:00 +02:00
Christian Schwarz	3e359aaeda	zfs: fixup `6fcf0635a5`: broken test	2018-10-19 17:23:00 +02:00
Christian Schwarz	8cfeeee23a	config: fixup `1f072936c5`: broken test	2018-10-19 17:23:00 +02:00
Christian Schwarz	f535b2327f	pruner: use envconst to configure retry interval	2018-10-19 17:23:00 +02:00
Christian Schwarz	e63ac7d1bb	pruner: log transitions to error state + log info to confirm pruning is done in active job	2018-10-19 17:23:00 +02:00
Christian Schwarz	359ab2ca0c	pruner: fail on every error that is not net.OpError.Temporary()	2018-10-19 17:23:00 +02:00
Christian Schwarz	45373168ad	replication: fix retry wait behavior An fsrep.Replication is either Ready, Retry or in a terminal state. The queue prefers Ready over Retry: Ready is sorted by nextStepDate to progress evenly.. Retry is sorted by error count, to de-prioritize filesystems that fail often. This way we don't get stuck with individual filesystems and lose other working filesystems to the watchdog. fsrep.Replication no longer blocks in Retry state, we have replication.WorkingWait for that.	2018-10-19 17:23:00 +02:00
Christian Schwarz	69bfcb7bed	daemon/active: implement watchdog to handle stuck replication / pruners ActiveSide.do() can only run sequentially, i.e. we cannot run replication and pruning in parallel. Why? * go-streamrpc only allows one active request at a time (this is bad design and should be fixed at some point) * replication and pruning are implemented independently, but work on the same resources (snapshots) A: pruning might destroy a snapshot that is planned to be replicated B: replication might replicate snapshots that should be pruned We do not have any resource management / locking for A and B, but we have a use case where users don't want their machine fill up with snapshots if replication does not work. That means we _have_ to run the pruners. A further complication is that we cannot just cancel the replication context after a timeout and move on to the pruner: it could be initial replication and we don't know how long it will take. (And we don't have resumable send & recv yet). With the previous commits, we can implement the watchdog using context cancellation. Note that the 'MadeProgress()' calls can only be placed right before non-error state transition. Otherwise, we could end up in a live-lock.	2018-10-19 17:23:00 +02:00
Christian Schwarz	4ede99b08c	replication: simpler PermanentError state + handle context cancellation	2018-10-19 17:23:00 +02:00
Christian Schwarz	814fec60f0	endpoint + zfs: context cancellation of util.IOCommand instances (send & recv for now)	2018-10-19 16:12:21 +02:00
Christian Schwarz	ace4f3d892	transport/tlsclientauth: handle cancellation of dialCtx	2018-10-19 16:08:20 +02:00
Christian Schwarz	82f0060eec	Revert "daemon/job/active: push mode: awful hack for handling of concurrent snapshots + stale remote operation" This reverts commit `aeb87ffbcf`.	2018-10-19 09:35:30 +02:00
Christian Schwarz	53ac853cb4	client/configcheck: build jobs for checking config and allow selecting what to print	2018-10-18 16:35:29 +02:00
Christian Schwarz	a5376913fd	daemon/job: fix buildJob returning nil error on job uild error Would show up as ugly nil-pointer-deref panic later during daemon startup	2018-10-18 16:19:27 +02:00
Christian Schwarz	6fcf0635a5	zfs: generalize dry send information for normal sends and with resume token This is in preparation for resumable send & recv, thus we just don't use the ResumeToken field for the time being.	2018-10-18 15:56:28 +02:00
Christian Schwarz	1f072936c5	fix default stdout outlet	2018-10-18 15:48:24 +02:00
Christian Schwarz	3c06235dca	replication + zfs: leave From field instead of To field empty for initial send	2018-10-14 13:06:23 +02:00
Christian Schwarz	f13749380d	docs: add warnings of changing semantics for manually created snapshots in 0.1	2018-10-13 18:34:37 +02:00
Christian Schwarz	eadb6f823d	docs: remove unreleased annotation from changelog for 0.1	2018-10-13 17:35:38 +02:00

... 4 5 6 7 8 ...

826 Commits