zrepl

mirror of https://github.com/zrepl/zrepl.git synced 2025-07-24 17:15:39 +02:00

Author	SHA1	Message	Date
Christian Schwarz	158d1175e3	rename SinglePruner to LocalPruner	2019-03-17 21:18:25 +01:00
Christian Schwarz	b25da7b9b0	job: snap: comment fix	2019-03-17 21:07:42 +01:00
Christian Schwarz	5cd2593f52	job: snap: workaround for replication cursor requirement	2019-03-17 21:07:01 +01:00
Christian Schwarz	d8d9e34914	pruner: single: remove unused member considerSnapAtCursorReplicated	2019-03-17 20:57:34 +01:00
Christian Schwarz	17818439a0	Merge branch 'problame/replication_refactor' into InsanePrawn-master	2019-03-17 17:33:51 +01:00
Christian Schwarz	da3ba50a2c	Merge remote-tracking branch 'origin/master' into problame/replication_refactor	2019-03-16 14:48:01 +01:00
Christian Schwarz	4ee00091d6	pull job: support manual-only invocation	2019-03-16 14:24:05 +01:00
Christian Schwarz	aff639e87a	Merge remote-tracking branch 'origin/master' into InsanePrawn-master	2019-03-15 21:05:20 +01:00
Christian Schwarz	a0f301d700	syslog logging: fix priority parsing + add test for default facility	2019-03-15 18:18:16 +01:00
Ximalas	fc311a9fd6	syslog logging: support setting facility in config	2019-03-15 17:55:11 +01:00
Christian Schwarz	7584c66bdb	pruner: remove retry handling + fix early give-up Retry handling is broken since the gRPC changes (wrong error classification). Will come back at some point, hopefully by merging the replication driver retry infrastructure. However, the simpler architecture allows an easy fix for the problem that the pruner practically gave up on the first error it encountered. fixes #123	2019-03-13 21:04:39 +01:00
Christian Schwarz	d78d20e2d0	pruner: skip placeholders + FSes without correspondents on source fixes #126	2019-03-13 20:42:37 +01:00
Christian Schwarz	c87759affe	replication/driver: automatic retries on connectivity-related errors	2019-03-13 15:00:40 +01:00
Christian Schwarz	07b43bffa4	replication: refactor driving logic (no more explicit state machine)	2019-03-13 15:00:40 +01:00
Christian Schwarz	796c5ad42d	rpc rewrite: control RPCs using gRPC + separate RPC for data transfer transport/ssh: update go-netssh to new version => supports CloseWrite and Deadlines => build: require Go 1.11 (netssh requires it)	2019-03-13 13:53:48 +01:00
InsanePrawn	160a3b6d32	more gofmt, drop snapjob.go_prefmt after it was accidentally added	2018-11-21 22:14:43 +01:00
InsanePrawn	3cef76d463	Refactor snapJob() to snapJobFromConfig()	2018-11-21 14:37:03 +01:00
InsanePrawn	e9564a7e5c	Inlined a couple legacy leftover functions from the mode copypasta	2018-11-21 14:35:40 +01:00
InsanePrawn	b79ad3ddc3	Honour PruneKeepNotReplicated.KeepSnashotAtCursor in SinglePrunerFactory	2018-11-21 14:17:38 +01:00
InsanePrawn	d0f898751f	Gofmt snapjob.go	2018-11-21 14:02:21 +01:00
InsanePrawn	22d9830baa	Fix prometheus with multiple jobs	2018-11-21 04:26:03 +01:00
InsanePrawn	e10dc129de	Make getPruner() private	2018-11-21 03:39:03 +01:00
InsanePrawn	dd11fc96db	Touchups in job.go	2018-11-21 03:27:39 +01:00
InsanePrawn	7de3c0a09a	Removed the references to a pruning 'side' in the singlepruner logging code and the snapjob prometheus thing.	2018-11-21 02:52:33 +01:00
InsanePrawn	141e49727c	Missed a last reference to tasks	2018-11-21 02:51:23 +01:00
InsanePrawn	442d61918b	remove most of the watchdog machinery	2018-11-21 02:42:13 +01:00
InsanePrawn	58dcc07430	Added SnapJobStatus	2018-11-21 02:08:39 +01:00
InsanePrawn	19d0916e34	remove snapMode, rename snap_ActiveSide to SnapJob	2018-11-21 01:54:56 +01:00
InsanePrawn	1265cc7934	pruned unused lines and comments ;)	2018-11-21 01:34:50 +01:00
InsanePrawn	3d2688e959	Ugly but working inital snapjob implementation	2018-11-20 19:30:15 +01:00
Christian Schwarz	3472145df6	pruner + proto change: better handling of missing replication cursor - don't treat missing replication cursor as an error in protocol - treat it as a per-fs planning error instead	2018-11-16 12:21:54 +01:00
Christian Schwarz	5e1ea21f85	pruning: add 'Negate' option to KeepRegex and expose it in config	2018-11-16 12:21:54 +01:00
Christian Schwarz	98bc8d1717	daemon/job: explicit notice of ZREPL_JOB_WATCHDOG_TIMEOUT environment variable on cancellation	2018-10-22 11:03:31 +02:00
Christian Schwarz	94427d334b	replication + pruner + watchdog: adjust timeouts based on practical experience	2018-10-21 18:37:57 +02:00
Christian Schwarz	b2844569c8	replication: rewrite error handling + simplify state machines * Remove explicity state machine code for all but replication.Replication * Introduce explicit error types that satisfy interfaces which provide sufficient information for replication.Replication to make intelligent retry + queuing decisions * Temporary() * LocalToFS() * Remove the queue and replace it with a simple array that we sort each time (yay no generics :( )	2018-10-21 18:37:57 +02:00
Christian Schwarz	fffda09f67	replication + pruner: progress markers during planning	2018-10-21 17:50:08 +02:00
Christian Schwarz	5ec7a5c078	pruner: report: fix broken checks for state (wrong precedence rules)	2018-10-21 13:37:08 +02:00
Christian Schwarz	190c7270d9	daemon/active + watchdog: simplify control flow using explicit ActiveSideState	2018-10-21 12:53:34 +02:00
Christian Schwarz	f704b28cad	daemon/job: track active side state explicitly	2018-10-21 12:52:48 +02:00
Christian Schwarz	5efeec1819	daemon/control: stop logging status endpoint requests	2018-10-20 12:50:31 +02:00
Christian Schwarz	438f950be3	pruner: improve cancellation + error handling strategy Pruner now backs off as soon as there is an error, making that error the Error field in the pruner report. The error is also stored in the specific fs that failed, and we maintain an error counter per fs to de-prioritize those fs that failed. Like with replication, the de-prioritization on errors is to avoid ' getting stuck' with an individual filesystem until the watchdog hits.	2018-10-20 12:46:43 +02:00
Christian Schwarz	50c1549865	pruner: fixup `69bfcb7bed`: add missing progress updates for watchdog	2018-10-20 10:58:22 +02:00
Christian Schwarz	f535b2327f	pruner: use envconst to configure retry interval	2018-10-19 17:23:00 +02:00
Christian Schwarz	e63ac7d1bb	pruner: log transitions to error state + log info to confirm pruning is done in active job	2018-10-19 17:23:00 +02:00
Christian Schwarz	359ab2ca0c	pruner: fail on every error that is not net.OpError.Temporary()	2018-10-19 17:23:00 +02:00
Christian Schwarz	69bfcb7bed	daemon/active: implement watchdog to handle stuck replication / pruners ActiveSide.do() can only run sequentially, i.e. we cannot run replication and pruning in parallel. Why? * go-streamrpc only allows one active request at a time (this is bad design and should be fixed at some point) * replication and pruning are implemented independently, but work on the same resources (snapshots) A: pruning might destroy a snapshot that is planned to be replicated B: replication might replicate snapshots that should be pruned We do not have any resource management / locking for A and B, but we have a use case where users don't want their machine fill up with snapshots if replication does not work. That means we _have_ to run the pruners. A further complication is that we cannot just cancel the replication context after a timeout and move on to the pruner: it could be initial replication and we don't know how long it will take. (And we don't have resumable send & recv yet). With the previous commits, we can implement the watchdog using context cancellation. Note that the 'MadeProgress()' calls can only be placed right before non-error state transition. Otherwise, we could end up in a live-lock.	2018-10-19 17:23:00 +02:00
Christian Schwarz	ace4f3d892	transport/tlsclientauth: handle cancellation of dialCtx	2018-10-19 16:08:20 +02:00
Christian Schwarz	82f0060eec	Revert "daemon/job/active: push mode: awful hack for handling of concurrent snapshots + stale remote operation" This reverts commit `aeb87ffbcf`.	2018-10-19 09:35:30 +02:00
Christian Schwarz	a5376913fd	daemon/job: fix buildJob returning nil error on job uild error Would show up as ugly nil-pointer-deref panic later during daemon startup	2018-10-18 16:19:27 +02:00
Christian Schwarz	2c994e879c	filters: fix broken error message reported by go vet on go 1.11	2018-10-13 17:17:34 +02:00

1 2 3 4

153 Commits