Commit Graph

606 Commits

Author SHA1 Message Date
Christian Schwarz
d17ecc3b5c replication/fsrep: report Pending[0] problem as fsrep problem in RetryWait state 2018-10-12 12:45:37 +02:00
Christian Schwarz
f9d24d15ed move wakup mechanism into separate package 2018-10-12 12:44:40 +02:00
Christian Schwarz
1fb59c953a implement transport protocol handshake (even before streamrpc handshake) 2018-10-11 21:21:46 +02:00
Christian Schwarz
be962998ba move serve and connecter into transports package 2018-10-11 21:21:46 +02:00
Christian Schwarz
a97684923a refactor: socketpair into utils package (useful elsewhere) 2018-10-11 21:17:43 +02:00
Christian Schwarz
1643198713 docs: reflect changes in replication_rewrite branch 2018-10-11 18:03:18 +02:00
Christian Schwarz
125b561df3 rename root_dataset to root_fs for receiving-side jobs 2018-10-11 18:03:18 +02:00
Christian Schwarz
0c3a694470 fixup: add test for global section 2018-10-11 17:52:19 +02:00
Christian Schwarz
525a875825 main: better descriptions for root subcommands 2018-10-11 17:52:19 +02:00
Christian Schwarz
4e16952ad9 snapshotting: support 'periodic' and 'manual' mode
1. Change config format to support multiple types
   of snapshotting modes.
2. Implement a hacky way to support periodic or completely
   manual snaphots.

In manual mode, the user has to trigger replication using the wakeup
mechanism after they took snapshots using their own tooling.

As indicated by the comment, a more general solution would be desirable,
but we want to get the release out and 'manual' mode is a feature that
some people requested...
2018-10-11 15:59:23 +02:00
Christian Schwarz
14febbeb4c config: skip files that do not end in .yml 2018-10-11 13:09:04 +02:00
Christian Schwarz
93c90cd705 pruning: fix YAML representation of PruneKeepRegex 2018-10-11 13:07:52 +02:00
Christian Schwarz
01668a989e transport local: named listeners + struct renaming 2018-10-11 13:06:47 +02:00
Christian Schwarz
976c1f3929 util.IOCommand: add stderr logging for unexpected crashes in calls to ProcessState.Sys()
Crashes observed on a FreeBSD 11.2 system

2018-09-27T05:08:39+02:00 [INFO][csnas]: start replication invocation="62"
2018-09-27T05:08:39+02:00 [INFO][csnas][repl]: start planning invocation="62"
2018-09-27T05:08:58+02:00 [INFO][csnas][repl]: start working invocation="62"
2018-09-27T05:09:57+02:00 [INFO][csnas]: start pruning sender invocation="62"
2018-09-27T05:10:11+02:00 [INFO][csnas]: start pruning receiver invocation="62"
2018-09-27T05:10:32+02:00 [INFO][csnas]: wait for wakeups
2018-09-27T06:08:39+02:00 [INFO][csnas]: start replication invocation="63"
2018-09-27T06:08:39+02:00 [INFO][csnas][repl]: start planning invocation="63"
2018-09-27T06:08:44+02:00 [INFO][csnas][repl]: start working invocation="63"
2018-09-27T06:08:49+02:00 [ERRO][csnas][repl]: receive request failed (might also be error on sender) invocation="63" filesystem="<REDACTED>" err="concurrent use of RPC connection" step="<REDACTED>(@zrepl_20180927_030838_000 => @zrepl_20180927_040835_000)" errType="*errors.errorString"
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x7d484b]

goroutine 3938545 [running]:
os.(*ProcessState).os.sys(...)
        /usr/lib/golang/src/os/exec_posix.go:78
os.(*ProcessState).Sys(...)
        /usr/lib/golang/src/os/exec.go:157
github.com/zrepl/zrepl/util.(*IOCommand).doWait(0xc4201b2d80, 0xc420070060, 0xc420070060)
        /go/github.com/zrepl/zrepl/util/iocommand.go:91 +0x4b
github.com/zrepl/zrepl/util.(*IOCommand).Read(0xc4201b2d80, 0xc420790000, 0x8000, 0x8000, 0x800c76d90, 0x0, 0xc420067c10)
        /go/github.com/zrepl/zrepl/util/iocommand.go:82 +0xe4
github.com/zrepl/zrepl/util.(*ByteCounterReader).Read(0xc4202dc580, 0xc420790000, 0x8000, 0x8000, 0x8c6900, 0x7cb201, 0xc420790000)
        /go/github.com/zrepl/zrepl/util/io.go:118 +0x51
github.com/zrepl/zrepl/vendor/github.com/problame/go-streamrpc.(*chunkBuffer).readChunk(0xc42057e3c0, 0x800d1bbf0, 0xc4202dc580, 0xc420790000, 0x8000, 0x8000)
        /go/github.com/zrepl/zrepl/vendor/github.com/problame/go-streamrpc/stream.go:58 +0x5e
github.com/zrepl/zrepl/vendor/github.com/problame/go-streamrpc.writeStream(0xa04620, 0xc4204a9c20, 0x9fe340, 0xc4200d6380, 0x800d1bbf0, 0xc4202dc580, 0x8000, 0xc42000e000, 0x900420)
        /go/github.com/zrepl/zrepl/vendor/github.com/problame/go-streamrpc/stream.go:101 +0x1ce
github.com/zrepl/zrepl/vendor/github.com/problame/go-streamrpc.(*Conn).send(0xc4200d6380, 0xa04620, 0xc4204a9c20, 0xc42057e2c0, 0xc42013d570, 0x800d1bbf0, 0xc4202dc580, 0x0, 0x0)
        /go/github.com/zrepl/zrepl/vendor/github.com/problame/go-streamrpc/main.go:374 +0x557
github.com/zrepl/zrepl/vendor/github.com/problame/go-streamrpc.(*Client).RequestReply.func1(0x999741, 0x7, 0xc4200d6380, 0xa04620, 0xc4204a9c20, 0xc42013d570, 0xa00aa0, 0xc4202dc580, 0xc420516480)
        /go/github.com/zrepl/zrepl/vendor/github.com/problame/go-streamrpc/client.go:169 +0x148
created by github.com/zrepl/zrepl/vendor/github.com/problame/go-streamrpc.(*Client).RequestReply
        /go/github.com/zrepl/zrepl/vendor/github.com/problame/go-streamrpc/client.go:167 +0x227
2018-09-27 12:06:59 +02:00
Christian Schwarz
75e42fd860 pruner: implement Report method + display in status command 2018-09-24 19:25:40 +02:00
Christian Schwarz
2990193512 replication: export SleepUntil in report 2018-09-24 19:23:53 +02:00
Christian Schwarz
75ba5874a5 active side: track activities in Run() as atomically updated member 2018-09-24 19:23:53 +02:00
Christian Schwarz
9e941d5be5 pruning: implement 'grid' keep rule 2018-09-24 17:33:16 +02:00
Christian Schwarz
328ac687f6 Remove obsolete cmd/** package + subpackages 2018-09-24 14:48:12 +02:00
Christian Schwarz
1ce0c69e4f implement local replication using new local transport
The new local transport uses socketpair() and a switchboard based on
client identities.
The special local job type is gone, which is good since it does not fit
into the 'Active/Passive side ' + 'mode' concept used to implement the
duality of push/sink | pull/source.
2018-09-24 14:43:53 +02:00
Christian Schwarz
f3e8eda04d fixup 4e04f8d3d2: snapper with separate stopped state for clean shutdown
would tight loop in ErrorWait
2018-09-24 14:40:47 +02:00
Christian Schwarz
cf5d63ee88 config: fix broken tests + reduce example configs 2018-09-24 12:41:39 +02:00
Christian Schwarz
4e04f8d3d2 snapper: make error mode an error wait mode
Just because taking one snapshot fails does not mean snapper needs to
stop for all others.
Since users are advised to monitor error logs, snapshot-taking errors
can still be addressed.
The ErrorWait mode allows a potential future Report / Status command to
distinguish normal waits from error waits.
2018-09-24 12:36:10 +02:00
Christian Schwarz
d04b9713c4 implement pull + sink modes for active and passive side 2018-09-24 12:36:10 +02:00
Christian Schwarz
6889f441b2 endpoint: support remote ReplicationCursor endpoint 2018-09-24 12:36:10 +02:00
Christian Schwarz
9c86e03384 endpoint Remote: fix broken Send endpoint for DryRun=true 2018-09-24 12:36:10 +02:00
Christian Schwarz
ffe33aff3d fix pruner: protobuf one-ofs require non-zero value, even if no public fields 2018-09-24 12:36:10 +02:00
Christian Schwarz
e3be120d88 refactor push + source into active + passive 'sides' with push and source 'modes' 2018-09-24 12:36:10 +02:00
Christian Schwarz
9446b51a1f status: infra for reporting jobs instead of just replication.Report 2018-09-23 21:11:33 +02:00
Christian Schwarz
4a6160baf3 update to streamrpc 0.4 & adjust config (not breaking) 2018-09-23 20:28:30 +02:00
Christian Schwarz
9dd662df08 status: raw output subcommand 2018-09-23 14:44:53 +02:00
Christian Schwarz
7f9eb62640 sink: concurrent connection handling 2018-09-18 22:44:00 +02:00
Christian Schwarz
6c3f442f13 daemon control / jsonclient: fix connection leak due to open request body
Also:
- Defensive measures in control http server (1s timeouts)
(prevent the leak, even if request body is not closed)
- Add prometheus metrics to track control socket latencies
(were used for debugging)
2018-09-13 12:44:46 +02:00
Christian Schwarz
fa47667f31 bring back prometheus metrics, with new metrics for replication state machine 2018-09-07 22:22:34 -07:00
Christian Schwarz
ab9446137f fix missing import of errors pacakge 2018-09-07 22:22:34 -07:00
Christian Schwarz
0c2ac3a168 pprof subcommand 2018-09-07 00:04:03 -07:00
Christian Schwarz
bf5099baac version subcommand: unified client & server 2018-09-06 23:52:11 -07:00
Christian Schwarz
7836ea36fc serve TLS: validate client CNs against whitelist in config file 2018-09-06 13:34:39 -07:00
Christian Schwarz
1edf020ce7 status command: better handling of 'nothing to do' Complete state 2018-09-06 11:46:02 -07:00
Christian Schwarz
c60ed78bc5 status subcommand: only draw one big progress bar of the entire replication
more details on progress per step in text form
2018-09-06 11:05:32 -07:00
Christian Schwarz
82d51cd0dc go vet fix 2018-09-05 21:48:52 -07:00
Christian Schwarz
0f75677e59 daemon/pruner: fix exercise (don't call it test) 2018-09-05 21:47:44 -07:00
Christian Schwarz
2c25f28972 simplify mapping & filtering in endpoints (re-rooting only) 2018-09-05 19:51:06 -07:00
Christian Schwarz
1323a30a0c zfs: ability to specify sources for zfsGet
fix use for Placeholder, leave rest as previous behavior
2018-09-05 19:51:06 -07:00
Christian Schwarz
975fdee217 replication & pruning: ditch replicated-property, use bookmark as cursor instead
A bookmark with a well-known name is used to track which version was
last successfully received by the receiver.
The createtxg that can be retrieved from the bookmark using `zfs get` is
used to set the Replicated attribute of each snap on the sender:
If the snap's CreateTXG > the cursor's, it is not yet replicated,
otherwise it has been.

There is an optional config option to change the behvior to
`CreateTXG >= the cursor's`, and the implementation defaults to that.

The reason: While things work just fine with `CreateTXG > the cursor's`,
ZFS does not provide size estimates in a `zfs send` dry run
(see acd2418).
However, to enable the use case of keeping the snapshot only around for
the replication, the config flag exists.
2018-09-05 19:51:06 -07:00
Christian Schwarz
acd2418803 handle DryRun send size estimate errors with bookmarks 2018-09-05 17:41:25 -07:00
Christian Schwarz
9eca269ad8 fixup 308e5e35fb: remove fprintf debug output 2018-09-05 08:35:31 -07:00
Christian Schwarz
c21222ef13 serve/tls: use handshake timeout from config 2018-09-05 08:32:59 -07:00
Christian Schwarz
6c988d0ebb add small subcommand to validate config 2018-09-05 08:32:38 -07:00
Christian Schwarz
4b39a18178 zfs: disable resume token test because it doesn't work in docker 2018-09-04 17:31:46 -07:00