Christian Schwarz
000d8bba66
hotfix: limit concurrency of zfs send & recv commands
...
ATM, the replication logic sends all dry-run requests in parallel,
which might overwhelm the ZFS pool on the sending side.
Since we use rpc/dataconn for dry sends, this also opens one TCP
connection per dry-run request.
Use a sempahore to limit the degree of concurrency where we know it is a
problem ATM.
As indicated by the comments, the cleaner solution would involve some
kind of 'resource exhaustion' error code.
refs #161
refs #164
2019-03-28 22:17:12 +01:00
Christian Schwarz
5b97953bfb
run golangci-lint and apply suggested fixes
2019-03-27 13:12:26 +01:00
Christian Schwarz
afed762774
format source tree using goimports
2019-03-22 19:41:12 +01:00
Christian Schwarz
5aefc47f71
daemon: remove last traces of watchdog mechanism
2019-03-19 18:15:34 +01:00
Christian Schwarz
c87759affe
replication/driver: automatic retries on connectivity-related errors
2019-03-13 15:00:40 +01:00
Christian Schwarz
07b43bffa4
replication: refactor driving logic (no more explicit state machine)
2019-03-13 15:00:40 +01:00
Christian Schwarz
0230c6321f
rpc/dataconn: microbenchmark
2019-03-13 13:57:21 +01:00
Christian Schwarz
796c5ad42d
rpc rewrite: control RPCs using gRPC + separate RPC for data transfer
...
transport/ssh: update go-netssh to new version
=> supports CloseWrite and Deadlines
=> build: require Go 1.11 (netssh requires it)
2019-03-13 13:53:48 +01:00
Christian Schwarz
d281fb00e3
socketpair: directly export *net.UnixConn (and add test for that behavior)
2019-03-13 11:36:34 +01:00
Christian Schwarz
25c974f0b5
envconst: support for int64
2019-03-13 00:07:33 +01:00
Christian Schwarz
7a75a4d384
util/iocommand: timeout kill on close + other hardening
2018-12-11 21:06:54 +01:00
Christian Schwarz
190c7270d9
daemon/active + watchdog: simplify control flow using explicit ActiveSideState
2018-10-21 12:53:34 +02:00
Christian Schwarz
69bfcb7bed
daemon/active: implement watchdog to handle stuck replication / pruners
...
ActiveSide.do() can only run sequentially, i.e. we cannot run
replication and pruning in parallel. Why?
* go-streamrpc only allows one active request at a time
(this is bad design and should be fixed at some point)
* replication and pruning are implemented independently, but work on the
same resources (snapshots)
A: pruning might destroy a snapshot that is planned to be replicated
B: replication might replicate snapshots that should be pruned
We do not have any resource management / locking for A and B, but we
have a use case where users don't want their machine fill up with
snapshots if replication does not work.
That means we _have_ to run the pruners.
A further complication is that we cannot just cancel the replication
context after a timeout and move on to the pruner: it could be initial
replication and we don't know how long it will take.
(And we don't have resumable send & recv yet).
With the previous commits, we can implement the watchdog using context
cancellation.
Note that the 'MadeProgress()' calls can only be placed right before
non-error state transition. Otherwise, we could end up in a live-lock.
2018-10-19 17:23:00 +02:00
Christian Schwarz
814fec60f0
endpoint + zfs: context cancellation of util.IOCommand instances (send & recv for now)
2018-10-19 16:12:21 +02:00
Christian Schwarz
a97684923a
refactor: socketpair into utils package (useful elsewhere)
2018-10-11 21:17:43 +02:00
Christian Schwarz
976c1f3929
util.IOCommand: add stderr logging for unexpected crashes in calls to ProcessState.Sys()
...
Crashes observed on a FreeBSD 11.2 system
2018-09-27T05:08:39+02:00 [INFO][csnas]: start replication invocation="62"
2018-09-27T05:08:39+02:00 [INFO][csnas][repl]: start planning invocation="62"
2018-09-27T05:08:58+02:00 [INFO][csnas][repl]: start working invocation="62"
2018-09-27T05:09:57+02:00 [INFO][csnas]: start pruning sender invocation="62"
2018-09-27T05:10:11+02:00 [INFO][csnas]: start pruning receiver invocation="62"
2018-09-27T05:10:32+02:00 [INFO][csnas]: wait for wakeups
2018-09-27T06:08:39+02:00 [INFO][csnas]: start replication invocation="63"
2018-09-27T06:08:39+02:00 [INFO][csnas][repl]: start planning invocation="63"
2018-09-27T06:08:44+02:00 [INFO][csnas][repl]: start working invocation="63"
2018-09-27T06:08:49+02:00 [ERRO][csnas][repl]: receive request failed (might also be error on sender) invocation="63" filesystem="<REDACTED>" err="concurrent use of RPC connection" step="<REDACTED>(@zrepl_20180927_030838_000 => @zrepl_20180927_040835_000)" errType="*errors.errorString"
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x7d484b]
goroutine 3938545 [running]:
os.(*ProcessState).os.sys(...)
/usr/lib/golang/src/os/exec_posix.go:78
os.(*ProcessState).Sys(...)
/usr/lib/golang/src/os/exec.go:157
github.com/zrepl/zrepl/util.(*IOCommand).doWait(0xc4201b2d80, 0xc420070060, 0xc420070060)
/go/github.com/zrepl/zrepl/util/iocommand.go:91 +0x4b
github.com/zrepl/zrepl/util.(*IOCommand).Read(0xc4201b2d80, 0xc420790000, 0x8000, 0x8000, 0x800c76d90, 0x0, 0xc420067c10)
/go/github.com/zrepl/zrepl/util/iocommand.go:82 +0xe4
github.com/zrepl/zrepl/util.(*ByteCounterReader).Read(0xc4202dc580, 0xc420790000, 0x8000, 0x8000, 0x8c6900, 0x7cb201, 0xc420790000)
/go/github.com/zrepl/zrepl/util/io.go:118 +0x51
github.com/zrepl/zrepl/vendor/github.com/problame/go-streamrpc.(*chunkBuffer).readChunk(0xc42057e3c0, 0x800d1bbf0, 0xc4202dc580, 0xc420790000, 0x8000, 0x8000)
/go/github.com/zrepl/zrepl/vendor/github.com/problame/go-streamrpc/stream.go:58 +0x5e
github.com/zrepl/zrepl/vendor/github.com/problame/go-streamrpc.writeStream(0xa04620, 0xc4204a9c20, 0x9fe340, 0xc4200d6380, 0x800d1bbf0, 0xc4202dc580, 0x8000, 0xc42000e000, 0x900420)
/go/github.com/zrepl/zrepl/vendor/github.com/problame/go-streamrpc/stream.go:101 +0x1ce
github.com/zrepl/zrepl/vendor/github.com/problame/go-streamrpc.(*Conn).send(0xc4200d6380, 0xa04620, 0xc4204a9c20, 0xc42057e2c0, 0xc42013d570, 0x800d1bbf0, 0xc4202dc580, 0x0, 0x0)
/go/github.com/zrepl/zrepl/vendor/github.com/problame/go-streamrpc/main.go:374 +0x557
github.com/zrepl/zrepl/vendor/github.com/problame/go-streamrpc.(*Client).RequestReply.func1(0x999741, 0x7, 0xc4200d6380, 0xa04620, 0xc4204a9c20, 0xc42013d570, 0xa00aa0, 0xc4202dc580, 0xc420516480)
/go/github.com/zrepl/zrepl/vendor/github.com/problame/go-streamrpc/client.go:169 +0x148
created by github.com/zrepl/zrepl/vendor/github.com/problame/go-streamrpc.(*Client).RequestReply
/go/github.com/zrepl/zrepl/vendor/github.com/problame/go-streamrpc/client.go:167 +0x227
2018-09-27 12:06:59 +02:00
Anton Schirg
6ca11a7391
byte counter for status
2018-08-30 12:54:30 +02:00
Anton Schirg
add1b69809
move retentiongrid to own package
2018-08-26 22:06:47 +02:00
Christian Schwarz
e30ae972f4
gofmt
2018-08-25 21:30:25 +02:00
Christian Schwarz
a0b320bfeb
streamrpc now requires net.Conn => use it instead of rwc everywhere
2018-08-08 13:09:51 +02:00
Christian Schwarz
1826535e6f
WIP
2018-07-15 17:36:53 +02:00
Christian Schwarz
8cca0a8547
Initial working version
...
Summary:
* Logging is still bad
* test output in a lot of placed
* FIXMEs every where
Test Plan: None, just review
Differential Revision: https://phabricator.cschwarz.com/D2
2018-06-24 10:44:00 +02:00
Christian Schwarz
b69089a527
Puller: refactor + use Task API
...
* drop rx byte count functionality
* will be re-added to Task as necessary
refs #10
2017-12-27 14:39:47 +01:00
Christian Schwarz
bfcba7b281
cmd: logging using logrus
2017-09-22 17:01:54 +02:00
Christian Schwarz
93a58a36bf
util: add PrefixLogger
2017-09-11 15:37:45 +02:00
Christian Schwarz
ca1a482e9e
sshbytestream & IOCommand: fix handling of dead child process
...
SSH catches SIGTERM, tears down its connection, then exits with
platform-specific exit code.
2017-08-09 21:01:06 +02:00
Christian Schwarz
8eb4a2ba44
Rudimentary progress reporting on send / recv side.
2017-08-06 16:21:54 +02:00
Christian Schwarz
e0d39ddf11
Implement RetentionGrid structure.
2017-07-01 23:19:31 +02:00
Christian Schwarz
5f84d30972
util/ReadWriteCloserLogger: handle unset readlog | writelog
2017-05-20 19:39:32 +02:00
Christian Schwarz
04206ebd8b
util.IOCommand: Close() gracefully via SIGTERM
2017-05-14 14:11:19 +02:00
Christian Schwarz
ee570bb060
refactor: consolidate ForkReader-like implementations to IOCommand
2017-05-14 12:27:15 +02:00
Christian Schwarz
6f84bf665d
cmd: support logging reads & writes from sshbytestream to a file.
2017-05-13 15:34:28 +02:00
Christian Schwarz
74719ad846
rpc: chunk JSON parts of communication + refactoring
...
JSONDecoder was buffering more of connection data than just the JSON.
=> Unchunker didn't bother and just started unchunking.
While chaining JSONDecoder.Buffered() and the connection using
ChainedReader works, it's still not a clean architecture.
=> Every JSON message is now wrapped in a chunked stream
(chunked and unchunked)
=> no special-cases
=> Keep ChainedReader, might be useful later on...
2017-05-13 15:33:46 +02:00
Christian Schwarz
b2de658270
util: fix package name of chunking
2017-05-13 15:25:09 +02:00
Christian Schwarz
61c263b91d
chunking: rewrite to handle EOF events correctly
...
bonus: some tests asserting the chunking protocol is adhered to
2017-05-06 23:41:51 +02:00
Christian Schwarz
d9ecfc8eb4
Gofmt megacommit.
2017-04-26 20:29:54 +02:00
Christian Schwarz
4494afe47f
Finish implementation of RPC.
2017-04-16 21:38:31 +02:00
Christian Schwarz
69f8e7cfc3
Implement chunking.
...
Move from rpc to separate util package.
2017-04-15 17:07:32 +02:00