Commit Graph

830 Commits

Author SHA1 Message Date
Christian Schwarz
2927d0ca15 rpc: use grpchelper package, add grpc.KeepaliveEnforcementPolicy, fix 'transport is closing' error
Symptom: zrepl log message:

    rpc error: code = Unavailable desc = transport is closing

Underlying Problem:

* rpc.NewServer was not using grpchelper.NewServer and not setting Server KeepaliveParams by itself
* and even grpchelper.NewServer didn't set a KeepaliveEnforcementPolicy
* However, KeepaliveEnforcementPolicy is necessary if the client keepalive is configured with non-default values
* .. which grpchelper.ClientConn does, and that method is used by rpc.NewClient

* rpc.Client was sending pings
* lacking server-side KeepaliveEnforcementPolicy caused grpc-hard-coded `pingStrikes` counter to go past limit of 2:
  021bd5734e/internal/transport/http2_server.go (L726)

How was this debugged:
* GRPC_GO_LOG_VERBOSITY_LEVEL=99 GRPC_GO_LOG_SEVERITY_LEVEL=info PATH=/root/mockpath:$PATH zrepl daemon
* with a patch on grpc package to get more log messages on pingStrikes increases:

    diff --git a/internal/transport/http2_server.go b/internal/transport/http2_server.go
    index 8b04b039..f68f55ea 100644
    --- a/internal/transport/http2_server.go
    +++ b/internal/transport/http2_server.go
    @@ -214,6 +214,7 @@ func newHTTP2Server(conn net.Conn, config *ServerConfig) (_ ServerTransport, err
            if kep.MinTime == 0 {
                    kep.MinTime = defaultKeepalivePolicyMinTime
            }
    +       errorf("effective keepalive enforcement policy: %#v", kep)
            done := make(chan struct{})
            t := &http2Server{
                    ctx:               context.Background(),
    @@ -696,6 +697,7 @@ func (t *http2Server) handlePing(f *http2.PingFrame) {
            t.controlBuf.put(pingAck)

            now := time.Now()
    +       errorf("transport:ping handlePing, last ping %s ago", now.Sub(t.lastPingAt))
            defer func() {
                    t.lastPingAt = now
            }()
    @@ -713,11 +715,13 @@ func (t *http2Server) handlePing(f *http2.PingFrame) {
                    // Keepalive shouldn't be active thus, this new ping should
                    // have come after at least defaultPingTimeout.
                    if t.lastPingAt.Add(defaultPingTimeout).After(now) {
    +                       errorf("transport:ping strike ns < 1 && !t.kep.PermitWithoutStream")
                            t.pingStrikes++
                    }
            } else {
                    // Check if keepalive policy is respected.
                    if t.lastPingAt.Add(t.kep.MinTime).After(now) {
    +                       errorf("transport:ping strike !(ns < 1 && !t.kep.PermitWithoutStream) kep.MinTime=%s ns=%d", t.kep.MinTime, ns)
                            t.pingStrikes++
                    }
            }

fixes #181
2020-01-04 21:10:41 +01:00
Juergen Hoetzel
d35e2400b2 transport/{TCP,TLS}: optional IP_FREEBIND / IP_BINDANY bind socketops
Allows to bind to an address even if it is not actually (yet or ever)
configured. Fixes #238

Rationale:
https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/#whatdoesthismeanformeadeveloper
2020-01-04 17:21:48 +01:00
Frans Bergman
47ed599db7 docs: add Void Linux to installation instructions 2019-12-28 12:43:53 +01:00
Christian Schwarz
f899f4cbe4 build: go.mod: bump go-netssh and drop go-critic replaces
(go-netssh vendored util/circlog, so the circular dep is gone)

fixes build failure reported by @poetterl-ric

```
  make ZREPL_VERSION=0.2.1 zrepl-bin
    GO111MODULE=on go build -mod=readonly -ldflags "-X github.com/zrepl/zrepl/version.zreplVersion=0.2.1" -o "artifacts/zrepl-linux-amd64"
    go: github.com/problame/go-netssh@v0.0.0-20191026123024-f34099f4f6b1 requires
            github.com/zrepl/zrepl@v0.2.0 requires
            github.com/golangci/lint-1@v0.0.0-20181222135242-d2cdd8c08219: invalid version: git fetch --unshallow -f origin in /builddir/go/pkg/mod/cache/vcs/ca789ff49d608cda239a48837cfeea6e9dcdb2bce20051383910eef46b623a33: exit status 128:
            fatal: git fetch-pack: expected shallow list
```
2019-12-28 12:42:33 +01:00
Juergen Hoetzel
b3231d2bed daemon: fix typos in error messages
closes #255
2019-12-11 21:30:48 +01:00
Christian Schwarz
c24c327151 build: fix build.Dockerfile + integrate into CircleCI
fixup of 080f2c0616
fixup of 4994b7a9ea
2019-11-28 15:19:46 +01:00
Christian Schwarz
5e17d7ba80 docs: add recent supporters 2019-11-26 00:45:13 +01:00
Christian Schwarz
0261dbfe3d docs: 0.2.1 changelog 2019-11-20 20:16:41 +01:00
Christian Schwarz
4301f741db dist/systemd: remove @privileged from SystemCallFilter + cleanup comments
fixes #237
2019-11-20 18:44:14 +01:00
Christian Schwarz
7e743c74dc docs + samples: adjust ssh 'Compression' arg in examples 2019-11-20 18:19:16 +01:00
Christian Schwarz
ad0b055245 daemon/prometheus: fix crash if listener cannot be created
refs #238

  zrepl version=v0.2.0-11-gdc39c81 GOOS=linux GOARCH=amd64 Compiler=gc
 starting daemon
 [pull_source]: starting job
 [_prometheus]: starting job
 [connection_loss_tidyup]: starting job
 [connection_loss_tidyup]: wait for wakeups
 [_control]: starting job
 [_prometheus]: cannot listen err="listen tcp 10.0.0.200:9091: bind: cannot assign requested add
 [_prometheus]: job exited
 panic: runtime error: invalid memory address or nil pointer dereference
         panic: runtime error: invalid memory address or nil pointer dereference
 [signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0x81ea4d]
 goroutine 25 [running]:
 net/http.(*onceCloseListener).close(...)
         /usr/local/go/src/net/http/server.go:3330
 sync.(*Once).doSlow(0xc00018b060, 0xc0000c7bc0)
         /usr/local/go/src/sync/once.go:66 +0xe3
 sync.(*Once).Do(...)
         /usr/local/go/src/sync/once.go:57
 net/http.(*onceCloseListener).Close(0xc00018b050, 0xc0003a6000, 0xe0)
         /usr/local/go/src/net/http/server.go:3326 +0x77
 panic(0xb1d1c0, 0x11d7d90)
         /usr/local/go/src/runtime/panic.go:679 +0x1b2
 net/http.(*onceCloseListener).Accept(0xc00018b050, 0xc000120020, 0xb0fd20, 0x11d7ce0, 0xbee6e0)
         <autogenerated>:1 +0x32
 net/http.(*Server).Serve(0xc0003a6000, 0x0, 0x0, 0x0, 0x0)
         /usr/local/go/src/net/http/server.go:2896 +0x286
 net/http.Serve(...)
         /usr/local/go/src/net/http/server.go:2468
 github.com/zrepl/zrepl/daemon.(*prometheusJob).Run(0xc000109940, 0xd19ec0, 0xc00018a930)
         /go/src/github.com/zrepl/zrepl/daemon/prometheus.go:75 +0x23e
 github.com/zrepl/zrepl/daemon.(*jobs).start.func1(0xc0000f68c0, 0xd22c40, 0xc000116ee0, 0xd1ba4
         /go/src/github.com/zrepl/zrepl/daemon/daemon.go:220 +0x121
 created by github.com/zrepl/zrepl/daemon.(*jobs).start
         /go/src/github.com/zrepl/zrepl/daemon/daemon.go:216 +0x52e
 zrepl.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
 zrepl.service: Failed with result 'exit-code'.
2019-11-16 22:11:13 +01:00
Christian Schwarz
27db3e6f70 docs: supporters: update & add viz for different kinds of support 2019-11-16 22:11:07 +01:00
Christian Schwarz
4994b7a9ea rpc/dataconn + build: support GOOS={solaris,illumos} 2019-11-16 22:07:47 +01:00
Christian Schwarz
080f2c0616 build: Makefile: refactor cross-builds + release, add i386 targets 2019-11-16 22:07:47 +01:00
Christian Schwarz
d469cc04b6 transports/ssh: bump go-netssh to improve dial errors
from go-netssh changelog:

    dial: better error handling if ssh command exits with non-zero exit status

    SSHError.Error() relied on go-rwccmd behavior of returning io.EOF if the
    ssh binary exited with status code 0.

    We no longe ruse go-rwccmd => capture Stderr ourselves using zrepl's
    circlog (depending on zrepl is not pretty, but since this package is supposedly
    only used by zrepl ATM, this is fine)

    refs https://github.com/zrepl/zrepl/issues/237
2019-11-16 22:07:47 +01:00
Christian Schwarz
e0a25d04ac build: Makefile: set GO111MODULE=on for all go commands 2019-11-16 22:07:47 +01:00
Christian Schwarz
b0f2c79944 build: go mods: split build deps into subgomod, bump prometheus to 1.2.1, tweaked go mod tidy
tweaked go mod tidy: see comment in go.mod
2019-11-16 22:07:47 +01:00
Christian Schwarz
d2bc40f78d docs: transports: ssh: better copy-pastable connect section 2019-11-16 22:07:47 +01:00
Christian Schwarz
9e54f11960 dist/systemd: fix ssh-transport: create stdinserver runtime directory
tested to work on Debian Stretch

refs #237
2019-11-16 22:07:38 +01:00
Andy Fiddaman
6eda1f743f Fix typo in tutorial.rst 2019-11-05 09:57:30 -08:00
Andy Fiddaman
6787decef1 Add OmniOS (illumos distribution) to list of OSs 2019-11-05 09:49:57 -08:00
Christian Schwarz
d56d45a2ab docs: install: apt: fix snippet display & link to packaging repo 2019-10-21 16:35:23 +02:00
Christian Schwarz
fcf16a163a docs: install: apt snippet: idempotent, bash compat, multiarch compat
Co-authored-by: Janis Streib <me@janis-streib.de>
Co-authored-by: Christian Schwarz <me@cschwarz.com>
2019-10-21 16:21:51 +02:00
Christian Schwarz
dc39c819a3 docs: add debian + ubuntu installation 2019-10-18 20:18:42 +02:00
Christian Schwarz
1048b09487 build: include config examples and dist in noarch tarball 2019-10-18 20:18:42 +02:00
Richard Poettler
3806e97404 docs: add copr repo for Fedora/CentOS
closes #229
2019-10-16 10:46:02 +02:00
chenhao
c396f9508a zfs: replace hard coded zfs command in ZFSDestroy
fixes #231
2019-10-16 10:22:53 +02:00
Christian Schwarz
b9933f6cb2 platformtest: add zfsGet bookmark handling & replicationCursor tests
This encodes the observation made in issue #230 :
In the ZFS version shipped in Ubuntu 16.04 where
`zfs get someprop a#bookmark` does not work.
2019-10-14 17:54:14 +02:00
Christian Schwarz
0ba4b5eda6 zfs: helper for ZFSGet guid and createtxg 2019-10-14 17:54:14 +02:00
Christian Schwarz
18d2c350de platformtest: harness: -failure.stop-and-keep-pool mode, prettier logging 2019-10-14 17:54:14 +02:00
Christian Schwarz
f8f9fd11cd platformtest: logging-related refactorings 2019-10-14 17:32:58 +02:00
John Ramsden
b422e6f12e docs: installation: add Arch Linux 'from source' package 2019-10-13 12:33:27 +02:00
Christian Schwarz
f8d5082bdd docs: remove outdated implementation references + remove 0.2-rc* from published docs 2019-10-13 12:26:39 +02:00
Christian Schwarz
ffe677e55a docs: snapshotting: command hook type: not the only hook type anymore 2019-10-13 12:16:31 +02:00
Christian Schwarz
84eefa57bc rpc/grpcclientidentity: remove hard-coded deadline in listener adatper causing crash
Verified once again that grpc.DialContext is indeed non-blocking.
However, it checks in a defer stmt that the passed dial is not ctx.Done().
That is highly unusual if the dial is non-blocking.
But it might still happen, maybe because of machine suspend during the function call and before the defer stmt is executed.

panic:
context deadline exceeded
goroutine 49 [running]:
github.com/zrepl/zrepl/rpc/grpcclientidentity/grpchelper.ClientConn(0x1906ea0, 0xc0003ea1e0, 0x1921620, 0xc0002da660, 0x0)
        /gopath/src/github.com/zrepl/zrepl/rpc/grpcclientidentity/grpchelper/authlistener_grpc_adaptor_wrapper.go:49 +0x38c
github.com/zrepl/zrepl/rpc.NewClient(0x1906f00, 0xc0002d60f0, 0x1921620, 0xc0002da640, 0x1921620, 0xc0002da660, 0x1921620, 0xc0002da6a0, 0x1921620)
        /gopath/src/github.com/zrepl/zrepl/rpc/rpc_client.go:53 +0x199
github.com/zrepl/zrepl/daemon/job.(*modePush).ConnectEndpoints(0xc0000d1e90, 0x1921620, 0xc0002da640, 0x1921620, 0xc0002da660, 0x1921620, 0xc0002da6a0, 0x1906f00, 0xc0002d60f0)
        /gopath/src/github.com/zrepl/zrepl/daemon/job/active.go:105 +0x15d
github.com/zrepl/zrepl/daemon/job.(*ActiveSide).do(0xc0000d6120, 0x1918720, 0xc00020f170)
        /gopath/src/github.com/zrepl/zrepl/daemon/job/active.go:356 +0x236
github.com/zrepl/zrepl/daemon/job.(*ActiveSide).Run(0xc0000d6120, 0x1918720, 0xc00009c660)
        /gopath/src/github.com/zrepl/zrepl/daemon/job/active.go:347 +0x289
github.com/zrepl/zrepl/daemon.(*jobs).start.func1(0xc0000fc880, 0x1921620, 0xc0002da120, 0x191a320, 0xc0000d6120, 0x1918720, 0xc0002d6a80)
2019-10-10 14:02:12 +02:00
Juergen Hoetzel
ad77371e38 docs: include Arch Linux installation 2019-10-06 20:38:00 +02:00
Juergen Hoetzel
c524acb2df Fix invalid comment syntax 2019-10-06 16:23:20 +02:00
Christian Schwarz
3edfe535c6 docs: fix typo on index page 2019-10-05 14:59:51 +02:00
Juergen Hoetzel
d3b99e8e39 Fix typo 2019-10-05 14:58:49 +02:00
Christian Schwarz
3c03f21419 docs: SEPA hint, supporters, fix publish script 2019-10-03 11:57:19 +02:00
Christian Schwarz
5c95c21727 transport/local: configurable dial_timeout for connect, default 2s 2019-09-29 19:05:54 +02:00
Christian Schwarz
a6b578b648 rpc/dataconn/stream: Conn: handle concurrent Close calls + goroutine leak fix
* Add Close() in closeState to identify the first closer
* Non-first closers get an error
* Reads and Writes from the Conn get an error if the conn was closed
  during the Read / Write was running
* The first closer starts _separate_ goroutine draining the c.frameReads channel
* The first closer then waits for the goroutine that fills c.frameReads
  to exit

refs 3bfe0c16d0
fixes #174

readFrames would block on `reads <-`
   but only after that would stream.Conn.readFrames close c.waitReadFramesDone
   which was too late because stream.Conn.Close would wait for c.waitReadFramesDone to be closed before draining the channel
                              ^^^^^^ (not frameconn.Conn, that closed successfully)

   195 @ 0x1032ae0 0x1006cab 0x1006c81 0x1006a65 0x15505be 0x155163e 0x1060bc1
           0x15505bd       github.com/zrepl/zrepl/rpc/dataconn/stream.readFrames+0x16d             github.com/zrepl/zrepl/rpc/dataconn/stream/stream.go:220
           0x155163d       github.com/zrepl/zrepl/rpc/dataconn/stream.(*Conn).readFrames+0xbd      github.com/zrepl/zrepl/rpc/dataconn/stream/stream_conn.go:71

   195 @ 0x1032ae0 0x10078c8 0x100789e 0x100758b 0x1552678 0x1557a4b 0x1556aec 0x1060bc1
           0x1552677       github.com/zrepl/zrepl/rpc/dataconn/stream.(*Conn).Close+0x77           github.com/zrepl/zrepl/rpc/dataconn/stream/stream_conn.go:191
           0x1557a4a       github.com/zrepl/zrepl/rpc/dataconn.(*Server).serveConn.func1+0x5a      github.com/zrepl/zrepl/rpc/dataconn/dataconn_server.go:93
           0x1556aeb       github.com/zrepl/zrepl/rpc/dataconn.(*Server).serveConn+0x87b           github.com/zrepl/zrepl/rpc/dataconn/dataconn_server.go:176
2019-09-29 19:05:54 +02:00
Christian Schwarz
8af824df41 docs: promote monetary support in changelog 2019-09-29 19:04:53 +02:00
Christian Schwarz
58ab25919e platformtest: dedicated pool per test, Makefile target, maintainer notice
fixes #216
fixes #211
2019-09-29 18:48:44 +02:00
Christian Schwarz
215848f476 docs: 0.2 changelog 2019-09-28 17:50:07 +02:00
Christian Schwarz
f9c7766073 replication/logic: fix race when reading byte counter pointer for report
fixes #214
2019-09-28 16:16:19 +02:00
Christian Schwarz
f976212ec9 config: validate presence of port in addresses
fixes #213
2019-09-28 14:25:14 +02:00
Christian Schwarz
8c88e168c1 rpc/dataconn/client: ReqRecv to log level Debug
reported by @avg-l
2019-09-28 11:49:20 +02:00
Christian Schwarz
a78c854404 rpc/dataconn/frameconn: mask ECONNRESET error on Close()
fixes #190
2019-09-28 11:49:20 +02:00
Christian Schwarz
8a5af2f80e build/circleci: apt update before installing
hope this fixes the spurious apt install failures
2019-09-27 21:31:05 +02:00