Symptom: zrepl log message:
rpc error: code = Unavailable desc = transport is closing
Underlying Problem:
* rpc.NewServer was not using grpchelper.NewServer and not setting Server KeepaliveParams by itself
* and even grpchelper.NewServer didn't set a KeepaliveEnforcementPolicy
* However, KeepaliveEnforcementPolicy is necessary if the client keepalive is configured with non-default values
* .. which grpchelper.ClientConn does, and that method is used by rpc.NewClient
* rpc.Client was sending pings
* lacking server-side KeepaliveEnforcementPolicy caused grpc-hard-coded `pingStrikes` counter to go past limit of 2:
021bd5734e/internal/transport/http2_server.go (L726)
How was this debugged:
* GRPC_GO_LOG_VERBOSITY_LEVEL=99 GRPC_GO_LOG_SEVERITY_LEVEL=info PATH=/root/mockpath:$PATH zrepl daemon
* with a patch on grpc package to get more log messages on pingStrikes increases:
diff --git a/internal/transport/http2_server.go b/internal/transport/http2_server.go
index 8b04b039..f68f55ea 100644
--- a/internal/transport/http2_server.go
+++ b/internal/transport/http2_server.go
@@ -214,6 +214,7 @@ func newHTTP2Server(conn net.Conn, config *ServerConfig) (_ ServerTransport, err
if kep.MinTime == 0 {
kep.MinTime = defaultKeepalivePolicyMinTime
}
+ errorf("effective keepalive enforcement policy: %#v", kep)
done := make(chan struct{})
t := &http2Server{
ctx: context.Background(),
@@ -696,6 +697,7 @@ func (t *http2Server) handlePing(f *http2.PingFrame) {
t.controlBuf.put(pingAck)
now := time.Now()
+ errorf("transport:ping handlePing, last ping %s ago", now.Sub(t.lastPingAt))
defer func() {
t.lastPingAt = now
}()
@@ -713,11 +715,13 @@ func (t *http2Server) handlePing(f *http2.PingFrame) {
// Keepalive shouldn't be active thus, this new ping should
// have come after at least defaultPingTimeout.
if t.lastPingAt.Add(defaultPingTimeout).After(now) {
+ errorf("transport:ping strike ns < 1 && !t.kep.PermitWithoutStream")
t.pingStrikes++
}
} else {
// Check if keepalive policy is respected.
if t.lastPingAt.Add(t.kep.MinTime).After(now) {
+ errorf("transport:ping strike !(ns < 1 && !t.kep.PermitWithoutStream) kep.MinTime=%s ns=%d", t.kep.MinTime, ns)
t.pingStrikes++
}
}
fixes#181
from go-netssh changelog:
dial: better error handling if ssh command exits with non-zero exit status
SSHError.Error() relied on go-rwccmd behavior of returning io.EOF if the
ssh binary exited with status code 0.
We no longe ruse go-rwccmd => capture Stderr ourselves using zrepl's
circlog (depending on zrepl is not pretty, but since this package is supposedly
only used by zrepl ATM, this is fine)
refs https://github.com/zrepl/zrepl/issues/237
Verified once again that grpc.DialContext is indeed non-blocking.
However, it checks in a defer stmt that the passed dial is not ctx.Done().
That is highly unusual if the dial is non-blocking.
But it might still happen, maybe because of machine suspend during the function call and before the defer stmt is executed.
panic:
context deadline exceeded
goroutine 49 [running]:
github.com/zrepl/zrepl/rpc/grpcclientidentity/grpchelper.ClientConn(0x1906ea0, 0xc0003ea1e0, 0x1921620, 0xc0002da660, 0x0)
/gopath/src/github.com/zrepl/zrepl/rpc/grpcclientidentity/grpchelper/authlistener_grpc_adaptor_wrapper.go:49 +0x38c
github.com/zrepl/zrepl/rpc.NewClient(0x1906f00, 0xc0002d60f0, 0x1921620, 0xc0002da640, 0x1921620, 0xc0002da660, 0x1921620, 0xc0002da6a0, 0x1921620)
/gopath/src/github.com/zrepl/zrepl/rpc/rpc_client.go:53 +0x199
github.com/zrepl/zrepl/daemon/job.(*modePush).ConnectEndpoints(0xc0000d1e90, 0x1921620, 0xc0002da640, 0x1921620, 0xc0002da660, 0x1921620, 0xc0002da6a0, 0x1906f00, 0xc0002d60f0)
/gopath/src/github.com/zrepl/zrepl/daemon/job/active.go:105 +0x15d
github.com/zrepl/zrepl/daemon/job.(*ActiveSide).do(0xc0000d6120, 0x1918720, 0xc00020f170)
/gopath/src/github.com/zrepl/zrepl/daemon/job/active.go:356 +0x236
github.com/zrepl/zrepl/daemon/job.(*ActiveSide).Run(0xc0000d6120, 0x1918720, 0xc00009c660)
/gopath/src/github.com/zrepl/zrepl/daemon/job/active.go:347 +0x289
github.com/zrepl/zrepl/daemon.(*jobs).start.func1(0xc0000fc880, 0x1921620, 0xc0002da120, 0x191a320, 0xc0000d6120, 0x1918720, 0xc0002d6a80)
* Add Close() in closeState to identify the first closer
* Non-first closers get an error
* Reads and Writes from the Conn get an error if the conn was closed
during the Read / Write was running
* The first closer starts _separate_ goroutine draining the c.frameReads channel
* The first closer then waits for the goroutine that fills c.frameReads
to exit
refs 3bfe0c16d0fixes#174
readFrames would block on `reads <-`
but only after that would stream.Conn.readFrames close c.waitReadFramesDone
which was too late because stream.Conn.Close would wait for c.waitReadFramesDone to be closed before draining the channel
^^^^^^ (not frameconn.Conn, that closed successfully)
195 @ 0x1032ae0 0x1006cab 0x1006c81 0x1006a65 0x15505be 0x155163e 0x1060bc1
0x15505bd github.com/zrepl/zrepl/rpc/dataconn/stream.readFrames+0x16d github.com/zrepl/zrepl/rpc/dataconn/stream/stream.go:220
0x155163d github.com/zrepl/zrepl/rpc/dataconn/stream.(*Conn).readFrames+0xbd github.com/zrepl/zrepl/rpc/dataconn/stream/stream_conn.go:71
195 @ 0x1032ae0 0x10078c8 0x100789e 0x100758b 0x1552678 0x1557a4b 0x1556aec 0x1060bc1
0x1552677 github.com/zrepl/zrepl/rpc/dataconn/stream.(*Conn).Close+0x77 github.com/zrepl/zrepl/rpc/dataconn/stream/stream_conn.go:191
0x1557a4a github.com/zrepl/zrepl/rpc/dataconn.(*Server).serveConn.func1+0x5a github.com/zrepl/zrepl/rpc/dataconn/dataconn_server.go:93
0x1556aeb github.com/zrepl/zrepl/rpc/dataconn.(*Server).serveConn+0x87b github.com/zrepl/zrepl/rpc/dataconn/dataconn_server.go:176