Symptom: zrepl log message:
rpc error: code = Unavailable desc = transport is closing
Underlying Problem:
* rpc.NewServer was not using grpchelper.NewServer and not setting Server KeepaliveParams by itself
* and even grpchelper.NewServer didn't set a KeepaliveEnforcementPolicy
* However, KeepaliveEnforcementPolicy is necessary if the client keepalive is configured with non-default values
* .. which grpchelper.ClientConn does, and that method is used by rpc.NewClient
* rpc.Client was sending pings
* lacking server-side KeepaliveEnforcementPolicy caused grpc-hard-coded `pingStrikes` counter to go past limit of 2:
021bd5734e/internal/transport/http2_server.go (L726)
How was this debugged:
* GRPC_GO_LOG_VERBOSITY_LEVEL=99 GRPC_GO_LOG_SEVERITY_LEVEL=info PATH=/root/mockpath:$PATH zrepl daemon
* with a patch on grpc package to get more log messages on pingStrikes increases:
diff --git a/internal/transport/http2_server.go b/internal/transport/http2_server.go
index 8b04b039..f68f55ea 100644
--- a/internal/transport/http2_server.go
+++ b/internal/transport/http2_server.go
@@ -214,6 +214,7 @@ func newHTTP2Server(conn net.Conn, config *ServerConfig) (_ ServerTransport, err
if kep.MinTime == 0 {
kep.MinTime = defaultKeepalivePolicyMinTime
}
+ errorf("effective keepalive enforcement policy: %#v", kep)
done := make(chan struct{})
t := &http2Server{
ctx: context.Background(),
@@ -696,6 +697,7 @@ func (t *http2Server) handlePing(f *http2.PingFrame) {
t.controlBuf.put(pingAck)
now := time.Now()
+ errorf("transport:ping handlePing, last ping %s ago", now.Sub(t.lastPingAt))
defer func() {
t.lastPingAt = now
}()
@@ -713,11 +715,13 @@ func (t *http2Server) handlePing(f *http2.PingFrame) {
// Keepalive shouldn't be active thus, this new ping should
// have come after at least defaultPingTimeout.
if t.lastPingAt.Add(defaultPingTimeout).After(now) {
+ errorf("transport:ping strike ns < 1 && !t.kep.PermitWithoutStream")
t.pingStrikes++
}
} else {
// Check if keepalive policy is respected.
if t.lastPingAt.Add(t.kep.MinTime).After(now) {
+ errorf("transport:ping strike !(ns < 1 && !t.kep.PermitWithoutStream) kep.MinTime=%s ns=%d", t.kep.MinTime, ns)
t.pingStrikes++
}
}
fixes#181