Commit Graph

6 Commits

Author SHA1 Message Date
Christian Schwarz
10a14a8c50 [#307] add package trace, integrate it with logging, and adopt it throughout zrepl
package trace:

- introduce the concept of tasks and spans, tracked as linked list within ctx
    - see package-level docs for an overview of the concepts
    - **main feature 1**: unique stack of task and span IDs
        - makes it easy to follow a series of log entries in concurrent code
    - **main feature 2**: ability to produce a chrome://tracing-compatible trace file
        - either via an env variable or a `zrepl pprof` subcommand
        - this is not a CPU profile, we already have go pprof for that
        - but it is very useful to visually inspect where the
          replication / snapshotter / pruner spends its time
          ( fixes #307 )

usage in package daemon/logging:

- goal: every log entry should have a trace field with the ID stack from package trace

- make `logging.GetLogger(ctx, Subsys)` the authoritative `logger.Logger` factory function
    - the context carries a linked list of injected fields which
      `logging.GetLogger` adds to the logger it returns
    - `logging.GetLogger` also uses package `trace` to get the
      task-and-span-stack and injects it into the returned logger's fields
2020-05-19 11:30:02 +02:00
Christian Schwarz
5b52e5e331 rpc: fix missing logger context vars in control connection handlers
use ctxInterceptor in gRPC interceptors
also panic if the unimplemented stream interceptor is used
2020-02-14 21:42:03 +01:00
Christian Schwarz
2927d0ca15 rpc: use grpchelper package, add grpc.KeepaliveEnforcementPolicy, fix 'transport is closing' error
Symptom: zrepl log message:

    rpc error: code = Unavailable desc = transport is closing

Underlying Problem:

* rpc.NewServer was not using grpchelper.NewServer and not setting Server KeepaliveParams by itself
* and even grpchelper.NewServer didn't set a KeepaliveEnforcementPolicy
* However, KeepaliveEnforcementPolicy is necessary if the client keepalive is configured with non-default values
* .. which grpchelper.ClientConn does, and that method is used by rpc.NewClient

* rpc.Client was sending pings
* lacking server-side KeepaliveEnforcementPolicy caused grpc-hard-coded `pingStrikes` counter to go past limit of 2:
  021bd5734e/internal/transport/http2_server.go (L726)

How was this debugged:
* GRPC_GO_LOG_VERBOSITY_LEVEL=99 GRPC_GO_LOG_SEVERITY_LEVEL=info PATH=/root/mockpath:$PATH zrepl daemon
* with a patch on grpc package to get more log messages on pingStrikes increases:

    diff --git a/internal/transport/http2_server.go b/internal/transport/http2_server.go
    index 8b04b039..f68f55ea 100644
    --- a/internal/transport/http2_server.go
    +++ b/internal/transport/http2_server.go
    @@ -214,6 +214,7 @@ func newHTTP2Server(conn net.Conn, config *ServerConfig) (_ ServerTransport, err
            if kep.MinTime == 0 {
                    kep.MinTime = defaultKeepalivePolicyMinTime
            }
    +       errorf("effective keepalive enforcement policy: %#v", kep)
            done := make(chan struct{})
            t := &http2Server{
                    ctx:               context.Background(),
    @@ -696,6 +697,7 @@ func (t *http2Server) handlePing(f *http2.PingFrame) {
            t.controlBuf.put(pingAck)

            now := time.Now()
    +       errorf("transport:ping handlePing, last ping %s ago", now.Sub(t.lastPingAt))
            defer func() {
                    t.lastPingAt = now
            }()
    @@ -713,11 +715,13 @@ func (t *http2Server) handlePing(f *http2.PingFrame) {
                    // Keepalive shouldn't be active thus, this new ping should
                    // have come after at least defaultPingTimeout.
                    if t.lastPingAt.Add(defaultPingTimeout).After(now) {
    +                       errorf("transport:ping strike ns < 1 && !t.kep.PermitWithoutStream")
                            t.pingStrikes++
                    }
            } else {
                    // Check if keepalive policy is respected.
                    if t.lastPingAt.Add(t.kep.MinTime).After(now) {
    +                       errorf("transport:ping strike !(ns < 1 && !t.kep.PermitWithoutStream) kep.MinTime=%s ns=%d", t.kep.MinTime, ns)
                            t.pingStrikes++
                    }
            }

fixes #181
2020-01-04 21:10:41 +01:00
Christian Schwarz
84eefa57bc rpc/grpcclientidentity: remove hard-coded deadline in listener adatper causing crash
Verified once again that grpc.DialContext is indeed non-blocking.
However, it checks in a defer stmt that the passed dial is not ctx.Done().
That is highly unusual if the dial is non-blocking.
But it might still happen, maybe because of machine suspend during the function call and before the defer stmt is executed.

panic:
context deadline exceeded
goroutine 49 [running]:
github.com/zrepl/zrepl/rpc/grpcclientidentity/grpchelper.ClientConn(0x1906ea0, 0xc0003ea1e0, 0x1921620, 0xc0002da660, 0x0)
        /gopath/src/github.com/zrepl/zrepl/rpc/grpcclientidentity/grpchelper/authlistener_grpc_adaptor_wrapper.go:49 +0x38c
github.com/zrepl/zrepl/rpc.NewClient(0x1906f00, 0xc0002d60f0, 0x1921620, 0xc0002da640, 0x1921620, 0xc0002da660, 0x1921620, 0xc0002da6a0, 0x1921620)
        /gopath/src/github.com/zrepl/zrepl/rpc/rpc_client.go:53 +0x199
github.com/zrepl/zrepl/daemon/job.(*modePush).ConnectEndpoints(0xc0000d1e90, 0x1921620, 0xc0002da640, 0x1921620, 0xc0002da660, 0x1921620, 0xc0002da6a0, 0x1906f00, 0xc0002d60f0)
        /gopath/src/github.com/zrepl/zrepl/daemon/job/active.go:105 +0x15d
github.com/zrepl/zrepl/daemon/job.(*ActiveSide).do(0xc0000d6120, 0x1918720, 0xc00020f170)
        /gopath/src/github.com/zrepl/zrepl/daemon/job/active.go:356 +0x236
github.com/zrepl/zrepl/daemon/job.(*ActiveSide).Run(0xc0000d6120, 0x1918720, 0xc00009c660)
        /gopath/src/github.com/zrepl/zrepl/daemon/job/active.go:347 +0x289
github.com/zrepl/zrepl/daemon.(*jobs).start.func1(0xc0000fc880, 0x1921620, 0xc0002da120, 0x191a320, 0xc0000d6120, 0x1918720, 0xc0002d6a80)
2019-10-10 14:02:12 +02:00
Christian Schwarz
07b43bffa4 replication: refactor driving logic (no more explicit state machine) 2019-03-13 15:00:40 +01:00
Christian Schwarz
796c5ad42d rpc rewrite: control RPCs using gRPC + separate RPC for data transfer
transport/ssh: update go-netssh to new version
    => supports CloseWrite and Deadlines
    => build: require Go 1.11 (netssh requires it)
2019-03-13 13:53:48 +01:00