Commit Graph

10 Commits

Author SHA1 Message Date
Christian Schwarz
fa586b493c [#348] fix crash on early-recv error
* The SendStream.Close() was not called by dataconn.Server, which left the zfs send process dangling.
* When the source job's ctx interceptor closed the task, the dangling zfs send was detect by the trace package and panicked.

    020-07-25T19:54:41-04:00 [ERRO][latitude][rpc.data][cyZj$J3Ca$J3Ca.CJwB]: cannot write send stream err="frameconn: shutting down"
    panic: end task: 1 active child tasks: end task: task still has active child tasks

    goroutine 196966 [running]:
    github.com/zrepl/zrepl/daemon/logging/trace.WithTask.func1.1(0xc000320680, 0xcde000)
            /home/jeremy/go/src/github.com/zrepl/zrepl/daemon/logging/trace/trace.go:221 +0x2f7
    github.com/zrepl/zrepl/daemon/logging/trace.WithTask.func1()
            /home/jeremy/go/src/github.com/zrepl/zrepl/daemon/logging/trace/trace.go:237 +0x38
    github.com/zrepl/zrepl/daemon/logging/trace.WithTaskAndSpan.func1()
            /home/jeremy/go/src/github.com/zrepl/zrepl/daemon/logging/trace/trace_convenience.go:41 +0x37
    github.com/zrepl/zrepl/daemon/job.(*PassiveSide).Run.func1(0xdcf780, 0xc0000a3560, 0xdc65a0, 0xc00035e620, 0xc0000a34d0)
            /home/jeremy/go/src/github.com/zrepl/zrepl/daemon/job/passive.go:194 +0x2e7
    github.com/zrepl/zrepl/rpc.NewServer.func3(0xdcf780, 0xc0001ce4b0, 0xdc65e0, 0xc00035e600, 0xc0000a34d0)
            /home/jeremy/go/src/github.com/zrepl/zrepl/rpc/rpc_server.go:82 +0xd5
    github.com/zrepl/zrepl/rpc/dataconn.(*Server).serveConn(0xc0000a2ba0, 0xc00018eca0)
            /home/jeremy/go/src/github.com/zrepl/zrepl/rpc/dataconn/dataconn_server.go:149 +0x3be
    github.com/zrepl/zrepl/rpc/dataconn.(*Server).Serve.func3(0xc0000b8180, 0xc0000a2ba0, 0xc00018eca0)
            /home/jeremy/go/src/github.com/zrepl/zrepl/rpc/dataconn/dataconn_server.go:108 +0x5d
    created by github.com/zrepl/zrepl/rpc/dataconn.(*Server).Serve
            /home/jeremy/go/src/github.com/zrepl/zrepl/rpc/dataconn/dataconn_server.go:106 +0x24a
    2020-07-25T19:58:55-04:00 [ERRO][latitude][rpc.data][Pt4F$gCWT$gCWT.fzhc]: cannot write send stream err="frameconn: shutting down"
    panic: end task: 1 active child tasks: end task: task still has active child tasks

fixes #348
2020-07-26 20:32:35 +02:00
Christian Schwarz
10a14a8c50 [#307] add package trace, integrate it with logging, and adopt it throughout zrepl
package trace:

- introduce the concept of tasks and spans, tracked as linked list within ctx
    - see package-level docs for an overview of the concepts
    - **main feature 1**: unique stack of task and span IDs
        - makes it easy to follow a series of log entries in concurrent code
    - **main feature 2**: ability to produce a chrome://tracing-compatible trace file
        - either via an env variable or a `zrepl pprof` subcommand
        - this is not a CPU profile, we already have go pprof for that
        - but it is very useful to visually inspect where the
          replication / snapshotter / pruner spends its time
          ( fixes #307 )

usage in package daemon/logging:

- goal: every log entry should have a trace field with the ID stack from package trace

- make `logging.GetLogger(ctx, Subsys)` the authoritative `logger.Logger` factory function
    - the context carries a linked list of injected fields which
      `logging.GetLogger` adds to the logger it returns
    - `logging.GetLogger` also uses package `trace` to get the
      task-and-span-stack and injects it into the returned logger's fields
2020-05-19 11:30:02 +02:00
Christian Schwarz
bcb5965617 [#307] rpc: proper handling of context cancellation for transportmux + dataconn
- prior to this patch, context cancellation would leave rpc.Server open
- did not make problems because context was only cancelled by SIGINT,
  which was immediately followed by os.Exit
2020-05-18 19:46:24 +02:00
Christian Schwarz
0e5c77d2be [#277] rpc + zfs: drop zfs.StreamCopier, use io.ReadCloser instead 2020-05-18 19:46:24 +02:00
Christian Schwarz
5b97953bfb run golangci-lint and apply suggested fixes 2019-03-27 13:12:26 +01:00
Christian Schwarz
afed762774 format source tree using goimports 2019-03-22 19:41:12 +01:00
Christian Schwarz
b2c5ffcaea rpc: dataconn: handle incorrect handler return values
refs #137
2019-03-16 14:47:29 +01:00
Christian Schwarz
c87759affe replication/driver: automatic retries on connectivity-related errors 2019-03-13 15:00:40 +01:00
Christian Schwarz
07b43bffa4 replication: refactor driving logic (no more explicit state machine) 2019-03-13 15:00:40 +01:00
Christian Schwarz
796c5ad42d rpc rewrite: control RPCs using gRPC + separate RPC for data transfer
transport/ssh: update go-netssh to new version
    => supports CloseWrite and Deadlines
    => build: require Go 1.11 (netssh requires it)
2019-03-13 13:53:48 +01:00