Commit Graph

25 Commits

Author SHA1 Message Date
Mathias Fredriksson
2d545c87cd rpc: fix read mutex unlock in dataconn stream
closes #413
2021-01-25 00:16:01 +01:00
Christian Schwarz
480176ba2d rpc/dataconn: fix go1.15-discovered recursive Error() method impl
rpc/dataconn/dataconn_client.go:69:9: Sprintf format %s with arg e causes recursive Error method call
2020-08-12 21:41:31 +02:00
Christian Schwarz
fa586b493c [#348] fix crash on early-recv error
* The SendStream.Close() was not called by dataconn.Server, which left the zfs send process dangling.
* When the source job's ctx interceptor closed the task, the dangling zfs send was detect by the trace package and panicked.

    020-07-25T19:54:41-04:00 [ERRO][latitude][rpc.data][cyZj$J3Ca$J3Ca.CJwB]: cannot write send stream err="frameconn: shutting down"
    panic: end task: 1 active child tasks: end task: task still has active child tasks

    goroutine 196966 [running]:
    github.com/zrepl/zrepl/daemon/logging/trace.WithTask.func1.1(0xc000320680, 0xcde000)
            /home/jeremy/go/src/github.com/zrepl/zrepl/daemon/logging/trace/trace.go:221 +0x2f7
    github.com/zrepl/zrepl/daemon/logging/trace.WithTask.func1()
            /home/jeremy/go/src/github.com/zrepl/zrepl/daemon/logging/trace/trace.go:237 +0x38
    github.com/zrepl/zrepl/daemon/logging/trace.WithTaskAndSpan.func1()
            /home/jeremy/go/src/github.com/zrepl/zrepl/daemon/logging/trace/trace_convenience.go:41 +0x37
    github.com/zrepl/zrepl/daemon/job.(*PassiveSide).Run.func1(0xdcf780, 0xc0000a3560, 0xdc65a0, 0xc00035e620, 0xc0000a34d0)
            /home/jeremy/go/src/github.com/zrepl/zrepl/daemon/job/passive.go:194 +0x2e7
    github.com/zrepl/zrepl/rpc.NewServer.func3(0xdcf780, 0xc0001ce4b0, 0xdc65e0, 0xc00035e600, 0xc0000a34d0)
            /home/jeremy/go/src/github.com/zrepl/zrepl/rpc/rpc_server.go:82 +0xd5
    github.com/zrepl/zrepl/rpc/dataconn.(*Server).serveConn(0xc0000a2ba0, 0xc00018eca0)
            /home/jeremy/go/src/github.com/zrepl/zrepl/rpc/dataconn/dataconn_server.go:149 +0x3be
    github.com/zrepl/zrepl/rpc/dataconn.(*Server).Serve.func3(0xc0000b8180, 0xc0000a2ba0, 0xc00018eca0)
            /home/jeremy/go/src/github.com/zrepl/zrepl/rpc/dataconn/dataconn_server.go:108 +0x5d
    created by github.com/zrepl/zrepl/rpc/dataconn.(*Server).Serve
            /home/jeremy/go/src/github.com/zrepl/zrepl/rpc/dataconn/dataconn_server.go:106 +0x24a
    2020-07-25T19:58:55-04:00 [ERRO][latitude][rpc.data][Pt4F$gCWT$gCWT.fzhc]: cannot write send stream err="frameconn: shutting down"
    panic: end task: 1 active child tasks: end task: task still has active child tasks

fixes #348
2020-07-26 20:32:35 +02:00
Christian Schwarz
10a14a8c50 [#307] add package trace, integrate it with logging, and adopt it throughout zrepl
package trace:

- introduce the concept of tasks and spans, tracked as linked list within ctx
    - see package-level docs for an overview of the concepts
    - **main feature 1**: unique stack of task and span IDs
        - makes it easy to follow a series of log entries in concurrent code
    - **main feature 2**: ability to produce a chrome://tracing-compatible trace file
        - either via an env variable or a `zrepl pprof` subcommand
        - this is not a CPU profile, we already have go pprof for that
        - but it is very useful to visually inspect where the
          replication / snapshotter / pruner spends its time
          ( fixes #307 )

usage in package daemon/logging:

- goal: every log entry should have a trace field with the ID stack from package trace

- make `logging.GetLogger(ctx, Subsys)` the authoritative `logger.Logger` factory function
    - the context carries a linked list of injected fields which
      `logging.GetLogger` adds to the logger it returns
    - `logging.GetLogger` also uses package `trace` to get the
      task-and-span-stack and injects it into the returned logger's fields
2020-05-19 11:30:02 +02:00
Christian Schwarz
bcb5965617 [#307] rpc: proper handling of context cancellation for transportmux + dataconn
- prior to this patch, context cancellation would leave rpc.Server open
- did not make problems because context was only cancelled by SIGINT,
  which was immediately followed by os.Exit
2020-05-18 19:46:24 +02:00
Christian Schwarz
0e5c77d2be [#277] rpc + zfs: drop zfs.StreamCopier, use io.ReadCloser instead 2020-05-18 19:46:24 +02:00
Christian Schwarz
b4abebce00 rpc/dataconn/timeoutconn: tests: relax deadline in timeout tests 2020-05-18 19:46:24 +02:00
InsanePrawn
44bd354eae Spellcheck all files
Signed-off-by: InsanePrawn <insane.prawny@gmail.com>
2020-02-24 16:06:09 +01:00
Christian Schwarz
02b3b4f80c fix some typos 2020-02-17 18:02:04 +01:00
Christian Schwarz
4994b7a9ea rpc/dataconn + build: support GOOS={solaris,illumos} 2019-11-16 22:07:47 +01:00
Christian Schwarz
a6b578b648 rpc/dataconn/stream: Conn: handle concurrent Close calls + goroutine leak fix
* Add Close() in closeState to identify the first closer
* Non-first closers get an error
* Reads and Writes from the Conn get an error if the conn was closed
  during the Read / Write was running
* The first closer starts _separate_ goroutine draining the c.frameReads channel
* The first closer then waits for the goroutine that fills c.frameReads
  to exit

refs 3bfe0c16d0
fixes #174

readFrames would block on `reads <-`
   but only after that would stream.Conn.readFrames close c.waitReadFramesDone
   which was too late because stream.Conn.Close would wait for c.waitReadFramesDone to be closed before draining the channel
                              ^^^^^^ (not frameconn.Conn, that closed successfully)

   195 @ 0x1032ae0 0x1006cab 0x1006c81 0x1006a65 0x15505be 0x155163e 0x1060bc1
           0x15505bd       github.com/zrepl/zrepl/rpc/dataconn/stream.readFrames+0x16d             github.com/zrepl/zrepl/rpc/dataconn/stream/stream.go:220
           0x155163d       github.com/zrepl/zrepl/rpc/dataconn/stream.(*Conn).readFrames+0xbd      github.com/zrepl/zrepl/rpc/dataconn/stream/stream_conn.go:71

   195 @ 0x1032ae0 0x10078c8 0x100789e 0x100758b 0x1552678 0x1557a4b 0x1556aec 0x1060bc1
           0x1552677       github.com/zrepl/zrepl/rpc/dataconn/stream.(*Conn).Close+0x77           github.com/zrepl/zrepl/rpc/dataconn/stream/stream_conn.go:191
           0x1557a4a       github.com/zrepl/zrepl/rpc/dataconn.(*Server).serveConn.func1+0x5a      github.com/zrepl/zrepl/rpc/dataconn/dataconn_server.go:93
           0x1556aeb       github.com/zrepl/zrepl/rpc/dataconn.(*Server).serveConn+0x87b           github.com/zrepl/zrepl/rpc/dataconn/dataconn_server.go:176
2019-09-29 19:05:54 +02:00
Christian Schwarz
8c88e168c1 rpc/dataconn/client: ReqRecv to log level Debug
reported by @avg-l
2019-09-28 11:49:20 +02:00
Christian Schwarz
a78c854404 rpc/dataconn/frameconn: mask ECONNRESET error on Close()
fixes #190
2019-09-28 11:49:20 +02:00
Christian Schwarz
77d3a1ad4d build: drop go Dep, switch to modules, support Go 1.13
bump enumer to v1.1.1
bump golangci-lint to v1.17.1

no `go mod tidy` because 1.13 and 1.12 seem to alter each other's output

fixes #112
2019-09-14 13:36:44 +02:00
Christian Schwarz
3bfe0c16d0 rpc/dataconn/stream: fix goroutine leaks & transitive buffer leaks
fixes #174
2019-09-07 22:22:05 +02:00
Christian Schwarz
254a292362 rpc/timeoutconn: platform-independent sizes for computing syscall.Iovec.Len
refs #184
2019-08-19 18:11:25 +02:00
Christian Schwarz
95e16f01c1 rpc/dataconn: audit and comment use of unsafe for readv usage
Go 1.13 will add a more precise escape analysis:
https://tip.golang.org/doc/go1.13#compiler

Reviewed usage of unsafe to ensure we adhere to the unsafe.Pointer safety rules:
https://tip.golang.org/pkg/unsafe/#Pointer
2019-08-19 18:11:25 +02:00
Christian Schwarz
d6304f4f1d rpc/dataconn: fix I/O timeout on variable receive rate
refs #162
2019-03-31 14:20:06 +02:00
Christian Schwarz
5b97953bfb run golangci-lint and apply suggested fixes 2019-03-27 13:12:26 +01:00
Christian Schwarz
afed762774 format source tree using goimports 2019-03-22 19:41:12 +01:00
Christian Schwarz
b2c5ffcaea rpc: dataconn: handle incorrect handler return values
refs #137
2019-03-16 14:47:29 +01:00
Christian Schwarz
c87759affe replication/driver: automatic retries on connectivity-related errors 2019-03-13 15:00:40 +01:00
Christian Schwarz
07b43bffa4 replication: refactor driving logic (no more explicit state machine) 2019-03-13 15:00:40 +01:00
Christian Schwarz
0230c6321f rpc/dataconn: microbenchmark 2019-03-13 13:57:21 +01:00
Christian Schwarz
796c5ad42d rpc rewrite: control RPCs using gRPC + separate RPC for data transfer
transport/ssh: update go-netssh to new version
    => supports CloseWrite and Deadlines
    => build: require Go 1.11 (netssh requires it)
2019-03-13 13:53:48 +01:00