6 Debugging
Christian Schwarz edited this page 2021-02-28 23:50:56 +01:00

Debugging Setup

  • Create 2 VMs, configure a file-backed ZFS pool on each of them
  • The VMs should share a private network or a bridge network with the host
  • Write yourself some scripting to build the zrepl binary on the host and scp it into the guests

Tasks:

  • Reduce network bandwidth between the VMs:
  • Fake zfs errors
    • Create a wrapper shell script and add its director to the zrepl daemon's PATH
    • Example Script:
      root@zrepl-dev-debian-2:[~]: cat mockpath/zfs
      #!/usr/bin/env bash
      set -eu
      args=("$@")
      
      ZREPL_MOCK_ZFS_PATH=/usr/local/sbin/zfs
      
      
      #if echo "${args[@]}" | egrep 'list.*snapshot' > /dev/null; then
      #    echo "sleeping ${args[@]}" >> /tmp/mocklog
      #    sleep 700
      #fi
      if echo "${args[@]}" | egrep 'recv -s' 2>&1 >/dev/null; then
          echo foo
          echo bar 1>&2
          sleep 10
          dd bs=1M count=1
          exit 23
          #echo "unenced is blocked to be received by this mock" 1>&2
          #exit 23
          sleep 1
      fi
      
      
      exec "$ZREPL_MOCK_ZFS_PATH" "${args[@]}"
      

Memory Leaks & Goroutine Leaks

Good exapmles for situations where the following instructions helped with debugging:

Instructions

  1. Run zrepl with autostarted pprof server on port :12345 and prometheus endpoint on :22345

    • configure prometheus endpoint in config
    • zrepl pprof listen on :12345 if the daemon is already running (for example when we want to capture a rare deadlock which would be resolved by restarting the daemon)
      • or ZREPL_DAEMON_AUTOSTART_PPROF_SERVER=:12345 zrepl daemon if restarting the daemon is a good idea
  2. watch 'curl localhost:22345/metrics | grep -v "#" | grep memstats'
    => go_memstats_heap_inuse_bytes or go_memstats_heap_alloc_bytes should be rising

  3. watch curl http://localhost:12345/debug/pprof/goroutine?debug=1

    9 @ 0x469db0 0x43d5f4 0x43d5ca 0x43d355 0xc767b0 0x498f51
    #	0xc767af	github.com/zrepl/zrepl/rpc/dataconn/stream.doWriteStream.func1+0x2ff	/mnt/zrepl /rpc/dataconn/stream/stream.go:92
    

    => 9 kept increasing to higher numbers over time => look at doWriteStream impl and fix goroutine leak (see commit)