zrepl/replication/report/replication_report.go
Christian Schwarz 2d8c3692ec rework resume token validation to allow resuming from raw sends of unencrypted datasets
Before this change, resuming from an unencrypted dataset with
send.raw=true specified wouldn't work with zrepl due to overly
restrictive resume token checking.

An initial PR to fix this was made in https://github.com/zrepl/zrepl/pull/503
but it didn't address the core of the problem.
The core of the problem was that zrepl assumed that if a resume token
contained `rawok=true, compressok=true`, the resulting send would be
encrypted. But if the sender dataset was unencrypted, such a resume would
actually result in an unencrypted send.
Which could be totally legitimate but zrepl failed to recognize that.

BACKGROUND
==========

The following snippets of OpenZFS code are insightful regarding how the
various ${X}ok values in the resume token are handled:

- 6c3c5fcfbe/module/zfs/dmu_send.c (L1947-L2012)
- 6c3c5fcfbe/module/zfs/dmu_recv.c (L877-L891)
- https://github.com/openzfs/zfs/blob/6c3c5fc/lib/libzfs/libzfs_sendrecv.c#L1663-L1672

Basically, some zfs send flags make the DMU send code set some DMU send
stream featureflags, although it's not a pure mapping, i.e, which DMU
send stream flags are used depends somewhat on the dataset (e.g., is it
encrypted or not, or, does it use zstd or not).

Then, the receiver looks at some (but not all) feature flags and maps
them to ${X}ok dataset zap attributes.

These are funnelled back to the sender 1:1 through the resume_token.

And the sender turns them into lzc flags.

As an example, let's look at zfs send --raw.
if the sender requests a raw send on an unencrypted dataset, the send
stream (and hence the resume token) will not have the raw stream
featureflag set, and hence the resume token will not have the rawok
field set. Instead, it will have compressok, embedok, and depending
on whether large blocks are present in the dataset, largeblockok set.

WHAT'S ZREPL'S ROLE IN THIS?
============================

zrepl provides a virtual encrypted sendflag that is like `raw`,
but further ensures that we only send encrypted datasets.

For any other resume token stuff, it shoudn't do any checking,
because it's a futile effort to keep up with ZFS send/recv features
that are orthogonal to encryption.

CHANGES MADE IN THIS COMMIT
===========================

- Rip out a bunch of needless checking that zrepl would do during
  planning. These checks were there to give better error messages,
  but actually, the error messages created by the endpoint.Sender.Send
  RPC upon send args validation failure are good enough.
- Add platformtests to validate all combinations of
  (Unencrypted/Encrypted FS) x (send.encrypted = true | false) x (send.raw = true | false)
  for cases both non-resuming and resuming send.

Additional manual testing done:
1. With zrepl 0.5, setup with unencrypted dataset, send.raw=true specified, no send.encrypted specified.
2. Observe that regular non-resuming send works, but resuming doesn't work.
3. Upgrade zrepl to this change.
4. Observe that both regular and resuming send works.

closes https://github.com/zrepl/zrepl/pull/613
2022-09-25 17:32:02 +02:00

195 lines
4.4 KiB
Go

package report
import (
"encoding/json"
"time"
)
type Report struct {
StartAt, FinishAt time.Time
WaitReconnectSince, WaitReconnectUntil time.Time
WaitReconnectError *TimedError
Attempts []*AttemptReport
}
var _, _ = json.Marshal(&Report{})
type TimedError struct {
Err string
Time time.Time
}
func NewTimedError(err string, t time.Time) *TimedError {
if err == "" {
panic("error must be empty")
}
if t.IsZero() {
panic("t must be non-zero")
}
return &TimedError{err, t}
}
func (s *TimedError) Error() string {
return s.Err
}
var _, _ = json.Marshal(&TimedError{})
type AttemptReport struct {
State AttemptState
StartAt, FinishAt time.Time
PlanError *TimedError
Filesystems []*FilesystemReport
}
type AttemptState string
const (
AttemptPlanning AttemptState = "planning"
AttemptPlanningError AttemptState = "planning-error"
AttemptFanOutFSs AttemptState = "fan-out-filesystems"
AttemptFanOutError AttemptState = "filesystem-error"
AttemptDone AttemptState = "done"
)
type FilesystemState string
const (
FilesystemPlanning FilesystemState = "planning"
FilesystemPlanningErrored FilesystemState = "planning-error"
FilesystemStepping FilesystemState = "stepping"
FilesystemSteppingErrored FilesystemState = "step-error"
FilesystemDone FilesystemState = "done"
)
type FsBlockedOn string
const (
FsBlockedOnNothing FsBlockedOn = "nothing"
FsBlockedOnPlanningStepQueue FsBlockedOn = "plan-queue"
FsBlockedOnParentInitialRepl FsBlockedOn = "parent-initial-repl"
FsBlockedOnReplStepQueue FsBlockedOn = "repl-queue"
)
type FilesystemReport struct {
Info *FilesystemInfo
State FilesystemState
// Always valid.
BlockedOn FsBlockedOn
// Valid in State = FilesystemPlanningErrored
PlanError *TimedError
// Valid in State = FilesystemSteppingErrored
StepError *TimedError
// Valid in State = FilesystemStepping
CurrentStep int
Steps []*StepReport
}
type FilesystemInfo struct {
Name string
}
type StepReport struct {
Info *StepInfo
}
type StepInfo struct {
From, To string
Resumed bool
BytesExpected uint64
BytesReplicated uint64
}
func (a *AttemptReport) BytesSum() (expected, replicated uint64, containsInvalidSizeEstimates bool) {
for _, fs := range a.Filesystems {
e, r, fsContainsInvalidEstimate := fs.BytesSum()
containsInvalidSizeEstimates = containsInvalidSizeEstimates || fsContainsInvalidEstimate
expected += e
replicated += r
}
return expected, replicated, containsInvalidSizeEstimates
}
func (f *FilesystemReport) BytesSum() (expected, replicated uint64, containsInvalidSizeEstimates bool) {
for _, step := range f.Steps {
expected += step.Info.BytesExpected
replicated += step.Info.BytesReplicated
containsInvalidSizeEstimates = containsInvalidSizeEstimates || step.Info.BytesExpected == 0
}
return
}
func (f *AttemptReport) FilesystemsByState() map[FilesystemState][]*FilesystemReport {
r := make(map[FilesystemState][]*FilesystemReport, 4)
for _, fs := range f.Filesystems {
l := r[fs.State]
l = append(l, fs)
r[fs.State] = l
}
return r
}
func (f *FilesystemReport) Error() *TimedError {
switch f.State {
case FilesystemPlanningErrored:
return f.PlanError
case FilesystemSteppingErrored:
return f.StepError
}
return nil
}
// may return nil
func (f *FilesystemReport) NextStep() *StepReport {
switch f.State {
case FilesystemDone:
return nil
case FilesystemPlanningErrored:
return nil
case FilesystemSteppingErrored:
return nil
case FilesystemPlanning:
return nil
case FilesystemStepping:
// invariant is that this is always correct
// TODO what about 0-length Steps but short intermediary state?
return f.Steps[f.CurrentStep]
}
panic("unreachable")
}
func (f *StepReport) IsIncremental() bool {
return f.Info.From != ""
}
// Returns, for the latest replication attempt,
// 0 if there have not been any replication attempts,
// -1 if the replication failed while enumerating file systems
// N if N filesystems could not not be replicated successfully
func (r *Report) GetFailedFilesystemsCountInLatestAttempt() int {
if len(r.Attempts) == 0 {
return 0
}
a := r.Attempts[len(r.Attempts)-1]
switch a.State {
case AttemptPlanningError:
return -1
case AttemptFanOutError:
var count int
for _, f := range a.Filesystems {
if f.Error() != nil {
count++
}
}
return count
default:
return 0
}
}