- **Resumable Send & Recv Support** No knobs required, automatically used where supported. - **Hold-Protected Send & Recv** Automatic ZFS holds to ensure that we can always resume a replication step. - **Encrypted Send & Recv Support** for OpenZFS native encryption. Configurable at the job level, i.e., for all filesystems a job is responsible for. - **Receive-side hold on last received dataset** The counterpart to the replication cursor bookmark on the send-side. Ensures that incremental replication will always be possible between a sender and receiver. Design Doc ---------- `replication/design.md` doc describes how we use ZFS holds and bookmarks to ensure that a single replication step is always resumable. The replication algorithm described in the design doc introduces the notion of job IDs (please read the details on this design doc). We reuse the job names for job IDs and use `JobID` type to ensure that a job name can be embedded into hold tags, bookmark names, etc. This might BREAK CONFIG on upgrade. Protocol Version Bump --------------------- This commit makes backwards-incompatible changes to the replication/pdu protobufs. Thus, bump the version number used in the protocol handshake. Replication Cursor Format Change -------------------------------- The new replication cursor bookmark format is: `#zrepl_CURSOR_G_${this.GUID}_J_${jobid}` Including the GUID enables transaction-safe moving-forward of the cursor. Including the job id enables that multiple sending jobs can send the same filesystem without interfering. The `zrepl migrate replication-cursor:v1-v2` subcommand can be used to safely destroy old-format cursors once zrepl has created new-format cursors. Changes in This Commit ---------------------- - package zfs - infrastructure for holds - infrastructure for resume token decoding - implement a variant of OpenZFS's `entity_namecheck` and use it for validation in new code - ZFSSendArgs to specify a ZFS send operation - validation code protects against malicious resume tokens by checking that the token encodes the same send parameters that the send-side would use if no resume token were available (i.e. same filesystem, `fromguid`, `toguid`) - RecvOptions support for `recv -s` flag - convert a bunch of ZFS operations to be idempotent - achieved through more differentiated error message scraping / additional pre-/post-checks - package replication/pdu - add field for encryption to send request messages - add fields for resume handling to send & recv request messages - receive requests now contain `FilesystemVersion To` in addition to the filesystem into which the stream should be `recv`d into - can use `zfs recv $root_fs/$client_id/path/to/dataset@${To.Name}`, which enables additional validation after recv (i.e. whether `To.Guid` matched what we received in the stream) - used to set `last-received-hold` - package replication/logic - introduce `PlannerPolicy` struct, currently only used to configure whether encrypted sends should be requested from the sender - integrate encryption and resume token support into `Step` struct - package endpoint - move the concepts that endpoint builds on top of ZFS to a single file `endpoint/endpoint_zfs.go` - step-holds + step-bookmarks - last-received-hold - new replication cursor + old replication cursor compat code - adjust `endpoint/endpoint.go` handlers for - encryption - resumability - new replication cursor - last-received-hold - client subcommand `zrepl holds list`: list all holds and hold-like bookmarks that zrepl thinks belong to it - client subcommand `zrepl migrate replication-cursor:v1-v2`
24 KiB
The goal of this document is to describe what logical steps zrepl takes to replicate ZFS filesystems.
Note that the actual code executing these steps is spread over the endpoint.{Sender,Receiver}
RPC endpoint implementations. The code in replication/{driver,logic}
(mostly logic.Step.doReplication
) drives the replication and invokes the endpoint
methods (locally or via RPC).
Hence, when trying to map algorithm to implementation, use the code in package replication
as the entrypoint and follow the RPC calls.
- The Single Replication Step algorithm is implemented as described
- step holds are implemented as described
- step bookmarks rely on bookmark cloning, which is WIP in OpenZFS (see respective TODO comments)
- the feature to hold receive-side
to
immediately after replication is used to move thelast-received-hold
forward- this is a little more than what we allow in the algorithm (we only allow specifying a hold tag whereas moving the
last-received-hold
is a little more complex)
- this is a little more than what we allow in the algorithm (we only allow specifying a hold tag whereas moving the
- the algorithm's sender-side callback is used move the
zrepl_CURSOR
bookmark to point toto
- The
zrepl_CURSOR
bookmark and thelast-received-hold
ensure that future replication steps can always be done using incremental replication:- Both reference / hold the last successfully received snapshot
to
. - Thus,
to
or#zrepl_CURSOR_G_${to.GUID}_J_${jobid}
can always be used for a future incremental replication stepto => future-to
:- incremental send will be possible because
to
doesn't go away because we claim ownership of the#zrepl_CURSOR
namespace - incremental recv will be possible iff
to
will still be present on the recv-side because oflast-received-hold
- the recv-side fs doesn't get newer snapshots than
to
in the meantime- guaranteed because the zrepl model of the receiver assumes ownership of the filesystems it receives into
- if that assumption is broken, future replication attempts will fail with a conflict
- guaranteed because the zrepl model of the receiver assumes ownership of the filesystems it receives into
- incremental send will be possible because
- Both reference / hold the last successfully received snapshot
- The Algorithm for Planning and Executing Replication of an Filesystems is a design draft and not used
- We also have Notes on Planning and Executing Replication of Multiple Filesystems
Algorithm for a Single Replication Step
The algorithm described below describes how a single replication step must be implemented.
A replication step is a full send of a snapshot to
for initial replication, or an incremental send (from => to
) for incremental replication.
The algorithm ensures resumability of the replication step in presence of
- uncoordinated or unsynchronized destruction of the filesystem, snapshots, or bookmarks involved in the replication step
- network failures at any time
- other instances of this algorithm executing the same step in parallel (e.g. concurrent replication to different destinations)
To accomplish this goal, the algorithm assumes ownersip of parts of the ZFS hold tag namespace and the bookmark namespace:
- holds with prefix
zrepl_STEP
on any snapshot are reserved for zrepl - bookmarks with prefix
zrepl_STEP
are reserved for zrepl
Resumability of a step cannot be guaranteed if these sub-namespaces are manipulated through software other than zrepl.
Note that the algorithm does not ensure that a replication plan, which describes a set of replication steps, can be carried out successfully. If that is desirable, additional measures outside of this algorithm must be taken.
Definitions:
Step Completion & Invariants
The replication step (full to
send or from => to
send) is complete iff the algorithm ran to completion without errors or a permanent non-network error is reported by sender or receiver.
Specifically, the algorithm may be invoked with the same from
and to
arguments, and potentially a resume_token
, after a temporary (like network-related) failure:
Unless permanent errors occur, repeated invocations of the algorithm with updated resume token will converge monotonically (but not strictly monotonically) toward completion.
Note that the mere existence of to
on the receiving side does not constitue completion, since there may still be post-recv actions to be performed on sender and receiver.
Job and Job ID
This algorithm supports that multiple instance of it run in parallel on the same step (full to
/ from => to
pair).
An exemplary use case for this feature are concurrent replication jobs that replicate the same dataset to different receivers.
We require that all parallel invocations of this algorithm provide different and unique jobid
s.
Violation of jobid
uniqueness across parallel jobs may result in interference between instances of this algorithm, resulting in potential compromise of resumability.
After a step is completed, jobid
is guaranteed to not be encoded in on-disk state.
Before a step is completed, there is no such guarantee.
Changing the jobid
before step completion may compromise resumability and may leak the underlying ZFS holds or step bookmarks (i.e. zrepl won't clean them up)
Note the definition of complete above.
Step Bookmark
A step bookmark is our equivalent of ZFS holds, but for bookmarks.
A step bookmark is a ZFS bookmark whose name matches the following regex:
STEP_BOOKMARK = #zrepl_STEP_bm_G_([0-9a-f]+)_J_(.+)
- capture group
1
must be the guid ofzfs get guid ${STEP_BOOKMARK}
, encoded hexadecimal (without leading0x
) and fixed-length (i.e. padded with leading zeroes) - capture group
2
must be equal to thejobid
Algorithm
INPUT:
jobid
from
: snapshot or bookmark: may be nil for full sendto
: snapshot, must never be nilresume_token
(may be nil)- (
from
andto
must be on the same filesystem)
Prepare-Phase
Send-side: make sure to
and from
don't go away
- hold
to
usingidempotent_hold(to, zrepl_STEP_J_${jobid})
- make sure
from
doesn't go away:- if
from
is a snapshot: holdfrom
usingidempotent_hold(from, zrepl_STEP_J_${jobid})
- else
idempotent_step_bookmark(from)
(from
is a bookmark)- PERMANENT ERROR if this fails (e.g. because
from
is a bookmark whose snapshot no longer exists and we don't have bookmark copying yet (ZFS feature is in development) )- Why? we must assume the given bookmark is externally managed (i.e. not in the )
- this means the bookmark is externally created bookmark and cannot be trusted to persist until the replication step succeeds
- Maybe provide an override-knob for this behavior
- PERMANENT ERROR if this fails (e.g. because
- if
Recv-side: no-op
-
from
cannot go away once we received enough data for the step to be resumable:# do a partial incremental recv @a=>@b (i.e. recv with -s, and abort the incremental recv in the middle) # try destroying the incremental source on the receiving side zfs destroy p1/recv@a cannot destroy 'p1/recv@a': snapshot has dependent clones use '-R' to destroy the following datasets: p1/recv/%recv # => doesn't work, because zfs recv is implemented as a `clone` internally, that's exactly what we want
-
if recv-side
to
exists, goto cleaup-phase (no replication to do) -
to
cannot be destroyed while being received, because it isn't visible as a snapshot yet (it isn't yet one after all)
Replication Phase
Attempt the replication step:
start zfs send
and pipe its output into zfs recv
until an error occurs or recv
returns without error.
Let us now think about interferences that could occur during this phase, and ensure that none of them compromise the goals of this algorithm, i.e., monotonic convergence toward step completion using resumable send & recv.
Safety from External Pruning
We are safe from pruning during the replication step because we have guarantees that no external action will destroy send-side from
and to
, and recv-side to
(for both snapshot and bookmark from
s)
Safety In Presence of Network Failures During Replication Network failures during replication can be recovered from using resumable send & recv:
- Network failure before the receive-side could stored data:
- Send-side
from
andto
are guaranteed to be still present due to zfs holds - recv-side
from
may no longer exist because we don'thold
it explicitly- if that is the case, ERROR OUT, step can never complete
- If the step planning algorithm does not include the step, for example because a snapshot filter configuration was changed by the user inbetween which hides
from
orto
from the second planning attempt: tough luck, we're leaking all holds
- Send-side
- Network failure during the replication
- send-side
from
andto
are still present due to zfs holds - recv-side
from
is still present because the partial receive state prevents its destruction (see prepare-phase) - if recv-side hasa resume token, the resume token will continue to work on the sender because
from
s andto
are still present
- send-side
- Network failure at the end of the replication step stream transmission
- Variant A: failure from the sender's perspective, success from the receiver's perspective
- receive-side
to
doesn't have a hold and could be destroyed anytime - receive-side
from
doesn't have a hold and could be destroyed anytime - thus, when the step is invoked again, pattern match
(from_exists, to_exists)
(true, true)
: prepare-phase will early-exit(false,true)
: prepare-phase will error out bc.from
does not exist(true, false)
: entire step will be re-done- FIXME monotonicity requirement does not hold
(false, false)
: prepare-phase will error out bc.from
does not exist
- receive-side
- Variant B: success from the sender's perspective, failure from the receiver's perspective
- No idea how this would happen except for bugs in error reporting in the replication protocol
- Misclassification by the sender, most likely broken error handling in the sender or replication protocol
- => the sender will release holds and move the replication cursor while the receiver won't => tough luck
- Variant A: failure from the sender's perspective, success from the receiver's perspective
If the RPC used for zfs recv
returned without error, this phase is done.
(Although the replication step (this algorithm) is not yet complete, see definition of complete).
Cleanup-Phase
At the end of this phase, all intermediate state we built up to support the resumable replication of the step is destroyed. However, consumers of this algorithm might want to take advantage of the fact that we currently still have holds / step bookmarks.
Recv-side: Optionally hold to
with a caller-defined tag
Users of the algorithm might want to depend on to
being available on the receiving side for a longer duration than the lifetime of the current step's algorithm invocation.
For reasons explained in the next paragraph, we cannot guarantee that we have a hold
when recv
exists. If that were the case, we could take our time to provide a generalized callback to the user of the algorithm, and have them do whatever they want with to
while we guarantee that to
doesn't go away through the hold. But that functionality isn't available, so we only provide a fixed operation right after receive: take a hold
with a tag of the algorithm user's choice. That's what's needed to guarantee that a replication plan, consisting of multiple consecutive steps, can be carried out without a receive-side prune job interfering by destroying to
. Technically, this step is racy, i.e., it could happen that to
is destroyed between recv
and hold
. But this is a) unlikely and b) not fatal because we can detect that hold fails because the 'dataset does not existand re-do the entire transmission since we still hold send-side
fromand
to`, i.e. we just lose a lot of work in rare cases.
So why can't we have a hold
at the time recv
exits?
It seems like zfs send --holds
could be used to send the send-side's holds to the receive-side. But that flag always sends all holds, and --holds
is implemented in user-space libzfs, i.e., doesn't happen in the same txg as the recv completion. Thus, the race window is in fact only smaller (unless we oversaw some synchronization in userland zfs).
But if we're willing to entertain the idea a little further, we still hit the problem that --holds
sends all holds, whereas our goal is to temporarily have >= 1 hold that we own until the callback is done, and then release all of the received holds so that no holds created by us exist after this algorithm completes.
Since there could be concurrent zfs hold
invocations on the sender and receiver while this algorithm runs, and because --holds
doesn't provide info about which holds were sent, we cannot correctly destroy exactly those holds that we received.
Send-side: callback to consumer, then cleanup
We can provide the algorithm user with a generic callback because we have holds / step bookmarks for to
and from
, respectively.
Example use case for the sender-side callback is the replication cursor, see algorithm for filesystem replication below.
After the callback is done: cleanup holds & step bookmark:
idempotent_release(to, zrepl_STEP_J_${jobid})
- make sure
from
can now go away:- if
from
is a snapshot:idempotent_release(from, zrepl_STEP_J_${jobid})
- else
idempotent_step_bookmark_destroy(from)
- if
from
fulfills the properties of a step bookmark: destroy it - otherwise: it's a bookmark not created by this algorithm that happened to be used for replication: leave it alone
- that can only happen if user used the override knob
- if
- if
Note that "make sure from
can now go away" is the inverse of "Send-side: make sure to
and from
don't go away". Those are relatively complex operations and should be implemented in the same file next to each other to ease maintainability.
Notes & Pseudo-APIs used in the algorithm
idempotent_hold(snapshot s, string tag)
like zfs hold, but doesn't error if hold already existsidempotent_release(snapshot s, string tag)
like zfs hold, but doesn't error if hold already existsidempotent_step_bookmark(snapshot_or_bookmark b)
creates a step bookmark of b- determine step bookmark name N
- if
N
already exists, verify it's a correct step bookmark => DONE or ERROR - if
b
is a snapshot, issuezfs bookmark
- if
b
is a bookmark:- if bookmark cloning supported, use it to duplicate the bookmark
- else ERROR OUT, with an error that upper layers can identify as such, so that they are able to ignore the fact that we couldn't create a step bookmark
idempotent_destroy(bookmark #b_$GUID, of a snapshot s)
must atomically check that$GUID == s.guid
before destroying sidempotent_bookmark(snapshot s, $GUID, name #b_$GUID)
must atomically check that$GUID == s.guid
at the time of bookmark creationidempotent_destroy(snapshot s)
must atomically check that zrepl'ss.guid
matches the currentguid
of the snapshot (i.e. destroy by guid)
Algorithm for Planning and Executing Replication of a Filesystems (planned, not implemented yet)
This algorithm describes how a filesystem or zvol is replicated in zrepl.
The algorithm is invoked with a jobid
, sender
, receiver
, a sender-side filesystem path
.
It builds a diff between the sender and receiver filesystem bookmarks+snapshots and determines whether a replication conflict exists or whether fast-forward replication using full or incremental sends is possible.
In case of conflict, the algorithm errors out with a conflict description that can be used to manually or automatically resolve the conflict.
Otherwise, the algorithm builds a list of replication steps that are then worked on sequentially by the "Algorithm for a Single Replication Step".
The algorithm ensures that a plan can be executed exactly as planned by aquiring appropriate zfs holds. The algorithm can be configured to retry a plan when encountering non-permanent errors (e.g. network errors). However, permanent errors result in the plan being cancelled.
Regardless of whether a plan can be executed to completion or is cancelled, the algorithm guarantees that leftover artifacts (i.e. holds) of its invocation are cleaned up.
However, since network failure can occur at any point, there might be stale holds on sending or receiving side after a crash or other error.
These will be cleaned up on a subsequent invocation of this algorithm with the same jobid
.
The algorithm is fit to be executed in parallel from the same sender to different receivers.
To be clear: replicating in parallel from the same sender to different receivers is supported.
But one instance of the algorithm assumes ownership of the filesystem path
on the receiver.
The algorithm reserves the following sub-namespaces:
- zfs hold:
zrepl_FS
- bookmark names:
zrepl_FS
Definitions
ZFS Filesystem Path
We refer to the name of a ZFS filesystem or volume as a filesystem path, sometimes just path. The name includes the pool name.
Sender and Receiver
Sender and Receiver are passed as RPC stubs to the algorithm. One of those RPC stubs will typically be local, i.e., call methods on a struct in the same address space. The other RPC stub may invoke actual calls over the network (unless local replication is used). The algorithm does not make any assumption about which object is local or an actual stub. This design decouples the replication logic (described in this algorithm) from the question which side (sender or receiver) initiates the replication.
Algorithm
Cleanup Previous Invocations With Same jobid
by scanning the filesystem's list of snapshots and step bookmarks, and destroying any which was created by this algorithm when it was invoked with jobid
.
TODO HOW
Build a fast-forward list of replication steps STEPS
.
STEPS
may contain an optional initial full send, and subsequent incremental send steps.
STEPS[0]
may also have a resume token.
If fast-forward is not possible, produce a conflict description and ERROR OUT.
TODOs:
- make it configurable what snapshots are included in the list (i.e. every one we see, only most recent, at least one every X hours, ...)
Ensure that we will be able to carry out all steps by aquiring holds or fsstep bookmarks on the sending side
idempotent_hold([s.to for s in STEPS], zrepl_FS_J_${jobid})
if STEPS[0].from != nil: idempotent_FSSTEP_bookmark(STEPS[0].from, zrepl_FSSTEP_bm_G_${STEPS[0].from.guid}_J_${jobid})
Determine which steps have not been completed (uncompleted_steps
) (we might be in an second invocation of this algorithm after a network failure and some steps might already be done):
res_tok := receiver.ResumeToken(fs)
rmrfsv := receiver.MostRecentFilesystemVersion(fs)
- if
res_tok != nil
: ensure thatres_tok
has a correspondinng step inSTEPS
, otherwise ERROR OUT - if
rmrfsv != nil
: ensure thatres_tok
has a correspondinng step inSTEPS
, otherwise ERROR OUT - if
(res_token != nil && rmrfsv != nil)
: ensure thatres_tok
is the subsequent step to the one we found forrmrfsv
- if both are nil, we are at the beginning,
uncompleted_steps = STEPS
and goto next block rstep := if res_tok != nil { res_tok } else { rmrfsv }
uncompleted_steps := STEPS[find_step_idx(STEPS, rstep).expect("must exist, checked above"):]
- Note that we do not explicitly check for the completion of prior replication steps.
All we care about is what needs to be done from
rstep
. - This is intentional and necessary because we cummutatively release all holds and step bookmarks made for steps that preceed a just-completed step (see next paragraph)
- Note that we do not explicitly check for the completion of prior replication steps.
All we care about is what needs to be done from
Execute uncompleted steps
Invoke the "Algorithm for a Single Replication Step" for each step in uncompleted_steps
.
Remember to pass the resume token if one exists.
In the wind-down phase of each replication step from => to
, while the per-step algorithm still holds send-side from
and to
, as well as recv-side to
:
-
Sending side: Idempotently move the replication cursor to
to
.- Why: replication cursor tracks the sender's last known replication state, outlives the current plan, but is required to guarantee that future invocations of this algorithm find an incremental path.
- Impl for 'atomic' move: have two cursors (fixes #177)
- Idempotently cursor-bookmark
to
usingidempotent_bookmark(to, to.guid, #zrepl_replication_cursor_G_${to.guid}_J_${jobid})
- Idempotently destroy old cursor-bookmark of
from
idempotent_destroy(#zrepl_replication_cursor_G_${from.guid}_J_${jobid}, from)
- If
from
is a snapshot, release the hold on it usingidempotent_release(from, zrepl_J_${jobid})
- FIXME: resumability if we crash inbetween those operations (or scrap it and wait for channel programs to bring atomicity)
- Idempotently cursor-bookmark
-
Receiving side:
idempotent_hold(to, zrepl_FS_J_${jobid})
because its going to be the next step'sfrom
- As discussed in the section on the per-step algorithm, this is a feature provided to us by the per-step algorithm.
-
Receiving side + Sending side (order doesn't matter): Make sure all holds & step bookmarks made by this plan on already replicated snapshots are released / destroyed:
idempotent_release_prior_and_including(from, zrepl_FS_TODO)
idempotent_step_bookmark_destroy_prior_and_including(from, zrepl_FS_TODO)
Cleanup receiving and sending side (order doesn't matter)
idempotent_release_prior_and_including(STEPS[-1].to, zrepl_FS_TODO)
idempotent_step_bookmark_destroy_prior_and_including(STEPS[-1].from, zrepl_FS_TODO)
Notes
idempotent_FSSTEP_bookmark
is likeidempotent_STEP_bookmark
, but with prefixzrepl_FSSTEP
instead ofzrepl_STEP
Notes on Planning and Executing Replication of Multiple Filesystems
The RPC Interfaces Uses Send-side Filesystem Names to Identify a Filesystem
The model of the receiver includes the assumption that it will Receive
all snapshot streams sent to it into filesystems with paths prefixed with root_fs
.
This is useful for separating filesystem path namespaces for multiple clients.
The receiver hides this prefixing in its RPC interface, i.e., when responding to ListFilesystems
rpcs, the prefix is removed before sending the response.
This behavior is useful to achieve symmetry in this algorithm: we do not need to take the prefixing into consideration when computing diffs.
For the receiver, it has the advantage that Receive
rpc can enforce the namespacing and need not trust this algorithm to apply it correctly.
The receiver is also allowed (and may need to) do implement other filtering / namespace transformations.
For example, when only foo/a/b
has been received, but not foo
nor foo/a
, the receiver must not include the latter two in its ListFilesystems
response.
The current zrepl receiver endpoint implementation uses the zrepl:placeholder
property to mark filesystems foo
and foo/a
as placeholders, and filters out such filesystems in the ListFilesystems
response.
Another approach would be to have a flat list of received filesystems per sender, and have a separate table that associates send-side names and receive-side filesystem paths.
Similarly, the sender is allowed to filter the output of its RPC interface to hide certain filesystems or snapshots.
Consequences of the Above Design
Namespace filtering and transformations of filesystem paths on sender and receiver are user-configurable. Both ends of the replication setup have their own config, and do not coordinate config changes. This results in a number of challenges:
-
A sender might change its filter to allow additional filesystems to be replicated. For example, where
foo/a/b
was initially allowed, the new filter might allowfoo
andfoo/a
as well. If the receiver uses thezrepl:placeholder
approach as sketched out above, this means that the receiver will need to replace the placeholder filesystems$root_fs/foo
and$root_fs/foo/a
with the incoming full sends offoo
andfoo/a
. -
Send-side renames cannot be handled efficiently because send-side rename effectively changes the filesystem identity because we use its name.