The motivation for this recatoring are based on two independent issues: - @JMoVS found that the changes merged as part of #259 slowed his OS X based installation down significantly. Analysis of the zfs command logging introduced in #296 showed that `zfs holds` took most of the execution time, and they pointed out that not all of those `zfs holds` invocations were actually necessary. I.e.: zrepl was inefficient about retrieving information from ZFS. - @InsanePrawn found that failures on initial replication would lead to step holds accumulating on the sending side, i.e. they would never be cleaned up in the HintMostRecentCommonAncestor RPC handler. That was because we only sent that RPC if there was a most recent common ancestor detected during replication planning. @InsanePrawn prototyped an implementation of a `zrepl holds release` command to mitigate the situation. As part of that development work and back-and-forth with @problame, it became evident that the abstractions that #259 built on top of zfs in package endpoint (step holds, replication cursor, last-received-hold), were not well-represented for re-use in the `zrepl holds release` subocommand. This commit refactors package endpoint to address both of these issues: - endpoint abstractions now share an interface `Abstraction` that, among other things, provides a uniform `Destroy()` method. However, that method should not be destroyed directly but instead the package-level `BatchDestroy` function should be used in order to allow for a migration to zfs channel programs in the future. - endpoint now has a query facitilty (`ListAbstractions`) which is used to find on-disk - step holds and bookmarks - replication cursors (v1, v2) - last-received-holds By describing the query in a struct, we can centralized the retrieval of information via the ZFS CLI and only have to be clever once. We are "clever" in the following ways: - When asking for hold-based abstractions, we only run `zfs holds` on snapshot that have `userrefs` > 0 - To support this functionality, add field `UserRefs` to zfs.FilesystemVersion and retrieve it anywhere we retrieve zfs.FilesystemVersion from ZFS. - When asking only for bookmark-based abstractions, we only run `zfs list -t bookmark`, not with snapshots. - Currently unused (except for CLI) per-filesystem concurrent lookup - Option to only include abstractions with CreateTXG in a specified range - refactor `endpoint`'s various ZFS info retrieval methods to use `ListAbstractions` - change the `zrepl holds list` command to consume endpoint.ListAbstractions - Add a `ListStale` method which, given a query template, lists stale holds and bookmarks. - it uses replication cursor has different modes - the new `zrepl holds release-{all,stale}` commands can be used to remove abstractions of package endpoint - Adjust HintMostRecentCommonAncestor RPC for stale-holds cleanup: - send it also if no most recent common ancestor exists between sender and receiver - have the sender clean up its abstractions when it receives the RPC with no most recent common ancestor, using `ListStale` - Due to changed semantics, bump the protocol version. - Adjust HintMostRecentCommonAncestor RPC for performance problems encountered by @JMoVS - by default, per (job,fs)-combination, only consider cleaning step holds in the createtxg range `[last replication cursor,conservatively-estimated-receive-side-version)` - this behavior ensures resumability at cost proportional to the time that replication was donw - however, as explained in a comment, we might leak holds if the zrepl daemon stops running - that trade-off is acceptable because in the presumably rare this might happen the user has two tools at their hand: - Tool 1: run `zrepl holds release-stale` - Tool 2: use env var `ZREPL_ENDPOINT_SENDER_HINT_MOST_RECENT_STEP_HOLD_CLEANUP_MODE` to adjust the lower bound of the createtxg range (search for it in the code). The env var can also be used to disable hold-cleanup on the send-side entirely. supersedes closes #293 supersedes closes #282 fixes #280 fixes #278 Additionaly, we fixed a couple of bugs: - zfs: fix half-nil error reporting of dataset-does-not-exist for ZFSListChan and ZFSBookmark - endpoint: Sender's `HintMostRecentCommonAncestor` handler would not check whether access to the specified filesystem was allowed.
28 KiB
The goal of this document is to describe what logical steps zrepl takes to replicate ZFS filesystems.
Note that the actual code executing these steps is spread over the endpoint.{Sender,Receiver}
RPC endpoint implementations. The code in replication/{driver,logic}
(mostly logic.Step.doReplication
) drives the replication and invokes the endpoint
methods (locally or via RPC).
Hence, when trying to map algorithm to implementation, use the code in package replication
as the entrypoint and follow the RPC calls.
-
The Single Replication Step algorithm is implemented as described
- step holds are implemented as described
- step bookmarks rely on bookmark cloning, which is WIP in OpenZFS (see respective TODO comments)
- the feature to hold receive-side
to
immediately after replication is used to move thelast-received-hold
forward- this is a little more than what we allow in the algorithm (we only allow specifying a hold tag whereas moving the
last-received-hold
is a little more complex)
- this is a little more than what we allow in the algorithm (we only allow specifying a hold tag whereas moving the
- the algorithm's sender-side callback is used move the
zrepl_CURSOR
bookmark to point toto
-
The
zrepl_CURSOR
bookmark and thelast-received-hold
ensure that future replication steps can always be done using incremental replication:- Both reference / hold the last successfully received snapshot
to
. - Thus,
to
or#zrepl_CURSOR_G_${to.GUID}_J_${jobid}
can always be used for a future incremental replication stepto => future-to
:- incremental send will be possible because
to
doesn't go away because we claim ownership of the#zrepl_CURSOR
namespace - incremental recv will be possible iff
to
will still be present on the recv-side because oflast-received-hold
- the recv-side fs doesn't get newer snapshots than
to
in the meantime- guaranteed because the zrepl model of the receiver assumes ownership of the filesystems it receives into
- if that assumption is broken, future replication attempts will fail with a conflict
- guaranteed because the zrepl model of the receiver assumes ownership of the filesystems it receives into
- incremental send will be possible because
- Both reference / hold the last successfully received snapshot
-
The Algorithm for Planning and Executing Replication of an Filesystems is a design draft and not used. However, there were some noteworthy lessons learned when implementing the algorithm for a single step:
- In order to avoid leaking
step-hold
s andstep-bookmarks
, if the replication planner is invoked a second time after a replication step (either initial or incremental) has been attempted but failed to completed, the replication planner must- A) either guarantee that it will resume that replication step, and continue as if nothing happened or
- B) release the step holds and bookmarks and clear the partially received state on the sending side.
- Option A is what we want to do: we use the step algorithm to achieve resumability in the first place!
- Option B is not done by zrepl except if the sending side doesn't support resuming. In that case however, we need not release any holds since the behavior is to re-start the send from the beginning.
- However, there is one edge-case to Option A for initial replication:
- If initial replication "
full a
" fails without leaving resumable state, the step holds on the sending side are still present, which makes sense because- a) we want resumability and
- b) the sending side cannot immediately be informed post-failure whether the initial replication left any state that would mandate keeping the step hold, because the network connection might have failed.
- Thus, the sending side must keep the step hold for "
full a
" until it knows more. - In the current implementation, it knows more when the next replication attempt is made, the planner is invoked, the diffing algorithm run, and the
HintMostRecentCommonAncestor
RPC is sent by the active side, communicating the most recent common version shared betwen sender and receiver. - At this point, the sender can safely throw away any step holds with CreateTXG's older than that version.
- If initial replication "
- The
step-hold
,step-bookmark
,last-received-hold
andreplication-cursor
abstractions are currently local concepts of packageendpoint
and not part of the replication protocol- This is not necessarilty the best design decision and should be revisited some point:
- The (quite expensive)
HintMostRecentCommonAncestor
RPC impl on the sender would not be necessary if step holds were part of the replication protocol:- We only need the
HintMostRecentCommonAncestor
info for the aforementioned edge-case during initial replication, where the receive is aborted without any partial received state being stored on the receiver (due to network failure, wrong zfs invocation, bad permissions, etc): - The replication planner does not know about the step holds, thus it cannot deterministically pick up where it left of (right at the start of the last failing initial replication).
- Instead, it will seem like no prior invocation happened at all, and it will apply its policy for initial replication to pick a new
full b != full a
, thereby leaking the step holds offull a
. - In contrast, if the replication planner created the step holds and knew about them, it could use the step holds as an indicator where it left off and re-start from there (of course asserting that the thereby inferred step is compatible with the state of the receiving side).
- We only need the
- (What we do in zrepl right now is to hard-code the initial replication policy, and hard-code that assumption in
endpoint.ListStale
as well.) - The cummulative cleanup done in
HintMostRecentCommonAncestor
provides a nice self-healing aspect, though.
- The (quite expensive)
- This is not necessarilty the best design decision and should be revisited some point:
- In order to avoid leaking
-
We also have Notes on Planning and Executing Replication of Multiple Filesystems
Algorithm for a Single Replication Step
The algorithm described below describes how a single replication step must be implemented.
A replication step is a full send of a snapshot to
for initial replication, or an incremental send (from => to
) for incremental replication.
The algorithm ensures resumability of the replication step in presence of
- uncoordinated or unsynchronized destruction of the filesystem, snapshots, or bookmarks involved in the replication step
- network failures at any time
- other instances of this algorithm executing the same step in parallel (e.g. concurrent replication to different destinations)
To accomplish this goal, the algorithm assumes ownership of parts of the ZFS hold tag namespace and the bookmark namespace:
- holds with prefix
zrepl_STEP
on any snapshot are reserved for zrepl - bookmarks with prefix
zrepl_STEP
are reserved for zrepl
Resumability of a step cannot be guaranteed if these sub-namespaces are manipulated through software other than zrepl.
Note that the algorithm does not ensure that a replication plan, which describes a set of replication steps, can be carried out successfully. If that is desirable, additional measures outside of this algorithm must be taken.
Definitions:
Step Completion & Invariants
The replication step (full to
send or from => to
send) is complete iff the algorithm ran to completion without errors or a permanent non-network error is reported by sender or receiver.
Specifically, the algorithm may be invoked with the same from
and to
arguments, and potentially a resume_token
, after a temporary (like network-related) failure:
Unless permanent errors occur, repeated invocations of the algorithm with updated resume token will converge monotonically (but not strictly monotonically) toward completion.
Note that the mere existence of to
on the receiving side does not constitute completion, since there may still be post-recv actions to be performed on sender and receiver.
Job and Job ID
This algorithm supports that multiple instance of it run in parallel on the same step (full to
/ from => to
pair).
An exemplary use case for this feature are concurrent replication jobs that replicate the same dataset to different receivers.
We require that all parallel invocations of this algorithm provide different and unique jobid
s.
Violation of jobid
uniqueness across parallel jobs may result in interference between instances of this algorithm, resulting in potential compromise of resumability.
After a step is completed, jobid
is guaranteed to not be encoded in on-disk state.
Before a step is completed, there is no such guarantee.
Changing the jobid
before step completion may compromise resumability and may leak the underlying ZFS holds or step bookmarks (i.e. zrepl won't clean them up)
Note the definition of complete above.
Step Bookmark
A step bookmark is our equivalent of ZFS holds, but for bookmarks.
A step bookmark is a ZFS bookmark whose name matches the following regex:
STEP_BOOKMARK = #zrepl_STEP_bm_G_([0-9a-f]+)_J_(.+)
- capture group
1
must be the guid ofzfs get guid ${STEP_BOOKMARK}
, encoded hexadecimal (without leading0x
) and fixed-length (i.e. padded with leading zeroes) - capture group
2
must be equal to thejobid
Algorithm
INPUT:
jobid
from
: snapshot or bookmark: may be nil for full sendto
: snapshot, must never be nilresume_token
(may be nil)- (
from
andto
must be on the same filesystem)
Prepare-Phase
Send-side: make sure to
and from
don't go away
- hold
to
usingidempotent_hold(to, zrepl_STEP_J_${jobid})
- make sure
from
doesn't go away:- if
from
is a snapshot: holdfrom
usingidempotent_hold(from, zrepl_STEP_J_${jobid})
- else
idempotent_step_bookmark(from)
(from
is a bookmark)- PERMANENT ERROR if this fails (e.g. because
from
is a bookmark whose snapshot no longer exists and we don't have bookmark copying yet (ZFS feature is in development) )- Why? we must assume the given bookmark is externally managed (i.e. not in the )
- this means the bookmark is externally created bookmark and cannot be trusted to persist until the replication step succeeds
- Maybe provide an override-knob for this behavior
- PERMANENT ERROR if this fails (e.g. because
- if
Recv-side: no-op
-
from
cannot go away once we received enough data for the step to be resumable:# do a partial incremental recv @a=>@b (i.e. recv with -s, and abort the incremental recv in the middle) # try destroying the incremental source on the receiving side zfs destroy p1/recv@a cannot destroy 'p1/recv@a': snapshot has dependent clones use '-R' to destroy the following datasets: p1/recv/%recv # => doesn't work, because zfs recv is implemented as a `clone` internally, that's exactly what we want
-
if recv-side
to
exists, goto cleanup-phase (no replication to do) -
to
cannot be destroyed while being received, because it isn't visible as a snapshot yet (it isn't yet one after all)
Replication Phase
Attempt the replication step:
start zfs send
and pipe its output into zfs recv
until an error occurs or recv
returns without error.
Let us now think about interferences that could occur during this phase, and ensure that none of them compromise the goals of this algorithm, i.e., monotonic convergence toward step completion using resumable send & recv.
Safety from External Pruning
We are safe from pruning during the replication step because we have guarantees that no external action will destroy send-side from
and to
, and recv-side to
(for both snapshot and bookmark from
s)
Safety In Presence of Network Failures During Replication Network failures during replication can be recovered from using resumable send & recv:
- Network failure before the receive-side could stored data:
- Send-side
from
andto
are guaranteed to be still present due to zfs holds - recv-side
from
may no longer exist because we don'thold
it explicitly- if that is the case, ERROR OUT, step can never complete
- If the step planning algorithm does not include the step, for example because a snapshot filter configuration was changed by the user inbetween which hides
from
orto
from the second planning attempt: tough luck, we're leaking all holds
- Send-side
- Network failure during the replication
- send-side
from
andto
are still present due to zfs holds - recv-side
from
is still present because the partial receive state prevents its destruction (see prepare-phase) - if recv-side has a resume token, the resume token will continue to work on the sender because
from
s andto
are still present
- send-side
- Network failure at the end of the replication step stream transmission
- Variant A: failure from the sender's perspective, success from the receiver's perspective
- receive-side
to
doesn't have a hold and could be destroyed anytime - receive-side
from
doesn't have a hold and could be destroyed anytime - thus, when the step is invoked again, pattern match
(from_exists, to_exists)
(true, true)
: prepare-phase will early-exit(false,true)
: prepare-phase will error out bc.from
does not exist(true, false)
: entire step will be re-done- FIXME monotonicity requirement does not hold
(false, false)
: prepare-phase will error out bc.from
does not exist
- receive-side
- Variant B: success from the sender's perspective, failure from the receiver's perspective
- No idea how this would happen except for bugs in error reporting in the replication protocol
- Misclassification by the sender, most likely broken error handling in the sender or replication protocol
- => the sender will release holds and move the replication cursor while the receiver won't => tough luck
- Variant A: failure from the sender's perspective, success from the receiver's perspective
If the RPC used for zfs recv
returned without error, this phase is done.
(Although the replication step (this algorithm) is not yet complete, see definition of complete).
Cleanup-Phase
At the end of this phase, all intermediate state we built up to support the resumable replication of the step is destroyed. However, consumers of this algorithm might want to take advantage of the fact that we currently still have holds / step bookmarks.
Recv-side: Optionally hold to
with a caller-defined tag
Users of the algorithm might want to depend on to
being available on the receiving side for a longer duration than the lifetime of the current step's algorithm invocation.
For reasons explained in the next paragraph, we cannot guarantee that we have a hold
when recv
exists. If that were the case, we could take our time to provide a generalized callback to the user of the algorithm, and have them do whatever they want with to
while we guarantee that to
doesn't go away through the hold. But that functionality isn't available, so we only provide a fixed operation right after receive: take a hold
with a tag of the algorithm user's choice. That's what's needed to guarantee that a replication plan, consisting of multiple consecutive steps, can be carried out without a receive-side prune job interfering by destroying to
. Technically, this step is racy, i.e., it could happen that to
is destroyed between recv
and hold
. But this is a) unlikely and b) not fatal because we can detect that hold fails because the 'dataset does not existand re-do the entire transmission since we still hold send-side
fromand
to`, i.e. we just lose a lot of work in rare cases.
So why can't we have a hold
at the time recv
exits?
It seems like zfs send --holds
could be used to send the send-side's holds to the receive-side. But that flag always sends all holds, and --holds
is implemented in user-space libzfs, i.e., doesn't happen in the same txg as the recv completion. Thus, the race window is in fact only smaller (unless we oversaw some synchronization in userland zfs).
But if we're willing to entertain the idea a little further, we still hit the problem that --holds
sends all holds, whereas our goal is to temporarily have >= 1 hold that we own until the callback is done, and then release all of the received holds so that no holds created by us exist after this algorithm completes.
Since there could be concurrent zfs hold
invocations on the sender and receiver while this algorithm runs, and because --holds
doesn't provide info about which holds were sent, we cannot correctly destroy exactly those holds that we received.
Send-side: callback to consumer, then cleanup
We can provide the algorithm user with a generic callback because we have holds / step bookmarks for to
and from
, respectively.
Example use case for the sender-side callback is the replication cursor, see algorithm for filesystem replication below.
After the callback is done: cleanup holds & step bookmark:
idempotent_release(to, zrepl_STEP_J_${jobid})
- make sure
from
can now go away:- if
from
is a snapshot:idempotent_release(from, zrepl_STEP_J_${jobid})
- else
idempotent_step_bookmark_destroy(from)
- if
from
fulfills the properties of a step bookmark: destroy it - otherwise: it's a bookmark not created by this algorithm that happened to be used for replication: leave it alone
- that can only happen if user used the override knob
- if
- if
Note that "make sure from
can now go away" is the inverse of "Send-side: make sure to
and from
don't go away". Those are relatively complex operations and should be implemented in the same file next to each other to ease maintainability.
Notes & Pseudo-APIs used in the algorithm
idempotent_hold(snapshot s, string tag)
like zfs hold, but doesn't error if hold already existsidempotent_release(snapshot s, string tag)
like zfs hold, but doesn't error if hold already existsidempotent_step_bookmark(snapshot_or_bookmark b)
creates a step bookmark of b- determine step bookmark name N
- if
N
already exists, verify it's a correct step bookmark => DONE or ERROR - if
b
is a snapshot, issuezfs bookmark
- if
b
is a bookmark:- if bookmark cloning supported, use it to duplicate the bookmark
- else ERROR OUT, with an error that upper layers can identify as such, so that they are able to ignore the fact that we couldn't create a step bookmark
idempotent_destroy(bookmark #b_$GUID, of a snapshot s)
must atomically check that$GUID == s.guid
before destroying sidempotent_bookmark(snapshot s, $GUID, name #b_$GUID)
must atomically check that$GUID == s.guid
at the time of bookmark creationidempotent_destroy(snapshot s)
must atomically check that zrepl'ss.guid
matches the currentguid
of the snapshot (i.e. destroy by guid)
Algorithm for Planning and Executing Replication of a Filesystems (planned, not implemented yet)
This algorithm describes how a filesystem or zvol is replicated in zrepl.
The algorithm is invoked with a jobid
, sender
, receiver
, a sender-side filesystem path
.
It builds a diff between the sender and receiver filesystem bookmarks+snapshots and determines whether a replication conflict exists or whether fast-forward replication using full or incremental sends is possible.
In case of conflict, the algorithm errors out with a conflict description that can be used to manually or automatically resolve the conflict.
Otherwise, the algorithm builds a list of replication steps that are then worked on sequentially by the "Algorithm for a Single Replication Step".
The algorithm ensures that a plan can be executed exactly as planned by acquiring appropriate zfs holds. The algorithm can be configured to retry a plan when encountering non-permanent errors (e.g. network errors). However, permanent errors result in the plan being cancelled.
Regardless of whether a plan can be executed to completion or is cancelled, the algorithm guarantees that leftover artifacts (i.e. holds) of its invocation are cleaned up.
However, since network failure can occur at any point, there might be stale holds on sending or receiving side after a crash or other error.
These will be cleaned up on a subsequent invocation of this algorithm with the same jobid
.
The algorithm is fit to be executed in parallel from the same sender to different receivers.
To be clear: replicating in parallel from the same sender to different receivers is supported.
But one instance of the algorithm assumes ownership of the filesystem path
on the receiver.
The algorithm reserves the following sub-namespaces:
- zfs hold:
zrepl_FS
- bookmark names:
zrepl_FS
Definitions
ZFS Filesystem Path
We refer to the name of a ZFS filesystem or volume as a filesystem path, sometimes just path. The name includes the pool name.
Sender and Receiver
Sender and Receiver are passed as RPC stubs to the algorithm. One of those RPC stubs will typically be local, i.e., call methods on a struct in the same address space. The other RPC stub may invoke actual calls over the network (unless local replication is used). The algorithm does not make any assumption about which object is local or an actual stub. This design decouples the replication logic (described in this algorithm) from the question which side (sender or receiver) initiates the replication.
Algorithm
Cleanup Previous Invocations With Same jobid
by scanning the filesystem's list of snapshots and step bookmarks, and destroying any which was created by this algorithm when it was invoked with jobid
.
TODO HOW
Build a fast-forward list of replication steps STEPS
.
STEPS
may contain an optional initial full send, and subsequent incremental send steps.
STEPS[0]
may also have a resume token.
If fast-forward is not possible, produce a conflict description and ERROR OUT.
TODOs:
- make it configurable what snapshots are included in the list (i.e. every one we see, only most recent, at least one every X hours, ...)
Ensure that we will be able to carry out all steps by acquiring holds or fsstep bookmarks on the sending side
idempotent_hold([s.to for s in STEPS], zrepl_FS_J_${jobid})
if STEPS[0].from != nil: idempotent_FSSTEP_bookmark(STEPS[0].from, zrepl_FSSTEP_bm_G_${STEPS[0].from.guid}_J_${jobid})
Determine which steps have not been completed (uncompleted_steps
) (we might be in an second invocation of this algorithm after a network failure and some steps might already be done):
res_tok := receiver.ResumeToken(fs)
rmrfsv := receiver.MostRecentFilesystemVersion(fs)
- if
res_tok != nil
: ensure thatres_tok
has a corresponding step inSTEPS
, otherwise ERROR OUT - if
rmrfsv != nil
: ensure thatres_tok
has a corresponding step inSTEPS
, otherwise ERROR OUT - if
(res_token != nil && rmrfsv != nil)
: ensure thatres_tok
is the subsequent step to the one we found forrmrfsv
- if both are nil, we are at the beginning,
uncompleted_steps = STEPS
and goto next block rstep := if res_tok != nil { res_tok } else { rmrfsv }
uncompleted_steps := STEPS[find_step_idx(STEPS, rstep).expect("must exist, checked above"):]
- Note that we do not explicitly check for the completion of prior replication steps.
All we care about is what needs to be done from
rstep
. - This is intentional and necessary because we cumulatively release all holds and step bookmarks made for steps that precede a just-completed step (see next paragraph)
- Note that we do not explicitly check for the completion of prior replication steps.
All we care about is what needs to be done from
Execute uncompleted steps
Invoke the "Algorithm for a Single Replication Step" for each step in uncompleted_steps
.
Remember to pass the resume token if one exists.
In the wind-down phase of each replication step from => to
, while the per-step algorithm still holds send-side from
and to
, as well as recv-side to
:
-
Sending side: Idempotently move the replication cursor to
to
.- Why: replication cursor tracks the sender's last known replication state, outlives the current plan, but is required to guarantee that future invocations of this algorithm find an incremental path.
- Impl for 'atomic' move: have two cursors (fixes #177)
- Idempotently cursor-bookmark
to
usingidempotent_bookmark(to, to.guid, #zrepl_replication_cursor_G_${to.guid}_J_${jobid})
- Idempotently destroy old cursor-bookmark of
from
idempotent_destroy(#zrepl_replication_cursor_G_${from.guid}_J_${jobid}, from)
- If
from
is a snapshot, release the hold on it usingidempotent_release(from, zrepl_J_${jobid})
- FIXME: resumability if we crash inbetween those operations (or scrap it and wait for channel programs to bring atomicity)
- Idempotently cursor-bookmark
-
Receiving side:
idempotent_hold(to, zrepl_FS_J_${jobid})
because its going to be the next step'sfrom
- As discussed in the section on the per-step algorithm, this is a feature provided to us by the per-step algorithm.
-
Receiving side + Sending side (order doesn't matter): Make sure all holds & step bookmarks made by this plan on already replicated snapshots are released / destroyed:
idempotent_release_prior_and_including(from, zrepl_FS_TODO)
idempotent_step_bookmark_destroy_prior_and_including(from, zrepl_FS_TODO)
Cleanup receiving and sending side (order doesn't matter)
idempotent_release_prior_and_including(STEPS[-1].to, zrepl_FS_TODO)
idempotent_step_bookmark_destroy_prior_and_including(STEPS[-1].from, zrepl_FS_TODO)
Notes
idempotent_FSSTEP_bookmark
is likeidempotent_STEP_bookmark
, but with prefixzrepl_FSSTEP
instead ofzrepl_STEP
Notes on Planning and Executing Replication of Multiple Filesystems
The RPC Interfaces Uses Send-side Filesystem Names to Identify a Filesystem
The model of the receiver includes the assumption that it will Receive
all snapshot streams sent to it into filesystems with paths prefixed with root_fs
.
This is useful for separating filesystem path namespaces for multiple clients.
The receiver hides this prefixing in its RPC interface, i.e., when responding to ListFilesystems
rpcs, the prefix is removed before sending the response.
This behavior is useful to achieve symmetry in this algorithm: we do not need to take the prefixing into consideration when computing diffs.
For the receiver, it has the advantage that Receive
rpc can enforce the namespacing and need not trust this algorithm to apply it correctly.
The receiver is also allowed (and may need to) do implement other filtering / namespace transformations.
For example, when only foo/a/b
has been received, but not foo
nor foo/a
, the receiver must not include the latter two in its ListFilesystems
response.
The current zrepl receiver endpoint implementation uses the zrepl:placeholder
property to mark filesystems foo
and foo/a
as placeholders, and filters out such filesystems in the ListFilesystems
response.
Another approach would be to have a flat list of received filesystems per sender, and have a separate table that associates send-side names and receive-side filesystem paths.
Similarly, the sender is allowed to filter the output of its RPC interface to hide certain filesystems or snapshots.
Consequences of the Above Design
Namespace filtering and transformations of filesystem paths on sender and receiver are user-configurable. Both ends of the replication setup have their own config, and do not coordinate config changes. This results in a number of challenges:
-
A sender might change its filter to allow additional filesystems to be replicated. For example, where
foo/a/b
was initially allowed, the new filter might allowfoo
andfoo/a
as well. If the receiver uses thezrepl:placeholder
approach as sketched out above, this means that the receiver will need to replace the placeholder filesystems$root_fs/foo
and$root_fs/foo/a
with the incoming full sends offoo
andfoo/a
. -
Send-side renames cannot be handled efficiently because send-side rename effectively changes the filesystem identity because we use its name.