bookmarking: prune policy for bookmarks

refs #34
This commit is contained in:
Christian Schwarz
2018-02-17 20:48:31 +01:00
parent 8e34843eb1
commit aa92261ea7
15 changed files with 149 additions and 48 deletions

View File

@ -48,14 +48,14 @@ Example: :sampleconf:`pullbackup/productionhost.yml`.
* - ``interval``
- snapshotting interval
* - ``prune``
- |prune| policy for filesytems in ``filesystems`` with prefix ``snapshot_prefix``
- |prune| for versions of filesytems in ``filesystems``, versions prefixed with ``snapshot_prefix``
- Snapshotting Task (every ``interval``, |patient|)
- A snapshot of filesystems matched by ``filesystems`` is taken every ``interval`` with prefix ``snapshot_prefix``.
- A bookmark of that snapshot is created with the same name.
- The ``prune`` policy is triggered on filesystems matched by ``filesystems`` with snapshots matched by ``snapshot_prefix``.
- The ``prune`` policy is evaluated for versions of filesystems matched by ``filesystems``, versions prefixed with ``snapshot_prefix``.
- Serve Task
@ -65,12 +65,6 @@ A source job is the counterpart to a :ref:`job-pull`.
Make sure you read the |prune| policy documentation.
Note that zrepl does not prune bookmarks due to the following reason:
a pull job may stop replication due to link failure, misconfiguration or administrative action.
The source prune policy will eventually destroy the last common snapshot between source and pull job.
Without bookmarks, the prune policy would need to perform full replication again.
With bookmarks, we can resume incremental replication, only losing the snapshots pruned since the outage.
.. _job-pull:
Pull Job
@ -99,7 +93,7 @@ Example: :sampleconf:`pullbackup/backuphost.yml`
* - ``snapshot_prefix``
- prefix snapshots must match to be considered for replication & pruning
* - ``prune``
- |prune| policy for local filesystems reachable by ``mapping``
- |prune| policy for versions of filesystems of local filesystems reachable by ``mapping``, versions prefixed with ``snapshot_prefix``
* Main Task (every ``interval``, |patient|)
@ -112,10 +106,11 @@ Example: :sampleconf:`pullbackup/backuphost.yml`
#. If the local target filesystem does not exist, ``initial_repl_policy`` is used.
#. On conflicts, an error is logged but replication of other filesystems with mapping continues.
#. The ``prune`` policy is triggered for all *target filesystems*
#. The ``prune`` policy is evaluated for all *target filesystems*
A pull job is the counterpart to a :ref:`job-source`.
Make sure you read the |prune| policy documentation.
.. _job-local:
@ -163,8 +158,6 @@ Example: :sampleconf:`localbackup/host1.yml`
#. The ``prune_rhs`` policy is triggered for all *target filesystems*
A local job is combination of source & pull job executed on the same machine.
Note that while snapshots are pruned, bookmarks are not pruned and kept around forever.
Refer to the comments on :ref:`source job <job-source>` for the reasoning behind this.
Terminology
-----------
@ -188,3 +181,7 @@ patient task
* waits for the last invocation to finish
* logs a warning with the effective task duration
* immediately starts a new invocation of the task
filesystem version
A snapshot or a bookmark.

View File

@ -3,9 +3,9 @@
Pruning Policies
================
In zrepl, *pruning* means *destroying snapshots by some policy*.
In zrepl, *pruning* means *destroying filesystem versions by some policy* where filesystem versions are bookmarks and snapshots.
A *pruning policy* takes a list of snapshots and -- for each snapshot -- decides whether it should be kept or destroyed.
A *pruning policy* takes a list of filesystem versions and decides for each whether it should be kept or destroyed.
The job context defines which snapshots are even considered for pruning, for example through the ``snapshot_prefix`` variable.
Check the respective :ref:`job definition <job>` for details.
@ -25,6 +25,7 @@ Retention Grid
jobs:
- name: pull_app-srv
type: pull
...
prune:
policy: grid
@ -34,6 +35,15 @@ Retention Grid
└─ 24 adjacent one-hour intervals
- name: pull_backup
type: source
interval: 10m
prune:
policy: grid
grid: 1x1d(keep=all)
keep_bookmarks: 144
The retention grid can be thought of as a time-based sieve:
The ``grid`` field specifies a list of adjacent time intervals:
the left edge of the leftmost (first) interval is the ``creation`` date of the youngest snapshot.
@ -43,6 +53,11 @@ Each interval carries a maximum number of snapshots to keep.
It is secified via ``(keep=N)``, where ``N`` is either ``all`` (all snapshots are kept) or a positive integer.
The default value is **1**.
Bookmarks are not affected by the above.
Instead, the ``keep_bookmarks`` field specifies the number of bookmarks to be kept per filesystem.
You only need to specify ``keep_bookmarks`` at the source-side of a replication setup since the destination side does not receive bookmarks.
You can specify ``all`` as a value to keep all bookmarks, but be warned that you should install some other way to prune unneeded ones then (see below).
The following procedure happens during pruning:
#. The list of snapshots eligible for pruning is sorted by ``creation``
@ -54,14 +69,16 @@ The following procedure happens during pruning:
#. the contained snapshot list is sorted by creation.
#. snapshots from the list, oldest first, are destroyed until the specified ``keep`` count is reached.
#. all remaining snapshots on the list are kept.
#. The list of bookmarks eligible for pruning is sorted by ``createtxg`` and the most recent ``keep_bookmarks`` bookmarks are kept.
.. _replication-downtime:
.. ATTENTION::
The configuration of the first interval (``1x1h(keep=all)`` in the example) determines the **maximum allowable replication lag** because the source and destination pruning policies do not coordinate:
if replication does not work for whatever reason, source will continue to execute the prune policy.
Eventually, source destroys a snapshot that has never been replicated to destination, degrading the temporal resolution of your backup.
Be aware that ``keep_bookmarks x interval`` (interval of the job level) controls the **maximum allowable replication downtime** between source and destination.
If replication does not work for whatever reason, source and destination will eventually run out of sync because the source will continue pruning snapshots.
The only recovery in that case is full replication, which may not always be viable due to disk space or traffic constraints.
Thus, **always** configure the first interval to ``1x?(keep=all)``, substituting ``?`` with the maximum time replication may fail due to downtimes, maintenance, connectivity issues, etc.
.. We intentionally do not mention that bookmarks are used to bridge the gap between source and dest that are out of sync snapshot-wise. This is an implementation detail.
Further note that while bookmarks consume a constant amount of disk space, listing them requires temporary dynamic **kernel memory** proportional to the number of bookmarks.
Thus, do not use ``all`` or an inappropriately high value without good reason.