ZFS, encryption (keys, keystore, LUKS), bootFS, ephemeral root (tmpfs, ZFS, F2FS, ...), testing in qemu, options & debugging, ... and many small things
22 KiB
/*
TempRoot Layout Options
The core concept of NixOS is to maintain the hosts' programs and configuration in a stateless manner. Hand in hand with that goes the need to (but also the advantages of) identifying anything stateful in the system. Such state generally falls into one of four categories:
- It is secret and/or can not be re-generated (without other information of the same kind).
- It may or may not be secret, but re-generating it takes time or is impractical.
- It may or may not be secret and can be re-generated quickly.
Category 1 data can not be included in the nix store and thus can not (directly) be derived from the system configuration.
It needs to be stored in a way that the host (and only the host) can access it an that it can't be lost.
It is therefore referred to as remote
data (even though it would usually also be stored locally).
Category 2 data is required for the system to boot (in reasonable time), and should thus, as local
data, be stored persistently on the host.
Parts that are not secret should be included in or generated from the system configuration.
Anything that is secret likely shouldn't be lost (i.e. is actually category 1) or can be re-generated from randomness or category 1 (and is thus category 3).
Category 3 data is expendable, and, as temp
data, can thus be cleared on reboot or at other times.
Temporary data is the cheapest to (not) maintain, esp. in terms of administrative overhead, and should be used wherever possible.
Though incorrectly assigning data to temp
that should be local
or remote
may break the system or cause data loss.
TempRoot is the concept of defaulting everything to temp
and selectively whitelisting things as local
or remote
, by mounting an ephemeral root (/
) file system and mounting/binding/linking various, sometimes many, nested paths to persistent storage.
This module implements the concept with different filesystem options for remote
, local
and temp
data, while maintaining the same general mount structure regardless, which should make the choice of backing storage largely transparent for anything running on the system.
ZFS is capable of serving all three roles quite well.
Its pooling and datasets allow seamless allocation of storage between the categories, file-system level encryption and sending of encrypted, incremental dataset snapshots make it excellent for the remote
role.
As backed for local
, which primarily holds the nix store, it benefits from transparent compression, and checksumming.
With fsync
disabled, and the ability to roll back snapshots, it also works to create very large storage areas for temp
data.
ZFS though struggles on lower-end systems. BtrFS could probably be configured to serve the roles with similar capability.
F2FS also supports checksumming and compression, though it currently does not automatically reclaim space gained by the latter (but TODO: Nix could be tuned to do this explicitly).
This and its design optimized for flash storage should make it an optimal backend for the local
data, esp. on lower-end hardware.
EXT4 supports checksumming only for metadata, and does not support compression. Block device layers could in principle be used for this.
Using a sime filesystem with external backup tools is possible yet suboptimal for remote
data, unless the system doesn't actually have any/much of it.
As long as the amount of temp
data can be expected to stay within reasonable bounds, tmpfs
es and swap can also be used to back the temp
data.
The disk/partition declaration and the installer tooling refer to disks by GPT partlabel
. They require the labels to be unique not only within a single target host, but also between the host that does the installation and the one being installed. It is therefore highly advisable (and in some places maybe implicitly expected) that the labels contain a unique host identifier, for example:
let hash = builtins.substring 0 8 (builtins.hashString "sha256" config.networking.hostName); in # ...
Examples
This completely configures the disks, partitions, pool, datasets, and mounts for a ZFS rpool
on a three-disk raidz1
with read and write cache on an additional SSD, which also holds the boot partition and swap:
{ wip.fs.disks.devices.primary.size = "16G"; # The default, and only relevant when installing to disk images. The »primary« disk will hold all implicitly created partitions and those not stating a »disk«.
wip.fs.boot.enable = true; wip.fs.boot.size = "512M"; # See »./boot.nix.md«. Creates a FAT boot partition.
wip.fs.keystore.enable = true; # See »./keystore.nix.md«. With this enabled, »remote« will automatically be encrypted, with a random key by default.
wip.fs.temproot.enable = true; # Use ZFS for all categories.
wip.fs.temproot.temp.type = "zfs";
wip.fs.temproot.local.type = "zfs";
wip.fs.temproot.remote.type = "zfs";
# Change/set the pools storage layout (see above), then adjust the partitions disks/sizes. Declaring disks requires them to be passed to the system installer.
wip.fs.zfs.pools."rpool-${hash}".vdevArgs = [ "raidz1" "rpool-rz1-${hash}" "rpool-rz2-${hash}" "rpool-rz3-${hash}" "log" "rpool-zil-${hash}" "cache" "rpool-arc-${hash}" ];
wip.fs.disks.partitions."rpool-rz1-${hash}" = { disk = "raidz1"; };
wip.fs.disks.partitions."rpool-rz2-${hash}" = { disk = "raidz2"; };
wip.fs.disks.partitions."rpool-rz3-${hash}" = { disk = "raidz3"; };
wip.fs.disks.partitions."rpool-zil-${hash}" = { size = "2G"; };
wip.fs.disks.partitions."rpool-arc-${hash}" = { }; } # (this is actually already implicitly declared)
On a less beefy system, but also with less data to manage, tmpfs
works fine for tmp
, and f2fs
promises to get more performance out of the flash/ram/cpu:
{ # See above for these:
#wip.fs.disks.devices.primary.size = "16G"; # (default)
wip.fs.boot.enable = true; wip.fs.boot.size = "512M";
wip.fs.keystore.enable = true;
wip.fs.temproot.enable = true;
wip.fs.temproot.temp.type = "tmpfs"; # Put `/` on a `tmpfs`.
wip.fs.temproot.local.type = "bind"; # `bind`-mount all `local` locations to `/.local/*`, ...
wip.fs.temproot.local.bind.base = "f2fs-encrypted"; # ... where a LUKS-encrypted F2FS is mounted.
#wip.fs.keystore.keys."luks/local-${hash}/0" = "random"; # (implied by the »-encrypted« suffix above)
#wip.fs.disks.partitions."local-${hash}".size = "50%"; # (default, fixed after installation)
wip.fs.temproot.remote.type = "zfs";
wip.fs.keystore.keys."luks/rpool-${hash}/0" = "random";
#wip.fs.keystore.keys."zfs/rpool-${hash}/remote" = "random"; # (default)
#wip.fs.zfs.pools."rpool-${hash}".vdevArgs = [ "rpool-${hash}" ]; # (default)
#wip.fs.disks.partitions."rpool-${hash}" = { type = "bf00"; size = null; order = 500; }; # (default)
# Default mounts/binds can also be removed, in this case causing logs to be removed on reboot:
wip.fs.temproot.local.mounts."/var/log" = lib.mkForce null; }
Implementation
#*/# end of MarkDown, beginning of NixOS module:
dirname: inputs: specialArgs@{ config, pkgs, lib, ... }: let inherit (inputs.self) lib; in let
prefix = inputs.config.prefix;
cfg = config.${prefix}.fs.temproot;
utils = import "${inputs.nixpkgs.outPath}/nixos/lib/utils.nix" { inherit (specialArgs) lib config pkgs; };
hash = builtins.substring 0 8 (builtins.hashString "sha256" config.networking.hostName);
optionsFor = type: desc: {
bind.source = lib.mkOption { description = "Prefix for bind-mount targets."; type = lib.types.str; default = "/.${type}"; };
bind.base = lib.mkOption { description = "Filesystem to automatically create as the ».source« for the bind mounts."; type = lib.types.enum ([ null ] ++ (lib.optionals (type == "local") [ "f2fs" "f2fs-encrypted" ])); default = null; };
zfs.dataset = lib.mkOption { description = "Dataset path under which to create the ${desc} »${type}« datasets."; type = lib.types.str; default = "rpool-${hash}/${type}"; };
mounts = lib.mkOption {
description = "Locations (for »temp« in addition to »/«) where a ${desc} filesystem should be mounted. Some are declared by default but may be removed by setting them to »null«.";
type = lib.types.attrsOf (lib.types.nullOr (lib.types.submodule ({ name, ... }: { options = {
target = lib.mkOption { description = "Attribute name as the mount target path."; type = lib.types.addCheck lib.types.str (name: (builtins.match ''^/.*[^/]$'' name) != null); default = name; readOnly = true; };
source = lib.mkOption { description = "Relative source path of the mount. (Irrelevant for »tmpfs«.)"; type = lib.types.str; default = builtins.substring 1 (builtins.stringLength name - 1) name; };
uid = lib.mkOption { description = "UID owning the mounted target."; type = lib.types.ints.unsigned; default = 0; };
gid = lib.mkOption { description = "GID owning the mounted target."; type = lib.types.ints.unsigned; default = 0; };
mode = lib.mkOption { description = "Permission mode of the mounted target."; type = lib.types.str; default = "750"; };
options = lib.mkOption { description = "Additional mount options to set. Note that not all mount types support all options, they may be silently ignored or cause errors. »bind« supports setting »nosuid«, »nodev«, »noexec«, »noatime«, »nodiratime«, and »relatime«. »zfs« will only heed »noauto« and otherwise use the ».zfsProps«."; type = lib.types.listOf lib.types.str; default = [ ]; };
extraFsOptions = lib.mkOption { description = "Extra options to set on the generated »fileSystems.*« entry (unless this mount is forced to »null«)."; type = lib.types.attrsOf lib.types.anything; default = { }; };
zfsProps = lib.mkOption { description = "ZFS properties to set on the dataset, if mode type is »zfs«."; type = lib.types.attrsOf (lib.types.nullOr lib.types.str); default = { }; };
}; })));
default = { };
apply = lib.filterAttrs (k: v: v != null);
};
mountOptions = lib.mkOption { description = "Mount options that will be placed before ».mounts.*.options«."; type = lib.types.listOf lib.types.str; default = [ "noatime" "nodev" "nosuid" ]; };
};
zfsNoSyncProps = { sync = "disabled"; logbias = "throughput"; }; # According to the documentation, »logbias« should be irrelevant without sync (i.e. no logging), but some claim setting it to »throughput« still improves performance.
in {
options.${prefix} = { fs.temproot = {
enable = lib.mkEnableOption "filesystem layouts with ephemeral root";
# Settings for filesystems that will be cleared on reboot:
temp = {
type = lib.mkOption { description = ''
"tmpfs": Creates »tmpfs« filesystems at »/« and all specified mount points.
"zfs": ...
"bind": Expects a filesystem to be mounted at »/«. Creates a hook to cre-create that filesystem on boot (TODO: implement and then enable this), and bind-mounts any additional mounts to ».bind.source+"/"+<mount.source>« (creating those paths if necessary).
''; type = lib.types.enum [ "tmpfs" "zfs" ]; default = "tmpfs"; };
} // (optionsFor "temp" "temporary");
# Settings for filesystems that persist across reboots:
local = {
type = lib.mkOption { description = ''
"bind": Expects a (locally persistent) filesystem to be mounted at ».bind.target«, and bind-mounts any additional mounts to ».bind.source+"/"+<mount.source>« (creating those paths if necessary).
"zfs": ...
''; type = lib.types.enum [ "bind" "zfs" ]; default = "bind"; };
} // (optionsFor "local" "locally persistent");
# Settings for filesystems that should have remote backups:
remote = {
type = lib.mkOption { description = ''
"bind": Expects a filesystem to be mounted at ».bind.target« that gets backed. Bind-mounts any additional mounts to ».bind.source+"/"+<mount.source>« (creating those paths if necessary).
"zfs": ...
''; type = lib.types.enum [ "bind" "zfs" ]; default = "bind"; };
} // (optionsFor "remote" "remotely backed-up");
swap = {
size = lib.mkOption { description = "Size of the swap partition or file to create."; type = lib.types.nullOr lib.types.str; default = null; };
encrypted = lib.mkOption { description = "Whether to encrypt the swap with a persistent key. Only relevant if ».asPartition = true«."; type = lib.types.bool; default = false; };
asPartition = lib.mkOption { description = "Whether to create a swap partition instead of a file."; type = lib.types.bool; default = cfg.local.type == "zfs"; };
};
persistenceFixes = (lib.mkEnableOption "some fixes to cope with »/« being ephemeral") // { default = true; example = false; };
}; };
config = let
in lib.mkIf cfg.enable (lib.mkMerge ([ ({
${prefix} = {
fs.temproot.temp.mounts = {
"/tmp" = { mode = "1777"; };
};
fs.temproot.local.mounts = {
"/nix" = { zfsProps = zfsNoSyncProps; mode = "755"; }; # this (or /nix/store) is required
"/var/log" = { source = "logs"; };
"/local" = { source = "system"; };
# »/swap« is used by »cfg.swap.asPartition = false«
};
fs.temproot.remote.mounts = {
"/remote" = { source = "system"; extraFsOptions = { neededForBoot = true; }; }; # if any secrets need to be picked up by »activate«, they should be here
};
};
boot.tmpOnTmpfs = false; # This would create a systemd mount unit for »/tmp«.
}) (lib.mkIf cfg.persistenceFixes { # Cope with the consequences of having »/« (including »/{etc,var,root,...}«) cleared on every reboot.
systemd.tmpfiles.rules = [ # keep in mind: this does not get applied super early ...
# nixos config
"L+ /etc/nixos/ - - - - ../../remote/etc/nixos/"
# »/root/.nix-channels« is already being restored.
# »/var/lib/nixos« remains ephemeral. Where it matters, explicitly define a U/GID!
# root's command history
"f /remote/root/.bash_history 0600 root root -"
"L+ /root/.bash_history - - - - ../remote/root/.bash_history"
"f /remote/root/.local/share/nix/repl-history 0600 root root -"
"L+ /root/.local/share/nix/repl-history - - - - ../../../../remote/root/.local/share/nix/repl-history"
];
fileSystems = { # this does get applied early
# (on systems without hardware clock, this allows systemd to provide an at least monolithic time after restarts)
"/var/lib/systemd/timesync" = { device = "/local/var/lib/systemd/timesync"; options = [ "bind" "nofail" ]; }; # TODO: add »neededForBoot = true«?
# save persistent timer states
"/var/lib/systemd/timers" = { device = "/local/var/lib/systemd/timers"; options = [ "bind" "nofail" ]; }; # TODO: add »neededForBoot = true«?
};
security.sudo.extraConfig = "Defaults lecture=never"; # default is »once«, but we'd forget that we did that
}) (lib.mkIf (cfg.swap.size != null && cfg.swap.asPartition) (let # Convenience option to create a local F2FS optimized to host the nix store:
useLuks = config.${prefix}.fs.keystore.keys?"luks/swap-${hash}/0";
in {
${prefix} = {
fs.disks.partitions."swap-${hash}" = { type = "8200"; size = cfg.swap.size; order = 1250; };
fs.keystore.keys."luks/swap-${hash}/0" = lib.mkIf cfg.swap.encrypted (lib.mkOptionDefault "random");
};
swapDevices = [ { device = "/dev/${if useLuks then "mapper" else "disk/by-partlabel"}/swap-${hash}"; } ];
})) (lib.mkIf (cfg.swap.size != null && !cfg.swap.asPartition) {
swapDevices = [ { device = "${cfg.local.bind.source}/swap"; size = (lib.wip.parseSizeSuffix cfg.swap.size) / 1024 / 1024; } ];
}) (lib.mkIf (cfg.temp.type == "tmpfs") { # (only temp can be of type tmpfs)
fileSystems = lib.mapAttrs (target: { options, uid, gid, mode, extraFsOptions, ... }: (extraFsOptions // {
fsType = "tmpfs"; device = "tmpfs"; options = (extraFsOptions.options or [ ]) ++ cfg.temp.mountOptions ++ options ++ [ "uid=${toString uid}" "gid=${toString gid}" "mode=${mode}" ];
})) ({ "/" = { options = [ ]; uid = 0; gid = 0; mode = "755"; extraFsOptions = { }; }; } // cfg.temp.mounts);
}) (lib.mkIf (cfg.temp.type == "zfs") {
boot.initrd.postDeviceCommands = lib.mkAfter ''
echo 'Clearing root ZFS'
( zfs list -H -o name -t snapshot -r ${cfg.temp.zfs.dataset} | grep '@empty$' | xargs -n1 --no-run-if-empty zfs rollback -r )
'';
}) (lib.mkIf (cfg.temp.type == "bind") { # (TODO: this should completely clear or even recreate the »cfg.temp.bind.source«)
boot.cleanTmpDir = true; # Clear »/tmp« on reboot.
}) (lib.mkIf (cfg.local.type == "bind" && (cfg.local.bind.base == "f2fs" || cfg.local.bind.base == "f2fs-encrypted")) (let # Convenience option to create a local F2FS optimized to host the nix store:
useLuks = config.${prefix}.fs.keystore.keys?"luks/local-${hash}/0";
in {
# TODO: fsck
${prefix} = {
fs.keystore.keys."luks/local-${hash}/0" = lib.mkIf (cfg.local.bind.base == "f2fs-encrypted") (lib.mkOptionDefault "random");
fs.disks.partitions."local-${hash}" = {
type = "8300"; order = 1000; disk = "primary"; size = "50%";
};
};
fileSystems.${cfg.local.bind.source} = { fsType = "f2fs"; device = "/dev/${if useLuks then "mapper" else "disk/by-partlabel"}/local-${hash}"; formatOptions = lib.mkDefault (lib.concatStrings [
"-O extra_attr" # required by other options
",inode_checksum" # enable inode checksum
",sb_checksum" # enable superblock checksum
",compression" # allow compression
#"-w ?" # "sector size in bytes"
# sector ? segments < section < zone
]); options = lib.mkDefault ([
# F2FS compresses only for performance and wear. The whole uncompressed space is still reserved (in case the file content needs to get replaced by incompressible data in-place). To free the gained space, »ioctl(fd, F2FS_IOC_RELEASE_COMPRESS_BLOCKS)« needs to be called per file, making the file immutable. Nix could do that when moving stuff into the store.
"compress_mode=fs" # enable compression for all files
"compress_algorithm=lz4" # compress using lz4
"compress_chksum" # verify checksums (when decompressing data blocks?)
"lazytime" # update timestamps asynchronously
] ++ cfg.local.mountOptions); };
# TODO: "F2FS and its tools support various parameters not only for configuring on-disk layout, but also for selecting allocation and cleaning algorithms."
})) ] ++ (map (type: (lib.mkIf (cfg.${type}.type == "bind") {
fileSystems = (lib.mapAttrs (target: args@{ source, uid, gid, mode, extraFsOptions, ... }: extraFsOptions // (rec {
device = "${cfg.${type}.bind.source}/${source}"; options = (extraFsOptions.options or [ ]) ++ [ "bind" ] ++ cfg.${type}.mountOptions ++ args.options;
preMountCommands = ''
${extraFsOptions.preMountCommands or ""}
mkdir -pm 000 -- ${lib.escapeShellArg target}
mkdir -pm 000 -- ${lib.escapeShellArg device}
chown ${toString uid}:${toString gid} -- ${lib.escapeShellArg device}
chmod ${mode} -- ${lib.escapeShellArg device}
'';
})) cfg.${type}.mounts) // {
${cfg.${type}.bind.source} = { neededForBoot = lib.any utils.fsNeededForBoot (lib.attrValues (builtins.intersectAttrs cfg.${type}.mounts config.fileSystems)); };
};
})) [ "temp" "local" "remote" ]) ++ (map (type: (lib.mkIf (cfg.${type}.type == "zfs") (let
dataset = cfg.${type}.zfs.dataset;
in {
${prefix} = {
fs.zfs.enable = true;
fs.zfs.pools.${lib.head (lib.splitString "/" dataset)} = { }; # ensure the pool exists (all properties can be adjusted)
fs.keystore.keys."zfs/${dataset}" = lib.mkIf (type == "remote" && config.${prefix}.fs.keystore.enable) (lib.mkOptionDefault "random"); # the entire point of ZFS remote are backups, and those should be encrypted
fs.zfs.datasets = {
${dataset} = {
mount = false; props = { canmount = "off"; mountpoint = "/"; } // (if type == "temp" then { refreservation = "1G"; } // zfsNoSyncProps else { });
};
} // (if type == "temp" then {
"${dataset}/root" = {
mount = true; props = { canmount = "noauto"; mountpoint = "/"; }; mode = "755";
};
} else { }) // (lib.wip.mapMerge (target: { source, options, zfsProps, uid, gid, mode, ... }: {
"${dataset}/${source}" = {
mount = if lib.elem "noauto" options then "noauto" else true; inherit uid gid mode;
props = { canmount = "noauto"; mountpoint = target; } // zfsProps;
};
}) cfg.${type}.mounts);
};
fileSystems = lib.mapAttrs (target: args@{ extraFsOptions, ... }: (
extraFsOptions
)) cfg.${type}.mounts;
}))) [ "temp" "local" "remote" ])));
}