zrok/docs/guides/self-hosting/metrics-and-limits/configuring-limits.md
2024-06-12 17:06:29 -04:00

12 KiB

sidebar_position
40

Configuring Limits

:::note This guide is current as of zrok version v0.4.31. :::

:::warning If you have not yet configured metrics, please visit the metrics guide first before working through the limits configuration. :::

Understanding the zrok Limits Agent

The limits agent is a component of the zrok controller. It can be enabled and configured through the zrok controller configuration.

The limits agent is responsible for controlling the number of resources in use (environments, shares, etc.) and also for ensuring that accounts are held below the configured data transfer bandwidth thresholds. The limits agent exists to manage resource consumption for larger, multi-user zrok installations.

Types of Limits

Limits can be specified that control the number of environments, shares, reserved shares, and unique names that can be created by an account. Limits that control the allowed number of resources are called resource count limits.

Limits can be specified to control the amount of data that can be transferred within a time period. Limits that control the amount of data that can be transferred are called bandwidth limits.

zrok limits can be specified globally, applying to all users in a service instance. Limit classes can be created to provide different levels of resource allocation. A single limit class can then be applied to multiple accounts, to alter their limit allocation beyond what's configured in the global configuration.

The Global Configuration

The reference configuration for the zrok controller (found at etc/ctrl.yaml in the repository) contains the global limits configuration, which looks like this:

# Service instance limits global configuration.
#
# See `docs/guides/metrics-and-limits/configuring-limits.md` for details.
#
limits:
  environments:     -1
  shares:           -1
  reserved_shares:  -1
  unique_names:     -1
  bandwidth:
    period:         5m
    warning:
      rx:           -1
      tx:           -1
      total:        7242880
    limit:
      rx:           -1
      tx:           -1
      total:        10485760
  enforcing:        false
  cycle:            5m

:::note A value of -1 appearing in the limits configuration mean the value is unlimited. :::

The enforcing boolean specifies whether or not limits are enabled in the service instance. By default, limits is disabled. No matter what else is configured in this stanza, if enforcing is set to false, there will be no limits placed on any account in the service instance.

The cycle value controls how frequently the limits agent will evaluate enforced limits. When a user exceeds a limit and has their shares disabled, the limits agent will evaluate their bandwidth usage on this interval looking to "relax" the limit once their usage falls below the threshold.

Global Resouce Count Limits

The environments, shares, reserved_shares, and unique_names specify the resource count limits, globally for the service instance.

These resource counts will be applied to all users in the service instance by default.

Global Bandwidth Limits

The bandwidth section defines the global bandwidth limits for all users in the service instance.

There are two levels of bandwidth limits that can be specified in the global configuration. The first limit defines a warning threshold where the user will receive an email that they are using increased data transfer amounts and will ultimately be subject to a limit.

The second limit defines the the actual limit threshold, where the limits agent will disabled traffic for the account's shares.

Bandwidth limits can be specified in terms of tx (or transmitted data), rx (or received data), and the total bytes that are sent in either direction. If you only want to set the total transferred limit, you can set rx and tx to -1 (for unlimited). You can configure any combination of these these values at either the limit or warning levels.

The period specifies the time window for the bandwidth limit. See the documentation for time.Duration.ParseDuration for details about the format used for these durations. If the period is set to 5 minutes, then the limits agent will monitor the transmitted and receivde traffic for the account for the last 5 minutes, and if the amount of data is greater than either the warning or the limit threshold, action will be taken.

Limit Classes

The zrok limits agent includes a concept called limit classes. Limit classes can be used to define resource count and bandwidth limits that can be selectively applied to individual accounts in a service instance.

Limit classes are created by creating a record in the limit_classes table in the zrok controller database. The table has this schema:

CREATE TABLE public.limit_classes (
    id integer NOT NULL,
    backend_mode public.backend_mode,
    environments integer DEFAULT '-1'::integer NOT NULL,
    shares integer DEFAULT '-1'::integer NOT NULL,
    reserved_shares integer DEFAULT '-1'::integer NOT NULL,
    unique_names integer DEFAULT '-1'::integer NOT NULL,
    period_minutes integer DEFAULT 1440 NOT NULL,
    rx_bytes bigint DEFAULT '-1'::integer NOT NULL,
    tx_bytes bigint DEFAULT '-1'::integer NOT NULL,
    total_bytes bigint DEFAULT '-1'::integer NOT NULL,
    limit_action public.limit_action DEFAULT 'limit'::public.limit_action NOT NULL,
    created_at timestamp with time zone DEFAULT CURRENT_TIMESTAMP NOT NULL,
    updated_at timestamp with time zone DEFAULT CURRENT_TIMESTAMP NOT NULL,
    deleted boolean DEFAULT false NOT NULL
);

This schema supports constructing the 3 different types of limits classes that the system supports.

After defining a limit class in the database, it can be applied to specific user accounts (overriding the relevant parts of the global configuration) by inserting a row into the applied_limit_classes table:

CREATE TABLE public.applied_limit_classes (
    id integer NOT NULL,
    account_id integer NOT NULL,
    limit_class_id integer NOT NULL,
    created_at timestamp with time zone DEFAULT CURRENT_TIMESTAMP NOT NULL,
    updated_at timestamp with time zone DEFAULT CURRENT_TIMESTAMP NOT NULL,
    deleted boolean DEFAULT false NOT NULL
);

Create a row in this table linking the account_id to the limit_class_id to apply the limit class to a specific user account.

Unscoped Resource Count Classes

To support overriding the resource count limits defined in the global limits configuration, a site administrator can create a limit class by inserting a row into the limit_classes table structured like this:

insert into limit_classes (environments, shares, reserved_shares, unique_names) values (1, 1, 1, 1);

This creates a limit class that sets the environments, shares, reserved_shares, and unique_names all to 1.

When this limit class is applied to a user account those values would override the default resource count values configured globally.

Applying an unscoped resource count class does not affect the bandwidth limits (either globally configured, or via a limit class).

Unscoped Bandwidth Classes

To support overriding the bandwidth limits defined in the global configuration, a site administrator can create a limit class by inserting a row into the limit_classes table structured like this:

insert into limit_classes (period_minutes, total_bytes, limit_action) values (2, 204800, 'limit');

This inserts a limit class that allows for a total bandwidth transfer of 204800 bytes every 2 minutes.

When this limit class is applied to a user account, those values would override the default bandwidth values configured globally.

Applying an unscoped bandwidth class does not affect the resource count limits (either globally configured, or via a limit class).

Scoped Classes

A scoped limit class specifies both the resource counts (shares, reserved_shares, and unique_names, but NOT environments) for a specific backend mode. Insert a row like this:

insert into limit_classes (backend_mode, shares, reserved_shares, unique_names, period_minutes, total_bytes, limit_action) values ('web', 2, 1, 1, 2, 4096000, 'limit');

Scoped limits are designed to increase the limits for a specific backend mode beyond what the global configuration and the unscoped classes provide. The general approach is to use the global configuration and the unscoped classes to provide the general account limits, and then the scoped classes can be used to further increase (or potentially decrease) the limits for a specific backend mode.

If a scoped limit class exists for a specific backend mode, then the limits agent will use that limit in making a decision about limiting the resource count or bandwidth. All other types of shares will fall back to the unscoped classes or the global configuration.

Limit Actions

When an account exceeds a bandwidth limit, the limits agent will seek to limit the affected shares (based on the combination of global configuration, unscoped limit classes, and scoped limit classes). It applies the limit by removing the underlying OpenZiti dial policies for any frontends that are trying to access the share.

This means that public frontends will simply return a 404 as if the share is no longer there. Private frontends will also return 404 errors. When the limit is relaxed, the dial policies are put back in place and the share will continue operating normally.

Unlimited Accounts

The accounts table in the database includes a limitless column. When this column is set to true the account is not subject to any of the limits in the system.

Experimental Limits Locking

zrok versions prior to v0.4.31 had a potential race condition when enforcing resource count limits. This usually only manifested in cases where shares or environments were being allocated programmatically (and fast enough to win the limits race).

This occurs due to a lack of transactional database locking around the limited structures. v0.4.31 includes a pessimistic locking facility that can be enabled only on the PostgreSQL store implemention.

If you're running PostgreSQL for your service instance and you want to enable the new experimental locking facility that eliminates the potential resource count race condition, add the enable_locking: true flag to your store definition:

store:
  enable_locking: true

Caveats

There are a number of caveats that are important to understand when using the limits agent with more complicated limits scenarios:

Aggregate Bandwidth

The zrok limits agent is a work in progress. The system currently does not track bandwidth individually for each backend mode type, which means all bandwidth values are aggregated between all of the share types that an account might be using. This will likely change in an upcoming release.

Administration Through SQL

There are currently no administrative API endpoints (or corresponding CLI tools) to support creating and applying limit classes in the current release. The limits agent infrastructure was designed to support software integrations that directly manipulate the underlying database structures.

A future release may provide API and CLI tooling to support the human administration of the limits agent.

Performance

Be sure to minimize the number of different periods used for specifying bandwidth limits. Specifying limits in multiple different periods can cause a multiplicity of queries to be executed against the metrics store (InfluxDB). Standardizing on a period like 24h or 6h and using that consistently is the best way to to manage the performance of the metrics store.