Permissions, Rate Limits & Quotas
Three independent mechanisms control what a user can do:
- Permissions — can this user call this model at all?
- Rate limits — how fast can they call it?
- Quotas — how much total cost can they accrue?
All three are matched per request before the upstream call. Each one can deny on its own, returning a specific error code.
Permissions
Section titled “Permissions”A permission row is a (user, provider, model_pattern) tuple:
[[permissions]]user_name = "alice"provider_name = "openai-main"model_pattern = "gpt-*"model_patternis a glob (*,?) against the incoming model name before alias resolution. A permission onchat-*matches the aliaschat-default; a permission ongpt-*matches the raw idgpt-4.1-mini.provider_namecan be omitted to grant access across all providers (the match still goes through permission → rewrite → alias → execute).- A user with no matching permission row gets
403 forbidden: model.
File permissions
Section titled “File permissions”[[file_permissions]] is a separate table that grants a user the ability
to call file-related upstream endpoints (upload, retrieve, delete) on a
provider. The structure is simpler:
[[file_permissions]]user_name = "alice"provider_name = "openai-main"Rate limits
Section titled “Rate limits”Rate limits are scoped to (user, model_pattern) and can enforce any or
all of three counters:
| Field | Meaning |
|---|---|
rpm | Requests per minute. |
rpd | Requests per day. |
total_tokens | Total tokens in a rolling window. |
[[rate_limits]]user_name = "alice"model_pattern = "gpt-*"rpm = 60rpd = 10000total_tokens = 200000Counters are maintained in memory and reconciled through the
RateLimitGC worker, which garbage-collects expired windows so the
process doesn’t accumulate dead counters.
A request that trips any counter returns 429 rate_limited with headers
indicating which counter fired and when the window resets.
Quotas
Section titled “Quotas”A quota is a single USD-denominated ceiling per user:
[[quotas]]user_name = "alice"quota = 100.0 # Ceiling in USDcost_used = 0.0 # Consumed cost; increased by usage accounting- The
UsageSinkworker writes cost intocost_usedas requests complete. - The
QuotaReconcilerworker periodically recomputescost_usedfrom recorded usage rows, so the live counter self-corrects if anything drifts. - A request whose projected cost would push
cost_usedoverquotareturns402 quota_exceeded.
Quota is a billing ceiling, not a safety ceiling — it only exists for
models and aliases that have pricing configured. Models without prices
contribute $0 to cost_used.
Order of evaluation
Section titled “Order of evaluation”For a given request, the checks run in this order:
- Auth — resolve API key to user.
- Permission — is there a matching
[[permissions]]row? If not →403 forbidden: model. - Quota — is
cost_used < quota? If not →402 quota_exceeded. - Rate limit — is the matching counter within its ceiling? If not
→
429 rate_limited. - Execute — route through
rewrite → alias → execute.
So a permission denial always beats a rate-limit denial, which always beats execution errors from upstream.
Runtime management
Section titled “Runtime management”Everything described here can be edited at runtime from the console’s
Users → Permissions / Rate limits / Quota workspace, or via the admin
API under /admin/permissions, /admin/rate_limits, /admin/quotas.
Changes are applied on the next request; there is no need to reload.