mTLS for Core ↔︎ Runner (v0.1)

Decision: - Core↔︎Runner gRPC supports TLS/mTLS in deployment config. - Dev tooling can generate short-lived local certs for test environments. - Core is behind a L7 proxy/LB that terminates TLS for gRPC (HTTP/2).

Current repo status: - Core and Runner can run plaintext on a trusted internal network, or TLS/mTLS when cert paths are configured. - Runner can request a Core-signed certificate renewal by submitting a CSR over the authenticated internal management endpoint when issuer CA material is configured. - Runner can still execute a local renewal command before client-cert expiry and notify Core through the same authenticated internal endpoint as an operator-managed fallback. - Runner authentication now supports Core-issued per-runner tokens with enrollment and rotation; the legacy shared token remains a compatibility fallback. - Core can denylist runner IDs so revoked runners cannot reconnect, rotate tokens, or renew certificates until the denylist entry is cleared.

Goals

  • Strongly authenticate each runner (device) to Core.
  • Allow rapid credential revocation (compromise/employee offboarding).
  • Avoid inbound firewall rules at customer sites (Runner uses outbound 443).

High-level design

Components

  • Core (cloud): job orchestration + policy + audit + runner registry.
  • Edge proxy (cloud): Envoy/ALB/NLB/NGINX that terminates TLS and forwards gRPC to Core.
  • Runner (on-prem): keeps an outbound streaming gRPC connection to Core.

Certificates

  • Root CA: controlled by the service operator.
  • Server certificate: used by the edge proxy (public DNS name).
  • Runner client certificates:
    • 1 certificate per runner
    • Subject/SAN contains runner_id and optional tenant_id
    • short-lived certs are recommended; local dev scripts generate sample certs

Current renewal model

  • Runner has a current client cert + key.
  • Runner reconnects periodically and presents its cert.
  • Before expiry (e.g. T-7 days), runner can execute a configured local renewal command and then report the new cert expiry back to Core telemetry.

Current renewal options

  1. Core-signed CSR renewal
    • Runner generates a CSR locally from its configured client key.
    • Runner sends the CSR over the authenticated internal management path.
    • Core signs and returns a new cert plus CA bundle when issuer config is present.
  2. Operator-managed local renewal
    • Runner executes a local command and reports the new cert expiry back to Core.

Revocation / offboarding

Current pragmatic approach: - Keep a Core-side denylist of runner_id (and/or cert fingerprint). - Use short expiry (30 days) to cap compromise window.

Future options: - CRL/OCSP if needed for stricter compliance.

Operational notes

  • The edge proxy must be configured for:
    • gRPC over HTTP/2
    • require_client_certificate: true
    • trusted client CA bundle
  • LB/proxy forwards client identity to Core (either via mTLS passthrough or validated headers; validated headers only).

Dev/test

See certs/dev/ and scripts/dev-mtls-certs.sh for local self-signed CA + server/client cert generation.