Core ↔︎ Runner Architecture (Hybrid)

Decision: Core–Runner communication uses Protobuf (gRPC). Runner–Plugin communication remains HTTP/JSON for maximum on‑prem compatibility.

Why

  • On‑prem environments are strict; inbound connectivity is hard.
  • Runner can maintain outbound connectivity to Core over 443.
  • Protobuf/gRPC gives a strongly-versioned contract, codegen SDKs, and efficient streaming.

Topology

  • Core (Cloud)
    • Workflow definitions, scheduling, policy, tenant management, audit logs
    • Dispatches jobs to runners
  • Runner (Customer on‑prem)
    • Executes jobs (calls internal APIs, runs plugins)
    • Reports results/events back to Core
  • Plugins (Customer on‑prem)
    • Slack/Email/Telegram/Kakao provider adapters etc.
    • Exposed locally over HTTP/JSON

Current implementation status

  • Core gRPC server and Runner gRPC client are implemented in Haskell.
  • Core dispatches Job messages and accepts JobResult, Heartbeat, and LogEvent over the same stream.
  • Runner reconnect loop is implemented.
  • Runner now persists an outbound disk queue and replays queued Heartbeat, JobResult, and LogEvent messages after reconnect.
  • Core exposes internal runner enrollment, token rotation, and Core-signed CSR renewal endpoints, enforces runner denylist state, and persists runner lifecycle telemetry for inventory views.

Security (baseline)

  • Current implementation supports per-runner auth tokens issued by Core enrollment and rotation endpoints, with shared-token fallback for compatibility.
  • Core can sign runner CSRs when issuer CA material is configured, while the older local renewal-command path remains as an operator-managed fallback.
  • TLS/mTLS is supported by deployment config; local/dev setups may run plaintext inside the trusted network.
  • Enrollment uses a configured registration token plus signed internal auth for the bootstrap HTTP path.
  • Every message includes tenant_id + runner_id and is authorization-checked on Core.

Versioning

  • Protobuf package versioned as aisopsflow.core.runner.v1.
  • Non-breaking changes only within v1.
  • Breaking changes -> v2 package.

Failure modes

  • If connection drops:
    • Runner reconnects automatically
    • Core marks runner offline by removing the live stream binding
    • durable local queueing keeps unsent outbound messages on disk until the stream is healthy again

Next concrete step

  • Add certificate fingerprint or CRL/OCSP-style revocation if runner-id denylist is not strong enough
  • Decide whether backup import should gain destructive replace/rollback semantics