architecture-core-runner
Core ↔︎ Runner Architecture (Hybrid)
Decision: Core–Runner communication uses Protobuf (gRPC). Runner–Plugin communication remains HTTP/JSON for maximum on‑prem compatibility.
Why
- On‑prem environments are strict; inbound connectivity is hard.
- Runner can maintain outbound connectivity to Core over 443.
- Protobuf/gRPC gives a strongly-versioned contract, codegen SDKs, and efficient streaming.
Topology
- Core (Cloud)
- Workflow definitions, scheduling, policy, tenant management, audit logs
- Dispatches jobs to runners
- Runner (Customer on‑prem)
- Executes jobs (calls internal APIs, runs plugins)
- Reports results/events back to Core
- Plugins (Customer on‑prem)
- Slack/Email/Telegram/Kakao provider adapters etc.
- Exposed locally over HTTP/JSON
Connectivity pattern (recommended)
Runner initiates and keeps a long-lived connection:
- Runner → Core: Connect() streaming RPC
- Core → Runner: sends Job messages over the stream
- Runner → Core: sends JobResult, Heartbeat, LogEvent over the same stream
This avoids inbound firewall rules at customer sites.
Current implementation status
- Core gRPC server and Runner gRPC client are implemented in Haskell.
- Core dispatches
Jobmessages and acceptsJobResult,Heartbeat, andLogEventover the same stream. - Runner reconnect loop is implemented.
- Runner now persists an outbound disk queue and replays queued
Heartbeat,JobResult, andLogEventmessages after reconnect. - Core exposes internal runner enrollment, token rotation, and Core-signed CSR renewal endpoints, enforces runner denylist state, and persists runner lifecycle telemetry for inventory views.
Security (baseline)
- Current implementation supports per-runner auth tokens issued by Core enrollment and rotation endpoints, with shared-token fallback for compatibility.
- Core can sign runner CSRs when issuer CA material is configured, while the older local renewal-command path remains as an operator-managed fallback.
- TLS/mTLS is supported by deployment config; local/dev setups may run plaintext inside the trusted network.
- Enrollment uses a configured registration token plus signed internal auth for the bootstrap HTTP path.
- Every message includes
tenant_id+runner_idand is authorization-checked on Core.
Versioning
- Protobuf package versioned as
aisopsflow.core.runner.v1. - Non-breaking changes only within v1.
- Breaking changes -> v2 package.
Failure modes
- If connection drops:
- Runner reconnects automatically
- Core marks runner offline by removing the live stream binding
- durable local queueing keeps unsent outbound messages on disk until the stream is healthy again
Next concrete step
- Add certificate fingerprint or CRL/OCSP-style revocation if runner-id denylist is not strong enough
- Decide whether backup import should gain destructive replace/rollback semantics