-
Notifications
You must be signed in to change notification settings - Fork 107
Adding RFD for agent metrics #295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding RFD for agent metrics #295
Conversation
|
@forsythetony thanks for this! However, I think this has high overlap with #276 which just moved to draft stage. I'd love to consolidate efforts around this. Feel free to open a discussion in zulip if you want to collaborate with @codefromthecrypt on how to best specify some of this! |
|
Feel free to correct me if I have missed something here. But I think at the protocol level we should likely define how this data is propagated, I'm not sure if the protocol needs to define how it gets exported. @codefromthecrypt feel free to chime in here as well |
|
TL;DR: Instead of tunneling telemetry over ACP (which could block agent messages), clients should run a local OTLP proxy and pass standard Thanks for starting this discussion. I'll set aside trace propagation for now since PR #276 already addresses that via For end-to-end telemetry, I see two distinct concerns:
This comment focuses on transport. Why not tunnel telemetry over ACP?Sending telemetry as JSON-RPC messages on the ACP transport has drawbacks:
Alternative: OTEL side-channel via environment variablesThe standard approach in observability is "do no harm" - telemetry should not interfere with the primary value path. A side-channel keeps agent communication separate from telemetry export. ACP's stdio transport defines how clients launch agents as subprocesses. The client controls the agent's environment at launch time. We can configure agents using standard OTEL environment variables: This means agent authors write and test against standard OTEL auto-configuration. The cognitive load stays with OTEL documentation, not ACP. The problem: where should telemetry go?The agent cannot know where the client wants telemetry routed. Different deployments have different needs:
Sharing the client's OTEL credentials with every agent is problematic: it leaks secrets and forces all agents to use the same destination. Proposal: ephemeral collector proxyI proposed a similar "telemetry hub" idea for MCP in block/goose#2078. The same pattern applies here. The client runs a lightweight OTLP listener that proxies telemetry to its configured backend. This follows the OpenTelemetry agent/gateway pattern commonly used in container deployments. Architectureflowchart TB
subgraph Client["Client/Editor"]
direction LR
C[ACP Handler]
P[OTLP Proxy]
E[OTLP Exporter]
C ~~~ P
P --> E
end
subgraph Agent["Agent Process"]
A[ACP Agent]
ASDK[OTEL SDK]
A ~~~ ASDK
end
subgraph Backend["Observability Backend"]
B[(Collector)]
end
C <-->|"stdio"| A
ASDK -->|"HTTP"| P
E -->|"HTTP + creds"| B
Discovery is pre-initializationThere is a chicken-and-egg problem here: environment variables must be set when launching the subprocess, but ACP capability exchange happens after the connection is established. The client needs to know whether to inject OTEL environment variables before calling This means OTLP capability discovery happens outside the normal ACP initialization flow. A few options:
The optimistic approach is pragmatic: environment variables are low-cost, and well-behaved OTEL SDKs gracefully handle misconfiguration. But a proper registry entry provides cleaner semantics. SequencesequenceDiagram
participant Registry as Agent Registry
participant Client as Client/Editor
participant Agent
Client->>Registry: Lookup agent metadata
Registry-->>Client: agent.json (includes otlp: true)
Note over Client: Start OTLP proxy if needed
Client->>Agent: Launch subprocess with OTEL ENV
Note over Client,Agent: Connection established (stdio)
Client->>Agent: initialize request
Agent-->>Client: initialize response
Client->>Agent: session/new
Agent-->>Client: sessionId
Note over Agent: OTEL SDK auto-configured<br/>from environment
Agent-)Client: OTLP/HTTP (out of band)
Environment variable injectionThe client passes OTEL configuration when launching the agent subprocess: The client's own configuration might include secrets: The proxy intercepts agent telemetry and forwards it with the real credentials. No secrets are exposed to agents. Implementation costThe main cost is that clients need code to run an OTLP proxy. This is a well-understood pattern with library support in most languages. It is simpler than many other features in the editor/agent exchange. Next stepsI would suggest moving this to an RFD to work through the details. The pre-initialization discovery question needs more thought, and there may be other edge cases worth exploring (multiple agents, proxy lifecycle, error handling). Once transport is in place, we can iterate on telemetry semantics (span names, metric definitions, log formats). Those discussions can take time - some tracing semantics evolve over years. Decoupling transport from semantics lets us make progress now. Happy to draft the RFD if there's interest. |
|
Ok sorry for the misunderstanding here. Happy to have folks work on exploring what is needed here, and figure out how much is guidance/suggestions/spec requirements |
|
@benbrandt I think what I'll do is go ahead and put the RFD version of my comment, as any schema for metrics traces (or logs.. because believe it or not they also have a schema!) would layer on the concept of a working transport. In worst case I close that PR |
|
as promised, here's the base RFD #298 |
|
Wow, I really appreciate the quick response and the in-depth comment. Just did a quick read through and I agree with what you've laid out. Not clogging the system critical pipe with metrics data is definitely the way to go. I'll take a deeper look at your new PR tomorrow and comment with thoughts. |
|
Awesome, I think we can then close this one in favor of #298 and let you both collaborate on that one |
Uh oh!
There was an error while loading. Please reload this page.