Adding RFD for agent metrics #295

forsythetony · 2025-12-03T06:25:11Z

$$$$$$$\   $$$$$$\        $$\   $$\  $$$$$$\ $$$$$$$$\       $$\      $$\ $$$$$$$$\ $$$$$$$\   $$$$$$\  $$$$$$$$\ 
$$  __$$\ $$  __$$\       $$$\  $$ |$$  __$$\\__$$  __|      $$$\    $$$ |$$  _____|$$  __$$\ $$  __$$\ $$  _____|
$$ |  $$ |$$ /  $$ |      $$$$\ $$ |$$ /  $$ |  $$ |         $$$$\  $$$$ |$$ |      $$ |  $$ |$$ /  \__|$$ |      
$$ |  $$ |$$ |  $$ |      $$ $$\$$ |$$ |  $$ |  $$ |         $$\$$\$$ $$ |$$$$$\    $$$$$$$  |$$ |$$$$\ $$$$$\    
$$ |  $$ |$$ |  $$ |      $$ \$$$$ |$$ |  $$ |  $$ |         $$ \$$$  $$ |$$  __|   $$  __$$< $$ |\_$$ |$$  __|   
$$ |  $$ |$$ |  $$ |      $$ |\$$$ |$$ |  $$ |  $$ |         $$ |\$  /$$ |$$ |      $$ |  $$ |$$ |  $$ |$$ |      
$$$$$$$  | $$$$$$  |      $$ | \$$ | $$$$$$  |  $$ |         $$ | \_/ $$ |$$$$$$$$\ $$ |  $$ |\$$$$$$  |$$$$$$$$\ 
\_______/  \______/       \__|  \__| \______/   \__|         \__|     \__|\________|\__|  \__| \______/ \________|

benbrandt · 2025-12-03T21:50:40Z

@forsythetony thanks for this!

However, I think this has high overlap with #276 which just moved to draft stage. I'd love to consolidate efforts around this. Feel free to open a discussion in zulip if you want to collaborate with @codefromthecrypt on how to best specify some of this!

benbrandt · 2025-12-03T21:52:10Z

Feel free to correct me if I have missed something here. But I think at the protocol level we should likely define how this data is propagated, I'm not sure if the protocol needs to define how it gets exported.

@codefromthecrypt feel free to chime in here as well

codefromthecrypt · 2025-12-04T01:13:29Z

TL;DR: Instead of tunneling telemetry over ACP (which could block agent messages), clients should run a local OTLP proxy and pass standard OTEL_* environment variables when launching agents. This keeps telemetry out-of-band, avoids leaking secrets, and lets agent authors use normal OTEL SDKs. Discovery of OTLP support would happen pre-initialization (via registry metadata or manual config) since ENV must be set before subprocess launch. I'd suggest an RFD to iron out details.

Thanks for starting this discussion. I'll set aside trace propagation for now since PR #276 already addresses that via params._meta with W3C trace context. That work is vital for sampling and correlation, but it's separate from the transport question here.

For end-to-end telemetry, I see two distinct concerns:

Telemetry transport: where and how to send logs, metrics, and traces
Telemetry semantics: what spans, metrics, and log formats to standardize

This comment focuses on transport.

Why not tunnel telemetry over ACP?

Sending telemetry as JSON-RPC messages on the ACP transport has drawbacks:

Head-of-line blocking: telemetry traffic could delay more important agent messages on stdio
Coupling: ties agent authors to ACP-specific telemetry code instead of standard OTEL SDKs
Support burden: places telemetry implementation complexity on the ACP project

Alternative: OTEL side-channel via environment variables

The standard approach in observability is "do no harm" - telemetry should not interfere with the primary value path. A side-channel keeps agent communication separate from telemetry export.

ACP's stdio transport defines how clients launch agents as subprocesses. The client controls the agent's environment at launch time. We can configure agents using standard OTEL environment variables:

OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318
OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf

This means agent authors write and test against standard OTEL auto-configuration. The cognitive load stays with OTEL documentation, not ACP.

The problem: where should telemetry go?

The agent cannot know where the client wants telemetry routed. Different deployments have different needs:

Some clients (like Goose) may want to aggregate logs into their own logging system
Some may want to display token counts locally before forwarding
Some want everything sent to their OTEL backend of choice

Sharing the client's OTEL credentials with every agent is problematic: it leaks secrets and forces all agents to use the same destination.

Proposal: ephemeral collector proxy

I proposed a similar "telemetry hub" idea for MCP in block/goose#2078. The same pattern applies here.

The client runs a lightweight OTLP listener that proxies telemetry to its configured backend. This follows the OpenTelemetry agent/gateway pattern commonly used in container deployments.

Architecture

flowchart TB
    subgraph Client["Client/Editor"]
        direction LR
        C[ACP Handler]
        P[OTLP Proxy]
        E[OTLP Exporter]
        C ~~~ P
        P --> E
    end

    subgraph Agent["Agent Process"]
        A[ACP Agent]
        ASDK[OTEL SDK]
        A ~~~ ASDK
    end

    subgraph Backend["Observability Backend"]
        B[(Collector)]
    end

    C <-->|"stdio"| A
    ASDK -->|"HTTP"| P
    E -->|"HTTP + creds"| B

Discovery is pre-initialization

There is a chicken-and-egg problem here: environment variables must be set when launching the subprocess, but ACP capability exchange happens after the connection is established. The client needs to know whether to inject OTEL environment variables before calling initialize.

This means OTLP capability discovery happens outside the normal ACP initialization flow. A few options:

Registry metadata: PR Add ACP Agent Registry RFD #289 proposes an agent registry with agent.json manifests. This could include an otlp field in the capabilities array, letting clients know ahead of time.
Manual configuration: Users configure their client to enable OTLP for specific agents based on agent documentation.
Optimistic injection: Clients could inject the OTEL environment variables unconditionally. Agents using standard OTEL SDKs will auto-configure when they detect OTEL_* prefixed variables (see llama-stack#4281 for an example of this pattern). Agents without OTEL support simply ignore them.

The optimistic approach is pragmatic: environment variables are low-cost, and well-behaved OTEL SDKs gracefully handle misconfiguration. But a proper registry entry provides cleaner semantics.

Sequence

sequenceDiagram
    participant Registry as Agent Registry
    participant Client as Client/Editor
    participant Agent

    Client->>Registry: Lookup agent metadata
    Registry-->>Client: agent.json (includes otlp: true)

    Note over Client: Start OTLP proxy if needed

    Client->>Agent: Launch subprocess with OTEL ENV
    Note over Client,Agent: Connection established (stdio)

    Client->>Agent: initialize request
    Agent-->>Client: initialize response

    Client->>Agent: session/new
    Agent-->>Client: sessionId

    Note over Agent: OTEL SDK auto-configured<br/>from environment
    Agent-)Client: OTLP/HTTP (out of band)

Environment variable injection

The client passes OTEL configuration when launching the agent subprocess:

OTEL_SERVICE_NAME=AgentName
OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318
OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf

The client's own configuration might include secrets:

OTEL_SERVICE_NAME=MyEditor
OTEL_EXPORTER_OTLP_ENDPOINT=https://my-org.apm.us-central1.gcp.elastic.cloud
OTEL_EXPORTER_OTLP_HEADERS=Authorization="ApiKey xxx..."

The proxy intercepts agent telemetry and forwards it with the real credentials. No secrets are exposed to agents.

Implementation cost

The main cost is that clients need code to run an OTLP proxy. This is a well-understood pattern with library support in most languages. It is simpler than many other features in the editor/agent exchange.

Next steps

I would suggest moving this to an RFD to work through the details. The pre-initialization discovery question needs more thought, and there may be other edge cases worth exploring (multiple agents, proxy lifecycle, error handling).

Once transport is in place, we can iterate on telemetry semantics (span names, metric definitions, log formats). Those discussions can take time - some tracing semantics evolve over years. Decoupling transport from semantics lets us make progress now.

Happy to draft the RFD if there's interest.

benbrandt · 2025-12-04T09:08:36Z

Ok sorry for the misunderstanding here. Happy to have folks work on exploring what is needed here, and figure out how much is guidance/suggestions/spec requirements

codefromthecrypt · 2025-12-04T11:02:35Z

@benbrandt I think what I'll do is go ahead and put the RFD version of my comment, as any schema for metrics traces (or logs.. because believe it or not they also have a schema!) would layer on the concept of a working transport. In worst case I close that PR

codefromthecrypt · 2025-12-04T12:21:09Z

as promised, here's the base RFD #298

forsythetony · 2025-12-05T05:49:09Z

Wow, I really appreciate the quick response and the in-depth comment.

Just did a quick read through and I agree with what you've laid out. Not clogging the system critical pipe with metrics data is definitely the way to go.

I'll take a deeper look at your new PR tomorrow and comment with thoughts.

benbrandt · 2025-12-05T11:38:15Z

Awesome, I think we can then close this one in favor of #298 and let you both collaborate on that one

Adding RFD for agent metrics

07fea7f

forsythetony requested a review from a team as a code owner December 3, 2025 06:25

benbrandt closed this Dec 3, 2025

benbrandt reopened this Dec 4, 2025

codefromthecrypt mentioned this pull request Dec 4, 2025

docs(rfd): Draft: Agent Telemetry Export #298

Open

benbrandt closed this Dec 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adding RFD for agent metrics #295

Adding RFD for agent metrics #295

Uh oh!

forsythetony commented Dec 3, 2025 •

edited

Loading

Uh oh!

benbrandt commented Dec 3, 2025

Uh oh!

benbrandt commented Dec 3, 2025

Uh oh!

codefromthecrypt commented Dec 4, 2025

Uh oh!

benbrandt commented Dec 4, 2025 •

edited

Loading

Uh oh!

codefromthecrypt commented Dec 4, 2025

Uh oh!

codefromthecrypt commented Dec 4, 2025

Uh oh!

forsythetony commented Dec 5, 2025

Uh oh!

benbrandt commented Dec 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Adding RFD for agent metrics #295

Adding RFD for agent metrics #295

Uh oh!

Conversation

forsythetony commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benbrandt commented Dec 3, 2025

Uh oh!

benbrandt commented Dec 3, 2025

Uh oh!

codefromthecrypt commented Dec 4, 2025

Why not tunnel telemetry over ACP?

Alternative: OTEL side-channel via environment variables

The problem: where should telemetry go?

Proposal: ephemeral collector proxy

Architecture

Discovery is pre-initialization

Sequence

Environment variable injection

Implementation cost

Next steps

Uh oh!

benbrandt commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codefromthecrypt commented Dec 4, 2025

Uh oh!

codefromthecrypt commented Dec 4, 2025

Uh oh!

forsythetony commented Dec 5, 2025

Uh oh!

benbrandt commented Dec 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

forsythetony commented Dec 3, 2025 •

edited

Loading

benbrandt commented Dec 4, 2025 •

edited

Loading