Skip to content

Conversation

@forsythetony
Copy link

@forsythetony forsythetony commented Dec 3, 2025

$$$$$$$\   $$$$$$\        $$\   $$\  $$$$$$\ $$$$$$$$\       $$\      $$\ $$$$$$$$\ $$$$$$$\   $$$$$$\  $$$$$$$$\ 
$$  __$$\ $$  __$$\       $$$\  $$ |$$  __$$\\__$$  __|      $$$\    $$$ |$$  _____|$$  __$$\ $$  __$$\ $$  _____|
$$ |  $$ |$$ /  $$ |      $$$$\ $$ |$$ /  $$ |  $$ |         $$$$\  $$$$ |$$ |      $$ |  $$ |$$ /  \__|$$ |      
$$ |  $$ |$$ |  $$ |      $$ $$\$$ |$$ |  $$ |  $$ |         $$\$$\$$ $$ |$$$$$\    $$$$$$$  |$$ |$$$$\ $$$$$\    
$$ |  $$ |$$ |  $$ |      $$ \$$$$ |$$ |  $$ |  $$ |         $$ \$$$  $$ |$$  __|   $$  __$$< $$ |\_$$ |$$  __|   
$$ |  $$ |$$ |  $$ |      $$ |\$$$ |$$ |  $$ |  $$ |         $$ |\$  /$$ |$$ |      $$ |  $$ |$$ |  $$ |$$ |      
$$$$$$$  | $$$$$$  |      $$ | \$$ | $$$$$$  |  $$ |         $$ | \_/ $$ |$$$$$$$$\ $$ |  $$ |\$$$$$$  |$$$$$$$$\ 
\_______/  \______/       \__|  \__| \______/   \__|         \__|     \__|\________|\__|  \__| \______/ \________|                                                                                                             

@forsythetony forsythetony requested a review from a team as a code owner December 3, 2025 06:25
@benbrandt
Copy link
Member

@forsythetony thanks for this!

However, I think this has high overlap with #276 which just moved to draft stage. I'd love to consolidate efforts around this. Feel free to open a discussion in zulip if you want to collaborate with @codefromthecrypt on how to best specify some of this!

@benbrandt benbrandt closed this Dec 3, 2025
@benbrandt
Copy link
Member

Feel free to correct me if I have missed something here. But I think at the protocol level we should likely define how this data is propagated, I'm not sure if the protocol needs to define how it gets exported.

@codefromthecrypt feel free to chime in here as well

@codefromthecrypt
Copy link
Contributor

TL;DR: Instead of tunneling telemetry over ACP (which could block agent messages), clients should run a local OTLP proxy and pass standard OTEL_* environment variables when launching agents. This keeps telemetry out-of-band, avoids leaking secrets, and lets agent authors use normal OTEL SDKs. Discovery of OTLP support would happen pre-initialization (via registry metadata or manual config) since ENV must be set before subprocess launch. I'd suggest an RFD to iron out details.


Thanks for starting this discussion. I'll set aside trace propagation for now since PR #276 already addresses that via params._meta with W3C trace context. That work is vital for sampling and correlation, but it's separate from the transport question here.

For end-to-end telemetry, I see two distinct concerns:

  1. Telemetry transport: where and how to send logs, metrics, and traces
  2. Telemetry semantics: what spans, metrics, and log formats to standardize

This comment focuses on transport.

Why not tunnel telemetry over ACP?

Sending telemetry as JSON-RPC messages on the ACP transport has drawbacks:

  • Head-of-line blocking: telemetry traffic could delay more important agent messages on stdio
  • Coupling: ties agent authors to ACP-specific telemetry code instead of standard OTEL SDKs
  • Support burden: places telemetry implementation complexity on the ACP project

Alternative: OTEL side-channel via environment variables

The standard approach in observability is "do no harm" - telemetry should not interfere with the primary value path. A side-channel keeps agent communication separate from telemetry export.

ACP's stdio transport defines how clients launch agents as subprocesses. The client controls the agent's environment at launch time. We can configure agents using standard OTEL environment variables:

OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318
OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf

This means agent authors write and test against standard OTEL auto-configuration. The cognitive load stays with OTEL documentation, not ACP.

The problem: where should telemetry go?

The agent cannot know where the client wants telemetry routed. Different deployments have different needs:

  • Some clients (like Goose) may want to aggregate logs into their own logging system
  • Some may want to display token counts locally before forwarding
  • Some want everything sent to their OTEL backend of choice

Sharing the client's OTEL credentials with every agent is problematic: it leaks secrets and forces all agents to use the same destination.

Proposal: ephemeral collector proxy

I proposed a similar "telemetry hub" idea for MCP in block/goose#2078. The same pattern applies here.

The client runs a lightweight OTLP listener that proxies telemetry to its configured backend. This follows the OpenTelemetry agent/gateway pattern commonly used in container deployments.

Architecture

flowchart TB
    subgraph Client["Client/Editor"]
        direction LR
        C[ACP Handler]
        P[OTLP Proxy]
        E[OTLP Exporter]
        C ~~~ P
        P --> E
    end

    subgraph Agent["Agent Process"]
        A[ACP Agent]
        ASDK[OTEL SDK]
        A ~~~ ASDK
    end

    subgraph Backend["Observability Backend"]
        B[(Collector)]
    end

    C <-->|"stdio"| A
    ASDK -->|"HTTP"| P
    E -->|"HTTP + creds"| B
Loading

Discovery is pre-initialization

There is a chicken-and-egg problem here: environment variables must be set when launching the subprocess, but ACP capability exchange happens after the connection is established. The client needs to know whether to inject OTEL environment variables before calling initialize.

This means OTLP capability discovery happens outside the normal ACP initialization flow. A few options:

  1. Registry metadata: PR Add ACP Agent Registry RFD #289 proposes an agent registry with agent.json manifests. This could include an otlp field in the capabilities array, letting clients know ahead of time.

  2. Manual configuration: Users configure their client to enable OTLP for specific agents based on agent documentation.

  3. Optimistic injection: Clients could inject the OTEL environment variables unconditionally. Agents using standard OTEL SDKs will auto-configure when they detect OTEL_* prefixed variables (see llama-stack#4281 for an example of this pattern). Agents without OTEL support simply ignore them.

The optimistic approach is pragmatic: environment variables are low-cost, and well-behaved OTEL SDKs gracefully handle misconfiguration. But a proper registry entry provides cleaner semantics.

Sequence

sequenceDiagram
    participant Registry as Agent Registry
    participant Client as Client/Editor
    participant Agent

    Client->>Registry: Lookup agent metadata
    Registry-->>Client: agent.json (includes otlp: true)

    Note over Client: Start OTLP proxy if needed

    Client->>Agent: Launch subprocess with OTEL ENV
    Note over Client,Agent: Connection established (stdio)

    Client->>Agent: initialize request
    Agent-->>Client: initialize response

    Client->>Agent: session/new
    Agent-->>Client: sessionId

    Note over Agent: OTEL SDK auto-configured<br/>from environment
    Agent-)Client: OTLP/HTTP (out of band)
Loading

Environment variable injection

The client passes OTEL configuration when launching the agent subprocess:

OTEL_SERVICE_NAME=AgentName
OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318
OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf

The client's own configuration might include secrets:

OTEL_SERVICE_NAME=MyEditor
OTEL_EXPORTER_OTLP_ENDPOINT=https://my-org.apm.us-central1.gcp.elastic.cloud
OTEL_EXPORTER_OTLP_HEADERS=Authorization="ApiKey xxx..."

The proxy intercepts agent telemetry and forwards it with the real credentials. No secrets are exposed to agents.

Implementation cost

The main cost is that clients need code to run an OTLP proxy. This is a well-understood pattern with library support in most languages. It is simpler than many other features in the editor/agent exchange.

Next steps

I would suggest moving this to an RFD to work through the details. The pre-initialization discovery question needs more thought, and there may be other edge cases worth exploring (multiple agents, proxy lifecycle, error handling).

Once transport is in place, we can iterate on telemetry semantics (span names, metric definitions, log formats). Those discussions can take time - some tracing semantics evolve over years. Decoupling transport from semantics lets us make progress now.

Happy to draft the RFD if there's interest.

@benbrandt benbrandt reopened this Dec 4, 2025
@benbrandt
Copy link
Member

benbrandt commented Dec 4, 2025

Ok sorry for the misunderstanding here. Happy to have folks work on exploring what is needed here, and figure out how much is guidance/suggestions/spec requirements

@codefromthecrypt
Copy link
Contributor

@benbrandt I think what I'll do is go ahead and put the RFD version of my comment, as any schema for metrics traces (or logs.. because believe it or not they also have a schema!) would layer on the concept of a working transport. In worst case I close that PR

@codefromthecrypt
Copy link
Contributor

as promised, here's the base RFD #298

@forsythetony
Copy link
Author

Wow, I really appreciate the quick response and the in-depth comment.

Just did a quick read through and I agree with what you've laid out. Not clogging the system critical pipe with metrics data is definitely the way to go.

I'll take a deeper look at your new PR tomorrow and comment with thoughts.

@benbrandt
Copy link
Member

Awesome, I think we can then close this one in favor of #298 and let you both collaborate on that one

@benbrandt benbrandt closed this Dec 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants