8  Observability

You can’t improve what you can’t measure. Let’s add tracing.

Note

Code Reference: code/v0.5/src/agentsilex/observability.py

8.1 Why Observability?

Agent systems are hard to debug:

  • Which tool was called and when?
  • How long did each LLM call take?
  • What was the full conversation flow?
  • Where did things go wrong?

8.2 OpenTelemetry Setup

We use OpenTelemetry — the industry standard for distributed tracing (observability.py):

import os
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.resources import Resource
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from contextlib import contextmanager

_tracer: trace.Tracer = None


def setup_tracer_provider():
    """Setup TracerProvider with OTLP exporter if endpoint is configured"""
    endpoint = os.getenv("OTEL_EXPORTER_OTLP_ENDPOINT")
    if endpoint:
        exporter = OTLPSpanExporter(endpoint, insecure=True)
        provider = TracerProvider(
            resource=Resource.create({"service.name": "agentsilex"})
        )
        provider.add_span_processor(BatchSpanProcessor(exporter))
        trace.set_tracer_provider(provider)


def initialize_tracer():
    """Initialize the global tracer instance"""
    global _tracer
    if _tracer is None:
        _tracer = trace.get_tracer("agentsilex")

Key points:

  • Uses environment variable OTEL_EXPORTER_OTLP_ENDPOINT for configuration
  • Service name is “agentsilex”
  • Global tracer instance for the module

8.3 The span Context Manager

Simple wrapper for creating spans:

@contextmanager
def span(name: str, **attrs):
    with _tracer.start_as_current_span(name, attributes=attrs) as s:
        yield s

Usage:

with span("llm_call", model="gpt-4o", tokens=150):
    response = completion(...)

with span("tool_execution", tool="get_weather"):
    result = get_weather("Tokyo")

8.4 ManagedSpan: Manual Control

Sometimes you need more control over span lifecycle:

class ManagedSpan:
    def __init__(self, name: str, **attributes):
        self.name = name
        self.attributes = attributes
        self._span = None
        self._context = None

    def start(self):
        self._span = _tracer.start_span(self.name, attributes=self.attributes)
        self._context = trace.use_span(self._span, end_on_exit=False)
        self._context.__enter__()
        return self

    def end(self):
        if self._context:
            self._context.__exit__(None, None, None)
            self._context = None
        if self._span:
            self._span.end()
            self._span = None

Unlike the context manager, ManagedSpan lets you start and end spans at different times.

8.5 SpanManager: Switching Between Spans

For agent execution, we often need to switch spans (e.g., when handoff occurs):

class SpanManager:
    def __init__(self):
        self.current: ManagedSpan | None = None

    def switch_to(self, name: str, **attributes):
        if self.current:
            self.current.end()

        self.current = ManagedSpan(name, **attributes).start()
        return self.current

    def end_current(self):
        if self.current:
            self.current.end()
            self.current = None

    def __del__(self):
        self.end_current()

This is useful for tracking which agent is currently active.

8.6 Instrumented Runner

The Runner uses these primitives (runner.py):

from agentsilex.observability import (
    setup_tracer_provider,
    initialize_tracer,
    span,
    SpanManager,
)

setup_tracer_provider()
initialize_tracer()

span_manager = SpanManager()


class Runner:
    def run(self, agent: Agent, prompt: str) -> RunResult:
        with span("workflow_run", run_id=str(uuid.uuid4())):
            span_manager.switch_to(f"agent_{agent.name}", agent=agent.name)

            # ... agent loop ...

            # When handoff occurs:
            span_manager.switch_to(f"agent_{new_agent.name}", agent=new_agent.name)

            # At the end:
            span_manager.end_current()

8.7 Visualizing with Phoenix

Arize Phoenix is a great free visualization tool:

# Install
pip install arize-phoenix

# Run Phoenix server
phoenix serve
# Set endpoint
import os
os.environ["OTEL_EXPORTER_OTLP_ENDPOINT"] = "http://localhost:6006"

# Now run your agent - traces will appear in Phoenix UI

8.8 What You See in Traces

workflow_run (2.3s) [run_id: abc-123]
├── agent_main_assistant (1.5s)
│   ├── llm_call (0.8s)
│   └── tool_call: transfer_to_weather_specialist
├── agent_weather_specialist (0.8s)
│   ├── llm_call (0.5s)
│   ├── tool_call: get_weather (0.1s)
│   └── llm_call (0.2s)
└── final_output: "The weather in Tokyo is..."

8.9 Configuration

Environment variables:

Variable Purpose
OTEL_EXPORTER_OTLP_ENDPOINT OTLP collector endpoint (e.g., http://localhost:6006)

If not set, tracing is disabled (no-op).

8.10 Example: Full Traced Run

import os
os.environ["OTEL_EXPORTER_OTLP_ENDPOINT"] = "http://localhost:6006"

from agentsilex import Agent, Runner, Session, tool

@tool
def get_weather(city: str) -> str:
    """Get weather for a city."""
    return f"Weather in {city}: 72°F"

agent = Agent(
    name="weather_bot",
    model="gpt-4o",
    instructions="You are a weather assistant.",
    tools=[get_weather],
)

session = Session()
runner = Runner(session)

# This run will be traced
result = runner.run(agent, "What's the weather in Tokyo?")

# Check Phoenix UI at http://localhost:6006

8.11 Key Design Decisions

Decision Why
OpenTelemetry Industry standard, works with any backend
Environment-based config Easy to enable/disable
SpanManager for switching Clean handoff tracking
Global tracer Simple, single instance
TipCheckpoint
cd code/v0.5

Observability is now built in! Traces help you understand agent behavior.