Observability

Theme: See what your memories are doing. This guide defines the observability architecture and phased rollout for Aegis Memory, with emphasis on:

Memory analytics (query patterns, hit rates, scope usage)
Prometheus metrics (latency histograms, memory counts, voting stats)
Memory timeline (how memories evolve)
Effectiveness dashboard (which memories help vs hurt outcomes)
Export to Langfuse/LangSmith

Current State (Baseline)

Aegis already has:

/metrics endpoint and foundational Prometheus primitives
structured logging in server/observability.py
ACE activity + dashboard routes (/memories/ace/dashboard/*)
evaluation KPIs via EvalRepository

However, telemetry is not yet fully unified across request middleware, repository operations, timeline events, and external tracing providers.

Goals

Make memory behavior visible and diagnosable in production.
Quantify retrieval quality, not just API uptime.
Provide operational feedback loops for ACE memory quality.
Export telemetry to existing toolchains (Langfuse/LangSmith) without high coupling.

Architecture Overview

Client Request
    |
    v
FastAPI Middleware (request_id, trace context, latency)
    |
    +--> Prometheus counters/histograms/gauges
    |
    +--> Structured logs
    |
    +--> Memory Event Bus (immutable event envelope)
             |
             +--> Timeline storage (MemoryEvent table)
             +--> Async exporters (Langfuse / LangSmith)
             +--> Dashboard read models

1) Memory Analytics

What to capture

Query volume and query shape
- namespace, agent context, requested scope, filter usage
Hit rates
- zero-hit ratio, low-hit ratio, average retrieved count
Scope usage
- distribution of global, agent_private, agent_shared for writes + reads
Retrieval quality signals
- retrieved memories later voted helpful/harmful

Suggested metrics

aegis_memory_queries_total{namespace,scope,agent_mode}
aegis_memory_query_hits_total{bucket} where bucket = zero|low|medium|high
aegis_memory_query_results_count (histogram)
aegis_memory_scope_usage_total{direction,scope} direction = write|read

Dashboard/API additions

Add /memories/ace/dashboard/analytics with:

hit rate trends (24h/7d/30d)
scope usage breakdown
top query patterns (normalized)
per-agent retrieval and miss rates

2) Prometheus Metrics Expansion

Required dimensions

Request: method, normalized endpoint, status
Memory ops: operation, status, namespace
Vote stats: helpful/harmful by agent and memory type
Counts: total memories by namespace/scope/type

Histograms to standardize

request latency
memory query latency
memory write latency
embedding latency
vote update latency

Notes

Keep label cardinality bounded (avoid raw query text as labels).
Use normalized endpoint paths and bounded enums.

3) Memory Timeline

Why

Current activity endpoints show current rows; they do not represent immutable evolution history.

Event model

Create a MemoryEvent timeline table (append-only):

identifiers: event_id, memory_id, project_id, namespace, agent_id
event metadata: event_type, created_at
payload: JSON details (delta, vote context, deprecation reason, retrieval metadata)

Event types

created
queried
voted_helpful
voted_harmful
delta_updated
deprecated
reflection_added

APIs

/memories/ace/dashboard/timeline (project timeline)
/memories/ace/dashboard/timeline/{memory_id} (single memory evolution)

4) Effectiveness Dashboard

Core question

Which memories improve completion and which correlate with failures?

Model

Correlate:

retrieval events (what memory IDs were returned)
vote signals (helpful/harmful)
task outcomes (FeatureTracker.passes, completion time)

Outputs

Memory uplift score (helpful for successful tasks)
Memory drag score (associated with failed/slower tasks)
Leaderboards by memory type, scope, and agent
Confidence-aware ranking (minimum sample and smoothing)

Endpoints

/memories/ace/dashboard/effectiveness/overview
/memories/ace/dashboard/effectiveness/memories
/memories/ace/dashboard/effectiveness/segments

5) Export to Langfuse/LangSmith

Design principles

provider-agnostic internal envelope first
async export queue (do not add latency to write/query paths)
retries + dead-letter behavior

Internal envelope

trace/request IDs
project/agent/session/task identifiers
operation + timestamps
input/output metadata (bounded)
outcome status and latency

Provider adapters

langfuse_exporter maps events into traces/spans/scores
langsmith_exporter maps events into runs/feedback artifacts

Config

OBS_EXPORT_LANGFUSE_ENABLED
OBS_EXPORT_LANGSMITH_ENABLED
provider keys/endpoints
queue and retry controls

Phased Rollout Plan

Phase 1 — Instrumentation wiring

Unify middleware + request tracing across API
ensure record_operation + latency tracking in add/query/vote/delta
add bounded labels and metric taxonomy

Phase 2 — Analytics and timeline

ship query-hit and scope usage metrics
add MemoryEvent table and timeline APIs

Phase 3 — Effectiveness attribution

join retrieval, votes, and outcomes
ship effectiveness views and trend endpoints

Phase 4 — External exports

async exporters for Langfuse/LangSmith
rollout toggles, retries, and monitoring

Phase 5 — Unification hardening

converge metrics/logs/events around shared telemetry schema
document compatibility and migration notes

Success Criteria

Operators can explain why hit rate changed in a given window.
Teams can identify top harmful memories and safely deprecate them.
Observability data is consumable both in Prometheus/Grafana and Langfuse/LangSmith.
Added telemetry has bounded cardinality and low request overhead.

Risks and Guardrails

Cardinality explosion: constrain labels to enums/normalized values.
Write amplification: event logging should be append-only and batched where possible.
Attribution bias: show confidence intervals and sample sizes for effectiveness scoring.
Vendor lock-in: keep provider adapters behind internal event schema.

Recommended Next Implementation Ticket Set

Middleware + operation taxonomy unification
Query analytics metrics and dashboard endpoint
MemoryEvent migration and timeline APIs
Effectiveness attribution joins and endpoints
Langfuse/LangSmith async exporters

​Observability

​Current State (Baseline)

​Goals

​Architecture Overview

​1) Memory Analytics

​What to capture

​Suggested metrics

​Dashboard/API additions

​2) Prometheus Metrics Expansion

​Required dimensions

​Histograms to standardize

​Notes

​3) Memory Timeline

​Why

​Event model

​Event types

​APIs

​4) Effectiveness Dashboard

​Core question

​Model

​Outputs

​Endpoints

​5) Export to Langfuse/LangSmith

​Design principles

​Internal envelope

​Provider adapters

​Config

​Phased Rollout Plan

​Phase 1 — Instrumentation wiring

​Phase 2 — Analytics and timeline

​Phase 3 — Effectiveness attribution

​Phase 4 — External exports

​Phase 5 — Unification hardening

​Success Criteria

​Risks and Guardrails

​Recommended Next Implementation Ticket Set