Observability
Theme: See what your memories are doing. This guide defines the observability architecture and phased rollout for Aegis Memory, with emphasis on:- Memory analytics (query patterns, hit rates, scope usage)
- Prometheus metrics (latency histograms, memory counts, voting stats)
- Memory timeline (how memories evolve)
- Effectiveness dashboard (which memories help vs hurt outcomes)
- Export to Langfuse/LangSmith
Current State (Baseline)
Aegis already has:/metricsendpoint and foundational Prometheus primitives- structured logging in
server/observability.py - ACE activity + dashboard routes (
/memories/ace/dashboard/*) - evaluation KPIs via
EvalRepository
Goals
- Make memory behavior visible and diagnosable in production.
- Quantify retrieval quality, not just API uptime.
- Provide operational feedback loops for ACE memory quality.
- Export telemetry to existing toolchains (Langfuse/LangSmith) without high coupling.
Architecture Overview
1) Memory Analytics
What to capture
- Query volume and query shape
- namespace, agent context, requested scope, filter usage
- Hit rates
- zero-hit ratio, low-hit ratio, average retrieved count
- Scope usage
- distribution of
global,agent_private,agent_sharedfor writes + reads
- distribution of
- Retrieval quality signals
- retrieved memories later voted helpful/harmful
Suggested metrics
aegis_memory_queries_total{namespace,scope,agent_mode}aegis_memory_query_hits_total{bucket}where bucket =zero|low|medium|highaegis_memory_query_results_count(histogram)aegis_memory_scope_usage_total{direction,scope}direction =write|read
Dashboard/API additions
Add/memories/ace/dashboard/analytics with:
- hit rate trends (24h/7d/30d)
- scope usage breakdown
- top query patterns (normalized)
- per-agent retrieval and miss rates
2) Prometheus Metrics Expansion
Required dimensions
- Request: method, normalized endpoint, status
- Memory ops: operation, status, namespace
- Vote stats: helpful/harmful by agent and memory type
- Counts: total memories by namespace/scope/type
Histograms to standardize
- request latency
- memory query latency
- memory write latency
- embedding latency
- vote update latency
Notes
- Keep label cardinality bounded (avoid raw query text as labels).
- Use normalized endpoint paths and bounded enums.
3) Memory Timeline
Why
Current activity endpoints show current rows; they do not represent immutable evolution history.Event model
Create aMemoryEvent timeline table (append-only):
- identifiers:
event_id,memory_id,project_id,namespace,agent_id - event metadata:
event_type,created_at - payload: JSON details (delta, vote context, deprecation reason, retrieval metadata)
Event types
createdqueriedvoted_helpfulvoted_harmfuldelta_updateddeprecatedreflection_added
APIs
/memories/ace/dashboard/timeline(project timeline)/memories/ace/dashboard/timeline/{memory_id}(single memory evolution)
4) Effectiveness Dashboard
Core question
Which memories improve completion and which correlate with failures?Model
Correlate:- retrieval events (what memory IDs were returned)
- vote signals (helpful/harmful)
- task outcomes (
FeatureTracker.passes, completion time)
Outputs
- Memory uplift score (helpful for successful tasks)
- Memory drag score (associated with failed/slower tasks)
- Leaderboards by memory type, scope, and agent
- Confidence-aware ranking (minimum sample and smoothing)
Endpoints
/memories/ace/dashboard/effectiveness/overview/memories/ace/dashboard/effectiveness/memories/memories/ace/dashboard/effectiveness/segments
5) Export to Langfuse/LangSmith
Design principles
- provider-agnostic internal envelope first
- async export queue (do not add latency to write/query paths)
- retries + dead-letter behavior
Internal envelope
- trace/request IDs
- project/agent/session/task identifiers
- operation + timestamps
- input/output metadata (bounded)
- outcome status and latency
Provider adapters
langfuse_exportermaps events into traces/spans/scoreslangsmith_exportermaps events into runs/feedback artifacts
Config
OBS_EXPORT_LANGFUSE_ENABLEDOBS_EXPORT_LANGSMITH_ENABLED- provider keys/endpoints
- queue and retry controls
Phased Rollout Plan
Phase 1 — Instrumentation wiring
- Unify middleware + request tracing across API
- ensure
record_operation+ latency tracking in add/query/vote/delta - add bounded labels and metric taxonomy
Phase 2 — Analytics and timeline
- ship query-hit and scope usage metrics
- add
MemoryEventtable and timeline APIs
Phase 3 — Effectiveness attribution
- join retrieval, votes, and outcomes
- ship effectiveness views and trend endpoints
Phase 4 — External exports
- async exporters for Langfuse/LangSmith
- rollout toggles, retries, and monitoring
Phase 5 — Unification hardening
- converge metrics/logs/events around shared telemetry schema
- document compatibility and migration notes
Success Criteria
- Operators can explain why hit rate changed in a given window.
- Teams can identify top harmful memories and safely deprecate them.
- Observability data is consumable both in Prometheus/Grafana and Langfuse/LangSmith.
- Added telemetry has bounded cardinality and low request overhead.
Risks and Guardrails
- Cardinality explosion: constrain labels to enums/normalized values.
- Write amplification: event logging should be append-only and batched where possible.
- Attribution bias: show confidence intervals and sample sizes for effectiveness scoring.
- Vendor lock-in: keep provider adapters behind internal event schema.
Recommended Next Implementation Ticket Set
- Middleware + operation taxonomy unification
- Query analytics metrics and dashboard endpoint
MemoryEventmigration and timeline APIs- Effectiveness attribution joins and endpoints
- Langfuse/LangSmith async exporters