How `unfault ask` Does RAG (Without Getting Weird About It)

A practical walk-through of the retrieval flow: from local parsing to graph traversal, semantic search, and optional client-side synthesis.

Most of the time, the trail between two pieces of code is not hard to find. You click through a call site. You follow an import. You land somewhere that makes sense.

Then you hit the familiar gap.

You are in a FastAPI service staring at an outbound HTTP call. You know it hits another service. You do not know what it actually reaches, what else depends on it, or what would break if you change it. You are back to the oldest tool in the industry: assumptions.

unfault ask exists for that gap. It is retrieval over a map that Unfault already builds during review. It answers questions by pulling the relevant context (findings, graph slices, call paths) and returning something you can verify.

This post reworks the previous “pipeline tour” into a story that matches how the system behaves today.

The privacy boundary (and why it shapes everything)

Unfault is built around a boring principle: your source code stays on your machine.

  • The CLI parses and builds the graph locally.
  • The API stores structure (graph topology, metadata) and findings derived from that structure.
  • If you use unfault ask --llm, the CLI talks to your configured provider. Unfault’s servers do not call a public LLM for you.

That boundary is why ask is built around retrieval plans and structural context. It is not a prompt that magically contains your repository.

The setup: two workspaces, one missing trail

Let us reuse the same characters from the “bridge” posts: a Waiter service calling a Kitchen service.

The question we want to answer is not “where is this string referenced”. It is:

“Are we calling everything we think we are calling?”

Concretely:

Terminal window
unfault ask "On waiter, what endpoints from kitchen are we not calling?" --workspace wks_waiter

This is a cross-workspace coverage question. It is the kind of thing you can approximate with traffic logs and docs, but you rarely have both when you are mid-change.

Under the hood: what the API actually does

The /rag/query endpoint returns a context pack. The CLI can print it directly, or (with --llm) ask your configured provider to synthesize a narrative.

Here is the pipeline in one diagram.

Now the details, in the order they actually occur.

1. Review gives us the map

unfault ask is downstream of unfault review.

During review the CLI parses your code locally, extracts semantic structure, and produces a graph. The API persists that structure and the findings derived from it.

The important detail for ask is that we have two kinds of retrievable things:

  • sessions (project-level summaries)
  • findings (issue-level facts)

2. Embeddings are created lazily

Embeddings are not required at review time.

On an ask call, the API checks for a small number of completed sessions without embeddings (currently capped) and generates those embeddings on demand.

This keeps review fast and pays the embedding cost only when someone actually asks a question.

3. The query gets parsed, then embedded

Before routing, the API extracts a little bit of structure from your question:

  • language/framework hints (used as filters)
  • file entities (used for file-scoped fallbacks)

Then it embeds your query with the query: prefix.

4. Intent routing picks a plan (ML, then regex)

The router tries a tiny ML classifier first. If confidence is below a threshold, it falls back to regex routing.

Current intents include:

  • overview: “Describe this workspace”. Prefer client graph_data and return a structural summary (languages, frameworks, entrypoints, hotspots).
  • coverage: Cross-workspace endpoint coverage. “What upstream endpoints are we not calling?” Best-effort, returns an enumerate_context based on stored sessions plus cross-workspace links and outbound HTTP calls.
  • relationship: Cross-workspace dependency direction. “Which depends on which?” Best-effort, returns an enumerate_context for depends_on and depended_on_by, and optionally edge counts between two named workspaces.
  • flow: Call-path tracing. “How does X work?” and “Show me the path to Y” return a flow_context (prefer client graph_data, fall back to stored graph).
  • usage: Caller/usage lookup. “Who calls this?” returns a graph slice (graph_context). If the target is ambiguous, the response includes disambiguation tokens you can paste into a follow-up query.
  • impact: Change blast radius. “If I change this, what breaks?” returns affected files and relationships in graph_context, and can be enriched with findings from dependent files.
  • dependencies: Imports and external dependencies. “What does this depend on?” returns dependency context in graph_context.
  • centrality: Hotspots. “Most central files” or “most central functions” return a ranked view based on heuristic centrality.
  • observability: SLO coverage. “Which routes are monitored?” returns slo_context (prefer client SLO nodes, fall back to stored SLO data).
  • enumerate: Counting and listing. “List all routes” and “How many endpoints?” return enumerate_context, typically from client graph_data.
  • semantic: Everything else. Vector similarity over session summaries and findings, with concept-based rule filtering when possible.

Routing is not only “what intent”. The plan can also say “I need a concrete target”. That is how we decide between a useful answer and a polite disambiguation prompt.

5. Fast paths for the questions that are not really semantic

Some questions should not hit the generic semantic pipeline at all.

Two examples that are implemented as fast paths today:

  • overview: “describe this workspace” when graph_data is available
  • cross-workspace coverage and relationship queries when workspace_id is provided

For our Waiter/Kitchen question, the intent is coverage. If the workspace has a stored graph, the API builds an enumerate_context describing upstream endpoints and which ones appear to be missing.

This is intentionally best-effort. It depends on resolved cross-workspace links and outbound HTTP calls.

6. Semantic retrieval always runs (sessions and findings)

Even if you asked a graph question, the endpoint still retrieves similar sessions and similar findings.

This serves two purposes:

  • give a default answer shape for “how is this doing” style questions
  • provide supporting evidence alongside a graph slice

Filters can come from a few places:

  • workspace scope
  • language/framework hints
  • concept-derived rule patterns (“error handling”, “timeout”, “sql injection”)

When a query is broad and not concept-targeted, the API diversifies findings by rule type (to avoid returning ten copies of the same pattern).

When a query is concept-targeted, it does not diversify. If you asked about timeouts, you usually want the timeouts.

7. A small UX trap: “show me issues in X” means “filter by file”

Users often ask:

“Show me the issues in auth.py”

That reads like semantic search, but the intended action is usually a file filter. The API has an explicit file-scoped fallback for queries that mention a file and words like “risks” or “issues”.

It will override the generic semantic results and return findings for that file token from the latest session.

8. Graph slices, flow paths, and SLOs

When the intent calls for structure, the API attempts structural retrieval:

  • usage, impact, dependencies, centrality: use the stored graph
  • flow: prefer client-provided graph_data, fall back to stored graph
  • observability: prefer client-provided SLO nodes, fall back to stored graph SLO data
  • enumerate: can be built from client graph

If the target cannot be resolved, the response contains disambiguation tokens you can paste into a follow-up query.

9. Graph enrichment: findings from dependent files

When an impact or usage slice returns a set of dependent files, the API can pull findings for those files and attach them as additional context.

They are marked as coming from dependent files, so an LLM (or a human) can explain why they show up.

A concrete trace: the coverage query

Back to our original question.

Terminal window
unfault ask "On waiter, what endpoints from kitchen are we not calling?" --workspace wks_waiter

What happens:

  1. The API embeds the query.
  2. Routing selects intent coverage.
  3. The endpoint uses the latest stored session with a graph for wks_waiter.
  4. It loads cross-workspace links and outbound HTTP calls.
  5. It returns an enumerate_context summarizing upstream endpoint coverage.

If the stored graph is missing or stale, the response will still include semantic context, plus a hint about what is missing.

Why answers can differ

If two people ask the same question and get slightly different answers, it is usually one of these:

  • freshness: someone ran unfault review after the last refactor
  • fidelity: client graph_data and stored graph do not always match one-to-one
  • routing: low confidence routes may fall back to semantic, which is intentionally broader
  • indexing warmth: embeddings are generated lazily and may not exist for older sessions yet

When an answer feels off, the fix is boring but reliable: run unfault review again.