World Model
When unfault flags a missing timeout, the finding itself is the least interesting part. The interesting part is what happens downstream: which services are affected, which SLO is at risk, and how confident we are in that assessment.
This page explains the machinery behind that: the code graph, the propagation model, and how runtime data from SLOs and distributed traces augments what static analysis can see on its own.
The problem with file-level findings
Section titled “The problem with file-level findings”A rule fires on a file. The finding names a line. That’s useful, but it answers the wrong question. The developer already knows the line exists; what they don’t know is whether it matters enough to fix before shipping.
Traditional static analysis can’t answer that because it has no model of the system. It sees the code but not the architecture. It sees the function but not which HTTP handler calls it, which SLO covers that handler, or what happens to a downstream service if a retry storm starts here at 3am.
unfault builds a graph that lets it answer those questions.
The code graph
Section titled “The code graph”The foundation is a directed graph built by the Rust/Tree-sitter parser:
- File nodes: one per source file, with language and path
- Function nodes: functions and methods, with HTTP handler metadata when applicable
- ExternalModule nodes: third-party libraries, categorised (HttpClient, Database, etc.)
- Edges:
Imports,ImportsFrom,Calls,Contains,UsesLibrary
This is a static graph. It captures the structural shape of the codebase from source alone, without executing anything. It’s fast, deterministic, and complete within the repository boundary.
The graph is the starting point. On its own it enables blast radius queries (“what imports this file?”) and centrality analysis (“which file is most depended upon?”). What it can’t do alone is connect a code-level finding to a business objective.
Adding runtime data: SLOs and traces
Section titled “Adding runtime data: SLOs and traces”Two optional enrichment passes extend the graph beyond the repository boundary.
SLO nodes (MonitoredBy edges)
Section titled “SLO nodes (MonitoredBy edges)”When GCP Cloud Monitoring, Datadog, or Dynatrace credentials are present, unfault
fetches SLO definitions and matches them to HTTP route handlers in the graph using
path patterns. Each matched handler gets a MonitoredBy edge pointing to a
GraphNode::Slo:
Function(POST /checkout) --[MonitoredBy]--> Slo("Checkout API 99.9%")SLO nodes are the top tier of the hierarchy: they represent what “success” means for a user journey. When the propagation model reaches an SLO node, it has a concrete answer to “what breaks”: not an inferred entrypoint, but a declared availability target.
For service-level SLOs (those without a specific path pattern), unfault matches the GCP service slug embedded in the SLO resource name against the local workspace directory name. This prevents sibling services in the same GCP project from being incorrectly linked to the wrong codebase.
RemoteService nodes (RemoteCall edges)
Section titled “RemoteService nodes (RemoteCall edges)”When GCP Cloud Trace credentials are present, unfault fetches recent spans from
the Cloud Trace v1 API and extracts cross-service call patterns. Each distinct
remote service observed in RPC_CLIENT spans (or outbound HTTP spans, since
Cloud Run’s OTEL exporter omits the kind field) becomes a GraphNode::RemoteService,
linked to the local file that makes the call:
File(payments/client.py) --[RemoteCall]--> RemoteService("inventory-service")Service name extraction works in layers: peer.service label first, then
/http/host, then span name heuristics (Sent.<Service>, gRPC patterns), then
URL host. Kubernetes FQDNs are stripped to the service name component;
public internet hostnames (.googleapis.com, .github.com, etc.) are kept intact.
The value of RemoteCall edges is that they extend the propagation model across
service boundaries. A finding in a file that calls an external service now has a
propagation path that crosses the repository boundary, which is categorically
different from a local failure, because there’s no local recovery path.
The propagation model
Section titled “The propagation model”Given a finding at a file, the model asks: if this file breaks, what is the furthest meaningful thing that breaks with it?
The answer is computed by a weighted BFS that traverses the graph in two directions simultaneously.
Reverse propagation (blast radius)
Section titled “Reverse propagation (blast radius)”Calls and Imports edges point from dependent to dependency. To find everything
affected when a file breaks, we walk against these edges, collecting everything
that imports or calls the failing file:
File(db.py) <--[Imports]-- File(auth.py) <--[Imports]-- File(main.py)This is the blast radius direction. It answers “who depends on me.”
Forward propagation (consequence)
Section titled “Forward propagation (consequence)”MonitoredBy and RemoteCall edges point forward toward consequences. We follow
these in the normal direction to reach anchors:
Function(handler) --[MonitoredBy]--> Slo("Checkout API")File(payments.py) --[RemoteCall]--> RemoteService("inventory-service")Contains edges (File → Function) are also traversed forward with zero weight,
so the model can reach MonitoredBy edges on function nodes from findings that
land on the parent file.
Edge weights
Section titled “Edge weights”Each edge type carries a propagation weight representing the conditional probability that a failure at the source materialises at the target:
| Edge | Weight | Rationale |
|---|---|---|
Calls | 0.80 | Direct invocation; caller blocks on callee |
Imports / ImportsFrom | 0.50 | Structural dependency; indirect but real |
Contains | 0.00 | Traversal only, no additional risk |
RemoteCall | 0.90 | Cross-service; no local circuit breaker assumed |
MonitoredBy | 1.00 | Reaching the SLO confirms macro-goal impact |
The aggregate risk is the complement probability product across hops:
risk = 1 - ∏(1 - weight_i)This is the “at least one failure propagates” probability under the independence
assumption, expressed as a percentage. A two-hop path through Imports (0.5) and
RemoteCall (0.9) gives 1 - (0.5 × 0.1) = 95%.
Anchor priority
Section titled “Anchor priority”The BFS selects the best anchor found in priority order:
- SLO node: highest confidence. The finding is tied to a declared availability target with a specific percentage and timeframe.
- RemoteService node: present when trace data is available. Signals a cross-service boundary, which matters because there is no local recovery path.
- Inferred entrypoint: fallback. The nearest file with no importers (a root of the import tree) is used as a proxy for the request entry point.
When no anchor is reachable (isolated file, no SLOs configured, no traces), the risk score is zero and the system view line is omitted from the output.
The output
Section titled “The output”The result of the propagation model is attached to every SystemHazard as a
PropagationPath:
hops: [payments/client.py, checkout_handler.py, SLO: Checkout API]aggregate_risk: 95.0 macro_goal: "Checkout API 99.9%"anchored_to_slo: trueThis drives the ↳ puts line in the review output:
🟡 payments/client.py:48 · The Retry Storm HTTP call via httpx.AsyncClient has no retry policy ↳ puts Checkout API (99.9%) at risk (95%)Enrichment cache
Section titled “Enrichment cache”SLO and trace fetches are cached at .unfault/cache/enrichment/ with a 5-minute
TTL, keyed on (project_id, workspace_name). The review footer distinguishes
cache hits (cached, green) from live fetches (fetch Xms, yellow), so the
source of latency is always visible.
unfault review # uses cache if freshunfault review --refresh-cache # bust cache, re-fetch from providersunfault review --offline # skip enrichment entirelyWhere the ideas came from
Section titled “Where the ideas came from”The three-tier structure (primitives, sub-goals, macro-goals) was shaped in part by reading two papers from early 2026:
Dupoux, LeCun, Malik et al. (arXiv:2603.15381) proposes a cognitive architecture with three learning modes, including a meta-controller (System M) that switches between passive observation and active exploration based on internal signals. The framing of a system that reasons at multiple levels of abstraction, rather than applying flat rules, is what we were reaching for. The analogy isn’t precise: unfault’s “meta-controller” is just the propagation model deciding which anchor is most relevant, not a learned policy. But the vocabulary was useful for thinking about the problem.
Zhang et al. (arXiv:2604.03208) presents hierarchical planning with latent world models for robotic manipulation. The key result is that planning at multiple temporal scales (a high-level planner generating sub-goal waypoints, a low-level planner executing them) dramatically outperforms single-level planning on compositional tasks. The structural parallel to code analysis is real: a finding at a line (primitive) is only interpretable in the context of the call chain (sub-goal) it belongs to, which is only meaningful against the business objective (macro-goal) it serves. We’re not building a latent world model or doing any learning; the code graph is our world model, and it’s deterministic.
The honest summary: these papers articulated a way of framing the problem that we found useful. The implementation is a weighted BFS over a directed graph, with SLO and trace data bolted on as optional enrichment. Nothing exotic.
Limitations
Section titled “Limitations”The independence assumption is wrong. The complement probability product assumes each hop fails independently. In practice, failures are correlated; a database outage hits every service that uses it simultaneously. The risk scores are a relative ranking, not calibrated probabilities.
Static graphs miss dynamic dispatch. If a file calls a function through an interface, the graph may not capture the concrete implementation. The propagation model is conservative (it uses what it can see) but it can miss paths.
Trace coverage is partial. Cloud Trace only captures what was instrumented and
exercised recently. A code path that hasn’t been hit in the last hour won’t appear
as a RemoteCall edge. Enabling OTEL instrumentation on all services and ensuring
regular traffic will improve coverage.
Service matching is heuristic. SLOs are matched to the local workspace by
comparing the GCP service slug or SLO display name against the directory name.
This works in conventional layouts but will produce incorrect results in monorepos
where the directory name doesn’t match the deployed service name. The path pattern
mechanism (setting /path labels on SLOs) is more reliable when it’s available.