Finding back our way

Use Unfault to locate common production-readiness gaps

In our last article, we found a “ghost” in our Kitchen-to-Waiter path. By injecting a 5-second delay, we discovered that our Python service didn’t just slow down, it choked. The lack of a defined timeout meant that worker threads were held hostage by a downstream bottleneck.

Finding a weakness is a win, but in a fast-moving codebase, wins can be temporary. If we don’t codify what we’ve learned, the next person to touch the create_order function might accidentally remove our fix or introduce a similar fragility elsewhere.

Closing the Loop: From fault to Fact

Unfault treats your code as a set of facts. It then derives findings about gaps in your code, like the brittle call between the Waiter service and the Kitchen service.

Use unfault review to extract a general overview of the health of your workspace.

Terminal window
$ unfault review
Looks good overall, with a couple spots that deserve a closer look. Two themes
keep showing up: resilience hardening and other cleanup. Starting point: main.py
(FastAPI app 'app' missing correlation ID middleware); then main.py (Large JSON
response loaded into memory).
At a glance
· One call missing a timeout
· Circuit breakers would help fail fast when dependencies are down
· Rate limiting would protect against abuse
· CORS config needed if browsers will call this API
────────────────────────────────────────────────────────────────────────────────
1248ms - python / fastapi - 1 file
Tip: use --output full to drill into hotspots.

At Unfault, we believe in helping developers without inducing more anxiety. Our default approach is to gently highlight gaps but let users decide when they need more specifics. You can then drill down:

Terminal window
$ unfault review --output=json | jq '.findings[] | select(.message | test("create_order"))'
{
"applicability_json": "{\"investment_level\":\"low\",\"min_stage\":\"prototype\",\"decision_level\":\"code\",\"benefits\":[\"reliability\",\"latency\"],\"prerequisites\":[],\"notes\":\"Time bounds are helpful even in demos; pick a sensible default.\"}",
"column": 38,
"description": "This HTTP client call in `create_order` does not specify a timeout. A timeout ensures the call completes within a known time bound, which helps maintain predictable response times for your service. Consider using a sensible timeout value tuned to your requirements.",
"dimension": "Stability",
"end_column": 14,
"end_line": 46,
"file_path": "main.py",
"fix_preview": "# Before:\n# client.post(\n f\"{KITCHEN_URL}/orders\",\n json={\n \"dish\": order.dish,\n \"order_id\": order_id\n }\n )\n# After:\nclient.post(\n f\"{KITCHEN_URL}/orders\",\n json={\n \"dish\": order.dish,\n \"order_id\": order_id\n },\n timeout=5.0,\n )",
"line": 40,
"message": "This HTTP client call in `create_order` does not specify a timeout. A timeout ensures the call completes within a known time bound, which helps maintain predictable response times for your service. Consider using a sensible timeout value tuned to your requirements.",
"rule_id": "python.http.missing_timeout",
"severity": "Medium",
"title": "HTTP call via `httpx`.post has no timeout"
}
{
"applicability_json": "{\"investment_level\":\"medium\",\"min_stage\":\"product\",\"decision_level\":\"code\",\"benefits\":[\"reliability\"],\"prerequisites\":[\"Only retry idempotent operations (or add idempotency keys)\",\"Define which failures are retryable and apply backoff + max attempts\"],\"notes\":\"Retries can increase load during outages; tune carefully and measure.\"}",
"column": 38,
"description": "This HTTP client call in `create_order` does not have a retry mechanism. Transient network failures (connection timeouts, 5xx errors, DNS issues) will propagate directly as user-visible errors. Consider adding a retry policy using tenacity, backoff, or urllib3.Retry with HTTPAdapter.",
"dimension": "Stability",
"end_column": null,
"end_line": null,
"file_path": "main.py",
"fix_preview": "# Option 1: Use tenacity decorator\nfrom tenacity import retry, stop_after_attempt, wait_exponential\n\n@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=1, max=10))\nasync def make_request():\n client.post(\n f\"{KITCHEN_URL}/orders\",\n json={\n \"dish\": order.dish,\n \"order_id\": order_id\n }\n )\n\n# Option 2: Use httpx with transport retries\nimport httpx\n\ntransport = httpx.HTTPTransport(retries=3)\nclient = httpx.Client(transport=transport)\n# Then use client.get(), client.post(), etc.",
"line": 40,
"message": "This HTTP client call in `create_order` does not have a retry mechanism. Transient network failures (connection timeouts, 5xx errors, DNS issues) will propagate directly as user-visible errors. Consider adding a retry policy using tenacity, backoff, or urllib3.Retry with HTTPAdapter.",
"rule_id": "python.http.missing_retry",
"severity": "Medium",
"title": "HTTP call via `httpx`.post has no retry policy"
}
{
"applicability_json": "{\"investment_level\":\"high\",\"min_stage\":\"production\",\"decision_level\":\"architecture\",\"benefits\":[\"reliability\",\"operability\"],\"prerequisites\":[\"Choose a circuit breaker library/pattern\",\"Define fallback behavior and error semantics\",\"Tune thresholds based on real traffic\"],\"notes\":\"Typically unnecessary for small demos; most useful with real traffic and external dependencies.\"}",
"column": 38,
"description": "This HTTP call in `create_order` does not have circuit breaker protection. A circuit breaker allows your service to fail fast and recover gracefully when external dependencies are slow or unavailable. Consider using the `circuitbreaker` or `pybreaker` library.",
"dimension": "Stability",
"end_column": null,
"end_line": null,
"file_path": "main.py",
"fix_preview": "# Install: pip install circuitbreaker\nfrom circuitbreaker import circuit\n\n@circuit(failure_threshold=5, recovery_timeout=60)\ndef create_order():\n # After 5 consecutive failures, the circuit opens for 60 seconds\n # During this time, calls fail fast without hitting the external service\n response = requests.get('https://api.example.com/data', timeout=30)\n return response.json()\n\n# Alternative: Use pybreaker for more control\nfrom pybreaker import CircuitBreaker\n\nbreaker = CircuitBreaker(fail_max=5, reset_timeout=60)\n\n@breaker\ndef create_order_with_pybreaker():\n response = requests.get('https://api.example.com/data', timeout=30)\n return response.json()",
"line": 40,
"message": "This HTTP call in `create_order` does not have circuit breaker protection. A circuit breaker allows your service to fail fast and recover gracefully when external dependencies are slow or unavailable. Consider using the `circuitbreaker` or `pybreaker` library.",
"rule_id": "python.resilience.missing_circuit_breaker",
"severity": "High",
"title": "HTTP call to external service in `create_order` lacks circuit breaker protection"
}

Once we identified that the Waiter -> Kitchen path is sensitive to latency, we updated our Python client to include a strict timeout and a retry policy.

When we run our Unfault reviewer, it doesn’t just see a “link.” It registers the resilience requirement as a Finding in the Unfault API.

This turns a transient discovery into a permanent Fact in our semantic graph. The next time you look at this path in the LSP, you won’t just see the route; you’ll see the documented constraint: “This path requires a timeout under 3s to prevent worker exhaustion.”

Continuous Verification

The true power of this workflow is that it’s repeatable. We can now run our fault plan as part of our CI/CD pipeline or as a periodic “health check.”

Because Unfault understands the graph, it can automatically verify that:

  • The execution path still exists.
  • The code handles the injected latency as expected (e.g., returning a 503 Service Unavailable instead of a 500 Internal Server Error or a timeout crash).

Conclusion: A Hike with a Map and a Compass

We started this series talking about code as a hike. Most of the time, we are hiking without a map, guessing at the stability of the bridges we cross.

By treating your codebase as a semantic graph, Unfault gives you the map. By adding fault, it gives you the ability to test the bridges before you put your full weight on them.

You no longer have to wait for production incidents to understand how your services behave. You can explore the paths, find the faults, and turn them into facts, ensuring a much shorter bridge between coding time and execution time.

Safe coding y’all.