The router, by the numbers
L0 to L3 in 0.6 ms, then cheapest capable silicon across the mesh. ~75% energy reduction at flat quality on real mixed traffic.
The router is the part of Joule Cloud most customers don't think about, which is exactly the goal. You call model: "auto"; we put your request on the cheapest capable silicon currently available across the mesh. This is what's happening when you do.
L0 to L3, in 0.6 ms
Each inference request hits the gateway and is classified by a small distilled model trained to predict task difficulty from the prompt. Sub-millisecond. The output is a tier:
| Tier | Bucket | What goes here | Typical J/req |
|---|---|---|---|
| L0 | lookup | cache hits, small embeddings, key-value access | ~0.01 |
| L1 | extraction | short summarization, classification, NER, sentiment | ~0.05 |
| L2 | aggregation | RAG, mid-context summarization, structured generation | ~0.3 |
| L3 | reasoning | long-context reasoning, code gen, planning, multi-step | ~6 |
The classifier is intentionally conservative — when it's uncertain, it upgrades. We'd rather pay the extra joules of a too-strong model than risk a quality regression a user notices.
Then: cheapest capable silicon
For the resolved tier, the router maintains a live ranking across every node in the mesh that can serve it. The score is a weighted sum:
- Per-token energy — joules per output token on that silicon (the dominant term)
- Local grid carbon intensity — gCO₂/kWh, refreshed hourly from Electricity Maps
- Operator PUE — per data centre, declared
- Queue depth — backed-up nodes drop in ranking
- Network latency from the gateway — for the user's region preference
The top-ranked node gets the call. If health-check fails mid-flight, automatic failover to the next-ranked node. The decision lands on the response header (X-Routed-To) and the receipt.
What this actually saves
For a sample week of mixed customer traffic (anonymised), "auto" vs pinning llama-3.3-70b-instruct:
| Quantity | Pinned 70B | auto |
|---|---|---|
| L0 lookups | 2.1 M × 0.31 J = 651 kJ | 2.1 M × 0.012 J = 25 kJ |
| L1 extractions | 800k × 0.31 J = 248 kJ | 800k × 0.052 J = 41 kJ |
| L2 aggregations | 140k × 0.31 J = 43 kJ | 140k × 0.29 J = 41 kJ |
| L3 reasoning | 22k × 0.31 J = 7 kJ | 22k × 5.9 J = 130 kJ |
| Total | 949 kJ | 237 kJ |
~75% energy reduction at flat quality (judge-model evaluated). The savings come from the long L0/L1 tail, where pinning 70B was using a sledgehammer on a thumbtack.
When NOT to use auto
Three scenarios where you should pin:
- Brand voice. If your product's UX depends on the chosen model's response style (Claude-y, GPT-y, etc.), pin.
- Determinism for evals. Pin the model when running benchmarks so apples-to-apples.
- Specialty. Pin Qwen for Chinese, DeepSeek for math, FLUX-dev for image quality.
For everything else, "auto" is the default for a reason.
The override
If you disagree with the classifier on a specific call, force the tier:
# this call will be routed to L3-capable silicon regardless of classification
X-Force-Tier: L3
Audit-logged. Useful for "this looks easy but I know it isn't" cases (intentionally adversarial prompts, very-long-context that the classifier underestimates).
What's coming
- Per-account tier calibration. Your traffic patterns are stable; the classifier can learn your account-specific threshold over a couple weeks.
- Speculative decoding at the gateway: small drafter model proposes; large model verifies. Cuts L3 cost ~2× on long generations.
- Cache-aware routing. KV cache locality across requests in the same session, served from the same silicon when possible.
The model "auto" is doing more under the hood than the docs make obvious. We make it look simple because that's the point.