2026-06-23

The router, by the numbers

L0 to L3 in 0.6 ms, then cheapest capable silicon across the mesh. ~75% energy reduction at flat quality on real mixed traffic.

The router is the part of Joule Cloud most customers don't think about, which is exactly the goal. You call model: "auto"; we put your request on the cheapest capable silicon currently available across the mesh. This is what's happening when you do.

L0 to L3, in 0.6 ms

Each inference request hits the gateway and is classified by a small distilled model trained to predict task difficulty from the prompt. Sub-millisecond. The output is a tier:

Tier	Bucket	What goes here	Typical J/req
L0	lookup	cache hits, small embeddings, key-value access	~0.01
L1	extraction	short summarization, classification, NER, sentiment	~0.05
L2	aggregation	RAG, mid-context summarization, structured generation	~0.3
L3	reasoning	long-context reasoning, code gen, planning, multi-step	~6

The classifier is intentionally conservative — when it's uncertain, it upgrades. We'd rather pay the extra joules of a too-strong model than risk a quality regression a user notices.

Then: cheapest capable silicon

For the resolved tier, the router maintains a live ranking across every node in the mesh that can serve it. The score is a weighted sum:

Per-token energy — joules per output token on that silicon (the dominant term)
Local grid carbon intensity — gCO₂/kWh, refreshed hourly from Electricity Maps
Operator PUE — per data centre, declared
Queue depth — backed-up nodes drop in ranking
Network latency from the gateway — for the user's region preference

The top-ranked node gets the call. If health-check fails mid-flight, automatic failover to the next-ranked node. The decision lands on the response header (X-Routed-To) and the receipt.

What this actually saves

For a sample week of mixed customer traffic (anonymised), "auto" vs pinning llama-3.3-70b-instruct:

Quantity	Pinned 70B	auto
L0 lookups	2.1 M × 0.31 J = 651 kJ	2.1 M × 0.012 J = 25 kJ
L1 extractions	800k × 0.31 J = 248 kJ	800k × 0.052 J = 41 kJ
L2 aggregations	140k × 0.31 J = 43 kJ	140k × 0.29 J = 41 kJ
L3 reasoning	22k × 0.31 J = 7 kJ	22k × 5.9 J = 130 kJ
Total	949 kJ	237 kJ

~75% energy reduction at flat quality (judge-model evaluated). The savings come from the long L0/L1 tail, where pinning 70B was using a sledgehammer on a thumbtack.

When NOT to use auto

Three scenarios where you should pin:

Brand voice. If your product's UX depends on the chosen model's response style (Claude-y, GPT-y, etc.), pin.
Determinism for evals. Pin the model when running benchmarks so apples-to-apples.
Specialty. Pin Qwen for Chinese, DeepSeek for math, FLUX-dev for image quality.

For everything else, "auto" is the default for a reason.

The override

If you disagree with the classifier on a specific call, force the tier:

# this call will be routed to L3-capable silicon regardless of classification
X-Force-Tier: L3

Audit-logged. Useful for "this looks easy but I know it isn't" cases (intentionally adversarial prompts, very-long-context that the classifier underestimates).

What's coming

Per-account tier calibration. Your traffic patterns are stable; the classifier can learn your account-specific threshold over a couple weeks.
Speculative decoding at the gateway: small drafter model proposes; large model verifies. Cuts L3 cost ~2× on long generations.
Cache-aware routing. KV cache locality across requests in the same session, served from the same silicon when possible.

The model "auto" is doing more under the hood than the docs make obvious. We make it look simple because that's the point.