Backend Research — 2026
What model-call backend the claude (agent) invocable
should actually use. The invocable runs in the VM and communicates with
the host engine via stdin/stdout JSON-lines. The pilot's default
assumption is the Anthropic API at per-token rates. The question: can we
swap that for a cheaper path — subscription-backed CLIs, fixed-cost
unlimited buckets, free tiers, or local models — without touching the
invocable contract.
Research conducted April 2026. All dated claims are verified against official docs where possible; flagged where not.
The shape the substrate wants
The invocable protocol is already "spawn a subprocess, talk JSON-lines over pipes." This maps cleanly onto:
- Native CLIs with headless mode —
claude -p,codex exec --json,gemini -p --output-format stream-json. Drop-in fit. - OpenAI-compatible HTTP endpoints — any provider (Groq, Cerebras, OpenRouter, Ollama) behind a ~40-line stdin→HTTP→stdout wrapper.
- Proxies — third-party bridges that wrap a subscription CLI as an OpenAI-compatible endpoint. Same wrapper.
Any of these is a valid invocable — the contract doesn't care. The question is cost, ToS, and reliability, not fit.
Paid API (baseline)
Per-token billing against Anthropic or OpenAI API accounts. Works, predictable, expensive.
Reference rates (April 2026):
- Claude Opus 4.7: 5/25 per Mtok in/out
- Claude Sonnet 4.5 (Azure Foundry): ~3/15 per Mtok
- GPT-5.3-Codex: ~1.25/10 per Mtok
A typical agent turn (50k in / 3k out on Sonnet) costs ~$0.19 through Azure Foundry or Anthropic direct.
Subscription-backed CLIs
Claude Code under Anthropic Max — essentially closed
Claude Code's headless mode is excellent: -p for
one-shot, --output-format stream-json, tool use with
--allowedTools, stdin piping. Maps 1:1 onto our invocable
pattern.
But: on April 4 2026, Anthropic explicitly
prohibited third-party harnesses from consuming subscription quota. The
Agent SDK requires ANTHROPIC_API_KEY; OAuth/subscription
tokens do not authenticate against it. External automation now bills at
full pay-as-you-go rates even under a Max plan.
Flag: exact policy date and wording from third-party summaries; direction (Anthropic tightening) well-attested, but verify before committing.
Verdict: dead path for our invocable. Don't route through Claude Code subscription.
Sources: headless docs, Anthropic pricing, Agent SDK auth issue #559, fusion94 writeup.
OpenAI Codex CLI under ChatGPT Pro — first-party blessed
codex exec --json --output-last-message <path> is
the sanctioned non-interactive mode. Stdin accepts prompts, stdout emits
JSONL (thread.started, turn.started,
item.*, turn.completed). Tool use, sandbox,
approval policy all work headless. This is the cleanest match for our
invocable architecture.
Auth. codex supports "Sign in with
ChatGPT" via browser OAuth on Plus (20)/Pro(100 &
$200) / Business / Edu / Enterprise. This consumes plan credits rather
than API billing. OpenAI's own docs contrast the two modes and document
scripted usage.
Limits (measured in messages/turns, not tokens):
| Plan | GPT-5.4 / 5h | GPT-5.3-Codex / 5h | GPT-5.4-mini / 5h |
|---|---|---|---|
| Plus ($20) | 20–100 | 30–150 | 60–350 |
| Pro ($100, 5x + 2x promo until May 31 2026) | 100–500 | 150–750 | 300–1,750 |
| Pro ($200, 20x) | 400–2,000 | 600–3,000 | 1,200–7,000 |
At cap: buy credits, drop to -mini, switch to API key,
or wait.
Effective cost. A turn is 20k–200k tokens including tool I/O. At Pro 200with 600–3, 000GPT − 5.3 − Codexturnsper5hand34windows/week, that′s 20k–100kturns/month. 0.04–0.20perMtokeffectivevs. 1.25/$10 API — ~10–100× cheaper for chunky agentic turns.
ToS. First-party codex exec under
ChatGPT auth is explicitly sanctioned. Third-party proxies that
re-export this as an endpoint are the arbitrage lane Anthropic shut down
— higher ban risk. Stay inside the tool.
Sources: Codex README, non-interactive docs, auth docs, pricing & limits, April 9 2026 limit update.
GitHub Copilot via the official SDK
The Copilot SDK went public preview April 2 2026
(Python/TS/Go/.NET/Java). First-party, in-policy, speaks JSON-RPC to the
copilot binary server mode. Models available across tiers:
Claude Sonnet 4.5/4.6, Opus 4.6/4.7, Haiku 4.5,
GPT-5.2/5.3-Codex/5.4/5.4-mini, GPT-4.1, GPT-4o, Gemini.
Billing is per "premium request" (one turn), not per token. Multiplier by model: Sonnet 1x, Opus 4.7 7.5x (promo), o3-class 3–10x. GPT-5 mini, GPT-4.1, GPT-4o are 0x — nominally "included."
Plans: Pro $10 (300 premium/mo), Pro+ $39 (1500), Business $19/user (~300), Enterprise $39/user (1000). Overage $0.04/premium-request.
Cost differential. A Sonnet turn via Pro+ is ~0.026vs. 0.19 through Azure Foundry — **~7× cheaper typical, 50× plausible for huge-context turns, 500× only on the "included" bucket edge cases.** The "50–500×" figure the user encountered is real but context-dependent, not a general claim.
Sources: Copilot CLI GA, Copilot SDK public preview, premium requests docs, plans page, supported models.
The 0x multiplier bucket — looks unlimited, isn't
GitHub's docs describe GPT-5 mini / GPT-4.1 / GPT-4o as 0x — "included" with no published cap. On paper this is the arbitrage: hammer a coding-capable model for $10/mo flat.
In practice, three gates pierce it:
Premium counter still gates 0x. When paid premium quota is exhausted, selecting a 0x model returns "you have exceeded your premium request allowance. We have automatically switched you to GPT-4.1" — despite the 0x multiplier. Confirmed: community #170137, #163550, VS Code bug microsoft/vscode#257125.
Undisclosed weekly global rate limits.
user_weekly_rate_limitederrors lock out all models including 0x, triggered by upstream provider capacity rather than quota. Reported lockout durations: 5h 52m, 67h, 463h (~19 days) — on accounts with 500+ premium requests still available. Discussions #192485, #192419, #192927. Documented in The Register's April 15 2026 coverage of GitHub's March rate-limit tightening.Abuse detection bans sustained scripted use. GitHub's AUP prohibits "excessive automated bulk activity." Real warning emails, feature disablement, and outright bans reported for patterns as light as long agent to-do lists (discussion #186764). No published numeric threshold. Proxies (
ericc-ch/copilot-api) make this worse — non-IDE user-agent and long sessions correlate with flagging.
The SDK bills identically to the IDE. No hidden cheaper lane. Copilot SDK changelog confirms each prompt counts against premium quota and global rate limits apply equally.
Verdict: 0x arbitrage is real on paper, unreliable in practice. Not a foundation for a programmatic agent loop.
Sources: docs: requests, community #170137, community #192485, community #186764, The Register, 2026-04-15.
The one real fixed-cost-unlimited bucket
Cursor Pro $20 / Ultra $200 — Auto mode. Auto is the only tier bucket in the market today that is genuinely not credit-metered. Cursor picks the model (Composer 1.5, GPT and Claude variants in rotation) and the pool is bounded only by soft per-minute burst limits, not message or token caps.
Programmatic access. ToS-gray but working. Several actively maintained proxies expose Cursor's gRPC/Connect protocol as a local OpenAI-compatible endpoint, reusing the IDE's OAuth token:
Nomadcxx/opencode-cursorephraimduncan/opencode-cursoranyrobert/cursor-api-proxy(tested against Cursor agent 2026.02.27)R44VC0RP/cursor-opencode-auth
Honorable mentions (fixed-cost, less useful for us):
- Windsurf Pro $15 — SWE-1.5/SWE-grep at 0 credits. Genuinely unmetered but no programmatic API.
- JetBrains AI Ultimate — unlimited only for local Ollama/LM Studio, not cloud.
- Google AI Ultra — generous Flash, Pro still daily-capped.
Sources: Cursor rate limits, Cursor models & pricing, Cursor agent usage blog, Windsurf plans.
Free tiers — what's actually usable
Gemini CLI under personal Google OAuth — the standout free path
Google's official gemini CLI authenticates against a
personal Google account. Free-tier quota: 60 RPM, 1,000 RPD on
Gemini 2.5 Flash. Headless mode
(gemini -p --output-format stream-json) is first-party
blessed for scripts/CI and JSON-lines native — drops directly into our
invocable pattern.
Flag: Gemini 2.5 Pro free-tier access via CLI was effectively pulled April 2026; treat Pro as paid. Flash quotas current as of April 2026.
Sources: Gemini CLI quotas, headless mode, Code Assist quotas.
Inference-as-a-service free tiers (OpenAI-compatible HTTP)
| Provider | Free ceiling | Models |
|---|---|---|
| Cerebras | 1M tokens/day | Llama 3.3 70B, Qwen3 32B/235B, GPT-OSS 120B |
| Groq | 14,400 RPD (Llama 3.1 8B Instant) / 1,000 RPD on larger | Llama 3.3 70B, Llama 4 Scout, DeepSeek R1 Distill 70B, Qwen QwQ, Mistral Saba |
| SambaNova | 10–30 RPM persistent | Llama 3.1 405B, Llama 3.3 70B, Qwen 2.5 72B |
| Mistral "Experiment" | 1 RPS, 1B tokens/month | open-mistral, mixtral variants |
OpenRouter :free |
20 RPM, 200 RPD (combined) | Qwen3 Coder 480B, DeepSeek R1, Llama 3.3 70B, GPT-OSS 120B, Devstral 2 |
All are OpenAI-compatible HTTP. Cerebras and Groq are the two with headroom for a real agent loop; OpenRouter is more variety than throughput.
Flag: Cerebras scheduled Llama 3.3 70B and Qwen3 32B deprecation for Feb 16 2026 — verify catalog before depending.
Sources: Cerebras models, Groq rate limits, SambaNova limits, Mistral tiers, OpenRouter free models.
GitHub Copilot Free — not usable
50 premium requests/month + 2,000 completions/month. Copilot SDK public preview now reaches Free accounts, but 50 turns/month burns in one agent session. Not a foundation.
Local models (zero marginal)
Ollama exposes OpenAI-compatible /v1/chat/completions +
native /api/chat with tool calling. On an M3/M4 Max
64–128GB, the plausibly-good-enough options in April 2026:
- Qwen3-Coder-Next (80B MoE, ~3B active, 256k ctx) — Sonnet-4.5-level coding on 64GB. Best tool-use-ready local option.
- GLM-4.7 (355B / 32B active) — needs 128GB+.
- Kimi K2.5 (1T / 32B active) — runs on 128GB M4 Max at ~1.7 tok/s; too slow for interactive loops.
- DeepSeek-V3.1-Nex-N1 — strong tool use, tight on 64GB.
Zero quota anxiety, offline, no ToS surface. A fallback that's always available.
Proxies and routers
If the invocable speaks OpenAI chat-completions and the backend is swappable, two layers matter:
CLIProxyAPI (router-for-me/CLIProxyAPI)
— wraps Gemini CLI, Antigravity, ChatGPT Codex, and Claude Code behind
one OpenAI/Gemini/Claude/Codex-compatible endpoint, reusing each tool's
OAuth. Most comprehensive. 26.8k stars, actively maintained (v6.9.28 on
2026-04-17). This is the one to watch.
LiteLLM (BerriAI/litellm)
— the router in front of everything. 43.7k stars, normalizes tool-call
schemas, streams reasoning blocks correctly, supports fallback chains.
Security advisory March 2026 — pin versions.
What breaks across all proxies:
- Auth refresh on OAuth-based wraps (most common failure mode)
- Rate limit bleed-through — upstream quota exhaustion surfaces as odd errors
- Reasoning/thinking blocks round-trip poorly
- Parallel tool calls and Anthropic↔︎OpenAI tool schema translation are the usual fidelity gaps
Stale/dead: reverse-engineered ChatGPT-UI proxies
(acheong08/ChatGPT, PawanOsman/ChatGPT) die to
Cloudflare/Turnstile updates. Don't build on them.
Recommended pilot shape
Keep the invocable contract model-agnostic (accepts session/context/prompt, emits tool-calls), and treat the backend as a ranked fallback chain:
- Default free — Gemini CLI under personal OAuth. 1,000 RPD Flash, first-party blessed, native stdin/stdout JSON. Zero cost, zero ToS risk. Start here.
- Free bulk — Cerebras via thin HTTP wrapper. 1M tok/day when the pilot wants volume.
- First-party paid — Codex CLI under ChatGPT Pro.
When you want OpenAI-grade reasoning.
codex exec --jsonmatches the invocable pattern exactly. Within ToS. - Fixed-cost "unlimited" lane — Cursor Pro $20 via
opencode-cursorproxy. ToS-gray, accept the risk. The one real hammer-it-flat-fee bucket. - Always-on fallback — Local Ollama + Qwen3-Coder-Next. Offline, no quota.
- Escape hatch — Raw Anthropic / OpenAI API. When nothing else works and correctness matters.
All six fit the same invocable contract. The backend becomes a config choice, not an architectural coupling.
Uncertainty register
- Anthropic's April 4 2026 third-party-harness ban: direction attested across multiple sources, exact date and wording from third-party summaries.
- Copilot Business premium-request allowance: dropped from public plans page; current number is contact-sales.
- Copilot SDK per-turn metering: documented as identical to IDE, not independently verified for SDK-specific discounts.
- Codex Pro $100 "2x bonus until May 31 2026": promo, verify before committing.
- Exact 2026 model naming (GPT-5.4, 5.3-Codex, -Spark, Opus 4.7) accurate at time of writing; churns quarterly.
- Cerebras model deprecation schedule (Feb 16 2026 on Llama 3.3 70B and Qwen3 32B): verify catalog.
- Cost ratios (10–100× Codex vs. API, 7–50× Copilot vs. Azure) are order-of-magnitude, not quotes.