Model System

Toolkit flagships.tk_ breadth.One router.

Toolkit is a routed AI system: Toolkode-owned product lanes for the flagship experience, plus supplemental tk_ routes that expose named models through one API key, one wallet, and token-level pricing.

Model Ladder

Model	Params	Role	Input / 1M	Cached / 1M	Output / 1M
toolkit-chat-turbo	4B dense	Fast chat + live data (our daily-retrained model)	$0.15	$0.02	$1.00
toolkit-chat	9B dense	Premium chat + reasoning	$0.30	$0.03	$1.20
toolkit-code-turbo	9B dense	Fast code generation	$0.20	$0.02	$1.00
toolkit-voice	voice system	Real-time voice	$0.10/min	—	$0.10/min
toolkit-cam	9B multimodal	Camera / vision	$0.10/min	—	$0.10/min
toolkit-base	hosted model	Business analysis + long documents (200K)	$0.40	$0.04	$1.60
toolkit-code	hosted model	Multi-file code + architecture (131K)	$1.20	$0.12	$4.80
toolkit-code-backend	hosted model	Agent workflows + repo-aware coding (65K)	$1.00	$0.10	$4.00
toolkit-think	hosted model	Deep reasoning and research (65K)	$0.80	$0.08	$6.00
toolkit-vision	vision model	Image + document understanding (128K)	$0.50	$0.05	$2.50

Prices per 1M tokens unless noted. Cached input = repeat context (system prompts, multi-turn history) at 90% discount. Voice and camera billed per minute.

Domestic Distilled tk_ Routes

These are compact, flash, and research-agent routes from U.S.-headquartered model labs. They broaden domestic choice without claiming Toolkit ownership. Cards show advertised token rates or provider task estimates first.

tk_gpt54_mini

GPT-5.4 mini

Lane: Domestic compact
Params: not disclosed
Active: not disclosed
Context: 400K
Max out: 128K
Status: stable API
I/O: text + image in, text out

Input / 1M

$0.75

Output / 1M

$4.50

cached/input read: $0.075

Compact OpenAI route for coding, computer use, subagents, and high-volume reasoning.

Supports reasoning, function calling, web search, file search, computer use, and skills.
Regional/data-residency processing carries a 10% provider uplift before Toolkode fee.

domesticcompactcoding

tk_gpt54_nano

GPT-5.4 nano

Lane: Domestic nano
Params: not disclosed
Active: not disclosed
Context: 400K
Max out: 128K
Status: stable API
I/O: text + image in, text out

Input / 1M

$0.20

Output / 1M

$1.25

cached/input read: $0.02

Lowest-cost OpenAI 5.4-class route for classification, extraction, ranking, and subagents.

Supports reasoning and common tool workflows; no computer-use support on this nano tier.
Best for support agents and classifiers, not full deep research or large autonomous coding.

domesticnanohigh volume

tk_gemini3_flash

Gemini 3 Flash

Lane: Domestic flash
Params: not disclosed
Active: not disclosed
Context: 1,048,576
Max out: 65,536
Status: preview
I/O: text, image, video, audio, PDF in; text out

Input / 1M

$0.50

Output / 1M

$3.00

cached/input read: $0.05

Fast multimodal route for search-grounded work, video/image input, and general agent tasks.

Supports thinking, code execution, computer use, file search, URL context, search grounding, Maps grounding, function calling, structured outputs, Batch, Flex, Priority, and caching.
Preview model: rate limits, availability, and behavior may change before stable release.

domesticflashmultimodal

tk_gemini31_flash_lite

Gemini 3.1 Flash-Lite

Lane: Domestic lite
Params: not disclosed
Active: not disclosed
Context: 1,048,576
Max out: 65,536
Status: preview
I/O: text, image, video, audio, PDF in; text out

Input / 1M

$0.25

Output / 1M

$1.50

cached/input read: $0.025

Cost-efficient Gemini route for high-volume agentic tasks, translation, and simple data processing.

Supports thinking, code execution, file search, URL context, search grounding, Maps grounding, function calling, structured outputs, Batch, Flex, Priority, and caching.
Does not support computer use or image generation; this is the high-volume lite route.

domesticlitehigh volume

tk_claude45_haiku

Claude Haiku 4.5

Lane: Domestic haiku
Params: not disclosed
Active: not disclosed
Context: 200K
Max out: model-dependent
Status: stable API
I/O: text + image in, text out

Input / 1M

$1.00

Output / 1M

$5.00

cached/input read: $0.10 read

Fast Anthropic route for coding, computer use, chat, and parallel subagent workflows.

Prompt caching: $1.25/MTok 5-minute write, $2/MTok 1-hour write, $0.10/MTok read.
US-only inference is available from Anthropic at a 1.1x provider uplift before Toolkode fee.

domesticfastagents

tk_gemini_deep_research

Gemini Deep Research

Lane: Domestic research agent
Params: agent workflow
Active: underlying Gemini + tools
Context: workflow-managed
Max out: cited report
Status: preview / Interactions API only
I/O: text + documents + tools

Typical task

$1-$3

Max task est.

$3-$7

Autonomous research agent that plans, searches, reads, reasons, and returns cited reports.

Not a normal chat-completions model: runs through the Gemini Interactions API with background execution.
Default tools include Google Search, URL Context, and Code Execution; MCP and File Search can be attached.
Google estimates moderate Deep Research at $1-$3/task and Deep Research Max at $3-$7/task; actual debit follows underlying tokens and tools.

domesticdeep researchagent

Settlement note: advertised provider rates shown. Toolkode adds a 5.5% platform fee at settlement.

Supplemental tk_ Model Cards

tk_ models represent themselves: model family, version, architecture, context, and active parameters where known. Cards show the advertised provider token rates first.

tk_glm47_flash

GLM-4.7-Flash

Lane: Fast base
Params: not disclosed
Active: not disclosed
Context: 131,072 on CF / 200K native
Max out: 128K native

Input / 1M

$0.06

Output / 1M

$0.40

Fast general chat, coding, and long-document work.

Cloudflare Workers AI currently exposes 131,072 tokens; Z.AI native docs list 200K context and 128K max output.
Displayed token price is the Cloudflare Workers AI route price, not Z.AI's native free promotional route.

fastbaselong context

tk_glm_turbo

GLM-5-Turbo

Lane: Fast GLM
Params: not disclosed
Active: not disclosed
Context: 200K
Max out: 128K

Input / 1M

$1.20

Output / 1M

$4.00

Low-latency GLM route for OpenClaw-style agent, coding, and tool workflows.

Z.AI docs list this as GLM-5-Turbo with 200K context, 128K max output, context caching, function calling, structured output, and thinking mode.

fastcodingtools

tk_glm51

GLM-5.1

Lane: Reasoning
Params: not disclosed
Active: not disclosed
Context: 200K
Max out: 128K

Input / 1M

$1.40

Output / 1M

$4.40

Reasoning, coding, multilingual work, and long-form synthesis.

Z.AI docs list GLM-5.1 at 200K context with English/Chinese support and coding-focused long-horizon task performance.

reasoningcodingmultilingual

tk_glm5

GLM-5

Lane: Reasoning
Params: not disclosed
Active: not disclosed
Context: 200K
Max out: 128K

Input / 1M

$1.00

Output / 1M

$3.20

General reasoning and tool workflows.

Z.AI docs list GLM-5 at 200K context for programming, agentic long-term planning, backend refactoring, and debugging.

reasoningtoolsgeneral

tk_kimi_k25

Kimi K2.5

Lane: Agentic compute
Params: 1T MoE
Active: not disclosed
Context: 256K

Input / 1M

$0.60

Output / 1M

$3.00

cached/input read: $0.10

Long-horizon coding and agent workflow execution.

Moonshot/Cloudflare docs list Kimi K2.5 at 256K context with vision inputs, reasoning, function calling, structured outputs, batch, and cached input pricing.

agentscodinglong context

tk_kimi_k26

Kimi K2.6

Lane: Agentic compute
Params: 1T MoE
Active: 32B active
Context: 262K

Input / 1M

$0.74

Output / 1M

$3.49

Long-horizon coding, design, and multi-step agent workflows.

Public third-party route listings describe K2.6 as 1T MoE, 32B active, 262K context, and a code-preview successor to K2.5; verify direct Moonshot billing before GA.

MoEagentslong context

tk_mimo_v2_pro

MiMo V2 Pro

Lane: General work
Params: 1T+ MoE
Active: 42B active
Context: 1,048,576
Max out: 131,072

Input / 1M

$1.00+

Output / 1M

$3.00+

High-quality general work and production assistance.

Public MiMo V2 Pro listings show 1M context, 131K max output, over 1T total parameters, 42B active, and tiered long-context pricing.

generalproductionreasoning

tk_mimo_v2_omni

MiMo V2 Omni

Lane: Multimodal
Params: not disclosed
Active: not disclosed
Context: verify before GA

Input / 1M

verify

Output / 1M

verify

Omni route for multimodal and general assistant workflows.

Public pricing/context evidence for MiMo V2 Omni was not strong enough for a flat public claim; keep gated until provider docs are verified.

multimodalomniassistant

tk_mimo_v25_pro

MiMo V2.5 Pro

Lane: Reasoning
Params: 1T MoE
Active: 42B active
Context: 1,048,576
Max out: 131,072

Input / 1M

$1.00

Output / 1M

$3.00

Stronger MiMo route for reasoning and production work.

OpenRouter/independent listings show MiMo V2.5 Pro at 1M context, 131K max output, $1/M input and $3/M output; Xiaomi direct docs should be rechecked before GA.

reasoningproductiongeneral

tk_mimo_v25

MiMo V2.5

Lane: General work
Params: not disclosed
Active: not disclosed
Context: 1,048,576

Input / 1M

$0.40

Output / 1M

$2.00

Broad general-use MiMo route.

OpenRouter/independent listings show MiMo V2.5 at 1M context, $0.40/M input and $2/M output; direct Xiaomi docs should be rechecked before GA.

generalfastassistant

tk_qwen35_plus

Qwen3.5 Plus

Lane: General reasoning
Params: not disclosed
Active: not disclosed
Context: 1,000,000
Max out: 65,536

Input / 1M

$0.115+

Output / 1M

$0.688+

General chat, reasoning, and coding.

Alibaba Model Studio lists Qwen3.5 Plus with 1M context and 65,536 max output; billing is tiered by prompt size, so public cards must not show a single flat price for all context lengths.

reasoningcodinggeneral

tk_qwen36_plus

Qwen3.6 Plus

Lane: General reasoning
Params: not disclosed
Active: not disclosed
Context: 1,000,000
Max out: 65,536

Input / 1M

$0.325

Output / 1M

$1.95

General chat, reasoning, coding, and production assistance.

Public model index listings show Qwen3.6 Plus with 1M context and 66K max output; verify final Alibaba billing table before GA because official pricing docs are easier to read through Qwen Plus routes.

reasoningcodinggeneral

tk_minimax_m27

MiniMax M2.7

Lane: Document agents
Params: not disclosed
Active: not disclosed
Context: 196K

Input / 1M

$0.30

Output / 1M

$1.20

Agentic productivity, document workflows, and structured tasks.

OpenRouter lists MiniMax M2.7 at 196,608 context, $0.30/M input and $1.20/M output.

agentsdocumentsstructured

tk_minimax_m25

MiniMax M2.5

Lane: Document agents
Params: not disclosed
Active: not disclosed
Context: 196K

Input / 1M

$0.118

Output / 1M

$0.99

Broad productivity and agent tasks.

OpenRouter lists MiniMax M2.5 at 196,608 context, $0.118/M input and $0.99/M output.

agentsdocumentsproductivity

tk_deepseek_v4_pro

DeepSeek V4 Pro

Lane: Reasoning/code
Params: 1.6T
Active: not disclosed
Context: 1,000,000

Input / 1M

verify

Output / 1M

$3.48

Reasoning, coding, and agentic workflows.

Public reporting describes DeepSeek V4 Pro as a 1.6T model with a 1M context window and $3.48/M output; direct API input billing must be verified before GA.

reasoningcodingagents

tk_deepseek_v4_flash

DeepSeek V4 Flash

Lane: Fast reasoning
Params: 284B
Active: not disclosed
Context: 1,000,000

Input / 1M

verify

Output / 1M

$0.28

Lower-latency reasoning and coding.

Public reporting describes DeepSeek V4 Flash as a smaller V4 variant with 1M context and $0.28/M output; direct API input billing must be verified before GA.

fastreasoningcoding

tk_groq_compound_mini

Groq Compound Mini

Lane: Fast tools
Params: 120B / 70B routed
Active: not disclosed
Context: 131K
Max out: 8,192

Input / 1M

$0.59

Output / 1M

$0.79

Current-data, tool-capable, and code-execution workflows.

Groq lists 131,072 context and 8,192 max output. Final cost depends on whether GPT-OSS-120B, Llama 3.3 70B, and built-in tools are used.

fast inferencetoolscurrent data

tk_cerebras_gptoss_120b_moe

Cerebras GPT-OSS 120B Fast

Lane: Fast 120B MoE
Params: 117B MoE
Active: 5.1B active
Context: 131K
Max out: 131K

Input / 1M

$0.25

Output / 1M

$0.69

High-throughput code, math, and agentic reasoning.

OpenAI lists gpt-oss-120b at 117B total, 5.1B active, 131,072 context, 131,072 max output, text-only input/output, and configurable reasoning effort.

MoE120Bfast inference

Settlement note: advertised provider rates shown. Toolkode adds a 5.5% platform fee at settlement.

Cheap Model First, Big Brain If Needed

Toolkit routes requests based on task type and difficulty, not just model size. The router attempts the lowest-cost capable model first, then escalates if necessary.

Task Type	Default Model
Simple chat, summaries, extraction	toolkit-chat-turbo
Premium chat, planning, business	toolkit-chat
Fast code (functions, SQL, endpoints)	toolkit-code-turbo
Product logic, general reasoning	toolkit-base
Backend architecture, APIs, systems	toolkit-code-backend
Large codebase edits, refactors	toolkit-code
Deep reasoning, research	toolkit-think
Voice	toolkit-voice
Camera / screenshots	toolkit-cam

Training Approach

All Toolkit models use LoRA fine-tuning, not full fine-tuning. This allows faster retraining cycles, lower GPU cost, specialization per role, and shared base models with different behavioral adapters.

Inference: AWQ 4-bit quantization
Training: bf16
Infrastructure: Rented GPUs, Cloudflare R2 for data
No user conversations used for model training.

Model Cards

toolkit-chat-turbo4B dense

Base: Qwen3-4B-Instruct + our LoRA
Training: LoRA r=32, bf16, ~60K samples daily
Retrains: Daily (4am + 7pm UTC)

Real-time conversational model trained on fresh information — the only model we retrain on our own infra.

Conversational responsesCurrent eventsLocal informationFast inferencePersona awareness

Data: News, sports, markets, weather, jobs, dining, travel

Freshness weighting: today = 100%, this week = 50%, >30 days = dropped. This model reads the news every day.

toolkit-chat9B dense

Base: Qwen3.5-9B + our LoRA (our pod)
Training: LoRA r=32, bf16, full dataset
Retrains: Weekly

Premium conversational model with stronger reasoning. Our fine-tuned 9B with tk_ personality + domain tuning.

Tool callingStructured outputsMulti-step reasoningBusiness conversationsPlanning

Served from our pod. Daily briefing context injected as system prompt.

toolkit-code-turbo9B dense

Base: Qwen3.5-9B + our code LoRA (our pod)
Training: LoRA r=32, bf16, curated code dataset
Retrains: Weekly

Fast coding model for autocomplete, small refactors, unit tests.

Fast code generationSQLAPIsFunctionsIDE integration

toolkit-voicevoice system

Base: Toolkit voice stack
Training: Private voice tuning
Retrains: Daily voice updates

Full-duplex voice conversation for real-time assistant workflows.

Full-duplex voiceNatural conversationSpeaker profilesReal-time interactionSub-second knowledge lookup

Data: ~13K voice conversation pairs + live briefing context

toolkit-cam9B multimodal

Base: MiniCPM-o 4.5
Training: LoRA on .llm backbone
Retrains: Not yet live

Phone camera and screenshot understanding.

Phone camera inputScreenshot understandingVisual reasoningMultimodal knowledge

Text-only v1 trained, vision-grounded v2 planned.

toolkit-basehosted model

Base: Toolkit hosted base rail
Training: Hosted — no weight updates by us
Retrains: Daily briefing via system prompt

General workhorse for business analysis and long-document tasks (200K runtime context).

200K contextTool orchestrationBusiness logicStructured reasoning

Data: Our daily briefing: news, markets, weather, regional data — injected as system prompt

toolkit-code-backendhosted model

Base: Toolkit backend code rail
Training: Hosted — no weight updates by us
Retrains: —

Agent workflows + repo-aware coding with cached-prompt reuse.

65K repo contextSession-affinity cacheMulti-turn agent workflowsTool callingSystem design

toolkit-codehosted model

Base: Toolkit hosted code rail
Training: LoRA r=32, bf16 on curated code dataset
Retrains: Biweekly

Multi-file code + architecture with our domain tuning.

Multi-file refactorsRepo-wide changesAgentic codingTool calling131K contextDebugging across files

toolkit-think / toolkit-visionhosted models

Base: Toolkit reasoning and vision rails
Training: Hosted — no weight updates by us
Retrains: Daily briefing via system prompt

Deep reasoning and multimodal understanding through separate routed rails.

Deep multi-step reasoningResearchImage + document OCR128K vision contextChart comprehension

Think mode and vision mode are routed separately; public card values follow runtime limits.

Pricing

All token pricing includes the smart router at no extra cost. Cached input tokens (repeated system prompts, multi-turn conversation history) are billed at 90% off input price.

Model	Input	Cached Input	Output
toolkit-chat-turbo	$0.15	$0.02	$1.00
toolkit-chat	$0.30	$0.03	$1.20
toolkit-code-turbo	$0.20	$0.02	$1.00
toolkit-voice	$0.10/min	—	$0.10/min
toolkit-cam	$0.10/min	—	$0.10/min
toolkit-base	$0.40	$0.04	$1.60
toolkit-code	$1.20	$0.12	$4.80
toolkit-code-backend	$1.00	$0.10	$4.00
toolkit-think	$0.80	$0.08	$6.00
toolkit-vision	$0.50	$0.05	$2.50

Images

$0.03

per image (standard)

Web Search

$0.005

per search call

Code Exec

$0.005

per execution

Training Cadence

Cadence	Models
Daily	chat (4B, our LoRA, 4am + 7pm UTC)
Daily briefing	chat-pro, base, code-fast, code, think, vision — hosted models, system-prompt refresh
Daily voice updates	voice quality and safety refresh
Not yet retrained	cam, code-backend — hosted on frontier models

Key Design Principles

Role specialization

Each model has a specific job instead of trying to be a single general model.

Routed system architecture

Requests are routed to the lowest-cost capable model and escalated only when needed.

LoRA fine-tuning

Enables fast iteration and specialization without full retraining.

Freshness training for chat

The chat model is retrained daily on fresh information sources.

Separation of storage and training

Conversation memory is stored for context and user experience, not used as training data.

Toolkit is a routed AI system composed of flagship Toolkode product lanes plus supplemental tk_ models for breadth. Toolkit-owned lanes carry our product experience, while tk_ routes expose named models such as GLM, Kimi, MiMo, Qwen Plus, MiniMax, DeepSeek, Groq Compound, and Cerebras GPT-OSS through one API key. The system always selects the lowest-cost capable route, escalating only when necessary.

Try Toolkit Free

View plans API docs

made with ♡ tk_ — toolkit-llm.com