tk_

Model System

Toolkit flagships.tk_ breadth.One router.

Toolkit is a routed AI system: Toolkode-owned product lanes for the flagship experience, plus supplemental tk_ routes that expose named models through one API key, one wallet, and token-level pricing.

Model Ladder

ModelParamsRoleInput / 1MCached / 1MOutput / 1M
toolkit-chat-turbo4B denseFast chat + live data (our daily-retrained model)$0.15$0.02$1.00
toolkit-chat9B densePremium chat + reasoning$0.30$0.03$1.20
toolkit-code-turbo9B denseFast code generation$0.20$0.02$1.00
toolkit-voicevoice systemReal-time voice$0.10/min$0.10/min
toolkit-cam9B multimodalCamera / vision$0.10/min$0.10/min
toolkit-basehosted modelBusiness analysis + long documents (200K)$0.40$0.04$1.60
toolkit-codehosted modelMulti-file code + architecture (131K)$1.20$0.12$4.80
toolkit-code-backendhosted modelAgent workflows + repo-aware coding (65K)$1.00$0.10$4.00
toolkit-thinkhosted modelDeep reasoning and research (65K)$0.80$0.08$6.00
toolkit-visionvision modelImage + document understanding (128K)$0.50$0.05$2.50

Prices per 1M tokens unless noted. Cached input = repeat context (system prompts, multi-turn history) at 90% discount. Voice and camera billed per minute.

Domestic Distilled tk_ Routes

These are compact, flash, and research-agent routes from U.S.-headquartered model labs. They broaden domestic choice without claiming Toolkit ownership. Cards show advertised token rates or provider task estimates first.

tk_gpt54_mini

GPT-5.4 mini

Lane: Domestic compact
Params: not disclosed
Active: not disclosed
Context: 400K
Max out: 128K
Status: stable API
I/O: text + image in, text out
Input / 1M
$0.75
Output / 1M
$4.50
cached/input read: $0.075

Compact OpenAI route for coding, computer use, subagents, and high-volume reasoning.

  • Supports reasoning, function calling, web search, file search, computer use, and skills.
  • Regional/data-residency processing carries a 10% provider uplift before Toolkode fee.
domesticcompactcoding
tk_gpt54_nano

GPT-5.4 nano

Lane: Domestic nano
Params: not disclosed
Active: not disclosed
Context: 400K
Max out: 128K
Status: stable API
I/O: text + image in, text out
Input / 1M
$0.20
Output / 1M
$1.25
cached/input read: $0.02

Lowest-cost OpenAI 5.4-class route for classification, extraction, ranking, and subagents.

  • Supports reasoning and common tool workflows; no computer-use support on this nano tier.
  • Best for support agents and classifiers, not full deep research or large autonomous coding.
domesticnanohigh volume
tk_gemini3_flash

Gemini 3 Flash

Lane: Domestic flash
Params: not disclosed
Active: not disclosed
Context: 1,048,576
Max out: 65,536
Status: preview
I/O: text, image, video, audio, PDF in; text out
Input / 1M
$0.50
Output / 1M
$3.00
cached/input read: $0.05

Fast multimodal route for search-grounded work, video/image input, and general agent tasks.

  • Supports thinking, code execution, computer use, file search, URL context, search grounding, Maps grounding, function calling, structured outputs, Batch, Flex, Priority, and caching.
  • Preview model: rate limits, availability, and behavior may change before stable release.
domesticflashmultimodal
tk_gemini31_flash_lite

Gemini 3.1 Flash-Lite

Lane: Domestic lite
Params: not disclosed
Active: not disclosed
Context: 1,048,576
Max out: 65,536
Status: preview
I/O: text, image, video, audio, PDF in; text out
Input / 1M
$0.25
Output / 1M
$1.50
cached/input read: $0.025

Cost-efficient Gemini route for high-volume agentic tasks, translation, and simple data processing.

  • Supports thinking, code execution, file search, URL context, search grounding, Maps grounding, function calling, structured outputs, Batch, Flex, Priority, and caching.
  • Does not support computer use or image generation; this is the high-volume lite route.
domesticlitehigh volume
tk_claude45_haiku

Claude Haiku 4.5

Lane: Domestic haiku
Params: not disclosed
Active: not disclosed
Context: 200K
Max out: model-dependent
Status: stable API
I/O: text + image in, text out
Input / 1M
$1.00
Output / 1M
$5.00
cached/input read: $0.10 read

Fast Anthropic route for coding, computer use, chat, and parallel subagent workflows.

  • Prompt caching: $1.25/MTok 5-minute write, $2/MTok 1-hour write, $0.10/MTok read.
  • US-only inference is available from Anthropic at a 1.1x provider uplift before Toolkode fee.
domesticfastagents
tk_gemini_deep_research

Gemini Deep Research

Lane: Domestic research agent
Params: agent workflow
Active: underlying Gemini + tools
Context: workflow-managed
Max out: cited report
Status: preview / Interactions API only
I/O: text + documents + tools
Typical task
$1-$3
Max task est.
$3-$7

Autonomous research agent that plans, searches, reads, reasons, and returns cited reports.

  • Not a normal chat-completions model: runs through the Gemini Interactions API with background execution.
  • Default tools include Google Search, URL Context, and Code Execution; MCP and File Search can be attached.
  • Google estimates moderate Deep Research at $1-$3/task and Deep Research Max at $3-$7/task; actual debit follows underlying tokens and tools.
domesticdeep researchagent

Settlement note: advertised provider rates shown. Toolkode adds a 5.5% platform fee at settlement.

Supplemental tk_ Model Cards

tk_ models represent themselves: model family, version, architecture, context, and active parameters where known. Cards show the advertised provider token rates first.

tk_glm47_flash

GLM-4.7-Flash

Lane: Fast base
Params: not disclosed
Active: not disclosed
Context: 131,072 on CF / 200K native
Max out: 128K native
Input / 1M
$0.06
Output / 1M
$0.40

Fast general chat, coding, and long-document work.

  • Cloudflare Workers AI currently exposes 131,072 tokens; Z.AI native docs list 200K context and 128K max output.
  • Displayed token price is the Cloudflare Workers AI route price, not Z.AI's native free promotional route.
fastbaselong context
tk_glm_turbo

GLM-5-Turbo

Lane: Fast GLM
Params: not disclosed
Active: not disclosed
Context: 200K
Max out: 128K
Input / 1M
$1.20
Output / 1M
$4.00

Low-latency GLM route for OpenClaw-style agent, coding, and tool workflows.

  • Z.AI docs list this as GLM-5-Turbo with 200K context, 128K max output, context caching, function calling, structured output, and thinking mode.
fastcodingtools
tk_glm51

GLM-5.1

Lane: Reasoning
Params: not disclosed
Active: not disclosed
Context: 200K
Max out: 128K
Input / 1M
$1.40
Output / 1M
$4.40

Reasoning, coding, multilingual work, and long-form synthesis.

  • Z.AI docs list GLM-5.1 at 200K context with English/Chinese support and coding-focused long-horizon task performance.
reasoningcodingmultilingual
tk_glm5

GLM-5

Lane: Reasoning
Params: not disclosed
Active: not disclosed
Context: 200K
Max out: 128K
Input / 1M
$1.00
Output / 1M
$3.20

General reasoning and tool workflows.

  • Z.AI docs list GLM-5 at 200K context for programming, agentic long-term planning, backend refactoring, and debugging.
reasoningtoolsgeneral
tk_kimi_k25

Kimi K2.5

Lane: Agentic compute
Params: 1T MoE
Active: not disclosed
Context: 256K
Input / 1M
$0.60
Output / 1M
$3.00
cached/input read: $0.10

Long-horizon coding and agent workflow execution.

  • Moonshot/Cloudflare docs list Kimi K2.5 at 256K context with vision inputs, reasoning, function calling, structured outputs, batch, and cached input pricing.
agentscodinglong context
tk_kimi_k26

Kimi K2.6

Lane: Agentic compute
Params: 1T MoE
Active: 32B active
Context: 262K
Input / 1M
$0.74
Output / 1M
$3.49

Long-horizon coding, design, and multi-step agent workflows.

  • Public third-party route listings describe K2.6 as 1T MoE, 32B active, 262K context, and a code-preview successor to K2.5; verify direct Moonshot billing before GA.
MoEagentslong context
tk_mimo_v2_pro

MiMo V2 Pro

Lane: General work
Params: 1T+ MoE
Active: 42B active
Context: 1,048,576
Max out: 131,072
Input / 1M
$1.00+
Output / 1M
$3.00+

High-quality general work and production assistance.

  • Public MiMo V2 Pro listings show 1M context, 131K max output, over 1T total parameters, 42B active, and tiered long-context pricing.
generalproductionreasoning
tk_mimo_v2_omni

MiMo V2 Omni

Lane: Multimodal
Params: not disclosed
Active: not disclosed
Context: verify before GA
Input / 1M
verify
Output / 1M
verify

Omni route for multimodal and general assistant workflows.

  • Public pricing/context evidence for MiMo V2 Omni was not strong enough for a flat public claim; keep gated until provider docs are verified.
multimodalomniassistant
tk_mimo_v25_pro

MiMo V2.5 Pro

Lane: Reasoning
Params: 1T MoE
Active: 42B active
Context: 1,048,576
Max out: 131,072
Input / 1M
$1.00
Output / 1M
$3.00

Stronger MiMo route for reasoning and production work.

  • OpenRouter/independent listings show MiMo V2.5 Pro at 1M context, 131K max output, $1/M input and $3/M output; Xiaomi direct docs should be rechecked before GA.
reasoningproductiongeneral
tk_mimo_v25

MiMo V2.5

Lane: General work
Params: not disclosed
Active: not disclosed
Context: 1,048,576
Input / 1M
$0.40
Output / 1M
$2.00

Broad general-use MiMo route.

  • OpenRouter/independent listings show MiMo V2.5 at 1M context, $0.40/M input and $2/M output; direct Xiaomi docs should be rechecked before GA.
generalfastassistant
tk_qwen35_plus

Qwen3.5 Plus

Lane: General reasoning
Params: not disclosed
Active: not disclosed
Context: 1,000,000
Max out: 65,536
Input / 1M
$0.115+
Output / 1M
$0.688+

General chat, reasoning, and coding.

  • Alibaba Model Studio lists Qwen3.5 Plus with 1M context and 65,536 max output; billing is tiered by prompt size, so public cards must not show a single flat price for all context lengths.
reasoningcodinggeneral
tk_qwen36_plus

Qwen3.6 Plus

Lane: General reasoning
Params: not disclosed
Active: not disclosed
Context: 1,000,000
Max out: 65,536
Input / 1M
$0.325
Output / 1M
$1.95

General chat, reasoning, coding, and production assistance.

  • Public model index listings show Qwen3.6 Plus with 1M context and 66K max output; verify final Alibaba billing table before GA because official pricing docs are easier to read through Qwen Plus routes.
reasoningcodinggeneral
tk_minimax_m27

MiniMax M2.7

Lane: Document agents
Params: not disclosed
Active: not disclosed
Context: 196K
Input / 1M
$0.30
Output / 1M
$1.20

Agentic productivity, document workflows, and structured tasks.

  • OpenRouter lists MiniMax M2.7 at 196,608 context, $0.30/M input and $1.20/M output.
agentsdocumentsstructured
tk_minimax_m25

MiniMax M2.5

Lane: Document agents
Params: not disclosed
Active: not disclosed
Context: 196K
Input / 1M
$0.118
Output / 1M
$0.99

Broad productivity and agent tasks.

  • OpenRouter lists MiniMax M2.5 at 196,608 context, $0.118/M input and $0.99/M output.
agentsdocumentsproductivity
tk_deepseek_v4_pro

DeepSeek V4 Pro

Lane: Reasoning/code
Params: 1.6T
Active: not disclosed
Context: 1,000,000
Input / 1M
verify
Output / 1M
$3.48

Reasoning, coding, and agentic workflows.

  • Public reporting describes DeepSeek V4 Pro as a 1.6T model with a 1M context window and $3.48/M output; direct API input billing must be verified before GA.
reasoningcodingagents
tk_deepseek_v4_flash

DeepSeek V4 Flash

Lane: Fast reasoning
Params: 284B
Active: not disclosed
Context: 1,000,000
Input / 1M
verify
Output / 1M
$0.28

Lower-latency reasoning and coding.

  • Public reporting describes DeepSeek V4 Flash as a smaller V4 variant with 1M context and $0.28/M output; direct API input billing must be verified before GA.
fastreasoningcoding
tk_groq_compound_mini

Groq Compound Mini

Lane: Fast tools
Params: 120B / 70B routed
Active: not disclosed
Context: 131K
Max out: 8,192
Input / 1M
$0.59
Output / 1M
$0.79

Current-data, tool-capable, and code-execution workflows.

  • Groq lists 131,072 context and 8,192 max output. Final cost depends on whether GPT-OSS-120B, Llama 3.3 70B, and built-in tools are used.
fast inferencetoolscurrent data
tk_cerebras_gptoss_120b_moe

Cerebras GPT-OSS 120B Fast

Lane: Fast 120B MoE
Params: 117B MoE
Active: 5.1B active
Context: 131K
Max out: 131K
Input / 1M
$0.25
Output / 1M
$0.69

High-throughput code, math, and agentic reasoning.

  • OpenAI lists gpt-oss-120b at 117B total, 5.1B active, 131,072 context, 131,072 max output, text-only input/output, and configurable reasoning effort.
MoE120Bfast inference

Settlement note: advertised provider rates shown. Toolkode adds a 5.5% platform fee at settlement.

Cheap Model First, Big Brain If Needed

Toolkit routes requests based on task type and difficulty, not just model size. The router attempts the lowest-cost capable model first, then escalates if necessary.

Task TypeDefault Model
Simple chat, summaries, extractiontoolkit-chat-turbo
Premium chat, planning, businesstoolkit-chat
Fast code (functions, SQL, endpoints)toolkit-code-turbo
Product logic, general reasoningtoolkit-base
Backend architecture, APIs, systemstoolkit-code-backend
Large codebase edits, refactorstoolkit-code
Deep reasoning, researchtoolkit-think
Voicetoolkit-voice
Camera / screenshotstoolkit-cam

Training Approach

All Toolkit models use LoRA fine-tuning, not full fine-tuning. This allows faster retraining cycles, lower GPU cost, specialization per role, and shared base models with different behavioral adapters.

Inference: AWQ 4-bit quantization
Training: bf16
Infrastructure: Rented GPUs, Cloudflare R2 for data
No user conversations used for model training.

Model Cards

toolkit-chat-turbo4B dense
Base: Qwen3-4B-Instruct + our LoRA
Training: LoRA r=32, bf16, ~60K samples daily
Retrains: Daily (4am + 7pm UTC)

Real-time conversational model trained on fresh information — the only model we retrain on our own infra.

Conversational responsesCurrent eventsLocal informationFast inferencePersona awareness

Data: News, sports, markets, weather, jobs, dining, travel

Freshness weighting: today = 100%, this week = 50%, >30 days = dropped. This model reads the news every day.

toolkit-chat9B dense
Base: Qwen3.5-9B + our LoRA (our pod)
Training: LoRA r=32, bf16, full dataset
Retrains: Weekly

Premium conversational model with stronger reasoning. Our fine-tuned 9B with tk_ personality + domain tuning.

Tool callingStructured outputsMulti-step reasoningBusiness conversationsPlanning

Served from our pod. Daily briefing context injected as system prompt.

toolkit-code-turbo9B dense
Base: Qwen3.5-9B + our code LoRA (our pod)
Training: LoRA r=32, bf16, curated code dataset
Retrains: Weekly

Fast coding model for autocomplete, small refactors, unit tests.

Fast code generationSQLAPIsFunctionsIDE integration
toolkit-voicevoice system
Base: Toolkit voice stack
Training: Private voice tuning
Retrains: Daily voice updates

Full-duplex voice conversation for real-time assistant workflows.

Full-duplex voiceNatural conversationSpeaker profilesReal-time interactionSub-second knowledge lookup

Data: ~13K voice conversation pairs + live briefing context

toolkit-cam9B multimodal
Base: MiniCPM-o 4.5
Training: LoRA on .llm backbone
Retrains: Not yet live

Phone camera and screenshot understanding.

Phone camera inputScreenshot understandingVisual reasoningMultimodal knowledge

Text-only v1 trained, vision-grounded v2 planned.

toolkit-basehosted model
Base: Toolkit hosted base rail
Training: Hosted — no weight updates by us
Retrains: Daily briefing via system prompt

General workhorse for business analysis and long-document tasks (200K runtime context).

200K contextTool orchestrationBusiness logicStructured reasoning

Data: Our daily briefing: news, markets, weather, regional data — injected as system prompt

toolkit-code-backendhosted model
Base: Toolkit backend code rail
Training: Hosted — no weight updates by us
Retrains:

Agent workflows + repo-aware coding with cached-prompt reuse.

65K repo contextSession-affinity cacheMulti-turn agent workflowsTool callingSystem design
toolkit-codehosted model
Base: Toolkit hosted code rail
Training: LoRA r=32, bf16 on curated code dataset
Retrains: Biweekly

Multi-file code + architecture with our domain tuning.

Multi-file refactorsRepo-wide changesAgentic codingTool calling131K contextDebugging across files
toolkit-think / toolkit-visionhosted models
Base: Toolkit reasoning and vision rails
Training: Hosted — no weight updates by us
Retrains: Daily briefing via system prompt

Deep reasoning and multimodal understanding through separate routed rails.

Deep multi-step reasoningResearchImage + document OCR128K vision contextChart comprehension

Think mode and vision mode are routed separately; public card values follow runtime limits.

Pricing

All token pricing includes the smart router at no extra cost. Cached input tokens (repeated system prompts, multi-turn conversation history) are billed at 90% off input price.

ModelInputCached InputOutput
toolkit-chat-turbo$0.15$0.02$1.00
toolkit-chat$0.30$0.03$1.20
toolkit-code-turbo$0.20$0.02$1.00
toolkit-voice$0.10/min$0.10/min
toolkit-cam$0.10/min$0.10/min
toolkit-base$0.40$0.04$1.60
toolkit-code$1.20$0.12$4.80
toolkit-code-backend$1.00$0.10$4.00
toolkit-think$0.80$0.08$6.00
toolkit-vision$0.50$0.05$2.50
Images
$0.03
per image (standard)
Web Search
$0.005
per search call
Code Exec
$0.005
per execution

Training Cadence

CadenceModels
Dailychat (4B, our LoRA, 4am + 7pm UTC)
Daily briefingchat-pro, base, code-fast, code, think, vision — hosted models, system-prompt refresh
Daily voice updatesvoice quality and safety refresh
Not yet retrainedcam, code-backend — hosted on frontier models

Key Design Principles

1.
Role specialization

Each model has a specific job instead of trying to be a single general model.

2.
Routed system architecture

Requests are routed to the lowest-cost capable model and escalated only when needed.

3.
LoRA fine-tuning

Enables fast iteration and specialization without full retraining.

4.
Freshness training for chat

The chat model is retrained daily on fresh information sources.

5.
Separation of storage and training

Conversation memory is stored for context and user experience, not used as training data.

Toolkit is a routed AI system composed of flagship Toolkode product lanes plus supplemental tk_ models for breadth. Toolkit-owned lanes carry our product experience, while tk_ routes expose named models such as GLM, Kimi, MiMo, Qwen Plus, MiniMax, DeepSeek, Groq Compound, and Cerebras GPT-OSS through one API key. The system always selects the lowest-cost capable route, escalating only when necessary.

Try Toolkit Free
View plansAPI docs
made with ♡ tk_ — toolkit-llm.com