Benchmarks

We show the numbers. You decide.

Head-to-head results against GPT-5 Mini, Claude 4 Haiku, and Gemini 3 Flash on real-world tasks: mobile layouts, valid HTML, CTA clarity, latency, and price per token.

Text inference$0.25 / 1M input

Tool calls$0.015 / call

Image generation$0.08 / image

Benchmarks

Measured output quality, not marketing claims.

Scored across mobile breakpoints, HTML validity, CTA clarity, visual hierarchy, latency, and price. Every metric is reproducible.

Metrictk_GPT-5 MiniClaude 4 HaikuGemini 3 FlashWinner

Mobile breakpoint pass96848186Toolkit

Tablet layout integrity94798083Toolkit

HTML validity100929491Toolkit

CTA clarity91808378Toolkit

Visual hierarchy89828081Toolkit

Template samenessLowMediumMediumMediumToolkit

Latency4.2s5.6s6.1s4.9sToolkit

Methodology

Every metric is repeatable.

How prompts are selected, how device states are reviewed, and how cost and latency are normalized before comparison.

Prompt bank

Use commercial prompts across SaaS, local services, real estate, restaurant, ecommerce, and dashboards.

Device review

Score the same output across narrow mobile, tablet, and desktop widths instead of reviewing only desktop screenshots.

Validity and structure

Measure HTML integrity, CTA clarity, layout hierarchy, and recurring template patterns before publishing a win.

Cost and latency

Normalize runtime, token volume, and cache behavior so published token pricing maps cleanly to the same workload envelope.

tk_

Stop paying premium prices for generic output.

Run your prompt against the benchmark wall, compare the output, and switch when the evidence is obvious.

Get API Key View Benchmarks