Cheap AI Model Tests — Everything Under $3 / Million Tokens
Picking a budget LLM usually means guessing. So we stopped guessing. We took every model priced under $3 per million output tokens that you can reach from a free trial, and sent each one the exact same prompts through the live All AI Ask API. Every number below — speed, cost, and the model outputs themselves — comes from real API calls, not marketing decks.
The three tests
Writing a Code Snippet
A focused coding task: produce a correct, efficient, 0-indexed iterative Fibonacci function in Python — and nothing but the code.
Writing a Short Paragraph
A plain-English writing task: explain what an API is to a non-technical small-business owner in 3–4 jargon-free sentences using one analogy.
Extracting Structured Data
A structured-output task: read one sentence and return strict JSON with a string name, numeric price, and boolean stock flag — no markdown, no prose.
Overall leaderboard
Averaged across all 3 tasks. Accuracy is graded by the agent against each task's published criteria.
| # | Model | Avg accuracy | Avg speed | Total cost |
|---|---|---|---|---|
| 🥇 | GPT-5.4 NanoOpenAI | 99 | 57.4 t/s | $0.000274 |
| 2 | Gemini 3.1 Flash LiteGoogle | 97 | 82.6 t/s | $0.000339 |
| 3 | CodestralMistral | 96.3 | 92.4 t/s | $0.000227 |
| 4 | Mistral Medium 3Mistral | 96 | 35.7 t/s | $0.000505 |
| 5 | Llama 3.1 8BGroq | 95.3 | 274.2 t/s | $0.000029 |
| 6 | Mistral Small 3.1Mistral | 95.3 | 77.3 t/s | $0.000161 |
| 7 | Llama 3.3 70BGroq | 95 | 181.9 t/s | $0.000324 |
| 8 | Amazon Nova MicroAmazon | 91.3 | 102.6 t/s | $0.000029 |
| 9 | Amazon Nova LiteAmazon | 90.3 | 110.1 t/s | $0.00006 |
| 10 | Ministral 8BMistral | 90 | 60 t/s | $0.000061 |
| 11 | Llama 4 ScoutGroq | 89.7 | 194.8 t/s | $0.000101 |
| 12 | DeepSeek V4 ProDeepSeek | 82.7 | 80.5 t/s | $0.000684 |
| 13 | DeepSeek V4 FlashDeepSeek | 82 | 66.4 t/s | $0.000173 |
| 14 | Grok 4.3xAI | 80.7 | 28.9 t/s | $0.000984 |
| 15 | GPT-OSS 120BGroq | 80.7 | 322.3 t/s | $0.000309 |
| 16 | GPT-OSS 20BGroq | 80 | 502.9 t/s | $0.000205 |
| 17 | GPT-OSS 120B (Cerebras)Cerebras | 77.3 | 490.4 t/s | $0.000534 |
| 18 | GLM 4.7 (Cerebras)Cerebras | 75.7 | 531.4 t/s | $0.00511 |
| 19 | Qwen 3 32BGroq | 73.7 | 351.5 t/s | $0.001475 |
How we tested
- The cohort: every model under $3 / million output tokens reachable from a trial account (19 models, 8 providers).
- Identical prompts: each model received the same prompt with default settings — one model per request.
- Real metrics: latency, token counts, and cost are returned directly by the API for each run.
- Accuracy: graded 0–100 by the agent against each task's published criteria — the full output for every model is shown so you can check the grading yourself.
