M3 foundation model
/
1M context · 512K min
/
83.5 BrowseComp
/
MSA sparse attention
/
native multimodal
§ 01 · A new foundation

Frontier
intelligence, delivered.

A foundation model that fuses coding, 1M context, and native multimodality — into one coherent, production-grade system.

vs Claude 20% price of Sonnet 4.6 12% price of Opus 4.7 approaching · surpassing in many scenarios
M3 / Model
EST. 2026
M-001 Figure · M3 logomark
§ Manifesto · A new foundation

A foundation model that ships with production-grade coding, 1M context, and native multimodality — designed end-to-end, deployed in hours.

  • 12hunattended
  • 70Ttokens
  • 0 → 1production
  • 3 frontiers
  • 1 system
  • 0 trade-offs
fig. 01 · Object Drawing No. · M3-A1
§ 02 · Approach

One model. Six frontiers.

Reasoning, coding, multimodal understanding, tool use, long context, general intelligence — engineered as one system, with long context and coding/agentic pushed to the front.
Approach
A single model that integrates coding agents, long-context reasoning, and multimodal understanding — built and trained together from the first step. No bolted-on adapters, no post-hoc stitching.
See capabilities See capabilities
Company
One research and engineering stack — from sparse-attention pretraining to agentic harness — delivering reliable, traceable, reproducible frontier intelligence.
Inside M3 Inside M3
§ 03 · Architecture

From-scratch sparse attention that holds up at 1M context.

MiniMax Sparse Attention (MSA) is engineered from the very first pretraining step — not retrofitted afterward. It keeps M3 sharp across long contexts and unlocks efficient inference at frontier scale.

01 Benchmark · vs. M2
9.7×Prefill
15.6×Decode
100%GPU util.
02 MSA Forward · PyTorch
# MSA block forward pass
def msa_forward(x, k_idx, l_idx):
  q = x.linear(d, d)
  k = x.gather(k_idx).linear(d, d)
  v = x.gather(l_idx).linear(d, d)
  return sdp_attn(q, k, v)
03 Index Hyper-params
block_sizek = 128
local_windowl = 64
strider_widthw = 32
k = 128l = 64w = 32
Step 1 / 3 · MSA Architecture
MSA · query @ t
  • From-scratchsparse attention
  • ~100%GPU utilization
  • 9.7×Prefill vs M2
  • 15.6×Decode vs M2
Step 2 / 3 · 70T Token Training
  • 70Tpretraining tokens
  • Step 0multimodal from start
  • 1Mcontext window
  • 512Kguaranteed usable
Step 3 / 3 · Benchmarks
  • 83.5BrowseComp · > Opus 4.7 (79.3)
  • 59.0SWE-Bench Pro
  • 66.0Terminal Bench 2.1
  • 37.1PostTrainBench · rank 3
MSA Reference Diagram Drawing No.: M3-001 Scale: 1/1
§ 04 · Capabilities

Three frontiers, one model.

One checkpoint carries all three — no routing, no hand-offs, no separate models to stitch. Each frontier is trained in, not bolted on.
01
/
Coding / Agentic SOTA
Engineering-grade intelligence, ready for production.
Coding and agentic capabilities trained into the same model — for long-horizon tasks that can be decomposed, parallelized, and self-corrected across Producer + Verifier loops running unattended for days.
  • Long-horizon tasks
  • Producer + Verifier loop
  • Computer Use
  • 1M context window
const m3 = await MiniMax.agent({
  model: "M3",
  context: "1M",
  tools: ["shell", "browser", "computer_use"],
  team: true,
});

// runs unattended for days
await m3.run("reproduce-paper-iclr-2025");
02
/
Native Multimodal
Native, not bolted-on. No translation layer, no bottleneck.
No separate vision encoder. Visual and text tokens live in the same transformer parameter space, trained jointly from the very first step. Native 1-hour video (1000–3000 frames) with reasoning that flows bidirectionally between vision and language.
  • 1-hour native video
  • Unified token space
  • Bidirectional reasoning
  • Screen-recording ready
TEXT
IMAGE
VIDEO
AUDIO
→ TOKEN
PDF
CODE
3D
One transformer, all modalities.
03
/
1M Long Context
Sustained intelligence, not just a big window.
A 1M-token context window where intelligence holds — code, logs, and figures for a long-running task can be loaded at once and processed concurrently. Per-token compute is just 1/20 of the previous generation thanks to MSA sparse attention, applied from the first pretraining step. Trained on 70T tokens — more than GLM (28.5T), Kimi K2 (15.5T) and DeepSeek V4 (33T).
  • 1,000,000 tokens
  • Sustained at length
  • Concurrent processing
  • Code + paper + logs in-window
0 256K 512K 768K 1M
0M
context window
0%
accuracy @ length
[ M3 · in motion ]

Every layer of M3, in one view

From the control plane to the model core — explore the capabilities that ship in production.

01Control PlaneReal-time observability
02Agent OrchestrationAutonomous workflows
03Native MultimodalOne model, every modality
04Million-Token ContextLong-horizon memory
05Sparse AttentionEfficient at scale
§ 05 · Agent
MiniMax Code agent icon
MiniMax Code

An agent trained with M3, for M3.

MiniMax Code is an agent product designed for M3 and trained alongside it — tuned to take full advantage of M3's long context, coding, and native multimodal capabilities. It's the recommended agent for working with MiniMax-M3. Built on the open-source OpenCode and Pi Agent harnesses, with plans to open-source the project after launch.

  • 01
    Agent Team workflow
    Producer + Verifier loops decompose, parallelize, and self-correct — running unattended for days on long-horizon tasks.
  • 02
    Deep reflection & correction
    The agent re-aligns plan and priority based on live task progress. You can step in, add requirements, and redirect at any time.
  • 03
    Computer Use
    Native multimodality lets Code operate across applications, files, and systems — say what you need, Code does the rest.
Download MiniMax Code Download MiniMax Code
mcode — reproduce-iclr-paper
12:04:18 [producer] Drafting reproduction plan from PDF…
12:05:02 [ok] 14 sections parsed · 38 figures extracted
12:06:41 [producer] Scaffolding repo + writing training script…
12:09:55 [verifier] Hyper-params drift detected, re-aligning with §4.2…
12:18:09 [ok] First SFT run converged — reproducing Fig.3 trend
12:42:33 [producer] DPO experiment launched (concurrent w/ eval)…
13:11:08 [ok] Squeezing effect reproduced, Extend method validated
13:48:52 [commit] 8 commits · 3 figures · 2 tables · ✅ paper-level match
§ 06 · Plans

Three tiers. Pick the one that fits your runway.

Token Plan also applies to MiniMax Code. Limited-time M3 API ≤512K at 50% off for the launch window — 7 days only.
Plus
$20/mo
Per month, billed monthly
  • ~1.7B tokens / month of M3 usage
  • Full access to the MiniMax model family (M3 / M2.7 / image / speech / music)
  • Run 3–4 concurrent agents
  • Integrates with popular coding tools, with more on the way
  • 1M context window — built for long documents and large codebases
  • Native multimodal understanding: image and video input
  • Text, image, speech, and music share one quota
PurchasePurchase
Ultra
$120/mo
Per month, billed monthly
  • ~12.5B tokens / month of M3 usage
  • Full access to the MiniMax model family (M3 / M2.7 / image / speech / music)
  • Run 6–7 concurrent agents
  • Integrates with popular coding tools, with more on the way
  • 1M context window — built for long documents and large codebases
  • Native multimodal understanding: image and video input
  • Text, image, speech, and music share one quota
  • Video generation: 5 clips / day
PurchasePurchase
POST /v1/text/chatcompletion_v2
curl https://api.minimaxi.com/v1/text/chatcompletion_v2 \
  -H "Authorization: Bearer $M3_API_KEY" \
  -d '{
    "model": "MiniMax-M3",
    "context_window": "1M",
    "messages": [
      { "role": "user",
        "content": [
          { "type": "text",  "text": "reproduce this ICLR 2025 paper" },
          { "type": "file",  "file_id": "iclr2025-oa.pdf" },
          { "type": "image_url", "image_url": "fig-1.png" }
        ]
      }
    ]
  }'
API ≤512K · 50% off · 7 days M3 · 1M context · multimodal
§ 07 · API

Drop-in. Multi-modal. Priced to scale.

Open the MiniMax platform, pick a Token Plan or top up for usage-based billing, and integrate M3 through a single API key — for any coding agent, IDE, or your own stack.

  • Any coding agent·IDE plugins·Custom harness·SDK & REST
Open platform Open platform
  • 1Mcontext · 512K guaranteed
  • 83.5BrowseComp
  • v2chatcompletion endpoint
  • MSAsparse attention
Compatible with Claude Code · Roo Code · Kilo Code · Cline · Codex CLI · OpenCode · Droid · TRAE · Grok CLI · Cursor
§ 08 · Demo · two long-horizon tasks

Two unattended tasks. Two domains. One model.

M3 was set to two open-ended tasks with no human in the loop — one in research, one in systems engineering. Each ran for half a day or more, planning, debugging and self-correcting on its own.

A
From a paper PDF to reproducible results — autonomously.

We handed M3 an ICLR 2025 Outstanding Paper Award winner — Learning Dynamics of LLM Finetuning. M3 ran unattended for nearly 12 hours, produced 18 autonomous commits and 23 experiment figures, and reproduced the paper's core results.

  • ~12hunattended runtime
  • 18autonomous commits
  • 23experiment figures
  • SFT trend matched
  • DPO squeezing reproduced
  • Extend method validated
Learning Dynamics of LLM Finetuning
ICLR 2025 · Outstanding Paper Award
Pred. probability (SFT) — M3 reproduced · --- paper original
fig. 03
matched
commit 18/18
M3 reproduction of Fig. 3 — SFT prediction-probability trajectory matches the original paper.
B-001 FP8 GEMM · Hopper kernel ● optimized
roundcommitutil.speedup
baseline#0017.6%1.0×
round 1#02419.0%2.5×
round 2#05138.4%4.6×
round 3#08252.1%6.4×
round 4#10961.7%7.8×
round 5#13167.0%8.6×
round 6#14571.3%9.4×
147 submissions · 1,959 tool calls · zero intervention
M3's 6-round CUDA optimization trajectory on an NVIDIA Hopper FP8 GEMM kernel.
B
24 hours alone with an FP8 CUDA kernel.

Given only a task description, a benchmark script and a non-runnable Triton skeleton — no reference implementation — M3 spent ~24 hours optimizing an FP8 GEMM kernel on Hopper GPUs. Through 6 optimization rounds and a long plateau, it pushed peak FP8 utilization from 7.6% to 71.3% — a 9.4× speedup vs. the initial baseline.

  • ~24hunattended runtime
  • 147benchmark submissions
  • 1,959tool calls
  • 6optimization rounds
  • 9.4×vs. baseline
  • 7.6→71.3%FP8 peak util.
◢ § 06 · M3 RELEASE FRONTIER MODEL ◣ ◤ EST. 2026 · SHANGHAI MSA · 9.7× PREFILL ◥
FRONTIER MODEL · NATIVE MULTIMODAL · RELEASED 2026.06.01
M3

Frontier intelligence, delivered.

1M CONTEXT WINDOW
512K GUARANTEED USABLE
83.5 BROWSECOMP
70T PRETRAINING TOKENS
9.7× MSA PREFILL VS. M2
12h UNATTENDED RUN
ALL SYSTEMS OPERATIONAL / MiniMax-M3 · MSA ARCHITECTURE