M3 foundation model

1M context · 512K min

83.5 BrowseComp

MSA sparse attention

native multimodal

§ 01 · A new foundation

Frontier
intelligence, delivered.

A foundation model that fuses coding, 1M context, and native multimodality — into one coherent, production-grade system.

vs Claude 20% price of Sonnet 4.6 12% price of Opus 4.7 approaching · surpassing in many scenarios

Read the brief Try the API

M3 / Model

EST. 2026

M-001 Figure · M3 logomark

§ Manifesto · A new foundation

A foundation model that ships with production-grade coding, 1M context, and native multimodality — designed end-to-end, deployed in hours.

12hunattended
70Ttokens
0 → 1production

3 frontiers
1 system
0 trade-offs

fig. 01 · Object

Drawing No. · M3-A1

A single model that integrates coding agents, long-context reasoning, and multimodal understanding — built and trained together from the first step. No bolted-on adapters, no post-hoc stitching.

See capabilities See capabilities

One research and engineering stack — from sparse-attention pretraining to agentic harness — delivering reliable, traceable, reproducible frontier intelligence.

Inside M3 Inside M3

§ 03 · Architecture

From-scratch sparse attention that holds up at 1M context.

MiniMax Sparse Attention (MSA) is engineered from the very first pretraining step — not retrofitted afterward. It keeps M3 sharp across long contexts and unlocks efficient inference at frontier scale.

01 Benchmark · vs. M2

9.7×Prefill

15.6×Decode

100%GPU util.

02 MSA Forward · PyTorch

# MSA block forward pass
def msa_forward(x, k_idx, l_idx):
  q = x.linear(d, d)
  k = x.gather(k_idx).linear(d, d)
  v = x.gather(l_idx).linear(d, d)
  return sdp_attn(q, k, v)

03 Index Hyper-params

block_sizek = 128

local_windowl = 64

strider_widthw = 32

k = 128l = 64w = 32

Step 1 / 3 · MSA Architecture

From-scratchsparse attention
~100%GPU utilization
9.7×Prefill vs M2
15.6×Decode vs M2

Step 2 / 3 · 70T Token Training

70Tpretraining tokens
Step 0multimodal from start
1Mcontext window
512Kguaranteed usable

Step 3 / 3 · Benchmarks

83.5BrowseComp · > Opus 4.7 (79.3)
59.0SWE-Bench Pro
66.0Terminal Bench 2.1
37.1PostTrainBench · rank 3

MSA Reference Diagram Drawing No.: M3-001 Scale: 1/1

§ 04 · Capabilities

Three frontiers, one model.

One checkpoint carries all three — no routing, no hand-offs, no separate models to stitch. Each frontier is trained in, not bolted on.

01Coding / Agentic SOTAproduction-grade
02Native Multimodalno translation layer
031M Long Contextsustained, not just big
Scroll to explore →

Coding and agentic capabilities trained into the same model — for long-horizon tasks that can be decomposed, parallelized, and self-corrected across Producer + Verifier loops running unattended for days.

Long-horizon tasks
Producer + Verifier loop
Computer Use
1M context window

const m3 = await MiniMax.agent({
  model: "M3",
  context: "1M",
  tools: ["shell", "browser", "computer_use"],
  team: true,
});

// runs unattended for days
await m3.run("reproduce-paper-iclr-2025");

No separate vision encoder. Visual and text tokens live in the same transformer parameter space, trained jointly from the very first step. Native 1-hour video (1000–3000 frames) with reasoning that flows bidirectionally between vision and language.

1-hour native video
Unified token space
Bidirectional reasoning
Screen-recording ready

TEXT

IMAGE

VIDEO

AUDIO

→ TOKEN

PDF

CODE

One transformer, all modalities.

A 1M-token context window where intelligence holds — code, logs, and figures for a long-running task can be loaded at once and processed concurrently. Per-token compute is just 1/20 of the previous generation thanks to MSA sparse attention, applied from the first pretraining step. Trained on 70T tokens — more than GLM (28.5T), Kimi K2 (15.5T) and DeepSeek V4 (33T).

1,000,000 tokens
Sustained at length
Concurrent processing
Code + paper + logs in-window

0 256K 512K 768K 1M

context window

accuracy @ length

[ M3 · in motion ]

Every layer of M3, in one view

From the control plane to the model core — explore the capabilities that ship in production.

01Control PlaneReal-time observability

02Agent OrchestrationAutonomous workflows

03Native MultimodalOne model, every modality

04Million-Token ContextLong-horizon memory

05Sparse AttentionEfficient at scale

§ 05 · Agent

MiniMax Code

An agent trained with M3, for M3.

MiniMax Code is an agent product designed for M3 and trained alongside it — tuned to take full advantage of M3's long context, coding, and native multimodal capabilities. It's the recommended agent for working with MiniMax-M3. Built on the open-source OpenCode and Pi Agent harnesses, with plans to open-source the project after launch.

01

Agent Team workflow

Producer + Verifier loops decompose, parallelize, and self-correct — running unattended for days on long-horizon tasks.
02

Deep reflection & correction

The agent re-aligns plan and priority based on live task progress. You can step in, add requirements, and redirect at any time.
03

Computer Use

Native multimodality lets Code operate across applications, files, and systems — say what you need, Code does the rest.

Download MiniMax Code Download MiniMax Code

mcode — reproduce-iclr-paper

12:04:18 [producer] Drafting reproduction plan from PDF…

12:05:02 [ok] 14 sections parsed · 38 figures extracted

12:06:41 [producer] Scaffolding repo + writing training script…

12:09:55 [verifier] Hyper-params drift detected, re-aligning with §4.2…

12:18:09 [ok] First SFT run converged — reproducing Fig.3 trend

12:42:33 [producer] DPO experiment launched (concurrent w/ eval)…

13:11:08 [ok] Squeezing effect reproduced, Extend method validated

13:48:52 [commit] 8 commits · 3 figures · 2 tables · ✅ paper-level match

now [producer] Awaiting next instruction_

Plus

$20/mo

Per month, billed monthly

~1.7B tokens / month of M3 usage
Full access to the MiniMax model family (M3 / M2.7 / image / speech / music)
Run 3–4 concurrent agents
Integrates with popular coding tools, with more on the way
1M context window — built for long documents and large codebases
Native multimodal understanding: image and video input
Text, image, speech, and music share one quota

PurchasePurchase

Popular

Max

$50/mo

Per month, billed monthly

~5.1B tokens / month of M3 usage
Full access to the MiniMax model family (M3 / M2.7 / image / speech / music)
Run 4–5 concurrent agents
Integrates with popular coding tools, with more on the way
1M context window — built for long documents and large codebases
Native multimodal understanding: image and video input
Text, image, speech, and music share one quota
Video generation: 3 clips / day

PurchasePurchase

Ultra

$120/mo

Per month, billed monthly

~12.5B tokens / month of M3 usage
Full access to the MiniMax model family (M3 / M2.7 / image / speech / music)
Run 6–7 concurrent agents
Integrates with popular coding tools, with more on the way
1M context window — built for long documents and large codebases
Native multimodal understanding: image and video input
Text, image, speech, and music share one quota
Video generation: 5 clips / day

PurchasePurchase

POST /v1/text/chatcompletion_v2

curl https://api.minimaxi.com/v1/text/chatcompletion_v2 \
  -H "Authorization: Bearer $M3_API_KEY" \
  -d '{
    "model": "MiniMax-M3",
    "context_window": "1M",
    "messages": [
      { "role": "user",
        "content": [
          { "type": "text",  "text": "reproduce this ICLR 2025 paper" },
          { "type": "file",  "file_id": "iclr2025-oa.pdf" },
          { "type": "image_url", "image_url": "fig-1.png" }
        ]
      }
    ]
  }'

API ≤512K · 50% off · 7 days M3 · 1M context · multimodal

§ 07 · API

Drop-in. Multi-modal. Priced to scale.

Open the MiniMax platform, pick a Token Plan or top up for usage-based billing, and integrate M3 through a single API key — for any coding agent, IDE, or your own stack.

Any coding agent·IDE plugins·Custom harness·SDK & REST

Open platform Open platform

1Mcontext · 512K guaranteed
83.5BrowseComp
v2chatcompletion endpoint
MSAsparse attention

Compatible with Claude Code · Roo Code · Kilo Code · Cline · Codex CLI · OpenCode · Droid · TRAE · Grok CLI · Cursor

API referenceEndpoints, params & rate limits Token PlansInvite-only 10% off

From a paper PDF to reproducible results — autonomously.

We handed M3 an ICLR 2025 Outstanding Paper Award winner — Learning Dynamics of LLM Finetuning. M3 ran unattended for nearly 12 hours, produced 18 autonomous commits and 23 experiment figures, and reproduced the paper's core results.

~12hunattended runtime
18autonomous commits
23experiment figures
✓SFT trend matched
✓DPO squeezing reproduced
✓Extend method validated

Learning Dynamics of LLM Finetuning

ICLR 2025 · Outstanding Paper Award

fig. 03

matched ✓

commit 18/18

M3 reproduction of Fig. 3 — SFT prediction-probability trajectory matches the original paper.

B-001 FP8 GEMM · Hopper kernel ● optimized

roundcommitutil.speedup

baseline#0017.6%1.0×

round 1#02419.0%2.5×

round 2#05138.4%4.6×

round 3#08252.1%6.4×

round 4#10961.7%7.8×

round 5#13167.0%8.6×

round 6#14571.3%9.4×

147 submissions · 1,959 tool calls · zero intervention

M3's 6-round CUDA optimization trajectory on an NVIDIA Hopper FP8 GEMM kernel.

24 hours alone with an FP8 CUDA kernel.

Given only a task description, a benchmark script and a non-runnable Triton skeleton — no reference implementation — M3 spent ~24 hours optimizing an FP8 GEMM kernel on Hopper GPUs. Through 6 optimization rounds and a long plateau, it pushed peak FP8 utilization from 7.6% to 71.3% — a 9.4× speedup vs. the initial baseline.

~24hunattended runtime
147benchmark submissions
1,959tool calls
6optimization rounds
9.4×vs. baseline
7.6→71.3%FP8 peak util.

§ 09 · Use cases

Where M3 fits.

A selection of tasks where M3 ships in production today — from autonomous research to Computer Use. Each card is a real working scenario; hover to read more.

01 / 04

P-M3-01 Research