AlphaGo Zero — Every Concept Explained

All the techniques, strategies, and parameters behind teaching a neural network to play 9×9 Go from scratch.
Self-Play (AI vs itself)
Collect Data (positions + outcomes)
Train (update neural net)
Repeat (better net → better games)
1
See
The neural network reads the board
Encoding Planes
Board described as 17 transparent sheets stacked on 9×9. Planes 1–8: Black history. 9–16: White. 17: whose turn.
17 × 9 × 9 input
Residual Blocks (the “trunk”)
10 stacked layers. Floor 1 sees stones, floor 10 sees strategy. Skip connections prevent depth from hurting basics.
10 blocks, 128ch, 3M params
Policy Head — “Where to play?”
Heat map over 82 actions (81 intersections + pass). Trained on MCTS visit distributions, not just best move.
82-dim probability
Value Head — “Am I winning?”
Single number: −1 (losing) to +1 (winning). Guides MCTS depth evaluation. Collapsed in v7.
scalar ∈ [−1, +1]
8-Fold Augmentation
Go board has 8 symmetries (4 rotations × 2 flips). Every position = 8 free training samples.
8× data ✓
guides
2
Search
MCTS thinks ahead by simulating futures
How MCTS Works
Send scouts down possible futures. Good scouts get reinforcements. After N scouts, take the most-visited path. Each: Select → Expand (ask net) → Backup.
Simulations per Move
How many futures explored before choosing. More = stronger but slower.
Ours: 200 KataGo: 400–800
C_PUCT (Curiosity vs Focus)
Explore new moves vs go deeper on good ones. Q + C×P×√N/(1+n). Higher = more exploration.
v1–6: 1.5 v7+: 1.1
Dirichlet Noise (α)
Random spice on root moves. α=10/avg_legal_moves. 9×9: 0.12. We used 0.30 (chess!) for 6 versions — biggest bug.
v1–6: 0.30 !!! v7+: 0.12
FPU (First Play Urgency)
Score for untried moves. 0 = “probably fine” → wastes sims. −0.2 = “probably bad” → focus proven moves first.
v1–6: 0 v7+: −0.2
Temperature Threshold
First N moves: pick randomly among good (explore openings). After N: always best (exploit).
v1–6: 30 v7+: 15
plays
3
Score
How games end — the only ground truth
Tromp-Taylor Scoring
Area scoring: stones + empty spaces only you surround. No capture history needed. Purely positional — flood-fill, count, done. Computer-friendly.
7.5
Komi
Compensation points for White (since Black goes first). The 0.5 prevents draws. Was root cause of v7 collapse.
v1–7: fixed 6.5 v8: random 7.5
Komi Randomization
KataGo innovation. 95%: N(7.5, 1.0). 5% extreme: N(7.5, 10.0). Both colors win → value head must read positions, not memorize “White always wins.”
v8: KataGo-style
45
Forbid Pass Before
Min stones on board before passing allowed. Prevents degenerate 2-move games.
v1–7: 30 v8: 45 stones

Produces z — the anchor

z = +1 (won) or −1 (lost). Objective, external, independent of the network. Breaks the circular loop — without scoring, the system is opinions training on opinions.

v7 Collapse → v8 Fix

Value=0.0000 — not perfection but catastrophe. Games too short, fixed komi → White always wins. Value learned wrong lesson. MCTS lost differentiation.

Komi6.5 → N(7.5, 1.0)
Value wt1.0 → 0.25
Pass30 → 45 stones
Resultpolicy=2.92, val=0.06 ✓
trains
4
Learn
How the neural network improves
Optimizer
SGD+momentum = ball rolling downhill. AdamW = smarter ball adapting per dimension. We use AdamW; all pros use SGD.
Ours: AdamW Pros: SGD+0.9
Batch Size
Flashcards per weight update. More = stable but slow.
v1–6: 128 v7+: 256
Learning Rate
Step size when correcting. Start big (learn fast), decay small (fine-tune).
1e-3 → 1e-4 linear
Steps per Iteration
Times you re-study between games. Overtraining kills plateau escape. Aim ~1:1 data ratio.
v1–6: 200 v7+: 80
Value Loss Weight
How much the brain cares about “winning?” vs “where?” Too high → value overfits, drags policy.
v1–7: 1.0 v8: 0.25
Replay Buffer
Past game positions. 500K = stale. 100K = no diversity. 300K ≈ 25 iterations of history.
300K ✓ L2: 1e-4 ✓
Trunk & Heads

Forward: Board → encode (17×9×9) → 10 residual blocks (the trunk) → splits into two heads:

Policy Head
82 probs — “where”
Value Head
scalar — “winning?”

“Head” = NN term for output branch on a shared trunk. Two heads = multi-headed network.

Backward: loss = −Σπ·log(p) + 0.25(v−z)² + L2. Both gradients → shared trunk.

The Bootstrapping Loop

Weak net → MCTS refines with 200 sims → train on visits → better net → MCTS stronger → repeat. The network learns in one forward pass what MCTS figured out in 200 sims.

Why most-visited, not highest-value? Visits encode quality and confidence. 4.7★ from 500 reviews > 5.0★ from 2.

MCTS uses the value head to judge positions — the very thing being trained. Circular by design. Even a terrible value head gives MCTS enough signal — a 51/49 preference becomes 70/30 after 200 sims.

Reading the Numbers

Policy loss (cross-entropy): how surprised the net is by MCTS choices.

4.41
0
random↑ 2.50 (~7× better)perfect

Value loss (MSE): how wrong the win prediction is.

4.0
0
always wrong0.00 = v7 collapse!↑ 0.04
The Full Pipeline (Linear, Not Circular)
1. Self-play generates a complete game using MCTS + current network
2. Game ends → Tromp-Taylor scoresz = +1 or −1
3. Every board position in that game gets labelled with z
4. Policy trains on MCTS visits — “where MCTS looked”
5. Value head trains on z — “who actually won”
6. Better net → better MCTS → better games → repeat

Scoring is the anchor. Without it: opinions training on opinions.

AlphaGo Zero 9×9 — Concept Guide v1–v8 — MLX on M4 Max — March 2026 Prepared by Eric San
Can the Value Head Go Wrong? — each dot = one board position’s predicted value. Vertical axis = prediction on [−1, +1].
Starts Random
Iteration 0 — untrained network
+1 0 −1
predictions all ≈ 0
No opinion on any position. z from first games gives the first real signal. Even a tiny learned bias → MCTS amplifies it next round.
✓ Self-bootstraps
Briefly Wrong?
What if a bad batch teaches wrong correlations?
+1 0 −1 z pulls
wrong predictions z corrects
Diverse but some wrong. (v−z)² punishes hard — prediction +0.6, reality −1 = massive gradient. 24 fresh games per iteration flush errors.
✓ Self-corrects in 1 iteration
Collapsed (v7)
The only real failure mode
+1 0 −1 no variation
all identical v = −0.99 everywhere
Same prediction for every position. MCTS can’t tell moves apart → search is random → visits are noise → network trains on noise → death spiral.
✗ Death spiral
Key insight: The value head starts from zero (uncommitted), not from an opinion. It’s never “inverted” because it was never oriented — it learns orientation from z. Cards 1 & 2 both have variation across positions, which z can work with. Card 3 has zero variation — z provides a gradient, but every position gets the same correction, so differentiation never emerges. The only failure mode is collapse.
Deep Dive 1 — toward Economist-style visual explainers Prepared by Eric San