✦ AlphaGo Zero — Every Concept Explained

All the techniques, strategies, and parameters behind teaching a neural network to play 9×9 Go from scratch.

Self-Play (AI vs itself)

→

Collect Data (positions + outcomes)

→

Train (update neural net)

→

Repeat (better net → better games)

See

The neural network reads the board

Encoding Planes

Board described as 17 transparent sheets stacked on 9×9. Planes 1–8: Black history. 9–16: White. 17: whose turn.

17 × 9 × 9 input

Residual Blocks (the “trunk”)

10 stacked layers. Floor 1 sees stones, floor 10 sees strategy. Skip connections prevent depth from hurting basics.

10 blocks, 128ch, 3M params

Policy Head — “Where to play?”

Heat map over 82 actions (81 intersections + pass). Trained on MCTS visit distributions, not just best move.

82-dim probability

Value Head — “Am I winning?”

Single number: −1 (losing) to +1 (winning). Guides MCTS depth evaluation. Collapsed in v7.

scalar ∈ [−1, +1]

8-Fold Augmentation

Go board has 8 symmetries (4 rotations × 2 flips). Every position = 8 free training samples.

8× data ✓

guides

MCTS thinks ahead by simulating futures

How MCTS Works

Send scouts down possible futures. Good scouts get reinforcements. After N scouts, take the most-visited path. Each: Select → Expand (ask net) → Backup.

Simulations per Move

How many futures explored before choosing. More = stronger but slower.

Ours: 200 KataGo: 400–800

C_PUCT (Curiosity vs Focus)

Explore new moves vs go deeper on good ones. Q + C×P×√N/(1+n). Higher = more exploration.

v1–6: 1.5 v7+: 1.1

Dirichlet Noise (α)

Random spice on root moves. α=10/avg_legal_moves. 9×9: 0.12. We used 0.30 (chess!) for 6 versions — biggest bug.

v1–6: 0.30 !!! v7+: 0.12

FPU (First Play Urgency)

Score for untried moves. 0 = “probably fine” → wastes sims. −0.2 = “probably bad” → focus proven moves first.

v1–6: 0 v7+: −0.2

Temperature Threshold

First N moves: pick randomly among good (explore openings). After N: always best (exploit).

v1–6: 30 v7+: 15

plays

Score

How games end — the only ground truth

Tromp-Taylor Scoring

Area scoring: stones + empty spaces only you surround. No capture history needed. Purely positional — flood-fill, count, done. Computer-friendly.

Komi

Compensation points for White (since Black goes first). The 0.5 prevents draws. Was root cause of v7 collapse.

v1–7: fixed 6.5 v8: random 7.5

Komi Randomization

KataGo innovation. 95%: N(7.5, 1.0). 5% extreme: N(7.5, 10.0). Both colors win → value head must read positions, not memorize “White always wins.”

v8: KataGo-style

Forbid Pass Before

Min stones on board before passing allowed. Prevents degenerate 2-move games.

v1–7: 30 v8: 45 stones

Produces z — the anchor

z = +1 (won) or −1 (lost). Objective, external, independent of the network. Breaks the circular loop — without scoring, the system is opinions training on opinions.

v7 Collapse → v8 Fix

Value=0.0000 — not perfection but catastrophe. Games too short, fixed komi → White always wins. Value learned wrong lesson. MCTS lost differentiation.

Komi6.5 → N(7.5, 1.0)

Value wt1.0 → 0.25

Pass30 → 45 stones

Resultpolicy=2.92, val=0.06 ✓

trains

Learn

How the neural network improves

Optimizer

SGD+momentum = ball rolling downhill. AdamW = smarter ball adapting per dimension. We use AdamW; all pros use SGD.

Ours: AdamW Pros: SGD+0.9

Batch Size

Flashcards per weight update. More = stable but slow.

v1–6: 128 v7+: 256

Learning Rate

Step size when correcting. Start big (learn fast), decay small (fine-tune).

1e-3 → 1e-4 linear

Steps per Iteration

Times you re-study between games. Overtraining kills plateau escape. Aim ~1:1 data ratio.

v1–6: 200 v7+: 80

Value Loss Weight

How much the brain cares about “winning?” vs “where?” Too high → value overfits, drags policy.

v1–7: 1.0 v8: 0.25

Replay Buffer

Past game positions. 500K = stale. 100K = no diversity. 300K ≈ 25 iterations of history.

300K ✓ L2: 1e-4 ✓

Trunk & Heads

Forward: Board → encode (17×9×9) → 10 residual blocks (the trunk) → splits into two heads:

Policy Head
82 probs — “where”

Value Head
scalar — “winning?”

“Head” = NN term for output branch on a shared trunk. Two heads = multi-headed network.

Backward: loss = −Σπ·log(p) + 0.25(v−z)² + L2. Both gradients → shared trunk.

The Bootstrapping Loop

Weak net → MCTS refines with 200 sims → train on visits → better net → MCTS stronger → repeat. The network learns in one forward pass what MCTS figured out in 200 sims.

Why most-visited, not highest-value? Visits encode quality and confidence. 4.7★ from 500 reviews > 5.0★ from 2.

MCTS uses the value head to judge positions — the very thing being trained. Circular by design. Even a terrible value head gives MCTS enough signal — a 51/49 preference becomes 70/30 after 200 sims.

Reading the Numbers

Policy loss (cross-entropy): how surprised the net is by MCTS choices.

4.41

random↑ 2.50 (~7× better)perfect

Value loss (MSE): how wrong the win prediction is.

4.0

always wrong0.00 = v7 collapse!↑ 0.04

The Full Pipeline (Linear, Not Circular)

1. Self-play generates a complete game using MCTS + current network

2. Game ends → Tromp-Taylor scores → z = +1 or −1

3. Every board position in that game gets labelled with z

4. Policy trains on MCTS visits — “where MCTS looked”

5. Value head trains on z — “who actually won”

6. Better net → better MCTS → better games → repeat

Scoring is the anchor. Without it: opinions training on opinions.

AlphaGo Zero 9×9 — Concept Guide v1–v8 — MLX on M4 Max — March 2026 Prepared by Eric San

Can the Value Head Go Wrong? — each dot = one board position’s predicted value. Vertical axis = prediction on [−1, +1].

Starts Random

Iteration 0 — untrained network

predictions all ≈ 0

No opinion on any position. z from first games gives the first real signal. Even a tiny learned bias → MCTS amplifies it next round.

✓ Self-bootstraps

Briefly Wrong?

What if a bad batch teaches wrong correlations?

wrong predictions z corrects

Diverse but some wrong. (v−z)² punishes hard — prediction +0.6, reality −1 = massive gradient. 24 fresh games per iteration flush errors.

✓ Self-corrects in 1 iteration

Collapsed (v7)

The only real failure mode

all identical v = −0.99 everywhere

Same prediction for every position. MCTS can’t tell moves apart → search is random → visits are noise → network trains on noise → death spiral.

✗ Death spiral

Key insight: The value head starts from zero (uncommitted), not from an opinion. It’s never “inverted” because it was never oriented — it learns orientation from z. Cards 1 & 2 both have variation across positions, which z can work with. Card 3 has zero variation — z provides a gradient, but every position gets the same correction, so differentiation never emerges. The only failure mode is collapse.

Deep Dive 1 — toward Economist-style visual explainers Prepared by Eric San