AlphaGo Zero — How a Neural Network Learns Go from Scratch

No human games, no databases. Just self-play, tree search, and reinforcement learning. 9×9 board, 3M parameters, Apple Silicon.
1
The Learning Loop
How AlphaZero trains itself

Play

Store

Train

Repeat

Self-Play

The AI plays against itself. Both sides use the same neural network. Each game produces training data: "at this board position, MCTS chose these moves."

Replay Buffer

A memory bank of recent games. The network trains on random samples from this buffer — not just the latest game. Size matters: too small = narrow data, too big = stale data.

The Network

Two outputs from one brain:

Policy head: "Where should I play?" → probability for each move

Value head: "Am I winning?" → score from −1 to +1

feeds
2
How MCTS Works
The tree search that creates training targets

Think of it like chess thinking ahead

"If I play here, they'll play there, then I'll play..." The tree explores thousands of these futures.

After 200 simulations, the most-visited move is probably the best. The visit count distribution becomes the training target.

Policy Targets & Sharpness

Sharp target: "Move A got 60% of visits, B got 20%." → network learns a clear preference.

Diffuse target: "10 moves each got ~10%." → network learns nothing useful. our problem

The Bootstrapping Trap

Weak network → flat policy prior → MCTS explores everything equally → diffuse targets → network can't learn sharp preferences → still weak. A vicious cycle.

The Encoding Mismatch

Training feeds the network board + 8 turns of history (17 planes). But MCTS only feeds current board (3 planes). The network sees different inputs in training vs search. v6 fix

informs
3
What We Learned (v1–v5)
5 training runs, 350+ iterations
Encoding consistency is #1. v1 had MCTS and training using different encodings — training was useless. v2 fixed it and instantly worked. Now we found a subtler mismatch (history planes) that's limiting v5.
Batched MCTS = same quality as sequential. Tested directly: entropy 2.155 vs 2.156. Virtual loss with batch=8 works. Safe to use for speed.
More MCTS sims don't sharpen targets. 100→800 sims: top-1 stays ~10%. The bottleneck is the network's prior, not the search depth.
Don't compare logs across code versions. v2 logged policy "2.96" but actual cross-evaluation showed 3.62. We chased a phantom for 3 runs.
Buffer sweet spot: 300K. 500K = stale data (v3 plateau). 100K = no diversity (v4 plateau). 300K retains ~10 iterations — enough curriculum without staleness.
v5 is our real best. 200 iterations, policy still declining. Cross-evaluation: v5 scores 3.14 vs v2's 3.62 on same test data. Still slow — the encoding mismatch is the next wall to break.

Next: v6

Fix the encoding mismatch. Drop history planes from training inputs so they match MCTS. One-line change. Resume from v5. The network already performs better without history (entropy 2.07 vs 3.07) — we just need to stop confusing it during training.

AlphaGo Zero 9×9 — MLX on M4 Max — March 2026 Prepared by Eric San