The AI plays against itself. Both sides use the same neural network. Each game produces training data: "at this board position, MCTS chose these moves."
A memory bank of recent games. The network trains on random samples from this buffer — not just the latest game. Size matters: too small = narrow data, too big = stale data.
Two outputs from one brain:
Policy head: "Where should I play?" → probability for each move
Value head: "Am I winning?" → score from −1 to +1
"If I play here, they'll play there, then I'll play..." The tree explores thousands of these futures.
After 200 simulations, the most-visited move is probably the best. The visit count distribution becomes the training target.
Sharp target: "Move A got 60% of visits, B got 20%." → network learns a clear preference.
Diffuse target: "10 moves each got ~10%." → network learns nothing useful. our problem
Weak network → flat policy prior → MCTS explores everything equally → diffuse targets → network can't learn sharp preferences → still weak. A vicious cycle.
Training feeds the network board + 8 turns of history (17 planes). But MCTS only feeds current board (3 planes). The network sees different inputs in training vs search. v6 fix
Fix the encoding mismatch. Drop history planes from training inputs so they match MCTS. One-line change. Resume from v5. The network already performs better without history (entropy 2.07 vs 3.07) — we just need to stop confusing it during training.