z = +1 (won) or −1 (lost). Objective, external, independent of the network. Breaks the circular loop — without scoring, the system is opinions training on opinions.
Value=0.0000 — not perfection but catastrophe. Games too short, fixed komi → White always wins. Value learned wrong lesson. MCTS lost differentiation.
Forward: Board → encode (17×9×9) → 10 residual blocks (the trunk) → splits into two heads:
“Head” = NN term for output branch on a shared trunk. Two heads = multi-headed network.
Backward: loss = −Σπ·log(p) + 0.25(v−z)² + L2. Both gradients → shared trunk.
Weak net → MCTS refines with 200 sims → train on visits → better net → MCTS stronger → repeat. The network learns in one forward pass what MCTS figured out in 200 sims.
Why most-visited, not highest-value? Visits encode quality and confidence. 4.7★ from 500 reviews > 5.0★ from 2.
MCTS uses the value head to judge positions — the very thing being trained. Circular by design. Even a terrible value head gives MCTS enough signal — a 51/49 preference becomes 70/30 after 200 sims.
Policy loss (cross-entropy): how surprised the net is by MCTS choices.
Value loss (MSE): how wrong the win prediction is.
z = +1 or −1Scoring is the anchor. Without it: opinions training on opinions.