AI for Math in HK Education — Essai / Talent Coop

VLM Accuracy (K-12)

57–71%

DrawEduMath leaderboard, Dec 2025³²

HK Primary Schools

500+

EDB statistics³⁹

Teacher AI Demand

>50%

want grading help (HKFEW)¹

AI Math Effect

g=0.60

GenAI meta-analysis³¹

I. TL;DR + Key Metrics

Summary (R3 — restructured around "what to build") Math is wide open for Essai. Nobody owns the teach-learn-assess loop for HK primary math. Typed-answer auto-grading is production-ready: GPT-4.1-mini hits 94.5% on worked algebra solutions⁴¹, making digital question gen + auto-grade a high-confidence P0. Handwritten math grading is improving fast but not yet reliable: the DrawEduMath leaderboard (Dec 2025) shows Gemini 3 Pro at 71.3%, GPT-5 at 64%, improving ~5–10% every 6 months.³² For HK’s paper-first culture, scan-to-grade is the moat — build the pipeline now (P1), ship with mandatory teacher review, and auto-upgrade as models improve. Lead v1 with digital-input question generation + typed-answer auto-grading + class dashboard at P3–P6.¹

R1 (19 Feb): Initial landscape + critical assessment. R2 (19 Feb): Data access estimation, HK pain points, feature-level landscape. R3 (19 Feb): Restructured around “what to build”; added VLM accuracy data; 40 sources. R3.1 (19 Feb): Updated with current model performance. DrawEduMath leaderboard Dec 2025 (Gemini 3 Pro: 71.3%, GPT-5: 64%). Typed grading: GPT-4.1-mini 94.5% on worked algebra (EDM 2025). Scan-to-grade upgraded from P2 back to P1 (with mandatory teacher review). 42 sources.

II. What to Build — Feature Recommendations R3 CORE

Every recommendation below is grounded in three filters: (1) what HK teachers actually ask for, (2) what’s technically feasible at primary level, and (3) what competitors don’t already do well. The narrative spine is: what should Essai build for math in HK schools?

P0 — Ship First (v1 Core)

Feature	What it does	Why P0	Evidence
Question generation HK curriculum, P3–P6, by topic + difficulty	Teacher selects strand/topic/difficulty → AI generates a worksheet in seconds	>50% of HK teachers want AI to generate test questions.¹ SmartQuest only covers DSE (secondary).⁸ Clear gap at primary level.	EdU Jockey Club pilot proved this works for P5–P6 in one HK school.³ Quezzio has ~4K generators with 5 difficulty levels.⁹
Auto-grading MC + typed worked solutions	Machine-mark MC, numeric answers, and typed worked solutions submitted digitally	>50% of teachers want AI grading help.¹ GPT-4.1-mini hits 94.5% accuracy on college algebra worked solutions⁴¹ — primary-level arithmetic will be higher. Production-ready today.	EDM 2025 study: 18K responses graded, 94.5% human agreement (GPT-4.1-mini w/ self-consistency).⁴¹ SmartQuest does OMR for bubble sheets.⁸ Quezzio uses symbolic equivalence.⁹
Class dashboard Scores by topic/strand	Teacher view: avg scores, topic breakdown, weakest areas per class	Low build effort — extends existing Essai dashboard architecture.²² Teachers explicitly ask for “who struggled with what.”	SmartQuest⁸ and Third Space Learning¹¹ both have class dashboards. Table stakes.

Scan-to-grade: P1 with mandatory teacher review (upgraded from R3’s P2) R3 downgraded scan-to-grade to P2 based on older VLM accuracy data (51–66%). Updated data (Dec 2025): Gemini 3 Pro hits 71.3% on real K-12 handwritten math, GPT-5 hits 64%, and the trajectory is ~5–10% improvement every 6 months.³² Still not autonomous-grading territory, but viable as a “pre-mark + teacher review” workflow — AI does a first pass, teacher overrides errors. At 71%, roughly 3 in 10 need correction — tedious but still faster than marking from scratch. Teachers who worry about accuracy (80.3%¹) keep control. Build the scan pipeline at P1; ship with review UI; auto-improve as models upgrade. Meanwhile, typed-answer grading is production-ready: GPT-4.1-mini achieves 94.5% on worked algebra solutions with self-consistency.⁴¹

P1 — Build Next (after v1 trust established)

Feature	What it does	Why P1	Evidence
Scan-to-grade Handwritten exercise books, teacher review required	Teacher scans stack of exercise books. AI pre-marks. Teacher reviews and overrides via review UI.	This IS the moat for HK (paper-first culture). VLM accuracy now at 71.3% (Gemini 3 Pro, Dec 2025)³² — viable as pre-mark + human review workflow. Mach/HKU digit OCR is >97%.⁶ Build pipeline now; accuracy auto-improves as models upgrade.	Graded Pro uses similar pre-mark + review model for 3,000+ UK/US schools.⁷ FERMAT benchmark: Gemini-1.5-Pro achieved 77% error correction rate on grades 7–12.⁴²
Per-student tracking + parent report	Auto-generate weekly per-student progress → shareable via WhatsApp	HK parents expect WhatsApp updates. Differentiates from SmartQuest (no parent-facing output).⁸ Low build cost — generated from grading data.	IXL and Third Space Learning both offer parent reports.¹¹ WhatsApp is HK’s primary parent-school channel.
TSA / HKDSE past-paper variants	Given a real past paper, generate 5 variants with same difficulty, different numbers. Each student gets unique version.	Solves “students memorise past papers” problem. TSA drives 80% of primary homework.⁴ High demand.	Quezzio already does algorithmic variation for US standards (10 generators/standard × 5 levels).⁹ Requires HKEAA license (HK$5,275–6,685/year).²⁰
Homework quality optimizer	Instead of “30 questions on fractions,” AI recommends “these 8 cover the same learning objectives.”	Aligns with EDB “quality over quantity” homework policy.²¹ Reduces student burden while maintaining coverage. Unique positioning.	EDB Curriculum Guide explicitly recommends reducing homework quantity at primary level.²¹ HK students receive 7+ homework pieces/day.⁵

P2 — Build When Accuracy Improves Further

Feature	What it does	Why P2	Gate
Autonomous scan-to-grade No teacher review required	Fully automated marking of scanned exercise books with no human-in-the-loop.	P1 ships with teacher review; P2 removes it. Requires VLM accuracy >90% on handwritten primary math to maintain teacher trust (80.3% worry about accuracy).¹	Ship when: VLM accuracy >90% on primary math handwriting. At current improvement rate (~5–10%/6mo), estimated mid–late 2027.³²
Step-by-step error detection	Mark where in a student’s working the error occurred, not just right/wrong	HK users already complain AI “doesn’t check each step.”¹⁶ But EMNLP 2025: <10% step-level error detection on harder math.¹⁹	Feasible at P3–P6 arithmetic only. Not DSE. Ship with teacher override as required.

P3 — Build as Moat (after adoption proven)

Feature	What it does	Why P3	Evidence
Misconception taxonomy + error twin matching	Model why students err. Surface other students making the same mistake. Group for targeted reteaching.	Nobody asked for this — but Third Space Learning built their entire business on it (700K+ lessons, 93% met Expected Standard).¹¹ Classic innovator’s dilemma: the teachers who’d benefit most don’t know to ask.	MalruleLib: 101 malrules across 498 templates, 1M+ instances.³⁸ Cross-template prediction drops to 40% — needs domain-specific training. ScaffoldiaMyMaths proves HK primary fractions work.¹³
Cross-subject correlation	“Students weak in math word problems also score low on Chinese reading comprehension.”	Only possible if you own both essay + math data for the same students. Essai is the only platform positioned to do this.²²	Requires meaningful data in both subjects first. v3 at earliest.
Socratic tutoring (AI follow-up)	When a student gets it wrong, AI asks guiding questions instead of showing the answer	Organic demand in HK already.¹⁸ Khan Academy’s “Explain Your Thinking”: 20–36% showed more understanding through AI conversation.¹² But MathGPT has 50+ schools doing this already.¹⁰	MathGPT is Socratic and “cheat-proof.”¹⁰ Graded Pro does viva voce follow-ups.⁷ Essai should not compete here in v1.

The Engagement Insight 62% of teachers say student engagement drives edtech adoption, not time-saving.³⁶ 73% want fresh content.³⁶ Only 18% of US K-12 teachers have tried AI at all.³⁵ In HK, >50% find AI tools difficult to use.¹ Implication: the feature teachers say they want (grading relief) may not be the feature that drives actual adoption. Engagement-first features (question variation, gamified practice) may outperform grading tools in driving sticky usage. Something to test, not assume.

III. Why These Features — HK Teacher Pain Points

1. Marking workload is the #1 thing teachers want AI to help with

>50% of HK teachers surveyed want AI to help with grading homework and generating test questions.¹ 66.9% want AI to organize teaching resources; 65.8% want AI to design teaching presentations.¹ Australian secondary teachers spend 10+ hours per week marking.⁴⁰

2. AI accuracy is the #1 concern blocking adoption

80.3% of HK teachers worry AI-generated content contains errors.¹ 83.5% worry students will use AI to copy homework.¹ Over 50% find it difficult to apply AI tools.¹ Individual teachers report schools actively discourage AI use in teaching.¹⁴

The Trust Trap If 80.3% of teachers already worry about AI accuracy¹, trust is the binding constraint. VLMs now hit 71% on handwritten math (Gemini 3 Pro, Dec 2025³²) — viable for “pre-mark + teacher review” but not autonomous grading. Meanwhile, typed grading is at 94.5%.⁴¹ Strategy: launch typed grading at P0 (high confidence), scan-to-grade at P1 (with review UI), and never position AI as replacing the teacher’s judgment.

3. Students outpace teachers in AI proficiency

95% of HK students use AI tools vs 90% of teachers, but students rate themselves significantly more proficient.² Both groups worry about over-reliance weakening critical thinking.² OHKF recommends: framework for teachers, progressive AI literacy curriculum, one-stop resource platform.²

4. TSA preparation drives extreme workload

HK primary students receive 7+ homework assignments per day — 85% of teachers confirm the same on weekends.⁵ ~80% of homework is TSA-related drill.⁴ Nearly half of families report children experiencing “drastic” stress; some children cry during homework.⁴ Despite most parents not believing TSA is useful, ~50% still prepare children for it — pure “herd mentality.”⁴

5. Paper-first culture — digital is a minority case

The EdU Jockey Club Primary AI math pilot (P5–P6 only, with Microsoft partner “遊戲湯麵”) is one of very few examples of digital math AI in HK primary schools.³ That pilot generates questions in seconds and marks handwritten answers with a 5-level grading scale. This is at one school only — not a commercial product.³

Real HK Voices (Threads)

“老師用AI出卷, 學生用AI作文做功課, 老師再用返AI改卷, 咁老師同學生存在嘅意義係？” [“Teachers use AI to set papers, students use AI to do homework, teachers use AI to mark — what’s the point of either existing?”] — @wxharp, 3,100 likes, Sept 2025¹⁵

Implication: AI that removes human involvement is seen as threatening. Essai must position as teacher-augmentation, not teacher-replacement.

“唔係每一個步驟都take嘅，就係一筆take過，有鬼用啊。如果有中間部份寫錯咗點樣批改呢” [“It doesn’t check each step, just sweeps through in one go — useless. What if the middle steps are wrong?”] — @fk_pk_1919, Jan 2026¹⁶

Implication: Step-by-step verification matters to HK users. “Just marked the final answer” will get rejected.

Math tutors overheard in a HK coffee shop: “ChatGPT 很弱，但 Deepseek 更爛” [“ChatGPT is weak at math, but Deepseek is even worse”]. Private tutors actively testing AI tools and finding them unreliable for actual math teaching. — @jagolee48, 275 likes, Feb 2025¹⁷

Implication: The tutor market is testing and rejecting existing generic AI tools. There’s demand, but no trusted product yet.

“數學神技！AI + 蘇格拉底提問法 = 免費私人補習老師。好多人問 AI 數學題，淨係識叫佢俾答案。錯晒！” [“AI math mastery! Socratic method = free private tutor. Most people just ask AI for answers — totally wrong!”] — @studywithai1314, Dec 2025¹⁸

Implication: Socratic/guided AI (not answer-giving) is already being promoted organically in HK. Khanmigo-style interaction has organic demand.

IV. Why These Features — HK Market Context

HK Math Curriculum Structure

The EDB organises primary math across 5 strands (Number, Algebra, Measures, Shape & Space, Data Handling) with emphasis on “quality over quantity.”²¹ Secondary reorganises into 3 dimensions. Key assessment gates:

Level	Assessment	What it tests	Stakes
P3	TSA²⁵	Basic Competencies in math²⁵	Low (school feedback only)
P6	TSA + pre-S1²⁵	Competency + banding readiness	Medium (determines S1 banding)
S3	Internal + TSA	Junior secondary proficiency	Medium (streaming)
S6	HKDSE²⁴	Paper 1 conv. (65%, 2¼hr) + Paper 2 MC (35%, 1¼hr)²⁴	HIGH — university admission

HKDSE Compulsory Math: Paper 1 has 3 sections (A1 elementary 35mk, A2 harder 35mk, B advanced 35mk). Topics span indices, factorization, quadratics, functions, trigonometry, coordinate geometry, statistics, probability.²⁴

Cultural Context for Product Design

Six HK-Specific Constraints

Exam culture: Everything revolves around TSA (P3/P6)²⁵ and HKDSE.²⁴ Features that don’t visibly improve exam results won’t get adopted.
Parent pressure: Parents drive tutoring demand. A parent-facing “your child’s weaknesses” report is high-value. HK parents expect WhatsApp updates.
補習 shadow market: Private tutors are early AI adopters and a signal of what schools will eventually want.¹⁷
Paper-first culture: Students write on paper. Any AI math tool must eventually handle scan → digitise → grade. But launching with digital-first (typed answers) is safer than launching with broken OCR.
HKEAA licensing: Commercial use of HKDSE/TSA past papers requires paid HKEAA licence (HK$5,275–6,685/year).²⁰
EDB policy: EDB favours “quality over quantity” in homework.²¹ Fewer-but-better-targeted questions align with policy and reduce parent pushback.

Gov Grant Alignment

“AI for Empowering Learning and Teaching” Programme: HK$500K per school (one-off). Application deadline: Feb 28, 2026. Must implement AI in at least 3 subjects (covering 2+ levels each).²³ If a school already uses Essai for Chinese + English, they need a third subject. Math is the obvious third. Essai offering Math lets schools tick all three subjects with one vendor, simplifying the grant application.²²

Research Evidence: AI Math Works

Study	Finding	Implication for Essai
IXL Florida (77K+ students)²⁶	Students outperformed non-users on FAST assessment; higher usage = bigger gains	Usage intensity matters. Dashboard + tracking drives engagement loop.
IXL Holland MI RCT (Johns Hopkins)²⁷	ESSA Tier 1 evidence; significant math gains	Adaptive practice has causal evidence behind it.
DreamBox ESSA Studies²⁸	“Strong” rating; 13K+ students; +0.10 effect size	Even modest effect sizes are meaningful at scale.
Springer Meta-Analysis 2024³⁰	Effect size 0.343 favouring AI over traditional instruction (21 studies)	AI math instruction consistently outperforms traditional.
GenAI Math Meta-Analysis 2025³¹	Pooled g=0.603; moderate-to-large positive impact	GenAI specifically (not just adaptive software) has strong evidence.
Khan Academy “Explain Your Thinking”¹²	20–36% of students showed more understanding through AI conversation	Socratic AI reveals hidden understanding. v2/v3 feature.
Photomath²⁹	220M+ downloads; 2.2B problems/month; acquired by Google 2023	Consumer demand for math AI is massive. School-facing product is the gap.

V. Competitive Landscape

HK Direct Competitors

Platform	Level	Focus	Schools	Strength	Weakness
SmartQuest⁸	Secondary (DSE)	Paper gen + auto-mark (OMR)	80+ (free trial)	DSE-specific, Google Classroom integration	No primary, no diagnostic, no teaching loop
AIMaths	Primary (P1–P6)	Adaptive learning + diagnostics	Unknown (new)	5 HK curriculum strands, learning paths	Very new, limited adoption, no teacher loop
Sayo Academy	All levels	General AI teaching tools	100+	Broad subjects, quiz generation	Generic — not math-specialised
Vinci AI	All levels	School-based LLM infra	Unknown	On-premise, 100+ pre-built AI apps	Infrastructure play, not a product

Global Comparables (Feature-Level)

Platform	Key Feature	How Good	HK Gap
Graded Pro⁷	Handwritten math grading + viva voce	3,000+ schools UK/US; “remarkable accuracy” claim; teacher override with annotations	No HK curriculum alignment; no Cantonese support
Quezzio (Wolfram)⁹	Algorithmic question generation	~4K generators; symbolic equivalence grading; anti-cheating via unique variants	US Common Core only; no HK strands
MathGPT¹⁰	Socratic “cheat-proof” tutoring	50+ schools; $25/student; never gives direct answers	US-focused; no HK curriculum; no teacher grading loop
Third Space Learning¹¹	Misconception detection via voice tutoring	700K+ lessons; 4-stage misconception/calculation distinction; 93% met Expected Standard 2025 SATs	UK curriculum only; voice-first (not paper); B2C model
IXL²⁶	Adaptive practice + score tracking	ESSA Tier 1 evidence (Johns Hopkins RCT)²⁷; Florida study: higher usage = bigger gains	US/UK standards; no HK localisation
ScaffoldiaMyMaths¹³	HK primary fraction scaffolding	Research-stage; adaptive scaffolding for lower-ability HK primary students	Not a commercial product; fractions only

Adjacent HK Platforms

Platform	What	Math?
LingoTask	Chinese + English essay/oral AI grading	No — language only. 150+ schools. QEF-funded til 2028.
Goodclass.ai (HKUST)	Generic AI education platform	Vaguely — not math-specialised
HKTA AI Homework	Consumer tutoring	Yes — B2C only. HK$138/mo.

Positioning: Essai vs SmartQuest

Essai Advantages

Full 教學評 loop (teach + learn + assess)
Primary coverage (SmartQuest is DSE-only)⁸
Existing school relationships (~20 schools)²²
Multi-subject bundling = grant capture²³
Diagnostic layer (v2) vs. none

SmartQuest Risks

80 schools already engaged (free trial)⁸
Proven OMR scanning + HKDSE format
Google Classroom integration live
Could expand to primary faster than Essai builds math

SmartQuest’s 80 Schools — Threat or Noise? SmartQuest has 80 schools on a free trial.⁸ Free trial ≠ paid. They cover DSE only (secondary). But if they expand downward to primary, they’d have distribution + proven OMR tech. The question is: does Essai’s “start at primary” strategy give it a protected beachhead, or is it building for the segment with lower willingness-to-pay? Primary schools have tighter budgets than secondary. The grant (HK$500K) helps, but is a one-off.²³

VI. Data Access Estimation — Auro DB Current State → Math Projection

Current State

Subject	Pipeline	Volume	Teacher Submissions
English	userdata → pdfdata → imagedata_full → report_score²²	11.4K essays	5.6K (11 schools)
Chinese	c_userdata → c_pdfdata → c_imagedata_full → report_score_c²²	9.3K essays	959 (8 schools)
Oral	teacherAssignment → oralReport²²	14.5K reports	—
Math	does not exist	0	0

Zero math data in Auro DB No tables, no schema, no content.²² This is not a gap — it’s a clean slate. The essay pipeline architecture (teacher batch-uploads scan → AI processes → report generated) is proven and can be adapted for math. But all math content (question bank, rubrics, grading models, misconception taxonomy) must be built from scratch.

Adoption Patterns — What Essays Tell Us About Math

Adoption is school-driven, not teacher-driven. 海怡寶血小學 alone accounts for ~50% of all essay volume.²² A handful of champion schools drive most activity. Expect the same for math.

Chinese adoption is 6× lower than English (959 vs 5,600 teacher submissions).²² Not because fewer schools use Chinese — because the Chinese AI grading product is newer and less trusted. Teachers need to see accuracy before committing workflow. Math will face the same ramp.

Realistic Y1 target for math: 2–3 champion schools, ~500–1K graded exercises/month. Based on Chinese essay pattern: 6-month ramp to steady state, then plateau until next round of school adoption.²²

Teacher Workflow Mapping (Current → With Essai Math)

Current Workflow	Essai Math Role
Teacher sets paper manually (textbook / past papers)	Question generation — generate by topic + difficulty in seconds
Students write answers in exercise books (paper)	v1: typed answers on device. v2+: scan paper³²
Teacher collects books, marks with red pen	v1: auto-grade typed MC/numeric. v2+: scan-to-grade when VLM accuracy permits
Teacher records scores in class register	Auto-tracking — scores recorded per student by topic/strand
Teacher re-explains weak areas in class	Class dashboard — “class is weakest on fractions; here’s a targeted warm-up”
Parent asks “how is my child?”	Parent report — per-student progress shareable via WhatsApp

OCR Constraint: The Handwriting Problem

Math handwriting recognition requires: digit recognition (0–9), symbol recognition (+, −, ×, ÷, =, fractions, brackets), layout understanding (vertical addition, long division, working steps), and diagram interpretation (geometry shapes, angles).

HKU / Mach Innovation achieved >97% accuracy on digit + symbol recognition using Xception/ResNet/UNet models trained on real HK student handwriting.⁶ But this is OCR (reading what’s written), not grading (judging whether the working is correct). DrawEduMath (Dec 2025 leaderboard) shows VLMs achieve 57–71% when they need to evaluate handwritten student work, up from 51–66% in the original paper, improving ~5–10% every 6 months.³² A hybrid approach (Mach OCR for recognition + LLM for evaluation) could outperform pure VLM approaches at primary level.

Partnership Opportunity: Mach Innovation / HKU Data Science Lab They’ve solved the hard OCR problem for HK primary math handwriting.⁶ Rather than rebuilding, explore a partnership or licensing arrangement. This would cut months off the eventual scan-to-grade build. Even if VLM grading isn’t ready, the OCR layer is a prerequisite.

VII. Critical Assessment of Feature Choices R3 NEW

Stance being assessed: “P0 should be question generation + auto-grading + class dashboard at P3–P6 level.”

Hidden Assumption #1: Auto-grading is technically feasible at primary level

UPDATED (R3.1): Feasibility is now stratified, not binary

Typed worked solutions: GPT-4.1-mini achieves 94.5% accuracy on college algebra worked solutions (18K responses, EDM 2025).⁴¹ Primary arithmetic will be higher. PRODUCTION-READY
Handwritten K-12 math: DrawEduMath leaderboard (Dec 2025): Gemini 3 Pro 71.3%, GPT-5 64%, improving ~5–10% per 6 months.³² FERMAT benchmark: Gemini-1.5-Pro 77% error correction on grades 7–12.⁴² VIABLE WITH TEACHER REVIEW
Digit/symbol OCR only: Mach/HKU achieves >97%.⁶ OCR ≠ grading, but combined with constrained answer matching for primary arithmetic, this could exceed VLM-only approaches. PARTIAL — WORTH TESTING

Reformed position: Typed auto-grading at P0 (high confidence). Handwritten scan-to-grade at P1 with mandatory teacher review UI — viable now as a “pre-mark + override” workflow. Autonomous handwriting grading (no review) at P2, gated on >90% accuracy.

Hidden Assumption #2: Question generation is the #1 teacher demand

CHALLENGE: What teachers SAY they want ≠ what drives adoption

Yes, >50% of HK teachers say they want question generation and grading help.¹ But 62% of teachers globally say student engagement drives edtech adoption, not time-saving.³⁶ Only 18% of US K-12 teachers have tried AI at all.³⁵ EdTech sits unused because companies fail to research teacher workflows.³⁷

Reformed position: Question generation is still P0 — it’s the lowest-risk entry point. But don’t assume it alone creates sticky usage. Monitor engagement metrics aggressively. If teachers generate questions but students don’t engage, the product dies regardless.

Hidden Assumption #3: Scan-to-grade placement

UPDATED (R3.1): P1 is correct — with the right UX

R3 downgraded scan-to-grade to P2 based on older accuracy data (51–66%). Updated DrawEduMath leaderboard (Dec 2025) shows Gemini 3 Pro at 71.3%, with ~5–10% improvement every 6 months.³² At 71%, roughly 3 in 10 answers need teacher correction — but that’s still faster than marking 10 in 10 from scratch. The key is UX: ship as “pre-mark + review” where AI does the first pass and the teacher overrides. Graded Pro uses this model for 3,000+ schools.⁷

Reformed position: Scan-to-grade back to P1, but with mandatory teacher review UI. The trust risk is real (80.3% worry about accuracy¹), so never position it as “AI grades your homework.” Position it as “AI saves you 60% of marking time.” Autonomous grading (no review) stays at P2, gated on >90% accuracy. Invest in Mach/HKU OCR partnership⁶ for the recognition layer.

Hidden Assumption #4: Starting at primary is a strength

CHALLENGE: SmartQuest has 80 schools at DSE level. Is primary actually safer?

SmartQuest covers only DSE (secondary).⁸ Essai’s “start at primary” avoids direct competition. But: primary schools have lower budgets than secondary. DSE is a high-stakes exam with strong willingness-to-pay. TSA is a low-stakes feedback assessment²⁵ — parents care intensely, but schools don’t need to spend money on it. ~500+ primary schools in HK³⁹ means a large addressable market, but per-school contract value may be lower.

Reformed position: Primary is still the right beachhead — Essai has existing primary school relationships²², the grant requires 3 subjects²³, and competition is weakest here. But be honest: this is a volume play, not a high-ARPU play. Plan pricing accordingly.

Hidden Assumption #5: Nobody asked for misconception diagnosis

INSIGHT: Third Space Learning built their entire business on what nobody asked for

700K+ lessons delivered. 93% of students met Expected Standard in 2025 SATs.¹¹ They distinguish misconceptions from calculation errors through a 4-stage process — and charge premium for it. MalruleLib catalogues 101 malrules across 498 templates³⁸ — proving that systematic error modelling is possible. ScaffoldiaMyMaths proved it works for HK primary fractions specifically.¹³

HK teachers who would benefit most from misconception diagnosis don’t know to ask for it. They ask for grading relief because that’s the pain they feel. Misconception diagnosis addresses the pain they can’t articulate. Classic innovator’s dilemma.

Reformed position: Still P3 — you can’t sell what teachers don’t understand yet. But start collecting error pattern data from day one (even at P0). When enough data accumulates, the misconception layer is the moat nobody can copy.

Summary: Reformed Priority Matrix (R3.1)

Priority	Feature	R3 Position	R3.1 Position	Why Changed
P0	Question generation (digital, P3–P6)	P0	P0 HOLD	Still the right entry point
P0	Auto-grading (typed worked solutions)	P0 (MC + numeric only)	P0 EXPANDED	GPT-4.1-mini 94.5% on worked algebra⁴¹ — can now grade typed steps, not just final answers
P0	Class dashboard	P0	P0 HOLD	Low effort, high value
P1	Scan-to-grade (w/ teacher review)	P2	P1 UPGRADED	Gemini 3 Pro 71.3%³² — viable as pre-mark + review workflow; Graded Pro model proven⁷
P1	Parent WhatsApp report	P1	P1 HOLD	High HK-specific value
P1	TSA past-paper variants	P1	P1 HOLD	High demand, exam-culture fit
P1	Homework quality optimiser	P1	P1 HOLD	EDB policy alignment²¹
P2	Autonomous scan-to-grade (no review)	—	P2 NEW	Gated on VLM >90% accuracy; estimated mid–late 2027
P2	Step-by-step error detection	P2	P2 HOLD	EMNLP 2025: <10% on harder math¹⁹
P3	Misconception taxonomy	P3	P3 HOLD	The moat, but invest only after adoption proven
P3	Cross-subject correlation	P3	P3 HOLD	Needs data in both subjects first²²

VIII. What Needs to Be Built

Component	Effort	Priority	Dependency
HK math question bank (P3–P6, 5 strands, tagged by topic + difficulty)²¹	Medium — bootstrap with TSA past papers²⁵ + AI generation	P0	HKEAA licensing for past papers²⁰
Auto-grading engine (MC + typed numeric; symbolic equivalence)	Medium — LLMs reliable at primary-level arithmetic¹⁹	P0	Question bank
Class dashboard (math strands — extend existing Essai UI)²²	Low — reuse essay dashboard architecture	P0	Auto-grading engine
Per-student tracking + parent WhatsApp report	Low — generated from grading data	P1	Class dashboard
TSA past-paper variant generator²⁵	Medium — Quezzio model applicable⁹	P1	Question bank + HKEAA licence²⁰
Homework quality optimiser (EDB-aligned)²¹	Low-Medium — algorithm over question bank	P1	Question bank + learning objectives mapping
Scan-to-grade OCR for handwritten math⁶	High — Mach/HKU partnership or licence	P2	VLM accuracy >85% OR Mach partnership⁶
Step-by-step error detection	Medium — only P3–P6 feasible¹⁹	P2	Scan-to-grade OCR
Misconception taxonomy (per topic)³⁸	High — leverage MalruleLib + teacher input	P3	Error data accumulation from v1
HKDSE paper generator	Medium — SmartQuest benchmark⁸	P3	HKEAA licence²⁰ + secondary curriculum mapping

IX. LLM & VLM Accuracy — The Technical Constraint R3 NEW

This section consolidates the accuracy evidence that gates several feature decisions.

Benchmark	What it tests	Result	Implication
DrawEduMath (NAACL 2025, leaderboard Dec 2025)³²	VLMs on 2,030 real K-12 handwritten math images; teacher-posed questions	57–71% accuracy (Gemini 3 Pro: 71.3%, GPT-5: 64%); improving ~5–10%/6mo	Viable for pre-mark + teacher review; not yet autonomous
EDM 2025 Typed Grading⁴¹	GPT-4.1-mini on 18K college algebra worked solutions	94.5% accuracy (w/ self-consistency); GPT-4.1-nano 93.1%; GPT-4o 91.9%	Typed/digital auto-grading is production-ready for P3–P6
FERMAT (ACL 2025)⁴²	VLM error detection + correction on handwritten math (grades 7–12)	Gemini-1.5-Pro: 77% error correction rate	Error correction (not just detection) approaching viability
GPT-4o Handwritten Grading (Nov 2024)³³	GPT-4o on handwritten college math (now superseded by newer models)	“Too inaccurate for classroom deployment”	Older result; GPT-5/Gemini 3 Pro show significant improvement since
EMNLP 2025¹⁹	LLM step-level error detection	<10% accuracy on harder math	Step-by-step grading unreliable for complex problems; P3–P6 arithmetic is simpler
GSM1k (NeurIPS 2024)³⁴	Frontier model generalisation on grade-school math	Genuine generalisation confirmed; up to 8% accuracy drops from data contamination	LLMs can do primary-level math; contamination is a concern for benchmarking
HKU/Mach Innovation⁶	Digit + symbol OCR on HK student handwriting	>97% accuracy	OCR is solved for HK; evaluation/grading is the gap
MalruleLib (Jan 2026)³⁸	Cross-template prediction of student malrules	Drops to 40% cross-template	Misconception detection needs domain-specific training, not general LLMs

The Two-Layer Problem

Layer 1 (OCR): Reading what the student wrote. Solved for digits/symbols (>97%).⁶ Harder for full expressions and layouts.

Layer 2 (Evaluation): Judging whether the student’s work is correct, identifying where errors occurred, and determining the nature of the error (calculation vs misconception). VLMs are at 57–71% on handwritten input (Dec 2025 leaderboard)³², 94.5% on typed worked solutions.⁴¹ Handwritten evaluation is the remaining binding constraint, but improving rapidly.

Essai’s v1 uses typed digital input for P0 auto-grading (94.5% accuracy, production-ready). P1 introduces handwritten scan-to-grade with mandatory teacher review (71% pre-mark accuracy, improving). This layered approach is both technically honest and pragmatically sound.

X. Creative Differentiators — What Only Essai Can Do

Opportunity	What it is	Why Essai specifically
Multi-subject grant bundle	Schools using Essai for Chinese + English need a 3rd subject for the HK$500K grant²³	Only Essai is positioned to offer 3 subjects in one vendor. LingoTask is language-only. SmartQuest is math-only.
Scan → Grade → Feedback (one step)	Teacher scans exercise books. AI pre-marks + teacher reviews. Returns graded + per-student feedback + class summary.	Meets HK paper-first reality. Graded Pro does this for UK/US⁷ but nobody for HK curriculum. Mach proves OCR works locally.⁶ P1 with teacher review; autonomous at P2.
“Error twin” matching	Surface other students who made the same error. Group for targeted reteaching.	HKU/Mach already uses hierarchical clustering for this concept.⁶ No commercial product surfaces this for teachers.
Parent WhatsApp report	Auto-generate weekly per-student report shareable in parent group	HK parents expect WhatsApp. “Your child improved on fractions this week” would be viral in parent circles. Low build cost.
Cross-subject correlation	“Math word problem weakness correlates with Chinese reading comprehension.”	Only possible with both essay + math data for the same students. Essai is the only platform in position.²² v3 at earliest.

XI. Verdict

CONDITIONAL YES — with honest technical constraints.

Reformed Stance (R3): Essai should build AI Math for HK schools. But the v1 must be scoped tighter than R2 suggested. Lead with digital-input question generation + typed-answer auto-grading + class dashboard at P3–P6. These are technically feasible, match teacher demand¹, and competition is weakest here.⁸

What R3.1 Changed: Two critical data updates. (1) Typed grading is production-ready: GPT-4.1-mini hits 94.5% on worked algebra solutions⁴¹ — P0 auto-grading can now handle typed steps, not just final answers. (2) Handwritten grading is closer than R3 suggested: DrawEduMath leaderboard (Dec 2025) shows Gemini 3 Pro at 71.3%, improving ~5–10% every 6 months.³² Scan-to-grade upgraded back to P1 (with mandatory teacher review UI). Autonomous grading (no review) gated at P2 on >90% accuracy. TSA past-paper variants and homework quality optimiser remain at P1.

The Honest V1: A digital-first math tool that generates HK-curriculum-aligned questions, auto-grades typed answers, and shows teachers where the class is weak. Not revolutionary. Not a moat. But shippable, trustworthy, and grant-qualifying. The moat (scan-to-grade + misconception detection) comes in v2/v3 as the technology catches up and error data accumulates.

Remaining Risks: Zero HK math teacher interviews (all pain data is survey-level). Unknown Essai engineering capacity. HKEAA licensing needed for past-paper content.²⁰ Primary schools have lower willingness-to-pay than secondary. “Digital-only” v1 doesn’t match the paper-first classroom reality — this is a deliberate trade-off for trust.

What Would Resolve Uncertainty: (1) 3–5 HK primary math teacher interviews (direct validation of demand and willingness to use a digital-input tool). (2) SmartQuest product teardown (how good is their OMR accuracy?). (3) Mach Innovation/HKU partnership feasibility check. (4) Pilot test: LLM grading on 20 real HK P3–P6 math papers (accuracy benchmark before building). (5) Engineering capacity check (who builds this, when?).

XII. Open Questions for Leslie

Digital-first acceptable? v1 uses typed answers, not scanned handwriting. Will schools accept this or is paper-based grading a non-negotiable? If schools insist on paper, v1 timeline extends significantly — pending VLM accuracy improvements.³²
Which grade levels first? Research says P3–P6 (installed base, technically feasible, no competition). Verify: is there school demand for DSE Math? If yes, SmartQuest is already there with 80 schools.⁸
What do your math teachers actually want? Surveys say grading + question gen.¹ Can we interview 3–5 HK primary math teachers directly?
Engineering capacity? Question bank, grading engine, new math UI. Who builds this? What’s realistic in 3–6 months?
Mach Innovation / HKU partnership? They’ve solved HK handwriting OCR for math.⁶ Is a partnership or licensing arrangement feasible?
TSA past-paper licensing? HKEAA charges HK$5–7K/year for commercial use.²⁰ Does this work for Essai’s model?
Grant timeline? Schools applying by Feb 28 — can Essai be named as intended vendor with a roadmap? Delivery window extends to Aug 2028.²³
Parent communication channel? WhatsApp reports would be high-impact for HK parents. Does Essai have WhatsApp integration or is it a new build?

XIII. References

[1] HKFEW (教聯會) Survey on AI in Teaching, 2025 — >50% teachers want AI for grading; 80.3% worry about AI accuracy; 83.5% worry about student cheating; >50% find AI tools difficult to use

[2] Our Hong Kong Foundation AI Education Survey, Jul–Dec 2025 — 1,200 respondents; 95% students vs 90% teachers use AI; students rate themselves more proficient; over-reliance concerns

[3] EDUplus / JobMarket — EdU Jockey Club Primary AI Math Pilot, Jul 2024 — P5–P6 AI math assistant with Microsoft/遊戲湯麵; question generation in seconds; handwritten answer marking; 5-level grading scale

[4] SCMP — “Seven after-school assignments each day,” 2016 — TSA pressure; ~80% homework TSA-related; child stress; parent herd mentality

[5] Hong Kong Free Press — Homework horror survey, 2018 — 7+ homework pieces/day; 85% teachers confirm same on weekends; teacher attribution to TSA + EDB policies

[6] HKU Data Science Lab / Mach Innovation — AI Math Marking System — >97% accuracy on digit/symbol recognition; Xception/ResNet/UNet; trained on HK student handwriting; hierarchical clustering for error grouping

[7] Graded Pro — 3,000+ schools UK/US; handwritten math grading; viva voce AI follow-up; teacher override with annotations; up to 95% faster marking

[8] SmartQuest HK — DSE paper gen + auto-marking; OMR scanning; 80+ schools free trial; Google Classroom integration; error analysis + AI recommendations

[9] Wolfram Quezzio — ~4,000 algorithmic question generators; 10 per standard × 5 difficulty levels; symbolic equivalence grading; LMS integration

[10] MathGPT.ai — Socratic “cheat-proof” tutoring; 50+ schools; $25/student; anti-cheating via copy-paste prevention + algorithmic variation

[11] Third Space Learning — 700K+ lessons; misconception detection via voice AI tutoring; 4-stage process; 93% met Expected Standard in 2025 SATs

[12] Khan Academy — “Explain Your Thinking” AI assessment — 20–36% of students showed more understanding through AI conversation than initial response; AI scorer aligns with human raters

[13] ScaffoldiaMyMaths — HK Primary Fraction Scaffolding (Durham/APSCE) — Adaptive scaffolding for HK lower-ability primary students; AI-based real-time feedback; dynamic virtual manipulatives; fraction focus

[14] HKET — AI Teaching Survey, 2025 — Schools actively discourage AI use in teaching; teachers report difficulty applying AI tools

[15] @wxharp on Threads (3,100 likes, Sept 2025) — “Teacher uses AI to set papers, student uses AI to do HW, teacher uses AI to mark — what’s the point?”

[16] @fk_pk_1919 on Threads (Jan 2026) — “AI grading doesn’t check each step — just sweeps through in one go”

[17] @jagolee48 on Threads (275 likes, Feb 2025) — HK math tutors: ChatGPT weak, Deepseek worse — testing and rejecting generic AI for math

[18] @studywithai1314 on Threads (Dec 2025) — Advocating Socratic AI prompting for math; organic demand for guided AI in HK

[19] EMNLP 2025 — LLMs achieve <10% accuracy on step-level error detection for harder math; primary arithmetic is simpler but complex problems remain unsolved

[20] HKEAA — Past Paper Licensing — Commercial use of HKDSE/TSA past papers requires HK$5,275–6,685/year licence

[21] EDB Curriculum Guide & Homework Policy — “Quality over quantity” in homework; math 5 strands (primary), 3 dimensions (secondary); schools encouraged to reduce test frequency in lower grades

[22] Auro DB Exploration, Internal (Feb 9, 2026) — 42 tables; essay volumes; 海怡寶血小學 = ~50% of volume; Chinese 6× lower than English; school-driven adoption; zero math tables

[23] LegCo Panel on Education, Oct 2025 — HK$500K per school for AI in 3+ subjects; data security concerns; HK funding vs Singapore/UK centralised models

[24] HKEAA HKDSE Math Subject Information — Paper 1 (65%, 2¼hr) + Paper 2 (35%, 1¼hr); 3 sections in Paper 1 (A1/A2/B); topics: indices through probability

[25] HKEAA BCA/TSA — TSA at P3/P6; Basic Competencies framework for math assessment; low-stakes school feedback

[26] IXL Florida Study, Jan 2025 — 77K+ students; outperformed non-users on FAST; higher usage = bigger gains

[27] IXL RCT Holland MI, Johns Hopkins 2023 — ESSA Tier 1 evidence; randomised controlled trial; significant math gains

[28] DreamBox ESSA Studies — “Strong” ESSA rating; 13K+ students; +0.10 effect size

[29] Photomath (Wikipedia) — 220M+ downloads; 2.2B problems/month; acquired by Google 2023

[30] AI Math Meta-Analysis, Springer 2024 — Effect size 0.343 favouring AI over traditional instruction; 21 studies

[31] GenAI Math Meta-Analysis 2025, MDPI Education — Pooled g=0.603; moderate-to-large positive impact of GenAI on math learning

[32] DrawEduMath, NAACL 2025 (Outstanding Paper Award) — 2,030 images K-12 handwritten math; leaderboard Dec 2025: Gemini 3 Pro 71.3%, GPT-5 64.0%, Claude Opus 4.5 57.8%; improving ~5–10% per 6 months

[33] GPT-4o Handwritten Math Grading, Nov 2024 (arxiv 2411.05231) — GPT-4o on handwritten college math; too inaccurate for classroom deployment

[34] GSM1k Benchmark, NeurIPS 2024 (arxiv 2405.00332) — Frontier models show genuine generalisation on grade-school math; up to 8% accuracy drops from data contamination

[35] RAND Teacher AI Usage Survey, Fall 2023 — 18% of US K-12 teachers use AI; 15% tried once and stopped

[36] eSpark EdTech Survey 2024 — 62% of teachers say student engagement drives edtech adoption; 73% want fresh content; engagement > time-saving

[37] EdWeek Market Brief 2024 — EdTech sits unused because companies fail to research teacher workflows before building

[38] MalruleLib, Chen/Liu/Sonkar, Jan 2026 — 101 malrules across 498 templates; 1M+ instances; cross-template prediction drops to 40%; systematic error modelling for math

[39] EDB School Statistics — ~500+ primary schools in Hong Kong

[40] Australian Teacher Workload Survey — 24% of secondary teachers spend 10+ hours/week marking

[41] Bhandari & Pardos, EDM 2025 — GPT-4.1-mini achieves 94.47% agreement with human grading on 18K college algebra worked solutions (self-consistency); GPT-4.1-nano 93.07%; GPT-4o 91.93%

[42] FERMAT Benchmark, ACL 2025 — 2,200+ handwritten math solutions grades 7–12; Gemini-1.5-Pro 77% error correction rate; evaluates error detection, localization, and correction across 4 dimensions