AI for Math in HK Education

Product Offering Research — Essai / Talent Coop · R3.1
19 February 2026
VLM Accuracy (K-12)
57–71%
DrawEduMath leaderboard, Dec 202532
HK Primary Schools
500+
EDB statistics39
Teacher AI Demand
>50%
want grading help (HKFEW)1
AI Math Effect
g=0.60
GenAI meta-analysis31

I. TL;DR + Key Metrics

Summary (R3 — restructured around "what to build") Math is wide open for Essai. Nobody owns the teach-learn-assess loop for HK primary math. Typed-answer auto-grading is production-ready: GPT-4.1-mini hits 94.5% on worked algebra solutions41, making digital question gen + auto-grade a high-confidence P0. Handwritten math grading is improving fast but not yet reliable: the DrawEduMath leaderboard (Dec 2025) shows Gemini 3 Pro at 71.3%, GPT-5 at 64%, improving ~5–10% every 6 months.32 For HK’s paper-first culture, scan-to-grade is the moat — build the pipeline now (P1), ship with mandatory teacher review, and auto-upgrade as models improve. Lead v1 with digital-input question generation + typed-answer auto-grading + class dashboard at P3–P6.1

R1 (19 Feb): Initial landscape + critical assessment. R2 (19 Feb): Data access estimation, HK pain points, feature-level landscape. R3 (19 Feb): Restructured around “what to build”; added VLM accuracy data; 40 sources. R3.1 (19 Feb): Updated with current model performance. DrawEduMath leaderboard Dec 2025 (Gemini 3 Pro: 71.3%, GPT-5: 64%). Typed grading: GPT-4.1-mini 94.5% on worked algebra (EDM 2025). Scan-to-grade upgraded from P2 back to P1 (with mandatory teacher review). 42 sources.


II. What to Build — Feature Recommendations R3 CORE

Every recommendation below is grounded in three filters: (1) what HK teachers actually ask for, (2) what’s technically feasible at primary level, and (3) what competitors don’t already do well. The narrative spine is: what should Essai build for math in HK schools?

P0 — Ship First (v1 Core)

FeatureWhat it doesWhy P0Evidence
Question generation
HK curriculum, P3–P6, by topic + difficulty
Teacher selects strand/topic/difficulty → AI generates a worksheet in seconds >50% of HK teachers want AI to generate test questions.1 SmartQuest only covers DSE (secondary).8 Clear gap at primary level. EdU Jockey Club pilot proved this works for P5–P6 in one HK school.3 Quezzio has ~4K generators with 5 difficulty levels.9
Auto-grading
MC + typed worked solutions
Machine-mark MC, numeric answers, and typed worked solutions submitted digitally >50% of teachers want AI grading help.1 GPT-4.1-mini hits 94.5% accuracy on college algebra worked solutions41 — primary-level arithmetic will be higher. Production-ready today. EDM 2025 study: 18K responses graded, 94.5% human agreement (GPT-4.1-mini w/ self-consistency).41 SmartQuest does OMR for bubble sheets.8 Quezzio uses symbolic equivalence.9
Class dashboard
Scores by topic/strand
Teacher view: avg scores, topic breakdown, weakest areas per class Low build effort — extends existing Essai dashboard architecture.22 Teachers explicitly ask for “who struggled with what.” SmartQuest8 and Third Space Learning11 both have class dashboards. Table stakes.
Scan-to-grade: P1 with mandatory teacher review (upgraded from R3’s P2) R3 downgraded scan-to-grade to P2 based on older VLM accuracy data (51–66%). Updated data (Dec 2025): Gemini 3 Pro hits 71.3% on real K-12 handwritten math, GPT-5 hits 64%, and the trajectory is ~5–10% improvement every 6 months.32 Still not autonomous-grading territory, but viable as a “pre-mark + teacher review” workflow — AI does a first pass, teacher overrides errors. At 71%, roughly 3 in 10 need correction — tedious but still faster than marking from scratch. Teachers who worry about accuracy (80.3%1) keep control. Build the scan pipeline at P1; ship with review UI; auto-improve as models upgrade. Meanwhile, typed-answer grading is production-ready: GPT-4.1-mini achieves 94.5% on worked algebra solutions with self-consistency.41

P1 — Build Next (after v1 trust established)

FeatureWhat it doesWhy P1Evidence
Scan-to-grade
Handwritten exercise books, teacher review required
Teacher scans stack of exercise books. AI pre-marks. Teacher reviews and overrides via review UI. This IS the moat for HK (paper-first culture). VLM accuracy now at 71.3% (Gemini 3 Pro, Dec 2025)32 — viable as pre-mark + human review workflow. Mach/HKU digit OCR is >97%.6 Build pipeline now; accuracy auto-improves as models upgrade. Graded Pro uses similar pre-mark + review model for 3,000+ UK/US schools.7 FERMAT benchmark: Gemini-1.5-Pro achieved 77% error correction rate on grades 7–12.42
Per-student tracking + parent report Auto-generate weekly per-student progress → shareable via WhatsApp HK parents expect WhatsApp updates. Differentiates from SmartQuest (no parent-facing output).8 Low build cost — generated from grading data. IXL and Third Space Learning both offer parent reports.11 WhatsApp is HK’s primary parent-school channel.
TSA / HKDSE past-paper variants Given a real past paper, generate 5 variants with same difficulty, different numbers. Each student gets unique version. Solves “students memorise past papers” problem. TSA drives 80% of primary homework.4 High demand. Quezzio already does algorithmic variation for US standards (10 generators/standard × 5 levels).9 Requires HKEAA license (HK$5,275–6,685/year).20
Homework quality optimizer Instead of “30 questions on fractions,” AI recommends “these 8 cover the same learning objectives.” Aligns with EDB “quality over quantity” homework policy.21 Reduces student burden while maintaining coverage. Unique positioning. EDB Curriculum Guide explicitly recommends reducing homework quantity at primary level.21 HK students receive 7+ homework pieces/day.5

P2 — Build When Accuracy Improves Further

FeatureWhat it doesWhy P2Gate
Autonomous scan-to-grade
No teacher review required
Fully automated marking of scanned exercise books with no human-in-the-loop. P1 ships with teacher review; P2 removes it. Requires VLM accuracy >90% on handwritten primary math to maintain teacher trust (80.3% worry about accuracy).1 Ship when: VLM accuracy >90% on primary math handwriting. At current improvement rate (~5–10%/6mo), estimated mid–late 2027.32
Step-by-step error detection Mark where in a student’s working the error occurred, not just right/wrong HK users already complain AI “doesn’t check each step.”16 But EMNLP 2025: <10% step-level error detection on harder math.19 Feasible at P3–P6 arithmetic only. Not DSE. Ship with teacher override as required.

P3 — Build as Moat (after adoption proven)

FeatureWhat it doesWhy P3Evidence
Misconception taxonomy + error twin matching Model why students err. Surface other students making the same mistake. Group for targeted reteaching. Nobody asked for this — but Third Space Learning built their entire business on it (700K+ lessons, 93% met Expected Standard).11 Classic innovator’s dilemma: the teachers who’d benefit most don’t know to ask. MalruleLib: 101 malrules across 498 templates, 1M+ instances.38 Cross-template prediction drops to 40% — needs domain-specific training. ScaffoldiaMyMaths proves HK primary fractions work.13
Cross-subject correlation “Students weak in math word problems also score low on Chinese reading comprehension.” Only possible if you own both essay + math data for the same students. Essai is the only platform positioned to do this.22 Requires meaningful data in both subjects first. v3 at earliest.
Socratic tutoring (AI follow-up) When a student gets it wrong, AI asks guiding questions instead of showing the answer Organic demand in HK already.18 Khan Academy’s “Explain Your Thinking”: 20–36% showed more understanding through AI conversation.12 But MathGPT has 50+ schools doing this already.10 MathGPT is Socratic and “cheat-proof.”10 Graded Pro does viva voce follow-ups.7 Essai should not compete here in v1.
The Engagement Insight 62% of teachers say student engagement drives edtech adoption, not time-saving.36 73% want fresh content.36 Only 18% of US K-12 teachers have tried AI at all.35 In HK, >50% find AI tools difficult to use.1 Implication: the feature teachers say they want (grading relief) may not be the feature that drives actual adoption. Engagement-first features (question variation, gamified practice) may outperform grading tools in driving sticky usage. Something to test, not assume.

III. Why These Features — HK Teacher Pain Points

1. Marking workload is the #1 thing teachers want AI to help with

>50% of HK teachers surveyed want AI to help with grading homework and generating test questions.1 66.9% want AI to organize teaching resources; 65.8% want AI to design teaching presentations.1 Australian secondary teachers spend 10+ hours per week marking.40

2. AI accuracy is the #1 concern blocking adoption

80.3% of HK teachers worry AI-generated content contains errors.1 83.5% worry students will use AI to copy homework.1 Over 50% find it difficult to apply AI tools.1 Individual teachers report schools actively discourage AI use in teaching.14

The Trust Trap If 80.3% of teachers already worry about AI accuracy1, trust is the binding constraint. VLMs now hit 71% on handwritten math (Gemini 3 Pro, Dec 202532) — viable for “pre-mark + teacher review” but not autonomous grading. Meanwhile, typed grading is at 94.5%.41 Strategy: launch typed grading at P0 (high confidence), scan-to-grade at P1 (with review UI), and never position AI as replacing the teacher’s judgment.

3. Students outpace teachers in AI proficiency

95% of HK students use AI tools vs 90% of teachers, but students rate themselves significantly more proficient.2 Both groups worry about over-reliance weakening critical thinking.2 OHKF recommends: framework for teachers, progressive AI literacy curriculum, one-stop resource platform.2

4. TSA preparation drives extreme workload

HK primary students receive 7+ homework assignments per day — 85% of teachers confirm the same on weekends.5 ~80% of homework is TSA-related drill.4 Nearly half of families report children experiencing “drastic” stress; some children cry during homework.4 Despite most parents not believing TSA is useful, ~50% still prepare children for it — pure “herd mentality.”4

5. Paper-first culture — digital is a minority case

The EdU Jockey Club Primary AI math pilot (P5–P6 only, with Microsoft partner “遊戲湯麵”) is one of very few examples of digital math AI in HK primary schools.3 That pilot generates questions in seconds and marks handwritten answers with a 5-level grading scale. This is at one school only — not a commercial product.3

Real HK Voices (Threads)

“老師用AI出卷, 學生用AI作文做功課, 老師再用返AI改卷, 咁老師同學生存在嘅意義係?” [“Teachers use AI to set papers, students use AI to do homework, teachers use AI to mark — what’s the point of either existing?”] — @wxharp, 3,100 likes, Sept 202515

Implication: AI that removes human involvement is seen as threatening. Essai must position as teacher-augmentation, not teacher-replacement.

“唔係每一個步驟都take嘅,就係一筆take過,有鬼用啊。如果有中間部份寫錯咗點樣批改呢” [“It doesn’t check each step, just sweeps through in one go — useless. What if the middle steps are wrong?”] — @fk_pk_1919, Jan 202616

Implication: Step-by-step verification matters to HK users. “Just marked the final answer” will get rejected.

Math tutors overheard in a HK coffee shop: “ChatGPT 很弱,但 Deepseek 更爛” [“ChatGPT is weak at math, but Deepseek is even worse”]. Private tutors actively testing AI tools and finding them unreliable for actual math teaching. — @jagolee48, 275 likes, Feb 202517

Implication: The tutor market is testing and rejecting existing generic AI tools. There’s demand, but no trusted product yet.

“數學神技!AI + 蘇格拉底提問法 = 免費私人補習老師。好多人問 AI 數學題,淨係識叫佢俾答案。錯晒!” [“AI math mastery! Socratic method = free private tutor. Most people just ask AI for answers — totally wrong!”] — @studywithai1314, Dec 202518

Implication: Socratic/guided AI (not answer-giving) is already being promoted organically in HK. Khanmigo-style interaction has organic demand.


IV. Why These Features — HK Market Context

HK Math Curriculum Structure

The EDB organises primary math across 5 strands (Number, Algebra, Measures, Shape & Space, Data Handling) with emphasis on “quality over quantity.”21 Secondary reorganises into 3 dimensions. Key assessment gates:

LevelAssessmentWhat it testsStakes
P3TSA25Basic Competencies in math25Low (school feedback only)
P6TSA + pre-S125Competency + banding readinessMedium (determines S1 banding)
S3Internal + TSAJunior secondary proficiencyMedium (streaming)
S6HKDSE24Paper 1 conv. (65%, 2¼hr) + Paper 2 MC (35%, 1¼hr)24HIGH — university admission

HKDSE Compulsory Math: Paper 1 has 3 sections (A1 elementary 35mk, A2 harder 35mk, B advanced 35mk). Topics span indices, factorization, quadratics, functions, trigonometry, coordinate geometry, statistics, probability.24

Cultural Context for Product Design

Six HK-Specific Constraints
  • Exam culture: Everything revolves around TSA (P3/P6)25 and HKDSE.24 Features that don’t visibly improve exam results won’t get adopted.
  • Parent pressure: Parents drive tutoring demand. A parent-facing “your child’s weaknesses” report is high-value. HK parents expect WhatsApp updates.
  • 補習 shadow market: Private tutors are early AI adopters and a signal of what schools will eventually want.17
  • Paper-first culture: Students write on paper. Any AI math tool must eventually handle scan → digitise → grade. But launching with digital-first (typed answers) is safer than launching with broken OCR.
  • HKEAA licensing: Commercial use of HKDSE/TSA past papers requires paid HKEAA licence (HK$5,275–6,685/year).20
  • EDB policy: EDB favours “quality over quantity” in homework.21 Fewer-but-better-targeted questions align with policy and reduce parent pushback.

Gov Grant Alignment

“AI for Empowering Learning and Teaching” Programme: HK$500K per school (one-off). Application deadline: Feb 28, 2026. Must implement AI in at least 3 subjects (covering 2+ levels each).23 If a school already uses Essai for Chinese + English, they need a third subject. Math is the obvious third. Essai offering Math lets schools tick all three subjects with one vendor, simplifying the grant application.22

Research Evidence: AI Math Works

StudyFindingImplication for Essai
IXL Florida (77K+ students)26Students outperformed non-users on FAST assessment; higher usage = bigger gainsUsage intensity matters. Dashboard + tracking drives engagement loop.
IXL Holland MI RCT (Johns Hopkins)27ESSA Tier 1 evidence; significant math gainsAdaptive practice has causal evidence behind it.
DreamBox ESSA Studies28“Strong” rating; 13K+ students; +0.10 effect sizeEven modest effect sizes are meaningful at scale.
Springer Meta-Analysis 202430Effect size 0.343 favouring AI over traditional instruction (21 studies)AI math instruction consistently outperforms traditional.
GenAI Math Meta-Analysis 202531Pooled g=0.603; moderate-to-large positive impactGenAI specifically (not just adaptive software) has strong evidence.
Khan Academy “Explain Your Thinking”1220–36% of students showed more understanding through AI conversationSocratic AI reveals hidden understanding. v2/v3 feature.
Photomath29220M+ downloads; 2.2B problems/month; acquired by Google 2023Consumer demand for math AI is massive. School-facing product is the gap.

V. Competitive Landscape

HK Direct Competitors

PlatformLevelFocusSchoolsStrengthWeakness
SmartQuest8Secondary (DSE)Paper gen + auto-mark (OMR)80+ (free trial)DSE-specific, Google Classroom integrationNo primary, no diagnostic, no teaching loop
AIMathsPrimary (P1–P6)Adaptive learning + diagnosticsUnknown (new)5 HK curriculum strands, learning pathsVery new, limited adoption, no teacher loop
Sayo AcademyAll levelsGeneral AI teaching tools100+Broad subjects, quiz generationGeneric — not math-specialised
Vinci AIAll levelsSchool-based LLM infraUnknownOn-premise, 100+ pre-built AI appsInfrastructure play, not a product

Global Comparables (Feature-Level)

PlatformKey FeatureHow GoodHK Gap
Graded Pro7Handwritten math grading + viva voce3,000+ schools UK/US; “remarkable accuracy” claim; teacher override with annotationsNo HK curriculum alignment; no Cantonese support
Quezzio (Wolfram)9Algorithmic question generation~4K generators; symbolic equivalence grading; anti-cheating via unique variantsUS Common Core only; no HK strands
MathGPT10Socratic “cheat-proof” tutoring50+ schools; $25/student; never gives direct answersUS-focused; no HK curriculum; no teacher grading loop
Third Space Learning11Misconception detection via voice tutoring700K+ lessons; 4-stage misconception/calculation distinction; 93% met Expected Standard 2025 SATsUK curriculum only; voice-first (not paper); B2C model
IXL26Adaptive practice + score trackingESSA Tier 1 evidence (Johns Hopkins RCT)27; Florida study: higher usage = bigger gainsUS/UK standards; no HK localisation
ScaffoldiaMyMaths13HK primary fraction scaffoldingResearch-stage; adaptive scaffolding for lower-ability HK primary studentsNot a commercial product; fractions only

Adjacent HK Platforms

PlatformWhatMath?
LingoTaskChinese + English essay/oral AI gradingNo — language only. 150+ schools. QEF-funded til 2028.
Goodclass.ai (HKUST)Generic AI education platformVaguely — not math-specialised
HKTA AI HomeworkConsumer tutoringYes — B2C only. HK$138/mo.

Positioning: Essai vs SmartQuest

Essai Advantages

  • Full 教學評 loop (teach + learn + assess)
  • Primary coverage (SmartQuest is DSE-only)8
  • Existing school relationships (~20 schools)22
  • Multi-subject bundling = grant capture23
  • Diagnostic layer (v2) vs. none

SmartQuest Risks

  • 80 schools already engaged (free trial)8
  • Proven OMR scanning + HKDSE format
  • Google Classroom integration live
  • Could expand to primary faster than Essai builds math
SmartQuest’s 80 Schools — Threat or Noise? SmartQuest has 80 schools on a free trial.8 Free trial ≠ paid. They cover DSE only (secondary). But if they expand downward to primary, they’d have distribution + proven OMR tech. The question is: does Essai’s “start at primary” strategy give it a protected beachhead, or is it building for the segment with lower willingness-to-pay? Primary schools have tighter budgets than secondary. The grant (HK$500K) helps, but is a one-off.23

VI. Data Access Estimation — Auro DB Current State → Math Projection

Current State

SubjectPipelineVolumeTeacher Submissions
Englishuserdata → pdfdata → imagedata_full → report_score2211.4K essays5.6K (11 schools)
Chinesec_userdata → c_pdfdata → c_imagedata_full → report_score_c229.3K essays959 (8 schools)
OralteacherAssignment → oralReport2214.5K reports
Mathdoes not exist00
Zero math data in Auro DB No tables, no schema, no content.22 This is not a gap — it’s a clean slate. The essay pipeline architecture (teacher batch-uploads scan → AI processes → report generated) is proven and can be adapted for math. But all math content (question bank, rubrics, grading models, misconception taxonomy) must be built from scratch.

Adoption Patterns — What Essays Tell Us About Math

Adoption is school-driven, not teacher-driven. 海怡寶血小學 alone accounts for ~50% of all essay volume.22 A handful of champion schools drive most activity. Expect the same for math.

Chinese adoption is 6× lower than English (959 vs 5,600 teacher submissions).22 Not because fewer schools use Chinese — because the Chinese AI grading product is newer and less trusted. Teachers need to see accuracy before committing workflow. Math will face the same ramp.

Realistic Y1 target for math: 2–3 champion schools, ~500–1K graded exercises/month. Based on Chinese essay pattern: 6-month ramp to steady state, then plateau until next round of school adoption.22

Teacher Workflow Mapping (Current → With Essai Math)

Current WorkflowEssai Math Role
Teacher sets paper manually (textbook / past papers)Question generation — generate by topic + difficulty in seconds
Students write answers in exercise books (paper)v1: typed answers on device. v2+: scan paper32
Teacher collects books, marks with red penv1: auto-grade typed MC/numeric. v2+: scan-to-grade when VLM accuracy permits
Teacher records scores in class registerAuto-tracking — scores recorded per student by topic/strand
Teacher re-explains weak areas in classClass dashboard — “class is weakest on fractions; here’s a targeted warm-up”
Parent asks “how is my child?”Parent report — per-student progress shareable via WhatsApp

OCR Constraint: The Handwriting Problem

Math handwriting recognition requires: digit recognition (0–9), symbol recognition (+, −, ×, ÷, =, fractions, brackets), layout understanding (vertical addition, long division, working steps), and diagram interpretation (geometry shapes, angles).

HKU / Mach Innovation achieved >97% accuracy on digit + symbol recognition using Xception/ResNet/UNet models trained on real HK student handwriting.6 But this is OCR (reading what’s written), not grading (judging whether the working is correct). DrawEduMath (Dec 2025 leaderboard) shows VLMs achieve 57–71% when they need to evaluate handwritten student work, up from 51–66% in the original paper, improving ~5–10% every 6 months.32 A hybrid approach (Mach OCR for recognition + LLM for evaluation) could outperform pure VLM approaches at primary level.

Partnership Opportunity: Mach Innovation / HKU Data Science Lab They’ve solved the hard OCR problem for HK primary math handwriting.6 Rather than rebuilding, explore a partnership or licensing arrangement. This would cut months off the eventual scan-to-grade build. Even if VLM grading isn’t ready, the OCR layer is a prerequisite.

VII. Critical Assessment of Feature Choices R3 NEW

Stance being assessed: “P0 should be question generation + auto-grading + class dashboard at P3–P6 level.”

Hidden Assumption #1: Auto-grading is technically feasible at primary level

UPDATED (R3.1): Feasibility is now stratified, not binary
  • Typed worked solutions: GPT-4.1-mini achieves 94.5% accuracy on college algebra worked solutions (18K responses, EDM 2025).41 Primary arithmetic will be higher. PRODUCTION-READY
  • Handwritten K-12 math: DrawEduMath leaderboard (Dec 2025): Gemini 3 Pro 71.3%, GPT-5 64%, improving ~5–10% per 6 months.32 FERMAT benchmark: Gemini-1.5-Pro 77% error correction on grades 7–12.42 VIABLE WITH TEACHER REVIEW
  • Digit/symbol OCR only: Mach/HKU achieves >97%.6 OCR ≠ grading, but combined with constrained answer matching for primary arithmetic, this could exceed VLM-only approaches. PARTIAL — WORTH TESTING
Reformed position: Typed auto-grading at P0 (high confidence). Handwritten scan-to-grade at P1 with mandatory teacher review UI — viable now as a “pre-mark + override” workflow. Autonomous handwriting grading (no review) at P2, gated on >90% accuracy.

Hidden Assumption #2: Question generation is the #1 teacher demand

CHALLENGE: What teachers SAY they want ≠ what drives adoption

Yes, >50% of HK teachers say they want question generation and grading help.1 But 62% of teachers globally say student engagement drives edtech adoption, not time-saving.36 Only 18% of US K-12 teachers have tried AI at all.35 EdTech sits unused because companies fail to research teacher workflows.37

Reformed position: Question generation is still P0 — it’s the lowest-risk entry point. But don’t assume it alone creates sticky usage. Monitor engagement metrics aggressively. If teachers generate questions but students don’t engage, the product dies regardless.

Hidden Assumption #3: Scan-to-grade placement

UPDATED (R3.1): P1 is correct — with the right UX

R3 downgraded scan-to-grade to P2 based on older accuracy data (51–66%). Updated DrawEduMath leaderboard (Dec 2025) shows Gemini 3 Pro at 71.3%, with ~5–10% improvement every 6 months.32 At 71%, roughly 3 in 10 answers need teacher correction — but that’s still faster than marking 10 in 10 from scratch. The key is UX: ship as “pre-mark + review” where AI does the first pass and the teacher overrides. Graded Pro uses this model for 3,000+ schools.7

Reformed position: Scan-to-grade back to P1, but with mandatory teacher review UI. The trust risk is real (80.3% worry about accuracy1), so never position it as “AI grades your homework.” Position it as “AI saves you 60% of marking time.” Autonomous grading (no review) stays at P2, gated on >90% accuracy. Invest in Mach/HKU OCR partnership6 for the recognition layer.

Hidden Assumption #4: Starting at primary is a strength

CHALLENGE: SmartQuest has 80 schools at DSE level. Is primary actually safer?

SmartQuest covers only DSE (secondary).8 Essai’s “start at primary” avoids direct competition. But: primary schools have lower budgets than secondary. DSE is a high-stakes exam with strong willingness-to-pay. TSA is a low-stakes feedback assessment25 — parents care intensely, but schools don’t need to spend money on it. ~500+ primary schools in HK39 means a large addressable market, but per-school contract value may be lower.

Reformed position: Primary is still the right beachhead — Essai has existing primary school relationships22, the grant requires 3 subjects23, and competition is weakest here. But be honest: this is a volume play, not a high-ARPU play. Plan pricing accordingly.

Hidden Assumption #5: Nobody asked for misconception diagnosis

INSIGHT: Third Space Learning built their entire business on what nobody asked for

700K+ lessons delivered. 93% of students met Expected Standard in 2025 SATs.11 They distinguish misconceptions from calculation errors through a 4-stage process — and charge premium for it. MalruleLib catalogues 101 malrules across 498 templates38 — proving that systematic error modelling is possible. ScaffoldiaMyMaths proved it works for HK primary fractions specifically.13

HK teachers who would benefit most from misconception diagnosis don’t know to ask for it. They ask for grading relief because that’s the pain they feel. Misconception diagnosis addresses the pain they can’t articulate. Classic innovator’s dilemma.

Reformed position: Still P3 — you can’t sell what teachers don’t understand yet. But start collecting error pattern data from day one (even at P0). When enough data accumulates, the misconception layer is the moat nobody can copy.

Summary: Reformed Priority Matrix (R3.1)

PriorityFeatureR3 PositionR3.1 PositionWhy Changed
P0Question generation (digital, P3–P6)P0P0 HOLDStill the right entry point
P0Auto-grading (typed worked solutions)P0 (MC + numeric only)P0 EXPANDEDGPT-4.1-mini 94.5% on worked algebra41 — can now grade typed steps, not just final answers
P0Class dashboardP0P0 HOLDLow effort, high value
P1Scan-to-grade (w/ teacher review)P2P1 UPGRADEDGemini 3 Pro 71.3%32 — viable as pre-mark + review workflow; Graded Pro model proven7
P1Parent WhatsApp reportP1P1 HOLDHigh HK-specific value
P1TSA past-paper variantsP1P1 HOLDHigh demand, exam-culture fit
P1Homework quality optimiserP1P1 HOLDEDB policy alignment21
P2Autonomous scan-to-grade (no review)P2 NEWGated on VLM >90% accuracy; estimated mid–late 2027
P2Step-by-step error detectionP2P2 HOLDEMNLP 2025: <10% on harder math19
P3Misconception taxonomyP3P3 HOLDThe moat, but invest only after adoption proven
P3Cross-subject correlationP3P3 HOLDNeeds data in both subjects first22

VIII. What Needs to Be Built

ComponentEffortPriorityDependency
HK math question bank (P3–P6, 5 strands, tagged by topic + difficulty)21Medium — bootstrap with TSA past papers25 + AI generationP0HKEAA licensing for past papers20
Auto-grading engine (MC + typed numeric; symbolic equivalence)Medium — LLMs reliable at primary-level arithmetic19P0Question bank
Class dashboard (math strands — extend existing Essai UI)22Low — reuse essay dashboard architectureP0Auto-grading engine
Per-student tracking + parent WhatsApp reportLow — generated from grading dataP1Class dashboard
TSA past-paper variant generator25Medium — Quezzio model applicable9P1Question bank + HKEAA licence20
Homework quality optimiser (EDB-aligned)21Low-Medium — algorithm over question bankP1Question bank + learning objectives mapping
Scan-to-grade OCR for handwritten math6High — Mach/HKU partnership or licenceP2VLM accuracy >85% OR Mach partnership6
Step-by-step error detectionMedium — only P3–P6 feasible19P2Scan-to-grade OCR
Misconception taxonomy (per topic)38High — leverage MalruleLib + teacher inputP3Error data accumulation from v1
HKDSE paper generatorMedium — SmartQuest benchmark8P3HKEAA licence20 + secondary curriculum mapping

IX. LLM & VLM Accuracy — The Technical Constraint R3 NEW

This section consolidates the accuracy evidence that gates several feature decisions.

BenchmarkWhat it testsResultImplication
DrawEduMath (NAACL 2025, leaderboard Dec 2025)32VLMs on 2,030 real K-12 handwritten math images; teacher-posed questions57–71% accuracy (Gemini 3 Pro: 71.3%, GPT-5: 64%); improving ~5–10%/6moViable for pre-mark + teacher review; not yet autonomous
EDM 2025 Typed Grading41GPT-4.1-mini on 18K college algebra worked solutions94.5% accuracy (w/ self-consistency); GPT-4.1-nano 93.1%; GPT-4o 91.9%Typed/digital auto-grading is production-ready for P3–P6
FERMAT (ACL 2025)42VLM error detection + correction on handwritten math (grades 7–12)Gemini-1.5-Pro: 77% error correction rateError correction (not just detection) approaching viability
GPT-4o Handwritten Grading (Nov 2024)33GPT-4o on handwritten college math (now superseded by newer models)“Too inaccurate for classroom deployment”Older result; GPT-5/Gemini 3 Pro show significant improvement since
EMNLP 202519LLM step-level error detection<10% accuracy on harder mathStep-by-step grading unreliable for complex problems; P3–P6 arithmetic is simpler
GSM1k (NeurIPS 2024)34Frontier model generalisation on grade-school mathGenuine generalisation confirmed; up to 8% accuracy drops from data contaminationLLMs can do primary-level math; contamination is a concern for benchmarking
HKU/Mach Innovation6Digit + symbol OCR on HK student handwriting>97% accuracyOCR is solved for HK; evaluation/grading is the gap
MalruleLib (Jan 2026)38Cross-template prediction of student malrulesDrops to 40% cross-templateMisconception detection needs domain-specific training, not general LLMs
The Two-Layer Problem

Layer 1 (OCR): Reading what the student wrote. Solved for digits/symbols (>97%).6 Harder for full expressions and layouts.

Layer 2 (Evaluation): Judging whether the student’s work is correct, identifying where errors occurred, and determining the nature of the error (calculation vs misconception). VLMs are at 57–71% on handwritten input (Dec 2025 leaderboard)32, 94.5% on typed worked solutions.41 Handwritten evaluation is the remaining binding constraint, but improving rapidly.

Essai’s v1 uses typed digital input for P0 auto-grading (94.5% accuracy, production-ready). P1 introduces handwritten scan-to-grade with mandatory teacher review (71% pre-mark accuracy, improving). This layered approach is both technically honest and pragmatically sound.


X. Creative Differentiators — What Only Essai Can Do

OpportunityWhat it isWhy Essai specifically
Multi-subject grant bundleSchools using Essai for Chinese + English need a 3rd subject for the HK$500K grant23Only Essai is positioned to offer 3 subjects in one vendor. LingoTask is language-only. SmartQuest is math-only.
Scan → Grade → Feedback (one step)Teacher scans exercise books. AI pre-marks + teacher reviews. Returns graded + per-student feedback + class summary.Meets HK paper-first reality. Graded Pro does this for UK/US7 but nobody for HK curriculum. Mach proves OCR works locally.6 P1 with teacher review; autonomous at P2.
“Error twin” matchingSurface other students who made the same error. Group for targeted reteaching.HKU/Mach already uses hierarchical clustering for this concept.6 No commercial product surfaces this for teachers.
Parent WhatsApp reportAuto-generate weekly per-student report shareable in parent groupHK parents expect WhatsApp. “Your child improved on fractions this week” would be viral in parent circles. Low build cost.
Cross-subject correlation“Math word problem weakness correlates with Chinese reading comprehension.”Only possible with both essay + math data for the same students. Essai is the only platform in position.22 v3 at earliest.

XI. Verdict

CONDITIONAL YES — with honest technical constraints.

Reformed Stance (R3): Essai should build AI Math for HK schools. But the v1 must be scoped tighter than R2 suggested. Lead with digital-input question generation + typed-answer auto-grading + class dashboard at P3–P6. These are technically feasible, match teacher demand1, and competition is weakest here.8

What R3.1 Changed: Two critical data updates. (1) Typed grading is production-ready: GPT-4.1-mini hits 94.5% on worked algebra solutions41 — P0 auto-grading can now handle typed steps, not just final answers. (2) Handwritten grading is closer than R3 suggested: DrawEduMath leaderboard (Dec 2025) shows Gemini 3 Pro at 71.3%, improving ~5–10% every 6 months.32 Scan-to-grade upgraded back to P1 (with mandatory teacher review UI). Autonomous grading (no review) gated at P2 on >90% accuracy. TSA past-paper variants and homework quality optimiser remain at P1.

The Honest V1: A digital-first math tool that generates HK-curriculum-aligned questions, auto-grades typed answers, and shows teachers where the class is weak. Not revolutionary. Not a moat. But shippable, trustworthy, and grant-qualifying. The moat (scan-to-grade + misconception detection) comes in v2/v3 as the technology catches up and error data accumulates.

Remaining Risks: Zero HK math teacher interviews (all pain data is survey-level). Unknown Essai engineering capacity. HKEAA licensing needed for past-paper content.20 Primary schools have lower willingness-to-pay than secondary. “Digital-only” v1 doesn’t match the paper-first classroom reality — this is a deliberate trade-off for trust.

What Would Resolve Uncertainty: (1) 3–5 HK primary math teacher interviews (direct validation of demand and willingness to use a digital-input tool). (2) SmartQuest product teardown (how good is their OMR accuracy?). (3) Mach Innovation/HKU partnership feasibility check. (4) Pilot test: LLM grading on 20 real HK P3–P6 math papers (accuracy benchmark before building). (5) Engineering capacity check (who builds this, when?).


XII. Open Questions for Leslie

  1. Digital-first acceptable? v1 uses typed answers, not scanned handwriting. Will schools accept this or is paper-based grading a non-negotiable? If schools insist on paper, v1 timeline extends significantly — pending VLM accuracy improvements.32
  2. Which grade levels first? Research says P3–P6 (installed base, technically feasible, no competition). Verify: is there school demand for DSE Math? If yes, SmartQuest is already there with 80 schools.8
  3. What do your math teachers actually want? Surveys say grading + question gen.1 Can we interview 3–5 HK primary math teachers directly?
  4. Engineering capacity? Question bank, grading engine, new math UI. Who builds this? What’s realistic in 3–6 months?
  5. Mach Innovation / HKU partnership? They’ve solved HK handwriting OCR for math.6 Is a partnership or licensing arrangement feasible?
  6. TSA past-paper licensing? HKEAA charges HK$5–7K/year for commercial use.20 Does this work for Essai’s model?
  7. Grant timeline? Schools applying by Feb 28 — can Essai be named as intended vendor with a roadmap? Delivery window extends to Aug 2028.23
  8. Parent communication channel? WhatsApp reports would be high-impact for HK parents. Does Essai have WhatsApp integration or is it a new build?

XIII. References

[1] HKFEW (教聯會) Survey on AI in Teaching, 2025>50% teachers want AI for grading; 80.3% worry about AI accuracy; 83.5% worry about student cheating; >50% find AI tools difficult to use
[2] Our Hong Kong Foundation AI Education Survey, Jul–Dec 20251,200 respondents; 95% students vs 90% teachers use AI; students rate themselves more proficient; over-reliance concerns
[3] EDUplus / JobMarket — EdU Jockey Club Primary AI Math Pilot, Jul 2024P5–P6 AI math assistant with Microsoft/遊戲湯麵; question generation in seconds; handwritten answer marking; 5-level grading scale
[4] SCMP — “Seven after-school assignments each day,” 2016TSA pressure; ~80% homework TSA-related; child stress; parent herd mentality
[5] Hong Kong Free Press — Homework horror survey, 20187+ homework pieces/day; 85% teachers confirm same on weekends; teacher attribution to TSA + EDB policies
[6] HKU Data Science Lab / Mach Innovation — AI Math Marking System>97% accuracy on digit/symbol recognition; Xception/ResNet/UNet; trained on HK student handwriting; hierarchical clustering for error grouping
[7] Graded Pro3,000+ schools UK/US; handwritten math grading; viva voce AI follow-up; teacher override with annotations; up to 95% faster marking
[8] SmartQuest HKDSE paper gen + auto-marking; OMR scanning; 80+ schools free trial; Google Classroom integration; error analysis + AI recommendations
[9] Wolfram Quezzio~4,000 algorithmic question generators; 10 per standard × 5 difficulty levels; symbolic equivalence grading; LMS integration
[10] MathGPT.aiSocratic “cheat-proof” tutoring; 50+ schools; $25/student; anti-cheating via copy-paste prevention + algorithmic variation
[11] Third Space Learning700K+ lessons; misconception detection via voice AI tutoring; 4-stage process; 93% met Expected Standard in 2025 SATs
[12] Khan Academy — “Explain Your Thinking” AI assessment20–36% of students showed more understanding through AI conversation than initial response; AI scorer aligns with human raters
[13] ScaffoldiaMyMaths — HK Primary Fraction Scaffolding (Durham/APSCE)Adaptive scaffolding for HK lower-ability primary students; AI-based real-time feedback; dynamic virtual manipulatives; fraction focus
[14] HKET — AI Teaching Survey, 2025Schools actively discourage AI use in teaching; teachers report difficulty applying AI tools
[15] @wxharp on Threads (3,100 likes, Sept 2025)“Teacher uses AI to set papers, student uses AI to do HW, teacher uses AI to mark — what’s the point?”
[16] @fk_pk_1919 on Threads (Jan 2026)“AI grading doesn’t check each step — just sweeps through in one go”
[17] @jagolee48 on Threads (275 likes, Feb 2025)HK math tutors: ChatGPT weak, Deepseek worse — testing and rejecting generic AI for math
[18] @studywithai1314 on Threads (Dec 2025)Advocating Socratic AI prompting for math; organic demand for guided AI in HK
[19] EMNLP 2025 — LLMs achieve <10% accuracy on step-level error detection for harder math; primary arithmetic is simpler but complex problems remain unsolved
[20] HKEAA — Past Paper LicensingCommercial use of HKDSE/TSA past papers requires HK$5,275–6,685/year licence
[21] EDB Curriculum Guide & Homework Policy“Quality over quantity” in homework; math 5 strands (primary), 3 dimensions (secondary); schools encouraged to reduce test frequency in lower grades
[22] Auro DB Exploration, Internal (Feb 9, 2026) — 42 tables; essay volumes; 海怡寶血小學 = ~50% of volume; Chinese 6× lower than English; school-driven adoption; zero math tables
[23] LegCo Panel on Education, Oct 2025HK$500K per school for AI in 3+ subjects; data security concerns; HK funding vs Singapore/UK centralised models
[24] HKEAA HKDSE Math Subject InformationPaper 1 (65%, 2¼hr) + Paper 2 (35%, 1¼hr); 3 sections in Paper 1 (A1/A2/B); topics: indices through probability
[25] HKEAA BCA/TSATSA at P3/P6; Basic Competencies framework for math assessment; low-stakes school feedback
[26] IXL Florida Study, Jan 202577K+ students; outperformed non-users on FAST; higher usage = bigger gains
[27] IXL RCT Holland MI, Johns Hopkins 2023 — ESSA Tier 1 evidence; randomised controlled trial; significant math gains
[28] DreamBox ESSA Studies“Strong” ESSA rating; 13K+ students; +0.10 effect size
[29] Photomath (Wikipedia)220M+ downloads; 2.2B problems/month; acquired by Google 2023
[30] AI Math Meta-Analysis, Springer 2024 — Effect size 0.343 favouring AI over traditional instruction; 21 studies
[31] GenAI Math Meta-Analysis 2025, MDPI Education — Pooled g=0.603; moderate-to-large positive impact of GenAI on math learning
[32] DrawEduMath, NAACL 2025 (Outstanding Paper Award)2,030 images K-12 handwritten math; leaderboard Dec 2025: Gemini 3 Pro 71.3%, GPT-5 64.0%, Claude Opus 4.5 57.8%; improving ~5–10% per 6 months
[33] GPT-4o Handwritten Math Grading, Nov 2024 (arxiv 2411.05231)GPT-4o on handwritten college math; too inaccurate for classroom deployment
[34] GSM1k Benchmark, NeurIPS 2024 (arxiv 2405.00332)Frontier models show genuine generalisation on grade-school math; up to 8% accuracy drops from data contamination
[35] RAND Teacher AI Usage Survey, Fall 2023 — 18% of US K-12 teachers use AI; 15% tried once and stopped
[36] eSpark EdTech Survey 2024 — 62% of teachers say student engagement drives edtech adoption; 73% want fresh content; engagement > time-saving
[37] EdWeek Market Brief 2024 — EdTech sits unused because companies fail to research teacher workflows before building
[38] MalruleLib, Chen/Liu/Sonkar, Jan 2026 — 101 malrules across 498 templates; 1M+ instances; cross-template prediction drops to 40%; systematic error modelling for math
[39] EDB School Statistics~500+ primary schools in Hong Kong
[40] Australian Teacher Workload Survey — 24% of secondary teachers spend 10+ hours/week marking
[41] Bhandari & Pardos, EDM 2025GPT-4.1-mini achieves 94.47% agreement with human grading on 18K college algebra worked solutions (self-consistency); GPT-4.1-nano 93.07%; GPT-4o 91.93%
[42] FERMAT Benchmark, ACL 20252,200+ handwritten math solutions grades 7–12; Gemini-1.5-Pro 77% error correction rate; evaluates error detection, localization, and correction across 4 dimensions