What I learned by OCRing an Afrikaans dictionary

At this point, I am confident in saying that Lector is the single most comprehensive language learning tool for Afrikaans on the market. I believe Steve Kaufmann (Founder of Lingq) is correct when he says that extensive reading from broad sources is the most important way to learn a language. Lingq would be my pick for the best paid resource for Afrikaans (not that there's a lot of competition in the area), but frankly, it is far too expensive for what I could uncharitably say is merely a text reader with a dictionary attached to it.

It didn't take long to prototype this app with Claude at my side (I've probably only written 10% of the LoCs in this app). It ships with an on-device dictionary based off the EN-AFR Wiktionary (~9,000 entries) which I thought would capture at least 80% of word lookups. But now that it's fairly stable and I've shifted a lot of effort from building the app to using the app, I have found it highly frustrating to use. Even on relatively simple texts (news articles, young adult books) it was falling back to the LLM fallback for translations on perhaps the majority of the words. Once I'd hit about a 500 word vocab in-app (i.e. words I encountered and marked as known), the gaps were mostly missing from the on-board dictionary.

So I did what a normal human does in this situation and went back to paying $13 a month for Lingq took a 346-page bilingual dictionary I had, scanned every page in vFlat and built a pipeline to convert it into ~10,000 structured entries. This is what I learned about coverage, what OCR does to a language with lots of diacritics, and about Afrikaans itself.

This is a build note, not a how-to. Much to my chagrin, dictionary glosses are copyrighted... One day I will find a public domain dictionary and use it to ship a comprehensive open-source dictionary sidecar for this app. We'll just have to learn to live without a first-party translation of the word "selfie" into Afrikaans.

Why a niche language needs this

There are about 7 million people who speak Afrikaans as a first language globally. More than Norwegian or Finnish. But the digital resources a learner's app leans on are far thinner than that suggests.

Resource	Afrikaans	German	Spanish
Native speakers	~7 M	~90 M	~490 M
Tatoeba sentence pairs	4,930	772,823	443,429
English-Wiktionary lemmas	6,247	103,717	115,681

The sentence banks tell the starkest story. Tatoeba is a crowdsourced repository of translated sentences, and it's where I sourced the Cloze practice. It holds just 4,930 Afrikaans sentences against 443,000 Spanish and 773,000 German; a ~150× gap to German that dwarfs the ~13× difference in speakers. But that thinness is less attributable to the language than it is to its speakers. A few prolific translators (read: Germans being German) whose work compounds over several years can reshape the whole space. Afrikaans is thin because almost nobody has done the typing, which is, more or less, why this post exists. Hopefully this post can inflame the passions of some extremely online people and we can build a larger corpus.

Tatoeba sentence pairs

German772,823

Spanish443,429

Afrikaans4,930

A learner needs a certain absolute number of words and example sentences to read with. German and Spanish cleared it by one to two orders of magnitude; Afrikaans hasn't. The dictionary side is the same shape: ~6,000 Afrikaans lemmas in English Wiktionary against ~104,000 (German) and ~116,000 (Spanish).

One caveat: free-dictionary size is a lottery, not a scarcity gauge. A language doesn't have fewer words because fewer people speak it. FreeDict's German-English dictionary has 517,534 headwords, but its Spanish-English one has only 4,502, smaller than its Afrikaans-English (5,129). Size there reflects whether a volunteer ingested a good source, not the language itself. So FreeDict's Afrikaans dictionary is actually a decent one; the real scarcity is in the sentence banks above, and in the curated, example-rich vocabulary that crowd-sourced sources simply don't carry, which is exactly what a published book has.

Published output is no rescue either. Germany puts out ~66,000 new book titles a year and Spain ~90,000; for Afrikaans there isn't even a clean annual figure. It's a minority slice of South Africa's output (~11% of trade-book revenue). The data scarcity is itself the point.

Add it up: one published learner's dictionary holds about as many headwords (~6,000) as the entire Afrikaans side of English Wiktionary — and ~2,600 of its words appear in no free source at all. That's why a weekend with a phone app meaningfully moves the needle (the coverage numbers are next).

Figures as of June 2026. Sources: Tatoeba, English Wiktionary lemma categories, FreeDict, speaker counts via Wikipedia/Ethnologue, book output via national publishers' bodies.

A small book beat a big wordlist

Here's the counter-intuitive part. The book has fewer headwords (~5,500) than the dictionary it was meant to improve (~10,000). By raw count it looks like a downgrade. But headcount is the wrong measure. What matters is how much the two overlap, and how many real-reading words each actually covers.

On a frequency-weighted benchmark of common words, merging the book should lift the share of distinct words the dictionary can resolve from about 67% to about 81%: nearly double the gain a free digital Afrikaans-English dictionary gave. Both are projections from a merge simulation, so 81% is just a rough ceiling.

The reason is the lack of overlap: about 59% of the book was new to the existing data. Wiktionary's Afrikaans is a scattershot of rare and inflected forms; the book is the curated everyday core. The words it recovered were exactly the ones a reader actually clicks - e.g. gebore (born), huiswerk (homework), terugkeer (return), omgee (to care), and the pronominal adverbs daaroor / daaraan that a scraped wordlist never has. A curated book, even a small one, is dense with the right words.

OCR is more-or-less blind to diacritics

This was the biggest surprise, and the most important lesson. Afrikaans uses diacritics fairly heavily. The circumflex (the little hat) (ô, ê) and the deelteken or diaeresis (ë, ï, ö). I expected OCR to miss a few, but it did in almost all cases.

Across 6,436 headwords, the circumflex-o (ô) appeared exactly zero times. Every môre (tomorrow), every oormôre (day after tomorrow) had quietly become an é or ê, or lost its hat entirely. The whole "exotic" accent set (ô, ö, ï, ü) had collapsed toward plain ê/ë or vanished.

The tempting fix is a rule: find a vowel that should carry a diaeresis and add it back, but I couldn't find a non-conflicting general rule. The rule that turned assosieer into assosiëer also "corrects" baie (one of the most common words in the language) into the non-word baië. Thankfully while chasing this, I only had to manually verify 120 or so entries.

So the real fix isn't in the data at all — it's in the lookup. Make dictionary lookup diacritic-insensitive as a last resort: if an exact match fails, strip the accents off both the query and the stored keys and try once more. Then môre, mére and mêre all find the same entry no matter how mangled the stored spelling, and you never risk a wrong "correction." The damaged data becomes a non-issue without touching a single headword.

The one trick that did work for placing a diaeresis: the book prints syllabification, and in Afrikaans the deelteken always sits on a vowel that starts a new syllable after another vowel. So the hyphens in ge-ïn-te-res-seerd tell you exactly where the dots belong. This worked for the entries where the book bothered to print syllabification, which turned out to be only about 64% of them.

Trust nothing; triangulate everything

OCR can't invent a word, but it can mangle one into a different real word, and the LLM that restructures the raw text can hallucinate. For a reference work, neither is acceptable. So no entry was trusted on the model's say-so. Each added entry had to satisfy the following criteria:

Is the headword a real Afrikaans word? Check it against the LibreOffice Hunspell af_ZA spellcheck wordlist (~100k words).
If it's not in the wordlist, is it attested in an independent text corpus (Tatoeba sentences)?
Does it cleanly split into two known words? Afrikaans compounds endlessly (e.g. huis + werk).
Is it a separable verb (a known particle plus a known stem), like af + betaal?

Anything that passed none of these went to a human review pile. Out of ~1,000 questionable entries, exactly one was true garbage. The book OCR'd far cleaner than I expected. The errors that did survive were almost all the invisible diacritics from the last section, which by their nature don't match a wordlist and so self-select into the review pile anyway.

A dictionary is mostly not headwords

Splitting the entries by type was clarifying. Of ~6,400 distinct entries, only 5,446 were "main" headwords. Another 660 were derived forms - words like aandeelhouer (shareholder) or aanklag (charge) - that the book prints as run-ons under a root verb, but which are real standalone words. And 333 were idioms (uitdrukkings): voëls van eenderse vere (birds of a feather), 'n bydrae lewer (to make a contribution).

The idioms are lovely, but they taught me a product lesson: in Lector you can click a word to see a definition, or highlight a phrase. A user will probably never have the accuracy to highlight just the idiom, but they might highlight a phrase or sentence which includes idiomatic phrasing. In these cases, it will always fall back to the LLM for translation. As lookup entries they'd be dead weight. So they become example sentences attached to their parent word instead.

What the book says about Afrikaans

Once it was clean, it produced an interesting dataset that I thought warranted some statistical exploration.

First letter: share of headwords

s11.8%

v10.4%

b8.3%

k7.4%

o6.4%

a5.9%

t5.2%

g5.1%

d4.8%

h4.5%

m4.4%

w3.9%

s and v run away with it. The v- prefixes alone carry a big share (ver-, voor-, vir-), and s- words are everywhere. The tail end, x / y / c, is purely reserved for loanwords from other languages.

Dipthongs (vowel pairs): % of words containing each

aa12.4%

ie12.1%

ee10.6%

oo8.1%

oe6.3%

ui4.6%

ei4.6%

ou2.2%

eu1.9%

The doubled vowels (aa, ee, oo) and ie dominate. Long vowels are everywhere, and you can hear why spoken Afrikaans sounds so open.

Compounding stretches the long tail. The average headword is 7.3 letters, but Afrikaans (and other Germanic languages fwiw) glues nouns together without limit. These seem unassailable at first, but it's a real turning point in your learning journey when you realise you can decode most compound nouns fairly easily:

liefdadigheidsorganisasie 25

kommunikasievaardigheid 23

uitbreidingsaktiwiteit 22

spysverteringstelsel 20

omgewingsvriendelik 19

verantwoordelikheid 19

And from the book's own syllabification, the typical word is about 2½ syllables — overwhelmingly two. By part of speech it's noun-heavy: ~42% nouns, ~17% verbs, the rest adjectives, adverbs and the small closed classes. Exactly the shape of a learner's dictionary.

Where this goes

The book is now the best Afrikaans coverage source I have, and it'll fold into Lector's dictionary build the same way the free sources do. More than that, the processing pipeline is generic. It'll digitise the next book on my shelf with far less hand-holding, and the diacritic-insensitive lookup quietly absorbs whatever accents the camera couldn't see.

The longer-term idea is to ship the pipeline itself — so anyone can point Lector at a dictionary they own and grow their own offline vocabulary, one shelf at a time.

← Back to the blog