Spell Checking¶

Dictionary-based spell checking with Unicode-aware tokenization and light text normalization. Supports 62 languages via compressed Marisa-Trie dictionaries. Returns a compact report with the total number of misspellings and the list of offending tokens.

Behavior¶

Normalizes common Unicode quirks (e.g., smart quotes, zero-width joiners).
Ignores numbers and leading/trailing punctuation when deciding correctness.
Treats '/’ variants as equivalent.
Looks up each token against the selected language dictionary.

Parameters¶

Parameter	Description
`text`	(str) Raw input text.
`language`	(str, default `"en"`) ISO-639 code.
`dict_dir`	(str \| Path \| None) Directory containing one or more `.marisa.zst` (or decompressed `.marisa`) dictionaries. If `None`: uses a per-user cache directory and auto-downloads the required dictionary if missing.
`use_mmap`	(bool, default `False`) True → memory-map the on-disk `.marisa` file (lowest RAM; fastest startup; OS page cache warms on first queries). False → load the entire trie into RAM (higher RAM; highest steady-state throughput).

Return value¶

dict with:

errors_count – int total misspellings
errors – list[str] of misspelled tokens (normalized/case-folded)

Examples¶

Basic¶

import textwizard as tw

res = tw.correctness_text("Thiss sentense has a typo.", language="en")
print(res)

Output

{"errors_count": 2, "errors": ["thiss", "sentense"]}

import textwizard as tw
print(tw.correctness_text("Queso è un tes , di preva.", language="it"))

Output

{"errors_count": 3, "errors": ["queso", "tes", "preva."]}

Custom dictionary directory & mmap¶

import textwizard as tw
from pathlib import Path

res = tw.correctness_text(
    "Coloar centre thetre",
    language="en",
    dict_dir=Path("~/textwizard_dicts"),
    use_mmap=True,
)
print(res)

Output

{"errors_count": 2, "errors": ["coloar", "thetre"]}

Operational notes¶

Cache location (when dict_dir=None): a per-user data directory is used. You can override it via the first existing of: TEXTWIZARD_DATA_DIR / TEXTWIZARD_DICT_DIR / TEXTWIZARD_HOME (environment variables).
Auto-download: when a dictionary is missing and dict_dir is not set, TextWizard downloads the compressed *.marisa.zst once and reuses it subsequently.
File formats: - *.marisa.zst files are decompressed on the fly (into memory) or to an adjacent *.marisa file when use_mmap=True. - If you already have an uncompressed *.marisa file in dict_dir, it is used directly.
Performance: - use_mmap=True → minimal RAM, fastest startup; excellent for large dictionaries or constrained environments. - use_mmap=False → maximal throughput once loaded; best when RAM is plentiful.
Chinese requires jieba; all other languages work out-of-the-box.
Output tokens in errors are normalized/case-folded; they may differ in casing from the original text.

Available dictionaries¶

Code	Language
`af`	Afrikaans
`an`	Aragonese
`ar`	Arabic
`as`	Assamese
`be`	Belarusian
`bg`	Bulgarian
`bn`	Bengali
`bo`	Tibetan
`br`	Breton
`bs`	Bosnian
`ca`	Catalan
`cs`	Czech
`da`	Danish
`de`	German
`el`	Greek
`en`	English
`eo`	Esperanto
`es`	Spanish
`fa`	Persian
`fr`	French
`gd`	Scottish Gaelic
`gn`	Guarani
`gu`	Gujarati (`gu_IN`)
`he`	Hebrew
`hi`	Hindi
`hr`	Croatian
`id`	Indonesian
`is`	Icelandic
`it`	Italian
`ja`	Japanese
`kmr`	Kurmanji Kurdish
`kn`	Kannada
`ku`	Central Kurdish
`lo`	Lao
`lt`	Lithuanian
`lv`	Latvian
`mr`	Marathi
`nb`	Norwegian Bokmål
`ne`	Nepali
`nl`	Dutch
`nn`	Norwegian Nynorsk
`oc`	Occitan
`or`	Odia
`pa`	Punjabi
`pl`	Polish
`pt`	Portuguese (EU)
`ro`	Romanian
`ru`	Russian
`sa`	Sanskrit
`si`	Sinhala
`sk`	Slovak
`sl`	Slovenian
`sq`	Albanian
`sr`	Serbian
`sv`	Swedish
`sw`	Swahili
`ta`	Tamil
`te`	Telugu
`th`	Thai
`tr`	Turkish
`uk`	Ukrainian
`vi`	Vietnamese

Spell Checking¶

Behavior¶

Parameters¶

Return value¶

Examples¶

Basic¶

Custom dictionary directory & mmap¶

Operational notes¶

Available dictionaries¶

See also¶