Spell Checking¶
Dictionary-based spell checking with Unicode-aware tokenization and light text normalization. Supports 62 languages via compressed Marisa-Trie dictionaries. Returns a compact report with the total number of misspellings and the list of offending tokens.
Behavior¶
Normalizes common Unicode quirks (e.g., smart quotes, zero-width joiners).
Ignores numbers and leading/trailing punctuation when deciding correctness.
Treats
'/’variants as equivalent.Looks up each token against the selected language dictionary.
Parameters¶
Parameter |
Description |
|---|---|
|
(str) Raw input text. |
|
(str, default |
|
(str | Path | None) Directory containing one or more |
|
(bool, default |
Return value¶
dict with:
errors_count–inttotal misspellingserrors–list[str]of misspelled tokens (normalized/case-folded)
Examples¶
Basic¶
import textwizard as tw
res = tw.correctness_text("Thiss sentense has a typo.", language="en")
print(res)
Output
{"errors_count": 2, "errors": ["thiss", "sentense"]}
import textwizard as tw
print(tw.correctness_text("Queso è un tes , di preva.", language="it"))
Output
{"errors_count": 3, "errors": ["queso", "tes", "preva."]}
Custom dictionary directory & mmap¶
import textwizard as tw
from pathlib import Path
res = tw.correctness_text(
"Coloar centre thetre",
language="en",
dict_dir=Path("~/textwizard_dicts"),
use_mmap=True,
)
print(res)
Output
{"errors_count": 2, "errors": ["coloar", "thetre"]}
Operational notes¶
Cache location (when
dict_dir=None): a per-user data directory is used. You can override it via the first existing of:TEXTWIZARD_DATA_DIR/TEXTWIZARD_DICT_DIR/TEXTWIZARD_HOME(environment variables).Auto-download: when a dictionary is missing and
dict_diris not set, TextWizard downloads the compressed*.marisa.zstonce and reuses it subsequently.File formats: -
*.marisa.zstfiles are decompressed on the fly (into memory) or to an adjacent*.marisafile whenuse_mmap=True. - If you already have an uncompressed*.marisafile indict_dir, it is used directly.Performance: -
use_mmap=True→ minimal RAM, fastest startup; excellent for large dictionaries or constrained environments. -use_mmap=False→ maximal throughput once loaded; best when RAM is plentiful.Chinese requires
jieba; all other languages work out-of-the-box.Output tokens in
errorsare normalized/case-folded; they may differ in casing from the original text.
Available dictionaries¶
Code |
Language |
|---|---|
|
Afrikaans |
|
Aragonese |
|
Arabic |
|
Assamese |
|
Belarusian |
|
Bulgarian |
|
Bengali |
|
Tibetan |
|
Breton |
|
Bosnian |
|
Catalan |
|
Czech |
|
Danish |
|
German |
|
Greek |
|
English |
|
Esperanto |
|
Spanish |
|
Persian |
|
French |
|
Scottish Gaelic |
|
Guarani |
|
Gujarati ( |
|
Hebrew |
|
Hindi |
|
Croatian |
|
Indonesian |
|
Icelandic |
|
Italian |
|
Japanese |
|
Kurmanji Kurdish |
|
Kannada |
|
Central Kurdish |
|
Lao |
|
Lithuanian |
|
Latvian |
|
Marathi |
|
Norwegian Bokmål |
|
Nepali |
|
Dutch |
|
Norwegian Nynorsk |
|
Occitan |
|
Odia |
|
Punjabi |
|
Polish |
|
Portuguese (EU) |
|
Romanian |
|
Russian |
|
Sanskrit |
|
Sinhala |
|
Slovak |
|
Slovenian |
|
Albanian |
|
Serbian |
|
Swedish |
|
Swahili |
|
Tamil |
|
Telugu |
|
Thai |
|
Turkish |
|
Ukrainian |
|
Vietnamese |
See also¶
TextWizard — Overview and quick start
Language Detection — Language identification