Language Detection¶
Language identification via character n-gram profiles. Candidate gating guided by priors and linguistic cues, then probability estimation for each language. Supports 161 languages. Returns a top-1 ISO code or a probability-ordered list.
Parameters¶
text: Input string (Unicode).top_k: How many candidates to return (default3).profiles_dir: Optional path overriding the bundled language profiles.use_mmap: IfTrue, memory-map the profile tries (lower RAM; slightly slower first access).return_top1: IfTrue, return only the best language code; otherwise a list of(lang, prob).
Return value¶
strwhenreturn_top1=True(e.g.,"it").list[tuple[str, float]]whenreturn_top1=False(sorted by probability).
Examples¶
Top-1 (single code)¶
import textwizard as tw
text = "Ciao, come stai oggi?"
lang = tw.lang_detect(text, return_top1=True)
print(lang)
Output
it
Top-k distribution¶
import textwizard as tw
text = "The quick brown fox jumps over the lazy dog."
langs = tw.lang_detect(text, top_k=5, return_top1=False)
print(langs)
Output
[('en', 0.9999376335362183), ('mg', 4.719212057614953e-05), ('fy', 1.4727973350205069e-05), ('rm', 2.8718519851832537e-07), ('la', 1.5918465665694727e-07)]
Batch examples¶
import textwizard as tw
for s in [
"これは日本語のテスト文です。",
"Alex parle un peu français, aber nicht so viel.",
"¿Dónde está la estación de tren?",
]:
print("TOP1:", tw.lang_detect(s, return_top1=True))
Output
TOP1: ja TOP1: fr TOP1: es
Profiles directory & mmap¶
from pathlib import Path
import textwizard as tw
langs = tw.lang_detect(
"Buongiorno a tutti!",
profiles_dir=Path("/opt/textwizard/profiles"), # custom profiles
use_mmap=True, # lower RAM
top_k=3,
)
print(langs)
Operational notes¶
Lazy loading: the model loads on first call and is cached for reuse.
Short/ASCII texts: ambiguity is common; provide longer samples for better confidence.
Profiles: if you keep profiles outside the package, pass
profiles_dir.Probabilities are softmax-normalised over candidates returned by the gate.
Supported languages (161)¶
aa — Afar |
ab — Abkhazian |
af — Afrikaans |
am — Amharic |
an — Aragonese |
ar — Arabic |
as — Assamese |
av — Avaric |
ay — Aymara |
az — Azerbaijani |
ba — Bashkir |
be — Belarusian |
bg — Bulgarian |
bm — Bambara |
bn — Bengali |
bo — Tibetan |
br — Breton |
bs — Bosnian |
ca — Catalan |
ce — Chechen |
ch — Chamorro |
cs — Czech |
cv — Chuvash |
cy — Welsh |
da — Danish |
de — German |
dz — Dzongkha |
ee — Ewe |
el — Greek |
en — English |
eo — Esperanto |
es — Spanish |
et — Estonian |
eu — Basque |
fa — Persian |
ff — Fula |
fi — Finnish |
fj — Fijian |
fo — Faroese |
fr — French |
fy — Western Frisian |
ga — Irish |
gd — Scottish Gaelic |
gl — Galician |
gn — Guarani |
gu — Gujarati |
gv — Manx |
ha — Hausa |
he — Hebrew |
hi — Hindi |
hr — Croatian |
ht — Haitian Creole |
hu — Hungarian |
hy — Armenian |
id — Indonesian |
ig — Igbo |
io — Ido |
is — Icelandic |
it — Italian |
iu — Inuktitut |
ja — Japanese |
jv — Javanese |
ka — Georgian |
kg — Kongo |
ki — Kikuyu |
kk — Kazakh |
kl — Kalaallisut |
km — Khmer |
kn — Kannada |
ko — Korean |
kr — Kanuri |
ks — Kashmiri |
ku — Kurdish |
kv — Komi |
kw — Cornish |
ky — Kyrgyz |
la — Latin |
lb — Luxembourgish |
lg — Ganda |
li — Limburgan |
ln — Lingala |
lo — Lao |
lt — Lithuanian |
lu — Luba-Kasai |
lv — Latvian |
mg — Malagasy |
mh — Marshallese |
mi — Māori |
mk — Macedonian |
ml — Malayalam |
mn — Mongolian |
mr — Marathi |
ms — Malay |
mt — Maltese |
my — Burmese |
ne — Nepali |
nl — Dutch |
nn — Norwegian Nynorsk |
no — Norwegian |
nv — Navajo |
ny — Chichewa / Nyanja |
oc — Occitan |
om — Oromo |
or — Odia |
os — Ossetian |
pa — Punjabi |
pl — Polish |
ps — Pashto |
pt — Portuguese |
qu — Quechua |
rm — Romansh |
rn — Kirundi |
ro — Romanian |
ru — Russian |
rw — Kinyarwanda |
sa — Sanskrit |
sc — Sardinian |
sd — Sindhi |
se — Northern Sami |
sg — Sango |
si — Sinhala |
sk — Slovak |
sl — Slovenian |
sm — Samoan |
sn — Shona |
so — Somali |
sq — Albanian |
sr — Serbian |
ss — Swati |
st — Sotho |
su — Sundanese |
sv — Swedish |
sw — Swahili |
ta — Tamil |
te — Telugu |
tg — Tajik |
th — Thai |
ti — Tigrinya |
tk — Turkmen |
tl — Tagalog |
tn — Tswana |
to — Tonga |
tr — Turkish |
ts — Tsonga |
tt — Tatar |
tw — Twi |
ty — Tahitian |
ug — Uyghur |
uk — Ukrainian |
ur — Urdu |
uz — Uzbek |
ve — Venda |
vi — Vietnamese |
vo — Volapük |
wa — Walloon |
wo — Wolof |
xh — Xhosa |
yi — Yiddish |
yo — Yoruba |
zh — Chinese |
zu — Zulu |