TextWizard¶
TextWizard is a Python library to extract, clean, and analyze text from PDFs, DOCX, images, CSV, HTML/XML, and more. It includes local OCR (Tesseract), cloud OCR with Azure Document Intelligence, multi-backend NER, language detection, lexical statistics, and HTML utilities.
Installation¶
Requires Python 3.9+.
pip install textwizard
Optional extras:
# Azure OCR
pip install "textwizard[azure]"
# NER engines
pip install "textwizard[ner]"
# Everything
pip install "textwizard[all]"
Note
For OCR, install Tesseract.
For spaCy models, e.g. python -m spacy download en_core_web_sm.
Quick start¶
import textwizard as tw
text = tw.extract_text("example.pdf")
print(text)
API overview¶
Method |
Purpose |
|---|---|
|
Local text extraction with optional Tesseract OCR |
|
Cloud extraction via Azure (text, tables, key-value) |
|
High-level HTML cleaning with semantic flags |
|
XML cleanup and normalization |
|
CSV cleanup with configurable dialect |
|
NER via spaCy / Stanza / spaCy-Stanza |
|
Spell checking |
|
Language detection |
|
Lexical metrics (entropy, Zipf, Gini, …) |
|
Similarity: |
|
Pretty-print HTML |
|
Convert HTML → Markdown |
Text extraction¶
Parameters¶
input_data:str | bytes | Pathextension: Required only ifinput_dataisbytes.pages: Page/sheet selection.Paged (PDF, DOCX, TIFF):
1,"1-3",[1, 3, "5-8"]Excel (XLSX/XLS): sheet index (
int), name (str), or mixed list
ocr: Enable Tesseract OCR for images and scanned PDFs/DOCX.language_ocr: OCR language, default"eng".
Examples¶
Basic:
import textwizard as tw
txt = tw.extract_text("docs/report.pdf")
print(txt)
From bytes:
from pathlib import Path
import textwizard as tw
raw = Path("img.png").read_bytes()
txt_img = tw.extract_text(raw, extension="png")
print(txt_img)
import textwizard as tw
sel = tw.extract_text("docs/big.pdf", pages=[1, 3, "5-7"])
ocr_txt = tw.extract_text("scan.tiff", ocr=True, language_ocr="ita")
print(sel); print(ocr_txt)
Supported Formats¶
Format |
OCR |
|---|---|
Optional |
|
DOC |
No |
DOCX |
Optional |
XLSX |
No |
XLS |
No |
TXT |
No |
CSV |
No |
JSON |
No |
HTML |
No |
HTM |
No |
TIF |
Default |
TIFF |
Default |
JPG |
Default |
JPEG |
Default |
PNG |
Default |
GIF |
Default |
Azure OCR¶
Parameters¶
input_data:str | bytes | Pathextension: File extension whenbytesare passed.language_ocr: OCR language code (ISO-639).pages: Page selection (int,"1,3,5-7", or list).azure_endpoint: Azure Document Intelligence endpoint URL.azure_key: Azure API key.azure_model_id:"prebuilt-read"(text only) or"prebuilt-layout"(text + tables + key-value).hybrid: IfTrue, for PDFs: native text for text pages and OCR for raster pages.
Example¶
import textwizard as tw
res = tw.extract_text_azure(
"invoice.pdf",
language_ocr="ita",
azure_endpoint="https://<resource>.cognitiveservices.azure.com/",
azure_key="<KEY>",
azure_model_id="prebuilt-layout",
hybrid=True,
)
print(res.text)
print(res.pretty_tables)
print(res.key_value)
Output
Fattura n. 2025-031 — Cliente: ACME S.p.A. — Data: 14/03/2025 — Totale: €1.234,56 …
[{'rows': 3, 'cols': 3, 'preview': [['Item', 'Qty', 'Total'], ['Widget A', '2', '€200'], ['Widget B', '1', '€150']]}]
{'InvoiceNumber': '2025-031', 'InvoiceDate': '2025-03-14', 'Customer': 'ACME S.p.A.', 'Total': '€1.234,56'}
HTML cleaning¶
See Clean HTML for A/B/C modes (text-only, structural clean, text+preserve), wildcard tag/attribute handling, and examples.
A) Text-only (no params)
import textwizard as tw
txt = tw.clean_html("<div><p>Hello</p><script>x()</script></div>")
print(txt)
Output
Hello
B) Structural clean (HTML out)
import textwizard as tw
html = """
<html><head><title>x</title><script>evil()</script></head>
<body>
<article><h1>Title</h1><img src="a.png"><p id="k" onclick="x()">hello</p></article><!-- comment -->
</body></html>
"""
out = tw.clean_html(
html,
remove_script=True,
remove_metadata_tags=True,
remove_embedded_tags=True,
remove_specific_attributes=["id", "on*"],
remove_empty_tags=True,
remove_comments=True,
remove_doctype=True,
)
print(out)
Output
<html>
<body>
<article><h1>Title</h1><p>hello</p></article>
</body></html>
C) Text with preservation (False flags)
import textwizard as tw
html = "<html><body><article><h1>T</h1><p>Body</p><!-- c --></article></body></html>"
txt = tw.clean_html(
html,
remove_sectioning_tags=False, # keep <article> in output
remove_heading_tags=False, # keep <h1> in output
remove_comments=False, # keep comments
)
print(txt)
Output
<article><h1>T</h1>Body<!-- c --></article>
Wildcard selectors
import textwizard as tw
html = '<div id="hero" data-track="x" onclick="h()"><img src="a.png"></div>'
out = tw.clean_html(
html,
remove_specific_attributes=["id", "data-*", "on*"],
remove_specific_tags=["im_"],
)
print(out)
Output
<html><head></head><body><div></div></body></html>
XML cleaning¶
import textwizard as tw
xml = "<root xmlns='ns'><a/><b>ok</b><!-- x --></root>"
fixed = tw.clean_xml(
xml,
remove_namespaces=True,
remove_empty_tags=True,
remove_comments=True,
normalize_entities=True,
)
print(fixed)
Output
<root><b>ok</b></root>
CSV cleaning¶
import textwizard as tw
csv_data = """id,name,age,city,salary
1,John,30,New York,50000
2,Jane,25,,40000
3,,35,Los Angeles,60000
4,Mark,45,,70000
5,Sarah,40,New York,
1,John,30,New York,50000
"""
out = tw.clean_csv(
csv_data,
delimiter=",",
remove_columns=["id", "salary"],
remove_values=["John", "50000"],
trim_whitespace=True,
remove_empty_columns=True,
remove_empty_rows=True,
remove_duplicates_rows=True,
)
print(out)
Output
name,age,city
,30,New York
Jane,25,
,35,Los Angeles
Mark,45,
Sarah,40,New York
Named-Entity Recognition (NER)¶
import textwizard as tw
sample = (
"Alex Rivera traveled to Springfield to meet the research team at Northstar Analytics on 14 March 2025. "
"The next day, he signed a pilot agreement with Horizon Bank and gave a talk at the University of Westland at 10:30 AM."
)
res = tw.extract_entities(sample)
print([e.text for e in res.entities["PERSON"]])
print([e.text for e in res.entities["GPE"]])
print([e.text for e in res.entities["ORG"]])
Output
['Alex Rivera']
['Springfield']
['Northstar Analytics', 'Horizon Bank', 'the University of Westland']
Spell checking¶
import textwizard as tw
check = tw.correctness_text("Thiss sentense has a typo.", language="en")
print(check)
Output
{"errors_count": 2, "errors": ["thiss", "sentense"]}
Language detection¶
Character n-gram detector with smart gating, priors, and linguistic hints. Supports 161 ISO-639-1 languages. Returns either a single top-1 code or a ranked list with probabilities.
import textwizard as tw
print("LANGS:", tw.lang_detect("Ciao, come stai oggi?", return_top1=True))
print("LANGS:", tw.lang_detect("The quick brown fox jumps over the lazy dog.", return_top1=True))
print("LANGS:", tw.lang_detect("これは日本語のテスト文です。", return_top1=True))
Output
LANGS: it
LANGS: en
LANGS: ja
Text statistics¶
import textwizard as tw
stats = tw.analyze_text_statistics("a a a b b c d e f g")
print(stats)
Output
{"entropy": 2.646, "zipf": {"slope": -0.605, "r2": 0.838}, "vocab_gini": 0.229, "type_token_ratio": 0.7, "hapax_ratio": 0.5, "simpson_index": 0.82, "yule_k": 800.0, "avg_word_length": 1.0}
Text similarity¶
import textwizard as tw
print(
tw.text_similarity("kitten", "sitting", method="levenshtein"),
tw.text_similarity("hello world", "hello brave world", method="jaccard"),
tw.text_similarity("abc def", "abc xyz", method="cosine"),
)
Output
0.5714285714285714 0.6666666666666666 0.33333333333333337
HTML tools¶
Pretty-print HTML¶
import textwizard as tw
html = """
<body>
<button id='btn1' class="primary" disabled="disabled">
Click <b>me</b>
</button>
<img alt="Logo" src="/static/logo.png">
</body>
"""
print(tw.beautiful_html(
html=html,
indent=4,
alphabetical_attributes=True,
minimize_boolean_attributes=True,
quote_attr_values="always",
strip_whitespace=True,
include_doctype=True,
expand_mixed_content=True,
expand_empty_elements=True,
))
Output
<!DOCTYPE html>
<html>
<head>
</head>
<body>
<button class="primary" disabled id="btn1">
Click
<b>
me
</b>
</button>
<img alt="Logo" src="/static/logo.png">
</body>
</html>
HTML → Markdown¶
import textwizard as tw
print(tw.html_to_markdown("<h1>Hello</h1><p>World</p>"))
Output
# Hello
World